Emotion Recognition in Multimodal Social Data

Passaro, Lucia; Amadei, Davide; Bacciu, Davide

doi:10.14428/esann/2026.es2026-287

Emotion recognition on social media is often approached in unimodal or single-label settings, despite the multimodal nature of online communication. This paper presents a study of multilabel emotion recognition from paired text-image data. We evaluate vision--language encoders and compare them with strong unimodal baselines and a zero-shot multimodal LLM. A simple multimodal classifier built on CLIP achieves the most reliable performance. Data-centric additions such as emoji transcription, caption augmentation, and pseudo-labelling offer limited gains, whereas calibrated decision thresholds have a consistent effect. The results highlight the value of visual cues and show limitations of recent VLMs.