Emotion recognition on social media is often approached in unimodal or single-label settings, despite the multimodal nature of online communication. This paper presents a study of multilabel emotion recognition from paired text-image data. We evaluate vision--language encoders and compare them with strong unimodal baselines and a zero-shot multimodal LLM. A simple multimodal classifier built on CLIP achieves the most reliable performance. Data-centric additions such as emoji transcription, caption augmentation, and pseudo-labelling offer limited gains, whereas calibrated decision thresholds have a consistent effect. The results highlight the value of visual cues and show limitations of recent VLMs.
Emotion Recognition in Multimodal Social Data
Passaro, Lucia;Bacciu, Davide
2026-01-01
Abstract
Emotion recognition on social media is often approached in unimodal or single-label settings, despite the multimodal nature of online communication. This paper presents a study of multilabel emotion recognition from paired text-image data. We evaluate vision--language encoders and compare them with strong unimodal baselines and a zero-shot multimodal LLM. A simple multimodal classifier built on CLIP achieves the most reliable performance. Data-centric additions such as emoji transcription, caption augmentation, and pseudo-labelling offer limited gains, whereas calibrated decision thresholds have a consistent effect. The results highlight the value of visual cues and show limitations of recent VLMs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


