Transformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models (VLMs) provides additional referential information, this paper tests their ability to leverage implicit event knowledge to acquire robust and generalizable representations of agent-patient interactions, assessing their capacity to distinguish between plausible and implausible events. The analysis was conducted on models of varying sizes and architectures. In the evaluation, the performance of unimodal and multimodal models of various sizes was compared using the task of recognizing the plausibility of minimal sentence pairs. Our analysis suggests several findings: 1) decoder-only models tend to outperform encoder-only ones; 2) the model size has a minor impact: although larger models perform better in absolute terms, the differences between 7B and 13B parameter models are not significant for this particular task; 3) while smaller encoder-only VLMs consistently fall short of their LLM counterpart, larger ones have similar or slightly superior performance; 4) all models have lower performance on the more challenging sentences; 5) adding corresponding images to the textual stimuli affects the accuracy levels of some models. These findings open avenues for further analyses of the inner workings of VLMs and their ability to model event knowledge with and without visual inputs.

Evaluation of event plausibility recognition in Large (Vision)-Language Models

Maria Cassese
Primo
;
Alessandro Bondielli
;
Alessandro Lenci
2024-01-01

Abstract

Transformer-based Language Models (LMs) achieve outstanding performances in various tasks but still exhibit limitations in recognizing common world events (GEK), particularly when they require referential information or real-world experience. Assuming that visual knowledge in vision-language models (VLMs) provides additional referential information, this paper tests their ability to leverage implicit event knowledge to acquire robust and generalizable representations of agent-patient interactions, assessing their capacity to distinguish between plausible and implausible events. The analysis was conducted on models of varying sizes and architectures. In the evaluation, the performance of unimodal and multimodal models of various sizes was compared using the task of recognizing the plausibility of minimal sentence pairs. Our analysis suggests several findings: 1) decoder-only models tend to outperform encoder-only ones; 2) the model size has a minor impact: although larger models perform better in absolute terms, the differences between 7B and 13B parameter models are not significant for this particular task; 3) while smaller encoder-only VLMs consistently fall short of their LLM counterpart, larger ones have similar or slightly superior performance; 4) all models have lower performance on the more challenging sentences; 5) adding corresponding images to the textual stimuli affects the accuracy levels of some models. These findings open avenues for further analyses of the inner workings of VLMs and their ability to model event knowledge with and without visual inputs.
2024
Cassese, Maria; Bondielli, Alessandro; Lenci, Alessandro
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1322153
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact