Transformer-based Language Models (LMs) excel in many tasks, but they appear to lack robustness in capturing crucial aspects of event knowledge due to their reliance on surface-level linguistic features and the mismatch between language descriptions and real-world occurrences. In this paper, we investigate the potential of Transformer-based Vision-Language Models (VLMs) in comprehending Generalized Event Knowledge (GEK), aiming to determine whether the inclusion of a visual component affects the mastery of GEK. To do so, we compare multimodal Transformer models with unimodal ones on a task evaluating the plausibility of curated minimal sentence pairs. We show that current VLMs generally perform worse than their unimodal counterparts, suggesting that VL pre-training strategies are not yet as effective to model semantic understanding and resulting models are more akin to bag-of-words in this context.
Assessing Language and Vision-Language Models on Event Plausibility
Maria Cassese;Alessando Bondielli;Alessandro Lenci
2023-01-01
Abstract
Transformer-based Language Models (LMs) excel in many tasks, but they appear to lack robustness in capturing crucial aspects of event knowledge due to their reliance on surface-level linguistic features and the mismatch between language descriptions and real-world occurrences. In this paper, we investigate the potential of Transformer-based Vision-Language Models (VLMs) in comprehending Generalized Event Knowledge (GEK), aiming to determine whether the inclusion of a visual component affects the mastery of GEK. To do so, we compare multimodal Transformer models with unimodal ones on a task evaluating the plausibility of curated minimal sentence pairs. We show that current VLMs generally perform worse than their unimodal counterparts, suggesting that VL pre-training strategies are not yet as effective to model semantic understanding and resulting models are more akin to bag-of-words in this context.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.