CINECA IRIS Institutional Research Information System

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models (LLMs) to reason over non-textual modalities. However, despite their success, understanding how they integrate visual and textual information remains an open challenge. Among them, LLaMA~3.2-Vision represents a significant milestone in the development of open-source MLLMs, offering a reproducible and efficient architecture that competes with leading proprietary models, such as Claude 3 Haiku and GPT-4o mini. Motivated by these characteristics, we conduct the first systematic analysis of the information flow between vision and language in LLaMA~3.2-Vision. We analyze three visual question answering (VQA) benchmarks, covering the tasks of VQA on natural images---using both open-ended and multiple-choice question formats---as well as document VQA. These tasks require diverse reasoning capabilities, making them well-suited to reveal distinct patterns in multimodal reasoning. Our analysis unveils a four-stage reasoning strategy: an initial semantic interpretation of the question, an early-to-mid-layer multimodal fusion, a task-specific reasoning stage guided by the resulting multimodal embedding, and a final answer prediction stage. Furthermore, we reveal that multimodal fusion is task-dependent: in complex settings such as document VQA, the model postpones cross-modal integration until semantic reasoning over the question has been established. Overall, our findings offer new insights into the internal dynamics of MLLMs and contribute to advancing the interpretability of vision-language architectures. Our source code is available at https://github.com/AImageLab/MLLMs-FlowTracker.

Tracing information flow in llama vision: A step toward multimodal understanding

Alessia Saporita;Vittorio Pipoli^Secondo;Federico Bolelli;Lorenzo Baraldi;Andrea Acquaviva;Elisa Ficarra

2025-01-01

Abstract

Multimodal Large Language Models (MLLMs) have recently emerged as a powerful framework for extending the capabilities of Large Language Models (LLMs) to reason over non-textual modalities. However, despite their success, understanding how they integrate visual and textual information remains an open challenge. Among them, LLaMA~3.2-Vision represents a significant milestone in the development of open-source MLLMs, offering a reproducible and efficient architecture that competes with leading proprietary models, such as Claude 3 Haiku and GPT-4o mini. Motivated by these characteristics, we conduct the first systematic analysis of the information flow between vision and language in LLaMA~3.2-Vision. We analyze three visual question answering (VQA) benchmarks, covering the tasks of VQA on natural images---using both open-ended and multiple-choice question formats---as well as document VQA. These tasks require diverse reasoning capabilities, making them well-suited to reveal distinct patterns in multimodal reasoning. Our analysis unveils a four-stage reasoning strategy: an initial semantic interpretation of the question, an early-to-mid-layer multimodal fusion, a task-specific reasoning stage guided by the resulting multimodal embedding, and a final answer prediction stage. Furthermore, we reveal that multimodal fusion is task-dependent: in complex settings such as document VQA, the model postpones cross-modal integration until semantic reasoning over the question has been established. Overall, our findings offer new insights into the internal dynamics of MLLMs and contribute to advancing the interpretability of vision-language architectures. Our source code is available at https://github.com/AImageLab/MLLMs-FlowTracker.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Codice ISBN
	
				978-3-032-05059-5
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
CAIP2025.pdf embargo fino al 17/09/2026 Tipologia: Documento in Post-print Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 993.53 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	993.53 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1324620

Citazioni

ND

ND

ND

social impact