Large Language Models (LLMs) now possess function calling capabilities, enabling them to interact with external tools and APIs. Within cognitive architectures for social robotics, this provides a robust mechanism for an LLM to orchestrate a set of discrete functions-conceptually echoing the brain's functional specificity-to manage operations such as visual perception, auditory processing, speech generation, and memory access, among others. However, LLMs exhibit varying propensities and strategies when utilizing these function calls. This paper presents a comparative evaluation of four LLMs, differing significantly in parameter scale and origin: Llama3 70b, Gemma2 9b, Mixtral 8x7b, and Phi3 mini 3.8b, each acting as the orchestrator in such an architecture. Our testbed involved a human-participant study (N=20) where individuals engaged in ambiguous social scenarios, each interacting with the architecture driven by the different LLMs (4 trials per participant). Results revealed statistically significant differences in the frequency of Look (F=13.62, p<0.001), Talk (F=9.29, p<0.001), and Hear (F=10.34, p<0.001) calls across LLMs. Notably, Llama3 70b made significantly more ‘Look’ calls (M=3.45), a behavior that corresponded with strong user preference (18/20), suggesting its interaction style was perceived as more natural and contextually aware. Mixtral 8x7b, in contrast, favored ‘Talk’ (M=11.05) and ‘Hear’ (M=11.15) calls. These findings demonstrate that analyzing function call patterns offers a quantitative lens to understand and compare the interaction strategies of different LLMs in orchestrating robotic behavior.
A Comparative Evaluation of Function-Calling LLMs in a Cognitive Architecture
Pardini, Marco;Galatolo, Federico A.;Cominelli, Lorenzo;Cimino, Mario G. C. A.;Greco, Alberto;Scilingo, Enzo Pasquale
2025-01-01
Abstract
Large Language Models (LLMs) now possess function calling capabilities, enabling them to interact with external tools and APIs. Within cognitive architectures for social robotics, this provides a robust mechanism for an LLM to orchestrate a set of discrete functions-conceptually echoing the brain's functional specificity-to manage operations such as visual perception, auditory processing, speech generation, and memory access, among others. However, LLMs exhibit varying propensities and strategies when utilizing these function calls. This paper presents a comparative evaluation of four LLMs, differing significantly in parameter scale and origin: Llama3 70b, Gemma2 9b, Mixtral 8x7b, and Phi3 mini 3.8b, each acting as the orchestrator in such an architecture. Our testbed involved a human-participant study (N=20) where individuals engaged in ambiguous social scenarios, each interacting with the architecture driven by the different LLMs (4 trials per participant). Results revealed statistically significant differences in the frequency of Look (F=13.62, p<0.001), Talk (F=9.29, p<0.001), and Hear (F=10.34, p<0.001) calls across LLMs. Notably, Llama3 70b made significantly more ‘Look’ calls (M=3.45), a behavior that corresponded with strong user preference (18/20), suggesting its interaction style was perceived as more natural and contextually aware. Mixtral 8x7b, in contrast, favored ‘Talk’ (M=11.05) and ‘Hear’ (M=11.15) calls. These findings demonstrate that analyzing function call patterns offers a quantitative lens to understand and compare the interaction strategies of different LLMs in orchestrating robotic behavior.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


