Dense multi-representation retrieval models, exemplified as ColBERT, estimate the relevance between a query and a document based on the similarity of their contextualised token-level embeddings. Indeed, by using contextualised token embeddings, dense retrieval, conducted as either exact or semantic matches, can result in increased effectiveness for both in-domain and out-of-domain retrieval tasks, indicating that it is an important model to study. However, the exact role that these semantic matches play is not yet well investigated. For instance, although tokenisation is one of the crucial design choices for various pretrained language models, its impact on the matching behaviour has not been examined in detail. In this work, we inspect the reproducibility and replicability of the contextualised late interaction mechanism by extending ColBERT to Col★, which implements the late interaction mechanism across various pretrained models and different types of tokenisers. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col★ models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness. Overall, our experiments successfully reproduce the performance of ColBERT on various query sets, and replicate the late interaction mechanism upon different pretrained models with different tokenisers. Moreover, our experimental results yield new insights, such as: (i) semantic matching behaviour varies across different tokenisers; (ii) more specifically, high-frequency tokens tend to perform semantic matching than other token families; (iii) late interaction mechanism benefits more from lexical matching than semantic matching; (iv) special tokens, such as [CLS], play a very important role in late interaction.

Reproducibility, Replicability, and Insights into Dense Multi-Representation Retrieval Models: from ColBERT to Col

Tonellotto N.;
2023-01-01

Abstract

Dense multi-representation retrieval models, exemplified as ColBERT, estimate the relevance between a query and a document based on the similarity of their contextualised token-level embeddings. Indeed, by using contextualised token embeddings, dense retrieval, conducted as either exact or semantic matches, can result in increased effectiveness for both in-domain and out-of-domain retrieval tasks, indicating that it is an important model to study. However, the exact role that these semantic matches play is not yet well investigated. For instance, although tokenisation is one of the crucial design choices for various pretrained language models, its impact on the matching behaviour has not been examined in detail. In this work, we inspect the reproducibility and replicability of the contextualised late interaction mechanism by extending ColBERT to Col★, which implements the late interaction mechanism across various pretrained models and different types of tokenisers. As different tokenisation methods can directly impact the matching behaviour within the late interaction mechanism, we study the nature of matches occurring in different Col★ models, and further quantify the contribution of lexical and semantic matching on retrieval effectiveness. Overall, our experiments successfully reproduce the performance of ColBERT on various query sets, and replicate the late interaction mechanism upon different pretrained models with different tokenisers. Moreover, our experimental results yield new insights, such as: (i) semantic matching behaviour varies across different tokenisers; (ii) more specifically, high-frequency tokens tend to perform semantic matching than other token families; (iii) late interaction mechanism benefits more from lexical matching than semantic matching; (iv) special tokens, such as [CLS], play a very important role in late interaction.
2023
9781450394086
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1205607
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact