CINECA IRIS Institutional Research Information System

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of re- trieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and comput- ing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrievalfamilieshaveemerged:theuseofsingleembeddedrepresentationsforeachpassageandquery,e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential formultiplerepresentationdenseretrievaltobeenhancedusingpseudo-relevancefeedbackandpresentour proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the em- beddingsthatmosthighlydiscriminateamongdocumentsareemployedastheexpansionembeddings,which are then added to the original query representation. We show that these additional expansion embeddings bothenhancetheeffectivenessofarerankingoftheinitialqueryresultsaswellasanadditionaldenseretrieval operation.Indeed,experimentsontheMSMARCOpassagerankingdatasetshowthatMAPcanbeimproved byupto26%ontheTREC2019querysetand10%ontheTREC2020querysetbytheapplicationofourpro- posed ColBERT-PRF method on a ColBERT dense retrieval approach. We further validate the effectiveness of our proposed pseudo-relevance feedback technique for a dense retrieval model on MSMARCO document rankingandTRECRobust04documentrankingtasks.Forinstance,ColBERT-PRFexhibitsupto21%and14% improvement in MAP over the ColBERT E2E model on the MSMARCO document ranking TREC 2019 and TREC2020querysets,respectively.Additionally, westudytheeffectiveness ofvariantsoftheColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, attaining up to 4.54× speedup over the default ColBERT-PRF model, and with little impact on effectiveness, throughthe application of approximate scoringand different clusteringmethods.

ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval

Wang X.;Macdonald C.;Tonellotto N.;Ounis I.

2023-01-01

Abstract

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of re- trieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and comput- ing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrievalfamilieshaveemerged:theuseofsingleembeddedrepresentationsforeachpassageandquery,e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential formultiplerepresentationdenseretrievaltobeenhancedusingpseudo-relevancefeedbackandpresentour proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the em- beddingsthatmosthighlydiscriminateamongdocumentsareemployedastheexpansionembeddings,which are then added to the original query representation. We show that these additional expansion embeddings bothenhancetheeffectivenessofarerankingoftheinitialqueryresultsaswellasanadditionaldenseretrieval operation.Indeed,experimentsontheMSMARCOpassagerankingdatasetshowthatMAPcanbeimproved byupto26%ontheTREC2019querysetand10%ontheTREC2020querysetbytheapplicationofourpro- posed ColBERT-PRF method on a ColBERT dense retrieval approach. We further validate the effectiveness of our proposed pseudo-relevance feedback technique for a dense retrieval model on MSMARCO document rankingandTRECRobust04documentrankingtasks.Forinstance,ColBERT-PRFexhibitsupto21%and14% improvement in MAP over the ColBERT E2E model on the MSMARCO document ranking TREC 2019 and TREC2020querysets,respectively.Additionally, westudytheeffectiveness ofvariantsoftheColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, attaining up to 4.54× speedup over the default ColBERT-PRF model, and with little impact on effectiveness, throughthe application of approximate scoringand different clusteringmethods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Codice DOI
	
				https://dx.doi.org/10.1145/3572405
			
	Tutti gli autori
	
						Wang, X.; Macdonald, C.; Tonellotto, N.; Ounis, I.

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1170085

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

29

21

social impact