Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of re- trieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and comput- ing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrievalfamilieshaveemerged:theuseofsingleembeddedrepresentationsforeachpassageandquery,e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential formultiplerepresentationdenseretrievaltobeenhancedusingpseudo-relevancefeedbackandpresentour proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the em- beddingsthatmosthighlydiscriminateamongdocumentsareemployedastheexpansionembeddings,which are then added to the original query representation. We show that these additional expansion embeddings bothenhancetheeffectivenessofarerankingoftheinitialqueryresultsaswellasanadditionaldenseretrieval operation.Indeed,experimentsontheMSMARCOpassagerankingdatasetshowthatMAPcanbeimproved byupto26%ontheTREC2019querysetand10%ontheTREC2020querysetbytheapplicationofourpro- posed ColBERT-PRF method on a ColBERT dense retrieval approach. We further validate the effectiveness of our proposed pseudo-relevance feedback technique for a dense retrieval model on MSMARCO document rankingandTRECRobust04documentrankingtasks.Forinstance,ColBERT-PRFexhibitsupto21%and14% improvement in MAP over the ColBERT E2E model on the MSMARCO document ranking TREC 2019 and TREC2020querysets,respectively.Additionally, westudytheeffectiveness ofvariantsoftheColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, attaining up to 4.54× speedup over the default ColBERT-PRF model, and with little impact on effectiveness, throughthe application of approximate scoringand different clusteringmethods.
ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval
Tonellotto N.;
2023-01-01
Abstract
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of re- trieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and comput- ing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrievalfamilieshaveemerged:theuseofsingleembeddedrepresentationsforeachpassageandquery,e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential formultiplerepresentationdenseretrievaltobeenhancedusingpseudo-relevancefeedbackandpresentour proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the em- beddingsthatmosthighlydiscriminateamongdocumentsareemployedastheexpansionembeddings,which are then added to the original query representation. We show that these additional expansion embeddings bothenhancetheeffectivenessofarerankingoftheinitialqueryresultsaswellasanadditionaldenseretrieval operation.Indeed,experimentsontheMSMARCOpassagerankingdatasetshowthatMAPcanbeimproved byupto26%ontheTREC2019querysetand10%ontheTREC2020querysetbytheapplicationofourpro- posed ColBERT-PRF method on a ColBERT dense retrieval approach. We further validate the effectiveness of our proposed pseudo-relevance feedback technique for a dense retrieval model on MSMARCO document rankingandTRECRobust04documentrankingtasks.Forinstance,ColBERT-PRFexhibitsupto21%and14% improvement in MAP over the ColBERT E2E model on the MSMARCO document ranking TREC 2019 and TREC2020querysets,respectively.Additionally, westudytheeffectiveness ofvariantsoftheColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, attaining up to 4.54× speedup over the default ColBERT-PRF model, and with little impact on effectiveness, throughthe application of approximate scoringand different clusteringmethods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.