Dense retrieval approaches are challenging the prevalence of inverted index-based sparse representation approaches for information retrieval systems. Different families have arisen: single representations for each query or passage (such as ANCE or DPR), or multiple representations (usually one per token) as exemplified by the ColBERT model. While ColBERT is effective, it requires significant storage space for each token's embedding. In this work, we aim to prune the embeddings for tokens that are not important for effectiveness. Indeed, we show that, by adapting standard uniform and document-centric static pruning methods to embedding-based indexes, but retaining their focus on low-IDF tokens, we can attain large improvements in space efficiency while maintaining high effectiveness. Indeed, on experiments conducted on the MSMARCO passage ranking task, by removing all embeddings corresponding to the 100 most frequent BERT tokens, the index size is reduced by 45%, with limited impact on effectiveness (e.g. no statistically significant degradation of NDCG@10 or MAP on the TREC 2020 queryset). Similarly, on TREC Covid, we observed a 1.3% reduction in nDCG@10 for a 38% reduction in total index size.
Static Pruning for Multi-Representation Dense Retrieval
Acquavia A.;Tonellotto N.
2023-01-01
Abstract
Dense retrieval approaches are challenging the prevalence of inverted index-based sparse representation approaches for information retrieval systems. Different families have arisen: single representations for each query or passage (such as ANCE or DPR), or multiple representations (usually one per token) as exemplified by the ColBERT model. While ColBERT is effective, it requires significant storage space for each token's embedding. In this work, we aim to prune the embeddings for tokens that are not important for effectiveness. Indeed, we show that, by adapting standard uniform and document-centric static pruning methods to embedding-based indexes, but retaining their focus on low-IDF tokens, we can attain large improvements in space efficiency while maintaining high effectiveness. Indeed, on experiments conducted on the MSMARCO passage ranking task, by removing all embeddings corresponding to the 100 most frequent BERT tokens, the index size is reduced by 45%, with limited impact on effectiveness (e.g. no statistically significant degradation of NDCG@10 or MAP on the TREC 2020 queryset). Similarly, on TREC Covid, we observed a 1.3% reduction in nDCG@10 for a 38% reduction in total index size.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.