Large-alphabet strings, prevalent in information retrieval and natural language processing, pose unique storage and processing challenges. This paper explores the efficient implementation of the alphabet-partition approach, introducing a compressed data structure that efficiently supports the operations ${\mathsf{rank}}$ and ${\mathsf{select}}$. Our implementation significantly outperforms existing methods, improving the ${\mathsf{select}}$ operation speed by 80% with only 11% additional space. We demonstrate the utility of our structure in various applications, including inverted list intersections, run-length compressed strings, and distributed computation of ${\mathsf{rank}}$ and ${\mathsf{select}}$. Notably, for run-length compressed strings using the Burrows-Wheeler transform, our data structure requires only 0.98-1.09 times the space of state-of-the-art RLFM-indexes to achieve 1.23-2.33 times faster pattern occurrence counting while also providing better theoretical guarantees.

Engineering rank/select data structures for large-alphabet strings

Carmona, Gabriel;
2025-01-01

Abstract

Large-alphabet strings, prevalent in information retrieval and natural language processing, pose unique storage and processing challenges. This paper explores the efficient implementation of the alphabet-partition approach, introducing a compressed data structure that efficiently supports the operations ${\mathsf{rank}}$ and ${\mathsf{select}}$. Our implementation significantly outperforms existing methods, improving the ${\mathsf{select}}$ operation speed by 80% with only 11% additional space. We demonstrate the utility of our structure in various applications, including inverted list intersections, run-length compressed strings, and distributed computation of ${\mathsf{rank}}$ and ${\mathsf{select}}$. Notably, for run-length compressed strings using the Burrows-Wheeler transform, our data structure requires only 0.98-1.09 times the space of state-of-the-art RLFM-indexes to achieve 1.23-2.33 times faster pattern occurrence counting while also providing better theoretical guarantees.
2025
Arroyuelo, Diego; Carmona, Gabriel; Larrañaga, Héctor; Riveros, Francisco; Rojas-Morales, Carlos Eugenio; Sepúlveda, Erick...espandi
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1342707
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact