A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza & Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology. Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems-Pattern matching, computations on discrete structures, sorting and searching; H.2.1 [Database Management]: Physical Design-Access methods; H.3.2 [Information Storage and Retrieval]: Information Storage-File organization; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-Search process General Terms: Algorithms Additional Key Words and Phrases: Text indexing, text compression, data structures, data storage representation, coding and information theory, indexing methods, textual databases, bioinformatics databases.

Compressed Text Indexes: From Theory to Practice

FERRAGINA, PAOLO;VENTURINI, ROSSANO
2008-01-01

Abstract

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza & Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology. Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems-Pattern matching, computations on discrete structures, sorting and searching; H.2.1 [Database Management]: Physical Design-Access methods; H.3.2 [Information Storage and Retrieval]: Information Storage-File organization; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval-Search process General Terms: Algorithms Additional Key Words and Phrases: Text indexing, text compression, data structures, data storage representation, coding and information theory, indexing methods, textual databases, bioinformatics databases.
Ferragina, Paolo; R., Gonzalez; G., Navarro; Venturini, Rossano
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/131360
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 109
  • ???jsp.display-item.citation.isi??? ND
social impact