We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log^{1+eps} n) time for any chosen eps, 0< eps <1. This data structure uses at most 5 n H_k(T) + o(n) bits of storage, where H_k(T) is the kth order empirical entropy of T. The space usage is Theta(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p+occ) query time using O(n H_k(T) log^{eps} n) + o(n) bits of storage for any chosen eps, 0< eps <1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.

Indexing compressed texts

FERRAGINA, PAOLO;G. MANZINI
2005-01-01

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form. Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log^{1+eps} n) time for any chosen eps, 0< eps <1. This data structure uses at most 5 n H_k(T) + o(n) bits of storage, where H_k(T) is the kth order empirical entropy of T. The space usage is Theta(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows-Wheeler Transform, and can be regarded as a compressed suffix array. Our second compressed data structure achieves O(p+occ) query time using O(n H_k(T) log^{eps} n) + o(n) bits of storage for any chosen eps, 0< eps <1. Therefore, it provides optimal output-sensitive query time using o(n log n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows-Wheeler Transform and the LZ78 algorithm.
2005
Ferragina, Paolo; Manzini, G.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/92400
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 510
  • ???jsp.display-item.citation.isi??? 376
social impact