Given a sequence S = s1s2..sn of integers smaller than r = O(polylog(n)), we show how S can be represented using n H_0(S) + o(n) bits, so that we can know any character at position q, as well as answer rank and select queries on S, in constant time. H_0(S) is the zero-order empirical entropy of S and n H_0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in n H_0(S) + o(n log r) bits and answer queries in O(log r/log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Sigma. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within n H_k(T) + o(n) bits of storage, where H_k(T) is the kth-order empirical entropy of T. This space bound holds simultaneously for all k <= alpha log_{Sigma} n, constant 0 < alpha < 1, and |Sigma| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log^{1+eps} n) time for any constant 0 < eps < 1; and reports a text substring of length L in O(L + log^{1+eps} n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |Sigma| = O(n^{beta}), for any 0 < beta < 1, by paying o(n log|Sigma|) extra space and multiplying all query times by O(log |Sigma|/log log n).

Compressed Representations of sequences and full-text indexes

FERRAGINA, PAOLO;G. MANZINI;
2007

Abstract

Given a sequence S = s1s2..sn of integers smaller than r = O(polylog(n)), we show how S can be represented using n H_0(S) + o(n) bits, so that we can know any character at position q, as well as answer rank and select queries on S, in constant time. H_0(S) is the zero-order empirical entropy of S and n H_0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in n H_0(S) + o(n log r) bits and answer queries in O(log r/log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Sigma. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within n H_k(T) + o(n) bits of storage, where H_k(T) is the kth-order empirical entropy of T. This space bound holds simultaneously for all k <= alpha log_{Sigma} n, constant 0 < alpha < 1, and |Sigma| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log^{1+eps} n) time for any constant 0 < eps < 1; and reports a text substring of length L in O(L + log^{1+eps} n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |Sigma| = O(n^{beta}), for any 0 < beta < 1, by paying o(n log|Sigma|) extra space and multiplying all query times by O(log |Sigma|/log log n).
Ferragina, Paolo; Manzini, G.; Makinen, V.; Navarro, G.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11568/115552
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 249
  • ???jsp.display-item.citation.isi??? 145
social impact