The Burrows-Wheeler transform (BWT) is a famous text transformation that rearranges the symbols of the input strings so that occurrences of a same symbol tend to occur in runs. The number of runs is an important parameter in the BWT output string, historically associated with its high compressibility and more recently used as a measure for the space complexity of efficient data structures. It is a known fact that reordering the strings in the input collection S affects the number of runs in the output string bwt(S) produced by applying the BWT to the string collection. In this paper, we define a class of transformed strings where symbols in particular blocks of the bwt(S) can be reordered according to a different adaptive alphabet order. Then, we introduce new heuristics to reduce the number of runs in the BWT output of a string collection that improve on the two existing heuristics introduced in Cox et al. [7]. These new heuristics are computed when applying the BWT to a string collection assuming no a priori order on the input strings and without requiring any pre- and/or post- processing of the collection S or of the BWT string. In this paper, we also face the problem of reconstructing the input collection S from the string bwt(S) together with the string permutation realized when applying an alphabetical reordering of symbols during the construction of bwt(S).
A Class of Heuristics for Reducing the Number of BWT-Runs in the String Ordering Problem
Guerrini V.
;Rosone G.
2024-01-01
Abstract
The Burrows-Wheeler transform (BWT) is a famous text transformation that rearranges the symbols of the input strings so that occurrences of a same symbol tend to occur in runs. The number of runs is an important parameter in the BWT output string, historically associated with its high compressibility and more recently used as a measure for the space complexity of efficient data structures. It is a known fact that reordering the strings in the input collection S affects the number of runs in the output string bwt(S) produced by applying the BWT to the string collection. In this paper, we define a class of transformed strings where symbols in particular blocks of the bwt(S) can be reordered according to a different adaptive alphabet order. Then, we introduce new heuristics to reduce the number of runs in the BWT output of a string collection that improve on the two existing heuristics introduced in Cox et al. [7]. These new heuristics are computed when applying the BWT to a string collection assuming no a priori order on the input strings and without requiring any pre- and/or post- processing of the collection S or of the BWT string. In this paper, we also face the problem of reconstructing the input collection S from the string bwt(S) together with the string permutation realized when applying an alphabetical reordering of symbols during the construction of bwt(S).I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.