Degenerate strings (DS) and elastic degenerate strings (EDS) are a way to represent, in a compact form, strings that contain a high degree of similarity. They can be particularly useful in some fields, such as text processing or the study of DNA mutations in computational biology, where it is necessary to efficiently manage several variations of a sequence. In practice, a degenerate string is a string whose symbols, called degenerate, can have several alternatives (hence a degenerate symbol is a set). In the literature different constraints have been imposed on degenerate string symbols. For example, the symbol can only be i) a set of letters of the alphabet, ii) a set of strings of the same length, or iii) a set of strings of variable length (including the empty string). We consider the latter in its most general form, which is known as elastic degenerate strings. Our contribution is the introduction of the Burrows-Wheeler transform of an elastic-degenerate string (EDS-BWT). We show that EDS-BWT is reversible and that it can be used to solve the pattern matching problem, i.e., the problem of finding a standard string pattern within an EDS, by adapting the inner properties of the classical Burrows-Wheeler transform. Finally, we implemented the EDS-BWT encoding/decoding and the prototype edsBWTSearch to experimentally compare our pattern matching approach to other existing tools managing elastic degenerate strings.
The Burrows-Wheeler Transform of an Elastic-Degenerate String
Cioni L.;Guerrini V.;Rosone G.
2024-01-01
Abstract
Degenerate strings (DS) and elastic degenerate strings (EDS) are a way to represent, in a compact form, strings that contain a high degree of similarity. They can be particularly useful in some fields, such as text processing or the study of DNA mutations in computational biology, where it is necessary to efficiently manage several variations of a sequence. In practice, a degenerate string is a string whose symbols, called degenerate, can have several alternatives (hence a degenerate symbol is a set). In the literature different constraints have been imposed on degenerate string symbols. For example, the symbol can only be i) a set of letters of the alphabet, ii) a set of strings of the same length, or iii) a set of strings of variable length (including the empty string). We consider the latter in its most general form, which is known as elastic degenerate strings. Our contribution is the introduction of the Burrows-Wheeler transform of an elastic-degenerate string (EDS-BWT). We show that EDS-BWT is reversible and that it can be used to solve the pattern matching problem, i.e., the problem of finding a standard string pattern within an EDS, by adapting the inner properties of the classical Burrows-Wheeler transform. Finally, we implemented the EDS-BWT encoding/decoding and the prototype edsBWTSearch to experimentally compare our pattern matching approach to other existing tools managing elastic degenerate strings.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.