Degenerate strings (DS) and elastic degenerate strings (EDS) are a way to represent, in a compact form, strings that contain a high degree of similarity. They can be particularly useful in some fields, such as text processing or the study of DNA mutations in computational biology, where it is necessary to efficiently manage several variations of a sequence. In practice, a degenerate string is a string whose symbols, called degenerate, can have several alternatives (hence a degenerate symbol is a set). In the literature different constraints have been imposed on degenerate string symbols. For example, the symbol can only be i) a set of letters of the alphabet, ii) a set of strings of the same length, or iii) a set of strings of variable length (including the empty string). We consider the latter in its most general form, which is known as elastic degenerate strings. Our contribution is the introduction of the Burrows-Wheeler transform of an elastic-degenerate string (EDS-BWT). We show that EDS-BWT is reversible and that it can be used to solve the pattern matching problem, i.e., the problem of finding a standard string pattern within an EDS, by adapting the inner properties of the classical Burrows-Wheeler transform. Finally, we implemented the EDS-BWT encoding/decoding and the prototype edsBWTSearch to experimentally compare our pattern matching approach to other existing tools managing elastic degenerate strings.

The Burrows-Wheeler Transform of an Elastic-Degenerate String

Cioni L.;Guerrini V.;Rosone G.
2024-01-01

Abstract

Degenerate strings (DS) and elastic degenerate strings (EDS) are a way to represent, in a compact form, strings that contain a high degree of similarity. They can be particularly useful in some fields, such as text processing or the study of DNA mutations in computational biology, where it is necessary to efficiently manage several variations of a sequence. In practice, a degenerate string is a string whose symbols, called degenerate, can have several alternatives (hence a degenerate symbol is a set). In the literature different constraints have been imposed on degenerate string symbols. For example, the symbol can only be i) a set of letters of the alphabet, ii) a set of strings of the same length, or iii) a set of strings of variable length (including the empty string). We consider the latter in its most general form, which is known as elastic degenerate strings. Our contribution is the introduction of the Burrows-Wheeler transform of an elastic-degenerate string (EDS-BWT). We show that EDS-BWT is reversible and that it can be used to solve the pattern matching problem, i.e., the problem of finding a standard string pattern within an EDS, by adapting the inner properties of the classical Burrows-Wheeler transform. Finally, we implemented the EDS-BWT encoding/decoding and the prototype edsBWTSearch to experimentally compare our pattern matching approach to other existing tools managing elastic degenerate strings.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1281467
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact