While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform () of the string that will allow us to find the interval in the string’s suffix array () containing pointers to starting positions of occurrences of a given pattern; second, a sample of the that—when used with the rank data structure—allows us access to the . The rank data structure can be kept small even for large genomic databases, by run-length compressing the , but until recently there was no means known to keep the sample small without greatly slowing down access to the . Now that Gagie et al. (SODA 2018) have defined an sample that takes about the same space as the run-length compressed —we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time.

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Manzini G.
2019-01-01

Abstract

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform () of the string that will allow us to find the interval in the string’s suffix array () containing pointers to starting positions of occurrences of a given pattern; second, a sample of the that—when used with the rank data structure—allows us access to the . The rank data structure can be kept small even for large genomic databases, by run-length compressing the , but until recently there was no means known to keep the sample small without greatly slowing down access to the . Now that Gagie et al. (SODA 2018) have defined an sample that takes about the same space as the run-length compressed —we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time.
2019
978-3-030-17082-0
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1097601
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? ND
social impact