In this paper we present a parallel version for the algorithm BFQzip, we introduced in [Guerrini et al., BIOSTEC – BIOINFORMATICS 2022], that modifies the bases and quality scores components taking into account both information at the same time, while preserving variant calling. The resulting FASTQ file achieves better compression than the original data. Here, we introduce a strategy that splits the FASTQ file into t blocks and processes them in parallel independently by using the BFQzip algorithm. The resulting blocks with the modified bases and smoothed qualities are merged (in order) and compressed. We show that our strategy can improve the compression ratio of large FASTQ files by taking advantage of the redundancy of reads. When splitting into blocks, the reads belonging to the same portion of the genome could end up in different blocks. Therefore, we analyze how reordering reads before splitting the input FASTQ can improve the compression ratio as the number of threads increases. We also propose a paired-end mode that allows to exploit the paired-end information by processing blocks of FASTQ files in pairs. Availability: The software is freely available at https://github.com/veronicaguerrini/BFQzip

Parallel Lossy Compression for Large FASTQ Files

Guerrini V.;Rosone G.
2023-01-01

Abstract

In this paper we present a parallel version for the algorithm BFQzip, we introduced in [Guerrini et al., BIOSTEC – BIOINFORMATICS 2022], that modifies the bases and quality scores components taking into account both information at the same time, while preserving variant calling. The resulting FASTQ file achieves better compression than the original data. Here, we introduce a strategy that splits the FASTQ file into t blocks and processes them in parallel independently by using the BFQzip algorithm. The resulting blocks with the modified bases and smoothed qualities are merged (in order) and compressed. We show that our strategy can improve the compression ratio of large FASTQ files by taking advantage of the redundancy of reads. When splitting into blocks, the reads belonging to the same portion of the genome could end up in different blocks. Therefore, we analyze how reordering reads before splitting the input FASTQ can improve the compression ratio as the number of threads increases. We also propose a paired-end mode that allows to exploit the paired-end information by processing blocks of FASTQ files in pairs. Availability: The software is freely available at https://github.com/veronicaguerrini/BFQzip
2023
978-3-031-38853-8
978-3-031-38854-5
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1212471
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact