Parallel Lossy Compression for Large FASTQ Files

Guerrini, V.; Louza, F. A.; Rosone, G.

doi:10.1007/978-3-031-38854-5_6

In this paper we present a parallel version for the algorithm BFQzip, we introduced in [Guerrini et al., BIOSTEC – BIOINFORMATICS 2022], that modifies the bases and quality scores components taking into account both information at the same time, while preserving variant calling. The resulting FASTQ file achieves better compression than the original data. Here, we introduce a strategy that splits the FASTQ file into t blocks and processes them in parallel independently by using the BFQzip algorithm. The resulting blocks with the modified bases and smoothed qualities are merged (in order) and compressed. We show that our strategy can improve the compression ratio of large FASTQ files by taking advantage of the redundancy of reads. When splitting into blocks, the reads belonging to the same portion of the genome could end up in different blocks. Therefore, we analyze how reordering reads before splitting the input FASTQ can improve the compression ratio as the number of threads increases. We also propose a paired-end mode that allows to exploit the paired-end information by processing blocks of FASTQ files in pairs. Availability: The software is freely available at https://github.com/veronicaguerrini/BFQzip

Parallel Lossy Compression for Large FASTQ Files

Guerrini V.;Louza F. A.;Rosone G.

2023-01-01

Abstract

In this paper we present a parallel version for the algorithm BFQzip, we introduced in [Guerrini et al., BIOSTEC – BIOINFORMATICS 2022], that modifies the bases and quality scores components taking into account both information at the same time, while preserving variant calling. The resulting FASTQ file achieves better compression than the original data. Here, we introduce a strategy that splits the FASTQ file into t blocks and processes them in parallel independently by using the BFQzip algorithm. The resulting blocks with the modified bases and smoothed qualities are merged (in order) and compressed. We show that our strategy can improve the compression ratio of large FASTQ files by taking advantage of the redundancy of reads. When splitting into blocks, the reads belonging to the same portion of the genome could end up in different blocks. Therefore, we analyze how reordering reads before splitting the input FASTQ can improve the compression ratio as the number of threads increases. We also propose a paired-end mode that allows to exploit the paired-end information by processing blocks of FASTQ files in pairs. Availability: The software is freely available at https://github.com/veronicaguerrini/BFQzip

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2023

Codice ISBN

978-3-031-38853-8
978-3-031-38854-5

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1212471

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

6

ND

CINECA IRIS Institutional Research Information System

Parallel Lossy Compression for Large FASTQ Files

Guerrini V.;Louza F. A.;Rosone G.

2023-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Attenzione

Citazioni

social impact

CINECA IRIS Institutional Research Information System

Parallel Lossy Compression for Large FASTQ Files

Guerrini V.;Louza F. A.;Rosone G.

2023-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)