Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo; Rotundo, Mariagiovanna; Vinciguerra, Giorgio

doi:10.1007/978-3-031-43980-3_16

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.

Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo;Rotundo, Mariagiovanna;Vinciguerra, Giorgio

2023-01-01

Abstract

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Codice ISBN
	
				978-3-031-43979-7
978-3-031-43980-3
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Engineering a Textbook Approach to Index Massive String Dictionaries.pdf accesso aperto Tipologia: Documento in Post-print Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 414.7 kB Formato Adobe PDF Visualizza/Apri	414.7 kB	Adobe PDF	Visualizza/Apri
Textbook approach.pdf non disponibili Tipologia: Versione finale editoriale Licenza: NON PUBBLICO - accesso privato/ristretto Dimensione 3.79 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	3.79 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1202667

Citazioni

ND

2

2

CINECA IRIS Institutional Research Information System

Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo;Rotundo, Mariagiovanna;Vinciguerra, Giorgio

2023-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

CINECA IRIS Institutional Research Information System

Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo;Rotundo, Mariagiovanna;Vinciguerra, Giorgio

2023-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)