We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.

Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo;Rotundo, Mariagiovanna;Vinciguerra, Giorgio
2023-01-01

Abstract

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings. Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block. Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.
2023
978-3-031-43979-7
978-3-031-43980-3
File in questo prodotto:
File Dimensione Formato  
Engineering a Textbook Approach to Index Massive String Dictionaries.pdf

accesso aperto

Tipologia: Documento in Post-print
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 414.7 kB
Formato Adobe PDF
414.7 kB Adobe PDF Visualizza/Apri
Textbook approach.pdf

non disponibili

Tipologia: Versione finale editoriale
Licenza: NON PUBBLICO - accesso privato/ristretto
Dimensione 3.79 MB
Formato Adobe PDF
3.79 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1202667
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact