CINECA IRIS Institutional Research Information System

Recent years have seen the widespread adoption of JSON as a data format to represent massive data collections managed and analysed by crucial applications. JSON data collections are usually schemaless, allowing thus for a flexible management of data. However, the absence of schema information has several disadvantages: the correctness of complex queries and programs cannot be statically checked, users have no way to figure out structural properties of the underlying data, and, more generally, schema-based optimisations cannot be applied. In this paper we deal with the problem of inferring a schema from massive JSON datasets. Our first contribution is the identification and definition of a JSON type language, which is a good compromise between simplicity and the ability of capturing complex structural properties of the input data. Our second contribution is the design of a schema inference algorithm and its implementation on Spark, in order to ensure a reasonable schema inference time for massive collections. Finally, we report about a preliminary experimental analysis showing the effectiveness of our approach in terms of precision and conciseness of inferred schemas.

Schema Inference for Massive JSON Datasets, extended abstract

Baazizi, Mohamed Amine;Ben Lahmar, Houssem;Colazzo, Dario;GHELLI, GIORGIO;Sartiani, Carlo

2016-01-01

Abstract

Recent years have seen the widespread adoption of JSON as a data format to represent massive data collections managed and analysed by crucial applications. JSON data collections are usually schemaless, allowing thus for a flexible management of data. However, the absence of schema information has several disadvantages: the correctness of complex queries and programs cannot be statically checked, users have no way to figure out structural properties of the underlying data, and, more generally, schema-based optimisations cannot be applied. In this paper we deal with the problem of inferring a schema from massive JSON datasets. Our first contribution is the identification and definition of a JSON type language, which is a good compromise between simplicity and the ability of capturing complex structural properties of the input data. Our second contribution is the design of a schema inference algorithm and its implementation on Spark, in order to ensure a reasonable schema inference time for massive collections. Finally, we report about a preliminary experimental analysis showing the effectiveness of our approach in terms of precision and conciseness of inferred schemas.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2016

Appare nelle tipologie:

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/840263

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact