Recent years have seen the widespread adoption of JSON as a data format to represent massive data collections managed and analysed by crucial applications. JSON data collections are usually schemaless, allowing thus for a flexible management of data. However, the absence of schema information has several disadvantages: the correctness of complex queries and programs cannot be statically checked, users have no way to figure out structural properties of the underlying data, and, more generally, schema-based optimisations cannot be applied. In this paper we deal with the problem of inferring a schema from massive JSON datasets. Our first contribution is the identification and definition of a JSON type language, which is a good compromise between simplicity and the ability of capturing complex structural properties of the input data. Our second contribution is the design of a schema inference algorithm and its implementation on Spark, in order to ensure a reasonable schema inference time for massive collections. Finally, we report about a preliminary experimental analysis showing the effectiveness of our approach in terms of precision and conciseness of inferred schemas.
Schema Inference for Massive JSON Datasets, extended abstract
GHELLI, GIORGIO;
2016-01-01
Abstract
Recent years have seen the widespread adoption of JSON as a data format to represent massive data collections managed and analysed by crucial applications. JSON data collections are usually schemaless, allowing thus for a flexible management of data. However, the absence of schema information has several disadvantages: the correctness of complex queries and programs cannot be statically checked, users have no way to figure out structural properties of the underlying data, and, more generally, schema-based optimisations cannot be applied. In this paper we deal with the problem of inferring a schema from massive JSON datasets. Our first contribution is the identification and definition of a JSON type language, which is a good compromise between simplicity and the ability of capturing complex structural properties of the input data. Our second contribution is the design of a schema inference algorithm and its implementation on Spark, in order to ensure a reasonable schema inference time for massive collections. Finally, we report about a preliminary experimental analysis showing the effectiveness of our approach in terms of precision and conciseness of inferred schemas.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.