CINECA IRIS Institutional Research Information System

Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.

From text to data: a NLP approach to digital archaeology

Elisa Paperini^Primo;Nevio Dubbini^Secondo;Gabriele Gattiglia^Ultimo

2019-01-01

Abstract

Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Codice ISBN
	
				978-83-948382-7-0
			
	Appare nelle tipologie:
	
				4.2 Abstract in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
CAA2019_programabstracts_v20190423-31.pdf accesso aperto Descrizione: Abstract Tipologia: Versione finale editoriale Licenza: Creative commons Dimensione 61.03 kB Formato Adobe PDF Visualizza/Apri	61.03 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1023242

Citazioni

ND

ND

ND

social impact