Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.

From text to data: a NLP approach to digital archaeology

Nevio Dubbini
Secondo
;
Gabriele Gattiglia
Ultimo
2019-01-01

Abstract

Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.
2019
978-83-948382-7-0
File in questo prodotto:
File Dimensione Formato  
CAA2019_programabstracts_v20190423-31.pdf

accesso aperto

Descrizione: Abstract
Tipologia: Versione finale editoriale
Licenza: Creative commons
Dimensione 61.03 kB
Formato Adobe PDF
61.03 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1023242
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact