Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.
From text to data: a NLP approach to digital archaeology
Nevio DubbiniSecondo
;Gabriele GattigliaUltimo
2019-01-01
Abstract
Extracting data from archaeological texts (from grey literature to journal papers) represents one of the archaeology’s most leading challenges. In recent years, Natural Language Processing (NLP) has been also adopted in the archaeological domain, but we are still far away from achieving robust results. This work is part of a more complex project related to extraction, visualisation and analysis of text data, carried out by MAPPA Lab, a digital archaeology lab of the University of Pisa, together with Italian NLP Laboratory of the Institute for Computational Linguistics "A. Zampolli" (http://www.italianlp.it/). The aim of this work is to set up an as-automatic-as-possible procedure to overcome one of the main barriers to data accessibility, related to digitisation of data in a way allowing to process them. We developed a semi-automated workflow for text extraction and processing of data from pdf to a previously designed RDBMS. The extraction of data about location, date, authors, bibliography, archaeological findings and chronology was texted on about 1300 short communication papers (about 120000 text lines) published in the Italian journal of Medieval Archaeology (Archeologia Medievale), from 1974 to 2017. A formalised vocabulary of archaeological terms was first developed, then text extraction and NLP algorithms were applied, in order to detect, tag and insert the extracted data into the database. This method of working on data can be applied to all sources on which it is necessary to carry out similar research activities. Moreover, retrieved data are digital, accessible and reusable.File | Dimensione | Formato | |
---|---|---|---|
CAA2019_programabstracts_v20190423-31.pdf
accesso aperto
Descrizione: Abstract
Tipologia:
Versione finale editoriale
Licenza:
Creative commons
Dimensione
61.03 kB
Formato
Adobe PDF
|
61.03 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.