Managing bid documentation in large, evolving technology companies is inherently complex, often due to inconsistencies in information such as translations, file updates, and manual data extraction. These processes involve multiple departments, including software, hardware, products, infrastructure, materials, and regulations, requiring collaboration across geographically distributed teams with different native languages. This complexity is exacerbated by the need to trace requirements from bid offers to code and product development, and to perform similarity analysis when needed. Unstructured information comes from diverse sources like scans and/or editable texts with tables and images, written in various languages and using domain-specific terminology. Manual processing is error-prone, and translating data can lead to the loss of context-specific meanings or issues in safety-critical domains. This study combines Natural Language Processing (NLP) and Optical Character Recognition (OCR) to classify data into "information"or "requirement"while preserving multilingualism. A dual-pipeline approach is developed, featuring both a meta-classifier (an ensemble of Logistic Regression, Support Vector Machine, Multinomial Naive Bayes, and Random Forest) for robust and interpretable results, and a BERT model for capturing subtle linguistic patterns. The proposed pipeline is validated using a real-world case study in railway requirement annotation. Additionally, to demonstrate the methodology's flexibility, a second case study is conducted on topic classification of newspaper articles using publicly accessible data. The pipeline's output is a software solution that uses pre-trained models tailored to the respective domains. Future developments will include the creation of a graphical user interface (GUI), enabling distributed users to easily and efficiently search, update their requirements, and extract custom PDFs processed with translator and OCR.

From Unstructured Documents to Annotated Information: An Optimized Pipeline to Process Industrial Requirements

Nocente Arianna
;
Pannocchia Gabriele;Rossetti Giulio
2024-01-01

Abstract

Managing bid documentation in large, evolving technology companies is inherently complex, often due to inconsistencies in information such as translations, file updates, and manual data extraction. These processes involve multiple departments, including software, hardware, products, infrastructure, materials, and regulations, requiring collaboration across geographically distributed teams with different native languages. This complexity is exacerbated by the need to trace requirements from bid offers to code and product development, and to perform similarity analysis when needed. Unstructured information comes from diverse sources like scans and/or editable texts with tables and images, written in various languages and using domain-specific terminology. Manual processing is error-prone, and translating data can lead to the loss of context-specific meanings or issues in safety-critical domains. This study combines Natural Language Processing (NLP) and Optical Character Recognition (OCR) to classify data into "information"or "requirement"while preserving multilingualism. A dual-pipeline approach is developed, featuring both a meta-classifier (an ensemble of Logistic Regression, Support Vector Machine, Multinomial Naive Bayes, and Random Forest) for robust and interpretable results, and a BERT model for capturing subtle linguistic patterns. The proposed pipeline is validated using a real-world case study in railway requirement annotation. Additionally, to demonstrate the methodology's flexibility, a second case study is conducted on topic classification of newspaper articles using publicly accessible data. The pipeline's output is a software solution that uses pre-trained models tailored to the respective domains. Future developments will include the creation of a graphical user interface (GUI), enabling distributed users to easily and efficiently search, update their requirements, and extract custom PDFs processed with translator and OCR.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/1321687
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact