This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

How to harvest Word Combinations from corpora. Methods, evaluation and perspectives

Lenci, Alessandro
Co-primo
;
Passaro, Lucia
Co-primo
;
2017-01-01

Abstract

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.
2017
Lenci, Alessandro; Masini, Francesca; Nissim, Malvina; Castagnoli, Sara; Lebani, Gianluca; Passaro, Lucia; Senaldi, Marco
File in questo prodotto:
File Dimensione Formato  
LENCI et al..pdf

solo utenti autorizzati

Descrizione: Articolo principale
Tipologia: Versione finale editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 2.07 MB
Formato Adobe PDF
2.07 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/892044
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 2
social impact