CINECA IRIS Institutional Research Information System

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

How to harvest Word Combinations from corpora. Methods, evaluation and perspectives

Lenci, Alessandro^Co-primo;Masini, Francesca^Co-primo;Nissim, Malvina^Co-primo;Castagnoli, Sara^Co-primo;Lebani, Gianluca^Co-primo;Passaro, Lucia^Co-primo;Senaldi, Marco^Co-primo

2017-01-01

Abstract

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2017

Tutti gli autori

Lenci, Alessandro; Masini, Francesca; Nissim, Malvina; Castagnoli, Sara; Lebani, Gianluca; Passaro, Lucia; Senaldi, Marco

File in questo prodotto:

File	Dimensione	Formato
LENCI et al..pdf solo utenti autorizzati Descrizione: Articolo principale Tipologia: Versione finale editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 2.07 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.07 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/892044

Citazioni

ND

ND

4

social impact