Entity-linking is a natural-language-processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base. When the knowledge base is Wikipedia, the problem is also referred to as wikification (in this case, items are Wikipedia articles). Entity-linking consists conceptually of many different phases: identifying the portions of text that may refer to an entity (sometimes called "entity detection"), determining a set of concepts (candidates) from the knowledge base that may match each such portion, and choosing one candidate for each set; the latter step, known as candidate selection, is the phase on which this paper focuses. One instance of candidate selection can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between the selected items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; we propose several heuristics trying to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset. Finally, in the appendix, we show an exact linear time algorithm that works under some more restrictive assumptions.

Using graph distances for named-entity linking

MARINO, ANDREA
In corso di stampa

Abstract

Entity-linking is a natural-language-processing task that consists in identifying strings of text that refer to a particular item in some reference knowledge base. When the knowledge base is Wikipedia, the problem is also referred to as wikification (in this case, items are Wikipedia articles). Entity-linking consists conceptually of many different phases: identifying the portions of text that may refer to an entity (sometimes called "entity detection"), determining a set of concepts (candidates) from the knowledge base that may match each such portion, and choosing one candidate for each set; the latter step, known as candidate selection, is the phase on which this paper focuses. One instance of candidate selection can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between the selected items. Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; we propose several heuristics trying to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset. Finally, in the appendix, we show an exact linear time algorithm that works under some more restrictive assumptions.
In corso di stampa
Blanco, Roi; Boldi, Paolo; Marino, Andrea
File in questo prodotto:
File Dimensione Formato  
main.pdf

accesso aperto

Tipologia: Documento in Pre-print
Licenza: Creative commons
Dimensione 536.31 kB
Formato Adobe PDF
536.31 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/790903
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
social impact