WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requires Python 2.7 but no additional library. The current version performs template expansion by preprocesssng the whole dump and extracting template definitions. The code provides these performance features: •multiprocessing is used for dealing with articles in parallel •a cache is kept of parsed templates.

WikiExtractor

ATTARDI, GIUSEPPE
2012-01-01

Abstract

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requires Python 2.7 but no additional library. The current version performs template expansion by preprocesssng the whole dump and extracting template definitions. The code provides these performance features: •multiprocessing is used for dealing with articles in parallel •a cache is kept of parsed templates.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/773144
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact