Contents

Introduction

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

The Italian Wikipedia dumps are available at Wikipedia database download.

In order to perform text analysis it is necessary to extract plain text from the documents by removing syntactical decorations (bolds, italics, underlines, etc.).

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled Armonium:

<page>
  <title>Armonium</title>
  <id>2</id>
  <timestamp>2008-06-22T21:48:55Z</timestamp>
  <username>Nemo bis</username>
  <comment>italiano</comment>
  <text xml:space="preserve">[[Immagine:Harmonium2.jpg|thumb|right|300 px]]
  
  L''''armonium'''' (in francese, ''harmonium'') è uno [[strumenti musicali|
  strumento musicale]] azionato con una [[tastiera (musica)|tastiera]], detta
  manuale. Sono stati costruiti anche alcuni armonium con due manuali.
  
  ==Armonium occidentale==
  Come l'[[organo (musica)|organo]], l'armonium è utilizzato tipicamente in
  [[chiesa (architettura)|chiesa]], per l'esecuzione di [[musica sacra]], ed è
  fornito di pochi registri, quando addirittura in certi casi non ne possiede
  nemmeno uno: il suo [[timbro (musica)|timbro]] è molto meno ricco di quello
  organistico e così pure la sua estensione.
  
  ...
  
  ==Armonium indiano==
  {{S sezione}}
  
  == Voci correlate ==
  *[[Musica]]
  *[[Generi musicali]]</text>
</page>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con
una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con
due manuali.

Armonium occidentale.
Come l'organo, l'armonium è utilizzato tipicamente in chiesa, per l'esecuzione
di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello
organistico e così pure la sua estensione.
...
</doc>

The extraction tool has been implemented in Python and it aims to achieve high accuracy in extraction task.

The standard page format adopted by Wikipedia makes use the wiki syntax, which is a simple and intuitive formalism for specifying meta-information associated to texts (bolds, italics, underlines, images, tables, etc.). Unfortunately this standard is not in use by every author, and some of them prefer to insert HTML markup inside the documents. Wiki and HTML tags are often misused in the text (not closed tags, wrong attributes, etc.). Therefore the extractor deploys several heuristics for maximizing the success probability. The main direction for future works is the improvements of the accuracy of the heuristic used.

Description

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

The tool reads from standard input and writes to current directory.

The script can be invoked as follows:

WikiExtractor.py [options]

The possible options are the following:

-c, --compress        : compress output files using bzip2 algorithm
-b ..., --bytes=...   : put specified bytes per output file (500K by default)
-o ..., --output=...  : place output files in specified directory (current
                        directory by default)
--help                : display this help and exit
--usage               : display script usage

Sample sessions:

WikiExtractor.py          : reads input from stdin and writes output to
                            current directory
WikiExtractor.py -o wiki  : reads input from stdin and writes output to
                            wiki directory
WikiExtractor.py -b 10M   : processes input and stores output in files of
                            10 MB
WikiExtractor.py -c       : processes input and stores output in compressed
                            files

Example of Use

The following commands illustrate how to apply the script to a Wikipedia dump:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> mkdir extracted
> bzip2 -dc itwiki-latest-pages-articles.xml.bz2 |
  WikiExtractor.py -cb 250K -o extracted

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

Downloads

Related Work

  • WikiPrep A Perl tool for preprocessing Wikipedia XML dumps.
Powered by MediaWiki