Contents

Introduction

The Italian Wikipedia has been chosen as source for text extraction. The encyclopedia maintainers provide, for each month, a dump of the documents of database: it consists of an unique XML file containing the whole encyclopedia, that is normally used to extract useful information from the Wikipedia, such as statistics, service lists, etc.

The Italian Wikipedia dumps are available at the page Wikipedia database download.

In order to perform text analysis it's necessary to extract plain text from the documents by removing syntactical decorations (bolds, italics, underlines, etc.).

The aim of Wikipedia extractor tool is to generate plain text from Wikipedia database, discarding each inessential information or annotation contained in Wikipedia pages, such as images, tables, references and lists.

In the dump of encyclopedia each document is representend by a single XML node, encoded as illustrated in the following example on the document titled Armonium:

<page>
  <title>Armonium</title>
  <id>2</id>
  <timestamp>2008-06-22T21:48:55Z</timestamp>
  <username>Nemo bis</username>
  <comment>italiano</comment>
  <text xml:space="preserve">[[Immagine:Harmonium2.jpg|thumb|right|300 px]]

  L''''armonium'''' (in francese, ''harmonium'') è uno [[strumenti musicali|strumento musicale]] azionato con una
  [[tastiera (musica)|tastiera]], detta manuale. Sono stati costruiti anche alcuni armonium con due manuali.
  
  ==Armonium occidentale==
  Come l'[[organo (musica)|organo]], l'armonium è utilizzato tipicamente in [[chiesa (architettura)|chiesa]], per
  l'esecuzione di [[musica sacra]], ed è fornito di pochi registri, quando addirittura in certi casi non ne possiede
  nemmeno uno: il suo [[timbro (musica)|timbro]] è molto meno ricco di quello organistico e così pure la sua estensione.
  
  ...
  
  ==Armonium indiano==
  {{S sezione}}
  
  == Voci correlate ==
  *[[Musica]]
  *[[Generi musicali]]</text>
</page>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con una tastiera, detta manuale. Sono stati
costruiti anche alcuni armonium con due manuali. Armonium occidentale. Come l'organo, l'armonium è utilizzato
tipicamente in chiesa, per l'esecuzione di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello organistico e così pure la sua estensione.
...
</doc>

The extraction tool has been implemented in Python and its goal is to achieve an high accuracy in extraction task.

The standard page format adopted by Wikipedia makes use the wiki syntax, which is a simple and intuitive formalism for specifying meta-information associated to texts (bolds, italics, underlines, images, tables, etc.). Unfortunately this standard is not in use by every author, and some of them prefer to insert HTML markup inside the documents. Wiki and HTML tags are often misused in the text (not closed tags, wrong attributes, etc.). Therefore the extractor deploys several heuristics for maximizing the success probability. The main direction for future works is the improvements of the accuracy of the heuristic used.

Description

wiki-extractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file contains several documents in the document format.

The tool reads from standard input and writes to standard output.

The script can be invoked as follows:

wiki-extractor.py [options]

The options can be the following:

-s, --split-sentences  split sentences using trained Punkt tokenizer
-c, --compress         compress output files using bzip2 algorithm
-b ..., --bytes=...    put specified bytes per output file (500 KB by default)
-o ..., --output=...   place output files in specified directory (current directory by default)
--help                 display this help and exit
--usage                display script usage

Sample sessions:

wiki-extractor.py          : reads input from stdin and writes output to current directory
wiki-extractor.py -o wiki  : reads input from stdin and writes output to wiki directory
wiki-extractor.py -b 10M   : processes input and stores output in files of 10 MB
wiki-extractor.py -sc      : processes input by splitting the sentences and compresses output files

Use example

The following commands explain a complete use of the script:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> mkdir extracted
> bzip2 -dc itwiki-latest-pages-articles.xml.bz2 | wiki-extractor.py -scb 250K -o extracted

Optionally, it's possible to have an unique file with the entire text extracted from Wikipedia:

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

Downloads

Powered by MediaWiki