(Example of Use)
(Downloads)
 
(7 intermediate revisions not shown)
Line 58: Line 58:
==Description==
==Description==
-
[http://medialab.di.unipi.it/wiki/Wikipedia_Extractor WikiExtractor.py] is a Python script that extracts and cleans text from a [http://download.wikimedia.org/ Wikipedia database dump]. The output is stored in a number of files of similar size in a given directory.
+
[http://medialab.di.unipi.it/wiki/Wikipedia_Extractor WikiExtractor.py] is a Python script that extracts and cleans text from a [http://download.wikimedia.org/ Wikipedia database dump].
-
Each file contains several documents in the [[Document Format|document format]].
+
The output is stored in a number of files of similar size in a given directory.
 +
Each file will contains several documents in the [[Document Format|document format]].
 +
 
 +
This is a beta version that performs template expansion by preprocesssng the whole dump and
 +
extracting template definitions.
Usage:
Usage:
-
  WikiExtractor.py [options]
+
  WikiExtractor.py [options] xml-dump-file
-
The possible options are the following:
+
optional arguments:
-
-c, --compress        : compress output files using bzip2 algorithm
+
  -h, --help            show this help message and exit
-
-b ..., --bytes=...  : put specified bytes per output file (500K by default)
+
  -o OUTPUT, --output OUTPUT
-
-B, --base= URL      : base URL for the Wikipedia pages
+
                        output directory
-
-o ..., --output=...  : place output files in specified directory (default
+
  -b n[KM], --bytes n[KM]
-
                        current)
+
                        put specified bytes per output file (default is 1M)
-
-l, --link            : preserve links
+
  -B BASE, --base BASE  base URL for the Wikipedia pages
-
--help                : display this help and exit
+
  -c, --compress        compress output files using bzip
-
 
+
  -l, --links          preserve links
-
== Data download ==
+
  -ns ns1,ns2, --namespaces ns1,ns2
-
 
+
                        accepted namespaces
-
The Wikipedia dumps are available at http://download.wikimedia.org/. One needs a file like this:
+
  -q, --quiet          suppress reporting progress info
-
"http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
+
  -s, --sections        preserve sections
-
 
+
  -a, --article        analyze a file containing a single article
-
 
+
  --templates TEMPLATES
-
Wikipedia dumps can be obtained faster through a BitTorrent client such as [http://www.utorrent.com/ uTorrent]. The torrents for the English Wikipedia files are located at:
+
                        use or create file containing templates
-
 
+
  -v, --version        print program version
-
  http://meta.wikimedia.org/wiki/Data_dump_torrents
+
== Example of Use ==
== Example of Use ==
The following commands illustrate how to apply the script to a Wikipedia dump:
The following commands illustrate how to apply the script to a Wikipedia dump:
  > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
  > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
-
  > mkdir extracted
+
  > WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
-
> bzip2 -dc itwiki-latest-pages-articles.xml.bz2 |
+
-
  WikiExtractor.py -cb 250K -o extracted
+
In order to combine the whole extracted text into a single file one can issue:
In order to combine the whole extracted text into a single file one can issue:
Line 95: Line 96:
==Downloads==
==Downloads==
-
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py Wikipedia Extractor] (version 2.5) Ten time faster version. Fixes to tag regexp. Guess prefix. Fixed handling of option -l.
+
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py Wikipedia Extractor (version 2.40)] This version is capable of expanding WikiMedia templates. Expanding templates requires a double pass on the dump, one for collecting the templates and one for performing extraction. Processing templates hence can take considerably longer: it is possible though to save extracted templates to a file choosing option <tt>--temaplates FILE</tt> in order to avoid repeating the scan for templates.
-
* [https://bitbucket.org/leonardossz/multithreaded-wikipedia-extractor Multithreaded version of Wikipedia Extractor]
+
* [https://github.com/attardi/wikiextractor Wikipedia Extractor on github]
-
* [https://github.com/bwbaugh/wikipedia-extractor Wikipedia Extractor fork on github]
+
* [https://github.com/jodaiber/Annotated-WikiExtractor Wikipedia Plain Text Extractor with Link Annotations]
* [https://github.com/jodaiber/Annotated-WikiExtractor Wikipedia Plain Text Extractor with Link Annotations]
 +
 +
=== Wikipeda dumps ===
* [http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 Italian Wikipedia database dump]
* [http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 Italian Wikipedia database dump]
* [http://download.wikimedia.org/ All Wikipedia database dumps]
* [http://download.wikimedia.org/ All Wikipedia database dumps]
 +
* [http://meta.wikimedia.org/wiki/Data_dump_torrents torrents] for use with a BitTorrent client such as [http://www.utorrent.com/ uTorrent]
== Related Work ==
== Related Work ==

Latest revision as of 13:24, 12 November 2015

Contents

Introduction

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

Wikipedia dumps are available from Wikipedia database download.

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled Armonium:

<page>
  <title>Armonium</title>
  <id>2</id>
  <timestamp>2008-06-22T21:48:55Z</timestamp>
  <username>Nemo bis</username>
  <comment>italiano</comment>
  <text xml:space="preserve">[[Immagine:Harmonium2.jpg|thumb|right|300 px]]
  
  L''''armonium'''' (in francese, ''harmonium'') è uno [[strumenti musicali|
  strumento musicale]] azionato con una [[tastiera (musica)|tastiera]], detta
  manuale. Sono stati costruiti anche alcuni armonium con due manuali.
  
  ==Armonium occidentale==
  Come l'[[organo (musica)|organo]], l'armonium è utilizzato tipicamente in
  [[chiesa (architettura)|chiesa]], per l'esecuzione di [[musica sacra]], ed è
  fornito di pochi registri, quando addirittura in certi casi non ne possiede
  nemmeno uno: il suo [[timbro (musica)|timbro]] è molto meno ricco di quello
  organistico e così pure la sua estensione.
  
  ...
  
  ==Armonium indiano==
  {{S sezione}}
  
  == Voci correlate ==
  *[[Musica]]
  *[[Generi musicali]]</text>
</page>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con
una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con
due manuali.

Armonium occidentale.
Come l'organo, l'armonium è utilizzato tipicamente in chiesa, per l'esecuzione
di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello
organistico e così pure la sua estensione.
...
</doc>

The extraction tool is written in Python and requires no additional library. It aims to achieve aims to achieve high accuracy in the extraction task.

Wikipedia articles are written in the MediaWiki Markup Language which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are often misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys several heuristics in order to circumvent such problems. A currently missing feature for the extractor is template expansion.

Description

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file will contains several documents in the document format.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions.

Usage:

WikiExtractor.py [options] xml-dump-file

optional arguments:

 -h, --help            show this help message and exit
 -o OUTPUT, --output OUTPUT
                       output directory
 -b n[KM], --bytes n[KM]
                       put specified bytes per output file (default is 1M)
 -B BASE, --base BASE  base URL for the Wikipedia pages
 -c, --compress        compress output files using bzip
 -l, --links           preserve links
 -ns ns1,ns2, --namespaces ns1,ns2
                       accepted namespaces
 -q, --quiet           suppress reporting progress info
 -s, --sections        preserve sections
 -a, --article         analyze a file containing a single article
 --templates TEMPLATES
                       use or create file containing templates
 -v, --version         print program version

Example of Use

The following commands illustrate how to apply the script to a Wikipedia dump:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

Downloads

Wikipeda dumps

Related Work

Powered by MediaWiki