m (Downloads)
(Downloads)
 
(35 intermediate revisions not shown)
Line 4: Line 4:
The Wikipedia maintainers provide, each month, an XML ''dump'' of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.
The Wikipedia maintainers provide, each month, an XML ''dump'' of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.
-
The Italian Wikipedia dumps are available at [http://download.wikimedia.org/itwiki/latest/ Wikipedia database download].
+
Wikipedia dumps are available from [http://download.wikimedia.org/ Wikipedia database download].
-
In order to perform text analysis it is necessary to extract plain text from the documents by removing syntactical decorations (bolds, italics, underlines, etc.).
+
The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.
-
 
+
-
The aim of the Wikipedia extractor tool is to generate plain text from Wikipedia database, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.
+
Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled ''Armonium'':
Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled ''Armonium'':
Line 55: Line 53:
  </doc>
  </doc>
-
The extraction tool has been implemented in Python and it aims to achieve high accuracy in extraction task.
+
The extraction tool is written in Python and requires no additional library. It aims to achieve aims to achieve high accuracy in the extraction task.
-
The standard page format adopted by Wikipedia makes use the ''wiki syntax'', which is a simple and intuitive formalism for specifying meta-information associated to texts (bolds, italics, underlines, images, tables, etc.). Unfortunately this standard is not in use by every author, and some of them prefer to insert HTML markup inside the documents. Wiki and HTML tags are often misused in the text (not closed tags, wrong attributes, etc.). Therefore the extractor deploys several heuristics for maximizing the success probability. The main direction for future works is the improvements of the accuracy of the heuristic used.
+
Wikipedia articles are written in the [http://www.mediawiki.org/wiki/Help:Formatting MediaWiki Markup Language] which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are often misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys several heuristics in order to circumvent such problems. A currently missing feature for the extractor is template expansion.
==Description==
==Description==
-
<tt>wiki-extractor.py</tt> is a Python script that extracts and cleans text from a [http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 Wikipedia database dump]. The output is stored in a number of files of similar size in a given directory.
+
[http://medialab.di.unipi.it/wiki/Wikipedia_Extractor WikiExtractor.py] is a Python script that extracts and cleans text from a [http://download.wikimedia.org/ Wikipedia database dump].
-
Each file contains several documents in the [[Document Format|document format]].
+
The output is stored in a number of files of similar size in a given directory.
 +
Each file will contains several documents in the [[Document Format|document format]].
-
The tool reads from standard input and writes to standard output.
+
This is a beta version that performs template expansion by preprocesssng the whole dump and
 +
extracting template definitions.
-
The script can be invoked as follows:
+
Usage:
-
  wiki-extractor.py [options]
+
  WikiExtractor.py [options] xml-dump-file
-
The possible options are the following:
+
optional arguments:
-
-s, --split-sentences  split sentences using trained Punkt tokenizer
+
  -h, --help            show this help message and exit
-
-c, --compress        compress output files using bzip2 algorithm
+
  -o OUTPUT, --output OUTPUT
-
-b ..., --bytes=...    put specified bytes per output file (500 KB by default)
+
                        output directory
-
  -o ..., --output=...  place output files in specified directory (current
+
  -b n[KM], --bytes n[KM]
-
                         directory by default)
+
                        put specified bytes per output file (default is 1M)
-
--help                display this help and exit
+
  -B BASE, --base BASE base URL for the Wikipedia pages
-
--usage                display script usage
+
  -c, --compress        compress output files using bzip
 +
  -l, --links          preserve links
 +
  -ns ns1,ns2, --namespaces ns1,ns2
 +
                         accepted namespaces
 +
  -q, --quiet          suppress reporting progress info
 +
  -s, --sections        preserve sections
 +
  -a, --article        analyze a file containing a single article
 +
  --templates TEMPLATES
 +
                        use or create file containing templates
 +
  -v, --version        print program version
-
Sample sessions:
+
== Example of Use ==
-
wiki-extractor.py          : reads input from stdin and writes output to
+
The following commands illustrate how to apply the script to a Wikipedia dump:
-
                              current directory
+
-
wiki-extractor.py -o ''wiki''  : reads input from stdin and writes output to
+
-
                              ''wiki'' directory
+
-
wiki-extractor.py -b ''10M''  : processes input and stores output in fiels of
+
-
                              ''10 MB''
+
-
wiki-extractor.py -sc      : processes input by splitting the sentences and
+
-
                              compresses output files
+
-
 
+
-
==Use example==
+
-
The following commands explain a complete use of the script:
+
  > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
  > wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
-
  > mkdir extracted
+
  > WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
-
> bzip2 -dc itwiki-latest-pages-articles.xml.bz2 |
+
-
  wiki-extractor.py -scb 250K -o extracted
+
-
Optionally, it's possible to have a single file with the entire text extracted from Wikipedia:
+
In order to combine the whole extracted text into a single file one can issue:
  > find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
  > find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
  > rm -rf extracted
  > rm -rf extracted
==Downloads==
==Downloads==
-
* [http://medialab.di.unipi.it/wiki/images/temp/wiki-extractor.py Wikipedia Extractor] (version 1.0)
+
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py Wikipedia Extractor (version 2.40)] This version is capable of expanding WikiMedia templates. Expanding templates requires a double pass on the dump, one for collecting the templates and one for performing extraction. Processing templates hence can take considerably longer: it is possible though to save extracted templates to a file choosing option <tt>--temaplates FILE</tt> in order to avoid repeating the scan for templates.
 +
* [https://github.com/attardi/wikiextractor Wikipedia Extractor on github]
 +
* [https://github.com/jodaiber/Annotated-WikiExtractor Wikipedia Plain Text Extractor with Link Annotations]
 +
 
 +
=== Wikipeda dumps ===
* [http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 Italian Wikipedia database dump]
* [http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2 Italian Wikipedia database dump]
 +
* [http://download.wikimedia.org/ All Wikipedia database dumps]
 +
* [http://meta.wikimedia.org/wiki/Data_dump_torrents torrents] for use with a BitTorrent client such as [http://www.utorrent.com/ uTorrent]
 +
 +
== Related Work ==
 +
* [http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/ WikiPrep] A Perl tool for preprocessing Wikipedia XML dumps.
 +
* [http://evanjones.ca/software/wikipedia2text.html Extracting Text from Wikipedia] Another Python tool for text extracting from Wikipedia XML dumps.
 +
* [http://www.mediawiki.org/wiki/Alternative_parsers Alternative Parsers] A list of links, descriptions, and status reports of the various alternative MediaWiki parsers.

Latest revision as of 13:24, 12 November 2015

Contents

Introduction

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

Wikipedia dumps are available from Wikipedia database download.

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

Each document in the dump of the encyclopedia is representend as a single XML element, encoded as illustrated in the following example from the document titled Armonium:

<page>
  <title>Armonium</title>
  <id>2</id>
  <timestamp>2008-06-22T21:48:55Z</timestamp>
  <username>Nemo bis</username>
  <comment>italiano</comment>
  <text xml:space="preserve">[[Immagine:Harmonium2.jpg|thumb|right|300 px]]
  
  L''''armonium'''' (in francese, ''harmonium'') è uno [[strumenti musicali|
  strumento musicale]] azionato con una [[tastiera (musica)|tastiera]], detta
  manuale. Sono stati costruiti anche alcuni armonium con due manuali.
  
  ==Armonium occidentale==
  Come l'[[organo (musica)|organo]], l'armonium è utilizzato tipicamente in
  [[chiesa (architettura)|chiesa]], per l'esecuzione di [[musica sacra]], ed è
  fornito di pochi registri, quando addirittura in certi casi non ne possiede
  nemmeno uno: il suo [[timbro (musica)|timbro]] è molto meno ricco di quello
  organistico e così pure la sua estensione.
  
  ...
  
  ==Armonium indiano==
  {{S sezione}}
  
  == Voci correlate ==
  *[[Musica]]
  *[[Generi musicali]]</text>
</page>

For this document the Wikipedia extractor produces the following plain text:

<doc id="2" url="http://it.wikipedia.org/wiki/Armonium">
Armonium.
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con
una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con
due manuali.

Armonium occidentale.
Come l'organo, l'armonium è utilizzato tipicamente in chiesa, per l'esecuzione
di musica sacra, ed è fornito di pochi registri, quando addirittura in certi
casi non ne possiede nemmeno uno: il suo timbro è molto meno ricco di quello
organistico e così pure la sua estensione.
...
</doc>

The extraction tool is written in Python and requires no additional library. It aims to achieve aims to achieve high accuracy in the extraction task.

Wikipedia articles are written in the MediaWiki Markup Language which provides a simple notation for formatting text (bolds, italics, underlines, images, tables, etc.). It is also posible to insert HTML markup in the documents. Wiki and HTML tags are often misused (unclosed tags, wrong attributes, etc.), therefore the extractor deploys several heuristics in order to circumvent such problems. A currently missing feature for the extractor is template expansion.

Description

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The output is stored in a number of files of similar size in a given directory. Each file will contains several documents in the document format.

This is a beta version that performs template expansion by preprocesssng the whole dump and extracting template definitions.

Usage:

WikiExtractor.py [options] xml-dump-file

optional arguments:

 -h, --help            show this help message and exit
 -o OUTPUT, --output OUTPUT
                       output directory
 -b n[KM], --bytes n[KM]
                       put specified bytes per output file (default is 1M)
 -B BASE, --base BASE  base URL for the Wikipedia pages
 -c, --compress        compress output files using bzip
 -l, --links           preserve links
 -ns ns1,ns2, --namespaces ns1,ns2
                       accepted namespaces
 -q, --quiet           suppress reporting progress info
 -s, --sections        preserve sections
 -a, --article         analyze a file containing a single article
 --templates TEMPLATES
                       use or create file containing templates
 -v, --version         print program version

Example of Use

The following commands illustrate how to apply the script to a Wikipedia dump:

> wget http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
> WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2

In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

Downloads

Wikipeda dumps

Related Work

Powered by MediaWiki