Wikipedia Extractor

From Medialab

Revision as of 18:16, 10 March 2008 by Antonio.Fuschetto (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The wiki-extractor is a Python script that extracts cleaned text from the Wikipedia database dump and places output in fixed-size files.

The script can be invokes as follows:

python wiki-extractor.py [options] [file]

The options can be the following:

-z, --gzip            : compress output files using bzip2 algorithm
-b ..., --bytes=...   : put specified bytes per output file (500 KB by default)
-o ..., --output=...  : place output files in specified directory (current directory by default)
--help                : display this help and exit
--usage               : display script usage

Follow some sample sessions:

wiki-extractor.py                   : reads input from stdin and writes output to current directory
wiki-extractor.py infile            : reads input from infile and writes output to current directory
wiki-extractor.py -o outdir         : reads input from stdin and writes output to outdir
wiki-extractor.py infile -o outdir  : reads input from infile and writes output to outdir