Document files contains a series of Wikipedia articles, represented each by an XML doc element:


The element doc has the following attributes:

  • id, which identifies the document by means of a unique serial number
  • url, which provides the URL of the original Wikipedia page.

The content of a doc element consists of pure text, one sentence per line.

Here is an example of a doc element:

<doc id="2" url="">
L'harmonium รจ uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.

Notice that because of Wikipedia conventions, the first sentence is the title of the article.

Such documents are produced by Wikipedia Extractor followed by Sentence Splitter.

