Document files contains a series of Wikipedia articles, represented each by an XML doc element:

<doc>...</doc>
<doc>...</doc>
...
<doc>...</doc>

The element doc has the following attributes:

  • id, which identifies the document by means of a unique serial number
  • url, which provides the URL of the original Wikipedia page.

The content of a doc element consists of pure text, one sentence per line.

Here is an example of a doc element:

<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium รจ uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>

Notice that because of Wikipedia conventions, the first sentence is the title of the article.

Such documents are produced by Wikipedia Extractor followed by Sentence Splitter.

Powered by MediaWiki