Contents

Description

SentenceSplitter.py is a Python tool for splitting plain text (in the document format) into sentences. It can be used as a script (from command line) or as a module.

The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.

The script can be invoked as follows:

SentenceSplitter.py [options]

The possible options are the following:

-m ..., --model-file=...  : use specified model file (italian language as default)
--help                    : display this help and exit
--usage                   : display script usage

Use examples

The following commands explain a complete use of Sentence Splitter used as a script from command line:

> SentenceSplitter.py -t IT-Model.pickle
<doc id="1" url="test">
Prima frase. Seconda frase.
</doc>
<doc id="1" url="test">
Prima frase.
Seconda frase.
</doc>
^D
> _

The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:

>>> import sys
>>> from Tanl.split.SentenceSplitter import *
>>> 
>>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin)
>>> for sentence in splitter:
...    print sentence
...
<doc id="1" url="test">
Prima frase. Seconda frase.
</doc>
<doc id="1" url="test">
Prima frase.
Seconda frase.
</doc>
^D
>>> _

Sentence Splitter Trainer

SentenceSplitterTrainer.py is a Python script for building the models for the Sentence Splitter from a training corpus.

The program exploits the Punkt Tokenizer Trainer. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.

The training corpus should be a file with a sequence of sentences separated by newlines.

The script can be invoked as follows:

SentenceSplitterTrainer.py [options] model-file

The possible options are the following:

-v, --verbose  : explain what is being done
--help         : display this help and exit
--usage        : display script usage

A sample session is:

SentenceSplitterTrainer.py IT-Model.pickle < IT-Training-corpus.txt

Downloads

Powered by MediaWiki