Sentence Splitter

From Medialab

Description is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.

The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.

Using training option builds the models for the sentence splitter from a training corpus.

The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.

The training corpus should be a file with a sequence of sentences separated by newlines.

The script can be invoked as follows: [options] model-file

The possible options are the following:

-t, --train  : train splitter from input
--help       : display this help and exit
--usage      : display script usage

Use examples

The following commands explain a complete use of Sentence Splitter used as a script from command line:

> --train IT-Model.pickle < IT-Training-corpus.txt
> IT-Model.pickle
Prima frase. Seconda frase.
Prima frase.
Seconda frase.
> _

The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:

>>> import sys
>>> from Tanl.split.SentenceSplitter import *
>>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin)
>>> for sentence in splitter:
...    print sentence
Prima frase. Seconda frase.
Prima frase.
Seconda frase.
>>> _