SentenceSplitter.py is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.
The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).
When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.
Using training option builds the models for the sentence splitter from a training corpus.
The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.
The training corpus should be a file with a sequence of sentences separated by newlines.
The script can be invoked as follows:
SentenceSplitter.py [options] model-file
The possible options are the following:
-t, --train : train splitter from input --help : display this help and exit --usage : display script usage
The following commands explain a complete use of Sentence Splitter used as a script from command line:
> SentenceSplitter.py --train IT-Model.pickle < IT-Training-corpus.txt > SentenceSplitter.py IT-Model.pickle Prima frase. Seconda frase. ^D Prima frase. Seconda frase. > _
The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:
>>> import sys >>> from Tanl.split.SentenceSplitter import * >>> >>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin) >>> for sentence in splitter: ... print sentence ... Prima frase. Seconda frase. ^D Prima frase. Seconda frase. >>> _