Difference between revisions of "Sentence Splitter"

From Medialab

Line 46: Line 46:
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitter.tgz Sentence Splitter] (Version 1.5)
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitter.py Sentence Splitter] (Version 1.5)
* [[Corpora|Italian Training Corpus]]
* [[Corpora|Italian Training Corpus]]

Latest revision as of 17:54, 11 August 2009


SentenceSplitter.py is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.

The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.

Using training option builds the models for the sentence splitter from a training corpus.

The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.

The training corpus should be a file with a sequence of sentences separated by newlines.

The script can be invoked as follows:

SentenceSplitter.py [options] model-file

The possible options are the following:

-t, --train  : train splitter from input
--help       : display this help and exit
--usage      : display script usage

Use examples

The following commands explain a complete use of Sentence Splitter used as a script from command line:

> SentenceSplitter.py --train IT-Model.pickle < IT-Training-corpus.txt
> SentenceSplitter.py IT-Model.pickle
Prima frase. Seconda frase.
Prima frase.
Seconda frase.
> _

The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:

>>> import sys
>>> from Tanl.split.SentenceSplitter import *
>>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin)
>>> for sentence in splitter:
...    print sentence
Prima frase. Seconda frase.
Prima frase.
Seconda frase.
>>> _