(Use examples)
Line 6: Line 6:
When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.
When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.
 +
 +
Using training option builds the models for the sentence splitter from a training corpus.
 +
 +
The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.
 +
 +
The training corpus should be a file with a sequence of sentences separated by newlines.
The script can be invoked as follows:
The script can be invoked as follows:
Line 11: Line 17:
The possible options are the following:
The possible options are the following:
-
  --help   : display this help and exit
+
-t, --train  : train splitter from input
-
  --usage : display script usage
+
  --help       : display this help and exit
 +
  --usage     : display script usage
==Use examples==
==Use examples==
The following commands explain a complete use of ''Sentence Splitter'' used as a script from command line:
The following commands explain a complete use of ''Sentence Splitter'' used as a script from command line:
 +
> SentenceSplitter.py --train ''IT-Model.pickle'' < ''IT-Training-corpus.txt''
  > SentenceSplitter.py IT-Model.pickle
  > SentenceSplitter.py IT-Model.pickle
  Prima frase. Seconda frase.
  Prima frase. Seconda frase.
Line 36: Line 44:
  Seconda frase.
  Seconda frase.
  >>> _
  >>> _
-
 
-
==Sentence Splitter Trainer==
 
-
<tt>SentenceSplitterTrainer.py</tt> is a Python script for building the models for the ''Sentence Splitter'' from a training corpus.
 
-
 
-
The program exploits the [http://nltk.org/doc/api/nltk.tokenize.punkt-module.html Punkt Tokenizer Trainer].
 
-
The technique used by ''punkt'' is described in [http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf Unsupervised Multilingual Sentence Boundary Detection] (Kiss and Strunk, 2005).
 
-
 
-
The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.
 
-
 
-
The training corpus should be a file with a sequence of sentences separated by newlines.
 
-
 
-
The script can be invoked as follows:
 
-
SentenceSplitterTrainer.py [options] model-file
 
-
 
-
The possible options are the following:
 
-
-v, --verbose  : explain what is being done
 
-
--help        : display this help and exit
 
-
--usage        : display script usage
 
-
 
-
A sample session is:
 
-
SentenceSplitterTrainer.py ''IT-Model.pickle'' < ''IT-Training-corpus.txt''
 
==Downloads==
==Downloads==
-
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitter.tgz Sentence Splitter] (Version 1.0)
+
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitter.tgz Sentence Splitter] (Version 1.5)
-
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitterTrainer.tgz Sentence Splitter Trainer] (Version 1.0)
+
* [[Corpora|Italian Training Corpus]]
* [[Corpora|Italian Training Corpus]]

Revision as of 15:52, 11 August 2009

Description

SentenceSplitter.py is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.

The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.

Using training option builds the models for the sentence splitter from a training corpus.

The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.

The training corpus should be a file with a sequence of sentences separated by newlines.

The script can be invoked as follows:

SentenceSplitter.py [options] model-file

The possible options are the following:

-t, --train  : train splitter from input
--help       : display this help and exit
--usage      : display script usage

Use examples

The following commands explain a complete use of Sentence Splitter used as a script from command line:

> SentenceSplitter.py --train IT-Model.pickle < IT-Training-corpus.txt
> SentenceSplitter.py IT-Model.pickle
Prima frase. Seconda frase.
^D
Prima frase.
Seconda frase.
> _

The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:

>>> import sys
>>> from Tanl.split.SentenceSplitter import *
>>> 
>>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin)
>>> for sentence in splitter:
...    print sentence
...
Prima frase. Seconda frase.
^D
Prima frase.
Seconda frase.
>>> _

Downloads

Powered by MediaWiki