Difference between revisions of "Sentence Splitter"

From Medialab

 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==Description==
 
==Description==
<tt>SentenceSplitter.py</tt> is a Python tool for splitting plain text (in the [[Document Format|document format]]) into sentences. It can be used as a script (from command line) or as a module.
+
<tt>SentenceSplitter.py</tt> is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.
   
 
The program exploits the [http://nltk.org/doc/api/nltk.tokenize.punkt-module.html Punkt Tokenizer], trained on a corpus.
 
The program exploits the [http://nltk.org/doc/api/nltk.tokenize.punkt-module.html Punkt Tokenizer], trained on a corpus.
Line 6: Line 6:
   
 
When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.
 
When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.
  +
  +
Using training option builds the models for the sentence splitter from a training corpus.
  +
  +
The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.
  +
  +
The training corpus should be a file with a sequence of sentences separated by newlines.
   
 
The script can be invoked as follows:
 
The script can be invoked as follows:
SentenceSplitter.py [options]
+
SentenceSplitter.py [options] model-file
   
 
The possible options are the following:
 
The possible options are the following:
-m ..., --model-file=... : use specified model file (italian language as default)
 
  +
-t, --train : train splitter from input
--help : display this help and exit
+
--help : display this help and exit
--usage : display script usage
+
--usage : display script usage
   
 
==Use examples==
 
==Use examples==
 
The following commands explain a complete use of ''Sentence Splitter'' used as a script from command line:
 
The following commands explain a complete use of ''Sentence Splitter'' used as a script from command line:
> SentenceSplitter.py -t IT-Model.pickle
+
> SentenceSplitter.py --train ''IT-Model.pickle'' < ''IT-Training-corpus.txt''
<doc id="1" url="test">
+
> SentenceSplitter.py IT-Model.pickle
 
Prima frase. Seconda frase.
 
Prima frase. Seconda frase.
</doc>
 
  +
^D
<doc id="1" url="test">
 
 
Prima frase.
 
Prima frase.
 
Seconda frase.
 
Seconda frase.
</doc>
 
^D
 
 
> _
 
> _
   
Line 36: Line 39:
 
... print sentence
 
... print sentence
 
...
 
...
<doc id="1" url="test">
 
 
Prima frase. Seconda frase.
 
Prima frase. Seconda frase.
</doc>
 
  +
^D
<doc id="1" url="test">
 
 
Prima frase.
 
Prima frase.
 
Seconda frase.
 
Seconda frase.
</doc>
 
^D
 
 
>>> _
 
>>> _
 
==Sentence Splitter Trainer==
 
<tt>SentenceSplitterTrainer.py</tt> is a Python script for building the models for the ''Sentence Splitter'' from a training corpus.
 
 
The program exploits the [http://nltk.org/doc/api/nltk.tokenize.punkt-module.html Punkt Tokenizer Trainer].
 
The technique used by ''punkt'' is described in [http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf Unsupervised Multilingual Sentence Boundary Detection] (Kiss and Strunk, 2005).
 
 
The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.
 
 
The training corpus should be a file with a sequence of sentences separated by newlines.
 
 
The script can be invoked as follows:
 
SentenceSplitterTrainer.py [options] model-file
 
 
The possible options are the following:
 
-v, --verbose : explain what is being done
 
--help : display this help and exit
 
--usage : display script usage
 
 
A sample session is:
 
SentenceSplitterTrainer.py ''IT-Model.pickle'' < ''IT-Training-corpus.txt''
 
   
 
==Downloads==
 
==Downloads==
* Sentence Splitter
 
  +
* [http://medialab.di.unipi.it/Project/SemaWiki/Tools/SentenceSplitter.py Sentence Splitter] (Version 1.5)
* Sentence Splitter Trainer
 
  +
* [[Corpora|Italian Training Corpus]]
* Italian Training Corpus
 

Latest revision as of 16:54, 11 August 2009

Description

SentenceSplitter.py is a Python tool for splitting plain text into sentences. It can be used as a script (from command line) or as a module.

The program exploits the Punkt Tokenizer, trained on a corpus. The technique used by punkt is described in Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk, 2005).

When used as a script, the tool reads the plain text from the standard input and writes the sentences to the standard output.

Using training option builds the models for the sentence splitter from a training corpus.

The trainer reads the training set from the standard input and writes the parameters of the sentence splitter to a model file.

The training corpus should be a file with a sequence of sentences separated by newlines.

The script can be invoked as follows:

SentenceSplitter.py [options] model-file

The possible options are the following:

-t, --train  : train splitter from input
--help       : display this help and exit
--usage      : display script usage

Use examples

The following commands explain a complete use of Sentence Splitter used as a script from command line:

> SentenceSplitter.py --train IT-Model.pickle < IT-Training-corpus.txt
> SentenceSplitter.py IT-Model.pickle
Prima frase. Seconda frase.
^D
Prima frase.
Seconda frase.
> _

The following is a set of instructions that uses the Sentence Splitter as a module from Python interpreter:

>>> import sys
>>> from Tanl.split.SentenceSplitter import *
>>> 
>>> splitter = SentenceSplitter('IT-Model.pickle').pipe(sys.stdin)
>>> for sentence in splitter:
...    print sentence
...
Prima frase. Seconda frase.
^D
Prima frase.
Seconda frase.
>>> _

Downloads