Tanl modules can be connected in a pipeline where each module consumes a stream of input items and produces a stream for later modules.

The pipeline model consists of three types of components:

  1. source: creates an initial pipe (e.g. a document reader)
  2. transform: receives data from one pipe and produces output on another pipe
  3. sink: consumes the output of a pipe.

For example a SentenceSplitter is a source that creates a pipe from an input stream:

ss = SentenceSplitter('italian.pickle').pipe(stdin)

The pipe can be connected to other stages that perform tokenization, POS tagging and parsing as follows:

# creates a SentenceSplitter reading from standard input
ss = SentenceSplitter('italian.pickle').pipe(stdin)
# connect a tokenizer to its output
wt = Tokenizer().pipe(ss)
# connect to it a PosTagger to add POS tags to sentences
pt = PosTagger('italian.pos').pipe(wt)
# connect to it a Parser using an SVM model for Italian
pa = Parser.create('italian.SVM').pipe(pt)

A sink is a process that consumes the output of a pipeline, driving the pipeline. It can be as simple as a Python iteration:

for sent in pa:
   print sent

Split and join

Sometimes a stage in the pipeline splits the data into parts that are to be processed by several pipelines and later recombined, by means of a join.

In the following example, the source is created by reading documents from a corpus. A pipe just containing the text of each document is created for being processed independently. The output of this pipeline needs to be recombined with the original source to produce the resulting documents.

La source is created from the document reader of the Wikipedia corpus:

co = Corpus("wikipedia")
reader = co.docReader(open('wiki.xml'))

A pipe of pure text is obtained from this reader:

p1 = reader.textIterator()

and this is combined with further transforms:

p2 = SentenceSplitter('italian.punkt').pipe(p1)
p3 = Tokenizer().pipe(p2)
p4 = PosTagger('italian.ttag').pipe(p3)

The output of the last stage of the pipe, is recombined into documents:

p6 = reader.join(p4)

The output from this pipe is printed:

for doc in p6:
   print doc

All together:

co = Corpus()
reader = co.docReader(open('wiki.xml'))
p1 = reader.textIterator()
p2 = SentenceSplitter('italian.punkt').pipe(p1)
p3 = Tokenizer().pipe(p2)
p4 = PosTagger('italian.ttag').pipe(p3)
p6 = reader.join(p4)
for doc in p6:
    print doc

Enumerator Interface

A stream of items is defined through a Enumerator interface:

template <class T>
class Enumerator
{
public:
  typedef T             ItemType;

  /** Advances to the next element of the collection.
      @return true if there is another item available. */
  virtual bool          MoveNext() = 0;

  /** @return the current element. */
  virtual ItemType      Current() = 0;

  /**   Optional reset to the beginning of the enumeration. */
  virtual void          Reset() {}
};

Each module provides an interface for connecting to a pipeline:

template <class Tin, Tout>
struct IPipe {
   Enumerator<Tout>* pipe(Enumerator<Tin>&);
};

For example, the POS tagger implements the interface:

class Parser : IPipe<Enumerator<vector<Token*>*>,
                     Enumerator<vector<Token*>*>
{
public:
   Enumerator<vector<Token*>*>* pipe(Enumerator<vector<Token*>*>& se);
 …
};

so that it can be connected to a pipeline like this:

// get a pipe from a tokenizer
Enumerator<vector<Token*>*> wt = tokenizer.pipe(cin);
// create a tagger
PosTagger pt(italian.params);
// connect it to a pipe
Enumerator<vector<Token*>*>* ptpipe = pt.pipe(wt);
// consume data from the pipe
while (ptpipe->MoveNext()) {
   vector<Token*>* tagged = ptpipe->Current();
}

C++ modules are made available to Python through SWIG and a pipeline can be built dynamically with a Python script.

Powered by MediaWiki