POS Tagger

From Medialab

Revision as of 11:10, 5 June 2013 by Giuseppe.Attardi (talk | contribs) (→‎TreeTagger)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Tanl includes two Part-of-Speech taggers:

  • HmmTagger, a tagger derived from Hunpos and rewritten in C++
  • PosTagger, a variant of TreeTagger.

Both taggers can be trained on an annotated corpus and can incorporate knowledge from a full lexicon, including information about lemmas. Therefore the tagger return both the POS and the lemma for each token.

The taggers have been trained for Italian using an annotated corpus of 220,000 tokens consisting of:

  • 112,000 tokens from Repubblica.it
  • 97,500 tokens from the TUT-MIDT corpus
  • 19,000 tokens from a corpus of questions.

A large Italian lexicon of fully inflected forms containing about 1.4 million forms has been also provided for training.

Relevant features of HmmTagger (partly shared with Hunpos):

  • HMM training/tagging is much faster than with more complex models, e.g. SVM and CRF.
  • Integrates knowledge from morphological analyzers/dictionaries into best path calculation.
  • A probabilistic suffix guesser is used to handle unknown and unseen words. Differently from Hunpos, HmmTagger learns suffixes from the whole lexicon.
  • Handles large tag sets smoothly. For example the Tanl Italian POS tag set consists in over 300 tags, which incorporate morphological knowledge, which the tagger handles without degradation in accuracy and performance. Training non-generative models on such data woud be computationally expensive.
  • Contextualized lexical probabilities with a context window of any size. Unlike traditional HMM models, HmmTagger estimates emission (lexical) probabilities based on the current tag and previous tags.

A combination of the two Tanl PoS tagger achieved best score in the Evalita 2009 shared task on PoS Tagging:

Tagger Interface

The Tanl POS taggers implement the following interface:

class PosTagger : IPipe<Enumerator<std::vector<Token*>*>,
                     Enumerator<std::vector<Token*>*>
{
public:
   /**
    *  Creates a pipe connected to the @c Enumerator @param se.
    *  @param se an @c Enumerator<vector<Token*>> from which
    *     vector<Token*> are extracted representing a sentence
    *     to tag.
    *  @return an @c Enumerator<vector<Token*>> of the tagged
    *     sentences produced by the tagger.
    *     The @c Token's in the result @c Enumerator are
    *     extensions of the corresponding input @c Token's with
    *     the addition of one attribute:
    *       POSTAG
    *     whose value represent the POS tag of the @c Token.
    *     Optionally the tagger may add also the attribute:
    *       LEMMA
    *     that represents the lemma of the given @c Token.
    */
   Enumerator<std::vector<Token*>*>*
      pipe(Enumerator<std::vector<Token*>*>& se);

   /** @return the set of tags used by the tagger */
   std::set<char const*> tags();
};

HmmTagger

The HmmTagger Tanl POS tagger exposes the following interface:

class HmmTagger : public PosTagger
{
  public:
  HmmTagger(const char* modelFile);

   Enumerator<std::vector<Token*>*>*
      pipe(Enumerator<std::vector<Token*>*>& se);
};

TreeTagger

The TreeTagger Tanl POS tagger exposes the following interface:

class TreeTagger : public PosTagger
{
  public:
  TreeTagger(const char* modelFile);

   /**
    *  Creates a pipe connected to the @c Enumerator @param se.
    *  @param se an @c Enumerator<vector<Token*>> from which
    *     vector<Token*> are extracted representing a sentence
    *     to tag.
    *  @return an @c Enumerator<vector<Token*>> of the tagged
    *     sentences produced by the tagger.
    *     The @c Token's in the result @c Enumerator are
    *     extensions of the corresponding input @c Token's with
    *     the addition of two attributes:
    *       POSTAG
    *       LEMMA
    *     whose values represent respectively:
    *       the POS tag and the lemma of the @c Token.
    */
   Enumerator<std::vector<Token*>*>*
      pipe(Enumerator<std::vector<Token*>*>& se);

   /** @return the set of tags used by the tagger */
   std::set<char const*> tags();
};

Pipeline Usage

Both taggers can be used in a pipeline, accepting as input, either:

  • a stream
  • a Tanl pipe
  • a Python iterator

For example:

from Tokenizer import *
from HmmTagger import *

lines = ["Per fare un albero,", "ci vuole un fiore."]
p1 = Tokenizer().pipe(iter(lines))
p2 = HmmTagger(taggerModel).pipe(p1)

for sent in p2:
  print sent

As an argument to pipe one can use instead a file, for example sys.in, to read from input.

Command Line Usage

Hmm Tagger

Training usage:

$ PosTrain -m lexicon model.hmm < train-file

Tagging usage:

$ PosTag model.hmm < input-file > output-file

TreeTagger

Training usage:

$ train-tree-tagger -utf8 -cl 2 -dtg 0.05 -sw 2 -ecw 1 -atg 1 -st "FS" fullexMorph.tanl openclassMorph.tanl train-file model.par

Tagging usage:

$ tree-tagger -token model.par input-file > output-file