Difference between revisions of "POS Tagger"

From Medialab

Line 68: Line 68:
 
};
 
};
 
</pre>
 
</pre>
  +
  +
  +
== Usage ==
  +
  +
=== TreeTagger ===
  +
  +
Training usage:
  +
$(TAGGERDIR)/train-tree-tagger -utf8 -cl 2 -dtg 0.05 -sw 2 -ecw 1 -atg 1 -st "FS" fullexMorph.tanl openclassMorph.tanl train-file model.par
  +
  +
TAgging usage:
  +
$(TAGGERDIR)/tree-tagger -token model.par input-file > output-file
  +
  +
=== Hunpos ===
  +
  +
Training usage:
  +
$(HUNPOSDIR)/hunpos-train model.par < train-file
  +
  +
Tagging usage:
  +
$(HUNPOSDIR)/hunpos-tag -m lexicon.tanl model.par < input-file > output-file

Revision as of 19:02, 1 September 2009

Part-of-speech tagging is performed by using a variant of TreeTagger.

The tagger is trained using a large Italian lexicon of fully inflected form containing about 1.4 million forms.

Tagger Interface

The Tanl POS taggers implement the following interface:

class PosTagger : IPipe<Enumerator<std::vector<Token*>*>,
                     Enumerator<std::vector<Token*>*>
{
public:
   /**
    *  Creates a pipe connected to the @c Enumerator @param se.
    *  @param se an @c Enumerator<vector<Token*>> from which
    *     vector<Token*> are extracted representing a sentence
    *     to tag.
    *  @return an @c Enumerator<vector<Token*>> of the tagged
    *     sentences produced by the tagger.
    *     The @c Token's in the result @c Enumerator are
    *     extensions of the corresponding input @c Token's with
    *     the addition of one attribute:
    *       POSTAG
    *     whose value represent the POS tag of the @c Token.
    *     Optionally the tagger may add also the attribute:
    *       LEMMA
    *     that represents the lemma of the given @c Token.
    */
   Enumerator<std::vector<Token*>*>*
      pipe(Enumerator<std::vector<Token*>*>& se);

   /** @return the set of tags used by the tagger */
   std::set<char const*> tags();
};

TreeTagger

The TreeTagger Tanl POS tagger exposes the following interface:

class TreeTagger : public PosTagger
{
  public:
  TreeTagger(const char* modelFile);

   /**
    *  Creates a pipe connected to the @c Enumerator @param se.
    *  @param se an @c Enumerator<vector<Token*>> from which
    *     vector<Token*> are extracted representing a sentence
    *     to tag.
    *  @return an @c Enumerator<vector<Token*>> of the tagged
    *     sentences produced by the tagger.
    *     The @c Token's in the result @c Enumerator are
    *     extensions of the corresponding input @c Token's with
    *     the addition of two attributes:
    *       POSTAG
    *       LEMMA
    *     whose values represent respectively:
    *       the POS tag and the lemma of the @c Token.
    */
   Enumerator<std::vector<Token*>*>*
      pipe(Enumerator<std::vector<Token*>*>& se);

   /** @return the set of tags used by the tagger */
   std::set<char const*> tags();
};


Usage

TreeTagger

Training usage:

$(TAGGERDIR)/train-tree-tagger -utf8 -cl 2 -dtg 0.05 -sw 2 -ecw 1 -atg 1 -st "FS" fullexMorph.tanl openclassMorph.tanl train-file model.par

TAgging usage:

$(TAGGERDIR)/tree-tagger -token model.par input-file > output-file

Hunpos

Training usage:

$(HUNPOSDIR)/hunpos-train model.par < train-file

Tagging usage:

$(HUNPOSDIR)/hunpos-tag -m lexicon.tanl model.par < input-file > output-file