SuperSense Tagger (SST) is a tool for assigning to each noun, verb, adjective and adverb of a sentence one of the 45 standard WordNet supersenses. The current version is based on a modified version of the Tanl NE Tagger with specific features.

Contents

History

The first version of the tool was based on the SuperSense Tagger developed by Ciaramita and Altun, an implementation of Michael Collins perceptron HMM (Hidden Markov Model).

The new tool has significantly better accuracy and speed and is fully integrated in the Tanl pipeline.


SST Api

The SST functionality is provided through the following classes:

class SST : public
   IPipe<Enumerator<vector<Token*>*>, Enumerator<vector<Token*>*> >
{
public:
   SST(char const* modelFile, char const* configFile = 0);

   void train(SentenceReader* sentenceReader);

   Enumerator<vector<Token*>*> pipe(Enumerator<vector<Token*>*>& sen);
}

The SST can also be used from Python as part of a pipeline:

from splitter.SentenceSplitter import *
from splitter.Tokenizer import *
from tag.PosTagger import *
from tag.SST import *

ssp = SentenceSplitter('en.ss').pipe('infile')
tokp = Tokenizer('en.tok').pipe(ssp)
posp = PosTagger('en.pos').pipe(tokp)
sstp = SST('it.sst').pipe(posp)

for tok in sstp:
   print tok


SST Pipe

The Tanl pipeline Interface is available for the SST. Here is an example of use in Python, training a model:

import SST
c = SST.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('sst.train')
sst = SST.SST(None)
sst.train(sr, 'sst.me')

or tagging a document:

import SST
c = SST.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('sst.test')
ner = NER.NER('sst.me')
np = sst.pipe(sr)
for x in np:
   print x


SST Design

The SST is based on a Maximum Entropy classifier, and uses two types of features: local and global.

The following are the local features, extracted from contiguous input tokens:

Lexical Features

 AllAlpha
 AllDigits
 AllQuoting
 AllUpper
 Capitalized
 ContainsDOT
 ContainsComma
 ContainsSlash
 ContainsDash
 ContainsDigit
 ContainsDollar
 firstWordCap   // first word of sentence is capitalized
 firstWordNoCap // first word but not capitalized
 IsYear
 HyphenCapCap   // Str1-Str2
 HyphenNoCapCap // str1-Str2
 HyphenCapNoCap // Str1-str2
 MixedCase
 NoLetter
 SingleChar
 SingleS  // 's

Token category features

adj.all
adj.pert
adj.ppl
adv.all
noun.Tops
noun.act
noun.animal
noun.artifact
noun.attribute
noun.body
noun.cognition
noun.communication
noun.event
noun.feeling
noun.food
noun.group
noun.location
noun.motive
noun.object
noun.other
noun.person
noun.phenomenon
noun.plant
noun.possession
noun.process
noun.quantity
noun.relation
noun.shape
noun.state
noun.substance
noun.time
verb.body
verb.change
verb.cognition
verb.communication
verb.competition
verb.consumption
verb.contact
verb.creation
verb.emotion
verb.motion
verb.perception
verb.possession
verb.social
verb.stative
verb.weather

Local Features

 prevCap        // previous is capitalized
 nextCap        // next is capitalized
 seqCap // previous and next are Capitalized, current also
 seqBreakCap    // previous and next are Capitalized, current not
 CapNext        // current and next are Capitalized
 noCapNext      // current is Capitalized but next is not
 CapPrev        // current and previous are Capitalized
 noCapPrev      // current is Capitalized but previous is not

 withinQuotes   // word is in sequence within quotes
 

POS Tags

POS at positions -2 -1 0 1 2

CPOS Tags

CPOS at positions -2 -1 0

Word Shape (currently under development)

Word transformation in which each character c of a string s is substituted with X if c is uppercase else with x if is lowercase; if c is a digit it is substituted with d and left as it is otherwise. In addition each sequence of two or more identical characters c is substituted with c*.


Global document features

otherCapitalized // an occurrence not as first word had
                 // feature Capitalized

  // Acronyms
  // An AllUpper word is stored as an acronym.
  // Caps sequences with those initials will be given the following features:
Acronym         // e.g. FCC
AcronymBegin    // e.g. Federal
AcronymContinue // e.g. Communication
AcronymEnd      // e.g. Committee


Italian Corpora

Corpora used to test the performance of the SuperSense Tagger

  • MultiWordNet (MWN): a multilingual lexical database (developed at ITC-IRST of Trento - Italy) strictly aligned with Princeton WordNet 1.6. The Italian part of the corpus is built on the traslation of a subset of the sentences that constitute the English Brown Corpus.
  • ItalWordNet (IWN): a large semantic database for the automatic treatment of the Italian language developed at ILC of Pisa- Italy. It is built within the SI-TAL (Integrated System for the Automatic Treatment of Language) Italian National Project. The database was created by extending the Italian wordnet developed within the EuroWordNet.

Actually we are working to improve the quality of the supersense annotation of the ItalWordNet Corpus.


To Do

  • Tune SstTagger Features
  • Improve Italian Corpus (IWN)


Resources

  • phpMyAdmin interface to access MWN and IWN database: link (Restricted access)


References