A Named Entity tagger prototype is available in directory 'NER'.

NER API

The NER functionality is provided through the following classes:

class NER : public
   IPipe<Enumerator<vector<Token*>*>, Enumerator<vector<Token*>*> >
{
public:
   NER(char const* modelFile, char const* configFile = 0);

   void train(SentenceReader* sentenceReader);

   Enumerator<vector<Token*>*> pipe(Enumerator<vector<Token*>*>& sen);
}

As usual, the NER can be used from Python as part of a pipeline:

from splitter.SentenceSplitter import *
from splitter.Tokenizer import *
from tag.PosTagger import *
from tag.NER import *

ssp = SentenceSplitter('en.ss').pipe('infile')
tokp = Tokenizer('en.tok').pipe(ssp)
posp = PosTagger('en.pos').pipe(tokp)
nerp = NER('en.ner').pipe(posp)

for tok in nerp:
   print tok

NER Design

The tagger is inspired by the design by (Chieu & Ng 2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.

The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.

The following are the local features, extracted from contiguous input tokens:

 // Lexical Features
 AllAlpha
 AllDigits
 AllQuoting
 AllUpper
 Capitalized
 ContainsDOT
 ContainsComma
 ContainsSlash
 ContainsDash
 ContainsDigit
 ContainsDollar
 firstWordCap   // first word of sentence is capitalized
 firstWordNoCap // first word but not capitalized
 IsYear
 HyphenCapCap   // Str1-Str2
 HyphenNoCapCap // str1-Str2
 HyphenCapNoCap // Str1-str2
 MixedCase
 NoLetter
 SingleChar
 SingleS  // 's

 // Token category features

 UNKNOWN
 PERSON
 ORGANIZATION
 LOCATION
 PRODUCT
 MONEY
 NUMBER
 MEASURE
 DURATION
 DATE
 TIME
 QUANTITY
 URL
 EMAIL

 // Local Features
 prevCap        // previous is capitalized
 nextCap        // next is capitalized
 seqCap // previous and next are Capitalized, current also
 seqBreakCap    // previous and next are Capitalized, current not
 CapNext        // current and next are Capitalized
 noCapNext      // current is Capitalized but next is not
 CapPrev        // current and previous are Capitalized
 noCapPrev      // current is Capitalized but previous is not

 withinQuotes   // word is in sequence within quotes
 rare   // word not present in FWL

 bigramLoc      // w-2, w-1 appear in CPB list for Locations
 bigramOrg      // w-2, w-1 appear in CPB list for Organizations
 bigramPers     // w-2, w-1 appear in CPB list for Persons

 suffixLoc      // 3-letter suffix present in SUF list Locations
 suffixOrg      // 3-letter suffix present in SUF list Organizations
 suffixPers     // 3-letter suffix present in SUF list for Persons
 suffixProd     // 3-letter suffix present in SUF list for Product

 lastLoc        // present in List of Location Last words
 lastOrg        // present in List of Organization Last words
 lastPers       // present in List of Person Last words
 lastProd       // present in List of Product Last words

 lowerSeqLoc    // word in List NLW appears in sequence of Caps
 lowerSeqOrg    // word in List NLW appears in sequence of Caps
 lowerSeqPers   // word in List NLW appears in sequence of Caps

 lowerLoc       // word appearing in list Lower Location Words
 lowerOrg       // word appearing in list Lower Organization Words
 lowerPers      // word appearing in list Lower Person Words

 // POS Tags
 POS            // multiple values: one for each POS
 prevPOS        // similarly
 nextPOS        // similarly

The token category features are extracted from tokens, by means of language specific regular expressions.

For some of the above features, the NER uses the following dictionaries:

  1. companies
  2. money
  3. names
  4. person
  5. time
  6. product
  7. location

in order to compute the following features:

Company
Location
Money
Name   // first name
PrevName
Person
NextPerson
Product
Time

The following lists are created during training:

FWL (Frequent Word List): words that occur in more than 5 documents

CPW (Common Preceding Words): 20 words that most often precede
   a certain class

CPB (Common Preceding Bigrams): bigrams that often precede
   a certain class

SUF (Suffix for Class): common 3-4 letter suffix for a certain class
   (-ian, -ish)

NLW (Name Last Words): list of words terminating a Name sequence
   Organization: Inc, Org, Co
   Locations: center, museum, square, street
   Person: Jr, II, III

LNW (Lowercase Name Words): list of lowercase words appearing in a
   Name sequence
   Organization: al, in, zonder, vor, for
   Person: "van der", "de", "of"

The following are the global document features:

otherPrevLoc    // another occurrence of current word
                // was preceded by a CPW for class Location
otherPrevOrg    // another occurrence of current word
                // was preceded by a CPW for class Organization
otherPrevPers   // another occurrence of current word
                // was preceded by a CPW for class Person

otherPrevBiLoc  // another occurrence of current word
                // was preceded by a bigram for class Location
otherPrevBiOrg  // another occurrence of current word
                // was preceded by a bigram for class Organizat\
ion
otherPrevBiPers // another occurrence of current word
                // was preceded by a bigram for class Person

otherLastLoc    // another occurrence of current word
                // had feature lastLoc
otherLastOrg    // another occurrence of current word
                // had feature lastOrg
otherLastPers   // another occurrence of current word
                // had feature lastPers

otherCapitalized        // an occurrence not as first word had
                        // feature Capitalized

  // Acronyms
  // An AllUpper word is stored as an acronym.
  // Caps sequences with those initials will be given the following features:
Acronym         // e.g. FCC
AcronymBegin    // e.g. Federal
AcronymContinue // e.g. Communication
AcronymEnd      // e.g. Committee

References

  1. Chieu, H.L. and Ng, H.T. (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
Powered by MediaWiki