A Named Entity tagger prototype is available in directory 'NER'.

Contents

NER API

The NER functionality is provided through the following classes:

class NER : public
   IPipe<Enumerator<vector<Token*>*>, Enumerator<vector<Token*>*> >
{
public:
   NER(char const* modelFile, char const* configFile = 0);

   void train(SentenceReader* sentenceReader);

   Enumerator<vector<Token*>*> pipe(Enumerator<vector<Token*>*>& sen);
}

As usual, the NER can be used from Python as part of a pipeline:

from splitter.SentenceSplitter import *
from splitter.Tokenizer import *
from tag.PosTagger import *
from tag.NER import *

ssp = SentenceSplitter('en.ss').pipe('infile')
tokp = Tokenizer('en.tok').pipe(ssp)
posp = PosTagger('en.pos').pipe(tokp)
nerp = NER('en.ner').pipe(posp)

for tok in nerp:
   print tok

NER Pipe

The Tanl pipeline Interface is available for the NER. Here is an example of use in Python, training a model:

import NER
c = NER.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('ner.train')
ner = NER.NER(None)
ner.train(sr, 'ner.me')

or tagging a document:

import NER
c = NER.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('ner.test')
ner = NER.NER('ner.me')
np = ner.pipe(sr)
for x in np:
   print x

NER Design

The tagger is inspired by the design by (Chieu & Ng 2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.

The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.

The following are the local features, extracted from contiguous input tokens:

Lexical Features

 AllAlpha
 AllDigits
 AllQuoting
 AllUpper
 Capitalized
 ContainsDOT
 ContainsComma
 ContainsSlash
 ContainsDash
 ContainsDigit
 ContainsDollar
 firstWordCap   // first word of sentence is capitalized
 firstWordNoCap // first word but not capitalized
 IsYear
 HyphenCapCap   // Str1-Str2
 HyphenNoCapCap // str1-Str2
 HyphenCapNoCap // Str1-str2
 MixedCase
 NoLetter
 SingleChar
 SingleS  // 's

Token category features

UNKNOWN
PERSON
ORGANIZATION
LOCATION
PRODUCT
MONEY
NUMBER
MEASURE
DURATION
DATE
TIME
QUANTITY
URL
EMAIL

Local Features

 prevCap        // previous is capitalized
 nextCap        // next is capitalized
 seqCap // previous and next are Capitalized, current also
 seqBreakCap    // previous and next are Capitalized, current not
 CapNext        // current and next are Capitalized
 noCapNext      // current is Capitalized but next is not
 CapPrev        // current and previous are Capitalized
 noCapPrev      // current is Capitalized but previous is not

 withinQuotes   // word is in sequence within quotes
 rare   // word not present in FWL

 bigramLoc      // w-2, w-1 appear in CPB list for Locations
 bigramOrg      // w-2, w-1 appear in CPB list for Organizations
 bigramPers     // w-2, w-1 appear in CPB list for Persons

 suffixLoc      // 3-letter suffix present in SUF list Locations
 suffixOrg      // 3-letter suffix present in SUF list Organizations
 suffixPers     // 3-letter suffix present in SUF list for Persons
 suffixProd     // 3-letter suffix present in SUF list for Product

 lastLoc        // present in List of Location Last words
 lastOrg        // present in List of Organization Last words
 lastPers       // present in List of Person Last words
 lastProd       // present in List of Product Last words

 lowerSeqLoc    // word in List NLW appears in sequence of Caps
 lowerSeqOrg    // word in List NLW appears in sequence of Caps
 lowerSeqPers   // word in List NLW appears in sequence of Caps

 lowerLoc       // word appearing in list Lower Location Words
 lowerOrg       // word appearing in list Lower Organization Words
 lowerPers      // word appearing in list Lower Person Words

POS Tags

POS            // multiple values: one for each POS
prevPOS        // similarly
nextPOS        // similarly

The token category features are extracted from tokens, by means of language specific regular expressions.

For some of the above features, the NER uses the following dictionaries:

  1. companies
  2. money
  3. names
  4. person
  5. time
  6. product
  7. location

in order to compute the following features:

Company
Location
Money
Name   // first name
PrevName
Person
NextPerson
Product
Time

The following lists are created during training:

FWL (Frequent Word List): words that occur in more than 5 documents

CPW (Common Preceding Words): 20 words that most often precede
   a certain class

CPB (Common Preceding Bigrams): bigrams that often precede
   a certain class

SUF (Suffix for Class): common 3-4 letter suffix for a certain class
   (-ian, -ish)

NLW (Name Last Words): list of words terminating a Name sequence
   Organization: Inc, Org, Co
   Locations: center, museum, square, street
   Person: Jr, II, III

LNW (Lowercase Name Words): list of lowercase words appearing in a
   Name sequence
   Organization: al, in, zonder, vor, for
   Person: "van der", "de", "of"

Global document features

otherPrevLoc    // another occurrence of current word
                // was preceded by a CPW for class Location
otherPrevOrg    // another occurrence of current word
                // was preceded by a CPW for class Organization
otherPrevPers   // another occurrence of current word
                // was preceded by a CPW for class Person

otherPrevBiLoc  // another occurrence of current word
                // was preceded by a bigram for class Location
otherPrevBiOrg  // another occurrence of current word
                // was preceded by a bigram for class Organization
otherPrevBiPers // another occurrence of current word
                // was preceded by a bigram for class Person

otherLastLoc    // another occurrence of current word
                // had feature lastLoc
otherLastOrg    // another occurrence of current word
                // had feature lastOrg
otherLastPers   // another occurrence of current word
                // had feature lastPers

otherCapitalized // an occurrence not as first word had
                 // feature Capitalized

  // Acronyms
  // An AllUpper word is stored as an acronym.
  // Caps sequences with those initials will be given the following features:
Acronym         // e.g. FCC
AcronymBegin    // e.g. Federal
AcronymContinue // e.g. Communication
AcronymEnd      // e.g. Committee

References

  1. Chieu, H.L. and Ng, H.T. (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
  2. Chieu, H.L. and Ng, H.T. (2002). Named entity recognition: a maximum entropy approach using global information. Proc. of COLING 2002.
Powered by MediaWiki