A Named Entity tagger prototype is available in directory 'NER'.

The tagger is inspired by the design by Chieu and Ng (2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.

The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.

The following are the local features, extracted from contiguous input tokens:

 // Lexical Features
 AllAlpha
 AllDigits
 AllQuoting
 AllUpper
 Capitalized
 ContainsDOT
 ContainsComma
 ContainsSlash
 ContainsDash
 ContainsDigit
 ContainsDollar
 firstWordCap   // first word of sentence is capitalized
 firstWordNoCap // first word but not capitalized
 IsYear
 HyphenCapCap   // Str1-Str2
 HyphenNoCapCap // str1-Str2
 HyphenCapNoCap // Str1-str2
 MixedCase
 NoLetter
 SingleChar
 SingleS  // 's

 // Local Features
 prevCap        // previous is capitalized
 nextCap        // next is capitalized
 seqCap // previous and next are Capitalized, current also
 seqBreakCap    // previous and next are Capitalized, current not
 CapNext        // current and next are Capitalized
 noCapNext      // current is Capitalized but next is not
 CapPrev        // current and previous are Capitalized
 noCapPrev      // current is Capitalized but previous is not

 withinQuotes   // word is in sequence within quotes
 rare   // word not present in FWL

 bigramLoc      // w-2, w-1 appear in CPB list for Locations
 bigramOrg      // w-2, w-1 appear in CPB list for Organizations
 bigramPers     // w-2, w-1 appear in CPB list for Persons

 suffixLoc      // 3-letter suffix present in SUF list Locations
 suffixOrg      // 3-letter suffix present in SUF list Organizations
 suffixPers     // 3-letter suffix present in SUF list for Persons
 suffixProd     // 3-letter suffix present in SUF list for Product

 lastLoc        // present in List of Location Last words
 lastOrg        // present in List of Organization Last words
 lastPers       // present in List of Person Last words
 lastProd       // present in List of Product Last words

 lowerSeqLoc    // word in List NLW appears in sequence of Caps
 lowerSeqOrg    // word in List NLW appears in sequence of Caps
 lowerSeqPers   // word in List NLW appears in sequence of Caps

 lowerLoc       // word appearing in list Lower Location Words
 lowerOrg       // word appearing in list Lower Organization Words
 lowerPers      // word appearing in list Lower Person Words

 // POS Tags
 POS            // multiple values: one for each POS
 prevPOS        // similarly
 nextPOS        // similarly

For some of the above features, the NER uses the following dictionaries:

  1. companies
  2. money
  3. names
  4. person
  5. time
  6. product
  7. location

in order to compute the following features:

Company
Location
Money
Name   // first name
PrevName
Person
NextPerson
Product
Time

The following lists are created during training:

FWL (Frequent Word List): words that occur in more than 5 documents

CPW (Common Preceding Words): 20 words that most often precede
   a certain class

CPB (Common Preceding Bigrams): bigrams that often precede
   a certain class

SUF (Suffix for Class): common 3-4 letter suffix for a certain class
   (-ian, -ish)

NLW (Name Last Words): list of words terminating a Name sequence
   Organization: Inc, Org, Co
   Locations: center, museum, square, street
   Person: Jr, II, III

LNW (Lowercase Name Words): list of lowercase words appearing in a
   Name sequence
   Organization: al, in, zonder, vor, for
   Person: "van der", "de", "of"

The following are the global document features:

otherPrevLoc    // another occurrence of current word
                // was preceded by a CPW for class Location
otherPrevOrg    // another occurrence of current word
                // was preceded by a CPW for class Organization
otherPrevPers   // another occurrence of current word
                // was preceded by a CPW for class Person

otherPrevBiLoc  // another occurrence of current word
                // was preceded by a bigram for class Location
otherPrevBiOrg  // another occurrence of current word
                // was preceded by a bigram for class Organizat\
ion
otherPrevBiPers // another occurrence of current word
                // was preceded by a bigram for class Person

otherLastLoc    // another occurrence of current word
                // had feature lastLoc
otherLastOrg    // another occurrence of current word
                // had feature lastOrg
otherLastPers   // another occurrence of current word
                // had feature lastPers

otherCapitalized        // an occurrence not as first word had
                        // feature Capitalized

  // Acronyms
  // An AllUpper word is stored as an acronym.
  // Caps sequences with those initials will be given the following features:
Acronym         // e.g. FCC
AcronymBegin    // e.g. Federal
AcronymContinue // e.g. Communication
AcronymEnd      // e.g. Committee

The tagger is trained to learn the following tags:

UNKNOWN
PERSON
ORGANIZATION
LOCATION
PRODUCT
MONEY
NUMBER
MEASURE
DURATION
DATE
TIME
QUANTITY
URL
EMAIL

some of which are recognized by means of suitable regular expresssions.

References

Chieu, H.L. and Ng, H.T. (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.

Powered by MediaWiki