A Named Entity tagger prototype is available in directory 'NER'.
The tagger is inspired by the design by (Chieu 2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.
The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.
The following are the local features, extracted from contiguous input tokens:
// Lexical Features AllAlpha AllDigits AllQuoting AllUpper Capitalized ContainsDOT ContainsComma ContainsSlash ContainsDash ContainsDigit ContainsDollar firstWordCap // first word of sentence is capitalized firstWordNoCap // first word but not capitalized IsYear HyphenCapCap // Str1-Str2 HyphenNoCapCap // str1-Str2 HyphenCapNoCap // Str1-str2 MixedCase NoLetter SingleChar SingleS // 's // Local Features prevCap // previous is capitalized nextCap // next is capitalized seqCap // previous and next are Capitalized, current also seqBreakCap // previous and next are Capitalized, current not CapNext // current and next are Capitalized noCapNext // current is Capitalized but next is not CapPrev // current and previous are Capitalized noCapPrev // current is Capitalized but previous is not withinQuotes // word is in sequence within quotes rare // word not present in FWL bigramLoc // w-2, w-1 appear in CPB list for Locations bigramOrg // w-2, w-1 appear in CPB list for Organizations bigramPers // w-2, w-1 appear in CPB list for Persons suffixLoc // 3-letter suffix present in SUF list Locations suffixOrg // 3-letter suffix present in SUF list Organizations suffixPers // 3-letter suffix present in SUF list for Persons suffixProd // 3-letter suffix present in SUF list for Product lastLoc // present in List of Location Last words lastOrg // present in List of Organization Last words lastPers // present in List of Person Last words lastProd // present in List of Product Last words lowerSeqLoc // word in List NLW appears in sequence of Caps lowerSeqOrg // word in List NLW appears in sequence of Caps lowerSeqPers // word in List NLW appears in sequence of Caps lowerLoc // word appearing in list Lower Location Words lowerOrg // word appearing in list Lower Organization Words lowerPers // word appearing in list Lower Person Words // POS Tags POS // multiple values: one for each POS prevPOS // similarly nextPOS // similarly
For some of the above features, the NER uses the following dictionaries:
in order to compute the following features:
Company Location Money Name // first name PrevName Person NextPerson Product Time
The following lists are created during training:
FWL (Frequent Word List): words that occur in more than 5 documents CPW (Common Preceding Words): 20 words that most often precede a certain class CPB (Common Preceding Bigrams): bigrams that often precede a certain class SUF (Suffix for Class): common 3-4 letter suffix for a certain class (-ian, -ish) NLW (Name Last Words): list of words terminating a Name sequence Organization: Inc, Org, Co Locations: center, museum, square, street Person: Jr, II, III LNW (Lowercase Name Words): list of lowercase words appearing in a Name sequence Organization: al, in, zonder, vor, for Person: "van der", "de", "of"
The following are the global document features:
otherPrevLoc // another occurrence of current word // was preceded by a CPW for class Location otherPrevOrg // another occurrence of current word // was preceded by a CPW for class Organization otherPrevPers // another occurrence of current word // was preceded by a CPW for class Person otherPrevBiLoc // another occurrence of current word // was preceded by a bigram for class Location otherPrevBiOrg // another occurrence of current word // was preceded by a bigram for class Organizat\ ion otherPrevBiPers // another occurrence of current word // was preceded by a bigram for class Person otherLastLoc // another occurrence of current word // had feature lastLoc otherLastOrg // another occurrence of current word // had feature lastOrg otherLastPers // another occurrence of current word // had feature lastPers otherCapitalized // an occurrence not as first word had // feature Capitalized // Acronyms // An AllUpper word is stored as an acronym. // Caps sequences with those initials will be given the following features: Acronym // e.g. FCC AcronymBegin // e.g. Federal AcronymContinue // e.g. Communication AcronymEnd // e.g. Committee
The tagger is trained to learn the following tags:
UNKNOWN PERSON ORGANIZATION LOCATION PRODUCT MONEY NUMBER MEASURE DURATION DATE TIME QUANTITY URL EMAIL
some of which are recognized by means of suitable regular expresssions.
Chieu, HL (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.