A Named Entity tagger prototype is available in directory 'NER'.
The tagger is inspired by the design by (Chieu & Ng 2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.
The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.
The following are the local features, extracted from contiguous input tokens:
// Lexical Features AllAlpha AllDigits AllQuoting AllUpper Capitalized ContainsDOT ContainsComma ContainsSlash ContainsDash ContainsDigit ContainsDollar firstWordCap // first word of sentence is capitalized firstWordNoCap // first word but not capitalized IsYear HyphenCapCap // Str1-Str2 HyphenNoCapCap // str1-Str2 HyphenCapNoCap // Str1-str2 MixedCase NoLetter SingleChar SingleS // 's // Local Features prevCap // previous is capitalized nextCap // next is capitalized seqCap // previous and next are Capitalized, current also seqBreakCap // previous and next are Capitalized, current not CapNext // current and next are Capitalized noCapNext // current is Capitalized but next is not CapPrev // current and previous are Capitalized noCapPrev // current is Capitalized but previous is not withinQuotes // word is in sequence within quotes rare // word not present in FWL bigramLoc // w-2, w-1 appear in CPB list for Locations bigramOrg // w-2, w-1 appear in CPB list for Organizations bigramPers // w-2, w-1 appear in CPB list for Persons suffixLoc // 3-letter suffix present in SUF list Locations suffixOrg // 3-letter suffix present in SUF list Organizations suffixPers // 3-letter suffix present in SUF list for Persons suffixProd // 3-letter suffix present in SUF list for Product lastLoc // present in List of Location Last words lastOrg // present in List of Organization Last words lastPers // present in List of Person Last words lastProd // present in List of Product Last words lowerSeqLoc // word in List NLW appears in sequence of Caps lowerSeqOrg // word in List NLW appears in sequence of Caps lowerSeqPers // word in List NLW appears in sequence of Caps lowerLoc // word appearing in list Lower Location Words lowerOrg // word appearing in list Lower Organization Words lowerPers // word appearing in list Lower Person Words // POS Tags POS // multiple values: one for each POS prevPOS // similarly nextPOS // similarly
For some of the above features, the NER uses the following dictionaries:
in order to compute the following features:
Company Location Money Name // first name PrevName Person NextPerson Product Time
The following lists are created during training:
FWL (Frequent Word List): words that occur in more than 5 documents CPW (Common Preceding Words): 20 words that most often precede a certain class CPB (Common Preceding Bigrams): bigrams that often precede a certain class SUF (Suffix for Class): common 3-4 letter suffix for a certain class (-ian, -ish) NLW (Name Last Words): list of words terminating a Name sequence Organization: Inc, Org, Co Locations: center, museum, square, street Person: Jr, II, III LNW (Lowercase Name Words): list of lowercase words appearing in a Name sequence Organization: al, in, zonder, vor, for Person: "van der", "de", "of"
The following are the global document features:
otherPrevLoc // another occurrence of current word // was preceded by a CPW for class Location otherPrevOrg // another occurrence of current word // was preceded by a CPW for class Organization otherPrevPers // another occurrence of current word // was preceded by a CPW for class Person otherPrevBiLoc // another occurrence of current word // was preceded by a bigram for class Location otherPrevBiOrg // another occurrence of current word // was preceded by a bigram for class Organizat\ ion otherPrevBiPers // another occurrence of current word // was preceded by a bigram for class Person otherLastLoc // another occurrence of current word // had feature lastLoc otherLastOrg // another occurrence of current word // had feature lastOrg otherLastPers // another occurrence of current word // had feature lastPers otherCapitalized // an occurrence not as first word had // feature Capitalized // Acronyms // An AllUpper word is stored as an acronym. // Caps sequences with those initials will be given the following features: Acronym // e.g. FCC AcronymBegin // e.g. Federal AcronymContinue // e.g. Communication AcronymEnd // e.g. Committee
The tagger is trained to learn the following tags:
UNKNOWN PERSON ORGANIZATION LOCATION PRODUCT MONEY NUMBER MEASURE DURATION DATE TIME QUANTITY URL EMAIL
some of which are recognized by means of suitable regular expresssions.
- Chieu, H.L. and Ng, H.T. (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.