(References)
m (NER Pipe)
 
(10 intermediate revisions not shown)
Line 1: Line 1:
A Named Entity tagger prototype is available in directory 'NER'.
A Named Entity tagger prototype is available in directory 'NER'.
-
The tagger is inspired by the design by (Chieu 2003),
+
== NER API ==
 +
 
 +
The NER functionality is provided through the following classes:
 +
 
 +
<pre>
 +
class NER : public
 +
  IPipe<Enumerator<vector<Token*>*>, Enumerator<vector<Token*>*> >
 +
{
 +
public:
 +
  NER(char const* modelFile, char const* configFile = 0);
 +
 
 +
  void train(SentenceReader* sentenceReader);
 +
 
 +
  Enumerator<vector<Token*>*> pipe(Enumerator<vector<Token*>*>& sen);
 +
}
 +
</pre>
 +
 
 +
As usual, the <tt>NER</tt> can be used from Python as part of a pipeline:
 +
 
 +
<pre>
 +
from splitter.SentenceSplitter import *
 +
from splitter.Tokenizer import *
 +
from tag.PosTagger import *
 +
from tag.NER import *
 +
 
 +
ssp = SentenceSplitter('en.ss').pipe('infile')
 +
tokp = Tokenizer('en.tok').pipe(ssp)
 +
posp = PosTagger('en.pos').pipe(tokp)
 +
nerp = NER('en.ner').pipe(posp)
 +
 
 +
for tok in nerp:
 +
  print tok
 +
</pre>
 +
 
 +
== NER Pipe ==
 +
The Tanl pipeline Interface is available for the NER. Here is an example of use in Python, training a model:
 +
 
 +
<pre>
 +
import NER
 +
c = NER.Corpus('it', 'conll03.fmt')
 +
sr = c.sentenceReader('ner.train')
 +
ner = NER.NER(None)
 +
ner.train(sr, 'ner.me')
 +
</pre>
 +
 
 +
or tagging a document:
 +
 
 +
<pre>
 +
import NER
 +
c = NER.Corpus('it', 'conll03.fmt')
 +
sr = c.sentenceReader('ner.test')
 +
ner = NER.NER('ner.me')
 +
np = ner.pipe(sr)
 +
for x in np:
 +
  print x
 +
</pre>
 +
 
 +
== NER Design ==
 +
 
 +
The tagger is inspired by the design by [http://www.cnts.ua.ac.be/conll2003/pdf/16063chi.pdf (Chieu & Ng 2003)],
which achieved second best score (first non combination tagger) at the [http://www.cnts.ua.ac.be/conll2003/ner/ CoNLL 2003 Shared Task].
which achieved second best score (first non combination tagger) at the [http://www.cnts.ua.ac.be/conll2003/ner/ CoNLL 2003 Shared Task].
Line 8: Line 67:
The following are the local features, extracted from contiguous input tokens:
The following are the local features, extracted from contiguous input tokens:
 +
<h3>Lexical Features</h3>
<pre>
<pre>
-
// Lexical Features
 
  AllAlpha
  AllAlpha
  AllDigits
  AllDigits
Line 31: Line 90:
  SingleChar
  SingleChar
  SingleS  // 's
  SingleS  // 's
 +
</pre>
-
// Local Features
+
<h3>Token category features</h3>
 +
 
 +
UNKNOWN
 +
PERSON
 +
ORGANIZATION
 +
LOCATION
 +
PRODUCT
 +
MONEY
 +
NUMBER
 +
MEASURE
 +
DURATION
 +
DATE
 +
TIME
 +
QUANTITY
 +
URL
 +
EMAIL
 +
 
 +
<h3>Local Features</h3>
 +
<pre>
  prevCap        // previous is capitalized
  prevCap        // previous is capitalized
  nextCap        // next is capitalized
  nextCap        // next is capitalized
Line 66: Line 144:
  lowerOrg      // word appearing in list Lower Organization Words
  lowerOrg      // word appearing in list Lower Organization Words
  lowerPers      // word appearing in list Lower Person Words
  lowerPers      // word appearing in list Lower Person Words
 +
</pre>
-
// POS Tags
+
<h3>POS Tags</h3>
  POS            // multiple values: one for each POS
  POS            // multiple values: one for each POS
  prevPOS        // similarly
  prevPOS        // similarly
  nextPOS        // similarly
  nextPOS        // similarly
-
</pre>
+
 
 +
The token category features are extracted from tokens, by means of language specific regular expressions.
For some of the above features, the NER uses the following dictionaries:
For some of the above features, the NER uses the following dictionaries:
Line 122: Line 202:
</pre>
</pre>
-
The following are the global document features:
+
<h3>Global document features</h3>
<pre>
<pre>
Line 135: Line 215:
                 // was preceded by a bigram for class Location
                 // was preceded by a bigram for class Location
otherPrevBiOrg  // another occurrence of current word
otherPrevBiOrg  // another occurrence of current word
-
                 // was preceded by a bigram for class Organizat\
+
                 // was preceded by a bigram for class Organization
-
ion
+
otherPrevBiPers // another occurrence of current word
otherPrevBiPers // another occurrence of current word
                 // was preceded by a bigram for class Person
                 // was preceded by a bigram for class Person
Line 147: Line 226:
                 // had feature lastPers
                 // had feature lastPers
-
otherCapitalized       // an occurrence not as first word had
+
otherCapitalized // an occurrence not as first word had
-
                        // feature Capitalized
+
                // feature Capitalized
   // Acronyms
   // Acronyms
Line 158: Line 237:
AcronymEnd      // e.g. Committee
AcronymEnd      // e.g. Committee
</pre>
</pre>
-
 
-
The tagger is trained to learn the following tags:
 
-
 
-
<pre>
 
-
UNKNOWN
 
-
PERSON
 
-
ORGANIZATION
 
-
LOCATION
 
-
PRODUCT
 
-
MONEY
 
-
NUMBER
 
-
MEASURE
 
-
DURATION
 
-
DATE
 
-
TIME
 
-
QUANTITY
 
-
URL
 
-
EMAIL
 
-
</pre>
 
-
 
-
some of which are recognized by means of suitable regular expresssions.
 
== References ==
== References ==
-
* Chieu, H.L. and Ng, H.T. (2003).[http://www.cnts.ua.ac.be/conll2003/pdf/16063chi.pdf Named Entity Recognition with a Maximum Entropy Approach]. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
+
# Chieu, H.L. and Ng, H.T. (2003).[http://www.cnts.ua.ac.be/conll2003/pdf/16063chi.pdf Named Entity Recognition with a Maximum Entropy Approach]. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
 +
# Chieu, H.L. and Ng, H.T. (2002). [http://www.aclweb.org/anthology-new/C/C02/C02-1025.pdf Named entity recognition: a maximum entropy approach using global information]. Proc. of COLING 2002.

Latest revision as of 13:38, 12 April 2009

A Named Entity tagger prototype is available in directory 'NER'.

Contents

NER API

The NER functionality is provided through the following classes:

class NER : public
   IPipe<Enumerator<vector<Token*>*>, Enumerator<vector<Token*>*> >
{
public:
   NER(char const* modelFile, char const* configFile = 0);

   void train(SentenceReader* sentenceReader);

   Enumerator<vector<Token*>*> pipe(Enumerator<vector<Token*>*>& sen);
}

As usual, the NER can be used from Python as part of a pipeline:

from splitter.SentenceSplitter import *
from splitter.Tokenizer import *
from tag.PosTagger import *
from tag.NER import *

ssp = SentenceSplitter('en.ss').pipe('infile')
tokp = Tokenizer('en.tok').pipe(ssp)
posp = PosTagger('en.pos').pipe(tokp)
nerp = NER('en.ner').pipe(posp)

for tok in nerp:
   print tok

NER Pipe

The Tanl pipeline Interface is available for the NER. Here is an example of use in Python, training a model:

import NER
c = NER.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('ner.train')
ner = NER.NER(None)
ner.train(sr, 'ner.me')

or tagging a document:

import NER
c = NER.Corpus('it', 'conll03.fmt')
sr = c.sentenceReader('ner.test')
ner = NER.NER('ner.me')
np = ner.pipe(sr)
for x in np:
   print x

NER Design

The tagger is inspired by the design by (Chieu & Ng 2003), which achieved second best score (first non combination tagger) at the CoNLL 2003 Shared Task.

The NER is based on a Maximum Entropy classifier, and uses two types of features: local and global.

The following are the local features, extracted from contiguous input tokens:

Lexical Features

 AllAlpha
 AllDigits
 AllQuoting
 AllUpper
 Capitalized
 ContainsDOT
 ContainsComma
 ContainsSlash
 ContainsDash
 ContainsDigit
 ContainsDollar
 firstWordCap   // first word of sentence is capitalized
 firstWordNoCap // first word but not capitalized
 IsYear
 HyphenCapCap   // Str1-Str2
 HyphenNoCapCap // str1-Str2
 HyphenCapNoCap // Str1-str2
 MixedCase
 NoLetter
 SingleChar
 SingleS  // 's

Token category features

UNKNOWN
PERSON
ORGANIZATION
LOCATION
PRODUCT
MONEY
NUMBER
MEASURE
DURATION
DATE
TIME
QUANTITY
URL
EMAIL

Local Features

 prevCap        // previous is capitalized
 nextCap        // next is capitalized
 seqCap // previous and next are Capitalized, current also
 seqBreakCap    // previous and next are Capitalized, current not
 CapNext        // current and next are Capitalized
 noCapNext      // current is Capitalized but next is not
 CapPrev        // current and previous are Capitalized
 noCapPrev      // current is Capitalized but previous is not

 withinQuotes   // word is in sequence within quotes
 rare   // word not present in FWL

 bigramLoc      // w-2, w-1 appear in CPB list for Locations
 bigramOrg      // w-2, w-1 appear in CPB list for Organizations
 bigramPers     // w-2, w-1 appear in CPB list for Persons

 suffixLoc      // 3-letter suffix present in SUF list Locations
 suffixOrg      // 3-letter suffix present in SUF list Organizations
 suffixPers     // 3-letter suffix present in SUF list for Persons
 suffixProd     // 3-letter suffix present in SUF list for Product

 lastLoc        // present in List of Location Last words
 lastOrg        // present in List of Organization Last words
 lastPers       // present in List of Person Last words
 lastProd       // present in List of Product Last words

 lowerSeqLoc    // word in List NLW appears in sequence of Caps
 lowerSeqOrg    // word in List NLW appears in sequence of Caps
 lowerSeqPers   // word in List NLW appears in sequence of Caps

 lowerLoc       // word appearing in list Lower Location Words
 lowerOrg       // word appearing in list Lower Organization Words
 lowerPers      // word appearing in list Lower Person Words

POS Tags

POS            // multiple values: one for each POS
prevPOS        // similarly
nextPOS        // similarly

The token category features are extracted from tokens, by means of language specific regular expressions.

For some of the above features, the NER uses the following dictionaries:

  1. companies
  2. money
  3. names
  4. person
  5. time
  6. product
  7. location

in order to compute the following features:

Company
Location
Money
Name   // first name
PrevName
Person
NextPerson
Product
Time

The following lists are created during training:

FWL (Frequent Word List): words that occur in more than 5 documents

CPW (Common Preceding Words): 20 words that most often precede
   a certain class

CPB (Common Preceding Bigrams): bigrams that often precede
   a certain class

SUF (Suffix for Class): common 3-4 letter suffix for a certain class
   (-ian, -ish)

NLW (Name Last Words): list of words terminating a Name sequence
   Organization: Inc, Org, Co
   Locations: center, museum, square, street
   Person: Jr, II, III

LNW (Lowercase Name Words): list of lowercase words appearing in a
   Name sequence
   Organization: al, in, zonder, vor, for
   Person: "van der", "de", "of"

Global document features

otherPrevLoc    // another occurrence of current word
                // was preceded by a CPW for class Location
otherPrevOrg    // another occurrence of current word
                // was preceded by a CPW for class Organization
otherPrevPers   // another occurrence of current word
                // was preceded by a CPW for class Person

otherPrevBiLoc  // another occurrence of current word
                // was preceded by a bigram for class Location
otherPrevBiOrg  // another occurrence of current word
                // was preceded by a bigram for class Organization
otherPrevBiPers // another occurrence of current word
                // was preceded by a bigram for class Person

otherLastLoc    // another occurrence of current word
                // had feature lastLoc
otherLastOrg    // another occurrence of current word
                // had feature lastOrg
otherLastPers   // another occurrence of current word
                // had feature lastPers

otherCapitalized // an occurrence not as first word had
                 // feature Capitalized

  // Acronyms
  // An AllUpper word is stored as an acronym.
  // Caps sequences with those initials will be given the following features:
Acronym         // e.g. FCC
AcronymBegin    // e.g. Federal
AcronymContinue // e.g. Communication
AcronymEnd      // e.g. Committee

References

  1. Chieu, H.L. and Ng, H.T. (2003).Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
  2. Chieu, H.L. and Ng, H.T. (2002). Named entity recognition: a maximum entropy approach using global information. Proc. of COLING 2002.
Powered by MediaWiki