Tanl Linguistic Pipeline |
Public Member Functions | |
Resources (char const *POStag, char const *NEtag) | |
Resources (std::string &locale, char const *POStag, char const *NEtag) | |
Resources (std::string &resourceDir, std::string &locale, char const *POStag, char const *NEtag) | |
size_t | typesCount () |
char const * | typeName (EntityType et) |
void | load (std::string &resourceDir) |
Load all resources from the given directory. | |
template<class WordSet > | |
void | load (std::vector< WordSet > &sets, char const *file) |
Load a group of WordSets from a file. | |
template<class WordSet > | |
void | load (vector< WordSet > &sets, char const *file) |
template<class WordSet > | |
void | load (WordSet *sets, char const *file) |
Public Attributes | |
Text::WordIndex | classId |
Maps class names to class IDs. | |
char const * | language |
char const * | POStag |
char const * | NEtag |
TagSet | prevTokenType |
TagSet | nextTokenType |
std::vector < Tanl::Text::NormWordSet > | dict |
Tanl::Text::NormWordSet | moneyDict |
Tanl::Text::NormWordSet | namesDict |
Tanl::Text::NormWordSet | timeDict |
Tanl::Text::NormWordSet | prodDict |
Tanl::Text::NormWordSet | FWL |
FWL (Frequent Word List): words that occur in more than 5 documents. | |
std::vector < Tanl::Text::NormWordSet > | designators |
CPW (Common Preceding Words): 20 words that most often precede names of a certain class. | |
std::vector < Tanl::Text::NormWordSet > | preBigrams |
CPB (Common Preceding Bigrams): bigrams that often precede names of a certain class. | |
std::vector < Tanl::Text::NormWordSet > | prefixes |
PRE (Prefix for Class): common 3-letter prefix for each class. | |
std::vector< Tanl::Text::Suffixes > | suffixes |
SUF (Suffix for Class): common 3-letter suffix for each class. | |
std::vector < Tanl::Text::NormWordSet > | firstWords |
EFW (Entity First Words): list of words starting an entity. | |
std::vector < Tanl::Text::NormWordSet > | lastWords |
ELW (Entity Last Words): list of words terminating an entity. | |
std::vector < Tanl::Text::NormWordSet > | lowerInterm |
NAW (Name After Words): list of words after an entity. | |
Static Public Attributes | |
static IXE::conf_set< std::string > | entityTypes |
The entity type names. |
void Tanl::NER::Resources::load | ( | std::vector< WordSet > & | sets, | |
char const * | file | |||
) | [inline] |
Load a group of WordSets from a file.
The file contains one word per line in the format: class word where class is an entity type, like LOC, MISC, ORG, PER.
void Tanl::NER::Resources::load | ( | std::string & | resourceDir | ) |
Load all resources from the given directory.
Referenced by Tanl::NER::NER::NER().
Maps class names to class IDs.
IXE::conf_set< std::string > Tanl::NER::Resources::entityTypes [static] |
The entity type names.
Referenced by Tanl::NER::NER::tag().
FWL (Frequent Word List): words that occur in more than 5 documents.
std::vector<Tanl::Text::NormWordSet> Tanl::NER::Resources::lowerInterm |
NAW (Name After Words): list of words after an entity.
e.g.: center, museum, square, street LIW (Lowercase Intermediate Words): list of lowercase words appearing within a sequence, eg: PER: "van der", "de", "of" ORG: al, in, zonder, vor, for