Tanl Linguistic Pipeline |
The task of the suffix guesser is to predict a tag-distribution based on the suffix of the word. More...
#include <SuffixGuesser.h>
Public Member Functions | |
void | serialize (std::ostream &out) |
Serializes a SuffixGuesser object. | |
void | serialize (std::istream &in) |
De-Serializes a SuffixGuesser object. | |
void | add_word (int n, std::string &word, TagID tag, int count) |
Adds a word to the suffix trie. | |
double | tagprob (std::string &word, int tagid) |
TODO. | |
double | tagprobs (std::string &word, std::vector< double > &probs) |
TODO. | |
Static Public Member Functions | |
static double | calculate_theta (std::vector< double > &apriori_tag_probs) |
TODO. | |
Public Attributes | |
double | theta |
Theta used in the interpolation process. | |
TrieNode | trie |
Trie of suffices. | |
Counts | empty_counts |
Empty Counts object. |
The task of the suffix guesser is to predict a tag-distribution based on the suffix of the word.
In training phase, it calculates for each suffix its count in the corpus, in total and for each tag separately. Let's assume a word ending with ABCDE. During prediction, it linearly interpolates the looked-up predictions for the ABCDE, BCDE, CDE, DE, E, "" suffices. Interpolation is done successively with weights 1 and theta, so weights are basically powers of 1/(1+theta), with the shorter suffix getting the larger weight.
void Tanl::POS::SuffixGuesser::add_word | ( | int | n, | |
std::string & | word, | |||
TagID | tag, | |||
int | count | |||
) |
Adds a word to the suffix trie.
n | Max suffix size. | |
word | String to be added to the trie. | |
tag | Tag identifier used to tag the word we are trying to add. | |
count | Amount of times word was tagged with tag inside the corpus. |
void Tanl::POS::SuffixGuesser::serialize | ( | std::istream & | in | ) |
De-Serializes a SuffixGuesser object.
in | The stream from which the object will be read |
References Tanl::POS::TrieNode::serialize(), serialize(), theta, and trie.
void Tanl::POS::SuffixGuesser::serialize | ( | std::ostream & | out | ) |
Serializes a SuffixGuesser object.
out | The stream wherein the object will be written |
References Tanl::POS::TrieNode::serialize(), theta, and trie.
Referenced by serialize().