Tanl Linguistic Pipeline |
This is a language model to calculate P(C | A, B) with linear interpolation i.e. More...
#include <linear_interpolated_lm.h>
Public Types | |
typedef std::tr1::unordered_map< int, Context< CT > * > | CMap |
typedef std::tr1::unordered_map< CT, double > | WordFreq |
Public Member Functions | |
void | serialize (std::ostream &out) |
void | serialize (std::istream &in) |
void | add_word (std::vector< int >::const_iterator context, int n, CT &word) |
Add token word in context context. | |
WordFreq & | get_words () |
unsigned | total_context_freq () |
size_t | word_count_at_context () |
std::vector< double > | calculate_lambdas (int level) |
void | counts_to_prob (std::vector< double > &lambdas) |
Translate frequencies to log probabilities. | |
double | wordprob (CT const &word, std::vector< int > &context) |
Public Attributes | |
unsigned | freq |
total word occurrences | |
CMap | childs |
context map | |
WordFreq | words |
word map |
This is a language model to calculate P(C | A, B) with linear interpolation i.e.
P(C| A B) = l3 ML (C| A B) + l2 ML (C | B) + l1 ML (C) + l0
the calculation of lambdas are the same as in Brants 2000.
The data structure is similar to the data structure used by SRILM: there is a context tree holding the null context as root (unigrams) B for bigrams starting with B , and B->A node for trigrams staring with AB. Every node of the context node has a word map storing the frequency of the words given the context.
The type of items is parametric: for tags is int, for words is string.
This module first calculates the frequencies, then calculates the lambda parameters of linear interpolation and finally transforms frequencies to probabilities.
TODO: the freq counting and the lambda calculation should be separated.
void Tanl::POS::Context< CT >::add_word | ( | std::vector< int >::const_iterator | context, | |
int | n, | |||
CT & | word | |||
) | [inline] |
Add token word in context context.
Consider as contexts the n-grams [context:context+i], 0 <= i < n. If you want to add the A B C trigram, just go down in the tree and add word C at every level.
References Tanl::POS::Context< CT >::add_word(), Tanl::POS::Context< CT >::childs, Tanl::POS::Context< CT >::freq, and Tanl::POS::Context< CT >::words.
Referenced by Tanl::POS::Context< CT >::add_word().
std::vector< double > Tanl::POS::Context< CT >::calculate_lambdas | ( | int | level | ) | [inline] |
void Tanl::POS::Context< CT >::counts_to_prob | ( | std::vector< double > & | lambdas | ) | [inline] |
Translate frequencies to log probabilities.
At level n we have to know the probability of word at level n-1. For example
P(C| A B) = l3 ML(C| A B) + l2 ML(C| B) + l1 ML(C) + l0, which is P(C| A B) = l3 ML(C| A B) + P(C| B)
P(C| B) is calculated first.
ML is the maximum likelihood probability: ML(C) = f(C)/N ML(C|B) = f(B, C)/f(B) ML(C|A B) = f(A, B, C)/f(A, B) where N is the number of tokens in the corpus.
References Tanl::POS::Context< CT >::childs.