Tanl Linguistic Pipeline

Tanl::POS::Context< CT > Struct Template Reference

This is a language model to calculate P(C | A, B) with linear interpolation i.e. More...

#include <linear_interpolated_lm.h>

Inheritance diagram for Tanl::POS::Context< CT >:
Tanl::POS::ProbLM< CT >

List of all members.

Public Types

typedef
std::tr1::unordered_map< int,
Context< CT > * > 
CMap
typedef
std::tr1::unordered_map< CT,
double > 
WordFreq

Public Member Functions

void serialize (std::ostream &out)
void serialize (std::istream &in)
void add_word (std::vector< int >::const_iterator context, int n, CT &word)
 Add token word in context context.
WordFreq & get_words ()
unsigned total_context_freq ()
size_t word_count_at_context ()
std::vector< double > calculate_lambdas (int level)
void counts_to_prob (std::vector< double > &lambdas)
 Translate frequencies to log probabilities.
double wordprob (CT const &word, std::vector< int > &context)

Public Attributes

unsigned freq
 total word occurrences
CMap childs
 context map
WordFreq words
 word map

Detailed Description

template<class CT>
struct Tanl::POS::Context< CT >

This is a language model to calculate P(C | A, B) with linear interpolation i.e.

P(C| A B) = l3 ML (C| A B) + l2 ML (C | B) + l1 ML (C) + l0

the calculation of lambdas are the same as in Brants 2000.

The data structure is similar to the data structure used by SRILM: there is a context tree holding the null context as root (unigrams) B for bigrams starting with B , and B->A node for trigrams staring with AB. Every node of the context node has a word map storing the frequency of the words given the context.

The type of items is parametric: for tags is int, for words is string.

This module first calculates the frequencies, then calculates the lambda parameters of linear interpolation and finally transforms frequencies to probabilities.

TODO: the freq counting and the lambda calculation should be separated.


Member Function Documentation

template<class CT>
void Tanl::POS::Context< CT >::add_word ( std::vector< int >::const_iterator  context,
int  n,
CT &  word 
) [inline]

Add token word in context context.

Consider as contexts the n-grams [context:context+i], 0 <= i < n. If you want to add the A B C trigram, just go down in the tree and add word C at every level.

References Tanl::POS::Context< CT >::add_word(), Tanl::POS::Context< CT >::childs, Tanl::POS::Context< CT >::freq, and Tanl::POS::Context< CT >::words.

Referenced by Tanl::POS::Context< CT >::add_word().

template<class CT >
std::vector< double > Tanl::POS::Context< CT >::calculate_lambdas ( int  level  )  [inline]
See also:
http://citeseer.ist.psu.edu/brants00tnt.html (Brants 2000). This algorithm is detailed in Figure 1.
template<class CT >
void Tanl::POS::Context< CT >::counts_to_prob ( std::vector< double > &  lambdas  )  [inline]

Translate frequencies to log probabilities.

At level n we have to know the probability of word at level n-1. For example

P(C| A B) = l3 ML(C| A B) + l2 ML(C| B) + l1 ML(C) + l0, which is P(C| A B) = l3 ML(C| A B) + P(C| B)

P(C| B) is calculated first.

ML is the maximum likelihood probability: ML(C) = f(C)/N ML(C|B) = f(B, C)/f(B) ML(C|A B) = f(A, B, C)/f(A, B) where N is the number of tokens in the corpus.

References Tanl::POS::Context< CT >::childs.


The documentation for this struct was generated from the following files:
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.