Tanl::POS::Context< CT > Struct Template Reference

This is a language model to calculate P(C | A, B) with linear interpolation i.e. More...

#include <linear_interpolated_lm.h>

Inheritance diagram for Tanl::POS::Context< CT >:

Public Types
typedef std::tr1::unordered_map< int, Context< CT > * >	CMap
typedef std::tr1::unordered_map< CT, double >	WordFreq
Public Member Functions
void	serialize (std::ostream &out)
void	serialize (std::istream &in)
void	add_word (std::vector< int >::const_iterator context, int n, CT &word)
	Add token word in context context.
WordFreq &	get_words ()
unsigned	total_context_freq ()
size_t	word_count_at_context ()
std::vector< double >	calculate_lambdas (int level)
void	counts_to_prob (std::vector< double > &lambdas)
	Translate frequencies to log probabilities.
double	wordprob (CT const &word, std::vector< int > &context)
Public Attributes
unsigned	freq
	total word occurrences
CMap	childs
	context map
WordFreq	words
	word map

Detailed Description

template<class CT>
struct Tanl::POS::Context< CT >

This is a language model to calculate P(C | A, B) with linear interpolation i.e.

P(C| A B) = l3 ML (C| A B) + l2 ML (C | B) + l1 ML (C) + l0

the calculation of lambdas are the same as in Brants 2000.

The data structure is similar to the data structure used by SRILM: there is a context tree holding the null context as root (unigrams) B for bigrams starting with B , and B->A node for trigrams staring with AB. Every node of the context node has a word map storing the frequency of the words given the context.

The type of items is parametric: for tags is int, for words is string.

This module first calculates the frequencies, then calculates the lambda parameters of linear interpolation and finally transforms frequencies to probabilities.

TODO: the freq counting and the lambda calculation should be separated.

Member Function Documentation

template<class CT>

void Tanl::POS::Context< CT >::add_word	(	std::vector< int >::const_iterator	context,
		int	n,
		CT &	word
	)			`[inline]`

Add token word in context context.

Consider as contexts the n-grams [context:context+i], 0 <= i < n. If you want to add the A B C trigram, just go down in the tree and add word C at every level.

References Tanl::POS::Context< CT >::add_word(), Tanl::POS::Context< CT >::childs, Tanl::POS::Context< CT >::freq, and Tanl::POS::Context< CT >::words.

Referenced by Tanl::POS::Context< CT >::add_word().

template<class CT >

std::vector< double > Tanl::POS::Context< CT >::calculate_lambdas ( int level ) [inline]

See also:: http://citeseer.ist.psu.edu/brants00tnt.html (Brants 2000). This algorithm is detailed in Figure 1.

template<class CT >

void Tanl::POS::Context< CT >::counts_to_prob ( std::vector< double > & lambdas ) [inline]

Translate frequencies to log probabilities.

At level n we have to know the probability of word at level n-1. For example

P(C| B) is calculated first.

ML is the maximum likelihood probability: ML(C) = f(C)/N ML(C|B) = f(B, C)/f(B) ML(C|A B) = f(A, B, C)/f(A, B) where N is the number of tokens in the corpus.

References Tanl::POS::Context< CT >::childs.

The documentation for this struct was generated from the following files:

tag/Tanlpos/linear_interpolated_lm.h
tag/Tanlpos/linear_interpolated_lm.cpp

Tanl::POS::Context< CT > Struct Template Reference

Public Types

Public Member Functions

Public Attributes

Detailed Description

template<class CT> struct Tanl::POS::Context< CT >

Member Function Documentation

template<class CT>
struct Tanl::POS::Context< CT >