Tanl Linguistic Pipeline

IXE::Lexicon Class Reference

Manage and use an inverted index dictionary. More...

#include <lexicon.h>

Inheritance diagram for IXE::Lexicon:
IXE::IndexTable< LexEntry >

List of all members.

Classes

class  Cursor

Public Types

typedef std::pair
< const_iterator,
const_iterator > 
Range
 A Range is-a pair of iterators marking the beginning and end of a range over which a given word matches.

Public Member Functions

void access (mappedFile &mf, off32_t &offset)
 Access the mapped file containing the data.
void accessPostings (FileHandle *fh)
 Access the file containing the postings.
Size getWord (TermID termID, char *term, Size size) const
 Get the word with given TermID.
const_iterator find (char const *word)
 Lookup the given word in the lexicon.
const_iterator find (byte *word, Size len)
 Lookup the given word, with the given length in the lexicon.
Range findPrefix (char const *term)
 Compute the range which contains words starting with the given prefix.
Range findPrefix (char const *term, Size len)
 Determines the range of words in the Lexicon which start with the given prefix.
Range findPrefix (wchar_t const *term, Size len)
Count frequency (char const *word)
Cursor cursor (int i=0)

Public Attributes

BigramTable bigrams
FileHandlepostingsFile
 Handle to the file containing the postings.

Friends

class Cursor
struct EntryCompare

Detailed Description

Manage and use an inverted index dictionary.

 SYNOPSIS
 include <ixe.h>
 using namespace IXE;
 Lexicon dict(filename);
 TermInfo tinfo = dict.get(term);
 

Lexicon maps strings to unique identifiers and frequency in the inverted index. Whenever a new word is found, the Lexicon class can be asked to assign it a termID. The frequency is useful for boolean query optimizations.

The termID are integers in the range 1 to 2^32 exclusive.

A supra-index (bigram index) is used to speed up searching words in the Lexicon. It has 256*256+1 entries.
Suppose the word 'accent' is the first word with starting with 'ac' then the value of char 'a' and 'c' is used to get an index into the bigram index (ie 97 * 256 + 99 = 24931 is the index into L1). The last entry points to the last word.

For example, suppose there are the following words:

 abate
 accent
 acute
 aerial
 

The word index keeps the word stripped of the first 2 characters.

The BigramTable and the word_index will look like:

               bigrams        word_index
             +----------+     +----------+
   ab: 24930 |      ----+---> |        -------> ate...
             +----------+     +----------+
   ac: 24931 |      ----+---> |        -------> cent...
             +----------+     +----------+
   ad: 24932 |      ----+-+   |        -------> cord...
             +----------+ |   +----------+
   ae: 24933 |      ----+-+   |        -------> ute...
             +----------+ |   +----------+
                          +-> |        -------> rial...
                              +----------+
 

The bigram index is used to narrow the search to a portion of the full index.

When searching for the word 'acute', first we extract 'ac' do a look up and get begin = bigrams['ac'], an iterator referring to the first word starting with 'ac', i.e. 'accent'.

The next entry (end = bigrams['ac' + 1]) contains the index of the first word lexicographically after 'ad' (it may or may not start with 'ad'). In this case it is 'aerial', since there are no words starting with 'ad'.

A binary search is performed using the STL procedure equal_range on the two iterators 'begin' and 'end' and so 'acute' is found.

The entries after the last one pointing to a word in the index are filled with the index one beyond the last one in word_index.


Member Function Documentation

const_iterator IXE::Lexicon::find ( byte *  word,
Size  len 
)

Lookup the given word, with the given length in the lexicon.

Use this for Unicode or Huffman encoded strings.

Lexicon::const_iterator IXE::Lexicon::find ( char const *  word  ) 

Lookup the given word in the lexicon.

Looks for a word in the Lexicon.

Parameters:
word the word to find
Returns:
An iterator pointing to the word entry. If no such words exists, returns lexicon.end().

Referenced by frequency().

Lexicon::Range IXE::Lexicon::findPrefix ( char const *  term,
Size  len 
)

Determines the range of words in the Lexicon which start with the given prefix.

Parameters:
t the prefix
len the prefix length
Returns:
a range of first and last word starting with prefix. If no such words exists, range.first == range.second.
Count IXE::Lexicon::frequency ( char const *  word  ) 
Returns:
the document frequency of the term, 0 if not present.

References find(), and postingsFile.

Size IXE::Lexicon::getWord ( TermID  termID,
char *  term,
Size  size 
) const

Get the word with given TermID.

Result is stored into

Parameters:
term,whose size is
size. 
Returns:
the length of the word, or -1 if not found.

The documentation for this class was generated from the following files:
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.