IXE::Lexicon Class Reference

Manage and use an inverted index dictionary. More...

#include <lexicon.h>

Inheritance diagram for IXE::Lexicon:

Classes
class	Cursor
Public Types
typedef std::pair < const_iterator, const_iterator >	Range
	A Range is-a pair of iterators marking the beginning and end of a range over which a given word matches.
Public Member Functions
void	access (mappedFile &mf, off32_t &offset)
	Access the mapped file containing the data.
void	accessPostings (FileHandle *fh)
	Access the file containing the postings.
Size	getWord (TermID termID, char *term, Size size) const
	Get the word with given TermID.
const_iterator	find (char const *word)
	Lookup the given word in the lexicon.
const_iterator	find (byte *word, Size len)
	Lookup the given word, with the given length in the lexicon.
Range	findPrefix (char const *term)
	Compute the range which contains words starting with the given prefix.
Range	findPrefix (char const *term, Size len)
	Determines the range of words in the Lexicon which start with the given prefix.
Range	findPrefix (wchar_t const *term, Size len)
Count	frequency (char const *word)
Cursor	cursor (int i=0)
Public Attributes
BigramTable	bigrams
FileHandle *	postingsFile
	Handle to the file containing the postings.
Friends
class	Cursor
struct	EntryCompare

Detailed Description

Manage and use an inverted index dictionary.

 SYNOPSIS

 include <ixe.h>

 using namespace IXE;
 Lexicon dict(filename);
 TermInfo tinfo = dict.get(term);

Lexicon maps strings to unique identifiers and frequency in the inverted index. Whenever a new word is found, the Lexicon class can be asked to assign it a termID. The frequency is useful for boolean query optimizations.

The termID are integers in the range 1 to 2^32 exclusive.

A supra-index (bigram index) is used to speed up searching words in the Lexicon. It has 256*256+1 entries.
Suppose the word 'accent' is the first word with starting with 'ac' then the value of char 'a' and 'c' is used to get an index into the bigram index (ie 97 * 256 + 99 = 24931 is the index into L1). The last entry points to the last word.

For example, suppose there are the following words:

 abate
 accent
 acute
 aerial

The word index keeps the word stripped of the first 2 characters.

The BigramTable and the word_index will look like:

               bigrams        word_index

             +----------+     +----------+
   ab: 24930 |      ----+---> |        -------> ate...
             +----------+     +----------+
   ac: 24931 |      ----+---> |        -------> cent...
             +----------+     +----------+
   ad: 24932 |      ----+-+   |        -------> cord...
             +----------+ |   +----------+
   ae: 24933 |      ----+-+   |        -------> ute...
             +----------+ |   +----------+
                          +-> |        -------> rial...
                              +----------+

The bigram index is used to narrow the search to a portion of the full index.

When searching for the word 'acute', first we extract 'ac' do a look up and get begin = bigrams['ac'], an iterator referring to the first word starting with 'ac', i.e. 'accent'.

The next entry (end = bigrams['ac' + 1]) contains the index of the first word lexicographically after 'ad' (it may or may not start with 'ad'). In this case it is 'aerial', since there are no words starting with 'ad'.

A binary search is performed using the STL procedure equal_range on the two iterators 'begin' and 'end' and so 'acute' is found.

The entries after the last one pointing to a word in the index are filled with the index one beyond the last one in word_index.

Member Function Documentation

const_iterator IXE::Lexicon::find	(	byte *	word,
		Size	len
	)

Lookup the given word, with the given length in the lexicon.

Use this for Unicode or Huffman encoded strings.

Lexicon::const_iterator IXE::Lexicon::find ( char const * word )

Lookup the given word in the lexicon.

Looks for a word in the Lexicon.

Parameters:

word

the word to find

Returns:: An iterator pointing to the word entry. If no such words exists, returns lexicon.end().

Referenced by frequency().

Lexicon::Range IXE::Lexicon::findPrefix	(	char const *	term,
		Size	len
	)

Determines the range of words in the Lexicon which start with the given prefix.

Parameters:

	t	the prefix
	len	the prefix length

Returns:: a range of first and last word starting with prefix. If no such words exists, range.first == range.second.

Count IXE::Lexicon::frequency ( char const * word )

Returns:: the document frequency of the term, 0 if not present.

References find(), and postingsFile.

Size IXE::Lexicon::getWord	(	TermID	termID,
		char *	term,
		Size	size
	)			const

Get the word with given TermID.

Result is stored into

Parameters:

	term,whose	size is
	size.

Returns:: the length of the word, or -1 if not found.

The documentation for this class was generated from the following files:

search/ixe/Common/lexicon.h
search/ixe/Common/lexicon.cpp

IXE::Lexicon Class Reference

Classes

Public Types

Public Member Functions

Public Attributes

Friends

Detailed Description

Member Function Documentation