Tanl Linguistic Pipeline |
Manage and use an inverted index dictionary. More...
#include <lexicon.h>
Classes | |
class | Cursor |
Public Types | |
typedef std::pair < const_iterator, const_iterator > | Range |
A Range is-a pair of iterators marking the beginning and end of a range over which a given word matches. | |
Public Member Functions | |
void | access (mappedFile &mf, off32_t &offset) |
Access the mapped file containing the data. | |
void | accessPostings (FileHandle *fh) |
Access the file containing the postings. | |
Size | getWord (TermID termID, char *term, Size size) const |
Get the word with given TermID. | |
const_iterator | find (char const *word) |
Lookup the given word in the lexicon. | |
const_iterator | find (byte *word, Size len) |
Lookup the given word, with the given length in the lexicon. | |
Range | findPrefix (char const *term) |
Compute the range which contains words starting with the given prefix. | |
Range | findPrefix (char const *term, Size len) |
Determines the range of words in the Lexicon which start with the given prefix. | |
Range | findPrefix (wchar_t const *term, Size len) |
Count | frequency (char const *word) |
Cursor | cursor (int i=0) |
Public Attributes | |
BigramTable | bigrams |
FileHandle * | postingsFile |
Handle to the file containing the postings. | |
Friends | |
class | Cursor |
struct | EntryCompare |
Manage and use an inverted index dictionary.
SYNOPSIS
include <ixe.h>
using namespace IXE; Lexicon dict(filename); TermInfo tinfo = dict.get(term);
Lexicon maps strings to unique identifiers and frequency in the inverted index. Whenever a new word is found, the Lexicon class can be asked to assign it a termID. The frequency is useful for boolean query optimizations.
The termID are integers in the range 1 to 2^32 exclusive.
A supra-index (bigram index) is used to speed up searching words in the Lexicon. It has 256*256+1 entries.
Suppose the word 'accent' is the first word with starting with 'ac' then the value of char 'a' and 'c' is used to get an index into the bigram index (ie 97 * 256 + 99 = 24931 is the index into L1). The last entry points to the last word.
For example, suppose there are the following words:
abate accent acute aerial
The word index keeps the word stripped of the first 2 characters.
The BigramTable and the word_index will look like:
bigrams word_index
+----------+ +----------+ ab: 24930 | ----+---> | -------> ate... +----------+ +----------+ ac: 24931 | ----+---> | -------> cent... +----------+ +----------+ ad: 24932 | ----+-+ | -------> cord... +----------+ | +----------+ ae: 24933 | ----+-+ | -------> ute... +----------+ | +----------+ +-> | -------> rial... +----------+
The bigram index is used to narrow the search to a portion of the full index.
When searching for the word 'acute', first we extract 'ac' do a look up and get begin
= bigrams['ac'], an iterator referring to the first word starting with 'ac', i.e. 'accent'.
The next entry (end
= bigrams['ac' + 1]) contains the index of the first word lexicographically after 'ad' (it may or may not start with 'ad'). In this case it is 'aerial', since there are no words starting with 'ad'.
A binary search is performed using the STL procedure equal_range on the two iterators 'begin'
and 'end'
and so 'acute' is found.
The entries after the last one pointing to a word in the index are filled with the index one beyond the last one in word_index.
const_iterator IXE::Lexicon::find | ( | byte * | word, | |
Size | len | |||
) |
Lookup the given word, with the given length in the lexicon.
Use this for Unicode or Huffman encoded strings.
Lexicon::const_iterator IXE::Lexicon::find | ( | char const * | word | ) |
Lookup the given word in the lexicon.
Looks for a word in the Lexicon.
word | the word to find |
Referenced by frequency().
Lexicon::Range IXE::Lexicon::findPrefix | ( | char const * | term, | |
Size | len | |||
) |
Count IXE::Lexicon::frequency | ( | char const * | word | ) |
References find(), and postingsFile.
Size IXE::Lexicon::getWord | ( | TermID | termID, | |
char * | term, | |||
Size | size | |||
) | const |
Get the word with given TermID.
Result is stored into
term,whose | size is | |
size. |