Tanl Linguistic Pipeline |
Global configuration parameters. More...
Namespaces | |
namespace | io |
Platform independent IO. | |
Classes | |
class | Pattern2 |
Extension of RegExp::Pattern that stores original pattern for serialization. More... | |
class | conf< Replacements > |
class | DocInfo |
Abstract class for document info. More... | |
struct | eptacode |
struct | FileFormat |
This variable contains the version number of the file format for the index. More... | |
struct | FileHeader |
Header for fulltext index files. More... | |
class | IndexTable |
An instance of this class is used to access either the word, stop-word, file, or meta-name index portions of an index file. More... | |
class | BigramTable |
A BigramTable contains the index (TermID) of the first word in the Lexicon starting with that bigram. More... | |
class | StringTable |
class | Item |
Items represent values for fields. More... | |
class | ItemOf |
class | KeyValuePairs |
Represents a sorted collection of associated string keys and string values that can be accessed with the key. More... | |
struct | LexEntry |
struct | EntryCompare |
class | TermInfo |
class | Lexicon |
Manage and use an inverted index dictionary. More... | |
struct | Condition |
struct | SimpleCondition |
Condition without the additional lock required by pthread_cond_wait(). More... | |
struct | Lock |
Lock object, used for synchronization. More... | |
class | Locked |
class | LockUp |
mutex interface: locks mutex at creation, unlocks at destruction More... | |
class | Array |
Array of persistent objects. More... | |
class | ArrayOf |
class | Indexable |
Interface for classes providing indexer access. More... | |
class | Reference |
Reference to another persistent object. More... | |
struct | NullType |
struct | TrueType |
struct | If |
struct | If< NullType, Positive, Negative > |
struct | isPointer |
struct | isPointer< T * > |
struct | isReference |
struct | isReference< Reference< T > > |
struct | isClass |
struct | isClass< T * > |
struct | isClass< bool > |
struct | isClass< char > |
struct | isClass< signed char > |
struct | isClass< short int > |
struct | isClass< int > |
struct | isClass< long int > |
struct | isClass< float > |
struct | isClass< double > |
struct | isClass< long double > |
struct | isClass< unsigned char > |
struct | isClass< unsigned short int > |
struct | isClass< unsigned int > |
struct | isClass< unsigned long int > |
struct | isClass< const bool > |
struct | isClass< const char > |
struct | isClass< const short int > |
struct | isClass< const int > |
struct | isClass< const long int > |
struct | isClass< const float > |
struct | isClass< const double > |
struct | isClass< const long double > |
struct | isClass< const unsigned char > |
struct | isClass< const unsigned short int > |
struct | isClass< const unsigned int > |
struct | isClass< const unsigned long int > |
struct | isClass< signed long long int > |
struct | isClass< unsigned long long int > |
struct | isClass< const signed long long int > |
struct | isClass< const unsigned long long int > |
struct | isClass< std::vector< T > > |
struct | isClass< std::string > |
struct | isArray |
struct | ArrayType |
struct | isArray< ArrayOf< T > > |
struct | ArrayType< ArrayOf< T > > |
struct | ArrayType< std::vector< T > > |
struct | isArray< std::vector< T > > |
struct | deref |
struct | deref< T * > |
class | HasMetaClass |
Determine if a class has a MetaClass. More... | |
struct | MetaClassOf |
struct | MetaClassOf< T, 1 > |
class | Field |
Abstract class for representing fields in tables. More... | |
class | CompositeField |
Composite fields. More... | |
class | FixedField |
Fixed size fields. More... | |
class | VarField |
Variable size fields. More... | |
class | ReferenceField |
Reference fields. More... | |
class | ArrayField |
Arrays for fields. More... | |
class | ArrayField< ArrayOf< T > > |
class | ArrayField< std::vector< T > > |
class | ArrayField< std::vector< std::string > > |
class | ArrayField< std::vector< char const * > > |
struct | CompositeBuilder |
Interface for creating fields. More... | |
struct | FieldBuilder |
class | MetaClass |
Class MetaClass Describes the structure of an object. More... | |
class | AnyObject |
Generic object, used for reading/writing dynamically defined tables. More... | |
struct | Options |
Options describes a set of command-line options. More... | |
class | OptionStream |
Given the traditional argc and argv for command-line arguments, extract options from them following the stream model. More... | |
struct | PostingOffset |
class | PostingList |
This class, given a IndexTable::const_iterator, accesses the list of postings for a word. More... | |
class | Runnable |
struct | DBT |
Represents items inserted/extracted from DB. More... | |
class | SubField |
Represents fields in table containing references to other objects. More... | |
class | MappedSubField |
Represents fields in table containing references to other objects This is the version for value type objects, which are stored in a separate mapped file. More... | |
class | SubCursor |
Represents a cursor on a Reference subfield. More... | |
struct | NoDoc |
Predicate false for any document. More... | |
class | Table |
class | DynamicTable |
DynamicTable. More... | |
class | InvalidThreadStateError |
class | Thread |
A class to start and manage a thread of execution. More... | |
class | ThreadGroup |
Java-like ThreadGroup. More... | |
class | ThreadPool |
A ThreadPool pre-creates and manages a pool of persistent threads to do tasks taken from a queue. More... | |
struct | FileAction |
Set the limit for the given resource to its maximum value. More... | |
class | conf< ColorMap > |
conf<ColorMap> is a Var containing a set of pair<TermColor, TermWeight> associated to an HTML tag/meta-attribute. More... | |
class | Configuration |
A Configuration object that holds all the configuration variables. More... | |
class | Var |
Configuration variable. More... | |
class | VarDefault |
Configuration variable with default value. More... | |
class | conf |
class | conf< bool > |
A conf<bool> is a Var for containing the value of a Boolean configuration variable. More... | |
class | conf< Dictionary > |
class | conf< float > |
A conf<float> is a Var for containing the value of a float configuration variable. More... | |
class | conf< int > |
A conf<int> is a Var for containing the value of an integer configuration variable. More... | |
class | conf< PatternSet > |
A conf_PatternSet contains a list of shell wildcard patterns. More... | |
class | conf_set |
A conf_set contains a set of configuration values. More... | |
class | conf< std::string > |
A conf<string> is a configuration variable containing a string value. More... | |
class | conf< std::vector< std::string > > |
A conf_vector contains a set of configuration values. More... | |
class | Conversion |
A Conversion maps a filename pattern to a conversion command. More... | |
class | ExcludeFile |
An ExcludeFile contains the set of filename patterns to exclude during either indexing or extraction. More... | |
class | FileType |
A FileType maps a filename pattern to a file type. More... | |
class | IncludeFile |
An IncludeFile contains the set of filename patterns to include during either indexing or extraction. More... | |
class | PatternList |
An PatternList contains a list of shell wildcard patterns. More... | |
class | PatternMap |
A PatternMap maps a shell wildcard pattern to an object of type T. More... | |
class | PatternSet |
A PatternSet contains a set of shell wildcard patterns. More... | |
class | PatternVar |
A PatternVar is a configuration variable containing a set of filename patterns. More... | |
class | Enumerator |
Enumerator interface. More... | |
class | Error |
Base class for all errors reported. More... | |
class | LogicError |
Base class for errors due to programming errors. More... | |
class | RuntimeError |
Base class for errors due to run time problems. More... | |
class | AssertionError |
Thrown if an internal consistency check fails. More... | |
class | UnimplementedError |
Thrown when an attempt to use an unimplemented feature is made. More... | |
class | InvalidArgumentError |
Thrown when an invalid argument is supplied to the API. More... | |
class | ConfigFileError |
Thrown when reading a configuration file fails. More... | |
class | FileError |
Thrown when opening a file fails. More... | |
class | MmapError |
Thrown when mmap fails mapping a file to memory. More... | |
class | FormatError |
Wrong index format file. More... | |
class | DocNotFoundError |
Thrown when an attempt is made to access a document which is not in the collection. More... | |
class | InternalError |
thrown when an internal inconsistency occurs. More... | |
class | IndexingError |
thrown during indexing. More... | |
class | RangeError |
thrown when an element is out of range. More... | |
class | ReaderError |
Thrown when reader fails interpreting document format. More... | |
class | CollectionError |
thrown for miscellaneous collection errors. More... | |
class | NetworkError |
thrown when there is a communications problem with a remote collection. More... | |
class | MemoryError |
thrown when there is a communications problem with a remote collection. More... | |
class | OpeningError |
Thrown when opening a collection fails. More... | |
class | TableError |
Thrown when accessing a database Table fails. More... | |
class | ParserError |
class | QueryError |
Thrown when an SQL query fails. More... | |
class | InvalidResultError |
Thrown when trying to access invalid data. More... | |
class | SystemError |
Thrown when a system call fails. More... | |
class | IOError |
class | Set |
A Set is a set but with the addition of a contains() member function, one that returns a simpler bool result indicating whether a given element is in the set. More... | |
class | Set< char const * > |
Specialize Set for C-stle strings so as not to have a reference (implemented as a pointer) to a char const*. More... | |
struct | TermHit |
TermHit is used to represent a word occurrence in a document, a sentence delimiter or a tag. More... | |
class | Timer |
class | unordered_map |
struct | IVisitor |
Define Visitable classes as:. More... | |
struct | Visitor |
struct | Visitable |
struct | FileEnum |
Typedefs | |
typedef std::vector< std::pair < Pattern2, std::string > > | Replacements |
A conf_Replacements contains pairs of (RegExp::Pattern, replacement). | |
typedef std::map< DocID, DocID > | Remap |
Pairs <from, to>, sorted by increasing from. | |
typedef std::map< char const *, char const * > | Dictionary |
A conf_dictionary contains a dictionary. | |
typedef conf_set< std::string > | conf_stringset |
A conf_set contains a set of configuration string values. | |
typedef FileType | MimeType |
A MimeType maps a mime type to a document reader type. | |
typedef unsigned | DocID |
DocID is a numeric ID for documents in IXE collections. | |
typedef unsigned short | HitPosition |
Word position in document. | |
typedef unsigned short | Occurrences |
Number of occurrences of a word in a document. | |
typedef short | TermColor |
TermColor is a numeric ID of a 'color' attribute for the word. | |
typedef unsigned | TermID |
typedef unsigned | Count |
typedef unsigned | Size |
typedef unsigned char | byte |
typedef Set< char const * > | chars_set |
typedef signed char | int1 |
typedef unsigned char | nat1 |
typedef signed short | int2 |
typedef unsigned short | nat2 |
typedef signed int | int4 |
typedef unsigned int | nat4 |
typedef float | real |
typedef float | real4 |
typedef double | real8 |
typedef unsigned long long | nat8 |
typedef signed long long | int8 |
typedef unsigned char | uchar |
typedef unsigned char | uint8 |
typedef unsigned short | uint16 |
typedef unsigned long | ulong |
typedef long long | longlong |
typedef unsigned long long | ulonglong |
typedef short | TermWeight |
Weight associated to a term: usually depends on the color. | |
Enumerations | |
enum | index_id { stop_word_index = 0, color_index = 1 } |
enum | { Exit_Success = 0, Exit_Config_File = 1, Exit_Usage = 2, Exit_Malformed_Query = 40, Exit_No_Read_Index = 50, Exit_No_Write_PID = 51, Exit_No_Socket = 52, Exit_No_Unlink = 53, Exit_No_Bind = 54, Exit_No_Listen = 55, Exit_No_Accept = 56, Exit_No_Fork = 57, Exit_No_Change_Dir = 58, Exit_No_Create_Thread = 59, Exit_No_Detach_Thread = 60, Exit_End_Enum_Marker } |
Functions | |
REGISTER (DocInfo) | |
std::ostream & | operator<< (std::ostream &s, const DocInfo &d) |
ostream & | outEptacode (ostream &o, register unsigned n) |
Write an unsigned integer to the given ostream in eptabit binary coding (base 127), low digit first. | |
ostream & | padEptacode (ostream &o, register unsigned no, unsigned old) |
Output number no to o stream, padding it so that it fits the same space occupied by previous number old, assuming that no <= old. | |
unsigned int | toEptacode (unsigned char *dst, register unsigned n) |
Write. | |
unsigned int | parseEptacode (register unsigned char const *&p) |
Parse an integer from a EPTACODE-encoded byte sequence (low digit first). | |
std::ostream & | outEptacode (std::ostream &, unsigned) |
std::ostream & | padEptacode (std::ostream &o, register unsigned no, unsigned old) |
std::ostream & | operator<< (std::ostream &o, const eptacode &e) |
std::ostream & | operator<< (std::ostream &s, const Item &t) |
template<class T > | |
std::ostream & | operator<< (std::ostream &s, const ItemOf< T > &t) |
Field * | makeField (char const *name, char const *typeName, Size &offs, Size maxLength) |
Create a Field dynamically, eg. | |
REGISTER (MetaClass) | |
template<Size size> | |
void | storeBigEndian (byte *dst, byte *src) |
template<> | |
void | storeBigEndian< 1 > (byte *dst, byte *src) |
template<> | |
void | storeBigEndian< 2 > (byte *dst, byte *src) |
template<> | |
void | storeBigEndian< 4 > (byte *dst, byte *src) |
template<> | |
void | storeBigEndian< 8 > (byte *dst, byte *src) |
template<int size> | |
void | fetchBigEndian (byte *dst, byte *src) |
template<int size> | |
void | storeLittleEndian (byte *dst, byte *src) |
template<> | |
void | storeLittleEndian< 1 > (byte *dst, byte *src) |
template<> | |
void | storeLittleEndian< 2 > (byte *dst, byte *src) |
template<> | |
void | storeLittleEndian< 4 > (byte *dst, byte *src) |
template<> | |
void | storeLittleEndian< 8 > (byte *dst, byte *src) |
template<int size> | |
void | fetchLittleEndian (byte *dst, byte *src) |
template<class T > | |
Field & | createField (char const *name, Size maxLength, Size offs, T *, Field::IndexType const indexType, char const *merge=0) |
std::ostream & | operator<< (std::ostream &s, const MetaClass &m) |
OptionStream & | operator>> (OptionStream &os, OptionStream::Option &o) |
Parse and extract an option from an option stream (argv values). | |
bool | realloc_record (DBT &record, ulong len) |
int | revlex_cmp (const char *aptr, int asiz, const char *bptr, int bsiz, void *op) |
Reverse lexicographic comparison for little endian machines. | |
int | float_cmp (const char *aptr, int asiz, const char *bptr, int bsiz, void *op) |
Float comparison for little endian machines. | |
void | fappend (char *f1, char *f2) |
void | incrementKey (DBT &key_) |
void | decrementKey (DBT &key_) |
TRESULT | ThreadMain (void *) |
void | mapDir (char const *pathname, FileAction &action, bool recurse_subdirectories, bool follow_symbolic_links, int verbosity) |
Perform action on each file in directory tree. | |
int | url_decode (char *dest, char const *src, int len) |
Decodes any %## encoding in the given string. | |
char * | url_encode (char const *s) |
Returns a string in which all non-alphanumeric characters except "-_.!~*'()," have been replaced with a percent (%) sign followed by two hex digits. | |
int | url_encode (char *dst, char const *s) |
Put into. | |
void | reverseURLdomain (char *revDomain, char const *url, Size len) |
Return the URL's site in reverse. | |
void | unreverseURLdomain (char *domain, char const *revDomain) |
Size | availableMemory () |
Detect the available memory. | |
void | cgi_parse (map< char const *, char const * > &keyMap, char *qstart) |
Parse a cgi query into a map of key values. | |
void | cgi_parse (std::map< char const *, char const * > &keyMap, char *qstart) |
Variables | |
FileFormat | fileFormatVersion = { 1, 3 } |
char const | Bext [] = "tcb" |
char const | Hext [] = "tch" |
char const | Fext [] = "tcf" |
char const | Uext [] = "unv" |
pthread_key_t | jmpbuf_key |
conf< ColorMap > | colorMapVar ("Colors") |
ColorMap & | colorMap = colorMapVar.value |
Contains a set of pair<TermColor, TermWeight> associated to an HTML tag/meta-attribute. | |
conf< bool > | VerboseConfig ("VerboseConfig", false) |
FileType | fileTypes |
MimeType | mimeTypes |
int const | WordMaxSize = 30 |
int const | WordMinSize = 2 |
char const | ConfigFileDefault [] = "ixe.conf" |
char const | TableNameDefault [] = "INDEX/docinfo" |
int const | FilesGrowDefault = 100 |
char const | IndexExt [] = ".fti" |
char const | PostingExt [] = ".pst" |
char const | TableExt [] = ".bdb" |
char const | ContentsExt [] = ".gz" |
int const | ResultsMaxDefault = 10 |
char const | TempDirectoryDefault [] = "/tmp" |
int const | WordPercentMaxDefault = 100 |
int const | Word_Threshold = 60000 |
int const | max_columns = 64 |
int const | max_prefix_lists = 5000 |
int const | Max_CursorAll_Hits = 20 |
int const | Postings_Segment_Size = 1024 |
int const | Min_Postings_Table = 4096 |
DocID const | noDocID = 0 |
HitPosition const | noPosition = 0 |
HitPosition const | maxPosition = (HitPosition)-1 |
Occurrences const | maxOccurrences = (Occurrences)-1 |
TermColor const | noColor = -1 |
TermColor const | color_Not_Found = -2 |
int const | num_bigrams = 256*256 + 1 |
TermWeight const | noWeight = 1 |
TermWeight const | repeatWeight = 0 |
char const | version [] = "0.9" |
Global configuration parameters.
All the IXE classes are declared in namespace IXE.
The IXE Toolkit is a set of modular C++ classes and utilities for indexing and querying documents.
typedef unsigned IXE::DocID |
DocID is a numeric ID for documents in IXE collections.
DocID start at 1, 0 is reserved for no ID.
typedef unsigned short IXE::HitPosition |
Word position in document.
Positions start at 1, 0 is reserved for no position.
typedef short IXE::TermColor |
TermColor is a numeric ID of a 'color' attribute for the word.
The color may represent a META NAME, or some HTML TAG.
int IXE::float_cmp | ( | const char * | aptr, | |
int | asiz, | |||
const char * | bptr, | |||
int | bsiz, | |||
void * | op | |||
) |
Float comparison for little endian machines.
OptionStream& IXE::operator>> | ( | OptionStream & | os, | |
OptionStream::Option & | o | |||
) |
Parse and extract an option from an option stream (argv values).
Options begin with either a '-' for short options or a "--" for long options. Either a '-' or "--" by itself explicitly ends the options; however, the difference is that '-' is returned as the first non-option whereas "--" is skipped entirely.
When there are no more options, the OptionStream converts to bool as false. The OptionStream's shift() member is the number of options parsed which the caller can use to adjust argc and argv.
Short options can take an argument either as the remaining characters of the same argv or in the next argv unless the next argv looks like an option by beginning with a '-').
Long option names can be abbreviated so long as the abbreviation is unambiguous. Long options can take an argument either directly after a '=' in the same argv or in the next argv (but without an '=') unless the next argv looks like an option by beginning with a '-').
os | The OptionStream to extract options from | |
o | The option to deposit into. |
ostream& IXE::outEptacode | ( | ostream & | o, | |
register unsigned | n | |||
) |
Write an unsigned integer to the given ostream in eptabit binary coding (base 127), low digit first.
The last digit has sign bit 1.
o | The ostream to write to. | |
n | The number to be written. |
Here is the largest number encodable with n bytes:
1 127 2 16129 3 2048383 4 260144641 5 33038369407
An alternative similar to Golomb encoding would be to use 3 bits unary encoded to represent length (0, 10, 110, 1110), and the rest in normal base 2. The largest number encodable would be:
1 127 2 16384 3 2097152 4 268435456 5 34359738368
Since the difference is marginal, we opt for the first solution which is simpler to encode/decode.
Referenced by padEptacode().
unsigned int IXE::parseEptacode | ( | register unsigned char const *& | p | ) | [inline] |
Parse an integer from a EPTACODE-encoded byte sequence (low digit first).
Integers are terminated by byte with sign bit 1.
p | A pointer to the start of the EPTACODE-encoded integer. After an integer is parsed, it is left one past the terminator character. |
Referenced by IXE::PostingList::HitsCursor::next(), IXE::PostingList::const_iterator::next(), IXE::PostingList::HitsCursor::nth(), IXE::PostingList::const_iterator::operator++(), and IXE::PostingList::PostingList().
void IXE::reverseURLdomain | ( | char * | revDomain, | |
char const * | url, | |||
Size | len | |||
) |
Return the URL's site in reverse.
url | is the URL |
int IXE::revlex_cmp | ( | const char * | aptr, | |
int | asiz, | |||
const char * | bptr, | |||
int | bsiz, | |||
void * | op | |||
) |
Reverse lexicographic comparison for little endian machines.
unsigned int IXE::toEptacode | ( | unsigned char * | dst, | |
register unsigned | n | |||
) |
int IXE::url_decode | ( | char * | dest, | |
char const * | src, | |||
int | len | |||
) |
Decodes any %## encoding in the given string.
Referenced by cgi_parse(), and IXE::KeyValuePairs::FillFromQueryString().
int IXE::url_encode | ( | char * | dst, | |
char const * | s | |||
) |
Put into.
dest | the URL encoding of | |
s. |
char * IXE::url_encode | ( | char const * | s | ) |
Returns a string in which all non-alphanumeric characters except "-_.!~*'()," have been replaced with a percent (%) sign followed by two hex digits.
This is the encoding described in RFC 2396 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URL's from being mangled by transmission media with character conversions (like some email systems).
According to RFC 2396, only alphanumerics, the unreserved characters "-_.!~*'()", and reserved characters ";/?:@&=+$,", used for their reserved purposes may be used unencoded within a URL. (