Tanl Linguistic Pipeline

IXE Namespace Reference

Global configuration parameters. More...

Namespaces

namespace  io
 

Platform independent IO.


Classes

class  Pattern2
 Extension of RegExp::Pattern that stores original pattern for serialization. More...
class  conf< Replacements >
class  DocInfo
 Abstract class for document info. More...
struct  eptacode
struct  FileFormat
 This variable contains the version number of the file format for the index. More...
struct  FileHeader
 Header for fulltext index files. More...
class  IndexTable
 An instance of this class is used to access either the word, stop-word, file, or meta-name index portions of an index file. More...
class  BigramTable
 A BigramTable contains the index (TermID) of the first word in the Lexicon starting with that bigram. More...
class  StringTable
class  Item
 Items represent values for fields. More...
class  ItemOf
class  KeyValuePairs
 Represents a sorted collection of associated string keys and string values that can be accessed with the key. More...
struct  LexEntry
struct  EntryCompare
class  TermInfo
class  Lexicon
 Manage and use an inverted index dictionary. More...
struct  Condition
struct  SimpleCondition
 Condition without the additional lock required by pthread_cond_wait(). More...
struct  Lock
 Lock object, used for synchronization. More...
class  Locked
class  LockUp
 mutex interface: locks mutex at creation, unlocks at destruction More...
class  Array
 Array of persistent objects. More...
class  ArrayOf
class  Indexable
 Interface for classes providing indexer access. More...
class  Reference
 Reference to another persistent object. More...
struct  NullType
struct  TrueType
struct  If
struct  If< NullType, Positive, Negative >
struct  isPointer
struct  isPointer< T * >
struct  isReference
struct  isReference< Reference< T > >
struct  isClass
struct  isClass< T * >
struct  isClass< bool >
struct  isClass< char >
struct  isClass< signed char >
struct  isClass< short int >
struct  isClass< int >
struct  isClass< long int >
struct  isClass< float >
struct  isClass< double >
struct  isClass< long double >
struct  isClass< unsigned char >
struct  isClass< unsigned short int >
struct  isClass< unsigned int >
struct  isClass< unsigned long int >
struct  isClass< const bool >
struct  isClass< const char >
struct  isClass< const short int >
struct  isClass< const int >
struct  isClass< const long int >
struct  isClass< const float >
struct  isClass< const double >
struct  isClass< const long double >
struct  isClass< const unsigned char >
struct  isClass< const unsigned short int >
struct  isClass< const unsigned int >
struct  isClass< const unsigned long int >
struct  isClass< signed long long int >
struct  isClass< unsigned long long int >
struct  isClass< const signed long long int >
struct  isClass< const unsigned long long int >
struct  isClass< std::vector< T > >
struct  isClass< std::string >
struct  isArray
struct  ArrayType
struct  isArray< ArrayOf< T > >
struct  ArrayType< ArrayOf< T > >
struct  ArrayType< std::vector< T > >
struct  isArray< std::vector< T > >
struct  deref
struct  deref< T * >
class  HasMetaClass
 Determine if a class has a MetaClass. More...
struct  MetaClassOf
struct  MetaClassOf< T, 1 >
class  Field
 Abstract class for representing fields in tables. More...
class  CompositeField
 Composite fields. More...
class  FixedField
 Fixed size fields. More...
class  VarField
 Variable size fields. More...
class  ReferenceField
 Reference fields. More...
class  ArrayField
 Arrays for fields. More...
class  ArrayField< ArrayOf< T > >
class  ArrayField< std::vector< T > >
class  ArrayField< std::vector< std::string > >
class  ArrayField< std::vector< char const * > >
struct  CompositeBuilder
 Interface for creating fields. More...
struct  FieldBuilder
class  MetaClass
 Class MetaClass Describes the structure of an object. More...
class  AnyObject
 Generic object, used for reading/writing dynamically defined tables. More...
struct  Options
 Options describes a set of command-line options. More...
class  OptionStream
 Given the traditional argc and argv for command-line arguments, extract options from them following the stream model. More...
struct  PostingOffset
class  PostingList
 This class, given a IndexTable::const_iterator, accesses the list of postings for a word. More...
class  Runnable
struct  DBT
 Represents items inserted/extracted from DB. More...
class  SubField
 Represents fields in table containing references to other objects. More...
class  MappedSubField
 Represents fields in table containing references to other objects This is the version for value type objects, which are stored in a separate mapped file. More...
class  SubCursor
 Represents a cursor on a Reference subfield. More...
struct  NoDoc
 Predicate false for any document. More...
class  Table
class  DynamicTable
 DynamicTable. More...
class  InvalidThreadStateError
class  Thread
 A class to start and manage a thread of execution. More...
class  ThreadGroup
 Java-like ThreadGroup. More...
class  ThreadPool
 A ThreadPool pre-creates and manages a pool of persistent threads to do tasks taken from a queue. More...
struct  FileAction
 Set the limit for the given resource to its maximum value. More...
class  conf< ColorMap >
 conf<ColorMap> is a Var containing a set of pair<TermColor, TermWeight> associated to an HTML tag/meta-attribute. More...
class  Configuration
 A Configuration object that holds all the configuration variables. More...
class  Var
 Configuration variable. More...
class  VarDefault
 Configuration variable with default value. More...
class  conf
class  conf< bool >
 A conf<bool> is a Var for containing the value of a Boolean configuration variable. More...
class  conf< Dictionary >
class  conf< float >
 A conf<float> is a Var for containing the value of a float configuration variable. More...
class  conf< int >
 A conf<int> is a Var for containing the value of an integer configuration variable. More...
class  conf< PatternSet >
 A conf_PatternSet contains a list of shell wildcard patterns. More...
class  conf_set
 A conf_set contains a set of configuration values. More...
class  conf< std::string >
 A conf<string> is a configuration variable containing a string value. More...
class  conf< std::vector< std::string > >
 A conf_vector contains a set of configuration values. More...
class  Conversion
 A Conversion maps a filename pattern to a conversion command. More...
class  ExcludeFile
 An ExcludeFile contains the set of filename patterns to exclude during either indexing or extraction. More...
class  FileType
 A FileType maps a filename pattern to a file type. More...
class  IncludeFile
 An IncludeFile contains the set of filename patterns to include during either indexing or extraction. More...
class  PatternList
 An PatternList contains a list of shell wildcard patterns. More...
class  PatternMap
 A PatternMap maps a shell wildcard pattern to an object of type T. More...
class  PatternSet
 A PatternSet contains a set of shell wildcard patterns. More...
class  PatternVar
 A PatternVar is a configuration variable containing a set of filename patterns. More...
class  Enumerator
 Enumerator interface. More...
class  Error
 Base class for all errors reported. More...
class  LogicError
 Base class for errors due to programming errors. More...
class  RuntimeError
 Base class for errors due to run time problems. More...
class  AssertionError
 Thrown if an internal consistency check fails. More...
class  UnimplementedError
 Thrown when an attempt to use an unimplemented feature is made. More...
class  InvalidArgumentError
 Thrown when an invalid argument is supplied to the API. More...
class  ConfigFileError
 Thrown when reading a configuration file fails. More...
class  FileError
 Thrown when opening a file fails. More...
class  MmapError
 Thrown when mmap fails mapping a file to memory. More...
class  FormatError
 Wrong index format file. More...
class  DocNotFoundError
 Thrown when an attempt is made to access a document which is not in the collection. More...
class  InternalError
 thrown when an internal inconsistency occurs. More...
class  IndexingError
 thrown during indexing. More...
class  RangeError
 thrown when an element is out of range. More...
class  ReaderError
 Thrown when reader fails interpreting document format. More...
class  CollectionError
 thrown for miscellaneous collection errors. More...
class  NetworkError
 thrown when there is a communications problem with a remote collection. More...
class  MemoryError
 thrown when there is a communications problem with a remote collection. More...
class  OpeningError
 Thrown when opening a collection fails. More...
class  TableError
 Thrown when accessing a database Table fails. More...
class  ParserError
class  QueryError
 Thrown when an SQL query fails. More...
class  InvalidResultError
 Thrown when trying to access invalid data. More...
class  SystemError
 Thrown when a system call fails. More...
class  IOError
class  Set
 A Set is a set but with the addition of a contains() member function, one that returns a simpler bool result indicating whether a given element is in the set. More...
class  Set< char const * >
 Specialize Set for C-stle strings so as not to have a reference (implemented as a pointer) to a char const*. More...
struct  TermHit
 TermHit is used to represent a word occurrence in a document, a sentence delimiter or a tag. More...
class  Timer
class  unordered_map
struct  IVisitor
 Define Visitable classes as:. More...
struct  Visitor
struct  Visitable
struct  FileEnum

Typedefs

typedef std::vector< std::pair
< Pattern2, std::string > > 
Replacements
 A conf_Replacements contains pairs of (RegExp::Pattern, replacement).
typedef std::map< DocID, DocIDRemap
 Pairs <from, to>, sorted by increasing from.
typedef std::map< char const
*, char const * > 
Dictionary
 A conf_dictionary contains a dictionary.
typedef conf_set< std::string > conf_stringset
 A conf_set contains a set of configuration string values.
typedef FileType MimeType
 A MimeType maps a mime type to a document reader type.
typedef unsigned DocID
 DocID is a numeric ID for documents in IXE collections.
typedef unsigned short HitPosition
 Word position in document.
typedef unsigned short Occurrences
 Number of occurrences of a word in a document.
typedef short TermColor
 TermColor is a numeric ID of a 'color' attribute for the word.
typedef unsigned TermID
typedef unsigned Count
typedef unsigned Size
typedef unsigned char byte
typedef Set< char const * > chars_set
typedef signed char int1
typedef unsigned char nat1
typedef signed short int2
typedef unsigned short nat2
typedef signed int int4
typedef unsigned int nat4
typedef float real
typedef float real4
typedef double real8
typedef unsigned long long nat8
typedef signed long long int8
typedef unsigned char uchar
typedef unsigned char uint8
typedef unsigned short uint16
typedef unsigned long ulong
typedef long long longlong
typedef unsigned long long ulonglong
typedef short TermWeight
 Weight associated to a term: usually depends on the color.

Enumerations

enum  index_id { stop_word_index = 0, color_index = 1 }
enum  {
  Exit_Success = 0, Exit_Config_File = 1, Exit_Usage = 2, Exit_Malformed_Query = 40,
  Exit_No_Read_Index = 50, Exit_No_Write_PID = 51, Exit_No_Socket = 52, Exit_No_Unlink = 53,
  Exit_No_Bind = 54, Exit_No_Listen = 55, Exit_No_Accept = 56, Exit_No_Fork = 57,
  Exit_No_Change_Dir = 58, Exit_No_Create_Thread = 59, Exit_No_Detach_Thread = 60, Exit_End_Enum_Marker
}

Functions

 REGISTER (DocInfo)
std::ostream & operator<< (std::ostream &s, const DocInfo &d)
ostream & outEptacode (ostream &o, register unsigned n)
 Write an unsigned integer to the given ostream in eptabit binary coding (base 127), low digit first.
ostream & padEptacode (ostream &o, register unsigned no, unsigned old)
 Output number no to o stream, padding it so that it fits the same space occupied by previous number old, assuming that no <= old.
unsigned int toEptacode (unsigned char *dst, register unsigned n)
 Write.
unsigned int parseEptacode (register unsigned char const *&p)
 Parse an integer from a EPTACODE-encoded byte sequence (low digit first).
std::ostream & outEptacode (std::ostream &, unsigned)
std::ostream & padEptacode (std::ostream &o, register unsigned no, unsigned old)
std::ostream & operator<< (std::ostream &o, const eptacode &e)
std::ostream & operator<< (std::ostream &s, const Item &t)
template<class T >
std::ostream & operator<< (std::ostream &s, const ItemOf< T > &t)
FieldmakeField (char const *name, char const *typeName, Size &offs, Size maxLength)
 Create a Field dynamically, eg.
 REGISTER (MetaClass)
template<Size size>
void storeBigEndian (byte *dst, byte *src)
template<>
void storeBigEndian< 1 > (byte *dst, byte *src)
template<>
void storeBigEndian< 2 > (byte *dst, byte *src)
template<>
void storeBigEndian< 4 > (byte *dst, byte *src)
template<>
void storeBigEndian< 8 > (byte *dst, byte *src)
template<int size>
void fetchBigEndian (byte *dst, byte *src)
template<int size>
void storeLittleEndian (byte *dst, byte *src)
template<>
void storeLittleEndian< 1 > (byte *dst, byte *src)
template<>
void storeLittleEndian< 2 > (byte *dst, byte *src)
template<>
void storeLittleEndian< 4 > (byte *dst, byte *src)
template<>
void storeLittleEndian< 8 > (byte *dst, byte *src)
template<int size>
void fetchLittleEndian (byte *dst, byte *src)
template<class T >
FieldcreateField (char const *name, Size maxLength, Size offs, T *, Field::IndexType const indexType, char const *merge=0)
std::ostream & operator<< (std::ostream &s, const MetaClass &m)
OptionStreamoperator>> (OptionStream &os, OptionStream::Option &o)
 Parse and extract an option from an option stream (argv values).
bool realloc_record (DBT &record, ulong len)
int revlex_cmp (const char *aptr, int asiz, const char *bptr, int bsiz, void *op)
 Reverse lexicographic comparison for little endian machines.
int float_cmp (const char *aptr, int asiz, const char *bptr, int bsiz, void *op)
 Float comparison for little endian machines.
void fappend (char *f1, char *f2)
void incrementKey (DBT &key_)
void decrementKey (DBT &key_)
TRESULT ThreadMain (void *)
void mapDir (char const *pathname, FileAction &action, bool recurse_subdirectories, bool follow_symbolic_links, int verbosity)
 Perform action on each file in directory tree.
int url_decode (char *dest, char const *src, int len)
 Decodes any %## encoding in the given string.
char * url_encode (char const *s)
 Returns a string in which all non-alphanumeric characters except "-_.!~*'()," have been replaced with a percent (%) sign followed by two hex digits.
int url_encode (char *dst, char const *s)
 Put into.
void reverseURLdomain (char *revDomain, char const *url, Size len)
 Return the URL's site in reverse.
void unreverseURLdomain (char *domain, char const *revDomain)
Size availableMemory ()
 Detect the available memory.
void cgi_parse (map< char const *, char const * > &keyMap, char *qstart)
 Parse a cgi query into a map of key values.
void cgi_parse (std::map< char const *, char const * > &keyMap, char *qstart)

Variables

FileFormat fileFormatVersion = { 1, 3 }
char const Bext [] = "tcb"
char const Hext [] = "tch"
char const Fext [] = "tcf"
char const Uext [] = "unv"
pthread_key_t jmpbuf_key
conf< ColorMap > colorMapVar ("Colors")
ColorMap & colorMap = colorMapVar.value
 Contains a set of pair<TermColor, TermWeight> associated to an HTML tag/meta-attribute.
conf< bool > VerboseConfig ("VerboseConfig", false)
FileType fileTypes
MimeType mimeTypes
int const WordMaxSize = 30
int const WordMinSize = 2
char const ConfigFileDefault [] = "ixe.conf"
char const TableNameDefault [] = "INDEX/docinfo"
int const FilesGrowDefault = 100
char const IndexExt [] = ".fti"
char const PostingExt [] = ".pst"
char const TableExt [] = ".bdb"
char const ContentsExt [] = ".gz"
int const ResultsMaxDefault = 10
char const TempDirectoryDefault [] = "/tmp"
int const WordPercentMaxDefault = 100
int const Word_Threshold = 60000
int const max_columns = 64
int const max_prefix_lists = 5000
int const Max_CursorAll_Hits = 20
int const Postings_Segment_Size = 1024
int const Min_Postings_Table = 4096
DocID const noDocID = 0
HitPosition const noPosition = 0
HitPosition const maxPosition = (HitPosition)-1
Occurrences const maxOccurrences = (Occurrences)-1
TermColor const noColor = -1
TermColor const color_Not_Found = -2
int const num_bigrams = 256*256 + 1
TermWeight const noWeight = 1
TermWeight const repeatWeight = 0
char const version [] = "0.9"

Detailed Description

Global configuration parameters.

All the IXE classes are declared in namespace IXE.

The IXE Toolkit is a set of modular C++ classes and utilities for indexing and querying documents.


Typedef Documentation

typedef unsigned IXE::DocID

DocID is a numeric ID for documents in IXE collections.

DocID start at 1, 0 is reserved for no ID.

typedef unsigned short IXE::HitPosition

Word position in document.

Positions start at 1, 0 is reserved for no position.

typedef short IXE::TermColor

TermColor is a numeric ID of a 'color' attribute for the word.

The color may represent a META NAME, or some HTML TAG.


Function Documentation

int IXE::float_cmp ( const char *  aptr,
int  asiz,
const char *  bptr,
int  bsiz,
void *  op 
)

Float comparison for little endian machines.

Returns:
< 0 if a is < b = 0 if a is = b > 0 if a is > b
OptionStream& IXE::operator>> ( OptionStream &  os,
OptionStream::Option &  o 
)

Parse and extract an option from an option stream (argv values).

Options begin with either a '-' for short options or a "--" for long options. Either a '-' or "--" by itself explicitly ends the options; however, the difference is that '-' is returned as the first non-option whereas "--" is skipped entirely.

When there are no more options, the OptionStream converts to bool as false. The OptionStream's shift() member is the number of options parsed which the caller can use to adjust argc and argv.

Short options can take an argument either as the remaining characters of the same argv or in the next argv unless the next argv looks like an option by beginning with a '-').

Long option names can be abbreviated so long as the abbreviation is unambiguous. Long options can take an argument either directly after a '=' in the same argv or in the next argv (but without an '=') unless the next argv looks like an option by beginning with a '-').

Parameters:
os The OptionStream to extract options from
o The option to deposit into.
Returns:
The passed-in OptionStream.
ostream& IXE::outEptacode ( ostream &  o,
register unsigned  n 
)

Write an unsigned integer to the given ostream in eptabit binary coding (base 127), low digit first.

The last digit has sign bit 1.

Parameters:
o The ostream to write to.
n The number to be written.
Returns:
The passed-in ostream.

Here is the largest number encodable with n bytes:

1 127 2 16129 3 2048383 4 260144641 5 33038369407

An alternative similar to Golomb encoding would be to use 3 bits unary encoded to represent length (0, 10, 110, 1110), and the rest in normal base 2. The largest number encodable would be:

1 127 2 16384 3 2097152 4 268435456 5 34359738368

Since the difference is marginal, we opt for the first solution which is simpler to encode/decode.

Referenced by padEptacode().

unsigned int IXE::parseEptacode ( register unsigned char const *&  p  )  [inline]

Parse an integer from a EPTACODE-encoded byte sequence (low digit first).

Integers are terminated by byte with sign bit 1.

Parameters:
p A pointer to the start of the EPTACODE-encoded integer. After an integer is parsed, it is left one past the terminator character.
Returns:
The integer.

Referenced by IXE::PostingList::HitsCursor::next(), IXE::PostingList::const_iterator::next(), IXE::PostingList::HitsCursor::nth(), IXE::PostingList::const_iterator::operator++(), and IXE::PostingList::PostingList().

void IXE::reverseURLdomain ( char *  revDomain,
char const *  url,
Size  len 
)

Return the URL's site in reverse.

Parameters:
url is the URL
int IXE::revlex_cmp ( const char *  aptr,
int  asiz,
const char *  bptr,
int  bsiz,
void *  op 
)

Reverse lexicographic comparison for little endian machines.

Returns:
< 0 if a is < b = 0 if a is = b > 0 if a is > b
unsigned int IXE::toEptacode ( unsigned char *  dst,
register unsigned  n 
)

Write.

Parameters:
n in eptacode to
dst. 
Returns:
the number of bytes written.
int IXE::url_decode ( char *  dest,
char const *  src,
int  len 
)

Decodes any %## encoding in the given string.

Returns:
the difference in size between src and dest.

Referenced by cgi_parse(), and IXE::KeyValuePairs::FillFromQueryString().

int IXE::url_encode ( char *  dst,
char const *  s 
)

Put into.

Parameters:
dest the URL encoding of
s. 
Returns:
the length of the encoded string
char * IXE::url_encode ( char const *  s  ) 

Returns a string in which all non-alphanumeric characters except "-_.!~*'()," have been replaced with a percent (%) sign followed by two hex digits.

This is the encoding described in RFC 2396 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URL's from being mangled by transmission media with character conversions (like some email systems).

According to RFC 2396, only alphanumerics, the unreserved characters "-_.!~*'()", and reserved characters ";/?:@&=+$,", used for their reserved purposes may be used unencoded within a URL. (

See also:
http://www.faqs.org/rfcs/rfc2396.html)
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.