Document Store

The IXE Document Store provides object persistence by implementing an object-relational interface to database tables.

IXE tables are implemented by means of Berkeley Database, using either the B-tree or Hash access methods. Locking and synchronization is implicitely provided, so that multiple threads or processes can access the same table concurrently. Locking is dealt by means of atomic instructions, avoiding overheads in database operations. No locking at all occurrs if the database is read in readonly mode.

IXE provides a convenient and efficient C++ interface, based on templates and cursors.

Building servers or Web Services that access an IXE object store is easily accomplished by means of the Thread, ThreadGroup and Socket classes. See class SearchServer for an example.

High speed of query execution is provided by compiling queries to an internal form that exploits cursors on the available indexes, performs the Small Adaptive Set Intersection (SASI) algorithm and exploits efficient access to disk indexes provided through hardware supported caching and memory mapping.

IXE is an application-oriented database. Database tables are constructed using information about application classes supplied with metadata annotation.

IXE provides flexible and convenient interface for retrieving data from database through cursors and queries.

IXE is able to efficiently handle databases with several millions objects and up to terabyte in size.

Cursors and Queries

In IXE table rows are considered as object instances and table correspond to classes.

Unlike SQL, IXE is oriented to work with objects instead of SQL tuples. So the result of each query execution is a cursor on a the set of objects fulfilling the query. These are fully formed objects, to which methods can be applied, not just ResultSets as with ADO or ODBC, from which the application must extract individual column values.

Cursors

IXE tables provide cursors to extract objects from a table of objects of class T:

Table::Cursor<T> cursor(collection);
while (cursor.MoveNext())
   cout << cursor.get();

Notice that method cursor.get() returns a genuine object of class T, whose methods can be directly invoked, without any need to reconstruct it or extract its individual fields through explicit code, as in ADO or ODBC.

Cursors can be created on any indexed field. For instance, one can create a cursor on the title field of class VideoInfo, and access an item given its title, like this:

Table::Cursor<VideoInfo> cursor(collection, "title");
VideoInfo* v = cursor.get("High noon");

Cursors implement the Enumerator<T> interface, consisting of methods Enumerator::MoveNext() and Enumerator::get().

Cursor objects typically provide methods for performing:

initialization: constructor returns a cursor positioned before the first element;
movement and test: MoveNext(), moves the cursor and returns true if there is a next element;
access: get(), returns the element pointed by the cursor;
specific cursor operations, depending on its type.

For instance, dumping the lexicon for a fulltext field is done in the following way:

Lexicon::Cursor cur = lexicon.cursor();
while (cur.next()) {
     word = cur.get();		// fetch the word
     cout << word;
     postList = cur.postlist();	// fetch the posting for the word
     cout << postList;
  }

Query Cursors

Besides generic cursors, which scan a whole table or an inverted index, it is possible to use query cursors, which allow examining the results of queries. For example, consider the following class:

class VideoInfo {
public:
  DocID id;
  char* title;
  char* format;

  META(DocInfo,
       (KEY(id, autoincrement),
	VARKEY(title, 255),
	FIELD(format)));
};

The following code implements a search on a collection of video documents, described as objects of class Video:

Collection<Video> collection("ixe/video");
Query query("title MATCHES Totti and format=wav");
QueryCursor<Video> cursor(collection, query);
while (cursor.hasNext())
   cout << *cursor.get();

The first line opens a Collection of Video objects, located in file "ixe/video". Then a query is created from an SQL-like expression. Search is performed by opening a Cursor on the collection for the results of the query. Each result is extracted by scanning the cursor.

A QueryCursor returns QueryResult's, which describe individual results of a query. A QueryResult contains:

class QueryResult {
  CollectionID  collection;
  DocID  	id;
  double	rank;
  HitPosition	position;
}

The rank is the information retrieval rank, computed by the cosine measure, and is used to sort the results by relevance to the query. Position information is useful for queries involving proximity.

A typical search application needs to return just the top k ranking documents among the N matching the query. If k << N, sorting the whole result set can be quite expensive (k is often 10 while N can be > 100,000).

A possible solution is to use a min-heap, that maintains a heap of the top k elements, sorted in reverse order, so that the least element is at index 0. Each result is compared with the least element: if it is smaller, it is just discarded, otherwise the least element is removed and the new element inserted.

This algorithm has a worst case complexity is O(N log(k)), with average complexity of: O(N + k log(k) log(N/k)) if k << N.

Using an STL heap, a typical loop for a search application would look like this:

vector<QueryResult> sorted;
QueryCursor<T> cursor(coll, query);
while (max_results-- && cursor.hasNext()) {
  tot++;
  if (i < sorted_size) {
    sorted.push_back(cursor.get());
    i++;
  } else {
    // remove smallest
    std::pop_heap(sorted.begin(), sorted.end(), rank_greater());
    // insert at just freed last place
    sorted[max_results - 1] = cursor.get();
    // restore the heap property
    std::push_heap(sorted.begin(), sorted.end(), rank_greater());
  }
}
std::sort_heap(sorted.begin(), sorted.end(), rank_greater());

Since only max_results are required, the code maintains a heap of this size where results get sorted.

The example uses a rank_greater() function object to sort results by information retrieval rank, but other applications can use different criteria.

For instance, if an overall document rank is required, one can create an additional table:

class PageRank {
    DocID      docID;
    Double     rank;

    META(PageRank,
	((KEY(docID),
	  FIELD(rank))
	);
};

Table<PageRank> PageRankTable("rank");

where items can be inserted and extracted as follows:

    PageRankTable.insert(pageRank);
    Table<PageRank>::Cursor c(PageRankTable);
    while (c.MoveNext()) {... c->get() ...}

Random access to the elements is provided by:

   PageRank = c->get((byte*)&docID);

In alternative to storing the rank in a separate table, IXE allows automatic handling of reference fields, i.e. fields whose type is a pointer type. This is obtained by declaring the index type for the field as Field::Mapped in the META annotation for a class.

This means that the field is stored in a separate file, that contains the values for the field stored consecutively. Very fast access to the field is achieved by mapping the file into memory and directly accessing the value for each document by using the ID of the document as an index.

For instance, we would define

class VideoInfo {
 public:
  DocID      id;
  char*      title;
  char*      format;
  PageRank*  rank;

  META(VideoInfo,
       (KEY(id, autoincrement),
	VARKEY(title, 255),
	FIELD(format),
	KEY(rank, Field::Mapped)));
};

The effect of this annotation is that the field is stored in a separate file, that contains the values for the field stored consecutively. Fast access to the field is achieved by mapping the file into memory and directly accessing the value for each document by using the ID of the document as an index.

Query Language

IXE supports different programmable query languages. A language with an SQL-like syntax is available as well as a simpler one with the typical syntax of Web Search engines.

Grammar conventions
Example	Meaning
expression	non-terminals
not	terminals
\|	disjoint alternatives
[else cmd]	optional part

The following rules in BNF-like notation specify the syntax for IXE query cursor conditions:

condition:  match
        | NOT match
        | match relOp condition

match:	ident MATCHES pattern 
	| ( condition )
        | comparison

pattern: attr patternOp pattern
        | attr pattern
        | attr

attr:	color = primary
        | primary

color:	ident

ident:	WORD

primary: ( pattern )
	| ! primary
	| WORD
        | WORDSTAR
        | PHRASE
        | proximity

proximity: PROXIMITY dist ( proxlist )

dist:	LONG_NUM

proxlist: ( wordlist ) & proxlist
        | ( wordlist )

wordlist: WORD wordlist
        | WORD

relOp:  AND
	| OR

patternOp: &&
	| ||

comparison:  operand = operand
        | operand NE operand
        | operand LT operand
        | operand LE operand
        | operand GT operand
        | operand GE operand
        | operand LIKE operand
        | operand NOT LIKE operand
        | operand BETWEEN operand AND operand
        | operand NOT BETWEEN operand AND operand

operand: factor
        | operand + factor
        | operand | factor
        | operand - factor

factor: power
        | factor * power
        | factor / power

power: term
        | term ^ power

term: field
        | LONG_NUM
        | FLOAT_NUM
        | STRING

field: ident
        | ident . ident