Tanl Linguistic Pipeline |
The IXE Document Store provides object persistence by implementing an object-relational interface to database tables.
IXE tables are implemented by means of Berkeley Database, using either the B-tree or Hash access methods. Locking and synchronization is implicitely provided, so that multiple threads or processes can access the same table concurrently. Locking is dealt by means of atomic instructions, avoiding overheads in database operations. No locking at all occurrs if the database is read in readonly mode.
IXE provides a convenient and efficient C++ interface, based on templates and cursors.
Building servers or Web Services that access an IXE object store is easily accomplished by means of the Thread, ThreadGroup and Socket classes. See class SearchServer for an example.
High speed of query execution is provided by compiling queries to an internal form that exploits cursors on the available indexes, performs the Small Adaptive Set Intersection (SASI) algorithm and exploits efficient access to disk indexes provided through hardware supported caching and memory mapping.
IXE is an application-oriented database. Database tables are constructed using information about application classes supplied with metadata annotation.
IXE provides flexible and convenient interface for retrieving data from database through cursors and queries.
IXE is able to efficiently handle databases with several millions objects and up to terabyte in size.
In IXE table rows are considered as object instances and table correspond to classes.
Unlike SQL, IXE is oriented to work with objects instead of SQL tuples. So the result of each query execution is a cursor on a the set of objects fulfilling the query. These are fully formed objects, to which methods can be applied, not just ResultSets as with ADO or ODBC, from which the application must extract individual column values.
IXE tables provide cursors to extract objects from a table of objects of class T:
Table::Cursor<T> cursor(collection); while (cursor.MoveNext()) cout << cursor.get();
Notice that method cursor.get()
returns a genuine object of class T
, whose methods can be directly invoked, without any need to reconstruct it or extract its individual fields through explicit code, as in ADO or ODBC.
Cursors can be created on any indexed field. For instance, one can create a cursor on the title
field of class VideoInfo
, and access an item given its title, like this:
Table::Cursor<VideoInfo> cursor(collection, "title"); VideoInfo* v = cursor.get("High noon");
Cursors implement the Enumerator<T> interface, consisting of methods Enumerator::MoveNext() and Enumerator::get().
Cursor objects typically provide methods for performing:
For instance, dumping the lexicon for a fulltext field is done in the following way:
Lexicon::Cursor cur = lexicon.cursor(); while (cur.next()) { word = cur.get(); // fetch the word cout << word; postList = cur.postlist(); // fetch the posting for the word cout << postList; }
Besides generic cursors, which scan a whole table or an inverted index, it is possible to use query cursors, which allow examining the results of queries. For example, consider the following class:
class VideoInfo { public: DocID id; char* title; char* format;
META(DocInfo, (KEY(id, autoincrement), VARKEY(title, 255), FIELD(format))); };
The following code implements a search on a collection of video documents, described as objects of class Video:
Collection<Video> collection("ixe/video"); Query query("title MATCHES Totti and format=wav"); QueryCursor<Video> cursor(collection, query); while (cursor.hasNext()) cout << *cursor.get();
The first line opens a Collection of Video
objects, located in file "ixe/video". Then a query is created from an SQL-like expression. Search is performed by opening a Cursor on the collection for the results of the query. Each result is extracted by scanning the cursor.
A QueryCursor returns QueryResult's, which describe individual results of a query. A QueryResult contains:
class QueryResult { CollectionID collection; DocID id; double rank; HitPosition position; }
The rank
is the information retrieval rank, computed by the cosine measure, and is used to sort the results by relevance to the query. Position information is useful for queries involving proximity.
A typical search application needs to return just the top k ranking documents among the N matching the query. If k << N, sorting the whole result set can be quite expensive (k is often 10 while N can be > 100,000).
A possible solution is to use a min-heap, that maintains a heap of the top k elements, sorted in reverse order, so that the least element is at index 0. Each result is compared with the least element: if it is smaller, it is just discarded, otherwise the least element is removed and the new element inserted.
This algorithm has a worst case complexity is O(N log(k)), with average complexity of: O(N + k log(k) log(N/k)) if k << N.
Using an STL heap, a typical loop for a search application would look like this:
vector<QueryResult> sorted; QueryCursor<T> cursor(coll, query); while (max_results-- && cursor.hasNext()) { tot++; if (i < sorted_size) { sorted.push_back(cursor.get()); i++; } else { // remove smallest std::pop_heap(sorted.begin(), sorted.end(), rank_greater()); // insert at just freed last place sorted[max_results - 1] = cursor.get(); // restore the heap property std::push_heap(sorted.begin(), sorted.end(), rank_greater()); } } std::sort_heap(sorted.begin(), sorted.end(), rank_greater());
Since only max_results
are required, the code maintains a heap of this size where results get sorted.
The example uses a rank_greater()
function object to sort results by information retrieval rank, but other applications can use different criteria.
For instance, if an overall document rank is required, one can create an additional table:
class PageRank { DocID docID; Double rank;
META(PageRank, ((KEY(docID), FIELD(rank)) ); };
Table<PageRank> PageRankTable("rank");
where items can be inserted and extracted as follows:
PageRankTable.insert(pageRank); Table<PageRank>::Cursor c(PageRankTable); while (c.MoveNext()) {... c->get() ...}
Random access to the elements is provided by:
PageRank = c->get((byte*)&docID);
In alternative to storing the rank in a separate table, IXE allows automatic handling of reference fields, i.e. fields whose type is a pointer type. This is obtained by declaring the index type for the field as Field::Mapped
in the META annotation for a class.
This means that the field is stored in a separate file, that contains the values for the field stored consecutively. Very fast access to the field is achieved by mapping the file into memory and directly accessing the value for each document by using the ID of the document as an index.
For instance, we would define
class VideoInfo { public: DocID id; char* title; char* format; PageRank* rank;
META(VideoInfo, (KEY(id, autoincrement), VARKEY(title, 255), FIELD(format), KEY(rank, Field::Mapped))); };
The effect of this annotation is that the field is stored in a separate file, that contains the values for the field stored consecutively. Fast access to the field is achieved by mapping the file into memory and directly accessing the value for each document by using the ID of the document as an index.
IXE supports different programmable query languages. A language with an SQL-like syntax is available as well as a simpler one with the typical syntax of Web Search engines.
Example | Meaning |
---|---|
expression | non-terminals |
not | terminals |
| | disjoint alternatives |
[else cmd] | optional part |
The following rules in BNF-like notation specify the syntax for IXE query cursor conditions:
condition: match | NOT match | match relOp condition
match: ident MATCHES pattern | ( condition ) | comparison
pattern: attr patternOp pattern | attr pattern | attr
attr: color = primary | primary
color: ident
ident: WORD
primary: ( pattern ) | ! primary | WORD | WORDSTAR | PHRASE | proximity
proximity: PROXIMITY dist ( proxlist )
dist: LONG_NUM
proxlist: ( wordlist ) & proxlist | ( wordlist )
wordlist: WORD wordlist | WORD
relOp: AND | OR
patternOp: && | ||
comparison: operand = operand | operand NE operand | operand LT operand | operand LE operand | operand GT operand | operand GE operand | operand LIKE operand | operand NOT LIKE operand | operand BETWEEN operand AND operand | operand NOT BETWEEN operand AND operand
operand: factor | operand + factor | operand | factor | operand - factor
factor: power | factor * power | factor / power
power: term | term ^ power
term: field | LONG_NUM | FLOAT_NUM | STRING
field: ident | ident . ident