Tanl Linguistic Pipeline |
IXE uses the Factory Pattern [gamma], to manage the list of implemented readers, and dynamically create reader's when it needs to.
Reader classes include code that automatically registers them with a ReaderMap object, which contains a list of all known readers, and is able to dynamically instantiate readers on demand.
The interaction between the indexer, readers and the document objects (objects describing metadata about documents), follows a precise message protocol, described below.
IXE provides a variety of classes to represent various properties of text fields present within documents: document format, compression, location.
Each fulltext field in a document is represented by an instance of class TextField, that contains:
For instance, to represent video documents, that have a name, a duration, a video format, a summary and two text fields (caption, transcript), we may use the following class:
class Video { public: char const* name; unsigned duration; char const* format; char const* caption; char const* summary; CompressedText<65565> transcript;
META(Video, ((VARKEY(name, 2047, Field::unique), FIELD(duration), VARKEY(format, 80, Field::fulltext), VARKEY(caption, 255, Field::fulltext), VARFIELD(summary, 8096), KEY(transcript, Field::fulltext) ))); };
In the example, format
and caption
are represented by plain TextField's, implicitly created from the corresponding strings.
The transcript
field is represented with type CompressedText<65656>, that is an instantation of template CompressedText with a maximum compressed size of 65656 bytes. CompressedText field are automatically compressed before storing and decompressed when read.
Each field in the document or even the same field in two different documents may be indexed using a different document reader. For instance the caption
may be just simple text, that can be read by a TextReader. Instead, the transcript
may be in various formats, e.g. XML requiring an XMLReader, HTML requiring an HTMLReader and so on.
Given an instance video1
of class Video
, we may set the document format type for the transcript
field as follows:
video1.transcript.Type("HTML");
We can set the contents of the transcript from either text in memory:
char const* tr = "Sample transcript text"; video1.transcript.Contents(tr);
or from an external file:
video1.transcript.source("/local/file.html");
In the former case, the whole contents of the field will be stored (in compressed form) within the IXE table, while in the latter case only the name of the file will be stored.
Finally note that while the format field could have been stored as a plain secondary KEY:
VARKEY(format, 80),
the use of a fulltext index is more efficient, since it creates a dictionary of all possible formats, which is used to create a more compact inverted index.
When an object is indexed, the IXE indexer analyzes each field and builds and index for each indexed field.
For fulltext fields, it generates a fulltext index, by creating an instance of the appropriate document reader, as determined by the type of the TextField.
The chosen document reader is invoked as follows by the indexer:
reader->addHits(*this, video1);
passing itself to the reader, as well as the document object to read.
A document reader will typically extract text from the document, skipping tags and other formatting information, perform tokenization splitting text into words, and calling back the indexer with a hit for each word:
indexer->hit(hit)
A TermHit describes a single occurrence of a word, with properties like its color, case, weight and offset from the start of the document.
Certain fields in a document object are to be filled with information extracted directly from a field in the document through the document reader itself.
In the example, the summary
field could be extracted by the reader for the transcript
field.
IXE provides a protocol for instructing readers on which fields to fill.
The protocol starts with a call to method mapFields() of the document object being indexed:
video1->mapFields(reader, "transcript");
This gives the opportunity to the document (video1
) to notify the reader (reader
), which fields it wants to be filled from field named "transcript".
The document may tell the reader that it wants the field "summary" to be filled, whose location is at address summary
, and whose maximum size is the size mentioned in the META
annotation for the class:
void Video::mapFields(DocReader* dr, char const* field) { if (!strcmp(field, "transcript")) dr->fillField("summary", summary, _metaClass.find("summary")->maxLength()); }
Here _metaClass.find
("summary") accesses the field "summary" in the metaclass for Video
, and method maxLength() returns its max size.
Finally, when the document reader has extracted the summary, it will fill the field in the object video1
with its value.
Reader classes include code that automatically registers them with a ReaderMap object, which contains a list of all known readers, and is able to dynamically instantiate readers on demand.
Additional readers can be made available to applications simply linking their binary with the application.
To self register a reader, the file that contains an implementation of a document reader, must contain an invocation of the macro REGISTER_READER
, with arguments the document type to which the reader should be associated and the static method that creates an instance of the reader on a given TextField.
For example, the TextReader can be registered for type "text" as follows:
DocReader* TextReaderFactory(TextField& tf) { char const* source = tf.Contents(); Size size = tf.size(); return new TextReader(source, size); }
REGISTER_READER("text", TextReaderFactory);