Tanl Linguistic Pipeline

Faceted Search

Search

IXE provides faceted search.

Facets Indexing

A field in a document can be used as a source of facets by defining its index type as Field:facet.

For example, a book catalog entry could be defined as:

class Book : public DocInfo {
  char const*	author;
  char const*	title;
  char const*	ISBN;
  char const*	category;
  float		price;
  time_t	date;
  char const*	language;
  int		year;
  int		pages;
  BookRank*	rank;
  META(Book,
       (SUPERCLASS(DocInfo),
	VARKEY(author, 255, Field::facet),
	VARKEY(title, 800, Field::fulltext),
	VARFIELD(ISBN, 255),
	VARKEY(category, 255, Field::facet),
	KEY(price, Field::indexed),
	KEY(date, Field::indexed),
	VARKEY(language, 255, Field::facet),
	KEY(year, Field::indexed),
	KEY(pages, Field::indexed),
	KEY(rank, Field::mapped)
	)
       );
};

Indexing

During indexing a ForwardList is collected for each facet field of a document. The list contains the temporary TermIDs for each term, whish is just a sequence number. The list is stored in a TC database of type Hash, associated with the DocID of the document. Values for facets are obtained through normal full text indexing when the following method is called:

bool ColumnIndexer::hit(TermHit& hit)

If the field is a faceted field, and the hit type is TermHit::lex, the hit term is included in the ForwardList. Other terms are indexed regularly: this allows for a faceted field to also be indexed with individual tokenized terms, so that for instance phrase search would work on the field.

When the index is committed, the forward list databases are scanned and the temporary TermIDs are replaced by their final value, which is their position in the Lexicon. The IDs are stored compressed as eptacoded deltas.

Dealing with partial indexes requires additional bookkeeping.

When the WordThreshold is reached, a partial index is created, storing the fulltext index for documents analyzed so far. Information on terms is kept in a WordMap, that associates to a term a sequence number and a number of occurrences. When creating the partial index, the WordMap is used to extract word in sort order, which is used as temporary TermID. A lexicon for each column is built, which associates the term to its TermID. The forward lists are updated as well, replacing the temporary TermIDs with the one from the lexicon.

In order to to this each forward list is scanned from the start: however, since a partial index only contains a portion of the docs, while the forward list contains all docs, the lists for docs from previous partial indexes are skipped, by maintaining a count of how many docs the previous indexes contained.

When the partial indexes are merged, a table of mappings is created, tableIndex[], which maps TermIDs from each partial index to the coresponding TermIDs in the final index. The forward lists are scanned again and the TermIDs update correspondingly.

Search

Faceted search can be expressed with additional query parameters. For example:

http://server/search?q=title:energy&fc=author

looks for books with energy in the title and it collects the values for field author in the results, in order to display them to the user providing suggestions to refine the query.

This is handled in three steps:

In order to collect facets the results are scanned and their IDs used as index into the Forward Lists for each requested facet. The values for those facets are collected in a map and counted.

Facets are displayed using a suitable result template. For example:

<dl>
<ixe:foreach var='f' in='facets'>
"&gt;$(f.facet)&lt;/dt&gt;
  &lt;dd">
    <ixe:foreach var='x' in='$(f.values)'>
      <a href="javascript:queryExpand('$(f.facet)', '$(x.name)')">$(x.name)</a> ($(x.count)),
    </ixe:foreach>
  </dd>
</ixe:foreach>
</dl>

builds a description that lists for each facet its values with their count in parenthesis. Each value has a link to a refined query, where the particular facet value is added to the current query.

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.