Software Architecture

From Medialab

The software architecture of SemaWiki consists in a pipeline of modules for:

  • Text Extraction from Wikipedia
  • Tagging
  • Indexing

The architecture of the tagging modules is based on a streaming metaphore, as illustrated in this diagram. Each module in the pipeline produces a stream of tokens, that are to be consumed by the following stages.

Tokens are represented in a flexible way, so that each stage can add further annotations to be passed along to later stages.

A base token has the following interface:

struct TokenBase {
  int                   id;     ///< sequence number
  std::string           form;   ///< word form

Generic tokens have an extensible set of features, consisting of attributes and links:

struct Token : public TokenBase
   Attributes attributes;
   Links links;
{ };

typedef std::map<char const*, Attribute> Attributes;
typedef std::map<char const*, Link> Links;
typedef Attribute std::string;

struct Link {
   int target;          ///< the ID of the target of the link
   std::string label;   ///< the label for the link

The modules can be connected in a pipeline described in Tanl Pipeline.