Status: 2009/9

From Medialab

The following deliverables were listed in the original project plan:

  1. Training corpus for Italian dependency parsing
  1. Visual annotation tool
  1. Dependency parser for Italian
  1. prototype Question Answering system for Italian
  1. prototype of First Topic Detection on news flow

We will add to these a set of Specifications for Italian NLP and replace the last one with the Tanl Pipeline.

The deliverable then will become:

  1. Specifications for Italian corpora annotation
  1. Training corpus for Italian dependency parsing
  1. Visual annotation tool
  1. Dependency parser for Italian
  1. prototype Question Answering system for Italian
  1. Natural language processing pipeline


Status

Specifications

These are ready and include the Tanl specifications for POS, Dependency and the corresponing Guidelines for Annotation.

Corpora

The corpora include:

  1. Repubblica POS
  2. ISST Tanl
  3. Named Entity corpus for Italian
  4. Super Sense training corpus
  5. Sentence Splitter Italian training corpus

Parser

The DeSR parser has been completed and it participated successfully at the Shared tasks for CoNLL 2007 and 2008.

A model for Italian will be built based on a corpus based on Tanl ISST and extended through process of Self Train.


Visual Annotator

The DgAnnotator is available.

An Interactive Parser Simulator has been released that allows to visualize interactively the parsing process.

Italian Qestion Answering

An enriched index of the Italian Wikipedia has been built that includes tags for LEMMA, POS, Dependency, NER, SST.

The index can be queried online, using a service called Deep Search.

A simple interface to Deep Search will be devloped which parses simple natural language queries and translates them into DeepSearch queries, in order to illustrate the feasibility of natural language QA.

The techniqes developed in th project have been used in Yahoo! Correlator and in a forthcoming other demo on Yahoo! Answers.

Tanl Pipeline

The Tanl pipeline consists in the following tools that can be combined together throgh Python scripts in order to build a variety of complex applications.

  1. Wiki Extractor
  2. Sentence Splitter
  3. Tokenizer
  4. POS Tagger
  5. Lemmatizer
  6. Morph Splitter
  7. Dependency parser
  8. Named Entity Recongnizer
  9. Super Sense Tagger

Most of the tools are written in C++ and a Python wrapper is available for calling from Python.