Activities: 2009

From Medialab

September 2009

The following activities are being carried out in order to complete the deliverables.

Corpora

Dependency TreeBank

The ISST Tanl TreeBank, consisting of approximately 80,000 tokens, will be extended to up to 500,000 token by a process of Self Train on the Italian Wikipedia.

The process consists in parsing the Wikipedia with a parser trained on the base ISST TreeBank. Among the over 5 million sentences, select a few thousands that are to be considered very likely to have been correctly analyzed, add them to the corpus and retrain the parser on the extended corpus.

Self Taining is a variant of Active Learning: while Active Learning selects new examples from a large collection of unannotated text to submit to a trainer for being annotated, Self Training does not involve an external trainer. Self Train uses the output of the system itself, possibily with a confidence score produced by the system, plus additional evidence for other sources in order to perform the selection.

In out case we have been testing two alternative approaches:

  1. use alternative parser.

The approach consists in parsing the unannotated collection with the DeSR parser trained on ISST and another parser trained on the Catalan corpus. The choice criterion consisted in selecting short sentences (of length 3-8) whose parse tree had the same struacture from both parsers. Attardi has done some encouraging experiments in thi direction.

  1. use corpus statistics

This approach consists in computing a plausibility score for each sentence and selecting the sentences with highest score. The plausibility is a measure of how likely the sentence contains meaningful relations. In particular it is based on statistical frquency measure of cooccurrences of words in the large unannotated collection.

Felice dell'Orletta has done some experiments in this direction. Hayouan Li is refining the approach using slightly different scoring criteria.

Still further experiments are required in order to tune the scoring function and possibly to incorporate other criteria, e.g. Pointwise Mutual Information as an evidence that two words are meaningful together.

Another aspect of the problem is to filter out almost identical sentences, which occur frequently in Wikipedia as the product of templates. For instance:

Mancano 2 giorni alla fine dell'anno.
Il 6 agosto è il 218° giorno del Calendario Gregoriano

are present 366 times for each day of the year. In order to avoid adding them to the ecorpus, therefore skewing the statistical distribution, we are removing quasi duplicates by a process of shingling, i.e. computing a set of hashes for each sentences that allows to identify efficiently near duplicates. A special variant of the technique has been developed for dealing with sentences, since the technique is usually applied to whole documents.

Named Entity Tagger

The NER tagger for ENglish developed by Attardi achieves state of the art acuracy. The version for Italian has been developed by Dell'Orletta and he is expected to deliver the final version.

Super Sense Tagger

A Super Sense Tagger has been developed by Stefano Dei Rossi, as well as a corpus for training it, with the involvement of Giulia. Dei Rossi is expected to produce the final release of the corpus and the tagger.

Question Answering

  • The indexer is operational on Linux using a new BTree+ implementation provided by TokyoCabinet.

Tamberi is involved in tuining an fixing it.

  • Deep Search is available online on paleo. It will have to be uptated to use the new index with the dition of NER ans SST tags.
  • Demo. A Demo of Deep Search will be prepared for presenting to the FOndazione.
  • Question Answering Prototype.

Maria Simi has developed a special corpus of questions, parsed automaticaly and then corrected manually. A specialized parser traning on this corpus achieves a high accuracy of close to 90%. It will be used as part of a prototype of Piqasso, which will parse questions, classifies according to question type and turn into a Deep Search query, for performing semantic search on the annotated Wikipedia.



Data Preparation

In order to complete the final deliverables, several data must be produced, that require lengthy processing.

These include:

  1. procesing the latest version of Italian Wikipedia with the latest tools

Parsing, NER tagging, SST tagging and indexing need to be performed. In order to speed up the process, Fuschetto and Taberi will setup the Tanl pipeline on the cluster Maranello, using native windows binaries. Tamberi will take care of updating the DeSR VS Solution in order to compile on Windows.

Similarly, a cluster on a virtual cluster will be setup. Stefano Palla will provide instruction on how to generate a Linux kernel incorporating the suitable Hyper-V Linux integration components.