Technical Annex

From Medialab


The project aims at developing tools and techniques to facilitate the access to high quality documents related to Italian culture present on the Web, chosen according to different criteria of interest. Given the high dynamicity and the fast evolution of material on the Web automated tools are needed for classification, knowledge extraction, building catalogues, annotations, navigation structures. Moreover machine translation is required in order to make the material available in other languages.

The project aims at producing the following. The experience gained in designing and producing the architecture of the Italian Library, based on the logical model OAIS Open Archival Information System, enables to design an exact and methodical indexing and evaluation of the websites related to the Italian textual heritage. The goal of the project is to develop these indexing and evaluation procedures and make available, for each website and for each document, an analytical description of the content and its ranking as to scholarly reliability and accordance with the international standards, in terms of both librarianship (included the preservability of the digital record) and scholarship (philological requirements etc.). The same principles will be followed in evaluating websites concerning Italian Language, Current cultural events and Art.

Meta Search Engine

Design and implementation of a (meta-)search engine for the Italian Culture on the Web, which offers algorithmic solutions to the state-of-the-art and is based on the linguistic resources produced by the partners of this project. The search engine will offer to the user the possibility to acquire various “points of view” about all results returned for his/her search, without scanning through all of them, but by looking at labelled folders organized hierarchically and formed on-the-fly as a function of their topics [FG04,FG05,B07,Z07], or of some categories in an ontology [CFM07], or of some authoritative domains selected by our partners. This way, the user can improve his/her query by narrowing its scope or by extracting new knowledge.

This design activity will be performed together with basic research on algorithms and data structures for text mining and compressed indexing [FM05,FGMM05,FLMM06], with the goal of proposing innovative solutions for searching the large amount of textual data out tools will need to cope with.

Tools for Knowledge Extraction

Design and development of software components for knowledge extraction from Italian Culture Web pages. This goal will be pursued by combining Machine Learning stochastic algorithms with advanced Natural Language Processing techniques in line with the state of the art in the computational linguistics field. In particular, two software components will be developed a) for extracting terminologies and structuring the acquired terms into proto-conceptual structures, and b) for semantically annotating documents on the basis of a reference ontology. The final result of the terminology extraction and structuring process will be used as the starting point for the construction and /or customization of ontologies by domain experts. Semantic annotation of documents will be carried out with reference to the domain ontologies semi-automatically built within the project; moreover, semantic mark-up will be exploited by the (meta-)search engine for semantic indexing, classification, query and navigation of documents.

Statistical Machine Translation

Development of a machine translation system Italian-English, based on statistical and machine learning techniques. These techniques have produced in the last years significant break-throughs in the field of analysis of natural language texts and have been already applied successfully by the proponents in the development of state-of-the-art lexical analyzers (POS taggers) and grammatical analyzers (parsers) for the Italian language [A06, ADSCC07, AS07]. For machine translation we intend to follow a variant of the Context-Based Machine Translation approach, which makes use of an extended bilingual dictionary and exploits a large text corpus in the target language to learn how to combine short sequences of translated words in complete sentences which preserve the context of the original language. The proposed variant consists in using treelets, which are fragments of syntactic tree, instead of simple terms sequences, in the composition, in order to better reconstruct the grammatical structure of the translated sentences. This approach has the advantage of not requiring huge parallel corpora in the two languages which is the case in other current approaches to machine translation, while still maintaining similar benefits in scalability, that is the possibility of improving the quality of the translation along with the growth of the learning corpus.

State of the Art

Although the efficiency and effectiveness of search engines has increased significantly, the search for relevant information on the Web is still difficult because (1) the Web is huge, heterogeneous, dynamic and uncontrolled; (2) users issue un-precise queries and look at few results. Many researchers and software companies are addressing difficulty (1) by designing meta-search engines that combine results coming from various sources (i.e. Google, Yahoo, MSN,…) with the goal of improving the recall and the freshness of the query results. Nonetheless, it is well known that this meta-approach incurs into two further difficulties:

  1. it needs sophisticated algorithms to properly combine the multi-lists of results returned by the queried search engines, in order to produce one unique list;
  2. this list becomes longer and thus more difficult to be visualized and analyzed by users.

If we add to these difficulties the fact that users are lazy in looking at search results (see 2 above), we then find not surprising that current research on Web Information Retrieval is mainly addressing the issue of designing post-processing tools that enrich the flat-list of results with novel “points of view” about them and/or propose "query suggestions" that help the users in refining their searches. In this project we will design a meta-search engine for the Italian Culture on the Web, starting from our expertise in Information Retrieval over various data types: Web pages [FG04,FG05], text collections [FG05,FGmm05], XML [FLMM06], and biological data [CFM07]. We will also dig into the recent literature on “semantic” search-engines, which are based on linguistic and ontology resources [B07,Z07].

The problem of automatically extracting relevant information out of the enormous and steadily growing amount of electronic text data is becoming more and more pressing. To overcome this problem, various technologies for information management systems have been explored within the Natural Language Processing (NLP) and AI community. Two promising lines of research are represented by the investigation and development of technologies for a) ontology learning from document collections, and b) knowledge mark-up of texts.

Ontology learning is concerned with knowledge acquisition from texts as a basis for the construction of ontologies, i.e. an explicit and formal specification of the concepts of a given domain and of the relations holding between them; the learning process is typically carried out by combining NLP technologies with machine learning techniques. [BUI05] organize the knowledge acquisition process into a “layer cake” of increasingly complex subtasks, ranging from terminology extraction and synonym acquisition to the bootstrapping of concepts and of the relations linking them. Term extraction is a prerequisite for all aspects of ontology learning from text: measures for termhood assessment range from raw frequency to Information Retrieval measures such as TF-IDF, up to more sophisticated measures [FRA99],[DELLORL06]. The dynamic acquisition of synonyms from texts is typically carried out through clustering techniques as well as lexical associations measures [LIN98], [ALLE03]. The most challenging research area in this domain is represented by the identification and extraction of relationships between concepts (taxonomical ones but not only); this research area presents strong connections with the extraction of relational information from texts, both relations and events (see below).

Knowledge mark-up is the task of automatically identifying in texts instances of semantic classes defined in an ontology [UR05]. This task includes recognition and semantic classification of items representing the domain referential entities, either “named entities” or any kind of word or expression that refers to a domain specific entity. Recently, annotation of inter-entity relational information is becoming a crucial task: annotated relations range from place_of, author_of etc. (Relation Extraction) to specific events where entities take part in with usually predefined roles (Event Extraction). Different approaches to the problem are reported in the literature, from use of hand-crafted rules-based templatic patterns to stochastic algorithms both supervised and un-supervised. Since most inter-entity relations are explicitly represented through the linguistic micro-structure of a text, many such approaches rely on advanced NLP techniques. In the field of machine translation the traditional approach [H05], so called Rule-Based MT, is based on a set of linguistic rules formulated by specialists, in particular grammatical rules for analyzing texts to be translated, transformation and generation rules. The procedure consists in two phases: decoding of the original text meaning, encoding in the foreign language. Most commercial systems are based on these techniques; an example is SYSTRAN, used by the European Commission.

Statistical learning techniques have produced in the last years significant break-throughs in the field of natural language texts analysis. Among the statistical approaches to machine translation, the Corpus-Based approach extracts linguistic knowledge from parallel texts corpora, and learns how to perform translations. The translation is performed directly, without going through a representation of the meaning. Advantages are that grammars are not required and that the quality improves as the size of the corpus grows. The construction of translation systems of this kind entails: preparation of the corpora, extraction of training examples, implementation of the decoder. This approach is used by the translation system just introduced by Google [O07]. Recent results of the NIST evaluation campaigns [NIST06], show that statistical MT systems are able to reach a better quality, measured according to the BLEU score (Bilingual Evaluation Understudy), with respect to current commercial products and close to the performance of human translators. The new techniques rely on a processing infrastructure able to index and retrieve (match) efficiently billions of n-grams, rather than utilizing complex rule systems produced with a large effort of human resources in knowledge engineering. A more recent approach, the Context-Based Machine Translation (CBMT) [C06], is based on novel language processing algorithms making use of a bilingual dictionary and a large texts corpus in the foreign language in order to construct and connect sequences of words in the foreign language, while preserving the context of the sentence in the original language. These resources are used in two main processes, a "flooder", corresponding to a translation model, and a "n-gram overlap resolver", corresponding to a decoder. A third component discovers near synonim sentences, through an unsupervised learning process, and deals with cases where the decoder fails in resolving very long n-grams. With respect to the Corpus-Based MT, there is no need to use parallel corpora in the two languages and scalability qualities are preserved.


[B07] H. Bast, A. Chitea, F.M. Suchanek, I. Weber: ESTER: efficient search on text, entities, and relations. Procs of ACM SIGIR, 671-678, 2007.
[CFM07] C. Corsi, P. Ferragina, R. Marangoni. The BioPrompt-box: an ontology-based clustering tool for searching in biological databases. BMC Bioinformatics, 8(Suppl 1), 2007.
[FG04] P. Ferragina, A. Gullì. The Anatomy of a Hierarchical Clustering Engine for Web-page, News and Book Snippets. IEEE Conf. Data Mining (ICDM), 395–398, 2004.
[FG05] P. Ferragina, A. Gullì. A personalized search engine based on web-snippet hierarchical clustering. Proc. Int. Conf. WWW, 801-810, 2005.
[FGMM05] P. Ferragina, R. Giancarlo, G. Manzini, M. Sciortino. Compression boosting in optimal linear time. Journal of the ACM, 52(4):688-713, 2005.
[FLMM06] P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan. Compressing and searching XML data via two zips. Proc. Int. Conf. WWW, 751-760, 2006.
[FM05] P. Ferragina, G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552-581, 2005.
[Z07] H. Zaragoza, et al.. Ranking Very Many Typed Entities on Wikipedia. Proc. CIKM, 2007.
[A06] G. Attardi. 2006. Experiments with a Multilanguage non-projective dependency parser. Proc. of the Tenth CoNLL, 2006.
[AC07] G. Attardi and M. Ciaramita. 2007. Tree Revision Learning for Dependency Parsing. In Proc. of HLT/AACL, 2007.
[AS07] G. Attardi, M. Simi. DeSR at the Evalita Dependency Parsing Task Proc. of Workshop Evalita 2007. Intelligenza Aritificiale, 4(2), 2007.
[ADSCC07] G. Attardi, F. Dell'Orletta, M. Simi, A. Chanev and M. Ciaramita. Multilingual Dependency Parsing and Domain Adaptation using DeSR. Proc. CoNLL Shared Task Session of EMNLP-CoNLL 2007, Prague, 2007.
[CA07] M. Ciaramita and G. Attardi. Dependency Parsing with Second-Order Feature Maps and Annotated Semantic Information, Proc. 10th Int. Conf. on Parsing Technology (IWPT 2007), Prague, 2007.
[C06] J. Carbonell, et al. 2006. Context-Based Machine Translation. Proc. 7th Conference of the Association for Machine Translation in the Americas, 19–22.
[H06] J. Hutchins. Current commercial machine translation systems and computer-based translation tools: system types and their uses. International Journal of Translation, 17(1-2), 2005, 5-38.
[M06] A. Moschitti. 2006. Making tree kernels practical for natural language learning. Proc. 11th Int. Conf. of the EACL, Trento, 2006.
[NIST06] NIST 2006 Machine Translation Evaluation Official Results. 2006. NIST.
[O07] F.J. Och. Machine Translation at Google. HLTC 2007.
[ZACAM07] H. Zaragoza, et al. Ranking Very Many Typed Entities on Wikipedia. Proc. of CIKM 2007, Lisboa, 2007.
[ZACA07] H. Zaragoza, J. Atserias, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia, 2007.
[BUI05] P- Buitelaar, P. Cimiano, B. Magnini (eds.), Ontology Learning from Text: Methods, Evaluation and Applications, volume 123 of Frontiers in Artificial Intelligence and Applications. IOS Press, July 2005.
[FRA99] K. Frantzi, Ananiadou, S., The C-value / NC-value domain independent method for multi-word term extraction, Journal of Natural Language Processing 6(3), 145–179, 1999.
[DELLORL06] F. Dell’Orletta, A. Lenci, S. Marchi, S. Montemagni, V. Pirrelli, Text-2-Knowledge: una piattaforma linguistico-computazionale per l’estrazione di conoscenza da testi, in Atti del XL CONGRESSO INTERNAZIONALE DI STUDI DELLA SOCIETÀ DI LINGUISTICA ITALIANA. (SLI 2006), Vercelli 20-22 settembre 2006, Roma, Bulzoni.
[LIN98] D. Lin, Automatic retrieval and clustering of similar words, in Proceedings of the 17th international conference on Computational linguistics, pages 768–774, Morristown, NJ, USA, 1998.
[ALLE03] P. Allegrini, S. Montemagni, V. Pirrelli, Example-Based Automatic Induction Of Semantic Classes Through Entropic Scores, “Linguistica Computazionale” XVI-XVII, pp.1-45, 2003.
[UR05] V. Uren, P. Cimiano, J. Iriac, S. Handschuhd, M. Vargas-Vera, E. Motta, F. Ciravegna, Semantic annotation for knowledge management: Requirements and a survey of the state of the art, Web Semantics: Science, Services and Agents on the World Wide Web, Volume 4, Issue 1, January 2006, Pages 14-28.
[LEN07] A. Lenci, S. Montemagni, V. Pirrelli, G. Venturi, NLP-based ontology learning from legal texts. A case study, in Proceedings of the II Workshop on Legal Ontologies and Artificial Intelligence Techniques (LOAIT ’07), 4 giugno 2007, pp. 113-130.

Research Plan

The project aims at developing powerful and friendly tools for intelligent and efficient access to all the Italian cultural contents existing in the web, which are scattered and extremely large. Tools that enable meeting users’ demand in a much more precise, selected and qualified way than the existing generalist search engines do. A further prior need is a reliable system of machine translation capable of giving access in English (later possibly in other languages as well) to all web contents in Italian.

Therefore, the following four advanced tools will be designed, built, integrated with each other into a functional platform and made available on the Internet:

  1. a meta-search engine based on the clustering of results and specialized for Italian; a tool for knowledge extraction and structuring from text material through the use of computational linguistics technologies;
  2. a systematic indexing and qualitative evaluation of websites concerning the Italian language and culture;
  3. a new system of machine translation from Italian to English and vice versa.

Meta Search Engine

In Sect. 2.2 we have commented about the computational and structural difficulties that make designing an efficient and effective search engine a hard task. Given our limited computational resources, we will implement a meta-search engine that will draw its query results from commodity search engines (like Google, Yahoo, MSN, etc.) and then it will deploy innovative tools to post-process these results in order to make the subsequent user analysis and visualization more efficient and effective. Specifically, these tools will draw inspiration from the following three famous (possibly orthogonal) approaches:

  1. Vivisimo, which enriches the list of query results with folders that are hierarchically organized and labelled with phrases capturing the theme that underlies the results contained into them (see also Grokker, Kartoo, Ujiko, ...). The user can then navigate through these folders, driven by their labels, with the goal of narrowing his/her searches and/or extracting new knowledge.
  1. Google Directory, which offers two paradigms for searching over a Web ontology (i.e. DMOZ): keyword search and directory navigation. DMOZ is publicly available and contains 4 million Web pages annotated in various languages by thousands of editors all around the world.
  1. Google Co-op, which allows restricting the Web search to a set of authoritative domains selected by the user or by a community of experts on a topic.

Therefore, our objective will be to implement a “vertical” variant of these tools which is specialized to work on the part of the Web related to Italian Culture; it will deploy our expertise in the design of meta-search engines and, further, the linguistic and ontology resources developed in this project. We will also investigate the design of “semantic” search-engines based on the sophisticated combination of full-text searches and semantic annotations derived from a proper ontology and taggers (in the spirit of [B07,Z07]), and we will study the design of innovative tools in the context of text indexing and mining properly derived from [FG04,FG05,FLMM06].

Statistical Machine Translation

An Italian-English (and English-Italian) machine translation system will be developed, based on statistical techniques, following the Context-Based Machine Translation approach. A first version of the translation system will use n-grams corpora in the two languages, from which to extract statistical parameters to be used in the choice of the most adequate combination of n-grams constituting a sentence translation. The texts to be translated have to be pre-processed with a series of tools such as: sentence splitter, tokenizer, POS tagger, Named Entity tagger, dependency parser. To this end we will use and improve existing tools previously developed by the members of the research unit. The translation procedure will be organized in the following phases: segmentation, transliteration, retrieval of possible combinations in the corpus, score computation for each sequence, selection of resulting sequence. The segmentation phase requires the use of Gazetteers of special terms in the two languages, which will be constructed gathering and combining available resources in Internet and by using a Named Entity tagger.

The transliteration phase requires the use of a bilingual dictionary containing all the inflected forms of the terms in the the two languages, which will be constructed by extending traditional dictionaries. The research unit has obtained a collection of thousand billions English words combined in n-grams. A similar collection is needed for Italian. Finally a collection of near-synonyms is needed for resolving situations where combination results are unsatisfactory. Such near-synonyms will be constructed automatically by determining co-occurrences of sentences in analogous contexts.

A more advanced version of the translation system will be based on the use of treelets, small portions of syntactic tree, as opposed to n-grams. This approach will facilitate the construction of grammatically and morphologically correct sentences. Treelets will be obtained by applying the multilingual parser DeSR to text collections in the two languages.


Project Coordination and Monitoring

Setup of Internet and Intranet site and Integrated Service Platform

Definition and Verification of Standards

Meta Search Engine based on Clustering of Results

Design and implementation of a (meta-)search engine for the Web pages related to Italian Culture, which offers to the user the possibility to acquire various “points of view” over all search results by means of labelled folders organized hierarchically and formed on-the-fly as a function of: categories in an ontology (module A), some authoritative domains (module B), or their topics (module C). In detail:

  1. The first module will be inspired to Google Directory, and thus it will offer to the users the possibility to search and navigate an ontology that is obtained by combining known authoritative directories (e.g. Yahoo, Dmoz, Looksmart,..) with the one created by our partners on Italian Culture. In developing this module we will deploy our expertise on the design of a search engine based on the Gene Ontology [CFM07].
  2. The second module will be inspired to Google Co-op and will allow users to classify and filter the search results based on an authoritative list of web domains about the Italian Culture, which will be selected by our experts or will be derived by publicly authoritative directories such as DMOZ, Yahoo!, Wikipedia, LII, Looksmart,...
  3. The third module will be inspired to Snaket [FG04,FG05] in order to group the search results into folders, hierarchically organized and labelled with meaningful phrases that capture the theme underlying the results contained into these folders. Currently Snaket bases the effectiveness of the clustering process on statistical techniques and on the use of the DMOZ ontology. This often induces a poorer clustering with numerous folders and unintelligible labels when used on Italian texts. To improve this approach we will design more effective algorithms for labelling extraction and clustering. In this respect, we will use the linguistic tools and parsers implemented by our partners. We notice that the clustering offered by Snaket is poorer than the one generated by module (A), in that it is constructed algorithmically on-the-fly at query time, but it is more powerful and potentially finer because it is not limited to a specific hand-made ontology.

In this project we will also investigate the application of [B07,Z07] to the “Web slice” related to Italian Culture, with the goal of constructing (on the fly) “semantic” clusters based on the annotations provided by the parsers of our partners and/or new algorithms for text mining derived from [FG04, FG05, FLMM06]. Software development (in Java) of the meta-search engine and basic theoretical research will go hand-to-hand, the latter aiming at the design of novel algorithmic solutions for text compression and mining, useful to improve the above tools.

Knowledge Extraction from Textual Data

The prerequisites for the semantic access to Italian Culture Web pages will be created by developing software components based on a sophisticated battery of integrated tools for Natural Language Processing (NLP), statistical text analysis and machine language learning. In particular, the focus will be on the extension and customization of pre-existing software components aimed at:

  1. the extraction of terminological and conceptual information from a document base, articulated as follows:
    1. a collection of domain terms;
    2. organisation of acquired domain terms into thesaurus-like structures, i.e. conceptual taxonomies and clusters of semantically related terms (or “quasi-synonyms”);
    3. structuring of identified domain concepts into a semantic network, where concepts denoting domain-relevant entities will be linked through the events in which they typically take part with specific roles, with the final goal of constructing a conceptual map of the domain;
  2. semantic mark-up of textual documents on the basis of the reference ontology developed by domain experts starting from the output of the ontology learning process (see a above). The semantic mark-up will be concerned with the recognition and semantic categorization a) of named entities (persons, localities, organizations, etc.) as well as of domain-relevant entities, and b) of relationships holding between them.

Machine translation

The goal of this activity is to produce a MT system whose accuracy could grow depending on the size of the training corpus, reaching beyond the current state of the art. The tecnique of Context-Based Machine Translation (CBMT) will be used, which does not require the use of parallel corpora in two languages or articulated and complex linguistic resources. The CBMT exploits a simple translation model which uses an extended bilingual dictionary containing all the forms of terms and a decoder. This one uses a context formed by long n-grams chains and computes overlaps between adjacent contexts. In experiments of Spanish to English translation [C06] this technique has obtained a BLEU score of 0,695, close to the level of human translators (0,71-0,79).

In order to adapt the CBMT approach to the Italian-English pair, it is necessary to gather two collections of n-grams, one for each language, extracted from a large quantity of good quality texts. For English it will be possible to use the 1Tera Web Ngrams corpus, distributed by the Linguistic Data Consortium, while for Italian a similar corpus will need to be obtained from a large collection of texts extracted from the Web in the Italian language. Such collection will be gathered by using a parallel crawler, and will need to be cleaned of spurious parts (HTML tags, navigation structure, etc.), normalized, split in tokens and syntactically analysed. The Italian collection will use the same data format used by LDC, in order to facilitate its use with other tools and the sharing with other research groups. Finally Gazetteer will be used, i.e. listings of named entities such as proper names, geographical locations, organizations. N-grams in the two languages, the bilingual dictionary and the gazetteers will all be indexed for fast access.

The translation procedure will be based on the following phases: segmentation in n-grams of the original text, transduction of n-grams and overlap decoder. The transduction of n-grams uses the indices to transform a moving window of n-grams from the original text into translated n-grams groups in the target language. The overlap decoder computes overlaps between these n-grams and those in the target language corpus, assigning a score to each n-gram. At the end of the process a translated sentence is obtained as a combination of the highest scoring n-grams.

In a second stage of the project, we propose to extent the CBMT technique using instead of n-grams, that represent linear text sequences, treelets, i.e. grammatical tree structures, obtained from grammatical analysis of texts (parsing). This technique will allow dispensing with the third component of current CBMT systems for the treatment of long n-grams. To this end the multilingual parser DeSR [A06] will be used to analyze texts in the two languages, building a collection of parse trees and computing their frequency. Treelet collections in the two languages will be constructed with their relative occurrence frequency. In order to adapt CMBT to the use of treelets, efficient indexing and tree matching techniques will be developed, based on the tree kernel approach [M06].