The ISST Italian Treebank

The Italian Syntactic-Semantic Treebank (ISST) is a multi-layered annotated corpus of Italian which represents one of the main outcomes of an Italian national project, SI-TAL, funded by the Italian Ministery of Science and Research (MURST) for the design and development of an integrated suite of tools and resources for Italian Natural Language Processing.

The project consortium included companies and computational linguistics sites in Italy which are active with different expertise in the computational linguistics field (ILC-CNR/CPR, Venezia University/CVR, ITC-IRST, "Tor Vergata" University/CERTIA and Synthema). ISST was developed between 1999 and 2001 and is being updated whenever needed.

Among the expected uses for ISST there were also training (and/or tuning) of grammars and sense disambiguation systems and the evaluation of language technology systems. For more details see Montemagni et al. (2003a, 2003b) and the SI-TAL web page.

ISST overall architecture

ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the dependency annotation level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet is the reference lexical resource used for the sense tagging task.

Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs).

Syntactic annotation in ISST

Among the main features of ISST with respect to other treebanks there is the distributed approach to syntactic annotation. In this respect, ISST differs from most treebanks which adopt a unique syntactic representation layer.

ISST also differs from multi-level treebanks like the Prague Dependency Treebank (PTD): whereas PTD annotation levels refer respectively to a) the surface dependency relations and b) the underlying sentence structure, ISST syntactic annotation levels are intended to provide orthogonal and independent views of the same surface syntax. None of the ISST syntactic annotation levels presupposes the other; on the other hand combined views of the complementary information contained in them can be provided, e.g. dependency information can be projected onto the constituent structure.

The motivations underlying this "double-track" approach to syntactic representation range from language-specific ones (e.g. the syntactically free constituent order and the pro-drop property of Italian) to usage-oriented ones (ISST syntactic annotation levels are intended to be exploitable both in real applications and for research purposes, and to be compatible with different approaches to syntax, either adopted in theoretical or applicative frameworks).

Corpus composition

The ISST corpus consists of 305,547 word tokens reflecting contemporary language use. It includes two different sections:

  1. a "balanced" corpus, testifying general language usage, for a total of 215,606 tokens;
  2. a specialised corpus, amounting to 89,941 tokens, with texts belonging to the financial domain.

The balanced corpus contains a selection of articles from different types of Italian texts, namely newspapers (La Repubblica and Il Corriere della Sera) and a number of different periodicals which were selected to cover a high variety of topics (politics, economy, culture, science, health, sport, leisure, etc.) covering a 10 year time period (1985-1995). The financial corpus includes articles taken from Il Sole-24Ore which were published in 1994.