The ISST Italian Treebank at CoNLL-2007

Preamble

The ISST-CoNLL corpus was developed through a cooperation between the Dipartimento di Informatica of the University of Pisa and Istituto di Linguistica Computazionale (ILC) of the National Council for Research (CNR).

Source

The Italian dependency annotated corpus, developed for the CoNLL-2007 Shared Task, was derived from the Italian Syntactic-Semantic Treebank (ISST), a multi-layered annotated corpus of Italian which represents one of the main outcomes of a major Italian national project, SI-TAL.

Copyright and license

ISST-CoNLL is copyrighted material which can be used for research purposes only and which cannot be distributed in any original or modified form (see the licence agreement form).

Contacts

Simonetta Montemagni (Istituto di Linguistica Computazionale).

Maria Simi (Dipartimento di Informatica).

Resource download

The data can be obtained from the CoNLL Shared Task web site.

Download the README file.

Download a more complete documentation file (PDF format).

Documentation

Data format

Data adheres to the following rules:
Field 1: ID Token counter, starting at 1 for each new sentence.
Field 2: FORM Word form or punctuation symbol
Field 3: LEMMA Lemma of word form
Field 4: CPOSTAG Coarse-grained part-of-speech tag. Based on the ILC/PAROLE tagset.
Field 5: POSTAG Fine-grained part-of-speech tag. Based on the ILC/PAROLE tagset.
Field 6: FEATS Morpho-syntactic features depend on the POS, as detailed in the linked file.
Fields 7: HEAD Non-projective head of current token, which is either a value of ID or zero ('0')
Field 8: DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information.
Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available for the Italian treebank.
Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank.

Corpus composition

ISST-CoNLL is a subset of the balanced ISST corpus of 79654 word tokens (of which 65016 are non punctuation tokens) for a total 4162 sentences, corresponding to the Corriere della Sera and periodicals partitions of ISST.

Statistics

#sentences 4162
#tokens 79654
#non-punct tokens 65016
#coarse pos tags 14
#fine pos tags 28
#deprels 21

Conversion

Conversion from the ISST corpus consisted in:
  1. combining information coming from two different annotation levels
  2. converting the ISST annotation scheme for dependency annotation into the CoNLL-2007 format.

Conversion had to cope with the fact that in ISST dependency relations are expressed in terms of binary relations holding between two lexical heads belonging to major lexical classes only (i.e. non-auxiliary verbs, nouns, adjectives and adverbs): in fact, in ISST information about grammatical words (e.g. determiners, prepositions, auxiliaries) is encoded in terms of features associated with the participants to the relation.

During the conversion process the dependency relations involving grammatical words had to be reconstructed from the ISST original annotation and the already existing dependency relations had to be revised accordingly. This was done semi-automatically by means of several conversion scripts whose output has been manually revised with the help of a graphical annotation tool. Further scripts were run to validate the consistency of the final output.

An XML intermediate format was produced in this process, preserving original annotations that could not be accomodated in the CoNLL format.

Acknowledgements

Isidoro Barraco and Patrizia Topi did most of the work of writing conversion scripts and revising tags.
Kiril Ribarov, Alessandro Lenci and Giuseppe Attardi contributed useful discussions on critical issues.