The ISST Italian Treebank at CoNLL-2007

Preamble

The ISST-CoNLL corpus was developed through a cooperation between the Dipartimento di Informatica of the University of Pisa and Istituto di Linguistica Computazionale (ILC) of the National Council for Research (CNR).

Source

The Italian dependency annotated corpus, developed for the CoNLL-2007 Shared Task, was derived from the Italian Syntactic-Semantic Treebank (ISST), a multi-layered annotated corpus of Italian which represents one of the main outcomes of a major Italian national project, SI-TAL.

Copyright and license

ISST-CoNLL is copyrighted material which can be used for research purposes only and which cannot be distributed in any original or modified form (see the licence agreement form).

Contacts

Simonetta Montemagni (Istituto di Linguistica Computazionale).

Maria Simi (Dipartimento di Informatica).

Resource download

The data can be obtained from the CoNLL Shared Task web site.

Download the README file.

Download a more complete documentation file (PDF format).

Documentation

Data format

Data adheres to the following rules:

Data files contain one or more sentences separated by a blank line.
A sentence consists of one or tokens, each one starting on a new line.
A token consists of ten fields described in the table below. Fields are separated by one tab character.
All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are used.
Data files are are UTF-8 encoded (unicode).

Field 1: ID	Token counter, starting at 1 for each new sentence.
Field 2: FORM	Word form or punctuation symbol
Field 3: LEMMA	Lemma of word form
Field 4: CPOSTAG	Coarse-grained part-of-speech tag. Based on the ILC/PAROLE tagset.
Field 5: POSTAG	Fine-grained part-of-speech tag. Based on the ILC/PAROLE tagset.
Field 6: FEATS	Morpho-syntactic features depend on the POS, as detailed in the linked file.
Fields 7: HEAD	Non-projective head of current token, which is either a value of ID or zero ('0')
Field 8: DEPREL	Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information.
Field 9: PHEAD	Projective head of current token, which is always an underscore because it is not available for the Italian treebank.
Field 10: PDEPREL	Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank.

Corpus composition

ISST-CoNLL is a subset of the balanced ISST corpus of 79654 word tokens (of which 65016 are non punctuation tokens) for a total 4162 sentences, corresponding to the Corriere della Sera and periodicals partitions of ISST.

Statistics

#sentences	4162
#tokens	79654
#non-punct tokens	65016
#coarse pos tags	14
#fine pos tags	28
#deprels	21

Conversion

Conversion from the ISST corpus consisted in:

combining information coming from two different annotation levels
converting the ISST annotation scheme for dependency annotation into the CoNLL-2007 format.

Conversion had to cope with the fact that in ISST dependency relations are expressed in terms of binary relations holding between two lexical heads belonging to major lexical classes only (i.e. non-auxiliary verbs, nouns, adjectives and adverbs): in fact, in ISST information about grammatical words (e.g. determiners, prepositions, auxiliaries) is encoded in terms of features associated with the participants to the relation.

During the conversion process the dependency relations involving grammatical words had to be reconstructed from the ISST original annotation and the already existing dependency relations had to be revised accordingly. This was done semi-automatically by means of several conversion scripts whose output has been manually revised with the help of a graphical annotation tool. Further scripts were run to validate the consistency of the final output.

An XML intermediate format was produced in this process, preserving original annotations that could not be accomodated in the CoNLL format.

Acknowledgements

Isidoro Barraco and Patrizia Topi did most of the work of writing conversion scripts and revising tags.
Kiril Ribarov, Alessandro Lenci and Giuseppe Attardi contributed useful discussions on critical issues.