The ISST-CoNLL corpus was developed through a cooperation between the Dipartimento di Informatica of the University of Pisa and Istituto di Linguistica Computazionale (ILC) of the National Council for Research (CNR).
Simonetta Montemagni (Istituto di Linguistica Computazionale).
Maria Simi (Dipartimento di Informatica).
The data can be obtained from the CoNLL Shared Task web site.
Download the README file.
Download a more complete documentation file (PDF format).
Field 1: ID | Token counter, starting at 1 for each new sentence. |
Field 2: FORM | Word form or punctuation symbol |
Field 3: LEMMA | Lemma of word form |
Field 4: CPOSTAG | Coarse-grained part-of-speech tag. Based on the ILC/PAROLE tagset. |
Field 5: POSTAG | Fine-grained part-of-speech tag. Based on the ILC/PAROLE tagset. |
Field 6: FEATS | Morpho-syntactic features depend on the POS, as detailed in the linked file. |
Fields 7: HEAD | Non-projective head of current token, which is either a value of ID or zero ('0') |
Field 8: DEPREL | Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information. |
Field 9: PHEAD | Projective head of current token, which is always an underscore because it is not available for the Italian treebank. |
Field 10: PDEPREL | Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank. |
#sentences | 4162 |
#tokens | 79654 |
#non-punct tokens | 65016 |
#coarse pos tags | 14 |
#fine pos tags | 28 |
#deprels | 21 |
Conversion had to cope with the fact that in ISST dependency relations are expressed in terms of binary relations holding between two lexical heads belonging to major lexical classes only (i.e. non-auxiliary verbs, nouns, adjectives and adverbs): in fact, in ISST information about grammatical words (e.g. determiners, prepositions, auxiliaries) is encoded in terms of features associated with the participants to the relation.
During the conversion process the dependency relations involving grammatical words had to be reconstructed from the ISST original annotation and the already existing dependency relations had to be revised accordingly. This was done semi-automatically by means of several conversion scripts whose output has been manually revised with the help of a graphical annotation tool. Further scripts were run to validate the consistency of the final output.
An XML intermediate format was produced in this process, preserving original annotations that could not be accomodated in the CoNLL format.
Isidoro Barraco and Patrizia Topi did most of the work of
writing conversion scripts and revising tags.
Kiril Ribarov,
Alessandro Lenci and Giuseppe Attardi contributed useful
discussions on critical issues.