The format of MIDT is the CoNLL format, a de-facto standard used in most shared tasks in NLP.
The data encoding is UTF-8 (unicode).
Data adheres to the following rules:
- Data files contain one or more sentences separated by a
- A sentence consists of one or tokens, each one starting on a
- A token consists of ten fields described in the table
below. Fields are separated by one tab character.
- All data files will contains these ten fields, although only
the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are
- Data files are are UTF-8 encoded (unicode).
|Field 1: ID
||Token counter, starting at 1 for each new sentence.
|Field 2: FORM
||Word form or punctuation symbol
|Field 3: LEMMA
||Lemma of word form
|Field 4: CPOSTAG
||Coarse-grained part-of-speech tag.
|Field 5: POSTAG
||Fine-grained part-of-speech tag.
|Field 6: FEATS
||Morpho-syntactic features depend on the POS, as detailed in the linked file.
|Fields 7: HEAD
||Non-projective head of current token, which is either a
value of ID or zero ('0')
|Field 8: DEPREL
||Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information.
|Field 9: PHEAD
||Projective head of current token, which is always an underscore because it is not available for the Italian treebank.
|Field 10: PDEPREL
||Dependency relation to projective head, which is always an
underscore, because it is not available for the Italian treebank.