The format of MIDT is the CoNLL format, a de-facto standard used in most shared tasks in NLP.
The data encoding is UTF-8 (unicode).
Data adheres to the following rules:
- Data files contain one or more sentences separated by a blank line.
- A sentence consists of one or tokens, each one starting on a new line.
- A token consists of ten fields described in the table below. Fields are separated by one tab character.
- All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are used.
- Data files are are UTF-8 encoded (unicode).
|Field 1: ID||Token counter, starting at 1 for each new sentence.|
|Field 2: FORM||Word form or punctuation symbol|
|Field 3: LEMMA||Lemma of word form|
|Field 4: CPOSTAG||Coarse-grained part-of-speech tag.|
|Field 5: POSTAG||Fine-grained part-of-speech tag.|
|Field 6: FEATS||Morpho-syntactic features depend on the POS, as detailed in the linked file.|
|Fields 7: HEAD||Non-projective head of current token, which is either a value of ID or zero ('0')|
|Field 8: DEPREL||Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information.|
|Field 9: PHEAD||Projective head of current token, which is always an underscore because it is not available for the Italian treebank.|
|Field 10: PDEPREL||Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank.|