Format
From Medialab
The format of MIDT is the CoNLL format, a de-facto standard used in most shared tasks in NLP.
The data encoding is UTF-8 (unicode).
Data adheres to the following rules:
- Data files contain one or more sentences separated by a blank line.
- A sentence consists of one or tokens, each one starting on a new line.
- A token consists of ten fields described in the table below. Fields are separated by one tab character.
- All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are used.
- Data files are are UTF-8 encoded (unicode).
Field 1: ID | Token counter, starting at 1 for each new sentence. |
Field 2: FORM | Word form or punctuation symbol |
Field 3: LEMMA | Lemma of word form |
Field 4: CPOSTAG | Coarse-grained part-of-speech tag. |
Field 5: POSTAG | Fine-grained part-of-speech tag. |
Field 6: FEATS | Morpho-syntactic features depend on the POS, as detailed in the linked file. |
Fields 7: HEAD | Non-projective head of current token, which is either a value of ID or zero ('0') |
Field 8: DEPREL | Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information. |
Field 9: PHEAD | Projective head of current token, which is always an underscore because it is not available for the Italian treebank. |
Field 10: PDEPREL | Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank. |