From Medialab

The format of MIDT is the CoNLL format, a de-facto standard used in most shared tasks in NLP.

The data encoding is UTF-8 (unicode).

Data adheres to the following rules:

  • Data files contain one or more sentences separated by a blank line.
  • A sentence consists of one or tokens, each one starting on a new line.
  • A token consists of ten fields described in the table below. Fields are separated by one tab character.
  • All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are used.
  • Data files are are UTF-8 encoded (unicode).
Field 1: ID Token counter, starting at 1 for each new sentence.
Field 2: FORM Word form or punctuation symbol
Field 3: LEMMA Lemma of word form
Field 4: CPOSTAG Coarse-grained part-of-speech tag.
Field 5: POSTAG Fine-grained part-of-speech tag.
Field 6: FEATS Morpho-syntactic features depend on the POS, as detailed in the linked file.
Fields 7: HEAD Non-projective head of current token, which is either a value of ID or zero ('0')
Field 8: DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See Dependency relations for more information.
Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available for the Italian treebank.
Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not available for the Italian treebank.