Corpus Format

From Medialab

Tokens in a corpus typically share a common format. Each corpus has its own SentenceReader, which reads tokens in a given format.

The format of input files is expressed in XML notation like in this example:

 <CorpusFormat name="tab">
    <field name="ID" use="ECHO" role="ID" value="INTEGER" />
    <field name="FORM" use="INPUT" role="FORM" />
    <field name="LEMMA" use="INPUT" />
    <field name="POS" use="INPUT" />
    <field name="HEAD" use="OUTPUT" link="DEP" />
    <field name="DEPREL" use="OUTPUT" label="DEP"/>

Field attributes have the following meaning:

  • name, name of the field
  • use, either INPUT (for fields available as input during both training and parsing), OUTPUT (for fields available during training, but to be predicted and added as output), or ECHO (for fields just to be compied form input to output)
  • role, field having a specific role for the process. For instance ID specifies that the value is to be used as an ID for the token.
  • link, the field value is an integer specifying the ID of the target of a link, to be referred by the attribute's value.
  • label, specifies a label for the link referred by the attribute's value.

If not otherwise specified, all values are strings in UTF-8 encoding.

The example presents a simplified version of the CoNLL 2008 Shared Task format, which includes both dependency links (referred as DEP) and semantic roles (referred as ARG0 and <ARG1).

The parser DeSR in particular uses this notation specify the format of input files. Here for example is the full specifications for the Conll08 Shared task:

<CorpusFormat name="conll08">
 <field name="ID"     use="ECHO" value="INTEGER"/>
 <field name="FORM"   use="IGNORE" value="STRING"/>
 <field name="LEMMA"  use="IGNORE" value="STRING"/>
 <field name="GPOS"   use="INPUT" value="STRING"/>
 <field name="PPOS"   use="INPUT" value="STRING"/>
 <field name="SPLIT_FORM" use="INPUT" value="STRING" role="FORM"/>
 <field name="SPLIT_LEMMA" use="INPUT" value="STRING"/>
 <field name="PPOSS"  use="INPUT" value="STRING"/>
 <field name="HEAD"   use="OUTPUT" link="DEP" role="HEAD"/>
 <field name="DEPREL" use="OUTPUT" label="DEP" role="DEPREL"/>
 <field name="PRED"   use="OUTPUT" value="STRING" role="PREDICATE"/>
 <field name="ROLE0"  use="OUTPUT" label="ROLE0" default="_"/>
 <field name="ROLE1"  use="OUTPUT" label="ROLE1" default="_"/>
 <field name="ROLE2"  use="OUTPUT" label="ROLE2" default="_"/>
 <field name="ROLE3"  use="OUTPUT" label="ROLE3" default="_"/>
 <field name="ROLE4"  use="OUTPUT" label="ROLE4" default="_"/>