EVALITA 2009 - PoS Tagging task

Dipartimento di Informatica, Università di Pisa

Task description

The evaluation will be based on three data sets:

Training Corpus (TrC): contains data annotated using the Tanl tagset [1] and must be used for training participating systems
Development Corpus (DvC): a smaller corpus to be used for development
Test Set (TeS): contains blind test data for the evaluation

The Tanl tagset includes morphological features and consists of 328 tags, from 14 basic categories. The task will hence evaluate the ability of taggers to handle a large tagset, useful for obtaining both lexical and morphological information from a POS tagger.

There will be two subtasks:

a closed task, where participants are not allowed to use any external resources besides the supplied TrC and DvC
an open task, where participants can use external resources.

The evaluation will be based on a token-by-token comparison (only ONE tag is allowed for each token). The evaluation metrics will be:

Tagging accuracy: the percentage of correctly tagged tokens with respect to the total number of tokens in TeS.
Unknown Words Tagging Accuracy: the Tagging Accuracy restricting the computation to unknown words. In this context "unknown word" means a token present in TeS but not in TrC.

Participants are required to provide a brief description of their system and a full notebook paper describing their experiments, in particular the techniques and the resources used, and presenting an analysis of the results.

Detailed guidelines (PDF format).

Corpora description

Source of training data

The training data set provided to the participants consists of articles from the online edition of the newspaper La Repubblica (http://www.repubblica.it/).

These data have been annotated in several steps: the first step was performed by the group of Andrea Baroni at the Università di Bologna and consisted in manually assigning a set of coarse-grain POS tags; then the MorphIt! [2] automated tool was used to assign a list of possible morphological tags to each token; a conversion script incorporating some heuristics was used to convert the POS and morphological tags into the Tanl tagset.

A final manual revision was applied to the whole corpus followed by a complete automated cross-check with an Italian lexicon of over 1,25 million forms.

These activities were performed as part of the project SemaWiki (Text Analytics and Natural Language processing - TANL) [1], a collaboration between the University of Pisa and the Institute for Computational Linguistics of CNR.

Training corpus statistics

The whole corpus consists in 108,875 word forms divided into 3,719 sentences.

#sentences	3,719
#tokens	108,875
#coarse PoS tags	14
#Morphed PoS tags	230

Copyright and license

Repubblica-TANL is copyrighted material which can be used for research purposes only and which cannot be distributed in any original or modified form. Participants will be requested to agree on these terms and conditions upon downloading the resource.

Resource download

PoS Tagging Corpus (TrC and DvC): PoSTaggingCorpus.tgz (3rd version)
PoS Tagging Accuracy Evaluator: poseval.py

PoS Test Corpus (TeS): PoSTest.tanl (NEW: released 10/9/09)

Submission details

Participants should submit their results by September 20th, midnight Italian time.

NEW: Differently from what specified in the guidelines, in order to encourage experimentation with different settings, each participant may send up to 4 runs for each subtask.

Runs must be sent to the organizers address, evalita@di.unipi.it, as a file in the same format as the Training Corpus, named as:

<team>: a short name for the team, without special characters
<Open|Closed>: Open or Closed subtask
<run>: a number between 1 and 4

The assessment of the submitted runs will be sent to the participants by October 5th, 2009, together with the gold-standard version of TeS.

Contacts

Giuseppe Attardi
Maria Simi

Dipartimento di Informatica, Università di Pisa
Largo B. Pontecorvo, 3
I-56127 Pisa
Italy
Phone: (+39) 050 2212700
Fax: (+39) 050 22127266

Documentation

Data format

Data adheres to the following rules:

Data files contain sentences separated by an empty line.
A sentence consists of a sequence of tokens, one token per line.
A token consists of two fields described in the table below. Fields are separated by one tab character.
Characters are UTF-8 encoded (Unicode).

Field Name	Description
FORM	Word form or punctuation symbol
POSTAG	Fine-grained part-of-speech tag, with morphology, based on the TANL tagset.

Example

A	E
ben	B
pensarci	Vfc
,	FF
l'	RDns
intervista	Sfs
dell'	EAns
on.	SA
Formica	SP
è	VAip3s
stata	VApsfs
accolta	Vpsfs
in	E
genere	Sms
con	E
disinteresse	Sms
.	FS

Tokenization issues

The example illustrates some tokenization issues:

abbreviations are properly identified as tokens (on.);
apostrophes representing a truncation are kept with the truncated token (l'intervista);
possible multi-word expressions (MWE) are not combined into a single token (in_genere);
clitics are not separated from the token (pensarci).

The TANL tagset

The Tanl tagset is designed according to the EAGLES guidelines [3], an agreed standard in the NLP community. In particular it was derived from the morphosyntactic classification of the ISST corpus [4]. Description of the Tanl tagset and Annotation guidelines.

Acknowledgements

Felice Dell'Orletta, Antonio Fuschetto, Alessandro Lenci, Simonetta Montemagni, Francesco Tamberi, Eva Maria Vecchi.

References

[1] G. Attardi et al. 2008. Tanl (Text Analytics and Natural Language processing). Project Analisi di Testi per il Semantic Web e il Question Answering, http://medialab.di.unipi.it/wiki/SemaWiki.
[2] E. Zanchetta, M. Baroni. 2005. Morph-it! A free corpus-based morphological resource for the Italian language. Proc. of Corpus Linguistics 2005, University of Birmingham, Birmingham, UK. http://dev.sslmit.unibo.it/linguistics/morph-it.php
[3] M. Monachini. 1995. ELM-IT: An Italian Incarnation of the EAGLES-TS. Definition of Lexicon Specification and Classification Guidelines. Technical report, Pisa.
[4] S. Montemagni, et al. 2003. Building the Italian Syntactic-Semantic Treebank. In Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, 189–210.