EVALITA 2009 - PoS Tagging task
Dipartimento di Informatica, Università di Pisa
Task description
The evaluation will be based on three data sets:
- Training Corpus (TrC): contains data annotated using the
Tanl tagset [1]
and must be used for training participating systems
- Development Corpus (DvC): a smaller corpus to be used for development
- Test Set (TeS): contains blind test data for the evaluation
The Tanl tagset includes morphological features and consists of 328 tags, from 14 basic categories.
The task will hence evaluate the ability of taggers to handle a large tagset, useful for obtaining
both lexical and morphological information from a POS tagger.
There will be two subtasks:
- a closed task, where participants are not allowed to use any external resources
besides the supplied TrC and DvC
- an open task, where participants can use external resources.
The evaluation will be based on a token-by-token comparison (only ONE tag is allowed for each token).
The evaluation metrics will be:
-
Tagging accuracy: the percentage of correctly tagged tokens
with respect to the total number of tokens in TeS.
-
Unknown Words Tagging Accuracy: the Tagging Accuracy
restricting the computation to unknown words. In this context "unknown
word" means a token present in TeS but not in TrC.
Participants are required to provide a brief description of their system and a full
notebook paper describing their experiments, in particular the techniques and the
resources used, and presenting an analysis of the results.
Detailed guidelines (PDF format).
Corpora description
Source of training data
The training data set provided to the participants consists of articles
from the online edition of
the newspaper La Repubblica (http://www.repubblica.it/).
These data have been annotated in several steps: the first step was performed by the
group of Andrea Baroni at the Università di Bologna and consisted in manually
assigning a set of coarse-grain POS tags; then the MorphIt! [2] automated tool was
used to assign a list of possible morphological tags to each token; a conversion script
incorporating some heuristics was used to convert the POS and morphological tags
into the Tanl tagset.
A final manual revision was applied to the whole corpus followed by a complete
automated cross-check with an Italian lexicon of over 1,25 million forms.
These activities were performed as part of the project SemaWiki
(Text Analytics and Natural Language processing - TANL) [1], a collaboration between
the University of Pisa and the Institute for Computational Linguistics of CNR.
Training corpus statistics
The whole corpus consists in 108,875 word forms divided into 3,719 sentences.
#sentences | 3,719 |
#tokens | 108,875 |
#coarse PoS tags | 14 |
#Morphed PoS tags | 230 |
Copyright and license
Repubblica-TANL is copyrighted material which can be used for
research purposes only and which cannot be distributed in any
original or modified form. Participants will be requested to agree
on these terms and conditions upon downloading the resource.
Resource download
PoS Tagging Corpus (TrC and DvC): PoSTaggingCorpus.tgz (3rd version)
PoS Tagging Accuracy Evaluator: poseval.py
PoS Test Corpus (TeS): PoSTest.tanl (NEW: released 10/9/09)
Submission details
Participants should submit their results by September 20th, midnight Italian time.
NEW:
Differently from what specified in the guidelines, in order to encourage experimentation with different
settings, each participant may send up to
4 runs for each subtask.
Runs must be sent to the organizers address, evalita@di.unipi.it,
as a file in the same format as the Training Corpus, named as:
<team>_POS_<Open|Closed>_<run>
- <team>: a short name for the team, without special characters
- <Open|Closed>: Open or Closed subtask
- <run>: a number between 1 and 4
The assessment of the submitted runs will be sent to the participants by October 5th, 2009,
together with the gold-standard version of TeS.
Contacts
Giuseppe Attardi
Maria Simi
Dipartimento di Informatica, Università di Pisa
Largo B. Pontecorvo, 3
I-56127 Pisa
Italy
Phone: (+39) 050 2212700
Fax: (+39) 050 22127266
Documentation
Data format
Data adheres to the following rules:
- Data files contain sentences separated by an empty line.
- A sentence consists of a sequence of tokens, one token per line.
- A token consists of two fields described in the table
below. Fields are separated by one tab character.
- Characters are UTF-8 encoded (Unicode).
Field Name | Description |
FORM | Word form or punctuation symbol |
POSTAG | Fine-grained part-of-speech tag, with morphology, based on the TANL tagset.
|
Example
A E
ben B
pensarci Vfc
, FF
l' RDns
intervista Sfs
dell' EAns
on. SA
Formica SP
è VAip3s
stata VApsfs
accolta Vpsfs
in E
genere Sms
con E
disinteresse Sms
. FS
Tokenization issues
The example illustrates some tokenization issues:
- abbreviations are properly identified as tokens (on.);
- apostrophes representing a truncation are kept with the truncated token
(l'intervista);
- possible multi-word expressions (MWE) are not combined into a single token
(in_genere);
- clitics are not separated from the token (pensarci).
The TANL tagset
The Tanl tagset is designed according to the EAGLES guidelines [3], an agreed
standard in the NLP community. In particular it was derived from the morphosyntactic
classification of the ISST corpus [4].
Description of the Tanl tagset
and Annotation guidelines.
Acknowledgements
Felice Dell'Orletta, Antonio Fuschetto, Alessandro Lenci, Simonetta Montemagni, Francesco Tamberi,
Eva Maria Vecchi.
References
[1] G. Attardi et al. 2008. Tanl (Text Analytics and Natural Language processing).
Project Analisi di Testi per il Semantic Web e il Question Answering,
http://medialab.di.unipi.it/wiki/SemaWiki.
[2] E. Zanchetta, M. Baroni. 2005. Morph-it! A free corpus-based morphological
resource for the Italian language. Proc. of Corpus Linguistics 2005, University of
Birmingham, Birmingham, UK. http://dev.sslmit.unibo.it/linguistics/morph-it.php
[3] M. Monachini. 1995. ELM-IT: An Italian Incarnation of the EAGLES-TS.
Definition of Lexicon Specification and Classification Guidelines. Technical
report, Pisa.
[4] S. Montemagni, et al. 2003. Building the Italian Syntactic-Semantic Treebank. In
Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series,
Kluwer, Dordrecht, 189–210.