EVALITA 2011 Super Sense Tagging task

Dipartimento di Informatica, Università di Pisa

Task description

Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each significant entity in a text, like nouns, verbs, adjectives and adverbs, within a general semantic taxonomy defined by the WordNet lexicographer classes (called super-senses) [1]. SST can be considered as a task half-way between Named-Entity Recognition (NER) and Word Sense Disambiguation (WSD): it is an extension of NER, since it uses a larger set of semantic categories, and it is an easier and more practical task with respect to WSD, that deals with very specific senses.

The 45 Super-sense categories are reported in the following table: SST tags.

Substaks

Closed subtask
In the closed subtask, we want to measure the accuracy in SS tagging, when only the corpus provided for training is used.

Open subtask
In the open subtask participants will be free to use any external resource in addition to the corpus provided for training; for example, instances of WordNet as well as other lexical or semantic resources.

The evaluation metrics will be:

tagging accuracy, i.e. the percentage of correctly classified tokens with respect to the total number of tokens.
precision, the percentage of correct positive predictions over the total number of positive predictions by the system
recall, the percentage of correct positive predictions by the system over the expected predictions
F1-measure, the weighted harmonic mean of precision and recall

An evaluation script, adapted from the CoNLL2000 shared task on chunking, is made available for evaluation purposes.

Participants are required to provide a brief description of their system and a full notebook paper describing their experiments, in particular the techniques and the resources used, and presenting an analysis of the results.

Corpora description

Source of training data

A corpus for Super-sense tagging was created starting from the Italian Syntactic-Semantic Treebank (ISST) [2] by a semi-automatic correction and conversion process, followed by manual revision. This process is detailed in [3].

ISST-SST (about 300,000 tokens) will be made available for the task and for research purposes. A portion of about 276,000 tokens will be used for training and development.

The evaluation will be performed on a smaller corpus obtained from a held-out portion of ISST-SST (about 30,000 tokens) and a portion of the Italian Wikipedia (about 20,000 additional tokens).

The creation of ISST-SST was initiated as part of the project SemaWiki (Text Analytics and Natural Language processing - TANL) [4], a collaboration between the University of Pisa and the Institute for Computational Linguistics of CNR.

Training corpus statistics

The training corpus consists in about 276,000 word forms divided into 11,342 sentences.

#documents	430
#sentences	11,342
#tokens	276,423
#Annotated tokens	135,738

Copyright and license

ISST-SST is copyrighted material which can be used for research purposes only and which cannot be distributed in any original or modified form. Participants will be requested to agree on these terms and conditions upon downloading the resource.

Resource download

ISST-SST Training Corpus: Download (1st version)
SST Tagging Accuracy Evaluator: Evaluation script

Use of the Perl evaluation script conlleval.pl:

conlleval.pl -g <gold-file> -s <sys-output>

ISST-SST Test Corpus: Download (1st version)
Please, use the same password that you received when you signed the agreement.

Submission details

Participants should submit their results by October 14th, midnight Italian time.

Runs must be sent to the organizers address, deirossi@di.unipi.it, as a file in the same format as the Training Corpus, named as:

<team>: a short name for the team, without special characters
<Open|Closed>: Open or Closed subtask

The assessment of the submitted runs will be sent to the participants by October 28th, 2011, together with the gold-standard version of the test data.

Contacts

Stefano Dei Rossi
Giulia Di Pietro
Maria Simi

Dipartimento di Informatica, Università di Pisa
Largo B. Pontecorvo, 3
I-56127 Pisa
Italy
Phone: (+39) 050 2212758
Fax: (+39) 050 22127266

Documentation

Data format

Data adheres to the following rules:

Characters are UTF-8 encoded (Unicode).
Data files are organized in documents. Each document starts with the line
```
-DOCSTART- -X- O O
```
A document contains sentences separated by an empty line.
A sentence consists of a sequence of tokens, one token per line.
A token consists of four fields described in the table below. Fields are separated by one tab character.
SST tags can span several tokens and uses the IOB notation: labels are prefixed with "B" for begin, "I" for inside, and "O", outside any label. This notation is typical in several CoNLL tagging tasks.

Field Name	Description
Form	Word form or punctuation symbol
Lemma	Word lemma or punctuation symbol
PoS	Part-of-speech tag, with morphological features, based on the TANL tagset.
Super Sense Tag	Super Sense tag in IOB notation

Example

VENARIA	venaria	SP	B-noun.location
(	,	FF	O
Torino	torino	SP	B-noun.location
)	)	FB	O
-	-	FC	O
Un	un	RIms	O
incendio	incendio	Sms	B-noun.event
,	,	FF	O
che	che	PRnn	O
si	si	PC3nn	O
sarebbe	essere	VAd3s	O
sviluppato	sviluppare	Vpsms	B-verb.creation
per	per	E	O
cause	causa	Sfp	B-noun.motive
accidentali	accidentale	Anp	B-adj.all
,	,	FF	O
ha	avere	VAip3s	O
gravemente	gravemente	B	B-adv.all
danneggiato	danneggiare	Vpsms	B-verb.change
a	a	E	O
Fiano	fiano	SP	B-noun.location
(	,	FF	O
Torino	torino	SP	B-noun.location
)	)	FB	O
,	,	FF	O
uno	uno	RIms	O
chalet	chalet	Smn	B-noun.artifact
di	di	E	O
proprietà	proprietà	Sfn	B-noun.possession
di	di	E	O
Umberto	umberto	SP	B-noun.person
Agnelli	agnelli	SP	I-noun.person

The TANL tagset

Description of the Tanl tagset for morph-syntactic annotation and Annotation guidelines where developed in the Semawiki project.

Acknowledgements

Giuseppe Attardi, Alessandro Lenci, Simonetta Montemagni.

References

[1] C. Fellbaum (Ed.) (1998) WordNet: An Electronic Lexical Database. MIT Press, Cambridge.
[2] S. Montemagni, et al. 2003. Building the Italian Syntactic-Semantic Treebank. In Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, 189–210.
[3] G. Attardi, S. Dei Rossi, G. Di Pietro, A. Lenci, S. Montemagni, M. Simi, A Resource and Tool for Super-sense Tagging of Italian Texts, Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010), Malta, 17-23 May 2010.
[4] G. Attardi et al. 2008. Tanl (Text Analytics and Natural Language processing). Project Analisi di Testi per il Semantic Web e il Question Answering, http://medialab.di.unipi.it/wiki/SemaWiki.