Dipartimento di Informatica, Università di Pisa
Super-sense tagging (SST) is a Natural Language Processing task that consists of annotating each significant entity in a text, like nouns, verbs, adjectives and adverbs, within a general semantic taxonomy defined by the WordNet lexicographer classes (called super-senses) [1]. SST can be considered as a task half-way between Named-Entity Recognition (NER) and Word Sense Disambiguation (WSD): it is an extension of NER, since it uses a larger set of semantic categories, and it is an easier and more practical task with respect to WSD, that deals with very specific senses.
The 45 Super-sense categories are reported in the following table: SST tags.
Closed subtask
In the closed subtask, we want to measure the accuracy in SS tagging, when only the corpus provided for training is used.
Open subtask
In the open subtask participants will be free to use any external resource in addition to the corpus provided for training; for example, instances of WordNet as well as other lexical or semantic resources.
The evaluation metrics will be:
An evaluation script, adapted from the CoNLL2000 shared task on chunking, is made available for evaluation purposes.
Participants are required to provide a brief description of their system and a full notebook paper describing their experiments, in particular the techniques and the resources used, and presenting an analysis of the results.
A corpus for Super-sense tagging was created starting from the Italian Syntactic-Semantic Treebank (ISST) [2] by a semi-automatic correction and conversion process, followed by manual revision. This process is detailed in [3].
ISST-SST (about 300,000 tokens) will be made available for the task and for research purposes. A portion of about 276,000 tokens will be used for training and development.
The evaluation will be performed on a smaller corpus obtained from a held-out portion of ISST-SST (about 30,000 tokens) and a portion of the Italian Wikipedia (about 20,000 additional tokens).The creation of ISST-SST was initiated as part of the project SemaWiki (Text Analytics and Natural Language processing - TANL) [4], a collaboration between the University of Pisa and the Institute for Computational Linguistics of CNR.
The training corpus consists in about 276,000 word forms divided into 11,342 sentences.
#documents | 430 |
#sentences | 11,342 |
#tokens | 276,423 |
#Annotated tokens | 135,738 |
ISST-SST Training Corpus: Download (1st version)
SST Tagging Accuracy Evaluator: Evaluation script
Use of the Perl evaluation script conlleval.pl:
conlleval.pl -g <gold-file> -s <sys-output>
ISST-SST Test Corpus: Download (1st version)
Please, use the same password that you received when you signed the agreement.
Participants should submit their results by October 14th, midnight Italian time.
Runs must be sent to the organizers address, deirossi@di.unipi.it, as a file in the same format as the Training Corpus, named as:
<team>_SST_<Open|Closed>_<run>
-DOCSTART- -X- O O
Field Name | Description |
---|---|
Form | Word form or punctuation symbol |
Lemma | Word lemma or punctuation symbol |
PoS | Part-of-speech tag, with morphological features, based on the TANL tagset. |
Super Sense Tag | Super Sense tag in IOB notation |
VENARIA venaria SP B-noun.location ( , FF O Torino torino SP B-noun.location ) ) FB O - - FC O Un un RIms O incendio incendio Sms B-noun.event , , FF O che che PRnn O si si PC3nn O sarebbe essere VAd3s O sviluppato sviluppare Vpsms B-verb.creation per per E O cause causa Sfp B-noun.motive accidentali accidentale Anp B-adj.all , , FF O ha avere VAip3s O gravemente gravemente B B-adv.all danneggiato danneggiare Vpsms B-verb.change a a E O Fiano fiano SP B-noun.location ( , FF O Torino torino SP B-noun.location ) ) FB O , , FF O uno uno RIms O chalet chalet Smn B-noun.artifact di di E O proprietà proprietà Sfn B-noun.possession di di E O Umberto umberto SP B-noun.person Agnelli agnelli SP I-noun.person
Giuseppe Attardi, Alessandro Lenci, Simonetta Montemagni.