Annotation guidelines

From Medialab

Tokenization

Postagging is based on the following assumptions:

  1. the splitting of tokens is not done: 'andandosene', 'mangiandolo' ... all in one word. Splitting is performed in a post-processing step, after pos-tagging.
  2. multi-words are not recognized: 'grazie_al' is 'grazie al', 'prima_delle' is 'prima delle'

Verbs

Verbs are split into the following categories:

  1. VA: auxiliary verbs (essere, avere or venire in passive costructions);
  2. VM: tutte le occorrenze di verbi modali (volere, potere, dovere, solere);
  3. V: main verbs.

Annotating Proper Names

  1. If the token has a grammatical role, it should not be tagged as SP. For example articles, prepositions, articulated prepositions, and numbers.
  2. More importance should be given to capitalized words in mid-phrase positions. Capitalized words in mid-phrase positions generally refer to a person, organization, title (of a film, book, song, etc.), or place. It will be useful to annotate tokens which can be classified as so with SP for Named Entity Recognition.
  3. The critical case is when the name of an organization, title, place, etc is written in lowercase. For example, la Corte constituzionale where 'Corte' and 'constituzionale' should both be tagged as SP because they refer to an organization, even if both words are not capitalized. These cases should be analyzed by hand to see if they should be considered proper names.


Underspecified Gender and Number

Rational: The gender [mfn] of a name or adjective is closely related to the lexicon. If a name or adjective may appear in different contexts with different gender it is given the gender n (underspecified). No attempt is made to infer the gender from the context. This also applies to professions that are usually covered by men, i.e. general [ns] or president [ns]. The same is true for number [spn].

Examples

rappresentante [ns]
complice [ns]
militare [ns]
elettricista [ns]
coniuge [ns]
magnati [np]
militari [np]
civili [np]
narcotrafficanti [np]
portabandiera [nn]
portaborse [nn]

Critical Cases

carcere [ms], carceri [fp] - annotated as so consistently in isst_tanl
ordine [ms] - annotated as [ns] in isst_tanl
zairoti [mp] - annotated as [np] in isst_tanl
zairota [ns] - annotated as [fs] in isst_tanl
Note: carcere, carceri and ordine are now tagged as shown above in isst_tanl (consistently)

Adjective vs Past Participles or Present Participles

Adjective

  1. If it is used to qualify a noun, then it is an adjective. Note that the lemma should change accordingly. For example:
(a) il ristorante aperto [Ams]
(b) lui si trova ricoverato [Ams]
(c) Maria era interessata [Afs]

Verb

It is a verb when it is used with an auxiliary verb as part of a more complex verbal phrase. For example:
(d) hanno isolato [Vpsms]
(e) si è trasformato [Vpsms]
(f) è stato [Vpsms] squalificato [Vpsms]
(g) ha dovuto essere sgomberata [Vpsfs]
It is a verb when there is an implicit auxiliary verb in a passive construction [Vps] or an implicit relative clause [Vps]. For example:
(h) Finito [Vpsms] il lavoro, Riccardo torna a casa
(i) la casa abitata [Vpsfs] da un anno
(j) il ristorante aperto [Vpsms] la scorsa settimana

Critical Cases

There are times in which the distinction is not always so clear. Critical cases include situations where the adjective/verb follows the verb 'essere' acting as a predicate [adj]; ... and so on. Consider the following sentences for example:

(k) La porta deve essere aperta [Afs] perché c'è corrente.
(l) La porta deve essere aperta[Vpsfs] ogni mattina alle otto.
In order to determine the tags for these examples, it is necessary to have contextual knowledge. This knowledge, however, is not always provided at the morpho-syntactic level (as seen in the examples above), thus it is difficult to generate clear-cut criteria for annotating these ambiguities. The two options available to handle such cases are either to maintain the ambiguity or for the human annotator to make an educated guess.


Adjective/Past Participle vs Noun

It is often the case that adjectives as well as past participles can act as nouns. Consider the following contexts:

1. Io preferisco la bionda (A/S), tu la castana (A)
2. Alla stazione ho incontrato un belga (A/S)
3. Non so quanto sia il dovuto (V/A/S)
4. L’Ispettorato invierà all’Unità Operativa lo stralcio del tabulato del digitato (V/A) degli ultimi due mesi

Between parentheses you will find the possible morpho-syntactic interpretations of the preceding highlighted word according to the De Mauro Paravia Italian dictionary. In principle, in Italian all adjectives or past participles can act as nouns in specific contexts. On the other hand, dictionaries record the most typical/frequent usages: this explains the asymmetric lexicographic treatment of biondo and castano in 1) above.

For the specific goals of our work I do not think to be a good idea to associate all adjectives and maybe verb (past participle) entries with a noun interpretation to deal with these cases. I rather believe that the following criteria should be followed in the annotation of these types of contexts:

a. If the adjective acting as a noun has a noun interpretation, then it should be tagged as a noun
b. If the adjective acting as a noun does not have a noun interpretation in the dictionary, then it should be tagged as an adjective; if it only has a verbal interpretation then it should be tagged as a verb.

This is what we have been doing so far. Here follows the annotation of examples 1-4 according to the criteria stated above:

1. Io preferisco la bionda (S), tu la castana (A)
2. Alla stazione ho incontrato un belga (S)
3. Non so quanto sia il dovuto (S)
4. L’Ispettorato invierà all’Unità Operativa lo stralcio del tabulato del digitato (A) degli ultimi due mesi

Examples and Critical Cases

The following examples of critical cases are tagged as shown below because they are only categorized as adjectives in our lexicon.

all'improvviso [Ams]
sanno di trentino [Ams]
sanno di buono [Ams]

Personal Pronouns and Clitics

Rational: we have introduced PC (clitic pronoun) for most of the cases of personal pronoun in the clitic form or other clitics (me, mi, te, ti, se, si, ce, ci, ve, vi, gli, la, le, lo, li, ne). In particular:

mi , me [PC1ns]
ti , te [PC2ns]
si , se [PC3nn]
ci , ce [PC1np] when personal pronoun
ci , c' , ce [PCnn] when adverb
vi , ve [PC2np] when personal pronoun
vi [PCnn] when adverb
lo [PC3ms]
la [PC3fs]
l' [PC3ns]
li [PC3mp]
le [PC3fp] when accusative, [PC3fs] when dative
gli [PC3ms], [PC3mp], [PC3fp] according to context => meglio sempre PC3ms (MS)
ne , n' [PCnn]
Note: in isst_tanl PC is also used for ci and vi when adverb, while in the Repubblica resource this is marked as B. If guidelines where used consistently we could distinguish these cases by the presence of the person.

Critical Cases

Ci [PCnn] crederò soltanto
Qualcuno ci [PCnn] ha provato
Che ci [PCnn] devo fare?
Il doppio di quanto ne [PCnn] elaborino
Vi [PCnn] fanno parte

Number One vs Indefinite Article

Rational: we define specific environments in which uno should be tagged as a number, [N], or as and indefinite article, [RI].

There are two cases in which uno is considered a number:
  1. When followed by the preposition di
  2. When followed by the proposition su plus a number.
All other instances of uno should be tagged as an indefinite article.

Examples

uno degli studenti [N] => annotato come PIms
uno studente [RI] => annotato come RIms
hanno trovato che a uno su dieci non piace andare in bici [N] => annotato come PIms
a una persona su dieci non piace andare in bici [RI] => annotato come PIfs