GELATO Research Topic:
Finite-State Automata for Inflectional Morphology

The last years, has seen a renewal of interest in the consideration of the finite automaton (FA) model to the design of taggers in natural language processing (NLP), even in the case of part-of-speech tagging. This is due to the speed and compactness of the representations. In effect, the growing complexity of current tagging systems make that the space required for implementations is an important issue in commercial applications, together with computational efficiency. This is specially the case for inflectional languages with a great variety of morphological processes, such as Spanish. We summarize some of the outstanding problems we have to deal with:

A highly complex conjugation paradigm, with nine simple tenses and nine compound tenses, both on the six different persons. If we add the Present Imperative with two forms, Infinitive, Compound Infinitive, Gerund, Compound Gerund, and Participle with four forms, then 118 inflected forms are possible for each verb.
Irregularities in both verb stems and endings. Very common verbs, such as hacer ( to do), have up to seven different stems: hac-er, hag-o, hic-e, haré, hiz-o, haz, hech-o. Approximately 30% of Spanish verbs are irregular. We have implemented 39 groups of irregular verbs.
Verbal forms with enclitic pronouns at the end. This can produce changes in the stem due to the presence of accents: da (give), dame (give me), dámelo (give it to me). We have implemented forms even with three enclitic pronouns, like tráetemelo ( bring it for you and me). Here, the analysis has to segment the word and return four tokens.
There exist some highly irregular verbs that can be handled only by including many of their forms directly in the lexicon. This is, for example, the case of ir (to go) and ser ( to be).
Gaps in some verbs paradigms, in which some forms are missing or simply not used. For instance, meteorological verbs such as nevar (to snow) are conjugated only in third singular person.
Duplicate past participles, like impreso and imprimido ( printed). In such cases, the tagger has to treat both.
highly complex gender inflection, with words with only one gender as hombre ( man) and mujer ( woman), and words with the same form for both genders as azul ( blue). In relation to words with separate forms for masculine and feminine, we have a lot of models: autor, autora ( author); jefe, jefa ( boss); poeta, poetisa ( poet); rey, reina ( king) or actor, actriz ( actor). We have implemented 20 variation groups for gender.
An also highly complex number inflection, with words presenting only the singular form, as estr\'es (stress), and others where only the plural form is correct, as matem\'aticas (mathematics). The construction of different forms does not involve as many variants as it is the case of the gender, but we can also consider a certain number of models: rojo, rojos (red); luz, luces (light); lord, lores (lord) or frac, fraques (dress coat). We have implemented 10 variation groups for number.

This complexity suggests the necessity to interface the tagging process in order to verify easily the properties demanded, as well as to facilitate the maintenance. As an example, let's consider the word sobre. This word has three possible meanings in Spanish: preposition (on, upon,over, about), noun ( envelope) and verb (to exceed, to be unnecessary). When it is a verb, there are two possible values for the person: first and third. So, the output of the morphological analyzer should contain four taggings:

     Word: "sobre"
           Preposition, "sobre"
           Common Noun, Masculine, Singular, "sobre"
           Verb, Subjunctive Present, First, Singular, "sobrar"
           Verb, Subjunctive Present, Third, Singular, "sobrar"

The maintenance of FA-based systems is not always trivial. So, most of authors propose updating protocols based on the simple re-compilation from the set of grammatical rules constituting the descriptive formalism for tagging. This technique is more friendly that the direct modification of the FA serving as kernel for the system, but the process should also ensure the sharing of linguistically related paths in the automaton, in order to permit both the implementation of efficient error recovery and debugger tools. In relation with this, classic determinization and minimization techniques for FAs do not guarantee sharing in basis to this requirement. This implies a lost of declarative power and make the study of segmentation phenomena difficult, which is often of interest for language specialist. At this point, our goal is to reconcile declarative power, and computational efficiency and safety.

We must provide a mechanism to verify the correctness of incremental developing of taggers. This may be helpful to minimize the set of errors present in the new releases of the system by assuring the compatibility with previous ones. The verification method we want to advocate for is based on reductions of a global FA. These reductions collapse states of the automaton to reach sizes reasonable enough to be outprinted and well understood. So, we can center our attention only around relevant information that can be easily manipulated.

Selected Readings

M. Vilares Ferro, J. Graña Gil and Pilar Alvariño Alvariño,
Finite-State Morphology and Formal Verification,
in András Kornai (ed.), Extended Finite State Models of Language, Cambridge University Press, 1997.
M. Vilares Ferro J. Graña Gil and A. Pan Bermúdez,
Building Friendly Architectures for Tagging,
Procesamiento del Lenguaje Natural, 19:127-132, 1996.

Note: On-line version of these papers are available in the COLE Publications page.

Send comments and suggestions to webmaster@coleweb.dc.fi.udc.es

GELATO Research Topic: Finite-State Automata for Inflectional Morphology

Selected Readings

GELATO Research Topic:
Finite-State Automata for Inflectional Morphology