GELATO Research Topic:The last years, has seen a renewal of interest in the consideration of the finite automaton (FA) model to the design of taggers in natural language processing (NLP), even in the case of part-of-speech tagging. This is due to the speed and compactness of the representations. In effect, the growing complexity of current tagging systems make that the space required for implementations is an important issue in commercial applications, together with computational efficiency. This is specially the case for inflectional languages with a great variety of morphological processes, such as Spanish. We summarize some of the outstanding problems we have to deal with:
This complexity suggests the necessity to interface the tagging process in order to verify easily the properties demanded, as well as to facilitate the maintenance. As an example, let's consider the word sobre. This word has three possible meanings in Spanish: preposition (on, upon,over, about), noun ( envelope) and verb (to exceed, to be unnecessary). When it is a verb, there are two possible values for the person: first and third. So, the output of the morphological analyzer should contain four taggings:
Word: "sobre"
Preposition, "sobre"
Common Noun, Masculine, Singular, "sobre"
Verb, Subjunctive Present, First, Singular, "sobrar"
Verb, Subjunctive Present, Third, Singular, "sobrar"
The maintenance of FA-based systems is not always trivial. So, most of authors propose updating protocols based on the simple re-compilation from the set of grammatical rules constituting the descriptive formalism for tagging. This technique is more friendly that the direct modification of the FA serving as kernel for the system, but the process should also ensure the sharing of linguistically related paths in the automaton, in order to permit both the implementation of efficient error recovery and debugger tools. In relation with this, classic determinization and minimization techniques for FAs do not guarantee sharing in basis to this requirement. This implies a lost of declarative power and make the study of segmentation phenomena difficult, which is often of interest for language specialist. At this point, our goal is to reconcile declarative power, and computational efficiency and safety.
We must provide a mechanism to verify the correctness of incremental developing of taggers. This may be helpful to minimize the set of errors present in the new releases of the system by assuring the compatibility with previous ones. The verification method we want to advocate for is based on reductions of a global FA. These reductions collapse states of the automaton to reach sizes reasonable enough to be outprinted and well understood. So, we can center our attention only around relevant information that can be easily manipulated.