GELATO GELATO Research Topic:
Incremental Parsing



The notion of incrementality has been used in two differing senses in the literature on parsing. In the first sense, the purpose of an incremental parser is to construct the analysis of a text bit by bit in a single left-to-right pass, rather than in one go when the text has come to an end. This is typically the case of syntax-directed editors, where the user writes the text in a top-down fashion, guided by the system itself. The other sense for incrementality stresses the necessity of efficiently analyzing arbitrary changes within the set of input. We shall focus our attention on this last meaning that clearly includes the first one.

Reasons to achieve incremental parsing are well known. For example, in the context of incremental program development, several consecutive corrections of the input are usually made. Therefore incremental parsing can be used to make the overall parsing process efficient, as the goal of the incremental context-free parser is to recover stable parts of a shared forest between consecutive parsing steps, at no cost in space and time. In particular, we have to guarantee the same level of sharing in the resulting parse forest, as in standard parsing. In the context of natural language processing, interest is due to the necessity of efficiently handling arbitrary changes within current input during text composition in language-sensitive editors. Here, incremental parsing can be used to make the overall parsing process efficient, in a context where several consecutive corrections of the text are usually made. This means that preparing a text requires significantly less effort than developing it from scratch. Another application that can motivate incremental parsing is the growing importance of highly interactive, and real-time systems, where the analysis process must be prompted immediately at the onset of new input. Incrementality is also required in systems allowing incomplete parsing, as is the case of the speech recognition where the input language can only be approximately defined, and individual inputs can vary widely from the norm. Finally, the incremental facility can take an interest in parse systems capable of combining pieces of information from different knowledge sources. This is the case of systems involving multimodal communication, where different parts of an utterance can be expressed in different modalities, say, one part in natural language and another by gesture.

In relation to context-free parsing, we chose to work in the context of the application of deterministic techniques to generate very efficient non-deterministic parsers. In order to solve the problems derived from grammatical constraints, we extends Earley's classic algorithm to push-down transducers (PDTs), separating the execution strategy from the implementation of the push-down automaton (PDA) interpreter. So, a family of general context-free parallel parsers is obtained for each family of deterministic context-free push-down ones, simulating all possible computations of any PDT, at worst in cubic time. However, the method is linear in a large class of grammars.

In relation to incremental parsing, the algorithm we developed is closer to the bottom-up algorithm of Ghezzi and Mandrioli than to the top-down one of Jalili and Gallier. The reason for that is the natural predisposition of Earley-like algorithms to incremental bottom-up approaches. Effectively, most of the additional structures required for this class of incrementality, are provided at no expense by the item concept in Earley's method, essentially a structure representing the state of the parsing process for a branch of the parse forest, at each point in the scan.

In the context of the recovery process, we shall define stable items between the initial parsing of and the parsing of the modified input string, as those items that represent a stable configuration of the PDT that would be reconstructed if we had redone an entire parse of the modified input string up to that point. From a practical point of view, we shall only consider the case of recovery on the basis of complete itemsets. This is equivalent to the recovery of all the OR-descendants of a node in a shared forest. Eventually, there will be no recovery of trivial nodes in an itemset. Even if this does not guarantee that all superfluous computations will be avoided, it allows to notably reduce the comparison between stack configurations corresponding to the original and the modified input string. At this point, our experience has shown that the incremental treatment is not interesting from a practical point of view when only a part of an itemset is stable, as it is usually the case when the input grammar has a lot of ambiguities generating crossed forests.

In the case of the parser, we must find in addition the scope of the modifications in the original forest, a simple task given that the nodes affected by changes in its structure are those common with the new parse forest having at least a changed descendant in relation to the new development, that is, those common items capable to be accessed in the continuation of the parsing process in the case no incremental treatment was applied. To update one of these nodes it will be sufficient to find its stable descendants in the parse forest which have been effectively recomputed and to replace them by the original corresponding structure in the recovered parse forest.

Selected Readings:

Note: On-line version of these papers are available in the COLE Publications page.


Send comments and suggestions to webmaster@coleweb.dc.fi.udc.es