published as: CLAUS Report number 5, Saarland University, February 1991
A Bottom-Up Algorithm for Parsing and Generation
Abstract
We present a bottom-up algorithm for parsing and generation. The algorithm is a bottom-up chart parser, whose lexical lookup phase has been modified
for generation. An analysis of the algorithm offers interesting insights into the
relationship between parsing and generation, summarized by the statement that parsing is a very constrained form of generation. The use of the
generation algorithm as a component of a grammar development
environment is discussed. 1 Parsing and Generation
In this paper, we present a chart-based algorithm for parsing and generation.
We will not consider generation from a semantic representation, for which
the usual head-driven algorithms [Shieber et al. 1990] are better suited, but rather the generation of large sets of sentences. This functionality is needed
in grammar development systems for exploring the coverage of a given
grammar, and for checking whether a grammar overgenerates, which is particularly useful if grammars are developed not only for analysis, but also
for generation. Let us start by giving an intuitive characterization of bottom-up parsing and
generation. * This work was conducted in IBM Germany's text understanding project LILOG. I would like to thank Roman Georg Arens, Monika Becker, Günter Neumann, Karel Oliva and Hans Uszkoreit for fruitful discussion.
Bottom-up parsing starts by creating an item for every occurence of a word in
the input string. Items carry information about the portion of the string they cover, their position in the string, and their syntactic category. If adjacent
items match the right-hand side of a grammar rule, they are combined into
larger items until an item is found that covers the entire string and whose category is whatever the grammar defines as a well-formed structure. The
parsing process produces as many solutions as there are different derivations
for the given string. In bottom-up generation, one initial item is created for every entry of the lexicon. In the case of generation, items need not carry information about
string position. Any sequence of items whose categories match the right-hand
side of a grammar rule can be combined into a larger item whose category is the left-hand side of the grammar rule. This process continues until no more
items can be built. As any interesting language contains an infinite number of
sentences, such an algorithm never stops generating items unless it is somehow restricted (see section 3).
Under this view, parsing and generation differ in two respects:
1. In the initialization phase, a parser uses only the words of the string,
whereas a generator uses all words of the lexicon.
2. Items used in parsing are indexed for their position in the string in order
to check adjacency, whereas items used for generation carry information
about the string they cover, but not about string position.These differences are illustrated in figure 1.
These differences suggest that parsing is a very restricted form of generation, in which it is already known which sentence must be generated. A similar
point on the relationship between parsing and generation is made in [Kay
Figure 1 We will now outline a generation algorithm that functions as a parser if more
restrictions are imposed on the process. The only difference between parsing and generation occurs in the initialization phase.
2 The Algorithm
The algorithm presented here differs from the intuitive formulation given in
section 1 in that the application of a rule to one or more items in order to
form a new item is generalized in the usual way by using active items. A rule is only applied to one item whose category unifies with the first category of
the rule's right-hand side, leading to the creation of an item, whose remainder is the rest of the rule's right-hand side. If the remainder is not
empty, the item is active, and can be combined with a passive item whose
category unifies with the leftmost element of the active item's remainder. 2.1 Data Structures
The principal data structure used by the algorithm is the item. An item is a
<Span,String,Category,Remainder> • Span is a pair <Starting Vertex, Ending Vertex>
• Category is a non-terminal symbol of the grammar1.
• Remainder is a list of categories. If Remainder is empty, the item is
passive (complete), otherwise it is active (incomplete).
Tree structures are not included in the item, as they can be built up in the syntactic category, e. g. if Definite-Clause Grammar or HPSG [Pollard and
Sag 1987] are used as the grammar formalism.
When running the algorithm as a parser, the items need not carry information
about the string, whereas in the generation case, it is crucial that the string, but not the span is instantiated at the beginning of the process.
Items may be actual or pending. The set of actual items is called the chart, and the set of pending items is called the agenda. Pending items have a
priority, and are added to the chart in the order of descending priority.
Whenever a pending item is added to the chart, it becomes actual, and new pending items are created by applying a rule to the item or by combining it
with an item of complementary activity (i. e. one of the two items must be active, and the other passive).
The reason that pending items are not immediately added to the chart is that this scheme allows one to control the order of exploration of the search space.
If pending items are added to the chart immediately, the result is depth-first
search, which leads to the enumeration of an uninteresting segment of the search space for the generation case, as illustrated by the following example:
The cat saw the cat. The cat saw the green cat. The cat saw the green green cat. The cat saw the green green green cat. . In order to avoid such behaviour, we have adopted a strategy that will first
generate all well-formed strings2 of length 1, then all strings of length 2, all strings of length 3, length 4 and so on, until a user-defined maximum length
is reached. If the user does not define a maximum length, the algorithm will
run forever. In order to achieve this behavior, we define a parsing strategy which prefers short pending items over long ones.
1We are not comitted to any particular grammar formalism. Our algorithm works for Context-Free Grammars, Definite Clause Grammars [Pereira and Warren 1980] and Feature Unification Grammars [Shieber 1986]. 2For practical purposes, only a subset of the possible strings of each length is generated, see section 3.
For parsing, this scheme opens the possibility to implement different parsing strategies (see [Erbach 1987]).
2.2 The Procedures
The procedure PROCESS is used for both parsing and generation. For
parsing, it must be called with an input string, for generation with a variable. After this process, all the results for parsing or generation can be found in the
1. remove all items from previous parsing or generation processes
for every word W in the string from position n to n+1
for every lexical entry of W with category Cat
ADD-ITEM(<span(n,n+1),W,Cat,nil>)
for every lexical entry with word W and category Cat
The purpose of the initialization procedure is to clear the chart and the
agenda, to perform lexical lookup, and to create an initial agenda of pending
items. Lexical lookup is the only phase in which the algorithm differs for parsing
and generation. The difference is the following: In the case of parsing, we know which words occur in the string and what their positions are. Consequently, one item will be created for every lexical
entry of every word token of the string, and each item is indexed with string
positions. Such indexing allows the algorithm to check for adjacency of items. In the case of generation, we do not know which words will be used, nor do
we know their string positions. Therefore, an item is created for every word in the lexicon. The string position is left uninstantiated. This allows a word to
3nil is the empty list, and ? is a free variable
occur more than once in the generated string, because it can be used at
ADD-ITEM(<span(Begin,End),String,Cat,Rem>)4
1. add item <span(Begin,End),String,Cat,Rem> to the chart
for every rule LHS -> RHS of the grammar
ADD-TO-AGENDA(<span(Begin,End),String,LHS,rest(RHS)>)
for every active item <span(Begin0,End0),String0,Cat0,Rem0>
if End0 = Begin and Cat unifies with first(Rem0)
AGENDA(<span(Begin0,End),String0+String,Cat0,rest(Rem0)>)
for every passive item <span(Begin1,End1),String1,Cat1,nil>
if End = Begin1 and first(Rem) unifies with Cat1
AGENDA(<span(Begin,End1),String+String1,Cat,rest(Rem)>)
ADD-ITEM adds an item to the chart, and creates new pending items:
- in case of a passive item for every applicable rule and every applicable
- in case of an active item for every applicable passive item
ADD-TO-AGENDA(Item) 1. EVALUATE(Item,Priority)
2. if Item is not longer than the maximal length
add pair <Item,Priority> to the agenda
ADD-TO-AGENDA adds a new item to the agenda instead of adding it to the
chart immediately like ADD-ITEM. Items which are on the agenda are assigned a priority, and the item with the highest priority is added to the
chart by MAIN-LOOP. The assignment of a priority to an item on the agenda
is the means for controlling the search strategy of the algorithm. EVALUATE(Item,Priority)
4first(List) selects the first element of List, rest(List) selects the rest of the list (i. e. the list without its first element), + denotes concatenation of strings, and what unify means depends on the grammatical formalism. In the case of context-free grammar it means identity of non -terminal symbols, in the case of DCG it means Prolog unification, and for feature unification grammars, it means feature term unification.
This strategy is useful in order to achieve complete generation for strings of increasing length. Any other search strategy may be defined.
remove item I with the highest priority from agenda
The procedure MAIN-LOOP makes pending items actual, thereby creating
3 Application in a Grammar Development Environment
Traditionally, grammar development environments consist of an editor for
writing grammars, and a parser for verification of the grammar. A parser is
useful for checking whether a grammar can analyze a given sample of sentences, and whether it assigns sentences the correct structures.
When writing grammars for generation5, it is important to assure that the
grammar is restrictive enough, and does not allow any ungrammatical
sentences. For this task, a parser is not very adequate. Even though one can provide the parser with a negative sample of ungrammatical sentences to
check whether they are correctly rejected, one can never be sure to anticipate
all possibilities for ungrammaticality. What is needed is a component which generates a representative sample of sentences, so that the grammar writer can check which set of sentences the
grammar generates, and whether it contains ungrammatical sentences.
There is a division of labor between the parser and the generator. The parser
checks the coverage of the grammar (in logical terms the completeness), and
the generator checks whether the grammar overgenerates (in logical terms the correctness).
Because natural languages are infinite, it is not possible to generate the entire
language. Therefore, a representative finite subset of the language must be
generated. There are a number of ways to restrict the generated language: Restrict the lexicon 5Here, we mean generation in the usual sense as generation from a semantic representation.
Restriction of the lexicon avoids uninteresting lexical variation in the generated sample sentences:
John saw the cat John saw the dog John saw the rat John saw the mat . For a class of structurally similar lexical entries, only one representative is
selected. This requires modification of the lexical lookup phase.
John saw the blue car John saw the blue blue car John saw the blue blue blue car John saw the blue blue blue blue car . This requires storing the derivational history with the items, and to put an
upper limit on the repetition that is allowed. However, one should not
eliminate recursion completely since it is an interesting property of the grammar.
Limit the length of the sentences generated
This prevents the algorithm from running forever. Restrict the set of rules
One may want to avoid some specialized grammar rules, if they treat
phenomena of the language that one is currently not interested in. Assure that new or modified rules or lexical entries are used
After modification of a grammar, one may want to know what effect the
changes have on the coverage of the grammar. This restriction prevents
generation of sentences which are based on exclusively old material.
4 Implementation and Efficiency
The algorithm has been implemented in Quintus Prolog under the AIX
operating system on an IBM PS/2. Due to different restrictions on the parsing
and the generation process, we use two separate programs (which share procedures), so that each can be optimized parsing or generation [Arens and
Erbach 1991]. They are part of a grammar development environment which
consists of a generator, a parser and tools for inspecting feature structures and parse results. STUF [Bouma, König, Uszkoreit 1988] is used as the
grammar formalism. The generation algorithm which was actually implemented differs from the
above idealized algorithm in the following respects: 1. In order to save storage space, we do not store the feature structure with
the item, but rather the derivation tree, from which the feature structure can be reconstructed. Note that reconstructing the feature structure of a string
is computationally much less expensive than parsing the string because no search is involved.
2. We do, however, store a Prolog term for each rule and each item which
contains a subset of the information contained in the corresponding feature
structure. The motivation is to have a simple and inexpensive test, which
can be applied before reconstructing and unifying feature structures. If the Prolog terms do not unify, neither will the corresponding feature
structures. Only if the Prolog terms unify, the feature structures must be reconstructed and unified. Since the feature structures contain more
information than the Prolog terms, their unification may fail, ever though
the Prolog terms unify. But most of the useless computation steps is thus avoided. The same optimization is used by the parser.
3. For binary rules, no active items are constructed, but only passive items,
which are produced by applying the rule to two items. This optimization
saves a lot of storage space. Incidentally, we use the same optimization in the bottom-up active chart parser of our grammar development
4. No active items are produced whose length is the maximal length. In
parsing, no active items are produced which extend to the end of the input
With all of these optimizations, the algorithm must be run overnight to get a sample of the sentences up to length 5 and over the weekend for sentences
up to length 10, as preliminary experimentation with a comprehensive
grammar of German indicates. 5 Conclusion
Chart-based algorithms are useful for both parsing and generation of large
samples of sentences. They store intermediate results, and avoid
recomputation of structures. This is very important for generation of samples, because larger strings are built from shorter ones. Moreover, we
found that some optimizations which were developed for chart parsing also make generation more efficient.
References
[Arens and Erbach 1991] R. G. Arens and G. Erbach. Evaluation von Grammatiken für die Analyse
natürlicher Sprache durch Generierung einer repräsentativen Satzmenge. In: Th. Christaller (ed.) Proceedings of GWAI 91, Springer, Berlin.
[Bouma, König, Uszkoreit 1988] G. Bouma, E. König and H. Uszkoreit. A Flexible Graph-Unification
Formalism and its Application to Natural-language Processing. IBM Journal
of Research and Development 32(2), 170 - 184. [Erbach 1987] Karl Gregor Erbach. An Efficient Chart Parser Using Different Strategies.
Department of Artificial Intelligence Working Paper No. 52. University of
Edinburgh. [Kay 1980] Kay, Martin. Algorithm Schemata and Data Structures in Syntactic Processing. Report CSL-80-12, Palo Alto, CA: XEROX PARC.
[Pereira and Warren 1980] F. Pereira and D. H. D. Warren. Definite Clause Grammars for Natural
Language Analysis. A Survey of the Formalism and a Comparison with Augmented Transition Networks. In: Artificial Intelligence 13, 231 -278.
[Pollard and Sag 1987] Information-based Syntax and Semantics. Volume 1: Fundamentals. CSLI
[Shieber 1986] An Introduction to Unifications-Based Approaches to Grammar. CSLI Lecture Notes No. 4, Stanford, CA.
[Shieber et al. 1990] Stuart M. Shieber, Gertjan van Noord, Fernando C. N. Pereira, Robert C.
Moore. Semantic-Head-Driven Generation. Computational Linguistics 16(1),
Appendix: Prolog Code of the Algorithm
The code presented here is not optimized for efficiency, but rather made as
/***********************************************************************/
/* process(?String,?Length,?Structure) */
/* process/3 is a procedure for parsing strings and for generating all */
/* possible strings up to a given length */
/* Examples: process([the,cat,is,on,the,mat],Length,Structure) */
/***********************************************************************/
nth(String,N,N1,Word), % for every word in the lexicon from N to N1
lexicon(Word,Cat), % for every lexical entry of Word
add_item(item(span(N,N1),[Word],Cat,[])),
update_agenda(OldAgenda,[ITEM-Priority|Agenda]), % sort pending items
add_item(ITEM), % add item to chart creating new pending items
clause(item(_,String,Structure,[]),true).
add_item(Passive) :- % apply rule to passive item
Passive = item(span(Begin,End),String,Cat0,[]),
add_item(item(span(Begin,End),String,LHS,RHS)),
add_item(Passive) :- % combine passive with active item
Passive = item(span(Mid,End),String1,Cat1,[]),
Active = item(span(Begin,Mid),String0,Cat0,[Cat|Remainder]),
add_to_agenda(item(span(Begin,End),String,Cat0,Remainder)),
add_item(Active) :- % combine active with passive item
Active = item(span(Begin,Mid),String0,Cat0,[Cat|Remainder]),
Passive = item(span(Mid,End),String1,Cat1,[]),
add_to_agenda(item(span(Begin,End),String,Cat0,Remainder)),
insert_agenda(Item-Priority,OldAgenda,NewAgenda), !,
insert_agenda(Item-Priority,[T0-P0|Items],[Item-Priority,T0-P0|Items]) :-
insert_agenda(Item-Priority,[T0-P0|Items],[T0-P0|Agenda]) :-
insert_agenda(Item-Priority,Tasks,Agenda).
evaluate(item(_,String,_,_),Priority) :-
Priority is 0 - Length. % shorter strings are given priority
% can be used for context-free grammar and DCG, replace this
% by your favorite unification procedure for other formalisms
% this is the format for lexical entries
codici identificazione aeromobili (dettagliato) Congo, Repubblica del TN-AAA fino a TN-ZZZ B-00001 fino a B-99999 (eccetto B-2000 fino a B-9999) C-IAAA fino a C-IZZZ (solo ultraleggeri) CC-CAA fino a CC-CZZ (velivoli di linea) CNA-AA fino a CNA-ZZ (velivoli militari) CS-AAA fino a CS-ZZZ (velivoli generici, eccetto i CU-A1000 fino a CU-A1999 (velivoli per agricoltura) CU-C1000 fino a CU