Collaborating on Utterances with a Spoken Dialogue System
Using an ISU-based Approach to Incremental Dialogue Management
brief pause”. As discussed by Clark (1996), thisdevice is an efficient solution to the problem posed
by uncertainty on the side of the speaker whether
a reference is going to be understood, as it checks
for understanding in situ, and lets the conversation
overlapping turn-taking, a whole range of
partners collaborate on the utterance that is in pro-
available. We explore the use of one such
Spoken dialogue systems (SDS) typically can-
not achieve the close coupling between produc-
tion and interpretation that is needed for this to
work, as normally the smallest unit on which they
operate is the full utterance (or, more precisely,
the turn). (For a discussion see e.g. (Skantze and
Schlangen, 2009).) We present here an approach
immediate feedback, trial intonations and
to managing dialogue in an incremental SDS that
can handle this phenomenon, explaining how it is
the incremental system was judged as sig-
implemented in system (Section 4) that works in
a micro-domain (which is described in Section 3).
As we will discuss in the next section, this goes be-
yond earlier work on incremental SDS, combining
In human–human dialogue, most utterances have
the production of multimodal feedback (as in (Aist
only one speaker.1 However, the shape that an
et al., 2007)) with fast interaction in a semantically
utterance ultimately takes on is often determined
more complex domain (compared to (Skantze and
not just by the one speaker, but also by her ad-
dressees. A speaker intending to refer to some-
thing may start with a description, monitor while
Collaboration on utterances has not often been
they go on whether the description appears to be
modelled in SDS, as it presupposes fully incre-
understood sufficiently well, and if not, possibly
mental processing, which itself is still something
extend it, rather than finishing the utterance in the
of a rarity in such systems. (There is work on
form that was initially planned. This monitoring
collaborative reference (DeVault et al., 2005; Hee-
within the utterance is sometimes even made very
man and Hirst, 1995), but that focuses on written
explicit, as in the following example from (Clark,
input, and on collaboration over several utterances
and not within utterances.) There are two systems
The system described in (Aist et al., 2007) is
A: Allegra, uh, replied and, uh, . . .
able to produce some of the phenomena that we
In this example, A makes use of what Sacks and
Schegloff (1979) called a try marker, a “question-
reference game (as we will see, the domain we
ing upward intonational contour, followed by a
have chosen is very similar), where users can re-
fer to objects shown on the screen, and the SDS
Though by far not all; see (Clark, 1996; Purver et al.,
gives continuous feedback about its understand-
ing by performing on-screen actions. While we
domain, and indeed found frequent use of “pack-
do produce similar non-linguistic behaviour in our
aging” of instructions, and immediate feedback, as
system, we also go beyond this by producing
in (2) (arrow indicating intonation).
verbal feedback that responds to the certainty ofthe speaker (expressed by the use of trial intona-
tion). Unfortunately, very little technical details
are given in that paper, so that we cannot compare
Even more closely related is some of our own
previous work, (Skantze and Schlangen, 2009),where we modeled fast system reactions to deliv-
We chose these as our target phenomena for the
ery of information in installments in a number se-
implementation: intra-utterance hesitations, possi-
quence dictation domain. In a small corpus study,
bly with trial intonation (as in line 2);2 immediate
we found a very pronounced use of trial or in-
execution of actions (line 4), and their grounding
stallment intonations, with the first installments of
role as display of understanding (“yeah” in line 3).
numbers being bounded by rising intonation, and
The system controls the mouse cursor, e.g. moving
the final installment of a sequence by falling into-
it over pieces once it has a good hypothesis about
nation. We made use of this fact by letting the sys-
a reference; other actions are visualised similarly.
tem distinguish these situations based on prosody,
and giving it different reaction possibilities (back-channel feedback vs. explicit confirmation).
The work reported here is a direct scaling up of
Our system is realised as a collection of incre-
that work. For number sequences, the notion of
mental processing modules in the InproToolKit
utterance is somewhat vague, as there are no syn-
(Schlangen et al., 2010), a middle-ware pack-
tactic constraints that help demarcate its bound-
age that implements some of the features of the
aries. Moreover, there is no semantics (beyond
model of incremental processing of (Schlangen
the individual number) that could pose problems
and Skantze, 2009). The modules used in the im-
– the main problem for the speaker in that do-
plementation will be described briefly below.
main is ensuring that the signal is correctly identi-fied (as in, the string could be written down), and
the trial intonation is meant to provide opportuni-
For speech recognition, we use Sphinx-4 (Walker
ties for grounding whether that is the fact. Here,
et al., 2004), with our own extensions for incre-
we want to go beyond that and look at utterances
mental speech recognition (Baumann et al., 2009),
where it is the intended meaning whose recogni-
and our own domain-specific acoustic model. For
tion the speaker is unsure about (grounding at level
the experiments described here, we used a recog-
3 rather than (just) at level 2 in terms of (Clark,
1996).) This difference leads to differences in the
Another module performs online prosodic anal-
follow up potential: where in the numbers domain,
ysis, based on pitch change, which is measured in
typical repair follow-ups were repetitions, in se-
semi-tone per second over the turn-final word, us-
mantically more complex domains we can expect
ing a modified YIN (de Cheveign´e and Kawahara,
2002). Based on the slope of the f0 curve, we clas-sify pitch as rising or falling.
This information is used by the floor track-
To investigate these issues in a controlled set-
ing module, which notifies the dialogue manager
ting, we chose a domain that makes complex and
(DM) about changes in floor status. These sta-
possibly underspecified references likely, and that
tus changes are classified by simple rules: silence
also allows a combination of linguistic and non-
following rising pitch leads to a timeout signal
linguistic feedback. In this domain, the user’s goalis to instruct the system to pick up and manipu-
2Although we chose to label this “intra-utterance” here,
late Tetris-like puzzle pieces, which are shown on
it doesn’t matter much for our approach whether one consid-ers this example to consist of one or several utterances; what
the screen. We recorded human–human as well
matters is that differences in intonation and pragmatic com-
as human–(simulated) machine interactions in this
then back-channels as in the example, indicating
acoustic understanding (Clark’s level 2), but fail-ure to operate on the understanding (level 3). (As
an aside, we found that it is far from trivial to find
;17 execute(A,T) ;18 U) >} the right wording for this prompt. We settled on
The user then indeed produces more material,
which together with the previously given informa-
sent to the DM faster (200ms) than silence after
tion resolves the question. This is where the RN-
falling pitch (500ms). (Comparable to the rules in
LAs come in: when a sub-question is resolved, the
DM looks into the field for RNLAs, and if there
Natural language understanding finally is per-
are any, puts them up for execution to the action
formed by a unification-based semantic composer,
manager. In our case, slots 4 and 13 are both
which builds simple semantic representations out
applicable, but as they have compatible RNLAs,
of the lexical entries for the recognised words; and
this does not cause a conflict. When the action
a resolver, which matches these representations
has been performed, a new question is accommo-
against knowledge of the objects in the domain.
dated (not shown here), which can be paraphrasedas “was the understanding displayed through this
action correct?”. This is what allows the user reply
The DM reacts to input from three sides: semantic
in line 3 to be integrated, which otherwise would
material coming from the NLU, floor state signals
need to be ignored, or even worse, would confuse
from the floor tracker, and notifications about exe-
a dialogue system. A relevant continuation, on the
cution of actions from the action manager.
other hand, would also have resolved the question.
The central element of the information state
We consider this modelling of grounding effects
used in the dialogue manager is what we call the
of actions an important feature of our approach.
iQUD (for incremental Question under Discus-
Similar rules handle other floor tracker events;
sion, as it’s a variant of the QUD of (Ginzburg,
not elaborated here for reasons of space.
1996)). Figure 1 gives an example. The iQUD
our current prototype the rules are hard-coded,
collects all relevant sub-questions into one struc-
but we are preparing a version where rules and
ture, which also records what the relevant non-
information-states can be specified externally and
linguistic actions are (RNLAs; more on this in a
second, but see also (Buß and Schlangen, 2010),where we’ve sketched this approach before), and
what the grounding status is of that sub-question.
Let’s go through example (2). The iQUD in
Evaluating the contribution of one of the many
Figure 1 represents the state after the system has
modules in an SDS is notoriously difficult (Walker
asked “what shall I do now?”. The system an-
et al., 1998). To be able to focus on evaluation of
ticipates two alternative replies, a take request, or
the incremental dialogue strategies and avoid in-
a delete request; this is what the specification of
terference from ASR problems (and more techni-
the slot value in 1 and 10 in the iQUD indicates.
cal problems; our system is still somewhat frag-
Now the user starts to speak and produces what is
ile), we opted for an overhearer evaluation. (Such
shown in line 1 in the example. The floor tracker
a setting was also used for the test of the incremen-
reacts to the rising pitch and to the silence of ap-
propriate length, and notifies the dialogue man-
We implemented a non-incremental version of
ager. In the meantime, the DM has received up-
the system that does not give non-linguistic feed-
dates from the NLU module, has checked for each
back during user utterances and has only one,
update whether it is relevant to a sub-question on
fixed, timeout of 800ms (comparable to typical
the iQUD, and if so, whether it resolves it. In this
settings in commercial dialogue systems). Two
situation, the material was relevant to both 4 and
of the authors then recorded 30 minutes of inter-
13, but did not resolve it. This is a precondition for
actions with the two versions of the system.We
the continuer-questioning rule, which is triggered
then identified and discarded “outlier” interac-
by the signal from the floor tracker. The system
tions, i.e. those with technical problems, or where
recognition problems were so severe that a non-
Okko Buß and David Schlangen. 2010. Modelling
understanding state was entered repeatedly. These
sub-utterance phenomena in spoken dialogue sys-tems. In Proceedings of Semdial 2010 (“Pozdial”),
criteria were meant to be fair to both versions
pages 33–41, Poznan, Poland, June.
of the system, and indeed we excluded similar
Herbert H. Clark. 1996. Using Language. Cambridge
numbers of failed interactions from both versions
(around 10 % of interactions in total).
Alain de Cheveign´e and Hideki Kawahara. 2002. YIN,
We measured the length of interactions in the
a fundamental frequency estimator for speech and
two sets, and found that the interactions in the in-
music. Journal of the Acoustical Society of America,
cremental setting were significantly shorter (t-test,
p < 0.005). This was to be expected, of course,
David DeVault, Natalia Kariaeva, Anubha Kothari, Iris
as the incremental strategies allow faster reactions
Oved, and Matthew Stone. 2005. An information-state approach to collaborative reference. In Short
(execution time can be folded into the user utter-
Papers, ACL 2005, Michigan, USA, June.
ance); other outcomes would have been possible,
though, if the incremental version had systemati-
tions, facts and dialogue. In Shalom Lappin, editor,
The Handbook of Contemporary Semantic Theory.
We then had 8 subjects (university students,
not involved in the research) watch and directly
Peter A. Heeman and Graeme Hirst. 1995. Collabo-
judge (questionnaire, Likert-scale replies to ques-
rating on referring expressions. Computational Lin-guistics, 21(3):351–382.
tions about human-likeness, helpfulness, and re-
Massimo Poesio and Hannes Rieser. 2010. Comple-
activity) 34 randomly selected interactions from
tions, coordination, and alignment in dialogue. Dia-
either condition. Human-likeness and reactivity
were judged significantly higher for the incremen-
tal version (Wilcoxon rank-sum test; p < 0.05 and
p < 0.005, respectively), while there was no effect
utterances in dialogue: a corpus study. In Proceed-
ings of the SIGDIAL 2009, pages 262–271, London,UK, September.
Harvey Sacks and Emanuel A. Schegloff. 1979. Two
preferences in the organization of reference to per-
We described our incremental micro-domain dia-
sons in conversation and their interaction. In George
logue system, which is capable of reacting to sub-
Psathas, editor, Everyday Language: Studies in Eth-
tle signals from the user about expected feedback,
nomethodology, pages 15–21. Irvington Publishers,Inc., New York, NY, USA.
and is able to produce overlapping non-linguistic
David Schlangen and Gabriel Skantze. 2009. A gen-
actions, modelling their effect as displays of un-
eral, abstract model of incremental dialogue pro-
derstanding. Interactions with the system were
cessing. In Proceedings of EACL 2009, pages 710–
judged by overhearers to be more human-like and
reactive than with a non-incremental variant. We
are currently working on extending and generalis-
Buschmeier, Okko Buß, Stefan Kopp, Gabriel
ing our approach to incremental dialogue manage-
Skantze, and Ramin Yaghoubzadeh. 2010. Middle-ware for incremental processing in conversational
agents. In Proceedings of SIGDIAL 2010, Tokyo,
Acknowledgments Funded by an ENP grant from DFG.
Gabriel Skantze and David Schlangen. 2009. Incre-
mental dialogue processing in a micro-domain. InProceedings of EACL 2009, pages 745–753, Athens,
Gregory Aist, James Allen, Ellen Campana, Car-
los Gomez Gallo, Scott Stoness, Mary Swift, and
Marilyn A. Walker, Diane J. Litman, Candace A.
Michael K. Tanenhaus. 2007. Incremental under-
Kamm, and Alicia Abella. 1998. Evaluating spoken
standing in human-computer dialogue and experi-
dialogue agents with PARADISE: Two case studies.
mental evidence for advantages over nonincremen-
Computer Speech and Language, 12(3).
tal methods. In Proceedings of Decalog (Semdial2007), Trento, Italy.
Willie Walker, Paul Lamere, Philip Kwok, Bhiksha
Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and
source framework for speech recognition. Techni-
Performance of Speech Recognition for Incremental
MA&UD Department – IKP-Urban (MEPMA) – Constitution of Committees at ULB Level, District Level and State Level for monitoring & evaluation of implementation of MEPMA at various levels – Modification Orders Issued. MUNCIPAL ADMINISTRATION AND URBAN DEVELOPMMENT (UBS) DEPARTMENT -06-2009. 1. G.O.Ms.No. 414 MA, dated: 4-06-2007. 2. Lr. Roc. No. 01/2008/GB, Dated: 7-04-2009 of Miss
Administrative Bylaws Including Drug Schedules I and II Table of Contents - SCP Administrative Bylaws 1.0 Title ………………………………………………………………………………………… 1 2.0 Definitions ………………………………………………………………………………… 1 3.0 Council …………………………………