Human Computer Dialogue System: 2019-GCUF-04303 Supervisor: Prof Kahsif Ali
Human Computer Dialogue System: 2019-GCUF-04303 Supervisor: Prof Kahsif Ali
BY
ASAD MAJEED
2019-GCUF-04303
MASTER OF SCIENCE
YEAR 2021
1
DECLARATION
The effort re-counted in this thesis was carried out by me beneath the direction of” Professor
Kashif Ali “Assistant professor, Department of computer science, GC University Faisalabad
Pakistan”
I hereby declare that the title of thesis “HUMAN COMPUTER DIALOGUE SYSTEM”
More importantly, the essence of the sentence in the result of my own investigation and no
part is duplicated with any distributed source (except references, standard science or genetic
models/condition/equations/protocol, etc.). I have also announced that this work has not yet
been submitted for other degree /certificates. If the data turns out to be wrong at any time, the
university can take action.
…………………………………..
ROLL # 1678
2
CERTIFICATE BY SUPERVISORY COMMITTEE
We certify that the contents and from of thesis submitted by Asad Majeed, Registration No:
2019-GCUF-04303 has been found satisfactory and in accordance with the prescribed format.
I recommend it to be processed for the evaluation by the external Examiner for the
Award of degree.
Signature of supervisor
………………………………………………
Name: ……………………...……………….
Designation with stamp……………………..
Chairperson
Signature with stamp………………………..
3
Table of Contents
DECLARATION.......................................................................................................................2
Chapter 1....................................................................................................................................1
INTRODUCTION......................................................................................................................1
Dialogue System.....................................................................................................................3
Rule-based method;................................................................................................................3
Sequence-To-Sequence;.........................................................................................................5
Reinforcement-Method of-Learning;.....................................................................................6
Chapter 2....................................................................................................................................7
Review of literature....................................................................................................................7
Transfer effects.....................................................................................................................11
Research hypotheses.............................................................................................................13
Chapter 3..................................................................................................................................15
Research Methodology.............................................................................................................15
Experimentation...................................................................................................................17
Task Initiative.......................................................................................................................19
Negotiation...........................................................................................................................19
Mathematical Analysis.........................................................................................................20
Computational Implementation............................................................................................20
Speaker Utterance.................................................................................................................21
4
Participants...........................................................................................................................23
Materials...............................................................................................................................24
Dialogue scenarios................................................................................................................26
Satisfaction questionnaire.....................................................................................................28
Procedure..............................................................................................................................28
Design...................................................................................................................................28
Dependent variables.............................................................................................................29
Discourse structure...............................................................................................................29
Chapter 4..................................................................................................................................31
Missing data..........................................................................................................................31
Preliminary analyses.............................................................................................................31
Discourse structure...............................................................................................................38
Challenges............................................................................................................................39
General discussion................................................................................................................40
Usefulness evaluation...........................................................................................................51
5
Localizability........................................................................................................................52
Humanness Evaluation.........................................................................................................55
Chapter 5..................................................................................................................................62
Conclusions..............................................................................................................................62
References................................................................................................................................63
6
Abstract
Human-Computer dialogue systems provide a natural language based interface between
human and computers. They are widely demanded in network information services,
intelligent accompanying robots, and so on. A Human-Computer dialogue system typically
consists of three parts, namely Natural Language Understanding (NLU), Dialogue
Management (DM) and Natural Language Generation (NLG). Each part has several different
subtasks. Each subtask has been received lots of attentions, many improvements have been
achieved on each subtask, respectively. But systems built in traditional pipeline way, where
different subtasks are assembled sequentially, suffered from some problems such as error
accumulation and expanding, domain transferring. Therefore, researches on jointly modeling
several subtasks in one part or cross different parts have been prompted greatly in recent
years, especially the rapid developments on deep neural networks based joint models. There
is even a few work aiming to integrate all subtasks of a dialogue system in a single model,
namely end-to-end models.
This paper introduces two basic frames of current dialogue systems and gives a brief survey
on recent advances on variety subtasks at first, and then focuses on joint models for multiple
subtasks of dialogues. We review several different joint models including integration of
several subtasks inside NLU or NLG, jointly modeling cross NLG and DM, and jointly
modeling through NLU, DM and NLG. Both advantages and problems of those joint models
are discussed. We consider that the joint models, or end-to-end models, will be one important
trend for developing The most fundamental communication mechanism for interaction is
dialogues involving speech, gesture, semantic and pragmatic knowledge. Various researches
on dialogue management have been conducted focusing on standardized model for goal
oriented applications using machine learning and deep learning models. The paper presents
the overview on existing methods for dialogue manager training; their advantages and
limitations. Furthermore, a new image-based method is used in Facebook baby Task 1 dataset
in Out Of Vocabulary setting. The results show that using dialogue as an image performs well
and helps dialogue manager in expanding out of vocabulary dialogue tasks in comparison.
7
Chapter 1
INTRODUCTION
The Information and Communication Technology (ICT) with which we interact in daily life
is more distributed and embodied into the environment (the so called intelligent space).
Especially, when designing ICT solutions for elderly people, who are very often critical
towards new technology, distributed system can be even more challenging. To improve the
Human Computer Interaction (HCI) with ICT solutions, a directed natural interaction and an
emotional intelligence is very important.
User studies regarding elder behavior change over the ageing process identified that”a skill
that many elderly people retain, even with significant cognitive degradation, is the ability to
communicate in a multimodal face-to-face fashion. The skills for this type of interaction are
acquired in infancy and early childhood and comprise of tacit, crystallized knowledge in
older adulthood”. Face-to-face interaction incorporates a wide range of non-verbal and
paraverbal ways to carry semantic content complementary to the speech. It allows persons
with disabilities to compensate some perception channels (e.g., hearing) by utilizing other
channels of communication (e.g., body gestures). Face-to-face dialog is also characterized by
well-established repair mechanisms of understanding thus enabling the listener to request a
repetition or clarification by the speaker (e.g., communicated by a head nod). Moreover, face-
to-face dialog has built-in mechanisms for constraining the interactants focus of attention.
This focus is important as some elderly people have difficulty dividing their attention or
handling distractions.
This kind of face-to-face interaction can be provided utilizing avatars. Avatars have the
potential to impersonate the used technology and thus increase the acceptance of the
software. The interaction with avatars is able to provide multiple advantages. Avatars can, for
instance, provide gestures, which in turn are able to increase the understanding of the
presented information. Furthermore, the visual enrichment of verbal information i.e., adding a
lip synched animated character to audio speech output, can increase the intelligence and
enhance the robustness of the information transmission as known from natural speech.
Therefore, a consistency between the visual and vocal output is of uttermost important.
1
based systems are of major interest in human-machine interactions for multimodal interfaces
and are preferred widely for Natural and Spoken Language Understanding (NLU or SLU),
Natural Language Generation (NLG) and Dialogue Processing (DP) tasks. Complete dialogue
solution as shown in figure 1, consists of many parts, each of which specializes in certain
task. Automatic Speech Recognition (ASR) module is responsible for converting spoken
users utterance into text. Natural Language Interpreter converts textual information to
meaningful features so that Dialogue State Tracker (DST) can process this features and
update the current dialogue state. DST outputs current dialogue state so that Dialogue
Response Selection (DRS) module. (which is trained to output a response to user utterance)
can generate textual reply to user. Later this textual reply is converted to speech by text-to-
speech (TTS) synthesizer. Since ASR and TTS are not related to dialogue manager directly,
they can be considered as complementary modules to complete dialogue solution.
2
3
Figure No1;
Dialogue Stake
Automatic Tracker
Natural Language
Speech
Interpreter
Recognizer
Text-to-speech
Natural Language
Generator natural Dialogue Response
Selection
Dialogue System
In this figure dialogue method is defined by the showing the steps that how the conversation
starts and become a dialogue. We are very happy because in this method successfully
described that it’s a basic dialogue process.
Rule-based method;
The first chat bot developed by using rule-based system was Aqsa, which uses pattern
matching based on user replies. In rule-based systems, human dialogues are modeled as set of
states and dialogue manager has to choose replies for the conversation from the given set of
rules. This model has been used in many different applications such as restaurant booking or
online psychological therapy chat bot In this method, a human, who is usually a domain
expert analyzes the dialogue flow between human agents and tries to come up with
predefined dialogue states and possible replies for each state based on patterns. The
advantages of rule based methods are that dialogue managers have control on selecting
4
replies for the conversation and these selected replies from the full set of replies ensure that
the user is not upset or offended, thus keeping the system consistent.
Yasmeen is a recent example of such system, where expert can define a dialogue structure
as seen on the left side in figure 3 (in .xml format). In NADIA dialogue systems, expert can
define structure with questions and answers and dialogue manager would use these hard
coded rules in order to engage in a conversation. On the right side in figure 3, a sample
conversation of NADIA with user can be seen, which consists of answering user’s questions
while making a trip reservation. If real human dialogue flow does not have many different
states and/or replies, rule-based systems usually outperform machine learning models [21].
However, most of the time in real life, human language can get very complex and it becomes
very easy to run out of dialogue states designed by the expert. In such scenarios, it is not
possible to use rule-based models other than giving generic answer to user such as “I don’t
know what you are asking”, which may frustrate the user after certain amount of time.
5
Sequence-To-Sequence;
I am fine <EOL>
Figure;2
In this figure with the help of sequence-to-sequence method it can be easily seen that how
dialogue system can work. LSTM encoder takes either full history or last reply and convert
into encoded feature vector where LSTM decoder takes this vector and outputs a possible
reply condition on encoded feature vector.
6
Reinforcement-Method of-Learning;
Encode
How are
you old Decode Encode
16?
I’m
16 Decode
Decode Encode
Figure; 3
In this method reinforcement is applied. Person A ask to Person B about his age. Person B
told his age. Person A not sure about his age and declared his decision after watching person
B. this is called the reinforcement method.
7
Chapter 2
Review of literature
Studies of interactive dialogues
Only a few studies have compared spoken dialogue interactions (speaking and listening) with
the corresponding written mode (typing at a keyboard, reading on a screen) (Oviatt, 1995;
Oviatt, Cohen, & Wang, 1994; Zoltan-Ford, 1991). Empirical studies have used a number of
experimental approaches: interactive human–human dialogues via a system in both the
written and spoken modes; simulated human– computer dialogues in the spoken mode, the
written mode, or both; real systems based human–computer dialogue in the spoken and
written modes. More precisely, in the HCI literature, interactive discourse has primarily been
studied in the form of task-oriented dialogue. Like any other task-oriented activity, task-
oriented dialogue is guided by a goal, i.e. the accomplishment of a task (Falzon, 1991;
Falzon, Amalberti, & Carbonnel, 1986). Studies have tended to focus on twotypes of
dialogue: (a) dialogues with another individual via a system in order to accomplish a task, i.e.
for the most part Computer Mediated Communication (CMC) (Brennan, 1998; Whittaker,
2003); or (b) direct dialogues with a system in natural language or in a restricted language
(e.g., keywords) in order to accomplish a task, i.e. human–computer dialogues. Interagent
dialogues will not be considered here.
Researchers have often grounded their work on shared, non-exclusive theoretical bases such
as (1) the theory of language acts and pragmatics, (2) (socio) cognitive theories, or (3)
psycholinguistic theories. 468 L. Le Bigot et al. / Computers in Human Behavior 22 (2006)
467–500 In brief, philosophers of language have studied conversation by focusing on the
partners communicative intent (and the way this intent is perceived). The production of a
statement in a dialogue context is conditioned by the realization of language acts and the
recognition of underlying intentions (Austin, 1970; Searle, 1969). At another level,
sociolinguists have suggested analyzing conversations via the study of the way turn-taking is
organized (Sacks, Schegloff, & Jefferson, 1974). Sacks et al. have shown that conversations
are governed by rules which determine, for example, turntaking within the conversation.
Using a precise methodology, they also identified the existence of cues (e.g., intonations,
overlapping) which indicate to the other partner the transitions at which he or she can speak.
Clark and co-workers (Clark, 1996; Clark & Schaefer, 1987; Clark & Wilkes-Gibbs, 1986)
have combined these two perspectives to propose a (socio)cognitive model of human
8
communication. Their idea is that the partners work together to construct reciprocal
understanding by constructing a shared communicative field (or grounding), i.e. a
representation of their prior knowledge, the situation, and their beliefs. This co-adaptation of
the participants in the conversation has been characterized, in particular, by the convergence
of the use of lexical items and the syntactic structure of speech by the partners (Brennan &
Clark, 1996; Clark & Schaefer, 1987, 1989; Fussell & Krauss, 1992; Isaacs & Clark, 1987;
Krauss & Fussell, 1991). Thus the partners share the responsibility for cooperating in the
conversation and make an effort to understand one another (Clark & Wilkes-Gibbs, 1986).
Grounding has been defined as an active process of cooperation that permits conversational
exchange and the construction of a shared referent (Clark & Brennan, 1991).
Psycholinguistic research has studied language specific processes such as verbal production
(e.g., Levelt, 1989) or comprehension (e.g., for text comprehension, Kintsch, 1998), but
rarely within the framework of spoken interactive discourse (Garrod & Pickering, 2004).
Finally, whatever the theoretical positions, researchers have tended to use similar measures
(e.g., the mean length of utterances, the number of words – tokens – the ratio of one type of
occurrence to the total number of words – type token ratio or TTR).
Within these different conceptions, human–computer dialogue has very frequently been
studied using the Wizard of Oz (Woz) technique. This technique consists of making
individuals believe that they are interacting with a system whereas, in fact, the messages they
receive are being sent by a human. Although the method is a very good compromise for the
study of human–computer dialogues, it is subject to some inherent limitations. For example,
in the case of Woz studies involving only the degradation of the human voice, the bias of the
experimenter has often been identified as a major limitation. In semi-automatic systems, the
partners response times are crucial for giving the impression that a real system is at work. A
brief overview of the results obtained using Woz studies will be presented below. However,
first of all, a presentation of the obtained CMC data may be useful for the study and
understanding of the conversational mechanisms involved in the written and spoken human–
computer dialogues.
In the majority of studies, two or more individuals have been interacting via a computer
system of the type ‘‘What You See Is What I See’’, i.e. a shared cooperative L. Le Bigot et
al. / Computers in Human Behavior 22 (2006) 467–500 469 space in which they can perform
the task. The underlying aim of this type of research has been to compare CMC situations
9
with face-to-face communication. The major results indicate that conversation is organized
differently in the written and spoken modes. For example, Ferrara, Brunner, and Whittemore
(1991) observed the emergence of a new written dialogue style which combined the
properties of both spoken and written language. McKinlay, Procter, Masting, Woodburn, and
Arnood (1994) showed that the pauses between turns are longer in written than in spoken
discourses. Moreover, and at a more general level, problems relating to the regulation and
coherence of dialogue, characterized by an impairment (but not by the absence) of the
grounding processes have also been identified (Brennan & Ohaeri, 1999; Hancok & Dunham,
2001). The authors observed a fall-off in the number of courtesy formulations and noted that
this trend was more extreme in the written mode. Furthermore, some authors have noted that
speakers change the style they use to address one another combined with a reduction of
lexical diversity characterized by a fall in the number of copulatives, subject pronouns, and
articles (Ferrara et al., 1991; Hancok & Dunham, 2001). Most authors have attributed the
characteristics of CMC to the permanence of the written medium compared to the transience
of the spoken word (for a review on CMC, see Whittaker, 2003). Finally, on the basis of
subjective measures and performance indicators Brennan and Ohaeri (1999) have established
that face-to-face communication situations are generally richer than written interactive
situations. Some of the observed CMC results have also been found in Woz situations. The
aim of these studies has generally been to distinguish human–human dialogue from human
computer dialogue or to provide an accurate characterization of human computer dialogue.
The main results have shown that communications with a system and via a system have
certain points in common such as the adaptation to the partner (Fais, 1998; Leiser, 1989) both
in the written and spoken modes, thus giving rise to the use of certain terms (e.g, third-person
pronouns, Brennan, 1991). Nevertheless, a large number of differences have also been
demonstrated. In general, users are less verbose when interacting with a system than with a
human partner (Richards & Underwood, 1985).
This difference has been shown to take the form of a reduction in lexical diversity
characterized by the low use of anaphora, connectors, ellipsis, involvement markers
(Amalberti, Carbonnel, & Falzon, 1993; Brennan, 1991; Bubb-Lewis & Scerbo, 2002;
Pierrel, 1988). At the same time, in the spoken mode, the volume of information relevant for
the research in question has been shown to be low on the initial turn in the conversation and
then to subsequently increase as the interactions continue (Amalberti et al., 1993). Generally
speaking, the number of information items has been seen to increase while verbosity
10
decreases. Zoltan- Ford (1991) has reported different results, with utterance length tending to
remain stable in the oral mode while increasing over dialogues in the written mode (see also
Hauptmann & Rudnicky, 1988; Oviatt et al., 1994). Furthermore, the number of disfluencies
(e.g., hesitations, overlappings) seems to diminish during interactions with a system
compared to dialogues with a human partner (Bortfeld, Leon, Bloom, Schober, & Brennan,
2001; Oviatt, 1995). Similarly, indicators of grounding have been seen to be less frequent
when subjects interact with a machine (Brennan, 1991; John- 470 L. Le Bigot et al. /
Computers in Human Behavior 22 (2006) 467–500 stone, Berry, NGuyen, & Asper, 1994).
For example, Brennan (1991) showed that a group interacting with a system produced fewer
first person subject pronouns. Johnstone et al. (1994) showed that confirmations, politeness,
and closures were not used as often in human–machine dialogue as in human–human
dialogue. Authors have found a number of different ways to explain the differences between
human–human dialogue and human–computer dialogue. Some have considered that the users
knowledge, i.e. the speakers model, of the systems supposed comprehension and problem-
solving abilities were erroneous (overestimated or underestimated). For example, Amalberti
et al. (1993) have suggested that individuals who only provide one item of information when
they take their first turn in the dialogue may be doubtful about the structures ability to process
more information than this. For other authors (Brennan, 1991; Bubb-Lewis & Scerbo, 2002;
Johnstone et al., 1994), the differences are primarily characterized by the absence of
grounding markers. Brennan (1998) has hypothesized that the shared context may be
impoverished in HCI situations. As far as the difference between the written and spoken
modes of interaction is concerned, a number of different interpretations have been proposed
(Oviatt, 1995; Zoltan-Ford, 1991). Zoltan-Ford (1991) discusses two explanatory hypotheses:
individuals have less confidence in the spoken mode and speaking requires less effort than
typing and is therefore less precise. Oviatt (1995) considers that the difference can be
explained by specific production levels, the underlying idea being that the planning load (in
Levelts, 1989 sense of the term) differs between the two modes of interaction.
In short, this set of results enables us to make an initial observation: There appear to be
invariants in the mechanisms involved in discourse regulation in both the written and spoken
modes. For example, the individuals real or supposed knowledge plays a decisive role on
discourse organization in human–computer dialogue interactions. Furthermore, the
representation of the partner changes as the interactions proceed, i.e. the individual adapts
lexically and syntactically (see also, the phenomenon of convergence). As interactions with a
11
machine or human partner proceed, the lexical and syntactical content of the discourse
becomes specialized. One of these characteristics is the disappearance of certain unnecessary
terms such as articles. This phenomenon of adaptation might initially simply be attributable
to the phenomenon of learning or familiarization, i.e. to the acquisition of knowledge
concerning the task to be performed and the procedures involved in performing it. However,
certain authors (Brennan, 1991; Johnstone et al., 1994) have illustrated the difference
between human–human and human–machine communication by pointing out the lack of
grounding in the latter type of dialogue. Thus, the interpretations of the mode-related
differences in interactions with or via a system are far from clear and different authors draw
different conclusions on this point. The interaction mode may have both direct and indirect
consequences on the production and the comprehensionof utterances.
If the direct and indirect consequences can be distinguished then it should be possible to gain
a better understanding of the mechanisms underpinned by each of the modes (and, indirectly,
the modalities) of interaction during the performance of a complex cognitive activity.
The indirect consequences should provide information concerning task performance, i.e.
concerning the knowledge of the L. Le Bigot et al. / Computers in Human Behavior 22
(2006) 467–500 471 system and the task acquired by individuals during the interaction. The
direct consequences should be specific to the mode of interaction. The study of the transfer of
knowledge previously acquired in one modality to the other should be a good way of
identifying what is directly dependent on the modality and what is dependent on other factors
such as the task or the use of the system.
Transfer effects
The transfer of knowledge acquired in human–human dialogue to human–computer dialogue
situations has often been taken as the starting point for studies of natural language use.
However, the concept of modal transfer to human–computer dialogue situations has rarely
been examined in itself. Nevertheless, with the emergence of multimodal systems and the
possibility to offer service continuity between different terminals (for example, from the
computer to the telephone or personal assistant), the full significance of this concept is
becoming clear. It is critical to know whether certain skills or items of knowledge acquired in
one mode of interaction are liable to impair the quality of interactions in another mode.
Within the field of computer environments, the transfer effect has been studied on the basis of
problem solving situations (OHara & Payne, 1998, 1999), the completion of complex
cognitive activities (Guthrie, 1988), or knowledge learning situations (de Croock, van
12
Merrie¨nboer, & Paas, 1998; Mayer, 1997; Moreno & Mayer, 2002; Sweller, 1998; van
Gerven, Paas, van Merrie¨nboer, & Schmidt, 2002; van Merrie¨nboer, Schuurman, de
Croock, & Paas, 2002). Nevertheless, the study of the transfer of earlier habits forms a well-
established tradition (James, 1929/1930). The idea is that initial training in one field may
have repercussions for subsequent training in other fields. If this action is positive then the
term ‘‘learning transfer’’ is used; if it is negative the effect is referred to as ‘‘interference’’.
The direction and scale of the effects depends on the level of learning in the first tasks as a
function of time or difficulty as well as on the similarity between the tasks. Thus in the case
of a goal-oriented activity, it is possible to speak of transfer due to the similarity of the
stimuli when the fact of having learned in one mode facilitates interaction in another mode.
Transfer studies have often revealed positive effects. That is the reason why much research
has been devoted to negative transfer (or interference). For example, if, in a problem-solving
situation, subjects learn an initial method then, when faced with a second problem which is
apparently similar to the preceding one, they may fail because they persist in vain with the
technique that previously worked instead of looking for another solution. Luchins (1942)
(reported in Fraisse, 1966) traditional water jug experiment illustrates this phenomenon very
well. The subjects in this experiment were faced with a series of problems requiring them to
transfer a certain quantity ofwater in order to reach the desired volume.
The first problem acted as a demonstration, the next few problems could be solved by
applying a formula, the following problems could be solved using either the same or a
simpler method and the final problem required the use of a different method. The results
showed that the majority of the subjects applied the first method they had learned to all the
problems. 472 L. Le Bigot et al. / Computers in Human Behavior 22 (2006) 467–500
Assuming that human–computer dialogue situations (goal-oriented dialog) call upon mode-
specific skills, then the acquisition of a habit or knowledge in one mode of interaction may
have a negative effect when the activity is repeated in a different interaction mode.
Individuals should continue to apply procedures that are not suited to the new interaction
mode. This hypothesis is not new. For example, Green (1955) used a manual coordination
task to test the hypothesis that transfer should be better from a difficult to an easy task than,
from an easy to a difficult task. However, no experimental study has addressed this question
in a human–computer dialogue context.
13
Research hypotheses
Based on cognitive and sociocognitive theories of learning, as well as the results of CMC and
Woz studies, we formulated a number of hypotheses concerning learning and transfer effects
during human–computer natural language dialogue. In brief, the real or supposed knowledge
possessed about the partner (machine) has a major impact on the structure and organization of
the discourse. The representation of the partner changes as the interactions progress.
Furthermore, individuals do not process information in the same way in the written and
spoken modes. The hypotheses concerning performance, discourse organization and
discourse were as follows:
On the basis of this observation, we can predict that there will be fewer grounding indicators
in written than in spoken mode in human–computer dialogue situations. (4) Effect of transfer
14
on the representation of the partner, performance and discourse structure. Changing the mode
of interaction after learning in an initial mode should result in a transfer effect both at the
level of performance and at that of the discourse indicators. Despite this, transfer should be
greater for the interaction mode in which the effort involved in learning was greater. To test
these hypotheses, an experiment was conducted using an actual natural language dialogue
system within the framework of a goal-oriented activity.
15
Chapter 3
Research Methodology
Our goal is to produce voice interactive human computer collaborative systems. Creating
such a complex system is not done in one step. We follow a development methodology
illustrated in Figure 1. There are six stages of development: creation of an underlying model
(which is based upon previous research and observations), analytical evaluation of the model,
computational implementation of the model, simulation of the model, an implemented
demonstration system and finally full-scale development of a working system. This
development paradigm directly follows the methodology of computer systems design. At
each successive stage, the development of the underlying model is made more specific. Often
it is the case that mathematical analysis of the underlying model can only be made by
simplifying the domain model (the use of exponentially distributed arrival rates is a common
assumption in systems building). However, at the simulation stage, complexities can be
introduced that could not be modeled at an analytical level (for instance, a non-
Past Research
Underlying Model
Analysis
Computer
Simulation
Determined System
16
Demo Pull-scale System
Exponentially distributed arrival rate). In turn, the simulation may not truly model what
occurs in the actual domain. Thus each stage of the process brings the process closer to
realizing the full-scale system. Although the process generally moves forward through the
stages, there is feedback from later stages. For instance, during the mathematical analysis of
the model, flaws or deficiencies in the underlying model may be detected. This causes a
revision of the underlying model (which in turn may result in a revision of the analysis).
17
system was implemented with several interesting features (Smith 1991; Hipp 1992): The
Missing Axiom Theory is a driving force behind the generation of system utterances. (Some
other dialogue systems have used similar approach to organizing dialogue (Cohen et al. 1989;
Gerlach & Horacek 1989; Quilici 1989; Young et al. 1989)). ¯ The system operates in four
discrete dialogue modes that specify how much control the computer will take in directing the
problemsolving. ¯ The system maintains a dynamic user model. ¯ The system uses
expectations to assist in speech recognition and clarification subdialogues. ¯ The system
employs an iterative problemsolving strategy that allows it to consider new nformation.
Experimentation
The system was tested on 8 subjects for a total of 141 human-computer dialogues. There are
several important contributions of the experiments: ¯ The performance of the system
validates that the Missing Axiom Theory can be used as dialogue control mechanism. Users
found the dialogues to be understandable and cohesive. ¯ The experiments also demonstrate
the effects of dialogue mode on natural language discourse.
Underlying Model –
The Collaborative Algorithm The agents in human-humanc ollaboration are individuals. Each
participant is a separate entity. The mental structures and mechanisms of one participant are
not directly accessible to the other. During collaboration the two participants satisfy goals and
share this information by some mean of communication. We say effective collaboration takes
place when each participant depends on the other in solving a commong oal or in solving a
goal more efficiently. It is the synergistic effect of the two problem-solvers working together
that makes the collaboration beneficial for both parties (Calistri- Yeh 1993). An overview of
our collaborative model is presented in
18
Participant A Participant B
Figure 2. Notice that each participant has a private plan, knowledge base (KB), and user
model. To collaborate there also must be some dialogue between the two participants. Figure
2: A Model of Collaboration There are several important assumptions made by the
Collaborative Algorithm.
1. All knowledge in an agent’s knowledge base is "true". It follows that any information
contained in an utterance is also true. This contrasts with a more realistic environment where
agents may have false beliefs or utter false statements.
2. The communication channel is perfect. Unlike true natural language communication there
is no ambiguity nor is there information loss due to speech misrecognitions.
3. The focus of the interaction is on solving the mutual problem, not on teaching. Unlike a
tutoring environment where one agent would like the other to gain the ability to solve the
problem on its own, the agents in the Collaborative Algorithm only want to solve the top
mutual goal as quickly as possible. However, we are currently working on modifications of
the Collaborative Algorithm to model the tutoring environment more closely.
19
The problem-solver maintains a model of the other participant to determine what is
appropriate to request. Dialogue Mechanisms for Conflict Resolution The Collaborative
Algorithm utilizes the Missing Axiom Theory to generate goal requests. However, the issue
of how and when goals are answered by other participants is left unresolved by the Missing
Axiom Theory. The Collaborative Algorithm provides a testbed to determine effective
strategies for answering queries. An important issue in collaborative environments is the
concept of conflict resolution. Even though agents may be working together, there still may
exist conflicts between the two agents about which path should be taken in order to solve a
particular goal.
Task Initiative
An important conflict resolution mechanism is the specification of which participant has task
initiative. The participant with task initiative over a goal controls which decomposition of
that goal the two participants will use. We have developed several task initiative setting
algorithms (Guinn 1993a; 1994). The Continuous Mode algorithm sets the initiative level of a
participant based on a probabilistic analysis of the two participant’s knowledge. Using the
Continuous Mode algorithm, the initiative levels of each agent will adapt during the problem-
solving.
Negotiation
Another strategy for conflict resolution is the usage of negotiation to determine which
decomposition the two participants will use. There are innumerable task initiative setting
algorithms as well as many possible negotiation strategies. In our collaborative model, agents
can negotiate when there is a task conflict. Negotiation takes the form of presenting evidence
for a particular path choice. Plan Inference Assistance Our dialogue initiative and negotiation
algorithms require proper plan recognition. However, plan inferencing can be very difficult
when an agent has limited information about the domain and limited information about the
other agent. We have found that certain utterances which we call Non-Obligatory Summaries
can assist in plan recognition. These utterances are announcements of a goal results that have
not been explicitly asked for by the other participant. We utilize both mathematical analysis
of the Collaborative Algorithm and computer-computer simulation of the Collaborative
Algorithm to determine the advantages and disadvantages of different task initiative and
negotiation strategies and the usefulness of Non-Obligatory Summaries.
20
Mathematical Analysis
This paper concentrates on the empirical validation of the Collaborative Algorithm. However,
we briefly note some of the results from our analytical analysis: ¯ We have identified the
necessary conditions to insure the soundness and completeness of the Collaborative
Algorithm We have analytically determined the effect of certain dialogue mode setting
mechanisms using simplified user models.
Weh ave analytically determinedt he effect of certain negotiation strategies using simplified
user models. We have analytically determined the effect of a class of utterances, non-
obligatory summaries, on dialogue efficiency. A detailed account of the above analyses is
given by Guinn (Guinn 1994).
Computational Implementation
The process of implementing an algorithm is an important feedback step to the underlying
model. To create a working computational implementation of an algorithm there can be no
gaps or "handwaving" in the implementation. Every detail must be worked out. If the
underlying model is sufficiently robust, the "fleshing out" of procedures, functions and
modules should not affect the overall model. However, there are occasions when the
implementation of a seemingly simple step at the general model level turns out to have
consequences that affect the overall model. As an example, we found that the closed world
assumption of Prolog was unacceptable in an environment where participants are expected to
have incomplete knowledge.
Thus the problem-solver must be capable of handling a multi-valued logic. The Collaborative
Algorithm has been implemented on Sun workstations with a combination of Prolog and C
language routines. Most of the coding of the algorithm is in Quintus Prolog 3.1.4.
An example dialogue carried out between two computer participants is given in Table 1. In
this example, task initiative for the top goal changes three times: twice when an agent
explicitly grants the other participant control and once following a negotiation. Computer
21
Simulations The implementation of the Collaborative Algorithm has the ability to function in
using the Continuous Mode Algorithm or Random Mode Algo-
Speaker Utterance
~Who is the murderer of Lord Dunsmore?
Dunsmore?
Lord Dunsmore?
poison?
Dunsmore?
Lord Dunsmore?
poison?
22
Suspectl0 had access to the poison.
Dunsmore?
Dunsmore?"
#p’oison.
23
Using the simulation results, we have some predictive power in determining what dialogue
mechanisms are useful for producing moreefficient joint problem-solving.
1.A knowledge distribution was chosen for a set of experiments, i.e., how much knowledge
each participant will receive is set.
2. Knowledgeis distributed between the two participants using the parameters chosen in Step
1.
3. Using the knowledge distribution created in Step 2, the collaborators solve the problem
eight times, once for each possible combination of mode setting, negotiation and summaries.
4. Steps 2 and 3 are repeated until a desired number of simulations are carried out for a
particular knowledge distribution.
Participants
Fifty people took part in the experiment. The final number of participants was 48 (24 men, 24
women) with a mean age of 30 years (SD = 9.11). A major problem relating to system
operation was encountered by two of the participants. They were replaced by individuals
having the same profile (i.e. same sex, age group, academic level, familiarity with
computers). Each participant was given a 20€ voucher. The academic level was Mean = 13.8
years of study (SD = 2.6). They generally defined themselves as having an average level of
computer skills (M = 3.9, SD = 1.3, on a five-point scale ranging from very poor to very
good). The majority of the participants used interactive on-line or telephone services only
occasionally or rarely (33 out of 48): two participants used this type of service more than
twice a week, and the rest between one and two times a week. None of them had previously
used the application ArtimisPlanResto. The participants were distributed equally into the
experimental groups.
24
Materials
The Artimis Plan Resto application
The ArtimisPlanResto application was used in this experiment. ArtimisPlanResto is a general
public prototype service based on an intelligent natural language dialogue system, used to
locate restaurants in Paris. The service was developed using ARTIMIS technology (Sadek,
1999; Sadek, Bretier, & Panaget, 1997; Sadek et al., 1996). Users can search on the basis of
three criteria: location of the restaurant, price, and food type. The system responds by
proposing the solutions that best match the request (for an example dialogue, see Fig. 1). A
search using ArtimisPlan- Resto can be subdivided into two phases (1) A phase during which
the criteria are formulated, i.e. the user formulates a more or less precise request in natural
language with the help of the system which asks him or her to specify the criteria in greater
25
System: Welcome to PlanResto […]. What would you like?
S: You are looking for a restaurant in the eleventh district. I have found more than 10.
You
S: You are looking for an Indian restaurant in the eleventh district. I have found 7. You
can
give more information to narrow your search, for example location, consult solutions […]
S: You are looking for an Indian restaurant for about 25 euros in the eleventh district. I
have found one called "La Ville de Jagannath". […]. Do you want more information or
S: The
Fig. Example of a spoken dialogue using the ‘‘PlanResto’’ application (note: all the search
criteria can be specified together; translated from French). detail (if they want, users can
specify multiple criteria at once). (2) A fine-tuning phase, i.e. once users have stated their
criteria, they can consult and work through the solutions returned by the system. They can
also ask for specific information or further details concerning these solutions. Users can
switch between phases 2 and 1 at any time.
26
The ArtimisPlanResto service is a telephone-based voice service (voice synthetics, voice
recognition and dialogue modules). It also operates in written mode with a Web type
interface without mouse implementation (text entered in a window and confirmation with the
Enter key – history displayed below the input window, see Fig. 2). The available interface
was only a test interface and did not correspond to what might be expected of a written
natural dialogue interface. Similarly, the system did not contain a spelling checker. The terms
were processed exactly as typed. For example, if Eiffel Tower was typed without capitals
then the system did not understand and the user had to reformulate the request. One very
important characteristic for the experimental validity of the study was that the content of the
system output was identical for the two interactive modes. Similarly, a user query in spoken
or written form produced precisely the same system response. Finally, the only way to
communicate with the system was to use language.
Dialogue scenarios
Information retrieval tasks were given to the users in the form of scenarios in order to place
them in a test situation. The users had to find a restaurant by means of the ArtimisPlanResto
service. Ideally, the restaurants corresponding to the search were predetermined.
Nevertheless, depending on the formulations chosen by the user or problems arising during
the interaction (e.g., voice recognition problems), the system could also propose other
solutions. Two or three criteria were given for each search. These took the form of the
location, the food type and the price of the restaurant. The location corresponded either to a
major site (e.g., close to the Eiffel Tower) or to a city district (e.g., in the first district). The
food type corresponded to the type of catering provided (e.g., gastronomic cuisine). The price
corresponded either to an exact amount (e.g., for 15 €) or to approximations (e.g., for
approximately 15€, for more than 15€). The number of possible responses varied between 1
and 6. The scenarios were presented in the form of lists of criteria. Twelve simple scenarios
were created by manipulating the number (2 or 3), the order (6 combina- F Web interface.
476 L. Le Bigot et al. / Computers in Human Behavior 22 (2006) 467–500 tions) of the
criteria and the number of solutions. There was one possible scenario per order and number
of criteria. Moreover, the order of the scenarios was counterbalanced using multiple Latin
Squares per scenario. The scenarios were grouped together in sets of six across two
experimental sessions (Sessions 1 and 2).
27
Mental workload self-evaluation questionnaire
The NASA-TLX (Task Load-Index, Hart & Staveland, 1988) is a weighted, bipolar,
subjective, multidimensional scale designed to evaluate the mental workload imposed by any
given task. It provides a global score based on the mean weights of six dimensions (or
sources): three of these relate to the task and three to the operators involvement in the task.
The dimensions are (1) mental activity (mental effort), (2) time pressure (sequence of sub-
tasks), (3) effort (mental and physical work), (4) performance (effectiveness in accomplishing
the task), (5) frustration, (6) physical activity (physical effort required to accomplish the
task). An earlier version of the NASA-TLX made use of nine dimensions. However, studies
showed that certain of these were redundant or failed to provide relevant information. For
example, the ‘‘stress’’ dimension was found to be equivalent to the ‘‘frustration’’ dimension.
In general, each of the six dimensions is operationalized by a question (e.g. Prinzel, Pope,
Freeman, Scerbo, & Mikulka, 2001; Appendix A, p.55). The subjects respond using Likert-
type, 20-point scales (with 1 generally representing the lowest level of mental workload and
20 representing the highest workload: these extremes are labeled ‘‘Low’’ and ‘‘High’’
respectively, with the exception of the ‘‘performance’’ dimension which is labeled ‘‘Good’’
and ‘‘Poor’’). The mental workload index is calculated on the basis of 21 measurements.
More precisely, the NASA-TLX uses the weighted mean model calculated on the basis of six
dimensions and 15 pairwise comparisons between the dimensions. These pairwise
comparisons provide a matrix consisting of six weights corresponding to the relative
importance of the six sources in the global load.
The NASA-TLX questionnaire was modified for the purposes of our experiment, with a
number of changes being made. (1) For reasons relating to the formulation and translation of
the items, the ‘‘frustration’’ dimension was replaced by a ‘‘stress’’ dimension. (2) The
questionnaire contained 7 questions. The ‘‘performance’’ dimension was subdivided into two
questions, one of these relating to effectiveness and the other to achievement of the goal. The
means of the estimates were calculated. (3) With reference to the work performed on
cognitive load (de Croock et al., 1998; van Merrie¨nboer et al., 2002), the scales used in this
experiment contained only 9 points. (4) The mental workload index was calculated using a
simple mean model (Sato et al., 1999). After these modifications, the homogeneity of the test
for the various experimental conditions was: Cronbachs a = 0.85–0.90. The homogeneity of
theoriginal NASA-TLX measured using the test–retest method was 0.83 (Scerbo, 2001).
28
Satisfaction questionnaire
A satisfaction questionnaire was drawn up on the basis of an earlier study. The original
questionnaire consisted of 25 questions. Only 11 of these were retained for the final
questionnaire (Cronbachs a = 0.81). The criteria used to choose the L. Le Bigot et al. /
Computers in Human Behavior 22 (2006) 467–500 477 questions related to satisfaction with
the ease of use of the system. A 12th question assessing general satisfaction was added to the
questionnaire (Cronbachs a = 0.82).
Procedure
The participants were greeted individually in a quiet room and asked to complete an
information questionnaire. The ArtimisPlanResto service was then presented to them. They
were told that the service functioned using a natural language dialogue but were given no
further information. Next, the subjects were given a problem-solving instruction, i.e. to use
the service to find restaurants ‘‘as quickly and accurately as possible’’. After each scenario,
the subjects completed a subjective questionnaire relating to the mental workload. The
scenarios were subdivided into two sessions of six scenarios each. After each session, the
participants completed a satisfaction questionnaire. Finally, the participants were debriefed
concerning their participation. The entire experiment did not last longer than 45 min on
average. The questionnaires were presented in paper-and-pencil form. Half of the participants
performed the first six scenarios (session 1) in spoken mode (telephone) and the other half
performed them in written mode (at a PC). Half of the participants then completed the final
six scenarios (session 2) in the same modality (identical condition) and the other half in a
different modality (different).
Design
A number of independent variables were manipulated in order to study the effects of the
interaction mode, learning, and transfer. The interaction modes in session1 (spoken vs.
written) and in session 2 (identical vs. different) were treated as between- subjects factors.
The ‘‘serial position’’ of the scenario in the session (positions 1 to 6) and ‘‘session’’ (session
1 vs. session 2) were treated as the within-subjects factors.
29
Dependent variables
Some dependent variables were based on the users utterances. The utterances were
transcribed word-for-word. In the case of the voice interactions, the transcriptions were based
on recordings of the dialogues. The written and spoken dialogues were transcribed in the
same way. Other variables were gathered on the basis of the success levels and the responses
to the questionnaires. Finally, measures relating to the dialogues were gathered on the basis
of notes relating to the exchanges between the user and the system.
Discourse structure
Number of words. The total number of user words per dialogue was collected for each
participant. This provided a baseline for the calculation of the other indices. Length of
30
utterances. The mean number of words for each dialogue was calculated by dividing the total
number of words by the number of utterances. Hesitations and comments were excluded from
the calculation. TTR for articles. The TTR for definite (the) and indefinite (a, some) articles
was calculated for each dialogue by dividing the number of definite and indefinite articles by
the total number of words per dialogue. TTR for personal pronouns. The TTR for first person
pronouns (such as I, me, my) and third person pronouns (such as he, she, they, them) was
calculated by dividing the number of occurrences of these pronouns by the total number of
words per dialogue. Some authors (Brennan, 1991) have indicated that the function of the
first person pronouns is meta-conversational (often appearing in indirect questions which are
typically more polite) or translating the assumption of responsibility and involvement in the
dialogue (Chafe, 1982). Literal command utterances. The level to which users and the system
shared lexicon and syntax was revealed by means of a command statement indicator. For
each dialogue, the three types of command statements ‘‘the next restaurant’’ ‘‘more
information please’’ and ‘‘consult the solutions’’ were categorized into four levels of
formulation: (1) Literal statements (presence of the three words in French forming the
commands), (2) Partial command statements (only two words), (3) Single words L. Le Bigot
et al. / Computers in Human Behavior 22 (2006) 467–500 479 (one word of the command)
and (4) Reformulations. This indicator was divided by the number of turns taken dedicated to
commands but excluding utterances due to voice recognition or spelling errors for the task for
each session.
31
Chapter 4
Results and discussion
Missing data and preliminary analyses
Missing data
The online data for one participants first dialogue was lost. It was replaced by the mean data
values for the first dialogue in the corresponding mode of interaction.
Preliminary analyses
The global success level for restaurant searches was nearly 93%. The global error level,
defined as the proportion of dialogue turns due to recognition or interpretation errors, was
14% (18% on spoken utterances, 10% on written utterances). This error level corresponds to
the results obtained by Raymond, Be´chet, De Mori, Damnati, and Este`ve (2004). These
authors calculated an error level on the basis of search criteria based on voice recognition
independently of the conduct of the dialogue. Two analyses were conducted. The first of
these related to the first six dialogues and tested hypotheses 1–3. The second related to the
entire set of dialogues distributed over the two six-dialogue sessions and investigated
hypothesis 4. The data analyses for the search criteria related only to the efficient search
utterances. The dialogue structure analyses, with the exception of the command statements
analysis, related to all turns taken during the dialogues.
32
Initial measures and representation of the partner Number of relevant items of information
and number of words on first turn. The mean scores are presented in Table 1. The analysis
revealed a main effect of serial position of the dialogue using Wilks Lambda criterion (K =
0.40, F(10,37) = 5.45, p < 0.0001) but no main effect of the interaction mode (K = 0.97,
F(2,45) = 0.69, p > 0.1). The multivariate analyses for each of the measures taken
independently revealed an effect of the serial position of the dialogue on the number of items
of information
provided in the first turn (K = 0.59, F(5,42) = 5.91, p < 0.001), but not on the
33
Table 1
Means (and standard deviations) of initial measures for the first six dialogues
Relevant information on first turn Spoken mode 24 0.83 1.33 1.58 1.75 1.58 1.67
1.46
(0.816) (0.917) (0.830) (0.944) (0.881) (0.761)
Written mode 24 1.38 1.71 1.96 1.83 1.92 2.00 1.80
(0.924) (0.859) (0.955) (0.816) (0.929) (0.780)
Words on first turn Spoken mode 24 5.83 6.67 6.83 7.62 6.50 6.96 6.74
(4.08) (4.54) (4.70) (4.53) (4.63) (5.19)
Written mode 24 4.54 5.67 6.21 4.63 5.00 5.00 5.17
(3.74) (4.18) (4.20) (3.33) (3.78) (3.51)
34
Table 2:
Means (and standard-deviations) of performance measures and subjective ratings for the first
six dialogues
Solution time (in second) Spoken mode 24 165.7 99.0 94.7 104.0 92.1 76.8 105.4
(131.7) (54.4) (98.0) (21.86) (13.72) (10.27)
Written mode 24 178.9 137.2 98.0 66.3 75.4 53.9 101.6
(121.8) (139.1) (11.42) (6.51) (12.00) (5.37)
Appropriate turn Spoken mode 24 7.50 4.67 3.88 4.25 4.25 3.67 4.70
(5.49) (2.33) (2.49) (3.29) (2.33) (2.44)
Written mode 24 4.26 4.13 3.33 2.71 3.79 2.38 3.43
(2.25) (3.11) (2.10) (1.43) (4.35) (1.13)
Mental workload Spoken mode 24 4.32 3.58 3.08 3.26 3.15 2.68 3.34
(1.34) (1.20) (1.34) (1.54) (1.25) (1.18)
Written mode 24 3.24 2.88 2.29 2.01 2.08 1.87 2.40
(1.11) (1.15) (1.12) (0.83) (1.36) (0.74)
Satisfaction Spoken mode 24 – – – – – – 3.83 (0.51)
Written mode 24 – – – – – – 4.40 (0.48)
35
number of words used during this first turn (K = 0.90, F(5,42) = 0.93, p > 0.1). More
precisely, the trend analysis revealed significant linear (F(1,46) = 17.53, MSe = 0.505 p <
0.001) and quadratic (F(1,46) = 9.90, MSe = 0.406, p < 0.01) components for the number of
relevant items of information on the first turn in the spoken mode but only a linear
component (F(1,46) = 8.91, MSe = 0.505, p < 0.01) for the written mode. No other
comparison reached statistical significance. In other words, only the number of relevant items
of information on the first turn in the dialogue exhibited a tendency to increase during the
initial dialogues and then stabilize for the subsequent ones, at least in the spoken mode. The
number of words varied only.
36
The analysis revealed a main effect of interaction mode (F(1,43) = 6.59, MSe = 4.10, p <
0.05) and of the serial position of the dialogue (F(5,215) = 3.17, MSe = 0.877, p < 0.01).
More precisely, the mental workload was greater in the spoken than in the written mode.
Moreover, the trend analysis revealed a significant linear component both in the spoken
(F(1,43) = 25.11, MSe = 1.20, p < 0.0001) and in the written mode (F(1,43) = 12.49, MSe =
1.20, p < 0.001). No other comparison reached statistical
37
Table 3
Means (and standard deviations) of dialogue structure measures for the first six dialogues
Total words Spoken mode 24 37.58 (29.01) 28.04 (22.03) 25.50 (21.59) 28.75 (30.68)
24.79 (18.73) 22.17 (18.78) 27.81
Written mode 24 14.57 (8.04) 15.92 (13.15) 12.42 (5.41) 9.5 (4.76) 10.13 (5.41) 8.96
(3.30) 11.91
Words per turn Spoken mode 24 4.15 (1.61) 4.57 (1.94) 5.64 (3.14) 5.26 (1.98) 5.49
(3.00) 4.89 (1.94) 5.00
Written mode 24 3.47 (2.30) 3.85 (2.82) 4.47 (3.40) 3.60 (1.81) 3.95 (2.72) 4.31 (2.92)
3.94
TTR I Spoken mode 24 0.037 (0.033) 0.037 (0.037) 0.024 (0.038) 0.038 (0.037) 0.040
(0.040) 0.031 (0.042) 0.035
Written mode 24 0.011 (0.023) 0.004 (0.016) 0.007 (0.023) 0.006 (0.031) 0.004 (0.020)
0.005 (0.024) 0.006
TTR art Spoken mode 24 0.181 (0.085) 0.212 (0.088) 0.178 (0.078) 0.198 (0.047) 0.195
(0.079) 0.212 (0.076) 0.196
Written mode 24 0.099 (0.093) 0.114 (0.102) 0.115 (0.089) 0.093 (0.084) 0.092 (0.093)
0.115 (0.092) 0.105
significance. In other words, the mental workload was greater in the spoken than in the
written mode and fell over the dialogues. Satisfaction. The efficiency levels, solution times,
mean number of appropriate turns, and mental workload for the six dialogues were entered as
covariates in the analysis The analysis revealed no effect of interaction mode (F(1,42) = 0.86,
MSe = 0.182, p > 0.1). After six dialogues, there was no significant difference in satisfaction
between the spoken and written modes. It should be noted that mental workload was most
highly correlated with satisfaction level (R = 0.62, p < 0.0001), i.e. the more the mental
workload increased, the more the level of satisfaction fell. In other words, even if the analysis
did not reveal any significant difference, the strong correlation between satisfaction and
38
mental workload suggests that the two measures are interdependent. In the case of the initial
dialogues, satisfaction seems to have been determined more by the effort involved in
performing the task than by the interaction mode.
Discourse structure
All the data was log transformed to stabilize the variance for the purposes of statistical
analysis. For reasons of clarity, only untransformed data is presented in Tables 3 and 4.
Number of words. The analysis revealed an effect of interaction mode (F(1,46) = 33.18, MSe
= 0.190, p < 0.0001), and of the serial position of the dialogue (F(5,230) = 5.77, MSe =
0.040, p < 0.0001). More precisely, the total number of words was greater in the spoken than
in the written mode. The trend analysis revealed a linear component for the spoken mode
(F(1,46) = 11.56, MSe = 0.042, p < 0.01) and the written mode (F(1,46) = 14.59, MSe =
0.042, p < 0.0001). No other comparison reached statistical significance. In other words, the
number of words used to perform the task fell in a comparable way as the interactions
proceeded irrespective of the modality. Number of words per turn. The analysis revealed an
effect of interaction mode (F(1,46) = 5.27, MSe = 0.141 p < 0.05) and of the serial position of
the dialogue (F(5,230) = 3.10, MSe = 0.112, p = 0.01). More precisely, the number of words
per turn was greater in the spoken than in the written mode. The trend analysis revealed
significant linear (F(1,46) = 4.25, MSe = 0.016, p < 0.05) and quadratic components (F(1,46)
= 6.01, MSe = 0.013, p < 0.0001) in the spoken mode. No other comparison reached
statistical significance. In other words, the length of
39
Table no;4
Means (and standard-deviations) of command statement measures for the first six dialogues
Total
Finally, the transfer effect has not yet been studied in human–human dialogue sit Uations to a
sufficient level to enable us to draw clear conclusions. Even in a natural language system
dialogue situation, a learning phase is required, at least for the pur poses of task completion.
Furthermore, there are mode-specific differences that have to be taken into account during
system design. Nevertheless, individuals tend to adapt to the system, at least when using
speech. They are involved in an active task completion process within which they consider
the system as a partner rather than as a tool. Last of all, there is a clear transfer effect when
individuals first use the system in one interaction mode and then continue their work with it
in another mode. In order to complete and extend these results, experiments should be
reproduced using other dialogue systems and applications. Also, interaction mode-dependent
learning and transfer effects should be studied within an alternating learning context (a
speech dialogue, then a written dialogue, then a speech dialogue, etc.).
Challenges
To build dialogue system developers faces many difficulties. These are due lack of
computer’s understanding of natural language. This problem arises many challenges for
developers e.g. problem of Anaphora Resolution, Inferences, Ellipsis, Pragmatics, Reference
resolutions and Clarifications, Inter sentential Ellipsis etc.[7] Besides these language problem
other challenges is to design system prompts, grounding, detection of conflicts and plan
40
recognition etc. In the spoken dialogue systems the problem related to utterance of the user
occur like ill formed utterances. These are the some of the challenges that developers have to
take care of at designing time.
General discussion
The aim of the experiment was to reveal learning, modality and transfer effects on
performance and dialogue structure during a complex, goal-oriented activity.Firstly, the
analyses showed that the representation of the partner changed over the dialogues (Amalberti
et al., 1993) and that it differed depending on the mode of interaction. The individuals
provided more relevant items of information during their first turn in the dialogue in the
written than in the spoken mode (Zoltan-Ford, 1991). The number of items of information
increased over dialogues, in particular in the spoken mode, to reach a maximum level.
Furthermore, the participants became increasingly concise, providing more and more
information in a number of words that remained relatively stable from one dialogue to the
next. This tendency was confirmed between sessions 1 and 2. Whatever the interaction mode
used for learning in session 1, the first utterance produced in the session 2 dialogues
contained multiple items of information and a minimum number of words. The differences
between the spoken and written modes became blurred after learning. More precisely, the
transfer of learning was identical whatever the modality. It is clear that the individuals
became aware of the systems potential and made full use of its capabilities. The implications
of results such as these are very encouraging for designers of natural language dialogue
systems. Furthermore, these results suggest that it should be possible to determine the
approximate level of familiarity with an information retrieval system that makes use of
natural language on the basis of the number of criteria stated and the number of words used in
the first dialogue turn. However, these results apply only to a service that involves a small
number of criteria and is used for information retrieval purposes. It would be interesting to
reproduce this type of experiment with a system using a larger number of search criteria and
in a different task context mobilizing different levels of user knowledge. Also, even though
the experiment was conducted with a real system, it was relatively controlled. The observed
effects might be due to familiarization with the task (information retrieval) rather than
familiarization with the use of the system. Furthermore, the data will only be relevant if the
tasks (scenarios) are sufficiently representative of real system
operation. Secondly, the analysis of the performance indicators revealed a fairly marked
learning effect and modality effect for the dialogues in session 1. The participants were more
efficient in the written than in the spoken mode. Nevertheless, the analysis did not indicate
41
that they reached an optimum level of efficiency any faster in the former than in the latter
mode. These results were partly corroborated by analysis 2 participent performances
continued to improve in session 2 when no change of interaction mode was introduced.
Furthermore, changing the interaction mode had a negative impact on solution times and we
even observed an increase in the number of dialogue turns required to complete the task when
participants switched from written to spoken mode. In both cases, the participants had to
adapt to the new interaction environment, with only a part of the knowledge constructed in
session 1 being transferred. In the case of the switch from the written to the spoken mode,
elements of transfer and interference were can simultaneously identified. The analysis of the
subjective rates reveals the same patterns of results. Thus transfer was greater for the
interaction mode requiring the greater learning effort, i.e. from the spoken mode to the
written mode. There are a number of considerations that might help explain these results.
First of all, an information retrieval task is a goal-oriented activity that requires the
construction of goals and sub-goals. By definition, its completion involves a cognitive cost.
Lovett, Reder, and Lebiere (1999) have pointed out that the performance of almost any
cognitive task calls on working memory for the maintenance and retrieval of information
during processing. Because of the characteristics of the spoken mode, it was more difficult
for the participants to manage both planning of their utterances and planning of the task under
a time constraint, at least during their initial interactions with the system. Secondly, the
management of the cognitive resources demanded the establishment of priorities. In the ACT-
R model (Anderson, 1993), processing activities are dependent on the current goal. The
accessibility of the declarative and procedural knowledge varies as a function of the
experiment. Thus the focusing of the subjects limited attentional resources on the goal
increases the accessibility of knowledge that is relevant to the goal when compared to other
knowledge (Lovett et al., 1999). The continuity of the activity took precedence over task
completion. Moreover, it is probable that only certain parts of the dialogue were responsible
for the effect, i.e. those requiring the greatest mental effort. Thirdly, expressing oneself in
speech and writing induces specific syntactical and lexical structures. Some of Zoltan-Fords
(1991) results were reproduced concerning the length of utterances. The spoken utterances
were longer than the written utterances but, contrary to Zoltan-Fords observations, the length
increased in the spoken mode but remained relatively stable in the written mode. The relative
stability of the number of words in the spoken mode, coupled with the reduction in the
number of turns taken between the two sessions, indicates that the users became increasingly
concise. This confirms the results obtained in Woz situations (Amalberti et al.,1993; Brennan,
42
1991; Bubb-Lewis & Scerbo, 2002; Pierrel, 1988; Richards & Underwood, 1985).
Furthermore, there were fewer grounding and involvement indicators in the written than in
the spoken mode (Chafe, 1982). This result sheds new light on Brennans analysis (1991).
Brennans experiment took place in the written mode, the conclusion being that involvement
and grounding indicators were less frequent in human–computer dialogue than human–
human dialogue situations. It is also possible to say that the use of involvement and
grounding indicators is dependent on the modality. Clark and Brennan (1991) have already
indicated that a number of factors, such as the interaction mode which imposes specific
constraints on the interaction, have an effect on grounding. Thus the establishment of the
common ground and involvement in the dialogue are more important in spoken mode. These
results are very encouraging for designers of dialogue systems. The active processes observed
in human–human dialogue situations are also at work in speech-based human–computer
dialogue situations (Allwood, Traum, & Jokinen, 2000). This interpretation is confirmed by
the analysis of the transfer effect in session 2. Some of the behaviorsfavoring grounding in
session 1 were also found in session 2. The individuals adapted to the systems lexicon and
syntactic structure when their first interaction was in spoken mode (Brennan, 1991; Fais,
1998; Leiser, 1989). Two explanatory hypotheses can be advanced to account for this. (1)
Literal restatements may characterize a more active construction of the common ground in
spoken than in written mode since, by its very nature, spoken mode should favor the
emergence of cooperative behavior. (2) Literal restatements may be characteristic of a mental
economy. More precisely, the phrases heard by the users would remain present in their
working memories (seealso, Garrod & Pickering, 2004, for syntactic priming; Levelt &
Kelter, 1982). The re-use of the material in this form would spare users the cost of
reformulating or of constructing new messages (Leiser, 1989). Fourthly, a cautionary
comment concerning all these results is warranted. We may have underestimated the impact
of voice recognition errors during the interaction is much greater than that of spelling
mistakes. For example, Baber and co-workers (Baber, Mellor, Graham, Noyes, & Tunley,
1996) demonstrated the effect of mental workload on the level of voice recognition errors.
Murray and co-workers (Murray, Jones, & Frankish, 1996b; Murray, Baber, & South, 1996a)
illustrated the effect of user stress on voice recognition. The voice recognition errors resulted
in an increase in stress and mental load which, in turn, led to an increase in the number of
voice recognition errors. In the presence of voice recognition errors, it is possible that users
employ specific error recovery procedures. If this is the case, then the effect of the interaction
mode would actually simply amount to a comparison of the written mode with a spoken
43
mode that contains failures. Nevertheless, the absence of a difference in satisfaction rating
between the written and spoken modes is encouraging. To summarize, the effect of the
interaction mode can be characterized in terms of direct consequences that are specific to
each of the interaction modes and indirect consequences that are specific to the task and the
management of the activity.
The permanence of the written trace and the transience of speech seem to be likely
explanations for some of the differences observed between the interaction modes. The time
required for reading and information processing resulted in a greater improvement in
performance in the written than in the spoken mode. The different nature of the management
of the activity is an indirect consequence of the effect of the interaction mode on task
completion. In contrast, the observation of typical behaviors such as the use of articles and
subject pronouns as a function of interaction mode is a direct consequence of the interaction
mode effect. As Oviatt (1995) has suggested, the syntactical arrangement involved in the
construction of grammatical sequences must differ between the written and spoken modes.
Levelt (1989) has pointed out that the order of the information is important in the macro-
planning of utterances. The time pressure present in the spoken mode does not generally
permit any largescale control of order. In contrast, resources are distributed in a different way
in the written mode. The arguments contained in the utterances can be reorganized. Thus the
interaction mode has an indirect effect on the activity. The task and its completion are
prioritized in the written mode.
Finally, the transfer effect has not yet been studied in human–human dialogue situations to a
sufficient level to enable us to draw clear conclusions. Even in a natural language system
dialogue situation, a learning phase is required, at least for the purposes of task completion.
Furthermore, there are mode-specific differences that have to be taken into account during
system design. Nevertheless, individuals tend to adapt to the system, at least when using
speech. They are involved in an active task completion process within which they consider
the system as a partner rather than as a tool. Last of all, there is a clear transfer effect when
individuals first use the system in one interaction mode and then continue their work with it
in another mode. In order to complete and extend these results, experiments should be
reproduced using other dialogue systems and applications. Also, interaction mode-dependent
learning and transfer effects should be studied within an alternating learning context (a
speech dialogue, then a written dialogue, then a speech dialogue, etc.).
44
Jointly modeling multiple subtasks
Traditionally, dialogue systems were built in a pipeline way. Models for each subtask were
built separately and then assembled into a whole system. Pipeline systems are conceptually
clear. Each part focuses on its own problems independently, each model is developed
independently. But there are also some limitations for pipeline systems. Firstly, it cannot
make use of the interaction information between different parts. There are significant
interactions between each subtask, the interactions are helpful to improve the system
performance. Taking the intent identification and slot filling in NLU as an example, slot
filling is helpful to intent identification, and vice versa. In a flight booking task, if only the
destination slot is labeled in a sentence, then the probability that intent of the sentence is to
tell the destination is big, on the contrary, if intent of a sentence is to tell the departure city,
then, a departure city will occur in the sentence with a big probability. If the interactions
between two subtasks can be modeled properly, it should be helpful to promote both tasks.
There are similar situations for other subtasks. Secondly, models for each subtask are trained
separately in a pipeline system. It brings difficulties from two sides. On the one hand,
developers of dialogue systems usually only get feedback from the end users, who inform
them about final performance of the systems. It is difficult to back propagate or assign final
error signals of the system to each subtask. It is also time-consuming and laborious to get
labeled data for each subtask. On the other hand, because it is difficult or impossible to
ensure fully correct in each subtask, errors in previous subtasks might hurt later subtasks. The
errors might be accumulated and enlarged through the pipeline, even becomeuncontrollable.
Thirdly, interdependences of subtasks in dialogue systems make online adaptation of systems
challenging.
For example, when one module (e.g. NLU) is retrained with new data, all the others (e.g DM)
that depend on it become sub-optimal due to the fact that they were trained on the output
distributions of the older version of the NLU module. Although the ideal solution is to retrain
the entire pipeline to ensure global optimality, this requires significant human effort .
Recent advances are exploring how to overcome above limitations of pipeline system. Jointly
modeling has been proven to be an efficient way. There are lots of work on joint models,
range from jointly modeling subtasks in NLU, DM or NLG respectively, to jointly modeling
subtasks cross NLU and DM, and even jointly modeling cross NLU, DM and NLG. Here,
“joint model” or “jointly modeling” means two or more subtasks are implemented in a single
45
model or in a strongly coupled frame, the model (or frame) is trained as a whole or
simultaneously instead of subtask by subtask.
46
[42] and Zhou, Wen, &Wang [43] employed a hierarchical neural network model for intent
classification and slot labeling. While the former put slot labeling in the bottom of
hierarchical network, intent identification on the top.
The later tried two different arrangements (one was exactly same as that in former, another is
inverse) and found the subtask on the top of the network always gained more benefits from
the hierarchical structure, no matter which subtasks were put on the top. It was not clear
which kind of joint way is better for given subtasks, there is still no full investigation on this
problem.
Almost all current joint models were supervised. They needed labeled data for all subtasks.
For deep neural network models, they demanded a large amount of data for better
performance. So, the second problem is how to get a large number of labeled data, or should
we pursue some unsupervised approach? As we have seen, unsupervised models significantly
performed worse than supervised models in single subtasks. There is still no unsupervised
approach for joint models. Could joint tasks find better unsupervised models than that in
single task by utilizing the interaction information between two or more subtasks? If it is
possible, the jointly models gain another important advantage compared with pipeline
models.
Another problem is domain adaption. It is expensive to build a large number of labeled data.
It is even expensive to build a large number of labeled data for each domain. How can we
reuse the labeled data in one domain on another domain? We have to deal with new words,
new intent, new slot values or even new slots for dialogues in new domains. There are some
beginning works on dealing with the problem. For example, Yazdani, & Henderson [45]
explored a zero-shot representation learning model for SLU in new dialogue domains.They
integrated intents (acts) and slots in a label representation learning model, different domains
used common parameters of word embeddings. The experimental results showed that the
word vector based model could adapt well to new domains. We will see in next section that
word based models could also be a possible way for cross-domain adaption in other joint
models.
47
2: Jointly modeling subtasks cross NLU and DM
Normally, DM receives semantic labels of a sentence from NLU as inputs. Some recent work
has crossed the gap, and uses the sentence as input of DM directly. Henderson, Thomson, &
Young proposed a word based RNN model for state tracking. The model mapped the ngrams
of user inputs to dialog states without using an explicit semantic decoder. Each slot was dealt
with a separate RNN model. The method was evaluated on the second Dialog State Tracking
Challenge (DSTC2) corpus and the results demonstrated consistently higher performances
compared with pipeline models. Mrksic & Kadlex, et al. [47] proposed a multi-domain state
tracking model basing on work proposed in Ref. []. The results showed the model could
achieve good performance when combined with some delexicalized features.
Reinforcement Learning (RL) was a major tool for policy modeling. Most of current joint
models including act generation employed Deep Reinforcement Learning (DRL) which was
first proposed in Ref. [48] for playing computer games. Mnih, Kavukcuoglu, & Silver, et al.
[48,49] implemented a screen based game playing agent. The agent selected game actions
according to screen images. They proposed a deep Qlearning algorithm on a Deep Q-
Network (DQN) with two layers of convolutional network and two layers of fullconnection
forward network for learning Q-function. A mapping from image inputs to game acts was
learned. By utilizing DRL, screen understanding was integrated with game operation
selection into an end-to-end model. The model achieved better or competitive scores in a
number of different games compared with human players. In fact, game playing is very
similar to dialogue. Images of screen are analogy to utterances of users, game operators are
analogy to dialogue actions. The goal of game agent is to achieve maximum long-term
rewards in multiple turns, which is also analogy to the goal in dialogues. The only difference
between games and dialogues is:
The inputs of games are continuous images, while the inputs of dialogues are discrete
language symbols. Narasimhan, Kulkarni, & Barzilay proposed a LSTM-DQN model for
text based network games, where LSTM was used to decode text inputs into a vector
representation which was then fed to a DNN to train a Q-function. It achieved better
performance than some previous models. Due to the great successes in computer games and
similarities between games and dialogues, DRL was then rapidly used for building end-to-
end joint models for dialogue systems. Cuayahuitl & Keizer used deep reinforcement
learning on a non-cooperated dialogue to generate dialogue policies, they implemented
48
experiments on a card game instead of dialogues. Cuayahuitl tried to construct a joint model
from the outputs of ASR to act generation. He used DRL in Cuayahuitl & Keizer for DM.
But they just showed some simple DRL results without performance evaluation of the
dialogue system. Zhao & Eskenazi jointly modeled state tracking and action generation in a
deep reinforcement learning frame. LSTM was used to track history of dialogue. They also
proposed a model with supervised information from dialogue state. Dialogue states were
manually designed in past dialogue systems. The design of dialogue states was subjective and
time-consuming. DRL provided an efficient way to avoid explicit design of the dialogue
states. But it was not so easy to train Q-function networks like DQN or LSTM-DQN. The
samples fed to the network was (st,at,rt,stþ1), t ¼ 1,2,…N or something like that. They were
not independent identically distributed (i.i.d.) because stþ1 (state at time t þ 1) was
determined by both st and at. The Q-function networks were therefore prone to oscillation
and difficult to converge. For training the DQN, Mnih, Kavukcuoglu, & Silver, et al. used an
experience replay mechanism proposed by Lin which randomly sampled previous
transitions, and thereby smoothed the training distribution over many past behaviors.
Recently, Hasselt, Guez, & Silver [55] leveraged the overestimation problem of standard Q-
Learning by introducing double DQN, Schaul, Quan, & Antonoglou, et al. improved the
convergence speed of DQN via prioritized experience replay. Although these measures were
workable to some extents, and have helped DRL to achieve great successes in computer
game. But there was no general guarantee for convergence of DRL. Ma & Wang [57] showed
that the Qfunction networks could converge well when a dialogue has a small act space, but
the situation became worse with the increase of act space of the dialogue. How to train Q-
function networks will be still a problem in near future.
49
learned a CFG from aligned corpus. Wong & Mooney [59] proposed an algorithm to learn a
synchronous context-free grammar (SCFG) automatically using sentence-semantic frame
aligned corpus. The model used left to right Earley chart to map semantic frame to natural
language sentences. It re-ranked mapping results using language model in decoding. Lu & Ng
[60] proposed a SCFG based forest-tostring generation algorithm. Konstas & Lapata used a
bottom-up chart decoder to learn a PCFG from a phrasesemantic slots pairs which were
harvested from a sentencesemantic frame alignment corpus, and re-ranked generative trees
combining n-grams and dependency relations, then outputted a sentence with top rank leave
nodes.
The outputted sentences were grammatical in syntax based methods. But it was difficult to
obtain good syntaxes. Manual rules were expensive and domain dependent, while grammar
learning relied on a large number of aligned corpus. Limited by syntaxes, all above methods
could not deal with semantic frames which did not occur in train data. Sentences generated by
these methods were lack of diversity. Sequence based models took a sentence as a sequence
of words or phrases.
They predicted the generation probability of next word basing on words already generated.
To cover the semantic frame in generated sentence, the sequence model took the dialogue
acts into consider. So, the generation probability of nth word could be estimated by
pqðwnjw1; :::; wn1; DAÞ, where DA is current dialogue act given by semantic frame, q is
parameters for the probability function. Several neural network based models, especially
RNNs, were used to approximate the probability. Zhang & Lapata described a work on using
RNNs to generate Chinese poetry. Wen, Gasic, & Kim, et al. jointly trained a forwardRNN
generator, a CNN and an inversed RNN ranker, to generate natural sentences for specific
DA. Wen, Gasic, & Mrksic used a DA control gate for sentence planning, and a LSTM for
surface realization. Two parts are jointly trained to generate grammatical sentences and
semantically insistent with DA. Mei, Bansal, & Walter proposed an end-to-end,domain
independent neural encoder-aligner-decoder model to jointly model content selection,
sentence planning and surfacerealization..
A LSTM was firstly used for encoding all semantic slots, and then the salient semantic slots
were extracted by analignment model, finally the natural sentence was generated by a
decoder. Dusek & Jurcicek proposed an attention based LSTM to encode inputted DA and
words already generated, and then a LSTM decoder together with a logistic classifier were
used to generate other words in sequence. They demonstrated their model can achieve
comparable performance with other RNN based models with less training data.
50
Compared to syntax based models, sequence based models did not need fine grain level
alignment data for training. The flexibility of sequence based models on modeling dialogue
history, context and word selection brought a diversity of sentences generated. On the other
hand, because the generation process of sequence based models was not controlled by any
specific syntax, it was unavoidable for them to generate ungrammatical sentences. It was also
possible for them to lose or repeat some slots in DA.
51
recent advances of sequence-to-sequence models in machine translation, some sequence-to-
sequence models for non-goal-driven dialogue were also proposed. Shang, Lu, & Li.
Presented a RNN based encoder-decoder Neural Responding Machine with attention signal,
Vinyals & Le proposed a LSTM based sequence-to-sequence Conversational Model. is the
typical structure of sequence-to-sequence model. Li, Galley, & Brockett, et al. used
Maximum
Mutual Information (MMI) as objective function to measures the mutual dependence between
inputs and outputs. Experimental results showed MMI helped sequence-to-sequence models
to produces more diverse responses.
These approaches jointly modeled the process from sentence inputs to generation of
responses for non-goal-driven dialogues. They did not include semantic parsing and explicit
DM, and therefore could not be applied to goal-driven dialogues directly. Dodge, Gane, &
Zhang, et al. also considered the difficulty on evaluation of these models. They therefore
proposed a set of tasks, including question answering, recommendation, question
answeringþrecommendation and chatting, to test the ability of end-to-end dialogue systems. It
might be an interesting way to bridge non-goal-driven dialogues and goal-driven dialogues.
On the other hand, there are lots of advances in sequence-to-sequence machine translation
models, which might be borrowed to build more powerful goaldrivendialogues in the future.
It is clear that a full end-to-end goal-driven dialogue system should not only output a final
sentence to respond an input sentence, but also keep and update fruitful internal
representations or memories for dialogues. The internal memories can be either explicitly
extracted and represented, or validated by some external tasks such as question answering.
Usefulness evaluation
If any computer system is to be taken up by users and customers, it must be demonstrably
useful, so ÒusefulnessÓ is the first of the more qualitative evaluation criteria we look at. The
YPA "is a natural language dialogue system that allows users to retrieve information from
British Telecom’s Yellow pages" (Kruschwitz et al., 1999, 2000).
The yellow pages contain advertisements, with the advertiser name, and contact formation.
The YPA system returns addresses and if no address found, a conversation is started and the
system asks the user for more details in order to give a user the required address. The YPA is
composed of the Dialog Manager, the Natural Language Frontend, the Query Construction
Component, and the Backend database. The Backend includes a relational database that
contains tables extracted form the Yellow pages. The conversation starts by accepting user
52
input through a graphical user interface, then the Dialogue Manager sends textual input
through the Natural Language Frontend for parsing. If no addresses are found then the Dialog
Manager sends the textual input to the Natural Language Frontend for parsing.
After that, the parse tree is sent to the Query Construction Component, which translates the
input into a database query, to query the Backend Database and return the retrieved address.
If no addresses are found, then the Dialogue Manager starts putting more questions to the
user to obtain further clarification. To evaluate the YPA, 75 queries were extracted from a
query corpus, and a response sheet was prepared to see if the returned addresses were
appropriate or not, how many dialog steps were necessary, the total number of addresses
recalled and he number of those relevant to the original query. Results show that 62 out of 75
queries managed to return addresses, and 74% of those addresses were relevant to the original
query. In a similar manner, we evaluated the ÒusefulnessÓ of the responses generated by our
Qur’an chatbot. The Qur’an chatbot was developed using our chatbot-trainingprogram, where
the English/Arabic corpus of the holy book of Islam the Qur’an is used. The QurÕan text is
available via the Internet; and in principle the QurÕan provides guidance and answers to
religious and other questions. The resulting system accepts user input in English, and answers
with appropriate ayyas from the QurÕan in the English and Arabic languages.
Localizability
The localizability aspect of evaluation tries to identify how easy it is to adapt a natural
language dialogue system to new domain or language without affecting the way it works.
With this goal in mind, some dialogue systems have been designed to be retrainable to a new
domain via a domain corpus Inui et al., (2003) introduced a natural language dialogue system
based entirely on the use of corpora. The aim of this system is to be so general that it can be
trained with any corpus in any domain and language. The system is mainly composed of three
modules, the NL Parser, the Matcher, and the NL generator as displayed in Figure. The
inputted sentence is sent to the natural language (NL) parser to analyze the input using the N-
gram-based shallow parser (Inui et al., 2002). The matcher uses keyword matching and
structural matching to find the dialogue most similar to the current flow in the Dialogue
Corpus. The matcher uses the Context Data Base, in which each dialogue act is assigned an
intention from a list containing greet, question, explain, etc. In the keyword-based matcher,
the nouns and verbs identified by the NL parser are matched with the most similar nouns and
verbs from the Dialogue Corpus. Before confirming this match, the matcher checks the
intentions associated with those nouns and verbs in the Context Database. In the structural
matcher (Koiso et al., 2002), the similarity dialogue is figured out by calculated the structural
53
distance between two sentences. In this fully corpus-base approach, a user has the choice to
select which matcher to use. The NL generator generates the system’s responses and applies
the necessary exchange on.
NL Parenthesized corpus
Parser
matcher
Dialogue Corpus
NL
Context DB
genertor
54
Figure ; Corpus-Based Approach to Building a Natural Language Dialogue System
We built a generic Java program that reads a dialogue from a corpus and maps it to the AIML
format used by the ALICE chatbot to produce different versions of the chatbot, which were
evaluated using different techniques. Table 1 displays the corpora used to train our program.
After creating AIML files for the corpora types displayed in Table 1, the Pandorabot web-
hosting service1 was used to publish different versions of corpustrained chatbots to make
them available for use over the World Wide Web. Users were asked to chat with these
versions and provide their feedback. Based on user feedback and the retraining corpus, eight
system prototypes were generated to satisfy usersÕ expectations. The key issue in building
these prototypes was how to expand the knowledge learned from the corpus to increase the
chances of finding a match. The idea of matching is based on finding the best match, which is
the longest one. Since the input will not necessary match exactly a whole sentence extracted
from the corpus, other learning techniques were adopted.
In each prototype, machine-learning techniques were used and a new chatbot was tested. The
machine learning techniques ranged from a primitive simple technique to more complicated
ones. Building atomic categories and comparing the input with all atomic patterns to find a
match is an instance based learning technique.
55
However, the learning approach does not stop at this level, but it improved the matching
process by using the first word, and the most significant words. This increases the ability of
finding a nearest match by extending the knowledge base which is used during the matching
process. Four dialog transcripts generated by our Afrikaans prototype were used to measure
efficiency of adopting learning techniques. The frequency of each type of matching (atomic,
first word, significant word, and no match) in each generated dialogue was estimated and the
absolute frequencies were normalised to relative probabilities as shown in Figure 4. The
results proved that the first word and the most significant approach increase the ability to
provide answersto users and to let conversation continue.
Humanness Evaluation
The humanness aspect of a chatbot is traditionally measured by the ability of the dialogue
system to fool users into believing that they are interacting with a real human, not a virtual
one. Colby (1975) used this strategy to evaluate his chatbot PARRY that simulates a paranoid
patient. A blind test was applied by three psychiatrists questioning both PARRY and three
other human patients diagnosed as paranoid. Psychiatrists were not able to distinguish
PARRY chatbot from human patients.
The same policy was adopted in the Loebner prize competition, which allowsusers to chat
with a conversational agent for 10 minutes: if this chatting gives the Matching types in the
Afrikaans Prototype impression to users that they are dealing with a human and not a
machine, that conversational agent succeeds in the competition. However, this is a somewhat
superficial and subjective measure: 10 minutes is not really enough to judge the humanness
of a system, and the judgement depends on subjective opinions of a few users. We adopt a
novel way to measure the humanness of a natural language dialogue system by comparing
dialogues generated by the system, against ÒrealÓ human dialogues. To do this, the Wmatrix
tool (Rayson 2003) was used to compare a dialogue transcript generated via chatting with
ALICE, and real conversations extracted from different dialogue corpuses.
The comparison illustrates the strengths and weaknesses of ALICE as a human simulation,
according to lthe inguistic features: lexical, part-of-speech, and semantic differences. The
semantic comparison illustrates that explicit speech act expressions are highly used within
ALICE, an attempt to reinforce the impression that there is a real dialogue; pronouns (e.g. he,
she, it, they) are used more in ALICE, to pretend personal knowledge and contact; discourse
verbs (e.g. I think, you know, I agree) are overused in ALICE, to simulate human trust and
56
opinions during the chat; liking expressions (e.g. love, like, enjoy) are overused in ALICE, to
give an impression of human feelings.
In terms of Part-of-Speech analysis shows that singular first-person pronoun (e.g.I), second-
person pronoun (e.g. you) and proper names (e.g. Alice) are used more in ALICE, to mark
participant roles more explicitly and hence reinforce the illusion that the conversation really
has two participants.
At lexical level, analysis results shows that ALICE transcripts made more use of specific
proper names ÒAliceÓ (not surprisingly!) and ÒEmilyÓ; and of Òyou_knowÓ, where the
underscore artificially creates a new single word from two real words. Table 2 illustrates the
lexical comparison between ALICE transcripts filerepresented in column ÒO1Ó, and the real
conversation file represented in column ÒO2Ó
Item 01 02 %2 LL
Do 44 3. 90 35 0.65 + 58.69
57
intelligible and does not need extra effort to listen; contents of the systemÕs output should be
correct and relevant to the topic; adequate feedback is essential for users to feel in control
during interaction; and the structure of the dialogue should must be natural and reflects
usersÕ intuitive expectations (Dybkjaer et al., 2004).
Recently et al., (2014, p. 1) discovered that "a chat bot that provides responses based on the
participant’s input dramatically increased the perceived humanness and engagement of the
conversational agent." In their experiment researchers created a chat bot that asked
participants to describe a series of images. The interaction was either static, in which the
participants answer the base questions, or dynamic, where there is a mfollow-up
question based on the participant's response. A survey was completed by each participant
after answering questions about all images, In order to measure humanness, a question about
chat partner was provided to see if it was a human or a computer, six option scale were used:
definitely human; probably human; not sure but guess human; not sure but guess computer;
probably computer;mand definitely computer. Results reveal that 79.2% of static interview
participants thought their partner was definitely a computer, while only 41.9% of those using
the dynamic chatbot thought the same.
Happy Assistant is "a natural language dialog-based navigation system that help users access
e-commerce sites to find relevant information about products and services" (Chai et al.,
2001a). The system is composed of three main modules: the presentation manager (PM), the
dialog Manager (DM), and the Action Manager (AC). The presentation manager applies a
shallow parsing technique to identify semantic and syntactic information of interest from the
user textual input. Then it translates the users input into a well formed XML message called
the logical form. The dialogue manager is responsible for matching concepts from a
user's query to business rules found in the knowledge domain. The business rules
consist of a list of concepts together with some meta data about the target product or service.
If a match is found, the webpage associate with that rule is presented to the user. Otherwise,
the most important missing concept is figured out by introducing questions to the user.
Control is now turned over to the action manager, which accesses the product that
58
matched the query, and if the user provides special preferences, a sorting algorithm is applied
to yield a ranked list of products.
To make users trust the system, it must offer some explanation before producing a result, so
the system summarizes the userÕs request by paraphrasing it using the context history.
presents a sample of conversation with the Happy Assistant System taken from (Chai et
al.,2001a).
U: Yes, absolutely
The target notebook is displayed for the user. And beneath it a summary of the users request
displayed to explain why this product is displayed. Example of interaction with the Happy
Assistant System
Usability in this system is evaluated based on a study that designed to explore how much the
system meet usersÕ expectations in terms of ease of use, system flow, validity of the system
response, and user vocabulary. (Chai et al., 2001b). The study compared the navigation
process in the dialog system against a menu driven system to find target products. Results
show that users preferred the dialog based search over the menu driven search (79% to 21%
of the users) for the following reasons: ease of use, meeting the usersÕ needs, users like the
idea that they could express their needs in their language without being restricted to menu
choices, users feel that the computer did all the work for them, and more over users found
that the system reduce the interaction time. However, novice users preferred the menu driven
system because there is no need for typing. In a similar manner, we used the comparative
evaluation to compare the results generated by Google with the results generated by the
FAQchat system. FAQchat is another version of the chatbot-training-program described in
Section 2, where the FAQ corpus of the School of Computing (SoC) at University of Leeds is
used to train the program. The results returned from FAQchat are similar to ones generated
59
by search engines such as Google, where the outcomes are links to exact or nearest match
web pages. An evaluation sheet was prepared which contains 15 information-seeking tasks or
questions on a range of different topics related to the FAQ database. The evaluation sheet was
distributed among 21 members: nine of the staff and the rest postgraduate students.
An interface was built,which has a box to accept the user input, and a button to send this to
the system. The outcomes appear in two columns: one holds the FAQchat answers, and the
other is holds the Google answers after filtering it to the FAQ database. Users were asked to
try using the system, and state whether they were able to find answers using the FAQchat
responses, or using the Google responses; and which of the two they preferred and why.
Results in Table 3 show that 68% overall of our sample of users managed to find answers
with the FAQchat while 46% found them with Google. Since there are
In terms of preferences, for each question, users were asked to state which tool they prefer to
use to find the answer. The proportion of users who preferred each tool was calculated.
Results in Figure 6, show that 51% of the staff, 41% of the students, and 47% overall
preferred using FAQchat against 11% who preferred using Google. As a result, we can claim
that our users found the chatbot a novel and interesting way to access the FAQ using natural
language questions.
In 2006, Quarteroni and Manandhar applied the same approach in using the chatbot as
interface for an open-domain QA system. They developed a noninteractive QA system called
YourQA (Quarteroni and Mandandhar, 2006), which accepts a textual natural language query
from a user, takes the top 20 Google results for this query, retrieves the corresponding Web
pages, analyses them to extract answers, ranks them by relevance to the question, and obtains
a list of answers in the form of an HTML result page In order to make their system an
interactive one, Quarteroni and Mandandhar (2007) used ALICE as an interface for their
60
system. First of all they used the Wizard-of-Oz (WOz) experiment (Dahlbaeck et al., 1993)
in which a human operator hides behind a computer interface to simulate the conversation
with the user who thought that he is chatting with a computer system. A set of AIML
categories were prepared and wizard role was to choose the appropriate category and
utterance within the available set as a response to the user.
In case there was o response, the wizard would create an utterance to keep the conversation
alive. The aim of this experiment is to obtain initial data (conversations logs) to enrich
knowledge base of ALICE chatbot. After this the authors build an interactive interface for
YourQA system that consists of an applet window with a left panel where chat takes place
and a right panel where results are visualised when ready. A sample of chatting with this
version is shown in Figure
System0: Welcome!
User1: Hi
S2: Possible answers are in the panel on the right. Are you happy with these
answers?
U3: Yes!
S3: Good!
U5: Yes
S5: I found the answer in the panel on the right. Can I help you further
A sample of dialogue with YourQA chat version (taken from Quarteroni and Manandhar
2007)
Results
61
Simulations were carried out using three different knowledge distributions. The data
presented in Table 2 combines the results from all of those simulations. This data gives us
empirical validation of the effects of applying certain dialogue mechanisms in our underlying
model. For instance, we can see that both the Continuous Mode Algorithm and the use of
negotiation can greatly increase the efficiency of the joint problem-solving. Furthermore, the
use of Non-Obligatory Summaries greatly increased the efficacy of negotiation (but did not
significantly affect the Continuous Mode algorithm).
Demo and Full-Scale Systems Work is being done now to implement the Collaborative
Algorithm in a human-computer interactive environment. The previous stages of model
building, mathematical analysis, computational implementation and simulation have enabled
us to elaborate on our dialogue model before actually building our next generation human-
computer interactive system. Using the simulation results, we have some predictive power in
determining what dialogue mechanisms are useful for producing more efficient joint
problem-solving.
However, since simulations often involve simplifications of the actual domain being
modeled, we must not attribute results from analyzing the simulations to the realworld being
modeled in the simulation. These simulations only give information about the model. But by
observing the resulting dialogues, we can ascertain whether the underlying model generates
interactions that have the target behaviors observed in human-human dialogues. We also can
apply statisticalanalysis to this generated corpus to furtherour understanding of the
computational model.
62
Chapter 5
Conclusions
This paper gave a brief survey on goal-driven humancomputer dialogue systems, including
two often used frames and some recent research work on each subtask of dialogue systems.
However, the major concern of the paper was joint models, which model multiple subtasks of
dialogues simultaneously.
We considered jointly modeling is one important trend of dialogue systems. In fact, there was
a rapid increase of work on joint models in recent years. We tried to survey most of the
related work in the paper, classified them according to which subtasks were taken into the
joint models. As we have seen,there were several different types of join models, such as flat
or hierarchical type. There were also several different extents of integration, including
integration of several subtasks inside NLG, DM or NLG, jointly modeling subtasks crossing
NLG and DM, and jointly modeling the process through NLU, DM and NLG.
Although the joint models are at their beginning, they have shown some advantages
compared with previous pipeline models. One significant advantage of joint models is that
they could model interaction relations between different subtasks in a single model to
improve the performance of the whole system.
Another practical advantage is that joint models might remove some middle representations
which were built manually before. It might reduce the subjective of human design and assign
a dialogue model more flexible to adapt different tasks in different domains. It is not so
strange that most of recent joint models were constructed by deep neural networks. Deep
neural networks provided some uniform structures and training ways for different subtasks.
Reinforcement learning was still the main tool for DM.
Although neural networks have long been used in reinforcement learning, it is the recent
combination of reinforcement learning and different deep neural networks that brought the
deep reinforcement learning, which has pushed the researches on joint models forward
greatly. Finally, there are lots of problems waiting for solutions in joint models. How to get
enough data for building a dialogue system?, How to train a joint model efficiently?, How to
adapt a joint model in one domain to another? and so on. Some of the problems have
theoretical interests, some of them have practical appeals.
63
64
References
Allwood, J., Traum, D., & Jokinen, K. (2000). Cooperation, dialogue and ethics.
International Journal of Human–Computer Studies, 53, 871–914.
Amalberti, R., Carbonnel, N., & Falzon, P. (1993). User representations of computer
systems in human–
computer speech interaction. International Journal of Man–Machine Studies, 38, 547–
566.
Anderson, J. R. (1993). Rules of the mind. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Austin, J. (1970). Quand dire cest faire. Paris: Seuil.
Baber, C., Mellor, B., Graham, R., Noyes, J. M., & Tunley, C. (1996). Workload and
the use of automatic
speech recognition: The effects of time and resource demands. Speech
Communication, 20, 37–53.
Bortfeld, H., Leon, S. D., Bloom, J. E., Schober, M. F., & Brennan, S. E. (2001).
Disfluency rates in
conversation: Effects of age, relationship, topic, role, and gender. Language and
Speech, 44(2),
123–147.
Brennan, S. E. (1991). Conversation with and through computers. User Modeling and
User-Adapted
Interaction, 1, 67–86.
Brennan, S. E. (1998). The grounding problem in conversation with and through
computers. In S. R.
Fussel & R. J. Kreuz (Eds.), Social and cognitive psychological approaches to
interpersonal
communication (pp. 201–225). Hillsdale, NJ: Lawrence Erlbaum.
Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in
conversation. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 22(6), 1482–1493.
Brennan, S. E. & Ohaeri, J. O. (1999). Why do electronic conversations seem less
polite? The costs and
65
benefits of hedging. In Proceedings of the international joint conference on work
activities, coordination,
and aollaboration (WACC 99), pp. 227–235.
Bubb-Lewis, C., & Scerbo, M. W. (2002). The effects of communication modes on
performance and
discourse organization with an adaptive interface. Applied Ergonomics, 33, 15–26.
Chafe, W. L. (1982). Integration and involvement in speaking, writing, and oral
literature. In D. Tannen
(Ed.). Spoken and written language (Vol. IX, pp. 35–53). Norwood, NJ: ABLEX
Publishing
Corporation.
Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press.
Clark, H. H., & Brennan, S. E. (1991). Grounding in communication. In L. B.
Resnick, J. M. Levine, & S.
D. Teasley (Eds.), Perspectives on socially shared cognition (pp. 127–149).
Washington D.C.: APA.
Clark, H. H., & Schaefer, E. F. (1987). Collaborating on contributions to
conversations. Language and
Cognitive Processes, 2, 1–23.
Clark, H. H., & Schaefer, E. F. (1989). Contributing to discourse. Cognitive Science,
13, 259–294.
Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process.
Cognition, 22, 1–39.
L. Le Bigot et al. / Computers in Human Behavior 22 (2006) 467–500 497
de Croock, M. B. M., van Merrie¨nboer, J. J. G., & Paas, F. G. W. C. (1998). High
versus low contextual
interference in simulation-based training of troubleshooting skills: Effects on transfer
performance and
invested mental effort. Computers in Human Behavior, 14(2), 249–267.
DeVito, J. A. (1966). Psychogrammatical factors in oral and written discourse by
skilled communicators.
Speech Monographs, 33, 73–76.
66
DeVito, J. A. (1967). Levels of abstraction in spoken and written language. Journal of
Communication, 17,
354–361.
Drieman, G. H. J. (1962). Differences between written and spoken language – an
exploratory study –
quantitative approach. Acta Psychologica, 20, 36–57.
Fais, L. (1998). Lexical accommodation in human- and machine-interpreted
dialogues. International
Journal of Human–Computer Studies, 48, 217–246.
Falzon, P. (1991). Diagnosis dialogues: Modeling the interlocutors competence.
Applied Psychology: An
International Review, 40(3), 327–349.
Falzon, P., Amalberti, R., & Carbonnel, N. (1986). Dialogue control strategies in oral
communication. In
K. Hopper & I. A. Newman (Eds.), Foundation for human–computer communication
(pp. 73–98).
North-Holland: Elsevier Science Publishers.
Ferrara, K., Brunner, H., & Whittemore, G. (1991). Interactive written discourse as an
emergent register.
Written Communication, 8(1), 8–34.
Fraisse, P. (1966). La Psychologie Expe´rimentale. Paris: Presse Universitaire de
France.
Fraisse, P., & Breyton, M. (1959). Comparaisons entre les langages oral et e´crit.
Lanne´e Psychologique, 1,
61–71.
Fussell, S. R., & Krauss, R. M. (1992). Coordination of knowledge about in
communication: Effects of
speakers assumptions about what others know. Journal of Personality and Social
Psychology, 62(3),
378–391.
Garrod, S., & Pickering, M. J. (2004). Why is conversation so easy? Trends in
Cognitive Sciences, 8(1),
67
8–11.
Green, R. F. (1955). Transfer of skill on a following tracking task as a function of task
difficulty (target
size). The Journal of Psychology, 39, 355–370.
Guthrie, J. T. (1988). Locating information in documents: Examination of a cognitive
model. Reading
Research Quarterly, 23.
Hancok, J. T., & Dunham, P. J. (2001). Language use in computer-mediated
communication: The role of
coordination device. Discourse Processes, 31(1), 91–110.
Hart, S. G., & Staveland, L. E. (1988). Development of a multi-dimensional workload
rating scale: Results
of empirical and theoretical research. In P. A. Hancock & N. Meshkati (Eds.), Human
mental workload
(pp. 139–183). Amsterdam, The Netherlands: Elsevier.
Hauptmann, A. G., & Rudnicky, A. I. (1988). Talking to computers: An empirical
investigation.
International Journal of Man–Machine Communication, 28, 583–604.
Hidi, S. E., & Hildyard, A. (1983). The comparison of oral and written productions in
two discourse types.
Discourse Processes, 6, 91–105.
Isaacs, E., & Clark, H. H. (1987). References in conversation between experts and
novices. Journal of
Experimental Psychology: General, 2(6), 26–37.
James, H. E. O. (1929/1930). The transfer of training. British Journal of Psychology,
20,
322–332.
Johnstone, A., Berry, U., NGuyen, T., & Asper, A. (1994). There was a long pause:
Influencing turntaking
behaviour in human–human and human–computer spoken dialogues. International
Journal of
Human–Computer Studies, 41, 363–411.
68
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge, MA:
Cambridge University
Press.
Krauss, R., & Fussell, S. (1991). Perspective-taking in communication:
Representations of others
knowledge in reference. Social Cognition, 9(1), 2–24.
Leiser, R. G. (1989). Exploiting convergence to improve natural language
understanding. Interacting with
Computers, 1(3), 284–298.
Levelt, W. J. M. (1989). Speaking: From intention to articulation. Cambridge, MA:
The MIT Press.
498 L. Le Bigot et al. / Computers in Human Behavior 22 (2006) 467–500
Levelt, W. J. M., & Kelter, S. (1982). Surface form and memory in question
answering. Cognitive
Psychology, 14, 78–106.
Lovett, M. C., Reder, L. M., & Lebiere, C. (1999). Modeling working memory in a
unified architecture:
An ACT-R perspective. In A. Miyake & P. Shah (Eds.), Models of working memory
(pp. 135–182).
Cambridge, MA: Cambridge.
Luchins, A. S. (1942). Mechanization in problem solving. Psychological Monographs,
54(248), 1–95.
Mayer, R. E. (1997). Multimedia learning: Are we asking the right questions?
Educational Psychologist,
32(1), 1–19.
McKinlay, A., Procter, R., Masting, O., Woodburn, R., & Arnood, J. (1994). Studies
of turn-taking in
computer mediated communications. Interacting with Computer, 6(2), 151–171.
Moreno, R., & Mayer, R. E. (2002). Learning science in virtual reality multimedia
environments: Role of
methods and media. Journal of Educational Psychology, 94, 598–610.
69
Murray, I. R., Baber, C., & South, A. (1996a). Towards a definition and working
model of stress and its
effects on speech. Speech Communication, 20, 3–12.
Murray, A. C., Jones, D. M.,&Frankish, C. R. (1996b). Dialogue design in speech-
mediated data-entry: The
role of syntactic constraintsandfeedback. International Journal ofHuman–Computer
Studies,45, 263–286.
OHara, K. P., & Payne, S. J. (1998). The effects of operator implementation cost on
planfulness of
problem solving and learning. Cognitive Psychology, 35, 34–70.
OHara, K. P., & Payne, S. J. (1999). Planning and the user interface: The effects of
lockout time and error
recovery cost. International Journal of Human–Computer Studies, 50, 41–59.
Oviatt, S. L. (1995). Predicting spoken disfluencies during human–computer
interaction. Computer Speech
and Language, 9, 19–35.
Oviatt, S. L., Cohen, P. R., & Wang, M. (1994). Toward interface design for human
language technology:
Modality and structure as determinants of linguistic complexity. Speech
Communication, 15, 283–300.
Pierrel, J.-M. (1988). Le dialogue homme-machine en langage naturel. In Paper
presented at the Premie`res
Journe´es Nationales du GRECO-PRC.
Poole, M. E., & Field, T. W. (1976). A comparison of oral and written code
elaboration. Language and
Speech, 19, 305–312.
Prinzel, L. J., Pope, A. T., Freeman, F. G., Scerbo, M. W. & Mikulka, P. J. (2001).
Analysis of EEG and
ERPs for psychophysiological adaptive task allocation: NASA (TM-2002-211016).
Raymond, C., Be´chet, F., De Mori, R., Damnati, G. & Este`ve, Y. (2004). Automatic
learning of
interpretations strategies for spoken dialogue systems. In Paper presented at the
international
70
conference of acoustics, speech, and signal processing, Montreal, Quebec, Canada.
Richards, M. A. & Underwood, K. M. (1985). How should people and computers
speak to each other? In
Paper presented at the human–computer interaction – INTERACT84, London, 4–7
September.
Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest systematics for the
organization of turn-taking
for conversation. Language, 50, 696–735.
Sadek, D. (1999). Design considerations on dialogue systems: From theory to
technology – the case of
Artimis. In Paper presented at the ESCA TR workshop on interactive dialogue for
multimodal systems
(IDS99), Germany.
Sadek, D., Bretier, P. & Panaget, F. (1997). Artimis: Natural dialogue meets rational
agency. In Paper
presented at the 15th international joint conference on artificial intelligence
(IJCAI97), Nagoya, Japan.
Sadek, D., Ferrieux, A., Cozannet, A., Bretier, P., Panaget, F. & Simonin, J. (1996).
Effective human–
computer cooperative spoken dialogue: The AGS Demonstrator. In Paper presented at
the 4th
international conference on spoken language processing (ICSLP96).
Sato, N., Kamada, S., Miyake, S., Akatsu, J., Kumashiro, M., & Kume, Y. (1999).
Subjective mental
workload in Type A women. International Journal of Industrial Ergonomics, 24, 331–
336.
Scerbo, M. W. (2001). Stress, workload, and boredom in vigilance: A problem and an
answer. In P. A.
Hancock & P. E. Desmond (Eds.), Stress, workload, and fatigue (pp. 267–278).
Mahwah, NJ: Erlbaum.
Searle, J. (1969). Speech acts: An essay in the philosophy of language. Cambridge:
Cambridge University
Press.
71
Sweller, J. (1998). Cognitive load during problem solving: Effects on learning.
Cognitive Science, 12,
257–285.
L. Le Bigot et al. / Computers in Human Behavior 22 (2006) 467–500 499
van Gerven, P. W. M., Paas, F. G. W. C., van Merrie¨nboer, J. J. G., & Schmidt, H. G.
(2002). Cognitive
load theory and aging: Effect of worked examples on training efficiency. Learning
and Instruction, 12,
87–105.
van Merrie¨nboer, J. J. G., Schuurman, J. G., de Croock, M. B. M., & Paas, F. G. W.
C. (2002).
Redirecting learners attention during training: Effects on cognitive load, transfer test,
performance and
training efficiency. Learning and Instruction, 12, 11–37.
Whittaker, S. (2003). Theories and methods in mediated communication. In A. C.
Graesser, M. A.
Gernsbacher, & S. R. Goldman (Eds.), Handbook of discourse processes (pp. 243–
286). Mahwah, NJ:
LEA.
Zoltan-Ford, E. (1991). How to get people to say and type w
72
73