MICHAELF.MCTEAR
UniversityofUlster
Spokendialoguesystemsallowuserstointeractwithcomputer-basedapplicationssuchasdatabases
andexpertsystemsbyusingnaturalspokenlanguage. Theoriginsofspokendialoguesystemscan
betracedbacktoArticialIntelligenceresearchinthe1950sconcernedwithdevelopingconversa-
tionalinterfaces. However,itisonlywithinthelastdecadeorso,withmajoradvancesinspeech
technology,thatlarge-scaleworkingsystemshavebeendevelopedand,insomecases,introduced
into commercialenvironments. Asaresultmanymajortelecommunicationsandsoftwarecom-
panieshavebecomeawareofthe potential forspokendialoguetechnology to providesolutions
innewlydevelopingareassuchas computer-telephonyintegration. Voiceportals,whichprovide
a speech-based interface between atelephoneuserand web-basedservices,arethe most recent
application of spoken dialoguetechnology. Thisarticle describesthe maincomponentsof the
technology-speechrecognition,languageunderstanding, dialoguemanagement, communication
withanexternalsourcesuchas adatabase, language generation, speech synthesis -and shows
howthesecomponenttechnologiescanbeintegratedintoaspokendialoguesystem. Thearticle
describesindetail the methodsthat have been adopted insomewell-known dialogue systems,
exploresdierent systemarchitectures, considersissues of specication, design and evaluation,
reviewssomecurrentlyavailabledialoguedevelopmenttoolkits,andoutlinesprospectsforfuture
development.
CategoriesandSubjectDescriptors: I.2.7[ArticialIntelligence]: NaturalLanguageProcess-
ing|discourse,speechrecognitionandsynthesis;H.5.2[InformationInterfacesand Presen-
tation]:UserInterfaces|naturallanguage,voiceI/O
GeneralTerms: HumanFactors
AdditionalKey Words and Phrases: Dialogue management, humancomputer interaction, lan-
guagegeneration,languageunderstanding,speechrecognition,speechsynthesis
1. INTRODUCINGSPOKENDIALOGUETECHNOLOGY
The`conversationalcomputer'hasbeenthegoalofresearchersinspeechtechnology
andarticialintelligence(AI)formorethan30years. Anumberoflargescalere-
searchprogrammeshaveaddressedthisgoal,includingtheDARPACommunicator
Project,Japan'sFifthGenerationprogramme,andtheEuropeanUnion'sESPRIT
and LanguageEngineeringprogrammes. Theimpression of eortlessspontaneous
conversationwith acomputerhas beenfostered by examplesfrom sciencection
suchas HALin 2001: A Space Odyssey orthecomputeronthe Star Ship Enter-
Author'saddress: MichaelMcTear,SchoolofInformationandSoftwareEngineering,University
ofUlster,NewtownabbeyBT370QB,NorthernIreland,UK.
Permission to makedigital/hard copy of all or part of thismaterial without fee for personal
orclassroomuseprovidedthat the copiesarenotmadeordistributedforprot orcommercial
advantage,theACMcopyright/servernotice,thetitleofthepublication,anditsdateappear,and
noticeisgiventhatcopyingisbypermissionoftheACM,Inc. Tocopyotherwise,torepublish,
topostonservers,ortoredistributetolistsrequirespriorspecicpermissionand/orafee.
©20YYACM0000-0000/20YY/0000-0001$5.00
prise. Itisonlyrecently,however,thatspokenlanguageinteractionwithcomputers
hasbecomeapracticalpossibilitybothinscienticaswellasincommercialterms.
Thisisduetoadvancesinspeechtechnology,languageprocessinganddialoguemod-
elling,aswellastheemergenceof fasterandmorepowerfulcomputersto support
these technologies. Applications such as voice dictation and the control of appli-
ances using voice commands are appearing on themarket and an ever-increasing
numberofsoftwareand telecommunicationscompaniesare seekingto incorporate
speech technology into their products. It is important, however, to be aware of
thelimitationsoftheseapplications. Commonlystatementsaremadein salesand
marketingliteraturesuchas`Talktoyourcomputerasyouwouldtalktoyournext-
door neighbour' or `Teach your computer the art of conversation'. However, the
technologiesinvolvedwouldnotbesuÆcientto enableacomputerto engagein a
naturalconversationwithauser. Voicedictationsystemsprovideatranscriptionof
whattheuserdictatestothesystem,butthesystemdoesnotattempttointerpret
theuser'sinputnortodiscussitwiththeuser. Command-and-controlapplications
enableuserstoperform commandswithvoiceinputthat wouldotherwisebeper-
formedusingthekeyboardormouse. Thecomputerrecognisesthevoicecommand
and carriesoutthe action, orreports that the command wasnot recognised. No
other form ofdialogue isinvolved. Similarrestrictionsapply to mostother forms
ofvoice-basedsystemincurrentuse.
Spokendialoguesystems,ontheotherhand,canbeviewedasanadvancedappli-
cationofspokenlanguagetechnology. Spokendialoguesystemsprovideaninterface
betweentheuserandacomputer-basedapplicationthatpermitsspokeninteraction
with theapplication in arelativelynaturalmanner. Insodoing, spokendialogue
systemssubsumemostofthemajoreldsofspokenlanguagetechnology,including
speech recognition and speech synthesis, languageprocessing, and dialogue man-
agement.
Theaimof thecurrentsurveyisto describetheessentialcharacteristicsofspo-
kendialogue technologyat alevel oftechnical detailthat should beaccessible to
computerscientistswhoarenotspecialistsinspeechrecognitionandcomputational
linguistics. The surveyprovides anoverviewforthose wishing toresearch into or
developspokendialoguesystems,andhopefullyalsoforthosewhoarealreadyex-
perienced in this eld. Most published work to date on spoken dialoguesystems
tends to report on the design, implementation, and evaluation of individual sys-
temsorprojects,aswouldbeexpectedwithanemergingtechnology. Thepresent
paperwill notattempt to surveythegrowingnumberof spokendialoguesystems
currently in existencebut ratherwill focus on theunderlying technologies, using
examplesofparticularsystemstoillustratecommonlyoccurringissues.
1
1.1 Overviewofthepaper
Theremainderofthepaperisstructuredasfollows. Inthenextsectionspokendia-
loguesystemsaredenedascomputersystemsthatusespokenlanguagetointeract
1
Inevitablythereareomissions,insomecasesofwell-knownandimportantsystems,butthisis
unavoidable,astheaimisnottoprovideacomprehensivereviewofdialoguesystemsbuttofocus
onthegeneralissuesofthetechnology. Interestedreaderscanfollowupparticularsystemsinthe
referencesprovidedattheendofthesurveyandinAppendixA.
withuserstoaccomplishatask. Dialoguesystemsareclassiedintermsofdier-
entcontrolstrategiesandsomeexamplesare presentedinsection 3that illustrate
thisclassicationandgiveafeelfortheachievementsas wellasthelimitationsof
current technology. Section4 describesthecomponentsof aspoken dialoguesys-
tem{speechrecognition,languageunderstanding,dialoguemanagement,external
communication, response generation,and text to speech synthesis. Thekeyto a
successful dialogue system is the integration of these componentsinto a working
system. Section5reviewsanumberofarchitecturesanddialoguecontrolstrategies
that providethis integration. Methodologiesto support the specication, design,
and evaluation ofa spoken dialogue systemare reviewedin section 6. Particular
methodshaveevolvedforspecifyingsystemrequirements,suchasuserstudies,the
use of speech corpora, and Wizard-of-Ozstudies. Methods have also been devel-
oped forthe evaluation of dialoguesystemsthat go beyond themethods used for
evaluationoftheindividualelementssuchasthespeechrecognitionandspokenlan-
guageunderstanding components. This section also examinessomecurrentwork
on guidelines and standardsfor spoken languagesystems. A recentdevelopment
is the emergenceof toolkits and platforms to support the construction of spoken
dialoguesystems,similarto thetoolkitsand developmentplatformsthat areused
inexpertsystemsdevelopment. Somecurrentlyavailabletoolkitsarereviewedand
evaluatedinsection7. Finally,section8examinesdirectionsforfuture researchin
spokendialoguetechnology.
2. SPOKENDIALOGUESYSTEMS-ADEFINITION
Spokendialoguesystemshavebeendenedascomputersystemswithwhichhumans
interact on a turn-by-turn basis and in which spoken natural languageplays an
importantpartinthecommunication[Fraser1997]. Themainpurposeofaspoken
dialogue system is to provide an interface between a user and a computer-based
applicationsuchasadatabaseorexpertsystem. Thereisawidevarietyofsystems
that are covered by this denition, ranging from question-answer systems that
answeronequestionatatimeto`conversational'systemsthatengageinanextended
conversationwith the user. Furthermore, the mode of communication canrange
fromminimalnaturallanguage,consistingperhapsofonlyasmallsetofwordssuch
asthe digits0-9 and thewordsyes and no, through to largevocabularysystems
supportingrelativelyfree-forminput. Theinputitselfmaybespokenortypedand
maybecombinedwithotherinputmodessuchasDTMF(touch-tone)input,while
theoutputmaybespokenordisplayedastextonascreen,andmaybeaccompanied
byvisualoutputin theformoftablesorimages.
Spokendialoguesystemsenablecasualandnaiveuserstointeractwithcomplex
computer applications in a natural way using speech. Current IVR (Interactive
Voice Response) systems limitusers in what they cansay and how they cansay
it. However, usersof speech-based computer systemsoften do not know exactly
what information theyrequireand howto obtainit -they requirethesupport of
thecomputertodeterminetheirpreciserequirements. Forthisreasonitisessential
that speech-basedcomputersystems should beable to engage in adialogue with
usersratherthansimplyrespondtopredeterminedspokencommands. Atthesame
time spoken dialogue systems are more restricted than conversational computers
in that theirconversationaltopicsare limited, usuallyto asingledomain such as
ightenquiries.
Spokendialoguesystemscanbeclassiedintothreemaintypes,accordingtothe
methodsusedto controlthedialoguewiththeuser:
(1) Finite-state(orgraph)basedsystems
(2) Framebasedsystems
(3) Agentbasedsystems
The type of dialogue control strategy used has abearing on how the system ac-
complishestwoofits main tasks: processingtheuser'sinput andrecoveringfrom
errors.
2.1 Finite-statebasedsystems
In a nite state based system the user is taken througha dialogue consisting of
asequenceofpre-determined stepsorstates. Mostcommercially available spoken
dialogue systemsuse this form of dialogue control strategy. The dialogue ow is
specied asa set of dialogue states with transitions denoting various alternative
pathsthroughthe dialoguegraph. The systemmaintainscontrol of thedialogue,
producespromptsateachdialoguestate,recognises(orrejects)specicwordsand
phrasesin response to theprompt, and producesactions basedonthe recognised
response. The following is an example of an interaction with abasic nite-state
based system in which the system veries the user's input at each state of the
dialogue:
System: Whatisyour destination?
User: London
System: WasthatLondon?
User: Yes
System: Whatday do youwanttotravel?
User: Friday
System: WasthatSunday?
User: No
System: Whatday do youwanttotravel?
Generallysystemswithstate-baseddialoguecontrolrestricttheuser'sinputtosin-
glewordsor phrasesthat provideresponsesto carefullydesignedsystemprompts.
A major advantageof this form of dialogue control is that the required vocabu-
lary and grammar for each state can be specied in advance, resulting in more
constrained speech recognitionand languageunderstanding. However,the disad-
vantageisthatthesesystemsrestricttheuser'sinputtopredeterminedwordsand
phrases,makingcorrectionofmisrecogniseditemsdiÆcultaswellasinhibitingthe
user'sopportunitytotaketheinitiativeandaskquestionsorintroducenewtopics.
Ifaugmentedwithanaturallanguagecomponentastate-basedsystemcanaccept
morenaturalinputintheformofsentencesorpartialsentences. Furthermore,with
asimpledatastructuresuchasaformorframethatkeepstrackofwhichinforma-
tiontheuserhasprovidedandwhatthesystemstillneedstoknow,amoreexible
and morenaturaldialogue ow is possible. Vericationcan alsobedelayed until
thesystemhasgatheredall therequiredinformation. The Nuancedemobanking
systemtobedescribedinsection3isanexampleofastate-basedsystemwiththese
additionalfunctionalities.
2.2 Frame-basedsystems
Inaframe(ortemplate)basedsystemtheuserisaskedquestions that enablethe
system to ll slots in a template in order to perform a task such as providing
train timetable information. In this type of system the dialogue owis notpre-
determinedbutdependsonthecontentoftheuser'sinputandtheinformationthat
thesystemhastoelicit. Forexample:
System: Whatisyour destination?
User: London
System: Whatday do youwanttotravel?
User: Friday
System: Whatisyour destination?
User: LondononFridayaround 10inthe morning
System: Ihavethe followingconnection ...
In therst example theuser provides oneitem of information at atime and the
system performs rather like a state-based system. However, if the user provides
more than the requested information, asin the second example, the system can
acceptthisinformationandcheckifanyadditionalitemsofinformationarerequired
beforesearchingthedatabaseforaconnection. Framebasedsystemsfunction like
productionsystems,takingaparticularactionbasedonthecurrentstateofaairs.
Thequestions and other promptsthat thesystem should ask canbelisted along
withtheconditionsthat haveto betrueforaparticularquestionorpromptto be
relevant. Someform ofnaturallanguageinputisrequiredbyframe-basedsystems
to permit the user to respond more exibly to the system's prompts, as in the
second example. Natural languageisalso requiredtocorrect errorsof recognition
orunderstanding bythesystem. Generally,however,itis suÆcientforthesystem
tobeabletorecognisethemainconceptsintheuser'sutterance. ThePhilipstrain
timetableinformation,tobedescribedinsection3,isanexampleofaframe-based
system.
2.3 Agent-basedsystems
Agent-basedorAIsystemsaredesignedtopermitcomplexcommunicationbetween
thesystem,theuserandtheunderlyingapplicationinordertosolvesomeproblem
or task. There are many variants on agent-based systems, depending on what
particularaspectsofintelligentbehaviourareincludedinthesystem. Thefollowing
dialogue, taken from Sadek and de Mori [1998], illustrates a dialogue agent that
engagesinmixed-initiativeco-operativedialoguewiththeuser:
User: I'mlooking for ajobin the Calaisarea. Arethere anyservers?
System: No, therearen'tany employment servers for Calais. However, thereis
an employmentserver for Pas-deCalais andan employment severfor
Lille. Areyouinterestedinoneof these?
Inthis examplethe system'sanswerto theuser's requestis negative. Butrather
thansimply responding `no', thesystemattempts to provide amoreco-operative
responsethat mightaddresstheuser'sneeds.
In agent-based systems communication is viewed as interaction between two
agents, each of which is capable of reasoning about its own actions and beliefs,
andsometimesalsoabouttheactionsandbeliefsoftheotheragent. Thedialogue
model takesthe preceding context into accountwith the resultthat the dialogue
evolvesdynamicallyasasequenceofrelatedstepsthatbuildoneachother. Gener-
allythere aremechanismsforerrordetectionand correction,and thesystemmay
useexpectationstopredictandinterprettheuser'snextutterances. Thesesystems
tendtobemixedinitiative,whichmeansthattheusercantakecontrolofthedia-
logue,introducenewtopics, ormakecontributionsthatarenotconstrainedbythe
previoussystem prompts. For this reason theform of the user'sinput cannot be
determinedinadvanceasconsistingofasetnumberofwords,phrases,orconcepts,
and,inthemostcomplexsystems,asophisticatednaturallanguageunderstanding
component is required to process the user's utterances. The Circuit-Fix-It-Shop
system, to be presented in section 3, is an example of one type of agent-based
system. Other typeswillbediscussedinsection5.
2.4 Verication
Inaddition to thedierent levelsof languageunderstandingrequired bydierent
typesof dialogue system,there arealso dierent methods for verifying theuser's
input. Inthemostbasicstate-basedsystems,inwhichuserinputisrestrictedtosin-
glewordsorphraseselicitedat eachstateofthedialogue,thesimplestverication
strategyinvolvesthesystemconrming thatthe user'swordshavebeencorrectly
recognised. Themainchoiceisbetweenconrmationsassociatedwitheachstateof
thedialogue(i.e. everytimeavalueiselicitedthesystemveriesthevaluebefore
movingontothenextstate),orconrmationsat alater pointin thetransaction.
Thelatteroption,whichisillustratedintheexamplefromtheNuancebankingsys-
teminsection3,providesforamorenaturaldialogueow. Themorenaturalinput
permittedinframe-basedsystemsalsomakespossibleamoreexibleconrmation
strategy in which the system can verify a value that has just been elicited and,
within thesame utterance, ask thenext question. This strategy of implicit veri-
cation isillustratedin the examplefrom thePhilips traintimetable information
systeminsection3. Implicitvericationprovidesforamorenaturaldialogueow
aswellasapotentiallyshorterdialogue,andismadepossiblebecausethesystemis
abletoprocessthemorecomplexuserinputthatmayarisewhentheusertakesthe
initiative tocorrect thesystem'smisrecognitions andmisunderstandings. Finally,
inagent-basedsystems,morecomplexmethodsofverication(or`grounding')are
requiredalongwithdecisionsastohowandwhenthegroundingistobeachieved.
Vericationwill bediscussedin greaterdetail insection 4.3.2and someexamples
ofvericationstrategiescanbeseenintheexamplespresentedinsection3.
2.5 Knowledgesourcesfordialoguemanagement
The dialogue manager may draw on a number of knowledge sources, which are
sometimesreferredtocollectivelyasthedialoguemodel. Adialoguemodelmight
includethefollowingtypesofknowledgerelevanttodialoguemanagement:
Adialoguehistory : A record of the dialogue sofar in terms of the propositions
that havebeendiscussed andthe entities that havebeenmentioned. Thisrep-
resentation provides a basis for conceptual coherence and for the resolution of
anaphoraandellipsis.
Ataskrecord: Arepresentationoftheinformationtobegatheredinthedialogue.
This record, often referred to as a form, template, orstatus graph, is used to
determine what information has not yet beenacquired (see section 5.2). This
recordcanalsobeusedasataskmemory[AretoulakiandLudwig1999]forcases
whereauserwishesto changethevaluesofsomeparameters,suchasanearlier
departure time, but does notneed to repeat thewhole dialogue to providethe
othervaluesthat remainunchanged.
Aworld knowledge model: This model contains general background information
that supportsanycommonsense reasoningrequiredbythesystem,forexample,
thatChristmasdayisDecember25.
Adomain model: Amodelwithspecicinformationaboutthedomaininquestion,
forexample,ightinformation.
Ageneric modelof conversationalcompetence: Thisincludesknowledgeoftheprin-
ciplesofconversationalturn-takinganddiscourseobligations{forexample,that
anappropriateresponsetoarequestforinformationistosupplytheinformation
orprovideareasonfornotsupplyingit.
Auser model: Thismodelmaycontainrelativelystableinformationabouttheuser
that mayberelevantto thedialogue{such astheuser'sage,gender,and pref-
erences,| aswell asinformationthat changes overthe courseof thedialogue,
such astheuser'sgoals,beliefs,andintentions.
These knowledge sources are used in dierent ways and to dierent degrees ac-
cordingto thedialoguestrategychosen. Inthecase ofastate-based systemthese
models, if they exist at all, are represented implicitly in the system. For exam-
ple, the items of information and the sequence in which they are acquired are
pre-determined and thus representedimplicitly in the dialogue states. Similarly,
ifthere isauser model, itis likelyto besimpleand to consist ofasmall number
ofelementsthat determinethedialogueow. Forexample,thesystemcouldhave
a mechanism for looking up user information to determine whether the user has
previousexperience ofthis system. This information couldthen be usedto allow
dierentpathsthroughthesystem(forexample,withlessverboseinstructions),or
toaddressuserpreferenceswithouthavingtoask forthem.
Frame-based systemsrequirean explicittask model asthisinformation is used
todeterminewhat questionsstillneedtobeasked. Thisisthemechanismusedby
thesesystemstocontrolthedialogueow. Generallytheusermodel,ifoneexists,
would not needto be any moresophisticatedthan that described for state-based
systems. Agent-basedsystems, onthe other hand, requirecomplex dialogue and
usermodelsaswellasmechanismsforusingthesemodelsasabasisfordecisionson
how tocontrol thedialogue. Information aboutthedialogue historyandtheuser
canbeusedtoconstrainhowthesysteminterpretstheuser'ssubsequentutterances
andtodeterminewhatthesystemshouldsayandhowitshouldbesaid. Thesesorts
ofmodellinginvolverepresentationsofdiscoursestructure,ofintentions,goalsand
TableI. Dialoguecontrolstrategies
Features / Di-
alogue control
strategy
State-based Frame-based Agent-based
Input Single words or
phrases
Natural language
withconceptspot-
ting
Unrestricted natu-
rallanguage
Verication Explicit conrma-
tion-eitherofeach
inputor at endof
transaction
Explicit and im-
plicitconrmation
Grounding
Dialoguemodel Information state
represented im-
plicitlyindialogue
states.
Explicit represen-
tation of informa-
tionstates.
Model of system's
intentions, goals
andbeliefs.
Dialogue control
represented ex-
plicitly with state
diagram
Dialogue control
represented with
controlalgorithm
Dialoguehistory.
Context.
Usermodel Simple model of
usercharacteristics
orpreferences
Simple model of
usercharacteristics
orpreferences
Modelofuser'sin-
tentions,goalsand
beliefs
beliefs,andofdialogue asacollaborativeactivity. Various approachestodialogue
andusermodellinginagent-basedsystemswillbeexempliedanddiscussedinlater
sectionsofthissurvey.
2.6 Summary
Thissectionhasprovidedaclassicationofspokendialoguesystemsintermsofthe
type of dialoguecontrol strategy used and hasshown how the choice of dialogue
controlstrategyiscloselyrelatedtotheformoftheuser'sinputas wellastheuse
ofparticular errorrecoverymechanisms. Thesepointsare summarisedin TableI.
Thenextsectionpresentsthreeexamplesofspokendialoguesystemsthatillustrate
thesedierentcontrol strategies.
3. EXAMPLESOF SPOKENDIALOGUESYSTEMS
3.1 State-baseddialoguecontrol: TheNuanceautomaticbankingsystem
TheNuanceautomaticbankingsystemenablescallerstoconducttransactionsover
the telephone such as paying a bill, transferring funds between accounts, or ob-
tainingan accountbalance. Thesystemconducts adialogue in Englishwith the
userwhocanspeaktothesystemusingrelativelyunrestrictedspeech. Itshouldbe
stressedthatthefollowingdialoguesarebasedonrecordedinteractionswithademo
systemavailableintheUK(availableat: +441223516959),usedheretoillustrate
someimportantaspectsofstate-baseddialoguecontrol. Theshortcomingsthatare
presentedinsomeoftheexampleswouldhavebeenaddressedinafullydeveloped
commercialapplication.
System: Whatwould youliketodo?
User: Payabill
Pay a bill
Confirm: company, account, date
To which company?
Amount?
Transfer date
Pay another bill
yes no
Fig.1. Dialogueowforpaymentofabill
System: Whatcompany wouldyou liketopay?
User: Midland Bank
System: Howmuchwouldyou liketopay?
User: Onehundredandtwenty-vepounds fty-sevenpence
System: Whatdate wouldyou likethe paymenttobe madeon?
User: Monday
System: You want to pay HSBC one hundred fty-ve pounds and fty-seven
penceon April5th? Is thiscorrect?
User: Yes
System: Yourpayment willbesent. Wouldyou liketopayanother bill?
In order to complete the transaction the system has to elicit three specic val-
ues from thecaller { payee, amount, transferdate. These areconrmed and the
transaction is completed with a returnto thesub-menu for billpayments. Some
pointsworthnotingin thisexamplearethatthesystemis abletoprocessphrases
specifyingcurrencyamounts,is abletotransformrelativedatessuchas `Monday'
into absolute dates such as`April 5th', and is able to handle multiple synonyms
(`MidlandBank'or`HSBC').
The dialogueow, which is system-directed with apre-determined sequence of
questions,canbemodelledasaset of interconnecteddialoguestates,asshown in
Figure1.
Transferringamountsbetweenaccountsinvolvesadierenttypeoftransactionin
whichthesystemcanacceptmorecomplexnaturallanguageinputfromthecaller,
thus reducingthe numberof questions to beasked and thetime required forthe
transaction:
System: Whatwould youliketodo?
User: I would like totransfer twohundredand fty pounds frommy current
tomy savings account
System: Youwanttotransfertwohundredfty poundsfromcurrent tosavings.
Isthis correct?
User: Yes
Thistransactionisalsomodelledusingdialoguestatescombinedwithmultipleslot
lling. Additionaltransitionsneedtobeprovideddependingonwhichcombinations
of thevalues of amount, source account and destination accountare provided by
theuser in thenaturallanguageinput. Problemscanarise ifthis functionality is
notavailablethroughoutthedialogue,asillustratedinthefollowingexcerpt:
System1: Whatcompany wouldyou liketopay?
User1: AbbeyNational
System2: Howmuchwouldyou liketopay?
User2: Onehundredpoundsnext Monday
System3: Whatdate wouldyou likethe paymenttobe madeon?
User3: NextMonday
System4: YouwanttopayAbbeyNationalonehundredpoundsandsixty-onepence
onApril 5th? Isthis correct?
User4: No
System5: Wouldyoulike tochange thecompany,amount, or date?
User5: Amount
System6: Howmuchwouldyou liketopay?
User6: Onehundredpounds
System7: Youwant topay Abbey National onehundredpoundson April 5th? Is
thiscorrect?
User7: Yes
In User2 the user provides an over-informative answer, specifying not only the
amountrequested in System2but alsothe date of thetransaction. Thesystem's
nextquestion (System3)indicates that either thedate wasmisrecognised orthat
thesystem is notset upfor multiple slot llingat this dialogue state. Whatever
the reason,the system goeson to ask the questionrequired at thenext dialogue
state i.e. for the date of the transaction. Thus the user is requested to supply
informationthathadjustbeengiveninthepreviousresponse.Moreover,asshown
in System4,thesystemhasrecognised`nextMonday'notasadatebut aspartof
theamounti.e. as`andsixty-onepence'. AsshowninSystem5{User7, thesystem
isabletorecoverfromthiserrorbygettingtheusertoreconrmtheamount.
Analexampleillustratessomelimitationsonthesystem'smechanismsforerror
recovery:
System: Wouldyoulike tochange thecompany,amount, or date?
User: Thecompany andthe date
System: Pleasesaycompany,amount, ordate
From this exampleitcanbe seenthat thesystemis unable to correctmorethan
oneerroratatime. Thuswhilethestrategyofleavingtheconrmationofthethree
itemsofinformationuntiltheyhaveallbeenelicitedwillnormallyresultin amore
naturaland faster transaction, it may leadto afrustratingand rather unnatural
dialoguewheremorethanoneitemhasto bechanged.
TheNuancebankingsystemisanadvancedspokendialoguesystemthatemploys
astate-baseddialoguecontrolinconjunctionwithnaturallanguageunderstanding
andmultipleslotlling. Givengoodrecognitionratesthesystemislikelytoperform
wellandcompletetransactionseÆcientlyandeectively. Thesesampleinteractions
were chosencarefullyto illustratesomepotentialproblems thatare likelyto arise
whenastate-basedsystemisextendedtoprovidemoresophisticatedfunctionality.
The examples would appear to indicate that the multiple slot lling aorded by
thesystem's naturallanguageprocessing arenot available ateach dialoguestate.
Althoughthiscan beeasilyremedied, afurthercomplication isintroducedasnow
thesystemhasto maintainadialogue model ofwhichslots havebeenlled so as
to directthedialoguedownalternativepaths. With smallnumbersofslots thisis
not aproblem, but with an increasing numberof slots thecombinationsof state
transitions soon reach unmanageable proportions. This is particularly the case
whenfurthertransitionsareaddedtodealwithvericationoftheslotvalues. Thus
while state-based dialogue control is usefulfor implementingsimple transactions,
introducing more sophisticated functionality may require additional mechanisms
such asan explicit dialogue model that obscure the semantics of the state-based
control. Thisissuewillbediscussedingreaterdetailinsection5.Thenextexample
willshowhowaframe-basedsystemisabletohandletheseissuesmoreeectively.
3.2 Frame-baseddialoguecontrol: ThePhilipsautomatictraintimetableinformation
system
The Philipsautomatic traintimetable information system(available at: +49 241
604020)providesinformationoverthetelephone abouttrainconnectionsbetween
1200Germancities. The systemconducts anaturaldialoguein Germanwiththe
userwhocanspeaktothesystemusinguent,unrestrictedspeech. Theaimofthe
dialogue isto enablethe systemto constructanappropriate databasequery that
retrievestheinformationrequiredbytheuserandthentopresentthatinformation
in asuitable form. This system falls within aclass of similar systemsconcerned
with the provision of information, such as travel schedules, banking, stocks and
shares, and weather forecasts. The German version of the system is a research
prototype,whichisnotconnectedwiththeoÆcialinformationserviceprovidedby
GermanRail(DeutscheBahnAG).Thefollowingisanexampleofadialoguewith
the Philips Automatic Train Timetable Information System, translated from the
originalGerman[Aust etal.1995].
System1: Good morning. This is the automatic timetable information service.
HowcanIhelp you?
User1: Ineedaconnection fromHamburgtoMunich.
System2: Whenwouldyou liketogo fromHamburgtoMunich?
User2: NextFridayateight o'clock.
System3: Soyou wanttotravelon Sunday the 20thof November ateighto'clock
a.m.?
User3: No,on Friday.
System4: So youwant to travel on Friday the 18thof November at eighto'clock
a.m.?
User4: Yes.
System5: Thereisthefollowingconnection: WithIntercityExpress793,departure
fromHamburgat8:02,arrival inMunichat14:05. Doyouwantmeto
repeatthe connection?
User5: No, butIneedatrainfor the returntrip onSunday.
System6: AtwhattimewouldyouliketogofromMunichtoHamburgon Sunday
the 20thofNovember?
User6: Ihavetoarrive in Hamburglate inthe evening.
System7: Soyouwanttoarrive inHamburgbetweenninep.m. andelevenp.m.?
User7: Yes.
System8: Therearethreeconnections. Thisistherstone: WithIntercityExpress
880,departurefromMunichat15:54, arrivalin Hamburgat21:58. Do
youwantthe next connection?
User8: No,thanks, that oneisne.
System9: Doyouneedyetanother connection?
User9: No,thank you.
System10: Wewish youapleasant journey. Thank youfor calling. Goodbye.
Itisinformativeto examinethedialogueowinthisexample. Thepurposeofthe
systemistoguidetheusertowardsprovidingalltherequiredvalues-departurecity,
destination city, departuretime, anddeparture date, aswellasascertaining other
values, such as whether thetime is a.m. or p.m. and whether the timerefers to
thedepartureorarrival. Thesevaluesareacquiredoverthecourseofthedialogue
and once theyhave been acquired the systemcan executeasuitably constrained
databasequeryandprovideoneormoreconnections.
Thissystemcouldhavebeenimplementedusingstate-basedcontrolinwhichthe
system would progress througha number of stateseliciting and verifying the re-
quiredvalues. AlternativelythesystemcouldhavebeenimplementedasaQuestion-
Answeror NaturalLanguageDatabaseSysteminwhichtheuserwouldhavebeen
requiredtoprovidealltherequiredparametersinoneutterance,whichcouldthen
havebeentranslated directlyinto adatabasequery. However,oneof theaims of
thedesignersofthis systemwastoenableamoreexibleinteraction inwhichthe
userwouldnotbeconstrainedeithertoinputonevalueatatimeortoinputallthe
valueswithinoneutterance. Thisexibilityisnecessaryasitcannotbedetermined
in advancewhatausermightknowregardingtheinformation requiredtomakea
validquery. Forexample,thesystemmayneedtoknowiftheuserwishestotravel
onanInter-Citytrain,requiresatrainwith restaurantfacilities,and soon. Ifthe
user is not aware of all the possibilities, the system has to issue relevant queries
andelicit suitablevaluesinorderto ndthebestconnection.
A second aspectof dialogue owconcerns thesequencing of thesystem'sques-
tions. Thereshould bealogicalorderto thequestions. Thisordermaybelargely
determinedby whatinformationis tobeelicited in awell-structuredtask such as
a travel information enquiry. The disadvantageof a state-based approach com-
binedwithnaturallanguageprocessingcapabilitiesisthatusersmayproduceover-
informativeanswersthatprovidemoreinformationthanthesystemhasrequested
at thatpoint. InthePhilipsexampleat System2{User2,thesystem'smoreopen-
endedpromptwhenwouldyouliketogofromHamburgtoMunichisambiguousin
that itcanallowtheusertosupplydeparturetimeordateorboth|ashappens
in User2. Evenwith amoreconstrainedprompt such ason which day would you
like togo from Hamburgto Munich theusermightsupplybothdate andtime. A
systemthatfollowedapredeterminedsequenceofquestionsmightthenaskatwhat
timewouldyouliketogofromHamburgtoMunich |anunacceptablequestionas
thetime hasalready been given. The Philips systemusesastatus graphto keep
trackof which slotshavealreadybeenlled. Thismechanismwill bedescribed in
greaterdetailin section5.
A close examination of thedialogue also shows that thesystem is ableto deal
with recognitionerrorsand misunderstandings. Forexample,in System3thesys-
tem attempts to conrm thedeparture date and time but hasmisrecognised the
departuredateandiscorrectedbytheuserinUser3. Moresubtly,thesystemuses
dierent strategiesfor conrmation. In System2animplicit conrmation request
isusedinwhichthevaluesfordeparturecityanddestinationprovidedbytheuser
in User1areechoedbackwithin thesystem'snextquestion,which alsoincludesa
requestforthevalueforthedeparturedateand/ortime. Ifthesystem'sinterpreta-
tioniscorrect,thedialoguecanproceedsmoothlytothenextvaluetobeobtained
and theuserdoesnothaveto conrmthe previousvalues. Otherwise, ifthesys-
temhasmisunderstood theinputtheusercancorrectthevaluesbefore answering
thenextquestion. Conversely,anexplicit conrmationrequest haltsthedialogue
ow and requires an explicit conrmation from the user. An example occurs in
System3{User3inwhichthesystemmakesanexplicitrequestfortheconrmation
ofthedeparturedateandtimeand theusercorrectsthedate. Thenextexchange
System4{User4 is a further example of an explicit conrmation request to verify
thedeparturedateandtime.
One further aspect of the Philips system is its robustness. An example can
be seen at System6{User6. In response to the system prompt for the departure
timetheuser doesnotprovideadirectresponsecontainingtherequiredtime but
states a constraint on the arrivaltime, expressed vaguely aslate in the evening.
Thesystemisabletointerpretthis expressionin termsofarange(between 9p.m.
and11p.m.) andtondanappropriatedeparturetimethatmeets thisconstraint.
Moregenerally,thesystemisrobustenoughtobeabletohandlearangeofdierent
expressionsfordatesandtimes(e.g. threedaysbeforeChristmas,withinthismonth)
andtobeableto dealwithcasesofmissingandcontradictoryinformation.
Theprovisionofinformationsuchastraintimesisatypicalapplicationofspoken
dialoguetechnology. Philips hasdevelopedasystemwithsimilarfunctionalityfor
SwissRail,whichhasbeenanoÆcialpartofSwissRail'sinformationservicesince
1996. Publicreactiontothesystemhasbeenfavourable,withover80%ofthepeople
whousedtheserviceratingitas\excellent". Striketal.[1996]reportonaproject
involvingadaptationoftheGermansystemtotheDutchpublictransportnetwork,
while the EuropeanR&D projectARISE, which includes theDutch, French and
Italian railway operators,builds onearlier Europeanprojectsand on the Philips
systems to provide more elaborate services with a multilingual component. The
strengths andweaknessesof frame-baseddialoguecontrol arediscussed further in
section5.
3.3 Agent-baseddialoguecontrolforproblemsolving-theCircuitx-itshopsystem
Problemsolvingtasksusuallyinvolvesomedegreeofco-operationbetweenagents,
particularly when one of the agents is unable to solve the problem alone. Co-
operationmaytaketheformofassigning particularsub-tasksto particularagents
with the accomplishment of the sub-tasks contributing to the completion of the
overalltask. Suchco-operativeproblemsolvingwillrequirecommunicationbetween
theagentsto reportonthestate-of-playof theproblemsolvingandtosupplythe
informationrequiredbytheotheragent. Thenextexampleillustratescollaborative
problem solvingin theCircuit-Fix-It Shop system[Smith andHipp 1994]. Other
agent-basedsystemswill bediscussedinsection5.
TheCircuit-Fix-ItShopsystemhelpsuserstoxanelectroniccircuitbyengaging
in aspoken dialogue with the user. Communicationbetween system and user is
motivatedbythefactthatneitherparticipantisfullyequippedtosolvetheproblem
alone. The system has all the necessaryknowledge about the task but doesnot
havetherequiredsensoryabilitiestomonitorthestateoftheworldnortoperform
actionsin theworld. These aretheconcernofthehumanuserwhohasalsosome
knowledgeofthetask,whichcanrangefromthatofanovicetothat ofanexpert,
dependingontheindividualuser. Thesystememploysusermodelling-estimating
the user's knowledge of particular objects and actions based on inferences from
earlier exchanges in the dialogue - to determine the content of the information
to be communicated to the user. Context-dependent expectations are also used
to correct speech recognitionhypotheses and to track when the user shifts to a
dierent subdialogue. The following is an exampleof adialogue between auser
andthesystem:
System1: Isthereawire betweenconnector84 andconnector99?
User1: Itisnotthere.
System2: Add awirebetween connector84 andconnector99.
User2: Ineedhelp.
System3: Obtainawire.
User3: Done.
System4: Connectmetal of onefreeend ofthe wiretoconnector84.
User4: How?
System5: Bend back connector84.
User5: Done.
System6: Insertfreeendof wireintoconnector84.
User6: Wirebetween 84and99 isconnecting.
System7: Whatisthe LEDdisplaying?
Thedialogue evolvesinthe formof aproof,which canbeillustratedusing agoal
tree. Thegoaltreerepresentsthesolutionoftheproblemasitevolvesdynamically.
Thesysteminvokesrulestoprovethegoalinatop-downfashion-asinPrologstyle
theorem proving. The proofmaysucceed using internallyavailableknowledge, in
whichcase nodialogue isrequired. However,thesystem isdesigned to dealwith
caseswhere theprooffails becausetheinformationrequiredtocompletetheproof
isnotavailabletothesystem. Inthiscasethesystemengagesindialoguewiththe
usertoobtainthemissinginformation(described as`missingaxioms')so thatthe
proofcansucceed.
Atthe beginning ofthe dialogue,the systemdoesnot knowwhether there isa
wire between connector 84 and connector 99. As this is a missing axiomin the
currentproof,thesystemproducesutteranceSystem1toasktheuser. Thestateof
theproofatthispointisshowninthefollowinggoaltreedisplayedinFigure2. The
fact(wire(84,99),exist,X)) X = absent
missing axiom (utt 1)
System1: is there a wire between connector 84 and connector 99?
Fig.2. GoaltreebeforeutteranceSystem1
userconrmsthatthewireismissing.Fromthisthesystemcaninferthattheuser
knowsthelocationoftheconnectors andthesefacts areaddedto theusermodel.
Figure 3 shows the current state of the goal tree. So that the current goal can
fact(wire(84,99),exist,X)) X = absent
missing axiom (utt 1)
"it is not there" (utt 2) => INFER
userknows(loc(84)) userknows(loc(99)) User1: it is not there
Fig.3. GoaltreeafterutteranceUser1
becompleted,thesysteminstructstheusertoaddawirebetweentheconnectors.
Thisyieldsthegoaltreeshown inFigure4. Astheuserdoesnotknowhowtodo
this,asubgoalisinsertedinstructingtheuseronhowtoaccomplishthistask. This
fact(wire(84,99),exist,X)) X = absent
missing axiom (utt 1)
"it is not there" (utt 2) => INFER
userknows(loc(84)) userknows(loc(99))
do(action(add,wire(84,99))
missing axiom (utt 3) System2: add a wire between connector 84 and connector 99
Fig.4. GoaltreeafterutteranceSystem2
subgoalconsists oftheactions: locateconnector 84,locateconnector99, obtaina
wire, connectone end of wireto 84, and connect other end of wireto 99. These
itemsareaddedtothegoaltreedepictedinFigure5. However,astheusermodel
fact(wire(84,99),exist,X)) X = absent
missing axiom (utt 1)
"it is not there" (utt 2) => INFER
userknows(loc(84)) userknows(loc(99))
do(action(add,wire(84,99))
learn to do add (utt 4) missing axiom (utt 3) inserted subgoal 1
locate 84 locate 99 obtain wire connect(end1,84) connect(end2,99)
Fig.5. GoaltreeafterutteranceUser2
containstheinformationthattheusercanlocatetheseconnectors,instructionsfor
thersttwoactionsarenotrequiredandsothesystemproceedswithinstructions
forthe third action,which isconrmedin User3, and forthe fourthaction. Here
theuserrequires further instructions,which aregivenin System5withtheaction
conrmed in User5. At this point the userasserts that the wirebetween 84 and
99 is connecting, so that the fth instruction to connect the second end to 99 is
notrequired. Afurther missingaxiomisdiscoveredwhich leadsthesystemto ask
whattheLEDisdisplaying(System7).
3.4 Summary
The examples presented in this section have illustrated three dierent types of
dialogue control strategy. Theselectionof adialogue control strategydetermines
the degree of exibility possible in the dialogue and places requirements on the
technologies employed for processing the user's input and for correcting errors.
There are many variations on the dialogue strategies illustrated here and these
will be discussedin greaterdetailin section5. Thenextsectionwill examinethe
componenttechnologiesofspokendialoguesystems.
4. COMPONENTSOF ASPOKENDIALOGUE SYSTEM
Aspokendialoguesysteminvolvestheintegrationofanumberofcomponentsthat
typicallyprovidethefollowingfunctionalities[Wyardet al.1996]:
Speechrecognition: The conversion of an input speech utterance, consisting of a
sequenceofacoustic-phoneticparameters,into astringofwords;
Languageunderstanding: theanalysis ofthis stringof wordswiththe aimofpro-
ducing ameaning representationfor therecognisedutterancethat canbe used
bythedialoguemanagementcomponent;
DialogueManagement: Thecontroloftheinteractionbetweenthesystemandthe
user,includingtheco-ordinationoftheothercomponentsofthesystem;
Communication withexternalsystem: For example, with a database system, ex-
pertsystem,orothercomputerapplication;
Responsegeneration: Thespecicationofthemessagetobeoutputbythesystem;
Speechoutput: Theuseoftext-to-speechsynthesisorpre-recordedspeechtooutput
thesystem'smessage.
Thesecomponentsareexaminedinthefollowingsub-sectionsinrelationtotheirrole
in aspokendialoguesystem(forarecenttext onspeechandlanguageprocessing,
seeJurafskyandMartin[2000]).
4.1 SpeechRecognition
The task of the speech recognitioncomponent of a spoken dialogue systemis to
converttheuser'sinpututterance,whichconsistsofacontinuous-timesignal,into
asequenceofdiscreteunitssuchasphonemes(unitsofsound)orwords. Onemajor
obstacleisthehighdegreeofvariabilityinthespeechsignal. Thisvariabilityarises
fromthefollowingfactors:
Linguistic variability: Eectsonthespeechsignalcausedbyvariouslinguisticphe-
nomena. One example is co-articulation i.e. the fact that the same phoneme
can havedierentacousticrealisationsin dierent contexts, determined bythe
phonemesprecedingandfollowingthesoundin question;
Speakervariability: Dierences between speakers, attributable to physical factors
such as the shapeof thevocal tractaswell asfactorssuchasage, gender,and
regional origin; and dierences within speakers, due to the fact that even the
samewordsspokenonadierentoccasionbythesamespeakertendtodierin
termsof theiracousticproperties. Physical factorssuch astiredness,congested
airways due to a cold, and changes of mood have abearing on how wordsare
pronounced, but the location of a word within a sentence and the degree of
emphasisitisgivenarealsofactorswhichresultin intra-speakervariability;
Channel variability: Theeectsofbackgroundnoise,whichcanbeeitherconstant
or transient,and ofthetransmissionchannel, such asthetelephonenetwork or
amicrophone.
Thespeechrecognitioncomponentofatypicalspokendialogueapplication hasto
beabletocopewiththefollowingadditionalfactors:
Speakerindependence: Astheapplicationwillnormallybeusedforawidevarietyof
casualusers,therecognisercannotbetrainedonanindividualspeaker(orsmall
numberofspeakers)whowillusethesystem,asisthecasefordictationsystems;
instead,forspeaker-independentrecognitionsampleshavetobecollectedfroma
varietyofspeakerswhosespeechpatternsshouldberepresentativeofthepotential
usersof the system. Speaker-independentrecognitionis moreerror-pronethan
speaker-dependentrecognition.
Vocabulary size: Thesizeofthevocabularyvarieswiththeapplicationandwiththe
particulardesignofthedialoguesystem. Thusacarefullycontrolleddialoguemay
constraintheusertoavocabularylimitedtoafewwordsexpressingtheoptions
that areavailablein thesystem,whilein amoreexible systemthevocabulary
mayamounttomorethanathousandwords.
Continuousspeech: Users of spoken dialogue systems expect to be able to speak
normallytothesystemandnot,forexample,intheisolatedwordmodeemployed
insomedictationsystems. However,itisdiÆculttodeterminewordboundaries
incontinuousspeechsincethereisnophysicalseparationinthecontinuous-time
speechsignal.
Spontaneousconversationalspeech: Sincethespeechthatisinputtoaspokendia-
loguesystemisnormallyspontaneousandunplanned,itistypicallycharacterised
by disuencies, such as hesitations and llers (for example, umm and er, false
starts, in which the speaker begins onestructure thenbreaks o mid way and
startsagain,andextralinguisticphenomenasuchascoughing. Thespeechrecog-
niserhasto be ableto extract fromthespeechsignalasequenceofwordsfrom
which thespeaker'sintendedmeaningcanbecomputed.
The basic process of speech recognition involves nding a sequence of words,
usingaset ofmodelsacquired inapriortrainingphase,and matchingthese with
theincomingspeechsignalthatconstitutestheuser'sutterance. Themodelsmaybe
wordmodels,inthecaseofsystemswithasmallvocabulary,butmoretypicallythe
modelsareof unitsofsound suchasphonemesortriphones, which modelasound
as well asits contextin termsof thepreceding andsucceeding sounds. Themost
successfulapproachesviewthis pattern-matching asaprobabilisticprocesswhich
hastobeabletoaccountbothfortemporalvariability -duetodierentdurations
ofthesoundsresultingfromdierencesinspeakingrateandtheinherentlyinexact
nature of human speech, and acoustic variability - due to the linguistic, speaker
andchannelfactorsdescribedearlier. Thefollowingformulaexpressesthisprocess:
W
=argmax
w
P(OjW)P(W)
Inthis formula Wrepresentstheword sequencewith themaximum aposteriori
probability, while O represents the observation that is derived from the speech
signal. Twoprobabilities are involved: P(O j W), known asthe acoustic model,
whichhasbeenderivedthroughatrainingprocessandwhichistheprobabilitythat
asequenceofwordsWwillproduceanobservationO;andalanguagemodel,derived
from an analysis of a language corpus giving the prior probability distribution
assignedtothesequenceofwordsW.
TheobservationOcomprisesaseriesofvectorsrepresentingacousticfeaturesof
thespeechsignal. Thesefeaturevectorsarederivedfromthephysicalsignal,which
issampledandthendigitallyencoded. Perceptuallyimportantspeaker-independent
featuresareextractedandredundantfeaturesarediscarded.
Acousticmodellingisaprocessofmappingfrom thecontinuousspeechsignalto
thediscrete sounds of thewordsto be recognised. The acousticmodel of aword
is representedin Hidden MarkovModels (HMMs), as in Figure 6. Each state in
the HMMmight represent a unit of sound, for example, the three sounds in the
worddog. Transitionsbetweeneachstate,A=a
12 a
13 :::a
n1 :::a
nn
, representthe
probability of transitioning from one state to the next and model the temporal
progressionofthespeechsounds. Dueto variabilityin thedurationofthesounds,
asoundmayspreadacrossseveralframessothatthemodel cantakealooptran-
sition and remain in the same state. For example, if there were ve frames for
the word dog, the statessequence S
1
;S
1
;S
2
;S
2
;S
3
might be produced, reecting
thelongerdurationof thesoundsrepresentingd ando. AHiddenMarkovModel
start d 1 o 2 g 1 end
b 1(o1) b 1(02) b 2(o3)
b 2(o4) b 3(o5)
0 1 0 2 0 3 0 t
Word model
Observation sequence
Fig.6. AsimpleHiddenMarkovmodel
is doubly stochastic, as in addition to the transition probabilities the output of
each state,B =b
i(ot)
, is probabilistic. Instead ofeach statehaving asingleunit
of sound as output, all units of sound are potentiallyassociated with each state,
eachwith itsown probability. Themodel is \hidden"because,givenaparticular
sequenceofoutputsymbols,itisnotpossibletodeterminewhichsequenceofstates
producedtheseoutputsymbols. Itis,however,possibletodeterminethesequence
of statesthat hasthehighestprobabilityof having generatedaparticular output
sequence. In theory this would requirea procedure that would examine all pos-
sible statesequences and computetheir probabilities. In practice,because ofthe
Markovassumptionthat beinginagivenstatedependsonlyonthepreviousstate,
aneÆcientdynamic programmingprocedure such astheViterbi algorithmorA*
decodingcan beusedto reduce thesearch space. Ifastatesequence isviewed as
apaththroughastate-timelattice,at eachpointin thelatticeonlythepathwith
thehighestprobabilityisselected.
The output of the acoustic modelling stage is a set of word hypotheses which
can be examined to nd the best wordsequence, using a language model P(W).
The languagemodel contains knowledge aboutwhich wordsare morelikely in a
given sequence. Two typesofmodelare possible. A nitestate networkpredicts
all the possible word sequences in the language model. This approach is useful
if all the phrasesthat are likelyto occurin the speech input can be specied in
advance. Thedisadvantageisthatperfectlylegalstringsthatwerenotanticipated
are ruled out. Finite state networks can be used to parse well-dened sequences
suchasexpressionsoftime.
Alternatively, anN-grammodelcanbeused. TheuseofN-gramsinvolvescom-
puting the probability ofa sequenceof wordsasa productof the probabilitiesof
each word, assuming that the occurrenceof each word is determined bythe pre-
cedingN-1words. Thisrelationshipis expressedin theformula:
P(W)=P(w
1
;:::;w
n )=
N
Y
n=1 P(w
n jw
1
;:::;w
n 1 )
However,becauseofthehighcomputationalcostinvolvedin calculatingtheprob-
abilityofawordgivenalargenumberofprecedingwords,N-gramsareusuallyre-
ducedtobigrams(N=2)ortrigrams(N=3). ThusinabigrammodelP(w
i jw
i 1 )
theprobabilityof allpossiblenextwordsisbasedonlyonthecurrentword,while
inatrigrammodelP(w
i jw
i 2
;w
i 1
)itisbasedontwoprecedingwords. N-gram
modelsmayalsobebasedonclassesratherthanwordsi.e. thewordsaregrouped
intoclassesrepresentingeithersyntacticcategoriessuchasnounorverb,orseman-
tic categories,such asdaysof the week ornames of airports. A language model
reducesthe perplexity ofasystem,whichwill usuallyresultin greaterrecognition
accuracy. Perplexity is roughlydened asthe average branching factor, oraver-
age number of words, that might follow a given word. If the perplexity is low,
recognitionislikelytobemoreaccurateasthesearchspaceisreduced.
Theoutputofthespeechrecognisermaybeanumberofscoredalternativesasin
thefollowingexamplerepresentingtherecogniser'sbestguessesfortheinputstring
whattimedoesthe ightleave? [Wyardet al.1996]:
(1) whattimedoesthewhiteleaf1245.6
(2) whattimedoestheightleave1250.1
(3) whattimedoesaightleave1252.3
(4) whattimedidtheightleave1270.1
(5) whattimedidaightleave1272.3
Sometimesthereareonlysmalldierencesbetweenthealternatives,causedbyone
ortwowordsthatmaynotcontributetothemeaningofthestring. Forthisreason,
thealternativescanbemoreeconomicallyrepresentedin adirectedgraphorasa
word lattice. Theselection of themost likelysequence maybethe responsibility
ofothersystemcomponents. Forexample,ifthedomainofthedialoguesystemis
ightenquiries,then therstsequence, which hadthebest scorefromthe speech
recogniser,would bediscardedascontextuallyirrelevant. Similarlydialogueinfor-
mationwouldassistthechoicebetween2-3,whichaskaboutaightdeparturethat
hasnotyettakenplace,and4-5,whichaskaboutsomedeparturethathasalready
happened.
Asanalternativeto returningthecompletesequenceofwordsthat matchesthe
acousticsignaltherecognisercansearchforkeywords. Thistechniqueisknownas
wordspotting. Wordspottingis usefulfor dealingwithextraneouselementsin the
input,forexample,detectingyes inthestringwell,uh,yes,that'sright. Themain
diÆcultyforwordspottingistodetectnon-keywordspeech. Onemethodistotrain
the system with a variety of non-keyword examples, known assink (or garbage)
models. A wordspotting grammar network can then bespecied that allows any
sequenceofsinkmodels incombinationwiththekeywordstoberecognised.
Usersofspokendialoguesystemsaregenerallyconstrainedtohavingtowaituntil
thesystem hascompleted its output before they canbegin speaking. Once users
arefamiliarwithasystem,theymaywishtospeedupthedialoguebyinterrupting
the system. This is known as barge-in. ThediÆculty with simultaneous speech,
which iscommon in human-humanconversation, is that the incomingspeech be-
comescorruptedwithechofromtheongoingprompt,thusaectingtherecognition.
Varioustechniquesareunder developmentto facilitatebarge-in.
4.1.1 Summary. Thissectionhasoutlinedthemaincharacteristicsofthespeech
recognitionprocess, describingthe uncertainand probabilisticnature ofthis pro-
cess, in order to clarify the requirements that are put on the other system com-
ponents. Inalineararchitecturetheoutput ofthespeechrecogniser providesthe
input to thelanguageunderstanding module. DiÆculties mayarise forthis com-
ponent if the word sequence that is output does not constitute a legal sentence,
asspeciedbythecomponent'sgrammar. Inanycase,thedesignofthelanguage
understandingcomponentneedsto takeaccountof thenature oftheoutput from
thespeechrecognitionmodule. Similarly,in anarchitectureinwhichthedialogue
management component interacts with each of the other components, one of the
rolesofdialoguemanagementwillbetomonitorwhentheuser'sutterancehasnot
beenreliablyrecognisedandtodeviseappropriateremedialsteps. Theseissueswill
bediscussedingreaterdetailin subsequentsections. Formoreextensiveaccounts
ofspeechrecognition,see, forexample, RabinerandJuang [1993]andYoungand
Bloothooft [1997]. For tutorial overviews, see Makhoul and Schwartz[1995] and
Power[1996].
4.2 Languageunderstanding
Theroleofthelanguageunderstandingcomponentisto analysetheoutputofthe
speech recognitioncomponentandto deriveameaningrepresentationthatcanbe
used by the dialogue control component. Language understanding involves syn-
tacticanalysis,todeterminetheconstituentstructureoftherecognisedstring(i.e.
howthewordsgrouptogether),andsemanticanalysis,to determinethemeanings
of theconstituents. These twoprocessesmaybekeptseparate at therepresenta-
tionallevelinordertomaintaingeneralisabilitytootherdomains,buttheytendto
becombinedduring processingfor reasonsofeÆciency. On theother hand,some
approaches to languageunderstanding may involvelittle orno syntacticprocess-
ingand deriveasemanticrepresentationdirectly from therecognisedstring. The
advantages and disadvantages of these approaches, and the particular problems
involvedintheprocessingofspokenlanguage,willbereviewedinthissection.
The theoreticalfoundations for languageprocessing are to befound in linguis-
tics, psychology, and computational linguistics. Current grammatical formalisms
in computational linguistics share a number of key characteristics, of which the
mainingredientisafeature-baseddescriptionofgrammaticalunits,suchaswords,
phrasesandsentences[UszkoreitandZaenen1996].Thesefeature-basedformalisms
aresimilartothoseusedinknowledgerepresentationresearchanddatatypetheory.
Featuretermsaresetsofattribute-valuepairsinwhichthevaluescanbeatomic
symbolsor further feature terms. Feature terms belong to types, which may be
organised in a type hierarchy or as disjunctive terms, functional constraints, or
sets. The following simple example shows a feature-based representation for the
wordslions,roar androars aswellasasimplegrammarusingthePATR-IIformal-
ism[Shieber1986] that denes how thewordscanbe combined in awell-formed
sentence:
lexicon
lions: [cat:NP, head: [agreement: [number:plural,person:third]]]
roar: [cat:V,head: [form: nite, subject: [agreement: [number:plural,
person:third]]]]
roars: [cat:V,head: [form: nite, subject: [agreement: [number:singular,
person:third]]]]
grammar
S!NPVP
<Shead>=<VPhead>
<Sheadsubject>=<NPhead>
VP!V
<VPhead>=<Vhead>
Thelexiconconsistsofcomplexfeaturestructuresdescribingthesyntacticallyrele-
vantcharacteristicsofthewords,suchaswhethertheyaresingularorplural. The
grammarconsists ofphrasestructure rulesandequations thatdetermine howthe
wordscanbecombined.
The means by which feature terms may be combined to produce well-formed
feature termsis throughtheprocessof unication. Forexample: thewordslions
androar canbecombinedastheirfeaturesunify, whereaslions androars cannot,
astheagreementfeaturesareincompatible. Thisbasicformalismhasbeenusedto
accountfor awiderange of syntacticphenomena and, in combinationwith uni-
cation,toprovideastandardapproachtosentenceanalysisusingstring-combining
andinformation-combiningoperations.
Feature-based grammars are often subsumed under theterm unication gram-
mars. Onemajoradvantageofunicationgrammarsisthat theypermitadeclar-
ativeencodingof grammaticalknowledgethat is independentof any specic pro-
cessingalgorithm. Afurtheradvantageisthatasimilarformalismcanbeusedfor
semanticrepresentation,withtheeectthatthesimultaneoususeofsyntacticand
semanticconstraintscanimprovetheeÆciencyof thelinguisticprocessing.
In computational semantics sentences are analysed on the basis of their con-
stituent structure, under the assumption of the principleof compositionality i.e.
that the meaning of a sentence is a function of the meanings of its parts. Each
syntacticrulehasacorrespondingsemanticruleandtheanalysisoftheconstituent
structure ofthe sentence willlead to thesemanticanalysis of thesentenceas the
meaningsoftheindividualconstituentsidentiedbythesyntacticanalysisarecom-
bined. Themeaningrepresentationfromthisformofsemanticanalysisistypically
a logical formula in rst order predicate calculus (FOPC) or some more power-
ful intermediate representation language such as Montague's intensional logic or
DiscourseRepresentationTheory(DRT).Theadvantageofarepresentationofthe
meaningofasentenceinaformsuchasaformulaofFOPCisthatitcanbeusedto
deriveasetofvalidinferencesbasedontheinferencerulesofFOPC.Forexample,
asPulman[1996]shows,aquerysuchas:
DoeseveryightfromLondontoSan FranciscostopoverinReykyavik?
cannotbeansweredstraightforwardlybyarelationaldatabasethat doesnotstore
propositions oftheform everyX has property P.Insteadalogicalinferencehasto
bemadefromthemeaningofthesentencebasedontheequivalencebetweenevery
X has property P and there isno X that does nothave property P. Basedon this
inferencethesystemsimplyhastodetermineifanon-stoppingightcanbefound,
inwhichcasetheanswerisno,otherwiseitisyes.
While linguistics and psychology provide a theoretical basis for computational
linguistics,thecharacteristicsofspokenlanguagerequireadditional(orevenalter-
native) techniques. Oneproblem isthat naturallyoccurringtext, bothin written
form,asinnewspaperstories,aswellasinspokenform,asinspokendialogues,is
farremovedfromthewell-formedsentencesthatconstitutethedatafortheoretical
linguisticsand psychology. Inlinguisticsthemain concerniswithdevelopingthe-
oriesthat canaccountfor itemsof theoreticalinterest,oftenrarephenomenathat
demonstratethewidecoverageofthetheory,whileinpsychologythemainconcern
iswithidentifyingthecognitiveprocessesinvolvedinlanguageunderstanding. Tra-
ditionallyasymbolicrepresentationis used, withhand-craftedrulesthat produce
acomplete parsingofgrammaticallycorrect sentences but withatarget coverage
basedonarelativelysmall setof exemplarsentences. Whenconfrontedwithnat-
urallyoccurringtextssuch asnewspaperstoriesthese theoreticallywell-motivated
grammarstendtogenerateaverylargenumberofpossibleparses,duetoambigu-
ousstructurescontainedinthegrammarrules,while,conversely,theyoftenfail to
producethecorrectanalysisofagivensentence,oftenhavingafailurerateofmore
than60%[Marcus1995].
Spokenlanguageintroducesafurtherprobleminthattheoutputfromthespeech
recogniserwilloftennothavetheformofagrammaticallywell-formedstringthat
can be parsed by a conventional language understanding system. Rather it is
likelytocontainfeaturesofspontaneousspeech,suchassentencefragments,after-
thoughts,self-corrections,slipsofthetongue,orungrammaticalcombinations. The
followingexamplesof utterances (cited in Moore[1995]), from acorpuscollected
from subjects using either a simulated or an actual spoken language Air Travel
Information System (ATIS), would not be interpreted by a traditional linguistic
grammar:
Whatkind ofairplanegoesfromPhiladelphia toSanFranciscoMonday
stoppinginDallas inthe afternoon(rstclass ight)
(Do)(Do any of these ights)Are there any ightsthat arrive after ve
p.m.
The rst example is a well-formed sentence followed by an additional fragment
orafter-thought, enclosedin brackets. The second exampleis aself-correction in
whichthewordsintendedfordeletion areenclosedinbrackets.
SomeoftheseperformancephenomenaoccursuÆcientlyregularlythattheycould
bedescribedby special rules. Forexample, in somesystemsrules have been de-
velopedthatcanrecognise andcorrectself-repairsin anutterance[Dowdingetal.
1993; HeemanandAllen 1997]. Aconventionalgrammar couldbeenhancedwith
additional rules that could handle someof these phenomena, but the problem is
that it would be impossible to predict all the potential occurrencesof these fea-
turesofspontaneousspeechinthisway. Analternativeapproachistodevelopmore
robustmethodsforprocessingspokenlanguage.
Robust parsing aims to recover syntactic and semantic information from un-
restricted text that contains features that are not accounted for in hand-crafted
grammars. Robust parsingofteninvolvespartialparsing, in which theaim isnot
to perform a complete analysis of the text but to recover chunks, such as non-
recursivenoun phrases,that canbeused toextracttheessentialitemsofmeaning
inthetext. Thustheaimistoachieveabroadcoverageofarepresentativesample
oflanguagewhichrepresentsareasonableapproximate solutionto theanalysis of
thetext [Abney 1997]. In somesystemsmixed approachesare used, such asrst
attempting to carry out afull linguistic analysis on the input and only resorting
to robust techniques if this is unsuccessful. BBN's Delphi system [Stallard and
Bobrow1992],MIT's TINAsystem [Sene1992] and SRI International'sGemini
system[Dowdinget al.1993]workin thisway. As Moore [1995]reports, dierent
results havebeen obtained. The SRI team found that acombination of detailed
linguistic analysis and robust processing resulted in better performance than ro-
bust processing alone, while the best performing system at the same evaluation
(the November1992 ATIS evaluation) wasthe CMUPhoenix systemwhich uses
onlyrobustprocessingmethodsanddoesnotattempttoaccountforeverywordin
anutterance.
4.2.1 Integration of the speech recognition andnatural language understanding
components. Sofarithasbeenassumedthatthespeechrecogniserandthenatural
languageunderstanding module are connected serially and that the speech mod-
uleoutputs asinglestringto beanalysed bythelanguageunderstandingmodule.
Typically, however,theoutput from thespeechrecognitioncomponentis aset of
rankedhypotheses,ofwhichonlyafewwillmakesensewhensubjectedtosyntac-
ticandsemanticanalysis. Themostlikelyhypothesismayturnoutnotto bethe
stringthat is rankedasthe best set of wordsidentied bythe speech recognition
component(seetheexamplein section4.1). Whatthisimpliesisthat, inaddition
to interpreting the string(or strings) output bythe speech recogniser to provide
asemanticinterpretation, thelanguageunderstanding module canprovideanad-
ditional knowledge source to constrain theoutput of the speech recogniser. This
in turn hasimplications forthesystem architecture,in particular forthe ways in
whichthespeechrecognitionandnaturallanguageunderstandingcomponentscan
belinkedor integrated.
Thestandardapproachtointegrationinvolvesselectingasapreferredhypothesis
the string with the highestrecognition scorethat canbe processed by the natu-
ral languagecomponent. The disadvantage of this approach is that strings may
berejected asunparsable that nevertheless representwhat thespeakerhad actu-
ally said. In this case the recogniser would be over-constrained by the language
component. Alternatively, if robustparsingwere applied, the recognisercould be
under-constrained,asarobustparserwillattempttomakesenseoutofalmostany
wordstring.
One alternative approach to integration is word lattice parsing, in which the
recogniserproducesasetofscoredwordhypothesesandthenaturallanguagemod-
ule attempts to nda grammaticalutterance spanning the input signalthat has
the highest acoustic score. This approach becomes unacceptable in the case of
wordlatticescontaininglargenumbersofhypotheses,particularlywhenthereisa
large degreeof wordboundary uncertainty. Another alternativeis to use N-best
lteringinwhichtherecogniseroutputsthen-besthypotheses(whereNmayrange
from between 10 to 100 sentence hypotheses), and these are then ranked by the
languageunderstandingcomponenttodeterminethebest-scoringhypothesis[Price
1996]. This approach has the advantage of simplicity but the disadvantage of a
high computational cost given a largevalue for N. Manypractical systemshave,
however,producedacceptableresultswithvaluesaslowasN=5,usingrobustpro-
cessingifstrictgrammaticalparsingwasnotsuccessfulwiththetopverecognition
hypotheses[Kubalaetal.1992].
4.2.2 Some solutions. Various solutionshave beenadopted to the problem of
deriving asemanticrepresentationfrom thestring provided by the speech recog-
nition component. These include: comprehensive linguisticanalysis, methods for
dealingwithill-formedandincompleteinput,andmethodsinvolvingconceptspot-
ting. Someofthese willbebrieyreviewedinthefollowingparagraphs.
4.2.2.1 SUNDIAL. Inthe SUNDIAL project [Peckham1993],which was con-
cerned with travel information in English, French, German, and Italian, several
dierentapproacheswereadopted,withthefollowingcommonfeatures:
|arichlinguisticanalysis;
|robustmethodsforhandlingpartialandill-formedinput;
|asemanticrepresentationlanguagefortask-orienteddialogues.
Linguistic analysis in the Germanversionis basedon achartparser using auni-
cation categorialgrammar [Eckert and Niemann 1994]. Syntactic and semantic
structuresarebuiltin parallelbyunifyingcomplexfeaturestructuresduringpars-
ing. Theaimistondaconsistentmaximaledgeoftheutterance,butifnosingle
edge can be found, the best interpretation is selected for thepartial descriptions
returned by the chart parser. These partial descriptionsare referredto as utter-
anceeldobjects(UFOs). Variousscoringmeasuresareappliedtothechartedges
to determine the best interpretation. Additionally somefeatures of spontaneous
speech suchaspauses, lled pauses,and ellipses,are representedexplicitlyin the
grammar. ThefollowingexampleillustratestheuseofUFOsintheanalysisofthe
stringI wanttogo -atnine o'clock fromKoeln [EckertandNiemann1994]:
U1: syntax: [string: `Iwanttogo']
semantics:[type:want,theagent: [type: speaker],thetheme: [type: go]]
U2: syntax: [string: `atnineo'clock']
semantics:[type: time,thehour: 9]
U3: syntax: [string: `fromKoeln']
semantics:[type: go,thesource: [type: location,thecity: koeln]]
ThissequenceofUFOsisasetofpartialdescriptionsthatcannotbecombinedinto
alongerspanningedge,asU2,anellipticalconstruction,isnotcompatiblewithU1
andU3. However,itisstillpossibletobuild asemanticrepresentationfrom these
partialdescriptions,asshownin thisexample.
Thisexamplealsoillustratesthesemanticinterfacelanguage(SIL)whichisused
in SUNDIAL to pass the content of messages between modules. Two dierent
levels of detail are provided in SIL, both in terms of typed feature structures: a
linguistically-orientedlevel,asshownabove,and atask-orientedlevel,which con-
tains information relevant to an application, such as relations in the application
database. Thetask-oriented representationfor thepartial descriptions in theex-
ampleabovewouldbe:
U1: [taskparam]: [none]]
U2: [taskparam]: [sourcetime: 900]]
U1: [taskparam]: [sourcecity: koeln]]
This task-oriented representation is used by the dialogue manager to determine
whethertheinformationelicitedfromtheuserissuÆcienttopermitdatabaseaccess
orwhether furtherparametersneedtobeelicited.
Reporting on a comparative evaluation betweenearlier versions of the system,
whichdidnotincludearobustsemanticanalysis,andalaterversionthatdid,Eckert
and Niemann [1994] found a much better dialogue completion rate in the later
system, even though word accuracy rate (the results from the speech recogniser)
hadremainedroughlyconstantacrossthesystems.
4.2.2.2 SpeechActs. TheSpeechActssystem[Martinetal.1996],whichenables
prof