• Ei tuloksia

Pay a bill

N/A
N/A
Info
Lataa
Protected

Academic year: 2024

Jaa "Pay a bill"

Copied!
97
0
0

Kokoteksti

(1)

MICHAELF.MCTEAR

UniversityofUlster

Spokendialoguesystemsallowuserstointeractwithcomputer-basedapplicationssuchasdatabases

andexpertsystemsbyusingnaturalspokenlanguage. Theoriginsofspokendialoguesystemscan

betracedbacktoArticialIntelligenceresearchinthe1950sconcernedwithdevelopingconversa-

tionalinterfaces. However,itisonlywithinthelastdecadeorso,withmajoradvancesinspeech

technology,thatlarge-scaleworkingsystemshavebeendevelopedand,insomecases,introduced

into commercialenvironments. Asaresultmanymajortelecommunicationsandsoftwarecom-

panieshavebecomeawareofthe potential forspokendialoguetechnology to providesolutions

innewlydevelopingareassuchas computer-telephonyintegration. Voiceportals,whichprovide

a speech-based interface between atelephoneuserand web-basedservices,arethe most recent

application of spoken dialoguetechnology. Thisarticle describesthe maincomponentsof the

technology-speechrecognition,languageunderstanding, dialoguemanagement, communication

withanexternalsourcesuchas adatabase, language generation, speech synthesis -and shows

howthesecomponenttechnologiescanbeintegratedintoaspokendialoguesystem. Thearticle

describesindetail the methodsthat have been adopted insomewell-known dialogue systems,

exploresdierent systemarchitectures, considersissues of specication, design and evaluation,

reviewssomecurrentlyavailabledialoguedevelopmenttoolkits,andoutlinesprospectsforfuture

development.

CategoriesandSubjectDescriptors: I.2.7[ArticialIntelligence]: NaturalLanguageProcess-

ing|discourse,speechrecognitionandsynthesis;H.5.2[InformationInterfacesand Presen-

tation]:UserInterfaces|naturallanguage,voiceI/O

GeneralTerms: HumanFactors

AdditionalKey Words and Phrases: Dialogue management, humancomputer interaction, lan-

guagegeneration,languageunderstanding,speechrecognition,speechsynthesis

1. INTRODUCINGSPOKENDIALOGUETECHNOLOGY

The`conversationalcomputer'hasbeenthegoalofresearchersinspeechtechnology

andarticialintelligence(AI)formorethan30years. Anumberoflargescalere-

searchprogrammeshaveaddressedthisgoal,includingtheDARPACommunicator

Project,Japan'sFifthGenerationprogramme,andtheEuropeanUnion'sESPRIT

and LanguageEngineeringprogrammes. Theimpression of eortlessspontaneous

conversationwith acomputerhas beenfostered by examplesfrom sciencection

suchas HALin 2001: A Space Odyssey orthecomputeronthe Star Ship Enter-

Author'saddress: MichaelMcTear,SchoolofInformationandSoftwareEngineering,University

ofUlster,NewtownabbeyBT370QB,NorthernIreland,UK.

Permission to makedigital/hard copy of all or part of thismaterial without fee for personal

orclassroomuseprovidedthat the copiesarenotmadeordistributedforprot orcommercial

advantage,theACMcopyright/servernotice,thetitleofthepublication,anditsdateappear,and

noticeisgiventhatcopyingisbypermissionoftheACM,Inc. Tocopyotherwise,torepublish,

topostonservers,ortoredistributetolistsrequirespriorspecicpermissionand/orafee.

©20YYACM0000-0000/20YY/0000-0001$5.00

(2)

prise. Itisonlyrecently,however,thatspokenlanguageinteractionwithcomputers

hasbecomeapracticalpossibilitybothinscienticaswellasincommercialterms.

Thisisduetoadvancesinspeechtechnology,languageprocessinganddialoguemod-

elling,aswellastheemergenceof fasterandmorepowerfulcomputersto support

these technologies. Applications such as voice dictation and the control of appli-

ances using voice commands are appearing on themarket and an ever-increasing

numberofsoftwareand telecommunicationscompaniesare seekingto incorporate

speech technology into their products. It is important, however, to be aware of

thelimitationsoftheseapplications. Commonlystatementsaremadein salesand

marketingliteraturesuchas`Talktoyourcomputerasyouwouldtalktoyournext-

door neighbour' or `Teach your computer the art of conversation'. However, the

technologiesinvolvedwouldnotbesuÆcientto enableacomputerto engagein a

naturalconversationwithauser. Voicedictationsystemsprovideatranscriptionof

whattheuserdictatestothesystem,butthesystemdoesnotattempttointerpret

theuser'sinputnortodiscussitwiththeuser. Command-and-controlapplications

enableuserstoperform commandswithvoiceinputthat wouldotherwisebeper-

formedusingthekeyboardormouse. Thecomputerrecognisesthevoicecommand

and carriesoutthe action, orreports that the command wasnot recognised. No

other form ofdialogue isinvolved. Similarrestrictionsapply to mostother forms

ofvoice-basedsystemincurrentuse.

Spokendialoguesystems,ontheotherhand,canbeviewedasanadvancedappli-

cationofspokenlanguagetechnology. Spokendialoguesystemsprovideaninterface

betweentheuserandacomputer-basedapplicationthatpermitsspokeninteraction

with theapplication in arelativelynaturalmanner. Insodoing, spokendialogue

systemssubsumemostofthemajoreldsofspokenlanguagetechnology,including

speech recognition and speech synthesis, languageprocessing, and dialogue man-

agement.

Theaimof thecurrentsurveyisto describetheessentialcharacteristicsofspo-

kendialogue technologyat alevel oftechnical detailthat should beaccessible to

computerscientistswhoarenotspecialistsinspeechrecognitionandcomputational

linguistics. The surveyprovides anoverviewforthose wishing toresearch into or

developspokendialoguesystems,andhopefullyalsoforthosewhoarealreadyex-

perienced in this eld. Most published work to date on spoken dialoguesystems

tends to report on the design, implementation, and evaluation of individual sys-

temsorprojects,aswouldbeexpectedwithanemergingtechnology. Thepresent

paperwill notattempt to surveythegrowingnumberof spokendialoguesystems

currently in existencebut ratherwill focus on theunderlying technologies, using

examplesofparticularsystemstoillustratecommonlyoccurringissues.

1

1.1 Overviewofthepaper

Theremainderofthepaperisstructuredasfollows. Inthenextsectionspokendia-

loguesystemsaredenedascomputersystemsthatusespokenlanguagetointeract

1

Inevitablythereareomissions,insomecasesofwell-knownandimportantsystems,butthisis

unavoidable,astheaimisnottoprovideacomprehensivereviewofdialoguesystemsbuttofocus

onthegeneralissuesofthetechnology. Interestedreaderscanfollowupparticularsystemsinthe

referencesprovidedattheendofthesurveyandinAppendixA.

(3)

withuserstoaccomplishatask. Dialoguesystemsareclassiedintermsofdier-

entcontrolstrategiesandsomeexamplesare presentedinsection 3that illustrate

thisclassicationandgiveafeelfortheachievementsas wellasthelimitationsof

current technology. Section4 describesthecomponentsof aspoken dialoguesys-

tem{speechrecognition,languageunderstanding,dialoguemanagement,external

communication, response generation,and text to speech synthesis. Thekeyto a

successful dialogue system is the integration of these componentsinto a working

system. Section5reviewsanumberofarchitecturesanddialoguecontrolstrategies

that providethis integration. Methodologiesto support the specication, design,

and evaluation ofa spoken dialogue systemare reviewedin section 6. Particular

methodshaveevolvedforspecifyingsystemrequirements,suchasuserstudies,the

use of speech corpora, and Wizard-of-Ozstudies. Methods have also been devel-

oped forthe evaluation of dialoguesystemsthat go beyond themethods used for

evaluationoftheindividualelementssuchasthespeechrecognitionandspokenlan-

guageunderstanding components. This section also examinessomecurrentwork

on guidelines and standardsfor spoken languagesystems. A recentdevelopment

is the emergenceof toolkits and platforms to support the construction of spoken

dialoguesystems,similarto thetoolkitsand developmentplatformsthat areused

inexpertsystemsdevelopment. Somecurrentlyavailabletoolkitsarereviewedand

evaluatedinsection7. Finally,section8examinesdirectionsforfuture researchin

spokendialoguetechnology.

2. SPOKENDIALOGUESYSTEMS-ADEFINITION

Spokendialoguesystemshavebeendenedascomputersystemswithwhichhumans

interact on a turn-by-turn basis and in which spoken natural languageplays an

importantpartinthecommunication[Fraser1997]. Themainpurposeofaspoken

dialogue system is to provide an interface between a user and a computer-based

applicationsuchasadatabaseorexpertsystem. Thereisawidevarietyofsystems

that are covered by this denition, ranging from question-answer systems that

answeronequestionatatimeto`conversational'systemsthatengageinanextended

conversationwith the user. Furthermore, the mode of communication canrange

fromminimalnaturallanguage,consistingperhapsofonlyasmallsetofwordssuch

asthe digits0-9 and thewordsyes and no, through to largevocabularysystems

supportingrelativelyfree-forminput. Theinputitselfmaybespokenortypedand

maybecombinedwithotherinputmodessuchasDTMF(touch-tone)input,while

theoutputmaybespokenordisplayedastextonascreen,andmaybeaccompanied

byvisualoutputin theformoftablesorimages.

Spokendialoguesystemsenablecasualandnaiveuserstointeractwithcomplex

computer applications in a natural way using speech. Current IVR (Interactive

Voice Response) systems limitusers in what they cansay and how they cansay

it. However, usersof speech-based computer systemsoften do not know exactly

what information theyrequireand howto obtainit -they requirethesupport of

thecomputertodeterminetheirpreciserequirements. Forthisreasonitisessential

that speech-basedcomputersystems should beable to engage in adialogue with

usersratherthansimplyrespondtopredeterminedspokencommands. Atthesame

time spoken dialogue systems are more restricted than conversational computers

(4)

in that theirconversationaltopicsare limited, usuallyto asingledomain such as

ightenquiries.

Spokendialoguesystemscanbeclassiedintothreemaintypes,accordingtothe

methodsusedto controlthedialoguewiththeuser:

(1) Finite-state(orgraph)basedsystems

(2) Framebasedsystems

(3) Agentbasedsystems

The type of dialogue control strategy used has abearing on how the system ac-

complishestwoofits main tasks: processingtheuser'sinput andrecoveringfrom

errors.

2.1 Finite-statebasedsystems

In a nite state based system the user is taken througha dialogue consisting of

asequenceofpre-determined stepsorstates. Mostcommercially available spoken

dialogue systemsuse this form of dialogue control strategy. The dialogue ow is

specied asa set of dialogue states with transitions denoting various alternative

pathsthroughthe dialoguegraph. The systemmaintainscontrol of thedialogue,

producespromptsateachdialoguestate,recognises(orrejects)specicwordsand

phrasesin response to theprompt, and producesactions basedonthe recognised

response. The following is an example of an interaction with abasic nite-state

based system in which the system veries the user's input at each state of the

dialogue:

System: Whatisyour destination?

User: London

System: WasthatLondon?

User: Yes

System: Whatday do youwanttotravel?

User: Friday

System: WasthatSunday?

User: No

System: Whatday do youwanttotravel?

Generallysystemswithstate-baseddialoguecontrolrestricttheuser'sinputtosin-

glewordsor phrasesthat provideresponsesto carefullydesignedsystemprompts.

A major advantageof this form of dialogue control is that the required vocabu-

lary and grammar for each state can be specied in advance, resulting in more

constrained speech recognitionand languageunderstanding. However,the disad-

vantageisthatthesesystemsrestricttheuser'sinputtopredeterminedwordsand

phrases,makingcorrectionofmisrecogniseditemsdiÆcultaswellasinhibitingthe

user'sopportunitytotaketheinitiativeandaskquestionsorintroducenewtopics.

Ifaugmentedwithanaturallanguagecomponentastate-basedsystemcanaccept

morenaturalinputintheformofsentencesorpartialsentences. Furthermore,with

asimpledatastructuresuchasaformorframethatkeepstrackofwhichinforma-

tiontheuserhasprovidedandwhatthesystemstillneedstoknow,amoreexible

(5)

and morenaturaldialogue ow is possible. Vericationcan alsobedelayed until

thesystemhasgatheredall therequiredinformation. The Nuancedemobanking

systemtobedescribedinsection3isanexampleofastate-basedsystemwiththese

additionalfunctionalities.

2.2 Frame-basedsystems

Inaframe(ortemplate)basedsystemtheuserisaskedquestions that enablethe

system to ll slots in a template in order to perform a task such as providing

train timetable information. In this type of system the dialogue owis notpre-

determinedbutdependsonthecontentoftheuser'sinputandtheinformationthat

thesystemhastoelicit. Forexample:

System: Whatisyour destination?

User: London

System: Whatday do youwanttotravel?

User: Friday

System: Whatisyour destination?

User: LondononFridayaround 10inthe morning

System: Ihavethe followingconnection ...

In therst example theuser provides oneitem of information at atime and the

system performs rather like a state-based system. However, if the user provides

more than the requested information, asin the second example, the system can

acceptthisinformationandcheckifanyadditionalitemsofinformationarerequired

beforesearchingthedatabaseforaconnection. Framebasedsystemsfunction like

productionsystems,takingaparticularactionbasedonthecurrentstateofaairs.

Thequestions and other promptsthat thesystem should ask canbelisted along

withtheconditionsthat haveto betrueforaparticularquestionorpromptto be

relevant. Someform ofnaturallanguageinputisrequiredbyframe-basedsystems

to permit the user to respond more exibly to the system's prompts, as in the

second example. Natural languageisalso requiredtocorrect errorsof recognition

orunderstanding bythesystem. Generally,however,itis suÆcientforthesystem

tobeabletorecognisethemainconceptsintheuser'sutterance. ThePhilipstrain

timetableinformation,tobedescribedinsection3,isanexampleofaframe-based

system.

2.3 Agent-basedsystems

Agent-basedorAIsystemsaredesignedtopermitcomplexcommunicationbetween

thesystem,theuserandtheunderlyingapplicationinordertosolvesomeproblem

or task. There are many variants on agent-based systems, depending on what

particularaspectsofintelligentbehaviourareincludedinthesystem. Thefollowing

dialogue, taken from Sadek and de Mori [1998], illustrates a dialogue agent that

engagesinmixed-initiativeco-operativedialoguewiththeuser:

User: I'mlooking for ajobin the Calaisarea. Arethere anyservers?

System: No, therearen'tany employment servers for Calais. However, thereis

an employmentserver for Pas-deCalais andan employment severfor

Lille. Areyouinterestedinoneof these?

(6)

Inthis examplethe system'sanswerto theuser's requestis negative. Butrather

thansimply responding `no', thesystemattempts to provide amoreco-operative

responsethat mightaddresstheuser'sneeds.

In agent-based systems communication is viewed as interaction between two

agents, each of which is capable of reasoning about its own actions and beliefs,

andsometimesalsoabouttheactionsandbeliefsoftheotheragent. Thedialogue

model takesthe preceding context into accountwith the resultthat the dialogue

evolvesdynamicallyasasequenceofrelatedstepsthatbuildoneachother. Gener-

allythere aremechanismsforerrordetectionand correction,and thesystemmay

useexpectationstopredictandinterprettheuser'snextutterances. Thesesystems

tendtobemixedinitiative,whichmeansthattheusercantakecontrolofthedia-

logue,introducenewtopics, ormakecontributionsthatarenotconstrainedbythe

previoussystem prompts. For this reason theform of the user'sinput cannot be

determinedinadvanceasconsistingofasetnumberofwords,phrases,orconcepts,

and,inthemostcomplexsystems,asophisticatednaturallanguageunderstanding

component is required to process the user's utterances. The Circuit-Fix-It-Shop

system, to be presented in section 3, is an example of one type of agent-based

system. Other typeswillbediscussedinsection5.

2.4 Verication

Inaddition to thedierent levelsof languageunderstandingrequired bydierent

typesof dialogue system,there arealso dierent methods for verifying theuser's

input. Inthemostbasicstate-basedsystems,inwhichuserinputisrestrictedtosin-

glewordsorphraseselicitedat eachstateofthedialogue,thesimplestverication

strategyinvolvesthesystemconrming thatthe user'swordshavebeencorrectly

recognised. Themainchoiceisbetweenconrmationsassociatedwitheachstateof

thedialogue(i.e. everytimeavalueiselicitedthesystemveriesthevaluebefore

movingontothenextstate),orconrmationsat alater pointin thetransaction.

Thelatteroption,whichisillustratedintheexamplefromtheNuancebankingsys-

teminsection3,providesforamorenaturaldialogueow. Themorenaturalinput

permittedinframe-basedsystemsalsomakespossibleamoreexibleconrmation

strategy in which the system can verify a value that has just been elicited and,

within thesame utterance, ask thenext question. This strategy of implicit veri-

cation isillustratedin the examplefrom thePhilips traintimetable information

systeminsection3. Implicitvericationprovidesforamorenaturaldialogueow

aswellasapotentiallyshorterdialogue,andismadepossiblebecausethesystemis

abletoprocessthemorecomplexuserinputthatmayarisewhentheusertakesthe

initiative tocorrect thesystem'smisrecognitions andmisunderstandings. Finally,

inagent-basedsystems,morecomplexmethodsofverication(or`grounding')are

requiredalongwithdecisionsastohowandwhenthegroundingistobeachieved.

Vericationwill bediscussedin greaterdetail insection 4.3.2and someexamples

ofvericationstrategiescanbeseenintheexamplespresentedinsection3.

2.5 Knowledgesourcesfordialoguemanagement

The dialogue manager may draw on a number of knowledge sources, which are

sometimesreferredtocollectivelyasthedialoguemodel. Adialoguemodelmight

includethefollowingtypesofknowledgerelevanttodialoguemanagement:

(7)

Adialoguehistory : A record of the dialogue sofar in terms of the propositions

that havebeendiscussed andthe entities that havebeenmentioned. Thisrep-

resentation provides a basis for conceptual coherence and for the resolution of

anaphoraandellipsis.

Ataskrecord: Arepresentationoftheinformationtobegatheredinthedialogue.

This record, often referred to as a form, template, orstatus graph, is used to

determine what information has not yet beenacquired (see section 5.2). This

recordcanalsobeusedasataskmemory[AretoulakiandLudwig1999]forcases

whereauserwishesto changethevaluesofsomeparameters,suchasanearlier

departure time, but does notneed to repeat thewhole dialogue to providethe

othervaluesthat remainunchanged.

Aworld knowledge model: This model contains general background information

that supportsanycommonsense reasoningrequiredbythesystem,forexample,

thatChristmasdayisDecember25.

Adomain model: Amodelwithspecicinformationaboutthedomaininquestion,

forexample,ightinformation.

Ageneric modelof conversationalcompetence: Thisincludesknowledgeoftheprin-

ciplesofconversationalturn-takinganddiscourseobligations{forexample,that

anappropriateresponsetoarequestforinformationistosupplytheinformation

orprovideareasonfornotsupplyingit.

Auser model: Thismodelmaycontainrelativelystableinformationabouttheuser

that mayberelevantto thedialogue{such astheuser'sage,gender,and pref-

erences,| aswell asinformationthat changes overthe courseof thedialogue,

such astheuser'sgoals,beliefs,andintentions.

These knowledge sources are used in dierent ways and to dierent degrees ac-

cordingto thedialoguestrategychosen. Inthecase ofastate-based systemthese

models, if they exist at all, are represented implicitly in the system. For exam-

ple, the items of information and the sequence in which they are acquired are

pre-determined and thus representedimplicitly in the dialogue states. Similarly,

ifthere isauser model, itis likelyto besimpleand to consist ofasmall number

ofelementsthat determinethedialogueow. Forexample,thesystemcouldhave

a mechanism for looking up user information to determine whether the user has

previousexperience ofthis system. This information couldthen be usedto allow

dierentpathsthroughthesystem(forexample,withlessverboseinstructions),or

toaddressuserpreferenceswithouthavingtoask forthem.

Frame-based systemsrequirean explicittask model asthisinformation is used

todeterminewhat questionsstillneedtobeasked. Thisisthemechanismusedby

thesesystemstocontrolthedialogueow. Generallytheusermodel,ifoneexists,

would not needto be any moresophisticatedthan that described for state-based

systems. Agent-basedsystems, onthe other hand, requirecomplex dialogue and

usermodelsaswellasmechanismsforusingthesemodelsasabasisfordecisionson

how tocontrol thedialogue. Information aboutthedialogue historyandtheuser

canbeusedtoconstrainhowthesysteminterpretstheuser'ssubsequentutterances

andtodeterminewhatthesystemshouldsayandhowitshouldbesaid. Thesesorts

ofmodellinginvolverepresentationsofdiscoursestructure,ofintentions,goalsand

(8)

TableI. Dialoguecontrolstrategies

Features / Di-

alogue control

strategy

State-based Frame-based Agent-based

Input Single words or

phrases

Natural language

withconceptspot-

ting

Unrestricted natu-

rallanguage

Verication Explicit conrma-

tion-eitherofeach

inputor at endof

transaction

Explicit and im-

plicitconrmation

Grounding

Dialoguemodel Information state

represented im-

plicitlyindialogue

states.

Explicit represen-

tation of informa-

tionstates.

Model of system's

intentions, goals

andbeliefs.

Dialogue control

represented ex-

plicitly with state

diagram

Dialogue control

represented with

controlalgorithm

Dialoguehistory.

Context.

Usermodel Simple model of

usercharacteristics

orpreferences

Simple model of

usercharacteristics

orpreferences

Modelofuser'sin-

tentions,goalsand

beliefs

beliefs,andofdialogue asacollaborativeactivity. Various approachestodialogue

andusermodellinginagent-basedsystemswillbeexempliedanddiscussedinlater

sectionsofthissurvey.

2.6 Summary

Thissectionhasprovidedaclassicationofspokendialoguesystemsintermsofthe

type of dialoguecontrol strategy used and hasshown how the choice of dialogue

controlstrategyiscloselyrelatedtotheformoftheuser'sinputas wellastheuse

ofparticular errorrecoverymechanisms. Thesepointsare summarisedin TableI.

Thenextsectionpresentsthreeexamplesofspokendialoguesystemsthatillustrate

thesedierentcontrol strategies.

3. EXAMPLESOF SPOKENDIALOGUESYSTEMS

3.1 State-baseddialoguecontrol: TheNuanceautomaticbankingsystem

TheNuanceautomaticbankingsystemenablescallerstoconducttransactionsover

the telephone such as paying a bill, transferring funds between accounts, or ob-

tainingan accountbalance. Thesystemconducts adialogue in Englishwith the

userwhocanspeaktothesystemusingrelativelyunrestrictedspeech. Itshouldbe

stressedthatthefollowingdialoguesarebasedonrecordedinteractionswithademo

systemavailableintheUK(availableat: +441223516959),usedheretoillustrate

someimportantaspectsofstate-baseddialoguecontrol. Theshortcomingsthatare

presentedinsomeoftheexampleswouldhavebeenaddressedinafullydeveloped

commercialapplication.

System: Whatwould youliketodo?

User: Payabill

(9)

Pay a bill

Confirm: company, account, date

To which company?

Amount?

Transfer date

Pay another bill

yes no

Fig.1. Dialogueowforpaymentofabill

System: Whatcompany wouldyou liketopay?

User: Midland Bank

System: Howmuchwouldyou liketopay?

User: Onehundredandtwenty-vepounds fty-sevenpence

System: Whatdate wouldyou likethe paymenttobe madeon?

User: Monday

System: You want to pay HSBC one hundred fty-ve pounds and fty-seven

penceon April5th? Is thiscorrect?

User: Yes

System: Yourpayment willbesent. Wouldyou liketopayanother bill?

In order to complete the transaction the system has to elicit three specic val-

ues from thecaller { payee, amount, transferdate. These areconrmed and the

transaction is completed with a returnto thesub-menu for billpayments. Some

pointsworthnotingin thisexamplearethatthesystemis abletoprocessphrases

specifyingcurrencyamounts,is abletotransformrelativedatessuchas `Monday'

into absolute dates such as`April 5th', and is able to handle multiple synonyms

(`MidlandBank'or`HSBC').

The dialogueow, which is system-directed with apre-determined sequence of

questions,canbemodelledasaset of interconnecteddialoguestates,asshown in

Figure1.

Transferringamountsbetweenaccountsinvolvesadierenttypeoftransactionin

whichthesystemcanacceptmorecomplexnaturallanguageinputfromthecaller,

thus reducingthe numberof questions to beasked and thetime required forthe

transaction:

(10)

System: Whatwould youliketodo?

User: I would like totransfer twohundredand fty pounds frommy current

tomy savings account

System: Youwanttotransfertwohundredfty poundsfromcurrent tosavings.

Isthis correct?

User: Yes

Thistransactionisalsomodelledusingdialoguestatescombinedwithmultipleslot

lling. Additionaltransitionsneedtobeprovideddependingonwhichcombinations

of thevalues of amount, source account and destination accountare provided by

theuser in thenaturallanguageinput. Problemscanarise ifthis functionality is

notavailablethroughoutthedialogue,asillustratedinthefollowingexcerpt:

System1: Whatcompany wouldyou liketopay?

User1: AbbeyNational

System2: Howmuchwouldyou liketopay?

User2: Onehundredpoundsnext Monday

System3: Whatdate wouldyou likethe paymenttobe madeon?

User3: NextMonday

System4: YouwanttopayAbbeyNationalonehundredpoundsandsixty-onepence

onApril 5th? Isthis correct?

User4: No

System5: Wouldyoulike tochange thecompany,amount, or date?

User5: Amount

System6: Howmuchwouldyou liketopay?

User6: Onehundredpounds

System7: Youwant topay Abbey National onehundredpoundson April 5th? Is

thiscorrect?

User7: Yes

In User2 the user provides an over-informative answer, specifying not only the

amountrequested in System2but alsothe date of thetransaction. Thesystem's

nextquestion (System3)indicates that either thedate wasmisrecognised orthat

thesystem is notset upfor multiple slot llingat this dialogue state. Whatever

the reason,the system goeson to ask the questionrequired at thenext dialogue

state i.e. for the date of the transaction. Thus the user is requested to supply

informationthathadjustbeengiveninthepreviousresponse.Moreover,asshown

in System4,thesystemhasrecognised`nextMonday'notasadatebut aspartof

theamounti.e. as`andsixty-onepence'. AsshowninSystem5{User7, thesystem

isabletorecoverfromthiserrorbygettingtheusertoreconrmtheamount.

Analexampleillustratessomelimitationsonthesystem'smechanismsforerror

recovery:

System: Wouldyoulike tochange thecompany,amount, or date?

User: Thecompany andthe date

System: Pleasesaycompany,amount, ordate

(11)

From this exampleitcanbe seenthat thesystemis unable to correctmorethan

oneerroratatime. Thuswhilethestrategyofleavingtheconrmationofthethree

itemsofinformationuntiltheyhaveallbeenelicitedwillnormallyresultin amore

naturaland faster transaction, it may leadto afrustratingand rather unnatural

dialoguewheremorethanoneitemhasto bechanged.

TheNuancebankingsystemisanadvancedspokendialoguesystemthatemploys

astate-baseddialoguecontrolinconjunctionwithnaturallanguageunderstanding

andmultipleslotlling. Givengoodrecognitionratesthesystemislikelytoperform

wellandcompletetransactionseÆcientlyandeectively. Thesesampleinteractions

were chosencarefullyto illustratesomepotentialproblems thatare likelyto arise

whenastate-basedsystemisextendedtoprovidemoresophisticatedfunctionality.

The examples would appear to indicate that the multiple slot lling aorded by

thesystem's naturallanguageprocessing arenot available ateach dialoguestate.

Althoughthiscan beeasilyremedied, afurthercomplication isintroducedasnow

thesystemhasto maintainadialogue model ofwhichslots havebeenlled so as

to directthedialoguedownalternativepaths. With smallnumbersofslots thisis

not aproblem, but with an increasing numberof slots thecombinationsof state

transitions soon reach unmanageable proportions. This is particularly the case

whenfurthertransitionsareaddedtodealwithvericationoftheslotvalues. Thus

while state-based dialogue control is usefulfor implementingsimple transactions,

introducing more sophisticated functionality may require additional mechanisms

such asan explicit dialogue model that obscure the semantics of the state-based

control. Thisissuewillbediscussedingreaterdetailinsection5.Thenextexample

willshowhowaframe-basedsystemisabletohandletheseissuesmoreeectively.

3.2 Frame-baseddialoguecontrol: ThePhilipsautomatictraintimetableinformation

system

The Philipsautomatic traintimetable information system(available at: +49 241

604020)providesinformationoverthetelephone abouttrainconnectionsbetween

1200Germancities. The systemconducts anaturaldialoguein Germanwiththe

userwhocanspeaktothesystemusinguent,unrestrictedspeech. Theaimofthe

dialogue isto enablethe systemto constructanappropriate databasequery that

retrievestheinformationrequiredbytheuserandthentopresentthatinformation

in asuitable form. This system falls within aclass of similar systemsconcerned

with the provision of information, such as travel schedules, banking, stocks and

shares, and weather forecasts. The German version of the system is a research

prototype,whichisnotconnectedwiththeoÆcialinformationserviceprovidedby

GermanRail(DeutscheBahnAG).Thefollowingisanexampleofadialoguewith

the Philips Automatic Train Timetable Information System, translated from the

originalGerman[Aust etal.1995].

System1: Good morning. This is the automatic timetable information service.

HowcanIhelp you?

User1: Ineedaconnection fromHamburgtoMunich.

System2: Whenwouldyou liketogo fromHamburgtoMunich?

User2: NextFridayateight o'clock.

(12)

System3: Soyou wanttotravelon Sunday the 20thof November ateighto'clock

a.m.?

User3: No,on Friday.

System4: So youwant to travel on Friday the 18thof November at eighto'clock

a.m.?

User4: Yes.

System5: Thereisthefollowingconnection: WithIntercityExpress793,departure

fromHamburgat8:02,arrival inMunichat14:05. Doyouwantmeto

repeatthe connection?

User5: No, butIneedatrainfor the returntrip onSunday.

System6: AtwhattimewouldyouliketogofromMunichtoHamburgon Sunday

the 20thofNovember?

User6: Ihavetoarrive in Hamburglate inthe evening.

System7: Soyouwanttoarrive inHamburgbetweenninep.m. andelevenp.m.?

User7: Yes.

System8: Therearethreeconnections. Thisistherstone: WithIntercityExpress

880,departurefromMunichat15:54, arrivalin Hamburgat21:58. Do

youwantthe next connection?

User8: No,thanks, that oneisne.

System9: Doyouneedyetanother connection?

User9: No,thank you.

System10: Wewish youapleasant journey. Thank youfor calling. Goodbye.

Itisinformativeto examinethedialogueowinthisexample. Thepurposeofthe

systemistoguidetheusertowardsprovidingalltherequiredvalues-departurecity,

destination city, departuretime, anddeparture date, aswellasascertaining other

values, such as whether thetime is a.m. or p.m. and whether the timerefers to

thedepartureorarrival. Thesevaluesareacquiredoverthecourseofthedialogue

and once theyhave been acquired the systemcan executeasuitably constrained

databasequeryandprovideoneormoreconnections.

Thissystemcouldhavebeenimplementedusingstate-basedcontrolinwhichthe

system would progress througha number of stateseliciting and verifying the re-

quiredvalues. AlternativelythesystemcouldhavebeenimplementedasaQuestion-

Answeror NaturalLanguageDatabaseSysteminwhichtheuserwouldhavebeen

requiredtoprovidealltherequiredparametersinoneutterance,whichcouldthen

havebeentranslated directlyinto adatabasequery. However,oneof theaims of

thedesignersofthis systemwastoenableamoreexibleinteraction inwhichthe

userwouldnotbeconstrainedeithertoinputonevalueatatimeortoinputallthe

valueswithinoneutterance. Thisexibilityisnecessaryasitcannotbedetermined

in advancewhatausermightknowregardingtheinformation requiredtomakea

validquery. Forexample,thesystemmayneedtoknowiftheuserwishestotravel

onanInter-Citytrain,requiresatrainwith restaurantfacilities,and soon. Ifthe

user is not aware of all the possibilities, the system has to issue relevant queries

andelicit suitablevaluesinorderto ndthebestconnection.

(13)

A second aspectof dialogue owconcerns thesequencing of thesystem'sques-

tions. Thereshould bealogicalorderto thequestions. Thisordermaybelargely

determinedby whatinformationis tobeelicited in awell-structuredtask such as

a travel information enquiry. The disadvantageof a state-based approach com-

binedwithnaturallanguageprocessingcapabilitiesisthatusersmayproduceover-

informativeanswersthatprovidemoreinformationthanthesystemhasrequested

at thatpoint. InthePhilipsexampleat System2{User2,thesystem'smoreopen-

endedpromptwhenwouldyouliketogofromHamburgtoMunichisambiguousin

that itcanallowtheusertosupplydeparturetimeordateorboth|ashappens

in User2. Evenwith amoreconstrainedprompt such ason which day would you

like togo from Hamburgto Munich theusermightsupplybothdate andtime. A

systemthatfollowedapredeterminedsequenceofquestionsmightthenaskatwhat

timewouldyouliketogofromHamburgtoMunich |anunacceptablequestionas

thetime hasalready been given. The Philips systemusesastatus graphto keep

trackof which slotshavealreadybeenlled. Thismechanismwill bedescribed in

greaterdetailin section5.

A close examination of thedialogue also shows that thesystem is ableto deal

with recognitionerrorsand misunderstandings. Forexample,in System3thesys-

tem attempts to conrm thedeparture date and time but hasmisrecognised the

departuredateandiscorrectedbytheuserinUser3. Moresubtly,thesystemuses

dierent strategiesfor conrmation. In System2animplicit conrmation request

isusedinwhichthevaluesfordeparturecityanddestinationprovidedbytheuser

in User1areechoedbackwithin thesystem'snextquestion,which alsoincludesa

requestforthevalueforthedeparturedateand/ortime. Ifthesystem'sinterpreta-

tioniscorrect,thedialoguecanproceedsmoothlytothenextvaluetobeobtained

and theuserdoesnothaveto conrmthe previousvalues. Otherwise, ifthesys-

temhasmisunderstood theinputtheusercancorrectthevaluesbefore answering

thenextquestion. Conversely,anexplicit conrmationrequest haltsthedialogue

ow and requires an explicit conrmation from the user. An example occurs in

System3{User3inwhichthesystemmakesanexplicitrequestfortheconrmation

ofthedeparturedateandtimeand theusercorrectsthedate. Thenextexchange

System4{User4 is a further example of an explicit conrmation request to verify

thedeparturedateandtime.

One further aspect of the Philips system is its robustness. An example can

be seen at System6{User6. In response to the system prompt for the departure

timetheuser doesnotprovideadirectresponsecontainingtherequiredtime but

states a constraint on the arrivaltime, expressed vaguely aslate in the evening.

Thesystemisabletointerpretthis expressionin termsofarange(between 9p.m.

and11p.m.) andtondanappropriatedeparturetimethatmeets thisconstraint.

Moregenerally,thesystemisrobustenoughtobeabletohandlearangeofdierent

expressionsfordatesandtimes(e.g. threedaysbeforeChristmas,withinthismonth)

andtobeableto dealwithcasesofmissingandcontradictoryinformation.

Theprovisionofinformationsuchastraintimesisatypicalapplicationofspoken

dialoguetechnology. Philips hasdevelopedasystemwithsimilarfunctionalityfor

SwissRail,whichhasbeenanoÆcialpartofSwissRail'sinformationservicesince

1996. Publicreactiontothesystemhasbeenfavourable,withover80%ofthepeople

(14)

whousedtheserviceratingitas\excellent". Striketal.[1996]reportonaproject

involvingadaptationoftheGermansystemtotheDutchpublictransportnetwork,

while the EuropeanR&D projectARISE, which includes theDutch, French and

Italian railway operators,builds onearlier Europeanprojectsand on the Philips

systems to provide more elaborate services with a multilingual component. The

strengths andweaknessesof frame-baseddialoguecontrol arediscussed further in

section5.

3.3 Agent-baseddialoguecontrolforproblemsolving-theCircuitx-itshopsystem

Problemsolvingtasksusuallyinvolvesomedegreeofco-operationbetweenagents,

particularly when one of the agents is unable to solve the problem alone. Co-

operationmaytaketheformofassigning particularsub-tasksto particularagents

with the accomplishment of the sub-tasks contributing to the completion of the

overalltask. Suchco-operativeproblemsolvingwillrequirecommunicationbetween

theagentsto reportonthestate-of-playof theproblemsolvingandtosupplythe

informationrequiredbytheotheragent. Thenextexampleillustratescollaborative

problem solvingin theCircuit-Fix-It Shop system[Smith andHipp 1994]. Other

agent-basedsystemswill bediscussedinsection5.

TheCircuit-Fix-ItShopsystemhelpsuserstoxanelectroniccircuitbyengaging

in aspoken dialogue with the user. Communicationbetween system and user is

motivatedbythefactthatneitherparticipantisfullyequippedtosolvetheproblem

alone. The system has all the necessaryknowledge about the task but doesnot

havetherequiredsensoryabilitiestomonitorthestateoftheworldnortoperform

actionsin theworld. These aretheconcernofthehumanuserwhohasalsosome

knowledgeofthetask,whichcanrangefromthatofanovicetothat ofanexpert,

dependingontheindividualuser. Thesystememploysusermodelling-estimating

the user's knowledge of particular objects and actions based on inferences from

earlier exchanges in the dialogue - to determine the content of the information

to be communicated to the user. Context-dependent expectations are also used

to correct speech recognitionhypotheses and to track when the user shifts to a

dierent subdialogue. The following is an exampleof adialogue between auser

andthesystem:

System1: Isthereawire betweenconnector84 andconnector99?

User1: Itisnotthere.

System2: Add awirebetween connector84 andconnector99.

User2: Ineedhelp.

System3: Obtainawire.

User3: Done.

System4: Connectmetal of onefreeend ofthe wiretoconnector84.

User4: How?

System5: Bend back connector84.

User5: Done.

System6: Insertfreeendof wireintoconnector84.

User6: Wirebetween 84and99 isconnecting.

(15)

System7: Whatisthe LEDdisplaying?

Thedialogue evolvesinthe formof aproof,which canbeillustratedusing agoal

tree. Thegoaltreerepresentsthesolutionoftheproblemasitevolvesdynamically.

Thesysteminvokesrulestoprovethegoalinatop-downfashion-asinPrologstyle

theorem proving. The proofmaysucceed using internallyavailableknowledge, in

whichcase nodialogue isrequired. However,thesystem isdesigned to dealwith

caseswhere theprooffails becausetheinformationrequiredtocompletetheproof

isnotavailabletothesystem. Inthiscasethesystemengagesindialoguewiththe

usertoobtainthemissinginformation(described as`missingaxioms')so thatthe

proofcansucceed.

Atthe beginning ofthe dialogue,the systemdoesnot knowwhether there isa

wire between connector 84 and connector 99. As this is a missing axiomin the

currentproof,thesystemproducesutteranceSystem1toasktheuser. Thestateof

theproofatthispointisshowninthefollowinggoaltreedisplayedinFigure2. The

fact(wire(84,99),exist,X)) X = absent

missing axiom (utt 1)

System1: is there a wire between connector 84 and connector 99?

Fig.2. GoaltreebeforeutteranceSystem1

userconrmsthatthewireismissing.Fromthisthesystemcaninferthattheuser

knowsthelocationoftheconnectors andthesefacts areaddedto theusermodel.

Figure 3 shows the current state of the goal tree. So that the current goal can

fact(wire(84,99),exist,X)) X = absent

missing axiom (utt 1)

"it is not there" (utt 2) => INFER

userknows(loc(84)) userknows(loc(99)) User1: it is not there

Fig.3. GoaltreeafterutteranceUser1

becompleted,thesysteminstructstheusertoaddawirebetweentheconnectors.

Thisyieldsthegoaltreeshown inFigure4. Astheuserdoesnotknowhowtodo

this,asubgoalisinsertedinstructingtheuseronhowtoaccomplishthistask. This

(16)

fact(wire(84,99),exist,X)) X = absent

missing axiom (utt 1)

"it is not there" (utt 2) => INFER

userknows(loc(84)) userknows(loc(99))

do(action(add,wire(84,99))

missing axiom (utt 3) System2: add a wire between connector 84 and connector 99

Fig.4. GoaltreeafterutteranceSystem2

subgoalconsists oftheactions: locateconnector 84,locateconnector99, obtaina

wire, connectone end of wireto 84, and connect other end of wireto 99. These

itemsareaddedtothegoaltreedepictedinFigure5. However,astheusermodel

fact(wire(84,99),exist,X)) X = absent

missing axiom (utt 1)

"it is not there" (utt 2) => INFER

userknows(loc(84)) userknows(loc(99))

do(action(add,wire(84,99))

learn to do add (utt 4) missing axiom (utt 3) inserted subgoal 1

locate 84 locate 99 obtain wire connect(end1,84) connect(end2,99)

Fig.5. GoaltreeafterutteranceUser2

containstheinformationthattheusercanlocatetheseconnectors,instructionsfor

thersttwoactionsarenotrequiredandsothesystemproceedswithinstructions

forthe third action,which isconrmedin User3, and forthe fourthaction. Here

theuserrequires further instructions,which aregivenin System5withtheaction

(17)

conrmed in User5. At this point the userasserts that the wirebetween 84 and

99 is connecting, so that the fth instruction to connect the second end to 99 is

notrequired. Afurther missingaxiomisdiscoveredwhich leadsthesystemto ask

whattheLEDisdisplaying(System7).

3.4 Summary

The examples presented in this section have illustrated three dierent types of

dialogue control strategy. Theselectionof adialogue control strategydetermines

the degree of exibility possible in the dialogue and places requirements on the

technologies employed for processing the user's input and for correcting errors.

There are many variations on the dialogue strategies illustrated here and these

will be discussedin greaterdetailin section5. Thenextsectionwill examinethe

componenttechnologiesofspokendialoguesystems.

4. COMPONENTSOF ASPOKENDIALOGUE SYSTEM

Aspokendialoguesysteminvolvestheintegrationofanumberofcomponentsthat

typicallyprovidethefollowingfunctionalities[Wyardet al.1996]:

Speechrecognition: The conversion of an input speech utterance, consisting of a

sequenceofacoustic-phoneticparameters,into astringofwords;

Languageunderstanding: theanalysis ofthis stringof wordswiththe aimofpro-

ducing ameaning representationfor therecognisedutterancethat canbe used

bythedialoguemanagementcomponent;

DialogueManagement: Thecontroloftheinteractionbetweenthesystemandthe

user,includingtheco-ordinationoftheothercomponentsofthesystem;

Communication withexternalsystem: For example, with a database system, ex-

pertsystem,orothercomputerapplication;

Responsegeneration: Thespecicationofthemessagetobeoutputbythesystem;

Speechoutput: Theuseoftext-to-speechsynthesisorpre-recordedspeechtooutput

thesystem'smessage.

Thesecomponentsareexaminedinthefollowingsub-sectionsinrelationtotheirrole

in aspokendialoguesystem(forarecenttext onspeechandlanguageprocessing,

seeJurafskyandMartin[2000]).

4.1 SpeechRecognition

The task of the speech recognitioncomponent of a spoken dialogue systemis to

converttheuser'sinpututterance,whichconsistsofacontinuous-timesignal,into

asequenceofdiscreteunitssuchasphonemes(unitsofsound)orwords. Onemajor

obstacleisthehighdegreeofvariabilityinthespeechsignal. Thisvariabilityarises

fromthefollowingfactors:

Linguistic variability: Eectsonthespeechsignalcausedbyvariouslinguisticphe-

nomena. One example is co-articulation i.e. the fact that the same phoneme

can havedierentacousticrealisationsin dierent contexts, determined bythe

phonemesprecedingandfollowingthesoundin question;

(18)

Speakervariability: Dierences between speakers, attributable to physical factors

such as the shapeof thevocal tractaswell asfactorssuchasage, gender,and

regional origin; and dierences within speakers, due to the fact that even the

samewordsspokenonadierentoccasionbythesamespeakertendtodierin

termsof theiracousticproperties. Physical factorssuch astiredness,congested

airways due to a cold, and changes of mood have abearing on how wordsare

pronounced, but the location of a word within a sentence and the degree of

emphasisitisgivenarealsofactorswhichresultin intra-speakervariability;

Channel variability: Theeectsofbackgroundnoise,whichcanbeeitherconstant

or transient,and ofthetransmissionchannel, such asthetelephonenetwork or

amicrophone.

Thespeechrecognitioncomponentofatypicalspokendialogueapplication hasto

beabletocopewiththefollowingadditionalfactors:

Speakerindependence: Astheapplicationwillnormallybeusedforawidevarietyof

casualusers,therecognisercannotbetrainedonanindividualspeaker(orsmall

numberofspeakers)whowillusethesystem,asisthecasefordictationsystems;

instead,forspeaker-independentrecognitionsampleshavetobecollectedfroma

varietyofspeakerswhosespeechpatternsshouldberepresentativeofthepotential

usersof the system. Speaker-independentrecognitionis moreerror-pronethan

speaker-dependentrecognition.

Vocabulary size: Thesizeofthevocabularyvarieswiththeapplicationandwiththe

particulardesignofthedialoguesystem. Thusacarefullycontrolleddialoguemay

constraintheusertoavocabularylimitedtoafewwordsexpressingtheoptions

that areavailablein thesystem,whilein amoreexible systemthevocabulary

mayamounttomorethanathousandwords.

Continuousspeech: Users of spoken dialogue systems expect to be able to speak

normallytothesystemandnot,forexample,intheisolatedwordmodeemployed

insomedictationsystems. However,itisdiÆculttodeterminewordboundaries

incontinuousspeechsincethereisnophysicalseparationinthecontinuous-time

speechsignal.

Spontaneousconversationalspeech: Sincethespeechthatisinputtoaspokendia-

loguesystemisnormallyspontaneousandunplanned,itistypicallycharacterised

by disuencies, such as hesitations and llers (for example, umm and er, false

starts, in which the speaker begins onestructure thenbreaks o mid way and

startsagain,andextralinguisticphenomenasuchascoughing. Thespeechrecog-

niserhasto be ableto extract fromthespeechsignalasequenceofwordsfrom

which thespeaker'sintendedmeaningcanbecomputed.

The basic process of speech recognition involves nding a sequence of words,

usingaset ofmodelsacquired inapriortrainingphase,and matchingthese with

theincomingspeechsignalthatconstitutestheuser'sutterance. Themodelsmaybe

wordmodels,inthecaseofsystemswithasmallvocabulary,butmoretypicallythe

modelsareof unitsofsound suchasphonemesortriphones, which modelasound

as well asits contextin termsof thepreceding andsucceeding sounds. Themost

successfulapproachesviewthis pattern-matching asaprobabilisticprocesswhich

hastobeabletoaccountbothfortemporalvariability -duetodierentdurations

(19)

ofthesoundsresultingfromdierencesinspeakingrateandtheinherentlyinexact

nature of human speech, and acoustic variability - due to the linguistic, speaker

andchannelfactorsdescribedearlier. Thefollowingformulaexpressesthisprocess:

W

=argmax

w

P(OjW)P(W)

Inthis formula Wrepresentstheword sequencewith themaximum aposteriori

probability, while O represents the observation that is derived from the speech

signal. Twoprobabilities are involved: P(O j W), known asthe acoustic model,

whichhasbeenderivedthroughatrainingprocessandwhichistheprobabilitythat

asequenceofwordsWwillproduceanobservationO;andalanguagemodel,derived

from an analysis of a language corpus giving the prior probability distribution

assignedtothesequenceofwordsW.

TheobservationOcomprisesaseriesofvectorsrepresentingacousticfeaturesof

thespeechsignal. Thesefeaturevectorsarederivedfromthephysicalsignal,which

issampledandthendigitallyencoded. Perceptuallyimportantspeaker-independent

featuresareextractedandredundantfeaturesarediscarded.

Acousticmodellingisaprocessofmappingfrom thecontinuousspeechsignalto

thediscrete sounds of thewordsto be recognised. The acousticmodel of aword

is representedin Hidden MarkovModels (HMMs), as in Figure 6. Each state in

the HMMmight represent a unit of sound, for example, the three sounds in the

worddog. Transitionsbetweeneachstate,A=a

12 a

13 :::a

n1 :::a

nn

, representthe

probability of transitioning from one state to the next and model the temporal

progressionofthespeechsounds. Dueto variabilityin thedurationofthesounds,

asoundmayspreadacrossseveralframessothatthemodel cantakealooptran-

sition and remain in the same state. For example, if there were ve frames for

the word dog, the statessequence S

1

;S

1

;S

2

;S

2

;S

3

might be produced, reecting

thelongerdurationof thesoundsrepresentingd ando. AHiddenMarkovModel

start d 1 o 2 g 1 end

b 1(o1) b 1(02) b 2(o3)

b 2(o4) b 3(o5)

0 1 0 2 0 3 0 t

Word model

Observation sequence

Fig.6. AsimpleHiddenMarkovmodel

is doubly stochastic, as in addition to the transition probabilities the output of

each state,B =b

i(ot)

, is probabilistic. Instead ofeach statehaving asingleunit

of sound as output, all units of sound are potentiallyassociated with each state,

eachwith itsown probability. Themodel is \hidden"because,givenaparticular

sequenceofoutputsymbols,itisnotpossibletodeterminewhichsequenceofstates

producedtheseoutputsymbols. Itis,however,possibletodeterminethesequence

(20)

of statesthat hasthehighestprobabilityof having generatedaparticular output

sequence. In theory this would requirea procedure that would examine all pos-

sible statesequences and computetheir probabilities. In practice,because ofthe

Markovassumptionthat beinginagivenstatedependsonlyonthepreviousstate,

aneÆcientdynamic programmingprocedure such astheViterbi algorithmorA*

decodingcan beusedto reduce thesearch space. Ifastatesequence isviewed as

apaththroughastate-timelattice,at eachpointin thelatticeonlythepathwith

thehighestprobabilityisselected.

The output of the acoustic modelling stage is a set of word hypotheses which

can be examined to nd the best wordsequence, using a language model P(W).

The languagemodel contains knowledge aboutwhich wordsare morelikely in a

given sequence. Two typesofmodelare possible. A nitestate networkpredicts

all the possible word sequences in the language model. This approach is useful

if all the phrasesthat are likelyto occurin the speech input can be specied in

advance. Thedisadvantageisthatperfectlylegalstringsthatwerenotanticipated

are ruled out. Finite state networks can be used to parse well-dened sequences

suchasexpressionsoftime.

Alternatively, anN-grammodelcanbeused. TheuseofN-gramsinvolvescom-

puting the probability ofa sequenceof wordsasa productof the probabilitiesof

each word, assuming that the occurrenceof each word is determined bythe pre-

cedingN-1words. Thisrelationshipis expressedin theformula:

P(W)=P(w

1

;:::;w

n )=

N

Y

n=1 P(w

n jw

1

;:::;w

n 1 )

However,becauseofthehighcomputationalcostinvolvedin calculatingtheprob-

abilityofawordgivenalargenumberofprecedingwords,N-gramsareusuallyre-

ducedtobigrams(N=2)ortrigrams(N=3). ThusinabigrammodelP(w

i jw

i 1 )

theprobabilityof allpossiblenextwordsisbasedonlyonthecurrentword,while

inatrigrammodelP(w

i jw

i 2

;w

i 1

)itisbasedontwoprecedingwords. N-gram

modelsmayalsobebasedonclassesratherthanwordsi.e. thewordsaregrouped

intoclassesrepresentingeithersyntacticcategoriessuchasnounorverb,orseman-

tic categories,such asdaysof the week ornames of airports. A language model

reducesthe perplexity ofasystem,whichwill usuallyresultin greaterrecognition

accuracy. Perplexity is roughlydened asthe average branching factor, oraver-

age number of words, that might follow a given word. If the perplexity is low,

recognitionislikelytobemoreaccurateasthesearchspaceisreduced.

Theoutputofthespeechrecognisermaybeanumberofscoredalternativesasin

thefollowingexamplerepresentingtherecogniser'sbestguessesfortheinputstring

whattimedoesthe ightleave? [Wyardet al.1996]:

(1) whattimedoesthewhiteleaf1245.6

(2) whattimedoestheightleave1250.1

(3) whattimedoesaightleave1252.3

(4) whattimedidtheightleave1270.1

(5) whattimedidaightleave1272.3

(21)

Sometimesthereareonlysmalldierencesbetweenthealternatives,causedbyone

ortwowordsthatmaynotcontributetothemeaningofthestring. Forthisreason,

thealternativescanbemoreeconomicallyrepresentedin adirectedgraphorasa

word lattice. Theselection of themost likelysequence maybethe responsibility

ofothersystemcomponents. Forexample,ifthedomainofthedialoguesystemis

ightenquiries,then therstsequence, which hadthebest scorefromthe speech

recogniser,would bediscardedascontextuallyirrelevant. Similarlydialogueinfor-

mationwouldassistthechoicebetween2-3,whichaskaboutaightdeparturethat

hasnotyettakenplace,and4-5,whichaskaboutsomedeparturethathasalready

happened.

Asanalternativeto returningthecompletesequenceofwordsthat matchesthe

acousticsignaltherecognisercansearchforkeywords. Thistechniqueisknownas

wordspotting. Wordspottingis usefulfor dealingwithextraneouselementsin the

input,forexample,detectingyes inthestringwell,uh,yes,that'sright. Themain

diÆcultyforwordspottingistodetectnon-keywordspeech. Onemethodistotrain

the system with a variety of non-keyword examples, known assink (or garbage)

models. A wordspotting grammar network can then bespecied that allows any

sequenceofsinkmodels incombinationwiththekeywordstoberecognised.

Usersofspokendialoguesystemsaregenerallyconstrainedtohavingtowaituntil

thesystem hascompleted its output before they canbegin speaking. Once users

arefamiliarwithasystem,theymaywishtospeedupthedialoguebyinterrupting

the system. This is known as barge-in. ThediÆculty with simultaneous speech,

which iscommon in human-humanconversation, is that the incomingspeech be-

comescorruptedwithechofromtheongoingprompt,thusaectingtherecognition.

Varioustechniquesareunder developmentto facilitatebarge-in.

4.1.1 Summary. Thissectionhasoutlinedthemaincharacteristicsofthespeech

recognitionprocess, describingthe uncertainand probabilisticnature ofthis pro-

cess, in order to clarify the requirements that are put on the other system com-

ponents. Inalineararchitecturetheoutput ofthespeechrecogniser providesthe

input to thelanguageunderstanding module. DiÆculties mayarise forthis com-

ponent if the word sequence that is output does not constitute a legal sentence,

asspeciedbythecomponent'sgrammar. Inanycase,thedesignofthelanguage

understandingcomponentneedsto takeaccountof thenature oftheoutput from

thespeechrecognitionmodule. Similarly,in anarchitectureinwhichthedialogue

management component interacts with each of the other components, one of the

rolesofdialoguemanagementwillbetomonitorwhentheuser'sutterancehasnot

beenreliablyrecognisedandtodeviseappropriateremedialsteps. Theseissueswill

bediscussedingreaterdetailin subsequentsections. Formoreextensiveaccounts

ofspeechrecognition,see, forexample, RabinerandJuang [1993]andYoungand

Bloothooft [1997]. For tutorial overviews, see Makhoul and Schwartz[1995] and

Power[1996].

4.2 Languageunderstanding

Theroleofthelanguageunderstandingcomponentisto analysetheoutputofthe

speech recognitioncomponentandto deriveameaningrepresentationthatcanbe

used by the dialogue control component. Language understanding involves syn-

(22)

tacticanalysis,todeterminetheconstituentstructureoftherecognisedstring(i.e.

howthewordsgrouptogether),andsemanticanalysis,to determinethemeanings

of theconstituents. These twoprocessesmaybekeptseparate at therepresenta-

tionallevelinordertomaintaingeneralisabilitytootherdomains,buttheytendto

becombinedduring processingfor reasonsofeÆciency. On theother hand,some

approaches to languageunderstanding may involvelittle orno syntacticprocess-

ingand deriveasemanticrepresentationdirectly from therecognisedstring. The

advantages and disadvantages of these approaches, and the particular problems

involvedintheprocessingofspokenlanguage,willbereviewedinthissection.

The theoreticalfoundations for languageprocessing are to befound in linguis-

tics, psychology, and computational linguistics. Current grammatical formalisms

in computational linguistics share a number of key characteristics, of which the

mainingredientisafeature-baseddescriptionofgrammaticalunits,suchaswords,

phrasesandsentences[UszkoreitandZaenen1996].Thesefeature-basedformalisms

aresimilartothoseusedinknowledgerepresentationresearchanddatatypetheory.

Featuretermsaresetsofattribute-valuepairsinwhichthevaluescanbeatomic

symbolsor further feature terms. Feature terms belong to types, which may be

organised in a type hierarchy or as disjunctive terms, functional constraints, or

sets. The following simple example shows a feature-based representation for the

wordslions,roar androars aswellasasimplegrammarusingthePATR-IIformal-

ism[Shieber1986] that denes how thewordscanbe combined in awell-formed

sentence:

lexicon

lions: [cat:NP, head: [agreement: [number:plural,person:third]]]

roar: [cat:V,head: [form: nite, subject: [agreement: [number:plural,

person:third]]]]

roars: [cat:V,head: [form: nite, subject: [agreement: [number:singular,

person:third]]]]

grammar

S!NPVP

<Shead>=<VPhead>

<Sheadsubject>=<NPhead>

VP!V

<VPhead>=<Vhead>

Thelexiconconsistsofcomplexfeaturestructuresdescribingthesyntacticallyrele-

vantcharacteristicsofthewords,suchaswhethertheyaresingularorplural. The

grammarconsists ofphrasestructure rulesandequations thatdetermine howthe

wordscanbecombined.

The means by which feature terms may be combined to produce well-formed

feature termsis throughtheprocessof unication. Forexample: thewordslions

androar canbecombinedastheirfeaturesunify, whereaslions androars cannot,

astheagreementfeaturesareincompatible. Thisbasicformalismhasbeenusedto

accountfor awiderange of syntacticphenomena and, in combinationwith uni-

cation,toprovideastandardapproachtosentenceanalysisusingstring-combining

(23)

andinformation-combiningoperations.

Feature-based grammars are often subsumed under theterm unication gram-

mars. Onemajoradvantageofunicationgrammarsisthat theypermitadeclar-

ativeencodingof grammaticalknowledgethat is independentof any specic pro-

cessingalgorithm. Afurtheradvantageisthatasimilarformalismcanbeusedfor

semanticrepresentation,withtheeectthatthesimultaneoususeofsyntacticand

semanticconstraintscanimprovetheeÆciencyof thelinguisticprocessing.

In computational semantics sentences are analysed on the basis of their con-

stituent structure, under the assumption of the principleof compositionality i.e.

that the meaning of a sentence is a function of the meanings of its parts. Each

syntacticrulehasacorrespondingsemanticruleandtheanalysisoftheconstituent

structure ofthe sentence willlead to thesemanticanalysis of thesentenceas the

meaningsoftheindividualconstituentsidentiedbythesyntacticanalysisarecom-

bined. Themeaningrepresentationfromthisformofsemanticanalysisistypically

a logical formula in rst order predicate calculus (FOPC) or some more power-

ful intermediate representation language such as Montague's intensional logic or

DiscourseRepresentationTheory(DRT).Theadvantageofarepresentationofthe

meaningofasentenceinaformsuchasaformulaofFOPCisthatitcanbeusedto

deriveasetofvalidinferencesbasedontheinferencerulesofFOPC.Forexample,

asPulman[1996]shows,aquerysuchas:

DoeseveryightfromLondontoSan FranciscostopoverinReykyavik?

cannotbeansweredstraightforwardlybyarelationaldatabasethat doesnotstore

propositions oftheform everyX has property P.Insteadalogicalinferencehasto

bemadefromthemeaningofthesentencebasedontheequivalencebetweenevery

X has property P and there isno X that does nothave property P. Basedon this

inferencethesystemsimplyhastodetermineifanon-stoppingightcanbefound,

inwhichcasetheanswerisno,otherwiseitisyes.

While linguistics and psychology provide a theoretical basis for computational

linguistics,thecharacteristicsofspokenlanguagerequireadditional(orevenalter-

native) techniques. Oneproblem isthat naturallyoccurringtext, bothin written

form,asinnewspaperstories,aswellasinspokenform,asinspokendialogues,is

farremovedfromthewell-formedsentencesthatconstitutethedatafortheoretical

linguisticsand psychology. Inlinguisticsthemain concerniswithdevelopingthe-

oriesthat canaccountfor itemsof theoreticalinterest,oftenrarephenomenathat

demonstratethewidecoverageofthetheory,whileinpsychologythemainconcern

iswithidentifyingthecognitiveprocessesinvolvedinlanguageunderstanding. Tra-

ditionallyasymbolicrepresentationis used, withhand-craftedrulesthat produce

acomplete parsingofgrammaticallycorrect sentences but withatarget coverage

basedonarelativelysmall setof exemplarsentences. Whenconfrontedwithnat-

urallyoccurringtextssuch asnewspaperstoriesthese theoreticallywell-motivated

grammarstendtogenerateaverylargenumberofpossibleparses,duetoambigu-

ousstructurescontainedinthegrammarrules,while,conversely,theyoftenfail to

producethecorrectanalysisofagivensentence,oftenhavingafailurerateofmore

than60%[Marcus1995].

Spokenlanguageintroducesafurtherprobleminthattheoutputfromthespeech

recogniserwilloftennothavetheformofagrammaticallywell-formedstringthat

(24)

can be parsed by a conventional language understanding system. Rather it is

likelytocontainfeaturesofspontaneousspeech,suchassentencefragments,after-

thoughts,self-corrections,slipsofthetongue,orungrammaticalcombinations. The

followingexamplesof utterances (cited in Moore[1995]), from acorpuscollected

from subjects using either a simulated or an actual spoken language Air Travel

Information System (ATIS), would not be interpreted by a traditional linguistic

grammar:

Whatkind ofairplanegoesfromPhiladelphia toSanFranciscoMonday

stoppinginDallas inthe afternoon(rstclass ight)

(Do)(Do any of these ights)Are there any ightsthat arrive after ve

p.m.

The rst example is a well-formed sentence followed by an additional fragment

orafter-thought, enclosedin brackets. The second exampleis aself-correction in

whichthewordsintendedfordeletion areenclosedinbrackets.

SomeoftheseperformancephenomenaoccursuÆcientlyregularlythattheycould

bedescribedby special rules. Forexample, in somesystemsrules have been de-

velopedthatcanrecognise andcorrectself-repairsin anutterance[Dowdingetal.

1993; HeemanandAllen 1997]. Aconventionalgrammar couldbeenhancedwith

additional rules that could handle someof these phenomena, but the problem is

that it would be impossible to predict all the potential occurrencesof these fea-

turesofspontaneousspeechinthisway. Analternativeapproachistodevelopmore

robustmethodsforprocessingspokenlanguage.

Robust parsing aims to recover syntactic and semantic information from un-

restricted text that contains features that are not accounted for in hand-crafted

grammars. Robust parsingofteninvolvespartialparsing, in which theaim isnot

to perform a complete analysis of the text but to recover chunks, such as non-

recursivenoun phrases,that canbeused toextracttheessentialitemsofmeaning

inthetext. Thustheaimistoachieveabroadcoverageofarepresentativesample

oflanguagewhichrepresentsareasonableapproximate solutionto theanalysis of

thetext [Abney 1997]. In somesystemsmixed approachesare used, such asrst

attempting to carry out afull linguistic analysis on the input and only resorting

to robust techniques if this is unsuccessful. BBN's Delphi system [Stallard and

Bobrow1992],MIT's TINAsystem [Sene1992] and SRI International'sGemini

system[Dowdinget al.1993]workin thisway. As Moore [1995]reports, dierent

results havebeen obtained. The SRI team found that acombination of detailed

linguistic analysis and robust processing resulted in better performance than ro-

bust processing alone, while the best performing system at the same evaluation

(the November1992 ATIS evaluation) wasthe CMUPhoenix systemwhich uses

onlyrobustprocessingmethodsanddoesnotattempttoaccountforeverywordin

anutterance.

4.2.1 Integration of the speech recognition andnatural language understanding

components. Sofarithasbeenassumedthatthespeechrecogniserandthenatural

languageunderstanding module are connected serially and that the speech mod-

uleoutputs asinglestringto beanalysed bythelanguageunderstandingmodule.

Typically, however,theoutput from thespeechrecognitioncomponentis aset of

(25)

rankedhypotheses,ofwhichonlyafewwillmakesensewhensubjectedtosyntac-

ticandsemanticanalysis. Themostlikelyhypothesismayturnoutnotto bethe

stringthat is rankedasthe best set of wordsidentied bythe speech recognition

component(seetheexamplein section4.1). Whatthisimpliesisthat, inaddition

to interpreting the string(or strings) output bythe speech recogniser to provide

asemanticinterpretation, thelanguageunderstanding module canprovideanad-

ditional knowledge source to constrain theoutput of the speech recogniser. This

in turn hasimplications forthesystem architecture,in particular forthe ways in

whichthespeechrecognitionandnaturallanguageunderstandingcomponentscan

belinkedor integrated.

Thestandardapproachtointegrationinvolvesselectingasapreferredhypothesis

the string with the highestrecognition scorethat canbe processed by the natu-

ral languagecomponent. The disadvantage of this approach is that strings may

berejected asunparsable that nevertheless representwhat thespeakerhad actu-

ally said. In this case the recogniser would be over-constrained by the language

component. Alternatively, if robustparsingwere applied, the recognisercould be

under-constrained,asarobustparserwillattempttomakesenseoutofalmostany

wordstring.

One alternative approach to integration is word lattice parsing, in which the

recogniserproducesasetofscoredwordhypothesesandthenaturallanguagemod-

ule attempts to nda grammaticalutterance spanning the input signalthat has

the highest acoustic score. This approach becomes unacceptable in the case of

wordlatticescontaininglargenumbersofhypotheses,particularlywhenthereisa

large degreeof wordboundary uncertainty. Another alternativeis to use N-best

lteringinwhichtherecogniseroutputsthen-besthypotheses(whereNmayrange

from between 10 to 100 sentence hypotheses), and these are then ranked by the

languageunderstandingcomponenttodeterminethebest-scoringhypothesis[Price

1996]. This approach has the advantage of simplicity but the disadvantage of a

high computational cost given a largevalue for N. Manypractical systemshave,

however,producedacceptableresultswithvaluesaslowasN=5,usingrobustpro-

cessingifstrictgrammaticalparsingwasnotsuccessfulwiththetopverecognition

hypotheses[Kubalaetal.1992].

4.2.2 Some solutions. Various solutionshave beenadopted to the problem of

deriving asemanticrepresentationfrom thestring provided by the speech recog-

nition component. These include: comprehensive linguisticanalysis, methods for

dealingwithill-formedandincompleteinput,andmethodsinvolvingconceptspot-

ting. Someofthese willbebrieyreviewedinthefollowingparagraphs.

4.2.2.1 SUNDIAL. Inthe SUNDIAL project [Peckham1993],which was con-

cerned with travel information in English, French, German, and Italian, several

dierentapproacheswereadopted,withthefollowingcommonfeatures:

|arichlinguisticanalysis;

|robustmethodsforhandlingpartialandill-formedinput;

|asemanticrepresentationlanguagefortask-orienteddialogues.

Linguistic analysis in the Germanversionis basedon achartparser using auni-

cation categorialgrammar [Eckert and Niemann 1994]. Syntactic and semantic

(26)

structuresarebuiltin parallelbyunifyingcomplexfeaturestructuresduringpars-

ing. Theaimistondaconsistentmaximaledgeoftheutterance,butifnosingle

edge can be found, the best interpretation is selected for thepartial descriptions

returned by the chart parser. These partial descriptionsare referredto as utter-

anceeldobjects(UFOs). Variousscoringmeasuresareappliedtothechartedges

to determine the best interpretation. Additionally somefeatures of spontaneous

speech suchaspauses, lled pauses,and ellipses,are representedexplicitlyin the

grammar. ThefollowingexampleillustratestheuseofUFOsintheanalysisofthe

stringI wanttogo -atnine o'clock fromKoeln [EckertandNiemann1994]:

U1: syntax: [string: `Iwanttogo']

semantics:[type:want,theagent: [type: speaker],thetheme: [type: go]]

U2: syntax: [string: `atnineo'clock']

semantics:[type: time,thehour: 9]

U3: syntax: [string: `fromKoeln']

semantics:[type: go,thesource: [type: location,thecity: koeln]]

ThissequenceofUFOsisasetofpartialdescriptionsthatcannotbecombinedinto

alongerspanningedge,asU2,anellipticalconstruction,isnotcompatiblewithU1

andU3. However,itisstillpossibletobuild asemanticrepresentationfrom these

partialdescriptions,asshownin thisexample.

Thisexamplealsoillustratesthesemanticinterfacelanguage(SIL)whichisused

in SUNDIAL to pass the content of messages between modules. Two dierent

levels of detail are provided in SIL, both in terms of typed feature structures: a

linguistically-orientedlevel,asshownabove,and atask-orientedlevel,which con-

tains information relevant to an application, such as relations in the application

database. Thetask-oriented representationfor thepartial descriptions in theex-

ampleabovewouldbe:

U1: [taskparam]: [none]]

U2: [taskparam]: [sourcetime: 900]]

U1: [taskparam]: [sourcecity: koeln]]

This task-oriented representation is used by the dialogue manager to determine

whethertheinformationelicitedfromtheuserissuÆcienttopermitdatabaseaccess

orwhether furtherparametersneedtobeelicited.

Reporting on a comparative evaluation betweenearlier versions of the system,

whichdidnotincludearobustsemanticanalysis,andalaterversionthatdid,Eckert

and Niemann [1994] found a much better dialogue completion rate in the later

system, even though word accuracy rate (the results from the speech recogniser)

hadremainedroughlyconstantacrossthesystems.

4.2.2.2 SpeechActs. TheSpeechActssystem[Martinetal.1996],whichenables

prof

Kuvio

Table I. Dialogue control strategies
Fig. 1. Dialogue ow for payment of a bill
Fig. 2. Goal tree before utterance System1
Fig. 5. Goal tree after utterance User2
+7

Viittaukset

LIITTYVÄT TIEDOSTOT

The PWM-controller presented in Figure 2.5 adjusts the duty cycle d(t) to commit required actions to keep the output voltage at a desired level. A measured feedback signal is used

The impressive gains in performance obtained using deep neural networks DNNs for automatic speech recognition [23, 50, 51, 92] have motivated the application of DNNs to other

Domain mismatch occurs when data from application-specific target domain is related to, but cannot be viewed as iid samples from the source domain used for training speaker

Domain mismatch occurs when data from application-specific target domain is related to, but cannot be viewed as iid samples from the source domain used for training speaker

(a) What is the optimal strategy if you want to minimize the expected number of questions required.. What can you say about the expected number of questions that

(a) What is the optimal strategy if you want to minimize the expected number of questions required?. What can you say about the expected number of questions that

First, new knowledge about the strategic development process required to achieve sustainability and its management contributes to the business strategy literature on

Therefore, a sufficiently long prediction interval NT s is required, in order for the controller to “see” beyond the initial voltage drop and thus to ensure closed-loop stability.