• Ei tuloksia

SPECIFYING, DESIGNING AND EV ALUATING A SPOKEN DIALOGUE SYSTEM

In document Pay a bill (sivua 69-80)

APPLICATION

6. SPECIFYING, DESIGNING AND EV ALUATING A SPOKEN DIALOGUE SYSTEM

Developingaspoken dialogue systemcanbe viewed asaspecial caseof software

engineering, with its own methods and evaluation criteria that haveevolved over

thepast fewyears. A recentEU project, DISC(Spoken LanguageDialogue Sys-

tems andComponents) is concerned with specifying abest practicemethodology

for the development and evaluation of spoken dialogue systems [Dybkjr et al.

1997]. Varioussetsofguidelinesandstandardshaveemergedasaresultofresearch

projectssuchas theDanishDialogueProject[Bernsenetal.1998],theEU-funded

EAGLESprojectonstandards forspokenlanguagesystems [Gibbon et al. 1997],

andprojectsfundedintheUSundertheARPAinitiativesonspokenlanguagesys-

tems[Hirschman1995].Themain trendsin thisworkarereviewedin thissection,

lookingrstat themethods employedtosupport thespecication,designandde-

velopmentofspokendialoguesystems,andthenatmethodsofevaluationthathave

beenused.

6.1 Developmentmethodologies

Developingaspokendialogue systeminvolvesdecidingon thetasks that thesys-

tem hasto perform in order to solve aproblem interactively with ahumanuser;

specifyingadialoguestructurethatwillsupporttheperformanceofthetask;deter-

miningtherecognitionvocabulariesandlanguagestructuresthat willbeinvolved;

anddesigningandimplementingasolutionthat meetsthesecriteria.

Variousmethodsareincommonuseforestablishingsystemrequirements. These

include: literatureresearch,interviewswithuserstoelicittheinformationrequired

to construct the domain and task models; eld-study observations or recordings

of humans performing the tasks; eld experiments, in which someparameters of

the task aresimulated,full-scale simulations,and rapid prototyping. In order to

illustrate the issues involved, the two most commonly applied methods - design

basedonananalysisofhuman-humandialogues,anddesignbasedonsimulations-

willbedescribed,followedbyadiscussionofusabilityissuesandofdesignguidelines

andstandards.

6.1.1 Design basedon the analysis of human-humandialogues. Human-human

dialoguesprovideaninsightinto howhumans accomplishtask-orienteddialogues.

Considerableeorthasgoneinto collectingcorporaofrelevantdialogues,manyof

whichare publiclyavailable,such astheTRAINScorpusand theCSLUcorpora.

Analysis of the TRAINS corporacan provide information about the structure of

thedialogues in support ofconversationalmodelling -for example,whether task-

orienteddialoguesconsistmainlyofasingletopic-andabouttherangeofvocabu-

laryand languagestructuresinvolved. TheCSLUcorpora,ontheotherhand,are

focussedmainly onmodellingaccentsandmultilingualpronunciations.

Analysis of natural dialogues may also pinpoint some aspects of how humans

interactwithsoftwaresuchasemailandcalendarapplicationswhentheyareusing

aspeech-basedratherthanagraphicaluserinterface. IntheSpeechActsprojectsat

SunMicrosystemsLaboratories,pre-designstudiesareusedbeforedialoguedesign

to help the designer view the task from the user's perspective and to developa

feel for the style of interaction [Yankelovich nd]. One of the ndings was that

usersof thecalendarapplicationtypicallyusedrelativedatessuchastomorrow or

nextMonday,whereasabsolutedateswouldbeusedin theversionprovidedinthe

graphical userinterface. The organisationof information, such asthe numbering

ofmessagesin Sun'sMailToolGUI,gaverisetoconfusionin thespokenlanguage

interface asit became diÆcult to keep track of which messageswere new and to

refer back easily to previously read messages [Yankelovich et al. 1995]. Thus it

was concludedthat usersofaspeech userinterface(SUI) employadierentsetof

mental abilitiescompared to when theyuseagraphical userinterface(GUI). For

this reasonit was recommended that methods need to be developed to cope for

thelackofvisualcueswheninteractingwithsoftwareapplicationsoveratelephone

line. The pre-designstudies had animportant bearing on issuessuchas these as

well as for thedesign of prompts, the selection of vericationstrategies, and the

provisionofimmediatefeedback.

6.1.2 Designbasedonsimulations: WizardofOzand`SystemintheLoop'. Al-

thoughthe analysis ofhuman-humandialogues canprovideuseful information to

supportthedesignofspokendialoguesystems,themaindrawbackofthisapproach

isthat itisnotpossibletogeneralisefromunrestricted human-humandialogueto

the morerestrictedhuman-computer dialogues that can be supported by current

technology. Currentsystemsarerestrictedbylimitedspeech recognitioncapabili-

ties,limitedvocabularyand grammaticalcoverage,andlimitedabilityto tolerate

andrecoverfromerror. Toinvestigatehowhumansmighttalktoamorerestricted

dialoguepartner,such asacomputersystemin asituation where nosuchsystem

presentlyexists,somesortofsimulationofthesystemisrequired.

TheWizardofOz(WOZ)methodiscommonlyusedtoinvestigatehowhumans

mightinteract withacomputersystem[Fraserand Gilbert 1991]. Inthis method

ahumansimulatestheroleofthecomputer,providinganswersusingasynthesised

voice,andtheuserismadetobelievethatheorsheisinteractingwithacomputer.

The situation is controlled with scenarios, in which the user has to nd out one

or more pieces of information from the system, for example, aight arrivaltime

andthearrivalterminal. Theuseof aseriesof carefullydesignedWOZsimulated

systemsenablesdesignstobedevelopediterativelyandevaluationtobecarriedout

before signicantresources have been invested in system building [Gibbon et al.

1997]. OneofthegreatestdiÆculties facingtheWOZmethodisthatitisdiÆcult

forahumanexperimentertobehaveexactlyasacomputerwould,andtoanticipate

thesorts of recognitionandunderstanding problems that mightoccurin the real

system.

Toovercomethis disadvantage,the `Systemin the Loop'method maybe used.

Inthiscase,asystemwithlimitedfunctionalityisusedtocollectdata. For exam-

ple: thesystemmightincorporateontherstcyclespeechrecognitionandspeech

understanding modules, but themain dialogue management component may still

bemissing. Onsuccessivecyclesadditionalcomponentscanbeaddedandthefunc-

tionalityof thesystemincreased,thuspermittingmoredata tobecollected. It is

alsopossibletocombine thismethod withtheWOZmethod,in which thehuman

wizardsimulates thosepartsofthesystemthat havenotyet beenimplemented.

6.1.3 Usability analysis. As with any other software, the success of a spoken

dialogue system does not depend solely on the functionality and performance of

the software but also on its usabilityand acceptance bythe users forwhom it is

intended.

One aspect of usability is to determine the costs and benets of the proposed

system. Lenniget al.[1995]describethedevelopmentofasystemwhich aimedto

automatethehandlingofsomedirectoryassistance(DA)callsinBellCanada. One

ofthe initialinvestigationsinvolveddeterminingwhere potentialsavingscould be

made. Sincetheaverageoperatorworktimepercallwasfoundtobeapproximately

25seconds, andthecost tocompanies in theUS ofproviding directoryassistance

was estimated tobeover$1.5B,areductionof 1second in thework timepercall

would representsavingsofover$60Mperyear. Operatoracceptancewasanother

factor that was investigated in this study. Operators were generally positive and

particularly welcomed thefact that with theautomaticsystemtheydidnothave

to continuallyrepeat thesameinformation. Usingthesystem waseasierontheir

voiceand alsorequiredlesskeying,thusavoidingtheproblems ofrepetitivestrain

injury.

Similarndingsemerged in apart-automated directoryenquiries systemdevel-

opedbyVocalisforTeliaTeleRespons,Sweden'sleadingnetworkservicesprovider

[Peckhamnd]. Amajorconcernwashowtoreduce therunningcostsofdirectory

enquiries without compromising customer satisfaction. The solutionwasa semi-

automated system using a combination of voice response and speech recognition

TableIV. Contextualfunctionsandgoals.

A.Generalconstraintsandcriteria

Overalldesigngoal:

Spoken language dialogue system prototype

operating via the telephone and capable of

replacingahumanoperator.

Realismcriteria

The artefact should be preferableto current

technologicalalternatives

The system should run on machines which

couldbepurchasedbyatravelagency

Usabilitycriteria:

Maximise the naturalnessof user interaction

withthesystem

Constraints on system naturalness resulting

from trade-os with system feasibility have

to be madeina principledfashion based on

knowledgeofusersinorderto bepracticable

byusers.

B. Application of constraints and cri-

teria to the artefact within the design

space

Systemaspects

500wordsvocabulary

Max. 100wordsinactivevocabulary

Limited speaker-independent recognition of

continuousspeech

Close-to-real-timeresponse

SuÆcienttaskdomaincoverage

Taskaspects:

UserTasks:

Obtaininformation on and perform booking

ofightsbetweentwospeciccities

Usesinglesentences(ormax. 10words)

Useshortsentences(average3-4words)

C.Hypotheticalissues:

Isavocabularyof500wordssuÆcienttocap-

turethesublanguagevocabularyneededinthe

taskdomain?

technology. Themain tasksofan operatorhandlingadirectoryenquiry callwere

identiedandthosepartsthatcouldbehandledautomaticallywerespecied. Voice

processingtechnologywasusedatthebeginningandendofcalls-togreetthecaller

and to releasetherequestednumber. The search forthe numberwashandled by

thehumanoperator. Wordspottingallowedthe speechrecognitioncomponentto

recognise keywordsin the midstof extraneouswordsand sounds, while `talkover'

allowed the caller to speak over the system output if they did not wish to wait

forthe machine tonish before theybeganspeaking. Thecommercialbenetsof

thesysteminclude increasedstaproductivityandsubstantialcostsavings. From

thetechnicalperspective,thesystemachievesaspeechrecognitionaccuracylevelof

97%,whileTeliaTeleResponsbenetsfroman8%increaseineÆciencyandsavings

ofmillionsofpoundseachyear. ResearchfromTeliaTeleResponsshowsthat over

90%ofpeoplearepleasedtousethesystemandthat anadditional6-7%actually

prefersit. All thesignicantfactors,includingmarket conditions,pricing,therole

of theoperators,theviews ofthe unions,suppliersand customers, were analysed

attheoutsetbeforethetechnicalrequirementsofthesystemwereconsidered.

6.1.4 Requirement Specication. Following the analysis of requirements for a

spoken dialogue system based on one or more of the methods described in the

preceding paragraphs, a formalrequirementsspecication can beproduced. The

mostelaborateapproachwouldappeartobethatemployedintheDanishDialogue

Projectinwhichtwosetsofdocuments-aDesignSpaceDevelopment(DSD)anda

DesignRationale(DR)areproduced[Bernsen1993]. ADSDdocument(orframe)

representsthe designspace structure and designer commitmentsat agiven point

duringsystemdesign,sothat aseriesofDSDsprovideaseriesofsnapshotsofthe

evolvingdesignprocess. ADSDcontainsinformationaboutgeneralconstraintsand

criteria as well as the application of these constraints and criteria to the system

under developmentin theDanishDialogue Project. TableIVshowssomesample

entriesfromaDSD(from[Dybkjretal.1996]). ADRframerepresentsthereason-

ingaboutaparticular designproblem. Anexamplegivenin Dybkjret al.[1996]

describesafeaturewhichhadnotbeentakenintoaccountin theoriginalspecica-

tion-thatuserswerenotabletogetthepriceoftheticketstheyhadreserved. The

DR containsinformation concerningthejusticationfor theoriginal specication,

alist of possibleoptions, theresolution adopted,and comments. Inthis waythe

evolvingdesignanditsrationalearecomprehensivelydocumented.

Inadditionto requirementspecicationdocumentssuch asthese,the EAGLES

handbook[Gibbonetal.1997]recommendsaformalandexplicitdescriptionofthe

proposed dialogue. Onewaytodothiswouldbetorepresentthedialogueowas

aowchart,statetransitionnetwork,ordialogue grammar,in which allreachable

statesin thedialoguearespecied,alongwithinformationonwhatactionsshould

beperformedineachstateandhowtodecidewhichstateshouldbethenext. Tools

existfordisplayingthistypeofinformationgraphically. Forexample,intheDanish

DialogueProjectDDL(Dialogue DescriptionLanguage),agraphicallanguagefor

describingstate transitiondiagrams for event-drivensystemsis used to providea

formalspecicationofspokendialoguesystems.

6.1.5 Design guidelines. Thetheoretical basisfor muchof thework ondesign

guidelines comes from a theory of co-operative conversation, developed by the

philosopheroflanguage,[Grice1975]). Griceidentiedanumberofmaximsunder-

lying co-operativeconversationconcerningquantity, quality, relation and manner

ofcommunication. For example: thequantitymaximstatedthat aspeakershould

beas informative as required, but not moreinformative thanrequired; the qual-

itymaximrelatedto thetruthofaconversationalcontribution,and therelevance

maximto itsrelevance. Somecommonlyused evaluationmetrics,suchasContex-

tualAppropriateness(section6.2.2),aswellastheusabilityguidelinesdevelopedin

theDanishDialogueProject(seenextsection)are basedlooselyonGrice'swork.

6.1.5.1 Guidelines inthe Danish DialogueProject. Griceanmaxims havebeen

developedand extended into a set of usability guidelines in the Danish Dialogue

Project[Bernsen et al. 1996]. A rst set of the guidelines was developed on the

basisof analysisof 120examplesofuser-system interaction problemsidentied in

a corpus of dialogues from the Wizard of Oz (WOZ) simulations of the Danish

dialoguesystem. Theguidelinesweresubsequentlyrenedandconsolidatedintoa

toolcalledDET(DialogueEvaluationTool),thatcanbeusedtosupportthedesign

of co-operativedialogue systemsand asatoolfor diagnosticevaluation [Dybkjr

etal.1997]. DETconsistsof22guidelinesgroupedundersevendierentaspectsof

dialogue, such asinformativeness andpartner symmetry,anddivided intogeneric

(GG)andspecicguidelines(SG).Thefollowingaresomeexamples(thosemarked

with*arebasedonGrice):

Informativeness.

GG1:. *Make your contribution as informative as is required (for the current

purposesoftheexchange)

SG1:. Be fully explicitin communicatingto usersthe commitments theyhave

made.

Partner Symmetry.

GG10:. Inform the dialogue partners of important non-normal characteristics

whichtheyshould takeintoaccountinordertobehaveco-operativelyin dialogue.

Ensurethefeasibilityofwhatisrequiredofthem.

SG4:. Provideclearandcomprehensiblecommunicationofwhatthesystemcan

andcannotdo.

(Source: [Dybkjretal.1997]).

Using the guidelines for evaluation involves analysing transcripts of dialogues to

identifyinstances ofviolationsof theguidelines, whicharemarkedupin thetran-

scripts. The violations can then be examined in greater detail, disagreements

between analysers resolved, and recommendations developed for enhancing co-

operativityin the dialogue system. Thegenerality of theguidelines hasbeenex-

ploredbyapplyingthemsuccessfullyasadialoguedesignguidetopartofacorpus

fromtheSundialproject[Dybkjretal.1997].

6.1.5.2 TheEAGLES guidelines. IntheEAGLEShandbookaseriesofrecom-

mendationshavebeenproposedtosupport thedesignofspokendialoguesystems.

These guidelines include recommendations for the design of interactivevoice re-

sponse(IVR) systemsandfor thedesignofprompts. Thefollowingisasummary

oftherecommendationsforthedesignofdialoguesystems[Gibbonet al.1997]:

(1) Data collection:

|Studyofrecordingsofhuman-humaninteractioninasituationsimilartothe

oneinwhichthesystemwill beused.

|Wizard-of-Ozsimulations

|Transcriptionofthedialogues

(2) Specication,designandimplementationofarstversion(X)ofthedialogue

system.

(3) Tests

|Laboratory tests using corpora recorded in Wizard-of-Oz simulations, and

thenwithlaboratorystasimulatingusers,recordingnewdata

|Fieldtestswith realusers,recordingnewcorpora

(4) Tunethesystembyiterativelymodifying,thentestingit.

(5) DesignandimplementanX+1versionofthesystem,integratingnewtech-

nologies.

(6) Tests(asin step3)

(7) Returnto step4unlessthesystemisdeemedto becomplete.

Specicrecommendationsconcerningthedialoguemodelandthevocabularyofthe

systemareincludedin thefollowingadditionalguidelines:

(1) Conduct a dialogue act analysis of the dialogues collected in the corpora,

payingspecial attention tothe conditionswhich must besatisedin order to

proceedfromonedialoguestatetothenext.

(2) Describethedialoguestatetransitionsusingsomeformallyexplicitapparatus

(suchasaowchartorformalspecicationlanguage).

(3) Usethedata to identify the totallexicon required, then divide it intosub-

lexicons,whereeachsublexiconisassociatedwithadialogueact.

(4) Usethedatatoidentifyacoveringgrammar,thendivideitintosubgrammars,

whereeachsubgrammarisassociatedwithadialogueact.

6.2 Evaluation

Mostofthemethodsusedtosupportthespecicationanddesignofspokendialogue

systems-suchascorpusanalysis,WOZ,andsystem-in-the-loop-canalsobeusedto

collectdataforevaluation. Thissectionwillfocusonthemetricsthatareemployed

ratheronthemethods ofdatacollection.

Evaluation ofspokendialogue systemscaninvolveeither evaluationoftheindi-

vidual components(glass boxevaluation), orevaluation ofthe systemasa whole

(black box evaluation). Evaluationof individual components,with measures such

as word accuracy and sentence accuracy, have been employed for some time to

measure theperformanceofspokenlanguagesystemsunder theARPA initiatives

[Hirschman1995]. Itisonlymorerecentlythat measures havebeendevelopedfor

spokendialoguesystemsasawhole. Bothtypesofevaluationhavebeendescribed

insomedetailinareviewpaperby[Baggia1996],fromwhichmuchofthematerial

inthissectionisderived. SeealsoSmith[1997]andSmithandGordon[1997]fora

comparablesetofevaluationmethods.

6.2.1 Evaluation of individual components. Evaluation of individual compo-

nents is generally based on the concept of a reference answer, which determines

thedesiredoutputofthecomponenttobecomparedwithitsactualoutput. Refer-

enceanswersareeasiertodetermineforcomponentssuchasthespeechrecogniser

and the languageunderstanding component, but more diÆcultwith the dialogue

managerwheretherangeofacceptablebehavioursisgreater. Themostcommonly

usedmeasureforspeechrecognisersisWordAccuracy(WA).WAaccountsforerrors

atthewordlevel,whichincludeinsertion(W

I

),deletion(W

D

),andsubstitutionof

words(W

S

). WAiscalculatedasapercentageusingtheformula:

WA=100

1 W

S +W

I +W

D

W

%

whereW isthetotalnumberof wordsin thereferenceanswer.

Sentence Accuracy (SA) is ameasure of thepercentageof utterances in acor-

pusthathavebeencompletelyandcorrectlyrecognised.Inthiscasetherecognised

stringofwordsismatchedexactlywiththewordsinthereferenceanswer. Sentence

UnderstandingRate(SU),ontheotherhand,measurestherateofunderstoodsen-

tencesincomparisonwithareferencemeaningrepresentation. Analternativemea-

sureofunderstandingisConceptAccuracy(CA),whichmeasuresthepercentageof

conceptsthathavebeencorrectlyunderstood. CAissimilartoWA,asitmeasures

errors at the concept level which include insertions, deletions, and substitutions.

TextUnderstanding (TA), ameasureused in theMessage UnderstandingConfer-

ences,measurestheamountofsignicantinformationthathasbeenextractedfrom

atext,usingtemplatesasthereferenceanswers. Finally,anevaluationmethodhas

been developed in the ARPA Spoken Language System program to measure the

correctnessofdatabasequeryresponsesbymatchingtheactualresponseswithref-

erenceanswersexpressedasasetofminimalandmaximaltuples[Hirschman1995].

The correct answer must include at least the information in the minimal answer

In document Pay a bill (sivua 69-80)