Linguistic Interface Knowledge
Fig.9. ThearchitectureoftheCircuit-x-it-Shopsystem
actions and how theorems are used to prove goal completion. Among the other
types of knowledge included in this component are general dialogue knowledge
aboutthelinguisticrealisationsoftaskexpectationsandknowledgeabouttheuser
thatisacquired duringthecourseof thedialogue.
ProblemsolvingintheCircuit-Fix-ItShopsysteminvolvesco-operatingwiththe
user to solve aspecic goal, such as how to repair aparticular circuit. Problem
solving is achieved through communication between the system and the user to
establish what actions have to be carried out and what the current state of the
taskmightbe. Thecomponentsdescribedinthissectionsupportthesysteminits
reasoningaboutthestepsrequiredtocompleteatask,indecidingwhatinformation
tocommunicatetotheuser,andintheintegrationofinformationprovidedbythe
userintothesystem'smodelofthecurrentstateofthetask.
4.4.3 Communication with a planning system. Problem solving can also be
achievedthroughtheuseofaplanningsystemthatsupportsreasoningaboutgoals,
plansandactions. WhilethetaskstructuresusedintheCircuit-Fix-ItShopsystem
involvetaskdecomposition into sub-tasksandsubsequentlyinto primitiveactions
tobecarriedout,theproblem-solvingmechanismsaredierentfromthosethatare
usedinconventionalplanningsystems. TheCircuit-Fix-ItShopsystemispresented
withanexplicitgoalat thebeginningofthedialogueanditstaskistocollaborate
with the userin proving thegoal in much the samewayas atheorem is proved.
Planningsystemsincorporatefurthercomplicationsinthatoftenthesystemhasto
infertheuser'sgoalsfromstatementsoractionsthat may notexplicitlyrepresent
thegoals(plan recognition). Planning systemstypicallyinclude an explicitrepre-
sentationofbeliefs,desiresandintentionsthatarereasonedaboutduringthecourse
oftheproblemsolving. TheseelementsareassumedimplicitlyintheCircuit-Fix-It
Shopsystem.
The TRAINS project [Allen et al. 1995] is concerned with the integration of
natural language dialogue and plan reasoning to support collaborative problem
solving. The purpose of the dialogue is to negotiate and develop a plan. The
speech actsthat comprisethedialoguearemotivated byreasoningabouttheplan
and are at the same time interpreted in the light of the current plan. Figure 10
providesasimplied viewofthecomponentsoftheTRAINSsystem.
Dialogue Manager
Natural language generator
Domain plan reasoner
Execution planning and
monitoring Parser
Simulated TRAINS world
Manager's Utterance System's utterance
Fig.10. TheTRAINSarchitecture
PlanreasoningintheTRAINSsysteminvolvestwoalgorithms-theincorporation
algorithmandtheelaboration algorithm. Theincorporation algorithmisconcerned
essentiallywithplanrecognitioni.e. with ndingcausalandmotivationalconnec-
tions between potential interpretations of the current utterance and the current
plan. Thealgorithmsearchesthroughaspaceofplangraphswithnodesrepresent-
ing events and states, and links representing relations between eventsand states
suchas enablement,eect,generationandjustication. Theelaboration algorithm
supportsthesystem'sconstructionofaplanusingmeans-endsplanning.Iftheuser
encounters somechoice thatrequires conrmation,forexample,anelementinthe
planthatisambiguous,thesystemgeneratesanutterancetorequestconrmation.
4.4.4 Summary. Thisdiscussionoftheroleoftheexternalcommunicationcom-
ponentinaspokendialoguesystemhasshownhowanintegratedsystemarchitec-
ture,asillustratedin Figure7,isrequiredin ordertosupportinteractionbetween
thedialogue managementcomponent andtheother systemcomponents. Inaddi-
tiontotheproblemofdeterminingwhethersuÆcientinformationhasbeenelicited
from theusertoprovideinputto theexternalapplication, asdiscussedin section
4.3,obtainingtherequiredinformationfrom theexternalsourceis notnecessarily
astraightforwardtaskandcomplexinteractions mayberequiredinvolvingmedia-
tionsbetweenthedialoguemanagerandtheuser. Inthecaseofadatabasequery
the requested information may not be available in the form that was requested
sothat areformulatedqueryis required. Inaplan reasoningapplication such as
TRAINS the plan reasoner may fail to nd a connection betweenan event, goal
orfact inferred from theuser's utteranceand anodein theplan graph,in which
caseitcouldbeassumedthattheuser'sutterancehadbeenmisinterpretedandthe
languageunderstanding componentwouldberequiredto searchforanalternative
interpretation,failingwhichthesystemwouldrequestclaricationorrepair. Thus
the interpretation and resolution of the user's query may involve complex inter-
action withthe externalsourcebefore thesystemcanreport aresult backto the
user.
4.5 Responsegeneration
Assuming that the requested information has been retrieved from the external
source,theresponsegenerationcomponentnowhasto constructthemessagethat
is to besentto the speech outputcomponent to be spokento theuser. Broadly
speaking,theconstructionofthemessageconsistsofthree decisionsinvolving:
(1) whatinformationshouldbeincluded;
(2) howtheinformationshould bestructured;
(3) the form of the message - for example, the choice of words and syntactic
structure.
Responsegenerationcanbeachievedusingsimplemethods,suchastheinsertionof
theretrieveddataintopre-denedslotsinatemplate. Ontheotherhand,complex
methodsusingnaturallanguagegenerationtechniquesmaybeused,althoughgen-
erally these morecomplex methods haveonlybeen applied in researchprototype
systems.
Responsegenerationinadialoguesysteminvolvesadditionaltasksbeyondthose
required for other language generation tasks. Given that the information to be
generated is in the form of some non-linguistic representation - for example, the
results of a database query ora chain of reasoning from an expert system - the
dialoguemanagerhastorelatetheinformationtowhatwaspreviouslysaid(using
a discourse history) as well as to the user's goals and knowledge (using a user
model).
Useofadiscoursehistoryenablesthesystemto providearesponse thatis con-
sistentandcoherentwiththeprecedingdialogue. Forexample: ifsomeentitythat
has already been mentioned is to be referred to again, the system should check
whetherananaphoricexpressioncanbeusedunambiguouslytorefertotheentity
onasecondmention,asinthefollowingexampletakenfromReiterandDale[1997]:
The nexttrainisthe CaledonianExpress. It leaves at10am. Many
touristguidebookshighly recommend this train.
Little research has been done onthe use of pronouns in language generation,al-
thoughthere hasbeensomeresearch on generating denite descriptions- forex-
ample,the useof the train iftheCaledonianExpressand noothertrainhasbeen
previouslymentioned[DaleandReiter1995].
Asmentionedearlier,usermodellingintheearly1980swasconcernedwithmak-
ingnaturallanguagedialoguesystemsmoreco-operative. Inadditiontosupporting
theinterpretationoftheuser'sutterancesbymodellingtheuser'sbeliefs,goalsand
plans, theothermain application ofusermodels wastoenableasystemto adapt
itsoutput totheuser'sperceivedneeds[Wahlster andKobsa 1989]. A numberof
researchprojectsaddressedthisissue,ofwhichthefollowingareindicative.
TheKNOMEsystem[Chin1989]provideddierentlevelsofexplanationofUnix
commandsdependingonitscategorisationoftheuser'slevelofcompetenceandthe
degreeofdiÆcultyof thecommandinquestion. TheTAILORsystem[Paris1989]
adapteditsoutputtotheuser'slevelofexpertisebyselectingthetypeofdescription
and theparticularinformationthat wouldbeappropriateforagivenuser. Based
on an extensive analysis of scientic texts, it was found that texts from adult
encyclopaedias and manuals for experts mainly included structural information
that could berepresentedusing constituency schemasdescribingthe parts of the
objects,whileencyclopaediasforyoungchildrenandmanualsfornovicescontained
mainly process-orientedinformation that described the functional characteristics
oftheobjects. TAILORwasabletogenerateappropriatedescriptionstodierent
typesof userand to producearange ofdescriptionsfor usersfalling betweenthe
two extremes of novice or expert. Finally, in the IMP system, Jameson [1989]
investigatedtheuseof anticipationfeedbackto determinethebiasof thesystem's
output. Basicallywhat thisinvolvesis thatthesystemattemptsto anticipatethe
user'sreactiontoits outputand thentakesthisanticipatedreactioninto account
in nalising its output. This technique isparticularly appropriate forevaluation-
oriented dialogues, such as personnel selection interviewsand dialogues involving
travelagents,hotel managers,andsalespeople.
AusermodelwasusedintheCircuit-Fix-ItShopsystemtoenablethesystemto
determinewhatneededtobesaidtotheuserandwhatcouldbeomittedbecauseof
existing userknowledge(seetheexamplediscussed insection3.3). Inthissystem
the dialogue controller invoked inferences to derive additional axioms about the
user basedon theuser's utterances. These inferences, which are similar to those
usedby[Chin1989]intheKNOMEsystem,includedthefollowing[SmithandHipp
1994]: 60):
Ifthe axiommeaningisthat theuser hasagoaltolearnsomeinforma-
tion, then conclude that the user does notknow aboutthe information.
If the axiom meaning is that an action was completed, then conclude
thatthe user knowshow toperformthe action.
These inferences, which are based on abstract descriptions of actions and their
eects,wereusedtoprovideusermodelaxiomsthatcouldbeusedbythetheorem
proveralongwithotheraxiomsthatwereavailabletoprovegoalcompletion. Thus
theusermodelinformationwasemployedwithinthedialoguesystemtodetermine
theselectionof theinformationtobepresentedtotheuser.
A considerable amountof research in text generation has beenconcerned with
theorganisationofmessages,i.e. theirdiscoursestructure. Oneofthemostwidely
knownapproachesinvolvestheuseofrhetoricalrelationsbetweenelementsofatext,
asdescribed in Rhetorical StructureTheory (RST) [Mann and Thompson 1988].
Examplesof rhetoricalrelationsareelaboration,exemplication, andcontrast. Al-
ternatively, schemas have been used to provide the structure of the information
to be presented [McKeown 1985]. A schema sets out the main components of a
text, using elements such as identication, analogy, comparison, and particular-
illustration, whichhave asequential orderingin a textand canoccur recursively.
Schema-basedsystemsoftenusegeneralprogrammingconstructssuchaslocalvari-
ablesandconditional tests.
Theform oftheoutput isknown asthelinguisticrealisation. This involvesthe
choicesoflexicalitemsandsyntacticstructurestoexpressthedesiredmeaning. The
choiceoflexicalitemsmightinvolvedecidingbetweenthewordsleaveanddepart to
express theconceptof DEPARTURE, whilesyntacticdecisionsmightinvolvethe
choiceofanactiveorapassivesentence[ReiterandDale1997].Linguisticrealisa-
tionalsoinvolvesthegeneration ofgrammaticallycorrectstructures,forexample,
selectingtheappropriatetenseandrulesofagreement. Fromtheperspectiveofthe
constructionofatext,four dierentcategoriesof contentmaybeinvolved[Reiter
andDale1997]:
(1) unchanging text - i.e. parts of the messagethat are alwayspresent in the
outputtext;
(2) directly-available data - i.e. information that has been retrieved from a
databaseorknowledgebase;
(3) computabledata- i.e. informationthat isderivedfrom thedataasaresult
ofsomecomputation orreasoning (forexample,the numberof recordsfound
inthedatabasefortrainsbetweentwocities);
(4) unavailable data-i.e. informationthatisnotpresentinthedatabutwhich
supplementstheinformation(thisiscommonintextsauthoredbyhumans,for
example,extrainformationthatarailwaylinemaybeblockedbysnow).
Adialoguesystemmaymakeuseofatleasttherstthreetypes,usingunchanging
textfor theconstantpartsof amessage,retrieveddatato conveytheinformation
thatwasrequested,andcomputabledatatosummarisetheinformationortorequire
amorespecic choicefrom theuser.
4.6 Speechoutput
Speechoutputinvolvesthetranslationofthemessageconstructedbytheresponse
generationcomponentintospokenform. Inthesimplestcasespre-recordedcanned
speech maybeused, sometimeswith spacesto belled byretrievedorpreviously
recordedsamples,asin:
Youhaveacall from<Jason Smith>. Do youwish totakethe call?
inwhichmostofthemessageispre-recordedandtheelementinangularbracketsis
eithersynthesisedorplayedfromarecordedsample. Thismethodworkswellwhen
themessagestobeoutputareconstant,butsyntheticspeech isrequiredwhenthe
text isvariable and unpredictable,when largeamountsofinformation haveto be
processedandselectionsspokenout,andwhenconsistencyofvoiceisrequired. In
thesecasestexttospeech synthesis(TTS)is used.
Text tospeech synthesiscanbeseenasatwostageprocessinvolving
(1) textanalysis;
(2) speechgeneration[Edgingtonetal.1996a;1996b].
Textanalysisinvolvestheanalysisoftheinputtextthatresultsinalinguisticrep-
resentation that can beused by thespeech generation stageto producesynthetic
speech bysynthesisingaspeech waveformfrom thelinguisticrepresentation. The
textanalysisstageissometimesreferredtoastext-to-phonemeconversion,although
thisdescriptiondoesnotcovertheanalysisoflinguisticstructure thatisinvolved.
Thesecond stage,which isoften referredto asphoneme tospeech conversion, in-
volvesthegenerationofaprosodicdescription(includingrhythmandintonation),
followedbyspeechgenerationwhichproducesthenalspeechwaveform. Aconsid-
erableamountofresearchhasbeencarriedoutin texttospeechsynthesiswhichis
beyondthescopeof thepresentsurvey(see,forexample,[Edgingtonet al.1996a;
1996b;Carlson and Granstrom1997] for recent overviews). This researchhas re-
sultedin severalcommerciallyavailabletext to speechsystems, such asDECTalk
and the BT Laureate system [Page and Breen 1996]. The main aspects of text
to speech synthesisthat are relevantto spoken dialoguesystems willbereviewed
briey. Thetext analysisstageoftexttospeechsynthesiscomprisesfourtasks:
(1) textsegmentationandnormalisation;
(2) morphologicalanalysis;
(3) syntactictaggingandparsing;
(4) themodellingofcontinuousspeecheects.
Textsegmentation isconcerned withtheseparationof thetext into unitssuch as
paragraphs and sentences. In some cases this structure will already exist in the
retrievedtext, but thereare manyinstances ofambiguousmarkers. Forexample,
afull stopmaybe taken as amarker of asentence boundary, but it is also used
forseveral otherfunctions such asmarkinganabbreviation(St.),asacomponent
ofadate(12.9.97), oraspartofanacronym (M.I.5). Normalisationinvolvesthe
interpretationofabbreviationsand otherstandardforms such asdates, timesand
currencies, and their conversioninto a form that can be spoken. In many cases
ambiguity in theexpressions hasto beresolved -for example,St. canbe`street'
or`saint'.
Morphologicalanalysisis requiredontheonehandto dealwiththeproblem of
storing pronunciationsof large numbersof wordsthat are morphologicalvariants
of one another, and on the other to assist with pronunciation. Typically apro-
nunciationdictionary will storeonly the root forms of words, such as write. The
pronunciationsof relatedforms, such aswrites and writing,can be derived using
morphological rules. Similarly, words such as staring need to be analysed mor-
phologicallyto establish theirpronunciation. Potentialroot forms are star +ing
andstare +ing. Theformerisincorrectonthebasisof amorphologicalrulethat
requires consonant doubling (starring), while the latter is correct because of the
rulethatrequirese-deletionbeforethe-ing form.
Taggingisrequiredtodeterminethepartsofspeechofthewordsinthetextand
to permit alimited syntacticanalysis,usually involvingstochastic processing. A
smallnumberofwords-estimatedatbetween1and2%ofwordsinatypicallexicon
[Edgingtonetal.1996a]-havealternativepronunciationsdependingontheirpart
of speech. Forexample: live as averbwill rhyme with give, but as anadjective
rhymeswith ve. Thepartofspeechalso aectsstress assignmentwithin aword
-forexample, record asanounispronounced'record (with thestressontherst
syllable),andasaverbasre'cord (withthestressonthesecondsyllable).
Modellingcontinuous speech eects isconcerned withachieving naturalsound-
ing speech when the wordsare spoken in a continuous sequence. Two problems
areencountered. Firstly, thereareweakformsofwords,involvingmainlyfunction
words such asauxiliaryverbs,determinersand prepositions. These wordsare of-
ten unstressedand given reduced oramended articulationsin continuousspeech.
Without these adjustments theoutput soundsstilted and unnatural. The second
probleminvolvesco-articulationeectsacrosswordboundaries,whichhavetheef-
fect of deleting or changing sounds. Forexample: if thewordsgood and boy are
spokentogetherquickly,the/d/ingoodisassimilatedtothe/b/inboy. Modelling
these co-articulationeects isimportantfor theproductionof naturallysounding
speech.
There hasbeenanincreasingconcern withthegeneration ofprosody in speech
synthesis,aspoorprosodyisoftenseenasamajorproblemforspeechsystemsthat
tend to sound unnaturaldespite good modelling of theindividual units of sound.
Prosody includes phrasing, pitch, loudness, tempo, and rhythm, and is used to
conveydierencesinmeaningaswellastoconveyattitude.
Thespeechgenerationprocessinvolvesmappingfromanabstractlinguisticrep-
resentation of the text, as provided by the text analysis stage, to a parametric
continuous representation. Twomain methods have been used to model speech:
articulatory synthesis, which modelscharacteristicsof the vocal tract and speech
articulators, and formant synthesis, which models characteristics of the acoustic
signal. Formantsynthesishasbeenthemoresuccessfulmethod andhasproduced
commercialsystemssuchasDECTalkthatyieldahighdegreeof intelligibility.
Analternativemethodthatisusedinrecentwork,forexample,inBT'sLaureate
system, involves concatenative speech synthesis, in which pre-recorded units of
speech arestoredin aspeechdatabaseandselected andjoined togetherin speech
generation. The relevant units are usually not phonemes, due to the problems
that arisewith co-articulation, butdiphones, which assistin the modelling ofthe
transitions from one unit of sound to the next. Various algorithms have been
developedforjoining theunits togethersmoothly.
Generallyrelativelylittleemphasishasbeenputonthespeechoutputprocessby
developersof spokendialogue systems. This ispartlydue tothe factthat text to
speechsystemsarecommerciallyavailablethatcanbeusedto producereasonably