The Role of Voice in Human-Machine Communication*
Optimism is growing that the near future will witness rapid growth in human-computer interaction using voice. System prototypes have recently been built that demonstrate speaker-independent real-time speech recognition and understanding of naturally spoken utterances in moderately sized vocabularies (1000 to 2000 words), and larger-vocabulary speech recognition systems are on the horizon. Already, computer manufacturers are building speech recognition subsystems into their new product lines. However, before this technology will be broadly useful, a substantial knowledge base about human spoken language and performance during computer-based interaction needs to be gathered and applied. This paper reviews application areas in which spoken interaction may play a significant role, assesses potential benefits of spoken interaction with machines, and attempts to compare voice with alternative and complementary modalities of human-computer interaction. The paper also discusses information that will be needed to build a firm empirical foundation for future designing of human-computer interfaces. Finally, it argues for a more systematic and scientific approach to understanding human language and performance with voice interactive systems.
*The writing of this paper was supported in part by a grant from the National Science Foundation (No. IRI-9213472) to SRI International.
From the beginning of the computer era, futurists have dreamed of the conversational computera machine that we could engage in spoken natural language conversation. For instance, Turing's famous "test" of computational intelligence imagined a computer that could conduct such a fluent English conversation that people could not distinguish it from a human. However, despite prolonged research and many notable scientific and technological achievements, until recently there have been few human-computer dialogues, none spoken. This situation has begun to change, as steady progress in speech recognition and natural language processing technologies, supported by dramatic advances in computer hardware, has made possible laboratory prototype systems with which one can engage in simple question-answer dialogues. Although far from human-level conversation, this initial capability is generating considerable interest and optimism for the future of human-computer interaction using voice.
This paper aims to identify applications for which spoken interaction may be advantageous, to situate voice with respect to alternative and complementary modalities of human-computer interaction, and to discuss obstacles that exist to the successful deployment of spoken language systems because of the nature of spoken language interaction.
Two general sorts of speech input technology are considered. First, we survey a number of existing applications of speech recognition technologies, for which the system identifies the words spoken, but need not understand the meaning of what is being said. Second, we concentrate on applications that will require a more complete understanding of the speaker's intended meaning, examining future spoken dialogue systems. Finally, we discuss how such speech understanding will play a role in future human-computer interactions, particularly those involving the coordinated use of multiple communication modalities, such as graphics, handwriting, and gesturing. It is argued that progress has been impeded by the lack of adequate scientific knowledge about human spoken interactions, especially with computers. Such a knowledge base is essential to the development of well-founded human-interface guidelines that can assist system designers in producing successful applications incorporating spoken interaction. Given recent technological developments, the field is now in a position to systematically expand that knowledge base.
Background and Definitions
Human-computer interaction using voice may involve speech input or speech output, perhaps in combination with each other or with other modalities of communication.
The speech analysis task is often characterized along five dimensions:
Speaker dependence. Speech recognizers are described as speaker-dependent/trained, speaker-adaptive, and speaker-independent. For speaker-dependent recognition, samples of a given user's speech are collected and used as models for his/her subsequent utterances. For speaker-adaptive recognition, parameterized acoustical models are initially available, which can be more finely tuned for a given user through pronunciation of a limited set of specified utterances. Finally, speaker-independent recognizers are designed to handle any user's speech, without training, in the given domain of discourse (see Flanagan, in this volume).
Speech continuity. Utterances can be spoken in an isolated manner, with breaks between words, or as continuous natural speech.
Speech type. To develop initial algorithms, researchers typically first use read speech as data, in which speakers read random sentences drawn from some corpus, such as the Wall Street Journal. Subsequent to this stage of algorithm development, speech recognition research attempts to handle spontaneous speech, in which speakers construct new utterances in the chosen domain of discourse.
Interactivity. Certain speech recognition tasks, such as dictation, can be characterized as noninteractive, in that the speaker is receiving no feedback from the intended listener(s). Other systems are designed to process interactive speech, in which speakers construct utterances as part of an exchange of turns with a system or with another speaker.
Vocabulary and grammar. The user can speak words from a tightly constrained vocabulary and grammar or from larger vocabularies and grammars that more closely approximate those of a natural language. The system's vocabulary and grammar can be chosen by the system designer or application developer, or they can be compiled from data based on actual users speaking either to a simulated system or to an early system prototype. Current speech recognition technologies require an estimate of the probability of occurrence of each word in the
context of the other words in the vocabulary. Because these probabilities are typically approximated from the distribution of words in a given corpus, it is currently difficult to expand a system's vocabulary, although research is proceeding on vocabulary-independent recognition (Hon and Lee, 1991).
Vendors often describe their speech recognition hardware as offering very high recognition accuracy, but it is only in the context of a quantitative understanding of the recognition task that one can meaningfully compare the performance of recognizers. To calibrate the difficulty of a given recognition task for a given system, researchers have come to use a measure of the perplexity of that system's language model, which measures, roughly speaking, the average number of word possibilities at each state of the grammar (Bahl et al., 1983; Baker, 1975; Jelinek, 1976). Word recognition accuracy has been found, in general, to be inversely proportional to perplexity. Most commercial systems offer speech recognition systems claiming >95 percent word recognition accuracy given a perplexity on the order of 10. At least one vendor offers a 1000 to 5000 word, speaker-independent system, with perplexities in the range of 66 to 433, and a corresponding word-recognition error of 3 to 15 percent for recognition of isolated words (Baker, 1991). Current laboratory systems support real-time, speaker-independent recognition of continuously spoken utterances drawn from a vocabulary of approximately 1500 words, with a perplexity of 50 to 70, resulting in word recognition error rates between 4 and 8 percent (Pallett et al., 1993). The most ambitious speaker-independent systems are currently recognizing, in real time, read speech drawn from a 5000-word vocabulary of Wall Street Journal text, with a perplexity of 120, resulting in a word recognition error rate of 5 percent (Pallett et al., 1993). Larger vocabularies are now being attempted.
The end result of voice recognition is the highest-ranking string(s) of words, or often lattice of words, that covers the signal. For small vocabularies and tightly constrained grammars, a simple interpreter can respond to the spoken words directly. However, for larger vocabularies and more natural grammars, natural language understanding must be applied to the output of the recognizer in order to recover the intended meaning of the utterance.1 Because this natural language understanding process is complex and open ended, it is often constrained by the application task (e.g., retrieving information from a data base) and by the domain of discourse (e.g., a data base
1 See Moore (in this volume) for a discussion of how these components can be integrated.
about airline flights). Here the combination of speech recognition and language understanding will be termed speech understanding, and the systems that use such input will be termed spoken language systems. This paper reviews earlier work on the uses of speech recognition but concentrates on the uses of spoken language.
Three forms of speech synthesis technology exist:
Digitized speech. To produce an utterance, the machine assembles and plays back previously recorded and compressed samples of human speech. Although a noticeable break between samples can often be heard, and the overall intonation may be inaccurate, such a synthesis process can offer human-sounding speech of high intelligibility. This process is, however, limited to producing combinations of the recorded samples.
Text-to speech. Text-to-speech synthesis involves an automated analysis of the structure of words into their morphological constituents. By combining the pronunciations of those subword units according to letter- and morph-to-sound rules, coupled with a large list of exceptional pronunciations (for English), arbitrary text can be rendered as speech. Because this technology can handle open-ended text, it is suitable for large-scale applications, such as reading text aloud to blind users or reading electronic mail over the telephone. Text-to-speech science and technology are covered at length elsewhere in this volume (see Allen, in this volume, and Carlson, in this volume).
Concept-to-speech. With text-to-speech systems, the text to be converted is supplied from a human source. Future dialogue systems will require computers to decide for themselves what to say and how to say it in order to arrive at a meaningful and contextually appropriate dialogue contribution. Such systems need to determine what speech action(s) to perform (e.g., request, suggestion), how to refer to entities in the utterance, what to say about them, what grammatical forms to use, and what intonation to apply. Moreover, the utterance should contribute to the course of the dialogue, so the system should keep a representation of what it has said in order to analyze and understand the user's subsequent utterances.
The research areas of speech synthesis and language generation have received considerably less attention than speech recognition and understanding but will increase in importance as the possibility of developing spoken dialogue systems becomes realizable.
The remainder of this paper explores current and future applica-
tion areas in which spoken interaction may be a preferred modality of communication with computers. First, factors that may influence the desirability and efficiency of voice-based interaction with computers are identified, independent of whether a simple command language or a quasi-natural language is being spoken. Then, we discuss spoken language interaction, comparing it both to keyboard-based interaction and to the currently dominant graphical user-interface paradigm. After identifying circumstances that favor spoken language interaction, gaps in the scientific knowledge base of spoken communication are identified that present obstacles to the development of spoken language-based systems. It is observed that future systems will be multimodal, with voice being only one of the communication modalities available. We conclude with suggestions for further research that needs to be undertaken to support the development of voice-based unimodal and multimodal systems and argue that there is a pressing need to create empirically based human interface guidelines for system developers before voice-based technology can fulfill its potential.
WHEN IS SPOKEN INTERACTION WITH COMPUTERS USEFUL?
As yet there is no theory or categorization of tasks and environments that would predict, all else being equal, when voice would be a preferred modality of human-computer communication. Still, a number of situations have been identified in which spoken communication with machines may be advantageous:
• when the user's hands or eyes are busy,
• when only a limited keyboard and/or screen is available,
• when the user is disabled,
• when pronunciation is the subject matter of computer use, or
• when natural language interaction is preferred.
We briefly examine the present and future roles of spoken interaction with computers for these environments. Because spoken natural language interaction is the most difficult to implement, we discuss it extensively in the section titled ''Natural Language Interaction."
The classic situation favoring spoken interaction with machines is one in which the user's hands and/or eyes are busy performing
some other task. In such circumstances, by using voice to communicate with the machine, people are free to pay attention to their task, rather than breaking away to use a keyboard. Field studies suggest that, for example, F-16 pilots who can attain a high speech recognition rate can perform missions, such as formation flying or low-level navigation, faster and more accurately when using spoken control over various avionics subsystems, as compared with keyboard and multifunction-button data entry (Howard, 1987; Rosenhoover et al., 1987; Williamson, 1987). Similar results have been found for helicopter pilots in noisy environments during tracking and communications tasks (Simpson et al., 1982, 1985; Swider, 1987).2
Commercial hands/eyes-busy applications also abound. For instance, wire installers, who spoke a wire's serial number and then were guided verbally by computer to install that wire achieved a 20 to 30 percent speedup in productivity, with improved accuracy and lower training time, over their prior manual method of wire identification and installation (Marshall, 1992). Parcel sorters who spoke city names instead of typing destination-labeled keys attained a 37 percent improvement in entry time during hands/eyes-busy operations (Visick et al., 1984). However, when the hands/eyes-busy component of parcel sorting was removed, spoken input offered no distinct speed advantages. In addition, VLSI circuit designers were able to complete 24 percent more tasks when spoken commands were available than when they only used a keyboard and mouse interface (see the section titled "Direct Manipulation") (Martin, 1989). Although individual field studies are rarely conclusive, many field studies of highly accurate speech recognition systems with hands/eyes-busy tasks have found that spoken input leads to higher task productivity and accuracy.
Not only does spoken input offer efficiency gains for a given hands/eyes-busy task, it also offers the potential to change the nature of that task in beneficial ways. For example, instead of having to remember and speak or type the letters "YYZ" to indicate a destination airport, a baggage handler could simply say "Toronto," thereby using an easy-to-remember name (Martin, 1989; Nye, 1982). Similar potential advantages are identified for voice-based telephone dialers, to which one can say "Call Tom," rather than having to remember and input a phone number (Rabiner et al., 1980). Other hands/eyes-busy applications that might benefit from voice interaction include data entry and machine control in factories and field applications
2 Further discussion of speech recognition for military environments can be found in (Weinstein, 1991, in this volume).
(Martin, 1976), access to information for military command-and-control, astronauts' information management during extravehicular access in space, dictation of medical diagnoses (Baker, 1991), maintenance and repair of equipment, control of automobile equipment (e.g., radios, telephones, climate control), and navigational aids (Streeter et al., 1985).
A major factor determining success for speech input applications is speech recognition accuracy. For example, the best task performance reported during F-16 test flights was obtained once pilots attained isolated word recognition rates greater than 95 percent. Below 90 percent, the effort needed to correct recognition errors was said to outweigh the benefits gained for the user (Howard, 1987). Similar results showing the elimination of benefits once error correction is considered also have been found in tasks as simple as entering connected digits (Hauptmann and Rudnicky, 1990).
To attain a sufficiently high level of recognition accuracy in field tests, spoken input has been severely constrained to allow only a small number of possible words at any given time. Still, even with such constraints, accuracy in the field often lags that of laboratory tests because of many complicating factors, such as the user's physical and emotional state, ambient noise, microphone equipment, the demands of real tasks, methods of the user and system training, and individual differences encountered when an array of real users is sampled. However, it is claimed that most failures of speech technology have been the result of human factors engineering and management (Lea, 1992), rather than low recognition accuracy per se. Human factors issues are discussed further below and by Kamm (in this volume).
Limited Keyboard/Screen Option
The most prevalent current uses of speech synthesis and recognition are telephone-based applications. Speech synthesizers are commonly used in the telecommunications industry to support directory assistance by speaking the desired telephone number to the caller, thereby freeing the operator to handle another call. Speech recognizers have been deployed to replace or augment operator services (e.g., collect calls), handling hundreds of millions of callers each year and resulting in multimillion dollar savings (Lennig, 1989; Nakatsu, in this volume; Wilpon, in this volume). Speech recognizers for telecommunications applications accept a very limited vocabulary, perhaps spotting only certain key words in the input, but they need to function with high reliability for a broad spectrum of the general
public. Although not as physically severe as avionic or manufacturing applications, telecommunications applications are difficult because callers receive little or no training about use of the system and may have low-quality equipment, noisy telephone lines, and unpredictable ambient noise levels. Moreover, caller behavior is difficult to predict and channel (Basson, 1992; Kamm, in this volume; Spitz, 1991).3
The considerable success at automating the simpler operator services opens the possibility for more ambitious telephone-based applications, such as information access from remote databases. For example, the caller might inquire about airline and train schedules (Advanced Research Projects Agency, 1993; Proceedings of the Speech and Natural Language Workshop, 1991; Peckham, 1991), yellow pages information, or bank account balances (Nakatsu, in this volume), and receive the answer auditorily. This general area of human-computer interaction is much more difficult to implement than simple operator services because the range of caller behavior is quite broad and because speech understanding and dialogue participation are required rather than just word recognition. When even modest quantities of data need to be conveyed, a purely vocal interaction may be difficult to conduct, although the advent of "screen phones" may well improve such cases.
Perhaps the most challenging potential application of telephone-based spoken language technology is the interpretation of telephony (Kurematsu, 1992; Roe et al., 1991) in which two callers speaking different languages can engage in a dialogue mediated by a spoken language translation system (Kitano, 1991; Yato et al., 1992). Such systems are currently designed to incorporate speech recognition, machine translation, and speech synthesis subsystems and to interpret one sentence at a time. A recent initial experiment organized by ATR International (Japan), with Carnegie-Mellon University (USA) and Siemens A.G. (Germany) involved Japanese-English and Japanese-German machine-interpreted dialogues (Pollack, 1993; Yato et al., 1992). Utterances in one language were recognized and translated by a local computer, which sent a translated textual rendition to the foreign site, where text-to-speech synthesis took place. AT&T has demonstrated a limited-domain spoken English-Spanish translation system (Roe et al., 1991), although not a telephone-based one, and Nippon Electric Corporation has demonstrated a similar Japanese-English system.
Apart from the use of telephones, a second equipment-related
3 An excellent review of the human factors and technical difficulties encountered in telecommunications applications of speech recognition can be found in Karis and Dobroth (1991).
factor favoring voice-based interaction is the ever-decreasing size of portable computers. Portable computing and communications devices will soon be too small to allow for use of a keyboard, implying that the input modalities for such machines will most likely be digitizing pen and voice (Crane, 1991; Oviatt, 1992), with screen and voice providing system output. Given that these devices are intended to supplant both computer and telephone, users will already be speaking through them. A natural evolution of the devices will offer the user the capability to speak to them as well.
Finally, an emerging use of voice technology is to replace the many control buttons on consumer electronic devices (e.g., VCRs, receivers). As the number of user-controllable functions on these devices increases, the user interface becomes overly complex and can lead to confusion over how to perform even simple tasks. Products have recently been announced that allow users to program their devices using simple voice commands.
A major potential use of voice technology will be to assist deaf users in communicating with the hearing world using a telephone (Bernstein, 1988). Such a system would recognize the hearing person's speech, render it as text, and synthesize the deaf person's textual reply (if using a computer terminal) as a spoken utterance. Another use of speech recognition in assisting deaf users would be captioning television programs or movies in real time. Speech recognition could also be used by motorically impaired users to control suitably augmented household appliances, wheelchairs, and robotic prostheses. Text-to-speech synthesis can assist users with speech and motor impediments; can assist blind users with computer interaction; and, when coupled with optical character recognition technology, can read printed materials to blind users. Finally, given sufficiently capable speech recognition systems, spoken input may become a prescribed therapy for repetitive stress injuries, such as carpal tunnel syndrome, which are estimated to afflict approximately 1.5 percent of office workers in occupations that typically involve the use of keyboards (Tanaka et al., 1993), although speech recognizers may themselves lead to different repetitive stress injuries (Markinson, personal communication, 1993).4
4 The general subject of "assistive technology" is covered at length by H. Levitt (in this volume), and a survey of speech recognition for rehabilitation can be found in Bernstein (1988).
Subject Matter Is Pronunciation
Speech recognition will become a component of future computer-based aids for foreign language learning and for the teaching of reading (Bernstein and Rtischev, 1991; Bernstein et al., 1990; Mostow et al., 1993). For such systems, speakers' pronunciation of computer-supplied texts would be analyzed and given as input to a program for teaching reading or foreign languages. Whereas these may be easier applications of speech recognition than some because the words being spoken are supplied by the computer, the recognition system will still be confronted with mispronunciations and slowed pronunciations, requiring a degree of robustness not often considered in other applications of speech recognition. Substantial research will also be needed to develop and field test new educational software that can take advantage of speech recognition and synthesis for teaching reading. This is perhaps one of the most important potential applications of speech technology because the societal implications of raising literacy levels on a broad scale are enormous.
As with speech input, the factors favoring voice output are only informally understood. Just as tasks with a high degree of visual or manual activity may be more effectively accomplished using spoken input, such tasks may also favor spoken system output. A user could concentrate on a task rather than altering his or her gaze to view a system display. Typical application environments include flying a plane, in which the pilot could receive information about the status of the plane's subsystems during critical phases of operation (e.g., landing, high-speed maneuvering), and driving a car, in which the driver would be receiving navigational information in the course of driving. Other factors thought to favor voice output include remote access to information services over the telephone, lack of reading skills, darkened environments, and the need for omnidirectional information presentation, as in the issuing of warnings in cockpits, control rooms, factories, etc. (Simpson et al., 1985; Thomas et al., 1984).
There are numerous studies of speech synthesis, but no clear picture has emerged of when computer-human communication using speech output is most effective or preferred. Psychological research has investigated the intelligibility, naturalness, comprehensibility, and recallability of synthesized speech (Luce et al., 1983; Nusbaum and Schwab, 1983; Simpson et al., 1985; Thomas et al., 1984). Intelligibil-
ity and naturalness are orthogonal dimensions in that synthetic speech present in an environment of other human voices may be intelligible but unnatural. Conversely, human speech in a noisy environment may be natural but unintelligible (Simpson et al., 1985). Many factors influence the intelligibility of synthesized speech in an actual application environment, including the baseline phoneme intelligibility, speaking rate, signal-to-noise level, and presence of other competing voices, as well as the linguistic and pragmatic contexts (Simpson and Navarro, 1984; Simpson et al., 1985).
The desirability of voice output depends on the application environment. Pilots prefer to hear warnings with synthetic speech rather than digitized speech, as the former is more easily distinguished from other voices, such as radio traffic (Voorhees et al., 1983). However, in simulations of air traffic control systems, in which pilots would expect to interact with a human, digitized human speech was preferred to computer synthesized speech (Simpson et al., 1985). Users may prefer to receive information visually, either on a separate screen or on a heads-up display (Swider, 1987), reserving spoken output for critical warning messages (Simpson et al., 1985). Much more research is required in order to determine those types of information processing environments for which spoken output is beneficial and preferred. Furthermore, rather than just concentrating on the benefits of speaking an utterance as compared with other modes of presenting the same information, future research needs to evaluate user performance and preferences as a function of the content of what is being communicated, especially if the computer will be determining that content (e.g., the generation of navigational instructions for drivers). Finally, research is critically necessary to develop algorithms for determining the appropriate intonation contours to use during a spoken human-computer dialogue.
There are numerous existing applications of voice-based human-computer interaction, and new opportunities are developing rapidly. In many applications for which the user's input can be constrained sufficiently to allow for high recognition accuracy, voice input has been found to lead to faster task performance with fewer errors than keyboard entry. Unfortunately, no principled method yet exists to predict when voice input will be the most effective, efficient, or preferred modality of communication. Similarly, no comprehensive analysis has identified the circumstances when voice will be the preferred or most efficient form of computer output, though again hands/eyes-
busy tasks may also be among the leading candidates for voice output.
One important circumstance favoring human-computer communication by voice is when the user wishes to interact with the machine in a natural language, such as English. The next section discusses such spoken language communication.
COMPARISON OF SPOKEN LANGUAGE WITH OTHER COMMUNICATION MODALITIES
A user who will be speaking to a machine may expect to be able to speak in a natural language, that is, to use ordinary linguistic constructs such as noun and verb phrases. Conversely, if natural language interaction is chosen as a modality of human-computer communication, users may prefer to speak rather than type. In either case, users may expect to be able to engage in a dialogue, in which each party's utterance sets the context for interpreting subsequent utterances. We first discuss the status of the development of spoken language systems and then compare spoken language interaction with typed interaction.
Spoken Language System Prototypes
Research is progressing on the development of spoken language question-answering systemssystems that allow users to speak their questions freely and which then understand those questions and provide an accurate reply. The Advanced Research Projects Agency-supported air travel information systems, ATIS (Advanced Research Projects Agency, 1993), developed at Bolt, Beranek, and Newman (Kubala et al., 1992), Carnegie-Mellon University (Huang et al., 1993), the Massachusetts Institute of Technology (Zue et al., 1992), SRI International (Appelt and Jackson, 1992), and other institutions, allow novice users to obtain information in real time from the Official Airline Guide database, through the use of speaker-independent, continuously spoken English questions. The systems recognize the words in the user's utterance, analyze the meaning of those utterances, often in spite of word recognition errors, retrieve information from (a subset of) the Official Airline Guide's database, and produce a tabular set of answers that satisfy the question. These systems respond with the correct table of flights for over 70 percent of context-independent questions, such as "Which flights depart from San Francisco for Washington after 7:45 a.m.?" Rapid progress has been made in the development of these systems, with a 4-fold reduction in weighted error rates rec-
ognition over a 20-month period for speech recognition, a 3.5-fold reduction over a 30-month period for natural language understanding, and a 2-fold reduction over a 20-month period for their combination as a spoken language understanding system. Other major efforts to develop spoken dialogue systems are ongoing in Europe (Mariani, 1992; Peckham, 1991) and Japan (Yato et al., 1992).
Much of the language processing technology used for spoken language understanding has been based on techniques for keyboard-based natural language systems.5 However, spoken input presents qualitatively different problems for language understanding that have no analog in keyboard interaction.
Spoken Language vs. Typed Language
In our review of findings about linguistic communication relevant to spoken human-computer interaction, some results are based on analyses of human-human interaction, some are based on human-to-simulated-computer interaction, and some are based on human-computer interaction. Studies of human-human communication can identify the communicative capabilities that people bring to their interactions with computers and can show what could be achieved were computers adequate conversationalists. However, because this level of conversational competence will be unachievable for some time, scientists have developed techniques for simulating computer systems that interact via spoken language (Andry et al., 1990, Fraser and Gilbert, 1991; Gould et al., 1983; Guyomard and Siroux, 1988; Leiser, 1989; Oviatt et al., 1992, 1993a; Pavan and Pelletti, 1990; Price, 1990) by using a concealed human assistant who responds to the spoken language. With this method, researchers can analyze people's language, dialogue, task performance, and preferences before developing fully functional systems.
Important methodological issues for such simulations include providing accurate and rapid response, and training the simulation assistant to function appropriately. Humans engage in rapid spoken interaction and bring expectations for speed to their interaction with computers. Slow interactions can cause users to interrupt the system
5 For a discussion of the state of research and technology of natural language processing, see Bates (in this volume).
with repetitions while the system is processing their earlier input (VanKatwijk et al., 1979) and, it is conjectured, can also elicit phenomena characteristic of noninteractive speech (Oviatt and Cohen, 1991a). One technique used to speed up such voice-in/voice-out simulations is the use of a vocoder, which transforms the assistant's naturally spoken response into a mechanical-sounding utterance (Fraser and Gilbert, 1991; Guyomard and Siroux, 1988). The speed of the "system" is thus governed by the assistant's knowledge and reaction time, as well as the task at hand, but not by speech recognition, language understanding, and speech synthesis. However, because people speak differently to a computer than they do to a person (Fraser and Gilbert, 1991), even to prompts for simple yes/no answers (Basson, 1992; Basson et al., 1989), the assistant should not provide too intelligent a reply, as this might reveal the "system" as a simulation. A second simulation method, which both constrains the simulation assistant and supports a rapid response, is to provide the assistant with certain predefined fields and structures on the screen that can be selected to reply to the subject (Andry et al., 1990; Dahlback et al., 1992; Leiser, 1989; Oviatt et al., 1992). More research is needed into the development of simulation methodologies that can accurately model spoken language systems, such that patterns of interaction with the simulator are predictive of interaction patterns with the actual spoken language system.
Comparison of Language-Based Communication Modalities
In a series of studies of interactive human-human communication, Chapanis and colleagues (Chapanis et al., 1972, 1977; Kelly and Chapanis, 1977; Michaelis et al., 1977; Ochsman and Chapanis, 1974) compared the efficiency of human-human communication when subjects used any of 10 communication modalities (including face-to-face, voice-only, linked teletypes, interactive handwriting). The most important determinant of a team's problem-solving speed was found to be the presence of a voice component. Specifically, a variety of tasks were solved two to three times faster using a voice modality than a hardcopy one, as illustrated in Figure 1. At the same time, speech led to an 8-fold increase in the number of messages and sentences and a 10-fold increase in the rate of communicating words. These results indicate the substantial potential for efficiency advantages that may result from use of spoken language communication.
Research by the authors confirmed these efficiency results in human-human dialogues to perform equipment assembly tasks (Cohen, 1984; Oviatt and Cohen, 1991b), finding a 3-fold speed advantage for
interactive telephone speech over keyboard communication. Furthermore, the structure of telephone dialogues differed from that of keyboard dialogues. Among the differences, spoken dialogues exhibited more cue phrases that signaled the structure of the dialogue (such as ''next," "ok now"), and speakers interacted in a more "fine-grained" fashion than did keyboard users. Specifically, in order to achieve a subtask, speakers often made two requests, one for object identification and one for action, whereas keyboard users typically integrated both into one imperative utterance. Similar findings of a fine-grained approach during spoken interaction versus a more syntactically integrated approach for keyboard interaction have been found in a study
of simulated human-computer interaction (Zoltan-Ford, 1991). Finally, spoken input was more "indirect" than keyboard input. That is, unlike keyboard interaction, spoken utterances did not literally convey the speaker's intention that the listener perform an action (Cohen, 1984). Future research needs to address the extent to which such results generalize to spoken human-computer interaction for comparable tasks.
One benefit of voice input is the elimination of typing, which could offer potential office productivity savings (Baker, 1991; Jelinek, 1985). In a study of a simulated "listening typewriter," Gould et al. (Gould, 1978, 1982; Gould et al., 1983) examined how novice and expert users of dictation would use a machine that could recognize and type the user's dictation of a business letter, as compared with dictating and editing the letter to a human or handwriting and editing the letter. The listening typewriter system was simulated, and the subjects were informed that they were in fact speaking to a person. It was claimed that users of a listening typewriter were as satisfied with that mode of communication as with the others and that dictating to a listening typewriter could potentially be as fast a mode of letter composition as typing. There is, however, countervailing evidence from a number of simulation studies (Murray et al., 1991; Newell et al., 1990) that speech-only word processors are less efficient and less preferred than composition methods based on writing or typing. Moreover, a combined method of using speech for text input and a touch screen for cursor control was more efficient than speech alone, though still less efficient than composition and editing using keyboards or handwriting.
Neither series of studies examined in detail the linguistic and discourse structure of the dictated material that might explain why spoken composition and editing are less efficient than other modalities. In a study of human-human communication it was found that inexperienced "dictators" providing instructions for a human listener produced more discourse structures that would require editing in order to make acceptable text, such as repetitions, elaborations, and unusual uses of referring expressions, than did users of interactive speech or interactive keyboard (Oviatt and Cohen, 1991a, 1991b). Thus, lack of interaction with a listener may contribute to poorly formulated input, placing a larger burden on the postediting phase where speech input is less efficient (Newell et al., 1990). In summary, though automatic dictation devices have been much touted as an important product concept for speech technology, their potential benefit remains a question.
The space of modality studies has not yet been systematically explored. We do not know precisely how results from human-human communication studies can predict results for studies of human-simulation or human-computer interactions. Also, more studies comparing the structure and content of spoken human-computer language with typed human-computer language need to be conducted in order to understand how to adapt technology developed for keyboard interaction to spoken language systems.
Common to many successful applications of voice-based technology is the lack of an adequate alternative to voice, given the task and environment of computer use. Major questions remain as to the applications where voice will be favored when other modalities of communication are possible. Some studies report a decided preference for speech when compared to other modalities (Rudnicky, 1993), yet other studies report an opposite conclusion (Murray et al., 1991; Newell et al., 1990). Thus, despite the aforementioned potential benefits of human-computer interaction using voice, it is not obvious why people should want to speak to their computers in performing their daily office work. To provide a framework for answering this question, the discussion below compares the currently dominant direct-manipulation user interface with typed or spoken natural language.
Comparison of Natural Language Interaction with Alternative Modalities
Numerous alternative modalities of human-computer interaction exist, such as the use of keyboards for transmitting text, pointing and gesturing with devices such as the mouse, a digitizing pen, trackballs, touchscreens, and digitizing gloves. It is important to understand what role speech, and specifically spoken language, can play in supporting human interaction, especially when these other modalities are available. To begin this discussion, we need to identify properties of successful interfaces. Ideally, such an interface should be:
Error free. The interface should prevent the user from formulating erroneous commands, should minimize misinterpretations of the user's intent, and should offer simple methods for error correction.
Transparent. The functionality of the application system should be obvious to the user.
High-level. The user should not have to learn the underlying computer structures and languages but rather should be able to state simply his or her desires and have the system handle the details.
Consistent. Strategies that work for invoking one computer function should transfer to the invocation of others.
Easy to learn. The user should not need formal training, but rather a brief process of exploration should suffice for learning how to use a given system.
Expressive. The user should be able to perform easily any combination of tasks in mind, within the bounds of the system's intended functionality.
Using this set of properties, we discuss the use of direct manipulation and natural language technologies.
The graphical user-interface paradigm involves a style of interaction that offers the user menus, icons, and pointing devices (e.g., the "mouse" [English et al., 1967]) to invoke computer commands, as well as multiple windows in which to display the output. These graphical user interfaces (GUIs), popularized by the Apple Macintosh and by Microsoft Windows, use techniques pioneered at SRI International and at Xerox's Palo Alto Research Center in the late 1960s and 1970s (Englebart, 1973; Kay and Goldberg, 1977). With GUIs, users perform actions by selecting objects and then choosing the desired action from a menu, rather than by typing commands.
In addition, with many GUIs a user can directly manipulate graphical objects in order to perform actions on the objects they represent. For example, a user can copy a file from one disk to another by selecting its icon with the pointing device and "dragging" it from the list of files on the first disk to the second. Other direct manipulation actions include using a "scroll bar" to view different sections of a file and dragging a file's icon on top of the "trash" icon to delete it. Apart from the mouse, numerous pointing devices exist, such as trackballs and joysticks, and some devices offer multiple capabilities, such as the use of pens for pointing, gesturing, and handwriting. Finally, to generalize along a different dimension, users now can directly manipulate virtual worlds using computer-instrumented gloves and bodysuits (Fisher, 1990; Kreuger, 1977; Rheingold, 1991), allowing for subtle effects of body motion to affect the virtual environment.
Strengths Many writers have identified virtues of well-designed graphically based direct manipulation interfaces (DMIs) (e.g., Hutchins et al., 1986; Shneiderman, 1983), claiming that
• Direct manipulation interfaces based on familiar metaphors are intuitive and easy to use.
• Graphical user interfaces can have a consistent "look and feel" that enables users of one program to learn another program quickly.
• Menus make the available options clear, thereby curtailing user errors in formulating commands and specifying their arguments.
• GUIs can shield the user from having to learn underlying computer concepts and details.
It is no exaggeration to say that graphical user interfaces supporting direct manipulation interaction have been so successful that no serious computer company would attempt to sell a machine without one.
Weaknesses Direct manipulation interfaces do not suffice for all needs. One clear expressive weakness is the paucity of means available for identifying entities. Merely allowing users to select currently displayed entities provides them with little support for identifying objects not on the screen (such as a file name in a list of 200 files), for specifying temporal relations that denote future or past events, for identifying and operating on large sets of entities, and for using the context of interaction. At most, developers of GUIs have provided simple string-matching routines that find objects based on exact or partial matches of their names. What is missing is a way for users to describe entities using some form of linguistic expression in order to denote or pick out an individual object, a set of objects, a time period, and so forth.6 At a minimum, a description language should include some way to find entities that have a given set of properties, to say which properties are of interest as well as which are not, to say how many entities are desired, to supply temporal constraints on actions involving those properties, and so forth. Moreover, a useful feature of a description language is the ability to reuse the referents of previous descriptions. Some of these capabilities are found in formal query languages, and all are found in natural languages.
Although shielding a user from implementation details, direct manipulation interfaces are often not high level. For example, one common way to request information from a relational database is to select certain fields from tables that one wants to see. To do this correctly, the user needs to learn the structure of the databasefor
6 0f course, the elimination of descriptions was a conscious design decision by the originators of GUIs.
example, that the data are represented in one or more tables, comprised of numerous fields, whose meanings may not be obvious. Thus, the underlying tabular implementation has become the user interface metaphor. An alternative is to develop systems and interfaces that translate between the user's way of thinking about the problem and the implementation. In so doing, the user might perhaps implicitly retrieve information but need not know that it is kept in a database, much less learn the structure of that database. By engaging in such a high-level interaction, users may be able to combine information access with other information processing applications, such as running a simulation, without first having to think about database retrieval, and then switching "applications" mentally to think about simulation.
When numerous commands are possible, GUIs usually present a hierarchical menu structure. As the number of commands grows, the casual user may have difficulty remembering in which menu they are located. However, the user who knows where the desired action is located in a large action hierarchy still needs to navigate the hierarchy. Software designers have attempted to overcome this problem by providing different menu sets for users of different levels of expertise, by preselecting the most recently used item in a menu, and by providing direct links to commonly used commands through special key combinations. However, in doing the latter, GUIs are borrowing from keyboard-based interfaces and command languages.
Because direct manipulation emphasizes rapid graphical response to actions (Shneiderman, 1983), the time of system action in DMIs is literally the time at which the action was invoked. Although some systems can delay actions until specific future times, DMIs and GUIs offer little support for users who want to execute actions at an unknown but describable future time.
Finally, DMIs rely heavily on a user's hands and eyes. Given our earlier discussion, certain tasks would be better performed with speech. So far, however, there is little research comparing graphical user interfaces with speech. Early laboratory results of a direct-manipulation VLSI design system augmented with speaker-dependent speech recognition indicate that users were as fast at speaking single-word commands as they were at invoking the same commands with mouse-button clicks or by typing a single letter command abbreviation (Martin, 1989). That is, no loss of efficiency occurred due to use of speech for simple tasks at which DMIs typically excel. Note that a 2- to 3-fold advantage in speed is generally found when speaking is compared to typing full words (Chapanis et al., 1977, Oviatt and Cohen, 1991b). In a recent study of human-computer interaction to retrieve information
from a small database (240 entries), it was found that speech was substantially preferred over direct-manipulation use of scrolling, even though the overall time to complete the task with voice was longer (Rudnicky, 1993). This study suggests that, for simple risk-free tasks, user preference may be based on time to input rather than overall task completion times or overall task accuracy.
Natural Language Interaction
Strengths Natural language is the paradigmatic case of an expressive mode of communication. A major strength is the use of psychologically salient and mnemonic descriptions. English, or any other natural language, provides a set of finely honed descriptive tools such as the use of noun phrases for identifying objects, verb phrases for identifying events, and verb tense and aspect for describing time periods. By the very nature of sentences, these capabilities are deployed simultaneously, as sentences must be about something, and most often describe events situated in time.
Coupled with this ability to describe entities, natural languages offer the ability to avoid extensive redescription through the use of pronouns and other "anaphoric" expressions. Such expressions are usually intended to denote the same entities as earlier ones, and the recipient is intended to infer the connection. Thus, the use of anaphora provides an economical benefit to the speaker, at the expense of the listener's having to draw inferences.
Furthermore, natural language commands can offer a direct route to invoking an action or making selections that would be deeply embedded in the hierarchical menu of actions or would require multiple menu selections, such as font and type style and size in a word processing program. In using such commands, a user could avoid having to select numerous menu entries to isolate the desired action. Moreover, because the invocation of an action may involve a description of its arguments, information retrieval is intimately woven into the invocation of actions.
Ideally, natural language systems should require only a minimum of training on the domain covered by the target system. Using natural language, people should be able to interact immediately with a system of known content and functionality, without having to learn its underlying computer structures. The system should have sufficient vocabulary, as well as linguistic, semantic, and dialogue capabilities, to support interactive problem solving by casual usersthat is, users who employ the system infrequently. For example, at its present state of development, many users can successfully solve trip
planning problems with one of the ATIS systems (Advanced Research Projects Agency, 1993), within a few minutes of introduction to the system and its coverage. To develop systems with this level of robustness, the system must be trained and tested on a substantial amount of data representing input from a broad spectrum of users.7 Currently, the level of training required to achieve a given level of proficiency in using these systems is unknown.
Weaknesses In general, various disadvantages are apparent when natural language is incorporated into an interface. Pure natural language systems suffer from opaque linguistic and conceptual coveragethe user knows the system cannot interpret every utterance but does not know precisely what it can interpret (Hendrix and Walter, 1987; Murray et al., 1991; Small and Weldon, 1983; Turner et al., 1984). Often, multiple attempts must be made to pose a query or command that the system can interpret correctly. Thus, such systems can be error prone and, as some claim (Shneiderman, 1980), lead to frustration and disillusionment. One way to overcome these problems was suggested in a menubased language processing system in which users composed queries in a quasi-natural language by selecting phrases from a menu (Tennant et al., 1983). Although the resulting queries are guaranteed to be analyzable, when there is a large number of menu choices to make, the query process becomes cumbersome.
Many natural language sentences are ambiguous, and parsers often find more ambiguities than people do. Hence, a natural language system often engages in some form of clarification or confirmation subdialogue to determine if its interpretation is the intended one. Current research is attempting to handle the ambiguity of natural language input by developing probabilistic parsing algorithms for which analyses would be ranked by their probability of occurrence in the given domain (see Marcus, this volume). Also, research is beginning to investigate the potential for using prosody to choose among ambiguous parses (Bear and Price, 1990; Price et al., 1991). A third research direction involves minimizing ambiguities through multimodal interface techniques to channel the user's language (Cohen, 1991b; Cohen et al., 1989, Oviatt et al., 1993).
7 The ATIS effort has required the collection and annotation of over 10,000 user utterances, some of which is used for system development and the rest for testing during comparative evaluations conducted by the National Institute of Standards and Technology.
Another disadvantage of natural language interaction is that reference resolution algorithms do not always supply the correct answer, in part because systems have underdeveloped knowledge bases and in part because the system has little access to the discourse situation, even if the system's prior utterances and graphical presentations have created that discourse situation. To complicate matters, systems currently have difficulty following the context shifts inherent in dialogue. These contextual and world knowledge limitations undermine the search for referents and provide another reason that natural language systems are usually designed to confirm their interpretations.
It is not clear where typed natural language interaction will be a modality of choice. Studies comparing typed natural language database question answering with database querying using an artificial query language (e.g., SQL) (Chamberlin and Boyce, 1974) have given equivocal results, with some studies concluding that natural language interaction offers faster and more compact query formulation (Jarke et al., 1985), while others conclude that database querying using SQL is more accurate and easier to learn (Jarke et al., 1985; Shneiderman, 1980a). However, these studies are flawed by the use of prototype natural language systems rather than product quality systems. When a product quality natural language database retrieval system (INTELLECT; Harris, 1977) was studied in the field, users reported efficiency gains and a clear preference for natural language interaction as compared with a previous query language method of database interaction (Capindale and Crawford, 1990). Another difficulty in many laboratory studies is the lack of adequate controls on subject training. In one study comparing the utility of natural versus query language usage for database access (Shneiderman, 1980b), users in the natural language condition were given virtually no training on the content of a database, with the rationale that natural language systems should require no training, while users of SQL were trained on the file and field names of that database. Not surprisingly, under these conditions, natural language users made more "overshoot" errors, in the sense of asking for information not contained in the database.
Summary: Circumstances Favoring Spoken Language Interaction with Machines
Theoretically, direct manipulation should be beneficial when the objects to be manipulated are on the screen, their identity is known, and there are not too many objects from which to select. In addition, graphical user interfaces limit users' options, preventing them from
making errors in formulating commands. Natural language interaction with computers offers potential benefits when users need to identify objects, actions, and events from sets too large to be displayed and/ or examined individually and when users need to invoke actions at future times that must be described. Furthermore, natural language allows users to think about their problems and express their goals in their own terms rather than those of the computer. However, in allowing users to do so, systems need to have sufficient reasoning and interpretive capabilities to solve the problems of translating between the user's conceptual model and the system's implementation.
Combining the empirical results on circumstances favoring voice-based interaction with the foregoing analysis of interactions for which natural language may be most appropriate, it appears that applications requiring speedy user input of complex descriptions will favor spoken natural language communication. Moreover, this preference is likely to be stronger when a minimum of training about the underlying computer structures is possible. Examples of such an application area are asking questions of a database or creating rules for action (e.g., "If I am late for a meeting, notify the meeting participants"). Because of the recency of usable spoken language systems, there are very few studies comparing spoken language interaction with direct manipulation for accomplishing real tasks.
So far, we have contrasted spoken interaction with other modalities. It is worth noting that these modalities have complementary advantages and disadvantages, which can be leveraged to develop multimodal interfaces that compensate for the weaknesses of one interface technology via the strengths of another (Cohen, 1991; Cohen et al., 1989). (See section titled "Multimodal Systems.")
HUMAN FACTORS OBSTACLES TO SPOKEN LANGUAGE SYSTEMS
Although there are numerous technical challenges to building spoken language systems, many of which are detailed in this volume, interface and human factors knowledge is especially needed about such systems. We consider below information needed about spontaneous speech, spoken natural language, and spoken interaction.
When an utterance is spontaneously spoken, it may well involve false starts, hesitations, filled pauses, repairs, fragments, and other types of technically "ungrammatical" utterances. These phenomena
disrupt both speech recognizers and natural language parsers and must be detected and corrected before techniques based on present technology can be deployed robustly. Current research has begun to investigate techniques for detecting and handling disfluencies in spoken human-computer interaction (Bear et al., 1992; Hindle, 1983; Nakatani and Hirschberg, 1993), and robust processing techniques have been developed that enable language analysis routines to recover the meaning of an utterance despite recognition errors (Dowding et al., 1993; Huang et al., 1993; Jackson et al., 1991; Stallard and Bobrow, 1992).
Assessment of different types of human-human and human-computer spoken language has revealed that people's rate of spontaneous disfluencies and self-repairs is substantially lower when they speak to a system, rather than another person (Oviatt, 1993). A strong predictive relationship also has been demonstrated between the rate of spoken disfluencies and an utterance's length (Oviatt, 1993). Rather than having to resolve disfluencies, interface research has revealed that form-based techniques can reduce up to 70 percent of all disfluencies that occur during human-computer interaction (Oviatt, 1993). In short, research suggests that some difficult types of input, such as disfluencies, may be avoided altogether through strategic interface design.
In general, because the human-machine communication in spoken language involves the system understanding a natural language but not the entire language, users will employ constructs outside the system's coverage. However, it is hoped that given sufficient data on which to base the development of grammars and templates, the likelihood will be small that a cooperative user will generate utterances outside the coverage of the system. Still, it is not currently known:
• how to select relatively ''closed" domains, whose vocabulary
and linguistic constructs can be acquired through iterative training
and testing on a large corpus of user input,
• how well users can discern the system's communicative capabilities,
• how well users can stay within the bounds of those capabilities,
• what level of task performance users can attain
• what level of misinterpretation users will tolerate, and what
level is needed for them to solve problems effectively, and
• how much training is acceptable.
Systems are not adept at handling linguistic coverage problems,
other than responding that given words are not in the vocabulary or that the utterance was not understood. Even recognizing that an out-of-vocabulary word has occurred is itself a difficult issue (Cole et al., 1992). If users can discern the system's vocabulary, we can be optimistic that they can adapt to that vocabulary. In fact, human-human communication research has shown that users communicating by typing can solve problems as effectively with a constrained task-specific vocabulary (500 to 1000 words) as with an unlimited vocabulary (Kelly and Chapanis, 1977; Michaelis et al., 1977). User adaption to vocabulary restrictions has also been found for simulated human-computer interaction (Zoltan-Ford, 1983, 1991), although these results need to be verified for spoken human-computer interaction.
For interactive applications, the user may begin to imitate or model the language observed from the system, and the opportunity is present for the system to play an active role in shaping or channeling the user's language to match that coverage more closely. Numerous studies of human communication have shown that people will adopt the speech styles of their interlocutors, including vocal intensity (Welkowitz et al., 1972), dialect (Giles et al., 1987), and tempo (Street et al., 1983). Explanations for this convergence of dialogue styles include social factors such as the desire for approval (Giles et al., 1987), and psycholinguistic factors associated with memory limitations (Levelt and Kelter, 1982). Similar results have been found in a study of typed and spoken communication to a simulated natural language system (Zoltan-Ford, 1983, 1984), which showed that people will model the vocabulary and length of the system's responses. For example, if the system's responses are terse, the user's input is more likely to be terse as well. In a simulation study of typed natural language database interactions, subjects modeled simple syntactic structures and lexical items that they observed in the system's paraphrases of their input (Leiser, 1989). However, it is not known if the modeling of syntactic structures occurs in spoken human-computer interaction. If users of spoken language systems do learn to adopt the grammatical structures they observe, then new forms of user training may be possible by having system designers adhere to the principle that any messages supplied to a user must be analyzable by the system's parser. One way to guarantee such system behavior would be to require the system to generate its utterances, rather than merely reciting canned text, employing a bidirectional grammar. Any utterances the system could generate using that grammar would thus be guaranteed to be parseable.
A number of studies have investigated methods for shaping user's language into the system's coverage. For telecommunications appli-
cations, the phrasing of system prompts for information spoken over the telephone dramatically influences the rate of caller compliance for the expected words and phrases (Basson, 1992; Rubin-Spitz and Yashchin, 1989; Spitz, 1991). For systems with screen-based feedback, human spoken language can be effectively channeled through the use of a form that the user fills out with speech (Oviatt et al., 1993). Form-based interactions reduce the syntactic ambiguity of the user's speech by 65 percent, measured as the number of parses per utterance, thereby leading to user language that is simpler to process. At the same time, for the service transactions analyzed in this study, users were found to prefer forms-based spoken and written interaction over unconstrained ones by a factor of 2 to 1. Thus, not only can people's language be channeled, there appear to be cases where they prefer the guidance and sense of completion provided by a form.
Interaction and Dialogue
When given the opportunity to interact with systems via spoken natural language, users will attempt to engage in dialogues, expecting prior utterances and responses to set a context for subsequent utterances, and expecting their conversational partner to make use of that context to determine the referents of pronouns. Although pronouns and other context-dependent constructs sometimes occur less frequently in dialogues with machines than they do in human-human dialogues (Kennedy et al., 1988), context dependence is nevertheless a cornerstone of human-computer interaction. For example, contextually dependent utterances comprise 44 percent of the ATIS corpus collected for the Advanced Research Projects Agency spoken language community (MADCOW Working Group, 1992). In general, a solution to the problem of understanding context dependent utterances will be difficult, as it may require the system to deploy an arbitrary amount of world knowledge (Charniak, 1973; Winograd, 1972). However, it has been estimated that a simple strategy for referent determination employed in text processing, and one that uses only the syntactic structure of previous utterances, can suffice to identify the correct referent for pronouns in over 90 percent of cases (Hobbs, 1978). Whether such techniques will work as well for spoken human-computer dialogue is unknown. One way to mitigate the inherent difficulty of referent determination when using a multimodal system may be to couple spoken pronouns and definite noun phrases with pointing actions (Cohen, 1991; Cohen et al., 1989).
Present spoken language systems have supported dialogues in which the user asks multiple questions, some of which request fur-
ther refinement of the answers to prior questions (Advanced Research Projects Agency, 1993), or dialogues in which the user is prompted for information (Andry, 1992; Peckham, 1991). Much more varied dialogue behavior is likely to be required by users, such as the ability to engage in advisory, clarificatory, and confirmatory dialogues (Codd, 1974; Litman and Allen, 1987). With respect to dialogue confirmations, spoken communication is tightly interactive and speakers expect rapid confirmation of understanding through backchannels (e.g, "uh huh") and other signals. Studies have shown that communication delays as brief as 0.25 seconds can disrupt conversation patterns (Krauss and Bricker, 1967), leading speakers to elaborate and rephrase their utterances (Krauss and Weinheimer, 1966; Oviatt and Cohen, 1991a), and that telephone communications are especially sensitive to delays. The need for timely confirmations will challenge most applications of spoken language processing, particularly those involving telephony.
To support a broader range of dialogue behavior, more general models of dialogue are being investigated, both mathematically and computationally, including plan-based models of dialogue and dialogue grammars. Plan-based models are founded on the observation that utterances are not simply strings of words but rather are the observable performance of communicative actions, or speech acts (Searle, 1969), such as requesting, informing, warning, suggesting, and confirming. Moreover, humans do not just perform actions randomly; they plan their actions to achieve various goals, and, in the case of communicative actions, those goals include changes to the mental states of listeners. For example, speakers' requests are planned to alter the intentions of their addressees. Plan-based theories of communicative action and dialogue (Allen and Perrault, 1980; Appelt, 1985; Cohen and Levesque, 1990; Cohen and Perrault, 1979; Perrault and Allen, 1980; Sidner and Israel, 1981) assume that the speaker's speech acts are part of a plan, and the listener's job is to uncover and respond appropriately to the underlying plan, rather than just to the utterance. For example, in response to a customer's question of "Where are the steaks you advertised?", a butcher's reply of "How many do you want?" is appropriate because the butcher has discovered that the customer's plan of getting steaks himself is going to fail. Being cooperative, he attempts to execute a plan to achieve the customer's higher-level goal of having steaks (Cohen, 1978). Current research on this model is attempting to incorporate more complex dialogue phenomena, such as clarifications (Litman and Allen, 1987, 1990; Yamaoka and Iida, 1991), and to model dialogue more as a joint enterprise,
something the participants are doing together (Clark and Wilkes-Gibbs, 1986; Cohen and Levesque, 1991; Grosz and Sidner, 1990).
The dialogue grammar approach models dialogue simply as a finite state transition network (Dahlback and Jonsson, 1992; Polanyi and Scha, 1984; Winograd and Flores, 1986), in which state transitions occur on the basis of the type of communicative action that has taken place (e.g., a request). Such automata might be used to predict the next dialogue "states" that are likely and thus could help speech recognizers by altering the probabilities of various lexical, syntactic, semantic, and pragmatic information (Andry, 1992; Young et al., 1989). However, a number of drawbacks to the model are evident (Cohen, 1993; Levinson, 1981). First, it requires that the communicative action(s) being performed by the speaker in issuing an utterance be identified, which itself is a difficult problem, for which prior solutions have required plan recognition (Allen and Perrault, 1980; Kautz, 1990; Perrault and Allen, 1980). Second, the model assumes that only one state results from a transition. However, utterances are multifunctional. An utterance can be, for example, both a rejection and an assertion. The dialogue grammar subsystem would thus need to be in multiple states simultaneously, a property typically not allowed. Finally, and most importantly, the model does not say how systems should choose among the next moves, that is, the states currently reachable, in order for it to play its role as a cooperative conversant. Some analog of planning is thus also likely to be required.
Dialogue research is currently the weakest link in the research program for developing spoken language systems. First and foremost, dialogue technology is in need of a specification methodology, in which a theorist could state formally what a dialogue system should do (i.e., what would count as acceptable dialogue behavior). As in other branches of computer science, such specifications may then lead to methods for mathematically and empirically evaluating whether a given system has met the specifications. However, to do this will require new theoretical approaches. Second, more implementation experiments need to be carried out, ranging from the simpler state-based dialogue models to the more comprehensive plan-based approaches. Research aimed at developing computationally tractable plan recognition algorithms is critically needed.
There is little doubt that voice will figure prominently in the array of potential interface technologies available to developers. Except for conventional telephone-based applications, however, human-
computer interfaces incorporating voice will probably be multimodal, in the sense of combining voice with screen feedback use of a pointing device, gesturing, handwriting, etc. (Cohen et al., 1989; Hauptmann and McAvinney, 1993; Oviatt, 1992; Wahlster, 1991). Many application systems require multimodal communication, such as inherently map-based interactions. Such systems can involve coordinated speaking, gesturing, pointing, or writing on the map during input, and speech synthesis coordinated with graphics for output. From the previous discussion, it is apparent that each interface technology has strengths and weaknesses, and it may be strategic to attempt to develop interfaces that capitalize on the strengths of one to overcome weaknesses in another (Cohen, 1991). That is, users should be able to speak when desired, supplemented with other modalities as needed.
There are many advantages to multimodal interfaces:
Error avoidance and robust performance. Multimodal interfaces can offer the potential to avoid errors that otherwise would be made in a unimodal interface. For example, it is estimated that 86 percent of the task-critical human performance errors that occurred during a study of an interpreted telephony could have been avoided by opening up a screen-based handwriting channel (Oviatt, in press). Multimodal recognition also offers the possibility of enhanced recognition in adverse conditions. For example, simultaneous use of lip-reading speech recognizers may increase the recognition rate in high-noise environments (Garcia et al., 1992; Petajan et al., 1988) that otherwise would impair acoustic speech recognizers. Alternatively, in such environments, users of multimodal interfaces could simply switch modes, for example, to use handwriting.
Error correction. Multimodal interfaces offer more options for correcting errors that do occur. Recognition errors present a problem to users, partly because their source is not apparent. Users frequently respond to speech recognition errors by hyperarticulating. But since recognizers are typically not trained on hyperarticulated speech, this repair strategy leads to a lower likelihood of successful recognition for that content (Shriberg et al., 1992). Recognition problems can thus repeat numerous times on the same content, leading to a "degradation spiral" that is frustrating to users and may cause them to abort the application (Oviatt, 1992). By providing the option of using another modality, such as handwriting, a user can simply switch modes in order to correct an error in the first modality.
Situational and user variation. The various circumstances in which portable computers will be used are likely to alter people's preferences for one modality of communication or another. For example,
the user may at times encounter noisy environments or desire privacy and would therefore rather not speak. Also, people may prefer to speak for some task content but not for others. Finally, different types of users may systematically prefer to use one modality rather than another. In all these cases a multimodal system offers the needed flexibility.
Even as we investigate multimodal interaction for potential solutions to problems arising in speech-only applications, many implementation obstacles need to be overcome in order to integrate and synchronize modalities. For example, multimodal systems could present information graphically or in multiple coordinated modalities (Feiner and McKeown, 1991; Wahlster, 1991) and permit users to refer linguistically to entities introduced graphically (Cohen, 1991; Wahlster, 1991). Techniques need to be developed to synchronize input from simultaneous data streams, so that, for example, gestural inputs can help resolve ambiguities in speech processing and vice versa. Research on multimodal interfaces needs to examine not only the techniques for forging a productive synthesis among modalities but also the effect that specific integration architectures will have on human-computer interaction. Much more empirical research on the human use of multimodal systems needs to be undertaken, as we yet know relatively little about how people use multiple modalities in communicating with other people, let alone with computers, or about how to support such communication most effectively.
SCIENTIFIC RESEARCH ON COMMUNICATION MODALITIES
The present research and development climate for speech-based technology is more active than it was at the time of the 1984 National Research Council report on speech recognition in severe environments (National Research Council, 1984). Significant amounts of research and development funding are now being devoted to building speech-understanding systems, and the first speaker-independent, continuous, real-time spoken language systems have been developed. However, some of the same problems identified then still exist today. In particular, few answers are available on how people will interact with systems using voice and how well they will perform tasks in the target environments as opposed to the laboratory. There is little research on the dependence of communication on the modality used, or the types of tasks, in part because there have not been principled taxonomies or comprehensive research addressing these factors. In
particular, the use of multiple communication modalities to support human-computer interaction is only now being addressed.
Fortunately, the field is now in a position to fill gaps in its knowledge base about spoken human-machine communication. Using existing systems that understand real-time, continuously spoken utterances, which allow users to solve real problems, a number of vital studies can now be undertaken in a more systematic manner. Examples include:
• longitudinal studies of users' linguistic and problem-solving behavior that would explore how users adapt to a given system;
• studies of users' understanding of system limitations, and of their performance in observing the system's bounds;
• studies of different techniques for revealing a system's coverage, and for channeling user input;
• studies comparing the effectiveness of spoken language technology with alternatives, such as the use of keyboard-based natural language systems, query languages, or existing direct manipulation
• studies analyzing users' language, task performance, and preferences to use different modalities, individually and within an integrated multimodal interface.
The information gained from such studies would be an invaluable addition to the knowledge base of how spoken language processing can be woven into a usable human-computer interface. Sustained efforts need to be undertaken to develop more adequate spoken language simulation methods, to understand how to build limited but robust dialogue systems based on a variety of communication modalities, and to study the nature of dialogue.
A vital and underappreciated contribution to the successful deployment of voice technology for human-computer interaction will come from the development of a principled and empirically validated set of human-interface guidelines for interfaces that incorporate speech (cf. Lea, 1992). Graphical user-interface guidelines typically provide heuristics and suggestions for building "usable" interfaces, though often without basing such suggestions on scientifically established facts and principles. Despite the evident success of such guidelines for graphical user interfaces, it is not at all clear that a simple set of heuristics will work for spoken language technology, because human language is both more variable and creative than the behavior allowed by graphical user interfaces. Answers to some of the questions posed earlier would be valuable in laying a firm empirical founda-
tion for developing effective guidelines for a new generation of language-oriented interfaces.
Ultimately, such a set of guidelines embodying the results of scientific theory and experimentation should be able to predict, given a specified communicative situation, task, user population, and a set of component modalities, what the user-computer interaction will be like with a multimodal interface of a certain configuration. Such predictions could inform the developers in advance about potential trouble spots and could lead to a more robust, usable, and satisfying humancomputer interface. Given the complexities of the design task and the considerable expense required to create spoken language applications, if designers are left to their intuitions, applications will suffer. Thus, for scientific, technological, and economic reasons, a concerted effort needs to be undertaken to develop a more scientific understanding of communication modalities and how they can best be integrated in support of successful human-computer interaction.
Many thanks to Jared Bernstein, Clay Coler, Carol Simpson, Ray Perrault, Robert Markinson, Raja Rajasekharan, and John Vester for valuable discussions and source materials.
Advanced Research Projects Agency. ARPA Spoken Language Systems Technology Workshop. Massachusetts Institute of Technology, Cambridge, Mass. 1993.
Allen, J. F., and C. R. Perrault. Analyzing intention in dialogues. Artificial Intelligence, 15(3):143-178, 1980.
Andry, F. Static and dynamic predictions: A method to improve speech understanding in cooperative dialogues. In Proceedings of the International Conference on Spoken Language Processing, Banff, Alberta, Canada, Oct. University of Alberta, 1992.
Andry, F., E. Bilange, F. Charpentier, K. Choukri, M. Ponamale, and S. Soudoplatoff. Computerised simulation tools for the design of an oral dialogue system. In Selected Publications, 1988-1990, SUNDIAL Project (Esprit P2218). Commission of the European Communities, 1990.
Appelt, D. Planning English Sentences. Cambridge University Press, Cambridge, U.K., 1985.
Appelt, D. E., and E. Jackson. SRI International February 1992 ATIS benchmark test results. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.
Bahl, L., F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179-190, March 1983.
Baker, J. F. Stochastic modeling for automatic speech understanding. In D. R. Reddy, ed., Speech Recognition, pp. 521-541. Academic Press, New York, 1975.
Baker, J. M. Large vocabulary speaker-adaptive continuous speech recognition research overview at Dragon systems. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 29-32, Genova, Italy, 1991.
Basson, S. Prompting the user in ASR applications. In Proceedings of COST232 WorkshopEuropean Cooperation in Science and Technology, November 1992.
Basson, S., O. Christie, S. Levas, and J. Spitz. Evaluating speech recognition potential in automating directory assistance call completion. In AVIOS Proceedings. American Voice I/O Society, 1989.
Bear, J., J. Dowding, and E. Shriberg. Detection and correction of repairs in human-computer dialog. In D. Walker, ed., Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Newark, Delaware, June 1992.
Bear, J., and P. Price. Prosody, syntax and parsing. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pp. 17-22, Pittsburgh, Pa., 1990.
Bernstein, J. Applications of speech recognition technology in rehabilitation. In J. E. Harkins and B. M. Virvan, eds., Speech to Text: Today and Tomorrow. GRI Monograph Series B., No. 2. Gallaudet University Research Institute, Washington, D.C., 1988.
Bernstein, J., M. Cohen, H. Murveit, D. Rtischev, and M. Weintraub. Automatic evaluation and training in English pronunciation. In Proceedings of the 1990 International Conference on Spoken Language Processing, pp. 1185-1188, The Acoustical Society of Japan, Kobe, Japan, 1990.
Bernstein, J., and D. Rtischev. A voice interactive language instruction system. In Proceedings of Eurospeech '91, pp. 981-984, Genova, Italy. IEEE, 1991.
Capindale, R. A., and R. C. Crawford. Using a natural language interface with casual users. International Journal of Man-Machine Studies, 32:341-362, 1990.
Chamberlin, D. D., and R. F. Boyce. Sequel: A structured English query language. In Proceedings of the 1974 ACM SIGMOD Workshop on Data Description, Access and Control, May 1974.
Chapanis, A., R. B. Ochsman, R. N. Parrish, and G. D. Weeks. Studies in interactive communication: I. The effects of four communication modes on the behavior of teams during cooperative problem solving. Human Factors, 14:487-509, 1972.
Chapanis, A., R. N. Parrish, R. B. Ochsman, and G. D. Weeks. Studies in interactive communication: II. The effects of four communication modes on the linguistic performance of teams during cooperative problem solving. Human Factors, 19(2):101125, April 1977.
Charniak, E., Jack and Janet in search of a theory of knowledge. In Advance Papers of the Third Meeting of the International Joint Conference on Artificial Intelligence, Los Altos, Calif. William Kaufmann, Inc., 1973.
Clark, H. H., and D. Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22:1-39, 1986.
Codd, E. F. Seven steps to rendezvous with the casual user. In Proceedings IFIP TC-2 Working Conference on Data Base Management Systems, pp. 179-200. North-Holland Publishing Co., Amsterdam, 1974.
Cohen, P. R. On Knowing What to Say: Planning Speech Acts. PhD thesis, University of Toronto, Toronto, Canada. Technical Report No. 118, Department of Computer Science. 1978.
Cohen, P. R. The pragmatics of referring and the modality of communication. Computational Linguistics, 10(2):97-146, April-June 1984.
Cohen, P. R. The role of natural language in a multimodal interface. In The 2nd FRIEND21
International Symposium on Next Generation Human Interface Technologies, Tokyo, Japan, November 1991. Institute for Personalized Information Environment.
Cohen, P. R. Models of dialogue. In M. Nagao, ed., Cognitive Processing for Vision and Voice: Proceedings of the Fourth NEC Research Symposium. SIAM, 1993.
Cohen, P. R., and H. J. Levesque. Rational interaction as the basis for communication. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication. MIT Press, Cambridge, Mass., 1990.
Cohen, P. R., and H. J. Levesque. Confirmations and joint action. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, pp. 951-957, Sydney, Australia, Morgan Kaufmann Publishers, Inc. 1991.
Cohen, P. R., and C. R. Perrault. Elements of a plan-based theory of speech acts. Cognitive Science, 3(3):177-212, 1979.
Cohen, P. R., M. Dalrymple, D. B. Moran, F. C. N. Pereira, J. W. Sullivan, R. A. Gargan, J. L. Schlossberg, and S. W. Tyler. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proceedings, pp. 227-234, New York, Addison Wesley Publishing Co. 1989.
Cole, R., L. Hirschman, L. Atlas, M. Beckman, A. Bierman, M. Bush, J. Cohen, O. Garcia, B. Hanson, H. Hermansky, S. Levinson, K. McKeown, N. Morgan, D. Novick, M. Ostendorf, S. Oviatt, P. Price, H. Silverman, J. Spitz, A. Waibel, C. Weinstein, S. Zahorain, and V. Zue. NSF Workshop on Spoken Language Understanding. Technical Report CS/E 92-014, Oregon Graduate Institute, September 1992.
Crane, H. D. Writing and talking to computers. Business Intelligence Program Report D91-1557, SRI International, Menlo Park, Calif., July 1991.
Dahlback, N., and A. Jonsson. An empirically based computationally tractable dialogue model. In Proceedings of the 14th Annual Conference of the Cognitive Science Society (COGSCI-92), Bloomington, Ind., July 1992.
Dahlback, N., A. Jonsson, and L. Ahrenberg. Wizard of Oz studieswhy and how. In L. Ahrenberg, N. Dahlback, and A. Jonsson, eds., Proceedings from the Workshop on Empirical Models and Methodology for Natural Language Dialogue Systems, Trento, Italy, April. Association for Computational Linguistics, 1992.
Dowding, J., J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, and D. Moran. Gemini: A natural language system for spoken-language understanding. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 54-61, Columbus, Ohio, June 1993.
Englebart, D. Design considerations for knowledge workshop terminals. In National Computer Conference, pp. 221-227, 1973.
English, W. K., D. C. Englebart, and M. A. Berman. Display-selection techniques for text manipulation. IEEE Transactions on Human Factors in Electonics, HFE-8(1):515, March 1967.
Feiner, S. K., and K. R. McKeown. COMET: Generating coordinated multimedia explanations. In Human Factors in Computing Systems (CHI'91), pp. 449-450, New York, April. ACM Press, 1991.
Fisher, S. Virtual environments, personal simulation, and telepresence. Multimedia Review: The Journal of Multimedia Computing, 1(2), 1990.
Fraser, N. M., and G. N. Gilbert. Simulating speech systems. Computer Speech and Language, 5(1):81-99, 1991.
Garcia, O. N., A. J. Goldschen, and E. D. Petajan. Feature Extraction for Optical Speech Recognition or Automatic Lipreading. Technical Report, Institute for Information Science and Technology, Department of Electrical Engineering and Computer Science. The George Washington University, Washington, D.C., November 1992.
Giles, H., A. Mulac, J. J. Bradac, and P. Johnson. Speech accommodation theory: The first decade and beyond. In M. L. McLaughlin, ed., Communication Yearbook 10, pp. 13-48. Sage Publishers, Beverly Hills, California, 1987.
Gould, J. D. How experts dictate. Journal of Experimental Psychology: Human Perception and Performance, 4(4):648-661, 1978.
Gould, J. D. Writing and speaking letters and messages. International Journal of Man-Machine Studies, 16(1):147-171, 1982.
Gould, J. D., J. Conti, and T. Hovanyecz. Composing letters with a simulated listening typewriter. Communications of the ACM, 26(4):295-308, April 1983.
Grosz, B., and C. Sidner. Plans for discourse. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication, pp. 417-444. MIT Press, Cambridge, Mass., 1990.
Guyomard, M., and J. Siroux. Experimentation in the specification of an oral dialogue. In H. Niemann, M. Lang, and G. Sagerer, eds., Recent Advances in Speech Understanding and Dialogue Systems. NATO ASI Series, vol. 46. Springer Verlag, Berlin, 1988.
Harris, R. User oriented data base query with the robot natural language query system. International Journal of Man-Machine Studies, 9:697-713, 1977.
Hauptmann, A. G., and P. McAvinney. Gestures with speech for direct manipulation. International Journal of Man-Machine Studies, 38:231-249, 1993.
Hauptmann, A. G., and A. I. Rudnicky. A comparison of speech and typed input. In Proceedings of the Speech and Natural Language Workshop, pp. 219-224, San Mateo, Calif., June. Morgan Kaufmann, Publishers, Inc., 1990.
Hendrix, G. G., and B. A. Walter. The intelligent assistant. Byte, pp. 251-258, December 1987.
Hindle, D., Deterministic parsing of syntactic non-fluencies. In Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 123-128, Cambridge, Mass., June 1983.
Hobbs, J. R., Resolving pronoun reference. Lingua, 44, 1978.
Hon, H.-W., and K.-F. Lee. Recent progress in robust vocabulary-independent speech recognition. In Proceedings of the Speech and Natural Language Workshop, pp. 258-263, San Mateo, Calif., October. Morgan Kaufmann, Publishers, Inc., 1991.
Howard, J. A., Flight testing of the AFTI/F-16 voice interactive avionics system. In Proceedings of Military Speech Tech 1987, pp. 76-82, Arlington, Va., Media Dimensions., 1987.
Huang, X., F. Alleva, M.-Y. Hwang, and R. Rosenfeld. An overview of the SPHINX-II speech recognition system. In Proceedings of the ARPA Workshop on Human Language Technology, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1993.
Hutchins, E. L., J. D. Hollan, and D. A. Norman. Direct manipulation interfaces. In D. A. Norman and S. W. Draper, eds., User Centered System Design, pp. 87-124. Lawrence Erlbaum Publishers, Hillsdale, N.J., 1986.
Jackson, E., D. Appelt, J. Bear, R. Moore, and A. Podlozny. A template matcher for robust NL interpretation. In Proceedings of the 4th DARPA Workshop on Speech and Natural Language, pp. 190-194, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1991.
Jarke, M., J. A. Turner, E. A. Stohr, Y. Vassiliou, N. H. White, and K. Michielsen. A field evaluation of natural language for data retrieval. IEEE Transactions on Software Engineering, SE-11(1):97-113, 1985.
Jelinek, F. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64:532-536, April 1976.
Jelinek. F. The development of an experimental discrete dictation recognizer. Proceedings of the IEEE, 73(11):1616-1624, November 1985.
Karis, D., and K. M. Dobroth. Automating services with speech recognition over the public switched telephone network: Human factors considerations. IEEE Journal of Selected Areas in Communications, 9(4):574-585, 1991.
Kautz, H. A circumscriptive theory of plan recognition. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication. MIT Press, Cambridge, Mass., 1990.
Kay, A., and A. Goldberg. Personal dynamic media. IEEE Computer, 10(1):31-42, 1977.
Kelly, M. J., and A. Chapanis. Limited vocabulary natural language dialogue. International Journal of Man-Machine Studies, 9:479-501, 1977.
Kennedy, A., A. Wilkes, L. Elder, and W. S. Murray. Dialogue with machines. Cognition, 30(1):37-72, 1988.
Kitano, H. o dm-dialog. IEEE Computer, 24(6):36-50, June 1991.
Krauss, R. M., and P. D. Bricker. Effects of transmission delay and access delay on the efficiency of verbal communication. Journal of the Acoustical Society of America, 41(2):286-292, 1967.
Krauss, R. M., and S. Weinheimer. Concurrent feedback, confirmation, and the encoding of referents in verbal communication. Journal of Personality and Social Psychology, 4:343-346, 1966.
Kreuger, M. Responsive environments. In Proceedings of the National Computer Conference, 1977.
Kubala, F., C. Barry, M. Bates, R. Bobrow, P. Fung, R. Ingria, J. Makhoul, L. Nguyen, R. Schwartz, and D. Stallard. BBN BYBLOS and HARC February 1992 ATIS benchmark results. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.
Kurematsu, A. Future perspective of automatic telephone interpretation. Transactions of IEICE, E75(1):14-19, January 1992.
Lea, W. A. Practical lessons from configuring voice I/O systems. In Proceedings of Speech Tech/Voice Systems Worldwide, New York. Media Dimensions, Inc., 1992.
Leiser, R. G. Exploiting convergence to improve natural language understanding. Interacting with Computers, 1(3):284-298, December 1989.
Lennig, M. Using speech recognition in the telephone network to automate collect and third-number-billed calls. In Proceedings of Speech Tech'89, pp. 124-125, Arlington, Va. Media Dimensions, Inc., 1989.
Levelt, W. J. M., and S. Kelter. Surface form and memory in question-answering. Cognitive Psychology, 14(1):78-106, 1982.
Levinson, S. Some pre-observations on the modelling of dialogue. Discourse Processes, 4(1), 1981.
Litman, D. J., and J. F. Allen. A plan recognition model for subdialogues in conversation. Cognitive Science, 11:163-200, 1987.
Litman, D. J., and J. F. Allen. Discourse processing and commonsense plans. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication, pp. 365388. MIT Press, Cambridge, Mass., 1990.
Luce, P. A., T. C. Feustel, and D. B. Pisoni. Capacity demands in short-term memory for synthetic and natural speech. Human Factors, 25(1):17-32, 1983.
MADCOW Working Group. Multi-site data collection for a spoken language corpus. In Proceedings of the Speech and Natural Language Workshop, pp. 7-14, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1992.
Mariani, J. Spoken language processing in the framework of human-machine commu-
nication at LIMSI. In Proceedings of Speech and Natural Language Workshop, pp. 55-60, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.
Marshall, J. P. A manufacturing application of voice recognition for assembly of aircraft wire harnesses. In Proceedings of Speech Tech/Voice Systems Worldwide, New York. Media Dimensions, Inc., 1992.
Martin, G. L. The utility of speech input in user-computer interfaces. International Journal of Man-Machine Studies, 30(4):355-375, 1989.
Martin, T. B. Practical applications of voice input to machines. Proceedings of the IEEE, 64(4):487-501, April 1976.
Michaelis, P. R., A. Chapanis, G. D. Weeks, and M. J. Kelly. Word usage in interactive dialogue with restricted and unrestricted vocabularies. IEEE Transactions on Professional Communication, PC-20(4), December 1977.
Mostow, J., A. G. Hauptmann, L. L. Chase, and S. Roth. Towards a reading coach that listens: Automated detection of oral reading errors. In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI93), Menlo Park, Calif., Al Press/The MIT Press, 1993.
Murray, I. R., J. L. Arnott, A. F. Newell, G. Cruickshank, K. E. P. Carter, and R. Dye. Experiments with a Full-Speed Speech-Driven Word Processor. Technical Report CS 91/09, Mathematics and Computer Science Department, University of Dundee, Dundee, Scotland, April 1991.
Nakatani, C., and J. Hirschberg. A speech-first model for repair detection and correction. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 46-53, Columbus, Ohio, June 1993.
National Research Council. Automatic Speech Recognition in Severe Environments. National Academy Press, Washington, D.C., 1984.
Newell, A. F., J. L. Arnott, K. Carter, and G. Cruickshank. Listening typewriter simulation studies. International Journal of Man-Machine Studies, 33(1):1-19, 1990.
Nusbaum, H. C., and E. C. Schwab. The effects of training on intelligibility of synthetic speech: II. The learning curve for synthetic speech. In Proceedings of the 105th meeting of the Acoustical Society of America, Cincinnati, Ohio, May 1983.
Nye, J. M. Human factors analysis of speech recognition systems. In Speech Technology 1, pp. 50-57, 1982.
Ochsman, R. B., and A. Chapanis. The effects of 10 communication modes on the behaviour of teams during co-operative problem-solving. International Journal of Man-Machine Studies, 6(5):579-620, Sept. 1974.
Oviatt, S. L. Pen/voice: Complementary multimodal communication. In Proceedings of Speech Tech'92, pp. 238-241, New York, February 1992.
Oviatt, S. L. Predicting spoken disfluencies during human-computer interaction. In K. Shirai, ed., Proceedings of the International Symposium on Spoken Dialogue: New Directions in Human-Machine Communication, Tokyo, Japan, November 1993.
Oviatt, S. L. Toward multimodal support for interpreted telephone dialogues. In M. M. Taylor, F. Neel, and D. G. Bouwhuis, eds., Structure of Multimodal Dialogue. Elsevier Science Publishers B.V., Amsterdam, Netherlands, in press.
Oviatt, S. L, and P. R. Cohen. Discourse structure and performance efficiency in interactive and noninteractive spoken modalities. Computer Speech and Language, 5(4):297-326, 1991a.
Oviatt, S. L, and P. R. Cohen. The contributing influence of speech and interaction on human discourse patterns. In J. W. Sullivan and S. W. Tyler, eds., Intelligent User Interfaces, pp. 69-83. ACM Press Frontier Series. Addison-Wesley Publishing Co., New York, 1991b.
Oviatt, S. L, P. R. Cohen, M. W. Fong, and M. P. Frank. A rapid semi-automatic simulation technique for investigating interactive speech and handwriting. In J.
Ohala, ed., Proceedings of the 1992 International Conference on Spoken Language Processing, pp. 1351-1354, University of Alberta, October 1992.
Oviatt, S. L, P. R. Cohen, M. Wang, and J. Gaston. A simulation-based research strategy for designing complex NL systems. In ARPA Human Language Technology Workshop, Princeton, N.J., March 1993.
Pallett, D. S., J. G. Fiscus, W. M. Fisher, and J. S. Garofolo. Benchmark tests for the DARPA spoken language program. In Proceedings of the ARPA Workshop on Human Language Technology, San Mateo, Calif., Morgan Kaufmann Publishers, Inc., 1993.
Pavan, S., and B. Pelletti. An experimental approach to the design of an oral cooperative dialogue. In Selected Publications, 1988-1990, SUNDIAL Project (Esprit P2218). Commission of the European Communities, 1990.
Peckham, J. Speech understanding and dialogue over the telephone: An overview of the ESPRIT SUNDIAL project. In Proceedings of the Speech and Natural Language Workshop, pp. 14-28, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1991.
Perrault, C.R., and J. F. Allen. A plan-based analysis of indirect speech acts. American Journal of Computational Linguistics, 6(3):167-182, 1980.
Petajan, E., B. Bradford, D. Bodoff, and N. M. Brooke. An improved automatic lipreading system to enhance speech recognition. In Proceedings of Human Factors in Computing Systems (CHI'88), pp. 19-25, New York. Association for Computing Machinery Press, 1988.
Polanyi, R., and R. Scha. A syntactic approach to discourse semantics. In Proceedings of the 10th International Conference on Computational Linguistics, pp. 413-419, Stanford, Calif., 1984.
Pollack, A. Computer translator phones try to compensate for Babel. New York Times, January 29, 1993.
Price, P. J., Evaluation of spoken language systems: The ATIS domain. In Proceedings of the 3rd DARPA Workshop on Speech and Natural Language, pp. 91-95, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1990.
Price, P., M. Ostendorf, S. Shattuck-Hufnagel, and C. Fong. The use of prosody in syntactic disambiguation. In Proceedings of the Speech and Natural Language Workshop, pp. 372-377, San Mateo, Calif., October. Morgan Kaufmann Publishers, Inc., 1991.
Proceedings of the Speech and Natural Language Workshop, San Mateo, Calif., October, 1991, Morgan Kaufmann Publishers, Inc.
Rabiner, L. R., J. G. Wilpon, and A. E. Rosenberg. A voice-controlled, repertory-dialer system. Bell System Technical Journal, 59(7):1153-1163, September 1980.
Rheingold, H. Virtual Reality. Summit Books, 1991.
Roe, D. B., F. Pereira, R. W. Sproat, and M. D. Riley. Toward a spoken language translator for restricted-domain context-free languages. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 10631066, Genova, Italy. European Speech Communication Association, 1991.
Rosenhoover, F. A., J. S. Eckel, F. A. Gorg, and S. W. Rabeler. AFTI/F-16 voice interactive avionics evaluation. In Proceedings of the National Aerospace and Electronics Conference (NAECON'87). IEEE, 1987.
Rubin-Spitz, J., and D. Yashchin. Effects of dialogue design on customer responses in automated operator services. In Proceedings of Speech Tech'89, 1989.
Rudnicky, A. I. Mode preference in a simple data-retrieval task. In ARPA Human Language Technology Workshop, Princeton, N.J., March 1993.
Searle, J. R. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge, 1969.
Shneiderman, B. Natural vs. precise concise languages for human operation of computers: Research issues and experimental approaches. In Proceedings of the 18th Annual Meeting of the Association for Computational Linguistics, pp. 139-141, Philadelphia, Pa., June 1980a.
Shneiderman, B. Software Psychology: Human Factors in Computer and Information systems. Winthrop Publishers, Inc., Cambridge, Mass., 1980b.
Shneiderman, B. Direct manipulation: A step beyond programming languages. IEEE Computer, 16(8):57-69, 1983.
Shriberg, E., E. Wade, and P. Price. Human-machine problem-solving using spoken language systems (SLS): Factors affecting performance and user satisfaction. In Proceedings of Speech and Natural Language Workshop, pp. 49-54, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.
Sidner, C., and D. Israel. Recognizing intended meaning and speaker's plans. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, pp. 203-208, Vancouver, B.C., 1981.
Simpson, C. A., and T. N. Navarro. Intelligibility of computer generated speech as a function of multiple factors. In Proceedings of the National Aerospace and Electronics Conference (NAECON), pp. 932-940, New York, May. IEEE, 1984.
Simpson, C. A., C. R. Coler, and E. M. Huff. Human factors of voice I/O for aircraft cockpit controls and displays. In Proceedings of the Workshop on Standardization for Speech I/O Technology, pp. 159-166, Gaithersburg, Md., March. National Bureau of Standards, 1982.
Simpson, C. A., M. E. McCauley, E. F. Roland, J. C. Ruth, and B. H. Williges. System design for speech recognition and generation. Human Factors, 27(2):115-141, 1985.
Small, D., and L. Weldon. An experimental comparison of natural and structured query languages. Human Factors, 25:253-263, 1983.
Spitz, J. Collection and analysis of data from real users: Implications for speech recognition/understanding systems. In Proceedings of the 4th DARPA Workshop on Speech and Natural Language, Asilomar, Calif., February. Defense Advanced Research Projects Agency, 1991.
Stallard, D., and R. Bobrow. Fragment processing in the DELPHI system. In Proceedings of the Speech and Natural Language Workshop, pp. 305-310, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1992.
Street, R. L., Jr., R. M. Brady, and W. B. Putman. The influence of speech rate stereotypes and rate similarity on listeners' evaluations of speakers. Journal of Language and Social Psychology, 2(1):37-56, 1983.
Streeter, L. A., D. Vitello, and S. A. Wonsiewicz. How to tell people where to go: Comparing navigational aids. International Journal of Man-Machine Studies, 22:549562, 1985.
Swider, R. F. Operational evaluation of voice command/response in an Army helicopter. In Proceedings of Military Speech Tech 1987, pp. 143-146, Arlington, Va. Media Dimensions, 1987.
Tanaka, S., D. K. Wild, P. J. Seligman, W. E. Halperin, V. Behrens, and V. Putz-Anderson. Prevalence and Work-Relatedness of Self-Reported Carpal Tunnel Syndrome Among U.S. WorkersAnalysis of the Occupational Health Supplement Data of the 1988 National Health Interview Survey. National Institute of Occupational Safety and Health, and Centers for Disease Control and Prevention (Cincinnati), in submission.
Tennant, H. R, K. M. Ross, R. M. Saenz, C. W. Thompson, and J. R. Miller. Menu-based natural language understanding. In Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 151-158, Cambridge, Mass., June 1983.
Thomas, J. C., M. B. Rosson, and M. Chodorow. Human factors and synthetic speech. In B. Shackel, ed., Proceedings of INTERACT'84, Amsterdam. Elsevier Science Publishers B.V. (North Holland), 1984.
Turner, J. A., M. Jarke, E. A. Stohr, Y. Vassiliou, and N. White. Using restricted natural language for data retrieval: A plan for field evaluation. In Y. Vassiliou, ed., Human Factors and Interactive Computer systems, pp. 163-190. Ablex Publishing Corp., Norwood, N.J., 1984.
VanKatwijk, A. F., F. L. VanNes, H. C. Bunt, H. F. Muller, and F. F. Leopold. Naive subjects interacting with a conversing information system. IPO Annual Progress Report, 14:105-112, 1979.
Visick, D., P. Johnson, and J. Long. The use of simple speech recognisers in industrial applications. In Proceedings of INTERACT'84 First IFIP Conference on Human-Computer Interaction, London, U.K., 1984.
Voorhees, J. W., N. M. Bucher, E. M. Huff, C. A. Simpson, and D. H. Williams. Voice interactive electronic warning system (views). In Proceedings of the IEEE/AIAA 5th Digital Avionics Systems Conference, pp. 3.5.1-3.5.8, New York. IEEE, 1983.
Wahlster, W. User and discourse models for multimodal communication. In J. W. Sullivan and S. W. Tyler, eds., Intelligent User Interfaces, pp. 45-68. ACM Press Frontier Series. Addison Wesley Publishing Co., New York. 1991.
Weinstein, C. Opportunities for advanced speech processing in military computer-based systems. Proceedings of the IEEE, 79(11):1626-1641, November 1991.
Welkowitz, J., S. Feldstein, M. Finkelstein, and L. Aylesworth. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills, 10:715718, 1972.
Williamson, J. T. Flight test results of the AFTI/F-16 voice interactive avionics program. In Proceedings of the American Voice I/O Society (AVIOS) 87 Voice I/O Systems Applications Conference, pp. 335-345, Alexandria, Va., 1987.
Winograd, T. Understanding Natural Language. Academic Press, New York, 1972.
Winograd, T., and F. Flores. Understanding Computers and Cognition: A New Foundation for Design. Ablex Publishing Co., Norwood, N.J., 1986.
Yamaoka, T., and H. Iida. Dialogue interpretation model and its application to next utterance prediction for spoken language processing. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 849852, Genova, Italy. European Speech Communication Association, 1991.
Yato, F., T. Takezawa, S. Sagayama, J. Takami, H. Singer, N. Uratani, T. Morimoto, and A. Kurematsu. International Joint Experiment Toward Interpreting Telephony (in Japanese). Technical Report, The Institute of Electronics, Information, and Communication Engineers, 1992.
Young, S. R., A. G. Hauptmann, W. H. Ward, E. T. Smith, and P. Werner. High level knowledge sources in usable speech recognition systems. Communications of the ACM, 32(2), February 1989.
Zoltan-Ford, E. Language Shaping and Modeling in Natural Language Interactions with Computers. PhD thesis, Psychology Department, Johns Hopkins University, Baltimore, Md., 1983.
Zoltan-Ford, E. Reducing variability in natural-language interactions with computers. In M. J. Alluisi, S. de Groot, and E. A. Alluisi, eds., Proceedings of the Human Factors Society-28th Annual Meeting, vol. 2, pp. 768-772, San Antonio, Tex., 1984.
Zoltan-Ford, E. How to get people to say and type what computers can understand. International Journal of Man-Machine Studies, 34:527-547, 1991.
Zue, V., J. Glass, D. Goddeau, D. Goodine, L. Hirschman, M. Phillips, J. Polifroni, and S. Seneff. The MIT ATIS system: February 1992 progress report. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.