Cover Image

HARDBACK
$89.95



View/Hide Left Panel

Page 34

The Role of Voice in Human-Machine Communication*

Philip R. Cohen and Sharon L. Oviatt

SUMMARY

Optimism is growing that the near future will witness rapid growth in human-computer interaction using voice. System prototypes have recently been built that demonstrate speaker-independent real-time speech recognition and understanding of naturally spoken utterances in moderately sized vocabularies (1000 to 2000 words), and larger-vocabulary speech recognition systems are on the horizon. Already, computer manufacturers are building speech recognition subsystems into their new product lines. However, before this technology will be broadly useful, a substantial knowledge base about human spoken language and performance during computer-based interaction needs to be gathered and applied. This paper reviews application areas in which spoken interaction may play a significant role, assesses potential benefits of spoken interaction with machines, and attempts to compare voice with alternative and complementary modalities of human-computer interaction. The paper also discusses information that will be needed to build a firm empirical foundation for future designing of human-computer interfaces. Finally, it argues for a more systematic and scientific approach to understanding human language and performance with voice interactive systems.

*The writing of this paper was supported in part by a grant from the National Science Foundation (No. IRI-9213472) to SRI International.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 34
Page 34 The Role of Voice in Human-Machine Communication* Philip R. Cohen and Sharon L. Oviatt SUMMARY Optimism is growing that the near future will witness rapid growth in human-computer interaction using voice. System prototypes have recently been built that demonstrate speaker-independent real-time speech recognition and understanding of naturally spoken utterances in moderately sized vocabularies (1000 to 2000 words), and larger-vocabulary speech recognition systems are on the horizon. Already, computer manufacturers are building speech recognition subsystems into their new product lines. However, before this technology will be broadly useful, a substantial knowledge base about human spoken language and performance during computer-based interaction needs to be gathered and applied. This paper reviews application areas in which spoken interaction may play a significant role, assesses potential benefits of spoken interaction with machines, and attempts to compare voice with alternative and complementary modalities of human-computer interaction. The paper also discusses information that will be needed to build a firm empirical foundation for future designing of human-computer interfaces. Finally, it argues for a more systematic and scientific approach to understanding human language and performance with voice interactive systems. *The writing of this paper was supported in part by a grant from the National Science Foundation (No. IRI-9213472) to SRI International.

OCR for page 34
Page 35 INTRODUCTION From the beginning of the computer era, futurists have dreamed of the conversational computer—a machine that we could engage in spoken natural language conversation. For instance, Turing's famous "test" of computational intelligence imagined a computer that could conduct such a fluent English conversation that people could not distinguish it from a human. However, despite prolonged research and many notable scientific and technological achievements, until recently there have been few human-computer dialogues, none spoken. This situation has begun to change, as steady progress in speech recognition and natural language processing technologies, supported by dramatic advances in computer hardware, has made possible laboratory prototype systems with which one can engage in simple question-answer dialogues. Although far from human-level conversation, this initial capability is generating considerable interest and optimism for the future of human-computer interaction using voice. This paper aims to identify applications for which spoken interaction may be advantageous, to situate voice with respect to alternative and complementary modalities of human-computer interaction, and to discuss obstacles that exist to the successful deployment of spoken language systems because of the nature of spoken language interaction. Two general sorts of speech input technology are considered. First, we survey a number of existing applications of speech recognition technologies, for which the system identifies the words spoken, but need not understand the meaning of what is being said. Second, we concentrate on applications that will require a more complete understanding of the speaker's intended meaning, examining future spoken dialogue systems. Finally, we discuss how such speech understanding will play a role in future human-computer interactions, particularly those involving the coordinated use of multiple communication modalities, such as graphics, handwriting, and gesturing. It is argued that progress has been impeded by the lack of adequate scientific knowledge about human spoken interactions, especially with computers. Such a knowledge base is essential to the development of well-founded human-interface guidelines that can assist system designers in producing successful applications incorporating spoken interaction. Given recent technological developments, the field is now in a position to systematically expand that knowledge base.

OCR for page 34
Page 36 Background and Definitions Human-computer interaction using voice may involve speech input or speech output, perhaps in combination with each other or with other modalities of communication. Speech Analysis The speech analysis task is often characterized along five dimensions: Speaker dependence.     Speech recognizers are described as speaker-dependent/trained, speaker-adaptive, and speaker-independent. For speaker-dependent recognition, samples of a given user's speech are collected and used as models for his/her subsequent utterances. For speaker-adaptive recognition, parameterized acoustical models are initially available, which can be more finely tuned for a given user through pronunciation of a limited set of specified utterances. Finally, speaker-independent recognizers are designed to handle any user's speech, without training, in the given domain of discourse (see Flanagan, in this volume). Speech continuity.     Utterances can be spoken in an isolated manner, with breaks between words, or as continuous natural speech. Speech type.      To develop initial algorithms, researchers typically first use read speech as data, in which speakers read random sentences drawn from some corpus, such as the Wall Street Journal. Subsequent to this stage of algorithm development, speech recognition research attempts to handle spontaneous speech, in which speakers construct new utterances in the chosen domain of discourse. Interactivity.     Certain speech recognition tasks, such as dictation, can be characterized as noninteractive, in that the speaker is receiving no feedback from the intended listener(s). Other systems are designed to process interactive speech, in which speakers construct utterances as part of an exchange of turns with a system or with another speaker. Vocabulary and grammar.     The user can speak words from a tightly constrained vocabulary and grammar or from larger vocabularies and grammars that more closely approximate those of a natural language. The system's vocabulary and grammar can be chosen by the system designer or application developer, or they can be compiled from data based on actual users speaking either to a simulated system or to an early system prototype. Current speech recognition technologies require an estimate of the probability of occurrence of each word in the

OCR for page 34
Page 37 context of the other words in the vocabulary. Because these probabilities are typically approximated from the distribution of words in a given corpus, it is currently difficult to expand a system's vocabulary, although research is proceeding on vocabulary-independent recognition (Hon and Lee, 1991). Vendors often describe their speech recognition hardware as offering very high recognition accuracy, but it is only in the context of a quantitative understanding of the recognition task that one can meaningfully compare the performance of recognizers. To calibrate the difficulty of a given recognition task for a given system, researchers have come to use a measure of the perplexity of that system's language model, which measures, roughly speaking, the average number of word possibilities at each state of the grammar (Bahl et al., 1983; Baker, 1975; Jelinek, 1976). Word recognition accuracy has been found, in general, to be inversely proportional to perplexity. Most commercial systems offer speech recognition systems claiming >95 percent word recognition accuracy given a perplexity on the order of 10. At least one vendor offers a 1000 to 5000 word, speaker-independent system, with perplexities in the range of 66 to 433, and a corresponding word-recognition error of 3 to 15 percent for recognition of isolated words (Baker, 1991). Current laboratory systems support real-time, speaker-independent recognition of continuously spoken utterances drawn from a vocabulary of approximately 1500 words, with a perplexity of 50 to 70, resulting in word recognition error rates between 4 and 8 percent (Pallett et al., 1993). The most ambitious speaker-independent systems are currently recognizing, in real time, read speech drawn from  a 5000-word vocabulary of Wall Street Journal text, with a perplexity of 120, resulting in a word recognition error rate of 5 percent (Pallett et al., 1993). Larger vocabularies are now being attempted. The end result of voice recognition is the highest-ranking string(s) of words, or often lattice of words, that covers the signal. For small vocabularies and tightly constrained grammars, a simple interpreter can respond to the spoken words directly. However, for larger vocabularies and more natural grammars, natural language understanding must be applied to the output of the recognizer in order to recover the intended meaning of the utterance.1 Because this natural language understanding process is complex and open ended, it is often constrained by the application task (e.g., retrieving information from a data base) and by the domain of discourse (e.g., a data base 1 See Moore (in this volume) for a discussion of how these components can be integrated.

OCR for page 34
Page 38 about airline flights). Here the combination of speech recognition and language understanding will be termed speech understanding, and the systems that use such input will be termed spoken language systems. This paper reviews earlier work on the uses of speech recognition but concentrates on the uses of spoken language. Speech Synthesis Three forms of speech synthesis technology exist: Digitized speech.     To produce an utterance, the machine assembles and plays back previously recorded and compressed samples of human speech. Although a noticeable break between samples can often be heard, and the overall intonation may be inaccurate, such a synthesis process can offer human-sounding speech of high intelligibility. This process is, however, limited to producing combinations of the recorded samples. Text-to speech.     Text-to-speech synthesis involves an automated analysis of the structure of words into their morphological constituents. By combining the pronunciations of those subword units according to letter- and morph-to-sound rules, coupled with a large list of exceptional pronunciations (for English), arbitrary text can be rendered as speech. Because this technology can handle open-ended text, it is suitable for large-scale applications, such as reading text aloud to blind users or reading electronic mail over the telephone. Text-to-speech science and technology are covered at length elsewhere in this volume (see Allen, in this volume, and Carlson, in this volume). Concept-to-speech.     With text-to-speech systems, the text to be converted is supplied from a human source. Future dialogue systems will require computers to decide for themselves what to say and how to say it in order to arrive at a meaningful and contextually appropriate dialogue contribution. Such systems need to determine what speech action(s) to perform (e.g., request, suggestion), how to refer to entities in the utterance, what to say about them, what grammatical forms to use, and what intonation to apply. Moreover, the utterance should contribute to the course of the dialogue, so the system should keep a representation of what it has said in order to analyze and understand the user's subsequent utterances. The research areas of speech synthesis and language generation have received considerably less attention than speech recognition and understanding but will increase in importance as the possibility of developing spoken dialogue systems becomes realizable. The remainder of this paper explores current and future applica-

OCR for page 34
Page 39 tion areas in which spoken interaction may be a preferred modality of communication with computers. First, factors that may influence the desirability and efficiency of voice-based interaction with computers are identified, independent of whether a simple command language or a quasi-natural language is being spoken. Then, we discuss spoken language interaction, comparing it both to keyboard-based interaction and to the currently dominant graphical user-interface paradigm. After identifying circumstances that favor spoken language interaction, gaps in the scientific knowledge base of spoken communication are identified that present obstacles to the development of spoken language-based systems. It is observed that future systems will be multimodal, with voice being only one of the communication modalities available. We conclude with suggestions for further research that needs to be undertaken to support the development of voice-based unimodal and multimodal systems and argue that there is a pressing need to create empirically based human interface guidelines for system developers before voice-based technology can fulfill its potential. WHEN IS SPOKEN INTERACTION WITH COMPUTERS USEFUL? As yet there is no theory or categorization of tasks and environments that would predict, all else being equal, when voice would be a preferred modality of human-computer communication. Still, a number of situations have been identified in which spoken communication with machines may be advantageous: • when the user's hands or eyes are busy, • when only a limited keyboard and/or screen is available, • when the user is disabled, • when pronunciation is the subject matter of computer use, or • when natural language interaction is preferred. We briefly examine the present and future roles of spoken interaction with computers for these environments. Because spoken natural language interaction is the most difficult to implement, we discuss it extensively in the section titled ''Natural Language Interaction." Voice Input Hands/Eyes-Busy Tasks The classic situation favoring spoken interaction with machines is one in which the user's hands and/or eyes are busy performing

OCR for page 34
Page 40 some other task. In such circumstances, by using voice to communicate with the machine, people are free to pay attention to their task, rather than breaking away to use a keyboard. Field studies suggest that, for example, F-16 pilots who can attain a high speech recognition rate can perform missions, such as formation flying or low-level navigation, faster and more accurately when using spoken control over various avionics subsystems, as compared with keyboard and multifunction-button data entry (Howard, 1987; Rosenhoover et al., 1987; Williamson, 1987). Similar results have been found for helicopter pilots in noisy environments during tracking and communications tasks (Simpson et al., 1982, 1985; Swider, 1987).2 Commercial hands/eyes-busy applications also abound. For instance, wire installers, who spoke a wire's serial number and then were guided verbally by computer to install that wire achieved a 20 to 30 percent speedup in productivity, with improved accuracy and lower training time, over their prior manual method of wire identification and installation (Marshall, 1992). Parcel sorters who spoke city names instead of typing destination-labeled keys attained a 37 percent improvement in entry time during hands/eyes-busy operations (Visick et al., 1984). However, when the hands/eyes-busy component of parcel sorting was removed, spoken input offered no distinct speed advantages. In addition, VLSI circuit designers were able to complete 24 percent more tasks when spoken commands were available than when they only used a keyboard and mouse interface (see the section titled "Direct Manipulation") (Martin, 1989). Although individual field studies are rarely conclusive, many field studies of highly accurate speech recognition systems with hands/eyes-busy tasks have found that spoken input leads to higher task productivity and accuracy. Not only does spoken input offer efficiency gains for a given hands/eyes-busy task, it also offers the potential to change the nature of that task in beneficial ways. For example, instead of having to remember and speak or type the letters "YYZ" to indicate a destination airport, a baggage handler could simply say "Toronto," thereby using an easy-to-remember name (Martin, 1989; Nye, 1982). Similar potential advantages are identified for voice-based telephone dialers, to which one can say "Call Tom," rather than having to remember and input a phone number (Rabiner et al., 1980). Other hands/eyes-busy applications that might benefit from voice interaction include data entry and machine control in factories and field applications 2 Further discussion of speech recognition for military environments can be found in (Weinstein, 1991, in this volume).

OCR for page 34
Page 41 (Martin, 1976), access to information for military command-and-control, astronauts' information management during extravehicular access in space, dictation of medical diagnoses (Baker, 1991), maintenance and repair of equipment, control of automobile equipment (e.g., radios, telephones, climate control), and navigational aids (Streeter et al., 1985). A major factor determining success for speech input applications is speech recognition accuracy. For example, the best task performance reported during F-16 test flights was obtained once pilots attained isolated word recognition rates greater than 95 percent. Below 90 percent, the effort needed to correct recognition errors was said to outweigh the benefits gained for the user (Howard, 1987). Similar results showing the elimination of benefits once error correction is considered also have been found in tasks as simple as entering connected digits (Hauptmann and Rudnicky, 1990). To attain a sufficiently high level of recognition accuracy in field tests, spoken input has been severely constrained to allow only a small number of possible words at any given time. Still, even with such constraints, accuracy in the field often lags that of laboratory tests because of many complicating factors, such as the user's physical and emotional state, ambient noise, microphone equipment, the demands of real tasks, methods of the user and system training, and individual differences encountered when an array of real users is sampled. However, it is claimed that most failures of speech technology have been the result of human factors engineering and management (Lea, 1992), rather than low recognition accuracy per se. Human factors issues are discussed further below and by Kamm (in this volume). Limited Keyboard/Screen Option The most prevalent current uses of speech synthesis and recognition are telephone-based applications. Speech synthesizers are commonly used in the telecommunications industry to support directory assistance by speaking the desired telephone number to the caller, thereby freeing the operator to handle another call. Speech recognizers have been deployed to replace or augment operator services (e.g., collect calls), handling hundreds of millions of callers each year and resulting in multimillion dollar savings (Lennig, 1989; Nakatsu, in this volume; Wilpon, in this volume). Speech recognizers for telecommunications applications accept a very limited vocabulary, perhaps spotting only certain key words in the input, but they need to function with high reliability for a broad spectrum of the general

OCR for page 34
Page 42 public. Although not as physically severe as avionic or manufacturing applications, telecommunications applications are difficult because callers receive little or no training about use of the system and may have low-quality equipment, noisy telephone lines, and unpredictable ambient noise levels. Moreover, caller behavior is difficult to predict and channel (Basson, 1992; Kamm, in this volume; Spitz, 1991).3 The considerable success at automating the simpler operator services opens the possibility for more ambitious telephone-based applications, such as information access from remote databases. For example, the caller might inquire about airline and train schedules (Advanced Research Projects Agency, 1993; Proceedings of the Speech and Natural Language Workshop, 1991; Peckham, 1991), yellow pages information, or bank account balances (Nakatsu, in this volume), and receive the answer auditorily. This general area of human-computer interaction is much more difficult to implement than simple operator services because the range of caller behavior is quite broad and because speech understanding and dialogue participation are required rather than just word recognition. When even modest quantities of data need to be conveyed, a purely vocal interaction may be difficult to conduct, although the advent of "screen phones" may well improve such cases. Perhaps the most challenging potential application of telephone-based spoken language technology is the interpretation of telephony (Kurematsu, 1992; Roe et al., 1991) in which two callers speaking different languages can engage in a dialogue mediated by a spoken language translation system (Kitano, 1991; Yato et al., 1992). Such systems are currently designed to incorporate speech recognition, machine translation, and speech synthesis subsystems and to interpret one sentence at a time. A recent initial experiment organized by ATR International (Japan), with Carnegie-Mellon University (USA) and Siemens A.G. (Germany) involved Japanese-English and Japanese-German machine-interpreted dialogues (Pollack, 1993; Yato et al., 1992). Utterances in one language were recognized and translated by a local computer, which sent a translated textual rendition to the foreign site, where text-to-speech synthesis took place. AT&T has demonstrated a limited-domain spoken English-Spanish translation system (Roe et al., 1991), although not a telephone-based one, and Nippon Electric Corporation has demonstrated a similar Japanese-English system. Apart from the use of telephones, a second equipment-related 3 An excellent review of the human factors and technical difficulties encountered in telecommunications applications of speech recognition can be found in Karis and Dobroth (1991).

OCR for page 34
Page 43 factor favoring voice-based interaction is the ever-decreasing size of portable computers. Portable computing and communications devices will soon be too small to allow for use of a keyboard, implying that the input modalities for such machines will most likely be digitizing pen and voice (Crane, 1991; Oviatt, 1992), with screen and voice providing system output. Given that these devices are intended to supplant both computer and telephone, users will already be speaking through them. A natural evolution of the devices will offer the user the capability to speak to them as well. Finally, an emerging use of voice technology is to replace the many control buttons on consumer electronic devices (e.g., VCRs, receivers). As the number of user-controllable functions on these devices increases, the user interface becomes overly complex and can lead to confusion over how to perform even simple tasks. Products have recently been announced that allow users to program their devices using simple voice commands. Disability A major potential use of voice technology will be to assist deaf users in communicating with the hearing world using a telephone (Bernstein, 1988). Such a system would recognize the hearing person's speech, render it as text, and synthesize the deaf person's textual reply (if using a computer terminal) as a spoken utterance. Another use of speech recognition in assisting deaf users would be captioning television programs or movies in real time. Speech recognition could also be used by motorically impaired users to control suitably augmented household appliances, wheelchairs, and robotic prostheses. Text-to-speech synthesis can assist users with speech and motor impediments; can assist blind users with computer interaction; and, when coupled with optical character recognition technology, can read printed materials to blind users. Finally, given sufficiently capable speech recognition systems, spoken input may become a prescribed therapy for repetitive stress injuries, such as carpal tunnel syndrome, which are estimated to afflict approximately 1.5 percent of office workers in occupations that typically involve the use of keyboards (Tanaka et al., 1993), although speech recognizers may themselves lead to different repetitive stress injuries (Markinson, personal communication, 1993).4 4 The general subject of "assistive technology" is covered at length by H. Levitt (in this volume), and a survey of speech recognition for rehabilitation can be found in Bernstein (1988).

OCR for page 34
Page 44 Subject Matter Is Pronunciation Speech recognition will become a component of future computer-based aids for foreign language learning and for the teaching of reading (Bernstein and Rtischev, 1991; Bernstein et al., 1990; Mostow et al., 1993). For such systems, speakers' pronunciation of computer-supplied texts would be analyzed and given as input to a program for teaching reading or foreign languages. Whereas these may be easier applications of speech recognition than some because the words being spoken are supplied by the computer, the recognition system will still be confronted with mispronunciations and slowed pronunciations, requiring a degree of robustness not often considered in other applications of speech recognition. Substantial research will also be needed to develop and field test new educational software that can take advantage of speech recognition and synthesis for teaching reading. This is perhaps one of the most important potential applications of speech technology because the societal implications of raising literacy levels on a broad scale are enormous. Voice Output As with speech input, the factors favoring voice output are only informally understood. Just as tasks with a high degree of visual or manual activity may be more effectively accomplished using spoken input, such tasks may also favor spoken system output. A user could concentrate on a task rather than altering his or her gaze to view a system display. Typical application environments include flying a plane, in which the pilot could receive information about the status of the plane's subsystems during critical phases of operation (e.g., landing, high-speed maneuvering), and driving a car, in which the driver would be receiving navigational information in the course of driving. Other factors thought to favor voice output include remote access to information services over the telephone, lack of reading skills, darkened environments, and the need for omnidirectional information presentation, as in the issuing of warnings in cockpits, control rooms, factories, etc. (Simpson et al., 1985; Thomas et al., 1984). There are numerous studies of speech synthesis, but no clear picture has emerged of when computer-human communication using speech output is most effective or preferred. Psychological research has investigated the intelligibility, naturalness, comprehensibility, and recallability of synthesized speech (Luce et al., 1983; Nusbaum and Schwab, 1983; Simpson et al., 1985; Thomas et al., 1984). Intelligibil-

OCR for page 34
Page 65 the user may at times encounter noisy environments or desire privacy and would therefore rather not speak. Also, people may prefer to speak for some task content but not for others. Finally, different types of users may systematically prefer to use one modality rather than another. In all these cases a multimodal system offers the needed flexibility. Even as we investigate multimodal interaction for potential solutions to problems arising in speech-only applications, many implementation obstacles need to be overcome in order to integrate and synchronize modalities. For example, multimodal systems could present information graphically or in multiple coordinated modalities (Feiner and McKeown, 1991; Wahlster, 1991) and permit users to refer linguistically to entities introduced graphically (Cohen, 1991; Wahlster, 1991). Techniques need to be developed to synchronize input from simultaneous data streams, so that, for example, gestural inputs can help resolve ambiguities in speech processing and vice versa. Research on multimodal interfaces needs to examine not only the techniques for forging a productive synthesis among modalities but also the effect that specific integration architectures will have on human-computer interaction. Much more empirical research on the human use of multimodal systems needs to be undertaken, as we yet know relatively little about how people use multiple modalities in communicating with other people, let alone with computers, or about how to support such communication most effectively. SCIENTIFIC RESEARCH ON COMMUNICATION MODALITIES The present research and development climate for speech-based technology is more active than it was at the time of the 1984 National Research Council report on speech recognition in severe environments (National Research Council, 1984). Significant amounts of research and development funding are now being devoted to building speech-understanding systems, and the first speaker-independent, continuous, real-time spoken language systems have been developed. However, some of the same problems identified then still exist today. In particular, few answers are available on how people will interact with systems using voice and how well they will perform tasks in the target environments as opposed to the laboratory. There is little research on the dependence of communication on the modality used, or the types of tasks, in part because there have not been principled taxonomies or comprehensive research addressing these factors. In

OCR for page 34
Page 66 particular, the use of multiple communication modalities to support human-computer interaction is only now being addressed. Fortunately, the field is now in a position to fill gaps in its knowledge base about spoken human-machine communication. Using existing systems that understand real-time, continuously spoken utterances, which allow users to solve real problems, a number of vital studies can now be undertaken in a more systematic manner. Examples include: • longitudinal studies of users' linguistic and problem-solving behavior that would explore how users adapt to a given system; • studies of users' understanding of system limitations, and of their performance in observing the system's bounds; • studies of different techniques for revealing a system's coverage, and for channeling user input; • studies comparing the effectiveness of spoken language technology with alternatives, such as the use of keyboard-based natural language systems, query languages, or existing direct manipulation interfaces; and • studies analyzing users' language, task performance, and preferences to use different modalities, individually and within an integrated multimodal interface. The information gained from such studies would be an invaluable addition to the knowledge base of how spoken language processing can be woven into a usable human-computer interface. Sustained efforts need to be undertaken to develop more adequate spoken language simulation methods, to understand how to build limited but robust dialogue systems based on a variety of communication modalities, and to study the nature of dialogue. A vital and underappreciated contribution to the successful deployment of voice technology for human-computer interaction will come from the development of a principled and empirically validated set of human-interface guidelines for interfaces that incorporate speech (cf. Lea, 1992). Graphical user-interface guidelines typically provide heuristics and suggestions for building "usable" interfaces, though often without basing such suggestions on scientifically established facts and principles. Despite the evident success of such guidelines for graphical user interfaces, it is not at all clear that a simple set of heuristics will work for spoken language technology, because human language is both more variable and creative than the behavior allowed by graphical user interfaces. Answers to some of the questions posed earlier would be valuable in laying a firm empirical founda-

OCR for page 34
Page 67 tion for developing effective guidelines for a new generation of language-oriented interfaces. Ultimately, such a set of guidelines embodying the results of scientific theory and experimentation should be able to predict, given a specified communicative situation, task, user population, and a set of component modalities, what the user-computer interaction will be like with a multimodal interface of a certain configuration. Such predictions could inform the developers in advance about potential trouble spots and could lead to a more robust, usable, and satisfying humancomputer interface. Given the complexities of the design task and the considerable expense required to create spoken language applications, if designers are left to their intuitions, applications will suffer. Thus, for scientific, technological, and economic reasons, a concerted effort needs to be undertaken to develop a more scientific understanding of communication modalities and how  they can best be integrated in support of successful human-computer interaction. ACKNOWLEDGMENTS Many thanks to Jared Bernstein, Clay Coler, Carol Simpson, Ray Perrault, Robert Markinson, Raja Rajasekharan, and John Vester for valuable discussions and source materials. REFERENCES Advanced Research Projects Agency. ARPA Spoken Language Systems Technology Workshop. Massachusetts Institute of Technology, Cambridge, Mass. 1993. Allen, J. F., and C. R. Perrault. Analyzing intention in dialogues. Artificial Intelligence, 15(3):143-178, 1980. Andry, F. Static and dynamic predictions: A method to improve speech understanding in cooperative dialogues. In Proceedings of the International Conference on Spoken Language Processing, Banff, Alberta, Canada, Oct. University of Alberta, 1992. Andry, F., E. Bilange, F. Charpentier, K. Choukri, M. Ponamale, and S. Soudoplatoff. Computerised simulation tools for the design of an oral dialogue system. In Selected Publications, 1988-1990, SUNDIAL Project (Esprit P2218). Commission of the European Communities, 1990. Appelt, D. Planning English Sentences. Cambridge University Press, Cambridge, U.K., 1985. Appelt, D. E., and E. Jackson. SRI International February 1992 ATIS benchmark test results. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992. Bahl, L., F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179-190, March 1983. Baker, J. F. Stochastic modeling for automatic speech understanding. In D. R. Reddy, ed., Speech Recognition, pp. 521-541. Academic Press, New York, 1975.

OCR for page 34
Page 68 Baker, J. M. Large vocabulary speaker-adaptive continuous speech recognition research overview at Dragon systems. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 29-32, Genova, Italy, 1991. Basson, S. Prompting the user in ASR applications. In Proceedings of COST232 Workshop—European Cooperation in Science and Technology, November 1992. Basson, S., O. Christie, S. Levas, and J. Spitz. Evaluating speech recognition potential in automating directory assistance call completion. In AVIOS Proceedings. American Voice I/O Society, 1989. Bear, J., J. Dowding, and E. Shriberg. Detection and correction of repairs in human-computer dialog. In D. Walker, ed., Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Newark, Delaware, June 1992. Bear, J., and P. Price. Prosody, syntax and parsing. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pp. 17-22, Pittsburgh, Pa., 1990. Bernstein, J. Applications of speech recognition technology in rehabilitation. In J. E. Harkins and B. M. Virvan, eds., Speech to Text: Today and Tomorrow. GRI Monograph Series B., No. 2. Gallaudet University Research Institute, Washington, D.C., 1988. Bernstein, J., M. Cohen, H. Murveit, D. Rtischev, and M. Weintraub. Automatic evaluation and training in English pronunciation. In Proceedings of the 1990 International Conference on Spoken Language Processing, pp. 1185-1188, The Acoustical Society of Japan, Kobe, Japan, 1990. Bernstein, J., and D. Rtischev. A voice interactive language instruction system. In Proceedings of Eurospeech '91, pp. 981-984, Genova, Italy. IEEE, 1991. Capindale, R. A., and R. C. Crawford. Using a natural language interface with casual users. International Journal of Man-Machine Studies, 32:341-362, 1990. Chamberlin, D. D., and R. F. Boyce. Sequel: A structured English query language. In Proceedings of the 1974 ACM SIGMOD Workshop on Data Description, Access and Control, May 1974. Chapanis, A., R. B. Ochsman, R. N. Parrish, and G. D. Weeks. Studies in interactive communication: I. The effects of four communication modes on the behavior of teams during cooperative problem solving. Human Factors, 14:487-509, 1972. Chapanis, A., R. N. Parrish, R. B. Ochsman, and G. D. Weeks. Studies in interactive communication: II. The effects of four communication modes on the linguistic performance of teams during cooperative problem solving. Human Factors, 19(2):101125, April 1977. Charniak, E., Jack and Janet in search of a theory of knowledge. In Advance Papers of the Third Meeting of the International Joint Conference on Artificial Intelligence, Los Altos, Calif. William Kaufmann, Inc., 1973. Clark, H. H., and D. Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22:1-39, 1986. Codd, E. F. Seven steps to rendezvous with the casual user. In Proceedings IFIP TC-2 Working Conference on Data Base Management Systems, pp. 179-200. North-Holland Publishing Co., Amsterdam, 1974. Cohen, P. R. On Knowing What to Say: Planning Speech Acts. PhD thesis, University of Toronto, Toronto, Canada. Technical Report No. 118, Department of Computer Science. 1978. Cohen, P. R. The pragmatics of referring and the modality of communication. Computational Linguistics, 10(2):97-146, April-June 1984. Cohen, P. R. The role of natural language in a multimodal interface. In The 2nd FRIEND21

OCR for page 34
Page 69 International Symposium on Next Generation Human Interface Technologies, Tokyo, Japan, November 1991. Institute for Personalized Information Environment. Cohen, P. R. Models of dialogue. In M. Nagao, ed., Cognitive Processing for Vision and Voice: Proceedings of the Fourth NEC Research Symposium. SIAM, 1993. Cohen, P. R., and H. J. Levesque. Rational interaction as the basis for communication. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication. MIT Press, Cambridge, Mass., 1990. Cohen, P. R., and H. J. Levesque. Confirmations and joint action. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, pp. 951-957, Sydney, Australia, Morgan Kaufmann Publishers, Inc. 1991. Cohen, P. R., and C. R. Perrault. Elements of a plan-based theory of speech acts. Cognitive Science, 3(3):177-212, 1979. Cohen, P. R., M. Dalrymple, D. B. Moran, F. C. N. Pereira, J. W. Sullivan, R. A. Gargan, J. L. Schlossberg, and S. W. Tyler. Synergistic use of direct manipulation and natural language. In Human Factors in Computing Systems: CHI'89 Conference Proceedings, pp. 227-234, New York, Addison Wesley Publishing Co. 1989. Cole, R., L. Hirschman, L. Atlas, M. Beckman, A. Bierman, M. Bush, J. Cohen, O. Garcia, B. Hanson, H. Hermansky, S. Levinson, K. McKeown, N. Morgan, D. Novick, M. Ostendorf, S. Oviatt, P. Price, H. Silverman, J. Spitz, A. Waibel, C. Weinstein, S. Zahorain, and V. Zue. NSF Workshop on Spoken Language Understanding. Technical Report CS/E 92-014, Oregon Graduate Institute, September 1992. Crane, H. D. Writing and talking to computers. Business Intelligence Program Report D91-1557, SRI International, Menlo Park, Calif., July 1991. Dahlback, N., and A. Jonsson. An empirically based computationally tractable dialogue model. In Proceedings of the 14th Annual Conference of the Cognitive Science Society (COGSCI-92), Bloomington, Ind., July 1992. Dahlback, N., A. Jonsson, and L. Ahrenberg. Wizard of Oz studies—why and how. In L. Ahrenberg, N. Dahlback, and A. Jonsson, eds., Proceedings from the Workshop on Empirical Models and Methodology for Natural Language Dialogue Systems, Trento, Italy, April. Association for Computational Linguistics, 1992. Dowding, J., J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, and D. Moran. Gemini: A natural language system for spoken-language understanding. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 54-61, Columbus, Ohio, June 1993. Englebart, D. Design considerations for knowledge workshop terminals. In National Computer Conference, pp. 221-227, 1973. English, W. K., D. C. Englebart, and M. A. Berman. Display-selection techniques for text manipulation. IEEE Transactions on Human Factors in Electonics, HFE-8(1):515, March 1967. Feiner, S. K., and K. R. McKeown. COMET: Generating coordinated multimedia explanations. In Human Factors in Computing Systems (CHI'91), pp. 449-450, New York, April. ACM Press, 1991. Fisher, S. Virtual environments, personal simulation, and telepresence. Multimedia Review: The Journal of Multimedia Computing, 1(2), 1990. Fraser, N. M., and G. N. Gilbert. Simulating speech systems. Computer Speech and Language, 5(1):81-99, 1991. Garcia, O. N., A. J. Goldschen, and E. D. Petajan. Feature Extraction for Optical Speech Recognition or Automatic Lipreading. Technical Report, Institute for Information Science and Technology, Department of Electrical Engineering and Computer Science. The George Washington University, Washington, D.C., November 1992.

OCR for page 34
Page 70 Giles, H., A. Mulac, J. J. Bradac, and P. Johnson. Speech accommodation theory: The first decade and beyond. In M. L. McLaughlin, ed., Communication Yearbook 10, pp. 13-48. Sage Publishers, Beverly Hills, California, 1987. Gould, J. D. How experts dictate. Journal of Experimental Psychology: Human Perception and Performance, 4(4):648-661, 1978. Gould, J. D. Writing and speaking letters and messages. International Journal of Man-Machine Studies, 16(1):147-171, 1982. Gould, J. D., J. Conti, and T. Hovanyecz. Composing letters with a simulated listening typewriter. Communications of the ACM, 26(4):295-308, April 1983. Grosz, B., and C. Sidner. Plans for discourse. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication, pp. 417-444. MIT Press, Cambridge, Mass., 1990. Guyomard, M., and J. Siroux. Experimentation in the specification of an oral dialogue. In H. Niemann, M. Lang, and G. Sagerer, eds., Recent Advances in Speech Understanding and Dialogue Systems. NATO ASI Series, vol. 46. Springer Verlag, Berlin, 1988. Harris, R. User oriented data base query with the robot natural language query system. International Journal of Man-Machine Studies, 9:697-713, 1977. Hauptmann, A. G., and P. McAvinney. Gestures with speech for direct manipulation. International Journal of Man-Machine Studies, 38:231-249, 1993. Hauptmann, A. G., and A. I. Rudnicky. A comparison of speech and typed input. In Proceedings of the Speech and Natural Language Workshop, pp. 219-224, San Mateo, Calif., June. Morgan Kaufmann, Publishers, Inc., 1990. Hendrix, G. G., and B. A. Walter. The intelligent assistant. Byte, pp. 251-258, December 1987. Hindle, D., Deterministic parsing of syntactic non-fluencies. In Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 123-128, Cambridge, Mass., June 1983. Hobbs, J. R., Resolving pronoun reference. Lingua, 44, 1978. Hon, H.-W., and K.-F. Lee. Recent progress in robust vocabulary-independent speech recognition. In Proceedings of the Speech and Natural Language Workshop, pp. 258-263, San Mateo, Calif., October. Morgan Kaufmann, Publishers, Inc., 1991. Howard, J. A., Flight testing of the AFTI/F-16 voice interactive avionics system. In Proceedings of Military Speech Tech 1987, pp. 76-82, Arlington, Va., Media Dimensions., 1987. Huang, X., F. Alleva, M.-Y. Hwang, and R. Rosenfeld. An overview of the SPHINX-II speech recognition system. In Proceedings of the ARPA Workshop on Human Language Technology, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1993. Hutchins, E. L., J. D. Hollan, and D. A. Norman. Direct manipulation interfaces. In D. A. Norman and S. W. Draper, eds., User Centered System Design, pp. 87-124. Lawrence Erlbaum Publishers, Hillsdale, N.J., 1986. Jackson, E., D. Appelt, J. Bear, R. Moore, and A. Podlozny. A template matcher for robust NL interpretation. In Proceedings of the 4th DARPA Workshop on Speech and Natural Language, pp. 190-194, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1991. Jarke, M., J. A. Turner, E. A. Stohr, Y. Vassiliou, N. H. White, and K. Michielsen. A field evaluation of natural language for data retrieval. IEEE Transactions on Software Engineering, SE-11(1):97-113, 1985. Jelinek, F. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64:532-536, April 1976.

OCR for page 34
Page 71 Jelinek. F. The development of an experimental discrete dictation recognizer. Proceedings of the IEEE, 73(11):1616-1624, November 1985. Karis, D., and K. M. Dobroth. Automating services with speech recognition over the public switched telephone network: Human factors considerations. IEEE Journal of Selected Areas in Communications, 9(4):574-585, 1991. Kautz, H. A circumscriptive theory of plan recognition. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication. MIT Press, Cambridge, Mass., 1990. Kay, A., and A. Goldberg. Personal dynamic media. IEEE Computer, 10(1):31-42, 1977. Kelly, M. J., and A. Chapanis. Limited vocabulary natural language dialogue. International Journal of Man-Machine Studies, 9:479-501, 1977. Kennedy, A., A. Wilkes, L. Elder, and W. S. Murray. Dialogue with machines. Cognition, 30(1):37-72, 1988. Kitano, H. o dm-dialog. IEEE Computer, 24(6):36-50, June 1991. Krauss, R. M., and P. D. Bricker. Effects of transmission delay and access delay on the efficiency of verbal communication. Journal of the Acoustical Society of America, 41(2):286-292, 1967. Krauss, R. M., and S. Weinheimer. Concurrent feedback, confirmation, and the encoding of referents in verbal communication. Journal of Personality and Social Psychology, 4:343-346, 1966. Kreuger, M. Responsive environments. In Proceedings of the National Computer Conference, 1977. Kubala, F., C. Barry, M. Bates, R. Bobrow, P. Fung, R. Ingria, J. Makhoul, L. Nguyen, R. Schwartz, and D. Stallard. BBN BYBLOS and HARC February 1992 ATIS benchmark results. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992. Kurematsu, A. Future perspective of automatic telephone interpretation. Transactions of IEICE, E75(1):14-19, January 1992. Lea, W. A. Practical lessons from configuring voice I/O systems. In Proceedings of Speech Tech/Voice Systems Worldwide, New York. Media Dimensions, Inc., 1992. Leiser, R. G. Exploiting convergence to improve natural language understanding. Interacting with Computers, 1(3):284-298, December 1989. Lennig, M. Using speech recognition in the telephone network to automate collect and third-number-billed calls. In Proceedings of Speech Tech'89, pp. 124-125, Arlington, Va. Media Dimensions, Inc., 1989. Levelt, W. J. M., and S. Kelter. Surface form and memory in question-answering. Cognitive Psychology, 14(1):78-106, 1982. Levinson, S. Some pre-observations on the modelling of dialogue. Discourse Processes, 4(1), 1981. Litman, D. J., and J. F. Allen. A plan recognition model for subdialogues in conversation. Cognitive Science, 11:163-200, 1987. Litman, D. J., and J. F. Allen. Discourse processing and commonsense plans. In P. R. Cohen, J. Morgan, and M. E. Pollack, eds., Intentions in Communication, pp. 365388. MIT Press, Cambridge, Mass., 1990. Luce, P. A., T. C. Feustel, and D. B. Pisoni. Capacity demands in short-term memory for synthetic and natural speech. Human Factors, 25(1):17-32, 1983. MADCOW  Working Group. Multi-site data collection for a spoken language corpus. In Proceedings of the Speech and Natural Language Workshop, pp. 7-14, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1992. Mariani, J. Spoken language processing in the framework of human-machine commu-

OCR for page 34
Page 72 nication at LIMSI. In Proceedings of Speech and Natural Language Workshop, pp. 55-60, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992. Marshall, J. P. A manufacturing application of voice recognition for assembly of aircraft wire harnesses. In Proceedings of Speech Tech/Voice Systems Worldwide, New York. Media Dimensions, Inc., 1992. Martin, G. L. The utility of speech input in user-computer interfaces. International Journal of Man-Machine Studies, 30(4):355-375, 1989. Martin, T. B. Practical applications of voice input to machines. Proceedings of the IEEE, 64(4):487-501, April 1976. Michaelis, P. R., A. Chapanis, G. D. Weeks, and M. J. Kelly. Word usage in interactive dialogue with restricted and unrestricted vocabularies. IEEE Transactions on Professional Communication, PC-20(4), December 1977. Mostow, J., A. G. Hauptmann, L. L. Chase, and S. Roth. Towards a reading coach that listens: Automated detection of oral reading errors. In Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI93), Menlo Park, Calif., Al Press/The MIT Press, 1993. Murray, I. R., J. L. Arnott, A. F. Newell, G. Cruickshank, K. E. P. Carter, and R. Dye. Experiments with a Full-Speed Speech-Driven Word Processor. Technical Report CS 91/09, Mathematics and Computer Science Department, University of Dundee, Dundee, Scotland, April 1991. Nakatani, C., and J. Hirschberg. A speech-first model for repair detection and correction. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 46-53, Columbus, Ohio, June 1993. National Research Council. Automatic Speech Recognition in Severe Environments. National Academy Press, Washington, D.C., 1984. Newell, A. F., J. L. Arnott, K. Carter, and G. Cruickshank. Listening typewriter simulation studies. International Journal of Man-Machine Studies, 33(1):1-19, 1990. Nusbaum, H. C., and E. C. Schwab. The effects of training on intelligibility of synthetic speech: II. The learning curve for synthetic speech. In Proceedings of the 105th meeting of the Acoustical Society of America, Cincinnati, Ohio, May 1983. Nye, J. M. Human factors analysis of speech recognition systems. In Speech Technology 1, pp. 50-57, 1982. Ochsman, R. B., and A. Chapanis. The effects of 10 communication modes on the behaviour of teams during co-operative problem-solving. International Journal of Man-Machine Studies, 6(5):579-620, Sept. 1974. Oviatt, S. L. Pen/voice: Complementary multimodal communication. In Proceedings of Speech Tech'92, pp. 238-241, New York, February 1992. Oviatt, S. L. Predicting spoken disfluencies during human-computer interaction. In K. Shirai, ed., Proceedings of the International Symposium on Spoken Dialogue: New Directions in Human-Machine Communication, Tokyo, Japan, November 1993. Oviatt, S. L. Toward multimodal support for interpreted telephone dialogues. In M. M. Taylor, F. Neel, and D. G. Bouwhuis, eds., Structure of Multimodal Dialogue. Elsevier Science Publishers B.V., Amsterdam, Netherlands, in press. Oviatt, S. L, and P. R. Cohen. Discourse structure and performance efficiency in interactive and noninteractive spoken modalities. Computer Speech and Language, 5(4):297-326, 1991a. Oviatt, S. L, and P. R. Cohen. The contributing influence of speech and interaction on human discourse patterns. In J. W. Sullivan and S. W. Tyler, eds., Intelligent User Interfaces, pp. 69-83. ACM Press Frontier Series. Addison-Wesley Publishing Co., New York, 1991b. Oviatt, S. L, P. R. Cohen, M. W. Fong, and M. P. Frank. A rapid semi-automatic simulation technique for investigating interactive speech and handwriting. In J.

OCR for page 34
Page 73 Ohala, ed., Proceedings of the 1992 International Conference on Spoken Language Processing, pp. 1351-1354, University of Alberta, October 1992. Oviatt, S. L, P. R. Cohen, M. Wang, and J. Gaston. A simulation-based research strategy for designing complex NL systems. In ARPA Human Language Technology Workshop, Princeton, N.J., March 1993. Pallett, D. S., J. G. Fiscus, W. M. Fisher, and J. S. Garofolo. Benchmark tests for the DARPA spoken language program. In Proceedings of the ARPA Workshop on Human Language Technology, San Mateo, Calif., Morgan Kaufmann Publishers, Inc., 1993. Pavan, S., and B. Pelletti. An experimental approach to the design of an oral cooperative dialogue. In Selected Publications, 1988-1990, SUNDIAL Project (Esprit P2218). Commission of the European Communities, 1990. Peckham, J. Speech understanding and dialogue over the telephone: An overview of the ESPRIT SUNDIAL project. In Proceedings of the Speech and Natural Language Workshop, pp. 14-28, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1991. Perrault, C.R., and J. F. Allen. A plan-based analysis of indirect speech acts. American Journal of Computational Linguistics, 6(3):167-182, 1980. Petajan, E., B. Bradford, D. Bodoff, and N. M. Brooke. An improved automatic lipreading system to enhance speech recognition. In Proceedings of Human Factors in Computing Systems (CHI'88), pp. 19-25, New York. Association for Computing Machinery Press, 1988. Polanyi, R., and R. Scha. A syntactic approach to discourse semantics. In Proceedings of the 10th International Conference on Computational Linguistics, pp. 413-419, Stanford, Calif., 1984. Pollack, A. Computer translator phones try to compensate for Babel. New York Times, January 29, 1993. Price, P. J., Evaluation of spoken language systems: The ATIS domain. In Proceedings of the 3rd DARPA Workshop on Speech and Natural Language, pp. 91-95, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1990. Price, P., M. Ostendorf, S. Shattuck-Hufnagel, and C. Fong. The use of prosody in syntactic disambiguation. In Proceedings of the Speech and Natural Language Workshop, pp. 372-377, San Mateo, Calif., October. Morgan Kaufmann Publishers, Inc., 1991. Proceedings of the Speech and Natural Language Workshop, San Mateo, Calif., October, 1991, Morgan Kaufmann Publishers, Inc. Rabiner, L. R., J. G. Wilpon, and A. E. Rosenberg. A voice-controlled, repertory-dialer system. Bell System Technical Journal, 59(7):1153-1163, September 1980. Rheingold, H. Virtual Reality. Summit Books, 1991. Roe, D. B., F. Pereira, R. W. Sproat, and M. D. Riley. Toward a spoken language translator for restricted-domain context-free languages. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 10631066, Genova, Italy. European Speech Communication Association, 1991. Rosenhoover, F. A., J. S. Eckel, F. A. Gorg, and S. W. Rabeler. AFTI/F-16 voice interactive avionics evaluation. In Proceedings of the National Aerospace and Electronics Conference (NAECON'87). IEEE, 1987. Rubin-Spitz, J., and D. Yashchin. Effects of dialogue design on customer responses in automated operator services. In Proceedings of Speech Tech'89, 1989. Rudnicky, A. I. Mode preference in a simple data-retrieval task. In ARPA Human Language Technology Workshop, Princeton, N.J., March 1993. Searle, J. R. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge, 1969.

OCR for page 34
Page 74 Shneiderman, B. Natural vs. precise concise languages for human operation of computers: Research issues and experimental approaches. In Proceedings of the 18th Annual Meeting of the Association for Computational Linguistics, pp. 139-141, Philadelphia, Pa., June 1980a. Shneiderman, B. Software Psychology: Human Factors in Computer and Information systems. Winthrop Publishers, Inc., Cambridge, Mass., 1980b. Shneiderman, B. Direct manipulation: A step beyond programming languages. IEEE Computer, 16(8):57-69, 1983. Shriberg, E., E. Wade, and P. Price. Human-machine problem-solving using spoken language systems (SLS): Factors affecting performance and user satisfaction. In Proceedings of Speech and Natural Language Workshop, pp. 49-54, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992. Sidner, C., and D. Israel. Recognizing intended meaning and speaker's plans. In Proceedings of the Seventh International Joint Conference on Artificial Intelligence, pp. 203-208, Vancouver, B.C., 1981. Simpson, C. A., and T. N. Navarro. Intelligibility of computer generated speech as a function of multiple factors. In Proceedings of the National Aerospace and Electronics Conference (NAECON), pp. 932-940, New York, May. IEEE, 1984. Simpson, C. A., C. R. Coler, and E. M. Huff. Human factors of voice I/O for aircraft cockpit controls and displays. In Proceedings of the Workshop on Standardization for Speech I/O Technology, pp. 159-166, Gaithersburg, Md., March. National Bureau of Standards, 1982. Simpson, C. A., M. E. McCauley, E. F. Roland, J. C. Ruth, and B. H. Williges. System design for speech recognition and generation. Human Factors, 27(2):115-141, 1985. Small, D., and L. Weldon. An experimental comparison of natural and structured query languages. Human Factors, 25:253-263, 1983. Spitz, J. Collection and analysis of data from real users: Implications for speech recognition/understanding systems. In Proceedings of the 4th DARPA Workshop on Speech and Natural Language, Asilomar, Calif., February. Defense Advanced Research Projects Agency, 1991. Stallard, D., and R. Bobrow. Fragment processing in the DELPHI system. In Proceedings of the Speech and Natural Language Workshop, pp. 305-310, San Mateo, Calif., February. Morgan Kaufmann Publishers, Inc., 1992. Street, R. L., Jr., R. M. Brady, and W. B. Putman. The influence of speech rate stereotypes and rate similarity on listeners' evaluations of speakers. Journal of Language and Social Psychology, 2(1):37-56, 1983. Streeter, L. A., D. Vitello, and S. A. Wonsiewicz. How to tell people where to go: Comparing navigational aids. International Journal of Man-Machine Studies, 22:549562, 1985. Swider, R. F. Operational evaluation of voice command/response in an Army helicopter. In Proceedings of Military Speech Tech 1987, pp. 143-146, Arlington, Va. Media Dimensions, 1987. Tanaka, S., D. K. Wild, P. J. Seligman, W. E. Halperin, V. Behrens, and V. Putz-Anderson. Prevalence and Work-Relatedness of Self-Reported Carpal Tunnel Syndrome Among U.S. Workers—Analysis of the Occupational Health Supplement Data of the 1988 National Health Interview Survey. National Institute of Occupational Safety and Health, and Centers for Disease Control and Prevention (Cincinnati), in submission. Tennant, H. R, K. M. Ross, R. M. Saenz, C. W. Thompson, and J. R. Miller. Menu-based natural language understanding. In Proceedings of the 21st Annual Meeting of the Association for Computational Linguistics, pp. 151-158, Cambridge, Mass., June 1983.

OCR for page 34
Page 75 Thomas, J. C., M. B. Rosson, and M. Chodorow. Human factors and synthetic speech. In B. Shackel, ed., Proceedings of INTERACT'84, Amsterdam. Elsevier Science Publishers B.V. (North Holland), 1984. Turner, J. A., M. Jarke, E. A. Stohr, Y. Vassiliou, and N. White. Using restricted natural language for data retrieval: A plan for field evaluation. In Y. Vassiliou, ed., Human Factors and Interactive Computer systems, pp. 163-190. Ablex Publishing Corp., Norwood, N.J., 1984. VanKatwijk, A. F., F. L. VanNes, H. C. Bunt, H. F. Muller, and F. F. Leopold. Naive subjects interacting with a conversing information system. IPO Annual Progress Report, 14:105-112, 1979. Visick, D., P. Johnson, and J. Long. The use of simple speech recognisers in industrial applications. In Proceedings of INTERACT'84 First IFIP Conference on Human-Computer Interaction, London, U.K., 1984. Voorhees, J. W., N. M. Bucher, E. M. Huff, C. A. Simpson, and D. H. Williams. Voice interactive electronic warning system (views). In Proceedings of the IEEE/AIAA 5th Digital Avionics Systems Conference, pp. 3.5.1-3.5.8, New York. IEEE, 1983. Wahlster, W. User and discourse models for multimodal communication. In J. W. Sullivan and S. W. Tyler, eds., Intelligent User Interfaces, pp. 45-68. ACM Press Frontier Series. Addison Wesley Publishing Co., New York. 1991. Weinstein, C. Opportunities for advanced speech processing in military computer-based systems. Proceedings of the IEEE, 79(11):1626-1641, November 1991. Welkowitz, J., S. Feldstein, M. Finkelstein, and L. Aylesworth. Changes in vocal intensity as a function of interspeaker influence. Perceptual and Motor Skills, 10:715718, 1972. Williamson, J. T. Flight test results of the AFTI/F-16 voice interactive avionics program. In Proceedings of the American Voice I/O Society (AVIOS) 87 Voice I/O Systems Applications Conference, pp. 335-345, Alexandria, Va., 1987. Winograd, T. Understanding Natural Language. Academic Press, New York, 1972. Winograd, T., and F. Flores. Understanding Computers and Cognition: A New Foundation for Design. Ablex Publishing Co., Norwood, N.J., 1986. Yamaoka, T., and H. Iida. Dialogue interpretation model and its application to next utterance prediction for spoken language processing. In Proceedings of Eurospeech'91: 2nd European Conference on Speech Communication and Technology, pp. 849852, Genova, Italy. European Speech Communication Association, 1991. Yato, F., T. Takezawa, S. Sagayama, J. Takami, H. Singer, N. Uratani, T. Morimoto, and A. Kurematsu. International Joint Experiment Toward Interpreting Telephony (in Japanese). Technical Report, The Institute of Electronics, Information, and Communication Engineers, 1992. Young, S. R., A. G. Hauptmann, W. H. Ward, E. T. Smith, and P. Werner. High level knowledge sources in usable speech recognition systems. Communications of the ACM, 32(2), February 1989. Zoltan-Ford, E. Language Shaping and Modeling in Natural Language Interactions with Computers. PhD thesis, Psychology Department, Johns Hopkins University, Baltimore, Md., 1983. Zoltan-Ford, E. Reducing variability in natural-language interactions with computers. In M. J. Alluisi, S. de Groot, and E. A. Alluisi, eds., Proceedings of the Human Factors Society-28th Annual Meeting, vol. 2, pp. 768-772, San Antonio, Tex., 1984. Zoltan-Ford, E. How to get people to say and type what computers can understand. International Journal of Man-Machine Studies, 34:527-547, 1991. Zue, V., J. Glass, D. Goddeau, D. Goodine, L. Hirschman, M. Phillips, J. Polifroni, and S. Seneff. The MIT ATIS system: February 1992 progress report. In Fifth DARPA Workshop on Speech and Natural Language, San Mateo, Calif. Morgan Kaufmann Publishers, Inc., 1992.