Voice Communication Between
Humans and Machines
Some great pundit once remarked, "Every time has its technology, and every technology has its time." Although this is a somewhat simplistic view of technology, it is indeed true that when one thinks of a time period (especially in modern times) one always associates with it the key technologies that "revolutionized" the way people lived then. For example, key technologies that "came of age" in the 1970s include the VLSI integrated circuit, the photocopier, the computer terminal, MOS memory, and scanners. In the 1980s we saw the advent and growth of the personal computer, fiber optics, FAX machines, and medical imaging systems of all types. It is not too difficult to predict what some of the key technologies of the 1990s will be; these include voice processing, image processing, wireless communications, and personal information terminals.
If we examine the various technologies noted above and look at the interval between the time the technology was "understood" and the time the technology began to mature and grow, we see a very complex and intricate relationship. For example, the basic principles of FAX were well understood for more than 150 years. However, until there were established worldwide standards for transmission and reception of FAX documents, the technology remained an intellectual curiosity that was shown and discussed primarily in the research laboratory. Similarly, the concept (and realization) of a videophone was demonstrated at the New York World's Fair in 1964 (so-called
Picturephone Service), but the first commercially viable instruments were actually produced and sold in 1992. In this case it took a bandwidth reduction (from 1.5 Mbps down to 19.2 Kbps) and a major cost reduction, as well as algorithm breakthroughs in voice and video coding and in modem design, to achieve this minor miracle.
Other technologies were able to leave the research laboratory rather rapidly, sometimes in response to national imperatives (e.g., miniaturization for the space program), and sometimes in response to business necessities. Hence, when fiber optic lines were first mass produced in the 1980s, it was estimated that it would take about two decades to convert the analog transmission facilities of the old Bell System to digital form. In reality the long-distance telephone network was fully digital by the end of 1989, fully 10 plus years before predicted. Similarly, in the case of cellular telephony, it was predicted that it would be about a decade before there would be 1 million cellular phones in use in the United States. By the end of 1992 (i.e., about 8 years after the beginning of the "cellular revolution"), the 10-millionth cellular phone was already operating in the United States, and the rate of growth of both cellular and wireless telephony was continuing unabated.
Now we come to the decade of the 1990s and we have already seen strong evidence that the key technologies that are evolving are those that support multimedia computing, multimedia communication, ease of use, portability, and flexibility. The vision of the 1990s is ubiquitous, low-cost, easy-to-use communication and computation for everyone. One of the key technologies that must evolve and grow to support this vision is that of voice processing. Although research in voice processing has been carried out for several decades, it has been the confluence of cheap computation (as embodied by modern digital signal processor chips), low-cost memory, and algorithm improvements that has stimulated a wide range of uses for voice processing technology across the spectrum of telecommunications and consumer, military, and specialized applications.
ELEMENTS OF VOICE PROCESSING TECHNOLOGY
The field of voice processing encompasses five broad technology areas, including:
• voice coding, the process of compressing the information in a voice signal so as to either transmit it or store it economically over a channel whose bandwidth is significantly smaller than that of the uncompressed signal;
• voice synthesis, the process of creating a synthetic replica of a voice signal so as to transmit a message from a machine to a person, with the purpose of conveying the information in the message;
• speech recognition, the process of extracting the message information in a voice signal so as to control the actions of a machine in response to spoken commands;
• speaker recognition, the process of either identifying or verifying a speaker by extracting individual voice characteristics, primarily for the purpose of restricting access to information (e.g., personal/ private records), networks (computer, PBX), or physical premises; and
• spoken language translation, the process of recognizing the speech of a person talking in one language, translating the message content to a second language, and synthesizing an appropriate message in the second language, for the purpose of providing two-way communication between people who do not speak the same language.
To get an appreciation of the progress in each of these areas of voice processing, it is worthwhile to briefly review their current capabilities.
Voice coding technology has been widely used for over two decades in network transmission applications. A key driving factor here has been international standardization of coding algorithms at 64 Kbps (m-law Pulse Code ModulationG.711), 32 Kbps (Adaptive Differential Pulse Code ModulationG.721), and 16 Kbps (Low Delay Code Excited Linear PredictionG.728). Voice coding has also been exploited in cellular systems with the advent of the European GSM standard at 13.2 Kbps, the North American Standard (IS-54, Vector Storage Excitation Linear Prediction) at 8 Kbps, and the promise of the so-called half-rate standards of 6.6 Kbps in Europe and 4 Kbps in North America. Finally, low bit-rate coding for transmission has been a driving force for security applications in the U.S. government, based on standards at 4.8 Kbps (FS 1016, Code Excited Linear Prediction) and 2.4 Kbps (FS 1015, Linear Predictive Coding 10 E).
In the area of voice coding for storage, perhaps the most important application is in the storage of voice messages in voice mailboxes. Typically, most voice mail systems compress the speech to 16 Kbps so as to minimize the total storage requirements of the system while maintaining high-quality voice messages. Another recent application that relies heavily on voice coding is the digital telephone
answering machine in which both voice prompts (including the outgoing message) and voice messages are compressed and stored in the machine's local memory. Current capabilities include tens of seconds of voice prompts and up to about 30 minutes of message storage.
Voice synthesis has advanced to the point where virtually any ASCII text message can be converted to speech, providing a message that is completely intelligible, albeit somewhat unnatural (machinelike) in quality. Although the range of applications of voice synthesis is growing rapidly, several key ones have already emerged. One such application is a voice server for accessing electronic mail (e-mail) messages remotely, over a dialed-up telephone line. Such a service is a valuable one for people who travel extensively (especially outside the United States) and who do not have access to computer lines to read their mail electronically. It is also valuable for bridging the "time gap" associated with travel when the working day where you are need not align well with the working day in your home location.
Other interesting and evolving applications of voice synthesis include automated order inquiry (keeping track of the progress of orders); remote student registration (course selection and placement); proofing of text documents ("listening" to your written reports, responses to e-mail, etc.); and providing names, addresses, and telephone numbers in response to directory assistance queries.
Although speech recognition technology has made major advancements in the past several years, we are still a long way from the science fiction recognition machines as embodied by Hal in Stanley Kubrick's 2001, A Space Odyssey, or R2D2 in George Lucas's Star Wars. However, our current capability, albeit somewhat limited, has opened up a number of possibilities for improvements in the quality of life in selected areas.
A far-reaching application in speech recognition is the automation of the billing function of operator services whereby all O+ calls that are not dialed directly (e.g., Collect, Person-to-Person, Third-Party Billing, Operated-Assisted, and Calling Card) are handled automatically. Based on calling volumes at the end of 1993, about 4 billion calls per year are handled by speech recognition technology for this application alone. (Most of the remaining 55.5 billion O+ calls
are Calling Card calls, which are normally dialed directly via touch tone responsesrather than voice.)
Other recent applications of speech recognition include toys, cellular voice dialers for automobiles (which promise the ultimate in safety, namely "eyes-free" and "hands-free" communication), voice routing of calls (i.e., replacement for button pushing from touch-tone phones), automatic creation of medical reports (aids to radiologists and medical technicians), order entry (e.g., catalog sales, verification of credit), forms entry (insurance, medical), and even some elementary forms of voice dictation of letters and reports.
Speaker recognition technology is one application where the computer can outperform a human, that is, the ability of a computer to either identify a speaker from a given population or to verify an identity claim from a named speaker, exceeds that of a human trying to perform the same tasks.
Speaker identification is required primarily in some types of forensic applications (e.g., identifying speakers who make obscene phone calls or in criminal investigations). Hence, the number of applications affecting most people's day-to-day lives is limited.
Speaker verification is required for a wide variety of applications that provide secure entree to ATMs (automatic teller machines), PBXs, telecommunications services, banking services, private records, etc. Another interesting (and unusual) application of speaker verification is in keeping track of prison parolees by having an automatic system make calls to the place a parolee is supposed to be and verifying (by voice) that the person answering the call is, in fact, the parolee. Finally, voice verification has been used to restrict entry to buildings, restricted areas, and secure installations, etc., by requiring users to speak "voice passwords" (a throwback to the "open sesame" command in ancient Persia) in order to gain entry.
SPOKEN LANGUAGE TRANSLATION
Since spoken language translation relies heavily on speech recognition, speech synthesis, and natural language understanding, as well as on text-to-text (or message-to-message) translation, it is an ambitious long-term goal of voice processing technology. However, based on extensive research by NEC and ATR in Japan, AT&T, Carnegie-Mellon University, and IBM in the United States, and Siemens and Telefonica in Europe (among other laboratories), a number of inter-
esting laboratory systems for language translation have evolved. Such systems, although fairly limited in scope and capability, point the way to what "might be" in the future.
One example of such a laboratory language translation system is VEST, the Voice English-Spanish Translator, developed jointly by AT&T and Telefonica; it can handle a limited set of banking and currency transactions. This speaker-trained system, with a vocabulary of about 450 words, was demonstrated continuously at the Seville World's Fair in Spain in 1992 for about 6 months.
The NEC language translation system, which has been successfully demonstrated at two Telecom meetings, deals with tourist information on activities in Japan for foreign visitors. The ATR system is geared to "interpreting telephony," that is, voice translation over dialed-up telephone lines. The ATR system was demonstrated in 1993 via a three-country international call (among Japan, the United States, and Germany) for the purpose of registering for, and getting information about, an international conference.
NATURAL LANGUAGE PROCESSING
As voice processing technology becomes more capable and more sophisticated, the goals become more far reaching, namely to provide human-like intelligence to every voice transaction with a machine. As such, the machine will ultimately have to go beyond "recognizing or speaking the words"; it will have to understand the meaning of the words, the talker's intent, and perhaps even the talker's state of mind and emotions.
Although we are a long way from being able to understand the nuances of spoken language, or to provide such nuances in synthetic speech, we are able to go "beyond the words" toward understanding the meaning of spoken input via evolving methods of natural language understanding. These include well-established methods of syntactic and semantic analysis, pragmatics, and discourse and dialogue analysis. Such natural language understanding provides the bridge between words and concepts, thereby enabling the machine to act properly based on a spoken request and to respond properly with an appropriate action taken by the machine.
This volume, Voice Communication Between Humans and Machines, follows a colloquium on Human/Machine Communication by Voice held in February 1993. This colloquium, sponsored by the
National Academy of Sciences, examined two rapidly evolving voice processing technologiesvoice synthesis and speech recognitionalong with the natural language understanding needed to integrate these technologies into a speech understanding system. The major purpose of the colloquium was to bring together researchers, system developers, and technologists in order to better understand the strengths and limitations of current technology and to think about both business and technical opportunities for applying the technology. The contents and structure of this book are the same as that of the colloquium.
The colloquium was organized into four sessions each day. The talks on the first day provided a perspective on our current understanding of voice processing in general, and on speech synthesis, speech recognition, and natural language understanding specifically. The talks on the second day discussed applications of the technology in the telecommunications area, in the government and military, in the consumer area, and in aids for handicapped persons. In addition, one session examined the hardware and software constraints in implementing speech systems and at the user interface, which is critical to the deployment and user acceptance of voice processing technology. The last session of the colloquium concentrated on trying to both predict the future and provide a roadmap as to how we might achieve our vision for the technology.
Each session (with the exception of the final one on the second day) consisted of two 30-minute presentations followed by a lively 30-minute discussion period among the session chairman, the speakers, and the audience.
The first session, chaired by Ron Schafer (Georgia Tech), dealt with the scientific bases of human-machine communication by voice. Phil Cohen (SRI International) discussed the role of voice in human-machine communication and argued for interface designs that integrate multiple modes of communication. Jim Flanagan (Rutgers University) provided a comprehensive overview of the history of speech communications from the ancient Greeks (who were fascinated by talking statues) to modern voice communication systems.
The second session, chaired by Mark Liberman (University of Pennsylvania), discussed current understanding of the acoustic and linguistic aspects of speech synthesis. Rolf Carlson (Royal Institute of Technology, Sweden) argued for using speech knowledge from a wide range of disciplines. Jon Allen (MIT) presented the case for merging linguistic models with statistical knowledge obtained from exhaustive analysis of a large tagged database.
The third session, chaired by Steve Levinson (AT&T Bell Laboratories), was on speech recognition technology and consisted of an
overview of current speech recognition techniques by John Makhoul (BBN) and a talk on training and search methods by Fred Jelinek (IBM). Makhoul's talk emphasized the rate of progress in improving performance (as measured in terms of word accuracy) in continuous speech recognition over the past several years and the factors that led to these performance improvements. Jelinek concentrated on the mathematical procedures used to train and decode speech recognizers.
The final session of the first day, chaired by Lynette Hirschman (Massachusetts Institute of Technology), dealt with natural language understanding. Madeleine Bates (BBN) discussed models of natural language understanding and reviewed our current understanding in the areas of syntax, semantics, pragmatics, and discourse. Bob Moore (SRI) discussed the way in which speech can be integrated with natural language as the basis for a speech understanding system.
The two morning sessions on the second day were devoted to applications of the technology and were chaired by Chris Seelbach (Seelbach Associates) and John Oberteuffer (ASR News). Excellent overviews of key applications in the areas of telecommunications (Jay Wilpon, AT&T Bell Laboratories), aids for the handicapped (Harry Levitt, CUNY), the military (Cliff Weinstein, MIT Lincoln Laboratory), and consumer electronics (George Doddington, SISTO/DARPA) were given and stimulated lively discussion.
The next session, chaired by David Roe (AT&T Bell Laboratories), concentrated on technical and human requirements for successful technology deployment. Ryohei Nakatsu (NTT) discussed the hardware/ software issues, and Candace Kamm (Bellcore) discussed the user interface issues that needed to be addressed and understood.
The final session, titled "Technology 2001," was chaired by Frank Fallside (University of Cambridge) and consisted of three views of where the technology is headed over the next decade and how each speaker thought it would get there. Bishnu Atal (AT&T Bell Laboratories) looked at fundamentally new research directions. Sadaoki Furui (NTT) predicted the directions of research in synthesis and recognition systems. Finally, Mitch Marcus (University of Pennsylvania) discussed new developments in statistical modeling of semantic concepts.
A highlight of the colloquium was an after-dinner talk by Yasuo Kato (NEC) on the future of voice processing technology in the world of computers and communications. Kato, who has contributed to the field for close to 40 years, looked back at how far we have come and gave glimpses of how far we might go in the next decade.