The Future of Voice-Processing Technology in the World of Computers and Communications
This talk, which was the keynote address of the NAS Colloquium on Human-Machine Communication by Voice, discusses the past, present, and future of human-machine communications, especially speech recognition and speech synthesis. Progress in these technologies is reviewed in the context of the general progress in computer and communications technologies.
EXPECTATIONS FOR VOICE INTERFACE
Many of us have now experienced conversation with computers by voice. When I first experienced it more than a quarter-century ago I was a little shocked, even though it was just a simple conversation. It gave me the feeling that I was actually communicating with a person; that is, I had the feeling that the machine had a personality. And it was fun! Since then I have come to believe that the voice interface is very special. It should not be regarded simply as one alternative for ordinary human-machine interface.
Speech communication is natural, common, and easy for us. Practically speaking, there are at least three advantages to a voice interface. First, it is an easy interface to use. Second, it can be used while the user is engaged in other tasks. Third, it accommodates multimodal interfaces. But it seems to me that there are other important or essen-
tial reasons for voice communication for humans, and this thought has driven me to pour energy into speech recognition and synthesis for a long timefor the past 35 years.
I think we are still only halfway to our goal of an advanced or smart interface. From here on the scientific path to our goal only gets steeper.
Today we live in the age of information. Five billion people can benefit from an economically efficient society supported by computers and communications. This will become truer as we become more information oriented.
The NEC Corporation recognized the importance of integrating computers and communication long ago and adopted ''C&C" (computer and communications) as its corporate identity in 1977. In the future, C&C will become an indispensable tool for everyonefor many it already has. In this kind of environment, C&C must be easy to use. It must provide a friendly, natural, smart interface for people. Thus, the voice interface is an important component for the C&C of the future.
Recently, we have seen significant progress in speech-processing technologies, especially in the United States. At NEC we have also made a little progress with large-vocabulary recognition (10,000 words), speaker-independent continuous speech recognition (1000 words), and naturally spoken language understanding. Some of this technology is close to being commercialized.
I will not spend a lot of time discussing speech synthesis, but I must make one important comment. Prosodic features such as pauses, stresses, and intonation are related to semantic structure and emotion and are difficult to convey in written messages. But the role of these features will become very important in sophisticated smart interfaces.
VOICE INTERFACE IN THE C&C INFORMATION SOCIETY
There are many applications for new C&C technologies. In the area of public information services there are applications in employment, securities, risk management, taxation, health, education, marriage, entertainment, shopping, and even funeral arrangements. For business and personal use, there are applications in document design, presentations, publications, information access, and inter/intraoffice communication. And personal applications are also important: text preparation (dictation), e-mail, telephone, scheduling, and personal database management.
In the future information society, people will access and exchange a variety of services, applications, and information through networks. Access will be obtained through terminals in offices and homes and even through mobile terminals, as shown in Figure 1.
The speech-processing function for a voice interface can be located in those terminals or in central machines. I anticipate that this function will generally be handled by terminals. I will touch on this again later, but until now it has been easiest to locate this function centrally because of space or cost constraints. Next I will discuss some examples of centrally located speech-processing functions.
At NEC, we commercialized a speaker-independent recognition system, the SR-1000, and first used it for banking information services in 1979. Fourteen years ago is ancient history in voice-processing technology, but hindsight is 20-20, and to understand the future it is often helpful to know the past. This system could recognize 20 words (numerals plus control words, spoken separately) and could process 64 telephone lines simultaneously.
Our current model, the SR-3200, is also used for banking services as well as order entry and reservations. In all cases, centrally located speech recognizers must be speaker independent, which makes producing them more difficult.
For a moment I would like to go back even further. In 1959 we developed a spoken digit recognizer. I believed that word recognition for a small vocabulary was within reach, and we challenged ourselves to produce a voice dialing mobile telephone system. This
experimental system was another example of a centralized voice interface. We made it in 1960, and long-run field evaluations were done inside our laboratory building. Although the system was primitive, and of course not commercialized, we did confirm a future for the voice interface.
When we demonstrated the system at an EXPO in Japan, many customers were interested in using the recognizer in their work. However, most of them needed more than a simple word recognizer, and this forced me to begin research into continuous speech recognition.
Today, no one thinks about centralized machines because it is easy to install a voice dialing function inside the telephone. Soon you will be able to carry a voice dial mobile telephone in your pocket.
A FRIENDLY, SMART INTERFACE
A human-machine interface can be friendly and smart only if it is customized to individuals. By "smart" I mean that it is multimodal, accepts spoken language, and allows spontaneous and incomplete utterances. The personal terminal will become a tool for accessing and exchanging information from an infinite variety of applications and services, as shown in Figure 2. To customize terminals, however, it will be necessary for the terminal to know about the user, that is, to have knowledge of a person's job, application preferences, and acoustic and linguistic speech characteristics.
Similarly, the interface will also require knowledge of the applications or services being used or accessed. Because of difficulties with centralizing knowledge about users, this information will prob-
ably need to be contained in the terminal, and all speech recognition functions may also need to be carried out at the terminal.
In our speech recognition research and development at NEC, we have pursued terminal-type speech recognizers. The DP-100 was our first commercial speech recognizer. It was a speaker-dependent continuous speech recognizer with a 120-word vocabulary that we produced in 1978. Its operation is based on dynamic programming (DP) matching, which we developed in 1971. At that time we were very excited because DP matching significantly improved recognition performance. The DP-3000 is our current model; it has a 500-word vocabulary. To accomplish this we developed a high-speed DP matching LSI to reduce hardware size; it also tripled recognition performance.
The history of speech recognition research at NEC has been a series of studies and struggles for applications. The first application of the DP-100 was in parcel sorting, a job that requires hands-free data entry. Before we accepted contracts I conducted many joint studies with material-handling machine manufacturers and did frequent field experiments at end users' sorting yards. To promote sales of the system, I also visited many end users, including an airline company in the United States. The Federal Aviation Administration was one of our earliest customers for speech recognition products and tested our technology for applications in flight control training in 1979.
Other examples of the DP-series speech recognizer in use include inspection data entry for used-car auctions, meat auctions, and steel sheet inspection. The DP-3000 is also being used in a voice-controlled crane.
DP-series recognizers have been used primarily in eyes-busy or hands-busy situations. Some might say that we have spent too much time on speech recognition and that we started too soon, but I do not agree. Speech is difficult to handle, and speech-processing technological advances are closely tied to device technologies and market needs. So someone must make repeated efforts to show new practical possibilities and obtain funding for continued research.
From this perspective I do not think that we started too early or that our struggles have been useless. Through these experiences we have learned a lot. But clearly, more basic scientific research is needed because we do not yet have sufficient knowledge of speech.
As I mentioned earlier, the speech interface is for people, and I have always believed that it should be applied to personal systems or personal computers. We developed a speech recognition board for PCs in 1983 that was intended to popularize the speech interface. This board was a speaker-dependent word recognizer that had a 128word vocabulary and was priced at $250. It saw limited sales, mainly
because the software environment was inadequate. Recently, because of increases in CPU (central processing unit) power, speech recognition is finding its way into personal computers, and a good software environment is developing.
We had a similar experience in the area of speech synthesis. In 1965 I made a computer-controlled terminal analog speech synthesizer (TASS), and it was really big. To control it you needed a computer that was 10 times bigger than the synthesizer. LSI technology enabled us to shrink the size of the huge TASS, and in 1982 we commercialized a cartridge-type text-to-speech-type voice synthesizer. It was designed as a plug-in option to NEC's PCs. Initially, the synthesizer was not very popular at all. We subsequently included it as a standard feature of our PC, and the result was that we successfully marketed hundreds of thousands of speech synthesizers. Now though, even in a PC, we can easily accomplish speech synthesis with software and the CPU. To get 10 times as much linguistic processing power, we only have to add one VLSI (very-large-scale integrated) chip. In the year 2000 all of this will be on a small fraction of a single chip.
VOICE INTERFACE AND VLSI TECHNOLOGY
From now on, integrating a voice interface function in terminals will be related to progress in VLSI technology. Let us take a look at progress in VLSI technology. In the 2000s, with advances in device technology, gate numbers in chips will increase a hundred times, and clock frequencies will increase about 10 times. Most principal processing functions will be integrated on a single chip. Further advances in CAD (computer-aided design) technology will enable the development of large-scale application-specific ICs (ASICs). In the 2000s, because of these improvements, large-scale ASICs with 50 Megagates will be easily designed in less time than it takes to design today's 10k-gate ASICs.
Roughly speaking, our speech recognition processor, which was used at the TELECOM'91 demonstration, performed 1000-word continuous speech recognition at 1 GIPS processing with 100 Megabits of memory. If we apply the previous estimate for VLSI technological advances to speech recognition processing, we will see 10,000-word continuous speech recognition with 100 GIPS CPUs and 1 Gbit memories in the 2000s, even if we assume a 10 to 1 increase in hardware for vocabulary improvement. This would correspond to three CPU chips and one memory chip. If we are clever enough to reduce increased hardware requirements by one-third, which does not appear to be an
unreasonable expectation, we will be able to produce a two-chip high-performance speech recognizer.
This kind of VLSI progress was beyond the imagination of an engineer who wanted to make a phonetic typewriter 30 years ago. But even back then we youngsters believed in the future progress of electronics, and such belief enabled us to challenge some big mountains. In 1960 we developed a phonetic typewriter that exemplifies the technological improvements I have been talking about. It was built by NEC for Kyoto University as an experimental tool, and it could recognize a hundred kinds of Japanese monosyllables. Though solid-state technology was very new, I elected to use fully transistorized circuitry in this system, and it took 5000 transistors and 3000 diodes. Roughly speaking, this corresponds to 1/250 of CPU chip space in today's technology and only 50k primitive instructions per second.
FUTURE RESEARCH AND DEVELOPMENT ISSUES
Progress in VLSI technology is going to enable us to do an enormous amount of speech processing in a chip, and as a result the contents ofthe methods forspeech processing become more important. There will be two issues in future voice technology research and development (R&D). The first is the development of speech interfaces for PCs and personal terminals. The second is basic research toward the smart interface. In both cases, R&D needs to be oriented toward, and with an awareness of, applications.
Because of increases in processing power, multimedia functions, including images and sound, are becoming popular in PCs and terminals. Recently, pen input has been presented as a hopeful solution for making the interface more friendly. Personally, I prefer to use fingers. But, in addition to pens and fingers, the importance of speech input and output needs to be recognized. Today, speech recognition performance is not perfect, but continued development of practical speech interfaces and platforms is necessary to extend areas of application and to popularize PCs and terminals.
With a speech interface, as shown in Figure 3, three processing layers need to be considered. They are the application interface, the operating system, and the device for speech processing. The implementation of speech processing will depend heavily on the processing power of the CPU.
Today, NEC manufactures the Hobbit chip developed by AT&T. This chip is suitable for utilizing a pen input function in PCs. NEC also produces the V810, NEC's original multimedia processing chip.
However, we believe that a speech-processing chip is essential to fully implement the functions of a speech interface.
The second issue for future R&D is the smart interface. Areas of importance for the smart interface include speech-to-text conversion, or dictation, and a conversational spoken language interface. Important functions include the allowance of spontaneous and incomplete utterances. Also, because it is often impossible to understand or correctly interpret what was intended by an utterance without knowledge of the situation and speaker, a dialogue function becomes an important component of the smart interface. It will also be important to utilize knowledge of applications, service, and users' characteristics in the implementation of a smart interface.
To produce a smart interface, a fusion of speech and language (i.e., spoken language processing) is necessary. So far, language-processing people have not been so interested in speech, while speech people have been interested in language processing. That has been an unrequited love. The importance of this fusion between language and speech is now being recognized. As I stated earlier, some speech phenomena, including pause, stress, and intonation, are valuable in this fusion because they are gifts from speech to language.
In 1983 I organized a new laboratory that integrated two groups: a speech group and a language group, which until then had been working independently. Our natural-language-processing research has also been working toward machine translation. In 1985 NEC commercialized a machine translation system, PIVOT, that translates between Japanese and English text. To do this, we developed an intermediate language called PIVOT Interlingua, which is suitable for
multilingual translation. In fact, we recently added translation in Korean, Spanish, and French to the system.
With this fusion between speech and language in mind, I would like to mention three future research needs. The first is to develop more precise models of human spoken language production and perception. The second is to develop sophisticated models for computer learning and recognition. Third, we need to develop a knowledge base of language and domains. Research on these topics involves many areas of science and technology; psychology, cognitive science, bioscience, linguistics, mathematics, computer science, and engineering, to name a few. However, interdisciplinary collaboration between them will be extremely important, and I hope that people in the United States will help initiate this collaboration. For this collaboration there is a good common vehicle, and that is automatic interpretation.
TOWARD AUTOMATIC INTERPRETATION
When NEC first advocated the concept of C&C, we recognized that the future C&C would be large and complex. As a result, it must be intelligent enough to be reliable, self-controlling, and self-organized or autonomous, and it must offer people smart interfaces, which we called MandC&C (huMan and C&C). In 1983 we suggested automatic interpreting telephony as the ultimate goal of C&C. We think that the purpose of research into automatic interpreting telephony is not merely to realize an interpreting machine. This research involves the most important scientific and technological issues for future C&C.
At TELECOM'83 in Geneva, we demonstrated such a system. This was actually a small experimental system to suggest that future R&D be directed toward automatic interpretation. You can imagine my pleasure when, right after TELECOM'83, the ministry of post and telecommunications in Japan announced the start of a national project on automatic interpretation. They say that this was the founding of ATR Interpreting Telephony Research Laboratories, and I offer them my congratulations on having recently developed a very sophisticated system. I am glad, as a proposer and supporter, that automatic interpretation research is now receiving worldwide attention including work in the United States, Germany, and Japan. In 1991 NEC demonstrated its improved automatic interpretation system, INTERTALKER, at TELECOM'91 in Geneva. INTERTALKER is an integrated speaker-independent, continuous speech recognition, text-to-speech conversion system that utilizes the PIVOT machine translation system. The system recognizes Japanese and English speech and
translates into English, Japanese, French, and Spanish, as shown in Figure 4.
Automatic interpretation will help us realize the dream of human global communication that bridges the language gap. It is also an important goal for voice technology. Through the development of automatic interpreting telephony, we will be able to obtain and develop the technologies necessary for maintaining a C&C society.
In concluding, I want to reemphasize the importance of collaboration. It is the collaboration of interdisciplinary research areas: (1) collaboration between speech, language, and other disciplines and (2) collaboration between those involved in science and technology. It must also be an international collaboration.