National Academies Press: OpenBook

Voice Communication Between Humans and Machines (1994)

Chapter: Speech Technology in the Year 2001

« Previous: Technology in 2001
Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×

Page 445

Speech Technology in the Year 2001

Stephen E. Levinson and Frank Fallside

SUMMARY

This paper introduces the session ''Technology in the Year 2001" and is the first of four papers dealing with the future of human-machine communication by voice. In looking to the future it is important to recognize both the difficulties of technological forecasting and the frailties of the technology as it exists today—frailties that are manifestations of our limited scientific understanding of human cognition. The technology to realize truly advanced applications does not yet exist and cannot be supported by our presently incomplete science of speech. To achieve this long-term goal, the authors advocate a fundamental research program using a cybernetic approach substantially different from more conventional synthetic approaches. In a cybernetic approach, feedback control systems will allow a machine to adapt to a linguistically rich environment using reinforcement learning.

INTRODUCTION

The title of this session is "Technology in the Year 2001." This colloquium has discussed a number of the state-of-the-art issues: the scientific bases of human-machine communication by voice; the three

Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×

Page 446

technologies, recognition, synthesis, and natural language understanding; and, finally, the applications of this technology.

When the blueprint for this session was fitted together this session was called "Future Technology." The organizers felt that we should think really about it in a very "blue sky" sort of way. I was alarmed by the project altogether at that stage, rushed back home, and started reading about Leonardo da Vinci, H. G. Wells, and dreamed up a few impossible applications for speech recognition. During these ruminations, I thought, there are many interesting things we could discover—how to navigate the oceans of the world safely or, possibly, information about the location of treasure ships lost by the Spanish many years ago. I am sure that squids and other marine animals could tell us a great deal about that. There is also the question of HAL or Blade Runner, Ed Newbard, and old Napoleon Solo who used to ask for channel D. However, after some discussion with the speakers today, they indicated they did not want this sort of stuff at all.

It was decided that we should talk about evolutionary technology—rather than revolutionary technology. So we are talking about what is likely to be possible in the year 2001. In passing, we might note that the ideas of some of our predictions are not all that far away. We have rough models of HAL right now; of Blade Runner, I'm less certain.

However, we have put together a very interesting program for this last session. Certainly, the three speakers are eminently suited to this. They have all made significant contributions to the state of the art in several areas. One of the things we decided to do was to change the order slightly so that Sadaoki Furui will talk first about ultimate synthesis/recognition systems to give us a flavor of his view of the systems that are likely to be available. And then our two other experts will discuss research directions—B. Atal, in the area of speech, and M. Marcus in the area of natural language.

The paragraphs above are a slightly edited version of an audio recording of Frank Fallside's introduction of this session of the colloquium. They are included here for two reasons. First, they capture rather well Frank's persona. As I read them, I can hear his enunciation of the words in his marvelous accent and diction, which ever so slightly betrayed the intended intellectual mischief. Second, of course, is the intellectual mischief itself. What Frank was saying was that predicting the future of technology is fraught with danger and is thus best approached with a bit of self-deprecating humor.

Before exploring that idea further, it is worthwhile to make a few observations about the views of the speakers in this session. There is no need to summarize the material as the papers are presented in

Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×

Page 447

this volume in their entirety. It is interesting, however, to note some common themes.

First, the three speakers recognize the difficulty of technological forecasting and thus do not fix any of their predictions or research programs to any specific date, not even the year 2001 of the session title. Both Atal and Furui use human performance as an important benchmark for assessing progress. The importance of this measure is discussed in Session III, Speech Recognition Technology (Levinson, in this volume). Atal lists specific problems to which the present lack of a solution is an indication of gaps in our scientific understanding of spoken language. Included in his list are learning, adaptation, synthetic voice quality, and semantics. He suggests that some of these problems might be addressed by finding new, more faithful mathematical representations of the acoustic signal.

Furui points to other inadequacies such as poor multivoice and multilingual capabilities as indicative of a fundamental lack of understanding of speech. He suggests that combining recognition and synthesis in applications might be of help. As we shall note later, the closed recognition/synthesis loop is a very powerful tool that is central to Fallside's ambitious research program.

The presentation by Marcus is somewhat different from the two preceding it in the sense that it deals with the specific technical problem of statistical/structural models of language. However, indirectly, he addresses two of the same problems discussed by Atal and Furui. First, his statistical approach aims at the problem of meaning because it is a syntactico-semantic theory in which the semantics derives from lexical cooccurrences in specific syntactic structures. It also bears on the problem of learning in the sense that these complex models must be trained on (i.e., must be learned from) large linguistic corpora.

Thus, although never explicitly stated, the thrust of all three presentations is a clear call for fundamental research to resolve some of the critical questions surrounding speech communication. As such, these papers stand in direct opposition to the sentiments expressed in the session on speech technology to the effect that there are no fundamental impediments to the application of speech technology. To some extent, Atal, and to a greater extent Furui, envision beneficial applications of a mature speech technology. But their call for fundamental research is an admission that the technology to realize these applications does not yet exist and cannot be supported by a presently incomplete science of speech.

After the three aforementioned presentations, session chairman Frank Fallside opened the session for general discussion. There was an enthusiastic response from the attendees mostly in the form of

Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×

Page 448

technical comments related to the subject matter of the presentations rather than their long-term implications. The chairman did not try to steer the discussion toward the more philosophical aspects of the presentations even though his opening remarks were of a decidedly philosophical tone. Nor did he choose to appropriate any of the discussion period to report on his own research program even though it is aimed squarely at solving some of the fundamental problems raised by the session's speakers. In retrospect it is a pity he did not do so, although such action on his part would have been out of character, because he died shortly after the colloquium having deliberately relinquished an opportunity to make his ideas more widely known.

However, Fallside's approach to speech communication is clearly set forth, if only in conceptual form, in his keynote lecture at the 1991 Eurospeech Conference (Fallside, 1991). The insight upon which his research program is based is that speech communication in humans is an acquired skill involving the simultaneous learning of both perception and generation. Therefore, he argues, a mechanical system should do likewise by forming a closed loop system of analysis and synthesis components and allowing it to adapt to a linguistic environment.

Fallside treats only the linguistic aspects of speech communication. Whereas in a similar spirit but quite independently, Levinson (1989) argues that the entire sensory-motor periphery is required for humans to fully develop their cognitive function. As did Fallside, Levinson suggests that this behavior can be simulated with a feedback-controlled robot that interacts with a natural environment in the presence of a cooperative teacher. This idea has been explored experimentally by Gorin et al. (1991) and Sankar and Gorin (1993).

Whether or not these two hypotheses have any value remains to be seen. They do, however, share two important features. First, they are cybernetic rather than synthetic approaches, and second, they are unconventional, highly speculative, and not presently feasible.

All present approaches to speech communication are synthetic—that is, they advocate that we should first figure out, by any means available, how spoken language works. We should then capture that process in a mathematical model and finally implement the model in a computer program. By contrast, the cybernetic approach says we should use feedback control systems to allow a machine to adapt to a linguistically rich environment using reinforcement learning. This approach requires only limited a priori understanding of the linguistic phenomena under study.

The boldness (many would say foolishness) of cybernetic organic approaches is actually appropriate to the magnitude of the task we

Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×

Page 449

have set for ourselves. It must be realized that the quest to build a machine with human-like linguistic abilities is tantamount to simulating the human mind. This is, of course, an age-old philosophical quest, the rationality of which has been debated by thinkers of every generation. If the problem of simulating the mind is intractable, we shall develop a speech technology that is little more than a curiosity with some limited commercial value. If, however, the problem admits of a solution, as I believe it does, the resulting technology will be of historic proportions.

Frank Fallside did not live to see his research program carried out. That program might well turn out to be an important component in the accomplishment of the ultimate goal of speech research, to build a machine that is indistinguishable from a human in its ability to communicate in natural spoken language. Frank Fallside will never see such a machine. Sadly, the same is most likely true for this colloquium's participants. However, I believe the ultimate goal can be accomplished. I only hope that our intellectual descendants who finally solve the problem do not wonder why we were so conservative in our thinking, thus leaving the breakthrough to be made by a much later generation.

REFERENCES

Fallside, F., "On the Acquisition of Speech by Machines, ASM," Proc. Eurospeech 91, Genoa, Italy, 1991.

Gorin, A. L., et. al., "Adaptive Acquisition of Language," Computer Speech and Language 5(2):101-132, 1991.

Levinson, S. E., "Implication of an Early Experiment in Speech Understanding," Proceedings of the Al Symposium, pp. 36-37, Stanford, Calif., 1989.

Sankar, A., and A. L. Gorin, "Visual Focus of Attention in Adaptive Language Acquisition," Neural Networks for Speech and Vision Applications, R. Mammone, Ed., Chapman and Hall, 1993.

Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×
Page 445
Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×
Page 446
Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×
Page 447
Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×
Page 448
Suggested Citation:"Speech Technology in the Year 2001." National Academy of Sciences. 1994. Voice Communication Between Humans and Machines. Washington, DC: The National Academies Press. doi: 10.17226/2308.
×
Page 449
Next: Toward the Ultimate Synthesis/Recognition System »
Voice Communication Between Humans and Machines Get This Book
×
Buy Hardback | $95.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Science fiction has long been populated with conversational computers and robots. Now, speech synthesis and recognition have matured to where a wide range of real-world applications—from serving people with disabilities to boosting the nation's competitiveness—are within our grasp.

Voice Communication Between Humans and Machines takes the first interdisciplinary look at what we know about voice processing, where our technologies stand, and what the future may hold for this fascinating field. The volume integrates theoretical, technical, and practical views from world-class experts at leading research centers around the world, reporting on the scientific bases behind human-machine voice communication, the state of the art in computerization, and progress in user friendliness. It offers an up-to-date treatment of technological progress in key areas: speech synthesis, speech recognition, and natural language understanding.

The book also explores the emergence of the voice processing industry and specific opportunities in telecommunications and other businesses, in military and government operations, and in assistance for the disabled. It outlines, as well, practical issues and research questions that must be resolved if machines are to become fellow problem-solvers along with humans.

Voice Communication Between Humans and Machines provides a comprehensive understanding of the field of voice processing for engineers, researchers, and business executives, as well as speech and hearing specialists, advocates for people with disabilities, faculty and students, and interested individuals.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!