Speech Technology in the Year 2001
This paper introduces the session ''Technology in the Year 2001" and is the first of four papers dealing with the future of human-machine communication by voice. In looking to the future it is important to recognize both the difficulties of technological forecasting and the frailties of the technology as it exists todayfrailties that are manifestations of our limited scientific understanding of human cognition. The technology to realize truly advanced applications does not yet exist and cannot be supported by our presently incomplete science of speech. To achieve this long-term goal, the authors advocate a fundamental research program using a cybernetic approach substantially different from more conventional synthetic approaches. In a cybernetic approach, feedback control systems will allow a machine to adapt to a linguistically rich environment using reinforcement learning.
The title of this session is "Technology in the Year 2001." This colloquium has discussed a number of the state-of-the-art issues: the scientific bases of human-machine communication by voice; the three
technologies, recognition, synthesis, and natural language understanding; and, finally, the applications of this technology.
When the blueprint for this session was fitted together this session was called "Future Technology." The organizers felt that we should think really about it in a very "blue sky" sort of way. I was alarmed by the project altogether at that stage, rushed back home, and started reading about Leonardo da Vinci, H. G. Wells, and dreamed up a few impossible applications for speech recognition. During these ruminations, I thought, there are many interesting things we could discoverhow to navigate the oceans of the world safely or, possibly, information about the location of treasure ships lost by the Spanish many years ago. I am sure that squids and other marine animals could tell us a great deal about that. There is also the question of HAL or Blade Runner, Ed Newbard, and old Napoleon Solo who used to ask for channel D. However, after some discussion with the speakers today, they indicated they did not want this sort of stuff at all.
It was decided that we should talk about evolutionary technologyrather than revolutionary technology. So we are talking about what is likely to be possible in the year 2001. In passing, we might note that the ideas of some of our predictions are not all that far away. We have rough models of HAL right now; of Blade Runner, I'm less certain.
However, we have put together a very interesting program for this last session. Certainly, the three speakers are eminently suited to this. They have all made significant contributions to the state of the art in several areas. One of the things we decided to do was to change the order slightly so that Sadaoki Furui will talk first about ultimate synthesis/recognition systems to give us a flavor of his view of the systems that are likely to be available. And then our two other experts will discuss research directionsB. Atal, in the area of speech, and M. Marcus in the area of natural language.
The paragraphs above are a slightly edited version of an audio recording of Frank Fallside's introduction of this session of the colloquium. They are included here for two reasons. First, they capture rather well Frank's persona. As I read them, I can hear his enunciation of the words in his marvelous accent and diction, which ever so slightly betrayed the intended intellectual mischief. Second, of course, is the intellectual mischief itself. What Frank was saying was that predicting the future of technology is fraught with danger and is thus best approached with a bit of self-deprecating humor.
Before exploring that idea further, it is worthwhile to make a few observations about the views of the speakers in this session. There is no need to summarize the material as the papers are presented in
this volume in their entirety. It is interesting, however, to note some common themes.
First, the three speakers recognize the difficulty of technological forecasting and thus do not fix any of their predictions or research programs to any specific date, not even the year 2001 of the session title. Both Atal and Furui use human performance as an important benchmark for assessing progress. The importance of this measure is discussed in Session III, Speech Recognition Technology (Levinson, in this volume). Atal lists specific problems to which the present lack of a solution is an indication of gaps in our scientific understanding of spoken language. Included in his list are learning, adaptation, synthetic voice quality, and semantics. He suggests that some of these problems might be addressed by finding new, more faithful mathematical representations of the acoustic signal.
Furui points to other inadequacies such as poor multivoice and multilingual capabilities as indicative of a fundamental lack of understanding of speech. He suggests that combining recognition and synthesis in applications might be of help. As we shall note later, the closed recognition/synthesis loop is a very powerful tool that is central to Fallside's ambitious research program.
The presentation by Marcus is somewhat different from the two preceding it in the sense that it deals with the specific technical problem of statistical/structural models of language. However, indirectly, he addresses two of the same problems discussed by Atal and Furui. First, his statistical approach aims at the problem of meaning because it is a syntactico-semantic theory in which the semantics derives from lexical cooccurrences in specific syntactic structures. It also bears on the problem of learning in the sense that these complex models must be trained on (i.e., must be learned from) large linguistic corpora.
Thus, although never explicitly stated, the thrust of all three presentations is a clear call for fundamental research to resolve some of the critical questions surrounding speech communication. As such, these papers stand in direct opposition to the sentiments expressed in the session on speech technology to the effect that there are no fundamental impediments to the application of speech technology. To some extent, Atal, and to a greater extent Furui, envision beneficial applications of a mature speech technology. But their call for fundamental research is an admission that the technology to realize these applications does not yet exist and cannot be supported by a presently incomplete science of speech.
After the three aforementioned presentations, session chairman Frank Fallside opened the session for general discussion. There was an enthusiastic response from the attendees mostly in the form of
technical comments related to the subject matter of the presentations rather than their long-term implications. The chairman did not try to steer the discussion toward the more philosophical aspects of the presentations even though his opening remarks were of a decidedly philosophical tone. Nor did he choose to appropriate any of the discussion period to report on his own research program even though it is aimed squarely at solving some of the fundamental problems raised by the session's speakers. In retrospect it is a pity he did not do so, although such action on his part would have been out of character, because he died shortly after the colloquium having deliberately relinquished an opportunity to make his ideas more widely known.
However, Fallside's approach to speech communication is clearly set forth, if only in conceptual form, in his keynote lecture at the 1991 Eurospeech Conference (Fallside, 1991). The insight upon which his research program is based is that speech communication in humans is an acquired skill involving the simultaneous learning of both perception and generation. Therefore, he argues, a mechanical system should do likewise by forming a closed loop system of analysis and synthesis components and allowing it to adapt to a linguistic environment.
Fallside treats only the linguistic aspects of speech communication. Whereas in a similar spirit but quite independently, Levinson (1989) argues that the entire sensory-motor periphery is required for humans to fully develop their cognitive function. As did Fallside, Levinson suggests that this behavior can be simulated with a feedback-controlled robot that interacts with a natural environment in the presence of a cooperative teacher. This idea has been explored experimentally by Gorin et al. (1991) and Sankar and Gorin (1993).
Whether or not these two hypotheses have any value remains to be seen. They do, however, share two important features. First, they are cybernetic rather than synthetic approaches, and second, they are unconventional, highly speculative, and not presently feasible.
All present approaches to speech communication are syntheticthat is, they advocate that we should first figure out, by any means available, how spoken language works. We should then capture that process in a mathematical model and finally implement the model in a computer program. By contrast, the cybernetic approach says we should use feedback control systems to allow a machine to adapt to a linguistically rich environment using reinforcement learning. This approach requires only limited a priori understanding of the linguistic phenomena under study.
The boldness (many would say foolishness) of cybernetic organic approaches is actually appropriate to the magnitude of the task we
have set for ourselves. It must be realized that the quest to build a machine with human-like linguistic abilities is tantamount to simulating the human mind. This is, of course, an age-old philosophical quest, the rationality of which has been debated by thinkers of every generation. If the problem of simulating the mind is intractable, we shall develop a speech technology that is little more than a curiosity with some limited commercial value. If, however, the problem admits of a solution, as I believe it does, the resulting technology will be of historic proportions.
Frank Fallside did not live to see his research program carried out. That program might well turn out to be an important component in the accomplishment of the ultimate goal of speech research, to build a machine that is indistinguishable from a human in its ability to communicate in natural spoken language. Frank Fallside will never see such a machine. Sadly, the same is most likely true for this colloquium's participants. However, I believe the ultimate goal can be accomplished. I only hope that our intellectual descendants who finally solve the problem do not wonder why we were so conservative in our thinking, thus leaving the breakthrough to be made by a much later generation.
Fallside, F., "On the Acquisition of Speech by Machines, ASM," Proc. Eurospeech 91, Genoa, Italy, 1991.
Gorin, A. L., et. al., "Adaptive Acquisition of Language," Computer Speech and Language 5(2):101-132, 1991.
Levinson, S. E., "Implication of an Early Experiment in Speech Understanding," Proceedings of the Al Symposium, pp. 36-37, Stanford, Calif., 1989.
Sankar, A., and A. L. Gorin, "Visual Focus of Attention in Adaptive Language Acquisition," Neural Networks for Speech and Vision Applications, R. Mammone, Ed., Chapman and Hall, 1993.