Speech Recognition Technology: A Critique
This paper introduces the session on advanced speech recognition technology. The two papers comprising this session argue that current technology yields a performance that is only an order of magnitude in error rate away from human performance and that incremental improvements will bring us to that desired level. I argue that, to the contrary, present performance is far removed from human performance and a revolution in our thinking is required to achieve the goal. It is further asserted that to bring about the revolution more effort should be expended on basic research and less on trying to prematurely commercialize a deficient technology.
The title of this paper undoubtedly connotes different things to different people. The intention of the organizing committee of the Colloquium on Human-Machine Communication by Voice, however, was quite specific, namely to review the most advanced technology of the day as it is practiced in research laboratories. Thus, this paper fits rather neatly between one given by J. L. Flanagan, which discusses the fundamental science on which a speech recognition technology might rest, and those of J. G. Wilpon, H. Levitt, C. Seelbach, C. Weinstein, and J. Oberteuffer, which are devoted to real applications of speech recognition machines. While it is true that these ap-
plications use derivatives of some of the advanced techniques discussed here, they are not as ambitious as the purely experimental systems.
In keeping with the theme of advanced technology, J. Makhoul and R. Schwartz report on the "State of the Art in Continuous Speech Recognition." They give a phonetic and phonological description of speech and show how that structure is captured by a mathematical object called a hidden Markov model (HMM). This discussion includes a brief account of the history of the HMM and its application in speech recognition. Also included in the paper are discussions of extracting features from the speech waveform, measuring the performance of the system and the possibility of using the newer methods based on artificial neural networks.
Makhoul and Schwartz conclude that, as a result of the advances made in model accuracy, algorithms, and the power of computers, a "paradigm shift" has occurred in the sense that high-accuracy, real-time, speaker-independent, continuous speech recognition for medium-sized vocabularies can be implemented in software running on commercially available workstations. This assertion provoked an important and lively debate that I shall recount later in this paper. The HMM methodology allows us to cast the speech recognition problem as that of searching for the best path through a weighted, directed graph. The paper by F. Jelinek addresses two central and specific technical issues arising from this representation. First, how does one estimate the parameters of the model (i.e., weights of the graph) from data? This is usually referred to as the training problem. Second, given an optimal model, how does one use it in the recognition task? This second problem can be cast as a combinatorial search problem to which Jelinek outlines several solutions with emphasis on a dynamic programming approach known as the Viterbi algorithm.
There is no need to review these papers in more detail here since they appear in their entirety in this volume. What does deserve discussion here are the scientific, technological, and commercial implications of these papers. These issues formed the core of the debate that ensued at the colloquium after these two excellent and comprehensive papers were presented.
I opened the discussion at the colloquium by asking the speakers to evaluate the state of the art of their most advanced laboratory prototype systems with respect to human performance in communication by spoken language. I raised this question because I think the ultimate goal of research in speech recognition is to provide a means whereby people can carry on spoken conversations with machines in the same effortless manner in which they speak to each other. As I
noted earlier, the purpose of this paper is to evaluate the highest expression of such research. Thus, while it may be comfortable to discuss progress in incremental terms, it is more instructive to evaluate our best efforts with respect to our ultimate goals.
My question turned out to be a provocative one on which opinion was sharply divided. Both speakers and a substantial number of participants vigorously supported the following propositions:
• The performance of today's best experimental systems is only an order of magnitude in error rate away from a level that compares favorably with human performance.
• When experimental systems do achieve human-like performance, their structure and methods will be strongly reminiscent of present systems.
• Today's advanced technology is commercially viable.
These and even more strongly optimistic sentiments have been expressed by Oberteuffer (1993).
I was supported in my diametrically opposite opinion of the first two assertions by a few members of the colloquium. The substance of our objections is the following. The current euphoria about speech recognition is based on Makhoul's characterization of our progress as a "paradigm shift." His use of the term is wholly inappropriate and misleading. The phrase was first used by Kuhn (1970) to characterize scientific revolution. Makhoul was thus casting incremental, technical progress as profound, conceptual scientific progress.
The difference is best understood by example. An important paradigm shift in astronomy was brought about by the combination of a heliocentric model of the solar system and the application of Newtonian mechanics to it. Placing the sun rather than the earth at the center of the solar system may seem like a radical idea. Although it is counterintuitive to the naive observer, it does not, by itself, constitute a paradigm shift. The revolutionary concept arises from the consideration of another aspect of the solar system besides planetary position. The Ptolemaic epicycles do predict the positions of the planets as a function of time. In fact, they do so more effectively than the crude elliptical orbits postulated by the Copernican/Newtonian theory. Indeed, by the incremental improvement of compounding epicycles upon epicycles, the incorrect theory can be made to appear more accurate than the coarse but correct one. So clearly, heliocentricity alone is not a paradigm shift.
However, if one asks what forces move the planets on these observed regular paths and how this accounts for their velocities and accelerations, the geocentric theory becomes mute while the classical
mechanical description of the heliocentric model turns eloquent. This, then, is the paradigm shift, and its consequences are enormous. Epicycles are acceptable for making ritual calendars and some navigational calculations, but Newtonian mechanics opens new vistas and, after some careful measurement, becomes highly accurate.
There is a very close analogy between early astronomy and modern speech recognition. At the present moment, we think of the problem of speech recognition as one of transcription, being able to convert the speech signal into a set of discrete symbols representing words. This decoding process corresponds to the computation of celestial position only. It ignores, however, the essence of speech, its capacity to convey important information (i.e., meaning), and is thus incomplete. The paradigm shift needed in our field is to make meaning rather than symbolic transcription the central issue in speech recognition, just as force replaced location as the central construct in celestial mechanics. If one can compute the forces acting on the planets, one can know their orbits and the positions come for free. Similarly, if one can extract the meaning from a spoken message, the lexical transcription will fall out. Some readers may object to this analogy by noting that the topic of "speech understanding" has been under study for two decades. Unfortunately, the current practice of "speech understanding" does not qualify as the needed paradigm shift because it is an inverted process that aims to use meaning to improve transcription accuracy rather than making meaning the primary aspect.
In short, the incremental improvements in phonetic modeling accuracy and search methods summarized by Makhoul and Jelinek in this session do not constitute a paradigm shift. The fact that these improved techniques can run in near real time on cheap, readily available hardware is merely a result of the huge advances in microelectronics that came about nearly independent of work in speech technology.
Furthermore, we are very far away from human performance in speech communication. Some attendees have suggested that human performance on the standard ATIS (Air Travel Information Service) task is not much, if at all, better than our best computer programs. I doubt this to be so, but, even if it were, it ignores the simple and crucial fact that the ATIS task is not natural to humans. Although the ATIS scenario was not intended to be unnatural, experimental approaches to it ended up being tailored to our existing capabilities. As such, the task is highly artificial and has only vague similarity to natural human discourse.
I believe it is highly unlikely that any incremental improvements to our existing technology will effectively address the problem of communication with machines in ordinary colloquial discourse un-
der ordinary ambient conditions. It seems to me that fundamental science to support a human/machine spoken communication technology is missing. We will return to the question of what that science might be in the paper by Levinson and Fallside (this volume), which deals with future research and technology.
The debate outlined above is central to the continued progress of our field. The way we resolve it will have an enormous effect on the ultimate fate of speech recognition technology. Unfortunately, the debate is not about purely scientific issues but rather reflects the very delicate balance among scientific, technological, and economic factors. As the paper by L. Rabiner based on his opening address to the colloquium makes clear, the explanation of these sometimes contradictory factors is one of the principal motivations for this volume.
Here, then, is this author's admittedly minority opinion concerning the commercial future of today's laboratory-state-of-the-art speech recognition. By definition, any such viewpoint involves technological forecasting, which is one of the main themes of the papers by Levinson and Fallside, S. Furui, B. Atal, and M. Marcus. For the purposes of this discussion, however, it suffices to examine but one feature of technological forecasting. When technocrats predict the future of a new technology, they tend to be overly optimistic for the near term and overly pessimistic for the long haul.
The history of computing provides a classic example. In the early 1950s Von Neumann and his contemporaries foresaw many of the features of modern computing that are now commonplacefor example, time sharing, large memories, and faster speedsand predicted that they would be immediately available. They also guessed that ''computing would be only a tiny part of human activity" (Goldstine, 1972). In fact, the technological advances took much longer to materialize than they had envisioned. Moreover, they completely failed to imagine the enormous growth, 40 years later, of the market for what they envisioned as large computers.
I suggest that the same phenomenon will occur with speech technology. The majority opinion holds that technical improvements will soon make large-vocabulary speech recognition commercially viable for specific applications. My prediction, based on the aforementioned general characterization of technological forecasting, is that technical improvements will appear painfully slowly but that in 40 to 50 years speech recognition at human performance levels will be ubiquitous. That is, incremental technical advances will, in the near term, result in a fragile technology of relatively small commercial value in very special markets, whereas major technological advances resulting from a true paradigm shift in the underlying science will enable machines to display human levels of competence in spoken language commu-
nication. This, in turn, will result in a vast market of incalculable commercial value.
It is, of course, entirely possible that the majority opinion is correct, that a diligent effort resulting in a long sequence of rapid incremental improvements will yield the desired perfected speech recognition technology. It is, unfortunately, also possible that this strategy will run afoul of the "first-step fallacy" (Dreyfus, 1972), which warns that one cannot reach the moon by climbing a tree even though such an action initially appears to be a move in the right direction. Ultimately, progress stops far short of the goal when the top of the tree is reached.
If, as I argue, the latter possibility exists, what strategy should we use to defend against its undesirable outcome? The answer should be obvious. Openly acknowledge the risks of the incremental approach and devote some effort to achieving the paradigm shift from signal transcription to message comprehension alluded to earlier.
Perhaps more important, however, is recognition of the uniqueness of our technological goal. Unlike all other technologies that are integral parts of our daily lives because they provide us with capabilities otherwise unattainable, automatic speech recognition promises to improve the usefulness of a behavior at which we are already exquisitely proficient. Such a promise cannot be realized if the technology supporting it degrades our natural expertise in spoken communication. Since the present state of the art requires a serious diminution of our abilities and since we presently do not know how to leap the performance chasm between humans and machines, perhaps we should invest more in research aimed at finding a more nearly anthropomorphic and, by implication, potent technology. This would, of course, alter the subtle balance among science, technology, and the marketplace more toward precommercial experimentation with proportionately less opportunity for immediate profit. There is good reason to believe, however, that ultimately this strategy will afford the greatest intellectual, financial, and social rewards.
Dreyfus, H. L., What Computers Can't Do: A Critique of Artificial Reason, Harper, New York, 1972.
Goldstine, H. H., The Computer from Pascal to Von Neumann, Princeton University Press, Princeton, N.J., 1972, p. 344.
Kuhn, T. S., The Structure of Scientific Revolutions, 2nd ed., University of Chicago Press, Chicago, 1970, pp. 92 ff.
Oberteuffer, J., "Major Progress in Speech Technology Affirmed at National Academy of Sciences Colloquium," ASR News, Vol. 4, No. 2, Feb. 1993, pp. 5-7.