that produce speech with a range of quality and intelligibility. These synthesizers are in use in reading machines for the blind and in certain commercial applications that require information to be provided automatically in response to telephone requests. A comprehensive review of speech synthesis appears in Klatt (1987); information on commercial speech synthesis systems is available in the same newsletters and magazines cited above for speech recognition.
There are at least three critical factors contributing to the complexity of speech recognition by machine. The first relates to variation among speakers. For a system to be speaker independent, it must be able to function independently of all the idiosyncratic features associated with a particular talker's speech. In speaker-dependent systems, the computer system functions properly only for the voice or voices it has been trained to recognize. The second critical problem relates to the requirement that the system be able to handle continuous speech input. At the present time, most systems are capable of recognizing only isolated words or commands separated by pauses of 100-250 ms; however, some systems are now becoming available that recognize limited, clearly stressed, continuous speech. The third important factor is the number of words the system is capable of reliably recognizing. Vocabulary sizes in existing systems range from 2 to 50,000 words. Further important factors contributing to the difficulty of speech recognition include intrasubject variability in the production of speech and the presence of interfering background noise and unclear pronunciation. In general, the more predictable the input speech, the better the performance. Thus, for example, a system designed to recognize discretely presented digits spoken by a single person in a sound-isolated room can be made to perform essentially perfectly.
Most of the current successful speech recognition systems rely primarily on an information-theoretic approach in which speech is viewed as a signal with properties that can best be discerned through statistical or stochastic analysis. Recognition systems based on this approach use a simple model to relate text to its acoustic realization. The parameters of this model are then learned by the system during a training phase. Widely accepted practice represents speech as a set of 10 to 30 parameters extracted from the input at a fixed rate (typically every 5 to 20 ms). In this fashion, the input speech is reduced to a stream of representative vectors or numerical indices for each parameter (Davis and Mermelstein, 1980).
Two classes of systems that fall into the information-theoretic category are dynamic time warping (DTW) and hidden Markov modeling