It was pointed out by Makhoul and Schwartz (this volume) that the problem of speech recognition can be formulated most effectively as follows:
Given observed acoustic data A, find the word sequence W that was the most likely cause of A.
The corresponding mathematical formula is:
W = arg max P(A I W)P(W) (1)
P(W) is the a priori probability that the user will wish to utter the word sequence W = wl, w2, . . . wn (wi denotes the individual words belonging to some vocabulary V). P(A ï W) is the probability that if W is uttered, data A = al, a2,. . . ak will be observed (Bahl et al., 1983).
In this simplified presentation the elements ai are assumed to be symbols from some finite alphabet A of size êAú. Methods of transforming the air pressure process (speech) into the sequence A are of fundamental interest to speech recognition but not to this paper. From my point of view, the transformation is determined and we live with its consequences.
It has been pointed out elsewhere that the probabilities P(A ô W) are computed on the basis of a hidden Markov model (HMM) of speech production that, in principle, operates as follows: to each word 1 of vocabulary V, there corresponds an HMM of speech production. A concrete example of its structure is given in Figure 1. The model of speech production of a sequence of words W is a concatenation of models of individual words wi making up the sequence W (see Figure 2).
We recall that the HMM of Figure 1 starts its operation in the initial state SI and ends it when the final state SF is reached. A transi-