Skip to main content

Currently Skimming:

Computer Speech Synthesis: Its Status and Prospects
Pages 107-115

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 107...
... One especially promising trend is the systematic optimization of large synthesis systems with respect to formal criteria of evaluation. Speech recognition has progressed rapidly in the past decade through such approaches, and it seems likely that their application in synthesis will produce similar improvements.
From page 108...
... Today's synthetic speech is good enough to support a wide range of applications, but it is still not enough like natural human speech for the truly universal usage that it ought to have. If there are also real prospects for significant improvement in synthesis quality, we should consider a redoubled research effort, especially in the United States, where the current level of research in synthesis is low compared to Europe and Japan.
From page 109...
... a message drawn from unrestricted digital text, including anything from electronic mail to on-line newspapers to patent or legal texts, novels, or cookbooks; 5. a message composed automatically from nontextual computer data structures (which we might think of as analogous to "concepts" or "meanings"; or 6.
From page 110...
... construction of messages by realistic modeling of the physiological and physical processes of human speech production, including dynamic control of articulation and models of the airflow dynamics in the vocal tract. The largest scale of commercial activity has been of types 1 and 2, which might be called stored voice.
From page 111...
... Its input was of type 6, consisting of a string of phonemic symbols with stress indications and marks for word boundaries and pauses; thus, it accomplished "speech synthesis proper," with no text analysis component. The underlying conception for this system is admirably simple: each phoneme is characterized by a single invariant acoustic target, and the observed contextually varied time functions are generated by a smoothing process.
From page 112...
... simple architectures that permit program parameters to be optimized with respect to large bodies of actual speech and 2. easily calculated objective evaluation metrics that permit alternative designs to be compared quantitatively.
From page 113...
... His table of phonemic targets would certainly be amenable to corpus-based optimization; indeed, one can optimize arbitrarily large tables of acoustic targets, as long as enough data are brought to bear. However, Rabiner's method for time-function generation has some properties that would make optimization of its constants somewhat tricky and would hinder optimization of the table of phonemic targets as well.
From page 114...
... Only within the past few years have we seen a general use of systematic optimization techniques for purposes of inventory design, unit segmentation, unit selection, and unit combination algorithms. The general approach is to define a perceptually reasonable acoustic distortion metric and use it in a global comparison of alternatives (in allophonic clustering, in segmentation points, in unit selection, or whatever)
From page 115...
... Will progress come by a scientific route, through better modeling of human speech production, or by an engineering route, through larger inventories of prerecorded elements with optimal automatic selection and combination methods? How far can we push current ideas about text analysis algorithms?


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.