Skip to main content

Currently Skimming:

Linguistic Aspects of Speech Synthesis
Pages 135-156

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 135...
... In discourse contexts, several factors such as the specification of new and old information, contrast, and pronominal reference can be used to further modify the prosodic specification. When the prosodic correlates have been computed and the segmental sequence is assembled, a complete input suitable for speech synthesis has been determined.
From page 136...
... Instead, it has been necessary to first analyze the text into an underlying abstract linguistic structure that is common to both text and speech surface realizations. Once this structure is obtained, it can be used to drive the speech synthesis process in order to produce the desired output acoustic signal.
From page 137...
... Furthermore, we take as a working hypothesis the proposition that every aspect of linguistic structure manifests itself in the acoustic waveform. If this is true, the analysis part of the conversion process must provide a completely specified framework in order to ensure that the output speech waveform will be responsive to all linguistic aspects or tne utterance.
From page 138...
... The oral and nasal passages serve as a time-varying filter to acoustic disturbances that are excited either by the vocal cords or frication generated at some constriction in the vocal tract. All of these constraints, acting together, drive the speech generation process, and hence the text-to-speech process must algorithmically discover the overall ensemble of constraints to produce the synthetic speech waveform.
From page 139...
... It is interesting that these tests apply to the abstract morphemic structure of the parse, rather than the surface morph covering. For example, "teething" can be parsed into both "teethe + ing" and "teeth + in"," but in the latter analysis "teeth" is already an inflected form ("tooth" + PLURAL)
From page 140...
... For specific applications, desired pronunciations of words, whether they would be analyzed by morph covering or letter-to-sound procedures, can be forced by the simple expedient of placing the entire word directly in the lexicon and hence treating it as an exception. A particularly difficult specific application is the pronunciation of surnames, as found in, say, the Manhat
From page 141...
... Thus, the PLURAL morpheme, normally expressed by the morph "s" or "es," takes on differing pronunciation based on the value of the voicing feature of the root word to which the suffix attaches. Hence, the plural of "horse" requires that a short neutral vowel be inserted between the end of the root and the phonemic realization of PLURAL, else the plural form would only lengthen the terminal /s/ in "horse." On the other hand, if the root word does not end in an e-like phoneme, pronunciation of the plural form has the place and fricative consonant features of /s/ but follows the voicing of the root.
From page 142...
... flapped consonants are produced as rapid stops in words such as "butter," and two voiceless stop consonants can assimilate, as in "Pat came home," where the "I" is assimilated into the initial consonant of the word "came." Sometimes the effect of articulatory smoothing must be resisted in the interest of intelligibility, as in the insertion of a glottal stop between the two words "he eats." Without this hiatus mechanism, the two words would run on into one, possibly producing "heats" instead of the desired sequence. ORTHOGRAPHIC CONVENTIONS Abbreviations and symbols must be converted to normal words in order for a text-to-speech system to properly pronounce all of the letter strings in a sample of text.
From page 143...
... Many words vary with their functioning part of speech, such as "wind, read, use, invalid, and survey." Thus, among these, "use" can be a noun or verb and changes its pronunciation accordingly, and "invalid" can be either a noun or an adjective, where the location of main stress indicates the part of speech. At the single-word level, suffixes have considerable constraining power to predict part of speech, so that "dom" produces nouns, as in "kingdom," and "ness" produces nouns, as in "kindness." But in English, a final "s," functioning as an affix, can form a plural noun or a third-person present-tense singular verb, and every common noun can be used as a verb.
From page 144...
... Nevertheless, despite this possibility, phrase-level parsing must for the present provide the needed structural basis given the lack of such higher-level constraints when the system input consists of text alone. When the input is obtained from a message-producing system, where semantic and pragmatic considerations guide the message composition process, alternative prosodic structures may be available for the determination of prosodic correlates (Young and Fallside, 1979~.
From page 145...
... , so there is little motivation to extend the scope of syntactic analysis to the clause level. PROSODIC MARKING Once a syntactic analysis is determined, it remains to mark the text for prosodic features.
From page 146...
... These studies emphasized that performance structures have relatively small basic units, a natural hierarchy, and that the resulting overall structure was more balanced than that provided by syntactic constituent analysis. The "performance" aspect of these analyses utilized subjective appraisal of junctural breaks and discovered that word length and the syntactic label of structural nodes played an important role.
From page 147...
... For predicting prosodic phrase breaks from text, a dynamic programming algorithm is provided for finding the maximum probability prosodic parse. These recent studies are very encouraging, as they provide promising techniques for obtaining the prosodic phrasing of sentences based on input text.
From page 148...
... ) , where the latter parse would incorrectly imply main stress on "Street." But in a corpus derived from the Associated Press Newswire for 1988, "Wall Street" occurs 636 times outside the context of "Wall Street Journal," whereas "Street Journal" occurs only five times outside this context, and hence the mutual information measure will favor the first (correct)
From page 149...
... Once this structure is available, prosodic correlates must be specified. These include durations and the overall timing framework, and the fundamental frequency contour reflecting the overall intonation contour and local pitch accents to mark stress.
From page 150...
... must be strongly marked prosodically in order for the intended reading to be perceived. Discourse-level context may also facilitate the prosodic marking of complex nominals, designate prepositional phrase attachment, and disambiguate conjoined sentences.
From page 151...
... MULTILINGUAL SYNTHESIS Several groups have developed integrated rule frameworks and languages for their design and manipulation (Carlson and Granstrom, 1986; Hertz, 1990; Van Leeuwen and te Lindert, 19931. Using these structures, a flexible formalism is available for expressing rules and for utilizing these rules to "fill in" the coordinated comprehensive linguistic description of an utterance.
From page 152...
... Where good models are available, such as for morphemic structure and lexical stress, the results are exceedingly robust. Linguistic descriptions of discourse are much needed, and a more detailed and principled prosodic theory that could guide both analysis and synthesis algorithms would be exceedingly useful.
From page 153...
... Many facts of speech production are best represented at the articulatory level, and a rule system focused on articulatory gestures is likely to be simpler than the current rule systems based on acoustic phonetics. Unfortunately, the acquisition of articulatory information is exceedingly difficult, since it involves careful observation of the entire set of speech articulators, many of which are either hidden from normal view or are difficult to observe without perturbing the normal speech production process.
From page 154...
... , "Text Analysis and Word Pronunciation in Text-to-Speech Synthesis," Chapter 24 in Advances in Speech Signal Processing, S Furui and M
From page 155...
... (1990) , "Stress Assignment in Complex Nominals for English Text-toSpeech," in Proceedings of the ESCA Workshop on Speech Synthesis, pp.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.