Skip to main content

Currently Skimming:

Toward the Ultimate Synthesis/Recognition System
Pages 450-466

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 450...
... The problems for speech recognition include robust recognition against speech variations, adaptation/normalization to variations due to environmental conditions and speakers, automatic knowledge acquisition for acoustic and linguistic modeling, spontaneous speech recognition, naturalness and ease of human-machine interaction, and recognition of emotion. The problems for speaker recognition are similar to those for speech recognition.
From page 451...
... Although the speech recognizes, which converts spoken input into text, and the language analyzer, which extracts meaning from text, are separated into two boxes in the figure, it is desirable that they perform with tight mutual connection, since it is necessary to use semantic information efficiently in the recognizer to obtain correct texts. How to combine these two functions is a most important issue, especially in conversational speech recognition (understanding)
From page 452...
... 452 _ ~ ~ _ ~ ~ s.^ ~ ~ o ~ s C C ~ X , .° Cd ~ I'd ~ C,} .m do_ ~ Who V, 1 of a_ X ~ ~ ~W _ _% ~ ~ , %, ,: U]
From page 453...
... 453 _` oo ._, o 50 c: a_ a: ~ ~ 0 ~ =_ c ~ _.
From page 454...
... From the human interface point of view, future computerized systems should be able to automatically acquire new knowledge about the thinking process of individual users, automatically correct user errors, and understand the intention of users by accepting rough instructions and inferring details. A hierarchical interface that initially uses figures and images (including icons)
From page 455...
... FUTURE SPEECH SYNTHESIZERS Future speech synthesizers should have the following features: · Highly intelligible (even under noisy and reverberant conditions and when transmitted over telephone networks) Natural voice sound Prosody control based on meaning · Capable of controlling synthesized voice quality and choosing individual speaking style (voice conversion from one person's voice to another, etc.)
From page 456...
... FUTURE SPEECH RECOGNIZERS Future speech recognition technology should have the following features: . Few restrictions on tasks, vocabulary, speakers, speaking styles, environmental noise, microphones, and telephones .
From page 457...
... Automatic knowledge acquisition is very important in achieving systems that can automatically follow variations in tasks, including the topics of a conversation. Not only the linguistic structures but also the acoustic characteristics of a speech vary according to the task.
From page 458...
... Few restrictions on text, speaking style, environmental noise, microphones, and telephones · Robustness against speech variations · Adaptation and normalization to variations due to environmental conditions and speakers · Automatic acquisition of speaker-specific characteristics · Naturalness and ease of human-machine interaction · Incentive for customers to use the systems · Low-cost creation of new revenues for suppliers Cooperation on standards and regulation Quick prototyping and development One of the most serious problems arises from variability in a
From page 459...
... TOWARD ROBUST SPEECH/SPEAKER RECOGNITION UNDER ADVERSE CONDITIONS As described in the previous sections, robustness against speech variations is one of the most important issues in speech/speaker recognition (Furui, 1992b; Juang, 1991; Makhoul and Schwartz, in this volume; Weinstein, in this volume)
From page 460...
... When recognizing spontaneous speech in dialogues, it is necessary to deal with variations that are not encountered when recognizing speech that is read from texts. These variations include extraneous words, out-of-vocabulary words, ungrammatical sentences, botched utterances, restarts, repetitions, and style shifts.
From page 461...
... . Although it is not always necessary or efficient for speech synthesis/recognition systems to directly imitate the human speech production and perception mechanisms, it will become more important in the near future to build mathematical models based on these mechanisms to improve performance (Atal, in this volume; Carlson, in this volume; Furui, 1989~.
From page 462...
... Psychological and physiological research into human speech perception mechanisms shows that the human hearing organs are highly sensitive to changes in sounds, that is, to transitional (dynamic) sounds, and that the transitional features of the speech spectrum and the speech wave play crucially important roles in phoneme perception (Furui, 1986~.
From page 463...
... EVALUATION METHODS It is important to establish methods for measuring the quality of speech synthesis/recognition systems. Objective evaluation methods that ensure quantitative comparison of a broad range of techniques are essential to technological development in the speech-processing field.
From page 464...
... The problems include automatic knowledge acquisition, speaking style control in synthesis, synthesis from concepts, robust speech/speaker recognition, adaptation/normalization, language processing, use of articulatory and perceptual constraints, and evaluation methods. One important issue that is not included in this paper is language identification.
From page 465...
... Furui, "Concatenated phoneme models for text-variable speaker recognition," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, pp.
From page 466...
... K Soong, "Recent research in automatic speaker recognition," Advances in Speech Signal Processing, ed.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.