Speech, Physiology, and Other Interface Components
SPEECH RECOGNITION AND SYNTHESIS
Since speech is the most natural form of human intraspecies communication in the real world, it is important to examine the progress and problems associated with research and technology for speech recognition and synthesis by computers for use in communicating with humans in a synthetic environment (SE). Machine recognition and understanding of oral speech has been and continues to be a particularly difficult problem because of the enormous flexibility and variability in speech at both the intersubject and intrasubject levels. Moreover, basic grammar rules are often violated in oral communication. Although humans clearly are able to overcome the problems involved (e.g., by using contextual cues to assist in the interpretation of meaning), it is extremely difficult to duplicate these capabilities with computer software. General background on speech communication, both human and automatic, is available in O'Shaughnessy (1987); information on commercially available automatic speech recognition systems can be found in current newsletters and magazines such as ASR News, Voice News, Voice Processing, and Speech Technology.
Computer generation of speech also suffers from problems that remain to be solved. In particular, currently available speech synthesis technology does not provide speech that sounds natural or that can be easily matched to the characteristics of an individual speaker. Nevertheless, there are several types of speech synthesis systems available commercially
that produce speech with a range of quality and intelligibility. These synthesizers are in use in reading machines for the blind and in certain commercial applications that require information to be provided automatically in response to telephone requests. A comprehensive review of speech synthesis appears in Klatt (1987); information on commercial speech synthesis systems is available in the same newsletters and magazines cited above for speech recognition.
There are at least three critical factors contributing to the complexity of speech recognition by machine. The first relates to variation among speakers. For a system to be speaker independent, it must be able to function independently of all the idiosyncratic features associated with a particular talker's speech. In speaker-dependent systems, the computer system functions properly only for the voice or voices it has been trained to recognize. The second critical problem relates to the requirement that the system be able to handle continuous speech input. At the present time, most systems are capable of recognizing only isolated words or commands separated by pauses of 100-250 ms; however, some systems are now becoming available that recognize limited, clearly stressed, continuous speech. The third important factor is the number of words the system is capable of reliably recognizing. Vocabulary sizes in existing systems range from 2 to 50,000 words. Further important factors contributing to the difficulty of speech recognition include intrasubject variability in the production of speech and the presence of interfering background noise and unclear pronunciation. In general, the more predictable the input speech, the better the performance. Thus, for example, a system designed to recognize discretely presented digits spoken by a single person in a sound-isolated room can be made to perform essentially perfectly.
Most of the current successful speech recognition systems rely primarily on an information-theoretic approach in which speech is viewed as a signal with properties that can best be discerned through statistical or stochastic analysis. Recognition systems based on this approach use a simple model to relate text to its acoustic realization. The parameters of this model are then learned by the system during a training phase. Widely accepted practice represents speech as a set of 10 to 30 parameters extracted from the input at a fixed rate (typically every 5 to 20 ms). In this fashion, the input speech is reduced to a stream of representative vectors or numerical indices for each parameter (Davis and Mermelstein, 1980).
Two classes of systems that fall into the information-theoretic category are dynamic time warping (DTW) and hidden Markov modeling
(HMM). In its simplest form, DTW compares the input speech to sets of prestored templates or exemplars corresponding to the prerecorded utterance of each vocabulary item. To add robustness, several templates per word may be recorded by several speakers representing a variety of dialects, speaking styles, and sentence positions. Dynamic programming algorithms are used to evaluate the best match between the input and the templates. DTW refers to the need to time compress (or expand) the input word in order to make it comparable in duration to the stored exemplar. Further discussion of this topic can be found in Dixon and Martin (1979) and Rabiner (1983).
In DTW, both storage and processing times increase, at least linearly, with vocabulary expansion. The system must be trained on every new word in the vocabulary. Furthermore, for speaker-independent recognition, an inordinate number of templates describing the possible variations need to be recorded and stored. As a result, systems with more than a few hundred words are not practical. Because of these limitations, accuracy considerations, and the need to train the system on every word, DTW has been replaced by the HMM approach for the development of high-performance systems.
HMMs represent speech units (words, syllables, phones, etc.) as stochastic machines consisting of several (3 to 20) states. With each state is associated a probability distribution describing the probability of that state's emitting a given observation vector. In addition, there is a set of probabilities of moving from one state to another. Control is transferred between states according to these probabilities each time a new vector is generated (i.e., at the frame rate of the system).
For recognition, a solution to the reverse problem is desired. Given a set of pretrained models and a sequence of observation vectors, the model most likely to have generated the observation must be determined. The utterance associated with this model is then the recognizer's output. The parameters of the model, the probabilities, are determined from labeled training speech data, presumably containing realizations of all the modeled utterances. Efficient algorithms exist for both the training and recognition tasks. The structure of the model (number of states, speech unit, etc.) is chosen judiciously and depends on the language as well as other speech knowledge.
When large-vocabulary recognition is involved, the usual approach is to model speech as a sequence of very basic units such as phones. Co-articulatory phonetic effects are taken into account by having for each phone a different model for every context in which that phone exists (the context is specified by the neighboring phones). As a result, the number of models that must be trained, stored, and evaluated grows as the number of phonetic contexts increases. By taking context into account, the
recognition error rate can be cut in half (Chow et al., 1986). In phonetically based HMM systems, it is possible to recognize new words by simply including their phonetic pronunciation in the dictionary; there is no need to train the system specifically on the new words (although the accuracy on those words would be higher if the system is trained on them as well).
Stochastic systems make use of grammar rules that constrain the sequence of speech units that can occur. Grammars are usually either finite-state or statistical. Finite-state grammars describe explicitly the allowable word sequences (a word sequence is either allowed or it is not); statistical grammars allow all word sequences but with different probabilities. A rough measure of a grammar's restrictiveness is its so-called perplexity (Bahl et al., 1983). The lower this number, the more predictable the word sequence. It is easy to see how grammar restrictions can diminish the search that must be performed to determine the identity and order of words in an utterance. Not surprisingly, speech recognition for tasks with low perplexity can be performed much faster and more reliably than tasks with high perplexity.
For some applications, in addition to recognizing the sequence of words spoken, it is important to understand what has been said and give an appropriate response. For this purpose, the output of the recognizer is sent to a language understanding component, which analyzes and interprets the recognized word sequence. To allow for possible errors in recognition, the recognition component sends to the language understanding component not only the top-scoring word sequence, but also the N top-scoring word sequences (where N is typically 10-20). The language understanding component then chooses the word sequence that makes most sense in that context (Schwartz and Austin, 1991).
State of the Art in Speech Recognition
More and more, speech recognition technology is making its way from the laboratory to real-world applications. Recently, a qualitative change in the state of the art has emerged that promises to bring speech recognition capabilities within the reach of anyone with access to a workstation. High-accuracy, real-time, speaker-independent, continuous speech recognition, for medium-sized vocabularies (few thousand words), is now possible in software on off-the-shelf workstations. Users will be able to tailor recognition capabilities to their own applications. Such software-based, real-time solutions usher in a whole new era in the development and utility of speech recognition technology.
As is often the case in technology, a paradigm shift occurs when several developments converge to make a new capability possible. In the
case of continuous speech recognition, the following advances have converged to make the new technology possible:
Higher-accuracy, continuous speech recognition, based on hidden Markov modeling techniques,
Better recognition search strategies that reduce the time needed for high-accuracy recognition, and
Increased power of off-the-shelf workstations.
The paradigm shift is taking place in the way we view and use speech recognition. Rather than being mostly a laboratory endeavor, speech recognition is fast becoming a technology that is pervasive and will have a profound influence on the way humans communicate with machines and with each other. For a recent survey of the state of the art in continuous speech recognition, see Makhoul and Schwartz (1994).
Using HMMs, the word error rate for continuous speech recognition has been dropping steadily over the last decade, with a factor of two drop in error rate about every two years. Research systems are now able to tackle problems with large vocabularies. For example, in a test using the ARPA Wall Street Journal continuous speech recognition corpus, word error rates of 11 percent have been achieved for speaker-independent performance on read speech (Pallett et al., 1994). Although this performance level may not be sufficient for a practical system today, continuing improvements in performance are likely to make such systems of practical use in a few years.
Because of the availability of large amounts of training speech data from large numbers of speakers (hundreds of hours of speech), speaker-independent performance has reached such levels that it rarely makes sense to train systems on the speech of specific speakers. However, there will always be outlier speakers for whom, for one reason or another, the system does not perform well. For such speakers, it is possible to collect a relatively small amount of speech (on the order of minutes of speech) and then adapt the system's models to the outlier speaker to improve performance significantly for that speaker.
For information retrieval applications, it is important to understand the user's query and give an appropriate response. Speech understanding systems have reached the stage at which it is possible to develop a practical system for specialized applications. The understanding component must be tuned to the specific application; the work requires significant amounts of data collection from potential users and months of labor-intensive work to develop the language understanding component for that application. In the ARPA Airline Travel Information Service (ATIS) domain, users access flight information using verbal queries. Speech understanding systems in the ATIS domain have achieved understanding
error rates of less than 10 percent in speaker-independent mode (Pallett et al., 1994). In these tests, users speak in a normal spontaneous fashion.
Until recently, it was thought that to perform high-accuracy, real-time continuous speech recognition for large vocabularies would require either special-purpose very large system integrated (VLSI) hardware or a multiprocessor. However, new developments in search algorithms have sped up the recognition computation at least two orders of magnitude, with little or no loss in recognition accuracy (Schwartz and Austin, 1991). In addition, computing advances have achieved two orders of magnitude increase in workstation speeds in the last decade. These two advances have made software-based, real-time, continuous speech recognition a reality. The only requirement is that the workstation must have an analog-to-digital converter to digitize the speech. All the signal processing, feature extraction, and recognition search is then performed in software in real time on a single-processor workstation.
For example, it is now possible to perform a 3000-word ATIS recognition task in real time on such workstations as the Silicon Graphics Indigo (SGI) R3000 or the Sun SparcStation 2. Most recently, a 40,000-word continuous dictation task was demonstrated in real time on a Hewlett-Packard 735 workstation, which has about three times the power of an SGI R3000. Thus, the computation grows much slower than linear with the size of the vocabulary (Nguyen et al., 1993).
The real-time feats just described have been achieved at a relatively small cost in word accuracy. Typically, the word error rates are less than twice those of the best research systems.
The most advanced of these real-time demonstrations have not as yet made their way to the marketplace. However, it is possible today to purchase products that perform speaker-independent, continuous speech recognition for vocabularies of a few thousand words. These systems are being used in command and control applications, the routing of telephone calls by speaking the full name of the party being called, the training of air traffic controllers, and the control of workstation applications. Of particular significance to the public will be the development of transactional applications over the telephone, such as home shopping and airline reservations. Practical large-vocabulary, continuous-speech, speaker-independent dictation systems are a few years away.
HMMs have proven to be very good for modeling variability in time and feature space and have resulted in tremendous advances in continuous speech recognition. However, some of the assumptions made by HMMs are known not to be strictly true for speech—for example, the conditional independence assumptions in which the probability of being in a state is dependent only on the previous state, and the output probability at a state is dependent only on that state and not on previous
states. There have been attempts at ameliorating the effects of these assumptions by developing alternative speech models, such as the use of stochastic segmental models and neural networks. In all these attempts, however, significant computational limitations have hindered the full exploitation of these methods; in fact, the performance of such alternative models have barely approached that of HMM systems. However, when such models are used in conjunction with HMM systems, the resulting hybrids have achieved word error rate reductions of 10-20 percent (Makhoul and Schwartz, 1994).
Despite all these advances, much remains to be done. Speech recognition performance for very large vocabularies and larger perplexities is not yet adequate for useful applications, even under benign acoustic conditions. Any degradation in the environment or changes between training and test conditions causes a degradation in performance. Therefore, work must continue to improve robustness to varying conditions: new speakers, new dialects, different channels (microphones, telephone), noisy environments, and new domains and vocabularies. What will be especially needed are improved acoustic models and methods for fast adaptation to new conditions. Many of these research areas will require more powerful computing sources. Fortunately, workstation speed and memory will continue to grow in the years to come. The resulting more powerful computing environment will facilitate the exploration of more ambitious modeling techniques and will, no doubt, result in additional significant advances in the state of the art.
In comparison with speech recognition, the field of language understanding, which is a much harder problem, is still in its infancy. One major obstacle for advancement is the lack of a representation of semantics that is general and powerful enough to cover major applications of interest. And even if such a representation were available, there is still a strong need to develop automatic methods for interpreting word sequences, without having to rely on the currently dominant methods of labor-intensive crafting of detailed linguistic rules.
A speech synthesizer is a device that accepts at its input the text of an utterance in orthographic or computer-readable form and transforms that text into spoken form. The synthesizer performs much the same function as a human who reads a printed text aloud. The synthesizer usually contains a component that performs an initial transformation from a written text into a sequence of phonetic units (e.g., transforming caught to /kot/), and these phonetic units then control the production of sound in the synthesizer.
The synthesizer can generate the simulated speech in one of two ways. One method, called concatenative synthesis, is to create utterances by stringing together segments of speech that have been excerpted from the utterances of a speaker. These segments consist of pieces of syllables, usually a sequence of consonant and vowel or a vowel and a consonant. For example, a word like mat would be synthesized by concatenating two pieces ma + at. Special procedures are used to ensure that the joining of the pieces is done smoothly with minimal artifacts. Rules are used to provide suitable adjustment of timing and other prosodic characteristics. An inventory of several hundred of these prerecorded segments of speech is needed to synthesize most utterances in English. In order to synthesize speech for different speakers, it is necessary to make new recordings for each speaker and to edit these recordings to obtain the necessary inventory of segments.
A second method for generating the sound from a phonetic sequence is to use a synthesis device that models some aspects of the human speech-generating process. This process regards the human speech system as consisting of a set of sound sources that simulate vocal cord vibration or noise due to turbulent airflow, together with a set of resonators that simulate the filtering of the sources by the airways that constitute the vocal tract. The control parameters for this model specify attributes such as the frequency and amplitude of vocal cord vibration and the frequencies of the resonators. Depending on the complexity of the synthesizer, there can be as many as 40 such control parameters or as few as 10. This type of synthesizer has been called a formant or terminal-analog synthesizer.
In this second type of synthesizer, the rules for converting phonetic units into control parameters specify an array of parameter values for each speech sound and describe how these parameters should move smoothly from one speech sound to the next. By proper adjustment of the ranges of the parameters, different male and female voices can be synthesized. The best synthesis based on these formant or articulatory models is somewhat more intelligible than concatenative synthesis.
For both concatenative and formant synthesis, the generation of natural-sounding utterances requires that rules be developed for controlling the temporal aspects of the speech and changes in fundamental frequency that indicate prominent syllables and that delineate groupings of words. The most successful devices for synthesis of speech from text produce speech with reasonably high intelligibility, although not quite as intelligible as human production of speech, and with some lack of naturalness. Continuing research is leading to improvements in naturalness through adjustments of rules for the rhythmic and other prosodic aspects of the speech.
In all the previous discussion of human-machine interfaces, the inputs to the human (the displays) and the outputs from the human (the controls) have occurred at the limits of the human periphery, i.e., they have made use of the natural input and output mechanisms. In principle, it is possible to construct interfaces that bypass the periphery and display information by stimulating neural structures directly or sense physiological variables for purposes of control. Independent of the extent to which such interfaces provide a useful adjunct to conventional interfaces for normal subjects, they undoubtedly will prove exceedingly important for subjects with certain kinds of sensorimotor disabilities (Warner et al., 1994). In this section, we discuss briefly different kinds of physiological responses that might serve as useful control signals. We have omitted consideration of neural stimulation because, with only minor exceptions, we believe that most devices for providing such stimulation will be employed only for individuals with severe sensory disabilities; for such individuals, these devices will be permanently implanted and become part of the subject, thus eliminating the need to include such devices in human-machine interfaces for SE systems. We have also omitted discussion of the use of physiological response measurements (e.g., associated with muscle actions) to help individuals with severe motor disabilities control computers and telerobots, because this topic is included in our discussion of the medicine and health care application domain in Chapter 12.
Most practical work on physiological responses has been conducted in the laboratory for purposes of establishing the effects of selected experimental conditions on an individual's emotional state or for designing systems and system tasks that take human capabilities and limitations into consideration. In the near term, those measures that have been found useful as indicators of mental, emotional, or physical states in the real world should be equally useful as indicators of the states in an SE. However, apart from use with individuals having severe motor disabilities, it is likely to be several years before physiological-response sensing will be included as a control element in most SE systems. (A possible exception to this statement is suggested by the current research on brain-activated control at Armstrong Medical Research Laboratory by Junker et al., 1988.) The following discussion briefly reviews the current status of work in physiological response measurement.
Measures of physiological responses have been used by researchers to describe the physical, emotional, and mental condition of human subjects, usually in relation to performance of a specific task, involvement in a particular social transaction, or exposure to a particular set of environmental variables. Over the years, ergonomic researchers have conducted
an enormous number of studies employing physiological responses as indicators or correlates of fatigue, stress, onset or decline of attention, and level or change in level of mental workload (Wierwille, 1979). The physiological responses traditionally used in such studies focus on involuntary responses controlled by the autonomic nervous system, such as heart rate, blood pressure, stomach activity, muscle tension, palmar sweating, and pupil dilation. The techniques for measuring these responses vary in the ease and intrusiveness of data collection, the statistical properties of the data flow, and the certainty with which analyzed data can be interpreted.
Another area of physiological research activity involves studies of brain organization and cognitive function (Druckman and Lacey, 1989; Kramer, 1993; Zeffiro, 1993). Neurophysiologists have been heavily involved in developing and using physiological measures to map sensory and motor functions in the brain and to identify patterns of brain activity associated with attentional states and cognitive workload.
Autonomic Nervous System Responses
The autonomic nervous system is composed of two subsystems, the sympathetic and the parasympathetic. The sympathetic nervous system is responsible for mobilizing the body to meet emergencies; the parasympathetic nervous system is responsible for maintaining the body's resources. These two systems can either work together or in opposition to one another. Thus, an increase in a physiological response such as heart rate or muscle tension may be interpreted as an increase in sympathetic activity or as a decrease in parasympathetic activity.
Physiological responses that are interpreted as sympathetic nervous system activities are often cited as indicators of emotional response. These include heart rate, systolic and diastolic blood pressure, muscle tension, and skin conductance (Hassett, 1978). In evaluating these measures, it is important to recognize that the sympathetic nervous system does not respond in a unitary way; humans often exhibit an increase in one of these responses such as elevated heart rate, without showing an increase in others. Thus, it is misleading to speak of arousal as if it were a unitary response. Moreover, there is no simple one-to-one correspondence between any single physiological response and a particular emotion or cognitive process (Cacioppo and Tassinary, 1990). Although it may be possible in the future to identify patterns of responses that relate to specific psychological states, the results to date are mixed. For example, it appears to be possible to describe the intensity of an emotion but not necessarily whether it is positive or negative.
Perhaps the most thorough study of the measurement and use of autonomic nervous system responses has been accomplished by researchers
interested in identifying physiological indicators of mental workload. Wierwille (1979) provided a complete review of 14 physiological measures and their usefulness. The results of this analysis point to three measures that have some correlation with mental workload: catecholamine excretion as determined by bodily fluid analysis, pupil diameter, and evoked cortical potential. Although each of these was judged to have promise, none is easily measured. In addition, Wierwille suggests that they should be used in combination with behavioral measures when drawing conclusions about levels or types of mental workload.
All physiological measures suffer from some drawback, either in terms of ease of measurement or accuracy of interpretation. One reason it is so difficult to make inferences about possible psychological concomitants of physiological processes is the complexity of the physiological systems and the sensory environments involved. Each physiological process may be internally regulated by multiple control systems and also may be complicated by individual differences (e.g., one person may respond to threat primarily with elevated heart rate, whereas another will respond with elevated blood pressure).
Heart rate can be assessed using electrocardiogram electrodes attached to the chest or by attaching a plethysmograph to the finger or earlobe to detect changes in blood flow. Although heart rate is relatively easy to measure, the results are conflicting. According to a review by Hartman (1980), some studies show increases in heart rate with mental workload; others show a decrease or no change in heart rate with increases in workload. Wierwille's (1979) results suggest that one fairly stable and useful measure is spectrum analysis of heartbeat intervals.
Systolic and diastolic blood pressure are usually monitored by using a self-inflating arm cuff system (Weber and Drayer, 1984). A newer technology from OHMEDA uses a finger cuff to track pulse pressure continuously; it provides a measure of blood pressure and heart rate every two seconds. Systolic pressure is the peak pressure when the blood is being ejected from the left ventricle; diastolic pressure is the pressure between contractions while the ventricle is relaxing (Smith and Kampine, 1984). Heart rate and systolic blood pressure tend to show significant increases when subjects are shown emotionally arousing films, asked to carry out demanding tasks (such as mental arithmetic), or are placed in embarrassing or threatening situations (Krantz and Manuck, 1984). There are many other cardiovascular variables that can be assessed (e.g., pulse transit time, forearm blood flow—Smith and Kampine, 1984). Depending on the precise nature of the research question, additional measures may be needed in order to identify the underlying mechanisms that have produced any observed changes in blood pressure or heart rate. For example, an increase in blood pressure could be due to an increase in cardiac
output (either heart rate or stroke volume), or it could be due to an increase in peripheral resistance (constriction of blood vessels or capillaries). Knowing which of these has occurred may in turn point to the neurotransmitters likely to have been active.
Monitoring of muscle tension (electromyography—EMG) involves attaching two or more electrodes over the surface of the target muscle to detect naturally occurring bioelectric potentials. Large skeletal muscles such as the trapezius (located in the back of neck) or frontalis (forehead) may be monitored as a way of assessing muscle tension, or facial expression. Even changes too fleeting or slight to be observed by a human judge can be detected by placing electrodes near key facial muscles (Hassett, 1978; Cacioppo and Petty, 1983). The bioelectric potentials being detected are very small in magnitude and the range of frequencies includes 60 Hz. To ensure clean recording of EMG, isolation from electrical fields is important. This recording procedure is relatively straightforward. Muscle tension may be an indicator of alertness; however, it also changes as a function of fatigue, level, the individual's physical condition, and the physical demands of a task.
EMG signals have been employed for the control of powered prostheses such as the Utah Artificial Arm (Jacobsen et al., 1982). The basis for this control is detailed models of the muscle biomechanics that are used to predict the proper joint torques from the EMGs (Meek et al., 1990). Given the success in prosthetics, it is therefore natural to consider that EMGs could be used to control robotic devices. Hiraiwa et al. (1992) used EMGs to control finger position, torque, and motion of an artificial robot hand. Control was achieved with a neural net, trained by using a VPL DataGlove; accuracies were quite modest. Farry and Walker (1993) are working on EMG control of the Utah/MIT Dextrous Hand Master; various EMG processing methods are being tested using an EXOS Dextrous Hand Master to predict basic grasp types. They also present a concise up-to-date review of EMG signal processing and the use of EMGs in the control of prostheses. The use of EMGs to control telerobots or computers is clearly less appropriate for normal humans than for humans with severe motor disabilities. They are used for persons with disabilities (e.g., to control prosthetics) only because better choices do not exist.
EDA (electrodermal activity, also known as GSR, galvanic skin reflex) is assessed by attaching a pair of gold cup electrodes to the skin (usually on the hand), passing a very small electrical current between them, and measuring the resistance. As palmar sweating increases, the conductance increases (Hassett, 1978). In response to stimuli, people typically show a transient change in EDA; the number and latency of peaks in EDA are often measured as a way of assessing stimulus response. According to Wierwille (1979), stress will cause decreases in skin conductance; however,
extensive time averaging is required to demonstrate a change. Moreover, skin response changes as a function of several variables including ambient temperature, humidity, physical exertion, and individual differences in body metabolism.
Research on pupil dilation has shown a change in pupil size as mental workload increases: during high levels of workload the pupil dilates, but when the operator becomes overloaded, pupil size is reduced. Pupil dilation can be recorded by a motion picture or video camera; analysis is accomplished by measuring each frame through manual or automated techniques. The evaluation of pupillary response is complicated by the fact that pupil size changes with the level of ambient illumination, fatigue, and acceleration.
In recent years, many researchers have made significant contributions to the measurement of brain activity (Kramer, 1993; Zeffiro, 1993; Druckman and Lacey, 1989). Four technologies are of particular interest: event related potentials (ERP), positron emission tomography (PET), magnetic resonance imaging (MRI), and magnetic stimulation mapping.
Event-related potentials are found on the electroencephalographic record. They reflect brain activities that occur in response to discrete events or tasks.
Druckman and Lacey (1989) report several important research efforts involving ERPs. For example, a negative voltage with a stimulus-response latency of 100 ms (N100) has been shown to be related to the individual's focus of attention. This finding suggests that N100 might be used to determine whether an individual is following task instructions or has shifted attention to some other aspect of the environment. A negative voltage with a latency of 200 ms (N200) has been found to be related to a mismatch between an individual's expectations and events occurring in the environment. A positive voltage having a 300 ms latency (P300) has been associated with mental workload (Donchin et al., 1986; Kramer, 1993) and analysis of memory mechanisms (Neville et al., 1986; Karis et al., 1984). Kramer (1993) reports the extensive use of P300 in measuring both primary and secondary mental task load. A principal finding across studies is that the P300 elicited by discrete secondary tasks decreases as the difficulty of the primary task increases. Although no formal reliability studies have been conducted, there is a substantial body of evidence that suggests that ERPs are reliable measures of mental workload in the laboratory.
The next step is to begin experimentation in field environments (Druckman and Lacey, 1989; Kramer, 1993).
Another important aspect of the ERP is the readiness potential that is manifested as a slow negative wave preceding a voluntary response. This wave has been labeled the contingent negative wave (CNV) by Walter et al. (1964). It is one of a class of events preceding negative waves that have been shown to relate to a subject's mental preparatory processes. There is some evidence that the size of the negative potential and its location by hemisphere can provide evidence regarding the response a subject is thinking about independent of the response that is actually made (Coles et al., 1985).
It is important to note that the time lag associated with obtaining reliable estimates of ERPs for cognitive functions may take several seconds. This time delay limits the potential of ERPs for VE applications requiring real-time, closed-loop control.
Functional and Structural Information
Functional and structural information about the brain can be obtained by a combination of methods, including magnetic stimulation mapping, positron emission tomography, and magnetic resonance imaging. Magnetic stimulation mapping is accomplished by exciting a magnetic coil over the surface of the head. This induces a magnetic field that stimulates the underlying brain. When the coil is placed over one of the motor areas in the brain and a brief burst of pulses is initiated, some of the nerve cells in that area will discharge and the part of the body controlled by those cells will move. Magnetic stimulation mapping is also used to stimulate peripheral nerve cells, such as those in the back or limbs for both clinical and research purposes. The number of cells activated by this technique is unclear.
PET is an imaging technique that portrays the flow of radioactive substances throughout the body. It is based on a combination of principles from computer tomography and radio isotope imaging (Martin et al., 1991). Researchers at the National Institutes of Health use PET to map the motor cortex. They examine blood flow data in the brain from a PET scan as subjects move different parts of their bodies. Areas of high blood flow are matched with the body part being moved; the increase in blood flow has been shown to be proportional to the rate of movement or degree of contraction. Although PET provides accurate information about the function of a particular area of the brain, it does not present particularly useful anatomical information. As a result, MRI, which does provide anatomical information, has been used in conjunction with PET—the images from both techniques can be taken simultaneously for the same subject
and then spatially registered with one another. This provides a pattern that links structure and function.
Recently a procedure has been developed at Massachusetts General Hospital to use MRI to collect functional data by imaging oxidation status in response to stimulation and motor responses. This is a completely noninvasive approach to mapping the cortex, and it is much faster than PET. Functional MRI data have been collected that show changes in oxidation status in the premotor cortex and somatosensory cortex when a subject thinks about making a movement. During actual movement, both of these plus the primary motor cortex are activated. Another promising process, magnetic resonance angiography (MRA), images blood vessels and can be used to measure blood flow through the brain (Martin et al., 1991).
Finally, there has been some interesting work conducted at Armstrong Medical Research Laboratory (Junker et al., 1988, 1989) on brain-actuated control of a roll axis tracking simulator. The focus of the work has been on providing closed-loop feedback to subjects that allows them to learn to control the resonance portion of the EEG and then to use these resonances to send control signals to a tracking simulator. Essentially, the studies have shown that subjects have been able to accurately roll to the left or right in the simulator by controlling brain resonance signals. Although this research is in early stages, it does appear to hold some promise for training individuals to use physiological responses as simple control signals (e.g., on-off, left-right, etc).
Recently this work on brain-actuated control has extended to its use for rehabilitation (Calhoun et al., 1993). These authors believe that, by employing appropriate evoking stimuli and feedback, people can learn to use EEG responses to control wheelchairs, computers, and prosthetics. Furthermore, they conclude that brain-actuated control may be better than most existing interfaces for people with disabilities.
OTHER INTERFACE COMPONENTS
Other interface components that could be of value concern stimulation of the olfactory (smell) or gustatory (taste) channels, and diffuse sensations related to the sensing of heat, air currents, humidity, and chemical pollutants in the air that affect the eyes or skin surface. In some cases, stimulation can have direct functional significance for the specified task; in others, it may serve merely to increase the general sense of presence or immersion in the environment.
In one research and development project aimed at the training of firefighters, the traditional helmet-mounted display of visual and auditory stimuli is being extended to include the display of odor and radiant heat stimuli (Cater et al., 1994). The current version of this experimental system, which makes use of a 44 lb backpack for housing the special additional equipment required, includes both a computer-controlled odor delivery system and a computer-controlled radiant heat delivery system. The odor system delivers odors by blowing air across microcapsules crushed by pinch rollers; the radiant heat system provides directional as well as overall intensity characteristics by controlling the individual intensity of infrared lamps arranged in a circular array.
The visual display component of the system is driven by video outputs of two Silicon Graphics Indigos (one for each eye); Polhemus Fastrak position information is processed by a PC and communicated to one of the Indigos via an RS-232 serial bus; another Indigo serial bus controls the odor delivery subsystem; a third serial connection on the second Indigo drives a MIDI bus to control the radiant heat subsystem; and an Ethernet connection between the two Indigos is used for synchronization purposes and for sharing peripheral information.
The demonstration program Pyro makes use of two virtual human arms and hands, Polhemus trackers attached to the user's wrists, a virtual torch, and some virtual flammable spheres. The user's task in this demonstration is to light each of the spheres with the torch.
Among the problems encountered in this work are the visualization of the flame, gumming up of the pinch rollers used to crush the odor microcapsules, and the weight of the backpack.