The Auditory Channel
As indicated previously, the accomplishments and needs associated with the auditory channel differ radically from those associated with the visual channel. Specifically, in the auditory channel, the interface devices (earphones and loudspeakers) are essentially adequate right now. In other words, from the viewpoint of synthetic environment (SE) systems, there is no need for research and development on these devices and no need to consider the characteristics of the peripheral auditory system to which such devices must be matched. What is needed, however, is better understanding of what sounds should be presented using these devices and how these sounds should be generated. Accordingly, most of the material presented in this section is concerned not so much with auditory interfaces as with other aspects of signal presentation in the auditory channel.
STATUS OF THE RELEVANT HUMAN RESEARCH
There is no topic in the area of auditory perception that is not relevant to the use of the auditory channel in some kind of SE system. The illustrative topics included here have been chosen for discussion because we believe they have exceptional relevance or are receiving considerable attention by investigators in the field of audition who are concerned with the use of the auditory channel in SE systems. General background on audition can be found in Carterette and Friedman (1978); Pickles (1988); Moore (1989); Yost (1994); and Fay and Popper (1994).
Resolution and Information Transfer Rates
General comments on resolution and information transfer rates, most of which are not modality specific, are presented in the Overview, in the discussion of the current state of the SE field. Here we supplement those previous, more general remarks with information that is specific to the auditory channel.
Most work on auditory resolving power has focused on artificially simple stimuli (in particular, tone pulses) or speech sounds. Except for a small amount of work directed toward aiding individuals with hearing impairments, relatively little attention has been given to the resolution of environmental sounds. Nevertheless, knowledge of the normal auditory system's ability to resolve arbitrary sounds is quite advanced. Thus, for example, there is an extensive literature, both experimental and theoretical, on the ability to discriminate between two similar sounds, or to perceive a target sound in the presence of a background masking sound (presented simultaneously or temporally separated). On the whole, knowledge in this area appears to be adequate for most SE design purposes. Useful information on auditory resolution can be found in the general texts on audition cited above, in the Handbook of Human Perception and Human Performance (Boff et al., 1986), in the Engineering Compendium (Boff and Lincoln, 1988), and, most importantly, in the many articles published each year by the Journal of the Acoustical Society of America.
Issues related to information transfer rates are much more complex because such rates depend not only on basic resolving power, but also on factors related to learning, memory, and perceptual and cognitive organization. It appears that the upper limits on information transfer rates for spoken speech and Morse code, two methods of encoding information acoustically for which there exist subjects who have had extensive training in deciphering the code, are roughly 60 bits/s and 20 bits/s, respectively. Unfortunately, we are unaware of any estimates of the information transfer rate for the perception of music. We would guess, however, that the rate lies between the above two, with a value closer to that of speech than of Morse code (because of the much higher dimensionality of the stimulus set in music than in Morse code). Although we cannot prove it, we suspect that the rate achieved with spoken speech is close to the maximum that can be achieved through the auditory channel. We say this not because we believe that speech is special, in the sense argued by various speech experts (e.g., Liberman et al., 1968), but rather because of the high perceptual dimensionality of speech sounds and because of the enormous amount of learning associated with the development of speech communication. Furthermore, except possibly for the case of music, we believe that it would be extremely difficult to achieve a comparatively
high rate with any other coding system. Certainly, none of the individuals we know would be willing to spend an equivalent amount of time attempting to learn an arbitrary nonspeech code developed for purposes of general research or for use in SE systems. Unfortunately, there is no theory yet available that enables one to reliably predict the dependence of information transfer rates on the coding scheme or training procedures employed.
Much of the current work concerned with communicating information to individuals through the auditory channel falls under the heading of ''auditory displays." A comprehensive view of work in this area, together with hundreds of references, can be found in the book edited by Kramer (1994).
The focus of work in this area has been the creation of different kinds of displays. Relatively little attention has been given to questions of evaluation, to an analysis of the information transfer achieved, or to training issues and how performance changes with practice. The kind of displays include audification, in which the acoustic stimulus involves direct playback of data samples, using frequency shifting, if necessary, to bring the signals into auditory frequency range; and sonification, in which the data are used to control various parameters of a sound generator in a manner designed to provide the listener with information about the controlling data. In general, such displays are being used both for monitoring tasks (e.g., to monitor the condition of a hospital patient or the state of a computer) and for data exploration (e.g., in physics, economics).
In the attempt to create effective displays, investigators are using perceptually high-dimensional stimulus sets and display sounds and codes that capitalize on the special sensitivities and strengths of the auditory system. Generally accepted positive characteristics of the auditory system include the ability to detect, localize, and identify signals arriving from any direction (despite the presence of objects that would cause visual occlusion), the tendency for sounds to cause an alerting and orienting response, and the exceptional sensitivity to temporal factors and to changes in the signal over time. These characteristics make auditory displays particularly useful for warning and monitoring functions.
A further set of issues in the design of acoustic displays concerns the extent to which the display makes use of everyday sounds and natural codes so that learning time is minimized. Although such codes are always preferred, some applications, such as the sonification of financial data, require codes that are highly abstract. Furthermore, when an abstract code is required, attention must be given to the way in which the
resulting complex of auditory signals will be analyzed into distinct perceptual streams (see the detailed discussion on auditory scene analysis below).
The topic of auditory spatial perception is important because the perception of the spatial properties of a sound field is an important component of the overall perception of real sound fields, because the location of a sound source is a variable that can be used to increase the information transfer rate achieved with an auditory display, and because it has been a central research focus in the simulation of acoustic environments for virtual environment (VE) systems.
Auditory localization of isolated sound sources in anechoic (i.e., echo-free) environments is relatively well understood. The direction of a source is determined primarily by comparing the signals received at the two ears to determine the interaural amplitude ratio and the interaural time delay or phase difference as a function of frequency. Whereas interaural phase provides the dominant cue for narrowband low-frequency (< 1,500 Hz) signals, interaural amplitude ratio provides the dominant cue for narrowband high-frequency signals. For signals of appreciable bandwidth, interaural time delay also plays a significant role at high frequencies. These empirical psychophysical results are consistent with what one would expect based on the relevant physical acoustics: at the low frequencies, the effects of head shadow are relatively small because of diffraction; at the higher frequencies, measurement of time delay suffers from phase ambiguities unless there is sufficient bandwidth to eliminate these ambiguities. The results are also consistent with the known physiology: the neural firings in the auditory nerve are able to follow the individual cycles of a narrowband signal only at the lower frequencies; they can, however, follow modulations in the envelope of high-frequency signals with significant bandwidth at the output of the auditory "critical band" filters.
Ambiguities in the determination of direction achieved using interaural amplitude ratio and interaural time delay are well predicted by assuming that the head is a sphere with sensors located at the end of a diameter, and then determining the three-dimensional surfaces over which both the interaural amplitude ratio and the interaural time delay remain constant. All such surfaces are surfaces of revolution about the interaural axis, which can be approximated by cones for sources far removed from the head, and the median plane (in which the signals to the two ears are identical if one assumes the head is symmetrical) constitutes a limiting case of this family of surfaces.
The methods used by the auditory system to resolve these ambiguities
involve (1) head movements and (2) monaural processing. Thus, for example, front-back ambiguities are easily eliminated by rotating the head in azimuth. Directional information from monaural processing is achieved by estimating properties of the direction-dependent filtering of the transmitted signal that occurs when the transmitted signal propagates from the signal source to the eardrum. The directional information achieved in this manner, however, depends strongly on (1) the listener having adequate a priori information on the spectrum of the transmitted signal (so the spectrum of the received signal can be factored into the spectrum of the transmitted signal and the transfer function of the filter) and (2) the existence of high-frequency (> 5,000 Hz) energy in the transmitted signal (so that the listener's pinnae can have a strong directional effect on the spectrum of the received signal).
For isolated sound sources in an anechoic environment, the just-noticeable difference (JND) in azimuth for sources straight ahead of the listener is roughly 1 deg. However, the JND in azimuth for sources off to the side, the JND in elevation within the median plane, and the JND in distance are relatively large. In all these cases, the JND constitutes a substantial fraction (e.g., one-fifth) of the total meaningful range of the variable in question.
When reflections (echoes and reverberations) are present, direction finding tends to be degraded below that which occurs in anechoic environments. However, this degradation is limited by the tendency of the auditory system to enhance perception of the direct acoustic wave and suppress the late-arriving echoes (the precedence effect), at least with respect to direction finding.
Perception of sound source distance is generally very poor. Whatever ability there is seems to be based on the following three changes in the received signal, as source distance is increased: a decrease in intensity, an increase in high-frequency attenuation, and an increase in the ratio of reflected to direct energy. Distance perception is poor because none of these cues is reliable: they all can be influenced by factors other than distance. The influence of reflections on both direction perception and distance perception, together with an inadequate understanding of how listeners separate out the characteristics of the transmitted signal from the characteristics of the acoustic environment that modify the characteristics of the transmitted signal as it approaches the ear, make the study of reflection effects in audition a high-priority item.
Even though distance perception is poor, under normal circumstances, sound sources are perceived to be located outside the head, that is, they are perceptually externalized. However, under special circumstances, the sources can be made to appear inside the head. Although in-head localization seldom occurs under normal conditions, it is a major
problem when sounds are presented through earphones as they are in head-mounted displays for VE systems.
Further important issues in the area of auditory spatial perception concern the identification of source position (as opposed to discrimination) and the phenomena that occur when multiple sources are simultaneously active at different locations. Identification of source position, like identification of many other variables, is limited by inadequate short-term memory and evidences constraints on information transfer that are much stronger than those that would be implied solely by the JND data. In the area of multiple sources, a great deal of research has been conducted on the ability to detect a given source in a background of interference emanating from sources located at other locations; however, very little is known about the ability to localize a given source in such a background of interference. Furthermore, most of the work in this area is limited to the case in which the interference arises from a single location in space and the space is anechoic.
Detailed information on the topics discussed above is available in the following articles, chapters, and books, and in the many references cited in these publications: Blauert (1983); Colburn and Durlach (1978); Durlach and Colburn (1978); Durlach et al. (1992c); Durlach et al. (1993); Gilkey and Anderson (1994); Mills (1972); Searle et al. (1976); Wenzel (1992); and Yost and Gourevitch (1987).
The parameters of the human auditory system pertinent to real-time system dynamics include spatial resolution, as estimated by the JND in angle or, as it is sometimes called, the minimum audible angle or MAA (Mills, 1958); the velocity-dependent minimum audible movement angle or MAMA (Perrott, 1982); and the corresponding minimum perceivable latencies. As noted above, the auditory system is most sensitive to changes in the azimuthal position of sources located in front of the listener, so that the necessary angular resolution and accuracy of a head tracker are determined by the JND in azimuth for sources in front of the listener. This localization blur (Blauert, 1983) is dependent to some extent on the nature of the signal, but is never less than 1 deg. The angular accuracy of the commonly used magnetic trackers now available is on the order of 0.5 deg, and thus meets this requirement. The necessary translational accuracy of trackers depends on the distances of the sources to be simulated. However, since perception of source distance itself is very poor, permissible tracker translational error is not bound by error in simulated source distance but rather by error in simulated source angle. Again considering sources in front of the listener, source distance must be large enough that the maximum angular positional error (equal to the angular tracker error plus the error in angular position due to translational tracker error) is smaller than the JND in angular position. If the angular error for
the tracker is assumed to be 0.5 deg, then, as long as the translational error causes no greater than about 0.5 deg of angular error, it will be perceptually insignificant. For a given translational accuracy, sources above some minimum distance (dmin) from the listener can be simulated without perceptual error. The value of dmin is given by dmin = e / 2tan (a/2), where a is the allowable angular error due to translational error (0.5 deg) and e is the translational accuracy of the tracker. For an accuracy of e = 1 mm, dmin is about 12 cm, a value that seems sufficient for most practical cases. It should also be noted that the limited accuracy of the tracking device will, in most cases, cause no audible effects even if the sources are somewhat closer to the subject, because the localization blur (even in azimuth) is larger than 1 deg for many kinds of signals and is considerably larger for directions other than straight ahead.
The questions of the latency constraints and update rates required to create natural virtual auditory environments are still largely unanswered. Some relevant perceptual studies include the work of Perrott et al. (1989) and Grantham and his colleagues (e.g., Grantham, 1986, 1989, 1992; Chandler and Grantham, 1992). These studies have begun to examine the perception of moving auditory sources. For example, for source speeds ranging from 8 to 360 deg/s, minimum audible movement angles were found to range from about 4 to 21 deg, respectively, for a 500 Hz tone burst (Perrott, 1982; Perrott and Tucker, 1988). The recent work on the perception of auditory motion by Perrott and others using real sound sources (moving loudspeakers) suggests that the computational latencies currently found in fairly simple auditory VEs are acceptable for moderate velocities. For example, latencies of 50 ms (associated with positional update rates of 20 Hz as found in the Convolvotron system from Crystal River Engineering) correspond to an angular change of 18 deg when the relative source-listener speed is 360 deg/s, 9 deg when the speed is 180 deg/s, etc.
As yet, no relevant studies have been performed that investigate the delays that may be tolerable in rendering dynamic cues when subjects actively move their heads. Such psychophysical questions are currently being investigated in the United States by the National Aeronautics and Space Administration (NASA) and in Europe by the SCATIS (Spatially Coordinated Auditory/Tactile Interaction Scenario) project. In any event, the overall update rate in a complex auditory VE will depend not only on tracker delays (nominally 10 ms for the Polhemus Fastrak), but also on communication delays and delays in event processing or handling as well. The tracker delays in such systems will probably contribute only a small fraction of the overall delay, particularly when multiple sources or larger filters (e.g., required to simulate reverberant rooms) are used.
Auditory Scene Analysis
Unlike the visual system, in which there is a direct mapping of source location to image on the retina, in the auditory system, there is no peripheral representation of source location. There are only two sensors (the ears), each sensor receives energy from all sources in the environment, and the spatial analysis takes place centrally after the total signal in each ear is analyzed by the cochlea into frequency bands. Thus, the auditory picture of the acoustic scene must be constructed by first analyzing the filter outputs into components, then combining these components both across frequency bands and across the two ears. This signal-processing architecture differs not only from that found in the visual system, but also from that which would be used in the design of an artificial acoustic sensing system. Ideally, in an artificial system, the spatial processing would be performed prior to the frequency processing: one would use an array of microphones and parallel processing of the set of microphones outputs to achieve separate channels corresponding to different regions of space, and then separately analyze the frequency of the signal arising from each spatial cell (Colburn et al., 1987; Durlach et al., 1993).
Independent of the evolutionary developments underlying the way in which the auditory system is designed, the existing design presents a unique problem for creating a coherent auditory scene. The lack of a peripheral representation of source location complicates not only the task of determining source location, but also that of determining the number and character of the sources. Somehow, the higher centers in the auditory system must decompose the output of each auditory filter in each ear into elements that, after the decomposition, can be recombined into representations of individual sources.
In general, understanding auditory scene analysis is important to the design of SE systems because properties of this analysis play an important role in determining the effectiveness of auditory displays. Background information on this topic can be found in Hartman (1988), Handel (1989), Bregman (1990), and Yost (1991). Briefly, auditory scene analysis refers to the fact that, analogous to the visual domain, one can conceive of the audible world as a collection of acoustic objects. The auditory system's job is to organize this world by identifying and parsing the frequency components belonging to individual objects from the mixture of components reaching the two ears that could have resulted from any number of "real" acoustic sources.
Historically, research in this area has been concerned with the phenomena of auditory stream segregation, in which an auditory stream corresponds to a single perceptual unit or object. Studies have shown that, in addition to spatial location, various acoustic features such as synchrony
of temporal onsets and offsets, timbre, pitch, intensity, frequency and amplitude modulation, and rhythm, can serve to specify the identities of objects and segregate them from one another. A single stream or object, such as a male human voice, will tend to be composed of frequencies that start and stop at the same time, have similar low-frequency pitches (formants), similar rhythmic changes or prosody, and so on. A female voice is parsed as a separate object because it has higher frequency formants, a different prosody, etc. Nonspeech sounds can possess the same sorts of distinguishing characteristics. Furthermore, Gestalt principles of perceptual organization normally applied to visual stimuli, such as principles of closure, occlusion, perceptual continuity, and figure-ground phenomena, have their counterparts in audition as well. For example, these principles operate in a drawing when one object, say the letter B is occluded by another, say an irregular blob of ink. The visual system has no trouble seeing the discontinuous visible fragments of the occluded B as a unitary and continuous object. However, if the blob is removed, it is much more difficult to recognize the same fragments as belonging to a B. A similar effect can be heard when a series of rising and falling tonal glides are interspersed with bursts of white noise (Dannenbring, 1976). When the noise is not present, one hears bursts of rising and falling tones that seem like isolated events. However, when the noise is present the auditory system now puts the tonal glides together into one continuous stream that seems to pass through the noise, even though the glides are not physically present during the noise.
While such illusions may seem merely like perceptual curiosities, these kinds of effects can have a profound impact on whether information is perceived and organized correctly in a display in which acoustic signals are used to convey meaning about discrete events or ongoing actions in the world and their relationships to one another. For example, if two sounds are assigned to the same perceptual stream, simultaneous acoustic masking may be greater but temporal comparisons might be easier if relative temporal changes cause a change in the Gestalt or overall percept or cause the single stream to perceptually split into two streams. Suppose two harmonically related tonal pulses with similar rhythms are being used to represent two different friendly ship signatures in a sonar display. The information content of this nicely harmonic, unitary percept might be that all is well. However, if the signal recognition system of the sonar system now detects a change in one ship's acoustic signature, this might be represented as an inharmonicity of one component relative to the other. A further change in their relative temporal patterns might break the sound into two separate, inharmonic objects signaling "potential enemy in the vicinity." Conversely, if the inharmonic and temporal relationships are not made sufficiently distinct, the warning signal might
not be recognized. Whereas the above discussion is rather speculative, it illustrates how acoustic signals could be used, or misused, for monitoring important events in situations like a sonar display in which the operator's visual channel is already overloaded.
In the context of display design, the notion of auditory scene analysis has been most influential in the recent interest in using abstract sounds, environmental sounds, and sonification for information display (Kramer, 1994). The idea is that one can systematically manipulate various features of auditory streams, effectively creating an auditory symbology that operates on a continuum from literal everyday sounds, such as the rattling of bottles being processed in a bottling plant (Gaver et al., 1991), to a completely abstract mapping of statistical data into sound parameters (Smith et al., 1990). Principles for design and synthesis can be gleaned from the fields of music (Blattner et al., 1989), psychoacoustics (Patterson, 1982), and higher-level cognitive studies of the acoustical determinants of perceptual organization (Bregman, 1990; Buxton et al., 1989). Recently, a few studies have also been concerned with methods for directly characterizing and modeling such environmental sounds as propeller cavitation, breaking or bouncing objects, and walking sounds (Howard, 1983; Warren and Verbrugge, 1984; Li et al., 1991). Other relevant research includes physically based acoustic models of sound source characteristics, such as radiation patterns (Morse and Ingard, 1968). Further discussion of some of these issues is presented in the section below on computer generation of nonspeech audio.
Adaptation to Unnatural Perceptual Cues
There are many SE situations in which a sensory display, or the manner in which such a display depends on the behavior of the operator of the SE system, deviates strongly from normal. Such situations can arise because of inadequacies in the design or construction of an SE system or because intentional deviations are introduced to improve performance. Whenever a situation of this type exists, it is important to be able to characterize the operator's ability to adapt.
There have been a number of instances in which acoustic signals have been used as substitutes for signals that would naturally be presented via other modalities (e.g., Massimino and Sheridan, 1993). Unfortunately, however, much of this work has not included careful experimentation on how subjects adapt to such sensory substitutions over time. A major notable exception to this can be found in the work on sonar systems for persons who are blind (Kay, 1974; Warren and Strelow, 1984, 1985; Strelow and Warren, 1985).
Most of the work on adaptation to unnatural perceptual cues has
focused on spatial perception and on transformations of cues within the auditory system, i.e., no sensory substitution has been involved. Extensive reviews of this work, with long lists of references, can be found in Welch (1978) and Shinn-Cunningham (1994).
Current research in this area directly relevant to SE systems continues to be concerned with spatial perception and is of two main types. First, studies are being conducted to determine the extent to which the signal-processing and measurement procedures required to achieve realistic simulations of real acoustic environments using earphone displays can be simplified and made more cost-effective. An important part of this effort concerns the extent to which listeners can adapt to the alterations in cues associated with these procedures. Thus, for example, a number of investigators have begun to explore the extent to which listeners can learn to make use of the direction-dependent filtering associated with someone else's ears (Wenzel et al., 1993a). Second, research is under way to determine the extent to which listeners can adapt to distortions that are purposefully introduced in order to improve spatial resolution. In particular, a number of transformations are being studied that present the listener with magnified perceptual cues that approximate in various ways the cues that would be present if the listener had a much larger head (Durlach and Pang, 1986; Van Veen and Jenison, 1991; Durlach et al., 1993). Also being studied now are transformations that enable the listener to perceive the distance of sound sources much more accurately (see the review by Durlach et al., 1993). Although it is fairly obvious that effective spatial resolution can be improved by such transformations, the degree to which the response bias introduced by these transformations can be eliminated by adaptation is not obvious.
In general, although adaptation is a topic that has been of interest to psychologists for a long time, there is still no theory available that will enable one to predict the rate and asymptote of adaptation as a function of the transformation and training procedure employed.
STATUS OF THE TECHNOLOGY
A recent comprehensive review of technology for the auditory channel in VE systems is available in Durlach et al. (1992b), and much of the material presented in this section is taken from that report.
An auditory interface for virtual environments should be capable of providing any specified pair of acoustic waveforms to the two ears. More specifically, it should (1) have high fidelity, (2) be capable of altering those waveforms in a prescribed manner as a function of various behaviors or outputs of the listener (including changes in the position and orientation of the listener's head), and (3) exclude all sounds not specifically
generated by the VE system (i.e., real environmental background sounds). Requirement (3) must be relaxed, of course, for an augmented-reality system in which the intention is to combine synthetically generated sounds and real environmental sounds (see the section on hear-through displays below).
Generally speaking, such results can be most easily achieved with the use of earphones; when loudspeakers located away from the head are employed, each ear receives sound from each loudspeaker and the control problem becomes substantial. Although commercial high-fidelity firms often claim substantial imaging ability with loudspeakers, the user is restricted to a single listening position within the room, only azimuthal imaging is achieved (with no compensation for head rotation), and the acoustic characteristics of the listening room cannot be easily manipulated. In addition, since the ears are completely open, extraneous (undesired) sounds within the environment cannot be excluded. Finally, although the tactual cues associated with the use of earphones may initially limit the degree of auditory telepresence, since the user will be required to transit back and forth between the virtual and the real environments, such tactual cueing may actually prove useful. In any case, such cuing is likely to be present because of the visual interface. One set of situations for which loudspeakers might be needed are those in which high-energy, low-frequency acoustic bursts (e.g., associated with explosions) occur. In such cases, loudspeakers, but not earphones, can be used to vibrate portions of the body other than the eardrums.
Earphones vary in their electroacoustic characteristics, size and weight, and mode of coupling to the ear (see, for example, Shaw, 1966; Killion, 1981; Killion and Tillman, 1982). At one extreme are transducers that are relatively large and massive and are coupled to the ear with circumaural cushions (i.e., that completely enclose the pinnae). At the other extreme are insert earphones through which the sound is delivered to some point within the ear canal. The earphone may be very small and enclosed within a compressible plug (or custom-fitted ear mold) that fits into the ear canal, or the earphone may be remote from the ear and its output coupled via plastic tubing (typically 2 mm inner diameter) that terminates within a similar plug. Intermediate devices include those with cushions that rest on the pinnae (e.g., Walkman-type supraaural earphones) and those with smaller cushions (about 1.5 cm diameter) that rest within the concha (so-called earbuds).
All earphone types can deliver sounds of broad bandwidth (up to 15 kHz) with adequate linearity and output levels (up to about 110 dB sound
pressure level). Precise control of the sound pressure at the eardrum requires knowledge of the transfer function from the earphone drive voltage to the eardrum sound pressure. Probe-microphone sound-level measurements proximal to the eardrum can be used to obtain this information. These measurements, particularly at frequencies above a few kHz, are nontrivial to perform. In general, this transfer function is expected to be more complex (versus frequency) and more different across individuals as the size of the space enclosed between the eardrum and the earphone increases. Thus, circumaural earphones that enclose the pinnae are expected to have more resonances than insert earphones. One might similarly expect greater variability with repeated placement of the earphones on a given individual, but recent measurements of hearing thresholds at high frequencies indicate that, even with relatively large supraaural earphones, test-retest variability measures are small (a few dB up to 14 kHz—Stelmachowicz et al., 1988). Thus, repeated measures of this transfer function on a given listener may be unnecessary. For some applications, even interindividual differences may be unimportant and calibration on an average (mannequin) ear may be sufficient. Insert earphones (or earplugs), by virtue of having a high ratio of contact area (with the ear canal wall) to exposed sound transmission area (the earplug cross-section), afford relatively good attenuation of external sounds (about 35 dB above 4 kHz, decreasing to about 25 dB below 250 Hz). Circumaural earphones can achieve similar high-frequency attenuation but less low-frequency attenuation. Recently developed active-noise-canceling circumaural headsets (e.g., Bose Corp.) provide up to 15 dB additional low-frequency attenuation, thereby making their overall attenuation characteristics similar to that of insert earphones. Supraaural earphones, which rest lightly on (or in) the external ear, provide almost no attenuation (consistent with their commonly being referred to as open-air design). The greatest attenuation can be achieved by combining an insert earphone with an active-noise-canceling circumaural hearing protector. Even including such a protector, it is unlikely that the cost of such a sound delivery system will exceed $1,000.
Most of the past work on auditory interfaces for virtual environments has been directed toward the provision of spatial attributes for the sounds. Moreover, within this domain, work has focused primarily on the simulation of normal spatial attributes (e.g., Blauert, 1983; Wenzel et al., 1988; McKinley and Ericson, 1988; Wightman and Kistler, 1989a,b; Wenzel, 1992). Relatively little attention has been given to the provision of supernormal attributes (Durlach and Pang, 1986; Durlach, 1991; Van Veen and Jenison, 1991; Durlach et al., 1992a).
Normal spatial attributes are provided for an arbitrary sound by multiplying the complex spectrum of the transmitted sound by the transfer
function of the space filter associated with the transformation that occurs as the acoustic waveform travels from the source to the eardrum. (In the time domain, the same transformation is achieved by convolving the transmitted time signal with the impulse response of the filter.) For binaural presentation, one such filter is applied for each of the two ears. Inasmuch as most of the work on virtual environments has focused on anechoic space, aside from the time delay corresponding to the distance between source and ear, the filter is determined solely by the reflection, refraction, and absorption associated with the body, head, and ears of the listener. Thus, the transfer functions have been referred to as head-related transfer functions (HRTFs). Of course, when realistic reverberant environments are considered, the transfer functions are influenced by the acoustic structure of the environment as well as that of the human body.
Estimates of HRTFs for different source locations are obtained by direct measurements using probe microphones in the listener's ear canals, by roughly the same procedure using mannequins, or by the use of theoretical models (Wightman and Kistler, 1989a,b; Wenzel, 1992; Gierlich and Genuit, 1989). Once HRTFs are obtained, the simulation is achieved by monitoring head position and orientation and providing, on a more or less continuous basis, the appropriate HRTFs for the given source location and head position/orientation.
The process of measuring HRTFs for a set of listeners is nontrivial in terms of time, skill, and equipment. Although restriction of HRTF measurements to the lower frequencies is adequate for localization in azimuth, it is not adequate for vertical localization, for externalization, or for elimination of front-back ambiguities (particularly if no head movement is involved). Thus, ideally, a sampling frequency of roughly 40 kHz is required (corresponding to an upper limit for hearing of 20 kHz). Similarly, if the HRTFs are measured in an acoustic environment with reverberation, the associated impulse responses can be very long (e.g., more than a second). These two facts, combined with the desire to measure HRTFs at many source locations, in many environments, and for many different listeners, can result in a monstrously time-consuming measurement program.
Preliminary investigations have demonstrated that, without training, nonindividualized HRTFs cause greater localization error, particularly in elevation and in front-back discrimination, than do HRTFs measured from each subject (Wenzel et al., 1993a). These results have been interpreted as showing that HRTFs must be measured for each individual subject in order to achieve maximum localization performance in an auditory SE. However, essentially all of the work done to date has been done without regard for the effects that could be achieved by means of sensorimotor adaptation (e.g., see Welch, 1978). We suspect that if subjects were given
adequate opportunity to adapt in a closed-loop SE, much of the current concern with detailed characteristics of the HRTFs and with the importance of intersubject differences would disappear. In addition, significant advances are now being made in the modeling of HRTFs and the development of parametric expressions for HRTFs based on abstractions of the head, torso, and pinnae. In fact, in the not-too-distant future, it may be possible to obtain reasonably good estimates of an individual's HRTFs merely by making a few geometric measurements of the outer ear structures. These two factors, combined with the use of models for describing the effects of reverberation, should greatly simplify the HRTF estimation problem.
Given an adequate store of estimated HRTFs, it is then necessary to select the appropriate ones (as a function of source and head position/orientation) and filter the source signal in real time. Although some ability to perform such processing has been achieved with relatively simple analog electronics (e.g., Loomis et al., 1990), the devices available for achieving the most accurate simulations employ digital signal processing. The earliest commercially available systems (e.g. the FocalPoint system from Gehring Research and the Convolvotron) employed simple time-domain processing schemes to ''spatialize" input sound sources. The Acoustetron (successor to the Convolvotron) quadrupled the computational capabilities of these earlier systems. It stores 72 pairs of spatial filters sampled at 50 kHz for spatial positions sampled at 30 degrees in azimuth and 18 deg in elevation. The spatial filters that most closely correspond to the instantaneous position of the source relative to the listener's head are retrieved in real time, interpolated to simulate positions between the spatial sampling points, and then convolved with the input sound source to generate appropriate binaural signals. The system can spatialize up to 32 sources in parallel, enabling simulation of simple acoustic room models (with first- and second-order reflections) in real time.
Another time-domain spatialization system was developed by McKinley and his associates in the Bioacoustics and Biocommunications Branch of the Armstrong Laboratory at Wright-Patterson Air Force Base in order to present three-dimensional audio cues to pilots (R.L. McKinley, personal communication, 1992). The HRTFs incorporated into this system, derived from measurements using the mannequin KEMAR, are sufficiently dense in azimuth (HRTFs are measured at every degree in azimuth) to eliminate the need for interpolation in azimuth. In elevation, the measurements are much less dense and linear interpolation is employed. The researchers at Wright-Patterson, in conjunction with Tucker Davis Technologies of Gainesville, Florida, have recently developed a new time-domain processor based on their earlier efforts. This machine, which is
now commercially available, contains more memory than has been available in previous spatialization systems. In addition, the product has been designed to maximize the flexibility of the system, allowing researchers to allocate the available processing power as necessary for the individual application.
Frequency-domain filtering is being employed in the Convolvotron II, a next-generation spatialization device under development at Crystal River Engineering. By performing frequency-domain filtering, the Convolvotron II will be capable of greater throughput for relatively low cost. In addition, because the Convolvotron II is being built from general-purpose, mass-produced digital signal processing (DSP) boards, the system will be completely modular and easily upgraded.
The state of the art in spatial auditory displays is not yet quite adequate for high-quality VE applications. For example, there are serious questions about the adequacy of the current techniques used for constructing interpolated HRTFs (Wightman et al., 1992; Wenzel and Foster, 1993). Similarly, with the current-generation processors, limitations in memory and filter size, as well as in processing speed and algorithm architecture, have limited the ability to simulate nonisotropic sources or reverberant environments (measured or synthesized). To date, the only real-time auditory spatialization system that included even simplified real-time room modeling was that developed by Foster and his associates (Foster et al., 1991) for the Acoustetron. Even though the acoustic model used was rather simple and provided only a small number of first- and second-order reflections, this system provides increased realism (e.g., it improves externalization). The role played by reverberation (as well as other factors) in the externalization of sound images produced by earphone stimulation is discussed in Durlach et al. (1992c).
More sophisticated and realistic sound-field models have been developed for architectural applications (e.g., Lehnert and Blauert, 1991, 1992; Vian and Martin, 1992), but they cannot be simulated in real time by any of the spatialization systems currently available. An overview of the current state of the art for sound-field modeling and a representative collection of contemporary papers may be found in special issues in Applied Acoustics (1992, 1993). As the computational power of real-time systems increases, the use of these detailed models will become feasible for the simulation of realistic environments.
The most common approach to modeling the sound field is to generate a spatial map of secondary sound sources (Lehnert and Blauert, 1989). In this method, the sound field due to a source in echoic space is modeled
as a single primary source and a cloud of discrete secondary sound sources (which correspond to reflections) in an anechoic environment. The secondary sources can be described by three major properties (Lehnert, 1993a): (1) distance (delay), (2) spectral modification with respect to the primary source (air absorption, surface reflectivity, source directivity, propagation attenuation), and (3) direction of incidence (azimuth and elevation).
In contrast to the digital generation of reverberation (which has a long history—e.g., Schroeder, 1962), very few people have experience with real-time sound-field modeling. In order to achieve a real-time realization of sophisticated sound models, it has been suggested that early and late sources (Lehnert and Blauert, 1991; Kendall and Martens, 1984) be treated differently. Early reflections would be computed in real time, whereas late reflections would be modeled as reverberation.
Two methods are commonly used to find secondary sound sources: the mirror-image method (Allen and Berkley, 1979; Borish, 1984) and variants of ray tracing (Krokstadt et al., 1968; Lehnert, 1993a). Lehnert and colleagues (Shinn-Cunningham et al., 1994) have compared the computational efficiency of the two methods with respect to their achievable frame rate and their real-time performance for a room of moderate complexity with 24 reflective surfaces. A total of 8 first-order and 19 second-order reflections were calculated for a specific sender-receiver configuration.
For this test scenario, the mirror-image method was more efficient than the ray-tracing method. In addition, the mirror-image method is guaranteed to find all the geometrically correct sound paths. For the ray-tracing method, it is difficult to predict the required number of rays to find all desired reflections. The ray-tracing method does have the advantage that it produces reasonable results even when very little processing time is available, and it can easily be adapted to work at a given frame rate by adjusting the number of rays used, whereas the mirror-image method cannot be scaled back easily since the algorithm is recursive. Ray tracing will yield better results in more complex environments since the dependence of the processing time on the number of surfaces is linear not exponential, as is the case for the mirror-image method. Thus, although for the given test case the mirror-image method was more efficient, there will probably be some scenarios in which the mirror-image method is superior and others in which the ray-tracing method offers better performance.
Since calculation of the sound-field model will be the most time-consuming part of the auditory pipeline, optimization of these calculations is worthwhile (e.g., see the discussion in Shinn-Cunningham et al., 1994). Computational resources can be assigned as necessary to achieve the necessary accuracy of the simulation. If, for example, a primary source is to
be presented, no reflection filters are required and more resources (i.e., more filter coefficients) can be assigned to the HRTFs to obtain more precise spatial imaging. For a second-order reflection, spectral filtering must take place, but since the directivity of the delayed reflection is less salient psychoacoustically, less accuracy may need to be used for the HRTF filtering. The structure of the auralization unit should allow for task adequate assignment of resources. Efficient algorithms and signal-processing techniques for real-time synthesis of complex sound fields are currently being investigated in a collaborative project by Crystal River Engineering and NASA.
Off-Head, Hear-Through, and Augmented-Reality Displays
Apart from work being conducted with entertainment applications in mind, most of the research and development concerned with auditory displays in the SE area has been focused on stimulation by means of earphones. However, as indicated previously, such stimulation has two drawbacks: (1) it encumbers the listener by requiring that equipment be mounted on the user's head and (2) it stimulates only the listener's eardrums.
In considering item (1), it should be noted that for many of the head-tracking techniques now in use (see Chapter 5), head-tracking devices as well as earphones must be mounted on the head. If the visual display in the system is also head mounted, then the ergonomic difficulties associated with adding the earphones are minor: not only would a head-tracker presumably be required for the visual display even if none were required for the auditory display, but also the incremental burden of adding earphones to the head-mounted visual display would be relatively trivial (provided, of course, that the use of earphones was envisioned when the mounting system was designed and not added later as an afterthought).
In considering item (2), it should be noted that, even though earphones can generate sufficient power to deafen the user, stimulation via earphones is totally inadequate for delivering acoustic power to the user in a manner that affects body parts other than the ear. Although for most applications in the SE field, stimulation of the auditory system via the normal acoustic channel (outer ear, eardrum, middle ear, cochlea, etc.) is precisely what is desired, if one wants to provide realistic simulations of high-energy acoustic events in the environment, such as explosions or low-altitude flyovers by fast-moving aircraft, then acoustic stimulation of the rest of the body (e.g., shaking the user's belly) may also be important.
The design of off-head auditory displays, i.e., loudspeakers, has been a central focus of the audio industry for many years. Many of the loudspeaker systems now available are, like earphones, more than adequate
for all SE applications with respect to such characteristics as dynamic range, frequency response, and distortion. They are also adequate with respect to cost, although they tend to be more expensive than earphones, particularly if the application requires the production of very high intensity levels throughout a very large volume (e.g., loud rock music in a big theater). For SE applications, the main problem with loudspeaker systems, as it is with earphones, is that of achieving the desired spatialization of the sounds (including both the perceived localization of the sound sources and the perceived acoustic properties of the space in which the sources are located).
The major problem that arises in spatialization for loudspeaker systems that does not arise when earphones are used concerns the difficulty of controlling the signals received at the two eardrums (and the differences between these two signals). Unlike the situation with earphones, in which the signal received at a given eardrum is determined simply by the signal transmitted by the earphone at that ear (and the fixed filtering associated with acoustic transmission from the earphone transducer to the eardrum), when a loudspeaker system is used, the signal received at the given eardrum is influenced by all the signals transmitted by all the loudspeakers in the room, together with all the transformations that each of the signals undergoes as it propagates through the room from the loudspeaker to the eardrum. Among the names used to designate various approaches to one or more components of this problem are binaural stereo, transaural stereo, spectral stereo, multichannel stereo, quadrophony, ambisonics, and surround sound. Even when a given system is tuned to provide an adequate perception for a given position of the listener's head, this perception is likely to rapidly degrade as the listener's head is moved out of the "sweet spot." To date, no loudspeaker system has been developed that incorporates head-tracking information and uses this information to appropriately adjust the inputs to the loudspeakers as the position or orientation of the listener's head is altered.
Background material on the spatialization problem for signals presented by means of loudspeakers is available in Eargle, 1986; Trahiotis and Bernstein, 1987; Cooper and Bauck, 1989; Griesinger, 1989; and MacCabe and Furlong, 1994.
Within the VE area, one of the best-known systems that makes use of off-head displays is the CAVE, a VE system developed at the Electronic Visualization Laboratory, University of Illinois at Chicago (Cruz-Neira et al., 1993). The current CAVE system employs four identical speakers located at the corners of the ceiling and amplitude variation (fading) to simulate direction and distance effects. In the system now under development, speakers will be located at all eight corners of the cube and reverberation and high-frequency attenuation will be added to the parameters
that can be used for spatialization purposes. In a new off-head system being developed by Crystal River Engineering, attempts are being made to utilize protocols similar to those used in the earphone spatialization systems (e.g., the Convolvotron), so that the user will be able to change from one type of system to the other with minimal waste of time and effort.
Finally, it should be noted that relatively little attention has been given to augmented reality in the auditory channel. As in the visual channel, there are likely to be many applications in which it is necessary to combine computer-synthesized or sampled audio signals with signals obtained (1) directly from the immediate environment or (2) indirectly from a remote environment by means of a teleoperator system. In principle, the signals from the immediate environment can be captured from controlled acoustic leakage around the earphones (hear-through displays) or by positioning microphones in the immediate environment (perhaps on the head-mounted display, HMD) and adding signals in the electronic domain rather than the acoustic domain. However, because one may want to manipulate the environmental signals before adding it, or because one may want to use the same system for the case in which the sources of the environmental signals are remote and sensed by a telerobot, the latter approach seems preferable. Ideally, an acoustic augmented-reality system should be capable of receiving signals sensed by microphones in any environment (immediate or remote), transforming these signals in a manner that is appropriate for the given situation, and then adding them to the signals presented by the VE system. Currently, the most obvious use of acoustic augmented-reality systems is to enable an individual who is deeply immersed in some VE task to simultaneously monitor important events in the real world (e.g., the occurrence of warning signals in the real world).
Computer Generation of Nonspeech Audio
While much work remains to be done in the areas of sound spatialization and modeling of environmental effects, even less has been accomplished in the area of physical modeling and the computer generation of nonspeech sounds. How can we build a real-time system that is general enough to produce the quasi-musical sounds usually used for auditory icons, as well as purely environmental sounds, like doors opening or glass breaking? The ideal synthesis device would be able to flexibly generate the entire continuum of nonspeech sounds described above as well as be able to continuously modulate various acoustic parameters associated with these sounds in real time. Such a device or devices would act as the generator of acoustic source characteristics that would then serve as
the inputs to a sound spatialization system. Thus, initially at least, source generation and spatial synthesis would remain as functionally separate components of an integrated acoustic display system. While there would necessarily be some overhead cost in controlling separate devices, the advantage is that each component can be developed, upgraded, and utilized as stand-alone components so that systems are not locked into an outmoded technology.
Many of the algorithms likely to be useful for generating nonspeech sounds will be based on techniques originally developed for the synthesis of music as well as speech. The main goal in speech synthesis is the production of intelligible (and natural sounding) speech waveforms. To do this, one must accurately synthesize the output of a specific type of instrument, the human voice. Also, in speech synthesis, the final acoustic output can be rated according to the measurable amount of information conveyed and the naturalness of the speech. In synthesizing music, typically the goals are not as specific or restricted: they are defined in terms of some subjective criteria of the composer. Usually, the goal is to produce an acoustic waveform with specific perceptual qualities: either to simulate some traditional, physical acoustic source or to produce some new, unique sound with appropriate attributes.
Because the aims of synthesized music are more diverse than those of speech synthesis, there are a number of different, acceptable methods for its synthesis. Choice of the method depends on the specific goals, knowledge, and resources of the composer. In any synthesis method, the composer controls some set of time-varying parameters to produce an acoustic waveform whose perceptual qualities vary accordingly. These computer-controlled parameters may be related to a physical parameter in a modeled instrument, to the shape or spectrum of the acoustic waveform, or to the perceptual quality desired for the sound. Often, these varying techniques are combined to get a specific effect. Some of the most common techniques are described below.
One method used for computer-controlled synthesis is known as additive synthesis. In this method, a synthesized voice (or instrumental line) is generated by the addition of simple sine waves using short-time Fourier methods. One of the problems with this method is that the number of parameters needed to generate an acoustic signal with specific qualities is large, reflecting the fact that many important music percepts are far removed from the short-time Fourier parameters. Thus, synthesizing a particular sound quality can be cumbersome. Often, additive synthesis is used to simulate known sounds by first analyzing the desired sound and
measuring these parameters directly (Grey and Moorer, 1977; Moorer, 1978; Portnoff, 1980; Dolson, 1986). Small alterations in these known parameters can then fine-tune the acoustic waveform.
Another common technique, which Cook et al. (1991) have described as an example of an abstract algorithm, is frequency modulation or FM synthesis, in which the composer specifies the carrier frequency, modulation index, and modulation frequency of a signal (Chowning, 1973; Chowning and Bristow, 1986; Schottstaedt, 1977). By varying the relation of the carrier to modulation frequency, the resultant sound can be harmonic or inharmonic. If harmonic, the relation defines what overtones exist in the sound (important in timbre perception). Changing the value of the modulation index controls the spread of the spectrum of the resultant sound, and therefore its perceptual brightness. Since the relationship between the values of the frequency modulation parameter and the corresponding perceptual aspects of the synthesized output is reasonably straightforward, the frequency modulation technique has proven both powerful and versatile. This technique is employed by many commercially available synthesizers, such as the Yamaha DX-7.
Subtractive synthesis is a term in music synthesis that refers to the shaping of a desired acoustic spectrum by one or more filtering operations and is a precursor to the more modern approaches that have come to be known as physical modeling. Subtractive synthesis is often used when a simple physical model of an instrument can be described by an excitatory source that drives some filter(s). Both the source waveform and the filters, which may or may not bear any relation to a real instrument, are specified by the composer.
Most traditional instruments can also be mathematically modeled in this way: they are excited by some vibratory noise source, and their physical properties acoustically filter that source. For brass-like instruments, vibrations of the player's lips are the acoustic source; in a stringed instrument, it is the string vibration; for a singer, it is the actions of the vocal chords. The physical properties that affect the acoustic output include the shape of the instrument, its effective length, the reflectivity of the material out of which the instrument is constructed, etc. By modeling the vibratory mode of the excitatory source and the acoustic effects of the physical properties of the instrument, subtractive synthesis has been used to synthesize known instruments (Jaffe and Smith, 1983; Risset and Mathews, 1969). As an instrument is played, the effective dimensions of the instrument are altered by opening or closing keys or valves. These changes cause the modeled filters to change in time. Because such modeling is closely tied to physical parameters, it may be less cumbersome (and intuitively easier) to adjust parameters to achieve a desired effect.
Another alternative is to develop methods of generating sounds by
modeling the motions of the physical sound events, i.e., by numerical integration of the wave equation. Generating sound by solving the equations of motion of a musical instrument captures a natural parameterization of the instrument and includes many of the important physical characteristics of the sound. The conventional and perhaps most general way of representing an acoustics system is to use a set of partial differential equations (PDEs) in the temporal and three-dimensional spatial domain. However, it is not practical considering the intensive numerical computation and the constraint of real-time computing. To reduce the computational complexity of the sound generation tasks without giving up the physical essence of the representation, one approach is to use the aggregate properties of the physical model instead of solving the problem at the microscopic level, i.e., solving the PDEs.
As noted above, subtractive synthesis was the first attempt at this type of modeling of aggregate properties. Several more recent physical modeling techniques based on aggregate properties have been developed, including the digital waveguide technique, transfer function techniques, modal synthesis, and mass spring models to synthesize sounds ranging from musical instruments to the singing voice of the human vocal tract (Borin et al., 1992; Cadoz et al., 1993; Cook, 1993; Djoharian, 1993; Keefe, 1992; Morrison and Adrien, 1993; Smith, 1992; Wawrzynek, 1991; Woodhouse, 1992).
Collision sounds are an example of a simple auditory event in a virtual environment. We may predict the potential collision of two objects by observing their paths with our eyes but, short of actual breaking or deformation of the objects, it is the sound of the collision that best reveals how the structure of the objects has been affected by the collision.
To illustrate how to generate collision sounds by using physically based models, we choose a uniform beam as an example structure. The vibrational mechanics theory of beam structures has been examined carefully and can provide a solid groundwork for collision sound synthesis. The collision can be decomposed into two parts, the excitation component, which initiates the impact event and the resonator component, in which the interesting vibration phenomena take place.
The excitation module is affected by the force, position, and duration of the impact; the resonator module is determined by the structure boundary conditions, material density, modulus of elasticity, and object geometry (e.g., length, width, and height). Because the uniform beam has a simple structure, one can derive the equations to depict its major vibrational modes and calculate the natural resonant frequency associated with each mode based on the aggregate physical properties. The natural resonant frequency reveals the strong linkage between material types and object shapes and can show objects' attributes. For complex free-form
objects, finite element analysis can be used to calculate the associated resonant frequency for the major vibrational modes.
Additional topics that are often grouped into the broad field of music synthesis and may be relevant to the use of sound in SEs include the use of special computational structures for the composition of music, specific hardware and software developed for music synthesis, and notational or computational conventions that are specialized for music synthesis. Of particular interest is the research by Cadoz and associates that makes use of human-machine interfaces with tactual feedback and focuses on the production of music by gestural control of simulated instruments (Cadoz et al., 1984, 1990; Cadoz and Ramstein, 1991). Further information on music synthesis is available in Moore (1990), Mathews and Pierce (1989), Roads and Strawn (1988), and Richards (1988).
Current devices available for generating nonspeech sounds tend to fall into two general categories: samplers, which digitally store sounds for later real-time playback, and synthesizers, which rely on analytical or algorithmically based sound generation techniques originally developed for imitating musical instruments (see Cook et al., 1991; Scaletti, 1994; and the discussion above). With samplers, many different sounds can be reproduced (nearly) exactly, but substantial effort and storage media are required for accurately prerecording sounds, and there is usually limited real-time control of acoustic parameters. Synthesizers, in contrast, afford a fair degree of real-time, computer-driven control.
Most widely available synthesizers and samplers are based on MIDI (musical instrument digital interface) technology. The baud-rate of such devices (31.25 kbar), especially when connected to standard serial computer lines (19.2 kbar), is still low enough that continuous real-time control of multiple sources/voices will frequently "choke" the system. In general, synthesis-based MIDI devices, such as those that use frequency modulation, are more flexible than samplers in the type of real-time control available but less general in terms of the variety of sound qualities that can be reproduced. For example, it is difficult to generate environmental sounds, such as breaking or bouncing objects, from an FM synthesizer (see Gaver, 1993).
Large-scale systems designed for sound production and control in the entertainment industry or in music composition incorporate both sampling and digital synthesis techniques and are much more powerful. However, they are also expensive, require specialized knowledge for their use, and are primarily designed for off-line sound design and post-production. A potential disadvantage of both types of devices is that they are
primarily designed with musical performance and/or sound effects in mind. This design emphasis is not necessarily well suited to the generation and control of sounds for the display of information and, again, tends to require that the designers and users have specialized knowledge of musical or production techniques.
The most general systems would be based on synthesis via physical or acoustical models of sound source characteristics. A simpler but less versatile approach would use playback of sampled sounds or conventional MIDI devices, as in most current systems. Since very general physical models are both difficult (perhaps impossible) to develop and computationally intensive, a more practical and immediately achievable system might be a hybrid approach that uses techniques like real-time manipulation of simple parameters, such as the pitch, filter bandwidth, or intensity, of sampled sounds, and real-time interpolation between sampled sounds, analogous to ''morphing" in computer graphics; the Emu Morpheus synthesizer is an example of this kind of approach. Recently, several commercial synthesizer companies have announced new products based on physical modeling techniques. A sound card being developed by MediaVision is based on digital waveguides (Smith, 1992); the Yamaha VL1 keyboard synthesizer is based on an unspecified physical modeling approach; and the Macintosh-based Korg SynthKit allows construction of sounds via interconnection of a visual programming language composed of modular units representing hammer-strikes, bows, reeds, etc.
A few nonspeech-sound-generation systems have been integrated specifically for virtual environment applications (e.g., Wenzel et al., 1990; VPL Research's Audiosphere; see also Wenzel, 1994); some designers are beginning to develop systems intended for data "sonification" (e.g., Smith et al., 1990; Scaletti and Craig, 1991; Wenzel et al., 1993b). Related developments in auditory interfaces include the work on audio "windowing" systems for applications like teleconferencing (Ludwig et al., 1990; Cohen and Ludwig, 1991). However, far more effort needs to be devoted to the development of sound generation technology specifically aimed at information display (see Kramer, 1994). Even more critical, perhaps, is the need for further research on lower-level sensory and higher-level cognitive determinants of acoustic perceptual organization, since these results should serve to guide technology development. Furthermore, relatively little research has been concerned with how various acoustic parameters interact to determine the identification, segregation, and localization of multiple, simultaneous sound streams. Understanding such interaction effects is likely to be critical in any acoustic display developed for SE systems.
With the advances in algorithms and hardware to produce such simulations,
there is also a need to develop an extensible protocol for virtual audio. Such a protocol will need to encompass all the acoustic models in use today and those expected to be developed in the near future. This protocol should allow developers and designers of sonification systems to utilize SE technology, even as that technology makes dramatic improvements in its capabilities.
Most of the needs associated with the auditory channel lie in the domain of perceptual studies. With one major exception, discussed below, the technology for the auditory channel is either adequate now or will be adequate in the near future.
Many of the perceptual issues in the auditory channel that require attention fall under the general heading of adaptation to alterations in sensorimotor loops. Some specific examples in this category, all of which relate to the spatialization problem, concern the extent to which (and rate at which) listeners adapt to various types of distortions, including those associated with (1) the use of simplified HRTFs or transformations designed to achieve supernormal localization performance in VE systems and (2) various mappings between telerobotic sensor arrays and the human auditory system in teleoperator systems that make use of nonanthropomorphic telerobots. Knowledge of how well and how fast individuals can adapt to distortions or transformations of these types under various training conditions is essential to the design of effective systems. Other closely related examples focus on the use of the auditory channel for sensory substitution purposes, e.g., the presentation of auditory signals to substitute for haptic feedback that is difficult to provide because of current equipment limitations.
Another major area in the perceptual domain that requires substantial work falls under the heading of auditory information displays. Current knowledge of how to encode information for the auditory channel in ways that produce relatively high information transfer rates with relatively small amounts of training is still rather meager. It is obvious that encoding all information into speech signals is inappropriate and the statement that the encoding should be natural is simply not adequate to guide specific design choices. This general encoding and display problem is judged to be both important and difficult to solve. Also, it is believed that progress in this area will depend, at least in part, on an improved understanding of auditory scene analysis.
The main technology area that requires attention concerns the computer generation of sounds: software and hardware are required to enable VE system users to specify and generate the desired acoustic behavior of the objects and agents in the VE under consideration. Physical modeling of environmental sounds, together with the development of appropriate mathematical approximations and software and hardware that enables such sounds to be computed and rendered in real time, constitutes a major task in this area. Subsequently, it will be necessary to develop a user-friendly system for providing speech, music, and environmental sounds all in one integrated package. And eventually, of course, the whole sound generation system will have to be appropriately integrated with the software and hardware system for generating visual and haptic images. Compared with these sound generation problems, most of the remaining technological problems in the spatialization area for both HMDs and off-head displays appear relatively minor. It should be noted, however, that current technology is still unable to provide interactive systems with real-time rendering of acoustic environments with complex, realistic room reflections.