5
Auditory Factors in the Design of Displays and Controls

The ambient sound environment of the dismounted soldier is likely to be extremely varied. At one extreme, the background noise will be so loud as to preclude any sensible message transmission-either incoming or outgoing. Such noise can be a serious source of stress (see Chapter 6) as well as an interference to communication. At the other extreme is the need for surreptitious activity in a quiet environment wherein any audible sound generated by the soldier is to be avoided for security reasons.

Either of these ambient conditions can restrict the utility of auditory subsystems. Still, high visual channel loadings and priority occupation of hands means that auditory displays and controls may offer real advantages.

AUDITORY DISPLAYS

The visual channel is the mode of choice for providing information at high rates to the dismounted infantry soldier. However, in certain tasks and situations an auditory display may be more appropriate. Auditory displays are frequently used for alerting, warnings, and alarms-situations in which the information occurs randomly and requires immediate attention. The near omnidirectional character of auditory displays is a major advantage over other types of helmet-mounted displays. Table 5-1 summaries some of the factors to consider when making a choice between an auditory and a visual display.

The individual soldier's computer/radio is the main source of the auditory information for the Land Warrior System. It is currently envisioned that the auditory displays will be presented to the soldier via a headset mounted in the



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 117
--> 5 Auditory Factors in the Design of Displays and Controls The ambient sound environment of the dismounted soldier is likely to be extremely varied. At one extreme, the background noise will be so loud as to preclude any sensible message transmission-either incoming or outgoing. Such noise can be a serious source of stress (see Chapter 6) as well as an interference to communication. At the other extreme is the need for surreptitious activity in a quiet environment wherein any audible sound generated by the soldier is to be avoided for security reasons. Either of these ambient conditions can restrict the utility of auditory subsystems. Still, high visual channel loadings and priority occupation of hands means that auditory displays and controls may offer real advantages. AUDITORY DISPLAYS The visual channel is the mode of choice for providing information at high rates to the dismounted infantry soldier. However, in certain tasks and situations an auditory display may be more appropriate. Auditory displays are frequently used for alerting, warnings, and alarms-situations in which the information occurs randomly and requires immediate attention. The near omnidirectional character of auditory displays is a major advantage over other types of helmet-mounted displays. Table 5-1 summaries some of the factors to consider when making a choice between an auditory and a visual display. The individual soldier's computer/radio is the main source of the auditory information for the Land Warrior System. It is currently envisioned that the auditory displays will be presented to the soldier via a headset mounted in the

OCR for page 117
--> TABLE 5-1 When to Use the Auditory Versus Visual Form of Presentation Use auditory presentation if: Use visual presentation if:   The message is simple. The message is short. The message will not be referred to later. The message deals with events in time. The message calls for immediate action. The visual system of the person is overburdened. The receiving location is too bright or dark adaptation integrity is necessary. The person's job requires him or her to move about continually.   The message is complex. The message is long. The message will be referred to later. The message deals with location in space. The message does not call for immediate action. The auditory system of the person is overburdened. The receiving location is too noisy. The person's job allows him or her to remain in one position.   Source: Deatherage (1972: Table 4-1). helmet, although they also have the capability of interfacing to a handset. The auditory displays are currently envisioned to be either monaural with the handset or biaural1 with the integrated headset. In this chapter we discuss the characteristics of auditory displays as well as some specific guidelines for their design. Detectability of a Sound An auditory signal is detected with increasing probability as the level of the sound increases. The masked threshold is defined as the level required for 75 percent correct detection of the signal when presented to a listener in a two-interval task. In a two-interval task, the listener reports which one of two noise intervals randomly contains the signal. Some guidelines on setting the level of an auditory display are based on Sorkin (1986): A signal 6 to 10 dB above the masked threshold allows near-perfect detection in controlled conditions. Signal levels 16 dB above the masked threshold will be sufficient for situations requiring a rapid response to a signal, such as a warning signal. The level of an auditory warning signal should be less than 30 dB above the masked threshold, in order to minimize operator annoyance and the disruption of communications. 1   Biaural presents the same signal in both ears. This is not the same as stereo presentation. With stereo presentation the signals not the same as both ears.

OCR for page 117
--> Nonauditory channels should be considered for environments that require sound levels above 115 dB. Two factors that affect the determination of the masked threshold are the spectrum and duration of the interfering sound. More masking occurs when the signal and the interference are close in frequency, especially when the frequency of the interference is below that of the signal. As the intensity of the interference increases, the effect spreads to additional signal frequencies. When the signal is shorter than 100 msec, the masked threshold level depends on the signal energy rather than signal power and is therefore more difficult to detect than longer signals. In the operational environment of land warrior, the determination of a single masked threshold and therefore a single level for the auditory display may be difficult. The acoustic environment has a wide variety of interfering sounds that may or may not be present at any one time. The environment may range from complete quiet to the roar of battle, with both wide-band noise and impulse noise from weapons fire and explosions. An adaptive system that monitors the acoustic environment and sets the warning appropriately should be investigated. Tonal Displays A sound composed of the related components of 1,250, 1,500, 1,750, and 2,000 Hz has the same pitch as a single component of 250 Hz. This low-frequency pitch is perceived even though no signal energy is present at low-frequency and even in the presence of interfering noise at low frequencies. This so-called missing fundamental frequency is heard because of the sensitivity of the auditory system to the harmonic structure of sounds. This harmonic pitch provides a useful code for auditory displays. The missing fundamental is stable and relatively insensitive to the relative levels of the component frequencies or to masking of some components, provided there is a sufficient number. Specific design criteria for tonal displays are informed by Patterson (1982): The pitch of warning sounds should be between 150 and 1,000 Hz. Signals should have at least four dominant frequency components, within the first 10 harmonics, in order to minimize masking effects, minimize pitch and quality changes during masking, and maximize the number of distinctive signals that can be generated. Signals should have harmonic rather than inharmonic spectra. Lower-priority warning signals should have most of their energy in the first five harmonics. Higher-priority, immediate action signals should have more energy in harmonics 6 through 10. High-priority signals can be made distinctive by adding a small number of inharmonic components.

OCR for page 117
--> The frequency range of the signal should be restricted within 500 to 5,000 Hz, with the dominant frequency components within the range of 1,000 to 4,000 Hz. Temporal form and shape of auditory displays are important factors for detectability, coding, and listener reaction. The following guidelines for this dimension of auditory displays are based on Patterson (1982): Near-optimal envelope parameters are a minimum of 100 msec duration, a 25 msec rise and fall time, and quarter sine shaping. Onset rates of less than 1dB/msec with the final level falling below 90 dB. Use a variety of temporal patterns in order to minimize confusion. Code urgency or priority with pulse rate (i.e., code high urgency with high pulse rate). In a study of aircraft warning signals, Patterson and Milroy (1980) found that, although large sets of warnings can be learned, considerable learning time and regular retraining is required. Patterson's (1982) recommendations are that no more than six immediate action warning signals and two attention signals should be used, provided distinctive temporal and spectral patterns are used, the perceived urgency of the warnings matches their priority, and warning sounds are followed by keyword speech warnings. An attention is a special warning sound that signals the priority level of the following warning. Speech Displays For the dismounted infantry soldier, especially one with a helmet-mounted display, speech displays should be considered as a means to relieve the possible information overload of the visual channel. Deathridge (1972) gives the following reasons for using speech rather than other auditory signals in auditory displays: Flexibility. Ability to identify the message source. Listeners don't have or need special training in coded signals. Rapid, two-way exchange of information is required. Messages deal with the preparation for a future event. Situations of stress might cause the listener to forget the meaning of a coded message. Issues that need to be addressed for speech displays are: (1) What are the optimal ways to generate speech displays? (2) What is the best way to integrate them with other displays and tasks? In this section we present some general

OCR for page 117
--> principles and guidelines for speech displays, which are based largely on studies of cockpit displays but should be appropriate to the Land Warrior System. Intelligibility is the most commonly used performance measure for speech displays and speech communications systems in general. Speech intelligibility is the percentage of utterances correctly recognized by the listener from a set of utterances presented under a given listening condition. Several intelligibility tests that have been developed and commonly used are the Modified Rhyme Test (House et al., 1965), the Diagnostic Rhyme Test (Voiers, 1977), and phonetically balanced words. Intelligibility measures can be used to determine how sensitive speech displays will be to disruption by noise. Determining the appropriate level for speech displays is more involved than that of other auditory displays. Considerable speech information is carried in the consonant sounds. These sounds have shorter durations, higher frequencies, and lower power than vowels. In a high noise environment, the level of the vowel must be much higher than the background noise in order for the consonants to be detectable. In such situations, preprocessing the speech with either a 3 dB/octave boost or peak clipping can improve the intelligibility of the speech in noise. Synthetic speech systems allow considerable control of speech parameters such as pitch, speech rate, sex, and accent. This allows the generation of speech displays that have distinctive characteristics from speech heard over communications channels. One drawback to synthetic speech is it is much more sensitive to the effects of linguistic and task context, the level of operator training, background noise, and other manipulations of the spectrum. Pisoni (1982) has shown that synthetic speech may require more attentional resources than natural speech. One ongoing debate in the design of speech displays is the message format and the use of auditory alerts preceding the speech message. The major debates relate to the relative effectiveness of monosyllabic versus polysyllabic words, keywords versus sentences, and speech messages with or without preceding alert tones. For both natural and synthetic speech, polysyllabic words are more intelligible than monosyllabic words. Similarly, words in sentences are more intelligible than words in isolation. The context provided by the additional syllables in the words and/or the words in the sentence increases the redundancy and improves intelligibility. Because of the ongoing use of the speech channel for other purposes and the difficulty of communicating over a long speech warning, Patterson (1982) advocates limiting the use of speech warnings for immediate-action emergency conditions. He suggests the use of sentence-length speech messages when signaling abnormal conditions that are less time critical. Other use of sentence messages are when disruptions are possible and the number of alternative messages is large. Patterson's integrated approach to auditory warnings is presented in Table 5-2. Simpson and colleagues (1986), using a different design philosophy, recommend the use of speech only for the most time-critical warnings, the use of a

OCR for page 117
--> TABLE 5-2 Integrated Auditory Warning System Priority Purpose Result Description Highest Emergencies Immediate Tone sequences with keyword warning Second Abnormality Immediate Specific tone prefix awareness with voice message Third Advisory Check visual Specific auditory signal display distinctive voice, no nonspeech alerting signals, repetition of the speech warning only after the operator has had enough time to correct the problem, and four syllable minimum speech messages; the shorter 4 to 8 syllable messages present information more rapidly and cause less interference with other communication. In their view, the distinctive voice with a machine quality precludes the need for an alerting tone prefix. Patterson's philosophy is that the voice message is a backup to the aural signal and may provide advisory information. For different types of tasks, one or the other design philosophy may be more effective. Table 5-3 can be used as a guide to select what type of functions are appropriate for the different types of auditory displays. An additional capability with the use of speech displays is the ability to store messages for playback at a later time (i.e., voice mail). This would provide the soldier with the ability to save incoming voice messages when he is not able to attend to them or to save important messages he may want to review again. Some indication that messages are stored would be needed, like the message light on an answering machine. Not all soldiers may want or need this kind of auditory display. It may be more appropriate for platoon or squad leaders rather than individual squad members. Three-Dimensional Auditory Displays Spatial hearing technology has developed to the point at which it is now possible to present spatial information to a listener using headphones. The applications of three-dimensional auditory displays to the infantry environment include: indicating the location of other soldiers, threats, and targets; introducing spatial separation among communications channels; and providing an auditory beacon as a navigation aid. Modern views of spatial hearing suggest that localization judgments depend on three classes of acoustic cues: interaural time differences, interaural level differences, and direction-specific frequency shaping of the high-frequency spec-

OCR for page 117
--> TABLE 5-3 Functional Evaluation of Audio Signals Types of Signal   Types of Signal     Function Tones (Periodic) Complex sounds (nonperiodic) Speech Quantitative indication POOR Maximum of 5 to 6 toners absolutely recognizable. POOR Interpolation between signals inaccurate. GOOD Minimum time and error in obtaining exact value in terms compatible with response. Qualitative indication POOR-TO-FAIR Difficult to judge approximate value and direction of deviation from null setting unless presented in close temporal sequence. POOR Difficult to judge approximate deviation from desired value. GOOD Information concerning displacement, direction, and rate presented in form compatible with required response. Status indication GOOD Start a stop timing. Continuous information where rate of change of input is low. GOOD Especially suitable for irregularly occurring signals (e.g., alarm signals). POOR Inefficient; more easily masked; problem of repeatability. Tracking FAIR Null position easily monitored; problem of signal-response compatibility. provide. POOR Required qualitative indications difficult to provide. GOOD Meaning intrinsic in signal. General Good for automatic communication of limited information. Meaning must be learned. Easily generated. Some sounds available with common meaning (e.g., fir bell). Easily generated. Most effective for rapid (but not automatic) communication of complex, multidimensional information. Meaning intrinsic in signal and context when standardized. Minimum of new learning required.

OCR for page 117
--> tra introduced by the head and pinnae. Kistler and Wightman (1992) argue that listeners determine the laterality of sounds based on interaural time differences and interaural level differences and distinguish between front and back and between up and down on the basis of spectral cues. Wightman and Kistler (1992) showed that low-frequency interaural time differences are the dominant cue for sound localization. Therefore, for signals to be localized easily, they should include frequency components that spread across the entire spectrum. This sensitivity to interaural time and intensity also enables the human listener to detect signals that otherwise would be buried in the noise. One can demonstrate as much as a 15 dB advantage in detection when there are interaural differences in the signal and noise input to each ear. That is, the level of the signal can drop 15 dB below the level detected with identical inputs to both ears or with monaural listening. In addition to improving detectability, these cues also facilitate improved processing of speech messages in noise. Speech intelligibility studies by Bronkhorst and Plomp (1988, 1992) have shown a 6 to 10 dB advantage with speech at 0 degrees azimuth and noise off-axis compared with the control condition of speech and noise at 0 degrees azimuth. Spatial hearing also enables us to selectively attend to spatially separated conversations in a crowed, noisy room-the so-called cocktail party effect. (In such an environment, covering one ear tightly will cause a sudden decrease in the ability to hear separate conversations.) Ericson and McKinley (1996) studied speech intelligibility when two competing messages were presented spatially separated via headphones. Results showed a 10 to 40 percentage point improvement in intelligibility when compared with the control condition of both messages presented 0 degrees azimuth. Interestingly the greatest improvement occurred for messages in which both competing talkers were female and the least improvement when both talkers were male. To support the use of three-dimensional audio displays for the infantry soldier, two additional technologies are required: stereo headphones and head trackers. In order to minimize the acoustic isolation, the stereo headphones should provide a minimum of attenuation to external signals (i.e., they should be acoustically transparent). The ear-rest-type earphone of the proposed Land Warrior System are not stereo and may cause significant attenuation and discomfort when inserted deep into the ear channel to maintain stability during the strenuous maneuvers the soldier will experience. In another related concept that has been proposed, audio speakers are suspended inside the helmet shell. This arrangement would provide no interference with unaided hearing. The disadvantage is acoustic transmission of the auditory signal to the environment; this would be a significant disadvantage during covert operations. The arrangement would also cause problems for three-dimensional audio displays, since current systems require circumaural stereo headphones or ear insert devices. At least one manufacturer of 3DAD systems makes an acous-

OCR for page 117
--> TABLE 5-4 Advantages and Disadvantages of Monaural, Biaural, Stereo, and 3DADs   Advantage Disadvantage Monaural Inexpensive Single earphone Sounds can not be localized| Reduced signal detection Reduced speech intelligibility Biaural Small increase in signal detection Headphones required Stereo Increased signal detection Increased speech intelligibility Minimal increase in complexity Headphone required 3DAD Improved signal detection Improved speech intelligibility Signals localizable Multiple channels monitoring Way point navigation Increase complexity Headphones required tically transparent headset with which environmental signals such as speech can be easily heard. Whether this technology would provide the same capability as unaided hearing is an area that should be investigated. Also, the ability of such a headset technology to operate in a rugged military environment is still to be determined. A means of determining the soldier's head position and orientation is required to most effectively utilize three-dimensional audio displays. This provides the capability to fix the auditory signal in space as the soldier rotates his head. This capability may also be of value for some types of visual information displays as well. The most common forms of head trackers use magnetic or ultrasonic sensing. Head trackers of this form are commonly used in virtual reality and cockpit applications in which there is a fixed range over which the transmitter and receiver can operate. New technologies being developed use miniature gyroscope or magnetic earth sensors. Table 5-4 summarizes the advantages and disadvantages of monaural, stereo, and three-dimensional audio displays. When combined with the global positioning system, the three-dimensional audio display could present the radio communications of an individual's squad members in the direction they are relative to him. This would improve situation awareness and could reduce fratricide. This would be advantageous for the desired capability called conversation mode communications, which is defined as the capability for three or more stations on the radio network to communicate simultaneously with each other. Another application would be way point navigation. A tone could be presented in the desired direction of movement; the soldier would follow the tone until he reached the desired way point; at that time, the

OCR for page 117
--> tone would either cease, indicating the desired position was reached, or another tone would be displayed indicating the new direction to follow. The same technique could be used to designate targets for directing fire. Each of these applications is being investigated by the Air Force for applications in fighter cockpits. For the Land Warrior System, not all soldiers may need this capability. For example, platoon or squad leaders could have the capability to locate his squad members, whereas special operations forces or scouts would need the navigation capability during night operations or adverse weather. AUDITORY CONTROLS In a broad sense, an auditory control is a machine activated by any sound. However, for the helmet-mounted display we focus our discussion on voice commands from the user and the attendant need for voice recognition capabilities. For the dismounted infantry soldier, speech recognition can be of considerable benefit in concurrent task situations: (1) when both hands are engaged (such as soldier aiming a weapon); (2) when the eyes are engrossed in visual processing and cannot easily change gaze for manual data entry (e.g., forward fire control); and (3) when even only a single hand is engaged in a manual task (e.g., squad leader deploying troops). It has been reported that speech recognition is as good or better than keyboard data entry, although error rates may be higher depending on the specific system, vocabulary, and task situation (error rates as low as 1 percent have been reported). However, speech recognition is not without its drawbacks. Feedback from a failure to correctly recognize an utterance may disrupt the concurrent activities, a disruption that speech recognition was intended to eliminate. Moreover, the same conditions that limit auditory displays-that is, either loud ambient noise conditions or surreptitious operations will strictly limit what and how the soldier can articulate voice commands. Since the circuitry needed to allow machine recognition has some cost and adds weight to the soldier's pack, the cost-benefit determination is uncertain. A step toward reducing that uncertainty can be provided by a brief exploration of the technology for voice recognition subsystems. Types of Speech Recognition Systems Speech recognition systems are classified along several dimensions: (1) the number of speakers they recognize, (2) the type of speech they recognize, and (3) the size of the vocabulary. The following paragraphs will address each of these dimensions and some associated trade-offs in turn. Speaker-dependent systems recognize speech from only one speaker. Speaker-independent systems recognize speech from many speakers. The performance of speaker-dependent systems is generally better than that of speaker-

OCR for page 117
--> independent systems. Speaker-adaptive systems start out with speaker-independent templates and adapt those templates to the current speaker and therefore approach performance levels of speaker-independent systems. The next dimension is based on how the speech recognition system handles word boundaries. Isolated-word recognition systems require a pause of 100 to 250 msec or greater between words. Connected-word recognition systems require a very short pause between words. Continuous-speech recognition systems require no pause between words and accept fluent speech. The last dimension is based on the number of words the system can recognize. Vocabulary size is generally classified as small (less than 200 words), large (1,000 to 5,000 words), very large (5,000 words or greater) and unlimited (greater than 64,000 words). In general the computational resources required to perform speech recognition increase across each of these dimensions. For example, a speaker-dependent, isolated-word, small-vocabulary system would require the least computational resources. Word recognition accuracy is the most commonly used measure of performance for speech recognition algorithms. It is simply the percentage of words correctly recognized. Errors can be assigned to one of three categories: (1) substitution error (one word is recognized as another); (2) insertion error (a word is recognized that was not spoken); and (3) deletion error (a word spoken was not detected). An analysis of the errors provides feedback on the vocabulary design. For example, consistent substitution errors may imply that those words are easily confused and the vocabulary should be modified. In general, the word recognition accuracy decreases across each of the above dimensions. For example, as the vocabulary size increases, the word recognition rate decreases. Environmental Issues Most applications of speech recognition have been in relatively benign environments such as office or laboratory settings where the background noise is low. Speaker-dependent isolated-word and connected-word systems with 30 to 150 word vocabularies can provide greater than 95 percent accuracy in cockpit environments with high noise. Whereas previous work has characterized the acoustic environment for military weapons systems such as tanks, helicopters, and aircraft and examined the effects on speech recognition performance, little such work has been done for the battlefield. For the dismounted infantry soldier, other environmental factors are also present that will change the way he speaks and could negatively impact speech recognition performance. Things such as physical and mental fatigue, sleep deprivation, and physical exertion are all areas that have had little or no systematic study of their effects on speech recognition performance.

OCR for page 117
--> Human Factors Issues Several human factors issues involved in the design of a speech recognition system are (1) vocabulary size, (2) vocabulary selection and syntax, (3) operator training, and (4) system training. Vocabulary size is determined by both computational and storage constraints on one hand and the required task on the other. Large vocabularies often lower the accuracy of a system because there is a greater possibility of error. The vocabulary should be designed so that it does not contain easily confused words. The vocabulary should be chosen so that it uses terminology that is familiar and common to the user in that task. The vocabulary syntax can be designed to provide a greater effective vocabulary size by enabling a subvocabulary at each point in the command sequence. This also enables high recognition accuracy because there is a smaller number of words to choose from at each point. The syntax should also be chosen so that it is familiar to the user performing the task. For isolated word systems, the operator must learn to insert a pause between each word. For speaker-dependent systems, the system itself must be trained for each speaker with 4 to 10 repetitions of each word in the vocabulary. Thus, for large vocabularies, hundreds of words must be spoken. This sort of training is very tedious and puts practical limits on the vocabulary size. CONCLUSIONS AND DESIGN GUIDELINES When the visual channel is heavily loaded, as is a distinct possibility in the Land Warrior System, the auditory channel may be an appropriate alternative way to convey information to the infantry soldier. The auditory channel has naturally evolved to serve a warning or alerting function. Issues to be addressed when using auditory displays are masking and interference by other signals, confusability between signals, training requirements, and signal localization. The impact on unaided hearing depends on the type of auditory display chosen. If monaural systems are chosen, there will be no interference with unaided hearing. A more problematic case is represented by use of circumaural stereo headphones. Both laboratory and field experiments need to be conducted to determine the impact on overall task performance for the types of being considered for use in the helmet-mounted display. An overall issue to be addressed when using multiple displays is what information should be placed on what display and when. Are there certain tasks or stress situations when this allocation should change? (See Chapter 6 for additional discussion on these questions.) Little research has been conducted looking at the relationship between different display modalities (i.e., auditory and visual), the physical environment, and workload. Understanding this relationship and the trade-offs involved is critical for the successful use of such a complex system as Land Warrior System, which will be used in a highly dynamic and wide-ranging

OCR for page 117
--> set of environmental and workload conditions. This section has presented guidelines for the development and use of auditory displays. Specific studies for the Land Warrior System need to be performed to validate the guidelines for the particular conditions of use. On the output side, speech recognition might provide a means of control and information input not presently available for the many hands-busy, eyes-busy tasks in which the dismounted infantry soldier is engaged. Although speech recognition technology has made significant strides in the past several years toward large-vocabulary, speaker-independent, continuous speech systems, these advancements may not be practical in the near term for the dismounted soldier due to the computational resources required. Each type of speech recognition system has unique human factors and technology issues that will influence the final system design and performance. Trade-offs between the available computational resources and task requirements will need to be studied.