The Visual Channel
The visual interface will, in many cases, provide the human user with the most salient and detailed information regarding the synthetic environment (SE). Given current technical limitations on the real-time display of detailed, continuous-motion imagery, making optimal use of normal visual sensory capabilities constitutes a formidable challenge for system designers. Of the many display options that currently exist, none is completely adequate across all applications.
STATUS OF THE RELEVANT HUMAN RESEARCH
It is clear that the limits imposed by technological constraints on stimulation interact with the information-processing capabilities of the human observer. Consequently, the effectiveness of current technology must be assessed against knowledge of the full range of human sensory capabilities. A proper understanding of both the technological and sensory aspects of visual displays defines the challenge for subsequent research and development (Boff et al., 1986; Haber and Hershenson, 1980; Sekuler and Blake, 1990).
An effective visual interface provides an appropriate match of the parameters of stimulation to the characteristics of the sense organ. The production of visual stimuli in such a system depends in part on technological capabilities for monitoring movements of the observer (to generate both movement-contingent and noncontingent sensory information), graphics-processing power, and display characteristics. For a visual interface
to be effective in an SE context, the sensory information need not exceed, and may often be less than, the sensory-processing capabilities of the average human observer. The latter observation is borne out by the experience of people with deficient vision (or hearing): people with many kinds of sensory disabilities can still experience both real and virtual environments (VEs). Indeed, perturbations of normal visual stimulation may be disturbing but not necessarily destructive of perception in either real or synthetic environments. When sensory conditions are maintained with minimal variation, various forms of adaptation can occur and can be used to advantage in overcoming technical limitations (Held and Durlach, 1991).
SE-Relevant Aspects of Visual System Organization
The superficial similarity between the human visual system and created imaging systems (such as a television camera) has the potential to misguide efforts to advance the state of the art of the visual channel of SE systems. One example of a fundamental difference between the design goals of the two systems lies in the manner in which images are collected. Created systems generally strive to collect a uniformly resolved image of a scene. For applications in which any area of the scene might be attended to (by the ultimate viewer), this is an appropriate goal. However, the ideal proximal stimulus in an SE system is one in which the information is presented in a manner that is complementary to the normal operation of the sense organ. In the case of the eye, an optimal system would not ignore the role of eye motions, the uneven distribution of photoreceptors in the retina, and the limitations of the eye's optics (Westheimer, 1986). Because the eyes are used to actively probe a scene (Gregory, 1991) in a way somewhat similar to an insect probing the world with antennae, the concentration of image quality at the center of the fixation point may be a highly appropriate use of resources. In a sense, the fovea is like the sensitive tip of a blind person's cane; the rest is just context.
Although these concepts have begun to be addressed by designers of eye-slaved displays, the design of the visual channel of SE systems is likely to benefit from more widespread conceptualization of the eye as an output device as well as an input device. There are a number of potential benefits of utilizing such fixation information. It can guide the allocation of resources by matching the information transfer rate to the receiver (e.g., high resolution can be limited to the fovea). It can reduce the computational burden of image generation in VE systems. The fixation point information collected can be used to adapt the information presented (in the case of teleoperator systems) or generated (in the case of VE systems) to more seamlessly match the interests of the viewer.
The effects of spatiotemporal blur can be managed using fixation point information. In this case, the analogous phenomena of display persistence (due to the time constants of the imaging device) and retinal persistence or smear (due to the time constants of the receptors and light-adaptation mechanisms) have different perceived outcomes based on the mode in which they are used. Direct view of a panoramic display requires that moving images present in the display be carefully managed so that attention is not directed to motion blur. This usually involves creative application of cinematographic techniques (Spottiswoode, 1967). The human observer, however, remains largely unaware of motion blur because of the links between eye motion, retinal and higher-level "suppression," and the perceptual processing involved in active viewing of a scene (Matin, 1975; Matin et al., 1972). Thus, rapid target motions that elicit saccadic (ballistic) eye movements are not perceived as blurred because visual sensitivity to the target is reduced during the period beginning about 50 ms before the saccade is initiated to 50 ms after the new fixation point is reached (Latour, 1962). Slower target motions in the range of 5 to 40 deg of visual angle per second resulting in pursuit eye movements are much less subject to suppression effects, thus presenting a challenge for the display designer.
In the visual system there are many such opportunities to achieve the dual goals of more closely accommodating the sense organ while allocating technological resources economically. The development of color head-mounted displays (HMDs), for example, has presented difficulties due to the need to present three channels of chromatic information. Examination of the photoreceptor population of the retina reveals first that the color-sensing receptors (cones) are not evenly distributed throughout the retina. The greatest concentration of cones is found where the visual axis intercepts the retina, but beyond 10 deg of visual angle the cone density is uniform at about 5 percent of the central value. Moreover, the noncolorsensing rod population at this intercept point is zero but increases with eccentricity to a maximum at 18 deg. Thus rods are present in far greater density throughout the periphery than are cones, with the result that the resolution of color information is limited in the periphery (Hood and Finkelstein, 1986).
Cones come in three general types as defined by the photopigment present. By measuring the differential response of the three cone types, the visual system is able to determine (within a metameric equivalent) the wavelength of stimulation. But unlike the case of created imaging systems, the quantity and distribution of the different cone types is not uniform. Overall, the ratio of short (S), medium (M), and long (L) wavelength peak cones is about 3:7:14. Short cone representation in the retina is relatively sparse, drastically limiting the resolving power of the eye on
this color dimension. (It also follows that the foveal vision system is particularly dependent on the medium and long cones and relatively independent of the short cones and rods.)
These nonhomogeneities provide another opportunity to match the needs of the visual system while achieving technological savings. Since the bandwidth for color information is not uniform throughout the retina, the image-generating component of the display system is relieved of a technological constraint.
There are about 120 × 106 rod and 6.5 × 106 cone photoreceptors per eye. This represents an extraordinary amount of information to be processed by the visual system. In order to accommodate this flow of information, initial image processing occurs at the retinal level. The mosaic of receptors is not connected one-to-one with the optic nerve fibers. In fact, there are only about 1 × 106 fibers leaving the eye that contain the codified information from the over 126 × 106 photoreceptors. Intermediate cells (horizontal, bipolar, and amacrine) make excitatory and inhibitory connections to groups of receptors in various spatial configurations. These connections result in receptive fields that preferentially respond to retinal excitation of specific spatial configurations and temporal qualities. (The duplex nature of the retina alluded to earlier also becomes evident in the connectivity pattern: rods are part of one network and the various classes of cones are part of another, with minimal synaptic communication between the two.) Once again we find that the visual system allocates the smaller, higher spatial resolution channels to the fovea.
The rather long axons of the retinal ganglion cells constituting the optic nerve regroup into left and right optic tracts containing left field or right field information from both eyes. About 20 percent of these fibers split off from here to make synaptic connections at the superior colliculus (SC), while the remaining 80 percent terminate in the lateral geniculate nucleus (LGN). (Of the ganglion cells destined for the LGN, there are two classes: P ganglion cells make connections to parvocellular areas and M ganglion cells make connections to the magnocellular areas of the LGN.) These connectivity patterns yield two important insights to the functioning of the visual system that have relevance for the display designer.
The primary flow of information through the lateral geniculate nucleus is from the retina to the visual cortex. But, in addition to the initial processing of motion, color, etc., described above, the lateral geniculate nucleus also serves as a modulator of visual stimulation. The thalamus, of which the lateral geniculate nucleus is a part, generally serves as a sort of volume control for sensory stimulation. (There are corresponding nuclei of the thalamus that provide analogous modulatory control for most other senses.) In this capacity, feedback signals from the reticular activating
system provide control that is based on the general level of arousal, whereas signals from the visual cortex help direct visual attention.
The first insight based on the postretinal connectivity patterns concerns the emergence of separate temporally acute and spatially acute processing channels. These channels roughly correspond to the periphery and fovea, respectively. There are six layers observable in the structure of the LGN: four outer layers made up of small cells (parvocellular layers) and two inner layers containing large cells (magnocellular layers). The parvo cells, although slower, process finer details in the image and support color perception (opponent color connection patterns, etc.). The color-blind magno cells respond quickly and are involved with motion perception.
The disproportionately large representation of the fovea, along with the ''receptive field" neural coding of higher visual features (including more abstract visual characteristics such as orientation and contour) repeats throughout the cortex and is a basic mechanism of information extraction in the visual system (Graham, 1989). These representations may have implications for creating appropriate image compression and generation algorithms in SE systems.
A second insight (based on connectivity patterns) that should be more carefully studied and exploited involves the ambient visual system. As mentioned above, 20 percent of fibers from the optic tracts converge on the superior colliculus. This midbrain structure has close links to lower autonomic centers responsible for emotion, arousal, and motor coordination. The control of eye movement depends in large part on these signals (Hallett, 1986). One important neighboring area, the pretectal region, controls the pupillary reflex. Also, vestibular and tactile inputs, as well as signals coming from muscle tension and joint position sensors, converge in various ways with these signals from the superior colliculus (and other brainstem nuclei) yielding a concrete feeling of bodily position and configuration that ultimately guides motion.
We have seen many dualities present in the visual system: rod versus cone and their retinal distribution with respect to the fovea; P ganglion versus M, with polysynaptic and the analogous parvo connections supporting the perception of detail and color; superior colliculus versus cortical processing, with cortical signals underlying conscious perception of visual images. This latter distinction supports the notion that there are two visual systems: the focal system and the ambient system. One example of the separateness of the focal system is the ability (often called blindsight) of cortically blind people to point to objects they cannot "see."
This focal-ambient duality lends support to Bridgeman's (1991) demonstrations of separate cognitive and motor-oriented maps. The implications for display designers include customizing the display based on its
intended function: spatial displays, such as representations of instruments and dials, are subject to motion-induced illusions and cognitive biases; displays designed to serve as a means of facilitating visually guided behavior (such as tracking or piloting tasks) are not subject to these illusions, but are subject to constraints based on limited memory for this information and need to be continuously displayed.
The integration of vestibular and visual information leads to interactions of form and orientation that can be exploited by the display designer (Mittelstaedt, 1991). Given the technical difficulties involved in artificially communicating inertial and gravitational cues in an SE, and considering the potential for enhancing the effectiveness of environments emphasizing orientation and motor control (as almost any application of SEs does), the area of ambient visual system effects should be explored. In addition, further knowledge of the autonomic effects of the ambient visual system are likely to be instrumental in mitigating the sopite syndrome.
The implications of the technical factors presented above for immersion and performance are not well known. Given the present limitations in generating veridical displays, much of what is known about vision will not be accommodated by the technology in the near term. However, if we are to realize the full potential of the visual interface, we must not let display development be driven solely by current trends in electronic design.
Sensory Constraints for Displays
Spatial distortion and image degradation may be caused by either inadequate channel capacity in the display unit or by inadequate optical arrangements between the display and the user's eyes. Channel capacity sets a fundamental limit on all attributes of the displayed imagery, including spatial resolution, temporal update rates and lags, contrast, luminance, and color resolution. Distortion of optical factors can arise in connection with (among other things) the choice of focal distance, the stereoscopic arrangement, the alignment of the optic axis, and the relation between convergence and accommodation.
Special optical arrangements are required to transfer the image on the display screen to the image on the retina of the human observer. Undesirable distortions can result from this transfer, especially when the display screens are positioned just a few centimeters from the eyes (as in most virtual environment displays). To project the image on a tiny display screen into a real-world-size image on the retina, a high degree of optical magnification is required. Furthermore, at a fixed close viewing distance, the larger the magnification, the greater the geometric distortion (particularly
at the greater eccentricities that can arise in displays with large fields of view).
In general, the optics must allow for clear focusing, minimization of geometric distortion, and off-axis eye movements without significant image vignetting. It must also be lightweight and easily adjustable for a wide variety of human users. In the case of stereoscopic displays, great care must be taken to ensure adequate alignment of the images provided to the two eyes and minimization of the inherent conflict between ocular convergence and accommodation (e.g., see Miyashita and Uchida, 1990). For example, in the case of near-field imagery, the image disparity between the two eyes will necessarily be large to elicit the proper stereo depth effect. In natural viewing this disparity in the near field is accompanied by an appropriate level of ocular convergence and lens accommodation. In a collimated-optics HMD, these additional depth cues are not present, so the percept is degraded.
Temporal resolution limitations center on three separable factors: frame rate, update rate, and delays (Piantanida et al., 1993). Frame rate is related to the perception of flicker and is hardware-determined. Because the human temporal contrast sensitivity function (CSF) shifts toward higher frequencies in response to increases in the brightness level of the display, higher light levels result in greater flicker sensitivity. Although models of the CSF can predict the human response to these stimuli (Wiegand et al., 1994), precise recommendations of frame rates must take into account the emission characteristics of the display as well as the time course of the stimuli. The typical luminances produced in current non-see-through displays are sufficient to cause flicker for rates of 30 Hz and below. The higher light levels required for see-through displays, coupled with the limited control of real-world lighting in many augmented-reality situations, may result in the need for higher frame rates for such applications. The CSF is also dependent on retinal eccentricity: peripheral sensitivity to flicker is greater than foveal (Pirenne and Mariott, 1962). As field-of-view limitations are removed, peripheral flicker will become objectionable.
Update rate is determined primarily by computational speed. When the view in an HMD is updated at a rate above 12 Hz, motion is perceived as smooth, and motion parallax cues support depth perception. Below this rate illusory motion or even simulator sickness may be induced (Piantanida et al., 1993).
At the current time, orientation and position tracker system delays are the major contributing factor to viewpoint update delays in HMD images. These "phase lag" delays can be as long as 250 ms for commonly available electromagnetic tracking systems (Meyer et al., 1992). Because all of the above delays are additive, motion artifacts and desynchronization
of the visual channel with respect to the other channels of a multimodal SE are likely to dwarf the more subtle characteristics of human temporal response based on the CSF.
The Third Dimension
Distance in the third dimension (from the eyes) and its corollary, depth perception, are ubiquitous properties of realistic vision. A variety of monocular stimulus conditions are capable of eliciting the appearance of relative depth. These include relative motion, perspective, occlusion, shadows, texture coarseness, aerial perspective (changes of color and luminance with distance), and others. Beyond approximately 1 m, these monocular depth cues predominate in creating the percept of depth, whereas in the near field (arm's length), the combined use of motion cues with stereopsis becomes crucial for eliciting three-dimensional perception that is convincing.
Binocular vision provides all the stimulus conditions of monocular vision plus the advantage of stereopsis with its high resolution (a few seconds of arc) as well as the disadvantage of requiring alignment of the two eyes to preclude the appearance of double images and binocular rivalry.
Our ability to simulate full binocular vision requires better technology than is now available for the following reasons. In stereoptic vision not only is resolution of each image good to about 1 minute of arc, but also relative displacements between corresponding image features on the two retinas (disparities) can be discriminated to at least an order of magnitude better. No currently available display can meet such specifications. As a consequence, resolution of detail in all three dimensions is considerably less than optimal—it corresponds to the vision of a visually disabled observer. However, inasmuch as depth perception is also elicited by the relative motion of retinal images and the use of triangulation (parallax), some degree of distance discrimination can be achieved without stereopsis.
Many of the factors discussed above become especially critical when augmented-reality systems are considered. The difficulties stem largely from the immediate comparison between the real-world stimulus and the synthetic elements displayed as an overlay. When the subtle shifts in light qualities, zero delay viewpoint changes, and three-dimensional qualities of objects present in the real world are combined with the less-responsive synthetic stimuli, separation of the real and the synthetic is obvious. Similarly, the perception of the surface qualities of reflective
objects, the distinction between self-luminant and reflective objects, the perceptual invariance of color and brightness of objects based on the perceptual integration of light source and other information present in the entire scene, are all likely to constrain the type of information that can be effectively and unambiguously displayed (Roufs and Goossens, 1988; Westerink and Roufs, 1989).
The visual system has evolved to resolve ambiguities in the proximal stimulus through knowledge about how visual scenes behave. In using see-through HMDs, it will be hard not to violate this knowledge. Further research is needed to evaluate these effects and to develop robust presentation techniques that are effective within the technological limitations.
Sensory Adaptation to Visual Distortion
Many simple alterations involving the visual channel are effortlessly adapted to. For example, most users have little difficulty learning to position a cursor on a vertical computer screen through the use of a mouse constrained to a horizontal tabletop. Stable homogeneous alterations can be adapted to, even if they are extreme (Kohler, 1964 ). Nonhomogeneous alterations that are stable with respect to the visual field (such as those introduced by prism goggles) can also be adapted to (Held and Freedman, 1963). Because HMD optics (or any lens system placed in front of the eyes) exhibits these prismatic distortions, the design of such systems can be informed by what is known about the process of adaptation to prismatic distortions.
Often, distortions are purposely introduced to enhance perception (e.g., magnification to increase visual acuity). By incorporating appropriate distortions, displays can be better matched to specific environments or tasks, thus enhancing the visual capabilities of the observer. These enhancements would typically involve local changes in magnification (similar to the use of bifocal lenses), but they can also be based on higher-order properties of the visual stimulus such as alterations of optical flow fields in motion-oriented tasks.
Another example of an enhancing visual distortion involves changes in the effective interocular distance. Depth through stereopsis is limited to a maximum retinal disparity that can be accommodated by ocular geometry and the size of the retinal fusional areas. The result is that stereopsis starts to substantially degrade at approximately 10 m and is not effective beyond about 135 m from the viewer (Haber and Hershenson, 1980). Increased interocular distance has been used in binoculars and other optical instruments to overcome this limitation. Enhancement of the near field is not so straightforward, however. The ability to fuse the disparate retinal images into a perception of depth is limited by fairly
unchangeable neural connectivity patterns. The limit of maximum disparity (Parnum's limit) suggests an upper bound to the degree of enhancement of stereopsis that is possible through eye position manipulation. Adaptation to extreme interocular distance changes, if they occur at all, are likely to have extremely long time courses and may never become complete.
It is clear that the purposeful introduction of alterations has the capacity to both overcome equipment limitations and enhance the sensory capabilities of the observer. Development of visual displays that capitalize on these effects must combine existing knowledge of perception with findings from our growing experience with SEs.
STATUS OF THE TECHNOLOGY
An ideal visual display would be capable of providing imagery that is indistinguishable from that encountered in everyday experience in terms of quality, update rate, and extent. Clearly, however, current technology will not support such highly veridical visual displays, nor is it clear that such a high level of verisimilitude is required for most applications. The relative importance of various display features including visual properties (e.g., field of view, resolution, lumination, contrast, and color), ergonomics, safety, reliability, and cost must therefore be carefully evaluated for any given application. McKenna and Zeltzer (1992) provide a thorough comparison among several of these features for five major three-dimensional display types.
A complete visual interface consists of four primary components: (1) a visual display surface and an attendant optical system that paints a detailed, time-varying pattern of light onto the retina of the user; (2) a system for positioning the visual display surface relative to the eye of the operator; (3) a system for generating the optical stimuli (either from camera sources of real-world scenes, stored video imagery, or computer synthesis or some combination of the three); and (4) a system for sensing the positions and motions of the head and/or eyeballs. Items (3) and (4) are treated in Chapters 5 and 8; they are not treated here except insofar as they have direct implications for the first two topics.
Two major classes of visual display systems are available for SE applications: head-mounted and off-head displays. The imagery displayed by either form of device may be coupled to the motions of the user's head either directly (using sensors that measure head position and orientation) or indirectly (typically using manual control devices such as joysticks or speech input).
At present, the majority of visual displays for synthetic environments are physically coupled to the head of the operator by mounting display
hardware on a helmet or headband worn by the user. A significant advantage of head-mounted displays is that the display positioning servomechanism is provided by the human torso and neck. This allows the generation of a completely encompassing viewing volume with no additional hardware and eliminates lag introduced by display surface positioning systems required by some off-head strategies. In many HMDs, all the imagery is synthetic and generated by computer. For certain augmented-reality displays, however, a semitransparent display surface is used and the synthetic imagery is overlaid on imagery resulting directly from objects in the environment. In other augmented-reality displays, the synthetic imagery is combined with imagery derived from optical sensors on a telerobot.
Among the disadvantages of head-mounted displays are the weight and inertial burden that interferes with the natural, unencumbered motions of the user, the fatigue associated with these factors, and the increased likelihood of symptomatic motion sickness with increased head inertia (DiZio and Lackner, 1992). In general, it is difficult to build head-mounted display devices that exhibit good spatial resolution, field of view, and color yet are lightweight, comfortable, and cost-effective.
Although often not associated with synthetic environment applications, a variety of off-head displays are available for SE tasks. These displays range from relatively inexpensive monoscopic and stereoscopic panel and projected displays to experimental autostereoscopic displays. High-resolution color panel and projected display systems are available at relatively low cost due to common use in the computer graphics and entertainment fields. These systems, at most, require a lightweight pair of active or passive glasses to generate high-quality stereo and therefore impart a minimal inertial burden on the user and are comfortable to use. Within the limits of comfortable viewing range, the static field of view and spatial resolution of panel and projected displays is determined by user distance from the display surface. Relatively large field of view (e.g., > 100 deg horizontal field of view) can be readily achieved with larger display surfaces without requiring correcting optics. Panel and projection displays are typically larger and heavier than HMDs, a disadvantage in volume- and weight-limited applications. In addition, servo-controlled or multiple static display surfaces must be used to provide a completely encompassing visual environment. Another approach to increasing viewing volume is to use head position and orientation sensors of the same type and form used on head-mounted displays. Although the relationship between the user's head position and orientation and the remote or virtual viewpoint is not so straightforward as that encountered when using head-mounted displays, this form of viewpoint control can nonetheless be used to advantage to both control the viewpoint into the synthetic
environment (Spain, 1992) and to correct for off-axis viewing perspective (McKenna and Zeltzer, 1992).
Hybrid panel display systems exist that use smaller display surfaces servo-controlled or manually steered to the user's head position and orientation (Oyama et al., 1993; MacDowall et al., 1990). These systems remove the inertial burden of head-mounted displays yet allow the use of more readily available, nonminiature, high-resolution color panel display devices to generate the imagery. In addition, user head position and orientation information is readily available through instrumentation of the display support structure, and consequently a large effective viewing volume is achieved without additional complexity. In servo-controlled variants, delay and distortions introduced by the electromechanical control mechanisms must be dealt with, in addition to the safety issues associated with coupling a powered device to the human head. Manually steered devices do not have the same safety concerns but do limit the applications to which these devices may be applied, since one or both hands must be used to move the visual display and therefore are not available for manual interaction tasks.
Autostereoscopic off-head display techniques, such as slice-stacking and holographic video, do not require special viewing aids (e.g., field-sequential or polarized glasses). Displays of this type, currently in the research and development phase, are discussed further in the next section.
Components and Technologies
Display Surfaces and Optics
The most frequently used and technically mature display types for synthetic environment applications are cathode ray tubes (CRTs) and backlighted liquid crystal displays (LCDs). Although these technologies have proven to be very useful for near-term HMD applications, several shortcomings compromise their long-term promise. CRT technology has been able to deliver small, high-resolution, high-luminance monochromatic displays. These CRTs, however, are relatively heavy and bulky and place very high voltages on the human head. In addition, the development of miniature, high-resolution, high-luminance, color CRTs has been difficult. In contrast, current LCD technology can produce color images at low operating voltages, but only at marginal picture-element densities. Both technological approaches require bulky optics to form high-quality images with sufficiently large optical exit pupils. The following paragraphs discuss emerging technological approaches that have the potential to produce high-resolution color (Tektronix or custom built devices)
imagery while reducing some of the weight, bulk, and cost of current displays.
A near-term approach that is bringing high-quality, color, CRT-based HMDs to the commercial marketplace is the use of mechanical or electronic color filtering techniques applied to monochrome CRTs. In this approach, the CRT is scanned at three times the normal rate, and red, green, and blue filters are sequentially applied. Some of the integrated systems discussed in subsequent sections of this report use this technique.
Commercially available SE displays have almost exclusively relied on TV-quality liquid crystal display technology. This technology was developed and chosen because of the demand for large-area displays for computer and television screens; it is limited in its maximum pixel density and size by the thin-film manufacturing techniques used for its fabrication. The quality of transistors, lithographic resolution, and parasitic resistance of wires in the thin-film technologies is considerably inferior to that capable of being realized in silicon very large scale integrated (VLSI) technologies. In the VE and teleoperator fields, large-area displays are not desirable. In these applications, very high resolution in compact, lightweight form is needed.
Thomas Knight of the MIT Artificial Intelligence Laboratory is pursuing such display characteristics using silicon VLSI chip technology. The potential for density and performance can be seen by comparing the available resolution of liquid crystal (LC) displays with the density of commercially available image sensors based on silicon technology. The LCDs have maximum resolutions of about 640 × 400, with a pixel pitch of about 330 µ. Silicon sensors have been fabricated with resolutions of 4,000 × 4,000 pixels, with dot pitches of 11 µ. Smaller pixel dimensions are lithographically possible, but light sensitivity demands the use of larger pixel areas. Knight believes that a feasible prototype in today's technology is a 2,000 × 2,000 image display with pixels on an 8 µ pitch, producing a display with an active area of 16 mm × 16 mm.
There are several approaches to coating such a high-resolution substrate with optically active materials. One simple approach is to use conventional field-sensitive liquid crystal material as a reflective display. An advantage of SE applications is the ability to control the illumination environment, making many problems in LC display technology, such as angular field of view, less critical. Some indication of the possible resolution available with LC techniques on silicon is the application of liquid crystals to diagnosis of failing integrated circuit parts, in which lines of under 3 µ are routinely imaged (Picart et al., 1989).
Kopin, Inc., in Taunton, Massachusetts, is also investigating high-resolution LCDs on silicon, with the use of very thin (1,000 A) silicon substrates, which are optically transmissive. The Kopin approach relies
on fabricating very thin layers of recrystallized silicon on a silicon dioxide substrate, forming circuitry on the substrate, and finally releasing the recrystallized silicon by etching. The released thin silicon sheet is then floated onto a glass substrate and used as a transparent display. Devices with a resolution of 1,100 lines per square inch over a 0.5 × 0.5 inch area have been manufactured.
Another optical interface approach is the use of electrochromic polymers, such as polyaniline (Kobayashi et al., 1984). This material changes color reversibly from a pale yellow to a dark green when reduced by current flow in solution. By coating an area array of noble metal electrodes fabricated on an active silicon structure, reduction in pixel-sized regions is achievable. With either of these display technologies, the major advantages of cost-effective, mass producible, lithographically defined devices remain, leading to potentially dramatic system-level economic benefits.
The optical portion of a SE visual display is potentially amenable to significant cost savings with the use of holographic optical elements (HOEs) as a replacement for bulky and heavy lens assemblies. The combination of a silicon display and HOEs could lead to a light, cost-effective SE interface suitable for mass production. Work in this area is ongoing at the University of Alabama, where the focus is primarily on computer-generated holograms using LC materials on silicon substrates as a high-resolution binary display.
A unique alternative to the classic HMD based on CRT or LCD technology is being pursued at the University of Washington's Human Interface Technology (HIT) Laboratory (LaLonde, 1990). The HIT laboratory is developing a display based on laser microscanner technology; it will use tiny solid state lasers to scan color images directly onto the retina. The advantage of this approach is that it does not require the use of the heavy, bulky optics typically used in CRT/LCD-based collimated aerial image systems and has the potential for developing high-resolution, lightweight, and low-cost display systems. The laser microscanner display, however, still faces substantial technical obstacles. A thoughtful analysis of this form of display may be found in Holmgren and Robinett (1994).
Autostereoscopic displays are interesting in that no viewing aids (e.g., field-sequential or polarized glasses) are required and no inertial burden is added to the human user. Depending on the size of their viewing zone or viewing volume, autostereoscopic displays can be seen by multiple viewers. Some displays, such as lenticular or parallax barrier displays, can present a small number of precomputed perspective views, such that
limited motion parallax and lookaround—usually restricted to the horizontal plane—can be generated without tracking viewer head position. Holographic and slice-stacking displays can present continuously varying perspective views, although in a restricted viewing volume. Since autostereoscopic displays entail either restricted viewing volumes or large, resolution-limited display surfaces, however, they may not be appropriate for SE applications requiring a completely encompassing viewing volume (McKenna and Zeltzer, 1992).
Lenticular Displays A lenticular sheet is an array of cylindrical lenses that can be used to generate an autostereoscopic three-dimensional image by directing different two-dimensional images into viewing subzones. The subzones are imaged out at different angles in front of the lenticular sheet. When an observer's head is positioned correctly, each eye will be in a different viewing zone and will see a different image, thus allowing for binocular disparity.
Lenticular imaging requires very high resolution to image a large number of views. With CRT technology, the pixel size limits the upper resolution and thus the number of views. The bandwidth requirements can also become very large, since N views are displayed. Furthermore, N views must be rendered in real time, with the imagery ''sliced" and placed into the vertical strips behind the lenticules. The number of displayable views is limited by the imperfect focusing ability of the cylindrical lenses. Lens aberrations and diffraction of the light reduce the directivity of the lenses, so that the focused imagery from the back screen does not emerge with parallel rays, but rather spreads with some angle. This spread limits the number of subzones that can be differentiated from each other. Another key issue with lenticular sheet displays is that the back screen imagery must be closely aligned with the slits or lenticules, otherwise the subzone imagery will not be directed into the appropriate subzone.
Parallax Barrier Displays A parallax barrier is a vertical slit plate that is placed in front of a display, simply to block part of the screen from each eye. A parallax barrier acts much like a lenticular screen, except that it uses barriers to obstruct part of the display rather than lenticules to direct the screen imagery. The screen displays two images, each of which is divided into vertical strips. The strips displayed on the screen alternate between the left and right eye images. Each eye can see only the strips intended for it, because of the slit plate. More than two images can be displayed on the screen to create multiple views from side to side. When a CRT monitor is used with a parallax barrier, the horizontal resolution is divided by the number of two-dimensional views provided. Multiple projecting monitors can be used to maintain a higher horizontal resolution with a large number of views. Each projector images a different
viewpoint, and the barrier and diffusion screen direct the light back to the viewing zones. Parallax barriers are not commonly used because they suffer from several drawbacks. First, the displayed imagery is often dim, because the barrier blocks most of the light to each eye. Also, with small slit widths, the diffraction of light from the slit gap can become problematic, as the light rays spread. As discussed above, the CRT imagery must be segmented into strips, as with a lenticular sheet display.
Slice-Stacking Displays Slice-stacking displays, also referred to as multiplanar displays, build up a three-dimensional volume by layering two-dimensional images (slices). Just as a spinning line of light emitting diodes (LEDs) can perceptually create a planar image, a rotating plane of LEDs can create a volumetric image. A similar volume can be scanned using CRT displays and moving mirrors. Rather than using a planar mirror, which would have to move over a large displacement at a high frequency, a variable-focus, or varifocal, mirror can be used. A 30 Hz acoustic signal is commonly used to vibrate a reflective membrane. As the mirror vibrates, its focal length changes, and a reflected monitor is imaged, over time, in a truncated-pyramid viewing volume. The mirror continuously changes its magnification, so that imagery scanned over time (as a CRT operates) will be continually changing in depth (not in discrete slices). A variant on this approach is under development by Texas Instruments. In this technique, micromechanical mirrors, 17 µ square, are supported by silicon beams on diagonal corners. The two unsupported corners are metalized and used as electrodes in an electrostatic actuator, which allows the mirrors to be pulled to one side or the other. Actuation rates of about 10 microseconds and angular deflections of about 10 deg allow these microscopic mirrors to deflect incoming light to form high-resolution displays. A version containing approximately 700 × 500 pixels using frame sequential color derived from a color filter wheel and six bit pulse width control of the intensity of each pixel has been demonstrated.
Another method for generating a volumetric image is to illuminate a rotating surface with a random access light source. Some experimental systems have employed a spinning double helix, illuminated by lasers and controlled by acousto-optic scanners (Soltan et al., 1992). In order to illuminate a specific location in the volume, the laser is timed to strike the helical surface as it passes through that location.
Slice-stacking methods trace out a luminous volume, such that objects are transparent, and normally obscured objects, further in depth, cannot be hidden. This can be ideal for volumetric datasets and solid modeling problems, but it is poorly suited to photographic or realistic images with hidden surfaces. The addition of head-tracking would allow
hidden surfaces to be approximately removed in the rendering step for one viewer. Not all surfaces can be correctly rendered, however, because the two eyes view from differing locations; each eye should see some surfaces that are obscured to the other.
Holographic Displays Computer-generated holograms fall under two main categories, computer-generated stereograms and computer-generated diffraction patterns. Computer-generated stereograms are recorded optically, from a set of two-dimensional views of a three-dimensional scene. The final hologram projects each two-dimensional image into a viewing zone, and stereo views can be seen with horizontal parallax (Benton, 1982). Full-color, high-resolution images have been generated, as well as large, wide field-of-view holograms. This is a non-real-time imaging technique, however; it requires off-line recording. A large amount of information is needed to generate the hologram as well, since every view (typically 100 to 300) must be generated.
Rather than record a set of two-dimensional views holographically, a true diffraction pattern can be generated by computer. When illuminated, the hologram will create a three-dimensional wavefront, imaging three-dimensional objects and light sources in space (Dallas, 1980; Tricoles, 1987). The methods used to compute the diffraction patterns are complex and computationally expensive. Until recently, computer-generated holograms had to be recorded using plotter or printing techniques, as an off-line process. A new method, however, allows a holographic image to be displayed in real time, from a fast-frame buffer storage (St. Hilaire et al., 1990; Benton, 1991). Because the holographic signal can be scanned in real time and potentially broadcast, this system is referred to as holographic video by its creators at the Massachusetts Institute of Technology.
The information contained in a hologram with dimensions of 100 × 100 mm with a viewing angle of 30 deg corresponds to approximately 25 Gbyte (25 billion samples), well beyond the range of current technology to update at frame rates of 30 Hz (St. Hilaire et al., 1990). The MIT system addresses this problem by reducing the information rate in three ways—by eliminating vertical parallax (saving several orders of magnitude), by limiting the viewing zone to approximately 14 deg (wider angles require higher spatial frequency diffraction patterns), and by limiting the image size. The diffraction patterns for a frame are computed on a super-computer (Connection Machine II) in under 5 s for fairly simple objects composed of luminous points. The hologram is stored in a high-resolution frame buffer (approximately 6 Mbyte per frame) and is transmitted to a high-bandwidth acousto-optical modulator (AOM). The AOM modulates a coherent light source to create the three-dimensional image. Both
monochrome and tricolor displays have been demonstrated. Recently, a larger version of the system with a modified scanning architecture has been developed and can display a holographic image at 140 mm × 70 mm and a 30 deg angle of view, each frame of which requires 36 Mbyte of computed diffraction pattern.
In the United States, following the pioneering work of Ivan Sutherland and his students in the late 1960s, research and development of display systems for synthetic environments have been carried out at MIT's Media Lab, the Ames and Langley Research Centers of the National Aeronautics and Space Administration (NASA), Wright-Patterson Air Force Base, the Naval Ocean Systems Center, the University of North Carolina, LEEP Optics, the University of Washington, the Japanese government's Mechanical Engineering Laboratory at Tsukuba, CAE Electronics, VPL Research, Virtual Research, Technology Innovation Group, Kaiser Electronics Electro-Optics Division, Hughes Electro-Optical & Data Systems Group, Stereographics Corporation, and Fake Space Labs, to name a few of the more prominent research and development centers in academia, government, and private industry. Applications of the various research systems include scientific visualization, vehicle guidance, and remote manipulation. A variety of relevant references are available in Aukstakalnis and Blatner (1993), Earnshaw et al. (1993), and Kalawsky (1993).
Over the years, advanced, high-performance HMDs have been developed for various military research applications (Merritt, 1991). These systems, although quite capable, often cost in excess of $.5 million. The highest-resolution HMD technology used to date has been developed by CAE Electronics of Canada for a variety of military applications. High-resolution color images produced by a pair of General Electric liquid crystal light valve (LCLV) projectors are condensed onto fiber optic bundles that convey image information to an HMD incorporating wide-angle eyepiece optics. The Phase V version of this fiber optic HMD provides a 160 deg horizontal (H) by 80 deg vertical (V) field of view with a 38 deg (H) stereoscopic overlap centered in the operator's field of view. In contrast, commercial HMDs until recently consisted mainly of low-resolution, wide-field-of-view LCD-based systems. Historically, the most widely used commercial head-mounted systems were the Virtual Research Flight Helmet, the VPL EyePhone, and LEEP Systems CyberFace
II. All three provide a large total horizontal field of view of roughly 100 to 110 deg using LCD screens with effective spatial resolutions of around 300 × 200 pixels in the standard NTSC (National Television Standard Code) aspect (width/height) ratio. All of these devices are or were sold for under $10,000. VPL Research introduced but then withdrew a higher-resolution HMD titled the HRX, due to maintainability problems. W Industries (in the United Kingdom) markets a rugged HMD titled the Visette, with properties similar to that of the VR Flight Helmet. Neither the VR Flight Helmet nor the VPL EyePhone is currently being sold, and few CyberFace devices remain in use.
The past few years have witnessed the introduction of commercially available systems that can serve as relatively simple, robust, and inexpensive developmental tools for those interested in developing SEs. These systems are often commercial variants of military systems and are introducing color CRT-based and advanced solid-state-based display devices to the commercial marketplace.
Virtual Research, upon discontinuing its successful Flight Helmet, introduced a lightweight helmet-mounted display using miniature monochrome CRTs with color wheels known as the EYEGEN3. This 28-ounce system takes two NTSC inputs, has a resolution of 250 (H) × 493 (V), and a field of view that ranges from 32 deg (H) at 100 percent binocular overlap to 48 deg (H) at 50 percent binocular overlap. It has outstanding brightness and contrast and costs under $8,000. In a similar price range, Liquid Image Corporation's Mirage HMD uses a 5.7-inch diagonal, thin-film transistor, active-matrix LCD display to provide a 240 (H) × 720 (V) bioccular (i.e., nonstereoscopic) color display over a 110 deg (H) field of view. Although the system accepts an NTSC input, its response time is 80 ms (typical for LCD displays). At the higher end of the commercial performance spectrum (approximately $150,000), Kaiser Electro-Optics has introduced the Color Sim Eye. A direct result of Kaiser's military work, the Color Sim Eye provides 1,280 (H) × 1024 (V) interlaced lines over a 40 deg diameter field of view. Variants with 60-80 deg field of view are under development; a monochrome version is also available. CRTs are mounted low alongside the head to minimize their inertial burden during turns of the head. Simple relay optics are used to convey images into the operator's eyes. They also provide a 44 percent see-through capability. Surface-stabilized ferroelectric liquid crystal shutters are used to sequentially display the three primary colors at a refresh rate of 60 Hz (i.e., 180 Hz scan rate); it weighs 3.5 lb (1 lb less than the monochrome version). Technology Innovation Group's recently developed HMD systems provide (at somewhat higher cost) greater adjustability, more sophisticated relay optics that yield a wider field of view, and improved helmet features. Using an overall configuration similar to that of the Color Sim Eye
(i.e., CRTs mounted close to the turning axes of the head), the system is capable of providing an expanded field of view while minimizing geometric distortions.
Classic CRT-based panel and projection displays with sufficient scan rates and bandwidths can generate high-resolution (i.e., 1,000+ horizontal line) stereoscopic displays using field sequential techniques. High-resolution, color, high-bandwidth displays of this form start at approximately $2,000 for panel versions and $13,000 for wide-screen projection versions. This form of stereographic display uses a temporal sequencing approach to provide alternating stereo display pairs to the right and left eyes. The premier provider of field sequential devices in the computer graphics market place is Stereographics Corporation. Stereographics manufactures a series of LCD-shuttered glasses that use infrared technology for field synchronization. These glasses and the infrared transmitter can be added to appropriate computer consoles and projection systems for well under $2,000. A variety of even lower-cost, wire-synchronized, field sequential systems are available from 3D TV. Total system costs are approximately one order of magnitude less than that of comparable head-mounted displays.
Several interesting hybrid systems are available. Fake Space Labs has developed a class of displays known as the BOOM. The user holds the display head by two handles, pointing and moving the view direction by turning the display head as if it were a pair of binoculars. Thumb buttons can be used to control forward and backward motion within the SE. Several configurations of this rugged, floor-mounted device are available. The low-end system uses twin CRT displays to provide 640 (H) × 480 (V) resolution in monochrome. The high-end three-color version of the BOOM provides 1280 (H) × 960 (V) interlaced resolution using color filtering techniques. BOOM viewers have integral tracking sensors in the six-degrees-of-freedom support structure. This tracking system is very fast, returning orientation and position information to the computer in 5 ms or less after a change of viewing direction. LEEP Optics has introduced CyberFace III, a higher-resolution monochrome system with a single image source and a countersprung support arm to neutralize the weight on the wearer's head. CyberFace III is aimed at the PC-based computer-aided design (CAD) market and is a lightweight, low-priced alternative to the Fake Space BOOM. It can be used with a face-fitting head mount or in a hand-guided mode. Costs for these hybrid systems fall between those of off-head panel displays and head-mounted displays.
Advantages and Disadvantages of Head-Mounted and Off-Head Displays
In summary, head-mounted displays provide a straightforward approach to the generation of a seamless, all-encompassing viewing volume. High-performance HMDs, however, are costly, and current technological challenges result in less than ideal performance. In particular: (1) high-resolution, miniature, lightweight, low-cost display surfaces are yet to be realized; (2) weight and inertial burdens imposed by most HMDs affect the incidence of symptomatic motion sickness, the ability of users to make proper judgments concerning orientation, and their long-term habitability; and (3) due to size, performance, and cost constraints, fixation/focus compensation is utilized in most HMDs and conflicting visual depth cues are provided to the user. Furthermore, the proper operation of HMDs is intimately tied to the performance of head-tracking systems (i.e., update rate and lag), which is currently less than ideal.
Off-head display approaches can help alleviate some of these shortcomings. Since the display is not worn on the head: (1) relatively inexpensive, widely available display surfaces (i.e., panel and projection) can provide high-resolution color imagery; (2) the effects of placing additional weight and inertial burdens on the user's head are minimized; (3) some of the volume and weight constraints on fixation/focus compensation systems are relaxed; and (4) some off-head configurations (i.e., those that supply an instantaneous encompassing viewing volume) do not necessarily require head-orientation tracking, and therefore, lags and distortions due to the tracking system are minimized. Off-head displays, however, are by no means perfect. If an instantaneous encompassing viewing volume is not provided, some form of head tracking is required. Not only must the computer-generated imagery be slaved to the user (as in HMDs), but also the display surfaces must be servo-controlled. Finally, the overall volume and weight of the display system is typically much greater for off-head systems than for head-mounted systems.
Within the past few years, the design and development of visual display systems appropriate for use in synthetic environments have engaged unprecedented commercial interest and involvement. High-end HMD technology, originally developed for military applications, has been transferred to the commercial sector and new commercial markets have resulted in innovative lower-end designs. Lightweight, intermediate-resolution (approximately 1,000 horizontal lines), color, see-through capable HMDs are now available for under $200,000; lower-resolution (i.e.,
NTSC level) systems are available for under $10,000. Intermediate-resolution, off-head, field-sequential, panel and projection systems are available for well under $20,000. Equivalent BOOM-type systems cost under $100,000. No special government-directed research effort seems to be required as an impetus to drive the efforts to refine system design and reduce system costs in the intermediate resolution display technologies under discussion. That is not to say that there remains no significant research and development to be done, but rather that commercial pressures seem adequate to motivate and fund these efforts.
Current commercial and technological trends, however, are still likely to result in displays for synthetic environments that have several shortcomings. The best of the current commercial systems, with resolutions roughly equivalent to high-definition TV, cannot provide eye-limited resolution except across a relatively small field of view (i.e., approximately 38 deg [H] field of view). It may be some time before commercial market pressures by themselves will drive the development of the higher-resolution display devices required for wider field-of-view, eye-resolution-limited display systems. Eye-tracking approaches (i.e., in which a high-resolution display area is kept within the foveal region of the eye and a lower-resolution image is presented elsewhere), do not alleviate the required absolute display surface resolution but rather mitigate the computational requirements for generating visual images in virtual environments or the camera resolution requirements for teleoperators. Some government involvement therefore may be appropriate for encouraging high-resolution display design that will enable wide-field-of-view (i.e., perceptually seamless), eye-resolution-limited display systems to be built. In order to provide a 180 deg (H) by 120 deg (V) field of view (i.e., the instantaneous field of view of the human visual system), display devices providing approximately 4,800 (H) × 3,800 (V) lines of resolution are required (McKenna and Zeltzer, 1992). It seems prudent for both CRT-based and solid-state device-based approaches for display devices to be pursued. A particular area for emphasis should be the development of miniature, high-resolution CRT and solid-state display surfaces, since commercial pressures tend toward larger versions of these devices and HMDs require smaller versions. Autostereoscopic off-head displays (e.g., video holographic displays), which are especially attractive for situation assessment applications, should also be considered for continued government-funded research and development.
In designing a visual display system for synthetic environments, it is important to consider the specific tasks to be undertaken and, in particular, the visual requirements inherent in these tasks. None of the available technologies is capable of providing the operator with imagery that is in all important respects indistinguishable from a direct view of a complex,
real-world scene. In other words, significant compromises must be made. The compromises made by designers should be based on the best available evidence regarding the relationships between vision system features and objective performance metrics—not on wholly subjective criteria. Although many important lessons can be learned from the work conducted over many years on large-scale simulators concerning these issues, further research is clearly required.
Unanswered questions include: For a defined task domain, how should one compromise between spatial resolution and field of view? Is color worth the added expense and loss of spatial resolution? Is a stereoscopic display worth the trouble? If so, what are the appropriate parameters (baselines, convergence angles, convergence/accommodation mismatches) for various types of tasks? Can supernormal cues (e.g., depth, contrast, etc.) be used to advantage? Can the simulator-induced queasiness that accompanies the use of wide-field-of-view HMDs for some users be minimized or eliminated entirely? Which forms of delays and distortions induced by the visual display system can be adapted to? How are the required visual display system parameters affected within multimodal systems? Can visual display system requirements be relaxed in multimodal display environments? What are the perceptual effects associated with the merging of displays from different display sources? How do we design a comfortable HMD that integrates visual, auditory, and position tracking capabilities? These are but a few of the many research issues that impact the practical design of any visual display system for synthetic environments.