Picture This: The Changing World of Graphics
Donald P. Greenberg
Andries van Dam
In this session, we will give you three snapshots of our favorite things in graphics—the views of three random folks that are not necessarily representative of the field. I will talk about displays, virtual reality (VR), and things that interest me; Andries van Dam will talk about user interfaces; and Donald Greenberg will talk about rendering.
The first thing I want to talk about is graphics in a nutshell since the 1980s (Box 6.1). Here is my mantra: 2D (two-dimensional graphics) has become ubiquitous. That is, 20 years ago, 2D was the cult; 10 years ago it hit the mainstream; and now you cannot get a machine without 2D graphics on it. If you want to get a 24 by 80 text-only screen, you have to try really hard. In some sense, what has happened to 3D (three-dimensional graphics) is 10 years behind 2D. It was still a cult 10 years ago; now, of course, you do not need to be in the cult of 3D in order to do 3D graphics in your work.
Some people say that VR, immersive displays, and interactive and immersive environments are also on this road, but I am a skeptic.
Speaking as a person who has been doing VR for 25 years, I think that the best recent development for VR is the Web coming along to take VR off the front pages. Until about a year ago, we had to spend a lot of time—and I suspect a lot of you who are working anywhere near the field spent a lot of time—just shooing people out of the lab in order to get anything done. In a nutshell, VR has been around doing useful work—by useful, I mean that people are willing to pay money for it when they do not have any alternatives—since at least 1970, in terms of flight simulators.
In the past 10 years, the military has spent lots of money on training both individual pilots and large groups of people, since it is not practical to train for all situations with real equipment. Also, the entertainment area is always going to be around, but with a nebulous future. However, I want to give you a couple of examples from medicine, an application area in which you can show incremental improvements that give some idea of what might be more widely available in the future (Box 6.2). Two examples can be presented using videos.
The first video is from Kikinis, Jolesz, and Lorensen.1 Many of you are familiar with their work at Brigham and Women's Hospital in Boston. The interesting part of this work is that they are using graphics, and the merging of graphics with the patient, as a standard procedure now—not just in the planning of neurosurgery, but as a guide
Box 6.1 Computer Graphics Since 1980
or a road map for the surgery itself. This is a planning system for neurosurgery that, of course, starts with computed tomography (CT) or magnetic resonance imaging (MRI) scans. The novel aspect is showing the results registered on top of the patient. This guides the surgeon in deciding where to cut the hole in the skull and, once inside, how much tissue to remove.
Now we have systems that combine graphics with something in the outside world. My theme today is that we could push that capability forward for the rest of us. The next video provides an example of how to do this for applications in which the medical imagery is coming to you in real time. For guidance, you could use not only the old data, but real-time data as well. This is similar to the previous application, except now you are using the real-time imagery as part of the display in order to guide what you are doing.
In the next 10 years, we will find that the use of 3D graphics is increasingly commonplace. There will be a merging of the live video with the display. You might well ask what we need live video for in our everyday lives. I hesitate to tell you, but I think it is going to be the year-2000 version of the picture phone. I say this despite the fact that some of you might laugh, because picture phones have come around about every 10 years and so far have been disappointing to just about everybody. I expect that people are not going to be disappointed 10 years from now. There will be enough bandwidth, display capability, and computing to make teleconferencing a compelling shared presence, which the current generation of teleconferencing hardware cannot do. My personal view is that the crucial aspect of shared presence is going to be the capability to extract a sense of 3D from one place and show it at another place in a way that makes participants feel they are together, even if they are thousands of miles apart.
We are going to have to do what you might call desktop VR for modeling and 3D teleconferencing. In 10 years, I think we will have the equivalent of, say, a dozen laptops packaged within a single workstation, with about a dozen times the power and display capability of a laptop (more pixels (perhaps 10 million), higher resolution, and a wider field of view). Think about a picture-window kind of display made up of, for example, 10 or 12 laptop-size screens and 10 or 12 cameras. These cameras will use the workstation's computational resources to extract a model of the small environment you are in—a little office cubicle, for instance—and transmit that model to another place. In this way, together with head-tracked stereo, you could share the sense of presence with the person you are talking to or the two or three people you might be having a teleconference with. I believe the principal problem remaining is how to display these remote environments, because their data representation is probably not going to be polygonal. This is going to be the most exciting frontier: modeling from imagery and warping the images in such a way as to make the output image seem more realistic. A lot of people are working on this. The 1996 SIGGRAPH conference, for example, featured many papers on how to take imagery, warp it, and correct it from another perspective, making the output image more like a photograph than a picture of a set of polygons.
BOX 6.2 Virtual Reality in Medicine: Examples
These are my conclusions. We are going to have some automatic model extraction from video sequences, but not model extraction in the intelligent sense. The system will not know what the object is, but it will know where surfaces are in your little office cubicle. We are going to have desktop VR-like environments of what you might call a 3D window into another world. In addition, and totally separate from this, we will have laptops evolving into wearable computers the size of pagers; we will keep graphic displays with us all the time as we walk around. Much larger, higher-resolution images in the form of tiled displays will become commonplace. The capability will be far beyond the same 1,000- by 1,000-pixel displays we have had for the past 10 or 15 years.
What has happened in the computer graphics industry in the first 25 years that I worked in it? The first system I worked on was a General Electric machine, trying to simulate the lunar landing vehicle docking with the Apollo spacecraft. It was as far as we could go in the 1960s: there was no lighting model and the colors were assigned to 3D geometries. This was the extent of the complexity that you could display in real time to try to train the astronauts in a nongravitational environment. About five years later, the graphics pipeline started. Without going through the details, you have a model; you have to describe where you are looking at it; you do a perspective transformation; you convert that to raster operations; and in the process, you do the lighting, put it into storage, and then display it. Around 1972, researchers at the University of Utah developed the Fong model. They never could see a full picture; there was only enough memory to see 16 scan lines at one time. The model was polygonally based and able to show diffuse reflections, transparency, and specular highlight. That was the basis of the graphics pipeline then; it is the basis of the graphics pipeline today.
The next step occurred around 1979. Turner Whittick from Bell Labs introduced ray tracing, which bypassed the camera perspective in raster operations. That is, you would send the ray through every pixel in the environment, compute where it reflected, pick up the accumulated light contributions, put an image in storage, and display it. The time required for an image as complex as his first publicly presented image was measured in days on a VAX 780, a 1-MIPS machine.
We have progressed a lot; we can do pretty well today even with complex environments, generating a picture with, for example, 32 million polygons in it. We are trying to see what happens as we start to get into complex environments, because the old algorithms begin to break up. We can generate the 32-million polygon image I showed only on machines that have 768 megabytes of local storage.
Interest in making pictures for the movie industry led to advances over ray tracing. Using thermodynamics, we tried a radiosity approach. Instead of doing millions and millions of rays, we took an environment, broke it up into thousands of polygons, and solved the interaction equations among all of those polygons. This is not a simple problem because, in contrast to finite element analysis, where there are nearest-neighbor problems, this approach relies on the global interaction among every part of the environment. There is a fully populated, nondiagonally dominated matrix that must be solved. In complex environments (for example, an occupied meeting room), the solution can exceed hundreds of thousands of polygons. Once solved, if it is for diffuse environments we no longer have to compute the lighting and we can use the graphics pipeline.
So where are we headed? Well, part of the problem we have had—and it is our own fault—is the fact that we have been too successful in making good-looking images. We cannot really discern whether the picture we are looking at is real or synthetic. If we think about what is happening now and what will happen in the future, we have to look at graphics rendering, at least, as a modeling of the physical world. Once we have modeled the physical world, we have to create the image of the physical world, and then we have to see how the human mind would evaluate it in the perceptual world.
Beginning with a model of the physical world, which has not only the geometry but the material properties on a wavelength basis (the light on a goniometric basis with its full distributions), we can do the correct energy distribution throughout the environment and find out what radiance is coming from every surface in the environment. Then we want to create the picture from where the camera is. Our camera really is our eye, which is made up of curved lenses, retinas, and so on. We have to create the perceptual image, taking into account the perceptual
response and the physiological change. Then there is the interpretation, where the cognitive is not necessarily just what you see in the image; this is a tough problem.
So how do we start? Well, we take off our computer science hats and go back to Maxwell's equations from the 1800s. We try to get a simulation of the reflection model, which includes the angle of incidence, the wavelength, and the roughness and distribution of the surface properties—how much it absorbs and how much it reflects. We get an arbitrary distribution in directional diffuse and specular, a very complex five-dimensional problem. We have to do this: we want to have the model because we do not even have the measurements of most materials.
I want to go back to a comment that Edward Feigenbaum made. The problem we have in computer graphics is perhaps a problem in computer science in general; that is, we forget the engineering and experimental part of it. What we really have to do is test these paradigms in terms of their computation complexity. We have to do the experimentation to make sure that our algorithms are correct. We then want to distribute it, but in fact, we can distribute it right now only for diffuse surfaces because the rest is too complex when using the radiosity algorithms.
At Cornell, we have set up a $500,000 measurement lab in my laboratory to try to measure bi-directional reflectance distribution functions. We have come up with an approach using particle tracing, not the correlate techniques described by physicists years ago. Now we will send out billions and billions of photons, determine every surface they hit at every wavelength from every direction, and then try to accumulate this statistically (we now have bounds on the density distribution functions and their variance). Then we will be able to start to display these nonexistent environments by sending out these billions of photons. If we go into our measurement lab, build these environments, compare them, and then find they are correct, we will know that the simulations are accurate.
We also model the human eye—the reflection of the cornea, the lens, and the scattering of light inside the material so we can get the glare effects. We model the adaptive perceptions so that we see what something might look like in moonlight versus daylight when, in fact, our spatial and color acuities are very different, depending on whether the rods or the cones are involved. Then we build models, simulate them, and assess what is real and what is false.
So what is going to happen in the future? It is clear that as the environments get more complex, we are going to have more polygons. We are going to have in the neighborhood of 1 million to 2 million pixels, so the ratio of polygons to pixels or pixels to polygons is going to approach one. It started off with the pipeline. There were 400 pixels per polygon. Then it went to 100, to 50, and now down to 10. In the future, the balance point is going to be pixels to polygons, and I contend the graphics pipeline is going to go away.
Because we will have to disambiguate complex environments, we need all the perceptual cues, or shadows, of the interreflections. We will need global illumination algorithms, which are going to be common. As Andy van Dam said, we are going to need the progressive algorithms so we can get everything in real time. We will need to have a transition from WYSIWYG to WYSIWII. The former refers to "what you see is what you get." The movie industry is very happy with this because as long as the image is believable, the industry does not care whether it is real or not. I claim that it has to be "what you see is what it is" (WYSIWII) so that we can start to use these things to simulate what is going to be.
I would like to make three closing comments. First, with increasing processing and bandwidth, the real-time realistic displays—the "holy grail"—will occur within the next decade. If bandwidth is cheaper than processing and memory, then the client-server or cluster-of-computing paradigm will hold. If processing and memory are cheaper than bandwidth, then perhaps we will have local computation and display. In either case, however, we are going to have real-time, realistic image generation.
Second, I hope we will reach the stage soon where simulations can be accepted as valid tests. We do auto crashes that way. We do not build chips by building one and testing to see if it works. We test it on a simulator many, many times, so that on its first computation, it will work. At least in the design of automobile bodies or architectural interiors, for example, why can't we make sure that what we are going to see is what it is going to be? Again, the problem is that we have been too convincing in our images. If we start to believe in these tests, we could also start to use our simulated environments to improve the inverse rendering approaches and some of our computer vision problems.
Third, I would like to talk about the good news-bad news problem. The good-news problem is that 90 percent
of the information goes through our visual system. The bad news problem is that seeing is believing, and once we see the picture, we are liable to believe it. Photography will no longer be admissible in court. Seeing will not be believing, and it is going to be dangerous when we start to mix the real and virtual worlds. A picture is worth 1,024 words.
Andries van Dam
I am going to talk about what I call post-WIMP user interfaces, WIMP meaning windows, icons, menus, and pointing. Why user interfaces? Because we finally discovered that having raw functionality is not sufficient. If you cannot manifest that raw functionality in a usable form to ordinary users, it is not nearly good enough. Industries have discovered this and now have usability labs. Unfortunately, academia is still lagging a bit. The user interface does not get the attention that it deserves in computer science curricula, but this is a separate matter.
I will talk about what lies beyond the WIMP interfaces that we are familiar with today. This was, of course, presaged by Michael Dertouzos's plea that we should be much more interested in usability than we are and that computers should be much more user-friendly. This, to me, is an ideal. It was, by the way, considered an idea of the lunatic fringe 20 years ago that graphical user interfaces (GUIs) could make it possible for preschoolers— children as young as 2 years of age, I have been told, who can neither read nor write—to use computers productively. This is revolutionary, but it is just the beginning.
What else should we be thinking about? What is good about WIMP GUIs and what are they missing? What is good about them is ease of learning, ease of transferability from application to application, and ease of implementability, thanks to many layers of software. However, not every user fits that paradigm. Many people no longer want to point and click—those with repetitive stress injuries, for example, for whom speech is going to be far more important. Also, the layers of software support are both a feature and a bug. There is a huge learning curve in becoming a user interface programmer.
I am more concerned about the fundamental built-in limitations of WIMP user interfaces. They rest on a finite-state automata (FSA) model where the application sits in a state, waits for an input, and then reacts to it. It is half-duplex at its worst. Human conversation, on the other hand, is marked by the fact that it is full-duplex and multichanneled (that is, you see me gesturing while you are listening to me talking), and user interfaces do not mimic this at all. Key presses are discrete, but if we are involved in gesture recognition for handwriting or speech recognition, all of a sudden input is continuous and has to be disambiguated. The problem becomes far more difficult as soon as we move outside the boundaries of the user interface.
The WIMP GUI does not appeal to our other senses. It just uses the visual channel for output and the tactile channel for input. This is not nearly good enough. We would not be able to communicate well if all we had was vision, so we need to appeal to our other senses. Henry Fuchs talked a little about immersive virtual reality. You are not about to carry your keyboard and your mouse into an immersive environment where you can walk around freely—where the computer is continuously tracking your body, head, hands, and maybe even your gaze, and is communicating with you in a style that is very different from the WIMP GUI you are accustomed to.
So what is a post-WIMP user interface? Clearly, it has to solve the problems of multiple parallel channels and multiple participants. Computing as a solitary vice is largely going to go away and be replaced by teamwork and collaboration supported by computer tools. A post-WIMP user interface will require much higher bandwidth than we are used to for keyboards and mice. It will be based on probabilistic, continuous input that has to be disambiguated (consider today's experiences with handwriting for personal digital assistants or data gloves for virtual reality); it will need to have backtracking. The objects, unlike our desktop objects that just sit there and have no autonomous behavior, will have their own built-in behaviors, and we will interact with them as they are interacting with each other. Twenty years ago, the Media Lab at the Massachusetts Institute of Technology had a wonderful demo ("Put That There") of somebody who could talk, gesture, and interact with objects on the screen that did their own thing; this is the kind of future we are moving toward.
WIMP GUIs are not going to go away; they will be augmented, suitably and appropriately. I see a spectrum of user interface techniques. On the one hand, there will be direct control—I, as a user, will be empowered to tell the application what I want it to do. On the other hand, there will also be indirect control where I can dispatch
Box 6.3 Challenges for 3D User Interfaces, Virtual Reality, and AR
viruses or agents, however you view them, that will operate on my behalf. They will communicate with me in reasonable ways to let me know what they are up to so I can adjust them and make sure that they are not harmful. Agents and wizards in primitive forms are already here; you will see them used far more in the future, for better and for worse, especially on the Internet. Finally, we are going to have speech and gesture input.
A different set of user interface techniques based on 3D graphics holds promise. These involve 3D widgets and a sketching user interface that employs gesture recognition rather than traditional WIMP techniques. The 3D widget is a specialized tool that lives in the same 3-space as the 3D application object. The sketching interface provides a more limited vocabulary of direct manipulation.
There are a number of challenges that still have to be met before any of these techniques are used routinely in industry (Box 6.3). One that I can talk about briefly is time-critical computing, which the networking community calls quality of service. Instead of having algorithms that compute perfectly and take however long they require, we want algorithms that yield a usable result within a given time limit and produce higher-quality results if given more time. This will enable us to schedule the number of frames that we are able to generate and avoid motion sickness. Such time-critical computing requires a new way of looking at algorithms, something the artificial intelligence folks have been doing for a while. Object models are another pet peeve of mine. All of the current object-oriented programming languages have static, class instance hierarchies. In the 3D graphics multimedia world, we need dynamic, evolving objects that can change their type and their membership on the fly at run time. This is not supported, nor of course is any kind of input or output, which is all relegated to a library.
In conclusion, things are going to change pretty dramatically in the future. Change will be driven not so much by what is happening in our university research laboratories as by the driving forces of the entertainment industry, which is putting enormous resources into making this revolution happen. In the graphics community, we are all very keen on these new technologies, but we need to conduct usability studies to find out whether they are really useful. This is terra incognita, but it is a very exciting new frontier.
WILLIAM PRESS: I have a question for Donald Greenberg. An idea whose time comes and goes and never really arrives is eye tracking, because people have estimated that the real bit rate into the entire visual cortex is perhaps only a megabyte. Does that have any future now?
DONALD GREENBERG: It definitely has a future. We are making a lot of progress on it. We are also looking at the difference in the full view versus peripheral vision, which means that we can reduce the amount of resolution we need on the slides. Henry is conducting the particular tracking experiment.
HENRY FUCHS: You characterize it accurately: it is something that comes and goes but has not really arrived. I think it is like the picture phone. An optimist will say that when all the necessary technological components have arrived, then it will have some advantages. I think it has not arrived because various pieces of the puzzle are not in place yet.
Even when we are able to do eye tracking, for example, we are unable to take advantage of it because of other limitations such as the display mechanism itself, the graphic systems, and various other things. My guess is that, in 10 years, there will be a few selected applications, but it will take at least 10 years until we see some part of it have an impact.