Not for Sale

A Visionary Session

The Nature of Computer Vision

The second session of the workshop concerned research in computer vision. As explained by Larry Davis of the University of Maryland, the central challenge of the field is to give computers the basic visual abilities that humans possess: for example, deciding whether a person is in view, interpreting a streetscape well enough to make navigation decisions for an automobile, and visualizing what a scene looks like from a different viewpoint or under different lighting conditions. The input to these decision processes is one or more digital images. A two-dimensional image is divided into rectangular boxes—pixels—and for each pixel there is an observed value (or set of values if, for example, the image is in color rather than black and white). Needless to say, the observations can be quite noisy, so it is important that algorithms be robust. Tony Chan (UCLA), who moderated the session, noted that there are major challenges in this field. It is necessary to operate with only incomplete information because of factors such as motion or occlusion of some objects by others. In addition, many problems are inherently ill-posed.

Jitendra Malik from the University of California at Berkeley presented an overview of many of the open problems facing computer vision researchers. He divided the problems into four topic areas: image reconstruction, visual control, image segmentation, and object recognition.

Image Reconstruction

The reconstruction problem is an inverse problem: construct a three-dimensional scene given a two-dimensional image of it. A goal is to recover characteristics such as the geometry of surfaces, reflectances, and motion. Human viewers get clues from stereo vision, symmetries, prior expectations, and the multiple views imparted by motion, and these are the basic tools that underlie computer reconstructions, too (Problem 10, Problem 11, and Problem 12).

 Problem 10. Given multiple views of a scene, find the corresponding features in the views. (Current algorithms are about 95 percent accurate, but this is not good enough to yield reconstructions that are completely satisfying to human viewers.)
 Problem 11. Estimate the three-dimensional scene from multiple views. This problem is solved if the mapping is simple—i.e., (x,y,z) maps to (a(x),b(y), z)—but not for perspective projection. (One workshop participant noted that total least squares, or errors-in-variables, is a possible approach to this problem.)
 Problem 12. Develop effective algorithms for modeling reflectance and texture for natural objects. The basic physics has been known for 100 years, but modeling surfaces such as human skin still cannot be done satisfactorily.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 7
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop 2 A Visionary Session The Nature of Computer Vision The second session of the workshop concerned research in computer vision. As explained by Larry Davis of the University of Maryland, the central challenge of the field is to give computers the basic visual abilities that humans possess: for example, deciding whether a person is in view, interpreting a streetscape well enough to make navigation decisions for an automobile, and visualizing what a scene looks like from a different viewpoint or under different lighting conditions. The input to these decision processes is one or more digital images. A two-dimensional image is divided into rectangular boxes—pixels—and for each pixel there is an observed value (or set of values if, for example, the image is in color rather than black and white). Needless to say, the observations can be quite noisy, so it is important that algorithms be robust. Tony Chan (UCLA), who moderated the session, noted that there are major challenges in this field. It is necessary to operate with only incomplete information because of factors such as motion or occlusion of some objects by others. In addition, many problems are inherently ill-posed. Jitendra Malik from the University of California at Berkeley presented an overview of many of the open problems facing computer vision researchers. He divided the problems into four topic areas: image reconstruction, visual control, image segmentation, and object recognition. Image Reconstruction The reconstruction problem is an inverse problem: construct a three-dimensional scene given a two-dimensional image of it. A goal is to recover characteristics such as the geometry of surfaces, reflectances, and motion. Human viewers get clues from stereo vision, symmetries, prior expectations, and the multiple views imparted by motion, and these are the basic tools that underlie computer reconstructions, too (Problem 10, Problem 11, and Problem 12). Problem 10. Given multiple views of a scene, find the corresponding features in the views. (Current algorithms are about 95 percent accurate, but this is not good enough to yield reconstructions that are completely satisfying to human viewers.) Problem 11. Estimate the three-dimensional scene from multiple views. This problem is solved if the mapping is simple—i.e., (x,y,z) maps to (a(x),b(y), z)—but not for perspective projection. (One workshop participant noted that total least squares, or errors-in-variables, is a possible approach to this problem.) Problem 12. Develop effective algorithms for modeling reflectance and texture for natural objects. The basic physics has been known for 100 years, but modeling surfaces such as human skin still cannot be done satisfactorily.

OCR for page 7
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Visual Control A second interesting area of research is visual control, used to provide visual feedback for obstacle avoidance and locomotion guidance. In the mid-1990s, a computer system developed by Carnegie Mellon University successfully controlled a car driving cross-country for more than 90 percent of the travel time, although not in urban or congested areas. The control operates on a hierarchy of levels: the system needs to make low-level decisions, such as which lane to travel in, while taking into account upcoming decisions, such as when to make the next turn, and these decisions must be moderated by a high-level plan that navigates the vehicle toward its final destination (Problem 13). Problem 13. If we control a navigation system by making use of visual information, there will inevitably be delays in the feedback loop due to processing time. Thus we need to design control laws involving look-ahead to avoid instabilities in control. Image Reconstruction Segmentation is a basic task in image processing—it is performed to partition a collection of pixels into objects. Object boundaries are cued by differences in brightness, color, texture, and/or stereographic depth among neighboring pixels, and their detection can be aided by motion and the recognition of certain common objects (e.g., a chair). This processing is usually performed bottom-up (processing pixels to determine objects), but a top-down approach can also be useful: for instance, if the first processing steps suggest that a human figure is present, a top-down model might then be invoked to list body parts that ought to be in the scene, and perhaps to suggest where on the image the pixels might constitute a face. Segmentation also can be performed by surface fitting or by probabilistic inference with a Markov random field model. Problem 14. Perform segmentation by unifying the bottom-up and top-down approaches and making use of all of the visual cues (brightness, color, depth, texture, and so on). More recently, graph partitioning has been applied to the problem. 3 Each pixel is a node in the graph, and the weight of an edge is based on the similarity between pairs of pixels in features such as brightness, texture, and their coordinate differences. The eigenvectors of the graph Laplacian can be used for partitioning the pixels into segments, but the mathematical theory is incomplete: properties of the eigenvectors are not well understood. Applications to computer vision problems have been pursued.4 3   Shi, J., and Malik, J., Self-inducing relational distance and its application to image segmentation, Proc. of Fifth European Conference on Computer Vision, H. Burkhardt and B. Neumann, eds., Springer-Verlag, Berlin, 1998, pp. 528-543. 4   See, for instance, Boykov, Y., Veksler, O., and Zabih, R., Fast approximate energy minimization via graph cuts, Proc. of Seventh IEEE International Conference on Computer Vision, IEEE Comput. Soc., Los Alamitos, Calif., 1999, pp. 377-384; Ishikawa, H., and Geiger, D., Occlusions, discontinuities, and epipolar lines in stereo, Proc. of Fifth European Conference on Computer Vision, H. Burkhardt and B. Neumann, eds., Springer-Verlag, Berlin, 1998, pp. 232-248; and Roy, S., and Cox, I.J., A maximum-flow formulation of the N-camera stereo correspondence problem, Sixth International Conference on Computer Vision, Narosa Publishing House, New Delhi, 1998, pp. 492-499.

OCR for page 7
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Object Recognition A fourth interesting area is recognition of classes of objects (such as an automobile) or recognizing an instance of an object in a class (such as a Volkswagen Beetle). Humans may be able to distinguish 10,000 to 100,000 objects and are tolerant of different views, illuminations, and occlusions, whereas computer systems are not (Problem 15 and Problem 16). Problem 15. Develop a system that can infer material properties—for example, distinguishing between skin and metal—and can make consequential decisions, such as how hard to grasp an object. Problem 16. Develop a unified framework for segmentation and recognition, representing variabilities within categories (e.g., shape) by using the interplay between discriminative models (neutral networks, support vector machines) and generative models (probabilistic models that can synthesize new instances). More Difficult Types of Object Recognition Larry Davis emphasized high-level problems in his presentation, noting that segmentation is in part predicated on what level of detail one is looking for. For instance, in some imaging contexts it is necessary to identify a particular set of pixels as belonging to an automobile 's bumper, while in others the critical need is to cluster all the pixels that compose the entire car. Motion is a helpful clue for segmentation, but it presents its own challenges (Problem 17). Problem 17. Distinguish the moving object itself from its shadows and reflections, especially in the presence of dynamic changes in background (owing, for example, to rain, wind, and lighting). Rigid objects are identified by matching to a library of images, or by characterizing their shape and texture. Deformable objects present more challenges but more opportunities; detection of emotion, for instance, depends on studying the deformability of faces (Problem 18). Problem 18. Identify deformable objects, with one of the hardest cases being human faces. Davis noted that visual learning is essential at every level of image processing. We must reason about what we see, using static and dynamic analysis to identify objects and to refine hypotheses after further observations. Differential Equation Models in Vision Guillermo Sapiro of the University of Minnesota discussed some applications of partial differential equations to vision. He noted that ideas that are commonplace in one field often become the “hottest developments ” in a different one, so communication among researchers with different backgrounds is essential. Over the course of the session, several participants presented

OCR for page 7
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop examples in which computer science ideas were rediscovered in the mathematical science literature and in which mathematical ideas were re-derived for vision applications. Sapiro presented examples in which diffusion equations (similar to those used in thin film fluid dynamics but with different boundary conditions) can be used to fill in objects that are partially occluded. The Modeling Problem Even young children know which visual cues to use in order to segment an image: sometimes shape, sometimes color, sometimes texture (Problem 19). Problem 19. How can a computer know which visual cues are important in each instance? Other problems were posed during the plenary discussion by workshop participants (Problem 20-Problem 23). Problem 20. Use differential geometry to match features in three-dimensional images, such as brain scans, just as it is used to match two-dimensional images today. Problem 21. Segment images by the distance between the object and the camera in order to be able to peel off subimages to isolate deeper layers. Problem 22. Compress images subject to constraints that preserve certain critical features such as coastlines. Problem 23. Create physical models and structural models with elastic component parts that capture the function of a complicated deformable object such as the human face. Many problems in computer vision arise from projective geometry ( Problem 24). Problem 24. Study orthogonal projection on non-flat manifolds, since perspective geometry leads to cones. Other problems arise from possible mechanistic connections between computer vision and biological vision. Tomaso Poggio of MIT asked whether, in biological vision, an object is represented as a series of views or as a three-dimensional entity. More generally, how closely should computer vision algorithms relate to those of human vision?

OCR for page 7
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Summary of Computer Vision Session Research on computer vision is an example of a successful two-way street between mathematics and computer science: interesting mathematics has been used by vision researchers to make practical tools and to develop commercially important innovations such as the JPEG standard for image storage. In turn, the work of some vision researchers in using anisotropic diffusion, for instance, led mathematicians to study the differential equation model developed by those researchers. Stanley Osher of UCLA noted another example of rich interaction, wherein a paper5 that developed axioms for morphology-based image processing led to the analysis of motion of level sets by affine mean curvature. Also, L. Rudin's PhD thesis in computer science (from the California Institute of Technology) introduced shock wave theory and BV (bounded variation) spaces to image restoration. The mathematical foundations of computer vision are well recognized. Geometry has contributed the basis for projections and for the level set method, and topology aids in the understanding of deformable surfaces. Probability and statistics are used in estimation under uncertainty. Tools such as wavelet analysis and methods for solving diffusion problems and evolving vector fields have been contributed by analysis and partial differential equations, while an understanding of properties such as reflectance and radiosity comes from mathematical physics. Tools from harmonic analysis, for example, have been basic to the development of the JPEG standard for storing images. Algebraic geometry has been useful in work on curve recognition. Computational math is essential to developing fast algorithms, while many problems are best understood in terms of graph theory. Despite the richness of past interactions, there is a potential for much more collaboration. Jitendra Malik noted that the connection between theoretical computer science and combinatorics has not yet been fully exploited. In order to make progress in computer vision, researchers need to jump from continuous formulations to discrete ones and back, giving an even greater opportunity for interplay. 5   Alvarez, L., Guichard, F., Lions, P.L., and Morel, J.M., Axioms and fundamental equations of image processing, Archive for Rational Mechanics and Analysis, Vol. 16 (1993), pp. 199-257.