Click for next page ( 262


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 261
261 Roberta Lenczowski "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation Roberta E. Lenczowski is technical director, National Imagery and Mapping Agency (NIMA), in Bethesda, Maryland. Ms. Lenczowski earned her B.A. in philosophy from Creighton University in 1963. She completed her M.A. in philosophy from St. Louis University in 1970. In November 1977, Ms. Lenczowski began her professional career with the Defense Mapping Agency, now the National Imagery and Mapping Agency. She has held numerous technical and supervisory positions before her appointment as technical director in 2001. 261

OCR for page 261
262 DR. LENCZOWSKI: To try to get this session started, albeit a little bit late, but I know that the discussions have been very enthusiastic, very, very interesting. One of our speakers, Dr. Rabiner, has to leave by quarter to 6:00. So, I don't want to do a whole lot of introduction with respect to my background. Suffice it to say, I am from the National Imagery and Mapping Agency. As a result, I have a very vested interest in understanding . . Imagery analysis. So, based upon the short abstracts I have read for each of these speakers, I know that things they are going to say will be very relevant to work that we do within the agency. So, I will introduce each of the speakers. I will give you a little bit of an insight, a tickler, if you will, in terms of what their topic is. When they have completed then, of course, we will open this to a discussion and an exchange. Our first speaker is Dr. Malik. He, in fact, received his Ph.D. in computer science from Stanford University in 1985. In 1986, he joined the faculty of the Computer Science Division in the Department of Electrical Engineering and Computer Science at the University of California in Berkeley, where he is currently a professor. 262

OCR for page 261
263 He has For those of you who are not familiar with his background, his research interests are in computer vision and computational modeling of human vision. His work spans the range of topics in vision, including image segmentation, in grouping texture, stereopsis, object recognition, imagery-based modeling and rendering, content- based imagery querying and intelligent vehicle highway systems. He has authored or co-authored more than a hundred research papers on these topics. So, he is an incredible asset to the discussion as we continue. I need to point out that in my own background since the very early eighties, I have been following the work that has gone on in the community that I am predominantly from, that being the mapping side of the National Imagery and Mapping Agency and have watched as we have attempted to have insights in terms of some of the research issues of automatic or what we refer to now as assisted feature recognition or automated or assisted target recognition. So, one of the things that Dr. Malik plans to talk about in terms of recognizing the objects and the actions is analyzing images with the objective of recognizing those objects or those activities in the scene that is being 263

OCR for page 261
264 imaged. As we get into period, hopefully I will have some more descriptive relevant to the business some of information that we are in fact, is of great support with respect to here of homeland security. the later - . . c .lecusslon an opportunity to provide about why this is so and how that, in the core topic 264

OCR for page 261
265 Introduction by Session Chair Roberta Lenczowski Dr. Lenczowski introclucect herself as a scientist from the National Imagery anct Mapping Agency who has a very vested interest in uncterstancting imagery analysis. Recognizing objects anct actions means analyzing images with the objective of recognizing those objects or those activities in the scene that is being imagect. Such analysis is important with respect to the core topic of homelanct security. 265

OCR for page 261
266 Jitendra Malik "Computational Vision" Transcript of Presentation Summary of Presentation Power Point Slides Video Presentation Jitendra Malik was born in Mathura, India, in 1960. He received a B.Tech. degree in electrical engineering from the Indian Institute of Technology, Kanpur, in 1980 and a Ph.D. degree in computer science from Stanford University in 1985. In January 1986, he joined the University of California at Berkeley, where he is currently the Arthur J. Chick Endowed Professor of EECS and the associate chair for the Computer Science Division. He is also on the faculty of the Cognitive Science and Vision Science groups. His research interests are in computer vision and the computational modeling of human vision. His work spans a range of topics in vision, including image segmentation and grouping, texture, stereopsis, object recognition, image-based modeling and rendering, content-based image querying, and intelligent vehicle highway systems. He has authored or coauthored more than a hundred research papers on these topics. He received the gold medal for the best graduating student in electrical engineering from IIT Kanpur in 1980, a Presidential Young Investigator Award in 1989, and the Rosenbaum Fellowship for the Computer Vision Programme at the Newton Institute of Mathematical Sciences, University of Cambridge, in 1993. He received the Diane S. McEntyre Award for Excellence in Teaching from the Computer Science Division, University of California at Berkeley, in 2000. He was awarded a Miller Research Professorship in 2001. He serves on the editorial boards of the International Journal of Computer Vision and the Journal of Vision and on the scientific advisory board of the Pacific Institute for the Mathematical Sciences. 266

OCR for page 261
267 DR. MALIK: Thank you, Dr. Lenczowski. Let's begin at the beginning. How do we get these images and what do I mean by images? So, the first thing to realize is that images arise from a variety of sources and they could be volumetric images, just 2D images, images over time. So, of course, we have our economical example of optical images, which are getting to be extremely cheap and extremely high resolution. There are going to be zillions of these devices everywhere, whether we like it or not. But I also wanted to mention other modalities, which are for their own advantages. So, x-ray machines so when you walk through the airport, but those are not tomography machines, but they increasingly will be and thus they will give access to 3D data. This may be of great relevance and automated techniques for detecting weapons, explosives and so on. There are rain sensors, which give you not just information about the brightness coming back, but also directly depths. We refer to this as 2.5(b) data, because it is not the full volume. You could have infrared sensors and no doubt many other exotic techniques are constantly being developed. 267

OCR for page 261
268 I won't talk about the reconstruction part of the problem. I assume that we have the images and let's process them further. So, this is the coal problem. We start off with -- it is just a collection of big -- with some attributes and some spatial, XY, XYZ, XYZT, whatever. But we want to able to say in -- and I am going to use examples from optical images but many of the techniques do carry over. That is a problem. How do we do this? That is a recognition problem. So' at this image, we all recognize it. Maybe you have been to the Louvre. Maybe it is image is splattered over a zillion places. when we look it is because because this But probably you have not seen him before. So, I wish to argue that -- to identify that something is a face, doesn't rely on you having seen that specific face and, of course, you have . . varieties. So, the key point to note is that recognition is at weighting levels of specificity. There is at a category level, faces, cards. There is specific process, Mona Lisa, or it could be a specific person in a specific expression. The second key point and this is a very important one is that we want to recognize objects in spite of certain kinds of radiation, radiation such as changes in pose. You can look at the person from different 268

OCR for page 261
269 viewpoints. Changes in lighting and -- images, you would have occlusion. There would be clutter and there will be objects hiding each other. This is why just thinking of it as a traditional statistical classification problem is not quite adequate because in that framework, some of these radiations due to are very systematic, which they are treated as noise thing to do. So, that is about action recognition as video data, whatever. common English words that ~ , so YOU can read the geometry and lighting, which result from the -- of physics, and that may not be the right about objects. But we can talk well, particularly when we have Just a collection of are really associated with actions list there. They involve movement and posture change, object manipulation, conversational gesture and moving the hands about, sign language and so on. Certainly, we can recognize these as well. So, how shall we do this? So, what I am going to do in this talk is sort of -- our way through this area and at a later part, I am going to mostly talk about work and approaches taken in my group, but I wanted to make the survey sort of inclusive. So, initially, I am going to talk about face recognition. I want to mention -- why do I pick faces? Faces are really just another class of object. 269

OCR for page 261
270 So, in principle, the same techniques apply. However, it is a very special and very important domain. It has been argued that a baby at the age of one day -- and there have been experiments which have argued that at age 15 -- can distinguish a face from a non-face. It may, in fact, be already hard-wired into the vision system. There has been a lot of work in the last ten years and the implications for surveillance and so forth are obvious. So, it is worth taking that specialty. So, this is work from Carnegie Mellon University. So, there are now pretty good techniques for doing face detection. That has to be the first problem. You start out with some image and you want to find that set of pixels, which are -- to a face. So, that is the face detection problem. I am sure they picked apropos examples here. You can try this detecting yourself. You can select any photograph you like and see the results of the algorithm. I think this is currently the best algorithm in the business because it works for faces and a variety of views and does it reasonably well. This gives you sort of an idea of what the performance is like. So, the performance is 94 percent detection rate on this with false detection every two 270

OCR for page 261
271 images. This is the current state of the art. No doubt things could be improved. Okay. So, that is what you can do to try to find faces. But now you want to say whose face is it. Is it Nixon's face or Clinton's face? So, here what I am going to do is to just report results from a study, which was carried out a couple of years ago and the person who did this study or was a leader in the study is Jonathan Phillips, who is actually back in the room. So, any detailed questions, he is the one to ask. This -- become commercial, which unfortunately always complicates things because now scientific claims get mixed up with commercial interests and so on and so forth. Anyway, here is a study, which -- systems from different vendors. So, the key issue here is you will take some picture of a person at some time. You are not going to act when you finally try to identify that person in a crowd. The person is not going to obligingly present his or her face in exactly the same pose, exactly the same lighting with exactly the same hairdo, with exactly the same state of wrinkledness or lack thereof. So, you have to consider this recognition in spite of these variations. 271

OCR for page 261
356 this, what you find is that basically each person is equal to one Eigenface, take away the average. So, actually what happens is that the Eigen analysis just gives you back the individual identities and so on. Now, Eigen analysis is being used all the time throughout science and people are exactly going down this road all the time. It is not well-understood that if you go to high dimensions and you have relatively few examples, you have these distortions. The whole area of what distortions there are and what systematic issues you have are not very well understood. Another area of dimensional glut is that you have the cursive dimensionality. Algorithms don't run well in high dimensions. A simple example in the context of the talks that we have had is say I have to search through a database for nearest neighbors. That gets slow as you go up into high dimensions and there is a lot of theoretical research that has tried to crack the problem, but basically it has not been really solved in any practically effective way. So, basically you just have to go through a significant fraction of the data set to find the nearest neighbor. No ideas of hierarchical search really work in these way high dimensions. For example, Jerry Friedman in 356

OCR for page 261
357 the audience search. It pioneered the idea of using hierarchical works great in low dimensions, like ten dimensions or something, but you know, when N is much smaller than D, it is not a viable method. Well, the other issues we just don't understand enough about what our data structures are that we are gathering. They are these things up in high dimensional space. There isn't enough terms of reference for mathematics that has been provided to us to describe it. We lack the language. It is really the responsibility of math to develop that language. So, a lot of these things are about point clouds and high dimensions, just individual points are pictures or sounds or something. They are up in high dimensions. If we collect the whole data base, we have a cloud of points and high dimensions. That has some structure. What is it? We believe that there is something like a manifold or complex or some other mathematical object behind it. We also know that when we look at things like the articulations of images as they vary through some condition like the pose, that we can actually see that things follow a trajectory up in some high dimensional space notionally, but we can't visualize it or understand it very well. 357

OCR for page 261
358 So, people are trying to develop methods to let us understand better what the manifolds of data are like. Tell me what this data looks like. Josh Tannenbaum, who I guess is now going to M.I.T., did this work with EisoMap(?), which given an unordered set of pictures of the hands, okay, figures out a map that shows what the underlying parameter space is. So, there is a parameter that is like this and there is -- that is this parameter and then there is a parameter that goes like that and that is discovered by a method that just takes this unstructured data set and finds the parameterization. We need a lot more understanding in this direction. The kinds of image data that you see here, this is just one articulation, one specific issue and one methodology for dealing with it. But we could go much further. Representation of high dimensional data, Professor Coifman mentioned this and in some of Professor Malik's work it is also relevant. What are the right features or the right basis to represent these high dimensional data in so that you can pull out the important structures? So, just as an example, if you look at facial gestures, faces articulating in time and you try to find how do you actually represent those gestures the best, one 358

OCR for page 261
359 could think of using a technique like principal components analysis to do that. It fails very much because the structure of looking at facial articulation is nothing like the linear manifold structure that, you know, 70 years ago was really, you know, cutting edge stuff, when Hoteling(?) came up with principal components. It is a much more subtle thing. So, independent components analysis deal much more with the structure as it really is and you get little basis functions, each one of which can be mapped and correlated with an actual muscle that is able to do an articulation of the face. You can discover that from data. It is also nice to come up with mathematically defined systems that correlate in some way with the underlying structures of the data. So, through techniques like multi-resolution analysis, there are mathematicians designing, you know, bases and frames and so on that have some of the features that correlate in a way with what you see in these data. So, for example, through a construction of taking data and band pass filtering it, breaking it into blocks and doing local redon(?) transform on each block, you can come up with systems that have the property that they add in little local ridges and so on. So, you can represent with low dimensions structures like fingerprints 359

OCR for page 261
360 because you have designed special ways of representing data that have long -- elongated structures and it is also relevant to the facial articulation because the muscles and so on have exactly the structure that you saw. Okay. So, just some of the points I have made, we are entering an era of data, the generalizability, the robustness, the scalability that have come up in some of these things are issues that will have to be addressed and in order for that data to deliver for any kind of homeland defense, where mathematicians can contribute is understanding fundamentally the structure of this high dimensional data much better than we do today. It can provide a language to science and a whole new set of concepts that may make them think better about it. Our challenge is to think more about how to cope with dimensional glut, understand high dimensional space and represent high dimensional data. The speakers have brought up the issues that I have said in various ways and trying to -- you know, under one umbrella, cover all of this is pretty hard, but certainly in Professor Malik's talk was brought up the issue of robustness to articulation. We don't understand the articulation manifold of the images well enough. As an object in the world is articulated, the images that it generates go 360

OCR for page 261
361 through a manifold in high dimensional space. It is poorly understood. IsoMap is an example of a tool that goes in the right direction. But to get robustness to articulations, it would be good to understand them better. The structure of image manifolds, as Professor Malik mentioned, working directly with shape, that is again trying to get away from a flat idea of the way data are in high dimensional space and come up with a completely different way to parameterize things. Professor Coifman, well, he basically like we are co-authors, so I am just extending his talk, but, anyway, the connections there are pretty obvious. Finally, I think what Rabiner showed is first of all the ability to be robust to articulations was mentioned. So, again, to understand I think how those strange noises in the data -- what is the structure of the manifold of those things up in high dimensional space. So, the structure is speech and confusers would be interesting to know. I think he has shown that if you work very hard for a long time, you can take some simple math ideas and really make them deliver, which also is a very important thread for us to not lose track of. He mentioned a number of math problems that he thought would be interesting. I 361

OCR for page 261
362 wish he had been able to stay and we could have discussed those at greater length. There was a questioner in the audience who said, well, but you said you had just been carrying out an implementation for 20 years. Are there any new math problems that we really need to break now? It would be very interesting to hear what he had to say. I think all the speakers have pointed out that a lot of progress has been stimulated by -- in math, I think, from the study of images and -- associated multimedia data and I think that will continue as well and eventually we will see the payoff in homeland security deliverables, but perhaps a long time. Thank you. [Applause.] PARTICIPANT: [Comment off microphone.] MR. DONOHO: IsoMap is just a preprocessor for multidimensional scaling, for example. So, for instance, you didn't have this standard method multidimensional scaling, we wouldn't be able to -- just provide a special set of distances. Has any of this played out? I mean there are 10,000 references a year to that methodology. I don't think I could track them all down and make a meta-analysis 362

OCR for page 261
363 of how many were just, you know, forgotten publications. But it just seems like millions of references -- PARTICIPANT: [Comment off microphone.] MR. DONOHO: We are right now getting data in volumes that were never considered 20 years ago. We have computation in ways that were not possible 20 years ago; whereas, the methodology of 80 years ago, just with the data and computations that we have is starting to deliver some impressive results. So, if we are talking about more recent methodology, well, let the data take hold and let some computations take hold. MR. MC CURLEY: [Comment off microphone.] question about what it is connected with looking for nearest MR. DONOHO: Your earlier features would help to index, I think that. I think that the idea of neighbors in a database of all the multimedia experiences in the whole unit, it is sort of where this is going. Right? That is not going to happen. There ought to be a more intelligent way to index, to compress and so on. Can we make a decision right now how much data to gather? I don't know. I think economics will drive everything. PARTICIPANT: [Comment off microphone.] 363

OCR for page 261
364 MR. DONOHO: It is an excellent question. I have some opinions on the matter but I don't know if I have really the right to stand between people and liquid refreshment. Yes, I think that most of the political things that I have seen are between non-technical people who are playing out the dynamics of where they are coming from rather than looking at really solving the problem. So, you know, I am a little bit -- so, for example, a national I.D. card wouldn't have any privacy issues if all it was used for is for -- you know, it contained data on the card and it was yours and it was digitally signed and encrypted and all of that stuff and there was no national database that big brother could look through. Actually, the point was that it was the card and that was it. Right? What people assumes it means there is that big brother has all the stuff there and they can track their movements and stuff like that. So, there are all those kinds of things. There is nothing that says the I.D. card can't just be used for face verification or identity verification. You are that person and that is it. They are not communicating with anybody. It is illegal to communicate with anybody. That can all be done. 364

OCR for page 261
365 But people get together and butt heads is sort of an Orwell characterization I guess. The science is absolutely clear, that you could do it without having all of those -- violating personal liberty and so on. DR. LENCZOWSKI: Are there other questions for any of the speakers? I guess that everyone is ready take their break. I want to thank you very much for your participation this afternoon. [Applause.] 365

OCR for page 261
366 Remarks on Image Analysis and Voice Rcognition David Donoho Our era has clefinect itself as a data era by creating pervasive networks and gargantuan databases that are promulgating an ethic of ciata-rich discourse and decisions. Some scientific issues that arise in working with massive amounts of data inclucle these: Generaiizabiiity-buiicting systems that can efficiently hanctie data for billions of people Robustness-builcting systems that work uncler articulated conditions, and Unclerstancting the structure of the ciata-techniques for working with data in very high dimensions space. Take, for example, the task of coping with high-climensiona] data in body gesture recognition. If you look at an utterance or an image, you are talking about something that is a vector in a i,000- or i,000,000-ctimensiona] space. For example, human identity, even highly clecimatect, might require some 65,000 variables just to produce raw imagery. Characterization of facial gestures is a subtle thing, and inclepenclent-components analysis (as opposed to principaI-components analysis) cleats much more with the structure as it really is. You get little basis functions, each one of which can be mapped and correlated with an actual muscle that is able to do an articulation of the face. 366