Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 261
261
Roberta Lenczowski
"introduction by Session Chair"
Transcript of Presentation
Summary of Presentation
Video Presentation
Roberta E. Lenczowski is technical director, National Imagery and Mapping Agency (NIMA), in
Bethesda, Maryland. Ms. Lenczowski earned her B.A. in philosophy from Creighton University in
1963. She completed her M.A. in philosophy from St. Louis University in 1970. In November
1977, Ms. Lenczowski began her professional career with the Defense Mapping Agency, now the
National Imagery and Mapping Agency. She has held numerous technical and supervisory
positions before her appointment as technical director in 2001.
261
OCR for page 262
262
DR. LENCZOWSKI: To try to get this session started, albeit
a little bit late, but I know that the discussions have
been very enthusiastic, very, very interesting.
One of our speakers, Dr. Rabiner, has to leave by
quarter to 6:00. So, I don't want to do a whole lot of
introduction with respect to my background. Suffice it to
say, I am from the National Imagery and Mapping Agency. As
a result, I have a very vested interest in understanding
. .
Imagery analysis.
So, based upon the short abstracts I have read
for each of these speakers, I know that things they are
going to say will be very relevant to work that we do
within the agency. So, I will introduce each of the
speakers. I will give you a little bit of an insight, a
tickler, if you will, in terms of what their topic is.
When they have completed then, of course, we will
open this to a discussion and an exchange.
Our first speaker is Dr. Malik. He, in fact,
received his Ph.D. in computer science from Stanford
University in 1985. In 1986, he joined the faculty of the
Computer Science Division in the Department of Electrical
Engineering and Computer Science at the University of
California in Berkeley, where he is currently a professor.
262
OCR for page 263
263
He has
For those of you who are not familiar with his
background, his research interests are in computer vision
and computational modeling of human vision. His work spans
the range of topics in vision, including image
segmentation, in grouping texture, stereopsis, object
recognition, imagery-based modeling and rendering, content-
based imagery querying and intelligent vehicle highway
systems.
He has authored or co-authored more than a
hundred research papers on these topics. So, he is an
incredible asset to the discussion as we continue.
I need to point out that in my own background
since the very early eighties, I have been following the
work that has gone on in the community that I am
predominantly from, that being the mapping side of the
National Imagery and Mapping Agency and have watched as we
have attempted to have insights in terms of some of the
research issues of automatic or what we refer to now as
assisted feature recognition or automated or assisted
target recognition.
So, one of the things that Dr. Malik plans to talk about in
terms of recognizing the objects and the actions is
analyzing images with the objective of recognizing those
objects or those activities in the scene that is being
263
OCR for page 264
264
imaged. As we get into
period, hopefully I will have
some more descriptive
relevant to the business
some of
information
that we are in
fact, is of great support with respect to
here of homeland security.
the later
-
. .
c .lecusslon
an opportunity to provide
about why this is so
and how that, in
the core topic
264
OCR for page 265
265
Introduction by Session Chair
Roberta Lenczowski
Dr. Lenczowski introclucect herself as a scientist from the National Imagery anct Mapping
Agency who has a very vested interest in uncterstancting imagery analysis. Recognizing
objects anct actions means analyzing images with the objective of recognizing those
objects or those activities in the scene that is being imagect. Such analysis is important
with respect to the core topic of homelanct security.
265
OCR for page 266
266
Jitendra Malik
"Computational Vision"
Transcript of Presentation
Summary of Presentation
Power Point Slides
Video Presentation
Jitendra Malik was born in Mathura, India, in 1960. He received a B.Tech. degree in electrical
engineering from the Indian Institute of Technology, Kanpur, in 1980 and a Ph.D. degree in
computer science from Stanford University in 1985. In January 1986, he joined the University of
California at Berkeley, where he is currently the Arthur J. Chick Endowed Professor of EECS and
the associate chair for the Computer Science Division. He is also on the faculty of the Cognitive
Science and Vision Science groups.
His research interests are in computer vision and the computational modeling of human vision.
His work spans a range of topics in vision, including image segmentation and grouping, texture,
stereopsis, object recognition, image-based modeling and rendering, content-based image
querying, and intelligent vehicle highway systems. He has authored or coauthored more than a
hundred research papers on these topics.
He received the gold medal for the best graduating student in electrical engineering from IIT
Kanpur in 1980, a Presidential Young Investigator Award in 1989, and the Rosenbaum
Fellowship for the Computer Vision Programme at the Newton Institute of Mathematical Sciences,
University of Cambridge, in 1993. He received the Diane S. McEntyre Award for Excellence in
Teaching from the Computer Science Division, University of California at Berkeley, in 2000. He
was awarded a Miller Research Professorship in 2001. He serves on the editorial boards of the
International Journal of Computer Vision and the Journal of Vision and on the scientific advisory
board of the Pacific Institute for the Mathematical Sciences.
266
OCR for page 267
267
DR. MALIK: Thank you, Dr. Lenczowski.
Let's begin at the beginning. How do we get
these images and what do I mean by images? So, the first
thing to realize is that images arise from a variety of
sources and they could be volumetric images, just 2D
images, images over time. So, of course, we have our
economical example of optical images, which are getting to
be extremely cheap and extremely high resolution. There
are going to be zillions of these devices everywhere,
whether we like it or not.
But I also wanted to mention other modalities,
which are for their own advantages. So, x-ray machines so
when you walk through the airport, but those are not
tomography machines, but they increasingly will be and thus
they will give access to 3D data. This may be of great
relevance and automated techniques for detecting weapons,
explosives and so on.
There are rain sensors, which give you not just
information about the brightness coming back, but also
directly depths. We refer to this as 2.5(b) data, because
it is not the full volume. You could have infrared sensors
and no doubt many other exotic techniques are constantly
being developed.
267
OCR for page 268
268
I won't talk about the reconstruction part of the
problem. I assume that we have the images and let's
process them further. So, this is the coal problem. We
start off with -- it is just a collection of big -- with
some attributes and some spatial, XY, XYZ, XYZT, whatever.
But we want to able to say in -- and I am going to use
examples from optical images but many of the techniques do
carry over. That is a problem. How do we do this?
That is a recognition problem. So'
at this image, we all recognize it. Maybe
you have been to the Louvre. Maybe it is
image is splattered over a zillion places.
when we look
it is because
because this
But probably
you have not seen him before. So, I wish to argue that --
to identify that something is a face, doesn't rely on you
having seen that specific face and, of course, you have
. .
varieties.
So, the key point to note is that recognition is
at weighting levels of specificity. There is at a category
level, faces, cards. There is specific process, Mona Lisa,
or it could be a specific person in a specific expression.
The second key point and this is a very important
one is that we want to recognize objects in spite of
certain kinds of radiation, radiation such as changes in
pose. You can look at the person from different
268
OCR for page 269
269
viewpoints. Changes in lighting and -- images, you would
have occlusion. There would be clutter and there will be
objects hiding each other.
This is why just thinking of it as a traditional
statistical classification problem is not quite adequate
because in that framework, some of these radiations due to
are very systematic, which
they are treated as noise
thing to do. So, that is
about action recognition as
video data, whatever.
common English words that
~ , so YOU can read the
geometry and lighting, which
result from the -- of physics,
and that may not be the right
about objects. But we can talk
well, particularly when we have
Just a collection of
are really associated with actions
list there. They involve movement and posture change,
object manipulation, conversational gesture and moving the
hands about, sign language and so on. Certainly, we can
recognize these as well.
So, how shall we do this? So, what I am going to
do in this talk is sort of -- our way through this area and
at a later part, I am going to mostly talk about work and
approaches taken in my group, but I wanted to make the
survey sort of inclusive. So, initially, I am going to
talk about face recognition. I want to mention -- why do I
pick faces? Faces are really just another class of object.
269
OCR for page 270
270
So, in principle, the same techniques apply. However, it
is a very special and very important domain. It has been
argued that a baby at the age of one day -- and there have
been experiments which have argued that at age 15 -- can
distinguish a face from a non-face. It may, in fact, be
already hard-wired into the vision system.
There has been a lot of work in the last ten
years and the implications for surveillance and so forth
are obvious. So, it is worth taking that specialty. So,
this is work from Carnegie Mellon University. So, there
are now pretty good techniques for doing face detection.
That has to be the first problem. You start out with some
image and you want to find that set of pixels, which are --
to a face. So, that is the face detection problem.
I am sure they picked apropos examples here. You
can try this detecting yourself. You can select any
photograph you like and see the results of the algorithm.
I think this is currently the best algorithm in the
business because it works for faces and a variety of views
and does it reasonably well.
This gives you sort of an idea of what the
performance is like. So, the performance is 94 percent
detection rate on this with false detection every two
270
OCR for page 271
271
images. This is the current state of the art. No doubt
things could be improved.
Okay. So, that is what you can do to try to find
faces. But now you want to say whose face is it. Is it
Nixon's face or Clinton's face? So, here what I am going
to do is to just report results from a study, which was
carried out a couple of years ago and the person who did
this study or was a leader in the study is Jonathan
Phillips, who is actually back in the room. So, any
detailed questions, he is the one to ask.
This -- become commercial, which unfortunately
always complicates things because now scientific claims get
mixed up with commercial interests and so on and so forth.
Anyway, here is a study, which -- systems from different
vendors.
So, the key issue here is you will take some
picture of a person at some time. You are not going to act
when you finally try to identify that person in a crowd.
The person is not going to obligingly present his or her
face in exactly the same pose, exactly the same lighting
with exactly the same hairdo, with exactly the same state
of wrinkledness or lack thereof. So, you have to consider
this recognition in spite of these variations.
271
OCR for page 356
356
this, what you find is that basically each person is equal
to one Eigenface, take away the average. So, actually what
happens is that the Eigen analysis just gives you back the
individual identities and so on.
Now, Eigen analysis is being used all the time
throughout science and people are exactly going down this
road all the time. It is not well-understood that if you
go to high dimensions and you have relatively few examples,
you have these distortions. The whole area of what
distortions there are and what systematic issues you have
are not very well understood.
Another area of dimensional glut is that you have
the cursive dimensionality. Algorithms don't run well in
high dimensions. A simple example in the context of the
talks that we have had is say I have to search through a
database for nearest neighbors. That gets slow as you go
up into high dimensions and there is a lot of theoretical
research that has tried to crack the problem, but basically
it has not been really solved in any practically effective
way. So, basically you just have to go through a
significant fraction of the data set to find the nearest
neighbor.
No ideas of hierarchical search really work in
these way high dimensions. For example, Jerry Friedman in
356
OCR for page 357
357
the audience
search. It
pioneered the idea of using hierarchical
works great in low dimensions, like ten
dimensions or something, but you know, when N is much
smaller than D, it is not a viable method.
Well, the other issues we just don't understand
enough about what our data structures are that we are
gathering. They are these things up in high dimensional
space. There isn't enough terms of reference for
mathematics that has been provided to us to describe it.
We lack the language. It is really the responsibility of
math to develop that language.
So, a lot of these things are about point clouds
and high dimensions, just individual points are pictures or
sounds or something. They are up in high dimensions. If
we collect the whole data base, we have a cloud of points
and high dimensions. That has some structure. What is it?
We believe that there is something like a manifold or
complex or some other mathematical object behind it.
We also know that when we look at things like the
articulations of images as they vary through some condition
like the pose, that we can actually see that things follow
a trajectory up in some high dimensional space notionally,
but we can't visualize it or understand it very well.
357
OCR for page 358
358
So, people are trying to develop methods to let
us understand better what the manifolds of data are like.
Tell me what this data looks like. Josh Tannenbaum, who I
guess is now going to M.I.T., did this work with
EisoMap(?), which given an unordered set of pictures of the
hands, okay, figures out a map that shows what the
underlying parameter space is. So, there is a parameter
that is like this and there is -- that is this parameter
and then there is a parameter that goes like that and that
is discovered by a method that just takes this unstructured
data set and finds the parameterization.
We need a lot more understanding in this
direction. The kinds of image data that you see here, this
is just one articulation, one specific issue and one
methodology for dealing with it. But we could go much
further.
Representation of high dimensional data,
Professor Coifman mentioned this and in some of Professor
Malik's work it is also relevant. What are the right
features or the right basis to represent these high
dimensional data in so that you can pull out the important
structures? So, just as an example, if you look at facial
gestures, faces articulating in time and you try to find
how do you actually represent those gestures the best, one
358
OCR for page 359
359
could think of using a technique like principal components
analysis to do that. It fails very much because the
structure of looking at facial articulation is nothing like
the linear manifold structure that, you know, 70 years ago
was really, you know, cutting edge stuff, when Hoteling(?)
came up with principal components.
It is a much more subtle thing. So, independent
components analysis deal much more with the structure as it
really is and you get little basis functions, each one of
which can be mapped and correlated with an actual muscle
that is able to do an articulation of the face. You can
discover that from data.
It is also nice to come up with mathematically
defined systems that correlate in some way with the
underlying structures of the data. So, through techniques
like multi-resolution analysis, there are mathematicians
designing, you know, bases and frames and so on that have
some of the features that correlate in a way with what you
see in these data. So, for example, through a construction
of taking data and band pass filtering it, breaking it into
blocks and doing local redon(?) transform on each block,
you can come up with systems that have the property that
they add in little local ridges and so on. So, you can
represent with low dimensions structures like fingerprints
359
OCR for page 360
360
because you have designed special ways of representing data
that have long -- elongated structures and it is also
relevant to the facial articulation because the muscles and
so on have exactly the structure that you saw.
Okay. So, just some of the points I have made,
we are entering an era of data, the generalizability, the
robustness, the scalability that have come up in some of
these things are issues that will have to be addressed and
in order for that data to deliver for any kind of homeland
defense, where mathematicians can contribute is
understanding fundamentally the structure of this high
dimensional data much better than we do today. It can
provide a language to science and a whole new set of
concepts that may make them think better about it.
Our challenge is to think more about how to cope
with dimensional glut, understand high dimensional space
and represent high dimensional data. The speakers have
brought up the issues that I have said in various ways and
trying to -- you know, under one umbrella, cover all of
this is pretty hard, but certainly in Professor Malik's
talk was brought up the issue of robustness to
articulation. We don't understand the articulation
manifold of the images well enough. As an object in the
world is articulated, the images that it generates go
360
OCR for page 361
361
through a manifold in high dimensional space. It is poorly
understood. IsoMap is an example of a tool that goes in
the right direction.
But to get robustness to articulations, it would
be good to understand them better. The structure of image
manifolds, as Professor Malik mentioned, working directly
with shape, that is again trying to get away from a flat
idea of the way data are in high dimensional space and come
up with a completely different way to parameterize things.
Professor Coifman, well, he basically like we are
co-authors, so I am just extending his talk, but, anyway,
the connections there are pretty obvious.
Finally, I think what Rabiner showed is first of
all the ability to be robust to articulations was
mentioned. So, again, to understand I think how those
strange noises in the data -- what is the structure of the
manifold of those things up in high dimensional space. So,
the structure is speech and confusers would be interesting
to know.
I think he has shown that if you work very hard
for a long time, you can take some simple math ideas and
really make them deliver, which also is a very important
thread for us to not lose track of. He mentioned a number
of math problems that he thought would be interesting. I
361
OCR for page 362
362
wish he had been able to stay and we could have discussed
those at greater length.
There was a questioner in the audience who said,
well, but you said you had just been carrying out an
implementation for 20 years. Are there any new math
problems that we really need to break now? It would be
very interesting to hear what he had to say.
I think all the speakers have pointed out that a
lot of progress has been stimulated by -- in math, I think,
from the study of images and -- associated multimedia data
and I think that will continue as well and eventually we
will see the payoff in homeland security deliverables, but
perhaps a long time.
Thank you.
[Applause.]
PARTICIPANT: [Comment off microphone.]
MR. DONOHO: IsoMap is just a preprocessor for
multidimensional scaling, for example. So, for instance,
you didn't have this standard method multidimensional
scaling, we wouldn't be able to -- just provide a special
set of distances.
Has any of this played out? I mean there are
10,000 references a year to that methodology. I don't
think I could track them all down and make a meta-analysis
362
OCR for page 363
363
of how many were just, you know, forgotten publications.
But it just seems like millions of references --
PARTICIPANT: [Comment off microphone.]
MR. DONOHO: We are right now getting data in
volumes that were never considered 20 years ago. We have
computation in ways that were not possible 20 years ago;
whereas, the methodology of 80 years ago, just with the
data and computations that we have is starting to deliver
some impressive results. So, if we are talking about more
recent methodology, well, let the data take hold and let
some computations take hold.
MR. MC CURLEY: [Comment off microphone.]
question about what
it is connected with
looking for nearest
MR. DONOHO: Your earlier
features would help to index, I think
that. I think that the idea of
neighbors in a database of all the multimedia experiences
in the whole unit, it is sort of where this is going.
Right? That is not going to happen. There ought to be a
more intelligent way to index, to compress and so on.
Can we make a decision right now how much data to
gather? I don't know. I think economics will drive
everything.
PARTICIPANT: [Comment off microphone.]
363
OCR for page 364
364
MR. DONOHO: It is an excellent question. I have
some opinions on the matter but I don't know if I have
really the right to stand between people and liquid
refreshment.
Yes, I think that most of the political things
that I have seen are between non-technical people who are
playing out the dynamics of where they are coming from
rather than looking at really solving the problem. So, you
know, I am a little bit -- so, for example, a national I.D.
card wouldn't have any privacy issues if all it was used
for is for -- you know, it contained data on the card and
it was yours and it was digitally signed and encrypted and
all of that stuff and there was no national database that
big brother could look through.
Actually, the point was that it was the card and
that was it. Right? What people assumes it means there is
that big brother has all the stuff there and they can track
their movements and stuff like that. So, there are all
those kinds of things. There is nothing that says the I.D.
card can't just be used for face verification or identity
verification. You are that person and that is it. They
are not communicating with anybody. It is illegal to
communicate with anybody. That can all be done.
364
OCR for page 365
365
But people get together and butt heads
is sort of
an Orwell characterization I guess. The science is
absolutely clear,
that you could do it without having all
of those -- violating personal liberty and so on.
DR. LENCZOWSKI:
Are there other questions for
any of the speakers? I guess that everyone is ready
take their break.
I want to thank you very much for your
participation this afternoon.
[Applause.]
365
OCR for page 366
366
Remarks on Image Analysis and Voice Rcognition
David Donoho
Our era has clefinect itself as a data era by creating pervasive networks and gargantuan
databases that are promulgating an ethic of ciata-rich discourse and decisions. Some
scientific issues that arise in working with massive amounts of data inclucle these:
Generaiizabiiity-buiicting systems that can efficiently hanctie data for billions
of people
Robustness-builcting systems that work uncler articulated conditions, and
Unclerstancting the structure of the ciata-techniques for working with data in
very high dimensions space.
Take, for example, the task of coping with high-climensiona] data in body gesture
recognition. If you look at an utterance or an image, you are talking about something that
is a vector in a i,000- or i,000,000-ctimensiona] space. For example, human identity,
even highly clecimatect, might require some 65,000 variables just to produce raw imagery.
Characterization of facial gestures is a subtle thing, and inclepenclent-components analysis
(as opposed to principaI-components analysis) cleats much more with the structure as it
really is. You get little basis functions, each one of which can be mapped and correlated
with an actual muscle that is able to do an articulation of the face.
366
Representative terms from entire chapter:
computer vision