Skip to main content

Currently Skimming:

Image Analysis and Voice Recognition
Pages 261-366

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 261...
... In November 1977, Ms. Lenczowski began her professional career with the Defense Mapping Agency, now the National Imagery and Mapping Agency.
From page 262...
... So, I don't want to do a whole lot of introduction with respect to my background. Suffice it to say, I am from the National Imagery and Mapping Agency.
From page 263...
... I need to point out that in my own background since the very early eighties, I have been following the work that has gone on in the community that I am predominantly from, that being the mapping side of the National Imagery and Mapping Agency and have watched as we have attempted to have insights in terms of some of the research issues of automatic or what we refer to now as assisted feature recognition or automated or assisted target recognition. So, one of the things that Dr.
From page 264...
... 264 imaged. As we get into period, hopefully I will have some more descriptive relevant to the business some of information that we are in fact, is of great support with respect to here of homeland security.
From page 265...
... Recognizing objects anct actions means analyzing images with the objective of recognizing those objects or those activities in the scene that is being imagect. Such analysis is important with respect to the core topic of homelanct security.
From page 266...
... His work spans a range of topics in vision, including image segmentation and grouping, texture, stereopsis, object recognition, image-based modeling and rendering, content-based image querying, and intelligent vehicle highway systems. He has authored or coauthored more than a hundred research papers on these topics.
From page 267...
... So, of course, we have our economical example of optical images, which are getting to be extremely cheap and extremely high resolution. There are going to be zillions of these devices everywhere, whether we like it or not.
From page 268...
... There is specific process, Mona Lisa, or it could be a specific person in a specific expression. The second key point and this is a very important one is that we want to recognize objects in spite of certain kinds of radiation, radiation such as changes in pose.
From page 269...
... There would be clutter and there will be objects hiding each other. This is why just thinking of it as a traditional statistical classification problem is not quite adequate because in that framework, some of these radiations due to are very systematic, which they are treated as noise thing to do.
From page 270...
... So, the performance is 94 percent detection rate on this with false detection every two 270
From page 271...
... So, the key issue here is you will take some picture of a person at some time. You are not going to act when you finally try to identify that person in a crowd.
From page 272...
... MALIK: If you have multiple views.
From page 273...
... So, now I want to talk about general object recognition and this is, again, a huge topic and I am going to use the typical caveat, which is -approach and not necessarily the final approach, so on and so forth.
From page 274...
... How do we deal with object recognition? I think the key issue is dealing with shape availability.
From page 275...
... Now, this transform could be -- we then considered this variation model in terms of transformation and when we want to say is this possibly the same -- in the same category, we put probability distributions on the space of transformations. These will be around an operating point, which is the -- shape and this gets us out of this fundamental issue of how do we define shape in this general way.
From page 276...
... Otherwise you don't have a workable approach. So, a key issue here is the issue of correspondence.
From page 277...
... Suppose we represent a shape by a collection of - I basically won't tell you how we do this, but it ultimately comes down to attaching to each point some descriptor which sort of captures a relative arrangement of the rest of the shape with respect to that point and then trying to -- so, shape is essentially relative configuration of the rest of the points with respect to you and two points are to be matched if they have similar relative configurations. So, you can sort of formalize that in some way.
From page 278...
... If I change the cost function, will I -- a little bit, will I dramatically change the output or -DR . MALIK: There is continuity, yes.
From page 279...
... 279 declare to be the result. You don't want a linear search through all the models.
From page 280...
... We had nothing special on this and we -- an error rate which is probably not significantly better but at least as good with a completely -- technique and I think tuned for digits with much less training data. These are actually all the errors and if you look through them carefully, you will find that some of these even you as a human will get drunk.
From page 281...
... So, now we are moving towards talking about action recognition. You want to identify people and what action they perform.
From page 282...
... So, they have these -- a variety of causes -- these artists single image and the reason you can do this is because of anthropometrical strengths. You know all the names of the body parts, essentially provide you an equivalent of a second view.
From page 283...
... So, object recognition and activity recognition is coupled. The fact that I am writing with a pen -- so, the philosophy that we can try to do this by recovering -- I will conclude here.
From page 284...
... I should, again, return to my I should return to the beginning. I -- optical images, but that is not the only imaging modality.
From page 285...
... For the future, researchers are moving toward the problem of action recognition that is, identifying people and the action they are performing. Current approaches draw from the above techniques, but the work is in a very early stage.
From page 286...
... He is a recipient of the 1996 DARPA Sustained Excellence Award, the 1996 Connecticut Science Medal, the 1999 Pioneer Award of the International Society for Industrial and Applied Science, and the 1999 National Medal of Science.
From page 287...
... He was chairman of the Yale Mathematics Department from 1986 until 1989. Today his topic, the mathematical challenges for real time analysis of imaging data, is caught in the first statement of his abstract, where he points out that the range of task to be achieved by imaging systems designed to monitor, detect and track objects require a performance comparable to the perception capabilities of an eye.
From page 288...
... DR. COIFMAN: I would like to actually describe an experiment that I am involved in but -- and raise and describe the various mathematical challenges that come along the way.
From page 289...
... ~ is we really don't have any tools to describe in any efficient fashion high dimension geometric structures. So, we heard this morning about having a variety of parameters, which have to be looked at as a multivariate system of parameters and then we want to have a probability distribution describing probabilities of various events, but we have no tool to describe the actual support of those probability distributions or just have an efficient way of describing the geometry underlying that.
From page 290...
... Even simpler than that, we don't know how to approximate ordinary functions, which are given to us as empirically, as functions of some parameters. So, to give an example, you have a blood test.
From page 291...
... So, what I would like to -- there is just one idea I want to convey today in some sense. That compression, which is basically an efficient representation of data is a key ingredient in everything that one wants to do.
From page 292...
... 292 David Donoho recently wrote a beautiful paper showing that if you have a minimum complexity description of a set of data, it provides you essentially an optimal statistical estimator. So, the two things are tied.
From page 293...
... So, we took this residual and now we compress it by painting -- taking the best description mode of it. Now you would have a Van Gogh type impressionistic image.
From page 294...
... We heard in the proceeding talk what kind of processing and how difficult it is to actually do very simple operations that we won't even think about them, seeing that there is a face down here. So, how does one deal with that directly without having to measure each individual pixel, make up an image and then start to fool around with the pixels and do things.
From page 295...
... So, you don't want to measure each individual pixel. You just want to extract some maybe potentially global information and you want to analyze it in some fashion.
From page 296...
... So, now we are getting into the data mining glut here. It is a vector, which may have hundreds of components and if we wanted to do -- by the way, everything I am telling you has been around for 50 years.
From page 297...
... 297 looking in a scene for some chemical spill or maybe for the presence of some camouflaged object or somebody, which is really different from the environment or maybe it is melanoma on my skin, then the issue is how do I find this on the fly, so to speak. So, I will give you a recipe in a second.
From page 298...
... This is a device that Texas Instrument manufactures for projectors. It is an array that has half a million little mirrors.
From page 299...
... So, the next few slides will show you -- we took a chemical plume and we tried to find it exactly doing that. So, this is just an image of the mirror system and this is the image of our camera.
From page 300...
... So, this is just a tiny example of what you can do with this combined spectral -- combined spatial spectral mix and the switching of the mirror. So, the mirror operates as a computer, basically doing mathematical operations on the light as it comes in before 300
From page 301...
... 301 it is being converted to digital as opposed to the usual way, which is you convert it to digital. You are dead and that is it, I mean, basically because you have too much data.
From page 302...
... It enters in -- doing this sort of -- the data mining of the spectra, which is very similar to the data mining occurring, say, expression, for example, the same kind of activity is a dictionary going back and forth between activities. The issue is -- understanding the internal geometries of the spectra as they relate to the spatial distribution, this is critical if you are looking at objects of a given size.
From page 303...
... One challenge inciuctes the fact that researchers don't have tools to describe, in any efficient fashion, high-ctimensional geometric structures. In acictition they do not know how to approximate ordinary functions, which are given to us empirically, as functions of some parameters.
From page 304...
... He was promoted to supervisor in 1972, department head in 1985, director in 1990, and functional vice president in 1995. He joined the newly created AT&T Labs in 1996 as director of the Speech and Image Processing Services Research Lab and was promoted to vice president of research in 1998, where he managed a broad research program in communications, computing, and information sciences technologies.
From page 305...
... He received his Ph.D. from the Massachusetts Institute of Technology in 1967, but he created or moved to the newly created AT&T labs in 1996 as the director of speech and image processing services research lab.
From page 306...
... Rabiner is going to talk about today is the maturing, in fact, of our speech recognition to the point where it is not -- now it is very widely applied to a broad range of applications. However, as he points out, although the technology is often good enough for many of the current applications, there remain key challenges in virtually every aspect of voice recognition that prevent this technology from being ubiquitous in every environment for every speaker and for even broader ranges of applications.
From page 307...
... But even when you get the answer, I dialed the wrong number, in order for you to do something with that, okay -- this is a customer care application -- you have to go out to the next level, which is spoken language understanding and get the meaning. Now, I dialed the wrong number might mean I don't want to pay for that.
From page 308...
... Hopefully, that system, that entire speech circle will enable you to keep using the system and these are the kind of services people try to build today. So, we are going to concentrate on that first box a lot because there really are a lot of challenges there.
From page 309...
... Then, finally, we do a little bit of cleanup and there is the classical computer science, bet l oworl d program, you know. If you have got that one, you may even have a chance of going to a little more sophisticated one.
From page 310...
... So, it is below that, but it is fundamentally -- it has got some of the same parameters that the cell phone uses for actually operating in the poor environment of cell -- but it has got about 30 or -- that you use in cell phones. So, it is a little bit higher rate.
From page 311...
... 311 They fall apart in noisy environments, when you have office noise, when you try to operate it in the middle of an airport lounge, when you are operating in your car at 60 miles an hour and you have road noise and you have all the background and the radio blaring. When you try to use it over a speaker phone or a cell phone, every system just about falls apart.
From page 312...
... So, the concept is let's try to make it a little bit more like humans would do it perceptually. So, here comes a nice little model of what goes on.
From page 313...
... The acoustic model, though, now we are at the second stage and before -- there were three major models. There is one that is based on physical speech data, called the acoustic model and two that are based on -- so, the acoustic model is basically how do you take frames, which come on a regular basis every 10 milliseconds, but sounds don't come regularly.
From page 314...
... 314 So, you characterize each of these sounds by the Markov model, which is a statistical method for characterizing the special properties of the signal and it is not the right model for speech, but it is so powerful that you can get into the model and represent virtually any distribution. Now, the problem, of course, is you are over-training it in some sense, which makes you non-robust.
From page 315...
... The next level of the language model is what word sequences are now valid in the task you are doing, that so315
From page 316...
... Of course, the problem is that you get the text from somewhere and that assumes that people will speak according to that text model, much as we assume they would speak according to the accoustic model and of course there are problems of how you build a new one for a new task. Still real challenges there.
From page 317...
... So, the real issue is how do you build an efficient structure, a finite state network for decoding and searching large vocabularies. Let's assume we have a vocabulary of a million words, okay, with 50 times 50 times 50 units, with all of the complexity of the language model.
From page 318...
... They get determinized and they get minimized using very, very fancy computational linguistic techniques and out comes a search network with significantly lower states. So, here is an example of just a simple word transducer under the word "data" and basically the transducer just has arcs, which are the sounds and states where it winds up.
From page 319...
... The answer is how do you know if that sequence of words is reasonable or the right ones. So, we do what we call utterance verification.
From page 320...
... We have another one in a customer care application where you can speak freely and during it, people will list numbers, you know, like they want credit for numbers and there the word error rate is 5 percent. That is a 17 to ~ increase in digit error rate.
From page 321...
... It doesn't work. Well, we have a system I am going to show you later that started off with a 50 percent word error rate and the task error rate was virtually zero.
From page 322...
... Then digit recognition, here is digit recognition, the humans outperformed the machine by a factor of about 30 or 40 to I, including those great error rates. Humans doe real -- we had operators for a long time who telephone Remember, got their pay based on numbers because that never misrecognizing the would annoy a customer.
From page 323...
... They map high information, word sequences to meaning and we do that really, really well. We do performance evaluation and we do these operator assisted tasks.
From page 324...
... This was our customer care, touch tone hell system, as we call it. I will play a little bit of this.
From page 325...
... [Multiple discussions.] Here is a little example of how you might do it on the web-based customer care agent.
From page 326...
... There are desktop applications like dictation and command an d control of the desktop and control of document properties, stock quotes and traffic reports and weather and voice dialing. There are 20 million cell phones that are voice enabled today, access to messaging, access to calendars.
From page 327...
... We started going to from tens and hundreds of words to tens of thousands and hundreds of thousands and eventually a million words in the early nineties, we had a multimodal and dialogue in this time frame and finally the last one as we go forward. The dialogue systems are going to be very large vocabulary, limited task and control.
From page 328...
... DR. RABINER: Carnegie Mellon particularly made their same -- they were one of the key pioneers in this.
From page 329...
... 329 just get it all for research purposes only and use it. So, in a sense it is the most open community you can imagine.
From page 330...
... . The goal is to put intelligence in the hand set, so you can make speech recognition in cell phones much less dependent on the transition you use.
From page 331...
... 331 if you do the recognition in one language really well, the text to text translation itself is a real problem. So, there has been a lot of studying on how do you do text to text translation.
From page 332...
... You have to change it all the time and once you change it, that performance starts edging down a little bit because you are a variable yourself. But the technology that as not good enough -- you can say I am going to guarantee to keep all these bad guys -- but it is a pretty good return technology.
From page 333...
... Okay? So, if you listen to that North American Business News or whatever, the key words were all perfect.
From page 334...
... Synopsis isn't perfect, but most people find that you actually get through that and what it does is it gives them the triaging capability. So, if you are a sales person with 30 -- I get one voice mail message every month.
From page 335...
... Having something that converts it into text and looking for some key words. They are not going to be the key words you wanted just for the reason that you can't really listen to everything that goes on.
From page 336...
... 336 this session was to listen and to capture some of the challenges, the complexities that we should be thinking about in terms of the presentations 336 .
From page 337...
... In acictition. speech recognition In aclclition, systems are often very environment-specific, performing poorly when tested with speaker phones or cell phones, or in noisy environments such as restaurants, airport lounges, or moving cars.
From page 338...
... 338 David McLaughlin "Remarks on image Analysis and Voice Recognition" ~ ranscript of Presentation Summary of Presentation Video Presentation David McLaughlin is a provost and a professor of mathematics and neural science at New York University. His research interests are in visual neural science, integrable waves, chaotic nonlinear waves, and mathematical nonlinear optics.
From page 339...
... The second is that this isn't one problem, homeland defense. It is thousands of problems and as mathematicians or as scientists, we tend to attack one problem or one area of problems, if you like.
From page 340...
... Steve is a high officer in the Coast Guard, perhaps a lieutenant commander, but in any case a high officer, who this year is on leave from the Coast Guard in one of the government think tanks and his task is to think about border security. He is concerned with two things.
From page 341...
... He talks about how in Pakistan, crates are loaded onto something and they get to Hong Kong and in Hong Kong, they are stored more or less in the open, after which they are loaded onto boats and come to San Diego. In San Diego they are loaded onto trucks.
From page 342...
... But as Professor Malik knows and if you read his work, he uses, it is very wise in computer vision to attempt to use biological vision 342
From page 343...
... But when you think about how visual information is processed in the cortex, there are a few things that we do know and I just want to close by listing some of them and I am sure that the computer vision experts (a) know these, (b)
From page 344...
... For example, the visual system is often modeled as a feed forward network where information enters through the retina and it goes on to a filter called the LGN and on into the visual cortex; two specific layers of the primary visual cortex, 4 and 6. Now, it is known anatomically that there is a huge amount of feedback from 6 back to the LGN.
From page 345...
... Now, this is saying nothing to Professor Malik, who does this in his work continually, but it is a fact that I think that the computer vision community doesn't talk with the biological vision community enough. So, I think that that is the only three points that I really wanted to make and it is late and I think I will conclude with that.
From page 346...
... In computer vision, for example, it wouict be wise to attempt to use biological vision to guicle our thinking and construction of algorithms.
From page 347...
... Bass Professor in the Humanities and Sciences at Stanford University. He has worked on wavelet denoising, on visualization of multivariate data, on multiscale approaches to ill-posed inverse problems, on multiscale geometric representations, such as ridgelets, curvelets, and beamlets, and on manifolds generated by high dimensional data.
From page 348...
... With respect to the voice recognition, that there were only two words there in 1980 because I know in 1981, that system we had recognized or purportedly recognized 10 if properly enunciated. The next discussant that we have here to make commentary on, again, the complexity of the various topics is David Donoho.
From page 349...
... The information technology business is a huge part of our economy. It dominates corporate America.
From page 350...
... So, homeland security that we can contribute to, I guess, everyone is assuming it is all about data gathering, data analysis and then somehow we are going to find in this mountain of data things that are going to be useful. For example, if we had a national I.D., there would be some way to codify that into a process that would let us keep people from doing bad things to our citizens.
From page 351...
... So, all this data is being gathered and I could only sketch here, just mention a few examples, but it goes on and on. So, certainly all aspects of human identity, date and posture will be captured and digitized completely.
From page 352...
... Secondly, and this was mentioned before, there is issue that in our system we do things by private enterprise and so there is going to be a lot of trade speech claiming benefits and workability, which is quite different from the actual experience of some of the customers who tried to do these things. In addition, there are massive engineering issues -- there are massive engineering problems associated with this, that just merely to have something like a national idea or to use biometrics in a consistent way that would really close off from undocumented people the ability to get on airplanes and so on, just to create that system is very challenging and there would be a lot of discussion about whether you could actually build the infrastructure.
From page 353...
... Robustness would be, again, work in perfect conditions, but you can't deal with articulated conditions, where something is buried away from the research lab. A lack of fundamental understanding has to do with the fact that a lot of what we are doing is using what mathematics gave us 40 or 50 years ago and not developing mathematics that tries to understand what the underlying issues out of the data that we are trying to deal with is.
From page 354...
... I see a couple of things that came out in the talks today. Coping with dimensional glut, understanding high dimensional space and representing high dimensional data in qualitatively new ways.
From page 355...
... Traditional statistical theory says that the number of dimensions compared to the number of observations should be such that you have a very favorable ratio; namely, D over N It is small.
From page 356...
... The whole area of what distortions there are and what systematic issues you have are not very well understood. Another area of dimensional glut is that you have the cursive dimensionality.
From page 357...
... We also know that when we look at things like the articulations of images as they vary through some condition like the pose, that we can actually see that things follow a trajectory up in some high dimensional space notionally, but we can't visualize it or understand it very well.
From page 358...
... Representation of high dimensional data, Professor Coifman mentioned this and in some of Professor Malik's work it is also relevant. What are the right features or the right basis to represent these high dimensional data in so that you can pull out the important structures?
From page 359...
... It is a much more subtle thing. So, independent components analysis deal much more with the structure as it really is and you get little basis functions, each one of which can be mapped and correlated with an actual muscle that is able to do an articulation of the face.
From page 360...
... It can provide a language to science and a whole new set of concepts that may make them think better about it. Our challenge is to think more about how to cope with dimensional glut, understand high dimensional space and represent high dimensional data.
From page 361...
... The structure of image manifolds, as Professor Malik mentioned, working directly with shape, that is again trying to get away from a flat idea of the way data are in high dimensional space and come up with a completely different way to parameterize things. Professor Coifman, well, he basically like we are co-authors, so I am just extending his talk, but, anyway, the connections there are pretty obvious.
From page 362...
... I think all the speakers have pointed out that a lot of progress has been stimulated by -- in math, I think, from the study of images and -- associated multimedia data and I think that will continue as well and eventually we will see the payoff in homeland security deliverables, but perhaps a long time. Thank you.
From page 363...
... 363 of how many were just, you know, forgotten publications. But it just seems like millions of references -PARTICIPANT: [Comment off microphone.]
From page 364...
... So, you know, I am a little bit -- so, for example, a national I.D. card wouldn't have any privacy issues if all it was used for is for -- you know, it contained data on the card and it was yours and it was digitally signed and encrypted and all of that stuff and there was no national database that big brother could look through.
From page 365...
... 365 But people get together and butt heads is sort of an Orwell characterization I guess. The science is absolutely clear, that you could do it without having all of those -- violating personal liberty and so on.
From page 366...
... For example, human identity, even highly clecimatect, might require some 65,000 variables just to produce raw imagery. Characterization of facial gestures is a subtle thing, and inclepenclent-components analysis (as opposed to principaI-components analysis)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.