Read "The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop" at NAP.edu

« Previous: Detection and Epidemiology of Bioterrorist Attacks

Page 261 Cite

Suggested Citation:"Image Analysis and Voice Recognition." National Research Council. 2004. The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10940.

Page 262 Cite

Page 263 Cite

Page 264 Cite

Page 265 Cite

Page 266 Cite

Page 267 Cite

Page 268 Cite

Page 269 Cite

Page 270 Cite

Page 271 Cite

Page 272 Cite

Page 273 Cite

Page 274 Cite

Page 275 Cite

Page 276 Cite

Page 277 Cite

Page 278 Cite

Page 279 Cite

Page 280 Cite

Page 281 Cite

Page 282 Cite

Page 283 Cite

Page 284 Cite

Page 285 Cite

Page 286 Cite

Page 287 Cite

Page 288 Cite

Page 289 Cite

Page 290 Cite

Page 291 Cite

Page 292 Cite

Page 293 Cite

Page 294 Cite

Page 295 Cite

Page 296 Cite

Page 297 Cite

Page 298 Cite

Page 299 Cite

Page 300 Cite

Page 301 Cite

Page 302 Cite

Page 303 Cite

Page 304 Cite

Page 305 Cite

Page 306 Cite

Page 307 Cite

Page 308 Cite

Page 309 Cite

Page 310 Cite

Page 311 Cite

Page 312 Cite

Page 313 Cite

Page 314 Cite

Page 315 Cite

Page 316 Cite

Page 317 Cite

Page 318 Cite

Page 319 Cite

Page 320 Cite

Page 321 Cite

Page 322 Cite

Page 323 Cite

Page 324 Cite

Page 325 Cite

Page 326 Cite

Page 327 Cite

Page 328 Cite

Page 329 Cite

Page 330 Cite

Page 331 Cite

Page 332 Cite

Page 333 Cite

Page 334 Cite

Page 335 Cite

Page 336 Cite

Page 337 Cite

Page 338 Cite

Page 339 Cite

Page 340 Cite

Page 341 Cite

Page 342 Cite

Page 343 Cite

Page 344 Cite

Page 345 Cite

Page 346 Cite

Page 347 Cite

Page 348 Cite

Page 349 Cite

Page 350 Cite

Page 351 Cite

Page 352 Cite

Page 353 Cite

Page 354 Cite

Page 355 Cite

Page 356 Cite

Page 357 Cite

Page 358 Cite

Page 359 Cite

Page 360 Cite

Page 361 Cite

Page 362 Cite

Page 363 Cite

Page 364 Cite

Page 365 Cite

Page 366 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

261 Roberta Lenczowski "introduction by Session Chair" Transcript of Presentation Summary of Presentation Video Presentation Roberta E. Lenczowski is technical director, National Imagery and Mapping Agency (NIMA), in Bethesda, Maryland. Ms. Lenczowski earned her B.A. in philosophy from Creighton University in 1963. She completed her M.A. in philosophy from St. Louis University in 1970. In November 1977, Ms. Lenczowski began her professional career with the Defense Mapping Agency, now the National Imagery and Mapping Agency. She has held numerous technical and supervisory positions before her appointment as technical director in 2001. 261

262 DR. LENCZOWSKI: To try to get this session started, albeit a little bit late, but I know that the discussions have been very enthusiastic, very, very interesting. One of our speakers, Dr. Rabiner, has to leave by quarter to 6:00. So, I don't want to do a whole lot of introduction with respect to my background. Suffice it to say, I am from the National Imagery and Mapping Agency. As a result, I have a very vested interest in understanding . . Imagery analysis. So, based upon the short abstracts I have read for each of these speakers, I know that things they are going to say will be very relevant to work that we do within the agency. So, I will introduce each of the speakers. I will give you a little bit of an insight, a tickler, if you will, in terms of what their topic is. When they have completed then, of course, we will open this to a discussion and an exchange. Our first speaker is Dr. Malik. He, in fact, received his Ph.D. in computer science from Stanford University in 1985. In 1986, he joined the faculty of the Computer Science Division in the Department of Electrical Engineering and Computer Science at the University of California in Berkeley, where he is currently a professor. 262

263 He has For those of you who are not familiar with his background, his research interests are in computer vision and computational modeling of human vision. His work spans the range of topics in vision, including image segmentation, in grouping texture, stereopsis, object recognition, imagery-based modeling and rendering, content- based imagery querying and intelligent vehicle highway systems. He has authored or co-authored more than a hundred research papers on these topics. So, he is an incredible asset to the discussion as we continue. I need to point out that in my own background since the very early eighties, I have been following the work that has gone on in the community that I am predominantly from, that being the mapping side of the National Imagery and Mapping Agency and have watched as we have attempted to have insights in terms of some of the research issues of automatic or what we refer to now as assisted feature recognition or automated or assisted target recognition. So, one of the things that Dr. Malik plans to talk about in terms of recognizing the objects and the actions is analyzing images with the objective of recognizing those objects or those activities in the scene that is being 263

264 imaged. As we get into period, hopefully I will have some more descriptive relevant to the business some of information that we are in fact, is of great support with respect to here of homeland security. the later - . . c .lecusslon an opportunity to provide about why this is so and how that, in the core topic 264

265 Introduction by Session Chair Roberta Lenczowski Dr. Lenczowski introclucect herself as a scientist from the National Imagery anct Mapping Agency who has a very vested interest in uncterstancting imagery analysis. Recognizing objects anct actions means analyzing images with the objective of recognizing those objects or those activities in the scene that is being imagect. Such analysis is important with respect to the core topic of homelanct security. 265

266 Jitendra Malik "Computational Vision" Transcript of Presentation Summary of Presentation Power Point Slides Video Presentation Jitendra Malik was born in Mathura, India, in 1960. He received a B.Tech. degree in electrical engineering from the Indian Institute of Technology, Kanpur, in 1980 and a Ph.D. degree in computer science from Stanford University in 1985. In January 1986, he joined the University of California at Berkeley, where he is currently the Arthur J. Chick Endowed Professor of EECS and the associate chair for the Computer Science Division. He is also on the faculty of the Cognitive Science and Vision Science groups. His research interests are in computer vision and the computational modeling of human vision. His work spans a range of topics in vision, including image segmentation and grouping, texture, stereopsis, object recognition, image-based modeling and rendering, content-based image querying, and intelligent vehicle highway systems. He has authored or coauthored more than a hundred research papers on these topics. He received the gold medal for the best graduating student in electrical engineering from IIT Kanpur in 1980, a Presidential Young Investigator Award in 1989, and the Rosenbaum Fellowship for the Computer Vision Programme at the Newton Institute of Mathematical Sciences, University of Cambridge, in 1993. He received the Diane S. McEntyre Award for Excellence in Teaching from the Computer Science Division, University of California at Berkeley, in 2000. He was awarded a Miller Research Professorship in 2001. He serves on the editorial boards of the International Journal of Computer Vision and the Journal of Vision and on the scientific advisory board of the Pacific Institute for the Mathematical Sciences. 266

267 DR. MALIK: Thank you, Dr. Lenczowski. Let's begin at the beginning. How do we get these images and what do I mean by images? So, the first thing to realize is that images arise from a variety of sources and they could be volumetric images, just 2D images, images over time. So, of course, we have our economical example of optical images, which are getting to be extremely cheap and extremely high resolution. There are going to be zillions of these devices everywhere, whether we like it or not. But I also wanted to mention other modalities, which are for their own advantages. So, x-ray machines so when you walk through the airport, but those are not tomography machines, but they increasingly will be and thus they will give access to 3D data. This may be of great relevance and automated techniques for detecting weapons, explosives and so on. There are rain sensors, which give you not just information about the brightness coming back, but also directly depths. We refer to this as 2.5(b) data, because it is not the full volume. You could have infrared sensors and no doubt many other exotic techniques are constantly being developed. 267

268 I won't talk about the reconstruction part of the problem. I assume that we have the images and let's process them further. So, this is the coal problem. We start off with -- it is just a collection of big -- with some attributes and some spatial, XY, XYZ, XYZT, whatever. But we want to able to say in -- and I am going to use examples from optical images but many of the techniques do carry over. That is a problem. How do we do this? That is a recognition problem. So' at this image, we all recognize it. Maybe you have been to the Louvre. Maybe it is image is splattered over a zillion places. when we look it is because because this But probably you have not seen him before. So, I wish to argue that -- to identify that something is a face, doesn't rely on you having seen that specific face and, of course, you have . . varieties. So, the key point to note is that recognition is at weighting levels of specificity. There is at a category level, faces, cards. There is specific process, Mona Lisa, or it could be a specific person in a specific expression. The second key point and this is a very important one is that we want to recognize objects in spite of certain kinds of radiation, radiation such as changes in pose. You can look at the person from different 268

269 viewpoints. Changes in lighting and -- images, you would have occlusion. There would be clutter and there will be objects hiding each other. This is why just thinking of it as a traditional statistical classification problem is not quite adequate because in that framework, some of these radiations due to are very systematic, which they are treated as noise thing to do. So, that is about action recognition as video data, whatever. common English words that ~ , so YOU can read the geometry and lighting, which result from the -- of physics, and that may not be the right about objects. But we can talk well, particularly when we have Just a collection of are really associated with actions list there. They involve movement and posture change, object manipulation, conversational gesture and moving the hands about, sign language and so on. Certainly, we can recognize these as well. So, how shall we do this? So, what I am going to do in this talk is sort of -- our way through this area and at a later part, I am going to mostly talk about work and approaches taken in my group, but I wanted to make the survey sort of inclusive. So, initially, I am going to talk about face recognition. I want to mention -- why do I pick faces? Faces are really just another class of object. 269

270 So, in principle, the same techniques apply. However, it is a very special and very important domain. It has been argued that a baby at the age of one day -- and there have been experiments which have argued that at age 15 -- can distinguish a face from a non-face. It may, in fact, be already hard-wired into the vision system. There has been a lot of work in the last ten years and the implications for surveillance and so forth are obvious. So, it is worth taking that specialty. So, this is work from Carnegie Mellon University. So, there are now pretty good techniques for doing face detection. That has to be the first problem. You start out with some image and you want to find that set of pixels, which are -- to a face. So, that is the face detection problem. I am sure they picked apropos examples here. You can try this detecting yourself. You can select any photograph you like and see the results of the algorithm. I think this is currently the best algorithm in the business because it works for faces and a variety of views and does it reasonably well. This gives you sort of an idea of what the performance is like. So, the performance is 94 percent detection rate on this with false detection every two 270

271 images. This is the current state of the art. No doubt things could be improved. Okay. So, that is what you can do to try to find faces. But now you want to say whose face is it. Is it Nixon's face or Clinton's face? So, here what I am going to do is to just report results from a study, which was carried out a couple of years ago and the person who did this study or was a leader in the study is Jonathan Phillips, who is actually back in the room. So, any detailed questions, he is the one to ask. This -- become commercial, which unfortunately always complicates things because now scientific claims get mixed up with commercial interests and so on and so forth. Anyway, here is a study, which -- systems from different vendors. So, the key issue here is you will take some picture of a person at some time. You are not going to act when you finally try to identify that person in a crowd. The person is not going to obligingly present his or her face in exactly the same pose, exactly the same lighting with exactly the same hairdo, with exactly the same state of wrinkledness or lack thereof. So, you have to consider this recognition in spite of these variations. 271

272 So, here the experimenter -- a gallery of 200 people. So, even though you can have exactly the same pose, at which point the accuracy -- pretty good. You can see that the face is slightly angled in each of these. At 25 degrees -- performance has dropped to 68 percent. And at 45 degrees, it is at 30 percent. So, basically it is useless at this stage. So, basically what we are talking -- radiation plus, minus, 20, 25 degrees. This is starting from frontal views. People at CMU(?) have done a -- they tell me that the results from side views are even worse. The degree for the variation that you can tolerate becomes even smaller. So, this is the reality and the reason is that the techniques are primarily two dimensional. So, they don't deal with a -- variation in any serious way. PARTICIPANT: How about if you have -- DR. MALIK: If you have multiple views. Yes. So, the CMU people claim that -- because the frontal view is actually the most generalizable view. That gives you this essentially 45 degree cone. From the side view, the generalization cone is smaller. We can get into in more detail later. So, this is a question about illumination radiation and here again there is a drop here, from 91 272

273 percent to 55 percent with different kinds of lighting. This is radiation -- distance, but let's move on. So, here, this is sort of a summary slide, which shows how the performance -- so, this is performance, how it drops with radiation and pose. Radiation and indoor versus outdoor lighting, different differences and then the same person over differing time intervals. You know, the same person -- one year, two years. So, I think the moral of the story is that face recognition is not a solved problem. If any propaganda had told you, so please disabuse yourself of this notion. But there has been concrete progress, without quantifiable measurable results, but the issues of generalization, in particular, generalization to -- pose, dealing with the quality of data you might expect with surveillance cameras. These are all real issues and there is plenty of scientific work to be done on this problem. It is not a solved problem yet. Okay. Let's move on. So, now I want to talk about general object recognition and this is, again, a huge topic and I am going to use the typical caveat, which is -- approach and not necessarily the final approach, so on and so forth. 273

274 Okay. So, here is the philosophy first. How do we deal with object recognition? I think the key issue is dealing with shape availability. That is the bottom line. How do we recognize an object? These are -- faces -- or fish. We identify them as the same category, even though they are not identical. The person who -- and now how do we think about this. There is a standard statistical way of thinking about this. You construct a feature vector. You have some training examples and then you sort of model the variation. So that would require that you -- the problem into that framework, which means you take shapes and make them into feature vectors. This is not necessarily a good thing because you are losing the -- structure of the problem and you will come up with some feature vectors, like moments or whatever and they will often not capture all the aspects of the problem. There will be arbitrariness. How should we think about this? D'Arcy Thompson, nearly a hundred years ago, had a very good insight about this problem and he said that while it may be difficult to characterize shape, variability and shape may be easier if we think of it this way, that there is this grid here and what this grid is supposed to indicate is that there are spatial transformations, which transform one of these 274

275 shapes to this other shape. You can see that with the face. And this transform is obviously a non-linear transform. I mean, you could imagine simple rotation and translation, but we have to go beyond that. Now, this transform could be -- we then considered this variation model in terms of transformation and when we want to say is this possibly the same -- in the same category, we put probability distributions on the space of transformations. These will be around an operating point, which is the -- shape and this gets us out of this fundamental issue of how do we define shape in this general way. Now, I think mathematicians have studied shape for quite some time, but they basically have this notion of equivalence plus and we really need this notion of similar but not the same. That needs a fair amount of work. MS. CHAYES: So, what are the kinds of transformations that you allow? I mean, obviously, you are not allowing everything. DR. MALIK: I will talk about it. So, this actually -- so, this philosophy is broadly called the deformable template philosophy and there has been a fair amount of work on it and -- has sort of pioneered this approach, but in his work, there are two issues in all of this, which is one -- statistical 275

276 efficiency, optimality estimation. There are questions like that. There are questions of geometry and -- questions of computational efficiency and algorithms. All of these need to be sort of factored and tracked at the same time. Otherwise you don't have a workable approach. So, a key issue here is the issue of correspondence. If I want to estimate a transformation, so these two A's are the same, but to estimate the transformation, I must know what each point was. How do I find that out? Okay. So, that is the so-called correspondence problem and that -- if we can do that, then the second problem is this transformation problem. So, discrete -- algorithm kind of problem. the first problem essentially is more a algorithm kind of problem. The second problem is a more traditional continuous mathematics kind of problem and this is often the case that there is this mixture of discrete type stuff and continuous type stuff and we have to play in both these games. PARTICIPANT: It is not clear to me that you have to -- I mean, I might be wrong, but tell me why -- why can't I just say I look at such information such that the image -- is that -- 276

277 DR. MALIK: Computationally it could be -- basically you will wind up doing gradient -- space, which is non-convex and you will get stuck at a local minimum. Suppose we represent a shape by a collection of - - I basically won't tell you how we do this, but it ultimately comes down to attaching to each point some descriptor which sort of captures a relative arrangement of the rest of the shape with respect to that point and then trying to -- so, shape is essentially relative configuration of the rest of the points with respect to you and two points are to be matched if they have similar relative configurations. So, you can sort of formalize that in some way. Based on what is called an assignment problem or a bipartite matching problem and we will skip the details here. Okay. So, let's say we have done that. Now we talk about the problem of estimating the transformation. So, here is where -- now I can answer your question. So, estimating the transformation, the transformation cannot be arbitrary. You basically wanted to -- you wanted to be plus but not an arbitrary -- you want to somehow penalize ones which are non-smooth and -- but that smoothness has to be -- MS. CHAYES: So, there is a cost -- 277

278 DR. MALIK: Yes. So, the next slide will tell you what the cost function is. So, essentially what we use is a templex(?) prime model. So, there will be some set of points at which you know the correspondence and now which - - how do you interpolate? Well, a very natural model is a tenplex prime model, in which case there is a certain cost function, which is -- energy and you can solve this in a fairly natural way. So, the next slide actually will give you an example of this. So, here is what I am trying to do. I have this unknown shape. I don't know what it is. My rule is going to be if I can take one of my model shapes, which I know. I know that kind of fish. I will try to deform it to line up with this. If I succeed, well, then I am probably working. MS. CHAYES: How sensitive is this to the cost function? If I change the cost function, will I -- a little bit, will I dramatically change the output or -- DR . MALIK: There is continuity, yes. PARTICIPANT: How do you know to choose the model? DR. MALIK: I will try each model in turn and then whichever one will match the best is what I would 278

279 declare to be the result. You don't want a linear search through all the models. But I won't get into that now. The grid here is trying to indicate the transform. The grid is like a gauge figure to show you the transform. It is really two steps. One is a discrete -- step. One is a continuous transform estimation step. As you see in the -- you can see that it essentially stabilizes effectively of the formed shape to be like this one. PARTICIPANT: So, is it corresponds that you should match, that two points should have the same -- DR. MALIK: No, it is a bit more complicated than that. You don't want it to be just an XY thing. You want it to somehow be more in some way between geometric and topological but that is a meaningless statement, I can see. MR. AGRAWAL: [Comment off microphone.] DR. MALIK: No, that separation of the -- and you can do it in either version. You can essentially define a local frame from the direction of -- and you can orient everything with respect to that. But that will be with a middle reflection problem, but it will be with small rotations. We took this approach and -- with this approach we showed that we can do this for a variety of different 279

280 years. the same domains and this is important because, again, the tradition of object recognition work has been that it takes a lot of engineering and hard work to really make it succeed in any domain. So, digit recognition, you work on it for ten You want to do printed circuit board inspection, thing. Now, this is not a scalable solution because if we want to approach the level of 10,000 objects, hundred thousand objects, you really need one general technique, which you could then clean up with appropriate data to work in many different settings. This approach seems to have that characteristic. Digit recognition is a classic problem in this domain, that people have worked on it for, you know, many years and there is this data set where every technique known to man or at least to statisticians has been tried. We had nothing special on this and we -- an error rate which is probably not significantly better but at least as good with a completely -- technique and I think tuned for digits with much less training data. These are actually all the errors and if you look through them carefully, you will find that some of these even you as a human will get drunk. This data set is digits written by high school students and -- two sets. 280

281 There is a subset from USPS employees and a subset from high school students. The USPS employees, everybody can do them. It is the high school students who create problems. Here is how to do 3D objects. A crude way of doing it is to store multiple views and do the matting. There are techniques by which you can select automatically how many views that you need. Here the performance can be judged this way. The tradeoff is between having many views, in which you can do the problem easily and having few views and yet having a good error rate. The point is that even with an error rate with -- with those full view for objects, we could get an error rate of 2.5 percent. MS. CHAYES: How come that is not monotone? DR. MALIK: Well, there is some sampling issue there. There is some stochastic processes involved in the selection of the view. So, here is an example in a more difficult type of domain. So, now we are moving towards talking about action recognition. You want to identify people and what action they perform. Now, identifying people really -- if you have enough resolution on them really implies knowing their body parts, meaning knowing all their drugs. So, on this figure you want to -- you want to be able to -- right shoulder, left shoulder, et cetera, et cetera. If you can 281

282 do that, then you can -- it turns out that the same approach works, except that you must use an appropriate information model. This shows some results and this is from a -- this is from a Japanese, this tradition of comic books. So, they have these -- a variety of causes -- these artists single image and the reason you can do this is because of anthropometrical strengths. You know all the names of the body parts, essentially provide you an equivalent of a second view. This is an example of -- if you move the mouse -- so, this is a sequence of -- what is being done in -- a stick figure is being identified. This -- the question of recognizing actions. This is what we mean by action . . recognition. I want to point out that actions are -- the -- action is in a very early stage. I mean, there has been work done by others in the community, but usually you do distinguish between two classes of action. Actions are to -- a hierarchy and this is an area which really needs to be explored. Think of the action of withdrawing money from an ATM. It is in fact a sequence of steps, but the sequence of steps need not always be executed in exactly the same to copy from -- a stick figure model from a 282

order. It is really there is a partial order plan, which is -- here and multiple agents could be involved in this. So, this is very relevant because if we want to talk about situation assessment, that is really this -- I want to just remind you that the problems are actually quite isomorphic. More or less -- idea really applies in space time as well as in space. The cues are both static as well as dynamic. So, in some sense the -- and the dynamics and when you want to recognize action, this is all just based on the movement of my body but what objects are manipulated. So, object recognition and activity recognition is coupled. The fact that I am writing with a pen -- so, the philosophy that we can try to do this by recovering -- I will conclude here. This is my final slide. So, it should be obvious that there are zillions of challenges for mathematicians and statisticians in this business. It is a fundamental problem. Thirty percent of the brain is devoted to -- it has evolved through a million years. There is an arms race between animals, trying to camouflage animals, trying to detect and break through camouflage. Just a few of the problems. Shape, modeling shape radiation. Shape is a fundamental problem. The 283

284 field of geometry and mathematics is all about shape. But here is a question that is similarity. How do we define similarity in that. This is our fundamental problem. The statistician viewpoint together with the algorithmic viewpoint because, for on how to think about example -- had great -- construct because you claim that you have a optimization problem and then you can't solve it. Doesn't get you too far. I should, again, return to my I should return to the beginning. I -- optical images, but that is not the only imaging modality. The more exotic imaging modalities you have more fundamental questions of how to produce the question of reconstruction. [Applause.] 284 Thank you.

285 Computational Vision JitencIra Malik Recognition by computational vision may be carried out at varying levels of specificity. It can involve searching for any face, a type of face, a specific person's face, a specific person with a specific expression, or a person performing a certain activity. Major challenges in recognition relate to detecting objects despite "radiation," which inclucles changes in pose, hair, fighting, viewpoint, clistance, and age. These are two approaches to the problem of shape variability. One possibility wouict be to ctivicle a particular shape into components and perform comparisons between two shapes component-wise. A second wouict be to view the two shapes of interest as shapes on a basic grid that indicates spatial transformations and determine if there is a transformation mapping one shape to the other. For the future, researchers are moving toward the problem of action recognition that is, identifying people and the action they are performing. Current approaches draw from the above techniques, but the work is in a very early stage. 285

286 Ronald Colfman "Mathematical Challenges for Rea~-Time Analysis of imaging Data" Transcript of Presentation Summary of Presentation Power Point Slides Video Presentation Ronald Coifman joined the mathematics and computer science faculty at Yale University in 1980. His research interests include nonlinear Fourier analysis, wavelet theory, singular integrals, numerical analysis and scattering theory, real and complex analysis, and new mathematical tools for efficient computation and transcriptions of physical data, with applications to numerical analysis, feature extraction recognition, and denoising. He is currently developing analysis tools for spectrometric diagnostics and hyperspectral imaging. Dr. Coifman is a member of the American Academy of Arts and Sciences, the Connecticut Academy of Science and Engineering, and the National Academy of Sciences. He is a recipient of the 1996 DARPA Sustained Excellence Award, the 1996 Connecticut Science Medal, the 1999 Pioneer Award of the International Society for Industrial and Applied Science, and the 1999 National Medal of Science. Dr. Coifman was the Phillips Professor of Mathematics License en Sciences Mathematiques in 1962 and received his Ph.D. from the University of Geneva in 1965. 286

DR. LENCZOWSKI: While we do the conversion to the next speaker's presentation, let me introduce Dr. Coifman. He is the professor of -- the Phillips Professor of Mathematics at Yale University. He received his Ph.D. from the University of Geneva in 1965. Prior to coming to Yale in 1980, he was professor of Washington University in St. Louis, one of my alma maters. So, great regard for his work. Professor Coifman is currently leading a research program to develop new mathematical tools for the efficient transcription of physical data with applications to future extraction, recognition and denoising. He was chairman of the Yale Mathematics Department from 1986 until 1989. Today his topic, the mathematical challenges for real time analysis of imaging data, is caught in the first statement of his abstract, where he points out that the range of task to be achieved by imaging systems designed to monitor, detect and track objects require a performance comparable to the perception capabilities of an eye. Such vision systems are still beyond our capabilities. A couple of days ago, I had a couple of contractors come in and we were doing a briefing and describing the fact that despite all the work we have done 287

288 with knowledge management and knowledge engineering and artificial intelligence, that to date, even best computer is only comparable to the brain of a cat. So, clearly, you have the challenge. DR. COIFMAN: I would like to actually describe an experiment that I am involved in but -- and raise and describe the various mathematical challenges that come along the way. You will see that most of what we discussed this morning about data mining occurs immediately in the context of imaging and understanding imaging and trying to extract features from the images. Images you might think of are just what the camera gets but they are really as we have seen in the preceding talk are really quite a bit more complex. So, as we be able to do with track them, monitor what we do with our heard, I mean, what we really want to an imaging system is to detect objects, for objects and basically we want to do eye. Unfortunately, a camera is not an eye, as I said. A camera measures something for each individual pixel. Then you hand it on to a computer and the computer would -- hopefully you will have a decent algorithm to get something out of it. This is really not what we do with our eyes and I will indicate a system that allows you to resemble more to 288

289 an eye than to a camera and the system is basically mathematics converted to hardware. That is, I think, another challenge that will happen. I think that future devices will have the mathematics much more integrated in them because of existence of new technology like MEMS(?), which allows you to control on a micro level analogue data. So, we will see. We will get to that. The main obstruction to everything is what we heard this morning, but I would like to maybe describe a little bit the mathematical problems or basically our inadequacy to handle the kind of data streams that we are collecting. So, problem No. ~ is we really don't have any tools to describe in any efficient fashion high dimension geometric structures. So, we heard this morning about having a variety of parameters, which have to be looked at as a multivariate system of parameters and then we want to have a probability distribution describing probabilities of various events, but we have no tool to describe the actual support of those probability distributions or just have an efficient way of describing the geometry underlying that. In fact, we don't even know what the word "geometry" means. You know, just a few years ago it was Komolgorov, who made the observation that there are other 289

290 objects in traditional geometries, which are very useful. In fact, I think our task is really to follow that path and identify structures as they arise in nature as we confront them and the structures are not that simple. The second problem -- the second issue is that it might seem a little silly, but we really -- all the algorithms that we have are really exponential in their dimension. So, if you are already in Dimension 10, you are dead. We deal with things, which are much higher . . dimensions. Even simpler than that, we don't know how to approximate ordinary functions, which are given to us as empirically, as functions of some parameters. So, to give an example, you have a blood test. You will get ten chemicals being measured and you want to approximate the function, which tells you the state of your health in any sort of way. I am not aware of any mathematical algorithm capable of doing this efficiently and giving you any sort of information. What we need is, of course, structuring clusters, a structural clustering tool. We have to know what the word "structure" means and I will try and indicate how we get into that. Again, what I just said, in notions of geometric structures are really limited. Our imagination 290

291 is rather dull and we have to have nature sort of inspire us . Last but not least, probably the most important part of this at this point in terms of practical application in order to take those ideas and have them work is we need to have efficient digital representations of the massive data sets that we have been talking about. So, what I would like to -- there is just one idea I want to convey today in some sense. That compression, which is basically an efficient representation of data is a key ingredient in everything that one wants to do. I mean, you may think of compression as something that multimedia people do in order to transmit something on the line, but actually compression is basically an outcome razor-type principle. If you can compress and describe data very efficiently, you have a model. Having a model means that you basically are way ahead of this. So, a good test is actually -- so, in many cases, right, the fact that we cannot store the data is really an indication that we are just -- we just don't understand it. It is a real paradigm -- it is a real guidance. Over the last few years we have seen a lot of -- there have been many developments about trying to represent objects with minimum complexity. 291

292 David Donoho recently wrote a beautiful paper showing that if you have a minimum complexity description of a set of data, it provides you essentially an optimal statistical estimator. So, the two things are tied. Complexity is -- good compression and noise and denoising are very tied together. Noise is a generic problem everywhere. The problem is that, of course, this is just nice goals. We need to know how to reach them. So, let me describe -- so, the next few slides, some of you may have seen them, but they sort of fit the themes that we are describing here. It is basically a way of taking a complex structure and starting to -- trying to orchestrate it the way you would orchestrate a piece of music, as a collection of scores, which describe the effect of each instrument. Here, the structure is an image. So, you may have seen this. One of my favorite. So, here is a complex image, which has a lot of stuff in it and just trying to compress it, immediately unravels it as a combination of different structures, which are layered one on top of the other. So, here, this is just compressing it with basically taking one-third of ~ percent of the number of parameters and what you see happened is that it is blurred. It looks like a water color painting of the same thing. 292

293 All of the hairy features have disappeared, but all the coloring information, which may be useful for a given task, say, for example, then the color might be all we need in order to determine the age or the state of health of this particular mandrill and then that is fine. We don't need the rest of the stuff. This is a residual between the original image and the ones that we had. impressionistic thing. So, we took this residual and now we compress it by painting -- taking the best description mode of it. Now you would have a Van Gogh type impressionistic image. Right? It means it has been synthesized with brush strokes essentially. The other one was water colors. But bear in mind there is no human intervention in this business. You will see -- and this is a residual, okay, which is completely -- so, the original image is the sum of those three layers set upon each other, each one synthesized with a different instrument and it is completely like an orchestration for music. I mean, each musical score corresponds to an instrument. The score is given by a collection of -- by a sort of alphabet or language that describes it. Exactly the same thing has happened here. So, it looks like a lousy 293

294 Bear in mind this guiding principle that I have described is a principle of just complexities. The ability with the tools we have at our disposal to actually just compress this object to some accuracies, so to speak, and it gives a pretty decent compression, much better than if you were to do it once directly without -- but it is an indication that a lot of the things that we are trying to measure or assess are really the result of a super possession of structures, one on top of the other. If we had the ability to see that somehow or to unravel it into those structures, in other words, lift it to higher dimensions in order to do that, we are really doing a substantial job. So, let's return to the eye topic. We heard in the proceeding talk what kind of processing and how difficult it is to actually do very simple operations that we won't even think about them, seeing that there is a face down here. So, how does one deal with that directly without having to measure each individual pixel, make up an image and then start to fool around with the pixels and do things. In other words, we want to really proceed as an eye. The eye that I am involved in building at this point is an eye, which is -- should be superior to our eye. It 294

295 is designed to not only see the colors of the image, but actually measure the spectrum, the electromagnetic spectrum, the absorption at different, at maybe a hundred frequencies or 200 frequencies, the point being that this would allow you to detect a chemical plume. This would allow you to tag, to look at somebody if you want to tag or track him. You will look at the spectral signature and you don't have to worry about shapes and stuff like that. Then you measure and you follow. We know that it is much harder to do any operation on a gray level image than it is on a color image. If I want to discriminate object by color, it is much easier. When I look at the -- looking for a green spot here, I find it on the spot. I don't have to measure -- these are in green there or there or there. I can just look at it and my eye immediately goes to that location. So, the question is how can we do that and what Is the role of the mathematician in doing that? So, problem No. 1 is how does one integrate the processing with the actual measurement event? So, you don't want to measure each individual pixel. You just want to extract some maybe potentially global information and you want to analyze it in some fashion. Okay. So, the analogy, it is just like the guy -- like we heard this morning, if I am 295

296 looking for a needle in a haystack, if I have a good magnet, I bypass a lot of image processing. So, here the magnet is to do the spectroscopy in some sense, I mean, and just look at the color cubes. I will indicate how you integrate the computation or the mathematical number into the measurement. So, first of all, just the fact that understanding the -- or measuring the spectrum, and here is an example of spectra, is very important information about the material component of ingredient that you are looking at. You see, there is a camouflage nest over there and grass and -- and you see that those curves are different. So, to each pixel now you have associated a full curve. So, now we are getting into the data mining glut here. It is a vector, which may have hundreds of components and if we wanted to do -- by the way, everything I am telling you has been around for 50 years. Hyper- spectral imaging has been around. The reason it is not commonly used by everybody here is mathematics and data glut. I mean, you can't deal with the monster amounts of measurement that you are capable of measuring if you are capable of measuring it. So, each pixel, you may have hundreds of data. So, it does give you a beautiful signature, but if I am 296

297 looking in a scene for some chemical spill or maybe for the presence of some camouflaged object or somebody, which is really different from the environment or maybe it is melanoma on my skin, then the issue is how do I find this on the fly, so to speak. So, I will give you a recipe in a second. So, by the way, this is something that the people in the U.S. Geological Survey are doing. This is an image of the Nevada Desert. What you really want to do is find out what minerals are present at various locations. What you want to do is get this image. That tells you exactly what is present at various places. Unfortunately, this takes a long time to do. You can collect an image and it can take you days or months to actually get this kind of analogy, depending on the image. You may want to look at vegetation, see if it comes here or not or you may look at tissue and see if it helps. Or you may want to track somebody and discriminate that somebody from everybody else in the environment. So, please go to the next one. Mathematics and the tools being used are very primitive. Most people -- what most people do is this is the nine first -- the first nine components of a hyper-spectral image of the arm in which what is being done is they do here a -- analysis of 297

298 the maybe a hundred components that they measured and then display the first three components in red, green and blue, the next three in red, green and blue, the next three in red, green and blue. So, you get those blurred images, which may or may not be useful. But you have now taken everything and mixed it altogether and done what most people do is use single -- composition and use that to find your coordinator, which is not really the -- a particularly good tool for that kind of activity. So, let me tell you about the -- so, the camera we have is based on a digital mirror array. This is a device that Texas Instrument manufactures for projectors. It is an array that has half a million little mirrors. Each one of the mirrors has an on and off situation. It is being used usually to project on the screen by aiming the pixels that need to be aimed onto the screen and eliminating for a length of time corresponding to the intensity and running a color -- what they have in their hand is a computer that they are using as a paperweight essentially. The point is that you can take -- and this is just an example of how more smarts can use this device. So, you have an image here, which is this red plume over there and behind the missiles, you image the whole scene, 298

299 if you wish, on this square array of mirrors and you ask yourself is there any red spot in that region. So, the way it goes is you aim all the mirrors in the region everywhere, all of them together, into a special analyzer, which is also run by mirrors. So, because it found a red spot, the question is where did it find it. Well, it found it -- it breaks it into four quarters and checks for each one of them and then keeps on doing and in a few measurements has a binary coordinates of the location and did not do a lot of measurements. So, the next few slides will show you -- we took a chemical plume and we tried to find it exactly doing that. So, this is just an image of the mirror system and this is the image of our camera. So, this is not just science fiction at this point. That tells us how this spectroanalyzer goes. Light comes in and then it is being -- goes through a prism or something . Different colors go to -- this is a mirror array up here. Different colors will go to the mirror. You can select any combination of colors you want to measure whether they are present or not and then going to your detector. So, you both analyze spectrally and physically with the same thing. So, this is a scene in which there is 299

300 a chemical plume in which we measured for each individual pixel, whether we could detect a plume or not. But the level of the plume is so faint that the pixel was not capable of actually seeing anything. So, on the other hand, there is a method of multiplexing the light called Hadamard multiplexing, which allows you to actually detect the plume as you can see it. As you can see, it is quite noisy and the plume is rather faint, but in order to achieve that, we needed to do 250,000 measurements because that is how many pixels you have. If you follow the scheme I just described, well, in the white squares you find it. In the black ones, you don't. You keep doing it until you are finished. So, basically in 20 measurements you did it, as opposed to measuring the whole thing and then analyzing -- first of all, measuring the whole thing was garbage anyway. It was below the threshold of the system. Secondly, we really needed to combine the intensity of light coming from different locations in order to get this thing to work. So, this is just a tiny example of what you can do with this combined spectral -- combined spatial spectral mix and the switching of the mirror. So, the mirror operates as a computer, basically doing mathematical operations on the light as it comes in before 300

301 it is being converted to digital as opposed to the usual way, which is you convert it to digital. You are dead and that is it, I mean, basically because you have too much data. PARTICIPANT : [Comment off microphone.] DR. COIFMAN: Yes, I mean, this particular system of GI would go up to 2 1/2 microns. That is because they have a cover plate. If we change the cover plate, you can go all the way up. The cover plate is not letting the light go through it. So, the point being that I think this is going to be a paradigm for a lot of the sensing systems that are coming and DARPA has -- in fact, the mathematics of DARPA have a serious program in pushing it forward and that is that you really want to -- if you want to reduce the data flow that comes into a system, you really have to do it as you measure it, which is really what a biological system is doing. We never collect everything and then -- I mean, this kind of -- this sort of forces the issues because if you wanted to do this in video -- for a real time thing, you are going to collect for each image several megabytes of data if you are lucky, if you already reduced it. It just won't flow through the system. It doesn't really matter. The transmission is a problem. The 301

302 storage is a problem. Computation is also -- so in all of that, where does mathematics enter? It enters in -- doing this sort of -- the data mining of the spectra, which is very similar to the data mining occurring, say, expression, for example, the same kind of activity is a dictionary going back and forth between activities. The issue is -- understanding the internal geometries of the spectra as they relate to the spatial distribution, this is critical if you are looking at objects of a given size. Say it is an artificial object in the natural environment or if you are looking at somebody, for example, who has makeup on his face, you will be able to immediately recognize that or basically you want to just track and object. Again, there are lots of theoretical issues and a lot of practical issues and I don't think you can separate one from the other. [Applause.] 302 In gene if there the two

303 Mathematical Challenges for Real-Time Analysis of Imaging Data Ronalc! Coifman Researchers wouict like to develop an imaging system that will ctetect objects and track them-not pixel-by-pixel, as in a camera, but how we do it with our eye. One challenge inciuctes the fact that researchers don't have tools to describe, in any efficient fashion, high-ctimensional geometric structures. In acictition they do not know how to approximate ordinary functions, which are given to us empirically, as functions of some parameters. Also, in orcler to make ciata-mining techniques work, researchers need to have efficient digital representations of mass data sets. Dr. Coifman is currently involved in buiicting an eye not only to see the colors of the image but to actually measure the electromagnetic spectrum, which could enable the ctetection of, say, a chemical spill, a melanoma on the skin, or the presence of particular vegetation in the Nevada clesert. Looking at the spectral signature eliminates the need to look for shapes, and we can perform mathematical operations on fight as it comes in as opposed to storing the information in digital form. With too much data, transmission is a problem, storage is a problem, and computation is a problem. For that reason, the ciata-mining of the spectra, which is very similar to the clata-mining occurring, say, in gene expression, is going to be a paradigm for a lot of the sensing systems that are coming. 303

304 Larry Rabiner "Challenges in Speech Recognition" Transcript of Presentation Summary of Presentation Power Point Slides Video Presentation Larry Rabiner was born in Brooklyn, New York, on September 28,1943. He received an S.B. and an S.M. simultaneously in June 1964 and a Ph.D. in electrical engineering in June 1967, all from the Massachusetts Institute of Technology. From 1962 through 1964, Dr. Rabiner participated in the cooperative program in electrical engineering at AT&T Bell Laboratories. During this period he worked on designing digital circuitry, issues in military communications problems, and problems in binaural hearing. Dr. Rabiner joined AT&T Bell Labs in 1967 as a member of the technical staff. He was promoted to supervisor in 1972, department head in 1985, director in 1990, and functional vice president in 1995. He joined the newly created AT&T Labs in 1996 as director of the Speech and Image Processing Services Research Lab and was promoted to vice president of research in 1998, where he managed a broad research program in communications, computing, and information sciences technologies. Dr. Rabiner retired from AT&T at the end of March 2002 and is now a professor of electrical and computer engineering at Rutgers University and the associate director of the Center for Advanced Information Processing (CAIP) at Rutgers. Dr. Rabiner is coauthor of the books Theory and Application of Digital Signal Processing (Prentice-Hall, 1 975), Digital Processing of Speech Signals (Prentice Hall, 1 978), Multirate Digital Signal Processing (Prentice-Hall, 1983) and Fundamentals of Speech Recognition (Prentice-Hall, 1993~. Dr. Rabiner is a member of Eta Kappa Mu, Sigma Xi, Tau Beta Pi, the National Academy of Engineering, and the National Academy of Sciences and a fellow of the Acoustical Society of America, the IEEE, Bell Laboratories, and AT&T. He is a former president of the IEEE Acoustics, Speech and Signal Processing Society, a former vice president of the Acoustical Society of America, a former editor of the Associated Students of Seattle Pacific's (ASSP's) Transactions, and a former member of the IEEE Proceedings editorial board. 304

305 DR. LENCZOWSKI: The third speaker in our session this afternoon takes us beyond the issue of the visual to the issue of the sense of hearing. Dr. Rabiner is an individual who has an experience base in the private sector. He received his Ph.D. from the Massachusetts Institute of Technology in 1967, but he created or moved to the newly created AT&T labs in 1996 as the director of speech and image processing services research lab. He was promoted to the vice president of research in 1998, where he managed the broad research program and communications, computing and information science technologies. He retired from AT&T at the end of March 2002. Now he has pioneered a number of novel algorithms for digital filtering and digital spectral analysis and in the area of speech processing, he has made contributions to the fields of speech synthesis and speech recognition. I am not sure what many of your personal experiences are with speech recognition systems, but I recall being assigned to a lab in 1981, where we have the first system with a computer-based activity that was to allow us to some speech recognition so that we could give commands to the system for some of our mensuration work. It was my experience that voice recognition did not work very effectively. In fact, the only time that I 305

306 knew that we truly had voice recognition was whenever we were scheduling tours in the lab. Somehow the equipment always knew, recognized the term "tour" and we invariably would have crashes. So, that is my background experience. Now, what Dr. Rabiner is going to talk about today is the maturing, in fact, of our speech recognition to the point where it is not -- now it is very widely applied to a broad range of applications. However, as he points out, although the technology is often good enough for many of the current applications, there remain key challenges in virtually every aspect of voice recognition that prevent this technology from being ubiquitous in every environment for every speaker and for even broader ranges of applications. DR. RABINER: Thank you. You took away my entire introduction. Actually, the question that people ask me a lot is after 40 years of working in this field, you know, isn't this a solved problem. You know, you can actually -- there is actually a system we put out in 1992 does 1.2 billion recognitions a year for operator services and it is essentially flawless. It is about 99.8 percent accurate, but it only works on a five word vocabulary over the phone. 306

307 It actually works in the environments you are in. But the whole problem is that it is very task specific. It is very environment specific. It is very well-trained to what it can do. It just can't get to where we want to go. So, what I am going to try to use is the next half hour to tell you what are the current challenges in speech recognition. Now, for the most part, the services people talk about are broader than just automatic speech recognition. So, I will concentrate on that first box in what we call the speech circle, but in order for you to maintain a real service, a real voice-enabled service with a customer or with somebody who wants to use the system, you have really got to go around the whole circle. So, the first thing is, you know, the customer speaks something and I should recognize the words. That is a tough problem and we will talk about what are the challenges there. But even when you get the answer, I dialed the wrong number, in order for you to do something with that, okay -- this is a customer care application -- you have to go out to the next level, which is spoken language understanding and get the meaning. Now, I dialed the wrong number might mean I don't want to pay for that. But it might also mean could you 307

308 please help me get to the number. Something is wrong or there is something going funny here. What do I do? So, meaning, which is something that is completely unclear because the exact same input has lots of meaning. Okay? Once you try to guess at that meaning, you have got to go do a dialogue management, which means take some action. Okay. So, you You might have to go out you have to do spoken might have to do a database dip. and look at person's records and language generation, create some message to get that, to keep this speech circle working. Then you might want to say what number were you trying to dial. Okay. So, I can give you credit so that when you dial that number, you don't get charged twice or I help you dial it. We use a synthesizer to speak it. What number did you want to call and now you are in a loop. Hopefully, that system, that entire speech circle will enable you to keep using the system and these are the kind of services people try to build today. So, we are going to concentrate on that first box a lot because there really are a lot of challenges there. The speech recognition process after a very long time is very mathematically oriented. It starts off with your infant speech, you know, 10 kilohertz, 16 kilohertz, 308

309 kilohertz, whatever you want to have and does three processing steps. The first one is feature analysis. You really don't want to process that entire 64 kilobyte signal since it is laden with redundancy and it has information that is useful to you as a human, you know, to talk to somebody, to have a conversation, but it absolutely has nothing to do with the meaning of the words. So, things like the pitch and your inflection and your emotion and things like that, which, you know, are very important for human communication, but for recognition, we try to eliminate them because they are confusing variables. So, we do a mathematical analysis. We will talk about that. Then we do the heart of the work, statistical decoding, based on statistical models and, of course, this is the box that has three attached models. Those attached models are the training environment and we will get into that. Then, finally, we do a little bit of cleanup and there is the classical computer science, bet l oworl d program, you know. If you have got that one, you may even have a chance of going to a little more sophisticated one. Let's look at each of the boxes there. The feature -- box is all this to go out and get robust 309

features, things that are irrelevant or invariant to the environment, to the noise, to the transmission, to the transducer used, to the background situation and make them be relevant for -- and they did have the information. So, that was spectral information that characterizes the sounds of speech, the phoning, the context of -- et cetera. So, over the last 40 years we have learned how to do it through a spectrum analysis. We are using something like linear predictive coding, which let's derive parameters that we think are extremely efficient representation to the signal. In fact, we code speech from 64 kilobytes down to about 5 kilobytes. Now, if I have to give you kind of a guidance point, your cell phone uses an ~ kilobyte representation. So, it is below that, but it is fundamentally -- it has got some of the same parameters that the cell phone uses for actually operating in the poor environment of cell -- but it has got about 30 or -- that you use in cell phones. So, it is a little bit higher rate. The first challenge you come across is the fact that -- as I said, there are at least a dozen major systems out there. You can call Charles Schwab and get stock price quotes. You can call AT&T's customer care and I will show you, for example, that system. They all work nicely, but they are not robust. 310

311 They fall apart in noisy environments, when you have office noise, when you try to operate it in the middle of an airport lounge, when you are operating in your car at 60 miles an hour and you have road noise and you have all the background and the radio blaring. When you try to use it over a speaker phone or a cell phone, every system just about falls apart. Get somebody with an accent or a dialect or a speaking defect, real problem. Of course, noise and echo environments. So, the first problem and the major problem is robustness. You might say to yourself, why don't you make parameters that are robust to everything. What happens is then you have taken a distribution which is finely tuned and works real well and made it broad and it works terribly all the time. That is your choice. So, we have got to do something. The second thing, of course, is that the speaker set is one that we devised probably about 10 to 15 years ago and maybe there is some place for it to go. So, let me play a couple of examples. The mismatch between when you train the system, so you train a system in an environment for a set of speakers, for a set of inputs, et cetera, and when you test them, on the whole, people don't obey those rules. You 311

312 know, they really want it to work where they want, the way they want it to work. I will show you examples of how it can result in performance degradation. So, the concept is let's try to make it a little bit more like humans would do it perceptually. So, here comes a nice little model of what goes on. When you train your system, you had a training signal. People smoke in the environment. They are going to use it. You extracted your features and you build models. Those statistical models are really good representations of that data. Now, all of the sudden you hand that system off to somebody and they start speaking in a new environment and now the signal is mismatched to that signal. So, of course, what you are saying is, well, I want to add to the new environment and retrain or after the fact, since I can't really do that in many cases, I have three paths I can try. One is I can say let me try to enhance the signal so that even though you are in -- well, let's enhance it. Let's get rid of the noise or let's get rid of the reverberation or let's make it less sensitive. So, enhancement is kind of intense. Make this signal statistical characteristics look like the original one. The second thing you can do is -- and try to normalize them so that the average statistical properties 312

313 of these features match those. And, finally, the last thing you can do is when you finally have this model, you can say this model is a bad mismatch. Let me try to go at it, adapt that model. All of those have been tried and clearly with varying degrees of success. The acoustic model, though, now we are at the second stage and before -- there were three major models. There is one that is based on physical speech data, called the acoustic model and two that are based on -- so, the acoustic model is basically how do you take frames, which come on a regular basis every 10 milliseconds, but sounds don't come regularly. Some sounds are long vowels, some sounds are very brief, like stock consonants, very short. So, we want to map the acoustic features, which occur on a regular frame-by-frame basis into events. The events are sounds of the English language and you can think of these as being the phonings of the language. Now, that is just a very simplified view, because we don't use phonings. We used context dependent phonings. Instead of having about 50 sounds, we have 50 times 50 times 50 sounds. That is where mathematics really is nice because you can handle 50 times 50 times 50 sounds. Computers are pretty fast and we have some pretty good techniques. 313

314 So, you characterize each of these sounds by the Markov model, which is a statistical method for characterizing the special properties of the signal and it is not the right model for speech, but it is so powerful that you can get into the model and represent virtually any distribution. Now, the problem, of course, is you are over-training it in some sense, which makes you non-robust. You are training it to a distribution, which it doesn't actually follow. But it actually in that environment, that overtraining works extremely well. So, it is a powerful statistical model and, of course, it is a classification model. You give it only exemplars of the sound you are trying to train. So, it does a terrible job. So, for example, let's say your vocabulary with the letters of the alphabet and you are spelling a name. So, you have a letter like "e" and, you know, it is a set of b, c, d, e, g, p, t, d, z. Listen over the phone and you are almost random. When I spell my name, I say R-a-b as in boy, because if I don't say that, it will come out R-a-v, R-a-g, R-a-d. It is whatever somebody hears because they are almost indistinguishable. So, that is one of the challenges is if you only train models on similar events, how does it every know how 314

315 to discriminate that model from other models that are in the system to give you better performance. Okay? The next thing is the word lexicon. So, we have got sounds and now we want to form words because not every sound is followed by every other sound and we know that. We know there are only a certain number of consonants that can follow each other before a vowel has to appear. So, the concept of the word lexicon is it is a dictionary. How do you map legal phone sequences into words according to rules? So, a name like David, it is da a va id a. That is kind of nice for a name like David. How about a word like d-a-t-a. Is it data or is it data? Completely different. So, you have to put in at least two representations. We all know of words with multiple -- either, either, et cetera. Lots of pronunciations. So, here is a challenge. We can pick up a dictionary, you know, take -- you know, use it all, but how do you generate a word lexicon automatically. What happens if there are scientific terms or there are new names that you haven't seen before. How do you edit various dialects and word pronunciations, mathematically complicated problems. The next level of the language model is what word sequences are now valid in the task you are doing, that so- 315

called language models. How do you map words and phrases into sentences that are valid, based on the syntax of the recognition task you are doing? So, there are two ways. You can hand craft them. So, you literally write out a graph of what is allowed and people do this for a very long time or, of course, you go back to statistics. So, you take, you know, a few hundred million words of text and build a trigram, an engram statistic of, you know, what is the probability of this word followed and preceded by other words and this word or combinations of words. Of course, the problem is that you get the text from somewhere and that assumes that people will speak according to that text model, much as we assume they would speak according to the accoustic model and of course there are problems of how you build a new one for a new task. Still real challenges there. Finally comes the heart of the work, the place where mathematics has really helped a lot and that is the pattern classification. So, we have got these features coming in. We want to determine the proper sequence of sounds, words and the sentence that comes out. We want to combine all of those, those three models, to generate the optimal word sequence, the highest probability, the maximum likelihood event given a task, which defines the syntax, 316

given a set of words that could be spoken, which is the lexicon and given the sounds of the language. So, we have an -- search through every possible recognition choice. So, the real issue is how do you build an efficient structure, a finite state network for decoding and searching large vocabularies. Let's assume we have a vocabulary of a million words, okay, with 50 times 50 times 50 units, with all of the complexity of the language model. So, we have features, cross units, cross phones, cross words, cross sentences and some of our big recognizers have 10 to the 22nd states that have to be searched. Well, that is difficult. The goal is to take this structure, which even sequentially you are going to 10 to the 22nd states. No one can do that, even with nice fast computers. So, we have very, very fast approximation algorithms, like the A Star Decoder or the fast linear search, but instead we figured out methods that can compile the network down from 10 to the 22nd to 10 to the 8th states. That is 14 orders of magnitude of deficiency. We don't know what the limit is. And I will show you some examples of what we have got. So, the concept is for a 64,000 word Wall Street Journal vocabulary -- this is the one that went down -- 10 to the 22nd, 10 to the 8th. The way it does it is by using -- weighted finite state 317

318 transducer at the HAM level for the sound, at the phone level, at the word level, at the phrase, they all get combined. They get optimized. They get determinized and they get minimized using very, very fancy computational linguistic techniques and out comes a search network with significantly lower states. So, here is an example of just a simple word transducer under the word "data" and basically the transducer just has arcs, which are the sounds and states where it winds up. So, from the da we wrote an a or an a with certain probabilities, with certain weights. These weights can be optimized and they can be minimized for the whole graph and the end result of it is that we have been looking at these techniques for the better part of six years and here is Moore's law, for those who are familiar with what -- you know, every 18 months, computational capability doubles. This is what we call Morey's law, from the name of the person who has been spearheading the effort at AT&T, Nario(?) Morey, and he has doubled the performance of search every single year. So, it is almost a factor of 2 to the 5th in five years of searching. Using the best approximation techniques in the speech community, it is off by a factor of about 2 to 1. 318

319 So, the community has done pretty well, too. So, another interesting problem. Finally, the last stage in recognition, utterance verification. The concept here is that you are asking the machine to give you the best sequence of words according to a grammar and a task. The answer is how do you know if that sequence of words is reasonable or the right ones. So, we do what we call utterance verification. So, somebody says credit please in a somewhat noisy environment. The machine finds an equally valid sentence credit fees. When you all done, you can ask the machine to tell you what is the likelihood or the confidence score of the word "credit" and it is very high. That is great. But fees was not. It was certainly better than please. So, you would take that one and say I am not so sure that is the output. Okay. You can reject it or you could go out and ask for some clarification. So, here are some examples. Somebody -- a baby cries in the background of this one. Those are both in this database, which I will show you the result on in a minute. So, the rejection problem is how do you extract -- how do you eliminate extraneous acoustic events, noise, background, get a 319

320 confidence measure and you do the best you can with it. So, what is the performance? I have three databases we have used over the last 15 years for just connected digit recognition, simple cases; you know, reading a telephone number or reading a calling card number or a credit card number. The first one is recorded by TI in a really antiseptic environment and the word error rate for 11 digits, O through 9, plus 0, is 3/IOths of a percent. We have another one in a customer care application where you can speak freely and during it, people will list numbers, you know, like they want credit for numbers and there the word error rate is 5 percent. That is a 17 to ~ increase in digit error rate. It is a robustness problem. We are using pretty much the same models, but the good news as you work your way up is we can do a thousand word vocabulary that is -- for a number of years around 2 percent when they stop the work at a thousand. Even as you work your way up to reading spontaneous speech for airline travel information, reading text from Wall Street Journal and business publications, broadcast news, listening to anchors on the radio and translating them, switchboard is sort of what you think of, literally listening to conversations off of the 320

321 telephone switchboard. These are cases of people literally calling home. Now, the error rates certainly go way, way up. You say to yourself, my God, 40 percent error rate. It doesn't work. Well, we have a system I am going to show you later that started off with a 50 percent word error rate and the task error rate was virtually zero. Okay? Because it all depends on what words it kind of doesn't do too well on. Here is the connected digit. That is 3/IOths of a percent. It is clean. It is beautiful speech. Here is North American Business News. Here is some from the switchboard. Very folksy, but, you know. Here is an example of -- this one is hard to see, I guess, because it is not blown up enough. But fundamentally today we can operate 50, 60, 70,000 word vocabularies in essentially real time as you see there. What we have been able to show as a result of all of the mathematics we have applied and all the signal processing is that whatever task we look at, we can make its word accuracy start approaching natural performance, but only approaching it. We haven't gotten there. We handle vocabularies -- in 1970, there was a two word vocabulary with a yes/no application. Today we are handling vocabularies in excess of a million words, a 321

322 million names, et cetera. So, there is no limitation of that form at all. But if you want to kind of figure out how we are doing compared to human performance, here is the range where machines outperform humans and you notice there are not many data points in this one, practically none. Within the 10 to ~ range, there is only a couple that get on the edge of 10 to 1. Then digit recognition, here is digit recognition, the humans outperformed the machine by a factor of about 30 or 40 to I, including those great error rates. Humans doe real -- we had operators for a long time who telephone Remember, got their pay based on numbers because that never misrecognizing the would annoy a customer. we used to have a lot of operator services. The best ones, if you look at what those are, like The Wall Street Journal is within about a factor of 10. That is because humans don't -- you know, they do well, but it is deadly dull stuff and it is not the kind of thing you really want to translate. So, let me take the last five or so minutes on the rest of it. Spoken language understanding is the next stage. Interpret the meaning of words and phrases and take action. The key thing here is that you can get accurate understanding without correctly recognizing every word. 322

323 Okay? So, I am going to show an application where we started with a 50 percent word error rate, but almost always got the key words that told us what the meaning was. That we actually have as a nationwide service. It is called the "How May I Help You Service." It is AT&T customer care for its long distance. It is operating nationwide today. Also spoken -- makes it possible to let a customer speak naturally, which is why we don't understand all the words. We don't even know what words they are going to say. We just happen to know that certain words key certain things to happen and we do really, really well on those words. So, we know the ones that key the things. So, we find what is the salient words and phrase. They map high information, word sequences to meaning and we do that really, really well. We do performance evaluation and we do these operator assisted tasks. They are real hard problems. There are challenges there. The next phase, of course, is that, you need dialogue management and generation and that is -- the goal there meaning and what you could do based interaction, not just on that local loop. have made that clear, that what you say 323 in order to do spoken language is determine the on the whole I really should depends on how

324 often you have gone through the loop. The methodology is you have a model of dialogue to guide the dialogue forward so it is a clear and well-understood goal and, obviously, evaluation of speed and accuracy of obtaining the goal, like booking an airline reservation, renting a car, purchasing a stock, obtaining help, and we use that internally. The challenge here is how do you get a science of dialogue. How do you keep it efficient in terms of -- how do you obtain goals, get answers and is there is an -- how does the user interface play into the art and science of dialogue. We will see that there are examples where it is better to use multimodal interactions. Sometimes it is better to point than it is to -- Here is an example of a before and after the system. This was our customer care, touch tone hell system, as we call it. I will play a little bit of this. I am not going to let you go through this. It took two minute and 55 seconds to get to a reverse directory assistance. When we started playing with -- 28 seconds right there all the way. So, the way this thing works -- we call it the "How May I Help You." The prompt is simple. You 324

325 can say anything you want. If you say pure gibberish, you are going to get pure gibberish out. From listening to about two or three million conversations between humans and our real attendants, we figured out that people call us to find out about account balances or about our calling plans or local plans or if there is an unrecognized number on their bill or whatever. There are between 15 and 50 things and guess what? It is easier to map all of those two million and figure out what words they say that cues you into what they wanted. Okay? So, we made it work. So, here is -- I am going to play one example from our field test labs here. This guy was irate. He bleeped out in the beginning and in the end he got exactly what he wanted. It really worked well. The system has been in use nationwide since last November, a 37 percent decrease, repeat calls that weren't happy and 78 percent decreased customer complaints. Customers love this system. [Multiple discussions.] Here is a little example of how you might do it on the web-based customer care agent. This is basically how you would use multimodal to basically get you in situations where it might be -- the concept is that 325

326 - so, again, We sometimes better actually to point and sometimes it is better to speak and the combination is a more powerful paradigm. We play a few examples of things that would generate totally from text, never spoken by a human. That is not a human speaking. That is a machine. Now, here is some Spanish, same text. Now, for those who prefer a more natural face, this is the exact same synthesis lined up with the face. Now, this last example is a broad band one where we use 16 kilohertz rather than - - so, again, in 40 years, what have we done? We have put out a lot of services. There are desktop applications like dictation and command an d control of the desktop and control of document properties, stock quotes and traffic reports and weather and voice dialing. There are 20 million cell phones that are voice enabled today, access to messaging, access to calendars. Voice works as a concept of converting any web page to a voice enabled site. So, you can answer any question on line and -- other key protocols into making this happen. Telematics, which is command and control of automotive features, not the car, but, you know, the comfort system, the radio, the windows, the sun roof and finally smart devices, PDAs and cell phones. 326

327 So, we have come a long way. Mathematics has driven us from small vocabulary to medium vocabulary in the sixties to -- and we got to the mathematics and statistical based in the seventies and early eighties. We started going to from tens and hundreds of words to tens of thousands and hundreds of thousands and eventually a million words in the early nineties, we had a multimodal and dialogue in this time frame and finally the last one as we go forward. The dialogue systems are going to be very large vocabulary, limited task and control. That is where we are today. In three years, we are going to be able to go through an arbitrary environment. We are going to have the robustness problem pretty much in hand and in about another three or four years after that, we will go to unlimited tasks. So, mathematics has played a key role in that 40 year evolution. Got a long way to go. Thanks. [Applause.] PARTICIPANT: Why did it take 40 years? DR. RABINER: First of all, Moore's law is the key thing. Okay? Until you get that computation, you know, I run the starting and, you know, a two word vocabulary in 1980, we had a real application and we had to 327

328 build specialized hardware for it on an operator handset. Okay? Today, you know, we can do in real time on your basic PC hundreds of thousands of words. It is a big difference when the computation in this storage is not -- PARTICIPANT: Did you in 1980 or -- did you really know exactly what you wanted to do but you couldn't do it fast enough or you didn't have the right -- the conceptual framework, how did it evolve? DR. RABINER: You know, it is a chicken and an egg thing. In 1980, we had a sense of what we wanted to do, but it was computationally infeasible. You learn what works and then as machines get faster, you -- gee, that is not exactly what I want to do. Do I want to do mixture density? Do I want to play games with -- we had a sense of what we wanted to do but not the power to do it. PARTICIPANT: How did this know-how occur that you have obtained? DR. RABINER: Carnegie Mellon particularly made their same -- they were one of the key pioneers in this. In England, Cambridge University -- and they made that public. They first were going to sell it. Not many people wanted to buy it. So, they made it open source, public domain. So, anybody, any university today can go out and 328

329 just get it all for research purposes only and use it. So, in a sense it is the most open community you can imagine. DARPA also helped to make it open because -- almost all of the projects over the last decade have been based on DARPA responses initiative. They would issue a challenge. All of the DARPA work is open. MR. AGRAWAL: [Comment off microphone.] DR. RABINER: There is nothing inhibiting us. I mean, I think we know what to do. Okay? One of the things that I didn't mention is a challenge. In synthesis, the reason synthesis -- what we realized is that if you really want to make synthesis good, you know, the easiest way to do it? Record a trillion sentences that somebody says and decompose them into their basic unit and store them. It was inconceivable when this was proposed in the 1980s. We store them all and we have extremely fast retrieval. We literally look through every one of them and say what is the best unit, in combination with every other best unit and -- in recognition there ought to be -- and that is called data driven. So, obviously, in theory, we actually store about 50 megabytes worth. But, you know, that is just the compression technology. Get that same database. Store across a few thousand people. Just do recognition by just 329

330 looking up the units. That is one of the things we would love to do. We haven't actually done it. Why haven't we done it? Because the stuff works so well in limited environments. Synthesis has a problem because it never worked well. It always was intelligible but it sounded like -- we used to call it non-customer quality. If you listen for 20 minutes, you -- asleep. It would put you to sleep. But when you do this, you can listen to that forever. You know, it is absolutely perfect. PARTICIPANT: Is there any intelligence that can be placed on the hand set that the receiver can be trained so that the recognition -- the server area can be -- DR. RABINER: Certainly in cell phone technology, there is a standard called Aurora?. The goal is to put intelligence in the hand set, so you can make speech recognition in cell phones much less dependent on the transition you use. All of the robustness techniques have that capability. PARTICIPANT: It seems like the next doable challenge will be probably to do real time translation because it seems like you have DR. RABINER: Real time translation is a much harder problem. It is a much harder problem because even robustness techniques have the 330

331 if you do the recognition in one language really well, the text to text translation itself is a real problem. So, there has been a lot of studying on how do you do text to text translation. There are old classic examples of -- in time, you know, idiomatic sentences that are almost impossible -- but the answer is there are limited task domains. We have done it with Japanese and with Chinese, English. Certainly, translation is probably about a decade away because of the text to text problems, not because of the recognition or the synthesis. PARTICIPANT: [Comment off microphone.] DR. RABINER: You are in an environment right now -- again, if -- piece of cake. Why not? PARTICIPANT: [Comment off microphone.] DR. RABINER: People practice that art and who do it really very well. There are certain pieces of things -- user interface is an art. Dialogue management is an art. So, we have to make it into a science. PARTICIPANT: -- speaker recognition? DR. RABINER: That is a completely different topic. Okay? Speaker recognition is something that will probably never be done. Speaker verification is what you really want to ask about. Okay? 331

332 The answer is -- obviously, you can do it with some performance -- whether you use retinal scan, whether you use fingerprints, all of these have various -- false alarm rates. The best systems -- achieve 9S, 99 percent. Verification accuracy with something on the order of a half to ~ percent false projection. Okay? Is that good enough? Well, certainly it screens out some things. It is very easy to go at in some systems and just record your voice. You can't use a fixed, you know, verification phrase. You are dead as soon as somebody records you. You have to change it all the time and once you change it, that performance starts edging down a little bit because you are a variable yourself. But the technology that as not good enough -- you can say I am going to guarantee to keep all these bad guys -- but it is a pretty good return technology. MS. CHAYES: Why won't it be done? I mean, we can do it ourselves, right? You hear a song, you know who is singing it. You mean it won't be done on a short time scale or it won't be done on a -- DR. RABINER: Can you listen to somebody you have never heard before, listen to then two or three or four times and then -- I will let you listen to 30 people you 332

333 have never heard before five times each and then a week later I am going to give you one of those 30 and then -- MS. CHAYES: The computer has a better memory than I do. DR. RABINER: Not much. MR. MC CURLEY: [Comment off microphone.] DR. RABINER: Right now, what this whole concept of this broadcast news is to say can you record all of these and use that as a searching index. MR. MC CURLEY: I guess my question was would you convert it into text or would you convert it into something else. DR. RABINER: No, you convert it into text. Okay? What it would do is it would do real well on the significant words. Okay? So, if you listen to that North American Business News or whatever, the key words were all perfect. "For" or "it" or "this" gets wrong a lot of the time. So, the bottom line to this is you would align that text with the speech. Find all occurrences of Saddam Hussein. Okay? And guess what? You will find most of them absolutely perfectly. It will miss a few. And it will have a couple of false alarms, but you can check. You click on each one. That is not what I want. That is not what I want. 333

334 So, it is not perfect, but it gets you a long, long way to where you want to be. You come back and you have 30 voice mail messages. You don't go through them sequentially. Okay? We have a system which we call ScanMail. It was actually written up in The New York Times yesterday. ScanMail converts every one of your voice messages, to text and gives you a chance to scan it as though it is text mail. More than that, if you are really, really good at finding numbers, call me back at 917-3211-2217. It will find that and it aligns it with it. You can click on it. So, it gives you a synopsis. Synopsis isn't perfect, but most people find that you actually get through that and what it does is it gives them the triaging capability. So, if you are a sales person with 30 -- I get one voice mail message every month. E-mail, I get 75 a day. But there are sales people who get 30, 40, 50 voice messages a day and very little e-mail and they are the people who love this kind of capability. PARTICIPANT: [Comment off microphone.] DR. RABINER: They come down to making things more efficient, so you can try more things. PARTICIPANT: [Comment off microphone.] 334

335 DR. RABINER: Well, the text-to-text translation is a really tough problem. Okay. And very likely, most terrorists don't give you those free -- [Laughter.] Obviously, homeland defense is -- you know, the speaker recognition, the verification. They go out in the surveillance. Having something that converts it into text and looking for some key words. They are not going to be the key words you wanted just for the reason that you can't really listen to everything that goes on. No one can do that. MR. EUBANK: You talked a lot about robustness, but the kind of robustness you were describing is noise and things like that and not an adversarial -- DR. RABINER: What is adversarial in speech recognition? MR. EUBANK: [Comment off microphone.] DR. RABINER: Yes, probably, but, you know, you would to be much more specific. I just can't give you an easy answer on that one. Sorry. I am going to turn this back to you. DR. LENCZOWSKI: I want to thank you for very stimulating discussions. With that, I was going to turn this over to Dr. McLaughlin, whose responsibility during 335

336 this session was to listen and to capture some of the challenges, the complexities that we should be thinking about in terms of the presentations 336 .

337 Challenges in Speech Recognition Larry Rabiner There are two main challenges in speech recognition. To begin, many speech recognition systems are very task-specific, with limited vocabularies. In acictition. speech recognition In aclclition, systems are often very environment-specific, performing poorly when tested with speaker phones or cell phones, or in noisy environments such as restaurants, airport lounges, or moving cars. It is easy to take a well performing, finely tuned system and make it so broact that it works terribly all the time. Further, when a system is clesignect in an environment with a set of speakers and a set of inputs, then in a new environment, the signal is mismatched. There are three mocteis for speech recognition: i. The acoustic Hostel. This mocte] is basest on physical speech data and the numerous types of sounds it must clistinguish. One powerful approach in this domain is a Markov mocteI, which is a statistical method for characterizing the special properties of the signal. The system works well in the given environment, but is nonrobust. The worst lexicon. This is a ctictionary bridging legal phenome sequences and worsts, according to rules. The mathematical challenge is in generating a word lexicon automatically, given that new worsts are often introclucect to a language and the lexicon must hanctie a range of pronunciations. 3. The language Hostel. This mocte] buiicts vaiict sequences from worsts and phrases, based on the syntax of the recognition task. Like the acoustic mocleI, it assumes that people will speak according to a certain text mocleI, making this mocle! nonrobust. The goal is to determine the most probable word sequence, given a task (which defines the syntax), given a set of worsts that could be spoken (which is the Iexicon), and given the sounds of the language. The current state of the art is a clecocler that searches 10x possible recognition choices whereby combinations are weighted and minimized. The final stage is "utterance verification," where key words are iclentifiecl. Currently, speech recognition systems have very large vocabularies and Iimitect tasks. By 2006, systems shouIct be able to perform effectively in arbitrary environments, and by 2010, systems shouict be able to perform un~imitect tasks. 337

338 David McLaughlin "Remarks on image Analysis and Voice Recognition" ~ ranscript of Presentation Summary of Presentation Video Presentation David McLaughlin is a provost and a professor of mathematics and neural science at New York University. His research interests are in visual neural science, integrable waves, chaotic nonlinear waves, and mathematical nonlinear optics. 338

339 DR. MC LAUGHLIN: So, I am going to try to be very brief and just make a few remarks. The first remark I wanted to make concerned comments this morning about how do -- does the government or any interested set of parties get mathematicians and scientists interested in homeland defense. Rather than try to answer that, I just want to -- I want to state what I think are three of the major problems that have to be overcome. The first is that in our guts as scientists and mathematicians, we don't see the immediacy and the urgency of the problem in our guts. Intellectually, we say it is a big problem. But we all hope that we have had the last terrorist attack. That is combined with two other things that we realize. The second is that this isn't one problem, homeland defense. It is thousands of problems and as mathematicians or as scientists, we tend to attack one problem or one area of problems, if you like. This is pretty -- a pretty broad list of things for us to sink our teeth into. You combine that with how we know we work, we work over 40 year time frames and we are being -- and we think if we do think about the immediacy of the problem, we are worried about what is going to happen in the next six 339

340 months or a year. It seems to me that you have to face those three realities and I certainly don't have any solutions, but it does seem to me that those sorts of realities have to be faced by people concerned with getting scientists and mathematicians immediately involved in homeland defense as they were in the Manhattan Project, where some of those problems were not necessary to overcome or else the scientific leaders realized they were not necessary to overcome. I don't know the history of the time. That was the first point I wanted to make. A second point I wanted to make concerned the small needle in a large haystack that has been referred to many times today and it refers to it somewhat indirectly, but I wanted to share with you a story that I have heard several times in recent months. There is a man by the name of Steve Flynn. Steve is a high officer in the Coast Guard, perhaps a lieutenant commander, but in any case a high officer, who this year is on leave from the Coast Guard in one of the government think tanks and his task is to think about border security. He is concerned with two things. That he wants to protect our borders;on the other hand, he doesn't want to close the Canadian border unnecessarily for financial reasons. 340

341 So, he tells this story, which I will try to abbreviate it because it is late, but it is rather disturbing. He talks about one problem of preventing a bomb, perhaps a dirty bomb, from being brought into the country through shipping. He talks about how in Pakistan, crates are loaded onto something and they get to Hong Kong and in Hong Kong, they are stored more or less in the open, after which they are loaded onto boats and come to San Diego. In San Diego they are loaded onto trucks. There is no inspection of this process yet. They are loaded onto trucks and nobody monitors where those trucks go. They know where they are going to end up. Perhaps they are going to end up in Newark at the -- near the Newark Airport in a storage facility, where they are -- where whoever has bought this equipment will pick them up. It is there that Customs will pay some attention to them. But in the meantime, they have gone through Chicago. They have taken detours totally unmonitored, depending on the truck drivers. Once they get in Newark, it would be conceivable for a phone call to detonate. The question is how do you discover which, if any, of the crates has a bomb in it? Now, Flynn's response -- so this is definitely a small needle in a large haystack 341

342 and his response is that it probably cannot be done at that point in time. In fact, he is urging people to monitor much more closely the packing of the boxes in -- at the entry point. The reason I mention that is -- and it is an obvious point, but when you are dealing with a small needle in a large haystack, it seems to me you want to carefully define that needle. Flynn is defining it as entry point, but in each case, I think we want to very carefully define the needle that we are seeking. So, that is the point two that I wanted to make. The last point I wanted to make is slightly more scientific and it returns to a point that George Papanicolaou made this morning, when he was -- when he mentioned a comment about general data mining and he was urging people when possible to use science in addition to statistics. He used the example of the physics of earthquakes versus sound detonation. In what we talked about today, this afternoon, I want to focus on a few remarks about computer vision, although I believe similar remarks would apply to speech recognition and so forth. But as Professor Malik knows and if you read his work, he uses, it is very wise in computer vision to attempt to use biological vision 342

343 to guide our thinking and construction of computer vision algorithms. The problem is we know very little about real vision. But when you think about how visual information is processed in the cortex, there are a few things that we do know and I just want to close by listing some of them and I am sure that the computer vision experts (a) know these, (b) undoubtedly try to use them, but I also believe that since we don't understand how they are used in the human and mammalian visual system, we haven't yet unveiled how they should be used in the computer visual system. In any case, when you think about the cortical processing of visual information, the first thing that is completely clear is that that region of the brain is a massively complex feedback system. There are extensive feedbacks between regions of the brain and many of these regions of the brain are themselves layered structures. There is massive feedback between the layers. So, if you like, you can almost think about feedback within feedback within feedback. You can ask questions like how is the brain making use of the fact that the architecture is layered? People don't know the answer to that. People do know which regions are coupled to which. They know that is based on anatomy and they know 343

344 very well which layers are coupled to which, also based on anatomy. They don't have much feeling for the strengths of those couplings. For example, the visual system is often modeled as a feed forward network where information enters through the retina and it goes on to a filter called the LGN and on into the visual cortex; two specific layers of the primary visual cortex, 4 and 6. Now, it is known anatomically that there is a huge amount of feedback from 6 back to the LGN. In fact, there is more feedback anatomically from 6 back to the LGN than there is from the LGN forward to 6. Yet, nobody knows within the visual system what that feedback is doing. It must be doing something because there is so much of it anatomically. Another issue is time scales. People are beginning to wonder is there temporal delays between these different regions for which the feedback might be slightly slower or might be slightly delayed and, in fact, it could be delayed within delayed within delayed when you think about regions within regions within regions, which seems to be the architecture. It is quite controversial as to how important these delays are. David Mumford and LeMay and Lee have been concentrating on delays for several years now. Other neuroscientists focus upon the fact that there are 344

345 precursors in these delay processes that are almost instantaneous. But in any case, when you think about those features of the real visual system, namely, layered structures, massive feedbacks and the importance of temporal dynamics, some of the more static single layer or two dimensional image reconstruction that is done in computer vision, there is a chance that by better understanding the real visual system or even if we don't understand the real visual system by thinking about possible uses of this architecture and delay might well improve the computer vision system. Now, this is saying nothing to Professor Malik, who does this in his work continually, but it is a fact that I think that the computer vision community doesn't talk with the biological vision community enough. So, I think that that is the only three points that I really wanted to make and it is late and I think I will conclude with that. 345

346 Remarks on Image Analysis and Voice Recognition Davis! McLaughlin How can the government, or any interested set of parties, get mathematicians and scientists interested in homeianct defenses Some of the challenges are these: As mathematicians and scientists, we don't see the immediacy and the urgency of the problem. Intellectually, we say it's a big problem, but we all hope that we've tract the last terrorist attack. Homeianct defense is not a single problem. It's thousands of problems. But mathematicians and scientists tenet to attack one problem, or one area of problems, at a time, and there is a pretty broact fist of things for us to sink our teeth into. 3. We work over 40-year time frames, and if we do think about the immediacy of the problem, we worry about what's going to happen in the next 6 months or a year. Many of the problems in homeland security involve fincting a small needle in a large haystack. One way to cleat with this problem is to put more effort into carefully kenning that needle. Dr. McLaughlin illustrated his comment with a hypothetical example, attributed to Steve Flynn, a high-ranking officer in the U.S. Coast Guard, whose current task is to think about bottler security. It is necessary say, in data mining to use not only statistics but also insights from whatever other scientific realms inform the problem, even if they are distant from the analyses own fieict. In computer vision, for example, it wouict be wise to attempt to use biological vision to guicle our thinking and construction of algorithms. 346

347 David Donoho "Remarks on image Analysis and Voice Recognition" Transcript of Presentation Summary of Presentation Video Presentation David Donoho is the Anne T. and Robert M. Bass Professor in the Humanities and Sciences at Stanford University. He has worked on wavelet denoising, on visualization of multivariate data, on multiscale approaches to ill-posed inverse problems, on multiscale geometric representations, such as ridgelets, curvelets, and beamlets, and on manifolds generated by high dimensional data. He received his A.B. in statistics at Princeton and his Ph.D. in statistics at Harvard. He is a member of the National Academy of Sciences and the American Academy of Arts and Sciences. 347

348 DR. LENCZOWSKI: Once again setting up electronically here -- wouldn't it be nice to have voice recognition and simply tell it to connect. With respect to the voice recognition, that there were only two words there in 1980 because I know in 1981, that system we had recognized or purportedly recognized 10 if properly enunciated. The next discussant that we have here to make commentary on, again, the complexity of the various topics is David Donoho. Interestingly enough, he was also a reference in one of the previous presentations. So, he has the opportunity to determine whether or not he was represented properly. MR. DONOHO: Well, the cardinal difficulty I have to face here is I am standing between you and a reception. So, be brief, right? I wanted to put in context an idea that I think it is implicit. Everybody already accepts, but I might as well just say it out loud. Why do we think that mathematics is relevant to homeland security? It is because we are entering a data saturated era. We are seeing a one time transition to ubiquitous measurement and exploitation of data and we believe that mathematics is relevant to that. 348

349 So, I just want to go into that and talk about some of the obstacles not yet addressed and that I think were exemplified in some of these talks and isolate a few mathematical challenges, but I can't possibly be exhaustive. So, while world hunger has been with us for a long time and will continue to be with us sadly, we are at a point now where we will never be like Sherlock Holmes. We will be inundated with so much data that we will never feel that we don't have what we need to make decisions. The era that we are entering we have got the prodigious efforts of the best and brightest in our society are dedicated to create huge data repositories and exploit it all sorts of ways. The information technology business is a huge part of our economy. It dominates corporate America. It is like 50 percent of Fortune 500 budgets in information technology. So, it is clear that this is an important phenomenon but it is really, let's say, pushing the thinking of everybody about how we ought to go about things, at least Internet he United States. So, if you look at the definition of an era, of course, there were, you know, great minds in the Renaissance and in the Middle Ages that defined the era for them by, for example, 349

350 building wonderful structures or creating great works of art. What we are doing instead is creating networks -- pervasive networking, gargantuan databases and promulgating an ethic of data rich discourse and decisions. So, homeland security that we can contribute to, I guess, everyone is assuming it is all about data gathering, data analysis and then somehow we are going to find in this mountain of data things that are going to be useful. For example, if we had a national I.D., there would be some way to codify that into a process that would let us keep people from doing bad things to our citizens. Okay? So, it is important to say how far this really goes. I mean, anything can be quantified and data really can be gathered about everything and we really will be able to encode all sorts of things about human entity or behavior. People would say smells are an example of something that is merely poetry, but actually now we have electronic noses, arrays that have excellent abilities to distinguish between let's say roses and garbage or more to the point, they can distinguish between, for example, different hydrocarbons like in this case methane and benzene and so on. They could do a job that wouldn't be so safe for humans to do. 350

351 As Professor Coifman pointed out, we have hyperspectral photography that goes considerably beyond the abilities of the human eye, getting thousands of spectral bands and able to identify what things are made of, not merely what they looked like, simply by looking at them. Human motion tracking, there are now sensors that can record gestures, pose and the articulation over time and so on. So, all this data is being gathered and I could only sketch here, just mention a few examples, but it goes on and on. So, certainly all aspects of human identity, date and posture will be captured and digitized completely. This goes to the point that the NBA is currently -- has a web site where they have nine weekly games and talk about amazing moves that people made that were atypical for that player. Okay. Now, we all know that there are people who are naysayers about the data era and they would say that there is going to be a violation of privacy or morals and then there are people who say that calculating all this data and compiling it is just wasting time and you are just going to be lost in this glut and be paralyzed by the fact that you have the information. 351

352 So, if we look at the issues for homeland security, I guess I see the following things. So, first of all, any time you talk about compiling data, for example, around imagery or around speech, something like that, there is going to be in this city in particular a big reaction, which is based around political maneuvering and there will be always issues of -- there will always be issues of civil liberties and so on. Secondly, and this was mentioned before, there is issue that in our system we do things by private enterprise and so there is going to be a lot of trade speech claiming benefits and workability, which is quite different from the actual experience of some of the customers who tried to do these things. In addition, there are massive engineering issues -- there are massive engineering problems associated with this, that just merely to have something like a national idea or to use biometrics in a consistent way that would really close off from undocumented people the ability to get on airplanes and so on, just to create that system is very challenging and there would be a lot of discussion about whether you could actually build the infrastructure. But none of those things are science. So, it is really not Ralph Nader or the Pope or the data glut also the 352

353 arguments. Suppose all of those things are solved, we really should just face what the scientific issues are in using all this data. I see them as being -- if we wanted identified, generalizability, scalability, lack of fundamental understanding of the broad Issues robustness or ~ structure of the data. Generalizability would mean you small systems in a research lab, but then them through the level of dealing with can build these you can't deploy a quarter billion people in a satisfactory way. Robustness would be, again, work in perfect conditions, but you can't deal with articulated conditions, where something is buried away from the research lab. A lack of fundamental understanding has to do with the fact that a lot of what we are doing is using what mathematics gave us 40 or 50 years ago and not developing mathematics that tries to understand what the underlying issues out of the data that we are trying to deal with is. We are just getting what -- how can we deploy the stuff that is out there. So, it seems to me that all of those things are due to the fact that mathematics hasn't been going where the data resolution is pushing us, at least not enough to make a difference in the directions that we like. Where 353

354 the data revolution is pushing us is dealing with very, very high dimensional space. So, if you look at an utterance or an image, you are talking about something that is a vector up in a thousand or a million dimensional space. So, what we really need to think about is what is going on in high dimensions. I just mention here some examples, human identity, even highly decimated . It might be 65,000 variables just to take raw imagery, hyperspectral image. For a chemical, you have a thousand spectral lines for a sample. If you went to an image, you would have a thousand times the number of pixels. So, what are the needs with high dimensional data? I see a couple of things that came out in the talks today. Coping with dimensional glut, understanding high dimensional space and representing high dimensional data in qualitatively new ways. I think in various ways if mathematicians make major progress in these areas, there will be long term payoffs. Again, it might be just 40 years from now when the real problems are something else. But these things will get ultimately inserted in deliverables. One issue about dimensional glut, which underlay a little bit, when you have too many dimensions, often what 354

355 we are seeing in modern data sets is very, very high dimensions about each individual under study. So, an image is like maybe a million dimensions about one particular thing you are looking at. Traditional statistical theory says that the number of dimensions compared to the number of observations should be such that you have a very favorable ratio; namely, D over N. It is small. But where Moore's law is pushing us is D over N is going to infinity in some sense. We are getting more and more complex articulated data about just a few individuals and we are not getting more and more data per se about independent observations. As a result, we find it very difficult to train from this enormously high dimensional stuff and then get something that generalizes. Just as an example, I went out on the web and found a class project on Eigenfaces. So, you use principle components to find the underlying structure of face space in some sense. So, here is a class and there are their Eigenfaces. The thing is that what they have is each person is like a 16,000 dimensional vector. Right? Then there is only, I don't know, 30 kids in the class or something. So, it is exactly the situation that I was talking about. When you look at Eigen analysis on 355

356 this, what you find is that basically each person is equal to one Eigenface, take away the average. So, actually what happens is that the Eigen analysis just gives you back the individual identities and so on. Now, Eigen analysis is being used all the time throughout science and people are exactly going down this road all the time. It is not well-understood that if you go to high dimensions and you have relatively few examples, you have these distortions. The whole area of what distortions there are and what systematic issues you have are not very well understood. Another area of dimensional glut is that you have the cursive dimensionality. Algorithms don't run well in high dimensions. A simple example in the context of the talks that we have had is say I have to search through a database for nearest neighbors. That gets slow as you go up into high dimensions and there is a lot of theoretical research that has tried to crack the problem, but basically it has not been really solved in any practically effective way. So, basically you just have to go through a significant fraction of the data set to find the nearest neighbor. No ideas of hierarchical search really work in these way high dimensions. For example, Jerry Friedman in 356

357 the audience search. It pioneered the idea of using hierarchical works great in low dimensions, like ten dimensions or something, but you know, when N is much smaller than D, it is not a viable method. Well, the other issues we just don't understand enough about what our data structures are that we are gathering. They are these things up in high dimensional space. There isn't enough terms of reference for mathematics that has been provided to us to describe it. We lack the language. It is really the responsibility of math to develop that language. So, a lot of these things are about point clouds and high dimensions, just individual points are pictures or sounds or something. They are up in high dimensions. If we collect the whole data base, we have a cloud of points and high dimensions. That has some structure. What is it? We believe that there is something like a manifold or complex or some other mathematical object behind it. We also know that when we look at things like the articulations of images as they vary through some condition like the pose, that we can actually see that things follow a trajectory up in some high dimensional space notionally, but we can't visualize it or understand it very well. 357

358 So, people are trying to develop methods to let us understand better what the manifolds of data are like. Tell me what this data looks like. Josh Tannenbaum, who I guess is now going to M.I.T., did this work with EisoMap(?), which given an unordered set of pictures of the hands, okay, figures out a map that shows what the underlying parameter space is. So, there is a parameter that is like this and there is -- that is this parameter and then there is a parameter that goes like that and that is discovered by a method that just takes this unstructured data set and finds the parameterization. We need a lot more understanding in this direction. The kinds of image data that you see here, this is just one articulation, one specific issue and one methodology for dealing with it. But we could go much further. Representation of high dimensional data, Professor Coifman mentioned this and in some of Professor Malik's work it is also relevant. What are the right features or the right basis to represent these high dimensional data in so that you can pull out the important structures? So, just as an example, if you look at facial gestures, faces articulating in time and you try to find how do you actually represent those gestures the best, one 358

359 could think of using a technique like principal components analysis to do that. It fails very much because the structure of looking at facial articulation is nothing like the linear manifold structure that, you know, 70 years ago was really, you know, cutting edge stuff, when Hoteling(?) came up with principal components. It is a much more subtle thing. So, independent components analysis deal much more with the structure as it really is and you get little basis functions, each one of which can be mapped and correlated with an actual muscle that is able to do an articulation of the face. You can discover that from data. It is also nice to come up with mathematically defined systems that correlate in some way with the underlying structures of the data. So, through techniques like multi-resolution analysis, there are mathematicians designing, you know, bases and frames and so on that have some of the features that correlate in a way with what you see in these data. So, for example, through a construction of taking data and band pass filtering it, breaking it into blocks and doing local redon(?) transform on each block, you can come up with systems that have the property that they add in little local ridges and so on. So, you can represent with low dimensions structures like fingerprints 359

360 because you have designed special ways of representing data that have long -- elongated structures and it is also relevant to the facial articulation because the muscles and so on have exactly the structure that you saw. Okay. So, just some of the points I have made, we are entering an era of data, the generalizability, the robustness, the scalability that have come up in some of these things are issues that will have to be addressed and in order for that data to deliver for any kind of homeland defense, where mathematicians can contribute is understanding fundamentally the structure of this high dimensional data much better than we do today. It can provide a language to science and a whole new set of concepts that may make them think better about it. Our challenge is to think more about how to cope with dimensional glut, understand high dimensional space and represent high dimensional data. The speakers have brought up the issues that I have said in various ways and trying to -- you know, under one umbrella, cover all of this is pretty hard, but certainly in Professor Malik's talk was brought up the issue of robustness to articulation. We don't understand the articulation manifold of the images well enough. As an object in the world is articulated, the images that it generates go 360

361 through a manifold in high dimensional space. It is poorly understood. IsoMap is an example of a tool that goes in the right direction. But to get robustness to articulations, it would be good to understand them better. The structure of image manifolds, as Professor Malik mentioned, working directly with shape, that is again trying to get away from a flat idea of the way data are in high dimensional space and come up with a completely different way to parameterize things. Professor Coifman, well, he basically like we are co-authors, so I am just extending his talk, but, anyway, the connections there are pretty obvious. Finally, I think what Rabiner showed is first of all the ability to be robust to articulations was mentioned. So, again, to understand I think how those strange noises in the data -- what is the structure of the manifold of those things up in high dimensional space. So, the structure is speech and confusers would be interesting to know. I think he has shown that if you work very hard for a long time, you can take some simple math ideas and really make them deliver, which also is a very important thread for us to not lose track of. He mentioned a number of math problems that he thought would be interesting. I 361

362 wish he had been able to stay and we could have discussed those at greater length. There was a questioner in the audience who said, well, but you said you had just been carrying out an implementation for 20 years. Are there any new math problems that we really need to break now? It would be very interesting to hear what he had to say. I think all the speakers have pointed out that a lot of progress has been stimulated by -- in math, I think, from the study of images and -- associated multimedia data and I think that will continue as well and eventually we will see the payoff in homeland security deliverables, but perhaps a long time. Thank you. [Applause.] PARTICIPANT: [Comment off microphone.] MR. DONOHO: IsoMap is just a preprocessor for multidimensional scaling, for example. So, for instance, you didn't have this standard method multidimensional scaling, we wouldn't be able to -- just provide a special set of distances. Has any of this played out? I mean there are 10,000 references a year to that methodology. I don't think I could track them all down and make a meta-analysis 362

363 of how many were just, you know, forgotten publications. But it just seems like millions of references -- PARTICIPANT: [Comment off microphone.] MR. DONOHO: We are right now getting data in volumes that were never considered 20 years ago. We have computation in ways that were not possible 20 years ago; whereas, the methodology of 80 years ago, just with the data and computations that we have is starting to deliver some impressive results. So, if we are talking about more recent methodology, well, let the data take hold and let some computations take hold. MR. MC CURLEY: [Comment off microphone.] question about what it is connected with looking for nearest MR. DONOHO: Your earlier features would help to index, I think that. I think that the idea of neighbors in a database of all the multimedia experiences in the whole unit, it is sort of where this is going. Right? That is not going to happen. There ought to be a more intelligent way to index, to compress and so on. Can we make a decision right now how much data to gather? I don't know. I think economics will drive everything. PARTICIPANT: [Comment off microphone.] 363

364 MR. DONOHO: It is an excellent question. I have some opinions on the matter but I don't know if I have really the right to stand between people and liquid refreshment. Yes, I think that most of the political things that I have seen are between non-technical people who are playing out the dynamics of where they are coming from rather than looking at really solving the problem. So, you know, I am a little bit -- so, for example, a national I.D. card wouldn't have any privacy issues if all it was used for is for -- you know, it contained data on the card and it was yours and it was digitally signed and encrypted and all of that stuff and there was no national database that big brother could look through. Actually, the point was that it was the card and that was it. Right? What people assumes it means there is that big brother has all the stuff there and they can track their movements and stuff like that. So, there are all those kinds of things. There is nothing that says the I.D. card can't just be used for face verification or identity verification. You are that person and that is it. They are not communicating with anybody. It is illegal to communicate with anybody. That can all be done. 364

365 But people get together and butt heads is sort of an Orwell characterization I guess. The science is absolutely clear, that you could do it without having all of those -- violating personal liberty and so on. DR. LENCZOWSKI: Are there other questions for any of the speakers? I guess that everyone is ready take their break. I want to thank you very much for your participation this afternoon. [Applause.] 365

366 Remarks on Image Analysis and Voice Rcognition David Donoho Our era has clefinect itself as a data era by creating pervasive networks and gargantuan databases that are promulgating an ethic of ciata-rich discourse and decisions. Some scientific issues that arise in working with massive amounts of data inclucle these: Generaiizabiiity-buiicting systems that can efficiently hanctie data for billions of people Robustness-builcting systems that work uncler articulated conditions, and Unclerstancting the structure of the ciata-techniques for working with data in very high dimensions space. Take, for example, the task of coping with high-climensiona] data in body gesture recognition. If you look at an utterance or an image, you are talking about something that is a vector in a i,000- or i,000,000-ctimensiona] space. For example, human identity, even highly clecimatect, might require some 65,000 variables just to produce raw imagery. Characterization of facial gestures is a subtle thing, and inclepenclent-components analysis (as opposed to principaI-components analysis) cleats much more with the structure as it really is. You get little basis functions, each one of which can be mapped and correlated with an actual muscle that is able to do an articulation of the face. 366

Next: Opening Remarks and Discussion, April 27 »

The Mathematical Sciences' Role in Homeland Security: Proceedings of a Workshop (2004)

Chapter: Image Analysis and Voice Recognition

Welcome to OpenBook!

Get Email Updates