Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: Kevin Vixie Incorporating Invariants in Mahalanobis Distance-Based Classifiers: Applications to Face Recognition

Page 178 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 179 Cite

Page 180 Cite

Page 181 Cite

Page 182 Cite

Page 183 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 178 RECOGNITION TRANSCRIPT OF PRESENTATION MR. VIXIE: Thanks for your patience. So, the problems we are interested in range over a lot of different kinds of data, including hyperspectral data. The problem we chose to look at, for the methods I am going to tell you about, is space data, because not only has it been worked on a lot and it is hard, but there are nice sets of data out there, and there is also some way to make very nice comparisons among a bunch of competing algorithms. The people involved in this are myself, Andy Frazier, Nick Hengartner, who is here, Brendt Wohlberg. A little more widely, there is a team of us that is formulating that includes also other people who are here, like Peter Swartz, James and, of course, Nick. The big picture is that we have these very large data sets. To do the kind of computations that we need to do, we need to do dimension reduction. The classical, of course, is PCA, and we like to look at non-linear variance of that approach. The challenge we chose to address was how to build metrics which are invariant to shifts along surfaces in the image space which represent changes of, like scaling, rotation, etc., that don't change the identity, but they do change the image. This is preliminary work, like a couple of months of work, but the five-dimensional space we are working on is represented by the shift in X, shift in Y, scale in X, scale in Y and rotation. The next steps are to look at three- dimensional rotation and change elimination. So, the prior work that we believe is relevant is on the order of 160 papers that we looked through. This morning, out of curiosity, I did a site search, also an INSPEC search, just to see what would come up when I typed in âface recognition.â Isis gave me about 900 hits and INSPEC gave me about 3,200. That is not too far from believable that 5 percent may be kind of poor. So, the couple of papers that we found that we thought were of help to us was this eigenfaces versus Fisher faces, and then a paper on using tangent approximations to increase the classification rates for character recognition. The data that we used was the FERET data from the FERET database. The test bed we chose was the Colorado State University test bed. We chose this because, in this test bed, they have a very uniform way of testing many different algorithms. So, including ours, there are 13 algorithms there, so with ours, it was 14. There is a standard pre-processing that is done, and everybody trains on the same data with the same pre-processing and there is a standard test. That allowed us to compare, in a very fair and unbiased manner different algorithms. Here is a sampling of, this is the performance. I will explain this a little bit. The idea is that these algorithms train on the training data, and then they end up with an algorithm that is a trained algorithm that you can hand data to, and it will hand you back a distance matrix. So, if we hand you 100 faces, you end up with about 100 matrices, and this is the distances between the faces. Then, based on that, you can rank results and say, if I have two images of the same person, what I would like to happen is that, in the column corresponding to image one of this person, the image two is ranked number one. It is the closest. This is sort of a Monte Carlo experiment, a histogram of how often, if you hand

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 179 RECOGNITION an algorithm 100 faces, it is going to get the right person in rank one. So, it ranges. The bottom is the PCA base, putting in distance, which I will explain a little bit, and the top is this linear discriminate analysis based on a correlation angle. AUDIENCE: So, if you train on 100 faces, why doesn't it cover those 100 faces? MR. VIXIE: First of all, it is a different training and testing, and I will get to the ethos of how many it is trained on. Now, the pre-processing is done to all the training data and all the test data; you shift, scale and rotate based on eye coordinates. So, these slides aren't actually the very latest. I made some changes last night after I came and they are not showing. So, this is two pictures of the same guy. There is a different facial expression. You can also see at the bottom, these are the pre-process. These are the raw images. So, you scale, shift and rotate based on eye coordinates and location. Then you mask. You use an elliptical mask. Then you equalize the histogram. This is standard in image processing, and then you shift and scale until you get a mean zero, standard deviation one. For training, 591 images were used, and that was 197 individuals, three pictures each. Just a little flavor of what these different methods do, for PCA, of course, what you do is simply take the images, form the empirical covariance matrix, do the Hagandie composition, and then project down to some specified number, and here it was 60 percent. So, you end up with a 354-dimensional space. Now, that is actually a very big reduction in dimension, because the images are on the order of maybe a quarter- million pixels. So, it is a big reduction in dimension. The LDA is justâprobably everybody here knows about thisâthe Fisher basis. In this case, you have a knowledge about what is the within-class variance and what is the between-class variance. So, you know what differences correspond to differences between the same individual and what differences correspond to difference between different individuals. You try to pick a basis that can differentiate between-individual differences and within- individual differences, optimally. For the linear discriminate analysis, we simply take the within-class. So, there are 591 pictures. So, you have the differences. For each set of three images, you take all those differences, and you throw them together, build a covariance matrix, and that gives you the within-class covariance matrix. For the Euclidian distance, this is very simple. You just project into the PCA basis, and then take the coefficients. This is image A and image B. The distance between those images is just simply the difference of the coefficient squared sum dot square root. That corresponded to actually the worst performance in that set of algorithms. The correlation that was the best is an angle-based classification. Here, after you project it on an LDA basis, you take the mean and subtract them out and simply take an inner product. Subtract that from one. So, if the end product is maximized, then they are close. You want that to be a distance. You subtract that from one and it turns into something like a distance. So, what did we do? Well, what we did was dictated not only by some interest in faces, but more in an interest in many kinds of data, including hyperspectral, voice, etc. So, we said, well, we would like to know, in a principal way, how to include in the distance metric the notion that there are differences that don't really matter.

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 180 RECOGNITION So, the blue curve represents the low dimensional manifold. In our case, it is going to be five-dimensional because there are five parameters we are modifying the image by. That represents sort of an orbit of the face in the image space that represents all the same individual. So, what I would like to do is build a metric that says, if there are two pointsâa point here and a point thereâthere is no distance between them, if they are the same individual. So, these manifolds are non-linear and it is not necessarily easy to compute them. So, as sort of a first thing, we do a linear approximation. So, this just represents the tangent manifold approximation to that surface. So, what we are going to now is, we are going to try to modify this covariance matrix, which uses a kernel to build the distance, or the inverse of that kernel, modify it with the knowledge that differences along this direction don't matter. Now, of course, that is not quite true, because if this is highly curved, and if I go too far out, that linear approximation isn't very good. So, we want to use the second-order information, which we can also compute, to limit how widely we let ourselves move along this direction with no penalty. We built this new covariance matrix, and it enabled us to use the classification method. Now, notice that the key feature here is that, in fact, when you do this, you end up with different modifications at each point. Even though we start with the same within-class covariance, we end up with different modifications of that. So, it is a localized modification of something that seems to be non-linear. I will come back to some of the details and show you the results. So, this is the same graphic as before with some faces removed. So, this was the worst, this is the best, this is what we got. What we got was an improvementâthe next- to-the-last curve was what we did without the tangent modification. So, we got this big improvementâagain, our performance is untuned, and the thing you have to understand about the face stuff is that often tuning makes a very big difference. So, we were encouraged because untuned performance, and then we got this big jump. Sort of the next step is to add that to the angle base metric, in which you get an improvement there. Again, really, the real goal for this isn't to do faces the best of anybody. It is really to have a tool that is flexible, fast, and generally applicable. Here are some details. There is this nice picture, again. I like this picture. So, you imagine there are a bunch of individuals and you have a space of images. So, this plane I am drawing represents the space of images. If you have 100 by 100 pixelated images, you get 10,000 in actual space. Then, I imagine that I have a space of parameters that controlsâthat this individual controls where I am on this manifold. This is the transformation that maps you from individual and parameter to the image space. What you can do isâ to make a sensible little explanationâassume the image equals this Ï(F, Î¸) but I have got some noise. I am going to assume that is distributed normally, and that the Î¸ is also distributed normally, which is a helpful fiction that is not too far off. You can sort of convince yourself that it might be reasonable because all of this face recognition is done where you are first trying to locate them the same but you make at least sub-pixel errors. So, you might expect that the errors you make are going to have something that might be something like a Gaussian distribution.

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 181 RECOGNITION So, if we expand Ï in a Taylor series, then we get this. In our case, Î¸ is a five-dimensional vector, and G is going to be the number of dimensions you are using by five, and then we have the second-order term here. This is just a note that, because of linearity, if Î¸ distributes according to this distribution, then GF Î¸ is just that. So, let me give an example of the shifts. Now, when I show this to people first they say, what is the big deal? You are just shifting it. The deal is that in images this is non-linear. The shift in image is up and the pixel space is non- linear. I will give it to you a little later. You want to convolve these images with the kernels to smooth them before you take the derivative. This is the image you start with and then you try to shift it up nine pixels, and then this is the second-order of correction. Now, notice, you can see artifacts here, not that well, but there are some artifacts. It gets a little too light here and dark there. So, there are some artifacts with this second-order of correction, because we have shifted it a little bit past the validity of this combination of kernels and image. So, we have to tune the sort of combination of how big a kernel we use to smooth it and how far we shift it. AUDIENCE: [Comment off microphone.] MR. VIXIE: Yes, so let's take an example. If you have a very simple image and you have just one pixel that is black and everything else is white, and you shift it once, we are going to do it with quiet differences. You shift it one pixel. Now, take the difference between those pixels. One is the new spot and minus one is the old spot. Add that to the old image to get the new image. It is shifted one pixel. Now, I would like a five-pixel shift. So, how about if you multiply that little difference vector by five? Are you getting a five-pixel change? No. Does that make sense now? AUDIENCE: Yes, but that is not usually whatâ [off microphone.] MR. VIXIE: What I mean is I want to shift thingsâ [off microphone.] AUDIENCE: When you say linear, do you mean linear operating on the image, not on the location? MR. VIXIE: Yes. Yes. Okay, some details. So, it is a second-order error, if everything after zero and the first- order turn is small, then the distribution for S and F is going to be normal, according to this distribution. It is going to have a mean at Ï after zero, and this covariant. So, the maximum likelihood of classification is simply this minimizing over I, this distance, where the kernel is now modified with the derivative. Okay, so, the question is, how do you pick the Î¸. We have conflicting goals here. Do you want the Ï(Î¸) large, do you want to move out on that approximating manifold, but you want the second-order to be small? So, the solution is to maximize the determinant of Ï(Î¸), while constraining the second-order error. If you do that, it is fairly straightforward. You end up that Ï(Î¸) equals Î± times the inverse of this modified second- order term. What we have done here is, we have modified it so that it looks strictly positive. You might think that, okay, this weights all pixels evenly. So, this is the sum of the pixels.

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 182 RECOGNITION The way to view this second-order term is as a third-order tangent because at each pixel you have, in our case, a five-by-five matrix. So, what I would like to do is actually make it so that the second-order errors are discounted, or it is not paid as much attention to if those second-order errors are in directions which the a within, or the within-class covariance is ignoring anyway. So, then, what we do is, we decompose the within-class variances and we get the components of this tangent in each of those directions, and then simply weight those, because this final within-class covariance by the inverse square root of the Iâ [off microphone.] Now we have the sense that, if there is a second-order error but it is in a direction that the within-class covariance is totally ignoring, then we can go much farther out. Again, we used the switch and we the constrained optimization before. Okay, now just a couple of notes. The trick that makes this thing work quickly is the fact that, when you do the smoothing and just taking derivatives, so you can transfer the derivatives to the kernel, and then you can use the fact that even when you getâin this case, with the shift, you don't get a spatially varying kernel, but in other cases, especially when you are doing the second-order derivatives, you get a lot of spatially varying kernels, and that is hard to do with the FFT. So, we simply pre-process the image by multiplying by that spatially ordering term, and then we can do the FFT. That is what makes this work very quickly. I guess that is all the slides I have here. I had more in here. The main points that I wanted to close with are that this method of looking at modeling data is generally applicable. We are not interested, for a couple of reasonsâlegal reasonsâ [off microphone]. I might feel bad aboutâI don't think I am going to do this, but if I created something that was perfect at spatial recognition, maybe I would have a conscience issue. At any rate, this is something that is actually applicable to a very wide range of data. It is fast to compute and it is headed toward the use of knowledge that we have about what sorts of transformation of identity or classificationsâ [off microphone.] The things we want to do next would be, first of all, add the change and modifications to the angle base. Also, look at more detailed, or more difficult, transformations, like three-dimensional locations would be much more difficult. Lighting is also an issue. Thank you. MS. KELLER-MC NULTY: Are there some questions while John is getting set up? AUDIENCE: How do you find theâ[off microphone.] MR. VIXIE: We let the CSU test bed do that, simply because we wanted our comparison to be absolutely unbiased with the other test cases. So, we used their algorithm. I didn't take care of that part. My job was the code to compute all the derivatives and second-order transforms. I didn't do that piece of it or recode that piece of it. MS. KELLER-MC NULTY: Other questions? Do you want to make a comment on the data fusion, data assimilation question earlier? MR. VIXIE: Yes, somebody asked about data fusion. Well, classically, when I think of data assimilation, I think of common filtering, something where you have a dynamic process and you want to get back to a state space, but that state space is not fully

INCORPORATING INVARIANTS IN MAHALANOBIS DISTANCE-BASED CLASSIFIERS: APPLICATIONS TO FACE 183 RECOGNITION measured. I have a ten-dimensional space and I am taking one-dimensional measurements, and the idea is that over time I build up those one-dimensional measurements, and I know about the dynamic connections. Then I can get back to the state. I preserve the state. When I think of assimilation, I think of that. In fact, I think that is what is commonly meant when you say data assimilation. You can correct me if you know better. Data fusion could include something like that. In essence, to do fusion, you might approach it this way, where you have an idea that there is some big invariant state space that is like hidden, and fusion is enabling you to get back to that hidden space. AUDIENCE: [Off microphone.] MR. VIXIE: Fundamentally, there is no difference in the way I stated it. I think the way people often think about it, and what you find written about it is quite different. I mean, I have a way of thinking about it that makes sense to me.

Next: II. COMBINING WITHIN CLASS COVARIANCES AND LINEAR APPROXIMATIONS TO INVARIANCES »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates