Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: John Elder Ensembles of Models: Simplicity (of Function) Through Complexity (of Form)

Page 191 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 192 Cite

Page 193 Cite

Page 194 Cite

Page 195 Cite

Page 196 Cite

Page 197 Cite

Page 198 Cite

Page 199 Cite

Page 200 Cite

Page 201 Cite

Page 202 Cite

Page 203 Cite

Page 204 Cite

Page 205 Cite

Page 206 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 191 TRANSCRIPT OF PRESENTATION MR. ELDER: I am going to talk today on two topics, ensembles of modelsâthat is, combining competing models together to improve accuracy, and then the complexity of doing that. It tends to help you to combine models on new data, which is sort of counter to the intuition or religion of simplification being important for generalization. So, I want to explore that. Combining models almost always improves generalization accuracy. The mechanisms for this aren't completely known, but there are a lot of intuitive reasons for that. I will mention a couple of methods of combining models, which I call bundling. You will notice that all good methods have to start with the letter âBâ. The question it raises, though, is why, if you take a model that is relatively complex already, and perhaps even slightly overfit, based on your experiments, and combine it with other models that are of a similar ilk, you don't get even further overfit. It certainly goes against the minimum description length principle for model selection, which I find very elegant and beautiful, that a modelâthat is related to the communications world by saying, I have to transmit an output variable to you. We all have the same input variables, and I can send that as a signal and then send the noise. Since the error distribution of the noise, which has to be sort of completely described, is tighter, given a good model, if I am able to describe the gist of the data very well, then I will only have to send a little bit of correction for you to reproduce the data. However, if my model is not very good, the error component will be much larger. So,

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 192 this is very appealing and it gets people into counting bits in terms of how to describe trees and so forth. The model ensembles really mess with this whole thing because they are very difficult to describe, and yet, generalize well. So, that is a puzzle. I think an answer is, to measure complexity differently than the form of the model. If it looks complex, it isn't necessarily, to look at the function of the model, to look at its behavior rather than its appearance. One of the metrics that does this was recently introduced by Jianming Ye and I want to describe that briefly, if I can. Then, a very simple experimental example to look at how it works out in practice, and then conclude with just some factors that affect complexity and, ergo, generalization. So, you can see I am still wedded to the Occam's razor viewpoint, which has been under attack recently, and for good reason. So, let's see if this is a partial answer to that. Here is an experiment that Steven Lee at the University of Idaho and I performed, using five different algorithms, on six different well-studied problems. Most of these data sets have fewer points in the data set than they have papers that have been written with them, so these are, in some sense, not necessarily representative. In particular, the investment problem is completely made up there on the right. They are ordered on the x-axis in terms of increasing difference on absolute terms between the estimates. So, there is a higher variance of differences on the right-most data set than there is on the left-most. What I have plotted are the relative errors. So, the very worst would be near one at the top, and the very best would be at the bottom. So, we can see that this selection of five algorithms, which is not my five favorite algorithmsâthis is the five we could get a hold of in S at the time, yet they are very different. We see that the age-old question of which algorithm is best really has just the kind of answer that we want. It depends on the problem. From this small subset of data, probably the neural nets would be the one. Again, this is on out-of-sample data, so this was after the fact. I don't have shown here what we knew about the training data beforehand. After the fact, it looks like neural nets, the red line, is closest to the zero line in terms of relative error. You can see that most techniques are competitive in at least one of the problems. The reason we wanted to do all these was to introduce a way of combining models. So, we looked at that, but then had to swallow our own medicine and look at simpler versions of it. We were introducing a technique called Advisor Perceptron, and that is the red one here, and it does very well. Simple model averaging, although it outperforms on each of the problemsâthat is the yellow lineâit still does just fine and

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 193 it is a whole lot easier. So, our point was to try to introduce a new technique, and it looks like splitting hairs compared to the big issue, which is virtually any reasonable method of model combinations. So, this is on the same scale as the previous slide, and you can see that the error variance, if you will, of choosing any one particular favorite technique is much greater than building five models and averaging the outputs together, and you get a much more robust answer. In fact, there was no pre-filtering on the individual models. They didn't have to reach any kind of performance threshold to be included in the average. They were just all averaged together. So, this is exciting. It is not five times more work to build five models. It is maybe 10 percent more work, because all the work is getting the data ready, and actually the fun part is running models. So, you get to do more fun things, and you get five times as much output, which is good, when you have a client or boss, and yet, your answer is better and you avoid the hard choices of which one. Now, I think this isn't done much, because we tend to specialize, and people tend to become experts in a particular technique, and they don't really feel confident in other techniques. Perhaps this problem has been alleviated somewhat by the commercial tools which keep becoming more and more improved, keeping it a little easier, the learning curve. It is certainly easier than it used to be in terms of using something. I remember when MARS was freeware, it took about a month to get it up and running. Now it is a commercial product, and I would rather pay a few thousands than spend the month. Only graduate students would rather spend the month than actually pay real money. They don't realize their time is money, even for graduate students. The good news is I think it is going to be possible for people to run more algorithms on the same problem, with 5 percent, 10 percent more effort. That is the fun part and the payoff is really good. So, that is a plug for modeling.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 194 Just stepping back, bundling consists of two basic steps. First, build a variety of models and, secondly, combine them in some way. I have just shown here that there are a lot of different options for how you can do those two steps. You can use different case weights, which is a generalization of boot-strapping, or you can modify the data values. You can just run with the same data and change your guidance parameters, or you can use different subsets of the variables. So, you can think of data coming from different sources and building models that are specialized from that source and later combining, without having to merge the entire data sets and start whole. That is a possible way. Then, combining them, you can vote or weigh or do other fancy things, or partition the space and have models that are experts in one area gradually hand over their authority to models that are experts in overlapping areas. Now, all of these have been attempted. I am just going to list here a handful of the major ways of combining models, and we have some experts here in a number of these areas. I also want to mention, Greg Ridgeway and I did a tutorial on this, and the notes from that are available on our web site, Greg from RAND. Perhaps the one that has captured the most attention, bagging, is perhaps the easiest one. It is just bootstrap aggregating. You take the same set of data and you build bootstrap replicates of the data, which I kind ofâwhich are certainly easy to do, and build models on those replicates and then aggregate them together, or vote. One that is a much more sophisticated, complex and so forth is boosting, where the accuracy of the firstâyou build a model, and the accuracy of that model is evaluated

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 195 for all the particular data points. Initially, everyone starts out with the same weight, every observation. The observations that you are able to nail get very low weight, and the ones that you have errors on get more weight, and you do a second pass. Then, the things that you still have trouble with, the weight continues to go up and the weight gets less and less, and you get these obsessed models, as you build these stages, that are looking more and more focused on fewer and fewer things. Sort of the things that you do right are taken for granted andâit is really like marriage, really. The socks on the floor really become important. These models are a little crazy. You don't use that final obsessed model. You use a weighted sum of the whole sequence of models, and this, astonishingly, has very good performance. So, a number of good statisticiansâit wasn't invented by statisticians. They would never come up with anything quite that crazy, but it was looked at after the fact as to why it is working, and Jerry Friedman and Dick Shraney and Hasty have done some good analysis of that, perhaps whyâonce you sort of analyze why something works, you can start to twiddle with it and change parameters, and get papers out of it. So, that is a good thing. There are a lot of different bundling techniques. Again, the distinction between them is probably less important than just doing them. My favorite way, which is still rather rare, my favorite way of generating variety is to use completely different modeling methods. This is a surface representation of a decision tree. This is a nearest neighbor surface representation, where a data point looks for its closest known point by some metric, and takes it answer as its estimate. That gives you similar surfaces to a decision tree, but not rectangular, but convex shapes. This is a piece-wise planer approximation method, a polynomial network, which is similar to a neural network, but smooth and differentiable, and then a kernel estimation surface. So, you can see that the strength, sort of the vocabulary, of these different methods are different, and they have different types of problems that they will work well on, or different regions of the same problem that they will work well on. So, it is a good source of diversity to have different modeling methods.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 196 One more example, pro bundling, we did a fairly sophisticated long-term project on credit scoring performance, and I am just showing five of the dozen or so techniques that we utilized. Here is a tree. It is scalable, there are properties that can handle categorical variables, real variables, they are fast, they are interpretable, at least in very small dimensions. They only have one problemâaccuracy. It is like a really well-written book about a very uninteresting subject. So, trees alone are up here somewhere, but when you bundle them, it suddenly becomes competitive with much more complex techniques, but still, was the worst of these five. What we did is, we said, okay, let's try all pairwise combinations of these methods. This is out of sample performance, so you wouldn't necessarily have known, well, sure, let's employ MARS and neural nets together and we will do better than either one. This is the distribution of the pairwise performance, and then, of course, the three-and four- and five-way performance. You can see that there is a trend here. The single best model is a combination of the three network methods. It is possibleâin this case, stepwise regression, neural nets and treesâcombine to give you something worse than any of the individual models. So, it is not a silver bullet, but statisticiansâor statistician wannabees, as I amâyou look for the trend, you look for the thing that will help you probabilistically. Just to highlight that, I will show the box plots of these distributions as the degree of integration of the models, in this case just averaging the outputs, as it increases. You can see that the mean and the median of the error improve with each other model that is

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 197 combined. We took this out to a dozen or so. Obviously, at some point, it saturates and has some noise, but the trend was definitely a good one, and it is one that I have seen over and over. Of course, it is a great research area to say, under what criterion should you include a model and a bundle and how can you estimate error rates and so forth, and that is very much an open issue. Well, this brings up the whole problem of Occam's razor, which basically has been taken to mean that simplicity is going to be a virtue in a sample. Pedro Domingos here got the best paper award a few years ago at the Knowledge, Discovery and Data Mining Conference for highlighting failings of the razor, and highlighting failings of this assumption, including, besides the performance of various kinds of model ensemble techniques, showing that, if you built an ensemble and then estimated itâif you built one model that estimatedâthat would improve your accuracy over building that one model to start with. This perverse thing, that I haven't had the guts to read the paper, in 1996, apparently taking trees that were perfectly fit and then grafting excess nodes to them, improved generalizability. That makes no sense, and I also point out some work by David Jensen about how much of overfit isn't really over-parameterization but is excessive search. This brings up a number of issues with respect to model selection and model complexity. Typically, what has been done most often to control complexity is by

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 198 counting and penalizing terms. It certainly works for regression, where the number of terms is equal to the number of degrees of freedom that the model is using, and you can generalize a version of cross-validation, which is a function of a penalty times the number of parameters in the error. So, if the error improves with complexity, the complexity goes up and you look for the best trade-off between the two and select a model at that point. It is fast, allows you to use all your data for training, instead of having to reserve some. It has been long known that single parameters can have fewer or greater degrees of freedom, and I just included some references here from the 1970s and 1980s, where, for instance, if you have a neural net, you can often have more parameters than you have data points and yet, not necessarily be over-fit. It is not making full use of those parameters. The parameters themselves are weak, they don't have full power. You can have a tree or MARS or something like that, that has the equivalent of three or four degrees of freedom for every parameter. There is a lot more search going on, a lot more flexibility. So, counting terms is certainly not the story. Also, the search, if you look at a model, you don't know how much work went into getting that model. If you have a three-term model, did you look at 10,000 variables to choose the three or 10 variables to choose those three. That makes a big difference. Hjorth, in 1989, said the evaluation of an effective model cannot be based on that model alone, but requires information about the class of model and the selection procedure. So, we find ourselves in a situation where, if we truly want to evaluate the complexityâthat is, degree of overfitâwe have to take into account the model selection process in coming up with a model selection metric. So, we need to take into account the extent of the search over model space, how thorough the algorithm was. In fact, a number of people hypothesized that, well, we know that greedy step wise regression is suboptimal, but it helps us avoid over-fit. If we looked at all optimal subsets, that would be more likely to over-fit. That is actually false, I believe, but there is the acknowledgment that the thoroughness of the algorithm search is a factor in over-fit. How diverse and how many inputs there are matter. The power of the parameters, and often overlooked, the degree to which the problem itself may be easy or hard. There is clear structure in the data. Then, a number of adaptive techniques will latch onto it. If it is much harder to find, you have much more chance for mischief. So, a model selection technique will have to be empirical and involve re

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 199 sampling. We may be able to get back to a place where we can do some of that, but we lose speed with that. We may be able to regain some of it if we do some early estimations and refine our complexity parameter approach from that. So, it is heavily computer-dependent, but we may be able to do it just in some early stages, and more intelligently select models from the vast search that data mining involves. Data mining, you come up with a final model and people interpret it. I always have trouble with that, because you have looped over a finite data set with noise in it. You have searched over billions of possible combinations of terms. The one that fell to the bottom, or that won, just barely beat out hundreds of other models that had perhaps different parameters, and the inputs, themselves, are highly correlated. So, I tend to think of models as useful and not try to think of the interpretability as a danger, but I know that is not the case with everyone here, in terms of the importance of interpretability. Well, there are a couple of metrics that have these properties. Jianming Ye introduced generalized degrees of freedom, where the basic idea is to perturb the output variable, refit your procedure, and then measure the changes in the estimates, with the idea that the flexibility of your process is truly complex. If your modeling process was to take a mean, then that is going to be fairly inflexible. It is certainly more subject to outliers than a median or something, but it is a whole lot less flexible than to changes in the data than a polynomial network or a decision tree or something like that. So, if your modeling procedure can respond to noise that you inject, and responds very happily, then you realize it is an over-fit problem. This reminds me, when I was studying at the University of Virginia and one of the master's students in my group was working with some medical professionals across the street at the University of Virginia hospital. He had a graph that they were trying to measure heart output strength for young kids with very simple-to-obtain features like the refresh rate for your skin when you press it. That is not the technical word for it, but how quickly it becomes pink again after you squeeze it, or the temperature of your extremities and so forth, and see if they could have the same information that you could get through these catheters that are invasive and painful. They determined that they could, but I remember a stage along the way when he was presenting some intermediate results on an overhead. This is one of the dangers of graphs on overheads. The nurse and the head nurse and the doctor were going over it and saying, âI see how temperature rises and you get this change.â My colleague realized, to his horror, that he had the overhead upside down and backwards. He changed it and it

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 200 took them about five seconds to say, âOh, I see how that works, too. It is completely the opposite.â So, we want to believe and, if we are extremely flexible with resultant changes to the data, then maybe we are too flexible, too easily fooled, too much over-fit. So, another method I won't look at in detail, but I will just mention, is Tibshirani and Knight's covariance inflation criteria. Instead of randomly perturbing the outputs, they shuffle the outputs and look at the covariance between old and new estimates. The key idea here is to put a loop around your whole procedure and look at its sensitivity. I remember that the first person I heardâJulian Fairwayâin an interface conference in 1991 doing that in his RAT regression analysis tool, at the time, a two-second analysis took two days to get resampling results on, but I thought it was a very promising approach. I saw him later. He was one of those statisticians trapped in a math department. So, he had gone away from doing useful things and doing more theoretical things out of peer pressure. Anyway, I was excited about this and it was, I think, a good idea. So, let me explain a little bit about what generalized degrees of freedom are. With regression, the number of degrees of freedom is the number of terms. If we extrapolate from that, you see that you count the number of thresholds in a tree, the number of splines in a MARS or something. People had noticed that the effect can be more like three, as I mentioned before, for a spline or even less than one for some particularly inefficient procedures. Well, if we, instead, generalize from a linear regression in a slightly different way, if we notice that the degrees of freedom are also the trace of the hat matrix, which is the sensitivity of the output, to sensitivity estimate of changes to the output, then we can use that as a way of measuring. We can empirically perturb the output, and then refit the procedure. This is similar to what is done with splines, I understand.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 201 So, put a loop around the process. That is nice, because the process itself can be a whole collection of procedures âoutlier detection, all sorts of things can be thrown into this spline. So, graphically, we have a modeling process and we have inputs and we have an output variable. We add some noise to it, and we record the output variable and the new forecast based on that perturbed output variable. I kind of like the fact that the output ye also spells the name of the fellow that came up with it. So, you have 100 observations. I know that is a heretically small number for this gathering. You have 100 observations and 50 perturbations. Then, you would record the perturbed output and its change, and you would look at the sensitivity of change in the output and change in the estimate. If I were doing it, I would just naturallyâeach perturbation experiment, I would calculate a different number. I do want to point out that Ye, I think, had a good idea, and he took the matrix and sliced it this way and said, well, I am going to look at the effect on observation one of changes over time, which seems to be a little bit more robust than measuring up all of these within an experiment. Also, interestingly, you can then assign complexity to observations. Now, I excluded that graph from this, but if you are interested, I can show, on a sample problem, where we did that. I will conclude, in this last few minutes, here with a very simple problem. This is a decision tree, and this is a decision-making mechanism for the data. Naturally, the tree is going to do pretty well on it. It is amazing how many test problems for trees involve

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 202 trees as their source. Here is where we have added noise to it. The noise is at a level that you can see it obscuresâpretty much obscures âthe smallest structural features, but doesn't obscure others. So, there are some features that are easy to discern and others that are harder, and that seems to be picked up very nicely by the GDS metric. Now, out of this surface we have sampled 100 training samples, and the other experiments with 1,000 and so forth. I am just going to show you sort of the very initial experiments, and make only a few points. This is very much ongoing. Out of those samples, we built trees and we also built boot-strap and put five trees

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 203 together, that went down different amounts. They either had four leaf nodes or eight leaf nodes in this example. You can see, if we built five four-leaf trees and put them together, you still get a tree. You still get something that could be represented as a single tree. It would look rather complex. So, I can show what that looks like, if you are interested, but you can see here, that the bagging procedure does gentler stairsteps than a raw tree. That tends to be good for generalization, the smoothness. So, bagging helps to smooth trees and that is, I think, the major reason for generalization improvement. You can see that, when you go to more complex trees, eight leaf nodes would be eight estimation surfaces. There are only five in this data. So, four is sort of under-powerful and eight is over-powerful, and you can see that it completely misses the smallest piece of structure up here. This one picks up on it, but on it, but it also picks up on another structure that isn't there. So, it is the bias variance type of trade-off. Here is, in essence, the end result. We also did experiments with eight noise variables being added in. So, tree structure depends on two variables, but now you have thrown in eight distracting noise variables. So, how does that affect things. Then, we have, on this slide, the regression as well. As the number of parameters increases, the measured generalized degrees of freedom for regression is almost exactly what Gary would tell you. It is basically one for one. Interestingly, here, with the trees, a single treeâthis purple line hereâit has an estimated rough slope of four, meaning that, on this problem, each split that was chosen with the tree algorithm has roughly the equivalent descriptive power of four linear terms, or it is using up the data at four times the rate of a linear term. If you bag those trees, it reduces to a slope of three. If you add in noise variables, it increases the slope to five, and then, if you bag those trees that are built on noise variables, it brings it back to four. So, just to summarize, adding in distracting noise variables increases the complexity of the model. The model looks exactly the same in terms of its structure and its size and how long it takes to describe the model. Because it looked at eight variables that could only get in its way, it was effectively more complex. Then, the other point is that the bagging reduces the complexity of the model. So, the bagged ensembleâfive trees âis less complex than a single tree, and that is consistent with Occam's razor idea that reduced complexity will increase generalizability after you reach that sort of saturation point.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 204 To summarize, I have a lot of little points to make. Bundling almost always improves generalization, and I think that different model families is a great source of diversity that you need for the bundling. If we measure complexity as flexibility of the procedure, then we sort of revive or answerâpartially answerâsome of the problems with general intuition of complexity being related to over-fit. So, the more a modeling process can match an arbitrary change made to its output, the more complex it is by this measure. Complexity clearly increases with distracting variables. These are variables that are considered, during the model search project, which may not appear at all in the model at the end. Actually, it would have to appear, at least somewhat, to affect the behavior, but the size of the model could be the same betweenâtwo candidate models could have the exact same number of parameters and so forth, but one could be more complex because of what it looked at. The complexity is expected to increase, obviously, with parameter power, the thoroughness of your search, and decrease with the use of priors and shrinking, and if there is clarity in the structure of the data. In our early experiments, I certainly thought the complexity would decrease as you had more data. That seems to be the case with some methods and not with others. It seems to actually increase with decision trees and decrease with neural nets, for instance. By the way, neural nets, on this same problem, had about 1.5 degrees of freedom per parameter. It could be that local methods somehow are more complex with more data, and methods that fit more global models are not. So, that is an interesting question. Model ensembles usually have effective complexity less than their components by this empirical measure. The next thing about the GDF is you now can more fairly compare very diverse procedures, even multi-step procedures, as long as you can put a loop around it. That concludes my talk. Thank you. MS. KELLER-MC NULTY: Questions? QUESTION: I have one. When you talk about the complexity of flexibility, are you meaning robustness? I am confused about the local methods being more complex. MR. ELDER: I am, too. It is confusing. The idea behind GDF, the idea behind these empirical measures of complexity is that you want to be able to measure the responsiveness of your procedure. Obviously, you don't want something that, every time you change the task a little bit, it is good for that one thing.

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 205 You know, it is the universal health oil. I really have hair problems. Well, it is good for that, it is good for your liver, too. Obviously, it is overfit. You know, where is that trade-off? So, the measure is, if you arbitrarily change the problem, or you change the problem a little bit, how responsive is it to suddenly get the same accuracy and so forth? Again, it was a little bit confusing to me, but the trees, they are only concerned aboutâonce they partition the data, they are only concerned about their little area, and they are not paying attention to the big, whereas, models that are basically fitting a global model, data everywhere helps, and it tends to rein it in a little bit, is my best guess. It is certainly an interesting area. QUESTION: Do you have any extrapolation from your results? [Comments off microphone.] MR. ELDER: The question, how things might change with many variables and massive data sets. I would like to revisit a problem we did. We participated in the KDD cup challenge, I guess, two summers ago, and that had 140,000 variables, which is a lot for us, but only 2,000 cases. The final model, the one that won the contest, used three variables. Well, three variables seems like a simple model, but it was a huge search. It would be nice to put that under this microscope and see what happens. Again, my intuition was that more data would actually help reduce the complexity, but it is kind of wide open. AUDIENCE: [Question off microphone.] MR. ELDER: Here, I have just talked about combining the estimates, and some techniques need a little squeezing to get them into the right shape. Take a tree. You might need to do some kind of robust estimation. If it is a classification problem, you might get a class distribution and use that to get a real value â to turn into a real value so that you can average it in. What I haven't talked about is, there are whole areas for collaboration amongst these different techniques that we have explored in the past. For instance, neural nets don't typically select variables. They have to work with whatever you give them, but other methods, like trees and stepwise regression and polynomial regressions select variables as a matter of course. Typically, you can do better if you use the subset they select, when you hand it off to the neural nets, and some methods are very good at finding outliers. There are a lot more methods that can go on between the methods. Up here, I talked about them just in terms of their final output, but yes, you do have to work sometimes to get them in a common language. AUDIENCE: [Question off microphone.] I am not sure how what you are talking about here answers that. MR. ELDER: Good point. Criticisms of Occam's razor work with any measure of complexity, is Pedro's point. The one that bothered me the most is about bundled. They are obviously more complex. They just look more complex, and yet, they do better. How can that be? In fact, it is very easy to get many more parameters involved in your bundle than you have data points and not have over-fit. They are not all freely simultaneously modified parameters, though. So, there is a difference. Under this measure, the things

ENSEMBLES OF MODELS: SIMPLICITY (OF FUNCTION) THROUGH COMPLEXITY (OF FORM) 206 that are more complex in GDF tend to over-fit than things that aren't. So, it is a vote on the side of Occam's razor. I haven't addressed any of the other criticisms, but in the experiments we have done here, it is more like you would â AUDIENCE: Like, for example, when you approximate an ensemble with one model back again, this model is still a lot more complex, because a lot of this complexity is apparent. It just seems that you still get the more complex models in that. MR. ELDER: But it is apparent complexity. AUDIENCE: Yes, but when you replace the apparent complexity with the actual complexity, not just measuring to find the number of parameters. [Comment off microphone.] MR. ELDER: I certainly would be eager to see that. Thanks.

Next: Report from Breakout Group »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates