Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 66
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop David Scott, Chair of Session on High-Energy Physics Introduction by Session Chair Transcript of Presentation BIOSKETCH: David Scott is the Noah Harding Professor of Statistics at Rice University. He earned his BA in electrical engineering and mathematics in 1972 and his PhD in mathematical sciences in 1976, both from Rice University. Dr. Scott’s research interests focus on the analysis and understanding of data with many variables and cases. The research program encompasses basic theoretical studies of multivariate probability density estimation, computationally intensive algorithms in statistical computing, and data exploration using advanced techniques in computer visualization. Working with researchers at Rice, Baylor College of Medicine, and elsewhere, he has published practical applications in the fields of heart disease, remote sensing, signal processing, clustering, discrimination, and time series. With other members of the department, Dr. Scott worked with the former Texas Air Control Board on ozone forecasting, and currently collaborates with Rice Environmental Engineers on understanding and visualization of massive data. In the field of nonparametric density estimation, Dr. Scott has provided a fundamental understanding of many estimators, including the histogram, frequency polygon, averaged shifted histogram, discrete penalized-likelihood estimators, adaptive estimators, oversmoothed estimators, and modal and robust regression estimators. In the area of smoothing parameter selection, he has provided basic algorithms, including biased cross-validation and multivariate cross-validation. Exciting problems in very high dimensional data and specialized topics remain open for investigation. Dr. Scott is a fellow of the American Statistical Association (ASA), the Institute of Mathematical Statistics, the American Association for the Advancement of Science, and a member of the International Statistics Institute. He received the ASA Don Owen Award in 1993. He is the author of Multivariate Density Estimation: Theory, Practice, and Visualization. He is currently editor of the Journal of Computational and Graphical Statistics. He is past editor of Computational Statistics and was recently on the editorial board of Statistical Sciences. He has served as associate editor of the Journal of the American Statistical Association and the Annals of Statistics. He has also held several
OCR for page 67
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop offices in the Statistical Graphics Section of the American Statistical Association, including program chair, chair-elect, chair, and currently past chair.
OCR for page 68
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. SCOTT: This is a very statistically oriented committee, but we were very much interested in bringing in research scientists to help us understand the data opportunities out there, and to bring a good description of problems that might be available for research. Certainly, our second topic today falls into this category in a nice way. It deals with high-energy physics. We have three outstanding speakers, two physicists and a computer scientist, to lead us in the discussion. We want to remind everybody that we intend, in some sense, to have a question or two during the talks, if possible, as long as it is for clarification and hopefully in the discussion at the end, when we come back together, you will have a chance to sort of express your ideas as well. We would like to capture those. I am editor of JCGS and again, I would like to extend an invitation to the speakers to consider talking with me about putting together a small research article for the journal, a special issue of the journal, later this year—next year. With that, I would like to turn it over to our first speaker, who tells me that, as all physicists, he has traveled all around the world. He is now at Berkeley.
OCR for page 69
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Robert Jacobsen Statistical Analysis of High Energy Physics Data Transcript of Presentation and PowerPoint Slides BIOSKETCH: Robert Jacobsen obtained a BSEE from Massachusetts Institute of Technology in 1978. He spent 1976 through 1986 working in the computer and data communications industry for a small company that was successively bought out by larger companies. He left in 1986 to return to graduate school in physics, obtaining his PhD in experimental high energy physics from Stanford in 1991. From 1991 through 1994, he was a scientific associate and scientific staff member at CERN, the European Laboratory for Nuclear Physics, in Geneva, Switzerland. While there, he was a member of the ALEPH collaboration concentrating on B physics and on the energy calibration of the LEP collider. He joined the faculty at Berkeley in 1995.
OCR for page 70
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. JACOBSEN: My name is Bob Jacobsen, and I am starting these three talks. So, I am going to lay some of the groundwork for what some of my colleagues will follow with. My science is different from what you have just been hearing about because we actually have a chart which explains it all. Because we have that chart, which is really what we call a standard model, it affects how we do our science. It affects what questions are interesting and what we look at in an important way. So, I am going to start with an introduction that actually talks about this science, the goals of it and how we study it. Then I am going to talk about a very specific case, which is something that is happening today, an experiment that has been running for two years and will probably run for three or four more years—that is a little bit uncertain—what we do and how we do it in the context of the statistical analysis of high-energy physics data. I am going to leave to you the question at the end, as to whether we are actually analyzing a massive data stream or not. Then, I am going to end with a few challenges, because I think it is important to point out that, although we get this done, we are by no means doing it in an intelligent way. So, the standard model of particle physics explains what we see at the smallest and most precise level in terms of two things, a bunch of constituents, which are things like electrons and neutrinos and quarks, and a bunch of interactions between them, which
OCR for page 71
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop you can think of as forces, but we tend to actually think of as particles that communicate with each other by the exchange of other particles. Depending on how you count, there are basically three types of stuff, fermions, and there are basically four forces and we omit the fact that we don’t actually understand gravity. We just sweep that under the rug and ignore it, and we do our experiments on electricity, magnetism, the strong force and the weak force. It is important to point out that, just like you can’t see—in the talks that we have just been hearing—you can’t actually see the underlying processes that are causing weather. You can only see the resulting phenomenon. We can’t see any of the things in that fundamental theory. What we actually see are built-up composite structures like a proton, which is made up of smaller things, and protons you can actually manipulate and play with. More complicated ones are the mesons, which are easier for us to experiment with, so we generally do, but even then, we have to use these techniques to do our experiments. We have no other instruments except the forces we have just described for looking at these particles. So, what we do is, we interact them through things that look like this big blob on, for you, it would be the right-hand side. For example, you bring an electron and anti-electron together. They interact through one of these forces, something happens and you study that outgoing product. This particular one is the annihilation of an electron and its anti-particle to produce two particles called B’s, which is what we study. The actual interactions of those particles is what my experiments study. In general, when you put together approximately a dozen constituents, four different forces, many different possibilities for how these can happen, there is a very large bestiary of things that can happen.
OCR for page 72
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Now, my wife likes to say that this is not actually science. This is a cartoon. I get a little bit defensive about that. It actually is backed up by a formal mathematical structure that lives on a little card that is about this big, which looks like that, and I am not going to go through it. The advantage that we have with this actual physical theory is that we can make from it predictions. We can—there are a few little asterisks here, and we won’t go over the details very much, but in many cases, we can actually make predictions to the level of four significant digits of what will happen in a particular reaction. Now, these predictions have been checked many different ways, but there are some open questions that remain. The two basic categories of this are that we have the equivalence of turbulence. We have things that we cannot calculate from first principles. They happen for two reasons. One reason is that the mathematics of this theory are, like the mathematics of turbulence, beyond us to solve for first principles. We do not yet know how to do that. We know how to model it, we know how to approximate it, we know how, in many cases, to pick regimes in which the thing simplifies to the point where we can make precise calculations. If you walked up to us with a physical situation, we may not be able to calculate the answer to that. Further, because of the way these processes work—and this is a cartoon here very similar to the one that I just talked about, the interaction between an electron and an anti-electron, which is very simple but, as time gets on, it gets more and more complicated. It starts as a very high-energy pure reaction but, as that energy is spread among more and more particles and their reaction products, the situation becomes
OCR for page 73
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop more and more highly dimensional, becomes more and more complicated, just like the cascade of energy and turbulence. By the time you get down to the smallest level, you are dealing with something that is inherently probabilistic. In our case, it starts probabilistic at the very topmost level. These interactions could go any one of a number of ways when they happen. AUDIENCE: Could you tell us how much energy you are talking about? MR. JACOBSEN: It depends on the experiment. This particular picture was made for one that was 100 giga-electron volts in the center mass. My experiment is actually 10. So, it is many times the RAS mass of the particles that we see in the final thing, and you are making a lot of stuff out of energy. The other problem here is that, like every physical theory, there are some underlying parameters. Depending on how you count, there are approximately 18 numbers that are in this theory that are not predictable by the theory. They are just numbers that are added in. Now, we have measured them many different ways. We are a little sensitive about the fact that 18 is very large. You know, is that too many? Wouldn’t you like a theory of everything that has the zero or one numbers in it. Is 18 too many? I just say, thousands of chemical compounds were originally explained by Earth, wind and fire but, when people worked on it for a long time, they discovered they really needed 100 elements. That was good, because we could explain 300 elements in terms of protons, neutron and electrons but, when they really looked into it, those protons, neutrons and electrons but when they really looked into it, those protons, neutrons and electrons of the 1930s became hundreds of mesons and particles in the 1950s. Those were eventually explained in terms of five things, the electron, the neutrino and three quarks but, when you look really into it, it turns out there are 88 of those. The modern version of the standard model particle physics has 88 distinguishable particles in it. So, no matter what we do, we seem to keep coming back to approximately 100. This gives us a question as to whether or not there is another level of structure. Right now, we don’t know the answer to that. The standard model doesn’t need another level of structure, but it is certainly something we are looking for. So, with these 18 parameters, we measure them. That is what we do. Is this complete? Is there anything more there? We don’t know the answer to that. We know that some measurements that were just announced over the last couple of years and just updated about a week ago require us to add at least a couple more numbers.
OCR for page 74
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Well, that is bad. The theory is getting more and more complicated, but we are still explaining all of the data, and there is one part where we have experimental evidence completely lacking, for whether the theory is really working the way it is supposed to. Including people who will talk to you later on in this session, we are working very hard on that issue. My current goal is to check a very specific hypothesis, which is that this theory, in its structure, as it stands today, is correct in describing. It may not be the entire story– but is it right? While we are at it, we are going to look for hints of how to extend it, but mostly, I am interested in how it is right. The usual method of doing this in science, which was invented by Galileo, was that you measure things 50 different ways. If they are all explained by one theory with one underlying parameter, the first one tells you nothing, because all you have done is measure the parameter. The second one, though, should be consistent, and the third one should be consistent and the fourth one should be consistent. The theory of gravity predicts that all bodies attract to the Earth at the same acceleration. The first object you drop tells you nothing, because you have just learned what that acceleration is. It is the second one that tells you something. So, this is an example of a whole bunch of measurements with their error bars made by different experiments of different quantities, all of which are related by this underlying theory, to a singular parameter. We express these are measurements of the underlying parameter, but that is not why we are doing it. Nobody actually cares what that number is. What we are trying to do is determine the theory which is saying, all those things are related. They are all different except for the fact that some of them are replicated from different experiments. They are described by one single underlying phenomenon, and in this case, they are disgustingly well distributed. The chi squared of all those things is about .05 per degree of freedom. That is a separate problem.
OCR for page 75
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Okay, so, how do you study it? The problem is that we can’t actually do an experiment in the classic sense. Like astronomers or people observing weather, we can’t go out and say, I want a storm here, right now. That is not going to happen. So, it is an observational science. We collide particles in our machines to put a lot of energy in a small volume, and then we let these reactions take over and everything that is permitted by nature will eventually happen by nature. Some might be more common than others. The typical reaction, though, is a little bit hidden from us. If I want to study the box that I showed you earlier, I can arrange for it to happen, but I cannot see the direct result. I can’t see the interaction happen. That happens over a 10−29 of a second. I can’t even see the things that it produces, because their lifetimes are just picoseconds. They will be produced. They will go some tiny little distance—that is a couple of microns. They will decay into things. Some of those things will decay further into other things and, if I am lucky, I may be able to see all of these things and try to infer backwards what actually happened. I have no control over this downstream process. I have only limited control over what I can measure from the things that come out of it. There are practical difficulties. What I want to see is what actually happens inside these little blocks. That is the introduction to high-energy physics. Now, I want to specialize in what we do. We are trying to understand why matter and antimatter are different. The world has a lot more of one than the other. In this mathematical structure of the theory, this is intimately associated with the idea of time. Particle versus antiparticle, time and parity, left-handedness versus right-
OCR for page 76
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop handedness are all tied together in the mathematical structures in ways that have only really been elucidated in the recent past. I consider 15 years to be the recent past. For those of you who are statisticians and, therefore, very familiar with numbers, there is a deep connection between the appearance of complex numbers in this theory that represent a wave, a propagating wave, as ei times something. This connection between changing parity, which changes one direction to another, like looking in a mirror, that just changes the sign of your X coordinates. Time reversal, time going forward and backwards changes the sine of your time coordinates. These waves become their own complex conjugates, under these reversals, under these symmetries. The bottom line is that we are trying to measure parameters, not just in terms of amplitudes, but also in terms of the complex nature of them, in a theory. We are interested in measuring complex numbers that nature seems to have. There are very few constants of nature that are complex. The gravitational constant, mass of the electron, tax rate—well, maybe not the tax rate—none of these are intrinsically complex. The standard model has a 3-by-3 matrix of complex numbers that just appears in it as a mathematical construct. Because it is a matrix that is relating how things transform into others, it is inherently unitary. Everything you start with becomes a one of three possibilities. That means that these three amplitudes for doing this—for example, one of these rows, this thing could become one of those three things that is the sum of these. When you actually put the mathematics of unitary together, what you get is—basically dotting these matrices into each other—you get an equation that looks like that. These rows are orthogonal to each other. You can represent that as a triangle. These are three complex numbers that sum back to zero, so they are a triangle. The important point that you need to carry away from this is that these constants are controlling physical reactions. So, by measuring how fast some particular physical reaction happens, I can measure this piece or I can measure this piece or I can measure this piece. I am measuring things that go into this construct, the triangle. By measuring a very large number of reactions, I can— AUDIENCE: What does CKM stand for? MR. JACOBSEN: Cabibbo-Kobayashi—Charlie will now give you the names of the three physicists. CHARLIE: Cabibbo-Kobayashi-Maskawa. MR. JACOBSEN: The three people who built this part of the theory. That was
OCR for page 125
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Let’s assume I have an x. I want to apply on it an F. I get the result, which is a y, and I want to store it somewhere, in location L, which can be my screen or can be a CERN archive. I don’t care, but eventually, I have to park it somewhere. What I think we have to keep in mind here, which traditional computing, definitely high performance computing, has ignored is that moving this data in and out is an integral part of the problem, not just computing. So, getting a very fast computation on a high-performance machine, but then waiting several weeks to get the data on and off the machine is not the solution here. So, we have to bring in the data placement activity as part of the end to end solution. Here are sort of the six basic steps that we have to carry out in order to do this y equal F effects or 2F. So, first of all, we have to find some parking space for x and y. As I pointed out earlier, how do we know how big x is? That is relatively easy. How big is y is getting even trickier, because it can depend on a lot of internal knowledge. Then we have to move it from some kind of a storage element to where we want to actually move it to move x. Then we may have to place the computation itself, because the computation itself may not be a trivial piece of software that resides anywhere in this distributed environment. Then, we have the computation to be done. Then, we have to move the results to wherever the customer orders us and, in the end, we have to clean up the space. Just doing this right is tough. I can assure you that you don’t do it right, even today, on a single machine. How many of you, when you open a file or you write to a
OCR for page 126
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop file, check the return codes of the write, whether it succeeded or not? I am sure that most of your applications will die if the disk is full, and it will take root intervention, in many cases, just to recover it. We cannot afford it in a distributed environment because it has to work in an autopilot with a lot of applications. So, what we really have here, if you think about it, it is really a DAG, a simple DAG in this case, although a shishkabob. Do this, do this, do this, do this. Keep in mind that we have to free the space, even if things have failed in between, which creates some interesting challenges here. This has to be controlled by the client, because you are responsible for it. Somebody has to be in charge of doing all these steps and, again, you can look at if, if you really want to, as a transaction that has to go end to end. I think I am sort of running out of time here. Here is a list of challenges. How do we move the data? The closest we can get to it is a quota system on some machines, but even that doesn’t guarantee you that you can actually write the data when you need it.
OCR for page 127
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, I already talked a little bit about it, because I want to move faster to the examples to show you that we can actually do something with all that, but the approach that we have been taking is, first of all, to make data placement first-class citizens. That means that when you write an application, when you design a system, make sure that getting space, moving the data, releasing it, is a clear action that is visible from the outside, rather than buried in a script that nobody knows about it and, if it fails, it really doesn’t help us much. We have to develop appliances that allow us to use a managed storage space in this environment in a reasonable way, and then create a uniform framework for doing it. So, let me show you what we have been able to do so far. The first one is, how we can generate the simulated event, with millions and millions of simulated events.
OCR for page 128
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop This is sort of the high-level architecture of what we have deployed out there. So, the application is generating a high-level description of what has to be done. This is getting into what we call—this is the Directed Acyclic Graph Manager that is responsible for controlling it. Now, for some of you, if it reminds you of the old days of JCL, yes, it is like JCL, at the higher-level, but then it goes to what we call Condor-G, which is the computational part, and we have a data placement schedule that uses the other tools to do that. So, I am not going to talk about that since I have four minutes, and just—the
OCR for page 129
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop physicists have way too much free time on their hands. So, they can generate this wonderful animation. So, here is the way it works. We have a master side. IMPALA is the CMS. They have an even bigger detector than the BaBar detector, that is generating the events themselves. This is the master side. Then we have all these other sides where we send out the computations, get the data back, publish it, move the data in and we keep going.
OCR for page 130
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Basically, each of these jobs is a DAG like this, and then we move them all to larger DAGs that include some controls before and after, and that is the way it works. So, here is an application that this graph is, after 900 hours, this is hours since it started. So, this is one of the CMS data challenges, and this is the number of events that we have to generate. So, a job is two months, and it has to keep going, and we have to keep generating the event. This is what we have been generating using that infrastructure. Let me show you another example of what happens when you have to do it in a more complex environment.
OCR for page 131
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop That is where we are putting in the planning. So, it is the same architecture as I showed you earlier, but there is a planning box there that is trying to make a decision on when to do it, how to do it, and what are the resources we should use for this. This is based on work that is actually done at Argonne National Labs and the University of Chicago as part of the GriPhyN Project, and that was because of the data system that includes higher-level information about the derivations that are formally defined and, from the derivation, we create transformations, which are the more specific acts that have to be done. This is creating the DAGs, but they are not being executed by the architecture. As we go through, we go back to the virtual system where we come and say, tell me, now, what to do.
OCR for page 132
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, this is what we have to do there. We have to aggregate information. We have all these images. We have to pull them together. We have to create distributions, as I showed you, of galaxy sizes or whatever it is. So, this is sort of the DAG that is going up rather than going out. This is an example of one job. This is an example of a collection of these jobs that we are actually executing. Each of the nodes in this DAG is a job that can be executed anywhere on the grid, and this is where we start.
OCR for page 133
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop This is the computing environment that we use to process the data, and these are the statistics. I will leave you with that, that if you want to write applications that work well in this environment, (a) be logical. The other one, you have to be in control, because if you don’t get the right service from one server, you should be prepared to move on to somebody else, if you want to use it effectively. Everyone wants lunch.
OCR for page 134
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Report from Breakout Group Instructions for Breakout Groups MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on. First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can’t just assume we get the process infinitely fast, whatever we want. The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems? Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk. So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations? Report from High-Energy Physics Breakout Group GROUP TWO PRESENTER: I only took a few notes, so I am trying to stall, but I am glad to see Mark Hansen has arrived. So, we talked about experimental physics. What is interesting is that there is sort of a matrix in my mind of what we discussed. I think Paul had mentioned there was a conference in Durham earlier this year in March, in which there were 100 physicists and two statisticians starting to scratch the surface of issues. There is a follow-up meeting in Stanford in September. Somebody named Brad Efron is the keynote speaker. So, presumably, there will be at least one statistician. I think what was clear is that, sort of in the current context of what experimental physics is doing, there is a list of very specific questions that they think they would like answered. What we had discussed went beyond that. We were really looking, gee, if we had some real statisticians involved, what deeper issues could we get into. I think that, after a good round of discussion for an hour, we decided there were probably a lot of really neat, cool things that could be done by somebody who would like to have a career changing event in their lives. Alan Wilkes is feeling a little old, but he thinks he might be willing to do this. I think on the good note is what you have, which is often—on another good note—collaborations are clearly in their infancy. There are only a few statisticians in the world, is sort of my observation. So, there is a reason why there are not a lot more collaborations than there should be, perhaps. If you look at Doug’s efforts in climatology, there are really some very established efforts. If you look at astronomy, you have had some efforts in the last four years that have really escalated to the next level, and I think physics is high on the list of making it to the next step. I think there are probably a lot of agencies here in this town that would help make that happen. The thing that gets more to sort of the issue at hand here is that there are a whole
OCR for page 135
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop lot of statistical things involved in what are called triggering. So, things are going on in this detector and the thing is when to record data, since they don’t record all 22 terabytes a second, although they would like to, I guess, if they could. The interesting statistic that I heard was, with what they do now, they think they get 99.1 percent of the interesting events among all the billions of ones that turn out not to be interesting. So, 99.1 is perhaps not a bad collection ratio. So, much of the really interesting statistics that we have talked about is sort of the off-line type. In other words, once you have stored away these gigabytes of data, there are lots of interesting pattern-recognition problems and stuff. Sort of on the real-time data mining sort of issue, we didn’t sort of pursue that particular issue very deeply. What struck everybody was how time-sensitive the science is here, and that the way statisticians do science is sort of at the dinosaur pace and the way physicists do it is, if they only sleep three hours a night, the science would get done quicker, and it is a shame they can’t stay up 24 hours a day. There is lots of discussion about magic tricks to make the science work quicker. All in all, I think the conversation really grew in intensity and excitement for collaborations, and almost everybody seemed to have ideas about how they could contribute to the discussion. I think I would like to leave it there and ask anybody else in the group if they wanted to add something.
Representative terms from entire chapter: