National Academies Press: OpenBook
« Previous: Robert Jacobsen Statistical Analysis of High Energy Physics Data
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 70
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 71
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 72
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 73
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 74
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 75
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 76
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 77
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 78
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 79
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 80
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 81
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 82
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 83
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 84
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 85
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 86
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 87
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 88
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 89

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 70 TRANSCRIPT OF PRESENTATION MR. JACOBSEN: My name is Bob Jacobsen, and I am starting these three talks. So, I am going to lay some of the groundwork for what some of my colleagues will follow with. My science is different from what you have just been hearing about because we actually have a chart which explains it all. Because we have that chart, which is really what we call a standard model, it affects how we do our science. It affects what questions are interesting and what we look at in an important way. So, I am going to start with an introduction that actually talks about this science, the goals of it and how we study it. Then I am going to talk about a very specific case, which is something that is happening today, an experiment that has been running for two years and will probably run for three or four more years—that is a little bit uncertain— what we do and how we do it in the context of the statistical analysis of high-energy physics data. I am going to leave to you the question at the end, as to whether we are actually analyzing a massive data stream or not. Then, I am going to end with a few challenges, because I think it is important to point out that, although we get this done, we are by no means doing it in an intelligent way. So, the standard model of particle physics explains what we see at the smallest and most precise level in terms of two things, a bunch of constituents, which are things like electrons and neutrinos and quarks, and a bunch of interactions between them, which

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 71 you can think of as forces, but we tend to actually think of as particles that communicate with each other by the exchange of other particles. Depending on how you count, there are basically three types of stuff, fermions, and there are basically four forces and we omit the fact that we don't actually understand gravity. We just sweep that under the rug and ignore it, and we do our experiments on electricity, magnetism, the strong force and the weak force. It is important to point out that, just like you can't see—in the talks that we have just been hearing—you can't actually see the underlying processes that are causing weather. You can only see the resulting phenomenon. We can't see any of the things in that fundamental theory. What we actually see are built-up composite structures like a proton, which is made up of smaller things, and protons you can actually manipulate and play with. More complicated ones are the mesons, which are easier for us to experiment with, so we generally do, but even then, we have to use these techniques to do our experiments. We have no other instruments except the forces we have just described for looking at these particles. So, what we do is, we interact them through things that look like this big blob on, for you, it would be the right-hand side. For example, you bring an electron and anti-electron together. They interact through one of these forces, something happens and you study that outgoing product. This particular one is the annihilation of an electron and its anti-particle to produce two particles called B's, which is what we study. The actual interactions of those particles is what my experiments study. In general, when you put together approximately a dozen constituents, four different forces, many different possibilities for how these can happen, there is a very large bestiary of things that can happen.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 72 Now, my wife likes to say that this is not actually science. This is a cartoon. I get a little bit defensive about that. It actually is backed up by a formal mathematical structure that lives on a little card that is about this big, which looks like that, and I am not going to go through it. The advantage that we have with this actual physical theory is that we can make from it predictions. We can— there are a few little asterisks here, and we won't go over the details very much, but in many cases, we can actually make predictions to the level of four significant digits of what will happen in a particular reaction. Now, these predictions have been checked many different ways, but there are some open questions that remain. The two basic categories of this are that we have the equivalence of turbulence. We have things that we cannot calculate from first principles. They happen for two reasons. One reason is that the mathematics of this theory are, like the mathematics of turbulence, beyond us to solve for first principles. We do not yet know how to do that. We know how to model it, we know how to approximate it, we know how, in many cases, to pick regimes in which the thing simplifies to the point where we can make precise calculations. If you walked up to us with a physical situation, we may not be able to calculate the answer to that. Further, because of the way these processes work—and this is a cartoon here very similar to the one that I just talked about, the interaction between an electron and an anti-electron, which is very simple but, as time gets on, it gets more and more complicated. It starts as a very high-energy pure reaction but, as that energy is spread among more and more particles and their reaction products, the situation becomes

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 73 more and more highly dimensional, becomes more and more complicated, just like the cascade of energy and turbulence. By the time you get down to the smallest level, you are dealing with something that is inherently probabilistic. In our case, it starts probabilistic at the very topmost level. These interactions could go any one of a number of ways when they happen. AUDIENCE: Could you tell us how much energy you are talking about? MR. JACOBSEN: It depends on the experiment. This particular picture was made for one that was 100 giga- electron volts in the center mass. My experiment is actually 10. So, it is many times the RAS mass of the particles that we see in the final thing, and you are making a lot of stuff out of energy. The other problem here is that, like every physical theory, there are some underlying parameters. Depending on how you count, there are approximately 18 numbers that are in this theory that are not predictable by the theory. They are just numbers that are added in. Now, we have measured them many different ways. We are a little sensitive about the fact that 18 is very large. You know, is that too many? Wouldn't you like a theory of everything that has the zero or one numbers in it. Is 18 too many? I just say, thousands of chemical compounds were originally explained by Earth, wind and fire but, when people worked on it for a long time, they discovered they really needed 100 elements. That was good, because we could explain 300 elements in terms of protons, neutron and electrons but, when they really looked into it, those protons, neutrons and electrons but when they really looked into it, those protons, neutrons and electrons of the 1930s became hundreds of mesons and particles in the 1950s. Those were eventually explained in terms of five things, the electron, the neutrino and three quarks but, when you look really into it, it turns out there are 88 of those. The modern version of the standard model particle physics has 88 distinguishable particles in it. So, no matter what we do, we seem to keep coming back to approximately 100. This gives us a question as to whether or not there is another level of structure. Right now, we don't know the answer to that. The standard model doesn't need another level of structure, but it is certainly something we are looking for. So, with these 18 parameters, we measure them. That is what we do. Is this complete? Is there anything more there? We don't know the answer to that. We know that some measurements that were just announced over the last couple of years and just updated about a week ago require us to add at least a couple more numbers.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 74 Well, that is bad. The theory is getting more and more complicated, but we are still explaining all of the data, and there is one part where we have experimental evidence completely lacking, for whether the theory is really working the way it is supposed to. Including people who will talk to you later on in this session, we are working very hard on that issue. My current goal is to check a very specific hypothesis, which is that this theory, in its structure, as it stands today, is correct in describing. It may not be the entire story– but is it right? While we are at it, we are going to look for hints of how to extend it, but mostly, I am interested in how it is right. The usual method of doing this in science, which was invented by Galileo, was that you measure things 50 different ways. If they are all explained by one theory with one underlying parameter, the first one tells you nothing, because all you have done is measure the parameter. The second one, though, should be consistent, and the third one should be consistent and the fourth one should be consistent. The theory of gravity predicts that all bodies attract to the Earth at the same acceleration. The first object you drop tells you nothing, because you have just learned what that acceleration is. It is the second one that tells you something. So, this is an example of a whole bunch of measurements with their error bars made by different experiments of different quantities, all of which are related by this underlying theory, to a singular parameter. We express these are measurements of the underlying parameter, but that is not why we are doing it. Nobody actually cares what that number is. What we are trying to do is determine the theory which is saying, all those things are related. They are all different except for the fact that some of them are replicated from different experiments. They are described by one single underlying phenomenon, and in this case, they are disgustingly well distributed. The chi squared of all those things is about .05 per degree of freedom. That is a separate problem.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 75 Okay, so, how do you study it? The problem is that we can't actually do an experiment in the classic sense. Like astronomers or people observing weather, we can't go out and say, I want a storm here, right now. That is not going to happen. So, it is an observational science. We collide particles in our machines to put a lot of energy in a small volume, and then we let these reactions take over and everything that is permitted by nature will eventually happen by nature. Some might be more common than others. The typical reaction, though, is a little bit hidden from us. If I want to study the box that I showed you earlier, I can arrange for it to happen, but I cannot see the direct result. I can't see the interaction happen. That happens over a 10 −29 of a second. I can't even see the things that it produces, because their lifetimes are just picoseconds. They will be produced. They will go some tiny little distance—that is a couple of microns. They will decay into things. Some of those things will decay further into other things and, if I am lucky, I may be able to see all of these things and try to infer backwards what actually happened. I have no control over this downstream process. I have only limited control over what I can measure from the things that come out of it. There are practical difficulties. What I want to see is what actually happens inside these little blocks. That is the introduction to high-energy physics. Now, I want to specialize in what we do. We are trying to understand why matter and antimatter are different. The world has a lot more of one than the other. In this mathematical structure of the theory, this is intimately associated with the idea of time. Particle versus antiparticle, time and parity, left-handedness versus right-

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 76 handedness are all tied together in the mathematical structures in ways that have only really been elucidated in the recent past. I consider 15 years to be the recent past. For those of you who are statisticians and, therefore, very familiar with numbers, there is a deep connection between the appearance of complex numbers in this theory that represent a wave, a propagating wave, as ei times something. This connection between changing parity, which changes one direction to another, like looking in a mirror, that just changes the sign of your X coordinates. Time reversal, time going forward and backwards changes the sine of your time coordinates. These waves become their own complex conjugates, under these reversals, under these symmetries. The bottom line is that we are trying to measure parameters, not just in terms of amplitudes, but also in terms of the complex nature of them, in a theory. We are interested in measuring complex numbers that nature seems to have. There are very few constants of nature that are complex. The gravitational constant, mass of the electron, tax rate— well, maybe not the tax rate—none of these are intrinsically complex. The standard model has a 3-by-3 matrix of complex numbers that just appears in it as a mathematical construct. Because it is a matrix that is relating how things transform into others, it is inherently unitary. Everything you start with becomes a one of three possibilities. That means that these three amplitudes for doing this—for example, one of these rows, this thing could become one of those three things that is the sum of these. When you actually put the mathematics of unitary together, what you get is— basically dotting these matrices into each other—you get an equation that looks like that. These rows are orthogonal to each other. You can represent that as a triangle. These are three complex numbers that sum back to zero, so they are a triangle. The important point that you need to carry away from this is that these constants are controlling physical reactions. So, by measuring how fast some particular physical reaction happens, I can measure this piece or I can measure this piece or I can measure this piece. I am measuring things that go into this construct, the triangle. By measuring a very large number of reactions, I can— AUDIENCE: What does CKM stand for? MR. JACOBSEN: Cabibbo-Kobayashi—Charlie will now give you the names of the three physicists. CHARLIE: Cabibbo-Kobayashi-Maskawa. MR. JACOBSEN: The three people who built this part of the theory. That was

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 77 before my time. You build a large number of measurements and you try to over-constrain this. You try to see whether the hypothesis of this theory is correct, which says that, all those measurements, when combined properly, will result in a triangle that sums to zero. That is what we want to know. Now, other experiments will have similar things they are trying to do. This is the cartoon of what actually happens. Let me remind you that we can't measure the underlying reaction or the intermediate products. We can only measure the things in the final state and, from that, we have to infer it. Now, even worse, our experiment does not tell us everything that we want to do. The properties of these outgoing particles, their momenta or even their type, are only measured with finite resolution and it is not as good as we need it to be. The information is incomplete in two senses. One is that sometimes things are not seen by the detector. There are pipes that carry particles in and cooling water in and information out. Particles that go in there are not measured, just like the satellites don't always tell you what you want at the time you want to see it. Our sampling of these events is imperfect. It is often confused as to how to interpret. This is not from my experiment, but it is a great example. This is what happened in the detector during one of these collisions. There is no actual picture here, where the computer readout sensors inside the thing. The yellow green lines are where the computer said, ah, a particle went there because I can

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 78 line up all these little dots that correspond to measurements, and there is some energy deposited out here. You will notice that there is one sort of going down toward 5:00 o'clock here that is not color. The computer says that that is not a real particle. No real particle made that line. The reason is, it has got a little tiny kink in it, part way along it. You can't really see it, but there is a missing dot and a kink right there. What it probably did was bounce off an atom and deflect. Instead of making a nice smooth trajectory, it hit an atom and deflected. That doesn't happen very often, but it doesn't fit my hypothesis that something just propagated through the deflection. We will come back to this kind of problem. In this case, what the machine knows about what happened here is missing an important piece. I didn't expect to do a mine is bigger than yours kind of slide, but let me start. Twenty-two terabytes a second. We don't keep it all, because most of it is—we don't keep it all. Most of it is stuff that people understood 200 years ago. So, we aren't interested in that. The fact that like charges repel and opposite charges attract, or maybe it is the other way around, we have mastered that. We don't need to study that any more. So, most of the 250 million collisions a second are just that, and there is nothing more to them than that. About a thousand result in things being thrown into the detector in ways that indicate something may have happened. So, the hardware detects that, drops this data stream down, both in space and time, fewer bytes per crossing, because you don't read out the parts of the detector that have nothing in them, and fewer per second. Then, there is software that takes that 60 megabytes per second down to about 100 per second that can't be distinguished from interesting events. It turns out, we will talk later about how many events, but we record about 600 gigabytes a day of these events that are interesting. This is what we analyze. Because there are so many things that can happen in this decay chain, and because only certain reactions are interesting, we need a lot of events, and I will talk about those statistics later on. So, we run as much as we can and we typically keep everything together for about two-thirds of always, and we are now looking at a few billion of these events, 500 terabytes, but I will come back to that number later on.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 79 The process of dealing with it is that you take the data that starts in this little hexagon, which is the detector itself, you shove it onto disks. Then you have a file of processors that looks at it and tries to convert it into usable data. I don't quite understand the terminology of Earth sensing, but I think this is Level 2. It is where you have taken it from raw bits, as to wire seven had so many erts, and you converted it to something went that-a-way, with a certain amount of precision. You try to get not just this precise measurement of that particular particle went that-a-way, but also an estimate of how precise that is. It is strictly MISR statistics estimates. The RMS of the scatter on this is expected to be [word missing]. We also do a tremendous amount of simulation which is very detailed. The simulation knows about all the material in the detection, it knows about all the standard model, it knows how to throw dice to generate random numbers. It is generating events where we know what actually happened because the computer writes that down, very similar to your simulations. We try to have at least three times as much simulated data as real data but, when you are talking hundreds of terabytes, what can you do? We feed these all through farms and what comes out the bottom of that is basically in real time. We keep up with that. Then it goes through a process where I am not sure where it fits into the models that we have described before, but it is very reminiscent of what we heard the NSA does. We look for what is interesting. We have had the trigger where we have thrown stuff away. Which phone you tap is sort of the analog of that. But now we take all the stuff that we have written down and individual people doing an analysis, looking for some particular thing they want to measure, will scan through this data to pick out the subset they look at. The ones who are smart will accept something like one in 10−5. The real things they are looking for are 10−6, 10−7, but they accept a larger set. People who don't put a good algorithm in there will accept 5 percent, and will have a huge data sample to do something with. AUDIENCE: [Question off microphone.] MR. JACOBSEN: The skins, the actual running through the thing, is supposed to be done three times a year. It is really done sort of every six months. It is a production activity. AUDIENCE: [Question off microphone.] MR. JACOBSEN: They will pick 10 million events out of this store, and then they will go away and study them for an extended period of time and write a paper. The

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 80 process of actually picking from billions of events is a large-scale computer project. So, they will—I will talk more about how they look over and over again at these things, but the actual project running to do the selection is scheduled and done as a collaborative effort several times a year. AUDIENCE: [Question off microphone.] MR. JACOBSEN That is about right. I will show you some plots. So, our task involves thousands of processors, hundreds of terabytes a disk, millions of lines of code, hundreds of collaborators, and I remind you that we still screw up a lot. We are doing the best we can but, for example, fixing the algorithm that didn't find that track involves interactions on all these levels, lots of code, more computer time, lots of people signing it off as the right thing to do. I have to show this. I love this. We are in the state of this poor guy. The task has grown and our tools have not, and this limits what we can do in science. This goes directly to our bottom line. So, once you have this sample of some number of events, what do you do with it? You are looking for a particular reaction. You are looking for this decayed to that, because you want to measure how often that happens, to see the underlying mechanism. You want to find out how many storms produce how much something or other. So, you remove events that don't have the right properties. The events you are looking for will have a finite number of final state particles. The electric charge has to add up to the right thing, due to charge conservation, etc. You cut everything away that

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 81 doesn't match that. Since the detector makes mistakes, you will have thrown away events that really were what you were looking for. A typical efficiency in terms of that kind of mistake for us is 20 percent. We can discuss whether it is 10 percent, 25 percent, but it is routine to throw away the majority of events that actually contain what you are looking for because of measurement errors and uncertainty. We will come back to that. In the end, you usually pick a couple of continuous variables like the total energy and the invariant mass. Invariant mass is just the idea that if you have something that blows up from the products, you can calculate how big the thing was that blew up. Everything is right, but you are still not certain they are the right ones, and you plot those two properties on these two axes for each of those events. You will get something that looks like this. This is actually a simulated run. In this bigger box are 1,100 events, approximately, and in this little blue-purple box, are 13.2 events. You get fractional events, because this is simulated data and you have to adjust for the variable number, and we don't have exactly as many real data as simulated data. So, all the ones that are in the blue box are most likely to be the right thing, because that is centered on the right values of these quantities. This big box is not centered on the right quantities. It is the wrong ones. It is our background. Those are accidentals that get through all of your cuts, that something else happened. It is easy to see what you have to actually do. You just count the events in those two boxes, take the stuff in the big box, and project how much of the stuff in the little box is background, and you have to worry about the shape and all the other stuff. Do the subtraction, and that is your answer, and then you write the paper, Nobel Prizes are awarded, etc. The way you do the extrapolation and the correction for efficiency is with these stimulations. They are not perfect but, to the extent that they are only small corrections, how big an error can you make on a small correction? So, you have to add systematic errors for that. Historically, though, there has been a tremendous concern, because everybody wants to be the first person to discover a particular reaction.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 82 So, they tend to sort of put the lines in the place that gets that last event in. The BaBar has adopted a policy, which is followed about 85 percent of the time, called blind analysis. Blind but not dumb, is the phrase we actually use. The analysts basically promise to make all their plots on their screen with this little box over the signal, while they are doing all their tuning. There is nothing that we can do to enforce that. These tools are on the desktop, but they do it. Then, once they have got all the details of their analysis final and they promise not to change anything, you pop that off and you look at it. This seems to be very effective, because nobody worries about it any more. As time goes on, we will get more complicated analyses. We will do a maximum-likelihood fit that has some linear combination of probability density functions. To the statisticians these probably aren't probability density functions, but that is what we call them, to find the rates of many things that are going on. Maybe that fit will have directly in it the standard model parameters that we are looking for. Maybe there will be parameters that describe a resolution. How much something is smeared out on the plot may actually be fit for, as it goes along. A typical fit has 10 parameters in it, but I have seen ones that have up to 215. The basic problem here is that, if you are fitting in a maximum-likelihood fit, you want to understand the distribution. We do not understand most of these distributions on first principles.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 83 We can extract them from simulations that we don't trust, or we can look at the data to try and figure out what they are. Now, we are not smart about it. This is a fit probability density function to some particular thing. You can't read this, because it is one of our tools, but it is the best fit by the sum of two Gaussians. So, we parameterize it as the sum of two Gaussians and feed it into one of these huge fits. The real reason that it is the sum of two Gaussians is there are two distributions there. There are actually two populations. In this plot down here, the high-energy physicist will motivate the fact that they are actually two types of events, but we do not separate them. It is very unusual, it was considered a major breakthrough for somebody to actually do a fit by doubling the number of parameters separating these two populations. That is how you get up to 215, a bunch of simultaneous fits. Usually, we just parameterize these. The reason is subtle. The reason is that, even though we know it is two populations, we are still going to have to fit for two Gaussians, because we don't have a first order understanding of what those things should actually look like. So, even though we calculate the errors on many quantities, we calculate them in a very simplistic Gaussian way, and we don't have any way of representing the more complicated things that are going on. To come back to your question, there are typically 10K to 100K events in that final selection that was done in production, and that fits in a desktop computer, and that is really what sets how many of those are. The analyses are done with very small scale tools. They are physicist-written,

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 84 general-purpose tool boxes, but users write their own algorithms for selecting events. The analysis turn around time is analogous to the attention span. That means that they will write code that is just complicated enough that it will complete before they come back with a cup of coffee. Now, my question to you—we don't have to answer this now—is I am not sure that we are doing statistical analysis of massive data streams at this point. The only way we know how to do it right now is to get it onto somebody's desktop, somehow make it fit, and in the next talk we may hear about better ways to do this. This is the situation we are in. We have bigger jobs, and our tools are just not up to it. Every time we attempt it, we get embarrassed. Look at this picture, actually taken at a Home Depot. You will notice the engine is actually running. This guy tried to drive home. In my last couple of slides, I want to turn this around the other way. I want to say something to the statisticians in the office, what I perceive the real challenges in your area to be. All of them are this idea of capturing uncertainty. We do a tremendously bad job of dealing with the fact that our knowledge is poor, in many different ways. What we do basically now is, if we suspect that we have something funny going on, we punt. We throw the event away, we run the detector for an extra week, month, year, whatever. So, if only parts of the final state are present because of a measurement error, if, when we look at the event, we could interpret it multiple ways, which can happen, there was this event—this cartoon is to remind you of the marginal track.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 85 If something is marginal because maybe something a little bit more complicated than the first order thing I am looking for has happened, we throw it away. You can easily do back of the envelope estimations that we are talking about factors of three in running time for experiments that cost $60 million by doing this. Physicists, particularly high- energy physicists, think that we have completely mastered Gaussian statistics, and refuse to admit that there are any others, although secretly we will mention Poisson. We don't know how to propagate detailed uncertainty. We don't know how to deal with the fact that the real distribution of error on one of these tracks is, 98 percent of the time it is an MISR of this width and 2 percent of the time it is a Gaussian of that width. We don't know how to combine all that throughout all the stages of this analysis to get out something that makes sense in the end. So, we choose to get rid of those special cases out of a fear of tails in our distribution. When we can't do that, we model it with these PDFs, which we don't really understand, and which are throwing away all the correlation information between those two, because we can't see that at all. This is accepted now, because it causes a lot of statistical power. In the absence of a way of dealing with it—I don't want to say formally, I want to say principled way of dealing with it, we do not know how to assign a systematic error if we allow these things to come in. It is not even so much that the systematic error from these will be large. We don't know what it should be. So, we don't really know how to balance off the complexity of our true information and the statistical power of our true information. As you go through the literature in high-energy physics, you will find people who will address this in one particular place in an analysis where it is useful. There has been no systematic approach, which is where the real gain is. The real gain will come from ten 5 percent gains, not from one 5 percent gain. Now, sometimes, every data matter. When what you are doing is the real final result, taking that triangle which is shown in cartoon form here to remind you, and combining measurements that took hundreds of people years to make, because different experiments are measuring different things, and trying to see whether this is right, you can't throw out the work of 500 people because there is a tail on the distribution. You will get lynched. So, the way that we do that now is that we do it manually. People called theorists get together at conferences and they argue. This particular plot is the result of six theorists for three weeks, trying to deal with

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 86 the statistics of combining data with non-Gaussian error. The way that you interpret this is that each little color band or hatched band is a different measurement from a different experiment, and this little egg-shaped thing here, the yolk, is the best estimate of where the corner of the triangle is in this plane. The triangle actually lies along the horizontal midplane and sort of extends up to there. As these measurements get smaller and smaller error bars on them, this will tighten up until eventually, hopefully, some of them will become inconsistent with each other. AUDIENCE: [Question off microphone.] MR. JACOBSEN: No, actually. What we do, remember, I said that this combination of all possible processes has to obey unitarity. Everything has to become only one possible thing. So, unitarity also implies that certain things don't happen. Some things sum to zero. So, we have this triangle which each side of it is related to certain rates. To make it easier to deal with, since we don't know any of them, we divide all the sides by the length of this and put it in an arbitrary zero to one space. So, we call these things an η and ρ, but they really are physical measurements. Everything is put in terms of one high-precision measurement. So, that is actually one of the things that is really spurious. That is not a good word to use. One of the things that is very hard about this is that everything, with the possible exception of some of the bands that appear in the theory, and one of the plots—but not all of the plots—is subject to experimental verification. We have no scale in this plot that is not subject to experimental verification. From here to here—I am not sure where the center is—there is the center, and there is the side. That is actually an experimental number. The width of that is actually secretly an experimental number. As fundamental physicists, we worry about that. When you learned electricity and magnetism, you probably learned it in terms of the charge of the electron. That did not tell us the charge of the electron. Somebody had to go out and measure it and, every time that measurement changes, it changes everything. It is now known to nine significant differences, so it doesn't change your electric bill, but at this level, it has to be taken into account that everything is subject to experimental verification. It is an unfortunate fact that there is nothing that is really well known in that plot. In 15 years, we will be doing this in high school, but right now, it is a very hard thing because there is a lot of uncertainty in that plot. MR. SCOTT: Can you just briefly, when the theorists are sitting around in this room deciding how to draw it, what exactly are they using for this? MR. JACOBSEN: Leaving the names out, there is one guy who says, the only way you can possibly do this is take all the likelihood curves from all of these measurements, pour them into a computer and have it throw numbers, and then make a scatterplot of the possible outcomes. Then, I always discuss whether we draw contour lines or not. Someone else will say, but that is completely bogus because there are large correlations in the systematic errors that these experiments have made. They have made correlated assumptions, and then how do you deal with that? His response was, well, we vary the underlying assumptions in one of these computer models. Someone else says, no, they told us it is plus or minus something in their abstract, that is all I know, that is what I am going to draw.

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 87 In fact, the one that actually won was the first one that I described. They put all this stuff in, they ran a computer model that put dots on a screen. The question is, if you have got a whole bunch of dots on the screen, where do you draw the 90 percent? One has to enclose 90 percent of those. Which 90 percent? I am not a statistician. My job is to make that little angled line in there. I don't understand, really, this, but I know that they don't either, because I can show you six other plots, and they are not the same. One last thing. I promised I would get done close to on time. Everybody—I teach freshman physics a lot. Everybody believes that science starts with “Eureka” and Archimedes running down the street naked. Not my science. My science starts with, “Hmm, I wonder what went wrong there.” Let me give you an example of this. This is a famous event. It has actually been redrawn with a different depictographic package, so it is actually a simulated event. The real event that looked like this led to the discovery of the τ, Nobel Prizes were awarded, etc., etc., etc. What this event was, was this can't happen. In the standard model, as it was established at that time, the rate for this process was identically zero. Zero does not fluctuate to one. This was not noticed by a computer program that was looking for it. Why do you look for something that can't happen? That is not a great way to get measurements. It was noticed by someone flipping through events and saying, “Oops, what is that?” The process that I have outlined, which is the modern process of the last five years, is not the process of flipping through a significant fraction of events. You can't look through a significant fraction of a billion events. So, we are, in fact, quite weak on the fact of finding things that don't belong. This is particularly difficult because nowadays, although we have a theory that predicts this, one of my students checked. Approximately 15 percent of the events that look like this are due to mistakes in the detection. Anomalies can be due to mistakes. The standard models predict that an electric charge is 100 percent conserved all the time. If you could prove that an electric charge was not conserved, automatic Nobel Prize. A lot of our events don't conserve charge, because we will offset the charged particle. Something disappeared and we didn't see it. So, if you naively add up the charges, it doesn't sum to zero. It sums to something else. So, how do you look for anomalies in the presence of mistakes is something that we don't know how to do. The way we do it now is say, “Oh, this weird thing that can't happen, could be happening.” Let's go look for it as a clean signature. We search positively. Let's see if

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 88 electric charge is conserved. We do a long, complicated analysis, and the bottom line is that, with such and such confidence level, an electric charge is conserved. It is a search paper, and there are a lot of those. They are all searches for specific things. It is very hard today to notice an anomaly. Okay, here are my conclusions. I tried to give you a feeling for how we do high-energy physics with these data. In the discussions, I really hope to hear about better ways, because our successors definitely need one. I am in the position of these guys, who are back at the invention of the kite, and they are working with the technology they know and understand, and they are just going to use it as much as they can, but they are not going to succeed. So, thank you very much. MR. SCOTT: While our second speaker is setting up, I want to thank Bob very much, and we do have a few minutes for questions. AUDIENCE: Why do you do this parametrically, if you don't know what your distribution is? MR. JACOBSEN: What other choice have I got? AUDIENCE: Non-parametrics. MR. JACOBSEN: I am embarrassed to say, I don't know how to do that, but let's talk. How do I put this? A lot of what we do is because we learned to do it that way. So, new ideas, in fact, distribute themselves in a very non-linear fashion in high-energy physics. Neural networks are a classic example. Neural networks for a long time, the basic reaction of the high-energy community was something like this: keep this away from me, I will get out the garlic. Then a few people showed that they actually could be understood. Once that had been shown, you don't even mention it any more. So, methods that we don't understand we resist until suddenly they are shown to be worthwhile. AUDIENCE: Where are they used? MR. JACOBSEN: Neural networks? They are used everywhere now, but they are predominantly used as categorizations of large groups, at least at the rejection steps before you get to the precision measurement. For example, the skins that are sorting out samples for people? I guess about a third of them probably use some sort of neural network recognition. It takes many variables to try to make a guess about these things. We do apply them on a large basis. We have simulated ones. We don't have neural network chips. We have neural network

STATISTICAL ANALYSIS OF HIGH ENERGY PHYSICS DATA 89 algorithms. AUDIENCE: What are the number of— [off microphone]? MR. JACOBSEN: Ten is the—plus or minus two. AUDIENCE: And the categorization is— [off microphone.] MR. JACOBSEN: Yes no.

Next: Paul Padley Some Challenges in Experimental Particle Physics Data Streams »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!