Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: Miron Livny Data Grids (or, A Distributed Computing View of High Energy Physics)

Page 115 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Page 123 Cite

Page 124 Cite

Page 125 Cite

Page 126 Cite

Page 127 Cite

Page 128 Cite

Page 129 Cite

Page 130 Cite

Page 131 Cite

Page 132 Cite

Page 133 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 115 TRANSCRIPT OF PRESENTATION MR. SCOTT: The next speaker is very appropriate. If you are going to have massive data sets, you need massive computing and âgridâ is the buzzword. We are very fortunate to have one of the leaders in the field to talk to us about the relationship between grid computing and high-energy physics. MR. LIVNY: Thank you. Since I really didn't know what was expected from me in this presentation, I will try to communicate sort of threeâaddress three areas. One is sort of our view from the computer science perspective of what high-energy physics is. The other one is to touch upon what I believe is a huge challenge, is how to do real interdisciplinary work. So, what does it mean to change a title from a computer science professional to a computer scientist. That was not easy, and we are still working on it. The third one is sort of to give you some update on technology and what can be done and how it works because, at the end of the day, a lot of what was described earlier depends on computing resources and stuff like that. So, we will see how far we can go with all of that. I will give you a little bit of background because I think it is important to understand (a) where we are coming from, and (b) what we have experienced so far.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 116 So, at Wisconsin we have been running the Condor project now for over 15 years, and we probably are the proud holders of the title of the oldest computer science project that is still doing the same thing. This is the good news and the bad news. The good news is that there is a lot of work to be done there. The bad news is that it is really hard to do it right. When I say to do it right, I think it is important to understand that, if you want to do it, then you have to do it for real, which means that you have to develop new software, and you have to face all the challenges of software engineering, middleware engineering, whatever you want to call it, because it has to run on all the platforms and it has to do it right. You have to become part of these collaborations, and it is not that easy to be this dot in the picture that you saw earlier, and to survive there. We definitely, in computer science, think that a collaboration of three scientists is a huge effort. Suddenly realizing that we have to work in these larger collaborationsâand I will talk about politics later â it is an important part of it, and has a huge implication. So, we have to learn how to do it, and it is not simple. The other part of it is that we have to work with real users. So, we cannot come and say, âYes, we would like to help you but can you change your distribution a little bit? I mean, you knowâif the distribution would have been a little bit different, it would have been much easier for us.â The same thing is true for us as computer scientists. The other part of it is really this business of practicing what you preach. If you develop a method or you develop a tool, if you are not actually using it and figuring out how it works and when it works and when it doesn't work, then I think there is very little hope that it will be accepted by an end user. To remind you, the high-energy physics community, as an example, was very self-contained until recently. I think what you have been hearing here regarding statistics, and what we have experienced on the computer science side, is that they realized that they need some help or can use help. This has not been an easy process, again, and we have to develop things that actually work. If we come and provide something and say, this is the solution, and it falls on its face, the next time doing it becomes very, very difficult. Now, the good news is that today what we are doingâdistributive computing gridsâall this is very fashionable and, therefore, we are getting a lot of support. It is good news and bad news because, when something is very fashionable, everyone is stepping to the plate, and people who have been doing a lot of very different things suddenly are experts in distributive computing, and that can be very dangerous.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 117 So, sort of one message I want to leave with you, something we have learned over the last 15 years, is that when we build a distributive computing environment, we have to sort of separate it into three main components, one that represents the clients, the consumer. This is not the trivial part of the equation. Then, there is the other part that represents the resource. Again, this is a very intelligent component, especially when we get into all the political issues of distributed ownership. What we have realized has worked extremely well is that we interface these two with a framework that we call match-making, that allows providers of resources and requesters of resources to sort of come together in the way that we, as humans, come together. I think one of the big messages that we have as a result of all of our work is that we should look at these computing environments more as communities rather than a computing system, where nanoseconds and picoseconds are what matter. The actual activity of why are you part of this community and what do you contribute and what do you get out of it becomes much more important than the latency of the network. In the mid-1990s, the grid concept came to the front of our activities, and this is the Bible, and we are working on the New Testament now, the second version of the grid book. The main message there, we are trying to create this pervasive and dependable computing facility, and as I said, there is a lot of activity on this. I can give you another talk to two hours to show it is related to distributive computing, and there are a lot of concepts that are sort of resurfacing, and there is a lot of stuff that goes back to what we

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 118 have been doing 20 and 30 years ago. So, if you look at this notion of the grid, there is the user who wants to do some analysis, who wants to create some kind of a histogram or scatterplot or what have you, or ask a question, or generate a million events, and I will show you a little bit about that. That leaves the fabric that has to do the work, and in between there is this thing that we call the grid of the middleware, that has to bring them together. Actually, following the division of labor that I showed you earlier, we have been doing, on our side, and have been sort of contributing and moving it into the high-energy physics community, is to comment and say there is what is called the Globus tool kit, which is the inter-domain capability that allows you to cross domains and create them into a single domain, and then we have taken the technology that we have developed and attached it on one side to the fabric and on the other side to the computing environment, and this is what we have to add to it in order to make the whole thing work. Now, what I will do today is focus more on the application, user side, because I think that this is much more applicable to what we are talking about here. One of the questions that we have to ask at the end of it is, how do we write applications, how do we develop interfaces that can actually take advantage of these capabilities. That, I think, is where some of the algorithmic and software development issues are coming into play. Now, we are even in a much worse situation than what you have seen earlier, because you can see only the handles on that side, but when we are getting into a grid effortâin this case, the particle physics data grid which is a DOE activity, we have to deal with several of these experiments, and we have to deal with several software

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 119 contributors, in order to make the whole thing work and in order to understand what is in common and what is different. This is, as I said, one example and it is chopped on the left. By the way, if you are looking for the use of Compus, logo generation is a huge industry these days. There is a continuous stream of new logos, and I am sure there is a lot of money in that also. Now, the hardware is also pretty challenging. This is one grid, in terms of infrastructure. This is the thoroughgrid. I think it is a $45 million or $50 million effort of NSF. Part of it is, okay, if you want to do high-energy physics calculation, here are a few flops that you can use and a few pedabytes to store the data. The question is, how do you get to that, and how do you get your results back? Let me try to generalize or abstract and say, this is the way we look at it, and that is the way we are trying to develop tools to help high-energy physics. So, the two examples that I have, one is from high-energy physics and the other one is from the Sloan Digital Sky Survey, which is more in the astronomy side.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 120 So, this is a simplified view of what is going on. These are the phenomena down there that can either be described by the detector or described by a set of models. That is sort of the sort of information about the real world. Then it goes through this pyramid of data refinement and, eventually, at the end, what they wanted is statistics. We have to deal with the flow of data up, and we have to deal with the issue of a query going down. When a scientist comes in with a question about statistics, which is basically, in most cases, âGive me a histogram of these values on these events, under these conditions,â then it is going down. One of the challenges here is how far you have to go down depends on the question. So, the projection question is challenging because you might get it from the top or you may have to go down. As you saw earlier, the amount of data involved here are huge and, in many cases, the computation needs is huge as well. What makes this more interesting is that all this is happening in a distributed environment. The distribution is in two dimensions. It is what we are used to, which is the physical distributionânamely, we have networks, we have machines that are interconnected by these networks, and these networks can be local area and these networks can be wide area, or can be wireless, or whatever they are. So, the machines are in different places. They are heterogeneous and all these kinds of wonderful things which, by the way, brings up one of the biggest challenges of this environment, is if you don't have a single reboot button. It is so wonderful when one of these wonderful machines misbehaves. You reboot it and you bring it back to a stable state and you can keep going. When you have distributed distribution, you can really never have the system in a

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 121 stable state. The other principle that we learned from the physicists, and that is the importance of time. What you know is what happened in the past and, by the time you react on it, it is even later. So, never assume that what you are doing has anything to do with reality. The second part, which is even more challenging, is the distributed ownership. Since computing is so cheap, and so powerful, we get more and more, a smaller and smaller organization owning more and more computing power. These computing resources are autonomous, are managed by their own local software, but are also managed by their own local administrators that reflect local policy. So, if you want to operate in this environment, you have to be prepared to deal with all that, which really means that you ought to take an opportunistic view of the world. You ought to go and use whatever is available for as long as it is, and then be ready to back off when this is gone, and be able to make sure that you behave according to the rules. So, what is driving the fact that it is distributed? Cost, obviously. Commodity computing is here to stay with us, I think, for a long time, at least, and if we have to compute on desktops or we have to compute on Play Station or whatever, the computational needs, as you saw them earlier, are so huge. I heard earlier that they would like to have threefold Monte Carlo. I heard earlier that you would like to get 10- fold Monte Carlo. Some people tell me that you can barely get today one-fold Monte Carlo, in many cases. So, if we can make it available to the HEP community and the othersâby the way, the biologists are coming in with even more than that, I just talked earlier this week with one biologist who wanted to do pairwise comparison of 120,000 proteins that exist today, and another 300,000 that are coming down the pike later this year. So, we have to go after commodity, whatever it is, and it is distributive in nature. The other part of it is performance, because we need to be close to where we collect the data. We want to be at CERN or we want to be at SLAC where the detector is, we want to be close to where the scientists are, in order to do the analysis. It is not only the performance, but it also brings in the availability issue. Why? Because if I have the data here and I have the computing here, I am in charge. I touch the mouse, all this is mine, even if it is not as big as it could have been if everyone would have gone to the single place. That is also the politics. So, you have these international efforts and, for example,

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 122 the United Kingdom now has invested a huge amount of money in e science, and they want the resources to be in the United Kingdom. All of a sudden, BaBar has more of a presence in the United Kingdom, not because that is what makes sense, but that is where suddenly the money is. So, that is the politics. The other one is the sociology. I want to have control, I want to show that I am a contributor, I am doing it. If we don't understand it from the outset, and we really build everything around itâthis also goes back to my previous commentsâwe have to understand how to operate in a large collaboration. It is getting even more difficult when it is interdisciplinary, when we come in with different cultures and a different way of thinking and we have, eventually, to solve the same problems. While we are doing computer science and they are doing physics, at the end of the day, they have to find one of these particles, which I have been trying to give them particles along and here it is, and let's forget about all these other worlds, but they don't want my particles. They are looking for another one. Now, what can we bring to the table? There are a lot of principles we have learned over the years in computer science. I think one of them, which is actually coming from the database world, is the value of planning. One of the nice things about databases is that you come in with a logical request. You don't know anything about the physical implementation of the system or the data. Then, you let somebody do the planning for you. There are a lot of challenges in doing the planning, and there are a lot of challenges in convincing the user that what you did is right. For example, a typical physicist will not trust any piece of software. Coming to them to say, âTrust me, give me a high-level request and here is the histogram,â they will say, âNo, no, where was the byte, when was it moved, by whom was it generated, by which operating system, which library?â âall these kinds of things, because they have learned over the years that they are very sensitive to it. Now, whether it has to be this way or not is an interesting question which I think has also statistical implications because, on the one hand, everything is random. On the other hand, you want to know exactly what the voltage distribution of the machine was when you ran it. So, that is getting intoâthe second item here is data provenance. There is a lot of work today in understanding, if this is the data, where did it come from? How much do we have to record in order to convince the young scientist that this data is valid, or that

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 123 these two data sets are the same? In the database world, whether it has materialized or not is left to the database, and that is connected to the other concept that I have here, which is virtual data. We have the whole IT output, which is called Griffin, which is dealing with virtual data, which is again coming to the end user and saying, âTell me what you want and I will do it for you. Write it in sequel, and I will decide what join method I should use and where I should use it, and whether I materialized it earlier or not is up to me.â There is a huge trust issue here, when you allow somebody else to do the planning for you. Now, the main issue in the planning is to figure out what is the research requirement. As I point out later, a huge question is how much is the space requirement of an operation, because this is sort of a bottleneck and a source of deadlock, if you cannot write your data or bring your data in when you are trying to really run a large operation. There is this question of when to derive and when to retrieve. If you want a million events that were simulated, what is cheaper, to go and retrieve them from CERN, or to rerun it on your local farm. Now, if you can guarantee that the two are statistically equivalent, then maybe I should reproduce them on the local farm, rather than wait for the data to be shipped from CERN. We have not solved, I think, even in databases the question of when to move the data and when to move the function. Given the amount of data involved and the amount of resources, we are facing this problem on a larger scale. Where do we do it, how long do we wait for the data to come, and where do we move the data? Can we push selection down to where the data is? For the issues that are coming from the database world, from computer science, we have to apply them not in the standard way, which is the way we have been doing it all along. The other part of it is that we really have a huge data workflow problem here that we have to manage and control, because if we screw up, it can be very bad. If you really want to live in this brave new world where we have grids and we have computing and we can do things everywhere, then we have this continuous movement of data, of computing. We are talking about tens or hundreds of thousands of things that we want to do in this environment. If we don't keep track of what is happening, or somebody really misbehaves or loses control, we can grind the whole

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 124 system to a halt. So, when we move the data from the phenomenon to the data, when we measure it, so we have a real time constraint if the data is coming from an instrument. Whether it is a telescope or a detector, we have to make sure that this data goes in and we cannot lose anything, because this is real data. We also have to deal with the production of data coming from the Monte Carlo production, which is, again, a stream of data coming in. I think many of us have focused on the problem of how to optimize read. The problem of how to maintain a pipeline where there are rights and data has to go in is still an open question. On the Web, we are all focusing, again, on how can I get the document quickly. We are not dealing with how do I inject a lot of data into a common space. Now, when we are doing the data to data, then we have this multistage, where what we are doing is, we are doing feature extraction, we are doing indexing, we are extracting metadata, we are doing compression. It is basically the same, depending on how you want to look at the output of it. We have to deal with all these stages. We have to keep track of them. We have to know how the data was produced, the data provenance, and we have also to record how it is done, so that if we want to redo it on the fly, we can do it automatically. In the end, what we have to do is, again, we have to select, we have to project and we have to aggregate. Now, the selection may involve a lot of distributed patches. So, I have to figure out where the data is. The index can tell me where it is, but it is distributed all over. Maybe some of it has to be reproduced, but eventually, I get the data. At what level I project, again, it depends. Sometimes I want an attribute which is in the metadata, and sometimes I have to go deeper, even into the raw data, in order to get the attribute that I want. So, the typical approach that we have, that we have sort of the whole thing and we do the selection and then the selection. Again, we need something that is more of the semi-joined structure, that we look at the attributes and what is going on, and then we go to the real couples, and the real couples may be very deep in the hierarchy and may require quite a lot of computing to get the data out. So, let me try to give you sort of a simple example of what is involved in doing what I would view the most basic operation on the grid, and try to make it abstract.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 125 Let's assume I have an x. I want to apply on it an F. I get the result, which is a y, and I want to store it somewhere, in location L, which can be my screen or can be a CERN archive. I don't care, but eventually, I have to park it somewhere. What I think we have to keep in mind here, which traditional computing, definitely high performance computing, has ignored is that moving this data in and out is an integral part of the problem, not just computing. So, getting a very fast computation on a high-performance machine, but then waiting several weeks to get the data on and off the machine is not the solution here. So, we have to bring in the data placement activity as part of the end to end solution. Here are sort of the six basic steps that we have to carry out in order to do this y equal F effects or 2F. So, first of all, we have to find some parking space for x and y. As I pointed out earlier, how do we know how big x is? That is relatively easy. How big is y is getting even trickier, because it can depend on a lot of internal knowledge. Then we have to move it from some kind of a storage element to where we want to actually move it to move x. Then we may have to place the computation itself, because the computation itself may not be a trivial piece of software that resides anywhere in this distributed environment. Then, we have the computation to be done. Then, we have to move the results to wherever the customer orders us and, in the end, we have to clean up the space. Just doing this right is tough. I can assure you that you don't do it right, even today, on a single machine. How many of you, when you open a file or you write to a

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 126 file, check the return codes of the write, whether it succeeded or not? I am sure that most of your applications will die if the disk is full, and it will take root intervention, in many cases, just to recover it. We cannot afford it in a distributed environment because it has to work in an autopilot with a lot of applications. So, what we really have here, if you think about it, it is really a DAG, a simple DAG in this case, although a shishkabob. Do this, do this, do this, do this. Keep in mind that we have to free the space, even if things have failed in between, which creates some interesting challenges here. This has to be controlled by the client, because you are responsible for it. Somebody has to be in charge of doing all these steps and, again, you can look at if, if you really want to, as a transaction that has to go end to end. I think I am sort of running out of time here. Here is a list of challenges. How do we move the data? The closest we can get to it is a quota system on some machines, but even that doesn't guarantee you that you can actually write the data when you need it.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 127 So, I already talked a little bit about it, because I want to move faster to the examples to show you that we can actually do something with all that, but the approach that we have been taking is, first of all, to make data placement first-class citizens. That means that when you write an application, when you design a system, make sure that getting space, moving the data, releasing it, is a clear action that is visible from the outside, rather than buried in a script that nobody knows about it and, if it fails, it really doesn't help us much. We have to develop appliances that allow us to use a managed storage space in this environment in a reasonable way, and then create a uniform framework for doing it. So, let me show you what we have been able to do so far. The first one is, how we can generate the simulated event, with millions and millions of simulated events.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 128 This is sort of the high-level architecture of what we have deployed out there. So, the application is generating a high-level description of what has to be done. This is getting into what we callâthis is the Directed Acyclic Graph Manager that is responsible for controlling it. Now, for some of you, if it reminds you of the old days of JCL, yes, it is like JCL, at the higher-level, but then it goes to what we call Condor-G, which is the computational part, and we have a data placement schedule that uses the other tools to do that. So, I am not going to talk about that since I have four minutes, and justâthe

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 129 physicists have way too much free time on their hands. So, they can generate this wonderful animation. So, here is the way it works. We have a master side. IMPALA is the CMS. They have an even bigger detector than the BaBar detector, that is generating the events themselves. This is the master side. Then we have all these other sides where we send out the computations, get the data back, publish it, move the data in and we keep going.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 130 Basically, each of these jobs is a DAG like this, and then we move them all to larger DAGs that include some controls before and after, and that is the way it works. So, here is an application that this graph is, after 900 hours, this is hours since it started. So, this is one of the CMS data challenges, and this is the number of events that we have to generate. So, a job is two months, and it has to keep going, and we have to keep generating the event. This is what we have been generating using that infrastructure. Let me show you another example of what happens when you have to do it in a more complex environment.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 131 That is where we are putting in the planning. So, it is the same architecture as I showed you earlier, but there is a planning box there that is trying to make a decision on when to do it, how to do it, and what are the resources we should use for this. This is based on work that is actually done at Argonne National Labs and the University of Chicago as part of the GriPhyN Project, and that was because of the data system that includes higher-level information about the derivations that are formally defined and, from the derivation, we create transformations, which are the more specific acts that have to be done. This is creating the DAGs, but they are not being executed by the architecture. As we go through, we go back to the virtual system where we come and say, tell me, now, what to do.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 132 So, this is what we have to do there. We have to aggregate information. We have all these images. We have to pull them together. We have to create distributions, as I showed you, of galaxy sizes or whatever it is. So, this is sort of the DAG that is going up rather than going out. This is an example of one job. This is an example of a collection of these jobs that we are actually executing. Each of the nodes in this DAG is a job that can be executed anywhere on the grid, and this is where we start.

DATA GRIDS (OR, A DISTRIBUTED COMPUTING VIEW OF HIGH ENERGY PHYSICS) 133 This is the computing environment that we use to process the data, and these are the statistics. I will leave you with that, that if you want to write applications that work well in this environment, (a) be logical. The other one, you have to be in control, because if you don't get the right service from one server, you should be prepared to move on to somebody else, if you want to use it effectively. Everyone wants lunch.

Next: Report from Breakout Group »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates