Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Proceedings of a Workshop on Statistics on Networks Keynote Address, Day 1
OCR for page 2
Proceedings of a Workshop on Statistics on Networks Network Complexity and Robustness John Doyle, California Institute of Technology DR. DOYLE: I am going to try to set the stage for this meeting. The work that I am going to talk about is the result of a lot of collaborations. I certainly won’t give justice to those people here. All the good ideas and all the good work that I am going to talk about is done with them, so when I say “we” I mean “they.” A lot of them are here today; two of them have posters and Jean Carlson is going to talk later. Walter Willinger is here. I am going to talk a lot about work with them. There are buzzwords related to network complexity—network-centric, systems biology, pervasive, embedded. What I am interested in is the core theory challenges that underlie those common themes. I am going to be a little narrow in my focus in the sense that I am going to concentrate on biological networks and technological networks: I want to stick with stuff where we mostly know how the parts work, so that means not much social networks. There is a remarkably common core of theoretical challenges, so from the math-stat side I think there are some really common themes here. There has been recent dramatic progress in laying the foundation, yet there has also been amazingly, at the same time, a striking increase in what I would call unnecessary confusion. I will talk a little bit about both of these. One of the common themes I am going to talk about is the fact that we see power laws all over. I think a lot of people at this workshop are going to be talking about that, because it is a common trait across all advanced technology and biology. Another common theme is that many of these systems in biology and advanced technology are robust yet fragile. What I mean by that is they work really well most of the time but that they fail occasionally, and when they do fail it can be catastrophically. We will hear more about that from other speakers and poster presenters. What I am going to do today, though, is talk about motivation for new theory and also education. We need to educate each other about even some fairly elementary aspects of statistics and mathematics. To get this started on the broadest level possible, I am going to stick with stuff that only assumes the background provided by a Caltech undergraduate education. Let’s start with some data. The figures below show the 20th century’s hundred largest disasters worldwide. What I have plotted here are three kinds of disasters: technological disasters in tens of billions of dollars, natural disasters in hundreds of billions of dollars (and natural disasters can be even bigger), and power outages over some period of time and tens of millions of customers. What you see on a log-log chart are roughly straight lines with slopes of minus one.
OCR for page 3
Proceedings of a Workshop on Statistics on Networks Figure 2 shows the raw data. I have just taken the event sizes and plotted them; the worst events are along the bottom, while the more common events are higher up and to the left. The significance of this is that the worst event is orders of magnitude worse than the median event, which we are learning much more about this last year. It also means that demonstrations of these behaviors (events of the type represented in the upper left corner of Figures 1 and 2) and reality (which includes the events toward the lower right) can be very different, both in scale and in character, with the more common behaviors being robust and the rare behavior displaying the great fragility of these systems. When we take and build our new network-centric embedded-everywhere systems that some of us in this room are going to provide to all of us, they will degrade, and the reality may be much, much worse. This is also an indication of robust yet fragile. The typical behavior is much smaller than the worst case, and the worst case is very, very bad. FIGURE 1
OCR for page 4
Proceedings of a Workshop on Statistics on Networks FIGURE 2 I am not going to focus so much on disasters. Jean Carlson is going to talk about that tomorrow, but I do want to get some of the ideas behind these kinds of statistics. They are either all you study or you won’t study it at all, and there is not much conversation back and forth between those groups of people. The reason you get straight lines is that logarithms of a power law give you straight lines, and that is the slope alpha. That just describes the data, and I don’t assume any kind of model. If I said I am going to think of this as samples from a stochastic process, that is a model. What I have written down is just data and I have done no statistics on it. The fact that I have drawn a line there is just to help your eye visualize things. One important thing to do is to look at data. One of the things that isn’t often done in papers is people don’t show you their data in a way that you can make judgments about the statistics they have made. We will see that there are lots of errors that arise because of this. Power laws are really ubiquitous. Why is that and what do probability theory and statistics tell us about power laws? Well, if I were to ask you why are Gaussians so ubiquitous you would say it is central limit theorem. If I asked you why the exponential distribution is so popular or so common you would probably talk about the Markov properties or the marginalization properties, which are essentially, mathematically the same thing. They both occur in situations of low variability, so expect to see Gaussians and exponentials all over the
OCR for page 5
Proceedings of a Workshop on Statistics on Networks place. When it comes to systems with high variability, power laws actually have even more strong statistical properties than Gaussians or exponentials. In fact, they have all that Gaussians and exponentials have, and even more. They are in some sense—I think Walter Willinger coined this phrase—more normal than normal. We should not call Gaussians normal; we should call power laws normal. They arise for no reason other than the way you measure data, and the way you look at the world will make them appear all over the place, provided there is high variability. Much of science goes out of its way to get rid of high variability. We study in our labs low-variability phenomena, but high variability is everywhere. With the former, you get convergent moments, while the latter corresponds to divergent moments. Typical notions of moments and mean and variance don’t mean anything. We talk about the coefficient of variation in a Gaussian. That is a well-defined concept in an exponential; it is actually 1. In power laws, of course, it is divergent. The issue is that power laws are not exceptional, but the really important issue is low versus high variability. What about mechanisms? We have large disasters because we have uncertain environments, we put assets at risk, and we only devote a certain amount of resources to ameliorating those risks, and thus sometimes those resources get overwhelmed. So, large events are not really surprising. If we were just to plot the large power outage of August 14, 2003, it is completely consistent with this data. The ramifications of the attacks on September 11, 2001, as a technological disaster, are not off the map. Hurricane Katrina is not off the map. These are all more or less consistent with the largest event. Like I said, I am not going to discuss disasters very much. High variability is much more fundamental and is very ubiquitous, particularly in highly engineered or evolved systems. Power laws are more normal. They shouldn’t be thought of as signatures of any specific mechanisms any more than Gaussians are. It also means that their statistical properties lead people to find them where they aren’t, through statistical errors. They are very abundant in networks, and I will come back to try to talk about in biology, in particular, why they are everywhere in biology. Because of all these properties, there are major errors that are not just isolated instances, but typical in high-impact journals in science, presumably because mathematicians and statisticians play little or no role in writing, reviewing, or on the editorial boards of these journals. One of these errors is variability in power laws. They are badly misestimating the slope or basically, even more profoundly, misunderstanding where high variability comes from in the first place. A typical mistake that is made, taking data that are actually exponential and plotting them in such a way that suggests a power law, is exemplified in the following series of figures. I numerically generated random data with the little Matlab code shown, using Matlab’s random
OCR for page 6
Proceedings of a Workshop on Statistics on Networks number generator (actually a pseudo-random number generator). I generated exponentially distributed data, and on Figure 3’s semi-log plot you see it is nearly a straight line. I checked that I had written my Matlab program right, and that the Matlab random number generator works. It is linear on a semi-log plot because, obviously, you take logs of an exponential and you get linear with slope -a. FIGURE 3 Instead of these rank plots you might think that we could just look at frequencies. Frequencies are obviously related to the rank as shown in Figure 4, even in this case where they are integers. You could write a little program to calculate the frequency plot and do a frequency plot and, lo and behold, a power law as shown in Figure 5, slope 1.5. All of a sudden, I think I have a power law here. In this case I know I don’t, but in some sense I have made a classic error. The idea is that I just plotted in a silly way—again, you are differentiating data—it is noisy and you are going to get problems. I don’t want to belabor this point too much because it is well understood why this happens, there is no mystery about it; it is just a matter of proper education. The statistics community has to explain this to people.
OCR for page 7
Proceedings of a Workshop on Statistics on Networks FIGURE 4 FIGURE 5
OCR for page 8
Proceedings of a Workshop on Statistics on Networks FIGURE 6
OCR for page 9
Proceedings of a Workshop on Statistics on Networks FIGURE 7 We really have got to not do that. You would think that it was an isolated instance but, in fact, it is almost always the case that you will see the plots on the right above are systematic errors. There are a bunch of networks that have been claimed to have power laws in them, and the data that was presented actually didn’t. Figure 8 comes from a paper-on-protein interaction power law, an article from Nature. I got the data and re-plotted it, but this is basically a plot that appears in a supplement. The paper concluded that it is a power law with a slope 1. It doesn’t even have finite mean or variance, if you think of it as a probability model. What does it really have? Roughly it has exponential. Again, just to comment, this is an undergraduate sort of error. If a Caltech undergraduate did this they wouldn’t last long. Of course, this is in Nature and in some sense—I have reviewed a few papers where I have said, no, you have got to plot it—they say, no, if we don’t plot it the way we do on the right, then the editors won’t accept them. I said we have got to change that.
OCR for page 10
Proceedings of a Workshop on Statistics on Networks FIGURE 8
OCR for page 11
Proceedings of a Workshop on Statistics on Networks FIGURE 9 Figures 9-13 deal with an analogous error, from Science, does the power grid have a power law? No, it’s almost a perfect exponential. You can’t get a better exponential than that and yet it was called a power law. You can ask, what about the World Wide Web? Let’s re-plot to make sure of what they were doing. They did logarithmic binning. You might think that helps. In this case, it really might be a power law, but you get the wrong slope. For a power law, slopes of 1.1 versus 1.7 are really different. You are off by orders of magnitude. So, once you get into power laws, the slopes are really important—again the point is, it doesn’t exactly look like a power law, the real data. That is not the big deal. The big deal is it does have high variability. The Web has high variability. In fact, when you look at the Internet, practically everything has high variability. It is almost impossible to find something that doesn’t have high variability, although Internet routers provide an exception. If you look at the router size frequency plot, you could come up with a power law whereas, in fact, it is clearly exponential in the tail because there is an excess of small-degree routers. That is probably because, either at the edge of the network there are a lot of degree-one routers or, in fact, it is probably an artifact of the way the data was taken, and you have to check this, but it is certainly not a power law. These errors are typical, but the big one, I think, is badly misunderstanding the origins of high variability, because high
OCR for page 51
Proceedings of a Workshop on Statistics on Networks FIGURE 51 Hard bounds: The bounds are achievable under certain assumptions, and there is a decomposition theorem that is well known in coding theory about how to achieve that. In control theory there is the corresponding thing which is a little bit less known, and I am beginning to realize the reason it is less well known is because people like me teach this stuff really badly. So, everybody takes their undergraduate course in controls. How many people took an undergraduate course in controls? It is always the course you hated most. I don’t understand this, and it is a pathology and a virus that I have. We have got to get it out of some of this stuff.
OCR for page 52
Proceedings of a Workshop on Statistics on Networks FIGURE 52 Anyway, control theory has a similar thing. You take D and E, you take the transforms and you can define this thing called a sensitivity function. If you think of the other thing as a conservation law based on channel capacity, you can’t exceed channel capacity here. You have another conservation law that says that the total sensitivity is conserved. So, I have this demo, shown in the following illustration. It is harder to move the tip of the pointer around if it is up than if it is down. If I want to go around here, I can move it around real easy—same parts but way harder. This theorem tells us why. It turns out that there is an instability it adds to the problem, and it turns out that the instability gets worse as it gets shorter, and eventually I can’t do it at all, and you can do the experiment yourself at home. You can check out this theorem at home—same thing, hard bounds, achievable solution, decomposable.
OCR for page 53
Proceedings of a Workshop on Statistics on Networks FIGURE 53 These two things have been sitting staring at us for 60 years, which is a total embarrassment. What happens if you stick them together? What would be the best thing that could possibly be true? You just stick them together? There is the cost of stabilization, there are the benefits of remote sensing and there is a need for low latency. That can’t possibly be right, because it is so obvious you would have thought some undergraduate would have done it 50 years ago. The other weird thing is Shannon worked for Bode at Bell Labs. They actually sat and talked about this stuff together. Not only that but it is a hard bound; it is achievable under certain assumptions, with the usual, you first prove it for the added Gaussian case, and then the solution is decomposable under certain assumptions. It is possible to unify these things, and there are actually some undergraduate results that come out right away. This hasn’t been published yet, but it is going to appear at the next conference on decision and control; it has been submitted to IEEE Transactions on Control, and so on.
OCR for page 54
Proceedings of a Workshop on Statistics on Networks FIGURE 54
OCR for page 55
Proceedings of a Workshop on Statistics on Networks FIGURE 55
OCR for page 56
Proceedings of a Workshop on Statistics on Networks FIGURE 56 Here is a claim or probably, more properly, irresponsible speculation. A lot of the complexity we see in biology is dominated by dealing with this trade off, and the deal for techno-networks. We are starting to use communication networks to do control. Biology already does that. Biology is latency driven everywhere. What does the Internet manage? Really latency. We need to have a latency theory of communications. Go to the FAST web site and read about this. This is pretty cool. It is a case where a very interesting new theory was done, global stability with arbitrary numbers, arbitrary areas and delays so you get very great robustness, global stability of a nonlinear system in the presence of time delays. I never thought I would see this. If I had to do it, I wouldn’t. I have good students who did this. One of the co-authors of the paper is here, Lun Li, and you can talk to her about it. She, I think, actually understands it. Anyway, it has been very practical and it has led to the theory of new protocols that, again, here this is macho stuff; this is breaking the world record. What is the theory? The way you connect these things is it is really a constrained optimization problem. It starts out as constrained optimization, but then you really have to redo optimization theory to get it to work in this layer decomposition way. There is actually a coherent theory for the first time of these layered, decentralized, asynchronous protocols. It is a
OCR for page 57
Proceedings of a Workshop on Statistics on Networks nascent theory. That little theorem I showed you before is an example of the kind of unified theory that is now coming along, but it is very much in its beginnings because we had these stovepipe areas chugging along for 50 years with the smartest people I know cranking out theorems in an isolated way. We have to go back 50 years and start putting them back together again. It looks doable and as I tried to show you, we are going to have to think about high variability all over the place. The high-variability statistics is going to be a dominant issue, and it is not necessarily a bad thing. In our disaster world it is a bad thing, but a lot of times this high variability can be exploited. TCP exploits at a huge amount. My cell phone story, if that wasn’t there, TCP wouldn’t work. So, it is a situation where we have got to be able to learn to exploit that more systematically. I think you are probably going to hear about bits and pieces of this from others throughout the next couple of days. I will stop and see if you have any quick questions. QUESTIONS AND ANSWERS QUESTION: A very quick question. In your earlier plot you showed for the power law, you said that apparently it is not power law, but more like exponential. It looks like it is a better fit with the power law than exponential. How do you explain that? DR. DOYLE: It is a statistical error. So, it is a case where you simply are making a mistake. Again, I didn’t really explain the details. This is a standard thing. The idea is that the thing on the right was a differentiated version of the thing on the left. The simple way of thinking about it is, you differentiate it and you create all this noise. There are two advantages to the picture on the left, the rank plot. First of all, I show you the data in its raw form, and you can say whether or not you think it is a power law. There are no statistics done. I show you the raw data. Then you can do a little bit of statistics by just sort of eyeballing the straight line. The idea is that you shouldn't think of a straight line it has to fit. The idea is that we use mean and variance to describe all sorts of low-variability phenomena. We know it isn’t really Gaussian and we don’t check the other moments. We need similarly to find better ways, in a broader sense, to describe high-variability data. All of a sudden you get means of 1 and variances of 100—it is not a useful statistic, it doesn't convert. You take the data over again. Means and variances don’t mean anything in this high-variability world. What you need to do robust statistics is a solvable power law, but you would want to use that even when it wasn't a power law, just like you use mean and variance, so there are robust ways to do this. This is all decades old statistics, and we don't teach it very well. That is one of the
OCR for page 58
Proceedings of a Workshop on Statistics on Networks challenges that we need to do a better job, just of teaching these things. The problem is science has so focused on getting rid of high variability. High variability was thought to be a bad thing, like you are doing bad experiments. As Jean will point out, high variability exists all over in nature. In the laboratory, it only exists in very exotic circumstances, so it is associated with exotica. DR. HANDCOCK: I would like to pick up on that question a little bit, about the identification and characterization of power laws. I think one issue that needs to be discussed, or at least I would like to hear some discussion more of, is the case where statisticians in particular have looked at questions for a long period of time, but they haven’t really been used, that knowledge has not been used in other areas of science. DR. DOYLE: That is absolutely right, it has not been used. DR. HANDCOCK: In this area, I think it is particularly important. Clearly, statisticians have had a lot to say about curve fitting and other issues like that, but what I am most reminded of is the recurrence of these debates. The classic example to my mind is Morris Kendell’s 1961 address to the Royal Statistical Society, where he essentially lambastes previous work done in a half a century before that time on essentially very similar questions here. DR. DOYLE: This is a very old story, a very old story. DR. HANDCOCK: I would like to hear people say more, how can the work of statisticians be more recognized and just routinely used by other sciences, to avoid occurrences of this kind? DR. DOYLE: I will tell you that my experience has been to give them good software tools do it right and make it easy for them to do it right. If you go to Matlab and get the stat tool box, it’s all low variability. So, my experience has certainly been, if we want to get good, new robust control theory into the hands of people, you make software. If you want to get biologists to be able to share models, you have to have a systems biology mark-up language, you have to make software. So, you have to turn your theories into software, and it has got to be usable, it has got to be user friendly. So, we use Matlab and call the SVD, and how many of us know actually how the SVD works, sort of? Well, you don’t need to. The point is, if you do it right, you need to know what it does, but you don’t need to know exactly how it does it. We need to do the same thing for high-variability statistics. It is one thing to lambaste everybody about it. It is another thing to try to create the tools, and show people, teach people. This is the thing we have got to do, we have got to teach people. There are people out there who are discovering this high-variability data everywhere. We are going to hear a lot about it in the next two days, and it is everywhere. The
OCR for page 59
Proceedings of a Workshop on Statistics on Networks irony is, because of its strong statistical properties, you end up finding it where it isn’t, and the stats community can make a lot of contribution. The problem is that I am not sure anybody studies this any more. It is this old, old topic, old classical stability laws; it was hot in the 1920s, it was hot in the 1950s, and maybe it will come back. I think it should come back. DR. KLEINFELD: Implicit in your talk was this notion that in engineering systems things are components, and also implicit in your talk and sort of in the world is that, the way biology seems to design things, they are very much in modules. Is there some sort of theorem that says that one should really build things in modules, like some complexity diverges or … DR. DOYLE: Yes and no. I would say we have nascent, little toy theorems that suggest that these architectures—here the point is, you don’t have modules unless you have protocols. If you look around, you discover the modules first, you see them, you see the parts, but they are meaningless unless there are protocols. None of this stuff would work or hook together unless they had protocols. What we can prove now is that some of these protocols are optimal in the sense that they are as good as they can be. For example, TCP properly run achieves a global utility sharing, that is fair sharing among all the users that use it. What that says is that you can get the same performance as if you had a centralized non-modular solution, but you can get it with a protocol that is both robust and evolvable. Now, is there any other way to do it? We don't know. You prove that this is optimal and that this is robust. So, there is a lot more work to be done. I mean, it is only the last few years that we have even had a coherent theory of how TCP/IP works. So, this is quite a new area. I think that is an important question. How much further can we go in proving properties? What we need to design now, as engineers, we need to design protocols. We need to design protocols that are going to run our world, and we need to do them and make them robust and verifiable, and scalable. Now we don’t do that very well. In biology, what we are doing is reverse engineering the protocols that evolution has come up with. Unfortunately, the good news is, it looks like it uses more or less the same protocols. So, it is not going to be incomprehensible. It could just be this infinite parts list. If we just made a parts list of everything on the Internet, it would be completely bewildering, we would never make any sense of it. Because we know the architecture, and because we know the protocols, we know what everything is doing. We have got to do the same thing with biology.
OCR for page 60
Proceedings of a Workshop on Statistics on Networks REFERENCE Han, J.D., et al. 2004. “Evidence for Dynamically Organized Modularity in the Yeast Protein-Protein Interaction Network.” Nature 430.
OCR for page 61
Proceedings of a Workshop on Statistics on Networks Network Models