Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 396
Proceedings of a Workshop on Statistics on Networks Visualization and Variation: Tracking Complex Networks Across Time and Space Jon Kleinberg, Cornell University DR. KLEINBERG: Thanks very much to the organizing committee for inviting me to come speak. This has been a very interesting workshop for me, especially seeing how the techniques across all these different areas make use of common principles. Obviously there is a lot we are going to have to think about afterwards. I’m a computer scientist, so I’m coming at this from a computer science perspective. In particular, the kinds of networks that I’m interested in are the large information networks that are built up on top of the applications on the Internet, starting with things like the World Wide Web and spilling over into networks that affect the way we communicate through things like email and instant messaging, through the way we express interest in hobbies through online communities like Live Journal and so forth, to the way we do electronic commerce and so forth, as indicated in Figure 1. FIGURE 1 There is this whole, what some people call “speciation,” event on the Internet where we see all these new forms of media springing up very suddenly, looking at the emergence of not just the technological structure here, but also the social structure. This notion that there are interesting social structures going on in these online applications is something that has interested
OCR for page 397
Proceedings of a Workshop on Statistics on Networks me for quite a while, starting back with some work I did on the World Wide Web where the question was how to use the linked structure of the Web, and the fact that these are documents authored by people, to help improve a search. This is something that combines the social and technology aspect very clearly, although I haven’t always been so good at conveying this. I have been at workshops of this flavor for some time. A few years ago I gave a talk in a very early workshop that brought together some computer scientists and sociologists. We talked about links on the Web and kept casually referring to the Web as a social network without really trying to unpack what I was really saying. This was at the Santa Fe Institute, and during the break some of us were standing out on the patio looking out at this beautiful view of Santa Fe when a very eminent sociologist came up next to me smoking a cigar and asked, “Did you see that talk by the guy from Cornell? That was terrible. He thought he was talking about social networks, and all he was talking about was how Google worked.” After explaining that I was that guy from Cornell, we actually had a reasonable conversation under the circumstances. That was what I view as one of the early steps in learning about some of the cultural gaps that exist between these different areas. I think more than anything that what I have done, or social scientists have done, or what these workshops like this have accomplished, has simply been the unbelievable perfusion of these online media, which has just made it abundantly clear that all of these technological things we are seeing are in the end about content authored by people and people communicating with each other. In reality it’s not so much that they both exist but that they are really intertwined. You can’t disentangle the technological networks from the social networks anymore. If I were to ask what are the networks and which email communication, instant messaging, or blogging or any of these other things take place, sure, your packets are traveling over the Internet, but if you are trying to configure a new instant messaging system you have to know a lot about the kinds of people that are going to be using it. You have to know the diurnal rhythms of them and whether it’s going to be used by teenagers or by grown-ups and so forth. All this affects the way in which we configure the technology, so in a real sense these things are inextricable at this point. A particular thing I wanted to focus on here is in the context of these social and technological networks, the way in which they are embedded in space, embedded in time, and embedded in surrounding organizational structures. One thing that I wanted to get at is that we shouldn’t just view the networks in isolation, but to think about them as being physically embedded as concretely as we view networks in neuron-anatomy as being physically embedded. I’m going to be talking about network models, as have most of the other talks at this
OCR for page 398
Proceedings of a Workshop on Statistics on Networks workshop, reflecting our interest in the structural properties of networks, but also particularly for me, in the processes unfolding on networks, the algorithmic processes that unfold on networks, and trying to find models that relate these to each other. In the end, hopefully, even very simple toy models—and in this talk I’ll try to go from a progression of toy models, up through more and more general models—they don’t just help us form a vocabulary of patterns, although that is certainly a very useful thing, but also in the computer science context, it’s often very surprising to me how quickly those turn into design principles. In the online world the ability to create new online media like peer-to-peer file sharing gives you a certain kind of flexibility such that things that had seemed like toy artificial models can very rapidly become principles by which you actually architect a system. That is one of the interesting things about this domain also. I also wanted to mention one thing that has been lurking around the surface of some of the talks, and that is the social science component. It is the notion of working with model systems. I think, especially for the statistics aspect of this workshop, this is going to be something important to think about. So, if I could caricature the traditional research obstacle in studying complex networks, it would be that you’d like your networks to be large-scale, realistic, and completely mapped, but in actuality you can only achieve any two of those. For obvious reasons it is very hard to achieve all three at once. FIGURE 2 So, just as Mark Handcock stole a slide from Martina Morris, I’ll steal a slide from Mark.
OCR for page 399
Proceedings of a Workshop on Statistics on Networks Figure 2 shows his Zachary Karate Club Network, and the only thing I want to point out here is that this is clearly a real network of real people, and it’s completely mapped. We know a lot about it, but the nature of how data like that are collected inherently limits the size. In his after-dinner talk last night, Steve Borgatti mentioned that a trend over the last few years has been to see networks go from sizes in the hundreds to sizes in the ten thousands or hundred thousands. In the process of doing that, we have to ask what are we losing? We are obviously gaining a lot. Many of these systems are what I would call online model systems. When I say model systems, I don’t mean that there is anything fake or synthetic about them. All of the people involved in the activities listed on Figure 2 are real people engaged in social interaction. It’s simply that we don’t know as much about them as we might have known about these smaller more self-contained systems such as the Karate Club Network. It’s hard to draw their boundaries; I’m thinking about things like the links among Web pages or Web blogs in the communities of bloggers and how they interact—social networking services like Live Journal or Orchid or any of the other things, which are actually trying to create a social network on their members. Email and instant messaging within organizations define network interaction. For online e-commerce sites, there has been some amount of interesting research on simply the behavior of people at these sites, because of course there is still the interaction of product reviews, forum discussions, replies, posts, and so forth, and then the download behavior of these people. We can then pick—this is a line of work that Mark really stimulated thinking—something like arXiv, which is a model system of, in this case, high-energy physicists, all publishing papers, co-authoring, citing each other, and so forth. It’s a very rich domain in which we can try studying things on a very large scale. It was this, for example, that caused three of us at Cornell—Paul Ginsparg, the author of the arXiv, Johannes Gehrke, and myself—to choose this as a data set for the annual KDD Cup Competition, basically saying here is a very rich system in which you can do data analysis in all sorts of dimensions, looking at real social data. There are also things that we would like to know about the people in this network that we don’t, but for looking at very-large-scale data, so far this is the best we can do. Figure 3 gives an overview of my talk. The hope was to show how all these themes can be traced out through a single particular topic, small-world networks and decentralized search. I could have easily picked several other topics.
OCR for page 400
Proceedings of a Workshop on Statistics on Networks FIGURE 3 We’ll start with an initial model, the small-world model of Watts and Strogatz, which many of you know, and then we’ll move from that, which is really a structural model, looking at the structure of the network, to a model about processes on top of the network. In particular, the problem of searchability of networks is really more about the process that operates on the network. We’ll start with a very simple model of that. I’ll try to pull back to create more general models. In particular, by abstracting a general pattern that’s going on here, we’ll start looking at both data and design principles. Then we’ll try to find that pattern in large-scale network data. We’ll see some rather surprising ways in which it has shown up in hyperlinks on the Web, email communication, and friendships on these online communities, and also how it’s been used as a design principle in things like peer-to-peer systems. And finally, some recent work I’ve been doing with Faloutsos and Yuri Leskovec looking at how well these networks evolve over time, and some things that actually completely challenge the conventional wisdom we had had on this. And necessarily, since my time is short, I will go through all of these rather briefly.
OCR for page 401
Proceedings of a Workshop on Statistics on Networks FIGURE 4 To introduce small-world networks, rather than reach back to the Watts-Strogatz model, I wanted to reach much further back to the original work in social psychology that stimulated a lot of this, Stanley Milgram’s small world experiments. Both models are summarized in Figure 4. To briefly recap Milgram’s experiment for those of you who haven’t seen it before, the idea was to trace out short paths in a social network, and particularly the network of acquaintanceships in the United States. He picked a target person who his subject was supposed to reach, who lived in a suburb of Boston, and he picked a whole bunch of starting people, mainly drawing them from out in the Midwest. Each starter person would start with a letter that he had to forward to this target, and rather than mail it directly to them, they had to use what they were told about the target and mail it to one of their friends with the goal of reaching the target in as few steps as possible. Those were the rules of the experiment. You get a letter, you're trying to reach a stockbroker in Sharon, Massachusetts, and you think about the things that you might know about them and you know about your friends. You figure out who you should pass it to, and the famous outcome of this was that the median number of steps in the completed chains was six, a number that has entered into the phrase six degrees of separation. There are a few things going on with this experiment, and I don’t have time to unravel them all, but one is clearly the structural issue. Why is it that our social networks in which our friends all seem to know each other, which as a number of the talks have said before looks fairly clustered, why should it have these short chains? This highly influential Nature paper of Watts and Strogatz in 1998 can really be viewed as a proposal for a kind of network that we ought to be
OCR for page 402
Proceedings of a Workshop on Statistics on Networks studying. More than any particular result, it really said let’s think about networks as superpositions; the notion of a random network embedded in some reference frame, in this case maybe a spatial reference frame. To take one of the flavors of their model, we can imagine that in a very simple version we have a grid with nodes, and these might represent the people in the social network, knowing something about their geographic coordinates. We then add random links, even as few as one per node; an example is shown in red in Figure 4. These random links are enough to make the diameter of the world very small, even while locally it looks very clustered. So, adding a very small number of random links, the network still looks relatively grid-like, even though we now know that its diameter has become very small. We can actually prove things like that using results that have been known for some time in the area of random graphs, in particular the results of Bollobás and Chung (1988), which studied almost exactly this network. So this was the idea, think of these networks as a superposition or, for our purposes, a random network embedded in some reference frame, be it spatial, organizational, or what have you. When I first came across the Watts-Strogatz work—which wasn’t hard, because they were also at Cornell at the time—the thing that really struck me about the original experiment was not just the structure. It was not just about the fact that there were these short chains, but it was about a process. As we teach our undergraduates in computer science classes, if you want to find the shortest path in a network you shouldn’t have done this experiment. You should have given a letter to someone and said, send this to all your friends. Tell them to send it to all their friends, and eventually you will see what the shortest chain to the target is. You can think about the obvious reasons why Milgram wasn’t able to do that experiment, therefore, he was forced to embark on this much more interesting one where you actually have to tunnel your way, you have make your best guess over the next hop. Essentially, what he was doing was decentralized routing. Decentralized routing is a completely non-technological context on the social network of people. That’s the first thing I would like to try modeling here. So, we would like to model a decentralized algorithm looking for a path through a network. Figure 5 describes a very basic model: imagine there is a current message holder who knows the message, the grid structure, their coordinates on the grid, and the coordinates of the target on the grid. They don’t know all the random links, but they know their own random links. So, this simply says when you get the letter, you know that you are living wherever you live. You know the target is in Sharon, Massachusetts, and you know you have friends in New Jersey and Montana and Singapore and so forth, and you want to try to figure out how best to direct the letter. That’s the idea. That’s why
OCR for page 403
Proceedings of a Workshop on Statistics on Networks it is decentralized. FIGURE 5 The gold standard here, which is going to pop up through a lot of the results, is this difference between things that grow logarithmically in n versus things that grow linearly in n; or more generally, things that grow like polynomials in log n versus polynomials in n. The idea is that the network has n × n nodes, but we want to be exponentially better. We want to be something like log n or log n squared in our delivery time. That would somehow tell us that the world is small. We can get there in a short number of steps. The funny thing here is that, although one could have asked this question immediately upon seeing the Milgram experiment, the Watts-Strogatz model, although not intended for this purpose, is really ideal as a kind of minimal model you would need if you wanted to ask this question. If you want to ask this question in an interesting way, what are the minimal ingredients you need? Well, you need a network that is partially known and partially unknown. A completely known network of course leads to the shortest-path algorithms, but that’s a very different subject. Finding you way in a known network—that’s basically the Mapquest problem. A completely unknown network is equally different. Then we have nothing to go on. In the known part you have high diameter, or else it would be too easy. For the full networks you have low diameters, so there is a solution to find. And we want to explore this relation between the
OCR for page 404
Proceedings of a Workshop on Statistics on Networks known and the unknown and the unknown part. How does the random network relate to the way in which it’s embedded in the spatial or organization structure or whatever? Again, if you are worried that this model is too simple, obviously, in a million ways it’s too simple. Obviously, to the extent that the grid in Figure 4 is a proxy for anything, maybe it’s a proxy for geographic information. We know people in the Milgram experiment also used occupational information and other things. I’ll try folding some of these things in, but I want to start with an intentionally very simple model, as shown in Figure 6. FIGURE 6 The Watts-Strogatz model is a bit too simple. So, the first theorem you can prove here is a negative result, an impossibility result which says that the network has n × n nodes. There really is a path of length of log n. There was a solution to find, but probably with local information you can’t find it. Any decentralized algorithm, however hard it computes, simply its lack of information is going to mean you need about at least n to the ⅔ steps to find your way, and ⅔ is actually the right answer here. You can also achieve this, but this is exponentially larger than they want. So, that’s what happens when you have a simple toy model. You have this striking phenomenon, and try analyzing it, and sometimes there is not enough in there. So, let’s try generalizing the model just a little bit, just minimally to try to do something. We could add one more knob we could turn, and I’ll call this a clustering exponent r. I’ll still add
OCR for page 405
Proceedings of a Workshop on Statistics on Networks a random long-range link out of each node, as you see in Figure 6, but it’s going to be tuned by this parameter r, which basically says so if I’m node v, I don’t actually choose my long-range link uniformly at random, but rather, I take into account the spatial structure. I choose w as the other end, with probability something like lattice distance to the minus r. So, think about the effect r has: when r = 0, that’s the uniform distribution and I’m back to the Watts-Strogatz model. As I crank r way up to some large number, I go from the network on the left in Figure 6 to the one on the right, where long links are very unlikely. Therefore, the network goes back to having a large diameter. My long-range links are not really helping me that much. I’m in this funny spectrum here where I know that when r = 0, the network is too confusing. There were the short paths, but I couldn’t find them, and when r is very big, there are no short paths to find. The question is what happens in this middle region as we trace it out. At least we have a 1-dimensional family we can study. The answer is that something kind of funny happens. There actually is an optimal operating point for this network as shown in Figure 7, and that’s when r = 2. So when I choose links with probability according to an inverse square distribution, I can actually achieve this gold standard of polynomial log squared n. As I move away from r = 2 in either direction, things get worse rapidly, at least in an asymptotic sense. Things become n to some power, where the power depends on how far away r is from 2. Two was somehow the right operating point for this network. FIGURE 7
OCR for page 406
Proceedings of a Workshop on Statistics on Networks Let’s think about that. Figure 8 gives some idea of how you prove something like that, and then we can go on to how we might extract out what we have actually learned from this beyond the fact that we are on a grid, beyond the fact that we have one long-range link, and so forth. But even in this very simple system there is obviously something to be proven. Referring to Figure 8, let me tell you how to prove the positive result with exponent 2, and I’ll skip the negative results for the sake of time. The basic idea is that I say I’m trying to get to this target t and I’m at this node v. Rather than try to trace analyze the whole sequence of steps from here on, I want to ask a simpler question: What’s the probability that in this very step I have my distance to the target? To have my distance to the target I have to go from d to jump into this ball of radius d/2. You can basically compute the probability mass in this ball and show that it’s about 1/log n, independent of d. The d cancels out, which is the key here. FIGURE 8 Basically, with 1/log n probability you have your distance to the target right away. Therefore, every log n steps or so you expect to halve your distance. The distance can only be halved log n times, and so the delivery time is proportional to log n times log n. That’s the one-slide outline. My interest here is to try figure out whether we can conclude something more
OCR for page 414
Proceedings of a Workshop on Statistics on Networks first thing I thought when I saw the scatter plot was isn’t that weird, no one out west is using LiveJournal. But actually this is the same thing you would see if I just plotted the population density of the United States. So, they are indistinguishable at the level of scatter plot. FIGURE 13 LiveJournal is sampling pretty broadly, but it’s very non-uniform. The growth rate of the ball of somebody, whoever is using it out here in eastern Nevada, is going to grow very gradually compared with someone who is using it on a particular block in Manhattan. What do we do about that? One thing we do is use groups, because groups didn’t require a grid. Liben-Nowell et al. simply said the groups are the first some number of people within some distance of me, all the balls centered on all the points. They took something that was slightly easier to analyze in their context. It’s very similar; it’s what they called rank-based friendship. Node v simply rank orders all their potential friends by everyone in the world by proximity. So the center of the circle in Figure 14 is node v, and their closest people are the small circles within that circle, the diameter of which is equal to the distance to person 7; that person is ranked 7th. Then what you do is link with probability somehow decaying in the rank: I’m more likely to link to the nth most close person than the (n+1)st closest person. What they did was prove a different generalization. So, rather than generalize things into group structures, they generalized it into rank-based friendships and proved an analog of the inverse square results, as shown in Figure 14.
OCR for page 415
Proceedings of a Workshop on Statistics on Networks FIGURE 14 Efficient routing is possible for a nearly arbitrary population density, again, mild assumptions here, if the link of probability is decay like 1/rank. Why is this a generalization? Well, because if distance d would be ranked d2, there are d2 people close to them. This gives me the inverse square law in that special case. Now they’ve got results saying what would be optimal for searchability? They have got 500,000 people with zip codes, and they look at some data. The punch line, which made me very happy, is that LiveJournal friendships actually approximate 1/rank quite closely. Time prevents me from getting into a lot of what they did here, but these lines are basically slope 1 and 1.05, and the chart in Figure 14 is one particular view of that data. For Hewlett-Packard Labs you could have argued somebody sat down and drew up the order chart. Maybe they were thinking about this. This is weirder still; it somehow configured itself to be like this. I don’t have a good explanation for that. I should also mention that independence of searchability is a little surprising because LiveJournal is a purely virtual online community, so if someone asked me what do you expect, I would have said a mixture model. I would have said you log on and you immediately link to all your high school friends who live in the same zip code as you. So, I get a big spike there, then I meet people utterly at random, and it’s kind of like a uniform distribution on the rest. The point is that is not what happened. If you lived in Nevada, you were nonetheless more likely to link to somebody in Nebraska and
OCR for page 416
Proceedings of a Workshop on Statistics on Networks Wyoming than you were to someone in New York City, and that’s bizarre. There is a lot more that should be looked at in this data, so one hesitates to draw sweeping conclusions yet, but it’s very intriguing. In the final few minutes I have remaining I want to talk about not just spatial embedding, but how things have evolved over time because here, too, there tend to be some surprises that actually make us go back and think about some things we need to be looking at. The question we wanted to ask was really a very simple one, stated in Figure 15, which is take one of these large model systems—we’ll take the arXiv, because it really works very nicely for our purposes—and ask how do the degrees of separation change in the network as it grows over time? We could define degrees of separation to say the 90th percentile distance among all pair-wise distances, but we have tried other ways. Why in the world has no one studied this on some large-scale system? Well, it’s computationally hard to answer this question. You want a network you can run up to any point in time and stop it. You would like to ask this question every month or every week, and every week that means you have to do an all-pairs shortest-path problem on a 100,000-node network. This is computationally somewhat demanding, so we were actually able to take advantage of some nice stuff that one of my co-authors wrote for approximate all-pairs shortest-paths in a large graph, some stuff Faloutsos worked on to track this effective diameter in several networks over time.
OCR for page 417
Proceedings of a Workshop on Statistics on Networks FIGURE 15 We looked at things beyond the arXiv of this model system, so on Figure 16 you will see plots. Time runs on the x axis, and effective diameter of the network runs on the y axis. The original plan was we’ll look at this, and we’ll try to see which slowly growing function of n best fit the diameter. Was it log2 n, was it log n, or some other function? Given that was the original plan, we found these plots a bit surprising, because slowly growing functions as you know go up, and these were all going down. The point was, despite the fact that the network was growing— and some of these networks were growing at an exponential rate over time—the effective diameter, and this is basically a smooth version of the 90 percentile, which I can tell you more about offline, was going down as the network grew. The network was growing, but the degrees of separation were actually getting closer. I have a few lines here which are probably better explained in question and answer form, but basically, there is a big missing past issue here. We are coming and observing networks starting at a particular time.
OCR for page 418
Proceedings of a Workshop on Statistics on Networks FIGURE 16 With the arXiv, we technically have no missing past problem because we have the arXiv from its very first paper. But in a very real sense we have a missing past problem, because the archive got plopped down in the middle of a lively, thriving research community that has been exchanging pre-prints for a long time. We only started measuring it at time “zero.” One worries that what these results really show is the process of catching up with a stationary structure the network already had. I can’t say that that’s not what is going on, but one can look at the decrease in diameter in many different ways. For example, one can artificially cut off the past, any number of years you want. You can say supposed we only started measuring in 1995. That’s what the dotted line in the left-hand graph represents, and you get this initial massive swoop down as the network catches up with links that essentially were already there. That effect quickly goes away, and we still see a steady decline in diameter long after that effect appears to be negligible. One can also do things where you keep the network intact, but only start measuring—I would have to explain this more slowly—but it, too, catches up with this. So, what we seem to be seeing, long after a lot of the edge effects go away, we are still seeing this steady march downward. Maybe in retrospect I should have seen this coming, because back on Figure 5 I said our gold standard is log n. We have n nodes and we want the diameter to grow like log n. That’s sort of the graph
OCR for page 419
Proceedings of a Workshop on Statistics on Networks theorist’s take on what the small world property means. It means that you grow exponentially slower than the size of the network. The common sense take on what the small world property means is that even as the world is growing, there are forces out there that are causing us to become more closely connected; in other words a decreasing diameter. Intuitively, maybe this has been consistent with the common sense intuition behind the small world property all along. It was simply our random-graph background that caused us to inject this log imagery. Having said that, log n and log2 n and these slowly growing functions are a very useful modeling tool for the kinds of analyses I have been talking about up until now, which was not of a temporal trend nature but simply looking at static snapshots, and talking about asymptotics as a way of capturing these. But it says that there is a lot to do here. A related property, one reason that sort of makes it easier to believe why it might be going down is that the number of edges also decreases, so the average degree in the network is not constant. If you compare now versus 10 years ago, people cite more papers, they author more papers, they co-author with more people. Everything in the arXiv network is densifying as we go forward, and clearly, any model is going to have decreasing diameter. There are models that we have been thinking about to do this. For example, one way, at least through simulation—this seems very hard to analyze—but through simulation if I have a model where people enter a social network via some point of contact, you are introduced into the network through a sort of epidemic process in which you befriend some of that person's friends, some of their friends, and so forth. Those are your links, and now we iterate, add a new person. Through simulation, one can set the parameters so that it densifies, the diameter goes down, but it seems very hard to prove anything about that kind of model. At least there are models out that that seem plausible, that might capture this kind of thing.
OCR for page 420
Proceedings of a Workshop on Statistics on Networks FIGURE 17 Figure 17 presents some reflections on various things and suggests some other questions. As I said, the set of themes that I wanted to try tracing through this was somehow the way the network models, even very specific models, become more general patterns, they become design principles. In the end, with enough patience, you can actually find data sets that really start to bear some of these things out. The shrinking diameters thing is again very suggestive of more work that could be done here, and in general I think we are only starting to have the computational ability to look at very large networks and snapshot them densely over time. I think it suggests that there are more surprises coming up. One network that we would like to try understanding over time is the Web itself. There is currently a large effort going on at Cornell involving some pure scientists, and also some people in the sociology department like Michael Macy and David Strang. In particular, we are trying to get a handle on Web evolution over time and some questions we can ask them. In small world networks, there are any number of other interesting questions that one can ask, which I haven’t touched on here. One, for example, which came up in a paper a year ago by Cristos Faloutsos, Kevin McCurly, and Andrew Tomkins, was roughly the question of when are short paths meaningful. So intuitively there is this tempting connection between social networks and, for example, homeland security applications in which you discover statements like my next door neighbor once sublet an apartment from a person who took a class with the brother of a 9/11 hijacker. You read in the news that social network analysis is going to cause us to discover things
OCR for page 421
Proceedings of a Workshop on Statistics on Networks like this and that’s going to be very useful. Are we really lacking any sense of how you score? Is this short path meaningful? Isn’t the whole point of the small world phenomenon that I could say this about any instantiation? I would say that we are really lacking, and I think this is interesting direction for the statistics aspect of this workshop, some way of talking about when short paths in the network are meaningful, and when they simply suffer from what you could call the small world problem, that you see short paths everywhere. Another area where this blends into economics and game theory—this is some work I have been doing with Prabhakar Raghavan, who is the head of research at Yahoo, where they are very interested in social media—is the notion that a lot of these onlines like LiveJournal and Orchid and Friendster and Linked In, and you can name any number of these, they attracted a lot of excitement a year ago based on the idea that they would become big marketplaces for information and services. You join one of these, and what’s the value to you in joining something like Friendster or Link In or Orchid? Well, because maybe a year from now you are going to need to rent an apartment in Boulder, Colorado you ask your friends. You say you need to rent an apartment in Boulder, Colorado. Do you know anyone who has a room to rent? They ask their friends, and gradually an answer percolates back. Not only is it an answer, but it somehow has been vetted by your friends. That was some of the appeal of these networks, but once you start thinking about this as a marketplace for information and services, you have to ask about the incentive issues at work. If you are actually offering incentives to people, and they have to act essentially as middle men for these products and services, then when does this function efficiently? When do things break down simply because of the friction of everybody skimming off money that you offer them? It turns out there are some interesting transitional behaviors here, which relate to some things about branching processes, and some sharp transitions between networks that function efficiently in this mode and networks that don’t. There is something qualitatively important about an effective branching factor of 2, and something I can’t really get into, but there is also a lot when you think about this interface of small world networks with the sort of economic game theory take on it. In general, I think this is a window into a lot of interesting questions in these kinds of networks, a lot of which can be studied with some of the model system data that we have. We are trying to accumulate more data, and I think in general there is much to be done here. I would be happy to talk about it more offline, but for now I’ll stop here.
OCR for page 422
Proceedings of a Workshop on Statistics on Networks QUESTIONS AND ANSWERS DR. HANDCOCK: I have one question which has four related parts. This is coming back to the HP real world email data set. How did you fit the model? And then how did you assess the goodness of fit. Three, was there any missing data from the email responses, and how was that dealt with? And four, because it's an ad hoc model based on heuristics, how did it compare to other models that took into account the organizational structure? DR. KLEINBERG: Those are all good questions, and actually I will start with an answer which applies to all four parts, which is I didn’t do any of this, because this was actually the work of Adamic and Adar. So, I don’t want to answer in detail on their behalf, because there was thinking behind that which I don’t have access to all of. There was actually not that much missing data as far as I actually understood, because they essentially collected these from logs. They got people’s permission to collect these from logs. In terms of competing models, I would say there wasn’t really a generative network model here, so, this was not a statement about a generative network model. It was a statement about simply trying to fit a decay of linkage frequency to group size. And so then it just comes down to a question of the goodness of fit for that particular data set, for which you can consult their paper. But there wasn't an attempt to then infer from that a general model. DR. CROVELLA: I’m Mark Crovella from Boston University. I’m just sort of curious about that example. What do you think? Is it more likely that the e-mail messages are the result of the organizational structure, or the organizational structure is a result of the e-mail messages? If it’s the latter, I would be interested in the social engineering aspects of that. DR. KLEINBERG: Yes, I think that’s a very interesting question. So, this may sound like a cop out, but I don’t think it is that obvious. As we know from looking at organizations, they do change: groups get reassigned, groups split, groups merge, and that clearly is the result of increased bandwidth. So, here unfortunately, I’m not going to draw so much on social science studies since I haven’t looked that much at the organizational theory of literature. But simply suggesting that obviously groups split and merge based on increased or decreased traffic between that, and that’s a sense in which the e-mail traffic in the long-range sense is going to feed back into the organizational structure. But it’s a good question. You could also argue that both of these are consequences of some hidden variable. So, in fact, people are working on similar or different things. That influences both the organization structure you see and who communicates with whom. In a sense, that’s the single root cause of these, but I think it’s true that as that changes over time, you are going to see changes in communication patterns in the organization.
OCR for page 423
Proceedings of a Workshop on Statistics on Networks REFERENCES Bollobas, B., and F.R.K. Chung. 1988. “The Diameter of a Cycle Plus a Random Matching.” SIAM J. of Discrete Math 1. Faloutsos, C., K. McCurley, and A. Tomkins. 2004. “Fast Discovery of Connection Subgraphs.” Tenth ACM SIGKDD Conference. Seattle, WA. Kleinberg, J. 2000. “Navigation in a Small World.” Nature 406. Kleinberg, J. 2002. “Small-World Phenomena and the Dynamics of Information.” Advances in Neural Information Processing Systems (NIPS) 14. Kleinberg, J., and P. Raghavan. 2005. “Query Incentive Networks.” Proc. 46th IEEE Symposium on Foundations of Computer Science. Leskovec, J., J. Kleinberg, and C. Faloutsos. 2005. “Graphs Over Time: Densification Laws, Shrinking Diameters and Possible Explanations.” Proc. 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. Liben-Nowell, D., J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. 2005. “Geographic Routing in Social Networks.” Proc. Natl. Acad. Sci. USA 102. Liben-Nowell, D., J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. 2005. “Geographic Routing in Social Networks.” Proc. Natl. Acad. Sci. USA 102. Malkhi, D., M. Naor, and D. Ratajczak. 2002. “Viceroy: A Scalable and Synamic Emulation of the Butterfly.” Proceedings of 21st Annual Symposium on Principles of Distributed Computing. Manku, G.S., M. Bawa, and P. Raghavan. 2003. “Symphony: Distributed Hashing in a Small World.” Proc. 4th USENIX Symposium on Internet Technologies and Systems. Menczer, F. 2002. “Growing and Navigating the Small World Web by Local Content.” Proc. Natl. Acad. Sci. USA 99(22):14014-14019. Milgram, S. 1967. “The Small World Problem.” Psychology Today 1. Ratnasamy, S., P. Francis, M. Handley, R. Karp, and S. Shenker. 2001. “A Scalable Content-Addressable Network.” Proc. ACM SIGCOMM. Rowstron, A., and P. Druschel. 2001. “Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems.” Proc. 18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware 2001). Stoica, I., R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. 2001. “Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications.” Proc. ACM SIGCOMM. Watts, D.J., and S.H. Strogatz. 1998. “Collective Dynamics of ‘Small-World’ Networks.” Nature 393.
OCR for page 424
Proceedings of a Workshop on Statistics on Networks Watts, D.J., P.S. Dodds, and M.E.J. Newman. 2002. “Identity and Search in Social Networks.” Science 296:1302-1305. Zhao, B.Y., J.D. Kubiatowicz, and A.D. Joseph. 2001. “Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing.” UC Berkeley Computer Science Division, Report No. UCB/CSD 01/1141.