**Jon Kleinberg, Cornell University**

**DR. KLEINBERG:** Thanks very much to the organizing committee for inviting me to come speak. This has been a very interesting workshop for me, especially seeing how the techniques across all these different areas make use of common principles. Obviously there is a lot we are going to have to think about afterwards.

I’m a computer scientist, so I’m coming at this from a computer science perspective. In particular, the kinds of networks that I’m interested in are the large information networks that are built up on top of the applications on the Internet, starting with things like the World Wide Web and spilling over into networks that affect the way we communicate through things like email and instant messaging, through the way we express interest in hobbies through online communities like Live Journal and so forth, to the way we do electronic commerce and so forth, as indicated in Figure 1.

There is this whole, what some people call “speciation,” event on the Internet where we see all these new forms of media springing up very suddenly, looking at the emergence of not just the technological structure here, but also the social structure. This notion that there are interesting social structures going on in these online applications is something that has interested

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 396

Visualization and Variation: Tracking Complex Networks Across
Time and Space
Jon Kleinberg, Cornell University
DR. KLEINBERG: Thanks very much to the organizing committee for inviting me to
come speak. This has been a very interesting workshop for me, especially seeing how the
techniques across all these different areas make use of common principles. Obviously there is a
lot we are going to have to think about afterwards.
I’m a computer scientist, so I’m coming at this from a computer science perspective. In
particular, the kinds of networks that I’m interested in are the large information networks that are
built up on top of the applications on the Internet, starting with things like the World Wide Web
and spilling over into networks that affect the way we communicate through things like email and
instant messaging, through the way we express interest in hobbies through online communities
like Live Journal and so forth, to the way we do electronic commerce and so forth, as indicated in
Figure 1.
FIGURE 1
There is this whole, what some people call “speciation,” event on the Internet where we
see all these new forms of media springing up very suddenly, looking at the emergence of not just
the technological structure here, but also the social structure. This notion that there are
interesting social structures going on in these online applications is something that has interested
396

OCR for page 396

me for quite a while, starting back with some work I did on the World Wide Web where the
question was how to use the linked structure of the Web, and the fact that these are documents
authored by people, to help improve a search. This is something that combines the social and
technology aspect very clearly, although I haven’t always been so good at conveying this.
I have been at workshops of this flavor for some time. A few years ago I gave a talk in a
very early workshop that brought together some computer scientists and sociologists. We talked
about links on the Web and kept casually referring to the Web as a social network without really
trying to unpack what I was really saying. This was at the Santa Fe Institute, and during the
break some of us were standing out on the patio looking out at this beautiful view of Santa Fe
when a very eminent sociologist came up next to me smoking a cigar and asked, “Did you see
that talk by the guy from Cornell? That was terrible. He thought he was talking about social
networks, and all he was talking about was how Google worked.” After explaining that I was that
guy from Cornell, we actually had a reasonable conversation under the circumstances. That was
what I view as one of the early steps in learning about some of the cultural gaps that exist
between these different areas.
I think more than anything that what I have done, or social scientists have done, or what
these workshops like this have accomplished, has simply been the unbelievable perfusion of these
online media, which has just made it abundantly clear that all of these technological things we are
seeing are in the end about content authored by people and people communicating with each
other.
In reality it’s not so much that they both exist but that they are really intertwined. You
can’t disentangle the technological networks from the social networks anymore. If I were to ask
what are the networks and which email communication, instant messaging, or blogging or any of
these other things take place, sure, your packets are traveling over the Internet, but if you are
trying to configure a new instant messaging system you have to know a lot about the kinds of
people that are going to be using it. You have to know the diurnal rhythms of them and whether
it’s going to be used by teenagers or by grown-ups and so forth. All this affects the way in which
we configure the technology, so in a real sense these things are inextricable at this point.
A particular thing I wanted to focus on here is in the context of these social and
technological networks, the way in which they are embedded in space, embedded in time, and
embedded in surrounding organizational structures. One thing that I wanted to get at is that we
shouldn’t just view the networks in isolation, but to think about them as being physically
embedded as concretely as we view networks in neuron-anatomy as being physically embedded.
I’m going to be talking about network models, as have most of the other talks at this
397

OCR for page 396

workshop, reflecting our interest in the structural properties of networks, but also particularly for
me, in the processes unfolding on networks, the algorithmic processes that unfold on networks,
and trying to find models that relate these to each other. In the end, hopefully, even very simple
toy models—and in this talk I’ll try to go from a progression of toy models, up through more and
more general models—they don’t just help us form a vocabulary of patterns, although that is
certainly a very useful thing, but also in the computer science context, it’s often very surprising to
me how quickly those turn into design principles. In the online world the ability to create new
online media like peer-to-peer file sharing gives you a certain kind of flexibility such that things
that had seemed like toy artificial models can very rapidly become principles by which you
actually architect a system. That is one of the interesting things about this domain also.
I also wanted to mention one thing that has been lurking around the surface of some of
the talks, and that is the social science component. It is the notion of working with model
systems. I think, especially for the statistics aspect of this workshop, this is going to be
something important to think about. So, if I could caricature the traditional research obstacle in
studying complex networks, it would be that you’d like your networks to be large-scale, realistic,
and completely mapped, but in actuality you can only achieve any two of those. For obvious
reasons it is very hard to achieve all three at once.
FIGURE 2
So, just as Mark Handcock stole a slide from Martina Morris, I’ll steal a slide from Mark.
398

OCR for page 396

Figure 2 shows his Zachary Karate Club Network, and the only thing I want to point out here is
that this is clearly a real network of real people, and it’s completely mapped. We know a lot
about it, but the nature of how data like that are collected inherently limits the size. In his after-
dinner talk last night, Steve Borgatti mentioned that a trend over the last few years has been to see
networks go from sizes in the hundreds to sizes in the ten thousands or hundred thousands. In the
process of doing that, we have to ask what are we losing? We are obviously gaining a lot. Many
of these systems are what I would call online model systems. When I say model systems, I don’t
mean that there is anything fake or synthetic about them. All of the people involved in the
activities listed on Figure 2 are real people engaged in social interaction. It’s simply that we
don’t know as much about them as we might have known about these smaller more self-contained
systems such as the Karate Club Network. It’s hard to draw their boundaries; I’m thinking about
things like the links among Web pages or Web blogs in the communities of bloggers and how
they interact—social networking services like Live Journal or Orchid or any of the other things,
which are actually trying to create a social network on their members.
Email and instant messaging within organizations define network interaction. For online
e-commerce sites, there has been some amount of interesting research on simply the behavior of
people at these sites, because of course there is still the interaction of product reviews, forum
discussions, replies, posts, and so forth, and then the download behavior of these people. We can
then pick—this is a line of work that Mark really stimulated thinking—something like arXiv,
which is a model system of, in this case, high-energy physicists, all publishing papers, co-
authoring, citing each other, and so forth. It’s a very rich domain in which we can try studying
things on a very large scale.
It was this, for example, that caused three of us at Cornell—Paul Ginsparg, the author of
the arXiv, Johannes Gehrke, and myself—to choose this as a data set for the annual KDD Cup
Competition, basically saying here is a very rich system in which you can do data analysis in all
sorts of dimensions, looking at real social data. There are also things that we would like to know
about the people in this network that we don’t, but for looking at very-large-scale data, so far this
is the best we can do.
Figure 3 gives an overview of my talk. The hope was to show how all these themes can
be traced out through a single particular topic, small-world networks and decentralized search. I
could have easily picked several other topics.
399

OCR for page 396

FIGURE 3
We’ll start with an initial model, the small-world model of Watts and Strogatz, which
many of you know, and then we’ll move from that, which is really a structural model, looking at
the structure of the network, to a model about processes on top of the network. In particular, the
problem of searchability of networks is really more about the process that operates on the
network. We’ll start with a very simple model of that. I’ll try to pull back to create more general
models. In particular, by abstracting a general pattern that’s going on here, we’ll start looking at
both data and design principles. Then we’ll try to find that pattern in large-scale network data.
We’ll see some rather surprising ways in which it has shown up in hyperlinks on the Web, email
communication, and friendships on these online communities, and also how it’s been used as a
design principle in things like peer-to-peer systems. And finally, some recent work I’ve been
doing with Faloutsos and Yuri Leskovec looking at how well these networks evolve over time,
and some things that actually completely challenge the conventional wisdom we had had on this.
And necessarily, since my time is short, I will go through all of these rather briefly.
400

OCR for page 396

FIGURE 4
To introduce small-world networks, rather than reach back to the Watts-Strogatz model, I
wanted to reach much further back to the original work in social psychology that stimulated a lot
of this, Stanley Milgram’s small world experiments. Both models are summarized in Figure 4.
To briefly recap Milgram’s experiment for those of you who haven’t seen it before, the idea was
to trace out short paths in a social network, and particularly the network of acquaintanceships in
the United States. He picked a target person who his subject was supposed to reach, who lived in
a suburb of Boston, and he picked a whole bunch of starting people, mainly drawing them from
out in the Midwest. Each starter person would start with a letter that he had to forward to this
target, and rather than mail it directly to them, they had to use what they were told about the
target and mail it to one of their friends with the goal of reaching the target in as few steps as
possible. Those were the rules of the experiment. You get a letter, you're trying to reach a
stockbroker in Sharon, Massachusetts, and you think about the things that you might know about
them and you know about your friends. You figure out who you should pass it to, and the famous
outcome of this was that the median number of steps in the completed chains was six, a number
that has entered into the phrase six degrees of separation.
There are a few things going on with this experiment, and I don’t have time to unravel
them all, but one is clearly the structural issue. Why is it that our social networks in which our
friends all seem to know each other, which as a number of the talks have said before looks fairly
clustered, why should it have these short chains? This highly influential Nature paper of Watts
and Strogatz in 1998 can really be viewed as a proposal for a kind of network that we ought to be
401

OCR for page 396

studying. More than any particular result, it really said let’s think about networks as
superpositions; the notion of a random network embedded in some reference frame, in this case
maybe a spatial reference frame. To take one of the flavors of their model, we can imagine that
in a very simple version we have a grid with nodes, and these might represent the people in the
social network, knowing something about their geographic coordinates. We then add random
links, even as few as one per node; an example is shown in red in Figure 4. These random links
are enough to make the diameter of the world very small, even while locally it looks very
clustered. So, adding a very small number of random links, the network still looks relatively
grid-like, even though we now know that its diameter has become very small. We can actually
prove things like that using results that have been known for some time in the area of random
graphs, in particular the results of Bollobás and Chung (1988), which studied almost exactly this
network. So this was the idea, think of these networks as a superposition or, for our purposes, a
random network embedded in some reference frame, be it spatial, organizational, or what have
you.
When I first came across the Watts-Strogatz work—which wasn’t hard, because they
were also at Cornell at the time—the thing that really struck me about the original experiment
was not just the structure. It was not just about the fact that there were these short chains, but it
was about a process. As we teach our undergraduates in computer science classes, if you want to
find the shortest path in a network you shouldn’t have done this experiment. You should have
given a letter to someone and said, send this to all your friends. Tell them to send it to all their
friends, and eventually you will see what the shortest chain to the target is. You can think about
the obvious reasons why Milgram wasn’t able to do that experiment, therefore, he was forced to
embark on this much more interesting one where you actually have to tunnel your way, you have
make your best guess over the next hop. Essentially, what he was doing was decentralized
routing.
Decentralized routing is a completely non-technological context on the social network of
people. That’s the first thing I would like to try modeling here. So, we would like to model a
decentralized algorithm looking for a path through a network. Figure 5 describes a very basic
model: imagine there is a current message holder who knows the message, the grid structure,
their coordinates on the grid, and the coordinates of the target on the grid. They don’t know all
the random links, but they know their own random links. So, this simply says when you get the
letter, you know that you are living wherever you live. You know the target is in Sharon,
Massachusetts, and you know you have friends in New Jersey and Montana and Singapore and so
forth, and you want to try to figure out how best to direct the letter. That’s the idea. That’s why
402

OCR for page 396

it is decentralized.
FIGURE 5
The gold standard here, which is going to pop up through a lot of the results, is this
difference between things that grow logarithmically in n versus things that grow linearly in n; or
more generally, things that grow like polynomials in log n versus polynomials in n. The idea is
that the network has n × n nodes, but we want to be exponentially better. We want to be
something like log n or log n squared in our delivery time. That would somehow tell us that the
world is small. We can get there in a short number of steps.
The funny thing here is that, although one could have asked this question immediately
upon seeing the Milgram experiment, the Watts-Strogatz model, although not intended for this
purpose, is really ideal as a kind of minimal model you would need if you wanted to ask this
question. If you want to ask this question in an interesting way, what are the minimal ingredients
you need? Well, you need a network that is partially known and partially unknown. A
completely known network of course leads to the shortest-path algorithms, but that’s a very
different subject. Finding you way in a known network—that’s basically the Mapquest problem.
A completely unknown network is equally different. Then we have nothing to go on. In the
known part you have high diameter, or else it would be too easy. For the full networks you have
low diameters, so there is a solution to find. And we want to explore this relation between the
403

OCR for page 396

known and the unknown and the unknown part.
How does the random network relate to the way in which it’s embedded in the spatial or
organization structure or whatever? Again, if you are worried that this model is too simple,
obviously, in a million ways it’s too simple. Obviously, to the extent that the grid in Figure 4 is a
proxy for anything, maybe it’s a proxy for geographic information. We know people in the
Milgram experiment also used occupational information and other things. I’ll try folding some of
these things in, but I want to start with an intentionally very simple model, as shown in Figure 6.
FIGURE 6
The Watts-Strogatz model is a bit too simple. So, the first theorem you can prove here is
a negative result, an impossibility result which says that the network has n × n nodes. There
really is a path of length of log n. There was a solution to find, but probably with local
information you can’t find it. Any decentralized algorithm, however hard it computes, simply its
lack of information is going to mean you need about at least n to the ⅔ steps to find your way,
and ⅔ is actually the right answer here. You can also achieve this, but this is exponentially larger
than they want. So, that’s what happens when you have a simple toy model. You have this
striking phenomenon, and try analyzing it, and sometimes there is not enough in there.
So, let’s try generalizing the model just a little bit, just minimally to try to do something.
We could add one more knob we could turn, and I’ll call this a clustering exponent r. I’ll still add
404

OCR for page 396

a random long-range link out of each node, as you see in Figure 6, but it’s going to be tuned by
this parameter r, which basically says so if I’m node v, I don’t actually choose my long-range link
uniformly at random, but rather, I take into account the spatial structure. I choose w as the other
end, with probability something like lattice distance to the minus r. So, think about the effect r
has: when r = 0, that’s the uniform distribution and I’m back to the Watts-Strogatz model. As I
crank r way up to some large number, I go from the network on the left in Figure 6 to the one on
the right, where long links are very unlikely. Therefore, the network goes back to having a large
diameter. My long-range links are not really helping me that much. I’m in this funny spectrum
here where I know that when r = 0, the network is too confusing. There were the short paths, but
I couldn’t find them, and when r is very big, there are no short paths to find. The question is what
happens in this middle region as we trace it out. At least we have a 1-dimensional family we can
study.
The answer is that something kind of funny happens. There actually is an optimal
operating point for this network as shown in Figure 7, and that’s when r = 2. So when I choose
links with probability according to an inverse square distribution, I can actually achieve this gold
standard of polynomial log squared n. As I move away from r = 2 in either direction, things get
worse rapidly, at least in an asymptotic sense. Things become n to some power, where the power
depends on how far away r is from 2. Two was somehow the right operating point for this
network.
FIGURE 7
405

OCR for page 396

Let’s think about that. Figure 8 gives some idea of how you prove something like that,
and then we can go on to how we might extract out what we have actually learned from this
beyond the fact that we are on a grid, beyond the fact that we have one long-range link, and so
forth. But even in this very simple system there is obviously something to be proven. Referring
to Figure 8, let me tell you how to prove the positive result with exponent 2, and I’ll skip the
negative results for the sake of time. The basic idea is that I say I’m trying to get to this target t
and I’m at this node v. Rather than try to trace analyze the whole sequence of steps from here on,
I want to ask a simpler question: What’s the probability that in this very step I have my distance
to the target? To have my distance to the target I have to go from d to jump into this ball of
radius d/2. You can basically compute the probability mass in this ball and show that it’s about
1/log n, independent of d. The d cancels out, which is the key here.
FIGURE 8
Basically, with 1/log n probability you have your distance to the target right away.
Therefore, every log n steps or so you expect to halve your distance. The distance can only be
halved log n times, and so the delivery time is proportional to log n times log n. That’s the one-
slide outline. My interest here is to try figure out whether we can conclude something more
406

OCR for page 396

first thing I thought when I saw the scatter plot was isn’t that weird, no one out west is using
LiveJournal. But actually this is the same thing you would see if I just plotted the population
density of the United States. So, they are indistinguishable at the level of scatter plot.
FIGURE 13
LiveJournal is sampling pretty broadly, but it’s very non-uniform. The growth rate of
the ball of somebody, whoever is using it out here in eastern Nevada, is going to grow very
gradually compared with someone who is using it on a particular block in Manhattan. What do
we do about that? One thing we do is use groups, because groups didn’t require a grid. Liben-
Nowell et al. simply said the groups are the first some number of people within some distance of
me, all the balls centered on all the points. They took something that was slightly easier to
analyze in their context. It’s very similar; it’s what they called rank-based friendship. Node v
simply rank orders all their potential friends by everyone in the world by proximity. So the center
of the circle in Figure 14 is node v, and their closest people are the small circles within that circle,
the diameter of which is equal to the distance to person 7; that person is ranked 7th. Then what
you do is link with probability somehow decaying in the rank: I’m more likely to link to the nth
most close person than the (n+1)st closest person. What they did was prove a different
generalization. So, rather than generalize things into group structures, they generalized it into
rank-based friendships and proved an analog of the inverse square results, as shown in Figure 14.
414

OCR for page 396

FIGURE 14
Efficient routing is possible for a nearly arbitrary population density, again, mild
assumptions here, if the link of probability is decay like 1/rank. Why is this a generalization?
Well, because if distance d would be ranked d2, there are d2 people close to them. This gives me
the inverse square law in that special case. Now they’ve got results saying what would be
optimal for searchability? They have got 500,000 people with zip codes, and they look at some
data. The punch line, which made me very happy, is that LiveJournal friendships actually
approximate 1/rank quite closely. Time prevents me from getting into a lot of what they did here,
but these lines are basically slope 1 and 1.05, and the chart in Figure 14 is one particular view of
that data.
For Hewlett-Packard Labs you could have argued somebody sat down and drew up the
order chart. Maybe they were thinking about this. This is weirder still; it somehow configured
itself to be like this. I don’t have a good explanation for that. I should also mention that
independence of searchability is a little surprising because LiveJournal is a purely virtual online
community, so if someone asked me what do you expect, I would have said a mixture model. I
would have said you log on and you immediately link to all your high school friends who live in
the same zip code as you. So, I get a big spike there, then I meet people utterly at random, and
it’s kind of like a uniform distribution on the rest. The point is that is not what happened. If you
lived in Nevada, you were nonetheless more likely to link to somebody in Nebraska and
415

OCR for page 396

Wyoming than you were to someone in New York City, and that’s bizarre. There is a lot more
that should be looked at in this data, so one hesitates to draw sweeping conclusions yet, but it’s
very intriguing.
In the final few minutes I have remaining I want to talk about not just spatial embedding,
but how things have evolved over time because here, too, there tend to be some surprises that
actually make us go back and think about some things we need to be looking at. The question we
wanted to ask was really a very simple one, stated in Figure 15, which is take one of these large
model systems—we’ll take the arXiv, because it really works very nicely for our purposes—and
ask how do the degrees of separation change in the network as it grows over time? We could
define degrees of separation to say the 90th percentile distance among all pair-wise distances, but
we have tried other ways. Why in the world has no one studied this on some large-scale system?
Well, it’s computationally hard to answer this question. You want a network you can run up to
any point in time and stop it. You would like to ask this question every month or every week, and
every week that means you have to do an all-pairs shortest-path problem on a 100,000-node
network. This is computationally somewhat demanding, so we were actually able to take
advantage of some nice stuff that one of my co-authors wrote for approximate all-pairs shortest-
paths in a large graph, some stuff Faloutsos worked on to track this effective diameter in several
networks over time.
416

OCR for page 396

FIGURE 15
We looked at things beyond the arXiv of this model system, so on Figure 16 you will see
plots. Time runs on the x axis, and effective diameter of the network runs on the y axis. The
original plan was we’ll look at this, and we’ll try to see which slowly growing function of n best
fit the diameter. Was it log2 n, was it log n, or some other function? Given that was the original
plan, we found these plots a bit surprising, because slowly growing functions as you know go up,
and these were all going down. The point was, despite the fact that the network was growing—
and some of these networks were growing at an exponential rate over time—the effective
diameter, and this is basically a smooth version of the 90 percentile, which I can tell you more
about offline, was going down as the network grew. The network was growing, but the degrees
of separation were actually getting closer. I have a few lines here which are probably better
explained in question and answer form, but basically, there is a big missing past issue here. We
are coming and observing networks starting at a particular time.
417

OCR for page 396

FIGURE 16
With the arXiv, we technically have no missing past problem because we have the arXiv
from its very first paper. But in a very real sense we have a missing past problem, because the
archive got plopped down in the middle of a lively, thriving research community that has been
exchanging pre-prints for a long time. We only started measuring it at time “zero.” One worries
that what these results really show is the process of catching up with a stationary structure the
network already had. I can’t say that that’s not what is going on, but one can look at the decrease
in diameter in many different ways. For example, one can artificially cut off the past, any number
of years you want. You can say supposed we only started measuring in 1995. That’s what the
dotted line in the left-hand graph represents, and you get this initial massive swoop down as the
network catches up with links that essentially were already there. That effect quickly goes away,
and we still see a steady decline in diameter long after that effect appears to be negligible. One
can also do things where you keep the network intact, but only start measuring—I would have to
explain this more slowly—but it, too, catches up with this. So, what we seem to be seeing, long
after a lot of the edge effects go away, we are still seeing this steady march downward. Maybe in
retrospect I should have seen this coming, because back on Figure 5 I said our gold standard is
log n. We have n nodes and we want the diameter to grow like log n. That’s sort of the graph
418

OCR for page 396

theorist’s take on what the small world property means. It means that you grow exponentially
slower than the size of the network. The common sense take on what the small world property
means is that even as the world is growing, there are forces out there that are causing us to
become more closely connected; in other words a decreasing diameter. Intuitively, maybe this
has been consistent with the common sense intuition behind the small world property all along. It
was simply our random-graph background that caused us to inject this log imagery. Having said
that, log n and log2 n and these slowly growing functions are a very useful modeling tool for the
kinds of analyses I have been talking about up until now, which was not of a temporal trend
nature but simply looking at static snapshots, and talking about asymptotics as a way of capturing
these. But it says that there is a lot to do here. A related property, one reason that sort of makes it
easier to believe why it might be going down is that the number of edges also decreases, so the
average degree in the network is not constant.
If you compare now versus 10 years ago, people cite more papers, they author more
papers, they co-author with more people. Everything in the arXiv network is densifying as we go
forward, and clearly, any model is going to have decreasing diameter. There are models that we
have been thinking about to do this. For example, one way, at least through simulation—this
seems very hard to analyze—but through simulation if I have a model where people enter a social
network via some point of contact, you are introduced into the network through a sort of epidemic
process in which you befriend some of that person's friends, some of their friends, and so forth.
Those are your links, and now we iterate, add a new person. Through simulation, one can set the
parameters so that it densifies, the diameter goes down, but it seems very hard to prove anything
about that kind of model. At least there are models out that that seem plausible, that might
capture this kind of thing.
419

OCR for page 396

FIGURE 17
Figure 17 presents some reflections on various things and suggests some other questions.
As I said, the set of themes that I wanted to try tracing through this was somehow the way the
network models, even very specific models, become more general patterns, they become design
principles. In the end, with enough patience, you can actually find data sets that really start to
bear some of these things out. The shrinking diameters thing is again very suggestive of more
work that could be done here, and in general I think we are only starting to have the
computational ability to look at very large networks and snapshot them densely over time. I think
it suggests that there are more surprises coming up.
One network that we would like to try understanding over time is the Web itself. There is
currently a large effort going on at Cornell involving some pure scientists, and also some people
in the sociology department like Michael Macy and David Strang. In particular, we are trying to
get a handle on Web evolution over time and some questions we can ask them. In small world
networks, there are any number of other interesting questions that one can ask, which I haven’t
touched on here. One, for example, which came up in a paper a year ago by Cristos Faloutsos,
Kevin McCurly, and Andrew Tomkins, was roughly the question of when are short paths
meaningful. So intuitively there is this tempting connection between social networks and, for
example, homeland security applications in which you discover statements like my next door
neighbor once sublet an apartment from a person who took a class with the brother of a 9/11
hijacker. You read in the news that social network analysis is going to cause us to discover things
420

OCR for page 396

like this and that’s going to be very useful. Are we really lacking any sense of how you score? Is
this short path meaningful? Isn’t the whole point of the small world phenomenon that I could say
this about any instantiation? I would say that we are really lacking, and I think this is interesting
direction for the statistics aspect of this workshop, some way of talking about when short paths in
the network are meaningful, and when they simply suffer from what you could call the small
world problem, that you see short paths everywhere.
Another area where this blends into economics and game theory—this is some work I
have been doing with Prabhakar Raghavan, who is the head of research at Yahoo, where they are
very interested in social media—is the notion that a lot of these onlines like LiveJournal and
Orchid and Friendster and Linked In, and you can name any number of these, they attracted a lot
of excitement a year ago based on the idea that they would become big marketplaces for
information and services. You join one of these, and what’s the value to you in joining
something like Friendster or Link In or Orchid? Well, because maybe a year from now you are
going to need to rent an apartment in Boulder, Colorado you ask your friends. You say you need
to rent an apartment in Boulder, Colorado. Do you know anyone who has a room to rent? They
ask their friends, and gradually an answer percolates back. Not only is it an answer, but it
somehow has been vetted by your friends. That was some of the appeal of these networks, but
once you start thinking about this as a marketplace for information and services, you have to ask
about the incentive issues at work. If you are actually offering incentives to people, and they
have to act essentially as middle men for these products and services, then when does this
function efficiently? When do things break down simply because of the friction of everybody
skimming off money that you offer them?
It turns out there are some interesting transitional behaviors here, which relate to some
things about branching processes, and some sharp transitions between networks that function
efficiently in this mode and networks that don’t. There is something qualitatively important
about an effective branching factor of 2, and something I can’t really get into, but there is also a
lot when you think about this interface of small world networks with the sort of economic game
theory take on it.
In general, I think this is a window into a lot of interesting questions in these kinds of
networks, a lot of which can be studied with some of the model system data that we have. We are
trying to accumulate more data, and I think in general there is much to be done here. I would be
happy to talk about it more offline, but for now I’ll stop here.
421

OCR for page 396

QUESTIONS AND ANSWERS
DR. HANDCOCK: I have one question which has four related parts. This is coming
back to the HP real world email data set. How did you fit the model? And then how did you
assess the goodness of fit. Three, was there any missing data from the email responses, and how
was that dealt with? And four, because it's an ad hoc model based on heuristics, how did it
compare to other models that took into account the organizational structure?
DR. KLEINBERG: Those are all good questions, and actually I will start with an
answer which applies to all four parts, which is I didn’t do any of this, because this was actually
the work of Adamic and Adar. So, I don’t want to answer in detail on their behalf, because there
was thinking behind that which I don’t have access to all of. There was actually not that much
missing data as far as I actually understood, because they essentially collected these from logs.
They got people’s permission to collect these from logs.
In terms of competing models, I would say there wasn’t really a generative network
model here, so, this was not a statement about a generative network model. It was a statement
about simply trying to fit a decay of linkage frequency to group size. And so then it just comes
down to a question of the goodness of fit for that particular data set, for which you can consult
their paper. But there wasn't an attempt to then infer from that a general model.
DR. CROVELLA: I’m Mark Crovella from Boston University. I’m just sort of curious
about that example. What do you think? Is it more likely that the e-mail messages are the result
of the organizational structure, or the organizational structure is a result of the e-mail messages?
If it’s the latter, I would be interested in the social engineering aspects of that.
DR. KLEINBERG: Yes, I think that’s a very interesting question. So, this may sound
like a cop out, but I don’t think it is that obvious. As we know from looking at organizations, they
do change: groups get reassigned, groups split, groups merge, and that clearly is the result of
increased bandwidth. So, here unfortunately, I’m not going to draw so much on social science
studies since I haven’t looked that much at the organizational theory of literature. But simply
suggesting that obviously groups split and merge based on increased or decreased traffic between
that, and that’s a sense in which the e-mail traffic in the long-range sense is going to feed back
into the organizational structure. But it’s a good question. You could also argue that both of
these are consequences of some hidden variable. So, in fact, people are working on similar or
different things. That influences both the organization structure you see and who communicates
with whom. In a sense, that’s the single root cause of these, but I think it’s true that as that
changes over time, you are going to see changes in communication patterns in the organization.
422

OCR for page 396

REFERENCES
Bollobas, B., and F.R.K. Chung. 1988. “The Diameter of a Cycle Plus a Random Matching.”
SIAM J. of Discrete Math 1.
Faloutsos, C., K. McCurley, and A. Tomkins. 2004. “Fast Discovery of Connection Subgraphs.”
Tenth ACM SIGKDD Conference. Seattle, WA.
Kleinberg, J. 2000. “Navigation in a Small World.” Nature 406.
Kleinberg, J. 2002. “Small-World Phenomena and the Dynamics of Information.” Advances in
Neural Information Processing Systems (NIPS) 14.
Kleinberg, J., and P. Raghavan. 2005. “Query Incentive Networks.” Proc. 46th IEEE Symposium
on Foundations of Computer Science.
Leskovec, J., J. Kleinberg, and C. Faloutsos. 2005. “Graphs Over Time: Densification Laws,
Shrinking Diameters and Possible Explanations.” Proc. 11th ACM SIGKDD Intl. Conf. on
Knowledge Discovery and Data Mining.
Liben-Nowell, D., J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. 2005. “Geographic
Routing in Social Networks.” Proc. Natl. Acad. Sci. USA 102.
Liben-Nowell, D., J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. 2005.
“Geographic Routing in Social Networks.” Proc. Natl. Acad. Sci. USA 102.
Malkhi, D., M. Naor, and D. Ratajczak. 2002. “Viceroy: A Scalable and Synamic Emulation of
the Butterfly.” Proceedings of 21st Annual Symposium on Principles of Distributed Computing.
Manku, G.S., M. Bawa, and P. Raghavan. 2003. “Symphony: Distributed Hashing in a Small
World.” Proc. 4th USENIX Symposium on Internet Technologies and Systems.
Menczer, F. 2002. “Growing and Navigating the Small World Web by Local Content.” Proc.
Natl. Acad. Sci. USA 99(22):14014-14019.
Milgram, S. 1967. “The Small World Problem.” Psychology Today 1.
Ratnasamy, S., P. Francis, M. Handley, R. Karp, and S. Shenker. 2001. “A Scalable Content-
Addressable Network.” Proc. ACM SIGCOMM.
Rowstron, A., and P. Druschel. 2001. “Pastry: Scalable, Distributed Object Location and Routing
for Large-Scale Peer-to-Peer Systems.” Proc. 18th IFIP/ACM International Conference on
Distributed Systems Platforms (Middleware 2001).
Stoica, I., R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. 2001. “Chord: A Scalable
Peer-to-Peer Lookup Service for Internet Applications.” Proc. ACM SIGCOMM.
Watts, D.J., and S.H. Strogatz. 1998. “Collective Dynamics of ‘Small-World’ Networks.” Nature
393.
423

OCR for page 396

Watts, D.J., P.S. Dodds, and M.E.J. Newman. 2002. “Identity and Search in Social Networks.”
Science 296:1302-1305.
Zhao, B.Y., J.D. Kubiatowicz, and A.D. Joseph. 2001. “Tapestry: An Infrastructure for Fault-
Tolerant Wide-Area Location and Routing.” UC Berkeley Computer Science Division, Report No.
UCB/CSD 01/1141.
424