National Academies Press: OpenBook

Proceedings of a Workshop on Statistics on Networks (CD-ROM) (2007)

Chapter: Network Data and Models--Martina Morris, University of Washington

« Previous: Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 226
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 227
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 228
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 229
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 230
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 231
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 232
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 233
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 234
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 235
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 236
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 237
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 238
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 239
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 240
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 241
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 242
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 243
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 244
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 245
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 246
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 247
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 248
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 249
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 250
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 251
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 252
Suggested Citation:"Network Data and Models--Martina Morris, University of Washington." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 253

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Network Data and Models Martina Morris, University of Washington DR. MORRIS: As they say, I realize I am the only thing that stands between you and dinner. It is the end of the day. It is not an enviable spot, and all of these have become longer and longer talks, so I am going to see if I can work through this a little more quickly. I am closer to the biologists since I put the people up front. We have a large group of people working on network sampling and network modeling—Steve Goodreau, Mark Handcock and myself at the University of Washington; Dave Hunter and Jim Moody, who are both here; Rich Rothenberg, who is an M.D., Ph.D.; Tom Snijders who some of you know, is a network statistician; Phillipa Pattison and Garry Robbins from Melbourne have also done a lot of work on networks over the year, and then a gaggle of grad students that come from lots of different disciplines as well. We are funded from NIH and we are primarily interested in how networks channel the spread of disease, so Figure 1 shows an example of a real network that comes from Colorado Springs, a visualization that Jim did. You can see that the network has a giant component, which is not surprising. It has this dendritic effect, which is what you tend to get with disassortative mixing. I think that is something that Mark Newman pointed out. After a while you get to look at these things and you can immediately pick that up. This is kind of a boy-girl disassortative mixing, and it generates these long loosely connected webs. This is also a fairly high-risk group of individuals, and it was sampled to be exactly that. John Potter thinks that he got about 85 percent of the high-risk folks in Colorado Springs. Every now and then you see a configuration like a ring of nodes connected to a central node, which represents a prostitute and a bunch of clients. Of course not all networks look like that. 226

FIGURE 1 Just a little bit on motivation. Our application is infectious disease transmission, and in particular HIV, recognizing that the real mechanism of transmission for HIV is partnership networks. What we are interested in, in a general sense, is what creates the network connectivity, particularly in sparse networks. As most of you know, HIV has enormously different prevalence around the world. As low as it is here, which is certainly less than 1 percent, compared to close to 40 percent in places like Botswana. So, there is a really interesting question there about what kind of network connectivity would generate that difference and variation in prevalence. There are clearly properties of the transmission system that you need to think about, which include biological properties; heterogeneity, which is the distribution of attributes of the nodes, the persons, but also infectivity and susceptibility in the host population. There are multiple time scales to consider, and time scales are something that we haven’t talked about much here, but I think are very important in this context. You get the natural history and evolutionary dynamics of the pathogens, but you also have some interesting stuff going on with partnerships there. In addition, there is this real challenge of data collection. That is in contrast to Traceroutes—it would be great if we could collect Traceroutes of sexual behavior. Maybe we could do that with some of those little nodes that we stuck on the monkey heads, but at 227

this point we can’t really do that. So, for us, this means that the data are, in fact, very difficult to collect, and that is a real challenge. One of the things we are aiming for—basically, this is our lottery ticket—is to find methods that will help us define and estimate models for networks that use minimally sampled data. By minimally, I mean you can’t even contact trace in this case, because contact tracing is, itself, a very problematic thing to do in the case of sexual behavior. FIGURE 2 So, how do you generate connectivity in sparse networks? Lots of people have focused on a scale of three steps: high-degree hubs, all you need is one person out there with millions of partners and your population is connected. In fact, you can generate connectivity in very-low- degree networks as well, as shown in Figure 2. Your simple square dance circle is a completely connected population where everybody only has two partners, so it is important to recognize that there are lots of other ways that you can generate connectivity, and that even if one person, for example, did act as a hub, and you figured that out somehow and you removed them, you would still have a connected population. I think it has some pretty strong implications for prevention. Also, there is obviously clustering and heterogeneity in networks. Figure 3 shows data from a school. You can see in this case that there is a combination of clustering by grade that generates the different colors and also the very strong racial divide. 228

FIGURE 3 You might ask yourself the question, why is this? What is the model behind this? What generates this? A number of people have hinted at this kind of stuff earlier. Are these exogenous attributes at work—that is, birds of a feather stick together? If you are the same grade, you are the same race as me, that is why we are likely to be friends. Or is it some kind of endogenous process where, if two people share a friend, they are more likely to be friends themselves? That is a friend-of-a-friend process. It is interesting that in popular language we have both of those ideas already ensconced in a little aphorism. So, thinking about partnership dynamics and the timing and sequence and why that would matter, one of the things that we have started looking at in the field of networks and HIV is the role that concurrent partnerships play. 229

FIGURE 4 Concurrent partners are partnerships that don’t obey the rule of serial monogamy. That is, each partnership does not have to end before the next one begins. You can see in Figure 4, in the long yellow horizontal line labeled “1”, we have got somebody who has a partnership all the way across the time interval, and then a series of concurrent partnerships with that, including a little one night stand labeled “5” that makes three of them concurrent at some point in time. Now, this is the same number of partners as in the upper graph, so it is not about multiple partners, although you do need to have multiple partners to have concurrent partners. This is really a function of timing and sequence. What we find in Uganda and Thailand is very interesting. Uganda at the time that we were doing this study had a prevalence of about 18 percent and Thailand had about 2 percent, which is 10 times less. In Uganda men reported about 8 partners in the lifetime on average and in Thailand it was 80. Now, 10 years later Thailand doesn’t have an epidemic that comes from having 80 partners in your lifetime, so something else is obviously going on: it is not just the number of partners. We looked into this concurrency a little bit, and we found that in both cases you get concurrent partnerships. So, it is not the simple prevalence of concurrency that is the difference. The difference comes in the kinds of partnerships that are concurrent. In Uganda you tend to get two long-term partnerships that overlap, whereas in Thailand you have the typical sex partner pattern, which is one long term partnership, and then a short term on the side. The net result is that in Uganda the typical concurrent partnership is actually active on the day of the 230

interview—41 percent of them were, and they had been going for three years. The ones that are completed are also reasonably long, about 10 months. In Thailand you don’t get anywhere near as many active on the day of the interview. They are about two months long. Ninety-five percent of them are a day long, so these concurrences happen and they are over. That kind of thing can actually generate a very different pattern on an epidemic. The simulations that we have done of that suggest that if you take this into account you do, in fact, generate an epidemic in Uganda that has about 18 percent prevalence, whereas Thailand will typically have just about 2 percent. The approach that we take is thinking about generative models here, local rules that people use to select partners that cumulate up to global structures and a network. What we want is a generative stochastic model for this process, and that model is not going to look like you, for example, want to create clustering. A clustering coefficient, although it can be a great descriptive summary for a network, is not necessarily going to function well as a generative model for a network. It is also probably the case that when I select a partner I am not thinking I want to create the shortest path, the shortest geodesic to the east coast. That is probably also not going on. Again, it’s a nice summary of a network but probably not a generative property. We want to be able to fit both exogenous and endogenous effects in these models, so that turns out to be an interesting and difficult problem. We also want this to work well for sample data. We want to be able to estimate based on observed data, and then make an inference to a complete network. Figure 5 shows what this generative model is going to look like. We have adapted this from earlier work in the field. It is an exponential random graph model. It basically takes the probability of observing a network or a graph, a set of relationships, as a function of something that looks a little bit like an exponentiated version of a linear model, and then a normalizing constant down below that is all possible graphs of that size. This is the probability of this graph as a function of a model prediction, with this as the normalizing constant. The origins go back at least in the statistical literature of Bahadur during 1961; I talked a little bit about a multivariate binomial. Besag has adopted this for spatial statistics, and Frank and Strauss first proposed it for networks in 1986. The benefits of this approach are that it considers all possible relations jointly, and that is important because the relations here are going to be dependent, if there are any endogenous processes going on, like these friends of a friend. It is an exponential family, which means it has very well understood statistical properties, and it also turns out to be very flexible. It can represent a wide range of configurations and processes. 231

FIGURE 5 What does model specification mean in a setting like this? There are two things that we must choose, the set of network statistics z(x) and whether or not to impose homogeneity constraints on the parameter θ. We are going to choose a set of network statistics that we think are part of the self-organizing property of this graph. A network statistic is just a configuration of dyads. Edges are a single dyad, that is the simplest network statistic, and that’s used to describe the density of the graph. Others include k-stars, which are nested in the sense that a 4-star contains quite a few 3-stars in it, and the 3-star has 3 2-stars in it. So, that is a nested kind of parameterization that is common in the literature. We tend to prefer something like degree distributions instead in part because I think they are easier to understand. Degrees are not nested. A node simply has a degree, it has one degree only, and you count up the number of nodes that have that degree. Triads or closed triangles are typically the basis for most of the clustering parameters that people are interested in. Almost anything you can think of in terms of a configuration of dyads can be represented as a network statistic. Then you have the parameter θ, and your choice there is whether you want to impose homogeneity constraints, and I believe Peter Hoff talked about this a little bit in his talk this morning. 232

FIGURE 6 There is clearly a lot of heterogeneity in networks. Heterogeneity can be based on either nodal or dyadic attributes. People can be different because of their age, their race, their sex, those kinds of exogenous attributes, or different types of partnerships may be different. Let’s think a little bit about how we create these network statistics. Referring to Figure 6, the vector here can range from a really minimal number one, such as the number of edges, which is the simple Bernoulli model for a random graph. But the vector and also be a saturated model, with one term for every dyad, which is a large number of terms. Obviously, you don’t move immediately to a saturated model. It wouldn’t give you any understanding of what was happening in the network anyway, so what we are trying to do is work somewhere in between these two, some parsimonious summary of the structural regularities in the network. Figure 6 gives some examples from the literature that have been used in the past; the initial model was the p1 model by Holland and Leinhardt in 1981. Peter Hoff’s talk this morning fit the framework of a p1 model: it had an actor-specific in-degree and an actor-specific out- degree—so, two parameters for every actor in the network—plus a single parameter for the number of mutual dyads, that is, when i nominates j and j nominates i in return. That is a dyadic independent model, which is to say, within a dyad, between two nodes, where there is the only dependence here for this mutual term, all other dyads are independent. That is a minimal model of what is going on in a network. The first attempt to really begin to model dependent processes in networks—and that is, the edges are dependent—was the Markhov graph proposal by Frank and Strauss, which is also 233

shown in Figure 6. This model includes terms for the number of edges, the number of k-stars— again, those are those nested terms—and the number of triangles. In each of those cases you can impose homogeneity constraints on θ or not. So, for any network statistic that you have in your model, the θ parameters can range from minimal—that is, there is a single homogenous θ for all actors—to maximal, where there is a family of θ i ’s, each being actor- or configuration-specific. That was the case in Peter Hoff’s model. Every actor would have two parameters there. You can say that every configuration has its own parameter, which is a function of the two actors, or multiple actors that are involved in it. In the Bernoulli graph, the edges are the statistic, and there is a single θ that says every edge is as likely as every other edge. So, that is a homogeneity constraint. When we go with maximal θ parameters, you quickly max out and lose insight, but you can sure explain a lot of variance that way. In between are parameters that are specific to groups or classes of nodes—so you might have different levels of activity or different mixing propensities by age or by sex. In addition, there are parametric versus non-parametric representations of ordinal distribution. So, for a degree distribution, you can have a parameter for every single degree, or you could impose a parametric form of some kind, a Poisson, negative binomial, something like that. The group parameterizations are typically used to represent attribute mixing in networks. We have heard a lot about that, but this is the statistical way to handle that. There is a lot of work that has been done on that over the last 20 years. The parametric forms are usually used to parsimoniously represent configuration distributions so degree distributions, shared partner distributions, and things like that. FIGURE 7 234

Estimating these models has turned out to be a little harder than we had hoped; otherwise, we would all be talking about statistical models for networks today. I don’t think there would be anybody in here who would be talking about anything else. The reason is that this thing is a little bit hard to estimate. Figure 7 shows the likelihood function P(X = x) that we are going to be trying to maximize. The normalizing constant c makes direct maximum likelihood estimation of that θ vector impossible because, even with 50 nodes, there are an almost uncountable number of graphs. So, you are not going to be able to compute that directly. Typically, there have been two approaches to this. The first that dominated in the literature for the last 25 years is pseudolikelihood, which is essentially based on the logistic regression approximation. We now know, and we knew even then, that this isn’t very good when the dependence among the ties is strong, because it makes some assumptions there about independence. MCMC is the alternative; it theoretically guarantees estimation. FIGURE 8 But that’s only “theoretically.” Implementation has turned out to be very challenging. The reason has been something called model degeneracy. Mark Handcock is going to talk a lot about this tomorrow, but I am just going to make a couple of notes about it today. Figure 8 shows a really simple model for a network, just density and clustering. Those are the only two things 235

that we think are going on. So, there is the density term, which is just the sum of the edges, and c(x) is the clustering coefficient that people like to use to represent the fraction of 2-stars that are completed triangles. What are the properties of this model? Let’s examine it through simulation. Start with a network of 50 nodes. We are going to set the density to be about 4 percent, which is about 50 edges for a Bernoulli graph. The expected clustering, if this were a simple random graph, would then just be 3.8 percent, but let’s give it some more clustering. Let’s bump it up to 10 times higher than expected and see how well this model does. By construction, the networks produced by this simple model will have an average density of 4 percent, and an average clustering of 38 percent. Figure 9 shows what the distribution of those networks looks like. FIGURE 9 The target density and clustering would be where the dotted lines intersect, but virtually none of the networks look like that. Most of these networks, instead, are either fairly dense, but don’t have enough triangles, or not so dense, but are almost entirely clustered in triangles. What does that mean? It means that is this is a bad model. For a graph with these properties, if you saw this kind of clustering and density, this is a bad model for it. This graph didn’t come from that model, that is what it says. In the context of estimation, we call the model degenerate when the graph used to estimate the model is extremely unlikely under that model. Instead, the model 236

places almost all of its probability mass on graphs that bear no resemblance to the original graph. We will never see anything like this in linear model settings, so this hung people up for about 10 years because they couldn’t figure out why every time they tried to estimate using MCMC, the parameters they would get back would create either complete graphs or empty graphs, and nothing in between. What they did was make simpler and simpler models because they figured something had to be wrong with the algorithm or something else. It turns out that the simpler the model the worse the properties, and 10 years of science were down the drain. The morals—and I think these really are morals, because they are a new way to think about data—are three. First, descriptive statistics may not produce good generative models, and I think that there is more truth in this setting than in the linear model setting. Second, starting simple and adding complexity often doesn’t work—at least not the way that we expect it to work. Third, I think it is going to take a long time to develop new intuition, and we are going to have to develop new model diagnostics in order to help our intuition. One thing we can say now pretty clearly is that statistics for endogenous processes—so, like this friend-of-a-friend stuff, the clustering parameter—need to be handled differently, because they induce very large feedback effects. A little bit of that and all of a sudden the whole graph goes loopy. It has to create lots of whatever it is because there is nothing to prevent it. FIGURE 10 237

We have been thinking a little bit about new model terms to capture some of the endogenous clustering effects, because we think these are important, and so we do want to be able to capture them. As shown in Figure 10, the old forms of these were typically either the number of triangles—that is what you saw in the Markov random graphs—or a clustering percent. In those cases, every triangle has the same impact, θ. That is your parameter, and that is obviously a problem. What we have done is come up with new models where we parameterize a distribution of shared partners. You can think of those shared partners as each giving you an increasing probability of having a tie with the other person, but there is a declining marginal return to that. So, it doesn’t blow up in the same way. It looks ugly, but it turns out it is a fairly simple weight that creates an alternating k- star—that is one way of thinking about it. 2-stars will be positively weighted and 3-stars will be negatively weighted; 4-stars will be positively weighted, and 5-stars will be negatively weighted. That is what this thing does. It tends to create reasonably nice, smooth parametric distributions for shared partner statistics. In the time I have left I want to give you a sense of how these models work in practice, and how friendly they look when you get them to actually estimate things well. Imagine a friendship network in which two different clustering processes at work. There is homophily, which is the birds-of-a-feather exogenous attribute, in which people tend to choose friends (i.e., create a link) to people who are like them, in grade, race, etc. Then there is transitivity, which is that people who have friends in common tend to also become friends. This is a friend-of-a-friend endogenous process for creating links. That can also generate a fair amount of assortative mixing, or can come from assortative mixing. Imagine three of our actors in the same grade, and two of them are tied to the same person. Then the question is, will they have a tie between themselves. If they do, it could either be because of transitivity, or it could be because they are in the same grade. So, the question is, how do we distinguish these? We have a number of different kinds of ties here. There are within- grade ties, which can come from both of these processes, and across-grade ties, which we can assume are due to transitivity. There are triangles that come from both, and there are ties not forming triangles, and we can assume that is homophily. So, we do have a little bit of information here that we are going to use to tease apart these things statistically, and we are going to work with data collected from a school. 238

FIGURE 11 Figure 11 shows data from school 10, a school that has a junior high school and a high school. You can see there aren’t a whole lot of links between those two. In addition, there is a fair amount of grade assortative mixing. Because we don’t have a lot of black students in this school—it is mostly white and Asian—there is a little more mixing by race than you might get otherwise, so the race effects will not be quite as strong here. We are going to try four models: edges alone (this is our simple random graph); edges plus attributes, which says the only thing going on in this model is that people are choosing others like themselves; edges plus this weighted edge-wise shared partner statistic, which is transitivity only, so just the friend-of-a-friend process; and then both of these things together. Figure 12 summarizes these four models. 239

FIGURE 12 Our approach is depicted schematically in Figure 13. We start with the data and our model, and we estimate our model coefficients. We can then simulate graphs with those properties, those particular coefficients, and those statistics, drawn from a probability distribution. We are going to compare the graph statistics from the simulated data to the graph statistics from our observed data, but the graph statistics that we are going to use are not actually statistics that were in the model. What we are trying to do is predict things like path links or geodesics that are not part of our original model, because that says we have got the local rules right. These are the local rules that generate those global properties. That is the approach we are going to take. 240

FIGURE 13 Figure 14 gives an example for a Bernoulli graph. You can see what the data looks like, and the chart shows the degree distribution from that graph. Students were only allowed to nominate their five best male and five best female friends. So, it is truncated at 10. Nobody reported 10 in this particular school. FIGURE 14 FIGURE 15 241

Figure 15 shows what the simulations look like from the Bernoulli model. You can see we don’t get the degree distribution very well. FIGURE 16 Figure 16 shows the edgewise shared-partner statistic. This is, for every edge in the graph, how many shared partners do they have. That is like a generalized degree distribution. That is what it looks like in the observed data. Figure 17 shows what it looks like from the Bernoulli model, so obviously the Bernoulli model isn’t capturing any of that kind of clustering at all. 242

FIGURE 17 Finally, Figure 18 shows the minimum geodesic distance between all pairs, with a certain fraction of them here being unreachable. FIGURE 18 Figure 19 shows how the Bernoulli model does, and it doesn’t do very well. 243

FIGURE 19 Putting these all together, Figure 20 shows your goodness of fit measures. FIGURE 20 Figure 21 shows what it looks like for all the models. The first column shows the degree distribution; even a Bernoulli model is pretty good. Adding attributes doesn’t get you much, but adding the shared partners, you get it exactly right on, and the same is true when you add both the shared partner and the attributes. For the local clustering, which is this edgewise shared-partner term, the Bernoulli model does terrible. I can’t say the attribute model does a whole lot better. Of course, once you put the two-parameter weighted shared-partner distribution in, you capture that pretty well. 244

FIGURE 21 You don’t capture the geodesic well with edges alone, but it is amazing how well you get the geodesic just from attribute mixing alone, just from that homophily. In fact, it actually doesn’t do so well, the local clustering term for this edgewise shared-partner doesn’t capture it anywhere near as well, and you actually don’t do as good a job when you put both of them in. So, Figure 22 is the eyeball test. That is a different approach to goodness of fit. One thing you want to make sure of with these models is that they aren’t just crazy. Obviously, those degenerate models were crazy, and you could see that very quickly. They would either be empty or they would be complete. For this figure, it actually would be hard to tell which one was simulated and which one was real. They are actually getting the structure from the eyeball perspective pretty well. 245

FIGURE 22 There are 50 schools from which we can draw information. There are actually more, but 50 that have good information. They range in size from fairly small—and this school 10 is one of the smaller mid-sized schools, with 71 students in the data set—but we can use these models all the way up to beyond 1,000. We have now used them on 3,000-node networks, and they are very good and very stable. Figure 23 compares some results for different network sizes, using the model that has both the friends-of-a-friend and the birds-of-a-feather processes in it. It does very well for the smaller networks, but as you start getting out into the bigger networks, the geodesics are longer than you would expect. Basically, I think that is telling you there is more clustering, there is more pulling apart in these networks, and less homogeneity than these models assume. So, there is something else that is generating some heterogeneity here. 246

FIGURE 23 The other thing you can do is to compare the parameter estimates across the models, which is really nice. In Figure 24, we look at 59 schools, using the attribute-only model. You can see the differential effects for grades: the older students tend to have more friends. 247

FIGURE 24 Figure 25 shows the homophily effect, the birds-of-a-feather effect, which is interesting. You see that it is strongest for the younger and the older, and those are probably floor and ceiling effects. Mean effects for race don’t show up as being particularly important, but blacks are significantly more likely than any other group to form assortative ties. Interestingly, you can see that Hispanics really bridge the racial divide. So, there are all sorts of nice things you can do by looking at those parameters as well. Finally, the other thing you can do is examine what is the effect of adding a transitivity term to a homophily model. I mean, how much of the effect that we attributed to homophily actually turns out to be this transitivity effect instead. It turns out the grade-based homophily estimates fall by about 14 percent once you control for this friend-of-a-friend effect. The race homophily usually falls, but actually sometimes rises, so once you account for transitivity you find that the race effects are actually even stronger than you would have expected with just the homophily model. This is shown in Figure 26. 248

FIGURE 25 FIGURE 26 249

The transitivity estimates this friend-of-a-friend effect falls by nearly 25 percent once you control for the homophily term, as shown in Figure 27. This doesn’t seem like much, but it is amazing to be able to do this kind of stuff on networks, because we have not been able to do this before. We have not been able to test these kinds of things before. What we have now is a program and a theoretical background that allows us to do this. FIGURE 27 What this approach offers is a principled method for theory-based network analysis where you have a framework for model specification, estimation, comparison and inference. These are generative models and they have tests for fit, so it is not as if you see there is clustering, but it is homogenous clustering how well that it fits. These give you the answers to those questions. We have methods for simulating networks that fall out of these things automatically, because we can reproduce the known, observed, or theoretically-specified structure just by using the MCMC algorithm. For the cross-sectional snapshots, this is a direct result of the fitting procedure. For dynamic stationary networks, it is based on a modified MCMC algorithm, and dynamic evolving networks means you have to have model terms for how that evolution proceeds. You can then simulate diffusion across these networks and, in particular, for us, disease transmission dynamics. It turns out the methods also lend themselves very easily to sampled networks and other missing network data. 250

With that I will say if you are interested in learning more about the package that we have, it is an R based package. We are going to make it available as soon as we get all the bugs out. If you want to be a guinea pig, we welcome guinea pigs, and all you need to do is take a look at http://csde.washington.edu/statnet. Thank you very much. QUESTIONS AND ANSWERS DR. BANKS: Martina that was a lovely talk and you are exactly right, you can do things that have never been done before, and I am deeply impressed by all of it. I do wonder if perhaps we have not tied ourselves into a straightjacket by continuing to use the analytic mathematical formulation of these models. DR. MORRIS: The analytic what formulation? DR. BANKS: An analytical mathematical formulation growing out of a p1 model and just evolving. An alternative might be to use agent-based simulation, to try and construct things from simple rules for each of the agents. For example, it is very unlikely that anybody could have more than 5 best friends or things like that. DR. MORRIS: Actually, you would be surprised how many people report that. I am kidding. I agree, except that I think that, depending on how you want to define agent-based simulation, these are agent-based simulations. What I am doing is proposing certain rules about how people choose friends. So, I choose friends because I tend to like people the same age as me, the same race as me, the same socioeconomic status. Those are agent-based rules and, when other people are using those rules, then we are generating the system that results from those rules being in operation. I don’t see a distinction between these two in quite the same way, but I do agree. One thing that we did do was focus on the Markov model for far too long. It was edges, stars, and triangles. I think there is an intuitive baby in the bath water, and that is, edges are only dependent if they share a node. That is a very natural way to think about trying to model dependency, but I think it did kind of narrow our focus probably more than it should have. DR. BANKS: You are exactly right. Your models do instantiate something that could be an agent-based system, but there are other types of rules that would be natural for friendship formation that would be very hard to capture in this framework, I think. DR. MORRIS: I would be interested in talking to you about what those are. DR. BANKS: For example, the rule that you can’t have more than 5 friends might be hard to build in. DR. MORRIS: No, in fact, that is very easy to build in. That is one of the first things we 251

had to do to handle the missing data here, because nobody could have more than 5 male or five female friends. DR. HOFF: There was one middle part of your talk which I must have missed, because you started talking about the degeneracy problem with the exponentially parameterized graph models, and then at the end we saw how great they were. So, at some point there was the transition, by including certain statistics, or is it including certain statistics that makes them less degenerate, or is it the heterogeneity you talked about? I could see how adding heterogeneity to your models or to the parameters is going to drastically increase the amount of the sample space that you are going to cover. Could you give a little discussion about that? DR. MORRIS: These models have very little heterogeneity in them relative to your models. So, every actor does not have both an in-degree and an out-degree. There is basically just a degree term for classes. So, grades are allowed to have different degrees, race is allowed to have different degrees. The real thing—that wasn’t what made this work. What made this work was the edgewise shared partner. When we had originally tried using the local clustering term as either the clustering coefficient or the number of triangles with just a straight theta on it, those are degenerate models. The edgewise shared partner doesn’t solve all problems either, but at least it was an ability bound, and that is essentially the effect that it has, is that it bounds the tail and it says that people can't have that many partners. So, that is what changed everything. DR. JENSEN: I think the description of model degeneracy is wonderful. Fortunately, it wasn’t 10 years of work in my case. It was more like 6 months of work that went down the drain and I didn’t know why, and I think you have now explained it. Is that written up anywhere? Are there tests that we can run for degeneracy? What more can we do? DR. MORRIS: That is a great question. Mark Handcock is really the wizard of model degeneracy, and I think he is going to give a talk tomorrow that can answer some of those questions. I don’t think we have a test for it yet, although you can see whether your MCMC chain is mixing properly and, if it is always down here and then all of a sudden it goes up here, then you know you have got a problem. It is still a bit of an art. STATNET, this package, will have a lot of this stuff built into it. DR. SZEWCZYK: My question is, is it model degeneracy, or is it model inadequacy? I look at a lot of these things and my question is, can we take some of these models and, rather than just fitting one universal model, can we go in there and, say, fit a mixture of these p-star models, or these Markov models or p1 models, rather than assuming that everyone acts the same within these groups? DR. MORRIS: There are lots of ways to try to address the heterogeneity, I agree with you, and I think they need to be more focused on the substantive problem at hand. 252

So, just throwing in a mixture or throwing in a latent dimension, to me, kind of misses the point of why do people form partnerships with others. So, when I go into this, I go in saying, I want to add attributes to this. A lot of people who have worked in the network field don’t think attributes matter because somehow it is still at the respondent level. We all know that, as good network analysts, we don't care about individuals. We only care about network properties and dyads. DR. SZEWCZYK: We care about individuals. DR. MORRIS: Attributes do a lot of work. They do a lot of heavy lifting in these models and they actually explain, I think, a fair amount. I would call it model degeneracy only in this case because you get an estimate and you might not even realize it was wrong. In fact, when people used the pseudolikelihood estimates, they had no idea that the cute little estimate they were getting with a confidence interval made no sense at all. It is degenerate because what it does, it performs the function. It actually gets the average right, but then it gets all the details wrong. So, you can call that inadequate, and it is. It is a failure. It is a model failure. That is very clear. DR. WIGGINS: So, one thing to follow up on that I was wondering about, since each of these models defined a class, I wonder if you thought about treating this using not classifiers, large-margin classifiers, like support vector machines. Some anecdotal evidence is that sometimes you can tell if none of your network models is really good for a network that you are interested in. So, some of these techniques that measure everything at once, rather than measuring a couple of features you want to reproduce will show you how one network, if you look at it in terms of one attribute, it looks like model F, but if you look at it in terms of a different model, it turns out to be model G, and that might be one way of seeing whether or not you have heterogeneity or just none of your models is a good model. If you have a classifier, then all the different classifiers might choose, not me, as the class, in which case you can kind of see if none of your models is the right model. DR. MORRIS: Yes, that is a nice idea. 253

Next: The State of the Art in Social Network Analysis--Stephen P. Borgatti, Boston College »
Proceedings of a Workshop on Statistics on Networks (CD-ROM) Get This Book
×
Buy Cd-rom | $123.00 Buy Ebook | $99.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

A large number of biological, physical, and social systems contain complex networks. Knowledge about how these networks operate is critical for advancing a more general understanding of network behavior. To this end, each of these disciplines has created different kinds of statistical theory for inference on network data. To help stimulate further progress in the field of statistical inference on network data, the NRC sponsored a workshop that brought together researchers who are dealing with network data in different contexts. This book - which is available on CD only - contains the text of the 18 workshop presentations. The presentations focused on five major areas of research: network models, dynamic networks, data and measurement on networks, robustness and fragility of networks, and visualization and scalability of networks.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!