National Academies Press: OpenBook

Proceedings of a Workshop on Statistics on Networks (CD-ROM) (2007)

Chapter: Dependency Networks for Relational Data--David Jensen, University of Massachusetts

« Previous: Visualization and Variation: Tracking Complex Networks Across Time and Space--Jon Kleinberg, Cornell University
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 425
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 426
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 427
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 428
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 429
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 430
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 431
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 432
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 433
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 434
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 435
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 436
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 437
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 438
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 439
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 440
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 441
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 442
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 443
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 444
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 445
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 446
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 447
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 448
Suggested Citation:"Dependency Networks for Relational Data--David Jensen, University of Massachusetts." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 449

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Dependency Networks for Relational Data David Jensen, University of Massachusetts DR. JENSEN: What I’m going to talk about is some work that has been done in statistical modeling of relational data. I’ll tell you what I mean by that in a moment. This is joint work with one of my students, Jennifer Neville, who is one of those fabulous students who the previous speaker referred to. She is also in the job market this year, so if you happen to be in computer science and looking for a good faculty candidate, she should definitely be on your list. The main point of what I’m going to talk about is a new joint model for relational data: Relational Dependency Networks (RDNs). I’ll also talk about it in the context of a particular application of this kind of statistical modeling—some fraud-detection work that we have been doing for the National Association of Securities Dealers. I will also try to serve as an exemplar of work in this area of computer science, particularly in what we generally think of as machine learning or knowledge discovery. And finally, I’ll try to point to some interesting new research directions. This talk is a bit of a Russian doll: I make multiple passes, going a little bit deeper each time. I’m a computer scientist, and we do iterative deepening search, so think of it that way. You may want to gauge your questions with that in mind. FIGURE 1 First, let me start out with this application. About two years ago we were approached by some people at the National Association of Securities Dealers (NASD). They are the world’s 425

largest regulator of securities brokers; and they are essentially detailed by the Securities and Exchange Commission to regulate the market in much the same way as the American Bar Association and the AMA regulate their particular areas. Some background on the NASD is included in Figure 1. The particular interest that they had was to try to predict fraud, and to focus their examinations for fraud by accounting for the fact that fraud is a social phenomenon. They particularly came to us because we were involved in doing this kind of work, but we’re taking into account the context, the relations between either people or Web pages or other things. NASD has a large dataset, and Figure 2 is a rough schema of the data. Intriguingly, this is data that you can access online a single record at a time. We have the entire database, but you can access records on your particular stockbroker if you are interested in finding out if they have engaged in questionable conduct prior to you working with them. The website is http://www.nasdbrokercheck.com. FIGURE 2 426

The NASD has data on roughly 1 million brokers, 300,000 branches, and 16,000 firms, so there is quite a large set of data and it is virtually complete. It is a wonderful network to look at because there are essentially no missing data in here. There is one aspect in which there is missing data, which is that brokers file disclosures indicating that they may have done something wrong. This can be a minor customer complaint; it can be a felony, and there is a large range of possible behaviors. This is a self-reporting system, so they file data on themselves, or their branch or the firm files data about them. Certainly in the case where they might have committed a felony and been fired, it would be the firm itself that would be filing the disclosure, but we know that those are fragmentary and are incomplete. We built a model that I will tell you about in a little while. We took that model and predicted, prospectively, who was at high risk for fraud in the next 12 months. Despite people’s general expectation, it’s actually a very small percentage of all brokers that do anything fraudulent, but we knew that this would be a very weak predictor. We rank ordered all brokers by their probability to fraud and compared this to NASD’s own rules. NASD has expert-derived rules which they use to select out a group of brokers for special attention by their examiners. We went down the list of brokers identified that way, which included about 125 brokers. We took the top 125 brokers on our list and compared the two sets. We took brokers who appeared on both lists, brokers on one list or the other list, and some brokers that were on neither list, and put them in front of human examiners, asking them to say how interested they were in those particular brokers, how much they would have wanted to know ahead of time about these brokers. We compared the scores that these four examiners gave to each of these cases. There were 80 cases, which each took about half an hour to examine, so for four examiners this was a person-week of effort per examiner. Figure 3 shows what we ended up with. Brokers who were on neither list generally have a very low score near 1. (Five indicates the most interest on the part of an examiner, 1 the least interest.) If we used NASD’s rules alone, or our rules alone derived automatically from the data, we would essentially have the same ability to predict brokers of interest to the examiners. If you combined the rule sets and said we’re going to look at individuals who were on both lists, you would actually end up with a set of brokers of even more interest to the examiners. 427

FIGURE 3 We also got a little bit of informal unsolicited feedback from one of the examiners. These are experts who have spent years examining individual brokers, and one of the examiners said, “Wow, if you got this one case right, I’ll be really impressed.” Because this examiner supervised barring this guy from the industry, he was so bad that he had used fraudulently obtained funds to attend a compliance conference that NASD runs on a regular basis. I wouldn’t be telling you about this if their rules hadn’t missed him, which was the case. In fact, our model ranked him in the top 100. So, he would have been one we would have recommended to the examiners to take a look at. That was a nice anecdotal result as well. Let’s back up a minute: how did we do this? What are we in the business of doing? We were looking at relational data, data that tied brokers together based on the firms and branches that they worked for. We called that relational data. What do we really mean? As we have all been talking about here, a lot of traditional work in statistics assumes that you have a single table of data. You assume the instances are independent, and you are looking for relationships between the variables in a given instance. In contrast, we assume there are multiple tables, many different types of objects, and importantly, the statistical dependencies may run between instances in the 428

same table, or between tables. We might want to say that one broker influences the probability that another broker will engage in fraud, or something about that broker’s firm may influence that probability. That’s the sense in which the data we are working with are relational. Our goals are to model attributes. Importantly, we are not trying to predict that a link will occur. We are trying to predict that some attribute or set of attributes exists for individual objects in the data, conditioned on the attributes of other objects and the relational structure among many object types. We are also after joint models so that we can do all the neat things you can do with a joint model. For instance, you can predict anomalies and other kinds of things. We are also under the strong impression that our inferences about one object should inform our inferences about another. We are doing what we refer to as collective inference. We want to do joint inference across all of the instances in our data. We can’t just do that one at a time, we need to conduct inference on large and complex networks. Also, because of the kinds of applications we are working with, NASD is a good example we want to follow to construct understandable models. The first thing that the NASD examiners asked of us when we first met them was, what does our model really say, how does it do what it does? We were able to put the models in front of them; they looked at them and said, yes, that makes sense. They were also able to say to us, “Oh, that thing over there . . .,” which we said, “Yes, yes, ignore that, that’s probably noise,” they said, “No, that is actually something that is interesting to us, but there is a better way of expressing it.” They were able to help us improve our models and our data sets. We are talking about large, complex networks over which we want to perform inference. So, what are the models that we are building? Figure 4 gives a high-level view of the model built from the data that I have just talked about. I’ll explain it in detail in a minute. Some of the details aren’t here because NASD, for obvious reasons, has said “Would you please not provide too many details of the exact model that you have built,” although they are surprisingly willing to work with us on publication. In fact, three of the co-authors of a paper that we just presented at the Knowledge Discovery and Data Mining Conference were all from NASD, and that paper goes into some detail. What we have here are these plates, these boxes, indicating object types; circles indicating variables on those objects, variables, or attributes; and the edges indicating dependencies, statistical dependence that runs between variables. 429

FIGURE 4 This is a type of graphical model that I will explain in more detail in a minute. It is a dependency network. So, if you want to know what an object depends on you can look at its parents. Look at things that have arrows going from them to the node that you are interested in predicting. For example, if you want to know whether the broker is a problem broker, you examine the sources of the arrows leading into that oval labeled “Is problem.” You’d want to look at whether they had problems in the past and have they had some sort of bad disclosures in the past? You would also want to look at that little loop under the oval in question, which asks whether there are known problem brokers in the same branch. There is an auto-correlation dependency, or homophily, among brokers who work in the same branch. You should look at whether other brokers in that branch have had problems in the past. That’s the dependency that goes out to the left side of the yellow box. Finally, there is one that goes up to the year of previous disclosures of that individual. You can look at these and get a sense of the overall types of statistical dependencies that exist in your data, and then the high-level summary of a model that actually has more detail sitting under each of the nodes in much the same way that Bayesian networks do. Also, there are these black dots in the upper left-hand corners of the four boxes, which indicate dependence on the mere count of those object types. For instance, the link between the dot in the blue rectangle and the “on watchlist” oval in the yellow box indicates that the number of disclosures influences whether a broker is on the NASD’s watch list. That black dot just indicates the number of, not some variable of that object type. 430

What are relational dependency networks? I’ll tell you how they are constructed in a minute, but they are a very expressive model in that they particularly represent the cyclic dependencies, this auto-correlation or homophily. They are efficient to learn. They are essentially linear in the time needed to learn a conditional distribution, because they are actually composed of a large number of conditional models. They can perform this collective inference because the inferences about one object can inform the inferences about another. What you do when you are performing inferences is to do inference over the entire graph simultaneously. Finally, they are practical because of some nice qualities about them. You can actually use models that you know ahead of time and you don’t have to learn all of the elements of those models. I should point out that the model you just looked at was learned without expert input other than the data set itself, so that model was learned automatically without any expert going in and turning off or adding dependencies. For details you can see two papers that Jen Neville and I recently produced. Also, this builds on work by David Heckerman and co-authors at Microsoft Research published in 2000 on propositional dependency networks: networks that assume independence among instances. So the basic flavor of our work is in alignment with a set of other work that has been done in computer science and machine learning, specifically in what have come to be called probabilistic relational models, or just relational models. That includes work by Daphne Koller’s group at Stanford on what they originally called PRMs, and now often called relational Bayesian networks, relational Markov networks, also out of Daphne’s group, and Ben Taskar, one of her students, our own relational dependency networks, and most recently some work at the University of Washington by Pedro Domingos and his student Matt Richardson in Markov logic networks. Figure 5 lists some of this work. 431

FIGURE 5 As you can tell, there is an enormous alphabet soup here of different model types, and there are about 10 that I haven’t mentioned. There has been an explosion of work in the last five years on graphical models for relational data—many different models being proposed. We are now in the process of trying to sort it all out, and figure out what the advantages and relative performance of these things are. We have done work over the last several years in statistical biases that can show up in these relational learning algorithms, and in this collective inference process. There’s this idea that you’ve got a joint model over all of your data, and are making simultaneous inferences. How do these RDNs work? Let’s do one more Russian doll inside. Relational dependence networks are a pseudo-likelihood model. You have heard about some of these previously. A model is constructed by composing a set of conditional models that are each learned independently. This sounds like a really bad idea, because what you would like to do is learn your entire model, all of the elements of your model jointly, but if you have a sufficiently large data set, what you end up with is the data set performing a coordination function that allows you to have the individual conditional models be consistent enough with each other, that you end up forming something that looks a lot like a coherent joint distribution of the data. What you want to do, for instance, is learn these individual conditional models. You want to learn the probability that a broker is a problem broker. You look at a set of characteristics about that broker and the neighborhood surrounding them. You look at the past behavior of that broker, for 432

instance, which may be captured as an individual variable on that broker; relations to other problem brokers and attributes of those brokers; and other things like the location and status of the firms that the individuals have worked for. You need to learn each of those models in a way that makes it parsimonious and accurate. FIGURE 6 The types of conditional models that we learned are these relational probability trees. These are trees where you take, for instance, a broker and the surrounding neighborhood of objects, which we specify ahead of time, drop that into the tree, and it rattles down to one of several leaf nodes. The leaf node gives you a probability distribution over the variable that you are trying to predict. Again, this is learned automatically from the data, a separate algorithm. Part of the tree for our NASD work is shown in Figure 7, enlarged just to give you a better sense. Is the particular broker older than 27 years of age? The answer is no, so we go down the right- hand part of the tree. Then you ask something about the firm size of the broker and the current firm. You may then go down the right-hand branch and ask about the past co-workers. How many past co-workers do you have? If it’s more than 35, you go down the left-hand branch of the tree, and you get down to this leaf node. You say, “Oh, well, there is sort of a roughly even probability of committing fraud or not committing fraud in the next 12 months.” 433

FIGURE 7 How are these relational probability trees learned? These are a tree-based classifier like those worked on for 30 years in the statistics and machine learning community. It’s a very common way of constructing a conditional model. We take standard ways of building these trees, and say, “Well, the problem is we need to start with a graph. We need to start with a very large graph, consisting of lots of objects and lengths.” What do we do? As shown in Figure 8, we take pieces of that graph—for instance, a broker and his or her surrounding neighborhood—and then say, okay, those are our instances. Now you have these little subgraphs that you have cut out of the graph, and you have these things that are sort of like instances, except they have widely varying structure. Some brokers may have many past co-workers, others may have very few. What you need to do is some aggregation over that to summarize the relational structure, which ends up giving you a somewhat different data set. There needs to be some adaptation of the learning to deal with problems of auto-correlation and some other things. That’s addressed in detail in the paper. Finally, you can construct yourself one of these relational probability trees. Do this for every variable in your data. Compose these trees into a single model that ends up giving you the relational dependency network that I showed you earlier. 434

FIGURE 8 The key insight of this work and lots of work in relational data in the machine learning/CS realm has been this need to do aggregation, and to search over possible different aggregators. So, what do I want to know about the past co-workers of somebody? Do I want to know their average age; the number of disclosures that those past co-workers have had? What are the things that matter? You want to search over that space, and then choose those variables that help you most that allow you to form a highly accurate conditional distribution. You can also have problems of what I would call pathological learning. An example is given in Figure 9. You can ignore the details of this particular network, but just to say we have a very nice relational dependency network here. If we do the wrong kind of learning of conditional models, we can end up with a relational dependency network that looks like this, that is essentially impossible to interpret, which makes inference much more complex. The reasons for that are there are at least two special problems you run into. One, it’s been referred to several times as auto-correlation. You need to account for auto-correlation when you are doing statistical hypothesis tests and parameter estimation. 435

FIGURE 9 Figure 10 shows a lot of correlation. Let’s say the green things there are firms and the blue things are brokers; you are looking at the characteristics of brokers and whether they will commit fraud. Firms tend to have either lots of these brokers or very few of them, so auto- correlation is a fact of life in many of these data sets. FIGURE 10 436

Another problem is what we call structural dependence, which is that you can often look at the mere structure, how many of something—for instance, how many brokers there are—and make some inference about the firm, where there is a correlation between the structure, the degree here in this case, and some attributes of one of the things in the graph. We have papers on both of these issues showing how they can affect both our algorithms and other algorithms in the field. If you correct for these problems, auto-correlation and structural dependence, what you end up with are much smaller models. In each of the graphs shown in Figure 11, the far left-hand side is our approach to building a model. The far right-hand side is a standard approach in a community, C4.5. What we are showing is ending up with much more parsimonious models that are just as accurate. Among all of these models accuracy did not differ but size certainly did. That influences how many edges you end up having in that big network, which affects both its interpretability and also the efficiency of inference. FIGURE 11 You now have this great model, and the question is, now what do we do? We’ve got this model but we need to actually apply it to a very large network. Figure 12 shows just notionally a citation network where you have papers and authors, and they are connected in the way that you would expect in terms of people authoring papers and papers citing each other. We have this network, and we also have attributes on these individual things: attributes on these papers and attributes on these authors. What our model says is if we want to make some inference about this 437

center node, center variable, what is the topic of that paper? We need to look at the topics of other papers published by that author, and we may need to look at characteristics of the authors as well. We need to go out into the network, grab the values of these variables and bring them back. We need to do that for every variable that we are trying to predict in the network, which means that this obviously can be a very intensive process. We also need to have the entire data set available to us to make these inferences simultaneously, because some of these nodes that are parents of the node we are trying to predict are, in turn, other things whose values we are inferring, other variables whose value we are inferring. FIGURE 12 We used Gibbs sampling to do this inference process, and we have had very good experience with its convergence and other properties. What we can end up doing is a good job at predicting the value of unknown attributes wherever they reside in the network. Your data set can look essentially like Swiss cheese at some level on which you are trying to make inferences, and you can fill in those gaps over a wide variety of variables. Some of them are known and some of them are unknown. 438

We do this process we call collective inference, in contrast to conventional models (whether they are graphical models or something else) where you can just look in an individual instance and say we’re going to make a prediction about this one, and now we are going to move onto the next one and make a prediction about this. Because of the interconnections among instances, you need to make all these inferences simultaneously. Doing this sort of inference has been shown to improve accuracy substantially in very specific settings. For instance, Chakrabarti et al. (1998) shows that collective inference can really improve your ability to determine the topics of Web pages, but now it’s also been shown more generally. The influence of highly confident inferences can travel substantial distances in the graph. Collective inference exploits a clever factoring of the space of dependencies to reduce variance, thus improving performance over considering all relational attributes, as shown in (Jensen, Neville, and Gallagher, 2004). Figure 13 shows another RDN that we constructed on a citation network in computer science, showing things that aren’t terribly surprising, like authors who are well known tend to author with other people who are well known. That’s the average rank of the author. Also, that the topics of papers tend to be related to the topics of other papers written by the same author. Again, this is not surprising, but it’s nice that this fell out of the model without any of our input. This is the model that was learned directly from the data, so those sorts of dependencies give us some confidence that it is actually learning something useful, that it matches our intuitions. FIGURE 13 Another data set, shown schematically in Figure 14, is the Internet movie database. We 439

have movies, actors, directors, and studios linked up in the way that you would expect. Here we are learning large numbers of auto-correlation dependencies showing, for instance, that the genre of a movie is likely to be highly correlated with the genre of other movies that are made by the same director, made by the same studio, or starring the same actors. People tend to stay within genre. That’s not a surprise, but again, nice to find that coming out of the model. FIGURE 14 Figure 15 shows a basic gene regulatory network. This one is a bit more difficult for me to interpret, but it seems to make sense, and we have done experiments on simulated data to give us some confidence that the models are actually finding what we expect them to find. Let me mention a couple of other pieces of work before I close. One is that in the course of doing this we needed a query language for graphs. SQL is what we started out using and it didn’t work particularly well, so one thing we came up with was a visual query language for graphs. QGRAPH (Blau, Immerman, and Jensen, 2002), which was very useful. QGRAPH allows us to submit queries, written with a simple visual syntax, that return entire subgraphs with heterogeneous structure. 440

FIGURE 15 For instance, Figure 16 illustrates a query that we used in the NASD data. I’ll blow up a portion of it. At the top this says, basically, get me all the brokers associated with all of their disclosures. The zero there says get me all the disclosures for this broker, and get me all of the past branches that they have worked with. The box in the center of Figure 16 indicates the equivalent of the sub-query where you say we want those structures that include the branch, the firm, the regulator of the firm, and all of the past co-workers of that firm. And by the way, we want all of the past branches. So, these get you structures that can be highly variable in size, but which have the same basic properties in terms of the kinds of objects that they contain. We use these queries both as an ad hoc tool for looking at big graphs, and for pulling out small pieces of them that are easy to actually display visually. We also use it as a way of conveying to the algorithm where it should look for possible statistical dependencies. 441

FIGURE 16 Also, just a quick ad for a paper that is on a very different topic. Jon Kleinberg set this up very nicely by discussing this problem of decentralized search in networks. We are able to construct paths through a network with an algorithm called “expected value navigation,” which we recently came up with and published at a conference in the summer of 2005. It does a remarkably good job of combining the information available for searching many of these networks, which is the degree of neighbors, as well as the attributes of neighbors, and using that to navigate in large networks. We have open source software (PROXIMITY) available to build these relational dependency networks, and that implements the query language and other things. It is available from our Website. We released version 4.0 in April, and we have had more than 600 downloads, which is a nice thing. We are pretty sure that those aren’t just crawlers, but are people really downloading. I want to emphasize that this has an enormous potential for collaboration among all the kinds of people who are here, certainly people in computer science, but I think we need connections to people who work in theory, statistics, social networks, and physics. 442

FIGURE 17 Figure 17 summarizes my thoughts on our growing research network. There is the potential for some really great collaboration here, and there is all this disparate work that is going on that has crossovers. I have heard a lot of it today and yesterday, and really enjoyed it. It’s been really good and I’ve certainly gotten a lot of input from this. One example I want to note is that of Pedro Domingos and Matt Richardson who put out a paper in 2001 that won the best paper award at the Knowledge Discovery and Data Mining Conference. Two years later Jon Kleinberg and co-authors picked up on that line of work, which had to do with viral marketing and some really interesting technical issues. That sort of work between two areas of computer science was nice to see. Now I hope we can branch out to other areas of science. Currently we model attributes, but we would like to move to being able to say we think links will be formed between these two objects, or we think there is a missing object or group in the data. Figure 18 lists some of my ideas for potential future research. Sampling issues that have already been mentioned are active learning and knowing how to sample while you are building models. The problem of what I would call non-link relations are actually links that aren’t instantiated, but rather are the kinds of things that we say, these two people are close in age, or they live next to each other. We don’t want to have to instantiate all those links; we would like to be able to represent them in other ways. Spatial and temporal models are a set of other things that are more CS oriented. 443

FIGURE 18 Finally, if you’d like further information, please contact me at jensen@cs.umass.edu or kdl.cs.umass.edu. QUESTIONS AND ANSWERS DR. BLEI: Hi, I’m Dave Blei from Carnegie Mellon. I didn’t quite see what the criterion was for determining dependence between two random variables. Maybe you could step us through from, say, actor to genre, how one decides to put an arc there. DR. JENSEN: Structural learning in RDNs is merely learning conditional models, and looking at the variables that are in those conditional models. So, if I look at a model that is X conditioned on everything else, and the model says, Y1 and Y3 are in the model for X, I draw in that dependency. DR. BLEI: How do we determine the independence of the variables? That’s really the question. DR. JENSEN: Oh, how do you determine independence? Graph separation. DR. BLEI: I have some data. It’s a big table, and I ask is row A dependent on row B? What is the test for doing that? DR. JENSEN: When you are building the relational probability tree, these are rows you 444

are talking about, not columns. DR. BLEI: So, you are building a probability model of a set of random variables? DR. JENSEN: Let’s say variables here are columns. DR. BLEI: When you have an arc between two random variables does that mean one is dependent on the other? DR. JENSEN: Yes. DR. BLEI: So, my question is just how to measure whether or not two variables are dependent on each other, since you are learning that arc. DR. JENSEN: When we are learning the relational probability trees we are looking at individual variables as possible predictors of our dependent variable. We are scoring that with chi square in this case. These are discrete variables. If we have continuous variables we discretize them. We are using chi squares as our scoring function for individual conditional dependencies. Those are learned in the conditional models. We decide to select a feature as a split in a tree that may be the age of the broker. Let’s use a simple one. The age of the broker is greater than 27; that’s the root node of our tree. What we are predicting is the broker is a problem, so that will result in an arc in our RDN beginning at broker age, going to broker-is-a-problem. DR. BANKS: I wasn’t quite sure that I followed exactly what the SQL search for graphs would give you. Could you say a few more words about how you would instantiate an inquiry? DR. JENSEN: Certainly. You are just asking what does the query language do for you. DR. BANKS: What types of questions would you ask that would not be asked? I could ask who is in my social network that is not in Ed Wegman’s social network, or something like that, but that is not what you are going after. DR. JENSEN: Figure 19 illustrates a very simple query. Say we get red things that are associated with two or more structures of the following kind, a blue thing associated with three or more green things. Now, you could write that as an SQL query if you wanted to. What you would get back are rows, which are a red thing, a blue thing, and a green thing. What we want is to identify the subgraphs on the left and the right center as matches while skipping past subgraphs like the one in the lower right (which does not qualify because it has only two green things on the left end). 445

FIGURE 19 DR. GROSS: It’s a hard thing to explain. DR. JENSEN: It is a subtle different from SQL in the sense that you are getting back these subgraphs, and in these grouping constructs have some subtle differences, although what we are now finally doing is actually writing out the algebra, and showing where it differs ever so slightly from SQL. What is nice about the language as a user is that it is visual. You draw these things that are much more pleasant than trying to write the convoluted SQL that we had to write to get these whole structures out. Also, it is easier to convey to people when you sit down and draw it out, because that looks vaguely like the results you are trying to get. It’s certainly a lot better than trying to explain SQL to someone who doesn’t know the language. DR. GROSS: [Remarks off mike.] DR. JENSEN: We have an implementation which I would definitely call a prototype, which is in PROXIMITY 4.0. It was also in the previous version of the system. We regularly query graphs that are on the order of 1 million to 2 million nodes. Very soon we are going to have the algebra complete, which allows you to easily implement it in whatever database system you choose. We have implemented it on top of both SQL in a previous implementation, and now on top of a vertical database called Monet. You wouldn’t learn much from our implementation unless you were using this particularly bizarre database that we use for efficiency reasons. However, the algebra should provide you a really good starting point for an implementation that should be fairly straightforward, and many of the query optimizations that work for SQL will 446

work for this as well. The algebra is only slightly different. REFERENCES Blau, H., N. Immerman, and D. Jensen. 2002. A Visual Language for Querying and Updating Graphs. University of Massachusetts Amherst Computer Science Technical Report 2002-037. Chakrabarti, S., B. Dom, and P. Indyk. 1998. Enhanced Hypertext Classification Using Hyper- Links. Proc. ACM SIGMOD Conference. Pp. 307-318. Dietterich, T. 2003. “Relational revolution.” New work in CS areas of machine learning and knowledge discovery. Domingos, P., and M. Richardson. 2001. Mining the Network Value of Customers. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining. San Francisco, CA: ACM Press. Pp. 57-66. Getoor, L., N. Friedman, D. Koller, and A. Pfeffer. 2001. Learning Probabilistic Relational Models. Relational Data Mining (S. Dzeroski and N. Lavrac, eds.). Berlin, Germany: Springer- Verlag. Jensen, D., and J. Neville. 2002. Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning. Proceedings of the 19th International Conference on Machine Learning. Jensen, D., J. Neville, and M. Hay. 2003. Avoiding Bias When Aggregating Relational Data with Degree Disparity. Proceedings of the 20th International Conference on Machine Learning. Jensen, D., J. Neville, and B. Gallagher. 2004. Why Collective Inference Improves Relational Classification. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Kempe, D., J. Kleinberg, and E. Tardos. 2003. Maximizing the Spread of Influence Through a Social Network. Proceedings of the 9th Association for Computing Machinery Special Interest Group and Forum for Advancement and Adoption of the Science of Knowledge Discovery and Data Mining International Conference. Neville, J., and D. Jensen. 2000. Iterative Classification in Relational Data. L. Getoor and D. Jensen (eds.). Papers of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data. Menlo Park, California: AAAI Press. Neville, J., and D. Jensen. 2003. Collective Classification with Relational Dependency Networks. Proceedings of the 2nd Multi- Relational Data Mining Workshop, 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Neville, J., and D. Jensen. 2004. Dependency Networks for Relational Data. Proceedings of the 4th IEEE International Conference on Data Mining. 447

Richardson, M., and P. Domingos. 2004. Markov Logic: A Unifying Framework for Statistical Relational Learning. Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields. Banff, Canada: IMLS. Pp. 49-54. Simsek, Ö., and D. Jensen. 2005. Decentralized Search in Networks Using Homophily and Degree Disparity. Proceedings of the 19th International Joint Conference on Artificial Intelligence. Taskar, B., E. Segal, and D. Koller. 2001. Probabilistic Clustering in Relational Data. Seventeenth International Joint Conference on Artificial Intelligence (IJCAI01). Seattle, Washington. Taskar, B., P. Abbeel, and D. Koller. 2002. Discriminative Probabilistic Models for Relational Data. Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02). Edmonton, Canada. 448

Appendixes 449

Next: Appendix A Workshop Agenda and List of Attendees »
Proceedings of a Workshop on Statistics on Networks (CD-ROM) Get This Book
×
Buy Cd-rom | $123.00 Buy Ebook | $99.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

A large number of biological, physical, and social systems contain complex networks. Knowledge about how these networks operate is critical for advancing a more general understanding of network behavior. To this end, each of these disciplines has created different kinds of statistical theory for inference on network data. To help stimulate further progress in the field of statistical inference on network data, the NRC sponsored a workshop that brought together researchers who are dealing with network data in different contexts. This book - which is available on CD only - contains the text of the 18 workshop presentations. The presentations focused on five major areas of research: network models, dynamic networks, data and measurement on networks, robustness and fragility of networks, and visualization and scalability of networks.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!