Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: ABSTRACT OF PRESENTATION

Page 138 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 139 Cite

Page 140 Cite

Page 141 Cite

Page 142 Cite

Page 143 Cite

Page 144 Cite

Page 145 Cite

Page 146 Cite

Page 147 Cite

Page 148 Cite

Page 149 Cite

Page 150 Cite

Page 151 Cite

Page 152 Cite

Page 153 Cite

Page 154 Cite

Page 155 Cite

Page 156 Cite

Page 157 Cite

Page 158 Cite

Page 159 Cite

Page 160 Cite

Page 161 Cite

Page 162 Cite

Page 163 Cite

Page 164 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 138 TRANSCRIPT OF PRESENTATION MR. PREGIBON: I guess I want to think of the words to apologize. I am listed as a keynote speaker, but I think any of the presentations we heard this morning, and probably this afternoon and tomorrow probably would qualify for this lunchtime keynote speech. The only thing I can think that it is marked as keynote is maybe it has caught the funder's eyes the most. So, with NSA being the major funding agency, this is probably an area that they are quite interested in. That is my excuse for why someone has named this a special talk. Otherwise, I think what we will see is this morning, this afternoon and tomorrow, there are many different shapes and forms for this large data problem and large data streams. So, what you are going to hear is something completely different than what you heard this morning. This isn't big science. This isn't science at all. In fact, I am a bit jealous. It is nice to have science that you can at least hope to explain what is going on, in some phenomenon in your large data. I am not sure that what I am going to talk about today is going to make any contribution to big science. The only thing I could think of is, I might donate my laptop to the scientific community because I think it holds some new physics in it. It is very sensitive to the characteristics of electrons and, if I just touch it the wrong way, it is going to die and reboot. It only weighs seven pounds. So, it is not a couple ton accelerator, but I really think there are some secrets in here that the physics community might be able to unlock.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 139 Let me get on to the topic. We are going to talk about network data. I will give you a brief introduction of what sort of network data I want to talk about. Then I want to go into a little bit of a spiel where I want to show you some network data. The reason I am doing that is to basically guide how we think about such data, and you are going to find out that, throughout the talk, there aren't a lot of statistics per se in the talk, but I think you will be able to see how statistical thinking guided a lot of sort of the ideas and the things we migrated to. By virtue of describing this network data, ideas of a dynamic graph pop up. So, we will talk a little bit about defining a dynamic graph, necessarily how we approximate that, and sometimes how we post-process that and then use it in some applications. So, let me jump right in to say what the topic of my talk is versus other sorts of network data that might exist. So, generally speaking, with regard to networks, you can sort of think of static versus dynamic. I will think of network typology as a somewhat static beast. Even though the number of switches in a telephone network and the number of pieces of copper or fiber connected to them, they do change through time, they are relatively static. They are, in many cases, physical objects. In the cases of Web pages, maybe they don't have a physical entity or identity, and Web links come and go, but they don't come and go as fast as other sorts of things. By that I mean, the thing that I am going to be talking about is actual traffic on a physical network. That is the sort of data that makes up the topic of my talk. So, the things of interest here are the entities that are using the physical network. They are using this physical network to conduct something â communication, finance, commerce, it could be transportation. These transactions involve other entities. They could be voice data transactions, investment transactions, and retail purchases. These things are totally out of the sort of control of the person collecting the data.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 140 Now, the data that are associated with these transactions isn't the content of the transaction. Largely speaking, they are just a few bites of information that describe some salient characteristics of the transaction, such as the date and time stamp that it took place, possibly encoded in the metadata is the location for the transaction, maybe somethingâ definitely in my case, and I think many of these casesâsome identifiers for the transactors, and often some element of size. It is the number of bytes involved in the transaction, the number of minutes the transaction lasted. Maybe in a financial or a commerce application it could be the size in dollars of the transaction. So, it may not say what was purchased, but you would know the dollar amount associated with it. Now, traditionally, these data aren't new. They have been around forever, and they have been around for a darned good reason, because people make money, or this captures records that are used for billing. This is the thing that actually paid, and continues to pay, for the collection of such data. What we are going to talk about is trying to use these dataâthis morning we talked about Level 0, Level 1, Level 2 data, and for this talk, you might think of these as Level 0 data. In fact, as the fields that I have described them, they are probably Level 1. The Level 0 data that we get off the network are pretty ugly. It is a raw, unstructured form called AMA format that we process and put into a different format called Level 1. We have our own name for it. As we go up the food chain, I think what we try to do is add value as we go up the levels. So, this talk is almost on, as you go up, and with large data sets, you do have to filter your data. If you can think of adding value as you go up this level stack, that is kind

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 141 of what we are talking about, or one of the messages I want to get across to you in my talk. Maybe the lesson here is that each record has just a few bytes of information, who talked to who, when and how long. So, they are pretty trivial. If you put together a couple hundred million of those a day, and you do that day in and day out, you can get a very good picture of what is going on in your network. So, that is kind of the message here. I think I used graphs in the title of my talk, and anyone who is associated with network data, this is a no-brainer. Graphs are a convenient representation of the data and, in the applications that I am going to be talking about, we are talking about call graphs. So, every node in a network, the network identifier is a phone number, and the edges we are talking about are phone calls or aggregate summaries of phone calls between telephone numbers. Some of the questions you might ask of such data, are there interesting nodes in your network? How close are two nodes to each other, not in geography, but in sort of the network topology, where the network is not the physical network, but it is the communication network. Other sorts of nodes or other questions that are interesting to me concern the temporal nature of the network. Is the node that I am looking at today behaving similarly to the way it behaved in the past. Then you might actually talk about subgraphs of the large graphs and how you might capture them and classify them. So, to try to motivate some of the ways we think about the data, let me show you

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 142 some actual data. It is going to be from what we will call a low-volume service. We get data all the time. So, it is not quite an accelerator application, but people are making phone calls and we bill them for their phone calls. So, we have a continuous stream of data. This is going to be a stream of data from one of our services. I picked it up for 22 weeks. What I am going to do is just give you an idea of the size, shape and scope of network data. Let me just go through a series of graphs, and hopefully you will be able to see some of these features. So, this is a plot. It is a scatter plot and, for each of the 154 days in the study period, I have a point. Here is a point, there is a point, and there are a bunch of points. That point contains the information on the number of nodes that were active in this service on that day, and the associated number of edges for that day. So, generally, on a weekday, we have roughly half a million transactors on our network, and they are transacting with two others. So, there are about a million edges a day, and about a half million nodes a day. On the weekends, we see it quite different, maybe only a third of a million nodes, and half a million edges. There are some outliers that correspond to special events, holidays, etc. So, this gives you an idea, a little bit, of how many transactors we are seeing on the network, how many transactions, and also gives you a little hint of the sparsity of the graph. If you have a graph that has n nodes, roughly, there are n2 edges. This is hardly n2. I mean, it is a factor of two. So, the graphs that we are talking about, and are going to talk about, are very, very sparse, and that is a good thing, or can be a good thing.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 143 This gets at the sparsity in a slightly different way. This is a plot of the number of edges versus the number of days on which they were observed. This is a typical plot you might see in an analysis of text and other things. Most edges occur only once. So, over a million of these edges, you only see once, over about half a million of the edges you only see twice over the study period. Way down here, you see some of the edges every day. So, this gives you a little sense of the temporal nature and how frequently you observe things. These are the same data presented slightly differently, and it just shows you the cumulative frequency. So, 95 percent of the edges have occurred on six or fewer days, of a 154-day study period. So, that gives you an idea of the episodic nature with which these edges come and go.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 144 This is a different way to view the data. This talk is about the lifetime of the edges. So, we have an edge and, if we have only seen it once, its lifetime is a day. Some edges we have seen twice. So, we say, have we seen them today and tomorrow, or did we see one the first day of the study period, and the second time we saw it was the last day of the study period. What this display is meant to convey is that roughly, say, 75 percent of the edges came and went within a week. Eighty percent of the edges came and went in about two weeks. Eighty-five percent came and went within a month, and 90 percent of them, the first time we saw them and the last time we saw them was about two months, which is less than half the study period. So, these things come and go and you may never see them again, and their half lives are generally very short. This is one of my favorite plots, and someone criticized me for saying it is my favorite one because it takes so long to explain, and a good plot, you shouldn't have to explain it, but let me make a crack at it. What I want to get at in this plot is just the fact that nodes come in and go out through time, and the numbers are pretty staggering. What I did was take the entire study period, and I got the complete list of network identifiers. There were about 6.5 million of them. Then I said, okay, let's start at day one, or how many are unique on day one. Well, they are all unique on day one, because that is where you are starting. On the second day, how many new ones did you see? On the third day, how many new ones that you hadn't seen the first couple of days and so forth. After a while you sort of get over this initial starting period and you get to a sort of steady state. I have fitted a trend line to this steady state up here. The slope of that corresponds to the fact that every day we see 16,000 new nodes in our graph. Then, I did the same going in reverse. I said, okay, we have all these nodes and now the steady state is going to be the reverse process. When was the first day of the study period? For how many nodes was that the last day I saw that node? The second day of the study period, for how many nodes was that the last day that I saw it, and fitted a trend line to the beginning of that, because the truncation for that will occur over here. Generally speaking, the slope of that trend line is that we are losing 27,000 new nodes a day. So, this is not a good business to be in. Again, what we are trying to capture is, nodes come in and go out and there are lots of them. It is not trivial numbers. So, you have to take this into account.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 145 AUDIENCE: [Question off microphone.] MR. PREGIBON: They haven't converged yet. The question was, do these lines converge? My response was they haven't converged yet. AUDIENCE: [Question off microphone.] MR. PREGIBON: The second question was, what would happen if I had run this for more weeks. I believe what would happen, I would see the same slopes, and I think the lines would be closer to intersection than they are right now. This business, by the way, I don't think it is that proprietary. It is our calling card business. Calling cards are buying out by virtue of cell phones and prepaid cards. So, this is a business that we are essentially harvesting. We are not investing in it, people aren't using it, and this is the rate at which that business is, you know, going in the toilet. So, the final thing is a similar plotâI guess it is not the final one, there is another plot after thisâthe same sort of business on the edges. What is a little interesting to me is that the attrition of edges is nearly linear. I can't explain that. This is the addition of edges. The numbers here are also big. The slopes of these lines, we are seeing 300,000 new edges a day that we have never seen before in the steady state, and we are losing 400,000 edges a day that we will never see a again. That gives you sort of an idea of both the node set and the edge set. This is the final graph in this set. This is speaking toward, again, the connectivity or the sparseness of these data. I have done it in a cumulative plot and I think I have tried

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 146 to capture the 90th percentile. So, 90 percent of the nodes on this graph have an in degree of eight or fewer edges, and I think 90 percent of the nodes on this graph had an out degree of 13 or fewer edges, and this is over the entire study period. So, that is, again, a bit of a background for, when you are in this business of studying graphs or traffic or transactions on a network, that is the type of volatility that you have to deal with. Now, let's get into how that data motivated sort of the structures we used to capture this graph. I keep saying this graph as if there is a graph. It is changing all the time. So, therein lies the problem. We have to come up with a definition of what we need. So, I will introduce a Microsoft operator. I found that in the symbol table. It looks like a little plus sign. That is a graph operator. Basically, all we are going to say is that we can take two graphsâgraph one and graph twoâand combine them or take a linear combination of them, such that the resulting graph will have the union of all the nodes and edges of graphs one and two, but that we are just going to take a linear combination of the weights along those edges to find a new graph, or graph three. I will show you how that is used on this viewgraph. Again, there are many different ways to define what we mean by the network graph. I am putting most of these up here as a straw man to say, none of these are the ones we chose, but these are all

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 147 legitimate choices for legitimate purposes. Again, you can just define the graph to be the network traffic you see on some period, say, today. That could be the graph you are interested in. Given the volatility that we see, 300,000 new edges today that we didn't see, and 400,000 are going away, that graph may be viewed as being too narrow for some applications, of too high variance. The sort of complement of that is to say, don't throw anything away, and let your graph just grow over time and don't sort of delete anything, or just keep adding stuff to it. That possibly, for some applications, is too broad. If you are using this graph for inferential purposes, you may not want traffic from a year ago to affect inferences on your graph today. For some applications, you may exactly want that. For some of the things we do, we don't want that. Something closer to what we want is this. We want the graph to be defined as what is happening recently with the network. You can think of this as simply a moving window. So, we have traffic coming over the network. We are going to window this. The only problem we have with this is that, to maintain this, we have to maintain too many little graphs. So, in order to update this graph tomorrow, we have to throw this guy away, and then add another one on the other end. Then, depending on how wide this window isâif it is a monthâwe are going to manage 30 graphs all the time. So, it is not a surprise to think of where I am going with this. I sort of said, well, we like this windowing operating but we don't like to pay for the storage and maintenance of all these graphs. Well, one way to get a benefit of that is to simply do an exponential average where we take a convex combination of the graph like it was through yesterday, and then just add in the graph for today, using sort of a factor of Î¸, which is going to sort of guide how wide of a window you are looking at for your application. Again, just doing the simple algebra, you can expand the recursion and basicallyâthis isn't anything new â most people who study this type of data, I think, this would be a natural definition for them. The benefits of this, from the recursion you see that recent data would have the most influence. From a computational point of view, it is great, really because only one graph needs to be stored. You have yesterday's graph, you put today's data in, and you immediately have the new graph for today.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 148 That parameter, Î¸, is, again, adjustable. I just show you some values of Î¸ that we use for different applications. If we said Î¸ equals about .85 and you follow that curve down, for our applications, if we do daily processing, that means an hour phone call will last, in our analysis, for roughly about a month. So, it won't wash out for a month. For other values of Î¸, I think that is a Î¸ of .4, if we set the value of .4, that one hour call will wash out in about a week. Again, it is completely adjustable. It really depends on your application. In telecom, phone numbers, when they go out of existence, they are reassigned. Typically, they can be reassigned within a month. So, we typically use a 30- day value and a Î¸ of .85. So, we don't blend activity on the new user of a phone number with activity on the previous owner. So, let's talk about the graph again. So, that captures the dynamic nature of our graph. Now, how are we going to represent it? One way that we like to represent the graphâand you will sort of see the applications that motivate thisâis in a constructive sense, and we are going to think of it as a union of all subgraphs, where the subgraphs are, for every node in the networkâso, this is important here, and we will get to this again laterâfor every node in the graph, we keep the set of edges going out of it or coming into it, the set of edges going out of it, and that defines a diameter one subgraph centered on that node, and we are going to do that for every node. Then the graph, by definition, is just a union of all those subgraphs. Now, there is a big penalty here, because I have stored every edge twice. If I call

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 149 Amy up, that edge is going to be in my subgraph, and it is also going to be in Amy's subgraph. So, this is pretty redundant. The main reason we do this is, it allows for fast expansion of subgraphs centered on every node. By that I mean, in a lot of the applications we are interested in not just who called me and who I called, but we want to take that out. For every one of the people I called, who called them and who did they call. By building this redundant structure, it is very easy to go from this down to this. So, it is very easy to traverse this data structure to build subgraphs of arbitrary depth, and literally within a second. So, that is the main reason why we like this constructive representation. We don't use that, though, in practice. What we actually do is approximate the graph. The nature of the approximation kind of builds off this constructive definition. So, we are going to approximate our large graph by sacrificing something. Now, I am not going to sacrifice information on the node set. I want information on every node in my graph. What I am going to sacrifice is some of the edges. So, I am only going to retain edges in my graph that are sort of big enough or important enough. That can be defined by the user, but I am going to truncate my edge set, say I am only going to keep the k edges with the largest weights on them going on. The k edges with the largest weights going out, and now I can define my subgraph now, centered on every node, and it is going to be approximate. Then I can define the big graph to be the union of all these subgraphs. Now, through sleight of hand, this graph isn't necessarily redundant. The reason

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 150 for that is sort of asymmetry. Amy might be very popular. When she calls me, I might fall out of her subgraph because she only spoke to me for a minute, where she talks to her family members and friends for quite long. I might be unpopular, so her edge may stick around in my subgraph for a long time. So, even though there is a possibility of redundancy in this representation, it may not occur. The other thing we try to do in our approximation is the following. We threw away edges, but we are going to try to maintain the total weight of the subgraph by maintaining the weight in a sort of funny node for each node in the graph, and we call it slot other. Every node will have two slot others, that sort of catch the carry over, and let me show you how this works. This is, let's say, my subgraph. I guess I labeled it me in the middle. Bill Gates is responsible for rotating some of these names up here. I just typed them in and I must have twisted something when I put in my circles, but when I went to type in the names, that is what it wanted to do and I couldn't undo it. So, if Bill Gates is responsible for doing it, it is my responsibility for not knowing how to undo it. So, anyway, this is my subgraph. Now, suppose I wanted to do a top four approximation for this. What this means is that I want to maintain only four of the nodes going in and four of the nodes going out. I am going to do that by saying, I am going to drop off the nodes with the smallest weight. If we look around here, it looks like these guys have the smallest weight. So, when my boss calls, or when Paul calls me, these guys are going to be left, and when I

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 151 call Bill, I don't call him enough, so I am going to whack that node. What I have done is, I have replaced those nodes here with two other nodes, just labeled âother.â So, they are not going to carry the names associated with who made those calls, or the number of calls that were in there. So, if I tried to mimic Amy's animation, I can do that. So, you can see how I am collapsing several nodes down to one, and then replacing them with node âother.â AUDIENCE: [Question off microphone.] MR. PREGIBON: So, the question is, what are the weights. So, the weights typically would be associated with some characteristic of the transaction. So, the weight might be the dollar value of a transaction, length of a phone call, number of bytes of the connection between this IP address and that IP address. So, sorry for the confusion. Anyway, the top four approximation for my subgraph, again, only maintains the top four in and out directions, plus overflow nodes that I have thrown away the labels on, but I have maintained the weight. So, my approximate subgraph has ever node in the network in it. It has the total weight as the original subgraph. All I have done is pruned some of the edges. So, that is the beast we have chosen to work with. AUDIENCE: [Question off microphone.] MR. PREGIBON: So, the question was, it is not really the top four because of the temporal nature of the data. I could have had something in my top four that slid off the top four but now is coming back in and would get in my top four. That is the segue into this view graph, is how do we update this thing. You are

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 152 absolutely right. If something falls off, it will go into other, and it has to fight its way back into the top four. The way it works, sort of at the atomic level, is that we are going to update this graph by updating all of the subgraphs at the atomic level. So, we may have the node set from yesterday, we have today's calls, and we take the convex combination of the weights, and then, when we do that, we then do the sort and only the top k nodes in that sort are retained. If you were in my top four and you then fall into other, I have got to keep calling you in order for you to fight your way back into my top four. So, then the approximate graph, then, through time, is then defined by that operation. So, the final thing I will talk about is something that we are going to do day in and day out. As the data streams in, we are going to construct a network topology of the data streaming in and maintain that. You can think of this as kind of a database. Basically, we are building what you might call a materialized view of the raw transactions, and this view is the network view of the data that can be queried very, very fast. Again, you put in a seed node and, within a second, you can get a subgraph out surrounding this node. When you abstract that subgraph, you may want to enhance it. This is just a fact of life, and reflects the fact that you data really aren't everything that you would like. For instance, this is something that we brought on ourselves. We threw away some stuff. So, we might want to think about ways that, at least when we analyze data, that we account for the fact that we threw away some stuff. Then, there are these two cases where we would have liked other stuff, but we didn't get it. So, you know, in the good old days, AT&T was a monopoly, so we had all the network transactions. So, our graph was complete. Today, we are not a monopoly. So, there are edges that we don't observe because those calls aren't carried on our network. Now, in building these graphs, for security's sake, you may want to have all these edges in, even though your data may not have observations on these edges. So, that is a potential problem. Another one is that, you know, you may be capturing intercepting data at one level in the network stack, and there are communications going on beneath that, that you won't see. So, in my application, AT&T is primarily a long-distance company. So, I will see long-distance phone calls. If I call my parents in Ohio and my brother, who also lives in my same town calls them, the AT&T network data will see calls going to my parents

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 153 from each of our numbers. They won't see me calling my brother, because we live in the same town and that is a local call. You can think of it the same in IP space. There are routers, in different parts of the Internet. You may see traffic crossing some of the routers, but you are not going to see stuff below. If you are really trying to build up network connectivity for all of these network elements, that may be important to you. We are exploring several approaches to enhancing the graphs. I will show you examples of both of them. One is strictly algorithmic, the other probabilistic. So, the algorithmic ones first. What we try to do here is capture some of the richness that our data doesn't bring to us that we would like to bring back in. So, in a deterministic fashion, we will add some edges and then we will do some pruning to get rid of them. Again, through a cartoon, how we do that, this might be my diameter two subgraph, again, showing the category other, and then for the nodes I called, or that called me, who they conversed with.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 154 I may add some edges in. So, for some reason or other, these edges weren't in my data, but I wanted to add them in. Then, after I added them in, I probably have too much junk in there and I will get rid of them. One algorithm we use to get rid of things is just running strongly connected components or some other fast graph algorithm, to basically prune out stuff. Maybe I should say something here that is obvious to anyone who has looked at these graphs. Graphs are like classification trees. People think of them, oh, they are simple, they are easy to interpret. Classification trees, like graphs, are easy to interpret if they are small. Classification trees, regression trees and graphs are really hard to interpret if they are big. So, pruning is almost necessary in every application, because these things get big. We seldom go out more than a radius of two in our graphs. You just bring in too much junk. Even at diameter two, you have to prune away junk.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 155 The second way we add stuff into our graphs is using probability models. The class of models we are using here are log-linear models, which are quite nice, because they give nice interpretable parameters, both for the graph overall, and also parameters at the node level. So, if you wanted to think about computing these models for every node, you can then cluster your parameters, to cluster your nodes. It is a very flexible type of model. The way we account for the fact that we are missing data is to put some priors on some of these parameters, and you can decide which ones you want priors on. Then, once you have those, you can use an N type algorithm to estimate the parameters, and this is one of the things we are experimenting with. Generally, we are at the point now for about a 200 node diameter two subgraph in the R language, we can compute these models in about two minutes. That is an interpretive language. So, we can't do it for a large number of subgraphs. If we rewrote that in C, we think we could get it down to about two seconds per subgraph, but we are still not satisfied with it at the level of modeling that we are doing. The other nice thing about these models, as a statistician, we would like to think about the parameters of the model. There are also the ultimate fitted probabilities. So, in the model that I put up before, we are actually modeling sort of the probability of an edge. You can think of then using the output of the statistical model as a way to prune and enhance your graph. For instance, this is actually, I think, a very small subgraph around me. I think there are only six nodes in this subgraph. So, these are the six nodes, and these might be

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 156 my observed data. If I was going to draw a graph of that data, this might be the threshold version of data that I use for making that graph. If I fit a model to that data, this might be the probabilities associated with the edges on that model and, if I threshold this thing, this might be the graph I plot. So, this is a way that we are going to be using to observe data, add in edges probabilistically, and then prune them back to something that we hope we can understand, and that captures the salient features of the data. Let me, in the last five minutes or so, go over the applications and give you an idea for the volumes we are talking about. I mentioned before, we are a long-distance company. So, this is the data from the long-distance network. We see about 350 million edges a day. That is a reasonably big number by this morning discussion. The bigger number is the number of nodes that we see through time, nearly half a billion nodes through time. So, it is a pretty big graph when you look at it through time. In our work, we tend to use the top nine approximation. That is for historical reasons. We have been changing that over the past year. I will give you an idea about the sizes of these materialized views. Each direction, inbound and outbound, are about 7 gig. It takes us about two hours to process the data daily. If we wanted to scan each of these maps and compute something for each, it takes about 20 minutes. It is easily under a second to expand a diameter two subgraph for any node. When we want to do the enhancement, flavor it and then render it on someone's Web site, it is basically about 10 seconds from âgive me the informationâ to seeing the subgraphs. Those are some of the operating characteristics that we are dealing with.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 157

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 158 Okay, so without further ado, let's talk about the applications. The primary use of these tools is in fraud detection. I will outline one type of fraud that we deal with, and those in the audience with similar problems, they probably have similar versions of this. They aren't necessarily for fraud. Customers subscribe for service. We like that. They don't pay us, we don't like that, and there are a lot of reasons why this happens. Some of it is our own doing. It is a competitive market. We work hard to get customers. We don't screen them, maybe, as well as we should. Other times, people out there who want to be bad are pretty good at being bad. They will steal identities. No matter how good we check on their credentials, it comes up good because this is a perfectly good credit record that this individual has.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 159 The only problem is that it is not the person. Then, the problem gets exacerbated, because not only did we give them service, we basically gave it to them for a couple of months. They use the service for 30 days, we send them a bill. Thirty days later, we find out they didn't pay the bill. So, we start reminding them, gently at first and more vigorously later. Eventually, we lose our patience, but literally, 60 to 90 days. So, it is not just getting service, but it is service over an extended period. So, this is something we don't like. In a security operation, you may not want bad guys floating around for 60 or 90 days without you knowing about them. You would like to know about them within minutes, days, weeks, rather than 60 or 90 days. So, part of the plan we have is, you know, bad guys don't work in isolation, or you count on the fact that they don't. So, there is going to be some clustering of bad behavior in your network, and we will exploit this to come up with ways to describe this affinity group, and just to rank suspects by who they are associated with. So, for instance, I think I have outlined a node up here. This is a suspect. We can compute their diameter two subgraph, add some value to it and thenâI won't go into detail on what the nodes of different shapes and sizes mean. Think of some as cellular handsets and others as land lines, thickness and color of the lines depicting the weight or the strength of that association. Color is the main thing here. Any node that is colored was associated with fraud very recently. So, this

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 160 particular phone number that was a suspect, in their community of interest, has five other bad guys associated with it. This is what we want to bring to the attention of the security staff, that this would be someone good to investigate. So, the way we do this is, for every new account we see in our network for a particular type of service, we will compute their subgraph seven days after we sight it. Seven days is, again, just some made up number, a waiting period. You want the calling circle to mature, depending on the level of activity. It could be one day, it could be six days, but seven is what we settled on. We will do some enhancement of those subgraphs and then threshold them. We will go to a database and look for recent fraud cases. So, we will color those subgraphs. We will rank them and put them on a Web site and the security associates then just go through and look at this, and do their investigations. This is a result of what we saw from a trial when we first developed this. On this axis is the number of nodes in a diameter subgraph that are associated with previous fraud cases. This was the outcome of new cases post investigation. The number plotted at each plotting position is the number of cases. So, there were 13 cases that we presented to security that had six bad guys in the community of interest, and about 45 percent of those turned out to be fraud. Out here, there were six cases that we presented that had about nine bad guys in it. Over 80 percent of those turned out to be bad. As soon as you get beyond that, it is

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 161 almost 100 percent. If you are surrounded by badness, by golly, you are bad, no doubt about it. Where the action, though, is down here where you don't see. This is the mess we were cleaning up when we started this. Now, when you see one or two bad guys around a new person, that is a big indicator that things are bad. So, by rolling this out and having active investigation, we are able to really clean up the mess. The second thing I will talk about, and just close on this, is tracking bad guys. You can think about this as account linkage. People who are bad, as I said, they are good at it. Just like customers, they exhibit loyalty. If you have got a good product, they are going to be loyal. Bad guys are the same way. If you have got a bad product, they will be loyal to you. If you let them abuse your network for a year without catching them, they will come back tomorrow and abuse it for another year because you are not hassling them. Fraud detection versus some of the security things, we don't have to be perfect. We just have to be better than the competition. The fraud is not going to go away. It is going to go to some network. We just don't want it to be ours. So, the nature of the game is to really hassle these guys. So, we want to hassle them, but again, they have their own tricks. So, identity theft is not quite science yet, but there are some people that are darned good at it. So, you know, you can knock them off your network and because they don't cost

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 162 you as much as you would like to, they will come back on your network. If you are dumb enough not to know they are back, they are going to abuse you again. So, it is the same old game except, you know, burned once, you don't want to get burned again. So, even though the node identifierâso, the phone identifier on your network is new, is it possible that the person behind that identity is the same. So, we are after them. Basically, we want to exploit the fact that you are who you call. You may be burning your cell phone or a prepaid card at a fast rate, but maybe your friends and family, your girlfriends or whatever, they are not doing the same thing. So, their lifetime in your network has a different half-life than yours, and we want to exploit that fact. So, what we do is, we keep a library of subgraphs for these baddies and then new guys come along and we want to do a graph matching. We want to say, are any of these newbies likely to be any of these ones we have caught before. So, it is a big problem. Again, it is a graph matching problem, and they don't necessarily come back on your network the same day that you bumped them off. So, you are collecting a pool of baddies with a pool of new customers, trying to do graph matching and then ranking these pairs to present to security. Again, there is a lot of engineering involved, and what do you mean by graph matching? So, basically, you are taking a graph and a subgraph, overlapping them and saying, you know, how good is the overlap. You know, this is just something we have come up with. It mimics, I think, a little bit of what goes on in text analysis, some of the scoring methods that are used there. You want to account for the fact that big weights

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 163 associated with edges are good, but if a node is very common, it is not very discriminatory, you want to down weight it. The fact that both the newbie and the old guy call Lands' End isn't very informative. If they both called a number in Yemen, maybe that is informative, if no one else called that number. So, that is what this is getting at. I am not going to go into details. I guess this is a cartoonâit is actually a real case of this. Anything that is in pink or red are the overlaps on two subgraphs. The network IDs for the subgraphs we are overlaying are both in here. Anything that is green is associated with only the top number here, anything in blue the bottom one. This just shows that these two identities that are appearing on your network at different times are quite likely the same person behind it. Quickly, just to show you the results of implementing this in the investigative sense, giving a bunch of leads to our spot associates, based on the deciles of predicted match, and then looking through time to say how many of those cases that were presented turned out to be true matches, and that the pairs that we presented were actually matched pairs. It just shows we are reasonably well calibrated. This thing would be perfectly on a 45 degree, on a perfect calibration. It is showing that, when we think it is a match, generally, after investigation, it is a match. The thing that, to us, makes this so powerful, again, speaks to a number of things that we are seeing today.

KEYNOTE ADDRESS: GRAPH MININGâDISCOVERY IN LARGE NETWORKS 164 Generally speaking, we are getting tens of thousands of new customers on our network a day, tens of thousands of baddies of different sorts, not paying their bills or whatever, a day. So, the fact that we are able to distill this down to less than a thousand things for our fraud team to investigate, it is a big crunching of this down. In a physical sense, as a physicist, you have got all of this data. Where do you look? So, these tools are to guide our physicists to show, this is where you have got to look. I am just going to close things by saying this is ongoing work, a lot going on, and thank you for your patience and your time. Thanks. MS. KELLER-MC NULTY: While we are getting John hooked up for the next session, if there are a couple of questions? [Question off microphone.] MR. PREGIBON: The question is, do I ever use spectral techniques for these graphs. I think possibly what you mean there is some of the Hudson Authority type computations? [Comments off microphone.] MR. PREGIBON: No, we have not. We should probably talk off line. There is some mathematics associated with analysis of graphs in the literature, with argon analysis. Those sorts of spectral methods are used to characterize and process findings of special nodes on the graph. For people who are familiar with the work of John Kleinberg from Cornell, his Hudson Authority work is of that ilk.

Next: Sallie Keller-McNulty, Chair of Session on Integrated Data Systems Introduction by Session Chair »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates