__Abstract of Presentation__

__Transcript of Presentation and PowerPoint Slides__

BIOSKETCH: Daryl Pregibon is head of the Statistics Research Department at AT&T Shannon Research Labs. His department is responsible for developing a theoretical and computational foundation of statistics for very large data sets. He has been with the Labs for over 20 years. He has interest in and has made contributions to the three main areas of statistics: modeling, data analysis, and computing. His specific contributions include data analytic methods for generalized linear and tree-based models, incorporating statistical expertise in data analysis software, and designing and building application-specific data structures in statistical computing. He is very active in data mining, which he defines as an interdisciplinary field combining statistics, artificial intelligence, and database research.

Dr. Pregibon received a PhD in statistics from the University of Toronto in 1979 and an MS in statistics from the University of Waterloo in 1976. He is a fellow of the American Statistical Association and has published over 50 articles in his field. He was coauthor of the best applications paper, “Empirical Bayes Screening for Multi-item Association in Large Databases,” at Knowledge Discovery and Data Mining 2001 (KDD2001) and the best research paper, “Hancock: A Language for Extracting Signatures from Data Streams,” at KDD2000. He is the past chair of CATS (Committee on Applied and Theoretical Statistics, National Academy of Sciences). He was co-chair of KDD97 and has been either a special advisor or member of the KDD program committees for the past 3 years. His is co-founder of SAIAS (Society for Artificial Intelligence and Statistics). Currently he is a member of CNSTAT (Committee on National Statistics, National Academy of Sciences), a member of the SIGKDD Executive Committee, a member of the Steering Committee of IDA (Intelligent Data Analysis), and a member of the Editorial Board of Data Mining and Knowledge Discovery.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Daryl Pregibon
Keynote Address: Graph Mining—Discovery in Large Networks
Abstract of Presentation
Transcript of Presentation and PowerPoint Slides
BIOSKETCH: Daryl Pregibon is head of the Statistics Research Department at AT&T Shannon Research Labs. His department is responsible for developing a theoretical and computational foundation of statistics for very large data sets. He has been with the Labs for over 20 years. He has interest in and has made contributions to the three main areas of statistics: modeling, data analysis, and computing. His specific contributions include data analytic methods for generalized linear and tree-based models, incorporating statistical expertise in data analysis software, and designing and building application-specific data structures in statistical computing. He is very active in data mining, which he defines as an interdisciplinary field combining statistics, artificial intelligence, and database research.
Dr. Pregibon received a PhD in statistics from the University of Toronto in 1979 and an MS in statistics from the University of Waterloo in 1976. He is a fellow of the American Statistical Association and has published over 50 articles in his field. He was coauthor of the best applications paper, “Empirical Bayes Screening for Multi-item Association in Large Databases,” at Knowledge Discovery and Data Mining 2001 (KDD2001) and the best research paper, “Hancock: A Language for Extracting Signatures from Data Streams,” at KDD2000. He is the past chair of CATS (Committee on Applied and Theoretical Statistics, National Academy of Sciences). He was co-chair of KDD97 and has been either a special advisor or member of the KDD program committees for the past 3 years. His is co-founder of SAIAS (Society for Artificial Intelligence and Statistics). Currently he is a member of CNSTAT (Committee on National Statistics, National Academy of Sciences), a member of the SIGKDD Executive Committee, a member of the Steering Committee of IDA (Intelligent Data Analysis), and a member of the Editorial Board of Data Mining and Knowledge Discovery.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Abstract of Presentation
Graph Mining: Discovery in Large Networks Daryl Pregibon (with Corinna Cortes and Chris Volinsky), AT&T Shannon Labs
Large financial and telecommunication networks provide a rich source of problems for the data mining community. The problems are inherently quite distinct from traditional data mining in that the data records, representing transactions between pairs of entities, are not independent. Indeed, it is often the linkages between entities that are of primary interest. A second factor, network dynamics, induces further challenges as new nodes and edges are introduced through time while old edges and nodes disappear.
We discuss our approach to representing and mining large sparse graphs. Several applications in telecommunications fraud detection are used to illustrate the benefits of our approach.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Transcript of Presentation
MR. PREGIBON: I guess I want to think of the words to apologize. I am listed as a keynote speaker, but I think any of the presentations we heard this morning, and probably this afternoon and tomorrow probably would qualify for this lunchtime keynote speech. The only thing I can think that it is marked as keynote is maybe it has caught the funder’s eyes the most. So, with NSA being the major funding agency, this is probably an area that they are quite interested in. That is my excuse for why someone has named this a special talk. Otherwise, I think what we will see is this morning, this afternoon and tomorrow, there are many different shapes and forms for this large data problem and large data streams.
So, what you are going to hear is something completely different than what you heard this morning. This isn’t big science. This isn’t science at all. In fact, I am a bit jealous. It is nice to have science that you can at least hope to explain what is going on, in some phenomenon in your large data.
I am not sure that what I am going to talk about today is going to make any contribution to big science. The only thing I could think of is, I might donate my laptop to the scientific community because I think it holds some new physics in it. It is very sensitive to the characteristics of electrons and, if I just touch it the wrong way, it is going to die and reboot. It only weighs seven pounds. So, it is not a couple ton accelerator, but I really think there are some secrets in here that the physics community might be able to unlock.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Let me get on to the topic. We are going to talk about network data. I will give you a brief introduction of what sort of network data I want to talk about. Then I want to go into a little bit of a spiel where I want to show you some network data. The reason I am doing that is to basically guide how we think about such data, and you are going to find out that, throughout the talk, there aren’t a lot of statistics per se in the talk, but I think you will be able to see how statistical thinking guided a lot of sort of the ideas and the things we migrated to. By virtue of describing this network data, ideas of a dynamic graph pop up. So, we will talk a little bit about defining a dynamic graph, necessarily how we approximate that, and sometimes how we post-process that and then use it in some applications.
So, let me jump right in to say what the topic of my talk is versus other sorts of network data that might exist. So, generally speaking, with regard to networks, you can sort of think of static versus dynamic. I will think of network typology as a somewhat static beast. Even though the number of switches in a telephone network and the number of pieces of copper or fiber connected to them, they do change through time, they are relatively static.
They are, in many cases, physical objects. In the cases of Web pages, maybe they don’t have a physical entity or identity, and Web links come and go, but they don’t come and go as fast as other sorts of things. By that I mean, the thing that I am going to be talking about is actual traffic on a physical network. That is the sort of data that makes up the topic of my talk. So, the things of interest here are the entities that are using the physical network. They are using this physical network to conduct something—communication, finance, commerce, it could be transportation.
These transactions involve other entities. They could be voice data transactions, investment transactions, and retail purchases. These things are totally out of the sort of control of the person collecting the data.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Now, the data that are associated with these transactions isn’t the content of the transaction. Largely speaking, they are just a few bites of information that describe some salient characteristics of the transaction, such as the date and time stamp that it took place, possibly encoded in the metadata is the location for the transaction, maybe something—definitely in my case, and I think many of these cases—some identifiers for the transactors, and often some element of size. It is the number of bytes involved in the transaction, the number of minutes the transaction lasted. Maybe in a financial or a commerce application it could be the size in dollars of the transaction. So, it may not say what was purchased, but you would know the dollar amount associated with it.
Now, traditionally, these data aren’t new. They have been around forever, and they have been around for a darned good reason, because people make money, or this captures records that are used for billing. This is the thing that actually paid, and continues to pay, for the collection of such data. What we are going to talk about is trying to use these data—this morning we talked about Level 0, Level 1, Level 2 data, and for this talk, you might think of these as Level 0 data.
In fact, as the fields that I have described them, they are probably Level 1. The Level 0 data that we get off the network are pretty ugly. It is a raw, unstructured form called AMA format that we process and put into a different format called Level 1. We have our own name for it.
As we go up the food chain, I think what we try to do is add value as we go up the levels. So, this talk is almost on, as you go up, and with large data sets, you do have to filter your data. If you can think of adding value as you go up this level stack, that is kind

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
of what we are talking about, or one of the messages I want to get across to you in my talk.
Maybe the lesson here is that each record has just a few bytes of information, who talked to who, when and how long. So, they are pretty trivial. If you put together a couple hundred million of those a day, and you do that day in and day out, you can get a very good picture of what is going on in your network. So, that is kind of the message here.
I think I used graphs in the title of my talk, and anyone who is associated with network data, this is a no-brainer. Graphs are a convenient representation of the data and, in the applications that I am going to be talking about, we are talking about call graphs.
So, every node in a network, the network identifier is a phone number, and the edges we are talking about are phone calls or aggregate summaries of phone calls between telephone numbers.
Some of the questions you might ask of such data, are there interesting nodes in your network? How close are two nodes to each other, not in geography, but in sort of the network topology, where the network is not the physical network, but it is the communication network. Other sorts of nodes or other questions that are interesting to me concern the temporal nature of the network. Is the node that I am looking at today behaving similarly to the way it behaved in the past. Then you might actually talk about subgraphs of the large graphs and how you might capture them and classify them.
So, to try to motivate some of the ways we think about the data, let me show you

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
some actual data. It is going to be from what we will call a low-volume service. We get data all the time. So, it is not quite an accelerator application, but people are making phone calls and we bill them for their phone calls. So, we have a continuous stream of data.
This is going to be a stream of data from one of our services. I picked it up for 22 weeks. What I am going to do is just give you an idea of the size, shape and scope of network data. Let me just go through a series of graphs, and hopefully you will be able to see some of these features.
So, this is a plot. It is a scatter plot and, for each of the 154 days in the study period, I have a point. Here is a point, there is a point, and there are a bunch of points. That point contains the information on the number of nodes that were active in this service on that day, and the associated number of edges for that day.
So, generally, on a weekday, we have roughly half a million transactors on our network, and they are transacting with two others. So, there are about a million edges a day, and about a half million nodes a day. On the weekends, we see it quite different, maybe only a third of a million nodes, and half a million edges.
There are some outliers that correspond to special events, holidays, etc. So, this gives you an idea, a little bit, of how many transactors we are seeing on the network, how many transactions, and also gives you a little hint of the sparsity of the graph.
If you have a graph that has n nodes, roughly, there are n2 edges. This is hardly n2. I mean, it is a factor of two. So, the graphs that we are talking about, and are going to talk about, are very, very sparse, and that is a good thing, or can be a good thing.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
This gets at the sparsity in a slightly different way. This is a plot of the number of edges versus the number of days on which they were observed. This is a typical plot you might see in an analysis of text and other things. Most edges occur only once. So, over a million of these edges, you only see once, over about half a million of the edges you only see twice over the study period. Way down here, you see some of the edges every day.
So, this gives you a little sense of the temporal nature and how frequently you observe things. These are the same data presented slightly differently, and it just shows you the cumulative frequency.
So, 95 percent of the edges have occurred on six or fewer days, of a 154-day study period. So, that gives you an idea of the episodic nature with which these edges come and go.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
This is a different way to view the data. This talk is about the lifetime of the edges. So, we have an edge and, if we have only seen it once, its lifetime is a day.
Some edges we have seen twice. So, we say, have we seen them today and tomorrow, or did we see one the first day of the study period, and the second time we saw it was the last day of the study period.
What this display is meant to convey is that roughly, say, 75 percent of the edges came and went within a week. Eighty percent of the edges came and went in about two weeks. Eighty-five percent came and went within a month, and 90 percent of them, the first time we saw them and the last time we saw them was about two months, which is less than half the study period. So, these things come and go and you may never see them again, and their half lives are generally very short.
This is one of my favorite plots, and someone criticized me for saying it is my favorite one because it takes so long to explain, and a good plot, you shouldn’t have to explain it, but let me make a crack at it. What I want to get at in this plot is just the fact that nodes come in and go out through time, and the numbers are pretty staggering.
What I did was take the entire study period, and I got the complete list of network identifiers. There were about 6.5 million of them. Then I said, okay, let’s start at day one, or how many are unique on day one. Well, they are all unique on day one, because that is where you are starting. On the second day, how many new ones did you see? On the third day, how many new ones that you hadn’t seen the first couple of days and so forth.
After a while you sort of get over this initial starting period and you get to a sort of steady state. I have fitted a trend line to this steady state up here. The slope of that corresponds to the fact that every day we see 16,000 new nodes in our graph. Then, I did the same going in reverse. I said, okay, we have all these nodes and now the steady state is going to be the reverse process.
When was the first day of the study period? For how many nodes was that the last day I saw that node? The second day of the study period, for how many nodes was that the last day that I saw it, and fitted a trend line to the beginning of that, because the truncation for that will occur over here. Generally speaking, the slope of that trend line is that we are losing 27,000 new nodes a day. So, this is not a good business to be in.
Again, what we are trying to capture is, nodes come in and go out and there are lots of them. It is not trivial numbers. So, you have to take this into account.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
AUDIENCE: [Question off microphone.]
MR. PREGIBON: They haven’t converged yet. The question was, do these lines converge? My response was they haven’t converged yet.
AUDIENCE: [Question off microphone.]
MR. PREGIBON: The second question was, what would happen if I had run this for more weeks. I believe what would happen, I would see the same slopes, and I think the lines would be closer to intersection than they are right now.
This business, by the way, I don’t think it is that proprietary. It is our calling card business. Calling cards are buying out by virtue of cell phones and prepaid cards. So, this is a business that we are essentially harvesting. We are not investing in it, people aren’t using it, and this is the rate at which that business is, you know, going in the toilet.
So, the final thing is a similar plot—I guess it is not the final one, there is another plot after this—the same sort of business on the edges. What is a little interesting to me is that the attrition of edges is nearly linear. I can’t explain that. This is the addition of edges.
The numbers here are also big. The slopes of these lines, we are seeing 300,000 new edges a day that we have never seen before in the steady state, and we are losing 400,000 edges a day that we will never see a again. That gives you sort of an idea of both the node set and the edge set.
This is the final graph in this set. This is speaking toward, again, the connectivity or the sparseness of these data. I have done it in a cumulative plot and I think I have tried

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
to capture the 90th percentile. So, 90 percent of the nodes on this graph have an in degree of eight or fewer edges, and I think 90 percent of the nodes on this graph had an out degree of 13 or fewer edges, and this is over the entire study period.
So, that is, again, a bit of a background for, when you are in this business of studying graphs or traffic or transactions on a network, that is the type of volatility that you have to deal with.
Now, let’s get into how that data motivated sort of the structures we used to capture this graph. I keep saying this graph as if there is a graph. It is changing all the time. So, therein lies the problem. We have to come up with a definition of what we need.
So, I will introduce a Microsoft operator. I found that in the symbol table. It looks like a little plus sign. That is a graph operator.
Basically, all we are going to say is that we can take two graphs—graph one and graph two—and combine them or take a linear combination of them, such that the resulting graph will have the union of all the nodes and edges of graphs one and two, but that we are just going to take a linear combination of the weights along those edges to find a new graph, or graph three.
I will show you how that is used on this viewgraph. Again, there are many different ways to define what we mean by the network graph. I am putting most of these up here as a straw man to say, none of these are the ones we chose, but these are all

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
I may add some edges in. So, for some reason or other, these edges weren’t in my data, but I wanted to add them in.
Then, after I added them in, I probably have too much junk in there and I will get rid of them. One algorithm we use to get rid of things is just running strongly connected components or some other fast graph algorithm, to basically prune out stuff.
Maybe I should say something here that is obvious to anyone who has looked at these graphs. Graphs are like classification trees. People think of them, oh, they are simple, they are easy to interpret. Classification trees, like graphs, are easy to interpret if they are small. Classification trees, regression trees and graphs are really hard to interpret if they are big. So, pruning is almost necessary in every application, because these things get big. We seldom go out more than a radius of two in our graphs. You just bring in too much junk. Even at diameter two, you have to prune away junk.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
The second way we add stuff into our graphs is using probability models. The class of models we are using here are log-linear models, which are quite nice, because they give nice interpretable parameters, both for the graph overall, and also parameters at the node level. So, if you wanted to think about computing these models for every node, you can then cluster your parameters, to cluster your nodes. It is a very flexible type of model.
The way we account for the fact that we are missing data is to put some priors on some of these parameters, and you can decide which ones you want priors on. Then, once you have those, you can use an N type algorithm to estimate the parameters, and this is one of the things we are experimenting with.
Generally, we are at the point now for about a 200 node diameter two subgraph in the R language, we can compute these models in about two minutes. That is an interpretive language. So, we can’t do it for a large number of subgraphs. If we rewrote that in C, we think we could get it down to about two seconds per subgraph, but we are still not satisfied with it at the level of modeling that we are doing.
The other nice thing about these models, as a statistician, we would like to think about the parameters of the model. There are also the ultimate fitted probabilities. So, in the model that I put up before, we are actually modeling sort of the probability of an edge. You can think of then using the output of the statistical model as a way to prune and enhance your graph.
For instance, this is actually, I think, a very small subgraph around me. I think there are only six nodes in this subgraph. So, these are the six nodes, and these might be

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
my observed data. If I was going to draw a graph of that data, this might be the threshold version of data that I use for making that graph.
If I fit a model to that data, this might be the probabilities associated with the edges on that model and, if I threshold this thing, this might be the graph I plot. So, this is a way that we are going to be using to observe data, add in edges probabilistically, and then prune them back to something that we hope we can understand, and that captures the salient features of the data.
Let me, in the last five minutes or so, go over the applications and give you an idea for the volumes we are talking about. I mentioned before, we are a long-distance company. So, this is the data from the long-distance network.
We see about 350 million edges a day. That is a reasonably big number by this morning discussion. The bigger number is the number of nodes that we see through time, nearly half a billion nodes through time. So, it is a pretty big graph when you look at it through time. In our work, we tend to use the top nine approximation. That is for historical reasons. We have been changing that over the past year.
I will give you an idea about the sizes of these materialized views. Each direction, inbound and outbound, are about 7 gig. It takes us about two hours to process the data daily. If we wanted to scan each of these maps and compute something for each, it takes about 20 minutes. It is easily under a second to expand a diameter two subgraph for any node.
When we want to do the enhancement, flavor it and then render it on someone’s Web site, it is basically about 10 seconds from “give me the information” to seeing the subgraphs. Those are some of the operating characteristics that we are dealing with.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Okay, so without further ado, let’s talk about the applications. The primary use of these tools is in fraud detection. I will outline one type of fraud that we deal with, and those in the audience with similar problems, they probably have similar versions of this. They aren’t necessarily for fraud.
Customers subscribe for service. We like that. They don’t pay us, we don’t like that, and there are a lot of reasons why this happens. Some of it is our own doing. It is a competitive market. We work hard to get customers. We don’t screen them, maybe, as well as we should. Other times, people out there who want to be bad are pretty good at being bad. They will steal identities. No matter how good we check on their credentials, it comes up good because this is a perfectly good credit record that this individual has.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
The only problem is that it is not the person.
Then, the problem gets exacerbated, because not only did we give them service, we basically gave it to them for a couple of months. They use the service for 30 days, we send them a bill. Thirty days later, we find out they didn’t pay the bill. So, we start reminding them, gently at first and more vigorously later. Eventually, we lose our patience, but literally, 60 to 90 days. So, it is not just getting service, but it is service over an extended period. So, this is something we don’t like.
In a security operation, you may not want bad guys floating around for 60 or 90 days without you knowing about them. You would like to know about them within minutes, days, weeks, rather than 60 or 90 days.
So, part of the plan we have is, you know, bad guys don’t work in isolation, or you count on the fact that they don’t. So, there is going to be some clustering of bad behavior in your network, and we will exploit this to come up with ways to describe this affinity group, and just to rank suspects by who they are associated with.
So, for instance, I think I have outlined a node up here. This is a suspect. We can compute their diameter two subgraph, add some value to it and then—I won’t go into detail on what the nodes of different shapes and sizes mean.
Think of some as cellular handsets and others as land lines, thickness and color of the lines depicting the weight or the strength of that association. Color is the main thing here. Any node that is colored was associated with fraud very recently. So, this

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
particular phone number that was a suspect, in their community of interest, has five other bad guys associated with it. This is what we want to bring to the attention of the security staff, that this would be someone good to investigate.
So, the way we do this is, for every new account we see in our network for a particular type of service, we will compute their subgraph seven days after we sight it. Seven days is, again, just some made up number, a waiting period. You want the calling circle to mature, depending on the level of activity. It could be one day, it could be six days, but seven is what we settled on.
We will do some enhancement of those subgraphs and then threshold them. We will go to a database and look for recent fraud cases. So, we will color those subgraphs. We will rank them and put them on a Web site and the security associates then just go through and look at this, and do their investigations.
This is a result of what we saw from a trial when we first developed this. On this axis is the number of nodes in a diameter subgraph that are associated with previous fraud cases. This was the outcome of new cases post investigation.
The number plotted at each plotting position is the number of cases. So, there were 13 cases that we presented to security that had six bad guys in the community of interest, and about 45 percent of those turned out to be fraud.
Out here, there were six cases that we presented that had about nine bad guys in it. Over 80 percent of those turned out to be bad. As soon as you get beyond that, it is

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
almost 100 percent. If you are surrounded by badness, by golly, you are bad, no doubt about it.
Where the action, though, is down here where you don’t see. This is the mess we were cleaning up when we started this. Now, when you see one or two bad guys around a new person, that is a big indicator that things are bad. So, by rolling this out and having active investigation, we are able to really clean up the mess.
The second thing I will talk about, and just close on this, is tracking bad guys. You can think about this as account linkage. People who are bad, as I said, they are good at it. Just like customers, they exhibit loyalty. If you have got a good product, they are going to be loyal. Bad guys are the same way. If you have got a bad product, they will be loyal to you.
If you let them abuse your network for a year without catching them, they will come back tomorrow and abuse it for another year because you are not hassling them. Fraud detection versus some of the security things, we don’t have to be perfect. We just have to be better than the competition. The fraud is not going to go away. It is going to go to some network. We just don’t want it to be ours.
So, the nature of the game is to really hassle these guys. So, we want to hassle them, but again, they have their own tricks. So, identity theft is not quite science yet, but there are some people that are darned good at it.
So, you know, you can knock them off your network and because they don’t cost

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
you as much as you would like to, they will come back on your network. If you are dumb enough not to know they are back, they are going to abuse you again. So, it is the same old game except, you know, burned once, you don’t want to get burned again.
So, even though the node identifier—so, the phone identifier on your network is new, is it possible that the person behind that identity is the same. So, we are after them. Basically, we want to exploit the fact that you are who you call. You may be burning your cell phone or a prepaid card at a fast rate, but maybe your friends and family, your girlfriends or whatever, they are not doing the same thing.
So, their lifetime in your network has a different half-life than yours, and we want to exploit that fact. So, what we do is, we keep a library of subgraphs for these baddies and then new guys come along and we want to do a graph matching.
We want to say, are any of these newbies likely to be any of these ones we have caught before. So, it is a big problem. Again, it is a graph matching problem, and they don’t necessarily come back on your network the same day that you bumped them off. So, you are collecting a pool of baddies with a pool of new customers, trying to do graph matching and then ranking these pairs to present to security.
Again, there is a lot of engineering involved, and what do you mean by graph matching? So, basically, you are taking a graph and a subgraph, overlapping them and saying, you know, how good is the overlap. You know, this is just something we have come up with. It mimics, I think, a little bit of what goes on in text analysis, some of the scoring methods that are used there. You want to account for the fact that big weights

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
associated with edges are good, but if a node is very common, it is not very discriminatory, you want to down weight it.
The fact that both the newbie and the old guy call Lands’ End isn’t very informative. If they both called a number in Yemen, maybe that is informative, if no one else called that number. So, that is what this is getting at.
I am not going to go into details. I guess this is a cartoon—it is actually a real case of this. Anything that is in pink or red are the overlaps on two subgraphs. The network IDs for the subgraphs we are overlaying are both in here. Anything that is green is associated with only the top number here, anything in blue the bottom one. This just shows that these two identities that are appearing on your network at different times are quite likely the same person behind it.
Quickly, just to show you the results of implementing this in the investigative sense, giving a bunch of leads to our spot associates, based on the deciles of predicted match, and then looking through time to say how many of those cases that were presented turned out to be true matches, and that the pairs that we presented were actually matched pairs.
It just shows we are reasonably well calibrated. This thing would be perfectly on a 45 degree, on a perfect calibration. It is showing that, when we think it is a match, generally, after investigation, it is a match. The thing that, to us, makes this so powerful, again, speaks to a number of things that we are seeing today.

OCR for page 136

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Generally speaking, we are getting tens of thousands of new customers on our network a day, tens of thousands of baddies of different sorts, not paying their bills or whatever, a day. So, the fact that we are able to distill this down to less than a thousand things for our fraud team to investigate, it is a big crunching of this down.
In a physical sense, as a physicist, you have got all of this data. Where do you look? So, these tools are to guide our physicists to show, this is where you have got to look.
I am just going to close things by saying this is ongoing work, a lot going on, and thank you for your patience and your time. Thanks.
MS. KELLER-MC NULTY: While we are getting John hooked up for the next session, if there are a couple of questions?
[Question off microphone.]
MR. PREGIBON: The question is, do I ever use spectral techniques for these graphs. I think possibly what you mean there is some of the Hudson Authority type computations?
[Comments off microphone.]
MR. PREGIBON: No, we have not. We should probably talk off line. There is some mathematics associated with analysis of graphs in the literature, with argon analysis. Those sorts of spectral methods are used to characterize and process findings of special nodes on the graph. For people who are familiar with the work of John Kleinberg from Cornell, his Hudson Authority work is of that ilk.