Leland Wilkinson, Chair of Session on Mining Commercial Streams of Data

Introduction by Session Chair

Transcript of Presentation

BIOSKETCH: Leland Wilkinson is an adjunct professor of statistics at Northwestern University and senior vice president of SPSS, Inc. His research interests include statistical graphics and statistical computing.

Dr. Wilkinson received his AB from Harvard College in 1966, his STB from Harvard University in 1969, and his PhD from Yale University in 1975. In addition to his statistics background, Dr. Wilkinson has also served as a lecturer, visiting scholar, and professor of psychology at Yale University, the Israel Institute of Applied Social Research, and the University of Illinois at Chicago.

One of Dr. Wilkinson’s many accomplishments is the development of SYSTAT, a statistics and statistical graphics package that he designed in the late 1970s and then incorporated in 1983. An early feature of SYSTAT was MYSTAT, a free statistical package for students. SYSTAT and SPSS became the first statistical software companies to market full-featured Windows versions of their software. In 1995, SYSTAT was sold to SPSS, which was in turn sold to Cranes Software International in 2002.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Leland Wilkinson, Chair of Session on Mining Commercial Streams of Data Introduction by Session Chair Transcript of Presentation BIOSKETCH: Leland Wilkinson is an adjunct professor of statistics at Northwestern University and senior vice president of SPSS, Inc. His research interests include statistical graphics and statistical computing. Dr. Wilkinson received his AB from Harvard College in 1966, his STB from Harvard University in 1969, and his PhD from Yale University in 1975. In addition to his statistics background, Dr. Wilkinson has also served as a lecturer, visiting scholar, and professor of psychology at Yale University, the Israel Institute of Applied Social Research, and the University of Illinois at Chicago. One of Dr. Wilkinson’s many accomplishments is the development of SYSTAT, a statistics and statistical graphics package that he designed in the late 1970s and then incorporated in 1983. An early feature of SYSTAT was MYSTAT, a free statistical package for students. SYSTAT and SPSS became the first statistical software companies to market full-featured Windows versions of their software. In 1995, SYSTAT was sold to SPSS, which was in turn sold to Cranes Software International in 2002.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. WILKINSON: All right, this session is on mining commercial streams of data, with Lee Rhodes, Pedro Domingos and Andrew Moore. Only one of our speakers is from a company although the other two, as you know, are involved in developing procedures, high-dimensional searches, and mining and other areas that are highly relevant to what is done by businesses. I just want to highlight the three major market shares of applications of streaming data analysis, and these are quite large. Monitoring and process control involves such applications as General Electric with its turbines worldwide. There are many, many turbines and, to shut down a turbine, can cost millions of dollars per day in their system. So, they need to maintain continuous multivariate data stream monitoring on those turbines, and they have real needs for display and alert and analysis capabilities. E-commerce goes without saying. We all know pretty much where that lies. Many are putting e-commerce data and Web logs into databases, but Amazon and other companies are analyzing these in real-time. Financial is another huge area for streaming data. I thought I would give you a quick illustration of how that gets used. This is a JAVA application called Dancer that is based on the graphics algebra, and the data we are putting into it now, we happen to be offline, of course, but this data feed is simulating a natural stream coming in. These are Microsoft stock trades, and these are coming in at roughly 5 to 10 per second. On the right, you see the list of trading houses, like Lehman Brothers, and so on. These trades, the symbol size is proportional to the volume of the trade. Up arrow is a buy, down arrow is a sell order, and then a cross trade is a rectangle. These traders want to be able to do things like alter time, back it up, and reverse it. Those of you who have seen the TiVo system for TV, video, know that these kinds of manipulations of time can be critical. This application, by the way, is not claiming this as a visualization. It is actually doing the calculations as soon as the real-time feed comes in. Notice all the scaling is being done on the fly. You can speed up the series. If you speed this up fast enough, it is a time machine, but I won’t go into that. I will show you just one more aspect of real-time graphics, and these are the kinds of graphics that you plug into what the rest of you guys do. When you develop algorithms, you can plug them into graphic displays of this sort. This one simulates the way I buy stock. Actually, I don’t buy stock for this reason. It is just a simple exponential forecast. You can see the behavior. This is trading in Oracle and SBSS. This type of a forecast represents exactly what I do and probably some of you as well which is, as soon as it starts going up a little bit, buy. What is being done here, the model is being computed in real-time. So, you get, in this kind of a system, anywhere from 10 updates a second to 10,000 data events per second, and 90 percent of the effort in developing software in this area is in the data handling. How do you buffer 10,000 events per second and then render in roughly frames per second using the graphic system? So, the rendering system is a lot simpler than the actual data handling system.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, now we are going to see some presentations that will highlight how these systems work, and we will begin with Lee Rhodes from Hewlett-Packard, who will tell you about data collection on the Web.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Lee Rhodes A Stream Processor for Extracting Usage Intelligence from High-Momentum Internet Data Transcript of Presentation Technical Paper BIOSKETCH: Lee Rhodes is chief architect for analysis in Hewlett-Packard’s Management Software Organization. During his career at HP he has led numerous research and development efforts in a broad range of technology areas, including fiber optics, integrated circuit design, high-performance graphics software and hardware, CPU architecture, massively parallel processing systems, multimedia and video streaming software, communications software systems, and two- and three-dimensional visualization software. Since 1996, Mr. Rhodes has been heavily involved in the development of operational systems software for the communications network service provider industry. He has invented a suite of technologies in the area of real-time analysis software that enables Internet service providers to quickly extract key statistical information about subscribers’ usage behavior that is critical to security, network operations, capacity planning, and business product planning. He assembled and managed the R&D team that developed this technology into a commercially successful product. Mr. Rhodes’ formal educational background includes a master’s degree in electrical engineering from Stanford University where his emphasis was on integrated circuit design and solid-state physics. His undergraduate degree was in physics from the California State University at San Diego.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. RHODES: This should go flawlessly because of our advance planning. First, I want to tell you how honored I am to be here, and I want to thank the organizers of this conference for putting this setting together. I particularly want to thank Lee Wilkinson for being a great mentor and friend and guiding me along the way. I am going to talk about a stream processor that we have developed at HP that is currently on the market. We sell this product. Before you do any kind of measurements in terms of what we are talking about—I just want to be clear that we should be calibrated. This is not science. This is engineering. Our role, my role, at HP is to develop advanced software. Our statistical sophistication is very low. I am learning and, with the help of Lee Wilkinson, I have learned an immense amount. I hated statistics when I was in college, but now, I am really excited about it. So, I am really having fun with it. In isolation, much of the technology you will see here has been written about before in some form. Nonetheless, I think you will find it interesting for you. The context of this technology is that we develop software for communications service providers. So, this software—and particularly Internet, although not exclusively Internet providers—those are our customers. How we got started as a start-up within HP about five years ago was exclusively focused on the Internet segment, and particularly broadband Internet providers. We are finding that the technology we built is quite extensible to neighbor markets, particularly telephony, mobile, satellite and so forth. Now, the network service providers, as I am sure you know, have some very serious challenges. The first one is making money. The second one is keeping it. In terms of making money, Marketing 101 or Business 101 would tell you that you need to understand something about your customers. The real irony here is that few Internet service providers do any measurements at all about what their customers are doing. In fact, during the whole dotcom buildup, they were so focused on building infrastructures, that they didn’t take the time, or invest in, the systems that would allow them to understand more about customer behavior. That even goes for the ISPs that are part of big telephone companies. Of course, telephone companies have a long history of perusing the call detail records and understanding profiles of its customers. There are some real challenges here, not only understanding your customers, but understanding what the differentiating services are. It is very competitive. What kind of services are going to make money for you. Another irony is pricing this stuff. It is not simple. It is not simple now, and it will get even more complex, because of this illusion that bandwidth is free. That won’t survive. It is not free. So, there have to be some changes and, as I go through the talk a little bit later, I think you will see why pricing is such a challenge, particularly for broadband. Certainly, you want to keep your own subscribers are part of your network, but you are also concerned about use, fraud, theft, and other kinds of security breaches. Now, when you go and talk to these service providers, they own the big networks. What you find is like in any big organizations. They have multiple departments and, of

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop course, the departments don’t communicate very well. This is not a surprise. We have the same problem at HP. Nonetheless, the sales people are interested in revenue. So, they are really interested in mediation systems which collect the data about the usage of other subscribers so that they can bill for it in some way. This is an emerging trend and will continue to be. They are interested in not just bytes, but they are interested in what types of traffic it is, time of day. For instance, they want to be able to track gamers, say, to a local gaming host on the network, because their network bits are cheaper than peering agreements out on the open networks. So, understanding who the people are who are using games and so forth would be of interest to them. Product development. These are the folks who come out with the new services. So, they need to have some sense of, well, is this going to make money or not, what is attractive. Network operations needs understanding of utilization and performance on a day-by-day basis. They tend to be very focused on servers, on machines, on links, to make sure they are operating properly. Product planning is often in a different department. These are the ones who are interested in future capacity, how can I forecast current behavior forward to understand what to buy and vend. Realize that a lot of quality of service, if you can call it that, on the Internet, today is accomplished by over-provisioning. So, if I have bodacious amounts of bandwidth, nobody tends to notice. Of course, IP is particularly poor at quality of service, but there is work being done to do that. So, the technology challenges for the service provider, there are many, but here are some of the few key ones. They would like to capture the data that would service all these different needs once. They are expensive to capture usage data, and the tendency is, among vendors such as HP, is to go in and say, oh, great, we have this widget. We will just sample your key core routers with SNP queries and get all this valuable data for you. Of course, every other vendor comes in and wants to do the same thing. So, they end up with 50 different devices querying all their routers and virtually bring the routers down. Economic storage and management of the Internet usage data is a severe problem. Of course, they want the information right away and, of course, it has to scale. So, I am talking about some of my back of the envelope kind of analysis of this problem of data storage and analysis challenges. Starting with—this is what I call a cross over chart. What I did is very simplistic calculations saying Internet traffic, particularly at the edges, is still growing at about doubling about every, say, 12 months. At times it has been faster than that. Over the past several years, it seems to be pretty stable. One of the interesting things is that the traffic in the core of the Internet is not increasing as fast as it is at the edges, and a lot of that has to do with private peering agreements and caching that is going on at the edge, which is kind of interesting. The next thing I plotted was aerial density of disk drives. In the disk industry, this is one of their metrics, is how many millions of bits per square inch of magnetic surface

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop can they cram onto a disk. That has been doubling about in a range of 15 months. So, it is a little bit slower. Then Moore’s law, which doubles about every 18 months. So, the axes had no numbers on them. They don’t need it. It doesn’t matter where you originate these curves, you are going to have a cross over. If this continues to grow at this rate, then at some point the—choose your measure. The traffic on the Internet is going to exceed some value. I think we can help with this one by better collection strategies and using statistics. AUDIENCE: I have to admit, I am really confused here by comparing Internet traffic volumes to disk drive densities. MR. RHODES: It is just a very simplistic assumption. It says that, if I am receiving traffic and I need to store information about that traffic that is proportional to the non-traffic, I have got to put it someplace. AUDIENCE: What does it mean that they are equal? MR. RHODES: I am just saying choose a value. Suppose you can store so many trillion or terabytes of data today. If the ability to store economically their data doesn’t increase as fast as the traffic increases and the need to store it, you may have a problem. AUDIENCE: So, where is the traffic coming from, if people can’t store it? MR. RHODES: That is on your own machines. Remember, the Internet is still growing. There are people joining. Now, the other crossing is Moore’s law, which says if the traffic continues to increase faster than Intel can produce CPUs that keep up with it, or Cisco can produce processors that keep up with it, you just have to add more horsepower. AUDIENCE: Well, isn’t the traffic consumed? If I am watching a video, I consume that traffic, I don’t store it. AUDIENCE: Some people might want to store it. MR. RHODES: Okay, at the service provider, they are not storing the actual traffic. What they are interested in are the summary records, which are called usage data. The usage data are summaries of flows. At least, that is very common in the service providers. It is a fraction of the actual traffic, but as a fraction, it stays about the same. So, as a service provider, the tendency—and this may seem strange—is to serve all of it. Those who have telecom backgrounds sometimes save their call detail records (CDRs) for seven years. Sometimes there are regulatory requirements. Saving the Internet traffic, number of summary records for a session which you might have on the record, is far higher, orders of magnitude higher, than a single phone call. If you make a phone call, one record is produced. If you sit hitting links on the Internet, you are producing sometimes hundreds of sessions, as far as the way these sessions are recorded. The second graph is also a back of the envelope calculation. This is based on some measurements that we have done, which is the storage required. Now, presume that you wanted to store each of these just usage records. One of the factors that we have measured on broadband Internet is the number of what we call flows. These are micro flows, really, per second per subscriber in a broadband environment. It is around .3, and varies, depending on time of day from about .1 up to about .3. Now, you multiply that through times the size of a storage record, and they don’t want to store just the flow information, they usually also need to put information like the

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop subscriber ID and some other key information. You assume a couple hundred bytes per record. All of a sudden, you are talking about pedabytes or exabytes of storage, if you want to store it for any kind of period. So, these represent different numbers of subscribers, different scales. The dark red one is about a million subscribers and, as a service provider we are working with today that saw this coming and realized that they had a problem. The other one is also a back of the envelope calculation, time to process this stuff. Say you get it all into a database. You have got to scan it once. That can take a long time. There, I just projected different database systems, depending on how many spindles and how sophisticated you want to get, in terms of how many records per second can you process, and how much money do you want to spend on it. So, we are talking about years, sometimes, if you wanted to scan the whole thing. So, there is a problem here and it has to do with inventory, if you just have too much inventory of data. Handling it is a severe problem. So, this is a somewhat tongue-in-cheek illustration, somewhat exaggerated to make a point, but a lot of our major customers are very used to having very big data warehouses for all their business data. Data warehouses are tremendous assets. As soon as you start trying to plug these into the kinds of volume we are talking about, it no longer makes that kind of sense. What we have developed—this is just a short cut—is a sense of how can we capture information on the fly, and build not just a single model, but hundreds or thousands of small models of what is going on in the network. Then, we have added the capability of essentially a real-time look-up, where the user here can, using a navigation scheme, can select what data they want to look at and then they look at, for instance, the distribution statistics of that intersection. This is the product—I promise I am not trying to sell anything, but I just want to say this is the architecture of the product that is the foundation of this. It is called Internet manager. It is an agent based technology. These represent software agents. It is these three things together here, encapsulator, rule engine, and a distributed data store. In a large installation, you can have scores to hundreds of these, and the whole idea is putting a lot of intelligence right up close to the source of this high speed streaming data. We have different encapsulators. These are all plug-ins. The encapsulator is like a driver. It basically connects whatever the unique source, type, or record type or whatever that these various sources produce to internal format. Then, this is a rule engine, which I won’t talk about. Basically, the flow is generally to the right, although this is somewhat simplistic, so it represents a kind of pipeline. So, these rule engines process rules, and they scale in three dimensions. One is the horizontal parallelization, which you have with many agents. The second is the size of the machine you put these one. The third is you can defer certain rules downstream. So, you can spread your intelligence processing. Now, a lot of times, and where we initially got started, was supplying basically data in database form, or file form, to various other business systems like rating, billing, reporting operations and so forth. That is how we got started. Now, to give you an idea of what the data—here is an example of one format of hundreds that we read. This is a net flow, version five record format. You can see all the

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop different types of information that comes out. Basically, it is summary information of the headers that you have been hearing about in the previous talks. Source destination addresses, source destination ports, bytes, packets, and a lot of very valuable information here. It is of a flow. A flow is a group of packets that is matched to a source and destination IP address and a source port, sometimes even an added destination port, and sometimes even a source port. So, it is really nailed down to the particular transaction that is going on. So, what do we do with this? Each of our engines, each one of them, can pull in records from anywhere from around 50,000 to 100,000 per second. The first task is to normalize these, collect them and normalize them. The second task is the normalization and I like to think of as a vector, which was also spoken of earlier. This is a set of arbitrary attributes. Think of them as columns in a database, but it comes in as a single record and actually can be variable in the number of attributes, and dynamic. Now, once these come in, and we can have multiple streams coming in, usually we know quite a bit about these streams. We might have a stream coming in from an authentication service like a DHCP or combination DHCP, sometimes radius, sometimes DNS, that basically authenticates a user. So, the service provider knows it is a legitimate describer, as well as the usage information coming from the router itself. What we call them is normalized metered events. It is sort of the most atomic information about usage. So, these entities come in just like a record, and they are processed in this rule change, and a stream processing engine can’t have loops. So, no four statements and stuff like that. We basically can’t afford it. It travels down and you can have F&L-type statements. The other interesting thing is, we have a statement where each of these, as the data is processing through, it looks at each of the field based on what rule it is—and this is all configurable, what rule you put in, about several hundred. There is an association with a data tree. One of the things that this data tree can be used is in sorting. As the ME travels through, decision are made, there is a natural selection going on based on a certain field. Then we can do simple summing, for instance. So, summing on a variable or even a group of variables is very straightforward, doing very much the joint something that was spoken about earlier. This all occurs in real-time. The other use of this data tree, we call it—and it doesn’t have to be just a tree, it can be a number of different points—is each one of these triangles is a structure that can have an arbitrary container, and we can put data in it. So, one of the ways that we do stream correlation in real-time is that we effectively have like a switch, where we can select information coming from what we call a session correlation source. It will load information into the tree that is used for matching, and then virtually all you have to do is now, as the new entities come through, they correlate dynamically to information that you want. For instance, it could be the IP address to a subscriber, or you could do all different kinds of correlation.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Now, in any one engine you could—so, I am using a symbolic representation of what you just saw, is this little triangle of a tree here, and you can have multiple ones. So, we can do fan out. So, you can have a single source because the same data needs to go to different applications and needs to be processed by different sets of rules. So, you can parallel them, going to different applications, or you can put them into sort of sequential themes for more sophisticated rule processing. So, the work that I have been doing has developed what I call capture models. So, as this data is flying by, I would like to collect more than just a sum. In fact, I would like to capture distributions of these variables or other kinds of characteristics. I think there are lots of things that you can do—Jacobeans, I haven’t seen the need for that—but there is the opportunity. A capture model can have child models associated with it, but one of the rules of the capture model is that the NME that goes in left goes out of the right, because you can have a series of these in a row. So, you can have multiple of these capture models plugged together. I tend to look at this like a matrix. Inside any of these capture models you have a--inside there is a matrix where you have a number of different variables that you can track. If you are doing binning, then the other axis is the bins. So, now you can put these, instead of doing just simple summing, now you can do sorting of your data, and it feeds right into this capture model. You can put them in layers and do sequential summing. So, you create all these little matrices, and they are not very big, a few kilobytes, the largest eight to ten kilobytes. So, you can have thousands of them. Now, the end-to-end architecture looks something like this, where you may have some free staging, for instance, some basic correlation going on. Then you put it directly into the models. That is one thing our customers are doing, or you can have these models directly on the raw data. So, you can be binning of it and making decisions as the data is flying by. What we do, then, is we store just the models. Of course, the nice thing about these capture models is that they don’t really grow with volume. The number of them is proportional to the size of your business problem that you are trying to deal with. Then, on the right here, you have the clients. This is an example—it is not a great example, but it is one example of a distribution that we collected. I don’t have a good example of truly real-time, but this kind of data can be collected in real-time. It represents the usage of subscribers over a 30-day period. This thing is just constantly updating as the data is flying by. Red represents the actual number of subscribers and the red axis is the amount of their usage. Now, this is a broadband Internet. So, you will see, I have a subscriber out here with 23 gigabytes of usage for that period, all the way down to tens or hundreds of bytes. So, there is a huge dynamic range. If you think about it, like electric utilities or other types of usage services you might have, very few of them have this kind of wide, dynamic range. Now, I fitted this, and it fitted pretty nicely to a log normal. Plotting this on a linear axis doesn’t make a lot of sense. In fact, what we do in the distribution models is do logarithmic binning. This data fits that very, very nicely. It is very probable in terms of binning.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Now I can see up to 90 percent of my subscribers now. There are two plots here. This is the subscribers at a particular usage, and this is the traffic that they create. One of the things it took me a while to figure out is why this right-hand side of this is so noisy. Notice it is jumping around quite a bit. Part of that is not just noise. Part of that is the fact that subscribers only come in unit quantities. So, a subscriber at 10 gigabytes of usage also creates big deltas out at the right-hand edge of this. The other reason is the actual binning. So, they may not fall in a particular bin. So, you will see some oscillation and it is actually easier to see the oscillation between bins on this graph. I did some—after reading Bill Cleveland’s book, I tried the QQ plot, but I did a reverse QQ plot because I have already got bytes on my X axis, and these are the standard normal quantiles on the left. What is interesting is that the fit on this is very, very good, over about four orders of magnitude. I didn’t bother doing any fancier fitting at the top or the bottom of this. The users at the bottom are using more than the models would predict, of course, and at the high end, they are using less. I find that, in looking at about a dozen of these from different sites, that the top ones slop around a bit. This is an asymmetry plot, which you read a lot about in the press. Actually, here, it is quantified. You can look at, for instance, that 20 percent of the subscribers, the top 20, are using 80 percent of all the traffic. That happens to be the way this distribution fell out. What they don’t talk about is, 80 percent of the users are only using 20 percent, which is the obverse of that, which means they have got a real severe pricing and fairness problem, but I won’t go into that. Some extensions of this basic technology we are doing now, and actually deploying with one of our customers, is using this kind of technique for security, abuse, fraud and theft. We are doing a lot of learning in how to do this, but I am convinced that, once you have a distribution of a variable and you can normalize it, say, over some longer period of time for the standard population, then you can very quickly see changes in that distribution very quickly. If all of a sudden something pops up, like a fan in, fan out, which is the number of destination IP addresses, or destination ports all of a sudden explodes, then you know someone is scanning ports. These terms mean different things, but in the service provider industry, fraud and theft are different. Theft is when they are losing money. Fraud is only when someone is using your account, because you are still paying. Then, abuse is basically violation of the end user agreement that you signed when you signed up with the service provider. Now, the other thing I am working on is dynamic model configurations, where you can dynamically refocus a model, a collection model, on different variables, different thresholds, what algorithms are actually used and so forth, do that dynamically. That allows you to do what I call drill forward. So, instead of having to drill down always to the history, you see something anomalous. It is likely to come back. This is not like looking for subatomic particles. So, if someone is misbehaving, more likely it will occur again, and you want to zoom in on that and collect more data, and more detailed data. So, that is what I call drill forward.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop just doing this in two dimensions. It is still pretty tractable in 10 or 20 dimensions. We hit statistical problems before we hit computational problems. Doing a fully general mixture of Gaussians in 20 dimensions is a very dangerous thing. There are so many parameters. I just wanted to show you one thing you can do with this, but it is going to require some contortions with a mouse. One thing we often find a mixture of Gaussians useful for is just a nice density estimator. Without showing you the density estimates, we are going to just jump down here to the high resolution. I am going to home in on that particular region of the data set. Remember, it is still a copy of the data set down there. If we go down closer, we can see—one of the brilliant properties that I love with mixtures of Gaussian is the way they manage to do this clustering with lots of different resolutions of interest. This isn’t something that a K-means or a hierarchical model has been able to do, to spot the small things. Just down there at a lower resolution, are the big things. Now I am going to get back to some more theory. Here is another question. Now, we are going to be going more toward the data queue types of questions that we have seen in a couple of talks today. Suppose we are dealing with a big data set with categorical attributes, not real value attributes and, for some reason, we are constantly having to ask counting queries. For instance, given a data set, we might be asking, does the average wealth of inner-city blue-collar SUV owners seem to be correlated with their health. That is a question which

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop you can answer by going into a database, getting a hold of appropriate records, getting some sufficient statistics from them, and then doing a computation. Suppose that we know that we are not just about to be asked to do one of them, but we are going to be asked to do lots and lots of them. For example, you end up wanting to do this if you are building a Bayesian network. You end up wanting to do it if you are, for instance, using a very nice algorithm by DuMichelle and Pregibon, that they introduced a couple of years ago based on doing lots and lots of empirical Bayes tests on subsets of the data. Those are just two examples. We have had about 10 examples of situations where you want to have a big database and ask billions of those kinds of questions. You want to ask them quickly. This is a crazy idea. In order to make this answer a question like that, why don’t we build a secondary database where we pre-computed the complete sets of possible questions you could ask, and we pre-computed the answers for all of them. The reason that is a good thing is, if we had that database, then I could say, I have a constant time algorithm, independent of the amount of data, for doing those kinds of statistical questions. The only problem with it is, a, it would take a huge amount of memory. In fact, on a typical database that we have been working with, we computed it would take about 1040 bytes of memory to contain this secondary database. The other problem is, it would take a very long time to build the secondary database. The universe would have ended by the time we built it. Although it is a crazy idea to do this, sometimes we can just push our luck and get away with it if we can very intensively compress the secondary database while we are building it. AD is a way to compress these databases. I am going to try to give you a sketch in about three minutes as to how we did it.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Here is a data structure which actually implements this impossibly expensive database I mentioned. It is a tree structure. Every node in it contains a count—for instance, this node contains a count of four. What does that mean? This node is answering the question, how many records have actually been run in sector four and attribute two we don’t care. So, it turns out there are four records of that form. How many records have four two, we end up looking at this class of data sets. a1 equals four and the children of that are actually a2, because you come down here and you see that there are three such records. So, that is a database in which we can access any counts we like. If you like, it is an implicit representation of a big hypercube combinatory index by all combinations of attributes. We can do it with two attributes, but even with just a mere 20 attributes, this thing would take up too much memory to store on your disk drive, let alone your memory. So, you can try to save memory while you build this data structure.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop The first thing you can do, whenever you get a count of zero, you don’t need to store that record and you don’t need to store any of the children in that record. Any node here with a count of zero, any specializations of those queries, also count as zeroes. So, that saved us a little bit of data, but not much. The example I described before, it went down from 1040 down to about 1030 bytes of memory. So, although we have decreased our memory requirements 10 billion-fold, it doesn’t help us much. Now, here is something else at this thing. Any node on this thing puts a tail on considering all possible values of the attribute a1. I am going to mark in green the most common value of a1. Here, a1 contains the value four, and that has the highest count. That happened four times. The others happened less often.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Everywhere on the tree, whenever I am instantiating a variable, I take the most common value of it and delineate that in green, and delete all of those from the database. All I am going to do is leave behind this little thing saying most common value. This was the most common value before I deleted it. Now, there are two consequences of that. It turns out that you save a very, very large amount of data here. The reason is because it is always the most common value on all levels that you are getting rid of. So, you might get rid of, if you are lucky, 40 percent of all the nodes on this level. Then, of the remaining 60 percent, you get rid of 40 percent of those on this level. Of the remaining whatever it is, 20 percent, you get rid of 40 percent of those on the next level. We can prove the amount of memory you need really goes down. In the example I was describing we then end up taking 106 bytes in this thing. It is much, much smaller. So, that helps us a lot. The other good piece of news is that you haven’t lost any information you really needed. You can reconstruct any counts that you had been able to get before through this prudent data structure. I will give you an intuition of why that is true. I am going to try to do this in four minutes. I will use this notation of a contingency table of a set of attributes. It is just this—well, you know, you are probably all familiar with what a contingency table is. In this data set, a contingency table at a1 and a2 is very similar to the distribution over a1 and a2.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop It is a table recording the fact, for instance, that 2, 2 occurs once in this database. There it is. Here is the marginal contingency table over a2, just as there is a histogram over the values of a2. This is a conditional contingency table. This is the conditional contingency table of a2 among those records in which a1 has the value two. So, if you can get your head around this, among records in which a1 has the value two, a2 takes the value of one twice there. Those are the two records in which a1 has a value two, a2 takes the value of one twice. These are just objects you need to know about. There are two properties that contingency tables have, and they are very similar to the properties that you get from the axioms of probability, only this one is based on counts. One of them, the marginal contingency table over a2, I can just row-wise add together the conditional contingency tables over each of the values of a1. Second fact, if I want to make the joint contingency table over a1 and a2, I can do that by gluing together contingency tables of a2, conditioned on each of the values of a1.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, let’s hold those two facts in our memory, and then I will show you how you can do reconstruction. So, here is our mutilated data structure. I am going to show you we can recreate the joint contingency table over a1 and a2, from this mutilated data structure. I am going to have a recursive call, a program which progressively runs over this tree. It seems to take a node, in this case, the root note, and a list of attributes—in this case a1 and a2. To build the joint contingency table of a1 and a2, it is going to be just what I said before. It is going to computer the conditional contingency tables, conditioned on each value of a1, from each of these nodes. To do this one, it just does a recursive call to the same algorithm. To do this one, it just realizes it has to create a table full of zeroes. This is the place where it looks like we are getting stuck, because this is the place where we don’t have any data structures to run down to compute this conditional contingency table. We can use the other facts I told you about contingency tables to save us. I know that I do a row-wise addition of these four contingency tables, I should get the marginal contingency table on a2.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop If I do another call, recursive call on my data structure here, asking it to create for me just the marginal contingency table for a2, once I subtract these three contingency tables, that allows me to create this conditional contingency table, which I glue together with the remaining three, and that gets me the joint. Did I succeed in convincing you in four minutes? AUDIENCE: Actually, I believed it when you said it. MR. WILKINSON: You could have run into a stack problem; right? MR. MOORE: The memory requirements for this are most proportional to twice the size of the contingency table you create, multiplied by the number of attributes in the data. It is small. Now, it turns out that we can only do this practically for—I think our largest case was 500 attributes, which is still a fairly respectable size, but we are not talking about doing this on tens of millions of attributes. Of course, we can do it on a day-to-day basis, in one case we did it on a billion records, but the number of attributes were limited to a few hundreds.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, when we actually do use this for various things, we have to search, in this case, over a relatively small database, just 1.5 million records, and 27 attributes, so 27 covariates. We also search over all five-dimensional tables. There are 27, choose five of those. Previously, we would have done a pass through data set for each of those 27 things. We don’t have to do that any more. Instead of taking 1.2 months of CPU time, it takes 21 minutes of CPU time to do it. The final closing comments, what happens as we have been developing these algorithms? I apologize because I have been showing you algorithms that we actually developed a couple of years ago. I didn’t show you our recent work because I didn’t have time to get to that in this talk. What happened after we reduced those algorithms was a bunch of basically government places and also some commercial places who had run into the problem they had a piece of statistics they really wanted to run, but they really could run it using off the shelf tools, came to us. So, we had a consulting company which now routinely turns these things out, in special cases, that we do things. We have found it is a very popular thing. We are turning away business rather than having to seek it out. So, I strongly recommend that this is a good business to be in, the computer science of making statistics go fast.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Just to ground it a little bit, some of the main academic things we are working on at the moment, is applying spatial statistics for biosurveillance of possible bioterrorist attacks. We have a system running monitoring western Pennsylvania, another one monitoring Utah. It was in operation during the Olympic Games, looking for hot spots in the sales of over-the-counter pharmaceuticals, emergency department registrations. We are currently trying to take that national to the stage where we are still going to be, by your standards, quite small. We will be getting in 10 million records a day, so not the billions of records that some of you have been talking about, and hoping to apply that in those cases. I have shown you some of the physics applications. This has also been very useful in some large, high-throughput screening work with Pfizer. The coolest algorithm is, we are putting something on Caterpillar diesel engines and, this is a long shot, but if it could become a product in 2008, when new emissions will come into place, which will monitor everything going on in a diesel engine real-time, to be doing real-time monitoring of emissions as a function of the timing of the diesel engine. We are also following this up using this for a number of intelligence applications. So, that is it. With good computational geometry, you can sometimes get even apparently impractical problems to a practical size, and papers and software at that Web address. That is it. MR. WILKINSON: Time for one or two questions.

OCR for page 295
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop MR. DOMINGOS: Going back to the pairwise distances problem, they were all guaranteed to be within .4 or farther than .4. What if that never happens? Do you just go down to single problems? You could always have two boxes where the smallest distance is less than .4 and the largest distance is more than .4. MR. MOORE: In the exact version of the algorithm there will be some cases where you have to go down to individual points in the database. You can show that the number of times that you will have to do that is the square root of the number of pairs in the database. So, the square root of the number of pairs in the database is linear, so the whole thing ends up being linear in the database size. If you tell the system that you are prepared to accept, say, a 1 percent error in your final count, usually it will never go down. AUDIENCE: What about the geometric points are scaled to higher dimensions? MR. MOORE: Good point, I should have mentioned that. These are based on a data structure called a KD tree, for which Jerry Friedman was one of the inventors. KD trees typically don’t work very well above about 10 or 20 dimensions. So, some of these algorithms, like the mixture of Gaussian ones, we get into computational problems with about 20 dimensions. Some of the other things, like the kernel methods or the two point counts we have done, we actually these days run them in a different data structure called a metric tree, where everything is stored in balls instead of hyper-rectangles. Under some circumstances, we have been able to run those in thousands of dimensions. In one particular case, we ran it in a million dimensions. That is not generally possible. If our distribution of the data is uniform, you could prove that you should not be able to do this efficiently. In empirical data, there are correlations among the covariates. Even if they are non-linear correlations, then you expect it to be practical, which is why we have been able to do this in thousands of dimensions.