Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: Paul Whitney Toward the Routine Analysis of Moderate to Large-Size Data

Page 281 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 282 Cite

Page 283 Cite

Page 284 Cite

Page 285 Cite

Page 286 Cite

Page 287 Cite

Page 288 Cite

Page 289 Cite

Page 290 Cite

Page 291 Cite

Page 292 Cite

Page 293 Cite

Page 294 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 281 TRANSCRIPT OF PRESENTATION MR. WHITNEY: Good morning. I don't know that I am going to start out as ambitious as looking at the whole network. I am from Pacific Northwest National Laboratory, a group of about 40 statisticians there. We do all kinds of data analysis. There is this phrase that describes a fair amount of what we do, probably because of the situations we live in, and it is fighting our tools. They just don't quite do everything you want them to do. We have to do a lot of custom work. I really wish I would have recorded the sound of my hard drive swapping, just to play for you guys. I hear it all the time. It is really a key thought. If you have heard your own hard drive swapping, just remember that we are going through some of these analyses, some of the stories of what we have done. I have a lot of colleagues. Here is a list of some that are related to the stories I am going to tell. I think one person here is in the audience. I think there are 15 people on that list, and 4 of them are statisticians. It turns out that there is a fellow there who is a risk analysis for things like nuclear reactors. There is a fellow there who is a remote sensor, some software engineers. There are people here who I really don't know what their technical background is, but we do have useful technical interaction. It just never quite comes up, what they did in school. Here is some data. It also happens to be where I am from. This is the Columbia River. This is some satellite image I downloaded from SPOT. It was a few years ago and I might have the name wrong. It is low resolution, but

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 282 you can see a lot of what is going on. This is the city here. There is the Snake River going into the Columbia, the Yakima you can't quite make out. There it is. I work right about there. This is agriculture, and it is blurry, probably not because the measurement is that bad, but you get an image in one form, you put it in another form, you put it into PowerPoint, and God knows what has happened by now, but there are a lot of those circles for irrigation. The reason that you don't see anything like that here is that that is the Hanford site. They don't do agriculture there. There are a few installations actually here that we will look at. I haven't got a good idea about why there is this demarcation. I suspect a fence. This is Rattlesnake Mountain. Okay, I am going to describe some data analysis stories. Just for background, I would like to describe what our computing environment looks like, just so you can get some idea about the size and difficulty of challenge that we are facing here. Forty-some odd people, two to three computers per person, typically a newer one, an older one, some Unix, some Mac, some PC. Our bread-and-butter software tends to be these guys here, because of the flexibility they give you. Some other languages, less and less Fortran and C over time. I don't think I have written anything in those languages in a long time. Then, packages, just depending. We have got the potential for a lot of storage, the AFS share kind of thing, and the network is good. It is reliable. You know, a lot of PCs. The demographics associated with Macintoshes are interesting. Here is the RAM size. This is getting back to the swapping thing. There is one lucky dog who has got a couple of gigabytes of RAM and,

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 283 you know, my computer is right there, and then my portable is right there, pretty typical, good stuff, when you think about it. Also, another thing to keep in mind is the computing model. It is the single work station computing model is what we tend to deal with on a day-to-day basis. Here is an outline. I thought I would start at the end. So, the computers are good. They are great, and stuff still breaks. The reason stuff breaks has to do, I believe, with the software model that we are using. The realization of that is nothing deep, and the solution is just work, that you have to keep track of the RAM you are using, and that will have a lot of implications, I think, in the future. It is an observation echoed very strongly by Ed, and Johannes is worried about memory, also, very explicitly in his talk. This failure of tools happens not only on streaming data or massive data, but ordinary data sets of a size you could presumably fit in memory will cause the computers to die with the tools that we typically use. Then, there is another type of complexity that is a whole lot slipperier. I think it is hard to describe, but if you think about the potential complexity underlying a large data set, and how you start to get your arms wrapped around it, it starts to become kind of daunting. For instance, let's pretend that you did a cluster analysis of a homogenous data set, and you had some nice formula that said what the optimum number of clusters might be, and if you can get that down to 3,000, because that is the optimum, well, you are just not done. That is still way too many for the human to absorb. So, somehow the complexity isn't just in the data set. It is in the communication

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 284 of the information to the user, and that is a tricky thing, a very slippery thing. We spend way too much time handling data. I am sure that happens to everybody, and I don't have a deep recommendation there to get to. It is just that we work through it like everybody else does. So, here is what we do. Here is a good strategy to consider, and think of the data as being something like that image of the Hanford area. First off, you have got this data. It is in a digital format, but we want to be able to use our standard data analysis tools for it. So, the first thing that we do, we make a vector out of it. There is a ton of ways to do that, and there is a lot of good work out there that can be used. For instance, if you are dealing with a collection of documents, this isn't what you would use, but you could use as fundamental information just word frequency counts. The world has moved on from that, but that is a nice measurement to think about. If you are working with an image or a collection of images, you could imagine looking at textures, edges, various bits of color. It turns out that both of those things, while being simple and way-too-briefly-stated there, they eventually can be made to work very well. It is kind of surprising and gratifying. The characteristics of those are indicated here in these two bullets. One is sort of a social characteristic, the bottom one, and the upper one is pretty interesting. Each coordinate in that vector that you are using to construct a signature can be very uninformative. For instance, if you are making a vector of word frequencies to represent a document, and you have multiples of those because you have multiple documents, the

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 285 number of times that the word signature appears, that is a pretty low level of information about the objects in question. It turns out, you put that together and you can start to do useful data analysis. Similarly, with an image: A color histogram, by itself, might not tell you a lot, but you put that together with other information and you start to get somewhere. This social characteristic is important. Guys like me can start making up sensors, virtual sensors, basically. You can, too. That basically changes the nature of our profession and our jobs as data analysts, I think, in a good way. Finally, why would you care about a strategy like this? Well, people have been coming up with data analysis algorithms forever. It is great. If you take a strategy where you encode some data object as a signature vector, that means that you can borrow algorithms from all kinds of staff packages, neural net packages, whatever, to try things. So, it is an effective strategy, basically, for running through a lot of analysis potential. Here is a way too busy showing of a collection of documents. Let me back off that one and go to something a little simpler. Suppose you got 400 documents. You want to know what is in there. And you have only got 30 minutes. So, you can't read them all, that is out of bounds at this point. Well, there are tools out there that will basically look at those, summarize what, say, the key words are, and do a visual clustering of the collection. Then, even better, start to label the various regions in ways that you can hopefully begin to understand what the generic contents of that collection are. So, the technology. You make one of those signature vectors. It turns out you need the coordinates to be meaningful. You do a little non-metric multidimensional scaling. There is a lot of artistic license in this particular one, and you go. So, it is a good functional thing, and you can start to build analytic functionality on top of that as well. This is a dangerous picture in the sense that I have never been able to explain it very well in a public forum, or a private forum either, probably. The picture suggests there is some sort of relation going on. Let me indicate broadly what it is. You have matched pair data. Think sophomore statistics here. It turns out that one measurement is one of these text vectors for the English version of the document, and another measurement is the text vector for the Arabic version of the document. You have got, then, in some sense this regression problem you can do from English to Arabic or Arabic to English. That part, by the way, has just become standard data analysis. It is regression and this is a very loosey-goosey regression display. By the

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 286 way, it does show that there is some potential for mapping there. The reason you might care is that maybe you don't speak one of those languages. So, you say, okay, I am going to calculate my vector for thisâyou know, I speak Arabic and somebody gave me this darned English document. So, I will calculate my vector for it, I will plug it into the regression formula, I will find what Arabic part of the space it lies in and then say, oh, that is just like these other documents. So, it gives you a way to do that kind of problem, again, based on this simple idea of calculating these signatures, using standard data analysis procedures, and then mapping back to the problem space. Another type of data object, image and video. It turns out that there are walls of books of how to do image analysis. So, we don't have to do that. The science is done. What they don't have done is things like, well, I have 10 of these images in my shoe box. I would like to sort them out. I have never actually organized my photo collection. How do I go about doing that? Well, if they are digitized, you can calculate a vector representation for each of those images. You can do one of those visual clustering type things and do a multi-dimensional scaling and show basically this organized collection of images, and that is one way to go. What this shows is something similar for a small videoclip. I am assuming 80 percent of you can recognize what videoclip this is. The calculations are really simple here, that led to this picture. We did one of those signature vectors for each frame. We did a cluster analysis for the signature vectors. We took the picture with the vector nearest the representer

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 287 vector, and then just show it and say, well, there is a label you can slap on that tape that gives you some idea of the content in a quick and dirty fashion. The idea just goes on. Here is the Hanford site, a little part of it. The 200 area is what it is affectionately known as. It is an ICONUS shot of it, so it is one of the bands of that. You can see, there are roads, some buildings, God only knows what, and so on. What this picture is, is what happens if you calculate a little vector that describes the content, just sort of there and there and there and there, and get a little subpicture around that vector location and say, well, I am going to do a multidimensional scaling type thing for those vectors. To indicate the content, I will just show the little pictures near each vector. Then you say, okay, I have got this. Have I got anything interesting from an exploratory analysis or even classification point of view, even though it was an unsupervised thing? So, you know the data base has been around for a while and this data has been around for a while, so you can start doing brushing ideas. So, you grab a bunch of these little pictures here in this region of this multidimensional scaling thing and see what you grab over here, just to get an idea of whether you have done anything useful. Okay, we have got a lot of these sort of benign regions, fields, as it were. Actually, it would be more like gravel fields there.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 288 Let's grab another region and see what we get. Well, we got roads, when you look at it. So, it becomesâyou develop the potential for building some interesting data analytic tools with that strategy, and it kind of gets you to the place of this, you know, all of these things, it is either the statistician manifestation or statistician hubris, that everything is data. You have got all of these objects. You have got this strategy. You can use it and see what happens. So, let's talk about some more data. Let's back up a second. So, you have got network traffic. You have got economic models. I was going to also talk about an error reporting data base, very diverse data objects.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 289 Network traffic, you have seen some information on the format of that, but we have got a strategy for summarizing the content, potentially. So, we will show how that begins to play out. An economic model, it turns out we spent some time analyzing the content of the output of an international economy, just a simulation of it. We have no idea, actually, if the model was any good. There is probably a joke in there somewhere. It is, again, a whole other data type and how do you get your arms around it. Well, let's just dive in. We are focusing on the content. I mean, there is a lot of work on the packet. You have seen a lot of that here today, and some indication of why you might be interested in that from a network design point of view, but we wanted to look at the payloads. Our model is, we were going to look at the contents going by at a particular point. So, you imagine, if this is a network cable, I am just going to keep track of what goes by here, something that has that type of mental model. There are tons of challenges. You have the ethical and legal issues associated with privacy. Data handling is always a problem. There are tools out there to help you get by that, but you have to learn how to deal with the tools. I am sure folks have gone through that learning curve here as well. Then, it is streaming data, and we have kind of been challenged by data sets on the order of the size of memory in our computer, given our current tools. So, streaming data is a whole other level of difficulty. So, what do we do? Well, we have got a strategy. So, let's try our strategy and see what happens. What can you do for a signature on streaming network data? It is not just text going by and it is not just images going by. It is who knows what going by in those payloads. So, you have immediately got a challenge of, well, I can't just read the image analysis literature. I can't just go to the computational linguistics literature. It is everything. I have got to do something that handles everything. Well, you know, there are some really straightforward things that have worked in the past for text that have at least the mathematical extension to a bite, to just digital data. So, we decided to look at byte-based n-grams and a couple of simple means of summary.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 290 N-grams are fairly well known. I thought I would take a minute, just in case there were a couple of people that didn't know what those guys were. An n-gram signature for a text document is basically a frequency table of successive runs of characters in the document. So, if you have got this type of text, and you are doing a three gram [on the phrase âan n-gramâ], then you have A-N-space, you have got N-space-N, N-space-dash, and so on. You just accrue those things. It turns out a fellow named Danocheck did some very nice work showing that you could use those as a basis of a signature to distinguish among language types. Then, subsequently, folks figured out that, son of a gun, you could actually do a fair job of information retrieval based on those crude measurements. This, again, emphasizes a point, that this, as the basis of the coordinate vector, A-N-space, isn't awesomely informative, all by itself, and the vector G-R-A is not buying you a lot either, all by itself. If you take that weak collection of measurements together, you can start to do stuff. There is nothing new about this.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 291 Here are some scanned Xeroxes from Cover and Joy's information theory book, showing what some of these models actually do from a generative point of view. If you just start thinking about what an n-gram might mean, well, it kind of goes with this Markov model thing. So, you estimate some frequencies, you generate some data from the models. This is a second-order one, sort of like dealing with two grams, and okay, hey, there is a word, there is a word, and that is not bad. It gets kind of funny down here. This is a four-gram. This could be a word, and so on. So, it does seem to have something to do with language. So, it is an okay measurement. We decided to just make a leap of faith and see if it would have anything to do with generic digital data objects when we went to byte-based n-gram things. Here is a summary of how this might work. Eventually what we did was, we used a bunch of stuff from my Web cache. We broke it up into size 1,500 bytes for reasons that are clear at this point in the day, and just started categorizing things in an unsupervised setting. This is a nice plot from R. You can see that, even though it was unsupervised, we did a really good job of isolating these guys. These guys are really strong in this cluster, but spread among other clusters. So, it is not an overwhelming connection between type of file and cluster, based on that signature, but it is not random either. It is better than random. The big challenge here, by the way, was just how do you semantically represent the content of something this generic. We didn't address that challenge in this particular exercise, but we were just going to use exemplar objects as opposed to things like words and sub-images.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 292 This second bullet was our T stumbling block. My computer kept making that noise, basically, during this exercise, using the types of tools I showed you. It turns out that if we just took a little time and rewrote one of the clustering algorithms to just not use so much memory, we did much better. Basically, the compute time went from impossible to, okay, that is a cup of coffee. I am going to go right to the end. Ed wrote a very nice paper that appeared in the mid-1990s that laid out one way to organize your computations to fit on work stations. One of my favorite lessons learned was that my work station is pretty good. I can solve a lot of problems with my work station. Even taking that advice, and just working with, say, a 100-megabyte data set in a 500-megabyte work station, and the kinds of tools that we typically use, stuff happened. Bad stuff happened. The computer made the noise. We charge our time by the hour. It is bad. It turns out that this would be good. A bounded RAM would be really good, and recursive formulations, there are a lot of things out there. Use statistics are good, the common filter is good. Once you know what you are looking for, you can do a Web search and start finding good theory out there in the computer science literature in the database area, saying how to organize your calculations to achieve this. I think they are just getting started there as well. There is a ton of good work that a lot of people could do. Let's think about that together here a second. There is some, as I said, some work out there. There is some commercial software, actually, that is thinking along these lines,

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 293 probably some I don't even know about. One of our spin off companies took some time to worry about keeping good control, the amount of RAM used in their calculations, but there is theory that you could do, too. I mean, you could do the relative efficiency of a bounded RAM statistic versus an unbounded RAM statistic for various problems. You could imagine re-architecting some of our standard tools, R, say, to start taking advantage of some of these ideas.

TOWARD THE ROUTINE ANALYSIS OF MODERATE TO LARGE-SIZE DATA 294 There is a lot of good work a lot of people could do that would get us not only to moderate data sets, but I think get us all going in a streaming data set sense. This second recommendation is slippery. I don't think I have made a great case for it. If you think about the complexity that ought to be inherent in some of these larger data sets, and how much trouble we have communicating some basic ideas, I believe there is a lot of effort that we need to expend in that area, and it is going to take a lot of folks, in part, just because they are all the type that we are going to be communicating with, and in part, because the statistics community isn't going to have all of those answers. MS. MARTINEZ: It is lunchtime. One question. AUDIENCE: What type of clustering algorithms have you had the best luck with, and have you developed algorithms specifically to deal with streaming data? MR. WHITNEY: I tend, just by default, to use a K-Means, and it works pretty good. It is simple, it is fast. We played with variants of it, a recursive partitioning version, with various reasons why you would recurse. The version that I ended up writing for that network problem wasn't for streaming data. It was for data basically defined so you could pass through and repass through it and repass through it. So, we had an explicit data structure that represented that type of operation. I hope to use that as a basis for lots of other algorithms, though.

Next: Leland Wilkinson, Chair of Session on Mining Commercial Streams of Data Introduction by Session Chair »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates