National Academies Press: OpenBook
« Previous: James Schatz Welcome and Overview of Sessions
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 7
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 8
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 9

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

WELCOME AND OVERVIEW OF SESSIONS 7 TRANSCRIPT OF PRESENTATION MR. SCHATZ: Thanks, Sallie. I am the chief of the math research office at NSA. The sponsorship of the conference here comes from an initiative that we have at the agency called Advanced Research and Development Activity, and Dr. Dean Collins is the head of that, whom I think some of you probably met at one of these conferences last year. We are very happy and proud to be part of the sponsorship and so forth. I am only going to talk for a few minutes, but interrupt with questions as needed, in the spirit of what I think you are here for, which is not lecturing to each other, but having a dialogue on things. Of course, I don't think it is a big secret why the National Security Agency is interested in massive data sets. I don't know what the stream rate for massive data sets is either, but I think we are going to make it at our place. Let me dwell on the obvious here just for a few minutes. As we look back over this past year, of course, a big question for us, not only for us as individuals, but for the government in the form of official commissions and so forth is, could we have prevented 9/11. We look back on that at a kind of obvious point. There is a question, was there a message in our collection somewhere that said, attack at dawn, with all the details and so forth? While we certainly have done a due diligence search of everything we can lay our hands on that was available to us prior to 9/11 and we haven't found such a transmission, another type of question, though, that probably bothers us a lot more is if we got one last night that said, attack at dawn, would we have noticed it? We have so much data to look at, and we don't look at the data. Our computers do first. So, if the computers don't find that message, and that means if the algorithms don't find that message, the analysts aren't going to read that message. That is just sort of the beginning part, of course, the most obvious. Already, it gets us into huge questions about what is the nature of our databases, how do we store data, how do we retrieve data. Of course, in the past year of really being thrown into a whole new paradigm of intelligence analysis and so forth, we are more in the situation of asking the question, okay, we are probably not going to be fortunate enough to have a message that says, attack at dawn. What we are going to have to do is find clues in a massive data set and put that together and, to do that, that there is something happening tomorrow morning. It has really been a huge change for us. It is not that we weren't thinking about massive data sets before; of course we were. When you are traditionally, after decades and decades, looking at well-defined nation-state targets, like Iraq, and you would—your way of approaching the analysis is sort of dictated by the fact that there is a country with national boundaries and a military and diplomats and various things like that to worry about. We were certainly aware of terrorist targets and studying them and worried about them and taking action long before 9/11, but of course, an event like that pumps up the level of urgency just beyond anything else that you could do. The postmortem stuff of analyzing what we could have done or would have done will, of course, go on for decades, just like today you still hear people talking about, could we have prevented Pearl Harbor. I think, 50 years from now, those same questions will be being asked. Of course, we are here now and we have to worry about the future, for sure.

WELCOME AND OVERVIEW OF SESSIONS 8 I looked over the topics here and the group. It is a wonderful group of people. I am even starting to recognize lots of names and people who are good friends like David Scott, of course, who has been working with us for so long. I am not one of the experts in the group, but I know that we have got a good pile of the experts here. Even if you are interested in environmental data or physics data, of course, there is one thing that you don't have to worry about with that data, I hope, which is that a good portion of it is encrypted. Even if we get past the encryption problem and say, supposed that all of our data is unencrypted, you probably do have to deal with some amount of garbling and that sort of stuff in your data, too. I imagine that we are all looking at the same kinds of questions, which are, there are certain events in our data that we would like to be able to spot and just pull out, because we know what they are going to be, and we just want to have rapid ways to do that. I think the real sense of where the science is going, at least for us, and I think for you is, how do we take a massive data set and see pieces of what we need to see in many different places and pull it together and make conclusions based on that, and how do we find patterns? For us, a key aspect of this problem, of course, is we don't have a homogeneous type of a data set. We have got any kind of communications medium that you can imagine, we will have that data. A key problem for us is kind of combining information from all these different things, and database structures for how you would actually set things up to make the algorithms run in a situation like that. Certainly, Kay Anderson and Dave Harris, who are from our group are here today, were working on these types of problems long ago. It didn't start a year ago, but post 9/11, we have ramped up dramatically our efforts in these areas. S&T in the research area alone, there are dozens of people working on this who weren't working on it a year ago. We have certainly got tons to learn about this stuff. It just seems, with the data explosion that everybody is going through, we are all kind of new at it, in a sense. I hope, in the spirit of these conferences, our guys will find ways to share technical things with you as best they can, and that even with all your proprietary information that you have to worry about, you can have a good technical exchange back with us. It is certainly an area where there are a lot of economic issues, and companies have ways of doing things and so forth, but hopefully the in-crowd here can get down to the mathematics and the statistics and share ideas. We need a lot of help in this area. What we are doing is dramatically good. We have had some amazing success stories just in the past year that were an absolute direct result of what I would call data mining on massive data sets. I can assure you, for us, it is not just an academic exercise. We are right in the thick of it. We utilize it every day. It has done wonderful stuff for us in the past year, and we are getting a lot out of these conferences. I popped into one last year, and I am glad to see a lot of the same faces. AUDIENCE: [Question off microphone.] MR. SCHATZ: Probably not, but I do want you to know that I am not just saying that to make you feel good. We really have had some dramatic successes in terms of techniques we didn't have a year ago for looking for patterns in massive data, drawing conclusions and taking some known attributes of a situation and mining through the data

WELCOME AND OVERVIEW OF SESSIONS 9 to find new ones, and very algorithmic based, and really providing tools for our analysts. Of course, however many analysts we have—and I wouldn't know what that number is, it is finite, and any given human being can only look at so much text and pictures in one day. For us, it is all about teaching the machines how to work for us, and teaching the machines is teaching the algorithms. I can't think of an example that we could share with you, but real examples, real intelligence, real impact, plenty of it, just in this past year, based on the kinds of techniques we are learning with you. Anyway, I don't want to overstay my welcome, because there is some real work to do, but if there are a couple more questions, I would be happy to talk. AUDIENCE: [Question off microphone.] MR. SCHATZ: Certainly, gigabytes on a daily basis and so forth. Maybe our experts will find a way they can give you a better sense of that. I don't really know. The thing is, we have lots of channels coming in at lots of rates, and if you put it all together, it would be something astronomical. We probably span every range of problems you could think of. It is not as though we have the mother lode coming in through one pipe every minute. We have lots of ways of collecting lots of sources. I am sure some of our most important stuff is very slow speed compared to the things you are talking about, and some of it is very high speed. There isn't any kind of one technique that we are looking for, and any range of techniques here—you know, something that takes longer and has to work at slower speeds is probably just as interesting to us as something that has to work at the speed of light, we are going to have all kinds of problems to apply this stuff to. Anything else I could give a vague kind of government answer to? Okay, Sallie, you are back, or maybe John is up, and thanks for being here, and we are happy to be part of this, and thanks for the research.

Next: Douglas Nychka, Chair of Session on Atmospheric and Meteorological Data Introduction by Session Chair »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!