Sallie Keller-McNulty

Welcome and Overview of Sessions

Transcript of Presentation

BIOSKETCH: Sallie Keller-McNulty is group leader for the Statistical Sciences Group at Los Alamos National Laboratory. Before she moved to Los Alamos, Dr. Keller-McNulty was professor and director of graduate studies at the Department of Statistics, Kansas State University, where she has been on the faculty since 1985. She spent 2 years between 1994 and 1996 as program director, Statistics and Probability, Division of Mathematical Sciences, National Science Foundation. Her on-going areas of research focus on computational and graphical statistics applied to statistical databases, including complex data/model integration and related software and modeling techniques, and she is an expert in the area of data access and confidentiality. Dr. Keller-McNulty currently serves on two National Research Council committees, the CSTB Committee on Computing and Communications Research to Enable Better Use of Information Technology in Government and the Committee on National Statistics’ Panel on the Research on Future Census Methods (for Census 2010), and chairs the National Academy of Sciences’ Committee on Applied and Theoretical Statistics. She received her PhD in statistics from Iowa State University of Science and Technology. She is a fellow of the American Statistical Association (ASA) and has held several positions within the ASA, including currently serving on its board of directors. She is an associate editor of Statistical Science and has served as associate editor of the Journal of Computational and Graphical Statistics and the Journal of the American Statistical Association. She serves on the executive committee of the National Institute of Statistical Sciences, on the executive committee of the American Association for the Advancement of Science’s Section U, and chairs the Committee of Presidents of Statistical Societies. Her Web page can be found at http://www.stat.lanl.gov/people/skeller.shtml



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Sallie Keller-McNulty Welcome and Overview of Sessions Transcript of Presentation BIOSKETCH: Sallie Keller-McNulty is group leader for the Statistical Sciences Group at Los Alamos National Laboratory. Before she moved to Los Alamos, Dr. Keller-McNulty was professor and director of graduate studies at the Department of Statistics, Kansas State University, where she has been on the faculty since 1985. She spent 2 years between 1994 and 1996 as program director, Statistics and Probability, Division of Mathematical Sciences, National Science Foundation. Her on-going areas of research focus on computational and graphical statistics applied to statistical databases, including complex data/model integration and related software and modeling techniques, and she is an expert in the area of data access and confidentiality. Dr. Keller-McNulty currently serves on two National Research Council committees, the CSTB Committee on Computing and Communications Research to Enable Better Use of Information Technology in Government and the Committee on National Statistics’ Panel on the Research on Future Census Methods (for Census 2010), and chairs the National Academy of Sciences’ Committee on Applied and Theoretical Statistics. She received her PhD in statistics from Iowa State University of Science and Technology. She is a fellow of the American Statistical Association (ASA) and has held several positions within the ASA, including currently serving on its board of directors. She is an associate editor of Statistical Science and has served as associate editor of the Journal of Computational and Graphical Statistics and the Journal of the American Statistical Association. She serves on the executive committee of the National Institute of Statistical Sciences, on the executive committee of the American Association for the Advancement of Science’s Section U, and chairs the Committee of Presidents of Statistical Societies. Her Web page can be found at http://www.stat.lanl.gov/people/skeller.shtml

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MS. KELLER-MCNULTY: Okay, I would like to welcome everybody today. I am Sallie Keller-McNulty. I am the current chair of the Committee on Applied and Theoretical Statistics. This workshop is actually sponsored by CATS. That is the acronym for our committee. It is kind of a bit of a déja vu looking out into this room, back to 1995, the nucleus of people who held the first workshop, or at least attended the first workshop that CATS had, on the analysis of massive data sets. It has taken us a while to put a second workshop together. In fact, as CATS tried to think about what makes sense for a workshop today, that really deals with massive amounts of data, is where we decided we would really try to actually jump ahead a bit and try to look at problems of streaming data, massive data streams. Now, the workshop committee, which consisted of David Scott, Lee Wilkinson, Bill DuMouchel and Jennifer Widom, when they started planning this, they were pretty comfortable with the concept of massive data streams. I think that, by the time that this actually got together, they debated whether, instead of data streams, it should be data rivers. Several of you have asked me what constitutes a stream, how fast does the data have to flow. I am not qualified to answer that question, but I think our speakers throughout the day should be able to try to address what that means to them. We need to give a really good thank you to our sponsors for this workshop, which is the Office of Naval Research and the National Security Agency. Now I will turn it over to Jim Schatz from NSA. He will give us an enlightening, boosting talk for the workshop.

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop James Schatz Welcome and Overview of Sessions Transcript of Presentation James Schatz is the chief of the Mathematics Research Group at the National Security Agency.

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. SCHATZ: Thanks, Sallie. I am the chief of the math research office at NSA. The sponsorship of the conference here comes from an initiative that we have at the agency called Advanced Research and Development Activity, and Dr. Dean Collins is the head of that, whom I think some of you probably met at one of these conferences last year. We are very happy and proud to be part of the sponsorship and so forth. I am only going to talk for a few minutes, but interrupt with questions as needed, in the spirit of what I think you are here for, which is not lecturing to each other, but having a dialogue on things. Of course, I don’t think it is a big secret why the National Security Agency is interested in massive data sets. I don’t know what the stream rate for massive data sets is either, but I think we are going to make it at our place. Let me dwell on the obvious here just for a few minutes. As we look back over this past year, of course, a big question for us, not only for us as individuals, but for the government in the form of official commissions and so forth is, could we have prevented 9/11. We look back on that at a kind of obvious point. There is a question, was there a message in our collection somewhere that said, attack at dawn, with all the details and so forth? While we certainly have done a due diligence search of everything we can lay our hands on that was available to us prior to 9/11 and we haven’t found such a transmission, another type of question, though, that probably bothers us a lot more is if we got one last night that said, attack at dawn, would we have noticed it? We have so much data to look at, and we don’t look at the data. Our computers do first. So, if the computers don’t find that message, and that means if the algorithms don’t find that message, the analysts aren’t going to read that message. That is just sort of the beginning part, of course, the most obvious. Already, it gets us into huge questions about what is the nature of our databases, how do we store data, how do we retrieve data. Of course, in the past year of really being thrown into a whole new paradigm of intelligence analysis and so forth, we are more in the situation of asking the question, okay, we are probably not going to be fortunate enough to have a message that says, attack at dawn. What we are going to have to do is find clues in a massive data set and put that together and, to do that, that there is something happening tomorrow morning. It has really been a huge change for us. It is not that we weren’t thinking about massive data sets before; of course we were. When you are traditionally, after decades and decades, looking at well-defined nation-state targets, like Iraq, and you would—your way of approaching the analysis is sort of dictated by the fact that there is a country with national boundaries and a military and diplomats and various things like that to worry about. We were certainly aware of terrorist targets and studying them and worried about them and taking action long before 9/11, but of course, an event like that pumps up the level of urgency just beyond anything else that you could do. The postmortem stuff of analyzing what we could have done or would have done will, of course, go on for decades, just like today you still hear people talking about, could we have prevented Pearl Harbor. I think, 50 years from now, those same questions will be being asked. Of course, we are here now and we have to worry about the future, for sure.

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop I looked over the topics here and the group. It is a wonderful group of people. I am even starting to recognize lots of names and people who are good friends like David Scott, of course, who has been working with us for so long. I am not one of the experts in the group, but I know that we have got a good pile of the experts here. Even if you are interested in environmental data or physics data, of course, there is one thing that you don’t have to worry about with that data, I hope, which is that a good portion of it is encrypted. Even if we get past the encryption problem and say, supposed that all of our data is unencrypted, you probably do have to deal with some amount of garbling and that sort of stuff in your data, too. I imagine that we are all looking at the same kinds of questions, which are, there are certain events in our data that we would like to be able to spot and just pull out, because we know what they are going to be, and we just want to have rapid ways to do that. I think the real sense of where the science is going, at least for us, and I think for you is, how do we take a massive data set and see pieces of what we need to see in many different places and pull it together and make conclusions based on that, and how do we find patterns? For us, a key aspect of this problem, of course, is we don’t have a homogeneous type of a data set. We have got any kind of communications medium that you can imagine, we will have that data. A key problem for us is kind of combining information from all these different things, and database structures for how you would actually set things up to make the algorithms run in a situation like that. Certainly, Kay Anderson and Dave Harris, who are from our group are here today, were working on these types of problems long ago. It didn’t start a year ago, but post 9/11, we have ramped up dramatically our efforts in these areas. S&T in the research area alone, there are dozens of people working on this who weren’t working on it a year ago. We have certainly got tons to learn about this stuff. It just seems, with the data explosion that everybody is going through, we are all kind of new at it, in a sense. I hope, in the spirit of these conferences, our guys will find ways to share technical things with you as best they can, and that even with all your proprietary information that you have to worry about, you can have a good technical exchange back with us. It is certainly an area where there are a lot of economic issues, and companies have ways of doing things and so forth, but hopefully the in-crowd here can get down to the mathematics and the statistics and share ideas. We need a lot of help in this area. What we are doing is dramatically good. We have had some amazing success stories just in the past year that were an absolute direct result of what I would call data mining on massive data sets. I can assure you, for us, it is not just an academic exercise. We are right in the thick of it. We utilize it every day. It has done wonderful stuff for us in the past year, and we are getting a lot out of these conferences. I popped into one last year, and I am glad to see a lot of the same faces. AUDIENCE: [Question off microphone.] MR. SCHATZ: Probably not, but I do want you to know that I am not just saying that to make you feel good. We really have had some dramatic successes in terms of techniques we didn’t have a year ago for looking for patterns in massive data, drawing conclusions and taking some known attributes of a situation and mining through the data

OCR for page 4
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop to find new ones, and very algorithmic based, and really providing tools for our analysts. Of course, however many analysts we have—and I wouldn’t know what that number is, it is finite, and any given human being can only look at so much text and pictures in one day. For us, it is all about teaching the machines how to work for us, and teaching the machines is teaching the algorithms. I can’t think of an example that we could share with you, but real examples, real intelligence, real impact, plenty of it, just in this past year, based on the kinds of techniques we are learning with you. Anyway, I don’t want to overstay my welcome, because there is some real work to do, but if there are a couple more questions, I would be happy to talk. AUDIENCE: [Question off microphone.] MR. SCHATZ: Certainly, gigabytes on a daily basis and so forth. Maybe our experts will find a way they can give you a better sense of that. I don’t really know. The thing is, we have lots of channels coming in at lots of rates, and if you put it all together, it would be something astronomical. We probably span every range of problems you could think of. It is not as though we have the mother lode coming in through one pipe every minute. We have lots of ways of collecting lots of sources. I am sure some of our most important stuff is very slow speed compared to the things you are talking about, and some of it is very high speed. There isn’t any kind of one technique that we are looking for, and any range of techniques here—you know, something that takes longer and has to work at slower speeds is probably just as interesting to us as something that has to work at the speed of light, we are going to have all kinds of problems to apply this stuff to. Anything else I could give a vague kind of government answer to? Okay, Sallie, you are back, or maybe John is up, and thanks for being here, and we are happy to be part of this, and thanks for the research.