Sallie Keller-McNulty, Chair of Session on Integrated Data Systems

Introduction by Session Chair

Transcript of Presentation

BIOSKETCH: Sallie Keller-McNulty is group leader for the Statistical Sciences Group at Los Alamos National Laboratory. Before she moved to Los Alamos, Dr. Keller-McNulty was professor and director of graduate studies at the Department of Statistics, Kansas State University, where she has been on the faculty since 1985. She spent 2 years between 1994 and 1996 as program director, Statistics and Probability, Division of Mathematical Sciences, National Science Foundation. Her on-going areas of research focus on computational and graphical statistics applied to statistical databases, including complex data/model integration and related software and modeling techniques, and she is an expert in the area of data access and confidentiality. Dr. Keller-McNulty currently serves on two National Research Council committees, the CSTB Committee on Computing and Communications Research to Enable Better Use of Information Technology in Government and the Committee on National Statistics’ Panel on the Research on Future Census Methods (for Census 2010), and chairs the National Academy of Sciences’ Committee on Applied and Theoretical Statistics. She received her PhD in statistics from Iowa State University of Science and Technology. She is a fellow of the American Statistical Association (ASA) and has held several positions within the ASA, including currently serving on its board of directors. She is an associate editor of Statistical Science and has served as associate editor of the Journal of Computational and Graphical Statistics and the Journal of the American Statistical Association. She serves on the executive committee of the National Institute of Statistical Sciences, on the executive committee of the American Association for the Advancement of Science’s Section U, and chairs the Committee of Presidents of Statistical Societies. Her Web page can be found at http://www.stat.lanl.gov/people/skeller.shtml



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Sallie Keller-McNulty, Chair of Session on Integrated Data Systems Introduction by Session Chair Transcript of Presentation BIOSKETCH: Sallie Keller-McNulty is group leader for the Statistical Sciences Group at Los Alamos National Laboratory. Before she moved to Los Alamos, Dr. Keller-McNulty was professor and director of graduate studies at the Department of Statistics, Kansas State University, where she has been on the faculty since 1985. She spent 2 years between 1994 and 1996 as program director, Statistics and Probability, Division of Mathematical Sciences, National Science Foundation. Her on-going areas of research focus on computational and graphical statistics applied to statistical databases, including complex data/model integration and related software and modeling techniques, and she is an expert in the area of data access and confidentiality. Dr. Keller-McNulty currently serves on two National Research Council committees, the CSTB Committee on Computing and Communications Research to Enable Better Use of Information Technology in Government and the Committee on National Statistics’ Panel on the Research on Future Census Methods (for Census 2010), and chairs the National Academy of Sciences’ Committee on Applied and Theoretical Statistics. She received her PhD in statistics from Iowa State University of Science and Technology. She is a fellow of the American Statistical Association (ASA) and has held several positions within the ASA, including currently serving on its board of directors. She is an associate editor of Statistical Science and has served as associate editor of the Journal of Computational and Graphical Statistics and the Journal of the American Statistical Association. She serves on the executive committee of the National Institute of Statistical Sciences, on the executive committee of the American Association for the Advancement of Science’s Section U, and chairs the Committee of Presidents of Statistical Societies. Her Web page can be found at http://www.stat.lanl.gov/people/skeller.shtml

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MS. KELLER-MCNULTY: Our next session has to do with integrated data streams. Actually, it has been alluded to in the sessions prior to this as well, the multiplatforms, how do you integrate the data? We are going to start off with a talk that is sort of overview in nature, that is going to present some pretty broad problems that we need to start being prepared—we need to start to prepare ourselves how to address. That is going to be by Doug Season, who is one of the deputy lab directors in the threat reduction directorate at Los Alamos. He has been involved with different presidential advisors at OSTP throughout his career, for both Clinton and Bush, has a long history of looking into and being interested and doing, himself, science in this whole area. That is going to be followed by a talk by Kevin Vixie, who will look at some hyperspectral analyses, kind of focus in on a piece of this problem. He is a mathematician at Los Alamos. Finally, our last speaker will be John Elder, who has been looking hard at integrating models and integrating data, and both hardware and software methods to do that.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop J.Douglas Beason Global Situational Awareness Abstract of Presentation Transcript of Presentation BIOSKETCH: Douglas Beason is the director of the International, Space and Response (ISR) Division at the Los Alamos National Laboratory, responsible for over 400 professionals conducting research and development in intelligence, space, sensor, and directed energy programs. He has over 26 years of R&D experience that spans conducting basic research to directing applied science national security programs and formulating national policy. Dr. Beason previously served on the White House staff, working for the President’s Science Advisor in both the Bush and Clinton administrations. He has performed research at the Lawrence Livermore National Laboratory; directed a plasma physics laboratory; taught as an associate professor of physics and director of faculty research; was deputy director for directed energy, USAF Research Laboratory; and is a member of numerous national review boards and committees, including the USAF Science Advisory Board and a Vice Presidential commission on space exploration. He retired as a colonel from the Air Force after 24 years, with his last assignment as commander of the Phillips Research Site, Kirtland AFB, New Mexico. Dr. Beason holds PhD and MS degrees in physics from the University of New Mexico, an MS in national resource strategy from the National Defense University, and is a graduate of the Air Force Academy with bachelor’s degrees in physics and mathematics. The author of 12 books and more than 100 other publications, he is a fellow of the American Physical Society, a distinguished graduate of the Industrial College of the Armed Forces, a recipient of the NDU President’s Strategic Vision Award, and a Nebula Award finalist.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Abstract of Presentation Global Situational Awareness Douglas Beason, Los Alamos National Laboratory Battlefield awareness can sway the outcome of a war. For example, General Schwarzkopf’s “Hail Mary” feint in the Gulf War would not have been possible if the Iraqis had had access to the same overhead imagery that was available to the Alliance forces. Achieving and maintaining complete battlefield awareness made it possible for the United States to dominate both tactically and strategically. Global situational awareness can extend this advantage to global proportions. It can lift the fog of war by providing war fighters and decision makers capabilities for assessing the state anywhere, at any time—locating, identifying, characterizing, and tracking every combatant (terrorist), facility, and piece of equipment, from engagement to theater ranges, and spanning terrestrial (land/sea/air) through space domains. In the world of asymmetric warfare that counterterrorism so thoroughly stresses, the real-time sensitivity to effects (as opposed to threats from specific, preidentified adversaries) that is offered by global situational awareness will be the deciding factor in achieving a dominating, persistent victory. The national need for global situational awareness is recognized throughout the highest levels of our government. In the words of Undersecretary of the Air Force and Director of the National Reconnaissance Office Peter Teets, “While the intelligence collection capabilities have been excellent, we need to add persistence to the equation…You’d like to know all the time what’s going on around the face of the globe.” Global situational awareness is achieved by acquiring, integrating, processing, analyzing, assessing, and exploiting data from a diverse and globally dispersed array of ground, sea, air, and space-based, distributed sensors and human intelligence. This entails intelligently collecting huge (terabyte) volumes of multidimensional and hyperspectral data and text through the use of distributed sensors; processing and fusing the data via sophisticated algorithms running on adaptable computing systems; mining the data through the use of rapid feature-recognition and subtle-change-detection techniques; intelligently exploiting the resulting information to make projections in multiple dimensions; and disseminating the resulting knowledge to decision makers, all in as near a real-time manner as possible.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. BEASON: Thanks, Sallie. This will be from the perspective of a physicist. While I was flying out here, I sat by a particle physicist. I, myself, am a plasma physicist. I found out that, even though we were both from the same field, we really couldn’t understand each other. So, good luck on this. Global situational awareness is a thrust that the government is undertaking, largely due to the events surrounding September 11 of 2001. Basically, it is a thrust to try to give decision makers the ability to be able to assess the socioeconomic or tactical battlefield situation in near-real time. The vision in this is to be able to do it anywhere any time, and this is versus everywhere all the time. Now, everywhere all the time may never be achieved, and we may never want to achieve that, especially because of the legal ramifications. The vision to be able to monitor nearly everywhere any time, what I am going to do is to walk you through some of the logic involved. First of all, what does it mean by that? What does it mean by some of the sensors? Then, really get to the core of the matter, which is how do we handle the data. That really is the big problem, not only the assimilation of it, understanding it, trying to fuse it together. We will mine it, fuse it, and then try to predict what is going to happen. Here is an outline of the talk. What are the types of capabilities that I am talking about in the concept of operations? Then, I will spend a little bit of time on the signatures. That is kind of the gravy on here. Again, I am a physicist, and this is where the fun part is. What do we collect and why, a little bit of the physics behind it, and how do we handle the data, how do we mine it, and then how do we fuse it? What scale of problem am I talking about? If we just purely consider—if we try to decouple the problem from the law enforcement to the space situational awareness, and just look, for example, at the battlefield awareness, what people are talking about in the Defense Department is some kind of grid on the order of 100 by 100 kilometers. So, that is 104 kilometers. Then, up to 50 kilometers high, and knowing the resolution down to a meter. That is like 1014 points. This is just the battlefield itself. So, it just staggers your mind, the problem. So, let me give some examples, and what do we mean about the global capabilities. First of all, the argument is being made that it is more than visible. That is, it is more than looking at photographs. It is more than looking at imagery. It includes all types of sensors, and I am going to walk you through this in a minute. It also includes cyberspace, Web sites, e-mail traffic, especially if there is a flurry of activity. What you would like to do is, you would like to have the ability to be able to look at what we call known sites, and to visit these in a time where, first of all, things don’t change very much. That is, it could be a building going up and you may only have to revisit this site perhaps weekly, daily or even hourly, if you would like. These are sites where something may be going on, or even Web sites, but you know that the delta time change is not very much, so you don’t have to really revisit it too much. The second thing is that you really want to have the capability for monitoring for specific events. If there is a nuclear explosion, if missiles are being moved around, if terrorists are meeting somewhere, you want to have those specific events. You want to be able to telescope down to them, to be able to tap into them. You want to be able to do it on a global scale. Second of all, for those kinds of activities, you may have to have

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop some kind of a tip-off. You don’t know, either through any kind of intercepts, like telephone conversations or visual intelligence. You may have to have human intelligence direct you to what is going to be happening. So, that is what you would like on a global scale. On a local scale, you would want very specific things to occur. For example, perhaps when equipment is being turned on or off, if this is a terrorist that you have located that you are communicating, you want to be able to not only geolocate the terrorist, but also to determine some of the equipment that they may be using. This thing of dismounts, right now, this is one of DARPA’s largest problems. A dismount is, if you can imagine a caravan going through the desert and you are tracking this caravan and all of a sudden somebody jumps off the caravan, you don’t want to divert your observation asset away from that caravan, but yet, this person who jumped off may be the key person. So, you would want to have the capability of not only following that caravan, but to follow this person across the desert, as they jump into a car, perhaps, drive to an airport, jump in a plane and then fly somewhere to go to a meeting. So, how do you do something like that? Again, it is not just visual. If you can imagine an integrated system of sensors that combine, say, acoustic sensors that are embedded in the ground that can follow the individual, and then hand off to some kind of RF—a radio-frequency sensor—that could follow the car, that could, again, follow the plane that the person may go into. So, what type of sensors are needed, and how do you integrate this in a way so that you don’t have a bunch of scientists sitting in a room, each person looking at an oscilloscope saying, okay, this is happening and that is happening and then you are going to hand it off to the next person. What type of virtual space do you need to build to be able to assimilate all this information, integrate it and then hand it off. So, these are some of the problems I will be talking about. The traditional way of looking at this problem is to build a layered system of systems. That is, you have to balance everything from sensitivity resolution, coverage and data volume. I will give you a quick example. Back in the Bosnian War, the communications channels of the military were nearly brought to their knees. The reason was not because of all the high information density that was going back and forth on the communications channel. It was because, when people would send out, say, air tasking orders or orders to go after a certain site, they would send them on PowerPoint slides with 50 or 60 emblems, each bit mapped all around the PowerPoint. So, you had maybe 20 or 30 megabytes of a file that had maybe 20 bits of information on it. So, you have to be smart in this. So, the point there is that when you are making these studies, the answer is not just to build bigger pipes and to make bigger iron to calculate what is going on. You have to do it in a smart way. Again, this is part of the problem, and now what I am going to do is walk you through, first of all, some of the sensors and some of the ways that people think we may be able to attack this problem. First of all, there is more to the sensing than visual imagery. Let me walk you through some examples of these. The case I am trying to build up here is that, for this global situational awareness, the problem is not really inventing new widgets. It is the information, and the information is really the key. It is where the bottleneck is. So, I am going to walk you through just some examples of some sensors that already exist. Some of them are already being used. It is not, again, a case of building new technology all the time. On the lower left-hand side, what you are looking at is a

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop defense threat reduction agency. So, this is why I am here, is to translate that, our project. It is a hard, deeply buried target project. We are basically looking at an underground cavern and trying to determine where assets are in this underground cavern. Of course, that is a timely question today. You can do this by using acoustic sensors. That is, you know the resonances that are built up in these three-dimensional cavities. Just like you can calculate the surface of a drumhead when it is struck, in a three-dimensional cavity, if you know where the resonances are located, what you can do is back out of that where the assets are, if you know that trucks are in there, for example, or that people are walking around. It is kind of like a three-dimensional pipe organ. This just shows some unique characteristics that arise from the power spectrogram of that. On the upper right-hand side it is just showing that, believe it or not, there is a difference in the acoustic signatures of solid state and liquid fuel—not solid state, but solid propellant liquid fuel rockets. You can back out what the differences are, and you can identify whether or not somebody has shot off, not only what type of rocket, if it solid or liquid fueled, but also the unique rocket itself from that. So, there are other types of sensors, using sonics—and I will talk a little bit more here when I talk about distributed networks. If you can have the ability to be able to geolocate your sensors in a very precise manner—say, by using differential GPS—then what you can do is correlate the acoustic signatures that you get. You can, for example, geolocate the position of snipers. You can imagine, then, that those distributed sensors don’t even have to be stationary, but they could also be moving, if you have a time-resolution that is high enough. What are the types of sensors that we are talking about? Well, radio-frequency sensors. For example, on the lower left-hand side, it shows a missile launcher that is erecting. Most of the examples I am using for global situational awareness are military in nature, but that is because of the audience that this was pitched at. What occurs in a physics sense, any time you have a detonation that happens in an engine, you have a very low temperature plasma that is created. Any time you have a plasma, plasmas are not perfect. That is, they are not ideal NHD plasmids. You have charge separation, which means that you have radio-frequency emissions. You are able to pick that up. In fact, the missions are dependent upon the cavity that they are created in. So, there is a unique signature that you can tag not only to each class of vehicle, but also the vehicle itself that you can pick out. Up on the right-hand side, it shows the same type of phenomenology that is being used to detect high explosives when they go off. I am not talking about mega-ton high explosives. I am talking about the pound class, 5- to 10-pound classes of explosives. Again, it creates a very low temperature plasma. An RF field is generated, and you can not only detect that RF field, but also, what you can do is, you can geo-located those. These are, again, examples of sensors that can be added to this global sensor array. We have two examples here of spectral type data. On the right-hand side is data from a satellite known as multi-thermal imager. It is a national laboratory satellite, joint effort between us and Sandia National Laboratory. It uses 15 bands in the infrared. It is the first time that something like this has been up and has been calibrated. What you are looking at is the actual dust distribution the day after the World Trade Center went down. This is from the hot dust that had diffused over lower Manhattan. I can’t tell you exactly what the calibration is on this, but it is extremely low.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop On the lower left-hand side, what you are looking at is an example of another sensor, which is a one photon counter. What this sensor does, it works in either a passive or an active mode. This is in an active mode where we have used a source to illuminate a dish. This is a satellite dish. It was actually illuminated from above from an altitude of about 1,000 meters. What we are looking at are the returns, the statistical returns from the photon system that came back. The technical community now has the ability to count photons one photon at a time and to do so in a time-resolved way with a resolution of less than 100 picoseconds. What that means is that we can now get not only a two-dimensional representation of what a view is, but also, with that time-resolution, you can build up a three-dimensional representation as well. What you are looking at, the reason you can get the pictures from the bottom side, it is through a technique called ballistic photons. That is, you know when the source was illuminated, and you can calculate, then, on the return of the photons the path of each of those individual photons. So, basically, what this is saying is that you can build up three-dimensional images now. You can, in a sense, look behind objectives. It is not always true, because you need a backlight for the reflection. Again, there is more to sensors than visual imagery. That is kind of the fun part of this, as far as the toys and being able to look at the different things we are collecting. The question then arises, how do we handle all this data. Finally, how do we go ahead and fuse it together. I would like to make the case of—I talked about a paradigm earlier about one way to collect data or to build bigger pipes and to make bigger computers to try to run through with different kinds of algorithms to assess what is going on. Another way to do this is to let the power of the computer actually help us itself by fusing the sensor with the computer at the source. It is possible now, because of a technique that was developed about 10 years ago in field programmable gate arrays—that is, being able to hardwire instructions into the registers itself, to achieve speeds that are anywhere from a hundred to a thousand times greater than what you can achieve using software, that is because you are actually working with the hardware instead of the software, to execute the code. Since these things are reprogrammable—that is, they are reconfigurable computers—then you can do this not only on the fly, but also, what this means is that you can make the sensors themselves part of the computation at the spot, and take away the need for such a high bandwidth for getting the data back to some kind of unique facility that can help process the information. Plus, what this gives you the ability to do is to be able to change these sensors on the fly. What I mean by this is, consider a technology such as the software radio. All you know is that radio basically is a receiver, and then there is a bunch of electronics on the radio to change the capacitance, the induction. All this does, the electronics, really, is to change the bandwidth of the signal, to sample different bits in the data stream, and that type of thing. It is now possible to go ahead and make—because computers are fast enough—and especially reconfigurable computers—to go ahead and make a reconfigurable computer that can do all the stuff that the wires and years ago the tubes and, now, the transistors do. What this means is that, if you have a sensor, say, like a synthetic aperture array, and you want to change the nature of the sensor because it is detecting thing in an RF, to say an infraredometer, you can do it on the fly. What this provides people the power with

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop is that—or if you have platforms that you are going to put out in the field, be they ground-based, sea-, air- or space-based, you don’t have to figure out, 5 to 10 years ahead of time, what these sensors are going to be. That is, if it is going to be an RF sensor, then all that is important is the reception of this, and the bandwidth that you have for the reconfigurable computer. You can change the nature, the very nature, of the sensor on the fly. That is a long explanation of what this chart is, but what this shows is that, by putting the power of the computation next to the sensor, then what you do is greatly reduce the complexity of the problem and the data streams that you need. You are still going to need the ability to handle huge amounts of data, because remember, I was talking about 1014 different nodes. What this does is help solve that problem. I don’t want to go too much longer on that, but you all know about the distributed arrays. I talked a little bit about the power of that earlier. Basically, having a non-centralized access to each of these, once you have the positions of these things nailed down in a way—say, by using GPS or, even better, differential GPS—then they don’t even have to be fixed, if you can keep track of them. What is nice about distributed networks is that every node on this should automatically know what every other node knows, because that information is transmitted throughout. So, it degrades very gracefully. What this also gives you the power to do is not only to take information in a distributed sense, but also, if you know the position of these sensors well enough, then you will have the ability to phase them together, and to be able to transmit. What this means is, if you have very low transmitters in here at each of these nodes, say, even at the watt level, by phasing them together, you get the beauty of phasing, if you can manage to pull this off. It is harder to do at the shorter wavelengths, but at the longer wavelengths, it is easier to do. Once you have all this data, how are you going to move it around in a fashion where, if it is intercepted, then you know that it is still secure? When using new technologies that are starting to arise, such as quantum key distribution, this is really possible. For example, two and only two keys are created, but the keys are only created when one of the wave functions is collapsed, of the two keys that exist. This is something that arises from the EPR paradox—Einstein, Polinski, Rosen—and I would be happy to talk to anybody after this about it. It involves quantum mechanics, and it is a beautiful subject, but we don’t have really too much time to get into it. Anyway, keys have been transmitted now 10 kilometers here in the United States, and the Brits have, through a collaboration, I think, with the Germans, have transmitted keys up to, I think, 26 kilometers through the air. We also have the ability to use different technologies to transmit the data that aren’t laser technologies. Why laser technologies? Well, laser energy is very opaque to certain atmospheric conditions. We know that RF can transmit, especially where there are holes in the spectrum, through the atmosphere. So, to be able to tap into regions of the electromagnetic spectrum that have not been touched before, in the so-called terahertz regime—this is normally about 500 gigahertz up to about 10 terahertz—is possible now, with advances in technology. What I have shown here is something that was initially developed at SLAC. IT is called the clistrino, which is based on their clistron RF amplifier. The key thing here is that the electron beam is not a pencil beam but, rather, a sheet beam which spreads out the energy density. So, you don’t have a lot of the

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop problems that you used to have for the older-type tubes. So, we talked a little bit about handling and we talked about the sensors. Let me talk about the data mining. What is envisioned here is having some kind of what we call distributed data hypercube. That is, it is a multidimensional cube with all sorts of data on it. On one axis, you have probably heard in the news, this is Poindexter’s push at DARPA, tapping everything into credit cards on one axis to airlines transactions on another axis, another axis being perhaps telephone intercepts, another axis being RF emissions, another axis perhaps visual information or human intelligence. Tapping into that, and being able to do so in a legal way—because there are large legal implications in this, as well as things that are prohibited by statute, as it turns out, especially when you talk about intelligence databases—to be able to render that using different types of algorithms and then be able to compute that and then feed that back in does two things. First of all, it is to give you a state of where you are at today and, second of all, it is to try to predict what is going to happen. I will give you a very short example of about five minutes here of something that is going on, that is taking disparate types of databases to try to do something like this. So, that is kind of the mining problem, and there are various ways to mine large types of data. Let me talk to you about two examples here where, again, you are not relying just on the algorithm itself, but you are relying on the computer to do this for you. This is an example of a program called GENIE that, in fact, was developed by a couple of the individuals that are here in this room. It is a genetic algorithm that basically uses an array of kernels to optimize the algorithm that you are going to render, and it does so in a sense where the algorithm evolves to find you the optimal algorithm. It evolves because, what you do is, you tell the algorithm or the onset, or tell the computer, what are the features, or some of the features, that you are looking for. On the left-hand side, this is an example of—this is an aerial overhead here of San Francisco Bay. What you want to do is look for golf courses on this. Let’s say that we paint each of the known golf courses green and then we tell the algorithm, okay, go off and find those salient characteristics which define what a golf course is. Now, as it turns out, it is very tough to do because there is not a lot of difference between water and golf courses. A lot of golfers, I guess, think that is funny. The reason is because of the reflectivity, the edge of the surface, which has no straight lines in it. So, it is kind of tough. Especially when you are talking about something like global situational awareness, if you find a golf course, you have got to make absolutely sure that it is a golf course. You can imagine that this could be something else that you are going after. What you do is, you let the computer combine those aspects, especially if you have hyperspectral data. That could be information in the infrared that may show the reflectivity, for example, of chlorophyll. It could be information about the edges. The computer assembles this data, using again, this basis of kernels that you have, and comes up with and evolves a unique optimized algorithm to search for these things. Now, this is different from neural nets because you can actually go back through and you can, through a deconvolution process, find out what the algorithms are, and it does make sense, when you look at it. Basically, it is by using the power of computation to help you itself, what you do is, you are reducing the complexity of the problem. Now, if you have taken that, you can go to the next step, and you can accelerate that algorithm,

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop not just to run on a computer, but if you hardwire it into a reconfigural computer, with that floating point gate array I told you about, what I will do is, I will show you an example of how you can do things on a—I don’t know if I am going to be able to do this. Say that you have streaming data coming in as video arrays. What is occurring here is that we have asked it to locate the car that you will see—that is the one in the green on the right-hand side—at rates that are approaching 30 frames per second. The point is that, by marrying different types of technology, we will be able to help out and help you determine to do things in a near-real-time manner. Other techniques for pulling very small or high—very low, I should say, signal-to-noise data out. What I have shown here is pulling some data out by using a template on the lower left-hand side. On the right-hand side, it is looking at some spectrographic data, and being able to pull out some chemical species. So, these are all examples of data fusing techniques. Let me really wrap this up and leave time for some questions here. The whole goal of this is to be able to synthesize what we call a complete view of a situation from disparate databases. Just trying to pull things together to give people a wide range of ability, be it from the national scene to the person, it could be a law enforcement officer, who may only want to know things that are happening 30 or 40 feet around him. What I have done is, I have put double-headed arrows on there, to show that there should be a capability for those sensors to be tasked themselves, which kind of makes the people who run the sensors kind of scared. On the other hand, if you don’t have that ability, then you don’t have the ability to allow feedback into the system. There is an example of something going on like this, that is not nearly as complex, where some Forest Service data is being combined on wild fires, where they start, how they originate, combine that with climatology data, looking at soil, wetness, wind data—what else, soil information. Then, combine that with Department of Justice data on known arsonists, where they started forest fires before and where they are located now. What this is attempting to do, with this very small database, is to combine these disparate databases to predict—first of all, to give you the situation where we are now, and then perhaps to be able to predict if an arsonist were to strike at a particular place. So, this is a real, no kidding, national security problem, forest fires, because you can imagine—well, we, at Los Alamos ourselves, had devastating fires two years ago, that nearly wiped out a national lab with that. So, by using small test problems like this, we will show not only the larger problems that will arise, but also hopefully that doing something like this is not merely a pipe dream. In conclusion, I think it has been determined that a need for a global situational awareness really exists. Again, this is a synthesis and an integration of space situational awareness, battlefield situational awareness, law enforcement situational awareness, to be able to be used in anti-terrorist activities. A lot of work needs to be done. This is not a one-lab or a one-university project. It is something that I think will really tap the S&T base of the nation. The key here is seamless integration. It is the data and it is not really the sensors, but it is integrating it in a way to be able to show it in a way that makes sense in a seamless sense. So, that is the talk. It is accelerated about 10 minutes faster than I normally give it. Might I answer any questions you might have? AUDIENCE: In that golf course example, you actually had trained data, spectral data from real golf courses, and trained the model on that and predicted other golf

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop sampling. We may be able to get back to a place where we can do some of that, but we lose speed with that. We may be able to regain some of it if we do some early estimations and refine our complexity parameter approach from that. So, it is heavily computer-dependent, but we may be able to do it just in some early stages, and more intelligently select models from the vast search that data mining involves. Data mining, you come up with a final model and people interpret it. I always have trouble with that, because you have looped over a finite data set with noise in it. You have searched over billions of possible combinations of terms. The one that fell to the bottom, or that won, just barely beat out hundreds of other models that had perhaps different parameters, and the inputs, themselves, are highly correlated. So, I tend to think of models as useful and not try to think of the interpretability as a danger, but I know that is not the case with everyone here, in terms of the importance of interpretability. Well, there are a couple of metrics that have these properties. Jianming Ye introduced generalized degrees of freedom, where the basic idea is to perturb the output variable, refit your procedure, and then measure the changes in the estimates, with the idea that the flexibility of your process is truly complex. If your modeling process was to take a mean, then that is going to be fairly inflexible. It is certainly more subject to outliers than a median or something, but it is a whole lot less flexible than to changes in the data than a polynomial network or a decision tree or something like that. So, if your modeling procedure can respond to noise that you inject, and responds very happily, then you realize it is an over-fit problem. This reminds me, when I was studying at the University of Virginia and one of the master’s students in my group was working with some medical professionals across the street at the University of Virginia hospital. He had a graph that they were trying to measure heart output strength for young kids with very simple-to-obtain features like the refresh rate for your skin when you press it. That is not the technical word for it, but how quickly it becomes pink again after you squeeze it, or the temperature of your extremities and so forth, and see if they could have the same information that you could get through these catheters that are invasive and painful. They determined that they could, but I remember a stage along the way when he was presenting some intermediate results on an overhead. This is one of the dangers of graphs on overheads. The nurse and the head nurse and the doctor were going over it and saying, “I see how temperature rises and you get this change.” My colleague realized, to his horror, that he had the overhead upside down and backwards. He changed it and it

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop took them about five seconds to say, “Oh, I see how that works, too. It is completely the opposite.” So, we want to believe and, if we are extremely flexible with resultant changes to the data, then maybe we are too flexible, too easily fooled, too much over-fit. So, another method I won’t look at in detail, but I will just mention, is Tibshirani and Knight’s covariance inflation criteria. Instead of randomly perturbing the outputs, they shuffle the outputs and look at the covariance between old and new estimates. The key idea here is to put a loop around your whole procedure and look at its sensitivity. I remember that the first person I heard—Julian Fairway—in an interface conference in 1991 doing that in his RAT regression analysis tool, at the time, a two-second analysis took two days to get resampling results on, but I thought it was a very promising approach. I saw him later. He was one of those statisticians trapped in a math department. So, he had gone away from doing useful things and doing more theoretical things out of peer pressure. Anyway, I was excited about this and it was, I think, a good idea. So, let me explain a little bit about what generalized degrees of freedom are. With regression, the number of degrees of freedom is the number of terms. If we extrapolate from that, you see that you count the number of thresholds in a tree, the number of splines in a MARS or something. People had noticed that the effect can be more like three, as I mentioned before, for a spline or even less than one for some particularly inefficient procedures. Well, if we, instead, generalize from a linear regression in a slightly different way, if we notice that the degrees of freedom are also the trace of the hat matrix, which is the sensitivity of the output, to sensitivity estimate of changes to the output, then we can use that as a way of measuring. We can empirically perturb the output, and then refit the procedure. This is similar to what is done with splines, I understand.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, put a loop around the process. That is nice, because the process itself can be a whole collection of procedures—outlier detection, all sorts of things can be thrown into this spline. So, graphically, we have a modeling process and we have inputs and we have an output variable. We add some noise to it, and we record the output variable and the new forecast based on that perturbed output variable. I kind of like the fact that the output ye also spells the name of the fellow that came up with it. So, you have 100 observations. I know that is a heretically small number for this gathering. You have 100 observations and 50 perturbations. Then, you would record the perturbed output and its change, and you would look at the sensitivity of change in the output and change in the estimate. If I were doing it, I would just naturally—each perturbation experiment, I would calculate a different number. I do want to point out that Ye, I think, had a good idea, and he took the matrix and sliced it this way and said, well, I am going to look at the effect on observation one of changes over time, which seems to be a little bit more robust than measuring up all of these within an experiment. Also, interestingly, you can then assign complexity to observations. Now, I excluded that graph from this, but if you are interested, I can show, on a sample problem, where we did that. I will conclude, in this last few minutes, here with a very simple problem. This is a decision tree, and this is a decision-making mechanism for the data. Naturally, the tree is going to do pretty well on it. It is amazing how many test problems for trees involve

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop trees as their source. Here is where we have added noise to it. The noise is at a level that you can see it obscures—pretty much obscures—the smallest structural features, but doesn’t obscure others. So, there are some features that are easy to discern and others that are harder, and that seems to be picked up very nicely by the GDS metric. Now, out of this surface we have sampled 100 training samples, and the other experiments with 1,000 and so forth. I am just going to show you sort of the very initial experiments, and make only a few points. This is very much ongoing. Out of those samples, we built trees and we also built boot-strap and put five trees

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop together, that went down different amounts. They either had four leaf nodes or eight leaf nodes in this example. You can see, if we built five four-leaf trees and put them together, you still get a tree. You still get something that could be represented as a single tree. It would look rather complex. So, I can show what that looks like, if you are interested, but you can see here, that the bagging procedure does gentler stairsteps than a raw tree. That tends to be good for generalization, the smoothness. So, bagging helps to smooth trees and that is, I think, the major reason for generalization improvement. You can see that, when you go to more complex trees, eight leaf nodes would be eight estimation surfaces. There are only five in this data. So, four is sort of under-powerful and eight is over-powerful, and you can see that it completely misses the smallest piece of structure up here. This one picks up on it, but on it, but it also picks up on another structure that isn’t there. So, it is the bias variance type of trade-off. Here is, in essence, the end result. We also did experiments with eight noise variables being added in. So, tree structure depends on two variables, but now you have thrown in eight distracting noise variables. So, how does that affect things. Then, we have, on this slide, the regression as well. As the number of parameters increases, the measured generalized degrees of freedom for regression is almost exactly what Gary would tell you. It is basically one for one. Interestingly, here, with the trees, a single tree—this purple line here—it has an estimated rough slope of four, meaning that, on this problem, each split that was chosen with the tree algorithm has roughly the equivalent descriptive power of four linear terms, or it is using up the data at four times the rate of a linear term. If you bag those trees, it reduces to a slope of three. If you add in noise variables, it increases the slope to five, and then, if you bag those trees that are built on noise variables, it brings it back to four. So, just to summarize, adding in distracting noise variables increases the complexity of the model. The model looks exactly the same in terms of its structure and its size and how long it takes to describe the model. Because it looked at eight variables that could only get in its way, it was effectively more complex. Then, the other point is that the bagging reduces the complexity of the model. So, the bagged ensemble—five trees—is less complex than a single tree, and that is consistent with Occam’s razor idea that reduced complexity will increase generalizability after you reach that sort of saturation point.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop To summarize, I have a lot of little points to make. Bundling almost always improves generalization, and I think that different model families is a great source of diversity that you need for the bundling. If we measure complexity as flexibility of the procedure, then we sort of revive or answer—partially answer—some of the problems with general intuition of complexity being related to over-fit. So, the more a modeling process can match an arbitrary change made to its output, the more complex it is by this measure. Complexity clearly increases with distracting variables. These are variables that are considered, during the model search project, which may not appear at all in the model at the end. Actually, it would have to appear, at least somewhat, to affect the behavior, but the size of the model could be the same between—two candidate models could have the exact same number of parameters and so forth, but one could be more complex because of what it looked at. The complexity is expected to increase, obviously, with parameter power, the thoroughness of your search, and decrease with the use of priors and shrinking, and if there is clarity in the structure of the data. In our early experiments, I certainly thought the complexity would decrease as you had more data. That seems to be the case with some methods and not with others. It seems to actually increase with decision trees and decrease with neural nets, for instance. By the way, neural nets, on this same problem, had about 1.5 degrees of freedom per parameter. It could be that local methods somehow are more complex with more data, and methods that fit more global models are not. So, that is an interesting question. Model ensembles usually have effective complexity less than their components by this empirical measure. The next thing about the GDF is you now can more fairly compare very diverse procedures, even multi-step procedures, as long as you can put a loop around it. That concludes my talk. Thank you. MS. KELLER-MC NULTY: Questions? QUESTION: I have one. When you talk about the complexity of flexibility, are you meaning robustness? I am confused about the local methods being more complex. MR. ELDER: I am, too. It is confusing. The idea behind GDF, the idea behind these empirical measures of complexity is that you want to be able to measure the responsiveness of your procedure. Obviously, you don’t want something that, every time you change the task a little bit, it is good for that one thing.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop You know, it is the universal health oil. I really have hair problems. Well, it is good for that, it is good for your liver, too. Obviously, it is overfit. You know, where is that trade-off? So, the measure is, if you arbitrarily change the problem, or you change the problem a little bit, how responsive is it to suddenly get the same accuracy and so forth? Again, it was a little bit confusing to me, but the trees, they are only concerned about—once they partition the data, they are only concerned about their little area, and they are not paying attention to the big, whereas, models that are basically fitting a global model, data everywhere helps, and it tends to rein it in a little bit, is my best guess. It is certainly an interesting area. QUESTION: Do you have any extrapolation from your results? [Comments off microphone.] MR. ELDER: The question, how things might change with many variables and massive data sets. I would like to revisit a problem we did. We participated in the KDD cup challenge, I guess, two summers ago, and that had 140,000 variables, which is a lot for us, but only 2,000 cases. The final model, the one that won the contest, used three variables. Well, three variables seems like a simple model, but it was a huge search. It would be nice to put that under this microscope and see what happens. Again, my intuition was that more data would actually help reduce the complexity, but it is kind of wide open. AUDIENCE: [Question off microphone.] MR. ELDER: Here, I have just talked about combining the estimates, and some techniques need a little squeezing to get them into the right shape. Take a tree. You might need to do some kind of robust estimation. If it is a classification problem, you might get a class distribution and use that to get a real value—to turn into a real value so that you can average it in. What I haven’t talked about is, there are whole areas for collaboration amongst these different techniques that we have explored in the past. For instance, neural nets don’t typically select variables. They have to work with whatever you give them, but other methods, like trees and stepwise regression and polynomial regressions select variables as a matter of course. Typically, you can do better if you use the subset they select, when you hand it off to the neural nets, and some methods are very good at finding outliers. There are a lot more methods that can go on between the methods. Up here, I talked about them just in terms of their final output, but yes, you do have to work sometimes to get them in a common language. AUDIENCE: [Question off microphone.] I am not sure how what you are talking about here answers that. MR. ELDER: Good point. Criticisms of Occam’s razor work with any measure of complexity, is Pedro’s point. The one that bothered me the most is about bundled. They are obviously more complex. They just look more complex, and yet, they do better. How can that be? In fact, it is very easy to get many more parameters involved in your bundle than you have data points and not have over-fit. They are not all freely simultaneously modified parameters, though. So, there is a difference. Under this measure, the things

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop that are more complex in GDF tend to over-fit than things that aren’t. So, it is a vote on the side of Occam’s razor. I haven’t addressed any of the other criticisms, but in the experiments we have done here, it is more like you would— AUDIENCE: Like, for example, when you approximate an ensemble with one model back again, this model is still a lot more complex, because a lot of this complexity is apparent. It just seems that you still get the more complex models in that. MR. ELDER: But it is apparent complexity. AUDIENCE: Yes, but when you replace the apparent complexity with the actual complexity, not just measuring to find the number of parameters. [Comment off microphone.] MR. ELDER: I certainly would be eager to see that. Thanks.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Report from Breakout Group Instructions for Breakout Groups MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on. First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can’t just assume we get the process infinitely fast, whatever we want. The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems? Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk. So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations? Report from Integrated Data Systems Breakout Group MS. KELLER-MC NULTY: Our discussion was really interesting and almost broke out in a fist fight at one point, but we all calmed down and got back together. So, having given you the three questions, we didn’t really follow them, so let me go ahead and sort of take you through our discussion. When we did try to talk about what the challenges were, our discussion really wandered into the fact that there are sort of two ways that you can kind of look at these problems. Remember, our session had to do with the integration of data streams. So, you can kind of look at this in a stovepipe manner, where you look at each stream independently and somehow put them together, hoping the dependencies will come out, or you actually take into account the fact that these are temporally related streams of information and try to capture that. The thought is that, if one could actually get at that problem, that is where some significant gains could be made. However, it is really hard, and that was acknowledged in more ways than one as well. That led us into talking about whether or not the only way to look at this problem domain is very problem-specific. Is every problem different, or is there something fundamental underneath all of this that we should try to pull out? In particular, should we be trying to look at, I am going to say, mathematical abstractions of the problem and the information, and how the information is being handled, to try to get at ways to look at this? What are the implications and database issues, database design issues, that could be helpful here? There clearly was no agreement on that, ranging on, there is no new math to be done, math isn’t involved at all, to in fact, there is some fundamental mathematics that needs to be done. Then, as we dug deeper into that and calmed down a little bit, we kind of got back to the notion that, what is really at issue here is how to integrate the fundamental science into the problem.

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop If I have two streams of data, one coming from each sensor, if I am trying to put them together, it is because there is some hidden state that I am trying to get at. Neither sensor is modeling perhaps the physics of that hidden state. So, how do I start to try to characterize that process and take that into account? So, that really means that I have to significantly bring the science into the problem. So, then, we were really sounding quite patriotic from a scientific perspective. One of our colleagues brought up the comment that, you know, this philosophy between, am I modeling the data or am I modeling the science and the problem, you know, has been with us for a long time. How far have we come in that whole discussion and that whole problem area since 1985? That had us take pause for a minute, like, where are we compared to what we could do in 1985, and how is it different? In fact, we decided, we actually are farther ahead in certain areas. In fact, our ability to gather the data, process the data, to model and actually use tools, we clearly are farther ahead. A really important issue, which actually makes the PowerPoint comment not quite so funny is that our ability and communication, remote communication, distributed communication, modes of communication, actually ought to work in our favor in this problem area as well. However, the philosophical issue of how to integrate science and technology and mathematics and all these things together, it is not clear we are all that much farther ahead. It is the same soap box we keep getting on. Then, it was really brought out, well, maybe we are a little bit farther ahead in our thinking, because we have recognized the powerful use of hierarchical models and the hierarchical modeling approach, looking at going from the phenomenology all the way up through integrating the science, putting the processing and tools together. The fact that it is not simply a pyramid, that this is a dynamic pyramid, that if we take into account the changing requirements of the analyst, if you will, the end user, the decision maker, we have to realize that there is a hierarchy here, but it is a hierarchy that is very dynamic in how it is going to change and move. There are actually methods, statistical mathematical methods, that have evolved in the last 10 or 15 years, that to try to look at the hierarchical approach. So, we thought that was pretty positive. There is a really clear need, as soon as we are going into this mode of trying to integrate multiple streams, to recognize that expertise, the human must be in the loop and the decision process, the decision environment back to the domain specificity of what you are trying to do, is needed. In a couple of the earlier sessions, we actually heard about the development of serious platforms for data collection, without any regard to how that information was going to be integrated, or how it was going to be used, through some more seriously collaborations that I will get into in a second. Maybe we can really influence the whole process, to design better ways to collect information, better instruments, things that are more tailored to whatever the problem at hand is. I thought there was a really important remark made in our group about how, if you are really just looking at a single data stream and a single source of information, that industry is really driving that single source problem. They are going to build the best, fastest, most articulate sensor. What they are not going to probably nail is the fusion of this information. If you couple that with the fact that, if you let that be done ad hoc, that you are now going to have just random methods coming together with a lot of false positives, and then we got into the discussion of privacy invasion, and how do you balance all of that,

OCR for page 165
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop that we really need the serious thought, the serious integration, multidisciplinary collaboration, to be developing the methods, overseeing the methodological development, as well as being able to communicate back to the public what is going on here. So, I thought that was kind of interesting. So, collaboration, there needs to be very close collaboration in areas like systems engineering, hardware software design, statistics, mathematics, computer science database type things, and basic science. That has to come together. Now, that is not easy because, again, we have been saying that forever that this is how we are going to solve these problems. Then that comes into play, what are the mechanisms that we can try to do that? We didn’t have a lot of good answers there. One idea was, is it possible to mount certain competitions that really are getting at serious fusion of information that would require multidisciplinary teams like this to come together. There was a suggestion that, at some of our national institutes, such as SAMSI, that is Science and Applied Mathematics Institute, one of the new, not solely NSF-funded, but one of the new NSF-funded institutes, perhaps some sort of a focus here. I think that gets back to Doug’s comment, which I thought was really good, that regular meetings as opposed to one up workshops is the way we are probably going to foster relationships between these communities. Clearly, funding is required for those sorts of things. Can we get funding agencies to require collaborations, and how do they then monitor and mediate how that happens. Then, one comment that was made at the end was the fact that, if we just focus in on statistics, and statistics graduate training, there is a lot of question as to whether we are actually training our students such that they can really begin to bite off these problems. I mean, do they have the computational skills necessary and the ability to do the collaborations. I think that is a big question. My answer would be, I think in some of our programs we are, and in others we are not, and how do we balance that? Just one last comment. You know, we spoke at very high level and just at the end of our time—and then we sort of ran out of time—it was pointed out that if you really think of a data mining area and data mining problems, that there has been a lot done on supervised and unsupervised learning. I think we understand pretty well that these are methods that have good predictive capabilities. However, it seems that the problem of the day is anomaly detection, and I really think that there, from a data fusion point of view, we really have a dearth of what we know how to do. So, the ground is fertile, the problems are hard, and somehow we have got to keep the dialogue going.