Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: TRANSCRIPT OF PRESENTATION

Page 207 Cite

Suggested Citation:"Report from Breakout Group." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 208 Cite

Page 209 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

REPORT FROM BREAKOUT GROUP 207 Report from Breakout Group Instructions for Breakout Groups MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on. First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can't just assume we get the process infinitely fast, whatever we want. The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems? Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk. So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations? Report from Integrated Data Systems Breakout Group MS. KELLER-MC NULTY: Our discussion was really interesting and almost broke out in a fist fight at one point, but we all calmed down and got back together. So, having given you the three questions, we didn't really follow them, so let me go ahead and sort of take you through our discussion. When we did try to talk about what the challenges were, our discussion really wandered into the fact that there are sort of two ways that you can kind of look at these problems. Remember, our session had to do with the integration of data streams. So, you can kind of look at this in a stovepipe manner, where you look at each stream independently and somehow put them together, hoping the dependencies will come out, or you actually take into account the fact that these are temporally related streams of information and try to capture that. The thought is that, if one could actually get at that problem, that is where some significant gains could be made. However, it is really hard, and that was acknowledged in more ways than one as well. That led us into talking about whether or not the only way to look at this problem domain is very problem-specific. Is every problem different, or is there something fundamental underneath all of this that we should try to pull out? In particular, should we be trying to look at, I am going to say, mathematical abstractions of the problem and the information, and how the information is being handled, to try to get at ways to look at this? What are the implications and database issues, database design issues, that could be helpful here? There clearly was no agreement on that, ranging on, there is no new math to be done, math isn't involved at all, to in fact, there is some fundamental mathematics that needs to be done. Then, as we dug deeper into that and calmed down a little bit, we kind of got back to the notion that, what is really at issue here is how to integrate the fundamental science into the problem.

REPORT FROM BREAKOUT GROUP 208 If I have two streams of data, one coming from each sensor, if I am trying to put them together, it is because there is some hidden state that I am trying to get at. Neither sensor is modeling perhaps the physics of that hidden state. So, how do I start to try to characterize that process and take that into account? So, that really means that I have to significantly bring the science into the problem. So, then, we were really sounding quite patriotic from a scientific perspective. One of our colleagues brought up the comment that, you know, this philosophy between, am I modeling the data or am I modeling the science and the problem, you know, has been with us for a long time. How far have we come in that whole discussion and that whole problem area since 1985? That had us take pause for a minute, like, where are we compared to what we could do in 1985, and how is it different? In fact, we decided, we actually are farther ahead in certain areas. In fact, our ability to gather the data, process the data, to model and actually use tools, we clearly are farther ahead. A really important issue, which actually makes the PowerPoint comment not quite so funny is that our ability and communication, remote communication, distributed communication, modes of communication, actually ought to work in our favor in this problem area as well. However, the philosophical issue of how to integrate science and technology and mathematics and all these things together, it is not clear we are all that much farther ahead. It is the same soap box we keep getting on. Then, it was really brought out, well, maybe we are a little bit farther ahead in our thinking, because we have recognized the powerful use of hierarchical models and the hierarchical modeling approach, looking at going from the phenomenology all the way up through integrating the science, putting the processing and tools together. The fact that it is not simply a pyramid, that this is a dynamic pyramid, that if we take into account the changing requirements of the analyst, if you will, the end user, the decision maker, we have to realize that there is a hierarchy here, but it is a hierarchy that is very dynamic in how it is going to change and move. There are actually methods, statistical mathematical methods, that have evolved in the last 10 or 15 years, that to try to look at the hierarchical approach. So, we thought that was pretty positive. There is a really clear need, as soon as we are going into this mode of trying to integrate multiple streams, to recognize that expertise, the human must be in the loop and the decision process, the decision environment back to the domain specificity of what you are trying to do, is needed. In a couple of the earlier sessions, we actually heard about the development of serious platforms for data collection, without any regard to how that information was going to be integrated, or how it was going to be used, through some more seriously collaborations that I will get into in a second. Maybe we can really influence the whole process, to design better ways to collect information, better instruments, things that are more tailored to whatever the problem at hand is. I thought there was a really important remark made in our group about how, if you are really just looking at a single data stream and a single source of information, that industry is really driving that single source problem. They are going to build the best, fastest, most articulate sensor. What they are not going to probably nail is the fusion of this information. If you couple that with the fact that, if you let that be done ad hoc, that you are now going to have just random methods coming together with a lot of false positives, and then we got into the discussion of privacy invasion, and how do you balance all of that,

REPORT FROM BREAKOUT GROUP 209 that we really need the serious thought, the serious integration, multidisciplinary collaboration, to be developing the methods, overseeing the methodological development, as well as being able to communicate back to the public what is going on here. So, I thought that was kind of interesting. So, collaboration, there needs to be very close collaboration in areas like systems engineering, hardware software design, statistics, mathematics, computer science database type things, and basic science. That has to come together. Now, that is not easy because, again, we have been saying that forever that this is how we are going to solve these problems. Then that comes into play, what are the mechanisms that we can try to do that? We didn't have a lot of good answers there. One idea was, is it possible to mount certain competitions that really are getting at serious fusion of information that would require multidisciplinary teams like this to come together. There was a suggestion that, at some of our national institutes, such as SAMSI, that is Science and Applied Mathematics Institute, one of the new, not solely NSF-funded, but one of the new NSF-funded institutes, perhaps some sort of a focus here. I think that gets back to Doug's comment, which I thought was really good, that regular meetings as opposed to one up workshops is the way we are probably going to foster relationships between these communities. Clearly, funding is required for those sorts of things. Can we get funding agencies to require collaborations, and how do they then monitor and mediate how that happens. Then, one comment that was made at the end was the fact that, if we just focus in on statistics, and statistics graduate training, there is a lot of question as to whether we are actually training our students such that they can really begin to bite off these problems. I mean, do they have the computational skills necessary and the ability to do the collaborations. I think that is a big question. My answer would be, I think in some of our programs we are, and in others we are not, and how do we balance that? Just one last comment. You know, we spoke at very high level and just at the end of our timeâand then we sort of ran out of timeâit was pointed out that if you really think of a data mining area and data mining problems, that there has been a lot done on supervised and unsupervised learning. I think we understand pretty well that these are methods that have good predictive capabilities. However, it seems that the problem of the day is anomaly detection, and I really think that there, from a data fusion point of view, we really have a dearth of what we know how to do. So, the ground is fertile, the problems are hard, and somehow we have got to keep the dialogue going.

Next: Mark Hansen Untitled Presentation »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: Report from Breakout Group

Welcome to OpenBook!

Get Email Updates