Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: Summary

Page 64 Cite

Suggested Citation:"Report from Breakout Group." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 65 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

REPORT FROM BREAKOUT GROUP 64 Report from Breakout Group Instructions for Breakout Groups MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on. First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can't just assume we get the process infinitely fast, whatever we want. The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems? Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk. So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations? Report from Atmospheric and Meteorological Data Breakout Group MR. NYCHKA: The first thing that the reporter has to report is that we could not find another reporter except for me. I am sorry, I was hoping to give someone the opportunity, but everybody shrank from it. So, we tried to keep on track on the three questions. I am sure that the other groups realized how difficult that was. Let me first talk about some technical challenges. The basic product you get out of this is a field. It is maybe a variable collected over space and time. There are some just basic statistical problems of how you summarize those in terms of probability density functions, if you have multiple samples of those, how you manipulate them, and also deal with them. Also, if you wanted to study, say, like a particular variable under an El NiÃ±o period versus a La NiÃ±a period, all those kinds of conditioning issues. So, that is basically, sort of very mainstream space-time statistics. Another important component that came out of this is the whole issue of uncertainty. This is true in general, and there was quite a bit of discussion about aligning these efforts with the climate change research initiative, which is a very high level kind of organized effort by the U.S. government to study climate. Uncertainty measures are an important part of that, and no surprise that the typical deterministic geophysical community tends to sort of ignore these, but it is something that needs to be addressed. There was also sort of the sentiment that one limitation is partly people's backgrounds. People use what they are familiar with. What they tend to do is limited by the tools that they know. They are sort of reticent to take on new tools. So, you have this sort of vicious circle that you only do things that you know how to do. I think an interesting thing that came out of thisâand let me highlight this as a very interesting technical challenge, and it is one of these curious things where, all of a sudden, a massive

REPORT FROM BREAKOUT GROUP 65 data set no longer becomes very massive. What John was bringing up is that these large satellite records typically have substantial non-zero biases, even when you average them. These biases are actually a major component of using these. So, a typical bias would be simply change a satellite platform that is measuring a particular remotely sensed variable, and you can see a level shift or some other artifact. In terms of combining different satellites, you need to address this. These biases need to be addressed empirically as an important problem. The other technical challenge is reducing data. This is another interesting thing about massive data sets, that part of the challenge here is to make them useful. In order to make them useful, you have to have some idea of what the clientele is. We have had some discussion about being careful about that, that you don't want to sort of create some kind of summary of the data and have that not be appropriate for part of the user community. The other thing is, whatever summary is done, the assumptions used to make it should be overt, and also there should be measures of uncertainty along with it. Collaborations, I think for this we didn't talk about this much, because I think they were so obvious. Obviously, the collaborators should be people in the geophysical community that actually work and compile this data with the statisticians. Some obvious centers are JPL, NCAR, NOAAâRalph, do you volunteer CORA as well? AUDIENCE: Sure. MR. NYCHKA: John, NCDC, I am assuming you will accept visitors if they show up. AUDIENCE: Sure will. It is a great place to be in the summer, between the Blue Ridge and the Great Smokeys. MR. NYCHKA: Okay, so one thing statisticians should realize is that there are these centers of concentrations of geophysical scientists, and they are great places to visit. The other collaboration that was brought up is that there needs to be some training of computer science in this. The other point, coming back to the climate change research initiative, is that this is another integrator, in terms of identifying collaborations. In terms of how to facilitate these collaborations, one suggestion wasâthis is post docs in particularâpost docs at JPL. I tried to steer the discussion a little bit, just to test the waters. What I suggested is some kind of regular process where there are meetings that people can anticipate. I am thinking sort of along the interface model or research conference model. It seems like the knee jerk reaction in this is simply, people identify an interesting area that they declare, let's have a workshop. We have the workshop, people get together, and then that is it. It is sort of the final point in time. I think John agreed with me, in particular, that a single workshop isn't the way to address this. So, I am curious about pursuing a sort of more regular kind of series of meetings. Okay, and that is it.

Next: David Scott, Chair of Session on High-Energy Physics Introduction by Session Chair »

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

No Thanks

Take a Tour »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: Report from Breakout Group

Welcome to OpenBook!

Get Email Updates