National Academies Press: OpenBook
« Previous: Summary
Suggested Citation:"Report from Breakout Group." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 64
Suggested Citation:"Report from Breakout Group." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 65

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

REPORT FROM BREAKOUT GROUP 64 Report from Breakout Group Instructions for Breakout Groups MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on. First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can't just assume we get the process infinitely fast, whatever we want. The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems? Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk. So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations? Report from Atmospheric and Meteorological Data Breakout Group MR. NYCHKA: The first thing that the reporter has to report is that we could not find another reporter except for me. I am sorry, I was hoping to give someone the opportunity, but everybody shrank from it. So, we tried to keep on track on the three questions. I am sure that the other groups realized how difficult that was. Let me first talk about some technical challenges. The basic product you get out of this is a field. It is maybe a variable collected over space and time. There are some just basic statistical problems of how you summarize those in terms of probability density functions, if you have multiple samples of those, how you manipulate them, and also deal with them. Also, if you wanted to study, say, like a particular variable under an El Niño period versus a La Niña period, all those kinds of conditioning issues. So, that is basically, sort of very mainstream space-time statistics. Another important component that came out of this is the whole issue of uncertainty. This is true in general, and there was quite a bit of discussion about aligning these efforts with the climate change research initiative, which is a very high level kind of organized effort by the U.S. government to study climate. Uncertainty measures are an important part of that, and no surprise that the typical deterministic geophysical community tends to sort of ignore these, but it is something that needs to be addressed. There was also sort of the sentiment that one limitation is partly people's backgrounds. People use what they are familiar with. What they tend to do is limited by the tools that they know. They are sort of reticent to take on new tools. So, you have this sort of vicious circle that you only do things that you know how to do. I think an interesting thing that came out of this—and let me highlight this as a very interesting technical challenge, and it is one of these curious things where, all of a sudden, a massive

REPORT FROM BREAKOUT GROUP 65 data set no longer becomes very massive. What John was bringing up is that these large satellite records typically have substantial non-zero biases, even when you average them. These biases are actually a major component of using these. So, a typical bias would be simply change a satellite platform that is measuring a particular remotely sensed variable, and you can see a level shift or some other artifact. In terms of combining different satellites, you need to address this. These biases need to be addressed empirically as an important problem. The other technical challenge is reducing data. This is another interesting thing about massive data sets, that part of the challenge here is to make them useful. In order to make them useful, you have to have some idea of what the clientele is. We have had some discussion about being careful about that, that you don't want to sort of create some kind of summary of the data and have that not be appropriate for part of the user community. The other thing is, whatever summary is done, the assumptions used to make it should be overt, and also there should be measures of uncertainty along with it. Collaborations, I think for this we didn't talk about this much, because I think they were so obvious. Obviously, the collaborators should be people in the geophysical community that actually work and compile this data with the statisticians. Some obvious centers are JPL, NCAR, NOAA—Ralph, do you volunteer CORA as well? AUDIENCE: Sure. MR. NYCHKA: John, NCDC, I am assuming you will accept visitors if they show up. AUDIENCE: Sure will. It is a great place to be in the summer, between the Blue Ridge and the Great Smokeys. MR. NYCHKA: Okay, so one thing statisticians should realize is that there are these centers of concentrations of geophysical scientists, and they are great places to visit. The other collaboration that was brought up is that there needs to be some training of computer science in this. The other point, coming back to the climate change research initiative, is that this is another integrator, in terms of identifying collaborations. In terms of how to facilitate these collaborations, one suggestion was—this is post docs in particular—post docs at JPL. I tried to steer the discussion a little bit, just to test the waters. What I suggested is some kind of regular process where there are meetings that people can anticipate. I am thinking sort of along the interface model or research conference model. It seems like the knee jerk reaction in this is simply, people identify an interesting area that they declare, let's have a workshop. We have the workshop, people get together, and then that is it. It is sort of the final point in time. I think John agreed with me, in particular, that a single workshop isn't the way to address this. So, I am curious about pursuing a sort of more regular kind of series of meetings. Okay, and that is it.

Next: David Scott, Chair of Session on High-Energy Physics Introduction by Session Chair »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!