National Academies Press: OpenBook
« Previous: Amy Braverman Statistical Challenges in the Production and Analysis of Remote Sensing Earth Science Data at the Jet
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 29
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 30
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 31
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 32
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 33
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 34
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 35
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 36
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 37
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 38
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 39
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 40
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 41

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 29 THE JET PROPULSION LABORATORY TRANSCRIPT OF PRESENTATION MS. KELLER-MC NULTY: Our next speaker is Amy Braverman from the Jet Propulsion Lab. Amy has a Mac, so this is going to be a new experience for me. MS. BRAVERMAN: This is the first time I have used this computer for a presentation, and I also took the plunge and did my presentation in PowerPoint for the first time. So, beware, if I have problems. MS. KELLER-MC NULTY: While they are setting this up, I want to remind everybody to look at your programs. We have scheduled some time in the afternoon for some breakout sessions with focused discussion in each of the presentation areas. So, write your questions down and think about that, so if it isn't covered here and doesn't get covered in the break, you will have some time to talk to each other about some of the problem areas and ideas. MS. BRAVERMAN: I would like to thank a couple of people. I would like to thank Doug for inviting me to come and talk here. I have been chomping at the bit to get some help, and this seems like the perfect opportunity to, I don't know, to whine for it, let's say. I would also like to thank the organizers for holding this workshop. It was based on the proceedings of the 1995 conference that I got into this problem in the first place, and was able to find Ralph Kahn at the Jet Propulsion Laboratory—that is how I met him, was reading the proceedings and saying to myself, gee, he is right across town there and I need a good application for my dissertation work, that looks like a good one. So, the rest is history, and I finally got a job out of it, just last year. I actually got a job as a graduate student and then as a post doc, and now as a regular bona fide person. I would like to include some interesting images in the presentation, and I don't necessarily plan to explain all of them in detail. They are just kind of there to liven things up. This is an Aster visible near-IR image of Washington taken on June 1, 2001. Aster is one of the instruments on NASA's Terra satellite which was launched about three years ago now, and I just thought, we are here in Washington, I might as well show you Washington.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 30 THE JET PROPULSION LABORATORY Here is the outline for my talk. It is pretty simple. I just want to kind of go through some observations I have made in my five years of various experiences working at JPL. I should tell you that I work in the Earth and Space Sciences Division, which is a little bit unusual. Most of the people who do my kind of work are up in the machine learning group. I have the benefit, actually, of working directly with the scientists who face these problems, and I think that is a real advantage, because I get to hear them talk about their problems, kind of on a daily basis, what it is they need to do. The image on the left there is fake. It was put together before any of the data that —well, except for the El Niño red blob there—before any of the data that it is trying to depict was actually collected. The El Niño stuff, I don't know where that comes from, exactly which satellite, but the other things are sort of designed to show what sorts of things the Earth Observing System would eventually provide. This is kind of a dream, which is a global picture, easily visualized, of what the world is doing right now. Anyway, what I was going to do is just run through what I see as the major statistical challenges in what we do at JPL, and then make some recommendations for how I think the statistics community could become more involved in what we do. I asked Doug what he thought would make a good talk and he said, well, some recommendations for how we could become more involved would be good. Statisticians are curiously absent from the scene at NASA right now. I think part of the reason for that is that we are pretty much perceived as not being practical enough to contribute, and that is something I want to come back to later. So, the Earth Observing System program is a long-term program to study the

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 31 THE JET PROPULSION LABORATORY Earth's climate system. What that means is looking at the atmosphere, the oceans, the biosphere, and looking at it all as one integrated system, and studying the feedbacks that are involved. The upper graphic there is kind of a little bit of an illustration of why clouds are important in this system, and it is because all the energy the Earth gets is from the Sun. The question is, how is that radiation budget working out for us? Some of that radiation is reflected back out into space. Some of it gets all the way through to the ground, gets absorbed, goes to power everything we do here, create fossil fuEOS and what not. That is really one of the major things we are trying to study. So, a very, very important question that we are trying to answer is, what are the radiative effects of clouds, and that necessitates knowing where the clouds are and how they are spatially distributed, how they are changing over time. There are also what we call aerosols in the atmosphere, some of which are manmade— pollution, for example—others of which are natural, like forest fire smoke, and these things, too, have an impact on the radiative balance of the Earth. The bottom image there is from an instrument called CERES, which I will mention again a little bit later on, which is a global map of the—what does it say, outgoing long wave flux of the Earth, I guess September 30, 2001. Anyway, we have lots of data at NASA. I am kind of amused by the question of, what is a massive data set. I know that some people don't agree with this but, in my view, a massive data set depends on who you are and what your computational resources are. You will understand why I have that perspective when I get on with what I am about to say about what our job is at NASA. These data that we collect, we are actually in the data production business at NASA, above and beyond everything EOSe. We do participate in the analysis of the data that we collect, as you can imagine, but our primary responsibility is to build and fly instruments to collect data to study climate change. So, we are in the business of providing these data to the community. The community is a very diverse group of individuals with lots of different interests, lots of different opinions about what assumptions are valid, and lots of different resources. Our users range from university or college researchers with desk top computers to people like NOAA, for example, as we just heard, who use some of our data. So, it is a real challenge to try to design a one-size- fits-all sort of way of producing and distributing data that can satisfy everybody's needs. One of the things I wanted to mention specifically was that the EOS program is a long-term program that involves a number of different satellites and instruments. One of the prime intentions of the EOS program is that these data were supposed to be used synergistically. We were supposed to be able to combine data across instruments and across platforms, and as yet, we haven't really done that, and we don't know of anybody who has, largely because of the huge data volumes we are getting and the very complicated way in which the data are collected.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 32 THE JET PROPULSION LABORATORY So, these are NASA's 23 strategic questions. I don't know how easy that is to read from where you sit. NASA has formulated these questions as a way of concretizing what sorts of problems we want to address with the data. I can certainly refer you to the Web page where this came from and let you know that I am not going to dwell on it or read them all, but you get kind of an idea for what kind of the big picture questions are. So, the Earth Observing System, right now it has a collection of satellites involved. Two of them are actually in orbit now. The Terra satellite was launched on December 18, 1999, from Vandenberg Air Force Base in California. It carries five instruments: MISR, the Multi-Angle Imaging Spectroradiometer; MODIS, the Moderate Resolution Imaging Spectrometer; ASTER, which I think is the Advanced Spaceborne Thermal Emission Radiometer—you get to be really good as acronyms when you work at NASA; CERES, Clouds and the Earth's Radiant Energy System; and MOPITT, Measurements of Pollution in the Troposphere. The reason MISR is in red there is because that is one of the projects that I work on, so I have drawn many of my examples from what I know about MISR. NASA is also very good at making pictures and doing animations and doing PR. This is a depiction of the Terra satellite in orbit, and the instruments that it carries, doing what they do. If you have heard me talk before, which some of you have, you will recognize MISR as the multi-colored beams stretching out forward and aft along the direction of flight. This depicts the fact that MISR collects data. So, MISR looks down at the Earth at nine angles and four wavelengths and has a swath width of about 300 kilometers or so. The Terra satellite is in a polar orbit. So, what we do is, we

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 33 THE JET PROPULSION LABORATORY successively see the same spot on the ground at nine angles and four wavelengths simultaneously. So, we have 36 channels' worth of data. On the other instruments onboard are MODIS, which has a much wider swath width and, therefore, gets a lot more coverage. MISR, with its skinny little swath width, gets global coverage about every nine days. MODIS, with its much wider swath width, gets global coverage every day. The other instrument, CERES, was the red and yellow ones going back and forth like this in the back. I am not sure what its coverage is, nor am I sure about MOPITT. ASTER is what they call the zoom lens of the Terra platform, because it only looks at specific spots when it is told to do so, and it has very high resolution, about a meter, I think. So, I wanted to put that up. The other—now I know I had better wait with this until I finish describing what I want to say. The second EOS satellite was launched on May 4, 2002. That is called the EOS-Aqua. It carries four instruments. You will notice some of the names appear again. MODIS and CERES, it has a MODIS and CERES as well. It also has the AIRS instrument, which is the other project that I work on at JPL, which is the Atmospheric Infrared Sounder, which looks down at Earth at single view angle, but 2,378 spectral bands, and has a spatial resolution of about 15 kilometers. MISR was about 1 kilometer resolution. So, that is how I knew what CERES looked like, from looking at that. Then we have AIRS, AMSU and HSB, which are actually bundled together in a single package, and are processed together at JPL. The spinning thing on the top is AMSR/E, which is going to come up. There is MODIS. You get some sort of relative idea of what the swath widths are, the relative widths, from the animation. I don't know what AMSR/E does, to tell you the truth. There is way too much to know at NASA. It took me about four years before I could actually read a document without having to look up every acronym that I came across and therefore kind of understand what was going on. I like this, because you are getting bathed in the glow of the data here. I think that is kind of nice. Choking on it is more like it, as I said. [Question off microphone from audience.] It depends on which instrument you are talking about. For MISR, I can tell you it is about 75 gigabytes a day, and for AIRS it is about 28.2 gigabytes a day. It is a little hard to say how big the data are, because the data are processed on successive levels. So, if you quote a big huge number, you are really talking about the same data in different forms.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 34 THE JET PROPULSION LABORATORY Pretty much, any way you slice it, it is a lot. I wasn't going to concentrate on the “gee, whiz, how big is it?” statistics, but one of the things people like to talk about is how, in the first six months of operation, the Terra satellite doubled all of NASA's holding from the time that it began. So, there is a lot of data. At the MISR science team meeting this week, we had the people from the data distribution center come and talk. They told us that we now have 33 terabytes—33 terabytes is 11 percent of what we have in storage now for MISR, and that is just one instrument on one platform. Let me say a few things about what happens to the data when we get it. It comes zipping down from the satellites and gets beamed around, and finally ends up at something called a data processing center called a DAAC, which is a Distributive Active Archive Center which, I think Ed pointed out, is a big oxymoron, all by itself, and it gets processed. The data, as it arrives at the DAAC, is called Level 0 data. It is just raw and uncalibrated from the spacecraft. It goes through a series of processing steps called Level 1 processing, which geolocates and calibrates the data and yields for you a data product called Level 1-B-2 in the terminology of things, which you can think of conceptually as a great big data set that has a row for every spatial scene on the ground, which would be 1 kilometer for MISR and 15 kilometers for AIRS, and a column for each observed channel, 36 columns for MISR and 2,378 columns for AIRS. Then comes the good part. So, that Level 1 data is then pumped through what we call Level 2 processing algorithms. John alluded to the word retrieval, which was a mysterious term to me for quite some time. That is where we retrieve the geophysical quantities that, in theory, produce those radiances. So, Level 2 data products are derived geophysical quantities that will form the basis for answering those 23 questions. Because the data are so large, we have an obligation to provide them in a form that is a little bit easier to use, and that is the so-called Level 3 stage of processing, where we produce global gridded summaries of the data on a monthly or a daily or a weekly or whatever basis. This is supposed to satisfy the needs of people who can't handle 75 gigabytes of data a day, and make the analysis a little bit easier. Level 4 is sometimes talked about, and it is the analysis stage, where you put the input into a climate model and use it to actually generate something. In my mind, I make a distinction between—Levels 1, 2 and 3 are what I call data production, and Level 4 is the data analysis stuff, and I think it is important to make that distinction.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 35 THE JET PROPULSION LABORATORY This is a granule map for AIRS. I said the data come down in chunks. Every day we get 240 chunks of AIRS data. Each granule is an array of 90 prints across and 135 footprints along for AIRS, and there are 2,378 radiance observations for that. I just wanted to put that up there so you get some idea, if you wanted to order data, and you wanted to order a whole world's worth of data for one day, you would have to order 240 files that look like this. If you wanted a specific spot, you would have to figure out which granule it is in. This changes every day. For AIRS, a granule is defined as six minutes' worth of data. One of the problems that we have is that each instrument defines its granule its own way. MODIS, for example, describes its granule as five minutes' worth of data. The naming conventions for the files are not easy to figure out and they are not standardized. The sampling grids, the grids at which the data are collected, are different for every instrument. So, you are looking at a big, big problem, if you want to compare data across instruments. Now, I wanted to mention just a few problems that I personally have encountered in working with the folks at JPL on looking at where statistics and data analysis fit in. There is a tremendous amount of statistics that goes on at JPL, and data analysis. Everybody does data analysis at JPL, and nobody is a statistician, except for me. One problem we had was with AIRS calibration data. The 2,378 channels are broken down into what they call 17 different modules. Channels go into the same modules because they share a certain amount of electronics. The channels are supposed to be independent of one another.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 36 THE JET PROPULSION LABORATORY These are covariance matrices for eight of the modules. They asked me to try to determine whether or not the measurements from these channels were, in fact, independent. To a lot of people at JPL, correlation means independence. Zero correlation would mean independence. Never mind the fact that the data don't really look very Gaussian when you try to look at them, but these are just a couple of the covariance matrices that we generated sort of a cohort so you could look at them quickly. It looks like we are doing okay on the last two, and then things get a little dicier as you go back, as you go down to number—this one here. The previous speaker alluded to doing retrievals and doing forward models. The calibration issue I think of as a Level 1 issue. It is something that you do at Level 1 processing. This is an example of a Level 2 problem. We collect from MISR, for example, these radiances from nine angles and four wavelengths. You see here six of the channels, RGB nadir and RGB 70 degree forward, and the so-called optical depth retrieval that comes from it. As John said, the way this is done almost across the board is by matching the observed radiances to radiances that are predicted, under the assumption that certain conditions are true. Then, wherever you find the best matches, then, aha, that must be it; that is what we are going to call the truth. This is okay, I guess. It seems like a very rich area for statisticians to help improve how that is done. The thresholds for when they say something fits or doesn't fit are pretty much ad hoc, and could benefit from some good principled statistical thinking. The other thing is in how they characterize the uncertainties associated with these retrievals. It has nothing to do with variance at all. It has more to do with how many of the candidate models fit, how close they fit. For example, if lots of the models fit, we are pretty uncertain. If none of the models fit, we are even more uncertain, and if just one of the models fits well, then they say we are pretty certain. The uncertainty characterization tends to be certain, not very certain, not certain at all, that sort of thing. It is qualitative, rather than quantitative.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 37 THE JET PROPULSION LABORATORY Finally, the third area that I wanted to mention was the creation of Level 3 products, which is how I got involved and what my responsibilities on MISR and AIRS are, to design Level 3 products. Level 3 products tend to be maps of averages and standard deviations. You take all the data for the period of time over which you are summarizing, and you chop it up into little 1 degree by 1 degree bins. If you are creating a monthly summary, then you produce maps of each of the quantities you are interested in, with a mean value in each grid cell, and then maybe another map with the standard deviation in each grid cell. If you are really clever, maybe you make some maps of the correlations or covariances to go along with it. I think the natural reaction of a statistician is, oh, that is awful. You know, you are throwing away most of the data there. It begins to make more sense when you stop to think about the operational problems that go into doing these things. Doing things like density estimates, fancy stuff is just completely out of the question because of the processing. The processing has to keep up, basically, so it has to be fast. I will just plug my own thing here and something I have worked with Ed Wegman a little bit about. What I propose to do for both MISR and AIRS is to create a Level 3 product that puts what I call a quantized data product. Instead of simply providing a mean of standard deviation in each grid cell, we provide basically the results of a clustering algorithm. You have a number of representative vectors and associated weights, which might be the numbers of original data points represented by each of those representatives, and an error measure, which might be the within-cluster mean squared error, and provide this product as a quantitative Level 3 product that retains more of the distributional information about the data than just a simple mean standard deviation would. In particular, it would retain some of the information about outliers which, for science analyses, tend to be among the most important things and the things you don't want to smooth out and throw away. What this image here is, is just a map of the relative error in one of the products that I created to show at the American Geophysical Union last week. The original data set from which this image was created was about 550 megabytes, and the compressed or quantized product was about 60 kilobytes. So, it is about a 10-fold reduction in data size, and you suffer pretty much, at worst, about a 7 percent error in the data, as measured by mean squared error relative to the average magnitude of the data within each grid cell. So, that is pretty good and it is quite good for a lot of applications. Sort of the problem here is how do you tell people ahead of time whether it is good enough for their

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 38 THE JET PROPULSION LABORATORY application, because you don't know what their application is and, of course, if that is good enough or not depends on what their application is. I wanted to show one other thing here, which is this little zippy animation. So, this is an animation of the Aqua orbit. You can see how it goes. We are in a polar orbit. That little skinny red strip there is not too far off of what a MISR swath would look like. An AIRS swath would be considerably wider than that. You see that what is going on there, is that as you go around the Earth is, of course, turning underneath you. So, how do you make a global summary of data like that. Someone who has thought about that, to some degree, is Noel Cressey, who has developed some techniques that we have experimented with a little bit, for creating kind of a Level 3 product that would sort of take account of spatial and temporal dependencies, in order to produce kind of a monthly summary of these data, that takes are of sort of the interpolation between swaths and over time. That is a big problem for us, too. I personally put that in the realm of the data analysis rather than the data production. So, data analysis, I will go through this quick because I want to get to the end here. Understanding what is in the data is a necessary precursor for doing anything with it, for inference. The vast majority of what is done with EOS data these days is exploratory and descriptive, because we are still just trying to understand what is in it. Inference is just something that is going to have to wait a little while. So, there are lots of opportunities for descriptive type techniques that need to be brought to bear on these data.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 39 THE JET PROPULSION LABORATORY I am going to just zip along. These are three areas where I think we could really benefit from the help of the statistics community, which is multivariate visualization, particularly for data sets where you want to preserve the spatial and temporal context of the data. In my view, the real problem here is how do you visualize features of joint distributions of very highly multivariate data as they evolve as you move around in space and time. Data mining and analysis of data sets, it is pretty obvious we need help with that. No one can look at all this data. So, we are doing some things. I like to think of the Level 3 products as kind of a first stab at that. Finally, data fusion, which I was unable to find too much literature by statisticians on that, and that is really important. That is a big problem for us. We want to be able to combine data from different instruments on the same satellite, from different satellites. We need to be able to combine information from ground sources with the satellite data that we get in order to validate it. So, if anybody has any good ideas about that, or would like to work on that, we would be very happy to have you help us. As far as analysis is concerned, I wanted to just mention a couple of examples. Ed Wegman has been real good to us and put together a new incarnation of his image tours ideas. I don't know how well you can see that. It is a little difficult to see here, but I will just run that. This was Ed's answer to our problem about how to visualize 36 channels' worth of information while retaining a spatial context. What is going on here is, we have 36 gray-scale images, each representing a different channel's worth of data. It is successively producing the same linear combination as shown by the little thing down in the corner here, of the 36 channels in each pixel of the data, and I will refer you to Ed to explain that a little more carefully. The animation doesn't run very long, and it is hard to see what is going on there, but what you hopefully noticed there was that certain features pop in and out of view as you run that thing. It actually turned out to be pretty useful for finding features that you didn't otherwise know about. Now, if you already knew those features were there, you could go straight to the image where it was most easy to see, or to the combination of images where it was most easy to see. If you don't know they are there, which we don't know anything ahead of time about this, then you need something that is going to help you look at a lot of data quickly and find these things, and this was very useful for that. We are still working with

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 40 THE JET PROPULSION LABORATORY this. We need to do more work with it in order to be able to help the scientists interpret what it is showing us. A second thing that we are heavily engaged in is collaborating with the machine learning systems people at JPL to do data mining, particularly active learning methods and tools. We got some internal money from NASA to pursue this, and we are looking forward to eventually being able to show it at the JSM interface meeting. Finally, we were able to suck Bin Yu in. I met Bin at a workshop at MSRI, and she showed some interest in the data. So, we put her up to a particularly vexing problem in the analyses of these data, which is how to detect thin clouds over snow and ice with MISR data. You know, you are looking at a white object over a white background. That is difficult to see. She was down in Pasadena yesterday, gave a talk to our science team, and has some very nice results for us, and we are looking forward to hearing more from her. She has a student working on the problem, actually, and that has been great. Now we have kind of got him learning how to order data, and that looks good for us because it means somebody is using our data, and it is good for him because it is a neat problem to work on. My general things I want to say—this (by the way) is the RGB nadir image of the previous movie here. I might have called this, like I said, a shameless cry for help, because NASA has so much wonderful data, whether you are interested in massive data set or whatever particular area of analysis you are interested in, there is a wonderful NASA data set that would be really interesting for you, I am sure. The great prohibition, I think, has been that a lot of statisticians think of practical constraints as kind of a detail that isn't really something they are interested in. What we really need people to do is to devote the time to understanding the problem context and the practical restrictions on the analyses, and to accept those as important research problems in their own right. You know, it is not enough to think of a great way of summarizing data or analyzing data. It has got to be something that can be done, or they won't pay any attention to us. Thanks to Ben and Ed and a couple other people, we are moving in that direction. So, Doug said to bring some specific suggestions for how we could remedy the situation, and these are the ones that I came up with.

STATISTICAL CHALLENGES IN THE PRODUCTION AND ANALYSIS OF REMOTE SENSING EARTH SCIENCE DATA AT 41 THE JET PROPULSION LABORATORY The first one is that I would really love to see ASA broker relationships between statisticians at colleges and universities and local NASA centers. Like I said, the NASA centers are just a tremendous source of really interesting data sets for teaching and research, and it would be great if we could get those into the hands of people who could actually do something with them. We would love to see ASA organize a workshop specifically devoted to statistical applications and problems in geoscience, Earth science, and remote sensing. Doug and I were just at the American Geophysical Union meeting early this week, where we had a session on model testing and validation in the geosciences that went very well. It was very, very well received, very well attended. Ed, Doug and Di Cook spoke last year at the AGU, and they are very, very happy to have us, and we would like to push forward some more formal relationship with the AGU. In fact, they suggested us having a committee and, they don't have any money, but if we could somehow find a way to come up with some money to send young researchers and students to the AGU to show their work, that would really be great, because that is really how we are going to inject ourselves into that community. Also, I would like to see geoscience get a little bit higher profile at JSM. Sometimes we have an occasional session on it, but it is nothing like—you know, bioinformatics you see all the time. It would be nice to have a session devoted to just bringing the problems to the statistics community, maybe not the solutions, but just tell us what the problems are. This was Ralph's suggestion, this last one, speaking like a strapped NASA researcher like he is. We would love to see a funding program to fund work that is directly relevant to NASA's problems. NASA has not traditionally funded statistics, partly because of the problem with being practical and doing things very useful for missions. We think that, if we could get some seed money to prove how useful we could be, that that funding would then come later. So, those are the suggestions. I hope that we can make some of them ring true, and thank you.

Next: Ralph Milliff Global and Regional Surface Wind Field Inferences from Spaceborne Scatterometer Data »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!