__Transcript of Presentation__

BIOSKETCH: Douglas Nychka is a senior scientist at the __National Center for Atmospheric Research__ (NCAR) and is also the project leader for the __Geophysical Statistics Project__ (GSP). He works closely with many of the postdoctorate fellows at NCAR and his primary goal is to emphasize interdisciplinary research: migrating statistical techniques to important scientific problems and using these problems to motivate statistical research. Dr. Nychka’s personal research interests include nonparametric regression (mostly splines), and statistical computing, spatial statistics, and spatial designs. He received his undergraduate degree from Duke University in mathematics and physics and his PhD from the University of Wisconsin under the direction of Grace Wahba. He came to GSP/NCAR in 1997 after spending 14 years as a faculty member in the Statistics Department at North Carolina State University.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Douglas Nychka, Chair of Session on Atmospheric and Meteorological Data
Introduction by Session Chair
Transcript of Presentation
BIOSKETCH: Douglas Nychka is a senior scientist at the National Center for Atmospheric Research (NCAR) and is also the project leader for the Geophysical Statistics Project (GSP). He works closely with many of the postdoctorate fellows at NCAR and his primary goal is to emphasize interdisciplinary research: migrating statistical techniques to important scientific problems and using these problems to motivate statistical research. Dr. Nychka’s personal research interests include nonparametric regression (mostly splines), and statistical computing, spatial statistics, and spatial designs. He received his undergraduate degree from Duke University in mathematics and physics and his PhD from the University of Wisconsin under the direction of Grace Wahba. He came to GSP/NCAR in 1997 after spending 14 years as a faculty member in the Statistics Department at North Carolina State University.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Transcript of Presentation
MR. NYCHKA: So, without further ado, our first speaker is going to be John Bates at the National Climatic Data Center. He will be talking about exploratory climate analysis and environmental satellites and weather radar data.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
John Bates
Exploratory Climate Analysis Tools for Environmental Satellite and Weather Radar Data
Abstract of Presentation
Transcript of Presentation and PowerPoint Slides
BIOSKETCH: John J.Bates is the chief of the Remote Sensing Applications Division of the U.S. National Oceanic and Atmospheric Administration’s (NOAA’s) National Climatic Data Center. Dr. Bates received a PhD in meteorology from the University of Wisconsin-Madison in 1986 under William L.Smith on the topic of satellite remote sensing of air-sea heat fluxes. Dr. Bates then received a postdoctoral fellowship at Scripps Institution of Oceanography (1986–1988) to work with the California Space Institute and the Climate Research Division. He joined the NOAA Environmental Research Laboratories in Boulder, Colorado, in 1988 and there continued his work in applying remotely sensed data to climate applications. In 2002, Dr. Bates moved to the NOAA National Climatic Data Center in Asheville, North Carolina.
Dr. Bates’ research interests are in the areas of using operational and research satellite data and weather radar data to study the global water cycle and studying interactions of the ocean and atmosphere. He has authored over 25 peer-reviewed journal articles on these subjects. He served on the AMS Committee on Interaction of the Sea and Atmosphere (1987–1990) and the AMS Committee on Applied Radiation (1991–1994).
As a member of the U.S. National Research Council’s Global Energy and Water Cycle Experiment (GEWEX) Panel (1993–1997), Dr. Bates reviewed U.S. agency participation and plans for observing the global water cycle. He was awarded a 1998 Editors’ Citation for excellence in refereeing Geophysical Research Letters for “thorough and efficient reviews of manuscripts on topics related to the measurement and climate implications of atmospheric water vapor.” He has also been a contributing author and U.S. government reviewer of the Intergovernmental Panel on Climate Change Assessment Reports. He currently serves on the International GEXEX Radiation Panel, whose goal is to bring together theoretical and experimental insights into the radiative interactions and climate

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
feedbacks associated with cloud processes, including the effects of water vapor within the atmosphere and at Earth’s surface.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Abstract of Presentation
Exploratory Climate Analysis Tools for Environmental Satellite and Weather Radar Data John Bates, National Climatic Data Center
1. Introduction
Operational data from environmental satellites form the basis for a truly global climate observing system. Similarly, weather radar provides the very high spatial and rapid time sampling of precipitation required to resolve physical processes involved in extreme rainfall events. In the past, these data were primarily used to assess the current state of the atmosphere to help initialize weather forecast models and to monitor the short-term evolution of systems (called nowcasting).
The use of these data for climate analysis and monitoring is increasing rapidly. So, also, are the planning and implementation for the next generation of environmental satellite and weather radar programs. These observing systems challenge our ability to extract meaningful information on climate variability and trends. In this presentation, I will attempt only to provide a brief glimpse of applications and analysis techniques used to extract information on climate variability. First, I will describe the philosophical basis for the use of remote sensing data for climate monitoring, which involves the application of the forward and inverse forms of the radiative transfer equation. Then I will present three examples of the application of statistical analysis techniques to climate monitoring: (1) the detection of long-term climate trends, (2) the time-space analysis of very large environmental satellite and weather radar data sets, and (3) extreme event detection. Finally, a few conclusions will be given.
2. Philosophy of the use of remote sensing data for climate monitoring
Remote sensing involves the use of active or passive techniques to measure different physical properties of the electromagnetic spectrum and to relate those observations to more traditional geophysical variables such as surface temperature and precipitation. Passive techniques use upwelling radiation from the Earth-atmosphere system in discrete portions of the spectrum (e.g., visible, infrared, and microwave) to retrieve physical properties of the system. Active techniques use a series of transmitted and returned signals to retrieve such information.
This is done by using the radiative transfer equation in the so-called forward and inverse model solutions. In the forward problem, sample geophysical variables, such as surface temperature and vertical temperature and moisture profiles, are input to the forward radiative transfer model. In the model, this information is combined with specified instrument error characteristics and responsivity to produce simulated radiances. The inverse radiative transfer problem starts with satellite-observed radiances. Because the inverse radiative transfer equation involves taking the inverse of an ill-conditioned matrix, a priori information, in the form of a first guess of the solution, is required to stabilize the matrix prior to inversion. The output of this process is geophysical

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
retrievals. The ultimate understanding of the satellite or radar data requires full application of the forward and inverse problems and the impact of uncertainties associated with each step in the process.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Transcript of Presentation
MR. BATES: Thank you. I didn’t think we would be in the big room. It is nice to be in this building. I am going to mainly talk about some of our larger so-called massive data sets that we acquire now over the wire from both environmental satellites—the ones you see on the television news every night, the geostationary satellites.
NOAA, as well as Department of Defense, also operate polar-orbiting—that is, satellites that go pole to pole and scan across a swath of data on a daily basis. Also, right now, our biggest data stream coming in is actually the weather radar data, the precipitation animations that you see now nightly on your local news. In talking in terms of what we just heard, in terms of the different data sets that come in, they come in from all different sources.
The National Climatic Data Center is the official repository in the United States of all atmospheric weather-related data. As such, we get things like simple data streams, the automatic observing systems that give you temperature, moisture, cloud height at the Weather Service field offices. Those are mostly co-located now at major airports for terminal forecasting, in particular.
We have, in the United States, a set of what are called cooperative observers, about 3,000 people who have their own little backyard weather station, but it is actually an officially calibrated station. They take reports. Some of them phone them in, and deposit the data, and that is a rather old style way of doing things.
We have data that comes in throughout the globe, reports like that, upper-air reports from radiosondes, and then the higher data now are the satellite and weather radar data. The United States operates nominally two geostationary satellites, one at 75 watts, one at 135 watts. The Japanese satellite, which is at 155 degrees East, is failing. So, we are in the process of actually moving one of our United States satellites out there. Then, of course, these polar-orbiting satellites.
I am mostly going to talk about the polar-orbiting satellites and some techniques we have used to analyze those data for climate signals. Those data sets started in about 1979, late 1978, and then go through the present.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
This is what I want to talk about today, just give you a brief introduction of what we are thinking about as massive data is coming in, and we are responsible for the past data as well as the future planning for data coming in from the next generation of satellites. A couple, three examples of how we use some techniques, sometimes rather simplistic, but powerful, to look at the long-term climate trends, some time-space analysis—that is, when you have these very high spatial and temporal data sets, you would like to reduce the dimensionality, but yet still get something meaningful out about the system that we are trying to study. Then, just briefly talk about amplification of the radar data. I just inherited the radar data science there, and so, that is new stuff, and it has just really begun in terms of data mining. So, when you have rare events in the radar such as tornadic thunderstorms, how can we detect those. Then, just a couple of quick conclusions.
That is what we are talking about in terms of massive here. So, the scale is time from about 2002 projecting out about the next 15 years or so. This is probably conservative because we are re-examining this and looking at more data, probably, more products being generated than we had considered before. On the axis here is terabytes, because people aren’t really thinking of pedabytes. Those numbers are really 10, 20, 30 pedabytes. Right now, we have got a little over one pedabyte and daily we are probably ingesting something like a terabyte.
The biggest data set coming in now is that we are getting the next rad data—this is the weather radar data—from about 120 sites throughout the United States. We are getting about a third of that in real time. They used to come in on the little Xabite 8 millimeter cassettes. For years, we used to just have boxes of those because there wasn’t

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
a funded project to put this on mass store. In the last two years, we have had eight PC work stations with each having eight readers, tape readers, on them, to read back through all the data and get it into mass store. So, now we can get our hands on it.
So, the first lesson is accessibility of the data, and we are really just in a position now to be going through the data, because it is accessible. We are looking at data rates by 2010, on the order of the entire archive—this is cumulative archives. So, that is not data read per year, so it is cumulative archive building up, of something over 30 pedabytes by the year 2010 or so. So, that is getting fairly massive.
In terms of remote sensing, there is a philosophical approach, and I am not sure how many of you have worked with remote sensing data. There are two ways of looking at the data, sort of the data in the satellite observation coordinates or the geophysical space of a problem you want to deal with.
These are referred to variously as the forward problem. Just very briefly, the forward problem, you have geophysical variables—temperature and moisture profiles of the atmosphere, the surface temperature, and your satellite is looking down into that system. So, using a forward model—a forward model being a radiative transfer model, physical model for radio transfer in the atmosphere—you can simulate so-called radiances.
The radiances are what the satellite will actually observe. In the middle are those ovals that we want to actually work on, understanding the satellite data and then understanding the processes of climate, and then, in fact, improving modeling. As an operational service and product agency, NOAA is responsible for not just analyzing what is going on but, foolishly, we are attempting to predict things. Analogous to other businesses, we are in the warning business. The National Weather Service, of course, is bold enough to issue warnings.
However, when you issue warnings, you also want to look at things like false alarm rate. That is, you don’t want to issue warnings when, in fact, you don’t have severe weather, tornadoes, etc.
The other aspect of the problem, the so-called inverse problem—so, starting from the bottom there—you take the satellite radiances and we have an inverse model that is usually a mathematical expression for the radio transfer equation which is non-linear. We linearize the problem. Then we have a linear set of equations. The inverse model, then, is an inverse set of equations. The matrix is usually ill conditioned. So, we go to those yellow boxes and condition the matrix by adding a priori information, a forecast

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
first guess, other a priori data, and then biases to somehow normalize the data set. We invert that to get geophysical retrieval. Then we can retrieve temperature and moisture profiles. We can retrieve surface properties, surface temperature, ocean wind speed, other geophysical quantities of interest.
So, the first application, detection of long-term climate trends using environmental satellite data, the issue of global warming has really surfaced in the last 10 years. We would like to know, is the Earth warming, how much. Are systems changing? How much? Is there more severe weather? That would just be an issue with the extremes in a distribution. You know, certain weather events are normal distributions. Certain aren’t. Precipitation is not normally distributed by any sense of the imagination. We get far fewer events of extreme rainfall—precipitation—than we do of light precipitation. So, it is more of a log normal distribution. We would like to know, are the extremes changing. So, that is a small portion of those distributions.
With satellite data, we face a couple of unique problems. First, we are sensing the atmosphere with satellites that have a lifetime of three to five years. So, we need to create a so-called seamless time series so that, when you apply time-space analysis techniques, you are not just picking up artifacts of the data that have to do with a different satellite coming on line. We use a three-step approach to that, something we call the nominal calibration. That is where you take an individual satellite, do the best you can with that satellite in terms of calibrating it, normalizing the satellites, and I will show you what that means. We have different satellites with different biases. Often, different empirical techniques are used to stitch those together in a so-called seamless manner. We would like an absolute calibration. That, of course, is very difficult, because what is absolute, what is the truth?
Then, we would like to apply some consistent algorithm. In the infrared, when you are remote sensing in the infrared, and you are looking down at the atmosphere from space, in the infrared, clouds are opaque. So, in order to send the temperature and moisture profile down to the surface, you have to choose or detect the cloud-free samples. So, you have to have a threshold that tells you, this is cloudy, this is clear. You can base that on a number of different characteristics about the data, usually time and spatial characteristics, time and space variability. Clouds move, the surface tends to be more constant in temperature. Not always, but the oceans certainly do. So, you use those statistical properties about changes in time and space of the data set, to allow you to identify clouds. You have to look at navigation. You have to do all kinds of error

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
checks, and then build retrievals to go from your radiant space to your geophysical space.
Then, we get, finally, into the fun part, exploratory data analysis. I tend to view this as sort of my tool kit out there in the shop working on a data set where, you know, you throw things at it and see what sticks. Once you get something that looks interesting, you start to formulate hypotheses about the physical system, how it works, and how your data set compares to what physics of the problems say are possible solutions. Then you might go on to look at data analysis and confirm your hypothesis.
Anyway, let’s go through the first step here. I am going to take more time with this first example and a little less with the second and just briefly go into the third one.
So, creation of seamless time series, you have here three different channels of data from a satellite, channel 8, that is an infrared window, channel 10 is actually a moisture channel in the upper atmosphere, a so-called water vapor channel, and these channels—10, 11, 12—are all water vapor channels. We look at emission lines of water vapor in the atmosphere. Channel 12 in particular we are going to look at because it is involved with a so-called water vapor feedback mechanism in global warming. In global warming, we hear these numbers quoted—atmospheric, oh, the temperature is going to go up two degrees in 100 years. Actually, anthropogenic CO2 manmade gasses only contribute one of the two degrees there. The extra warming, the other one degree of warming, comes from a so-called water vapor feedback. So, there has been a lot of controversy in the community about, does this water vapor feedback, in fact, work this way or not.
So, the different colors in these three charts, then, I am showing three things. One is the average global temperature over time, and this is a 20-year data set. So, the top line in each one of these is just your monthly mean data point for each of these satellites over time, about a 20-year time series on each one. These are four different channels. The mean—you see the march of the annual cycle up and down—the standard deviation of the data set, and just simply the number of observations, these are something like millions of observations a month—you can’t read that scale here, this is times 106. So, on the order of, you know, 5 or 6 million or so observations a month coming down.
This is from the polar-orbiting satellite. So, these have sampled the entire planet. The geostationary only sample that region that they are over. You can see a bit of the problem here, especially in this channel 12 where, number one, there are offsets between the different colors. The different colors are the different satellites. Over this time period, there are eight different satellites. They are designated by this NOAA-7, 8, 9, 10,

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
blended field is consistent with power-law spectral properties observed by the QSCAT. The third panel shows the wind stress curl for the blended field.
The blended winds have been used to drive regional and global ocean model simulations. Milliff et al. (1999) demonstrated realistic enhancements to the response of a relatively coarse-resolution ocean general circulation model (OGCM) to the higher-wavenumber winds in the blended product. Higher resolution OGCM experiments are in progress now.
Bayesian Inference for Surface Winds in the Labrador Sea
The Labrador Sea is one of a very few locations in the world ocean where surface exchanges of heat, momentum and fresh water can drive the process of ocean deep convection. Ocean deep convection can be envisioned as the energetic downward branch of the so-called global ocean conveyor belt cartoon for the thermohaline general circulation that is important in the dynamics of the Earth climate. The energetic exchanges at the surface are often associated with polar low synoptic events in the Labrador Sea.
A Bayesian statistical model has been designed to exploit the areal coverage of scatterometer observations, and provide estimates of uncertainty in the surface vector wind inferences for the Labrador Sea. Here, the scatterometer system is the NASA Scatterometer or NSCAT system that preceded QS-CAT. It has proved convenient to organize the Bayesian model components in stages. Data Model Stage distributions are specified almost directly from precise information that naturally arises in the calibration and validation of satellite observing systems. The Prior Model Stage (stochastic geostrophy) invokes a simple autonomous balance between surface pressure (a hidden process in our model) and the surface winds. The posterior distribution for the surface vector winds is obtained from the output of a Gibbs sampler.
An application of the Labrador Sea model for surface winds will be described at the end of this presentation. The first documentation of this model appears in Royle et al. (1998).
Bayesian Hierarchical Model for Surface Winds in the Tropics
The Bayesian Hierarchical Model (BHM) methodology is extended in a model for tropical surface winds in the Indian and western Pacific Ocean that derives from Chris Wikle’s postdoctoral work (Wikle et al., 2001). Here, the Data Model Stage reflects measurement error distributions for QSCAT in the tropics as well as for the surface winds from the NCEP analysis. The Prior Model Stage is prescribed in two parts. For large scales, the length scales

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
and propagation of the leading modes of the equatorial β-plane are used. At smaller scales, once again, we invoke a wavelet decomposition constrained by the power-law behavior for wavenumber spectra in the tropics.
A recent implementation of this model generates 50 realizations of the surface wind field, 4-times per day, at 50 km resolution, for the domain 20° N to 20° S, 40° E to 180° E, for the QSCAT data record for the calendar year 2000. Figure 2 depicts snapshots of five randomly selected realizations for zonal wind and divergence fields for 25 December 1999 at 0000 UTC. Differences are smallest in regions recently sampled by QSCAT. This implies that the uncertainty in the observations is smaller than the uncertainty in the approximate physics assigned in the prior model stage.
Surface convergence in the tropics is a critical field in the analysis of the atmospheric deep convection process. However, single realizations of this field are rarely useful because divergence is an inherently noisy field. The production of 50 physically sensible realizations can begin to quantify the space-time properties of the signal vs. noise. The first use of this dataset will be to diagnose surface convergence patterns associated with the Madden-Julian Oscillation (MJO) in the regions where the MJO is connected to the surface by atmospheric deep convection.
A Bayesian Hierarchical Air-Sea Interaction Model
The Bayesian Hierarchical Model methods extend naturally to multi-platform observations and complex physical models of air-sea interactions. Berliner et al (2002) demonstrate a prototype air-sea interaction BHM for a test case that mimics polar low propagation in the Labrador Sea, given both simulated altimeter and scatterometer observations. Hierarchical thinking leads to the development of a Prior Model distribution for the surface ocean streamfunction that is the product of an ocean given atmosphere model, and a model for the atmosphere. The Prior Model stage for the atmosphere is a model similar to the Labrador Sea wind model introduced above.
Figure 3 compares the evolution of the ocean kinetic energy distribution in the air-sea BHM with a case from which all altimeter data have been excluded. The BHM resolutions are 3 times coarser in space and O(1000) times coarser in temporal resolution with respect to a high-resolution “truth” experiment also shown in the comparison. Also, the “truth” fields are computed in a physical model that incorporates more sophisticated physics than those that form the basis of the Prior Model Stage in the air-sea BHM. Implications of this methodology to data assimilation issues in coupled general

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
circulation models will be discussed.
References
Berliner, L.M., R.F.Milliff and C.K.Wikle, 2002: “Bayesian hierarchical modelling of air-sea interaction” , J. Geophys. Res., Oceans, in press.
Chin, T.M., R.F.Milliff, and W.G.Large, 1998: “Basin-scale, high-wavenumber sea surface wind fields from multi-resolution analysis of scatterometer data”, J. Atmos. Ocean. Tech., 15, 741–763.
Milliff, R.F., M.H.Freilich, W.T.Liu, R.Atlas and W.G.Large, 2001: “Global ocean surface vector wind observations from space”, in Observing the Oceans in the 21st Century, C.J.Koblinsky and N.R.Smith (Eds.), GODAE Project Office, Bureau of Meteorology, Melbourne, 102–119.
Milliff, R.F., W.G.Large, J.Morzel, G.Danabasoglu and T.M.Chin, 1999: “Ocean general circulation model sensitivity to forcing from scatterometer winds”, J. Geophys. Res., Oceans, 104, 11337–11358.
Royle, J.A., L.M.Berliner, C.K.Wikle and R.F.Milliff, 1998: “A hierarchical spatial model for constructing wind fields from scatterometer data in the Labrador Sea.” in Case Studies in Bayesian Statistics IV, C.Gatsonis, R.E.Kass, B.Carlin, A.Cariquiry, A.Gelman, I.Verdinelli, and M.West (Eds.), Springer-Verlag, 367–381.
Wikle, C.K., R.F.Milliff, D.Nychka and L.M.Berliner 2001: “Spatiotemporal hierarchical Bayesian modeling: Tropical ocean surface winds”, J. Amer. Stat. Assoc., 96(454), 382–397.
Figure Captions
Table 1. Past, present, and planned missions to retrieve global surface vector wind fields from space (from Milliff et al., 2001). The table compares surface vector wind accuracies with respect to in-situ buoy observations. Launch dates for SeaWinds on ADEOS-2 and Windsat on Coriolis have slipped to 14 and 15 December 2002, respectively.
Figure 1. Three panel depiction of the statistical blending method for surface winds from scatterometer and weather-center analyses. Panel (a) depicts the wind stress curl for the weather-center analyses on 24 January 2000 at 1800 UTC. Wind stress curl from QSCAT swaths within a 12-hour window

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
centered on this time are superposed on the weather-center field in panel (b). Panel (c) depicts the wind stress curl for the blended field. Derivative fields such as wind stress curl are particularly sensitive to unrealistic boundaries in the blended winds.
Figure 2. A Bayesian Hierarchical Model is used to infer surface vector wind fields in the tropical Indian and western Pacific Oceans, given surface winds from QSCAT and the NCEP forecast model. Five realizations from the posterior distribution for (left) zonal wind and (right) surface divergence are shown for the entire domain on 30 January 2001 at 1800 UTC. The two panels in the first row are zonal wind and divergence from the first realization. Subsequent rows are zonal wind differences and divergence differences with respect to the first realization. The differences are for realizations 10, 20, 30, and 40 from a 50 member ensemble of realizations saved from the Gibbs sampler.
Figure 3. Summary plots for the Air-Sea interaction Bayesian hierarchical model (from Berliner et al., 2002). The basin average ocean kinetic energy distributions as functions of time are compared with a single trace (solid) from a “truth” simulation described in the text. The posterior mean vs. time (dashed) is indicated in panel (a) for the full air-sea BHM, and in panel (b) for an air-sea BHM from which all pseudo-altimeter data have been excluded. Panels (c-f) compare BHM probability density function estimates at days 1, 3, 5, and 7.

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Mission
Measurement approach
Swath (km)daily cov.
Resolution (km)
Accuracy(wrt buoys)
URL(http://)
ERS-1/2 AMI 4/91–1/01
C-BAND SCATT.
500/41%
50 (~70)
1.4–1.7 m/s rms spd 20º rms dir
~2 m/s random comp.
earth.esa.int
ASCAT/ METOP
C-BAND SCATT.
2×550/68%
25 50
Better than ERS
esa.int/esa/progs/www.METOP.html
NSCAT 9/96–6/97
Ku-BAND SCATT. (fan beam)
2×600/75%
(12.5) 25 50
1.3 m/s (1–22 m/s) spd 17º (dir)
1.3 random comp.
winds.jpl.nasa.gov/missions/nscat
SeaWinds/ QuickSCAT 7/99–present
Ku-BAND SCATT. (dual conical scan)
1600/92% (1400)
12.5 25
1.0 m/s (3–20 m/s) spd 25º (dir)
0.7 random comp.
winds.jpl.nasa.gov/missions/quickscat
SeaWinds/ ADEOS-2 2/02
Ku-BAND SCATT. (w/u-wave Rad.)
1600/92% (1400)
(12.5) 25
Better than QuickSCAT
winds.jpl.nasa.gov/missions/seawinds
WINDSAT/CORIOLIS 3/02
DUAL-LOOK POL. RAD.
1100/~70%
25
±2 m/s or 20% spd
±20°??
www.ipo.noaa.gov/windsat.html
CMIS/NPOESS 2010?
SINGLE-LOOK PO. RAD.
1700/>92%
20
±2 m/s or 20% spd
±20°?? (5–25 m/s)
www.ipo.noaa.gov/cmis.html

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Summary
Global and regional surface wind datasets from spaceborne scatterometers are ”massive” and important for climate and weather. Applications require:
regular grids
uniform spatial O(10 km) and temporal O(diurnal) resolution
Blended scatterometer and weather-center analyses provide global, realistic high-wavenumber surface winds
impose spectral constraints via multi-resolution wavelets
Bayesian Hierarchical Models to exploit massive remote sensing datasets
measurement error models from cal/val studies (likelihoods)
process models from GFD (priors)
advances in MCMC
Tropical Winds Example (Wikle et al. 2001)
Bayesian Hierarchical Model for Air-Sea Interaction (Berliner et al 2002)
multi-platform data from scatterometer and altimeter
stochastic geostrophy (atmos) and quasi-geostrophy (ocean) priors
MCMC to ISMC linkage for posteriors
term-by-term uncertainty
realistic covariance structures

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
Report from Breakout Group
Instructions for Breakout Groups
MS. KELLER-MC NULTY: There are three basic questions, issues, that we would like the subgroups to come back and report on.
First of all, what sort of outstanding challenges do you see relative to the collection of material that was in the session? In particular there, we heard in all these cases that there are real specific constraints on these problems that have to be taken into consideration. We can’t just assume we get the process infinitely fast, whatever we want.
The second thing is, what are the needed collaborations? It is really wonderful today. So far, we are hearing from a whole range of scientists. So, what are the needed collaborations to really make progress on these problems?
Finally, what are the mechanisms for collaboration? You know, Amy, for example, had a whole list of suggestions with her talk.
So, the three things are the challenges, what are the scientific challenges, what are the needed collaborations, and what are some ideas on mechanisms for realizing those collaborations?
Report from Atmospheric and Meteorological Data Breakout Group
MR. NYCHKA: The first thing that the reporter has to report is that we could not find another reporter except for me. I am sorry, I was hoping to give someone the opportunity, but everybody shrank from it.
So, we tried to keep on track on the three questions. I am sure that the other groups realized how difficult that was.
Let me first talk about some technical challenges. The basic product you get out of this is a field. It is maybe a variable collected over space and time. There are some just basic statistical problems of how you summarize those in terms of probability density functions, if you have multiple samples of those, how you manipulate them, and also deal with them. Also, if you wanted to study, say, like a particular variable under an El Niño period versus a La Niña period, all those kinds of conditioning issues. So, that is basically, sort of very mainstream space-time statistics.
Another important component that came out of this is the whole issue of uncertainty. This is true in general, and there was quite a bit of discussion about aligning these efforts with the climate change research initiative, which is a very high level kind of organized effort by the U.S. government to study climate. Uncertainty measures are an important part of that, and no surprise that the typical deterministic geophysical community tends to sort of ignore these, but it is something that needs to be addressed.
There was also sort of the sentiment that one limitation is partly people’s backgrounds. People use what they are familiar with. What they tend to do is limited by the tools that they know. They are sort of reticent to take on new tools. So, you have this sort of vicious circle that you only do things that you know how to do. I think an interesting thing that came out of this—and let me highlight this as a very interesting technical challenge, and it is one of these curious things where, all of a sudden, a massive

OCR for page 10

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
data set no longer becomes very massive. What John was bringing up is that these large satellite records typically have substantial non-zero biases, even when you average them. These biases are actually a major component of using these. So, a typical bias would be simply change a satellite platform that is measuring a particular remotely sensed variable, and you can see a level shift or some other artifact. In terms of combining different satellites, you need to address this. These biases need to be addressed empirically as an important problem.
The other technical challenge is reducing data. This is another interesting thing about massive data sets, that part of the challenge here is to make them useful. In order to make them useful, you have to have some idea of what the clientele is. We have had some discussion about being careful about that, that you don’t want to sort of create some kind of summary of the data and have that not be appropriate for part of the user community. The other thing is, whatever summary is done, the assumptions used to make it should be overt, and also there should be measures of uncertainty along with it.
Collaborations, I think for this we didn’t talk about this much, because I think they were so obvious. Obviously, the collaborators should be people in the geophysical community that actually work and compile this data with the statisticians.
Some obvious centers are JPL, NCAR, NOAA—Ralph, do you volunteer CORA as well?
AUDIENCE: Sure.
MR. NYCHKA: John, NCDC, I am assuming you will accept visitors if they show up.
AUDIENCE: Sure will. It is a great place to be in the summer, between the Blue Ridge and the Great Smokeys.
MR. NYCHKA: Okay, so one thing statisticians should realize is that there are these centers of concentrations of geophysical scientists, and they are great places to visit. The other collaboration that was brought up is that there needs to be some training of computer science in this. The other point, coming back to the climate change research initiative, is that this is another integrator, in terms of identifying collaborations. In terms of how to facilitate these collaborations, one suggestion was—this is post docs in particular—post docs at JPL.
I tried to steer the discussion a little bit, just to test the waters. What I suggested is some kind of regular process where there are meetings that people can anticipate. I am thinking sort of along the interface model or research conference model. It seems like the knee jerk reaction in this is simply, people identify an interesting area that they declare, let’s have a workshop. We have the workshop, people get together, and then that is it. It is sort of the final point in time. I think John agreed with me, in particular, that a single workshop isn’t the way to address this. So, I am curious about pursuing a sort of more regular kind of series of meetings. Okay, and that is it.