6
Status of Data Archives, Access, and Future Directions
In this chapter the status of available archived model-assimilated data sets (MADS) and supporting observed data is discussed, including the flow of data to archives and the accessibility of data by users. The need for a nationally focused and integrated archive system for assimilated data sets with provision for ready access is discussed. A future view of access to data sets, including software needs, also is presented.
AVAILABLE ANALYSES AND OBSERVATIONS
Only a brief outline of available observational data and model analyses is attempted in this report. An in-depth effort is needed to locate, evaluate, and compile available geophysical data resources and model analyses prior to development of a state-of-the-art model-assimilated data set extending back to about 1950, as recommended by the panel.
Atmospheric Analyses
Some surface analyses exist for the years since 1900. Archives of some upper-air analyses are available for the years since 1946. Daily model assimilation products exist for the years since about 1960. A daily set that includes ocean surface flux terms and model radiative terms is available for the years since 1985.
Ocean Analyses
Various analyses of surface fields (sea surface temperature [SST], stress, etc.) exist for 20 years to a century. Model-assimilated data have been available only for a few years and are not yet global. However, global ocean MADS, starting about 1985, are expected to become available in 1991.
Hydrological Analyses
River basin simulations have been run for more than 10 years for the United States. To the panel's knowledge, the fields have not been saved.
Paleoclimate Reconstructions
"Observed" fields such as topography, ice extent, and SST for the peak of the ice age (18,000 years ago) have been prepared for the world. Several climate models have been run using these fields as boundary conditions.
Surface Observations (Meteorology, Hydrology, Oceanography)
Archives of surface marine observations exist for the years 1854 to the present, with gaps during the two world war periods. A major project, the Comprehensive Ocean-Atmosphere Data Set (COADS), involving cooperation among the National Center for Atmospheric Research (NCAR), the National Oceanic and Atmospheric Administration/Environmental Research Laboratories (NOAA/ERL), the National Climatic Data Center (NCDC), and the University of Colorado's Cooperative Institute for Research in Environmental Sciences (CIRES), is under way to improve this data set. The main world archives of surface land data sets begin with the year 1967. A considerable amount of earlier data on tape also exists in diverse locations. In many areas, surface observations were routinely recorded at least by 1880, but the data would be difficult to compile and digitize.
The U.S. Geological Survey maintains a very good archive of daily and monthly river flow data for the United States. Monthly river flow data exist for several hundred sites in other parts of the world, but global coverage is far from complete.
Upper-Air Data (Meteorology)
Archives of global sets of daily upper-air rawinsonde observations start with the year 1957, with a gap in much southern hemispheric data from 1963 to 1966. Data from Australian and U.S. rawinsonde stations are archived
for the years from the late 1940s to the present. Archiving of satellite sounder data started in 1969, but there is a gap for 1971–1972.
These archives provide starting points for a state-of-the-art assimilation analysis effort. The archives exist because of patient, low-cost data-gathering efforts at a few centers. For purposes of reanalysis, archives should include data from delayed ships and isolated locations that were not included in real-time data sets. These efforts need enhancement. All available world data resources should be brought together in a nationally focused effort to develop an assimilation data archive system.
RESOURCES NEEDED FOR ARCHIVING MODEL-ASSIMILATED DATA SETS
New model assimilation products will continually become available. Additional efforts are needed at the NCAR, the NCDC, and the National Oceanographic Data Center (NODC) for efficient archiving of MADS. In addition, major MADS producers such as the National Meteorological Center (NMC), the European Centre for Medium Range Weather Forecasts (ECMWF), and others will need to apprise the archival centers of significant model attributes, accessing and imaging guidelines, and changes associated with the data.1
Resources Needed to Prepare Older Observed Data
Major efforts will be necessary to prepare older observed data for input to new analyses. An enhanced effort is needed in several discipline areas: marine ship data, upper-air meteorological data, and many types of land surface data (synoptic, solar, river discharge, soil moisture, etc.). An enhanced effort is needed for data sets that span more than a decade. International cooperation is essential; some of the needed coordination can be handled by the World Climate Research Program (WCRP) and bilateral programs.
ARCHIVE METHODS AND INSTITUTIONAL ARRANGEMENTS
Most available MADS analyses are located at NCAR, NCDC, and ECMWF. For example, NMC analyses and observations on magnetic tape are sent to both NCAR and NCDC within about 2 weeks of creation. In general, data sets should be stored in at least two archives in the United States for backup
protection. Such data backup has already proven valuable. Having data in two archives usually adds very little to the overall cost because the data are often needed in a second working archive for reasons independent of backup considerations.
As substantial periods of model assimilation are completed, the data sets should also be available in at least two permanent archives. In addition, these analyses should be placed on "publication media," such as digital audio tapes and CD-ROM (compact disk, read-only memory) disks, so they can be sent in bulk to a number of major users at reasonable cost.
Most major U.S. archives are on magnetic tape because the media costs are less than for optical disks. However, mass storage devices are preferable so that users do not have to search tapes for specific files.
A mass storage device permits a user to place named data sets (or files) on the storage device without having to be concerned about physical location, thereby greatly simplifying data handling for the user and permitting efficient use of the device. Mass storage devices for supercomputers typically cost $2 million to $4 million. Mass storage capability in the price range of $5,000 to $150,000 could possibly be developed for limited applications by local users.
Every 6 to 12 years it becomes necessary or desirable to put existing tape-stored data sets onto new storage media. A very large benefit of proper mass storage design is that data sets can be automatically transferred to new storage media without extensive user involvement. Otherwise, this would require great effort.
FUTURE DIRECTIONS
In the future, changes in technology and cost will make it practical to store considerable amounts of data at the scientist's computer workstation, which will have the power of a 1980 Cray system. The major data archive centers will store the data and put major amounts of it on various storage media (DAT [data acquisition tape], CD-ROM, and so on). The increasing bandwidth of communications also will permit more data to be sent by electronic transfers over the High Performance Computer and Communications program links.
Access to Model Data and Observations
The modes of access should be as follows:
-
Operational assimilated data sets (including delayed mode) will be transmitted routinely by operational centers to NCDC, NCAR, NODC, or other designated national archive centers.
-
Data may be obtained on request from NCDC, NCAR, NODC, or other major archive centers.
-
Universities may request a selection of current data transmitted in real time, via UNIDATA (University Data Broadcast Project), NOAA, or other national computer network.
-
Large amounts of data should be stored at the scientist's site using DAT tapes, CD-ROM disks, or related inexpensive technology for immediate use.
-
A scientist may choose to access large amounts of data directly from archival centers through computer networking.
-
Archival centers should prepare and publish service-oriented mass storage media (e.g., CD-ROM disks) at cost for publishing. Standardized data access (unpacking), manipulation, and display software should be automatically included in published media and electronic transfers.
-
A scientist may request selected data sets on inexpensive published mass storage media from archival centers.
Locate More Data at the Scientist's Site
Two new storage devices, CD-ROM and DAT or related technology, will permit scientists to store significant amounts of data in their own computer systems. The purchase costs of both CD-ROM and DAT hardware are expected to decrease to $50 to $200 within a few years.
Consider the time needed to prepare software for a CD-ROM or DAT. Suppose there are about 10 data sets on a CD-ROM. The time to formulate indexes to the data and prepare basic access software is about 6 weeks. A rather extensive set of display capabilities may take as long as 6 months to develop. The point is that this effort should be done once, routinely, in a nationally focused archive system, to meet the broad national needs for these data. Standard data access tools and metadata, manipulation, and imaging software should be included routinely with the assimilated data sets.
Software for Data Display and Manipulation
Many forecast centers, scientific groups, and government activities have software (e.g., NOAA/PROFS, McIDAS [Man-Computer Interactive Data Access System], UNIDATA [University Data Broadcast Project]) to view analyzed fields and observations. A basic software capability for display should be distributed along with the data. For example, if a depiction of geopotential height, temperature, or wind for a selected portion of the world is needed at a major forecast center, the computer at the center can be commanded to prepare a visual display from the assimilated data sets. Sci-
entists have found these displays to be essential for research purposes. The key parts of the display software should be made portable so that they can be easily run on any computer workstation. Local display hardware could include, for example, multicolor map plotters, the computer monitor, or equipment for transferring displays to slides.
Funding agencies should support proposals for developing compilations of existing routines and preparing interfaces for existing data sets. Such an activity would be cost effective because it would take advantage of software already in existence.
National Weather Service Modernization Program
During the 1990s the National Weather Service (NWS) will be installing a wide range of new observational systems, including the WSR-88D Doppler radar, profiler, Automated Surface Observing System (ASOS), and advanced satellite systems. The resulting 100-fold increase in the amount of data that must be integrated into a dynamically and kinematically consistent data set will require the use of model-based assimilation systems for effective understanding and utilization of this greatly enhanced data flow.
Data assimilation systems that will utilize the new data streams of the 1990s are presently under development. The research community outside the operational centers thus has an opportunity to contribute to the design of the system and to ensure that the needs of the broad community are considered.
Advances in Computer Technology
Advances in computing technology will affect the process by which model-assimilated data sets are generated as well as the means by which they are interrogated by scientists. Greater computing power will allow more sophisticated assimilation techniques to be used at national laboratories and forecast centers and will permit modeling and measurements groups to utilize specialized data assimilation software tailored to their own needs. Routine assimilation of large earth-atmospheric-oceanic-biogeochemical data sets will become feasible as the High Performance Computer and Communications Initiative program is carried out.
Use of these data sets by scientists will be facilitated greatly by the increasing speed and storage of individual computer systems at prices affordable to individuals and small laboratories. However, facilities for local processing and graphical evaluation of large assimilated data sets will only be useful if appropriately designed software for data management is also available and ready access to data at national archives is provided.