8
Interfacing Diverse Environmental Data—Issues and Recommendations

As described in Chapter 1, addressing many of the questions central to environmental research and assessment, and global change research in particular, requires combining geophysical and ecological data. Although this can be difficult, the success and cost-effectiveness of these research and assessment efforts depend significantly on the degree to which data interfacing issues are explicitly confronted. This chapter presents a working definition of data interfacing and describes in detail the technical and organizational barriers that impede it, including the barriers deriving from characteristics of data, from users' needs, from organizational interactions, and from information systems considerations. Specific recommendations also are provided. The chapter ends with a list of 10 Keys to Success, which are based on the committee's review of the case studies. These fundamental, generalized guiding principles should help practitioners to systematically respond to the challenges identified.

Real-world illustrations of problems and solutions relevant to data interfacing are used as examples throughout this chapter. Some of these are drawn from circumstances or applications that do not directly involve interfacing geophysical and ecological data. Examples of this sort were chosen because they effectively exemplify important elements or principles that are pertinent to such interfacing. Indeed, many of the challenges posed by interfacing these two data types are common to many other situations.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data 8 Interfacing Diverse Environmental Data—Issues and Recommendations As described in Chapter 1, addressing many of the questions central to environmental research and assessment, and global change research in particular, requires combining geophysical and ecological data. Although this can be difficult, the success and cost-effectiveness of these research and assessment efforts depend significantly on the degree to which data interfacing issues are explicitly confronted. This chapter presents a working definition of data interfacing and describes in detail the technical and organizational barriers that impede it, including the barriers deriving from characteristics of data, from users' needs, from organizational interactions, and from information systems considerations. Specific recommendations also are provided. The chapter ends with a list of 10 Keys to Success, which are based on the committee's review of the case studies. These fundamental, generalized guiding principles should help practitioners to systematically respond to the challenges identified. Real-world illustrations of problems and solutions relevant to data interfacing are used as examples throughout this chapter. Some of these are drawn from circumstances or applications that do not directly involve interfacing geophysical and ecological data. Examples of this sort were chosen because they effectively exemplify important elements or principles that are pertinent to such interfacing. Indeed, many of the challenges posed by interfacing these two data types are common to many other situations.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data FIGURE 8.1 Generalized representation of the processes involved in interfacing geophysical and ecological data. THE PROBLEM AND ITS CONTEXT As defined in Chapter 1, interfacing of geophysical and ecological data is the coordination, combination, or integration of such data for the purposes of modeling, correlation, pattern analysis, hypothesis testing, and field investigation at various scales (Figure 8.1). The data being interfaced can be products of a single, integrated study or can be derived from several studies performed at different times or places. Similarly, the data could have been collected with the interfacing effort in mind, or for other purposes entirely. This deliberately broad definition of interfacing is intended to fit as many situations as possible. As discussed in greater detail below, the specific questions scientists will ask and the ways in which they will therefore endeavor to integrate data are often ill-defined and constantly changing (see Box 8.1). As a result, no single narrowly framed definition and no mechanistic prescription or solution will be of much lasting use to scientists contending with the problems related to interfacing. At its simplest level, interfacing involves the identification, accessing, and combination of data. However, in practice these seemingly uncomplicated activities can be technically complex, stretching the limits of existing knowledge and the capabilities of available hardware and software

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data BOX 8.1 Complex Questions Require Data Interfacing Many fundamental questions about how ecosystems respond to forcing by larger-scale variables can be answered only by analytic techniques that use geophysical and ecological data together. The following examples from our case studies provide some sense of the range of such questions currently being addressed. The California Cooperative Oceanic Fisheries Investigation (CalCOFI) program used a variety of correlation, pattern analysis, and time series methods to look for relationships between mesoscale shifts in oceanic current systems and biological communities. It was successful in showing how these regional oceanographic and biological changes were linked to the larger-scale El Niño/Southern Oscillation phenomenon. The First ISLSCP Field Experiment (FIFE) study focused on the mass and energy balances at the land surface/atmosphere boundary and on the physical and biological processes that control them. The study combined ground-based, helicopter, aircraft, and satellite observations at several scales in order to develop and validate models that would allow surface climatology to be predicted from satellite measurements. The National Acid Precipitation Assessment Program's (NAPAP) Aquatic Processes and Effects studies collected a large variety of environmental data, including precipitation rates, rainfall chemistry, rates of surface water acidification, and potential effects on aquatic biota. The end products sought were models to predict future scenarios of surface water acidification. systems. In addition, the very act of interfacing frequently requires crossing disciplinary and administrative boundaries, thereby adding another level of complexity to the process. Interfacing therefore can best be understood as occurring in a series of overlapping contexts, both technical and organizational. Effective solutions must address and accommodate all of these relevant contexts. Interfacing efforts can be confounded by a variety of obstacles (see Mathews, 1983; Henderson-Sellers, 1990). The ones described in this chapter are typical of most situations involving data management and data analysis on complex data sets. However, the challenges facing global change research and other large, interdisciplinary environmental research programs are extreme because of the massive volumes of data, the broad (up to global) geographic scale, the temporal scale, the variety of natural and anthropogenic processes included, the scope of modeling efforts, the numbers of organizations involved, and the evolving nature of the research itself. Consequently, the repercussions of not addressing the barriers described below are correspondingly more significant and more severe than for more traditional single-discipline applications.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data ADDRESSING BARRIERS DERIVING FROM THE DATA Data interfacing efforts must sometimes confront the misperception that once data are in digital form and in a common format, interfacing is simply a matter of merging two or more data sets. As Townshend and Rasool (1993) have pointed out, "collecting data globally does not of itself create global data sets." There is a series of technical pitfalls and obstacles that must be considered and resolved for data interfacing, even on regional scales, to produce scientifically meaningful output. Some of these stem from relatively simple discrepancies among data types and can be dealt with in a straightforward manner. Others, in contrast, reflect fundamental theoretical or "cultural" differences in the ways that ecological and geophysical studies are conceived and carried out. Barriers that arise from these more fundamental differences involve, among other things, the size and complexity of ecological versus geophysical studies, their spatial and temporal scales, the numbers and kinds of variables measured, the role of models in study design and analysis, and traditions of funding and project administration. Spatial and Temporal Scale Some of the most apparent barriers revolve around issues of scale. Geophysical studies are more likely to cover continental- and global-scale areas sampled at lower spatial resolution (Rasool and Ojima, 1989; Sellers et al., 1992a). In contrast, ecological studies generally tend to involve ground-based and closely spaced sampling of smaller areas over relatively short time periods. For example, a recent review of about 100 field experiments in community ecology revealed that nearly half were conducted on plots no larger than 1 m in diameter (Kareiva and Anderson, 1988). There is, in fact, only one widely used global data set in the ecological realm, the Global Vegetation Index produced by NOAA at a spatial resolution of 15 to 20 km (Townshend and Rasool, 1993). This stems, in part, from a tradition in ecology of studies on single species and from ecology's roots in natural history studies performed by individual investigators (Worster, 1977). It also reflects an emphasis in the conduct of ecological studies on labor-intensive field and laboratory techniques that preclude sampling over broader spatial scales. For example, in the FIFE study, data on canopy-leaf-area index, green-leaf weight, dead-leaf weight, and litter weight had to be collected by hand from relatively small study plots. This could not be avoided, even though the study was designed from the outset to integrate ecological and geophysical data over larger areas. Because of such differences in study design, ecological data often

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data must be smoothed or averaged in order to match the coarser spatial and temporal scales characteristic of geophysical data and models. This occurs, for example, when Geographic Information Systems (GIS) are used to merge remotely sensed areal data (usually geophysical) with attribute data (usually ecological) from specific points on the ground (Elston and Buckland, 1993). This kind of averaging is a key step in the latest generation of integrated global climate models (e.g., Wessman, 1992; Baskin, 1993). However, such averaged data may not truly be representative of heterogeneous ecological communities. This is an important shortcoming when heterogeneity is a vital component determining an ecosystem's response to a changed environment. In fact, Holling (1992) points out that spatial heterogeneity, or lumpiness, is of primary interest to ecologists. Such differences in spatial scale also are related to the kinds of processes each field considers important, and the range of spatial and temporal scales across which they can be integrated. Most ecological studies thus attempt to focus on well-bounded community types and the actions of individual species or groups of species within these. Even studies that, by ecologists' standards, cover large areas (see Box 8.2) are fairly restricted compared to the global scope of many geophysical investigations. In addition, Wiens (1989) suggests that ecologists have been slower than atmospheric and earth scientists to address issues of scale. These other sciences (e.g., Clark, 1985) have a longer history of linking physical processes from local to global scales. Further, most ecological models function at a single scale (Ustin et al., 1991) or do not explicitly address scale (Wiens, 1989). There are, of course, exceptions to this generalization about the spatial scale of ecological studies. For example, there is a long-standing tradition in ecology of interest in global patterns of community diversity and the body sizes of individual organisms, and more recent concerns about biodiversity (e.g., May, 1991; Jackson, 1994 a,b) and sustainability of the biosphere (Lubchenco et al., 1991) encompass a global perspective. However, none of these concerns has to date required the interfacing of large amounts of data from different sources. The differences in spatial scale between geophysical and ecological studies are paralleled by analogous issues of temporal scale. Long-term time series of ecological data are relatively rare. This may make it difficult or impossible to create integrated data sets that focus on long-term changes in coupled ecological-geophysical systems. Where historical ecological data are available, they are more likely to represent data from several studies carried out independently over the period of research interest. Long-term ecological data sets of broad spatial extent are thus more likely to result from the combination of data from several sources. This in turn requires solving quality control, metadata, and data integration

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data BOX 8.2 Ecological and Geophysical Scales Differ While the majority of ecological studies focus on relatively restricted spatial and temporal scales, other have a larger perspective. The committee examined two of these, the NAPAP and CalCOFI programs, as case studies (see Chapters 3 and 7). Other areas of study that attempt to link ecological processes across local to regional scales are described below. Each illustrates ways in which ecological and geophysical data might be interfaced. Even though large by ecologists' standards, they are small in comparison with may geophysical studies. Ecologists have used a wide variety of paleoecological data to explore the ways in which vegetation communities responded to past climate change, especially during and after the most recent glaciation. Most of these studies have concentrated on regions within Europe and North America (e.g., Davis, 1981; Cole, 1985; Pennington, 1986; Webb, 1987; Foster et al., 1990). A central concern in these studies is to understand how the ecological requirements and characteristics of individual vegetation species contribute to regional patterns of community change over time. Marine ecologists have expanded their understanding of how intertidal communities are structured by including oceanographic processes in their studies. Connell (1985), Gaines and Roughgarden (1985), and Roughgarden et al. (1985, 1986) showed that the importance of predation and competition for space within the intertidal community depends on the numbers of larvae available for settlement. This in turn depends on regional processes, such as currents and upwelling, that extend far beyond the intertidal zone. Forest ecologists are attempting to use simulation models that incorporate the birth, growth, and dynamics of individual plants to understand how vegetation would change on regional and global scales in response to climate change (Shugart et al., 1992). Such models include the physiological responses of individual plants to specific environmental variables. They also include somewhat broader-scale changes in community composition in response to physical disturbance as well as to environmental change. Long-term studies in the Chesapeake Bay have examined how regional land use, hydrology, waste discharge, and natural ecological processes interact to affect important estuarine resources. These studies were based on a systems approach that depended on interfacing data on many different aspects of the estuary (NRC, 1988). issues stemming from methodological differences between the data sets. In addition to these differences in study design, the behavior of ecological systems can confound the interfacing of geophysical and ecological data. For example, environmentally induced changes in ecological systems often occur with time lags of varying lengths (e.g., Cole, 1985; Lewin, 1985; Pennington, 1986; Davis, 1989; Loehle et al., 1990; Steele, 1991). This can make it difficult to determine which ecological and geophysical data should most appropriately be interfaced. Such time lags may require interfacing data from the same location or region, but from different times. In many cases, such data may be difficult or impossible to

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data find. Research into the long-term response of vegetation to climate change (e.g., Davis, 1981, 1989; Cole, 1985; Pennington, 1986; Webb, 1987; Foster et al., 1990) has also shown that species respond individualistically, that is, that communities do not respond uniformly as entire units. This means that interfacing studies may provide misleading or inconsistent results when using ''representative" species as indicators of ecological response to geophysical variables. Reflecting such differences in temporal scale, the time steps in the models that represent geophysical and ecological systems can be quite distinct. For example, general circulation models recompute winds and temperature every 20 minutes for each grid cell. In contrast, ecological models of vegetation change use monthly to yearly, or even decadal, time steps (Baskin, 1993), depending on the kind of response being modeled. These scale-related problems stem from the fact that fundamentally different kinds of processes are at work in geophysical and ecological systems. They also arise from the fact that ecological systems can be viewed at many different scales and from many different perspectives. None of these is the only "correct" one (O'Neill, 1988; Wiens, 1989; Levin, 1992), and each is based on different choices about which underlying processes and mechanisms to look at. This complexity, of course, affects choices about what kinds of data should be selected for interfacing. More complex problems that cut across several scales are complicated by the fact that variables and processes may or may not change in concert across scales (Wessman, 1992). Certain kinds of measurements in ecological systems may be correlated at one scale, but appear unrelated or negatively correlated at another (Wiens, 1989). In addition, sampling at intervals that are too widely spaced in space and time often fails to capture important aspects of a system's underlying variability. In such instances, the well-known problem of aliasing can lead even sophisticated analysis approaches to falsely identify trends. Thus the scales at which different kinds of data are collected can constrain or even predetermine the relationships among ecological and geophysical variables. As a result, data collected at one scale cannot necessarily be used to represent processes at another scale (through averaging or subsetting). This means that scientists engaged in data interfacing efforts must exercise extreme care when attempting to integrate data across different scales. Merely merging ecological and geophysical data by rote without seriously considering scale-related issues and their implications could result in spurious relationships and misleading analysis results. Unfortunately, there are no well-developed guidelines to assist in such efforts, although hierarchy theory (O'Neill, 1988) is a promising conceptual approach for identifying ecological scales that maximize predictive power. Developing an integrated conceptual model of the systems being studied

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data will assist in defining methodological and data management problems particularly with respect to spatial and temporal scales. For example, in the committee's NAPAP case study the decision to measure acid neutralizing capacity (ANC) was based on a model of how lake chemistry works. ANC turned out to be a key parameter in the survey of lake sensitivity. The committee recommends that careful thought be given, in the planning for interdisciplinary research, to the implications of different inherent spatial and temporal scales and the processes they represent. These should be discussed explicitly in project planning documents. The methods used to accommodate or match inherent scales in different data types in any attempts to facilitate modeling and analysis should be carefully evaluated for their potential to produce artificial patterns and correlations. Preliminary Data Processing and Statistical Uncertainty A wide range of models and data processing algorithms typically are used in the development of ecological and geophysical data sets. Such preliminary processing is used for quality control and data cleanup, for data summarization and classification, and for extracting higher-level information from raw data (see Box 8.3). Raw data therefore are rarely used BOX 8.3 Preliminary Processing Affects Data Compatibility A wide range of preliminary processing methods are used to convert raw ecological and geophysical data to a usable form. Many of these can significantly affect later efforts to interface disparate data types. Raw images from satellite or airborne sensors must be processed to remove distortion and degradation that arise from a variety of sources (Geman, 1990; NRC, 1991, 1992; Geman and Gidas, 1991; Simpson, 1992). Estimating the true image intensity at each point often involves sophisticated spatial statistics that use adjacent data points to estimate the degree of distorition in the raw signal. In many studies, raw data are used to generate maps showing the distribution of variables such as age classes of trees in temperate forests, soil types, or ranges of sea-surface temperature. The resultant data sets no longer contain the original raw data, which can make it difficult if not impossible later to combine data sets with different class limits. In the Sahel case study, vegetation density was represented by a vegetation index. After several attempts using different algorithms, a suitable index was derived by taking the difference between the reflectance of visible and near-infared bands and dividing this by the sum of the reflectances of the two bands.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data directly, with the result that assumptions (both implicit and explicit), biases, and various kinds of statistical error are unavoidably built into each data set. These built-in features of the data have two important kinds of implications for data interfacing. First, they mean that data interfacing involves more than just combining the tangible aspects of the data, such as formats and data values. It also necessarily involves identifying, understanding, and accommodating the assumptions, perspectives, value judgments, and decisions inherent in each data set. In simpler terms, all data sets cannot be all things to all people. For example, Townshend and Rasool (1993) list the various data products currently being derived from NOAA's Advanced Very-High Resolution Radiometer (AVHRR) data. Each product responds to the specific needs of a different subset of users, and these products are not readily interchangeable, even though all are derived from the same basic raw data. Second, all preliminary processing and derivation steps are associated with some kind and amount of statistical error or uncertainty. Data interfacing, whether it involves combining separate estimates of the same variable, different variables, summary statistics, derived spatial data, or data that incorporate subjective judgments, represents another source of statistical error (NRC, 1992). For example, investigators focusing on point-based data in the FIFE program were unprepared to deal with the registration accuracy problem when requesting "their" pixel of AVHRR data. Plus or minus one pixel may mean a 5-km uncertainty in location, whereas the ground-based instruments were sensitive at 100-m scales. A more complicated problem arose when FIFE investigators attempted to associate averaged flux measurements from a 15-km-long aircraft transect with point-based flux measurements collected on the ground. When such sources of uncertainty are not documented and accounted for in the metadata for any given data set, biases can be introduced that affect the outcome of data analyses and the conclusions drawn from them. As a general rule, uncertainty and sensitivity analyses should routinely be accomplished for the outputs of all models using the same data sets. The committee recommends that the metadata for each data set explicitly describe all preliminary processing associated with that data set, along with its underlying scientific purpose and its effects on the suitability of the data for various purposes. Further, the metadata also should describe and quantify to the extent feasible the statistical uncertainty resulting from each processing step. Planning for studies that involve interfacing should explicitly consider the effects of preliminary processing on the utility of the resultant integrated data set(s). Metadata issues also are discussed in more detail later in this chapter.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data Data Volume The sheer quantity of data projected for global change studies can pose significant challenges for nearly every aspect of the data storage, retrieval, and analytical systems currently available. As summarized by Townshend and Rasool (1993), these drastically increased volumes stem from a variety of sources: A greater number of different sensing systems. An increase in the number of spectral bands and frequencies per sensor system. Improved sensor sensitivity. Cumulative increases in data volume as the historical record grows and sensor technology continues to advance. The proliferation of derived data sets to meet different needs. The creation of regional- and global-scale data sets from preexisting, fragmented data. In many cases, the projected volume of data from new instruments and programs is orders of magnitude higher than that currently produced by existing programs. This can make it impossible or impractical to continue using traditional data management and data interfacing methods. For example, the staff at the Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory are concerned that their existing labor-intensive data cleanup and interfacing methods will not be suited to the demands of the new Atmospheric Radiation Measurements (ARM) program. This DOE program will produce approximately 1,000 gigabytes (1 terabyte) of data per year, an amount significantly larger than the 5 gigabytes of data archived by CDIAC. Up to now, two of CDIAC's most popular data sets, Keeling's atmospheric carbon dioxide concentrations from Mauna Loa and Marland's carbon dioxide emission estimates from fossil fuel burning, are only 0.03 and 75.6 megabytes in size, respectively. Larger data volumes also can require changes in the relationship between data sources and users. For instance, the FIFE Information System staff considered it unreasonable to respond to open-ended investigator requests such as, "Send me all your level-1 AVHRR-LAC (Local Area Coverage) data," which totaled 1.5 gigabytes on ninety 6,250-bpi 9-track tapes. Instead, user support staff worked with users to refine specific data requests for actual research requirements. A related barrier stems from the need to provide the research community

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data with information about and ready access to an ever-increasing volume of data. These volumes threaten to overwhelm some of the existing methods for data storage, archival, and retrieval. The committee recommends that all proposed data management and interfacing methods be weighed carefully in terms of their ability to deal with large volumes of data. Assumptions that existing methods will continue to be suitable should be treated with caution. Overcoming Data Incompatibilities Data interfacing is often confounded by differences in the conventions that structure day-to-day practice. Each discipline has its own set of conventions, or language, which cannot easily be forced into a common terminology even when the same quantities are involved. For example, cartographers use "small scale" to refer to a very large area mapped without much detail, while other scientists use the same term to refer simply to a small area. Conversely, cartographers use "large scale" to refer to small areas mapped in great detail. Other scientists use this term to refer to extensive or large areas. The FIFE and NAPAP studies provide a rich variety of examples of how seemingly innocuous data characteristics can bedevil data interfacing efforts. In the FIFE study, the Information System staff found that the same symbols or terms were used by different disciplines to refer to widely different quantities. An analogous problem arose from the fact that separate groups measured the same variables, but called them by such different names that it would be difficult to combine them without prior knowledge of their respective naming conventions. As another example, all disciplinary groups in the FIFE study measured time, but in ways that made it difficult to combine their respective data. The surface biology group preferred wall clock time for field data collection, while more globally oriented investigators used Greenwich Mean Time as a more universal standard to link satellite and aircraft data collection. Further, because time of day is not traditionally noted with soil moisture measurements, it was difficult to use the soil moisture data with other data, such as those for surface fluxes, with higher temporal resolution. This incompatibility was problematic because diurnal variations in soil moisture can have significant effects on energy and moisture balance calculations. As yet another example, investigators working at a local scale preferred Universal Transverse Mercator coordinates, while latitude and longitude were needed to track satellite and aircraft operations and to link solar position and time in a scientifically consistent way. The surface flux group was focused on circulation modeling with grid cells several kilometers on a side. That group was therefore satisfied with

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data BOX 8.5 Metadata: The Key to the Lock Metadata document or describe all the facts, circumstances, and conditions associated with the actual data themselves. In most cases, metadata are the key needed for scientists other than the original investigator(s) to unlock the information contained in the data. This is because they provide insight into not only the raw characteristics of the data, but also constraints on their use and limits on their interpretation. Metadata are thus essential to the process of drawing scientifically defensible conclusions from the data. The essential components of metadata differ from project to project and data type to data type. However, the key elements include at a minimum those listed here. In addition, the committee found that the most thorough and useful metadata from the combined efforts of principal investigators, information management specialists, and other potential users not directly involved in collecting the data. Principal investigators are intimately familiar with the data and their quirks and peculiarities. Information management specialists are knowledgeable about ways in which the inherent structure of the data can affect their utility. Finally, other potential users will raise issues and propose applications of the data that would never have been thought of by the principal investigator. Key elements of metadata should include detailed description of at least the following (adapted from CENR, in press): Identification of contributors. Scope and purpose of the research program. Data collection methods (field and laboratory), including description of instrumentation. Sampling/measurement patterns in space and time. Gaps or inconsistencies in sampling/measurement patterns. Constraints imposed by measurement and processing methods. Preliminary processing and derivation algorithms. Quality assurance and control methods, including uncertainties in data or derived results. Definitions and formats for each variable. Quality control flags associated with the data. Quirks and peculiarities of the data. Limitations of the data. Potential problems in specific kinds of applications. Planned and actual applications of the data, including references to published papers. are interfacing large data sets cannot necessarily depend on embedded codes and flags to automatically subdivide, transform, or otherwise operate on the data. Thus, while researchers cannot necessarily depend on automation to solve this data management problem, neither can they be expected to exhaustively examine these detailed metadata manually to evaluate their relevance. Third, there is a large and ever-increasing number

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data BOX 8.6 Metadata in an Interdisciplinary Context A chronic problem with metadata is the reluctance of researchers to allow their data sets to be freely used by others. Good metadata, of course, are designed to allow this very thing to happen. Therefore, some mechanism needs to be established that encourages the free sharing of data. The H.J. Andrews Experimental Forest LTER study provides an interesting example of how a mechanism for sharing data across discipline boundaries developed on its own. The project started out as a series of related, but independent studies. As the study progressed, however, individual researchers began to realize that they needed data from other projects carried out at the site. As their need for other data increased, the individual scientists began to recognize that they had to make an effort to allow their data to be used by, and made useful to, their associates. This led to greater emphasis on adequately documenting their data sets. of data sets potentially useful in interfacing applications. For researchers to make effective use of this variety, they require a means of identifying, evaluating, accessing, and retrieving relevant data. Metadata play a key role in providing the information necessary for these steps. At present, some of this information is available through a combination of publications (e.g., newsletters and catalogs from CDIAC and the National Geophysical Data Center) and on-line data directories. These avenues are suited to providing summary descriptions of available data sets. However, it will be a challenge to furnish researchers with an efficient source of information about all available relevant data without requiring them to search numerous catalogs and directories. The traditional approach to metadata, in which they are considered as information separate from the data themselves, will not meet the challenges just described. The committee found agreement among a large segment of the data management specialists it consulted that a new conceptual model of metadata is required in which metadata are somehow integral to the data themselves. There was equally wide agreement, however, that no quick fixes to this problem are readily apparent. The committee recommends that the production of detailed metadata be a mandatory requirement of every study whose data might be used for interdisciplinary research. Metadata should be treated with the seriousness of a peer-reviewed publication and should include, at a minimum, a description of the data themselves, the study design and data collection protocols, any quality control procedures, any preliminary processing, derivation, extrapolation, or estimation procedures, the use of professional judgment, quirks or peculiarities in the data, and an assessment of features of the data that would constrain their use for certain purposes.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data Retracing Data Paths As described above, all data sets reflect a particular set of users' needs and perspectives, which are not necessarily applicable to all situations. Interfacing ecological and geophysical data can therefore require that they be reformatted, resummarized, reclassified, or otherwise adjusted. For example, in the Sahel study, remote sensing and ground-based precipitation data were combined to develop an overall picture of rainfall throughout the study area. In many cases, this can involve backtracking down the data path to an earlier version of the data and then proceeding from there along an alternate path (Figure 8.3). The ability to perform this kind of backtracking requires that detailed information be available about the prior processing steps that were used to create the data set(s) being retrieved for interfacing. Sometimes this can be accomplished with thorough metadata. However, when a large user community is simultaneously using, updating, and modifying a considerable number of data sets (as in the FIFE study), stand-alone documentation is not adequate. FIGURE 8.3 Schematic representation of how data interfacing can involve data processing steps that would not be needed if data were being analyzed independently. For each unique data set, this different data processing can require back-tracking along the data to different points.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data Instead, it is necessary to include this dynamic information in the database system itself by attaching it to each derived data set. The committee recommends that metadata contain enough information to enable users who are not intimately familiar with the data to backtrack to earlier versions of the data so that they can perform their own processing or derivation as needed. Where stand-alone documentation is not adequate (for large and complex data sets or where multiple users are simultaneously updating and modifying data), data managers should investigate the feasibility of incorporating an audit trail into the data themselves. Long-term Archiving of Data An important concern is the stewardship of data sets throughout their life cycle, which for global change data extends over a minimum of decades to centuries and does not necessarily end when the primary users believe they no longer need the data. The committee concludes that far too many environmental research projects give insufficient attention, in either the planning or the implementation stage, to the long-term archiving of their data sets. Data from studies that contribute significantly to our understanding of components and processes of the Earth system must be preserved and made accessible for future potential users of the data. There is a need to create a mindset within the research community that valuable data must have a long-term life that extends far beyond the publication of the principal investigator's analyses. In this regard, the committee found that there are no well-established and widely accepted protocols to assist scientists in deciding which data should be archived, in what formats they should be stored, and where and how they should be archived to maximize access for potential users. Further, in several cases the committee found little attention given to the long-term maintenance of data sets once they were archived. It is important to note, however, that there do not appear to be any insurmountable technical barriers to keeping all data collected in research projects, even data-intensive ones that involve high-resolution imagery, because advances in data storage and retrieval capabilities have kept pace with the ever-growing volumes of data in all fields of science. It is typical that the ensemble of all previous data in any scientific discipline is modest in volume compared to present and anticipated annual volumes. Therefore, the issue is not unmanageable volumes of data, rather it is the maintenance of the data sets in accessible, usable form over time that is the challenge for long-term retention. These and other issues related to the long-term archiving of geophysical and environmental data are discussed in depth in another report,

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data Preserving Scientific Data on Our Physical Universe: A New Strategy for Archiving the Nation's Scientific Information Resources (NRC, 1995). However, the committee wishes to emphasize the following points regarding interdisciplinary data archiving: The committee believes that, in general, the presumption in environmental research should be that ''data worth collecting are worth saving." The committee therefore recommends that funding agencies consider stipulating that all research applicants include in their research plans well-conceived and adequately funded arrangements for data management and for the ultimate disposition of their data. While it is impossible to establish universal guidelines for funding, the committees's investigations suggest that setting aside 10 percent of the total project cost for data management would not be unreasonable. These cost estimates should include adequate funds for preparing thorough metadata that serve the needs of all potential users. In order for these requirements to be fully effective, however, the agencies must adequately support active archives and long-term data repositories. Finally, the committee is concerned about the gaps in the existing system for long-term retention and maintenance of environmental data. The committee recommends that funding agencies provide guidelines that define the requirements for preparing data sets for long-term archiving. Educational and research institutions should be encouraged to incorporate strong data management and archival activities into every interdisciplinary project and should allocate sufficient funding to accomplish these functions. Professional recognition should be given to principal investigators and project data managers who perform these functions well. TEN KEYS TO SUCCESS The committee's investigations of the case studies and other related experience led it to identify 10 Keys to Success (Box 8.7), each of which incorporates both technical and cultural aspects. Keys 1 and 2 deal with the appropriate use of available information management technology. Keys 3, 4, and 5 describe design and management strategies. Keys 6, 7, and 8 refer to methods for accommodating the unavoidable realities of human behavior, motivation, and politics. Finally, keys 9 and 10 suggest ways of enhancing data interfacing by building a need for it into the structure of research programs. A discussion of the keys in terms of this grouping follows. Be practical. Use appropriate information technology.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data BOX 8.7 Ten Keys to Successful Data Interfacing Be practical. Use appropriate information technology. Start at the right scale. Proceed incrementally. Plan for and build on success. Use a collaborative approach. Account for human behavior and motivation. Consider needs of participants as well as users. Create common needs for data. Build participation by demonstrating the value of data interfacing. The utopian ideal of a comprehensive technological solution for interfacing environmental data almost certainly will never become a reality. On the one hand, attempts at such solutions typically lead to disaster because they ignore the realities of users' shifting needs, diverse and evolving hardware and software systems, different personal motivations, and the generally complex organizational contexts of interdisciplinary research projects. On the other hand, failing to use appropriate, up-to-date information management technology can impede or even prevent data interfacing efforts. For example, the committee heard of numerous instances of the problems created by researchers' use of spreadsheets, rather than actual database software, for data management functions. Similarly, while checking data obtained from data sources, CDIAC staff found many errors that stemmed from a reluctance or inability to use more sophisticated programming languages to search for errors and discrepancies. Balancing the constraints imposed by real-world practicality with a desire for the benefits of up-to-date information management technology is a difficult challenge. It requires close attention to users' needs and perceptions combined with the clever application of suitable technology. For example, both the FIFE Information System staff and the designers of the NASA Master Directory successfully applied a "least-common-denominator" principle for key elements of their systems. This helped achieve their goal of making these systems usable by the widest possible audience, even those with less than state-of-the-art hardware and software. However, while depending on a least-common-denominator user interface, the NASA Master Directory uses up-to-date networking technology to link widely separated databases. The CDIAC program provides another instance of this least-common-denominator approach. Its managers realized they could best fulfill their obligations as a key data

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data resource by distributing their data packages via older, and more widely accessible, technology. As another strategy, the designers of the CalCOFI program accomplished their goal of developing a resilient long-term program by focusing on a few relatively simple parameters that would stand the test of time. The successful case studies chose technology that best accommodated constraints imposed by users while also furthering the project's fundamental goals. None of them used information management technology for its own sake. Start at the right scale. Proceed incrementally. Plan for and build on success. Successful interfacing efforts begin with discrete, well-bounded, and manageable pieces of the larger interfacing problem. This approach permits solutions to be developed and tested at appropriate scales as working relationships and the understanding of users' needs evolve through experience. Success at these functionally well-bounded scales builds confidence, credibility, and the desire of the participants to buy into the effort. Thus, information management specialists should help select initial interfacing applications that have a high probability of success. They also should design hardware and software systems that can evolve over time by expanding, adapting, and rearranging semi-independent modules. Users' needs demand systems that have the ability to evolve over time, rather than static and broad-scale "solutions" that are often obsolete by the time they are installed. The information managers in the FIFE program responded to the interfacing needs that arose naturally from the interactions of the project scientists (but see also keys 9 and 10 below). These incremental and nonthreatening responses helped build momentum and created a supportive environment for interfacing. Similarly, CDIAC initially focused on demonstrating success with a few fundamentally important data sets. It should be noted, however, that starting at too small or restricted a scale could lead to major problems in the future when it is realized, for example, that important functions have not been included in the system. Use a collaborative approach. Account for human behavior and motivation. Consider needs of participants as well as users. Both research scientists and information management specialists uniformly stress the importance of collaboration as the best way of dealing with the human element in data interfacing applications. This is particularly

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data true of the process of hardware/software system design, where ongoing interaction between designers and users is vital to success. Collaboration also can help overcome potential political constraints to interfacing. Research scientists, used to relative autonomy, are not going to comply with data interfacing standards or procedures simply because someone tells them to. In addition, technology alone will not resolve such human issues. As Davenport et al. (1992) state, "No technology has yet been invented to convince unwilling [scientists] to share information or even to use it." The best means of ensuring active cooperation in facilitating data interfacing is to solicit information about users' priorities and concerns and to somehow account for these (e.g., Kirkpatrick, 1993). The best data interfacing solutions are those that make users' day-to-day lives easier and help them accomplish those things that are important to them. For example, CDIAC makes every effort to see that data sources are recognized for their contribution. It lists the data source as the primary author of each data package, provides a suggested bibliographic format for citing the data package, and encourages users to cite data packages as publications. The CalCOFI program has a history of collaborative decision making, in which project scientists participate equally in decisions about program direction and scope. As a result, the program has been able to adapt to changes in funding levels while maintaining the involvement of participating scientists. The LTER network, with a large constituency of academic researchers (who are rewarded within the university system for individual research), has initiated a periodic research symposium as part of its ongoing efforts to promote data sharing and integration. Interactions among data managers and researchers at various sites help exploit historical data in new ways and encourage new proposals focused on intersite analyses. These activities are intended to lead to additional publications and funding of research—positive incentives for data sharing that NSF promotes with supplemental research funding. It is important not to lose sight of the distinction between users and participants. Users are typically scientists who make use of a data interfacing system to retrieve information or data that further their own investigations. Participants, on the other hand, are in some way essential to the success of the data interfacing effort, but do not necessarily make use of the data themselves. Each data interfacing application involves a mixture of users and participants, both of whose needs must be taken into account. Create common needs for data. Build participation by demonstrating the value of data interfacing.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data The potential conflict between data interfacing and individual users' needs can be resolved by designing programs so that users actually depend on successful data interfacing to meet their needs. For example, the fundamental scientific questions motivating the NAPAP and CalCOFI programs demanded that physical and biological scientists interface their data. Similarly, many of the projects within the FIFE program were designed so that researchers needed access to each others' data in order to meet their research goals. Tangible benefits can further motivate researchers to participate in data interfacing efforts (see also key 7 above). The FIFE managers judged that the earlier inclusion of an integrative modeling group would have stimulated much more interaction and data interfacing by making the value of such activities much more apparent. Such tangible demonstrations of interfacing's benefits are vitally important. Successful interfacing efforts require participants to change fundamental aspects of their attitudes and behavior. People are much more likely to make such changes if they can see actual examples of the promised payoffs. Environmental research and monitoring projects are certain to become more complex and interdisciplinary in scope. By adequately planning for and anticipating the data interfacing issues raised in this report, researchers and data managers can help reduce the barriers and challenges that they will inevitably encounter. REFERENCES Baskin, Y. 1993. Ecologists put some life into models of a changing world. Science 259: 1694–1696. Clark, W.C. 1985. Scales of climate impacts. Climate Change 7: 5–27. Cole, K. 1985. Past rates of change, species richness, and a model of vegetational inertia in the Grand Canyon. Am. Nat. 125: 289–303. Committee on Earth and Environmental Sciences (CEES). 1992. The U.S. Global Change Data and Information Management Program Plan. National Science Foundation, Washington, D.C. Committee on Environment and Natural Resources (CENR). In press. The U.S. Global Change Data and Information System Implementation Plan. Joint Oceanographic Institutions, Inc., Washington, D.C. Connell, J.H. 1985. The consequences of variation in initial settlement vs. post-settlement mortality in rocky intertidal communities. J. Exp. Mar. Bio. Ecol. 93: 11–45. Davenport, T.H. 1993. Process Innovation: Reengineering Work Through Information Technology. Harvard Business School Press, Cambridge, Mass. 337 pp. Davenport, T.H., R.G. Eccles, and L. Prusak. 1992. Information politics. Sloan Manage. Rev. 34: 53–65. Davis, M.B. 1981. Quaternary history and the stability of forest communities. In Forest Succession: Concepts and Application. D.C. West, H.H. Shugart, and D.B. Botkin, eds. Springer-Verlag, New York.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data Davis, M.B. 1989. Insights from paleoecology on global change. Bull. Ecol. Soc. Am. 70: 222–228. DeSmedt, W. 1994a. The wolf at the door. Database Programming & Design (April): 58–67. DeSmedt, W. 1994b. CASE and prototyping: mind over model. Database Programming & Design (May): 47–49. Foster, D.R., P.K. Schoonmaker, and S.T.A. Pickett. 1990. Insights from paleoecology to community ecology. Trends Ecol. Evol. 5: 119–122. Freudenberg, W.R. 1992. Nothing recedes like success? Risk analysis and the organizational amplification of risks. Risk Issues Health Safety3: 1–35. Gaines, S., and J. Roughgarden. 1985. Larval settlement rate. Proc. Natl. Acad. Sci. USA82: 3707–3711. Geman, D.1990. Random fields and inverse problems in imaging. Lecture Notes in Mathematics. Springer-Verlag, New York. Geman, D., and B. Gidas. 1991. Image analysis and computer vision. Pp. 9–36 in Spatial Statistics and Digital Image Analysis. Panel on Spatial Statistics and Image Processing, National Research Council. National Academy Press, Washington, D.C. Henderson-Sellers, A. 1990. Predicting generalized ecosystem groups with the NCAR CCM: First steps toward an interactive biosphere. J. Climate 3(9): 917–940. Holling, C.S. 1992. Cross-scale morphology, geometry, and dynamics of ecosystems. Ecol. Mon. 62: 447–502. Jackson, J.B.C. 1994a. Constancy and change of life in the sea. Philos. Trans. R. Soc. London Ser. B 344: 55–60. Jackson, J.B.C. 1994b. Community unity?Science 264: 1412–1413. Kanciruk, P., and M.P. Farrell. 1989. The Case for Issue-Oriented Information Analysis Centers in Support of the U.S. Global Research Program. Environmental Sciences Division, Publ. No. 334671189. Oak Ridge National Laboratory, Oak Ridge, Tenn. 30 pp. Kareiva, P., and M. Anderson. 1988. Spatial aspects of species interactions: the wedding of models and experiments . Pp. 35–50 in Community Ecology , A. Hastings, ed. Lecture Notes in Biomathematics 77. Springer-Verlag, Berlin. Katzenbach, J.R., and D.K. Smith. 1993. The Wisdom of Teams: Creating the High-Performance Organization. Harvard Business School Press, Cambridge, Mass. Kirkpatrick, D. 1993. Making it all worker-friendly. Fortune 128(7): 44–53. Levin, S.A. 1992. The problem of pattern and scale in ecology. Ecology 73: 1943–1967. Lewin, R. 1985. Plant communities resist climate change. Science 228: 165–166. Lincoln, Y.S. (ed.). 1985. Organization Theory and Inquiry. Sage, Newbury Park, Calif. 231 pp. Loehle, C., J. Gladden, and E. Smith. 1990. An assessment methodology for successional systems. I. Null models and the regulatory framework. Environ. Manage. 14: 249–258. Lubchenco, J., A.M. Olson, L.B. Brubaker, S.E. Carpenter, M.M. Holland, S.P. Hubbell, S.A. Levin, J.A. MacMahon, P.A. Matson, J.M. Melillo, H.A. Mooney, C.H. Peterson, H.R. Pulliam, L.A. Real, P.J. Regal, and P.G. Risser. 1991. The Sustainable Biosphere Initiative: An ecological research agenda. Ecology72: 371–442. May, R.M. 1991. Biodiversity: A fondness for fungi. Nature 352: 475–476. National Research Council (NRC). 1988. Marine Environmental Monitoring in the Chesapeake Bay. National Academy Press, Washington, D.C. National Research Council (NRC). 1991. Solving the Global Change Puzzle: A U.S. Strategy for Managing Data and Information. National Academy Press, Washington, D.C. National Research Council (NRC). 1992. Combining Information: Statistical Issues and Opportunities for Research. National Academy Press, Washington, D.C.

OCR for page 81
Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data National Research Council (NRC). 1995. Preserving Scientific Data on Our Physical Universe: A New Strategy for Archiving the Nation's Scientific Information Resources. National Academy Press, Washington, D.C. Office of Science and Technology Policy (OSTP). 1991. Policy Statements on Data Management for Global Change Research. U.S. Global Change Research Program, National Science Foundation, Washington, D.C. O'Neill, R.V. 1988. Hierarchy theory and global change. Pp. 29–46 in Scales and Global Change: Spatial and Temporal Variability in Biospheric and Geospheric Processes, T. Rosswall, R.G. Woodmansee, and P.G. Risser, eds. SCOPE 35. Wiley, New York. Parsons, T. (ed.). 1947. Max Weber: The Theory of Social and Economic Organization. Free Press, New York. Pennington, W. 1986. Lags in adjustment of vegetation to climate caused by the pace of soil development: Evidence from Britain. Vegetation 67: 105–118. Rasool, S.I., and D.S. Ojima (eds.). 1989. Pilot Studies for Remote Sensing and Data Management: Report of a Meeting of the IGBP Working Group on Data and Information Systems. The Royal Swedish Academy of Sciences, Stockholm. Reason, J. 1990. The contribution of latent human failures to the breakdown of complex systems. Philos. Trans. R. Soc. London Ser. B 327: 475–484. Roughgarden, J., Y. Iwasa, and C. Baxter. 1985. Demographic theory for an open marine population with space-limited recruitment. Ecology 66: 54–67. Roughgarden, J., S.D. Gaines, and S.W. Pacala. 1986. Supply side ecology: The role of physical transport processes. Pp. 491–518 in Organization of Communities: Past and Present,J.H.R. Gee and P.S. Giller, eds. Blackwell, Boston, Mass. Sellers, P.J., F.G. Hall, G. Asrar, D.E. Strebel, and R.E. Murphy. 1992. An overview of the First International Satellite Land Surface Climatology Project (ISLSCP) Field Experiment (FIFE). J. Geophys. Res. 97: 18355–18371. Shugart, H.H., T.M. Smith, and W.M. Post. 1992. The potential for application of individual-based simulation models for assessing the effects of global change. Ann. Rev. Ecol. Syst. 23: 15–38. Steele J.H. 1991. Marine ecosystem dynamics: Competition of scales. Eco. Res. 6: 175–183. Strebel, D.E., J.A. Newcomer, J.P. Ornsby, F.G. Hall, and P.J. Sellers. 1990. The FIFE Information System. IEEE Transactions on Geoscience and Remote Sensing 28: 703–710. Townshend, J.R.G., and S.I. Rasool. 1993. Global change. In Data for Global Change, P.S. Glaesar and S. Ruttenberg, eds. CODATA, Paris. Ustin, S.L., C.A. Wessman, B. Curtiss, E. Kasischke, J. Way, and V.C. Vanderbilt. 1991. Opportunities for using the EOS imaging spectrometers and synthetic aperture radar in ecological models. Ecology 72: 1934–1945. Webb, T., III. 1987. The appearance and disappearance of major vegetational assemblages: Long-term vegetational dynamics in eastern North America. Vegetation 69: 177–187. Weick, K.E. 1976. Educational organizations as loosely coupled systems. Admin. Sci. Quart. 21: 1–19. Weick, K.E. 1982. Management of organizational change among loosely coupled elements. In Change in Organizations, P. Goodman, ed. Jossey-Bass, San Francisco Calif. Weick, K.E. 1985. Sources of order in underorganized systems: Themes in recent organizational theory. Pp. 106–136 in Organizational Theory and Inquiry , Y.S. Lincoln, ed. Sage, Newbury Park, Calif. Wessman, C.A. 1992. Spatial scales and global change: Bridging the gap from plots to GCM grid cells. Ann. Rev. Ecol. Syst. 23: 175–200. Wiens, J.A. 1989. Spatial scaling in ecology. Funct. Ecol. 3: 385–387. Worster, D. 1977. Nature's Economy. Cambridge University Press, New York.