Assessing the Implications of Multidirectional Interfaces
DATA INTEGRITY AND QUALITY
Assuring data integrity and quality in a distributed federated system is perhaps the biggest challenge in planning for the effective utilization of environmental satellite data in the next 10 to 15 years. Four distinct but related aspects of data have to be addressed to ensure effective stewardship of distributed data: data integrity, identity, quality, and lineage.
As data granules (delivery units) move among NASA, NOAA, and other agencies, brokers, value-added providers, and end users, there will have to be some way of assuring their integrity, that is, that content has not been altered. Digital signatures (e.g., checksums, one-way hash functions, and so on) provide a straightforward way to accomplish this. In the particular case of environmental satellite data, it will be most effective if the signature scheme is insensitive to lossless transformations (compression, reformatting, etc.) which are likely to occur as part of the data’s normal utilization.
Implication: All data granules distributed by NASA and NOAA should include a digital signature calculated by a standard, open algorithm, so that NASA and NOAA and any subsequent users of the data can verify its integrity.
In addition to a signature derivable from their content, data granules need names that identify their content (semantics) independent of the content’s representation (structure). The separation between identity and integrity is subtle but important, since although it is desirable for their relationship to be one to one, in practice it is likely to be one to many. It may simply not be possible for a lossless transformation of a granule (e.g., reformatting from HDF to TIFF) to preserve the same signature. Even if the relationship were always one to one, there is a general need to be able to refer to granules with names that make some sense to users, as opposed to the apparently random bit strings of digital signatures.
Implication: All data streams distributed by NASA and NOAA should have a well-defined granule naming convention. Where possible, services should be supported that map between granule names and signatures, so that names of granules may be recovered from their signatures.
Quality is simply a set of assertions about a data granule that provide sufficient ancillary/contextual information (metadata) about the granule to enable effective interpretation of its contents. Conventionally these assertions either are packaged with the granule (as embedded metadata) or are implicit in the granule’s parent data set (e.g., MODIS ocean color product). However, if a granule’s identity can be reliably established, then quality assertions can be provided by services that accept the granule’s name as input. This avoids the problem of constantly extending data formats to accommodate new forms of embedded metadata.
Implication: NASA and NOAA should provide services by which a data granule’s name may be used to recover all metadata relevant to that granule.
Static assertions about a data granule are only part of the context required to interpret the data. Even more important is the lineage of the data: the graph of antecedent data granules and transformations from which the granule was produced. Lineage, the “pedigree” of a data granule, is often a key determinant of a granule’s fitness for a particular use. For example, retrospectively updating the calibration of a low-level data product invalidates any derived products. If the lineage of the derived products is available, then such broken dependencies will be obvious.
It is a straightforward step from a system that maintains information about data granules to one that maintains connections between granules, but there is as yet no standard way to represent such connections. This is a key area for future development to ensure the reliability of data products generated in a distributed fashion.
Implication: NASA and NOAA should pursue the development of a system of maintaining lineage connections between data granule names.
The fundamental question regarding data accessibility is, Is the data readily available, in a form that I can use, at a cost that I can afford? Ready availability is primarily a combination of keeping the data online (as elaborated in the section “Storage” in Chapter 1) and published through well-defined services. In particular, the emerging Web services infrastructure (the UDDI, WSDL, and SOAP standards for discovery, specification, and invocation, respectively; all based on XML) should allow data access in ways directly supported by the development platforms (e.g., Java, Microsoft.net, etc.) with which most future applications will be built. Like massive online storage, support for Web services is expected to be ubiquitous, indeed the default means via which most satellite data will be accessed during the time frame of this report.
Implication: All NASA and NOAA satellite data should be accessible via standard Web services.
Data usability cuts directly to the contentious issue of data formats, that is, the logical structures (grids, geometry, embedded metadata, etc.) in which digital data are packaged for delivery. Digital data formats originated as “snapshots” of the internal state of specific software tools, for which it was more important to move data efficiently between the tool’s memory and an online store. Only later were formats such as netCDF and HDF developed whose primary role was to move data between possibly dissimilar tools—i.e., formats for which interoperability was a primary design constraint. The problem to date with such “universal” data formats is that they have not supplanted the use of more specific formats within many communities of practice, nor have truly universal tools for translating between formats been widely promulgated. This situation can be expected to change; there are simply too many compelling benefits to be realized from having relatively few standard data formats to support. However, as the EOSDIS experience amply demonstrates, attempts to impose a standard data format on a community by fiat, and especially without comprehensive tools for supporting the community’s existing format(s), are doomed to failure. It is thus critical that the overall system be designed so that data formats are not obstacles to data utilization.
Implication: NASA and NOAA satellite data should be available in the formats best supported by the user communities. If NASA and NOAA cannot supply data in these formats directly, they should supply the data in standard formats that have been demonstrated to be readily convertible to community formats, and should aggressively support the development of third-party services that can transparently provide such format conversion.
Finally, the accessibility of environmental satellite data can be hindered by the imposition of significant fees, or other constraints on use, as shown by NOAA’s previous experience with the Landsat program in the 1980s. Landsat was originally a NASA program. In 1979, President Jimmy Carter transferred the Landsat program to NOAA. In 1983 President Ronald Reagan directed NOAA to place the program in the hands of a private corporation. The Land Remote-Sensing Commercialization Act of 1984, enacted by Congress, gave guidelines for this transfer. However, studies showed that federal government subsidies of up to $500 million were required to make the commercialization effort viable.
The Reagan administration offered only $250 million in subsidies to the prospective contractors. As a result, only one company was willing to bid for the Landsat commercialization contract. The company was named Eosat. It was a joint venture between Hughes and RCA. Eosat was to operate the existing Landsats 4 and 5 satellites and to build two new satellites, Landsats 6 and 7. Eosat would then hold the exclusive rights to market satellite images and digital data.
Eosat was never able to raise the government subsidies necessary to make its company profitable. So, it quadrupled the price of satellite images and data and also collected large fees from overseas ground stations. The company never became financially viable. The higher prices significantly reduced the use of satellite images and data, thereby hindering data use and further diminishing Eosat revenues.
It finally became clear to Congress that the market for satellite data was not ready to support a commercial company. A new law, the Land Remote-Sensing Policy Act of 1992, repealed the 1984 law and returned Landsat to the government. The price of Landsat images and data dropped and usage increased.
A significant benefit of a federated distributed system designed around public interfaces is an increased adaptability to technological change. As noted in the section “Processing” in Chapter 1, the various underlying technologies in the data management infrastructure are expected to experience cost/performance improvements at rates equal to at least that of Moore’s law, for at least the next 10 to 15 years. This argues strongly for “just in time” implementation of specific system
components, which in turn is much simpler if the components interact with the rest of the system only through well-defined, stable interfaces. This applies at both the system integration level (e.g., delaying purchasing storage hardware to take advantage of falling prices) and at the administrative level (e.g., multiple third parties making their products available to different user communities at different times, as the needs of those communities stabilize).
Moore’s law-driven technological change is reasonably predictable, but an interface-based system is also more adaptable to revolutionary/disruptive change. For example, in the early 1990s, systems with tightly integrated graphical user interfaces (GUIs) had much greater difficulty making the transition to the World Wide Web than did systems whose GUIs were separated from the underlying functionality by well-defined interfaces, and could thus be replaced by clients based on Web browsers. In general, a disruptive change whose consequences can be contained between system interfaces stands much less chance of disrupting the entire system.
As discussed above, there is a wide range of data users, from high-end users such as the numerical weather prediction centers to companies providing products derived from satellite data to individuals logging on to retrieve imagery to guide decisions about recreation. Because of this range, there is a corresponding diversity in the education needs of the users.
To facilitate and improve the use of environmental satellite data, education and/ or training is needed to help users effectively address the following key issues:
What environmental parameters are measured by satellite observations?
For a desired parameter, how can data availability be determined?
For that data, what are the coverage, sampling, and data storage characteristics?
How can the data be obtained, from whom, in what form, at what cost, and with what delay?
How can the data be worked with to produce desired products, information?
What are the quality and accuracy issues with the data, and where are the associated techniques and calibration procedures summarized? (see in Appendix D the section “Bilko: A Case Study in Educating Users”).
In partnership with satellite data providers, operational users of high volumes of satellite data, such as the numerical weather prediction centers, need to educate staff and to conduct preliminary trials with synthetic data sets. They can ill afford to experience loss of productivity and/or degradation of forecast skill by introducing
new data and associated assimilation methodologies. Education and outreach to such users must alert them far in advance about potential new sources and types of satellite data, the uncertainties in the data, the planned space/time sampling, and plans for the continuity of the observations in the future. These users must then work to develop the subsetting and assimilation tools they need to make optimal use of the data. Governmental, scientific, and commercial producers of derived and/or value-added products will seek similar opportunities to educate themselves about future satellite data and to prepare for its use in order to best serve their needs to prepare products or carry out research and assessments.
Many potential users of satellite data need less than the full volume fed to the high-end users. They are aware of the availability of satellite data and seek to use a space- or time-sampled subset. For them, the requirement of the interface is to guide them to access the geophysical variable they seek, and to explain the accuracy and sampling characteristics of the data. For these users, educational needs include learning what data is available from what servers and learning how to download and manipulate the data, including how to geolocate and subset the data.
Potentially, there can be much greater use of satellite data if new users have the opportunity for adequate training and education. For a new user, the volume of raw digital data and the trials of acquiring and manipulating that data can be a major obstacle. An approach that has had success in many instances is to provide user-friendly access to satellite-based imagery, allowing new users to become familiar with satellite fields and coverage by downloading or viewing image files. Training courses such as the United Nations Educational, Scientific, and Cultural Organization (UNESCO) Bilko software package (see Appendix D), which provides training in understanding and manipulating satellite images of oceanographic relevance, supply entry-level education for new users. For these users to effectively transition to working with digital data, additional educational needs must be met that include adequate training to efficiently locate, acquire, and manipulate digital data files.
For all levels of user sophistication, education is a nontrivial undertaking, and, to some extent, user education requirements can be mitigated by adhering to current standards and practices. This leverages knowledge that users are already likely to possess.
USER PARTICIPATION IN PLANNING AND PERFORMANCE FEEDBACK
Because of the long lead time associated with satellite missions and because of the limited involvement of the broad spectrum of potential end users in that planning process, the agencies must make a deliberate effort to actively involve end users in long-range strategic planning. They also need to derive the benefit of the
experience of the community of end users and to create pathways for feedback on the performance of the environmental satellite sensors and missions.
Two formats for user participation have been successful. One is the formation of a panel or team (sometimes called science teams or expert panels) that has diverse representation inclusive of end users external to the agency responsible for the sensor. This team works on sampling characteristics, algorithm development, and data quality, beginning prior to launch and following through after launch. The science teams have achieved close engagement by some end users and added the value of intense external consideration of performance.
A second approach to gaining user participation and better suited for seeking guidance for planning future satellite observations is the convening of user workshops. These can, for example, be focused on a specific observable, such as soil moisture; on a possible mission, such as measuring sea-surface salinity; or on the suite of instrumentation to be flown on a family of satellites, such as NPOESS. The participants provide feedback on the performance of current missions and on what new observations, new sampling characteristics, or new accuracies are needed to achieve their goals.