The U.S. Geological Survey (USGS) has the unique role of and responsibility for preserving and archiving geospatial data on a national scale. An optimal spatial data infrastructure (SDI) for the USGS would include data standards, modern data-management services, and a set of key application services that are essential for addressing scientific questions. The SDI would also need to consider the importance of data-sharing and data discovery and to have flexible methods for preserving geospatial data for long periods through numerous changes and updates because the ability to document and analyze changes in temporal values on a national scale is of immense scientific and societal value.
Carrying out the committee’s vision for an SDI at the USGS requires synergistic partnerships with agencies and organizations that have already contributed to the SDI. A judicious selection of partners will enable the USGS to leverage its resources while adopting best practices and furthering interagency standardization. Finally, the success of a vision for a large program like the USGS SDI depends on supportive leadership and carefully planned, staffed, budgeted, and executed governance and policies.
Data are major tangible assets of the USGS. Although these assets are unique national and international resources, they are resources that have yet to realize their full potential, because many components remain inaccessible or are not interoperable. The lack of a single information space where data can be discovered and accessed is also an issue. The existence of data may be known by USGS programs and individuals but is largely unknown and much less understood out-
side the organization. An effective USGS SDI would need to enable and facilitate broad data discovery, sharing, and archiving across the research community.
Data discovery is the first important task of the USGS, and it will need to ensure the discoverability of prime datasets in each division. Data discovery simply means that basic information about the existence of a spatial dataset and how it can be obtained is widely available. Once prime datasets (the datasets most critical to the pursuit of each science mission) have been identified and indexed, they must be searchable and accessible in a corporate data-management system. That will require the development of new institutional policies and series of standards on metadata and data discovery. It will also require compliance with these policies and standards.
Data sharing is the second critical task for a functional USGS SDI. Data must be structurally and semantically interoperable so that they can be shared and integrated with other datasets in the USGS, around the nation, and with international partners. As a multidisciplinary organization, the USGS will need to be able to combine and synthesize data from various disciplines to contribute to its cross-domain missions.
USGS has the responsibility for maintaining data for the long term. Thus, an effective institutional strategy for data-archiving is needed as the third fundamental component of a USGS SDI to support temporal analysis. The USGS has a long history of creating of authoritative spatial datasets and, therefore, data creation is not included in the ‘discover and share for the long term’ mantra that was developed to help the USGS focus on the remaining steps beyond data creation.
Standards and interoperability are essential elements of an SDI, whether implemented in a region (such as the Infrastructure for Spatial Information in Europe) or in an organization (such as the British Geological Survey). Standards apply not just to data but to the array of processes that operate in an SDI. Standards require consistency of operation, which will be somewhat challenging for a scientific organization that needs to function within defined parameters but at the same time to innovate for future needs. Not all standards meet every user community’s needs and, in some cases, non-standard derivative products may be necessary. However, the committee believes that existing standards that have been developed with the input from across the user community are the best way of providing the widest possible access to outside users. The USGS also needs systems that are interoperable and that follow internationally agreed-upon consensus standards, such as the Open Geospatial Consortium (OGC) and Geoscience Markup Language (GeoSciML), if they are to advance national and international multidisciplinary science. That will require the USGS to design and build an information-management system within the SDI so that information can be effectively managed, analyzed, and delivered to the appropriate stakeholders
within and outside USGS. As a major international player, the Survey will need to collaborate with international partners to address data standardization and to comply with international protocols. Expanding ongoing efforts to make spatial data available in Keyhole Markup Language (KML) and other formats compatible with popular web-based map viewers such as Google Earth and Bing Maps will provide great value to USGS spatial data.
Open Geospatial Consortium
The OGC is an international organization consisting of 420 government, industry, and academic entities that participate in a consensus process to develop open spatial data interface standards. Its core mission is “to develop standards that enable interoperability and seamless integration of spatial information, processing software, and spatial services” (OGC, 2004). It allows users of geospatial technology to work with technology providers. The OGC has defined some key interoperability standards for geospatial data that are supported by U.S. federal agencies, international data providers, national SDI organizations, and commercial software providers. Many OGC standards have been incorporated into International Organization for Standards (ISO) standards and, conversely, many OGC standards incorporate ISO standards. OGC standards provide an essential infrastructure for SDIs that are designed to integrate fully onto the Web, and the OGC specification process and products have been adopted by nearly all SDI programs worldwide.
Geoscience Markup Language
GeoSciML is a major geoscience interoperability standards initiative that is being developed and supported by geological organizations worldwide. It is a geography markup language application schema that transfers and shares geologic information typically in the form of geologic maps. GeoSciML standards are based on the World Wide Web Consortium (W3C), the OGC, and standards and specifications of the International Organization for Standardization (ISO). The OneGeology project is an initiative that uses GeoSciML to increase the accessibility of geologic map data on Earth that are delivered in real time by merging data from several national geological surveys. To establish a common suite of features, GeoSciML draws from geoscience-data model efforts, geologic criteria (such as units, structures, and fossils), and artifacts of geologic investigations (such as specimens, sections, and measurements). Supporting objects are also considered (such as timescale and lexicons) so that they can be used as classifers for the primary objects. GeoSciML meets the short-term goal of providing geoscience information associated with geologic maps and observations, and it could be extended in the long term to other geoscience data.
GeoSciML is governed by a working group of the International Union of
Geological Sciences Commission for the Management and Application of Geoscience Information. It would benefit from substantially increased USGS involvement because of USGS’s major role as a global geoscience-data provider, and the USGS would benefit greatly from making its data more interoperable, both within USGS and externally.
Enterprise data management (EDM) refers to the ability of an organization to define data precisely, integrate them easily, and retrieve them effectively for both internal applications and external communication (DAMA, 2011). A common objective of EDM services is creating and maintaining data content that is accurate, precise, granular, consistent, transparent, and meaningful. There is an emphasis on integrating data content into business applications and facilitating the transfer of data from business process to another. EDM applies to the management of spatial data resources and other types of scientific and business data, and it commonly tries to address circumstances in which users in organizations or in collaborative environments independently source, model, manage, and store data. Although EDM is not dependent on a specific data-type or technology strategy, it still requires a strategic approach when selecting appropriate technologies, processes, and governance structure. That can often be a challenge for organizations because EDM requires an aligning of activities (such as data-content management) with their multiple user groups (such as finance, information technology, and operations). Moreover, in scientific organizations in which data have typically been managed by individual researchers or small teams, responsibilities for data management have fallen on individual researchers with uneven results. Uncoordinated data-management approaches can result in data conflicts and inconsistencies in quality, which makes it difficult for users to rely on such data for generating models, providing estimates, and informing decision-making.
The USGS SDI efforts can benefit from EDM techniques that have been adopted by others (see lessons learned and case studies in Chapter 3). For example, consolidation of data-management resources (such as database licensing, performance, backup and recovery, and archiving) helps to improve economies of scale. Those benefits apply whether data are centralized in an organization, distributed over multiples sites, or hosted in the cloud.
Data-Centric Research Challenges
The SDI concept is an outcome of a data-centric approach that is changing the management of information resources that are needed to support science and transforming scientific research and environmental policy-making.. As data collections used to support specific research projects increase to petabytes, desktop Geographic Information Systems and statistical tools alone will be insufficient
to support the complex workflows required by scientists. Goble and De Roure (2009) have noted that “we are in an era of data-centric scientific research, in which hypotheses are not only tested through directed data collection and analysis but also generated by combining and mining the pool of data already available.” The data-driven landscape is expanding in scale and diversity. Spatial and scientific datasets grow in size and number, but they are often poorly coordinated and incompatible, so discovery and integration tasks present serious challenges. The data mismatch may result from incompatibilities in scale, resolution, feature class, or temporal resolution or from semantic mismatch. In this environment, it is important to go beyond relegating file management to individual scientists or research teams and to begin providing them with a robust data-management infrastructure.
Integrating Diverse Data
The types of digital spatial data created, maintained, and produced by USGS scientific teams are diverse. Some data resources are reference-data collections generated by sensors; others, like the Topographic Map Series, are maintained as national collections. Other pertinent types of USGS information include traditional scientific collections, such as reports, publications, drawings, and videos. Although a standard coordinate reference system would be invaluable in integrating various spatial datasets, it may impractical. Nevertheless, there is a need to generate linkages among various spatial and nonspatial data collections (such as spreadsheets, published reports, documents, photographs, and engineering drawings) to support the USGS Science Strategy objectives. Integration of multiple data sources to support analysis (for example, to generate computational mashups) will be critical for an SDI and warrants high priority. The infrastructure will need the fundamental ability to cross-reference and cross-correlate information, facts, assumptions, and methods from different research domains on a global scale.
Analysis and Modeling
Sophisticated data-mining techniques allow scientists to explore spatio-temporal patterns in data. The use of the data to model phenomena such as floods, landslides, and influenza pandemics can provide information helpful in projecting the course of events, which can be useful for prevention and intervention efforts with even just a few hours of lead time. A USGS SDI would enable spatial and nonspatial data from multiple sources and disciplines to be integrated and linked to a unified model of a given system (such as a watershed, ecosystem, or regional climate). The SDI should serve as a powerful Web platform for supporting search and analysis capabilities on a rich corpus of interlinked spatial data and conventional research publications. Moreover, to avoid moving massive amounts of data
around, an SDI enables computations to be pushed as close to the data sources as possible.
Security and Confidentiality
Deploying an SDI that is available to a wide user community requires attention to maintaining appropriate privacy, security, and provenance. Documenting provenance is particularly important in the publication and reward incentives to encourage data and model-output contributions. Security and privacy are critical in handling data related to human populations, endangered species, and sensitive environments. A security and confidentiality policy that addresses every stage of the geospatial data stream, from collection to end user, must be in place.
The data required to understand and model many of the environmental interactions highlighted in the USGS Science Strategy are massive (petabytes) and growing rapidly. The transition to wired and wireless distributed sensor networks that monitor the solid Earth and oceans is a major driving force for the increase in data volume. Others are the spread of data collection and archiving capabilities around the world, which are interlinked through the Internet, and the increasing use of new Web-enabled capabilities, such as cloud computing and social networking. However, the software tools and applications needed to fully manage and exploit this vast and distributed set of information resources are still emerging. Scientists and computer experts are developing a new generation of software applications to access, visualize, analyze, and interpret large amounts of diverse data for use in research and models that are designed to improve predictions and inform policy and decision-making.
Ideally, an SDI would not only enable the effective management of and easy access to distributed spatial data and information but support the tools and applications needed for research and decision-making, especially in support of the USGS Science Strategy. In the past, spatial data were often incorporated directly into scientific models and decision-support systems to meet specific research and policy needs. However, that led to “stovepiping” of information with poor transparency of data and methods and to numerous inconsistencies between approaches, which then inhibited data integration across multiple problems and discouraged reuse and repurposing of data and derived products. The common data layers, such as those that are hosted by The National Map, are an exception to this stovepiping.
Spatial Data Infrastructure as an Applications Platform
An essential function of an SDI is to provide an overall framework and architecture within which new applications can be developed and integrated. That does not necessarily mean that the SDI would itself need to encompass all those applications. Rather, the SDI could serve as a platform to support a large community (both in and outside USGS) to develop and operate a rich set of application services. In some cases, specific USGS elements or teams may need to build new applications to address specific science questions or meet specific mission needs. Using the SDI as a community applications platform would allow users to take advantage of existing applications that perform functions that they need rather than having to develop their own applications. In turn, when users develop new or improved services, they could more easily make them available to others through the SDI. By opening up the SDI to application developers in other federal agencies or in the much larger geospatial data community, this could yield substantial benefits in shared resources, reduction in duplication of effort, greater innovation, and expanded capabilities.
Spatial Data Infrastructure as a Workflow Platform
Recent work in other fields of electronic science, or e-science, demonstrates the valuable concept of high-throughput workflow processes as a means of processing and analyzing large and complex data resources (Taylor et al., 2007). Some workflow methods are designed to perform routine jobs and utilize the necessary computational protocols to undertake data-centric science. They enable scientists to focus on scientific discovery rather than having to spend effort and resources in routine data-processing. They also permit the development of more sophisticated tools for monitoring and detection (such as alert services for unusual or extreme events). Workflow approaches offer the opportunity to facilitate cross-disciplinary transfer and application of data-processing and analytic techniques in support of interdisciplinary research and problem-solving. Workflows that have been developed to address a specific problem in one field of science often are directly applicable to other, seemingly unrelated fields (Goble et al., 2008). Such cross-disciplinary uses of workflows can help to generate new analytic approaches, improve data quality and timeliness, reduce analysis costs, and speed the transfer of scientific knowledge to applications. Another important benefit of workflow approaches is the detailed record of data transformation and data-processing that is generated, which is a vital part of tracking the provenance of data and is critical for the long-term curation and reuse of data. Documenting and curating the workflows themselves may be an important role for an SDI in capturing geospatial expertise and ensuring the long-term reusability of the spatial data supported by the SDI.
It is important for USGS management to appreciate the significance of USGS ownership of a comprehensive, reliable, national dataset. Developing and implementing the crucial elements of an SDI will require sponsorship and support at the highest levels in the USGS. Those in leadership that are specifically responsible for delivering the implementation will need to have the authority and resources to carry the task through. However, the biggest challenge in successfully implementing an SDI will be in facilitating a cultural shift in the approach to data: that data should be viewed as a corporate asset and not held as individual or divisional resources or as liabilities. Instituting the cultural change will be difficult given the natural tension between data-management activities and science research. As previously mentioned, incentivizing researchers for sharing data would be an example of a change in USGS management practices that could affect the necessary cultural shift.
The development and maintenance of an SDI requires collaborative partnerships with cooperating agencies, research organizations, nonprofit organizations, private organizations, and the public. The USGS Science Strategy acknowledges the need for its scientists and policy-makers to partner with other organizations in the sharing of scientific resources. With the creation of an SDI, the USGS is in a unique position to catalyze linkages between national science organizations and government environmental-assessment efforts, the broad research community, and the public. If properly coordinated and managed, the USGS SDI could provide considerable value to government efforts through data management, application of models, and other analysis tools. Through proper coordination and linkages with other, more science-based observatories, the USGS can focus its resources on high-priority science and environmental policy issues that are of national importance.
The public is an increasingly important partner in localized Earth observations through its use of devices such as GPS-enabled mobile telephones and cameras. The use of citizen-scientist field observations—such as observations of plant species and growth and in fish and bird counts—will need to be integrated, and such diverse volunteered information will be challenging but necessary to include for science analysis. A clear policy will need to be established regarding if and how to incorporate a given spatial dataset, whether from citizen-scientists or private industry.
Effective partnerships will require workable and fair policies for the full and open exchange of data in compliance with U.S. federal government policies. In the United States, the White House Office of Management and Budget (OMB) establishes a framework policy for data access and reuse among federal agen-
cies. In recognizing that government-generated information can be a valuable public resource, OMB Circular A-16 Revised (OMB, 2002) states that federal agencies have a responsibility to “collect, maintain, disseminate, and preserve spatial information such that the resulting data, information, or products can be readily shared with other federal agencies and non-federal users, and promote data integration between all sources.”
International partnering will be especially challenging. International science collaborators typically express a commitment to data access and data-sharing. For example, Europe has a policy framework with a specific directive that establishes an Infrastructure for Spatial Information in the European Commission (INSPIRE) and a directive on public access to environmental information (Europa, 2003) that requires environmental information to be provided to the public in a timely manner. Furthermore, the European Commission has a policy that enables reuse of public sector information (Directive 2003/98/EC) and more recently in December 2011 issued Directive 2011/833/EU so that documents could be provided on an open basis (Europa, 2011). In the absence of a supporting national policy, legal framework, and good data-management practices, however, such objectives are at risk of not being implemented. Many governance issues can arise when cooperating organizations span different legal jurisdictions or must comply with different organizational data-dissemination requirements. National policies and organizational practices that support these data-access systems will be necessary to ensure that research data flows as intended.
DAMA, 2011. The DAMA Dictionary of Data Management, 2nd Ed. Data Management International, 260 pgs.
Europa. 2003. Directive 2003/4/EC of the European Parliament and of the Council of 28 January 2003. Available online at http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2003:041:0026:0032:EN:PDF (Accessed June 29, 2011).
Europa. 2011. Commission Decision of 12 December 2011 on the reuse of Commission documents. Available online at http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2011:330:0039:0042:EN:PDF (Accessed March 17, 2012).
Goble, C., and D. De Roure. 2009. The Impact of Workflow Tools on Data-centric Research. In The Fourth Paradigm: Data-Intensive Scientific Discovery, edited by T. Hey, S. Tansley, and K. Tolle, Microsoft Research. Pp. 137-145.
Goble, C.A., R. Stevens, D. Hull, K. Wolstencroft, and R. Lopez. 2008. Data curation + process curation=data integration + science. Briefings in Bioinformatics 9(6):506-517.
OGC (Open Geospatial Consortium, Inc.). 2004. The OGC – A Unique Organization Offering Unique Benefits. Available online at http://portal.opengeospatial.org/files/?artifact_id=7376 (Accessed June 24, 2011).
OMB (Office of Management and Budget). 2002. Coordination of Geographic Information and Related Spatial Data Activities. Circular No. A-16 Revised, August 19. Available online at http://www.whitehouse.gov/omb/circulars_a016_rev (Accessed June 29, 2011).
Taylor, I.J., E. Deelman, D.B. Gannon, and M. Shields (Eds.). 2007. Workflows for e-Science: Scientific Workflows for Grids. London: Springer.