Read "Communicating Science and Engineering Data in the Information Age" at NAP.edu

Page 51 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

3

Strategy for Modernizing Data Storage, Retrieval, and Dissemination

In this chapter, we propose a strategy for modernizing the infrastructure and processes that support the dissemination function of the National Center for Science and Engineering Statistics (NCSES). Several rather significant actions need to be taken in order to capitalize on the new technologies and processes that would facilitate this modernization. We make six recommendations for action, ranging from revising the format in which science and engineering (S&E) data are received from the survey contractors to more attention on archiving the data for long-term access and preservation.

CAPACITY OF NCSES TO TAKE ADVANTAGE OF NEW TECHNOLOGIES

Emerging technologies for data capture, storage, retrieval, and exchange will dramatically change the context in which NCSES will provide data to users in the future. These technologies will further increase efficiency, permitting users to access the data interactively and to dynamically integrate it with other information. For NCSES, the key to being able to take advantage of these technologies is to begin with a sharp focus on modernizing procedures for collection and ingestion of raw data and information about the data (metadata) into the data system. This is no simple task because of the likelihood that modernization will call for accommodating infrastructure changes. Whether the existing systems will have the capacity to ingest the metadata and individual record data in formats that support the new technologies is not certain.

Page 52 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

In order to take full advantage of many of the emerging data sharing and visualization tools described in Chapter 2, it is important that the incoming data be collected and ingested into the NCSES data processing system in as disaggregated a form as possible. The data should be accompanied by sufficient information about the data items (metadata) to support future analyses and comparability with previous analyses, and there should be an appropriate versioning/change management system to ensure that the ability to trace the origin and history of the data (provenance) is incorporated. This is challenging to NCSES because, for the most part, the agency data are collected, updated, and accessed by contractors to NCSES. Since the collection, tabulation, and front-end activities are controlled by contractors, NCSES must specify the requirements for data inputs that are compatible with retrieval in open data formats and suitable for retrieval in formats that support common tools that software developers use to process data.

The data also need to be in formats that enable taking advantage of the web development capabilities embedded in Data.gov and other emerging dissemination means. The data must be capable of mashup with other data sources. These capabilities require that access to the data be available through an open application programming interface (API) that exposes the disaggregated data, along with its metadata, in machine-understandable form. The result is to enrich results and enhance the value of the data to users.

It is critically important that the data be accompanied by the machine-actionable documentation (metadata) needed to establish the data’s history of origin and ownership (provenance) and include a record of any modifications made during data editing and clean-up. The documentation also needs to include the measurement properties of the data with sufficient detail and accuracy to enable publication-ready tables to be automatically generated in a statistically consistent manner.

Furthermore, it is critically important that a formal automated capability for tracking and controlling changes to a project’s files—in particular to source code, documentation, and web pages (version control)—and formal change management procedures be applied to data collected by contractors. This establishes a reliable data provenance and ensures that all previous publications can be automatically verified and replicated.

In the panel’s judgment, NCSES is not very well positioned to meet the above preconditions for taking advantage of emerging technologies. The survey data that are entered into the center’s database are received from the survey contractors in tabular format mainly though machine-readable tabulations, rather than in a more easily accessible microdata format.

This situation is not unique to the S&E data that are received from contactors by NCSES. Suzanne Acar (representing the U.S. Federal Bureau

Page 53 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

of Investigation and the Federal Data Architecture Subcommittee of the Chief Information Officers Council) stated that difficulty in fully utilizing emerging technologies is a government-wide issue, one that will be taken up by a group of the World Wide Web Consortium (W3C) and other standards organizations.¹ W3C has plans to develop contract templates to enable governmental organizations to properly specify the format for receipt of the data from their contractors.

According to Ron Bianchi (representing the Economic Research Service of the U.S. Department of Agriculture), barriers to taking advantage of emerging technologies is a widespread issue across the federal statistical system and has been identified as a major concern for the newly formed Statistical Community of Practice and Engagement (SCOPE). This coordinating activity involves most of the large federal statistical agencies. The initial plans for the SCOPE initiative have included developing a template for contract deliverables specifications for data formats and accompanying metadata.

Recommendation 3-1. The National Center for Science and Engineering Statistics should incorporate provisions in contracts with data providers for the receipt of versioned microdata, at the level of detail originally collected, in open machine-actionable formats.

Implementing this recommendation will be no simple task for NCSES. Currently, NCSES manages 13 major surveys that involve contracts with five private-sector organizations and the U.S. Census Bureau (see Table 2-1). Furthermore, adding this requirement may initially incur additional costs to support a shift from the current practice of formatting the data after they are received to requiring contactors to input the data in a new format. Some consideration will have to be made for reformatting the existing historical databases to be compatible with the new open formats and structures, when possible, so data can be manipulated across current and prior survey results.

To enable the receipt of metadata from contractors in a universally accessible format, NCSES should consider adopting an electronic data interchange (EDI) metadata transfer standard. The selection and adoption of a metadata transfer standard would be more effective if NCSES accomplished it through participation in a government-wide initiative, such as the W3C contract template development or the SCOPE effort, which is more focused on the federal statistical agencies.

____________

¹W3C is an international community of member organizations that develops web standards, see http://www.w3c.org [November 2011].

Page 54 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Improving Data Delivery, Presentation, and Quality

In their presentations to the panel, the NCSES staff produced a large hard-copy stack of tabulations, noting that the stack represented just one of the center’s periodic reports. The staff also noted that, even though the center has largely shifted to electronic dissemination, the dictates of data accuracy and reliability require that a great deal of NCSES time is spent in checking data and formatting them for print and electronic publication.² For example, each page of the hard copy must be checked by someone looking at the source data. This effort comes at the expense of ensuring data integrity at the source. We think this emphasis is misplaced.

Although it will never be possible to fully avoid edit and quality checks, because errors are prone to creep into data at any stage in processing, there is much to be gained by focusing primarily on the quality of the incoming raw data from the source. This approach is best ensured by adopting a comprehensive database management framework for the process, rather than the current primary focus on review of the tabular presentation. A framework that ensures integrity at the source of the data, buttressed by the availability of metadata, is the necessary foundation of real improvement in data dissemination. Adoption of such an approach should have further benefits. By changing to a dissemination framework from a review framework, NCSES could free up some existing resources or be able to reduce contractor involvement, which would allow for the realignment of resources and funding to focus on making further process improvements.

Recommendation 3-2. The National Center for Science and Engineering Statistics should transition to a dissemination framework that emphasizes database management rather than data presentation and strive to use auditable machine-actionable means, such as version control, to ensure integrity of the data and make the provenance of the data used in publications verifiable and transparent.

All of the tables published by NCSES are selections, aggregations, and projections of the underlying micro-level observations. Recommendation 3-2 envisions that, whenever possible, published tables should be defined explicitly in these terms and produced by an automated process that includes metadata.

The panel acknowledges that in some cases—such as the NCSES’s Science and Engineering Indicators—this approach may not be immediately feasible, since an extensive data appendix is necessary to support the analy-

____________

²This information is based on the National Science Foundation presentation to the panel, October 27, 2010 (slide numbers 14-16).

Page 55 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

sis in the report. However, in general (following the practice that NCSES currently employs for the most detailed statistical tables), a web release of the raw data will reduce the burden on the NCSES staff related to manually check publications and will form the basis of a transition from tables to information and provide the users with more timely information. This structured approach to release of data will also provide transparency in the process, increase replicability, and assuage any user concerns about the delay between data collection and their availability.

It is important that the data provided by contractors to NCSES include machine-readable metadata that capture the statistical properties of the data and of the collection and research design. The appropriate form and content of these metadata are being considered in the SCOPE initiative. It is likely that such metadata are produced in the data collection process, since computer-assisted telephone interviewing (CATI) and other related survey tools use much of this information in their operations. However, metadata are currently not included in the required deliverables to the National Science Foundation (NSF) from contractors.

The shift to increased user capacity to produce customized output from the raw data is potentially a major and significant enhancement, which has the potential to offer great direct benefit, but such a change will also require consideration of second-order effects. Care will need to be taken to ensure that data confidentiality is ensured when providing users with cross-source microdata: consequently, rules about publishable cell size, for example, will have to be carefully considered.³ The greater transparency inherent in making more of the raw data available also increases the risk that users could juxtapose data in ways that lead to invalid interpretations, although this danger can certainly be reduced by the accessibility of robust metadata that explain the meaning (and limitations) of the data.

Recommendation 3-3. The National Center for Science and Engineering Statistics (NCSES) should require that data received from contractors be accompanied by machine-actionable metadata so as to allow for automated production of NCSES publications, comparability with previous analysis, and efficient access for third-party visualization, integration, and analysis tools.

____________

³Several reports of the Committee on National Statistics address the need to maintain the confidentiality of data provided to government agencies in confidence: Privacy and Confidentiality as Factors in Survey Response (National Research Council, 1979); Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council, 1993); Protecting Student Records and Facilitating Education Research (National Research Council, 2009); and Protecting and Accessing Data from the Survey of Earned Doctorates (National Research Council, 2010).

Page 56 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Another positive benefit of providing transparency and tools for exploratory access to data is that users will be in a position to identify errors in the data. NCSES should be prepared to solicit and accept error reports and make corrections as necessary. In contemporary terms, this would be an application of “crowd sourcing”—a focused attempt to tap into the collective intelligence of the users of the data. Clearly, when the general public has access and tools to combine data across data sources, there may be additional questions about data accuracy and usefulness, and NCSES will need to do its best to educate users and respond to their discoveries.

In its presentations, NCSES staff stressed that they are a comparatively small organization with limited resources. One way that these limited resources could be stretched is for NCSES to consider digital distribution channels, including enhanced use of pdf files and, after investigation of cost and benefits, perhaps facilitating print-on-demand (POD) publication. NCSES may wish to consider turning to POD technology of the U.S. Government Printing Office (GPO) as a potential means of controlling the costs associated with printing and distributing the few remaining hard-copy reports that it produces (see Chapter 2 for details).

VISUALIZATION OF S&E DATA

Just as a picture may be worth a thousand words, so can the best data visualizations replace a ream of tabular output and written analysis. Applications of data visualization—or as Edward Tufte (2004) characterizes it, the visual display of quantitative information—are growing profusely. (See Ware, 2004, for a contemporary treatment of this area.) Data visualizations are increasingly being used by federal data-producing agencies and others to analytically depict large data sets, such as those produced by NCSES. Two of the larger statistical agencies—the Census Bureau and the Bureau of Economic Analysis—and other federal agencies maintain visualization sites that are suggestive of approaches that NCSES might profitably take.⁴

Indeed, assisted by NCSES, the National Science Board has provided visualized data in the form of charts and graphs, and it maps its printed and online digest published in support of the 2010 Science and Engineering Indicators volume (National Science Board, 2010b). These static displays of information have been chosen by NSF staff for their ability to clarify relationships and trends in visually pleasing and interesting ways. They are appropriately considered first-generation visualizations, since they are

____________

⁴See http://blogs.census.gov/censusblog/2011/07/visualizing-foreign-trade-data.html; http://lehd.did.census.gov/led/datatools/visualization.html; http://www.bea.gov/newsreleases/glance.htm; http://www.bea.gov/itable/index.cfm; http://www.uspto.gov/dashboards/patents/main.dashxml [August 2011].

Page 57 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

not associated with an electronic database and thus are not susceptible to manipulation by data users who want to interactively illustrate aspects of the data for their own analysis.

The field of data visualization is quite dynamic, with new approaches and technologies being offered in the form of online sites and applications by both private and public sectors, as well as nuanced approaches to building a community of analysts around visualized subject matter. Because of the shortage of staff resources and the fast-changing data visualization landscape, the panel suggests that NCSES choose several deliberate approaches that can be taken in order to make progress toward improving visualization of the NCSES data. NCSES could (a) confederate with other federal statistical agencies that are already moving forward with visualization programs under an umbrella such as SCOPE; (b) work with private-sector vendors, such as the Google Public Data Explorer, to expand the potential for visualization of the NCSES data sets (taking much the same approach as Eurostat); or (c) continue to develop a select set of straightforward visualizations, such as those offered in the 2010 Digest but continuously update those visualizations and post them to the Internet when new data become available.

As discussed in Chapter 2, a complementary approach would be to provide the data in machine-understandable formats using open standards and with appropriate metadata so that users can develop their own visualizations using the increasingly sophisticated private vendor visualization tools that are on the market. NCSES could take advantage of the rapidly emerging services that make data easier to find, aggregate, interpret, integrate, and link.

Recommendation 3-4. The National Center for Science and Engineering Statistics should proceed to make its data available through open interfaces and in open formats compatible with efficient access for third-party visualization, integration, and analysis tools.

RETRIEVAL AND DISSEMINATION TOOLS

Adopting a new approach to data management and distribution will open up many exciting opportunities for low-cost solutions to data retrieval and dissemination. These opportunities would expand utilization of emerging government and private-sector resources to go beyond the capabilities offered by the current Scientists and Engineers Statistical Data System (SESTAT), the Integrated Science and Engineering Resources Data System (WebCASPAR), the Industrial Research and Development Information System (IRIS), and the Survey of Earned Doctorates (SED) Tabulation Engine tools.

As discussed in Chapter 2, once the conditions are established for

Page 58 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

dissemination of data, the public services, such as Data.gov, and private services, such as the Google Public Data Explorer, can bear much of the burden of dissemination. A caveat is in order here, however. Although using private-sector tools for dissemination is a promising solution for NCSES, dissemination tool development is extremely dynamic in the private sector, as panel member Micah Altman observed at the panel workshop. Many of the start-up dissemination and data sharing services have gone out of business. In view of this uncertainty, his advice is that users should mitigate the risk of using any of these systems by opting for open-source software whenever possible, retaining preservation copies of files in other institutions, limiting use to dissemination only (not data management), and leveraging metadata and APIs to create one data source that is then disseminated through multiple sources.

Another caution was voiced at the panel workshop by Myron P. Gutmann, director of NSF’s Directorate of Social, Behavioral, and Economic Sciences, with regard to such private-sector services as the Google Public Data Exchange. He warned that it could be dangerous to overrely on these private-sector dissemination tools, since the conditions of service or even the continued provision of service are corporate decisions that could significantly change or even end the dissemination mode. He also expressed a concern that distribution in a nongovernment-owned system could open the possibility of unauthorized changes in the data set unless there were strict controls in place within the dissemination tool and a policy that the data be anchored back to the originating federal agency source.

Altman identified research challenges and gaps between the state of the art and the state of the practice. Research challenges in this area include peta-scale online analysis, interactive statistical disclosure limitation, business models for long-term preservation, and data analysis tools for the visually impaired. Closable gaps include managing nontabular complex data and metadata-driven harmonization and linkage across data resources.

Recommendation 3-5. The National Center for Science and Engineering Statistics should develop a plan for redesign of its retrieval tools utilizing the emerging, sustainable capabilities of other government and private-sector resources.

PRESERVING ACCESS TO S&E DATA

When considering data release and management, it is important to have a long-term data management plan. Yet according to staff, the current NCSES approach to archival issues is ad hoc. In view of the importance of these data for historical reference, long-term access and permanent archival

Page 59 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

preservation are needed, and these could be ensured through proper policies and practices.

At a minimum, all of the collected data and the electronic and hard-copy publications that are produced should be scheduled for retention by the National Archives and Records Administration (NARA). In this regard, the NSF Sustainable Digital Data Preservation and Access Network Partners (DataNet) initiative is a ready in-house source of information on best practices and tools for implementing an active archival program.

NARA ELECTRONIC RECORDS PROGRAM

NARA has responsibility for the custody and retrieval of federal government records for which they have received a transfer of legal custody of records for the originating agency. A growing part of the NARA collections are in the form of electronic records. Because of the panel’s interest in ensuring the long-term retention and retrieval of NCSES data, we invited Margaret O. Adams, manager of the Archival Services Program, and Theodore J. Hull, senior archivist of accessions, to discuss the NARA reference services for electronic records.

The process for identifying records for archiving is a collaborative one. NCSES is required by law to manage records created or received in the course of business, and it does so by completing a form (Standard Form 115) that outlines the holdings and requests records disposition authority. Through a records scheduling and appraisal process, the archivist of the United States determines which federal records have temporary value and may be destroyed and which federal records have permanent value and must be preserved and transferred to the National Archives of the United States. The archivist’s determination constitutes mandatory authority for the final disposition of all federal records (36 CFR 1220.12). Only a very small percentage of records identified for permanent retention are actually accessed by NARA, but the kind of electronic records that are produced by NCSES have a very high chance of being appraised for permanent retention—that is, social and economic microdata collected for input into periodic and onetime studies and statistical reports, including information filed to comply with government regulations, as well as summary statistical data from national or special censuses and surveys.

According to Hull, a good part of the accessioning work is done by NARA. When records, documentation, and accession documents (SF-258) are received, NARA conducts a preliminary assessment, which can involve converting files to ASCII, contacting agency for replacements or additional documentation, verifying file formats, and selecting only permanent files for retention. Only then are records archived using NARA’s Archival Preservation System (APS).

Page 60 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

After they are accessed, they may be researched and retrieved using descriptions of the electronic records series in NARA’s online Archival Research Catalog (ARC).⁵ (This source will be replaced by NARA’s Online Public Access [OPA] system in coming months.) ARC includes descriptions for approximately 68 percent of NARA’s holdings nationwide and about 99 percent of accessioned electronic records.

The NARA records system is a very large system. As of January 2011, there were 717 series and 6.6 billion logical data records contributed by over 150 source agencies described in the ARC. The ARC search supports filtering by type of records (data files), and copies of fully releasable data files are provided on removable media for cost recovery. The Online Public Access system currently under development aims to support direct download of electronic records files.

In her presentation, Adams referred to the Committee on National Statistics publication, Principles and Practices for a Federal Statistical Agency (National Research Council, 2009, p. 27), which states that “a good dissemination program also uses a variety of channels to inform the broadest possible audience of potential users about available data products and how to obtain them.… Agencies should also arrange for archiving of data with the National Archives and Records Administration and other data archives, as appropriate, so that data are available for historical research in future years.”

As mentioned above, the archiving process begins with the identification of holdings and the request for records disposition authority by the agency. This is sometimes a challenging task, particularly with the growth of electronic versus hard-copy holdings. In the case of NCSES, the process of identifying and completing a records disposition authority request was last completed in 1995. Several types of records were then identified for permanent retention, including final published surveys and studies; electronic micro-level survey data, final edited versions of all electronic survey microdata, databases, spreadsheets, detailed tables, charts, statistical data, and other micro-level respondent information created prior to compiling, condensing, or summarizing the survey responses into the final summarized or published product; electronic text and detailed statistical tables, data analyses, and related records; electronic copies of survey reports, including the text of the final report and all other electronic records related to the report, such as detailed tables, charts, statistical data analyses, and spreadsheets; and technical information regarding data format and structure and other related computer program and system documentation, including codebooks, file layouts, data fields, data dictionaries, and other records that are necessary to understand the microdata. For most of these

____________

⁵See http://www.archives.gov/research/arc [November 2011].

Page 61 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

items, NCSES is instructed to retain them at the agency level for 10 years and then forward them to NARA.

Much has happened in terms of data collection, processing, and dissemination in the years since 1995. It is appropriate that NCSES review and refile, if necessary, a request for records disposition authority.

Recommendation 3-6. The National Center for Science and Engineering Statistics (NCSES) should work with the National Archives and Records Administration (NARA) to ensure long-term access and preservation of all of its publications and all data necessary to replicate these publications. As a necessary step, NCSES should review and update the request for disposition authority that is filed with NARA to ensure prompt and complete disposition of records and should regularly review the status of compliance with the records retention directive.

Page 62 Cite

Suggested Citation:"3 Strategy for Modernizing Data Storage, Retrieval, and Dissemination." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×