THE NATIONAL ACADEMIES
Advisers to the Nation on Science, Engineering, and Medicine
Committee on National Statistics
Panel on Communicating NSF Science and Engineering Information to Data Users
The Keck Center 500 Fifth Street, NW Washington, DC 20001 Phone: 202 334 3096 Fax: 202 334 3751 www.national-academies.org/cnstat
Dr. Lynda Carlson
Director
National Center for Science and Engineering Statistics
National Science Foundation
4201 Wilson Boulevard, Suite 965 Arlington, VA 22230
Dear Dr. Carlson,
This letter report from the Panel on Communicating National Science Foundation (NSF) Science and Engineering Information to Data Users recommends action by the National Center for Science and Engineering Statistics (NCSES), formerly the Division of Science Resources Statistics (SRS), on four key issues: data content and presentation, meeting changing storage and retrieval standards, understanding data users and their emerging needs, and data accessibility. The panel members are listed in Appendix A.
The recommended actions in this letter report can be considered as preliminary steps for NCSES to prepare for initiatives that will foster a transition from current practices and approaches to an improved program of data dissemination. These and other issues will be discussed in the panel’s final study report, which is scheduled to be issued in mid-2011. This letter report also includes a summary of the workshop that was held on October 27–28, 2010; see Appendix B. The workshop focused on the several aspects of the NCSES’s current approaches to communicating and disseminating statistical information—including NCSES’s information products, website, and database systems. It included presentations from NCSES staff and representatives of key user groups—including the academic research, private nonprofit research, and federal government policy-making communities; see Appendix C for the workshop agenda.
PANEL CHARGE AND ISSUES
The NCSES, as a means of fulfilling its mandate to collect and distribute information about science and engineering enterprise for the National Science Foundation, conducts an ambitious program of data dissemination in several formats: hard-copy and electronic-only publications, an extensive NCSES website, and two tools that are used to retrieve data from the NCSES database: the Integrated Science and Engineering Resource Data System (WebCASPAR) and the Scientists and Engineers Statistical Data System (SESTAT). These outputs and tools serve a broad community of information users, with wide-ranging data needs, statistical knowledge, access preferences, and technical abilities.
Our panel was asked to review the NCSES communication and dissemination program that is concerned with the collection and distribution of information on science and engineering and recommend future directions for the program. Specifically, we were asked to:
-
Review NCSES’s existing approaches to communicating and disseminating statistical information, including the center’s information products, website, and database systems. [This review will be conducted in the context of both current “best practices” and new and emerging techniques and approaches.]
-
Examine existing NCSES data on websites, information gathered by and from NCSES staff, volunteered comments of users, and input solicited by the panel from key user groups and assess the varied needs of different types of users within NCSES’s user community.
-
Consider the impact that current federal and NSF website guidance and policies have on the design and management of NCSES’s online (Internet) communication and dissemination program.
-
Consider current research and practice in collecting, storing, and utilizing metadata, with particular focus on specifications for social science metadata developed under the Data Documentation Initiative (DDI).
-
Consider the impact of government-wide activities and initiatives (such as FedStats and Data.gov) and the emerging user capability for online retrieval of government statistics.
IMPROVING DATA DELIVERY, PRESENTATION, AND QUALITY
In their presentations to the panel, the NCSES staff produced a large hard-copy stack of tabulations, noting that the stack represented just one of the center’s periodic reports. The staff also noted that, even though the center has largely shifted to electronic dissemination, the dictates of data accuracy and reliability require that a great deal of NCSES time is spent in checking data and formatting the data for print and electronic publication.1 For example, each page of the hard copy must be checked by someone looking at the source data. This effort comes at the expense of ensuring data integrity at the source. We believe this emphasis is misplaced.
Although it will never be possible to fully avoid edit and quality checks because errors are prone to creep into data at any stage in processing, there is much to be gained by focusing primarily on the quality of the incoming “raw” data from the source. This approach is best ensured by adopting a comprehensive database management framework for the process, rather than the current primary focus on review of the tabular presentation. A framework that ensures integrity of the data at the source of the data, buttressed by the availability of metadata (that is, data about the data), is the necessary foundation of real improvement in data dissemination.
RECOMMENDATION 1: The National Center for Science and Engineering Statistics should transition to a dissemination framework that emphasizes database management rather than data presentation and strive to ensure integrity of the data at the source.
All of the tables published by NCSES are selections, aggregations, and projections of the underlying microlevel observations. The recommendation above envisions that, wherever
possible, published tables should be defined explicitly in these terms and produced by an automated process that includes metadata.
The panel acknowledges that in some cases—such as the NCSES’s Science and Engineering Indicators—this approach may not be feasible since an extensive data appendix is necessary to support the analysis in the report. However, in general, a web release (following the practice that NCSES currently employs for the most detailed statistical tables) of the raw data will reduce the burden on NCSES’s staff and will form the basis of a transition from “tables” to “information” and provide the users with more timely information. This structured approach to release of data will also provide transparency in the process, and assuage any user concerns about the delay between data collection and its availability.
In its presentations, NCSES staff stressed that they are a comparatively small organization with limited resources. One way that these limited resources could be stretched is for NCSES to consider digital distribution channels, including enhanced use of PDF files and, after investigation of cost and benefits, perhaps facilitating print-on-demand publication. NCSES may wish to consider turning to the print-on-demand technology of the U.S. Government Printing Office as a potential means of controlling the costs associated with printing and distributing the few remaining hard-copy reports that it produces.
It is important that the data provided by contractors to NCSES include machine-readable metadata that capture the statistical properties of the data and of the collection and research design. The appropriate form and content of these metadata are being considered in the federal statistical agencywide Statistical Community of Practice and Engagement (SCOPE) initiative, which was discussed by Ron Bianchi (representing the Economic Research Service of the U.S. Department of Agriculture) at the workshop (see Appendix B). It is likely that such metadata are produced in the data collection process, since computer-assisted telephone interviewing (CATI) and other related survey tools use much of this information in their operations. However, metadata are currently not included in the required deliverables to NSF from contractors.
The shift to increased provision of raw data is potentially a major and significant enhancement, that has the potential to offer great direct benefit, but such a change will also require consideration of second-order effects. Care will need to be taken to ensure that data confidentiality is assured when providing users with cross-source microdata: consequently, rules about publishable cell size, for example, will have to be carefully considered.2 The greater transparency inherent in making more raw data available also increases the risk that users could juxtapose data in ways that lead to invalid interpretations, though this danger can certainly be lessened by the accessibility of robust metadata that explain the meaning (and limitations) of the data. The current state of metadata technology permits tagging the data items with permanent uniform resource locator (URL) and uniform resource identifier (URI) codes that enable identifying the source and meaning of the data items.
Another positive benefit of providing transparency and tools for exploratory access to data is that users will be in a position to identify errors in the data, and NCSES should be prepared to solicit and accept error reports and make corrections as necessary. Clearly, when the general public has access and tools to combine data across data sources there will be additional questions about data accuracy and usefulness, and NCSES will need to do its best to educate users and respond to users’ discoveries.
Finally, when considering data release and management, it is important to have a long-term data management plan. Yet according to staff, the NCSES’s current approach to archival issues is ad hoc. In view of the importance of these data for historical reference, long-term archival access is needed, and it could be assured through proper policies and practices. At a minimum, all of the collected data and publications should be scheduled for retention by the National Archives and Records Administration. In this regard, the NSF Sustainable Digital Data Preservation and Access Network Partners (DataNet) initiative is a ready in-house source of information on best practices and tools for implementing an active archival program.
MODERNIZING DATA CAPTURE, STORAGE, AND RETRIEVAL
Emerging technologies for data capture, storage, and retrieval will dramatically change the context in which NCSES will provide data to users in the future. For NCSES, the key to taking advantage of these technologies—which are designed to increase the efficiency of data capture, storage, and retrieval and to permit users to access the data interactively (such as with Web 2.0)—is to focus on procedures for entry of the raw data into the system. It is critically important that data enter the system in as disaggregated form as possible.
Furthermore, it is critically important that the data be accompanied by the machine-actionable documentation (metadata) needed to establish the data’s history of origin and ownership (know as provenance) and to include a record of any modifications made during data editing and clean-up. The documentation also needs to include the measurement properties of the data with sufficient detail and accuracy to enable publication-ready tables to be automatically generated in a statistically consistent manner.
The data also need to be able to take advantage of the web development capabilities embedded in data.gov and other emerging dissemination means to “mashup” data sources (a web page or application that uses and combines data, presentation, or functionality from two or more sources to create new services). These capabilities, incorporating such tools as open Application Programming Interface (API), enrich results and enhance the value of the data to data users. Unfortunately, NCSES is not very well positioned to take advantage of these new developments because the survey data that are entered into the center’s database are received from the survey contractors in tabular format (though machine readable) rather than in an easily accessible microdata format. This is not unique to the science and engineering data. The committee heard from Suzanne Acar (representing the U.S. Department of the Interior and the Federal Data Architecture Subcommittee) that this is a governmentwide issue, one which will be taken up by a group of the World Wide Web Consortium (W3C),3 which has plans to develop contract templates to enable governmental organizations to properly specify the format for receipt of the data from their contractors. Ron Bianchi (representing the Economic Research Service of the
3 |
W3C is an international community of member organizations, a full-time staff, and the public to develop web standards: see http://www.w3c.org [November 2010]. |
U.S. Department of Agriculture) stated that this is also a concern for the newly formed Statistical Community of Practice and Engagement (SCOPE), and that current plans for this coordinating activity that involves most of the large federal statistical agencies include developing a template for contract deliverables specifications.
RECOMMENDATION 2: The National Center for Science and Engineering Statistics should incorporate provisions in contracts with data providers for the receipt of data in formats, and accompanied by metadata, that will allow efficient access for third-party visualization and integration and the use of analysis tools.
Implementing this recommendation will be no simple task for NCSES. Currently, NCSES manages 13 major surveys, involving contracts with five private sector organizations and the U.S. Census Bureau: see Table 1. Furthermore, adding this requirement may initially incur additional costs to support a shift from the current practice of formatting the data after it is received to requiring contactors to input the data in a new format.
To enable the receipt of metadata from contractors in a universally-accessible format, NCSES will need to adopt an electronic data interchange (EDI) metadata transfer standard. The selection and adoption of a metadata transfer standard would be more effective if NCSES accomplished this through participation in a governmentwide initiative, such as the W3C contract template development, or the SCOPE effort focused on the federal statistical agencies.
UNDERSTANDING DATA USERS AND THEIR EMERGING NEEDS
NCSES is strongly committed to serving the needs of data users, but it has little evidence of how well it is meeting those needs. NCSES has made several notable attempts to gather this intelligence about user needs, but it does not have a formal, consistent, structured, and continuing program for doing so.
One problem for NCSES is that there are multiple levels of users for which products must be developed. For the most part, outreach efforts have been addressed to primary users, who are mostly researchers and analysts of research and development (R&D) expenditures and the R&D workforce. These users represent the most visible of users–researchers and analysts in the federal agencies that support government science and engineering analysis, in academia, and in the private sector. (Several of these users gave presentations at the workshop). However, the needs of a group of secondary users (those who rely on NCSES products to understand and gauge the implications for programs, policy, and advocacy) and tertiary users (such as policy makers and librarians) are given less attention, mostly because outreach to these groups is so difficult.
It is incumbent on NCSES to consider the needs of all of these groups and the technology platforms they use to access the data as NCSES considers the program of measurement and outreach discussed in this letter. NCSES could consider novel means of harvesting information about data use to analyze usage patterns, such as reviewing citations to NCSES data in publications, periodicals, and news items. Reaching out to document librarians and other secondary users by means of surveys or interviews would be another worthwhile initiative. One means of assisting in assuring that the needs of the secondary and tertiary data users are met is to assure that programs of outreach are specially directed to members of the media—those who re-release the NCSES data and interpret them to the public.
Among the tools that NCSES has used to assess user needs, according to John Gawalt (NCSES program director for information and technology services at the time of the workshop), are a web statistics analysis program that analyzes web server log file content and displays the traffic information on the basis of the log data (URCHIN) and software that collects and presents information about user behavior on its website (WebTrends). With proper permissions and protections, NCSES is also contemplating using cookies to identify return users and increase the efficiency of filling data requests. It also has plans to sponsor and field a customer survey to formally measure satisfaction.
In seeking a model for outreach to users, NCSES could consider modeling its efforts on the very aggressive program of Statistics Canada, described at the workshop by panel member Diane Fournier. Statistics Canada uses a combination of online questionnaires and focus groups to assess user needs and the usability of its website. The information has been used by Statistics Canada to develop a profile of its users and how they access the database. One advantage of this approach, although it is resource intensive, is the possibility of gathering use information from a wide range of users, both from those who are knowledgeable and regular users and from secondary and tertiary users who are less familiar with the data.
Another initiative that NCSES could undertake to better determine user needs is to renew the data workshops that NCSES conducted for several years but have been discontinued. Those workshops brought together users and potential users of licensed data. This same approach could be useful for acclimating users to web-based data, and to introduce frequent users to changes in data dissemination practices and procedures. Such data workshops would be a good way to find out how knowledgeable data users use NCSES data and to find out what concerns users have about the data.
RECOMMENDATION 3: The National Center for Science and Engineering Statistics should proceed with its plans to conduct a customer survey through use of an online questionnaire; should analyze patterns of data use by web users; and should consider reinstating the program of user workshops in order to educate users about the data and to learn about the needs of users in a structured way.
DATA ACCESSIBILITY
The panel heard from Judy Brewer (director of the Web Accessibility Initiative [WAI] at the W3C) on the issue of accessibility of information on the web. Her presentation and the discussion that followed the presentation raised several important issues.
The convention when considering web design for individuals with disabilities is to ensure that the site is accessible to those who are visually impaired. However, there is a much wider range of ways in which someone’s accessibility to information should be considered when developing websites and web applications. For example, a chart that is color coded may not be readily interpreted by someone with color blindness, multimedia files may not be accessible to someone who is deaf unless they are accompanied by transcripts, and someone with a cognitive disability such as attention deficit disorder may find websites that lack a clear and consistent organization difficult to navigate.
A range of guidance materials are available for developing accessible websites. Section 508 of the 1998 amendments to the U.S. Rehabilitation Act (29 U.S.C. 794d) governs the accessibility of electronic and information technology for people with disabilities, and specifies
the minimum standards for accessibility.4 Other standards include the “web accessibility initiative” of W3C, which provides guidance and tools for a range of websites and applications. Even more significant, given the possibility for rich dynamic interaction with these data resources, is that W3C has also developed standards for access to dynamic content, with specific guidelines in four categories:
-
accessible rich Internet applications—address accessibility of dynamic web content, such as those developed with Ajax, Dynamic HTML, or other such technologies;
-
authoring tool accessibility guidelines—address the accessibility of the tools used to create websites;
-
user agent accessibility guidelines—address assistive technology for web browsers and media players; and
-
web content accessibility guidelines—address the information in a website, including text, images, forms, and sounds.
In addition to issues of the usability and navigability of websites and web applications, there are issues related to the use and navigation of the datasets. Tabular data formats can be difficult to understand for those who must use screen readers, and data that are not organized in a transparent or immediately understandable way may be of limited or no utility for users with cognitive disabilities.
Through this presentation and the panel’s subsequent discussion, it became clear that the issue of the accessibility of tabular data and data visualization is a research question. Although W3C has pioneered standards for accessibility of dynamic user interfaces, many other issues including table navigation, navigation of large numeric datasets, and dynamic data visualization raise computer-human interaction challenges that have been explored only peripherally. The issue of accessibility is a clear opportunity for NSF to partner with scientists with disabilities and those who work on interface design and so lead by example.
RECOMMENDATION 4: The National Science Foundation should sponsor research and development on accessible visualization and other means for exploring tabular data.
The panel recognizes that such a far-reaching initiative is almost certainly beyond the capability and resources of NCSES, and so our recommendation is to the National Science Foundation. A program of research and development might fit well into the portfolio of other NSF units and could be considered for funding under such programs as the Broadening Participation Research Initiation Grants in Engineering Program. The importance of promoting scientific visualization as an aid to usability and accessibility is a recognized component of NSF’s cyberinfrastructure vision5 and is inherent in its human-centered computing cluster initiatives. Although NSF is not in a position to pioneer tool development, identifying the appropriate research areas is something that will have a major impact in the field.
NCSES could share the knowledge gained through these research and development activities with the broader community of practice and so have a major impact on a wide range of
4 |
A summary of Section 508 is available online at http://www.section508.gov/index.cfm?fuseAction=stdsSum [November 2010]. |
5 |
National Science Foundation, Cyberinfrastructure Vision for 21st Century Discovery, March 2007, p. 28. |
potential users of NCSES’s data (as well as other statistical datasets), thus making the data available to potential users for whom they are now inaccessible.
As a final note, the panel wishes to commend NCSES for encouraging this review of its dissemination practices. This is a particularly opportune time for incorporating lessons learned in several U.S. and international initiatives that are designed to increase the transparency, usability and accessibility of government data. Implementing the recommendations in this interim report will go a long way toward laying the basis for significant improvements in the way center data are disseminated.
Sincerely yours,
Kevin Novak,
Chair
Panel on Communicating NSF Science and Engineering Information to Data Users
TABLE 1 Summary of Selected Characteristics of NSF Science and Engineering Surveys
Survey |
Current Contractor |
Database Retrieval Tool/Publication |
Availability of Microdata |
Series Initiated/Archiving |
Education of Scientists and Engineers |
||||
Survey of Earned Doctorates |
National Opinion Research Center (NORC) |
WebCASPAR; InfoBriefs; Science and Engineering Degrees; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Doctorate Recipients from United States Universities: Summary Report; Academic Institutional Profiles |
Access to restricted microdata can be arranged through a licensing agreement; a secure data access facility/data enclave providing restricted microdata access is under development with NORC |
1957 (conducted annually, limited data available 1920–1956) |
Survey of Graduate Students and Postdoctorates in Science and Engineering |
RTI International |
WebCASPAR; InfoBriefs; Graduate Students and Postdoctorates in Science and Engineering; Science and Engineering Indicators; Women, Minorities, and Persons With Disabilities in Science and Engineering; Academic Institutional Profiles |
Data for the years 1972–2008 are available in a public-use file format |
1975 (conducted annually) |
Science and Engineering Workforce |
||||
Survey of Doctorate Recipients |
NORC |
SESTAT; InfoBriefs; Characteristics of Doctoral Scientists and Engineers in the United States; Science and Engineering Indicators; Women, Minorities, and Persons With Disabilities in Science and Engineering; Science and Engineering State Profiles |
Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement |
1973 (conducted biennially) |
National Survey of Recent College Graduates |
Mathematica Policy Research, Inc., and Census Bureau |
SESTAT; InfoBriefs; Characteristics of Recent Science and Engineering Graduates; Science and Engineering Indicators; Women, Minorities, and Persons With Disabilities in Science and Engineering |
Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement |
1976 (conducted biennially) |
National Survey of College Graduates |
Census Bureau |
SESTAT; InfoBriefs; Science and Engineering Indicators; Women, Minorities, and Persons With Disabilities in Science and Engineering |
Public use data files are available upon request |
1962 (conducted biennially) |
Research and Development Funding and Expenditures |
||||
Business Research and Development and Innovation Survey (BRDIS) |
Census Bureau |
IRIS; InfoBrief; Business and Industrial R&D; Science and Engineering Indicators; National Patterns of Research and Development Resources; Science and Engineering State Profiles |
Census Research Data Centers |
1953 (conducted annually) |
Survey of Federal Funds for Research and Development |
Synectics for Management Decisions, Inc. |
WebCASPAR; InfoBrief; Federal Funds for Research and Development; Science and Engineering State Profiles; Science and Engineering Indicators; National Patterns of Research and Development Resources |
Data tables only |
1952 (conducted annually) |
Survey of Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions |
Synectics for Management Decisions, Inc. |
WebCASPAR; InfoBrief; Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions; Science and Engineering State Profiles; Science and Engineering Indicators; National Patterns of Research and Development Resources |
Data tables only |
1965 (conducted annually) |
Survey of R&D Expenditures at Federally Funded R&D Centers |
ICF Macro |
WebCASPAR; InfoBrief; R&D Expenditures at Federally Funded R&D Centers; Academic Research and Development Expenditures; Science and Engineering Indicators; National Patterns of Research and Development Resources |
Data tables only |
1965 (conducted annually) |
Survey of Research and Development Expenditures at Universities and Colleges |
ICF Macro |
WebCASPAR; InfoBrief; Academic Research and Development Expenditures; Science and Engineering Indicators; National Patterns of Research and Development Resources; Science and Engineering State Profiles; Academic Institutional Profiles |
Data tables (selected items) only |
1972 (conducted annually, limited data available for various years for 1954–1970) |
Survey of State Research and Development Expenditures |
Census Bureau |
InfoBrief; State Government R&D Expenditures; Science and Engineering Indicators |
Data tables only |
1964 (conducted occasionally) |
Science and Engineering Research Facilities |
||||
Survey of Science and Engineering Research Facilities |
RTI International |
WebCASPAR; Scientific and Engineering Research Facilities; Science and Engineering Indicators |
Microdata from this survey for the years 1988–2001 are not available |
1986 (conducted biennially) |
Other Surveys |
||||
Survey of Public Attitudes Toward and Understanding of Science and Technology |
NORC, via an S&T module on the General Social Survey |
Science and Engineering Indicators |
Data tables only |
ICPSR, 1979–2001; CD, 1979–2004; (conducted biennially) |