A STATISTICAL AGENCY SHOULD STRIVE CONTINUALLY for the widest possible dissemination of the data it compiles, consistent with its obligations to protect confidentiality. Data should be disseminated in formats that are accessible and accompanied by documentation that is clear and complete. Dissemination should be timely, and information should be made readily available on an equal basis to all users. Agencies should have data curation policies and procedures in place so that data are preserved, fully documented, and accessible for use in future years.58
Planning for dissemination should be undertaken from the viewpoint that the public has contributed the data elements and paid for the data collection and processing. In return, the information should be accessible in ways that make it as useful as possible to the largest number of users—for decision making, program evaluation, scientific research, and public understanding.
An effective dissemination program is comprised of a wide range of elements:
- It should have an established publications policy, which describes, for a data collection program, the types of reports and other data releases to be made available, the formats to be used, the audience to be served, and the frequency of release.59
58 Data curation involves the management of data from collection and initial storage to archiving (or deletion should the data be deemed of no further use—e.g., a data file that represents an initial stage of processing). The purpose of data curation is to ensure that information can be reliably retrieved and understood by future users.
- It should have a variety of avenues for disseminating information about data availability and upcoming releases. Those avenues should be chosen to reach as broad a public as reasonably possible—including, but not limited to, an agency’s Internet website, conference exhibits and programs, newsletters and journals, email address lists, and social media and blogs. A statistical agency should also regularly communicate major findings to the media, which helps build the expectation of statistical agency releases without political interference or partisan spin.
- The public release of data should occur in a variety of forms (suitably processed to protect confidentiality), so that information can be accessed by users with varying skills and needs for retrieval and analysis. Useful data products include not only understandable maps, graphs, indicators, and tables on statistical agency websites, but also public-use microdata samples (PUMS) and other computer-readable files with richly detailed information.
- For data that are not publicly available, agencies should provide access for research and other statistical purposes through restricted modes that protect confidentiality, such as protected data enclaves and contractual licensing agreements.
- All data releases should be accompanied by careful and complete documentation or metadata, including explanatory material to assist users in appropriate interpretation (see Practice 4). For a complex database (such as a PUMS file), user training should be provided through webinars, online tutorials, and sessions at appropriate conferences.60
- The program should include archiving policies that guide which data to retain, where they are to be archived (with the National Archives and Records Administration, or an established archive maintained by an academic or other nonprofit institution, or both), and how they are to be accessible for future secondary analysis while protecting confidentiality.61
PUBLIC DATA PRODUCTS
Data release of aggregate statistics may take the form of regularly updated time series, cross-tabulations of aggregated characteristics of
respondents, analytical reports, interactive maps and charts, and brief reports of key findings. Such products should be readily accessible through an agency’s website, which should also make available more detailed tabulations in formats that are downloadable from the website. Agencies should take care in designing their websites to make it as easy as possible for users to locate and access information, testing accessibility and usability with a range of users.
A relatively new way for agencies to expand public use of their aggregate statistics is by providing selected data through application programming interfaces (APIs) to developers who, in turn, build custom applications for the Internet, smartphones, and similar media. For example, the Census Bureau’s APIs include neighborhood population characteristics and county-level information on business activity.62
Yet another form of dissemination involves access to individual-level microdata files, which make it possible to conduct in-depth research in ways that are not possible with aggregate data. PUMS files can be developed for general release. Such files contain data for samples of individual respondents that have been processed to protect confidentiality by deleting, aggregating, or modifying any information that might permit individual identification.63
RESTRICTED DATA ACCESS
While honoring their obligation to be proactive in seeking ways to provide data to users, statistical agencies must be vigilant in their efforts to protect against disclosure of data obtained under a pledge of confidentiality (see Practices 7 and 8). The stunning improvements over the past three decades in computing speed, power, and storage capacity, the growing availability of information from a wide range of public and private sources on the Internet, and the increasing richness of statistical agency data collections have increased the risk that individually identifiable information can be obtained through reidentification of data thought to have been suitably protected (see Doyle et al., 2001; National Academies of Sciences, Engineering, and Medicine, 2017b:Ch. 5; National Research Council, 2003b, 2005b:Ch. 5). In response, statistical agencies may have to scale back the detail that is provided in PUMS files or other public data products.
As an alternative to public access, statistical agencies have pioneered several methods of restricted access. One method is to provide or arrange
63 For a review of methods for confidentiality protection of PUMS files, see Federal Committee on Statistical Methodology (2005).
for a facility on the Internet to allow researchers to analyze restricted microdata to suit their purposes, with safeguards so that the researcher is not seeing the actual records and cannot obtain any output, such as too-detailed tabulations, that could identify individual respondents.64 A second method, pioneered by the National Center for Education Statistics (NCES), is to grant licenses to individual researchers to analyze restricted microdata at their own sites: such licenses require that the researchers agree to follow strict procedures for protecting confidentiality and accept liability for penalties if confidentiality is breached.65 A third method is to allow researchers to analyze restricted microdata at a secure site, such as one of the Federal Statistical Research Data Centers (FSRDCs) currently located at two dozen universities and research organizations around the country. The FSRDC network began as a Census Bureau initiative and now includes data from other agencies.66 Statistical agencies should continually seek to enlarge their suite of restricted access methods and, for each, to reduce as much as possible the cost, time, and burden of access for users.
64 The Data Enclave of NORC at the University of Chicago is such a facility: see http://www.norc.org/Research/Capabilities/Pages/data-enclave.aspx [April 2017]. It provides secure access by researchers to selected microdata sets of the Economic Research Service, the National Center for Science and Engineering Statistics, and several other federal agencies and private foundations. NCES provides similar functionality for access to its data sets: see, e.g., https://nces.ed.gov/datalab/ [April 2017].