Page 33 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

2

Current Practices for Documentation and Archiving in the Federal Statistical System

THE COMPLEXITY AND SCIENTIFIC NATURE OF THE PRODUCTION OF OFFICIAL STATISTICS

The production of official statistics is more complicated than many users of federal statistics understand. It requires the collection and quality control of data from multiple sources, and it often uses models to produce valid estimates and to protect confidentiality. When statistics are based on collected survey data, their production can involve survey design (primary sampling units, stratification, clustering), design of the survey instrument, field instructions, various data treatments to prepare the raw data for input into the estimation program(s), an estimation methodology, and whatever validation is used to assess the quality of the resulting official statistics. If estimates instead use data from administrative records or digital trace data,¹ other technical questions may have to be dealt with.

More involved yet is the increasingly common situation in which survey data, administrative data, and digital trace data are used in some combination to exploit their inherent strengths and minimize their weaknesses. This can involve the use of models that need to be fitted and assessed. Whether the input data come from a survey, administrative records, or digital traces, each of these steps may require addressing various complex questions about how to collect the highest quality input data, or what methods will

___________________

¹ Unfortunately, there is no commonly accepted term for the new sources of data that technology has made possible, which includes Internet transaction data, social media data, and sensor data. Other terms that have been suggested for (some of) this new type of data are organic data and big data.

Page 34 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

best treat the raw data for nonresponse or edit failures, or what model will produce the highest quality estimates. So, each stage—the collection of the input data, the processing of the input data, the computation of the estimates or indicators, the production of the associated metadata, and the evaluation of the resulting official statistics—can raise difficult issues that need careful resolution. And many of these steps can involve difficult scientific questions, some of which may have been addressed in the technical literature, and some of which may be novel and require innovative thinking.

To illustrate these issues, consider a survey conducted by the National Center for Science and Engineering Statistics (NCSES), the National Survey of College Graduates. In 2017, this survey, which is conducted every other year, sampled some 124,000 graduates, drawn from individuals who previously responded to the American Community Survey. The design of the study is a rotating panel: respondents are asked to respond to the survey four times, at 2-year intervals over a period of 8 years. Each survey year, a new cohort is recruited to serve in the panel, as the oldest cohort rotates out of the sample. Thus, each cross-sectional sample consists of new respondents and others who have responded in earlier waves. The rotating panel design is intended to capture changes in education and occupation circumstances in the U.S. adult population while limiting the participation burden on each respondent. An extensive questionnaire inquiring into educational experiences and their relation to occupational outcomes, particularly in the sciences and engineering, is administered in the first year of participation, with follow-up questionnaires completed in subsequent years. Three modes of data collection are employed to gather responses: Internet questionnaire, mail questionnaire, and telephone interview. Generally, the less expensive Internet or mail modes are tried first, followed by telephone if responses are not obtained with those self-administered methods. For the panelists in follow-up cohorts, the mode employed may be adjusted to use the one that achieved a response in the first wave. Unconditional incentive payments to encourage participation are made to some respondents whose response propensity is estimated to be lower and whose base weight (a measure of rarity in the sample) is high.

There are a number of post-survey adjustments made to the resulting sample. Every sample case has a weight that is a product of sample selection (base weight), weighting to account for unit nonresponse, trimming to eliminate extreme weights, raking procedures to bring sampling weights in line with sample frame estimates, and procedures to convert weights calculated for each sampling frame from which respondents are drawn into a final weight that reflects the population for the survey year. There is also use of imputation employed for item nonresponse. Many users of official statistics have an incomplete understanding not only of the complexity of the collection of the raw input data, but of the measures taken to turn those input data into the final official estimates.

Page 35 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

Simply put, official statistics are scientific estimates, which can sometimes be relatively complicated. This is another reason why a greater degree of transparency is important to have.

The committee’s impression is that the documentation of methods, the archiving of collected data, and the archiving of the resulting estimates in the federal statistical system are often viewed as having a low priority, and therefore that these actions are often carried out as an afterthought or not at all, depending on various time pressures. We strongly urge, instead, that these activities be treated as an integral part of the process of the development of a set of official statistics. They should be planned and accomplished alongside the work of constructing official estimates.

WHY TRANSPARENCY AND REPRODUCIBILITY ARE GOALS FOR NCSES AND THE FEDERAL STATISTICS SYSTEM

The benefits of transparency and reproducibility can be broken down into four pieces: efficiency, innovation and progress, trust and confidence, and the value from the use of the data products themselves. These four dimensions are now described.

Efficiency

As mentioned, it is important to know how an agency produces a set of official statistics, since that informs one about the quality of the methods used, and therefore the quality of the resulting official statistics, and it provides proof that no special interests are playing any role. Knowing how a set of official statistics is produced entails retaining details of the various data collection processes used, including the survey design, the survey instruments and field instructions, should survey data be collected, or retaining descriptions of any processes used to collect data from administrative sources and/or from digital traces. Then, whatever computations are carried out in converting the raw input data to the set of inputs fed into the estimation methodology also need to be documented. This is done by retaining the commented code used in the workflow that takes the set of raw data, modifies it for nonresponse, failed edits, and other corrective actions, and carries out all data transformations needed. Finally, the commented code that describes how the estimation procedures are applied also needs to be retained along with the relevant computing environment, as well as the computations carried out to assess the variability of the official statistics and efforts to validate them.

In all, as part of any organization that is involved with such complex processes, it is necessary for statistical agencies to retain, in an accessible way, detailed information as to how they accomplish their various tasks

Page 36 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

in the production of a set of official statistics. By doing this, these agencies can accommodate changes to staff as discussed in Chapter 1. Also, such a careful description makes clear to staff that their work is subject to detailed review, and therefore encourages thoroughness and care. This reflects a practical business case for transparency.

In addition, as part of such detailed documentation, when changes are made it is important not only to make the associated changes in the documentation, but to make available an explanation as to why the changes needed to be made. The precise location of the changes in the software code, either for the workflow history of the data treatments or for estimations, also needs to be made available, depending on what has changed. That way, if the changes are later found to be the cause of any problems, it will be a simple matter to undo them, and possibly to understand where the logic behind the change went wrong. Many users are concerned when changes in procedures or estimation methods are made, since such changes can then be responsible for changes in the official statistics that are not due to real changes in the phenomena being measured. Therefore, it is important to collect such changes in a specific, permanent location online so that users looking at time series of official statistics can possibly determine when changes in the series are (and when they are not) due to such modifications.

Innovation and Progress

Science marches on, and the science underlying federal statistics is no different. It is very likely that we are currently in a period in which substantial changes are either occurring or will soon occur as a result of the greater cost of collecting survey data and the greater use of other sources for input data. There are many reasons for innovation and progress, including the dynamic nature of the population and businesses whose characteristics we need to measure, opportunities provided by advances in technology, and theoretical progress. Changes in society currently include a greater reluctance to participate in surveys. Technological changes include the growth of mixed survey-mode designs, which are jointly motivated by a greater reluctance to participate in surveys generally and the rising cost of the traditional face-to-face contact method. The mixed-mode designs often involve attempts to get responses by mail or through self-administered questionnaires on the Web, followed by more expensive telephone or face-to-face contacts for nonrespondents. These changes in design are accompanied by new software tools that facilitate collaboration in software development, by making software code more easily available to the public, and the development of standardized metadata, all of which are discussed in Chapters 4 and 5.

Examples of new methods that have resulted from theoretical developments include the application of model-assisted survey sampling (including

Page 37 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

generalized regression estimators and calibration estimators), the application of Fay-Herriot models, often to combine information from disparate sources, and sample reuse methods for variance estimation. These improvements and innovations can arise either within an agency or through the external research community, and therefore both internal staff and external researchers need to be aware of the current approaches used for the production of a set of official statistics so that they can assess when a new idea might provide an improvement over the status quo.

Trust and Confidence

If statistical agencies wish the public to trust the estimates coming from the government, they should recognize that it is important to demonstrate to the public that the techniques used for data collection and the methods used on the collected data are not used to benefit any stakeholders and that they represent the current state of the art. This is accomplished by making their official statistics, their data collection techniques, and the estimation methods used to produce them available to the public, to the extent possible. Further, making available the data collection techniques and computations, along with the relevant input datasets—under secure arrangements to protect confidentiality—permits the validation of a set of official statistics by demonstrating its computational reproducibility.

Value from the Use of the Data Products

If a data product is not fully documented, for example when an agency does not provide the weights necessary to support valid inferences (if the data are survey based with nonresponse) or does not provide the sampling variances (in cases where there is interest in combining estimates through use of a statistical model), then users are likely to produce substandard estimates or analyze the data inappropriately. As described in detail in Chapter 5, fully documented resources (e.g., datasets) are realized by providing metadata that conform to some predefined metadata schema, preferably a publicly accessible standard. The ultimate goal is to develop fully documented datasets through use of machine-actionable, organized, comparable, quality metadata tied to the relevant data and making all of it accessible through an application programming interface.

If agencies exchange such documented information, users are able to resolve questions of the following nature: whether it is appropriate to utilize a tabulation that would require 10,000 cells, or what weights to provide for the survey-based estimates used in combination with a set of model-based estimates using administrative records in a small-area estimation model. For time-series analyses, the metadata could provide information on major

Page 38 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

changes to the series, which would inform the user about whether indicator variables for those changes should be incorporated in such models.

Further, more generally, use of algorithms developed for the production of one data product may become useful for the production of another. Standardization of measurement for common concepts, where appropriate, could become possible. Understanding of the differences in estimates of phenomena measured in different surveys could be enhanced. Agencies would be better able to track the impact of changes in collection methods. Continuing surveys or other collections would be more easily and accurately carried out.

In sum, by enabling users to have a better understanding of the characteristics of a set of official statistics, agencies enable those statistics to gain in value by being put to use for additional purposes.

EXISTING REQUIREMENTS

The U.S. Office of Management and Budget (OMB) provides guidance to the agencies concerning what data collection methods (mainly for surveys) federal statistical programs are obligated to make public. Specifically, OMB’s Standards and Guidelines for Statistical Surveys (OMB, 2006) describes the process agencies must follow to receive permission from OMB to collect survey information. Once a survey is approved, OMB makes its information collection review publicly available at https://www.reginfo.gov/. Two parts of this OMB product, sections 7.3 and 7.4, directly address documentation of data collection methods.² It is the case that the federal statistical agencies do comply with this request for information from OMB since OMB’s approval is a requisite to initiate data collection.

These standards say much less about statistical programs that are based on administrative data or digital trace data. Similarly, OMB provides less guidance with respect to the documentation of methodologies either used for data treatment or for estimation associated with the official statistics, including whether the associated software code should be made known to the public. (While these standards do not specify much regarding documentation of data treatment, some summary measures on data quality relative to nonresponse are required.)

With respect to guidance about archiving the input data and the official statistics, records schedules do exist informing agencies what is to be

___________________

² It should be mentioned that sometimes changes are made after approval is granted, but in those instances OMB requires that a change request be made or, if substantive enough, approval of a new submission. In any case, revised plans are ultimately required by OMB within 3 years and therefore what is made public will soon be consistent with the survey data collection processes used.

Page 39 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

retained at the National Archives. However, more comprehensive directives, often found in data management plans, about what precisely is to be retained, where it is to be located, for how long, using metadata from which standard, and whether the public is to have access to it, are currently not produced.

Most federal statistical agencies have developed their own statistical standards and reporting guidelines in support of greater transparency in their programs. For example, the U.S. Census Bureau’s “Statistical Quality Standard F2—Providing Documentation to Support Transparency in Information Products” identifies topics that should be discussed or described in a survey’s technical documentation (see Chapter 1). Similarly, NCES has developed statistical and reporting standards for data collection and statistical processing. NCES also stipulates standards for the dissemination of its data, including which topics should be reported in a statistical program’s technical documentation.

The approaches used by the statistical agencies to establish statistical and reporting standards vary—some agencies being very prescriptive and others less so. Also, we have not examined the extent to which these standards or guidelines at the individual agency level are implemented in practice.³

EXISTING PRACTICES

In most agencies, decisions on what to retain and what to release are made at the programmatic level. Agency standards, if they exist, obviously provide only a minimum standard, which many programs decide to exceed. For example, at the 2017 workshop on transparency (NASEM, 2019a), representatives of the Longitudinal Employer-Household Dynamics program at the Census Bureau reported that they have the capability of recovering any input data set, the associated program code, and the resulting official estimates in a few minutes.

What documentation is available very likely depends on a variety of things, including the perceived importance of the program, whether the program is a one-time program providing estimates for a single occasion or a continuing program, the resources made available by the agency’s parent organization for the overall study or project, and whether the data collection and/or data treatment was carried out by a contractor, the Census Bureau, or in house by the agency. Also, the funding for documentation can be affected by issues that arise during data collection. For instance,

___________________

³ A complete listing of agency reporting guidelines and statistical standards is provided by the Interagency Council on Statistical Policy. A link to this is provided by NCES through the Federal Committee on Statistical Methodology (see https://nces.ed.gov/FCSM/policies.asp).

Page 40 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

additional resources might be needed to obtain enough sample items, or the price of external data might increase. Agencies will often prioritize data collection and, as a result, take funds from reporting budgets. Also, agencies are often required to conduct the same data collection operation over time with a fixed budget, resulting in a similar problem of having to decide whether to collect more data or provide documentation.

In order to learn what was going on at the program level, we communicated with a small number of high-profile statistical programs by means of an informal questionnaire to see what documentation and archiving practices they used. We assumed that for lower-profile programs and for single-use circumstances the documentation and archiving practices were likely to be less complete. We sent this informal questionnaire to the program managers or high-level staff for 20 high-profile programs, and we received 11 responses, all from the programs listed in Box 2-1. We asked these individuals about their practices regarding the documentation of their data collections, the archiving of the resulting input datasets and resulting official statistics, and the documentation of statistical methods used to treat the data and to produce the indicated official statistics.

RESPONSES TO THE INFORMAL QUESTIONNAIRE

By requesting that program chiefs or other informed staff respond to these informal questionnaires, the committee was able to get a rough sense of what the current practice is, both internally and externally, regarding the documentation of the data collection methods, the data treatments used,

BOX 2-1
Programs That Responded to Informal Panel Questionnaire

American Community Survey, Census Bureau
National Marine Fisheries Service, National Oceanic and Atmospheric Administration
Residential Energy Consumption Survey, Energy Information Administration
Food Security Survey Module, Economic Research Service
National Crime Victimization Survey, Bureau of Justice Statistics
Statistics of Income Tax Data, Internal Revenue Service
Consumer Price Index, Bureau of Labor Statistics
Disability statistics from the Office of Research, Evaluation, and Statistics, Social Security Administration
Population Estimates Program, Census Bureau
National Health Interview Survey, National Center for Health Statistics
American Time Use Survey, Bureau of Labor Statistics

Page 41 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

and the estimation routines employed. In addition, this informal questionnaire asked about the current policies regarding assessment of input data quality, where the input datasets and the resulting official estimates are stored, with what additional information about the data file, and for how long they are to be retained.

Internal Retention Policies

The questionnaire began by asking about internal retention policies, starting with whether input datasets were saved in some type of repository internal to the agency and, if so, for how long they were retained. Other questions concerned whether the program was a continuing program, whether every new data collection was saved in its own location, and whether metadata were provided along with the saved data files (and, if so, whether a metadata standard was used). Most respondents said that the input datasets were saved on internal directories and that they were generally not overwritten by subsequent data collections. Further, while these files were stored with relevant metadata, they said that typically no metadata standard was used.

Respondents said that care was taken to include everything needed to understand what the file contained and to support later reuse. The input data files retained were generally the edited files used as input to produce the associated official estimates, they said, but some programs retained the raw data files and other intermediate files prior to the production of the final input data. There often were internal guidelines for what and how to save but, respondents said, there were typically no guidelines on the use of metadata standards.⁴ Some agencies pointed out that there was no repository on their own (internal) Web page that stored all the data intended for later reuse. One agency used the Inter-University Consortium for Political and Social Research (ICPSR) as its archive, and other agencies used the National Archives and Records Administration to archive data.

Retention of Workflow History

The questionnaire next asked whether the workflow history of data treatments was retained somewhere, possibly as commented code, and whether the estimation methodology was likewise retained, again as commented code. It also asked whether there were technical memoranda that summarized the approaches used for these computations. Here the responses were fairly consistent. Commented code was retained internally and

___________________

⁴ An exception is the use of International Organization for Standardization standards—see https://www.iso.org/home.html.

Page 42 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

stored in a way that supported later reuse, certainly in support of reproducibility studies with the above input datasets, but also for general research purposes. In addition, the agencies often made available a user’s guide and technical memoranda which were written to be accessible to a broad user community. Some respondents pointed out, however, that the comments and memoranda focused on what had been done with little discussion of how or why.

Some programs made extensive use of administrative records as input data. In those cases, it was typical for the disaggregated administrative records not to be released in any fashion, even in a restricted sense to a federal statistical research data center (FSRDC).

Use of Metadata Standards

The next two questions asked whether the Data Documentation Initiative (DDI), Statistical Data and Metadata Exchange (SDMX), or other metadata standards were used and whether the decision on whether to use such standards was based on a cost-benefit assessment of doing so. There were two mentions of the current use of DDI, and a future intended use of DDI (one of these was so that the data could be archived and disseminated at the ICPSR), and there was one mention of SDMX. Otherwise, the adoption of these standards or other metadata standards seems not to be typical of most federal statistical agencies.

One agency expressed concern that to adopt these standards would obligate the agency to do so for the entire historical series to ensure backward comparability. This is something that many agencies could be faced with. Our view is that backward compatibility should be achieved over time, but the focus should be on moving newer data to more open standards. We also note that when standards are open, contributions of this kind can also come from the user community. This is the approach that the Consumer Expenditure Surveys are taking.

What Is Made Available to the Public

The questionnaire asked what was made available externally to the public. We asked whether input datasets were available in FSRDCs, and if so what metadata were provided with the data and the standards that these metadata satisfied. Also, use of any alternatives to an FSRDC were explored, especially use of public-use microdata sets. The release of public-use microdata sets was quite common, and these were often accompanied by user’s manuals indicating how to use such files for analytic purposes. Some agencies also make arrangements for the release of microdata to researchers in other secure environments.

Page 43 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

Documentation of the survey design and the wording used in the survey instrument was always available, but sometimes only by request. Typically, considerable information was also available on the workflow history of the data treatments and the estimation methodology used. It was not uncommon for the commented code for either the workflow history or the estimation methodology to be made available as well. (There were even efforts to ensure that an external user could check on the computational reproducibility of some official estimates.) However, while there were some outstanding examples of accessible descriptions of the data treatments and estimations—for instance, the American Community Survey Design and Methodology (January 2014)—it is common for no such documents to be available.⁵

Concerning documentation of data quality, many agencies provide a coefficient of variation for the official statistics (when they are weighted aggregates from a survey), as well as the frequency of nonresponse. Many agencies conduct research on issues such as the data treatments used and variance estimation, and commonly make the results available as technical reports or white papers, sometimes published in the technical literature. Four journals in which such papers appear are Survey Methodology, the Journal of Official Statistics, the Journal of Survey Statistics and Methodology, and Public Opinion Quarterly.

Limitations of This Questionnaire

There are many limitations that must be conceded with this informal information gathering. To begin with, it was a very small “sample” with considerable nonresponse. Further, some questions requested detailed responses but respondents sometimes struggled to provide all the processes and details the questionnaire was meant to elicit. The wording of some questions often made implicit assumptions about the data collection or the methods that made the questions difficult for respondents to interpret in

___________________

⁵ The federal statistical agencies have written a number of excellent technical reports and handbooks for various official estimates that are based on the data collected from surveys. A key example of this is U.S. Census Bureau’s Current Population Survey Technical Paper (CPS TP77) (U.S. Census Bureau, 2019). Also, there have been efforts to provide contextual documentation for public release datasets, including documentation of the Public Use Microdata Sample Files for surveys such as the American Community Survey (https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2014_2018_PUMS_README.pdf). In addition, the statistical agencies have funded data quality assessments for surveys such as the Survey of Income and Program Participation (SIPP) and for the American Housing Survey (AHS)—for the SIPP, see Kalton (1998) and Jabine, King, and Petroni (1990). For AHS see Chakrabarty and Torres (1996). Further, the agencies have regularly published articles in the scientific literature informing those interested in the technical details of the methodology employed in various official estimates. Examples include Fay (1984); Dippo, Fay, and Morganstein (1984); Findley (2005); and Zieschang (1990).

Page 44 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

some situations. For example, the questions were somewhat survey-centric and did not provide enough guidance to programs that used extensive nonsurvey data. Finally, we asked about machine-readable metadata standards only with regard to nonsurvey data, which was an error.

Given these caveats, in combination with our knowledge of what OMB requires, the existence of several existing guidelines about documentation—in particular that of the Census Bureau—our examination of the Web pages for the same programs, and finally our direct knowledge of the documentation that certain programs release, we believe we have a reasonable idea of the range of documentation that is currently provided both internally and externally to an agency.

IMPLICATIONS OF INFORMAL QUESTIONNAIRE RESULTS

With respect to internal saving of their input data, all agencies retain these data and the official estimates for at least a year. Some agencies keep such data for as long as 40 or more years. (The Census Bureau keeps data for much longer than that.) Some agencies save many versions of the input data, including the raw responses and the file containing the responses treated for edit failures and for nonresponse; the most commonly saved internal version is probably the file used as final input to generate the official estimates.

With respect to the metadata provided with the saved input data files, many data programs use SAS files to indicate variable definitions and produce codebooks to indicate variable locations on the files. Two data programs use DDI as their metadata standard, another is working toward implementing it, and one uses SDMX. No programs try to evaluate whether to adopt metadata standards through a cost-benefit analysis. Further, no programs try to indicate the quality of their administrative record data by providing something like the nonresponse rate or the rate of undercoverage.

The workflow history is generally available internally with commented code. The same is true of the estimation methodology. Therefore, the statistical programs can typically support computational reproducibility internally, at least for a short period of time.

Regarding access to input data for external users, given that it is almost always confidential, the only access external users will have to the disaggregated input survey data is inside an FSRDC or other secure computing environment. Some agencies make available public-use microdata samples, and some metadata are saved with the input data, generally in the form of codebooks and other agency-specific procedures.

Regarding information on data collection methods, data treatments, and estimation methodologies provided to external users, there is often substantial information on the survey design used and the survey instrument.

Page 45 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

This is unsurprising, since most of this information is also available in the information collection requests submitted to OMB for data collection clearance. In addition, considerable information is available on the workflow history and the estimation used in technical methods documents, handbooks, white papers, and the statistical research literature. Surprisingly, even the commented code associated with these methods is at times made available to the user community.

A last issue, though not addressed in our informal request for information, is the need to help the external user community recognize when the processes used for data collection have changed substantially—for example, when variables have different content, or they have different locations on the data file, or when tables for different cross-tabulations have changed. Having a location on each program’s Web page that indicates all the major changes that have been implemented since the previous data collection would be very useful. This is sometimes done but not uniformly.

In general, agencies provide guidance that appears to be survey-centric. They focus on details of the data collection and not on details of the data treatment or the estimation methodology. They provide little information on the extent to which the data are saved or archived, where they are saved or archived, and how to gain access to such data. They rarely use community-normed metadata standards.

There is considerable variability across programs, and what is available for lower-priority programs is likely to be less comprehensive than what is available for ongoing, highly visible programs. There is probably less transparency about the collection and use of administrative data or digital trace data for input into official statistics programs than for survey inputs. Agencies provide little guidance to their programs about saving or archiving (preserving) data, especially input data but also public, official statistics. They also provide limited guidance about the kind of metadata and paradata that should be produced and shared.

Finally, transparency requires accessibility, so it is important to determine the difficulty or ease in finding technical documentation. To check this, we looked to see how far removed these documents were from the home page of each program. In most cases, there was a link on the program’s home page that referred to “Technical Documentation” (or that used a similar phrase), so these documents were often only a single click away from the landing page.

Conclusion 2.1: Documentation of data collection methods, data treatments, and estimation methods by federal statistical agencies, while in need of some improvement, is generally fairly complete with respect to what is available internally to an agency. The practice of archiving input datasets and official estimates varies greatly across agencies, and

Page 46 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

as a result some data are not retained even internally for long periods of time. Externally, while the public sometimes can gain access even to the code for various methodological processes, agencies often do not provide accessible methodological summaries for nonspecialists. Further, access to input datasets using secure avenues varies substantially across agencies.

CHALLENGES THAT ARISE IN IMPLEMENTING TRANSPARENCY AND REPRODUCIBILITY

There are challenges and costs associated with the use of increased transparency. To start, there is generally widespread agreement, though not universal practice, in agencies creating comprehensive, detailed documentation of all the steps used in the production of a set of official statistics for all the reasons given at the start of this chapter. Therefore, the costs of transparency due to what is needed internally are unavoidable. What is less agreed upon is what should be made publicly available. It is not the case that what is made public is always or should always be a subset of such internal documentation, since it is often desirable to provide information in a less technical and less detailed form, given the needs of many types of users. As a result, there may be added costs to providing external documentation to the public. It needs to be emphasized that the following challenges to complete disclosure of data and methods only affect the pace of movement toward full disclosure, but not the ultimate goal of full transparency.

Resources

Resources is our term for the time it takes staff to prepare various documents and materials needed to inform the public about how a set of official statistics was produced. In addition to time, there may also be an opportunity cost, since the people best suited to produce at least some of these documents and materials are likely to be the most talented staff. These are people who could be devoting their scarce time to the refinement of methodologies and the development of other methodologies. At the same time, agencies realize that the subset of their user community that is interested in details concerning the data collection or the various computations undertaken is likely quite small. Therefore, it is not unreasonable for agencies to decide that the scarce resource of their technical staff members’ time could at times outweigh the needs of a small fraction of their user community.

The resources needed will also include the time spent on emails and on the phone with expert users who wish to find out precisely what was done and why, in response to this additional degree of clarity (though the argument can be made that in a world where good transparent documentation

Page 47 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

is available, the number of calls from users is likely to be substantially reduced.) But it is conceivable that the desire for additional transparency, especially for nontechnical users, might lose out to resource constraints.

Participant/Respondent Confidentiality

Input datasets often have personally identifiable information (PII), which is forbidden by law to be released to the general public. As a prime example, data from individuals on federal household surveys include PII and therefore are not shareable unless care is taken to virtually eliminate the chances of a disclosure. Also, if federal statistical agencies use administrative data at the national or state level to develop official statistics, there are often laws forbidding any sharing of such data. In the case of survey input data, often the only way to make such input datasets available for analysis to members of the external research community is to do so in secure environments that provide comprehensive protection against disclosures. Such datasets are therefore only provided through use of federal statistical research data centers and comparable constructs, they are anonymized, and additional techniques, such as differential privacy, are applied to them to reduce any remaining risk of disclosure. Finally, any material the researchers wish to take with them from these federal statistical research data centers is carefully checked to see if there is any risk associated with their doing so. Even so, many types of data are not available even under such protections.

Manipulation of Estimates by Third Parties

The concern here is best communicated using an example. Suppose that there is transparency about how the consumer price index is produced. This price index is used for contract indexation and can rely on responses from a small number of firms, including firms whose contracts are indexed to the measure. These responses could be intentionally mis-responded to in order to artificially manipulate the estimates. Such concerns are not limited to the official statistics produced by the Bureau of Economic Analysis. Any policy-related intervention may need to keep certain parameters secret, at least while the intervention is ongoing. Therefore, if an agency is not careful, efforts to enhance transparency can provide information that can be used to compromise policy assessments or disclose PII.

Page 48 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

Transparency and Reproducibility When Third Parties Collect Data for a Statistical Agency

Complications also arise when multiple agencies work in collaboration or an agency relies on a private-sector contractor. Many agencies rely on the Census Bureau for data collection. The statistical agencies do not receive raw responses if the data are collected under the Census Bureau’s Title 13 regulation. Instead, the Census Bureau provides the official estimates that are estimated from the raw data. This limited transparency and reproducibility can undermine the sponsoring agency’s ability to improve its own official estimates or serve its users. This situation, we believe, also creates a minor communications problem when information on a set of official statistics has documentation that is split between the statistical agency and the Census Bureau. Further, the official statistics are often transmitted to the agency from the Census Bureau in tabular form, so the agency cannot by itself provide the raw responses to an FSRDC or to a researcher in response to a special request.

Similar issues arise when making use of a private-sector contractor to collect data and/or produce the estimates, as the agency may receive estimates in tabular form, complicating the assessment of data quality. Also, contractors may view the software used for data treatments or for estimation as being proprietary, and as a result may not wish to share this software.

This constraint on transparency is not, however, a necessary outcome. It is important for statistical agencies to realize, first, that openness does not preclude licensing—one can provide code under a noncommercial-use license, and separately contract for commercial use. Second, the agencies control the language that goes into Requests for Proposals, not the contractors. If a (future) contractor wishes not to submit because doing so would reveal what they consider to be proprietary software, then so be it. Third, this is about openness. Even commercial software used in government contracts could be required to be made available, at reasonable prices, to others, which at a minimum would ensure the ability to do “blackbox horse races” or other evaluations. Finally, there is no reason to hide the software code, and it has been standard practice for software companies to have a special license for government use that does not make the software become open.

Further, the metadata information that accompanies what agencies get from contractors is often in PDF form, which is often difficult to process. This raises other concerns, because such contractors then serve, in a sense, as the location where the raw data are saved.

Page 49 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×

Intellectual Property of Contractors and Commercial Entities

As just mentioned, when contractors provide assistance on data treatments or data collection, some of what is carried out may be considered proprietary. This might include a technique for imputation for nonresponse, or some manner of identifying the proper respondent from a business. Commercial entities also can provide datasets at a price that might be used to construct target populations for sampling. In such cases, contractors and commercial vendors may provide their algorithms or data to federal statistical agencies for a specific use for a limited period of time during which they are not to be shared.

In each of these cases where agencies could decide that full transparency is not feasible or is not cost efficient, there remains the obligation to inform the public about what has not been released and the reasons for doing so.

Conclusion 2.2: A foundational element of agencies that produce federal statistics is transparency of operations, methods, and results so that users can trust that federal statistical estimates are produced in an unbiased manner and understand their properties and how best to use them. The principle of transparency is reinforced in numerous reports, directives, and legislation, including the Foundations for Evidence-Based Policymaking Act of 2018 and the Federal Data Strategy.

Recommendation 2.1: Leadership at the Office of Management and Budget, the Interagency Council on Statistical Policy, the National Center for Science and Engineering Statistics, and all agencies that produce federal statistics should establish transparency of processes and methods as a high priority and continuously reinforce this priority to their staffs.

Page 50 Cite

Suggested Citation:"2 Current Practices for Documentation and Archiving in the Federal Statistical System." National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. doi: 10.17226/26360.

×