VIEWS FROM AGENCIES
John Eltinge: Operational Transparency Internal to an Agency
John Eltinge (U.S. Census Bureau) viewed his role not only as a presenter, but as a question asker and framer; he wanted everyone in attendance to participate after his framing was finished. He wondered about policies and practices in current or previous statistical organizations. For those who have been involved with areas outside of official statistics, he hoped that they would bring in relevant insights from those areas as well.
Eltinge said that his focus would be on internal agency processes. His colleague at the Bureau, Ruth Ann Killion, would follow with a parallel focus on engagement and transparency working with the agency’s external partners.
Eltinge said his questions would relate to three different types of transparency: (1) relative to data sources, (2) relative to the methodology, and (3) relative to output data products. What is appropriate to have in terms of transparency, as well as reproducibility and replicability, in each of those areas? What are the predominant benefits, costs, and risks of transparency, reproducibility, and replicability? What is current practice, especially for different subsets of one’s user community that have different levels of interest and technical sophistication? Finally, what might be the priorities for enhancement of transparency, reproducibility, and replicability?
Turning to the first of the three areas, input data sources, Eltinge said they could be surveys, administrative records, or other, possibly
Internet-based sources. For transparency, one would have access to internally archived data, paradata, metadata, systems, and code. In addition, one would like to have quality assessments and error profiles, and possibly total survey error decompositions and analogous extensions for administrative sources. Finally, he wondered whether there are special issues with nonsurvey data sources. One possibility is a more complex supply chain management.
In response to several questions about the distinction between input and output data in various workflows, Eltinge said that input data are data that originate either inside or outside of one’s agency, to which something is done to produce an output. One participant said that since the Bureau of Economic Analysis (BEA) had become co-located with the U.S. Census Bureau, staff have been able to learn a lot more about the internal aspects of the input data from the U.S. Census Bureau that BEA uses in creating its data products. Previously, there had been some “black box” elements, even though BEA has a program to monitor its source data. With respect to BEA data, there is a data archive so that staff can look at previous vintages.
Another participant said that he was very impressed with what everyone heard about the UK’s Office of National Statistics and Statistics Canada. At BEA, he said, the agency is getting closer to what they are calling a unified system that is interconnected and internally transparent. Also, staff are no longer duplicating each other’s work: in the past, for example, there might have been more than one BEA group estimating automobile production.
Another participant said that from the perspective of a user and from talking to U.S. Census Bureau staff, he believes that information on imputation, editing procedures, and the like are not in a condition to be effectively understood by anyone, whether it is inside the agency or outside. One cannot just share the SAS code. This is important, he said, because for some kinds of data, particularly people’s income as reported on surveys, the level of imputation that has to be done is quite large. That fact can make a difference for the policies to which the data are applied if the imputation is done in a way that does not take account of relevant variables. He asked whether there is some way of automatically documenting such techniques, which would be helpful both inside and outside an agency.
Another participant offered information about sharing data from the OECD, which has a centralized warehouse that stores all of its data for dissemination. The problem is that OECD has far too many datasets, partly because it does not have the means for storing the structural metadata. For each dataset, OECD has to store the code list, but those lists are not really available. When analysts want to create a new dataset, they cannot easily see what already exists; they just go and create another dataset, he said. One of the things that OECD is doing to alleviate this problem is to create a structural metadata repository. Analysts are given the tools to access and
use the repository. He said that he thinks several organizations, including the International Monetary Fund and Eurostat, have gone a long way in this direction. Having a data dictionary or metadata dictionary allows statisticians to analyze what already exists instead of creating a new dataset.
Another participant said that he wanted to bring up the distinction between access to data and access to the process that describes the data treatment, which are somewhat separate. Just because one cannot access the data, it does not mean that one cannot learn about the imputation methodology, the imputation rates, etc. One is documentation to some extent, and one is the realization of that documentation as an instantiation in the code. He noted that there are many standards for these activities, but there is a lot of noncompliance because it is an enormous amount of effort to trace some of these more complex systems in terms of their total variability or other characteristics.
The participant then noted that he and Eltinge coauthored a paper some time ago that reports on such an exercise. The cost of doing it is still high, so more research may be necessary to bring the cost down, but one can still look at the conceptual problems behind it. He and Eltinge attempted to do this in the code—similar to what Rob Sienkiewicz described earlier (see Chapter 5)—to comingle these different levels of information within the code base so that they could work from a code base that contained the high-level description of what the imputations were, as well as the mathematical representation and the code representation. That is where the self-documenting code has moved to. There are many methods for doing this, he said, but not everyone is doing it because that still means somebody needs to fill things out in that framework. Given the production schedules at most agencies, documentation will take a back seat. In considering the cost of outputting good documentation, the participant said it is important to decide what is worth documenting. That does not suggest that quality profiles of everything will be possible tomorrow, but the agencies should have the ability to create a quality profile without needing a 10-year effort. He noted that this issue is separate from the various data access issues.
Sarah Henry said that when acquiring the data, it is important to think about what conversations one will be having with the agencies that are providing the data. Those conversations could be quite broad, given the many possible future uses of the data. The people providing the data need to be fully aware as to what might happen with it. Another participant wanted to expand on Henry’s point because it is key: data are not the design of a survey. Some of it will be speculation on who is going to want to use the data for what. The extreme example given is from the biomedical world, where almost all of the electronic medical record systems in the United States are nearly impossible to use for research purposes. There is still an encounter-based system with billing as a major driver. Putting together a
longitudinal database to study something is very difficult, though everything is all there. The commercial firms that have this data do not find it beneficial to worry about this aspect, so the system is not going to be easily used for that purpose.
Ruth Ann Killion turned the discussion back to what is happening at the agencies. In her current position at the U.S. Census Bureau, her responsibilities include knowledge sharing. The agency has set up a group of about 20 people to talk about editing and imputation across the agency’s silos. At the group’s first meeting, someone from the population division said that he had worked on three censuses and was doing the same job as the other people in the group but that they had never met. That situation is where they started from, Killion said. Through discussions, the group has often discovered that what they are doing is not so different from what others are doing. They understand there are just a few methods that are actually used and that there are tweaks to them that can improve their performance. Killion said that one has to plan for these conversations to take place.
Eric Rancourt said that at Statistics Canada, there is a mechanism for the acquisition of administrative data. The agency has a small team that is dedicated to negotiations for the acquisition of data because if the task is left to the subject-matter people, they will likely prioritize what they are interested in. The agency requires that whoever goes to negotiate has to team up with someone from the administrative data acquisition group who has a corporate view. For example, if the agency is going to acquire data on four specific variables, the agency might want other variables that are also related to that program. Whenever there is the real possibility that some data will be acquired, there is a mini-broadcast within Statistics Canada to the analysts of health, social, and economic data that these data are coming and that maybe someone might need them. Rancourt said that the agency has been doing this for administrative data and is starting to also do it for commercial data or data from other sources.
Another participant added that the discussion had highlighted the importance of collecting the right metadata and thinking early about reuse and data cleaning. To be efficient in official statistics agencies, they heavily invest in selective editing. Selectively editing to add best value to the output produced at the time is complicated. If the data are reused for something else, the selective editing is not going to have been done in the most efficient way for that other output. It could be useful to think about that early on when employing selective editing; otherwise, when an agency receives administrative data, it might decide to edit everything and use the data for different purposes.
Ruth Ann Killion: Operational Transparency External to an Agency
Turning to the question of transparency to users external to an agency, Audris Mockus (member, steering committee, University of Tennessee) introduced Ruth Ann Killion, senior adviser for Business Transformation at the U.S. Census Bureau. Killion began by discussing how the U.S. Census Bureau shares data externally and what some of the implications are for transparency and reproducibility. One of the current methods that the agency uses is public-use microdata sample files, which have been provided for decades. They allow substantial additional analysis of data for a sample. To accommodate the fact that the data are the result of sampling, the Bureau provides sample weights. One can then reproduce statistics, or at least the means and variances. Re-identification has become more and more of an issue as people become more inventive and have more access to big data sources. That change has been critical for the U.S. Census Bureau. In a 2013 review, the Bureau discovered that the knowledge of someone’s participation in a survey increased the probability that the person could be found in one of the microdata files.
Killion added that another method used at the Bureau is to allow researchers to obtain special sworn status as a Bureau employee, which provides access to data dependent on the requirement that the user protect confidentiality and privacy. It is a relatively easy approval process, she said, and there is no excessive cost to obtaining this status. Once one has this status, the research is typically conducted onsite, although in recent years some have done some research offsite. However, the data stay at the agency. In addition, Killion stressed, one has access only to the edited, imputed data, not the raw data.
For on-site use, the U.S. Census Bureau has research data centers, which are quite expensive to set up and require one full-time employee per center. There are now 30 such centers, with a large increase over the last 5 years. The centers provide more data access for researchers because they have access to “real data,” including from multiple agencies. However, Killion noted, researchers do have to go through an extensive, clunky, bureaucratic process, probably with more than one agency, to have access to the data, but the process does allow for very tight information technology security controls, which is one of the major benefits of this approach. A more recent alternative for providing data access is the use of synthetic files. Killion noted that the Longitudinal Employer–Household Dynamics (LEHD) Program created one of the major ones (see Chapter 5). This approach allows for the construction of a dataset that preserves the covariance structure of the distribution of the raw data. However, she said, the value of such a construct loses its value if one is interested in aspects other than functions of the first two moments, so if one is trying to look at the skew-
ness, which can be important for economic data, or the kurtosis, what one gets is not as good as the original dataset. Killion also noted that disclosure avoidance techniques can affect the underlying dataset. All of these factors affect reproducibility.
In summary, Killion said, there are several ways that the U.S. Census Bureau currently shares data. One can see their range of value to researchers and also the strictness with which the Bureau provides access. In terms of transparency, documentation is the hardest product to produce. Killion said that one of the reasons she believes this is true is that the Bureau does not hire technical writers who could produce such documentation.
In response to a question about whether Killion was talking about documentation of code or a technical paper, Killion responded that it is both, or it could be a specification. Code always exists, but if it was, say, written in Fortran 30 years ago, it is not the same thing as having documentation of what the code is and what it is supposed to do. She also noted that in terms of methodology, sometimes the details cannot be disclosed. For example, in the decennial census, there is an algorithm called the primary selection algorithm that is invoked when there is more than one unit response for a given address and the Bureau has to decide which people really live there. There is an editing process for this situation, but it cannot be shared with outsiders. Therefore, researchers cannot reproduce what is done. These situations limit transparency, she said.
Killion noted again that disclosure avoidance techniques intentionally perturb the data. One cannot reproduce the output from the input if such techniques have been applied. The final issue on transparency for the U.S. Census Bureau, she said, which was also mentioned earlier by Susan Offutt regarding the National Agricultural Statistics Service, is that when the Bureau constructs estimates, subjective judgment is included. When the Bureau observed that a given number is out of line, the analysts use judgment to change this number. That happens constantly at the U.S. Census Bureau, she said. Not all of the editing is automated or rule-based, which is a problem for transparency.
Killion then talked about some current research that is happening at the economic directorate at the U.S. Census Bureau, noting that she knows about this because of the knowledge-sharing group on editing and imputation discussed earlier. For economic data, there is a history of experts who do ad hoc editing because they really know a company and, for example, the company said “X” last fall so it could not be saying the same thing this month. She characterized this as ad hoc editing. There is a lot of that being done, though not all economic data are amenable to that approach. For example, the Annual Capital Expenditure Survey is such that one cannot look at past entries to make decisions about current capital expenditures.
Turning next to timeliness, Killion said that the Bureau has been asked whether it can shorten the time for production of data products. Everybody is interested in this. Can the Bureau make the process more reproducible? Can it be done without changing the estimates? Clearly, she said, surveys could increase the efficiency of their associated production processes, and there are decades of research at the U.S. Census Bureau showing that there is over-editing of data. Identifying and dealing with outliers is absolutely necessary to get good estimates, she acknowledged, but she suggested that the Bureau does not necessarily have to do all of the microdata clean-up that those ad hoc editors do. More timely, relevant data would be essential, she said, so it would be useful if the Bureau provided repeatable editing practices that can be automated, thereby much quicker.
Killion described some experiments done with American Community Survey (ACS) data. The Bureau looked at quantities over time, editing in the absence of edit failures, the impact of editing, and modeling stopping points. Examining quantities over time involved looking at raw sums, estimates, standard errors, and the number of edit failures over time. For editing in the absence of failures, the Bureau wanted to understand the nature of edits that are made to adjust data but not to correct for edit failures. For the impact of editing, it wanted to quantify the impact of editing on estimates by North American Industry Classification System (NAICS) codes and edit type. For modeling stopping points, it wanted to model when to stop editing NAICS codes and switch resources to other NAICS codes. How long should one analyze the data to get to the point at which one decides that the result of further editing is not going to appreciably change what the final confidence interval is? Can that analysis be done more quickly?
Killion reported that the basic results from the experiment were that edits that do not address edit failures have very little impact on the estimates. That is, she said, the Bureau is spending months and months doing hand editing, and it is having virtually no impact on the estimates. The production of results could happen about 2 months earlier if one only did automated edits and dealt with outliers. At that point in the processing cycle, she said, the estimates of many, but not all, variables are stabilized within the confidence intervals. If the Bureau used only automated edits, it would increase others’ ability to reproduce what is being done. Also, she noted, there are new editing practices being introduced, some of which are related to big data. Those practices could increase the likelihood of estimates becoming stable much earlier in the data production process, which, she said, is really exciting.
Killion then talked about future research. First, the Bureau would like to determine the impact of edits to create a hierarchical editing system, and it would like to automate certain types of edits. In addition, it would like to build a machine-learning process for automating subjective edits. The
Bureau is currently doing research on the use of big data editing techniques, along with new data sources, to increase data accuracy while decreasing analyst burden, which may have a potential impact on the Bureau’s ability to do things more transparently. Finally, the Bureau is continuing to research stopping-point models so that editing can become more adaptive, she said.
Regarding reproducibility, Killion said that many issues have to be addressed in order to make reproducibility practical. First, the Bureau will have to make use only of automated editing. Second, it is very common to use random-number generators in editing and imputation routines, but this approach reduces the ability to reproduce outputs because one is using random terms. Finally, she said, disclosure avoidance techniques perturb the data and therefore also reduce the possibility of reproducibility.
Killion ended by raising a couple of basic questions: Should analysts expect to be able to reproduce publicly published federal data? and Are current techniques robust and transparent enough to satisfy the goals of the scientific method? She said that the goal of the scientific method is what is important here, rather than the letter of the law of the scientific method. She believed that at this point the Bureau’s techniques are not robust and transparent enough, but she said she hopes they can be moved in that direction.
Killion also noted that for federal data there are many checks and balances and many people are reviewing the processes. Given these layers of review, could the Bureau at some level claim that what is produced is indeed transparent or reproducible because it has been looked at so many times and in so many different ways? Can the Bureau improve its documentation enough? What does the Bureau need to change about how it operates in order to make further changes? What questions will the Bureau be asked? For example, she asked, will the Bureau discuss reproducibility, the data and the process, and the document quality or true transparency?
Bill Eddy (chair, steering committee, Carnegie Mellon University) said there are two major problems about disclosure. One is that the disclosure avoidance techniques that the U.S. Census Bureau uses are not optimal. The other is that the techniques are not implemented in a uniform way across the agency. They are not even inserted into processing at exactly the same step, he said. What is done now lessens the value of the data, including the public-use microdata, which are also distorted.
Killion responded that she stated earlier that the data are perturbed and, as a result, one cannot preserve the distributions. Connie Citro (CNSTAT) asked whether Killion was talking about the synthetic public-use microdata samples because the ones with which she is familiar do not perturb that much. Killion said that they are perturbed, but that information is not provided to users.
Eddy mentioned that researchers can come into the research data centers and carry out research on the methods that the Bureau is currently
using to instill confidentiality protection, but they are not told that the data are already perturbed. Another participant said that this was not just the practice of the U.S. Census Bureau.
Another participant said that as a computer scientist, he was not sure why it was not possible to preserve a third or fourth moment in creating a synthetic dataset. It does not seem like a very hard thing to do. Maintaining correlations does become exponentially harder as there are more variables, but if one is talking about a single variable and a third moment, he did not see this as being difficult.
John Abowd (U.S. Census Bureau) referred to a paper he coauthored with Ian Schmutte1 that offered a very detailed assessment of the difficulty in preserving both the estimates and the inferential validity when one uses data that have been subjected to disclosure limitation methods that are not documented. Furthermore, he said, there is the claim that documenting those methods compromises the confidentiality of the underlying data even more than they would otherwise be, and so not even the parameters that would allow one to compute a posterior distribution are released. Abowd said that the paper also shows how to recover some of those parameters without using any private information from datasets that had a similar frame but used different confidentiality protection mechanisms.
Abowd added that he and Schmutte are certainly not the first or the last people to document the failures of ad hoc statistical disclosure limitation. An earlier paper by Dinur and Nissim2 proved that if one publishes too many statistics that are too accurate from the same confidential database, one exposes the confidential microdata with near certainty. The amount of noise that has to be infused into the ensemble of the publications—not one at a time to maintain confidentiality—is of the order of , where N is the number of records in the database. Abowd said he knows of no demonstration that any statistical disclosure limitation system anywhere in the world by a statistical office infuses that much noise into anything.
Abowd reported that the U.S. Census Bureau, through its Data Stewardship Executive Policy Committee, has committed itself to protect the 2020 Census with a formal approach to privacy, using record-level differential privacy. This approach involves a complete rewrite of the disclosure avoidance system for the 2020 Census; it is planned to be implemented in the 2018 end-to-end test for all of the test files that are released from that
1 Abowd, J.M., and Schmutte, I.M. (2015). Economic analysis and statistical disclosure limitation. Brookings Papers on Economic Activity. Available: https://www.brookings.edu/bpea-articles/economic-analysis-and-statistical-disclosure-limitation [January 2018].
2 Dinur, I. and Nissim, K. (2003). Revealing information while preserving privacy. Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Available: http://www.cse.psu.edu/~ads22/privacy598/papers/dn03.pdf [November 2018].
exercise. He said that those test files may be released only to those with special sworn status, although this has not yet been decided. It represents a new disclosure avoidance methodology. All of the parameters and every policy decision will be made public.
However, Abowd noted, this approach for the 2020 Census does not fix the past, though this approach will never support provision of public-use microdata like those that have been available. The 2003 database reconstruction theorem by Dinur and Nissan (2003) is the death knell for public-use microdata in the format that statistical agencies have published for decades. He said that the implications of that theorem are only sinking in very gradually. Its main implication is that there is an information or privacy loss budget that is associated with the publication of all of the tabulations and microdata from any survey, census, or other confidential input or integrated confidential inputs, such as the LEHD data that Sienkiewicz talked about earlier (see Chapter 5). This is a sea-change problem for statistical agencies. Abowd noted that there has not been any training at the Bureau for how to think about metering out the privacy loss budget in order to deliver accuracy on the outputs that are important to users. He added that there are no examples of the ways in which perturbation of the data for any reason—editing or imputation or disclosure limitation—affects their use.
Abowd continued with additional information for the 2020 Census. The key outputs from the census are the redistricting files, which are block-level summaries of the race and ethnicity of voting age people in the United States. By agreement with the Department of Justice, the U.S. Census Bureau will not perturb the enumeration of the voting age population. The Department of Justice agreed to allow the U.S. Census Bureau to determine appropriate levels of statistical disclosure limitation to apply to the characteristics in those block-level files, which in this case are race and ethnicity if voting age does not count as a characteristic. The U.S. Census Bureau published 2.6 billion tabulations related to these redistribution files in 2010; there are 63 basic binary variables for race alone. At the block level, the Bureau published at least 2.6 billion additional tabulations of sex by age in 5-year intervals. The Bureau then published a 10 percent public-use microsample that was subjected to statistical disclosure limitation techniques.
Bill Eddy said that there is just no getting around the issue that the executives at statistical agencies have to be educated about how to manage a privacy loss budget. If that is not done, he said, some computer science class someplace is going to take one of the publications and reproduce the microdata. Fortunately, Eddy said, the people he knows who know how to do this also know that it would create a serious reputation problem for the statistical agencies so they do not do it. However, he said, those people
do use their skills for companies like Netflix, which has a business-related need to reproduce data.
A participant said the confidential data could be useful if maintained in a curated format with no disclosure limitation perturbations and no edits and imputations. The confidential data are what agencies spend the taxpayers’ dollars to obtain. He said that the taxpayers also paid for the agencies to improve the data in ways that Killion described very well. But that value added is also subject to scrutiny in how it is used when researchers are given access to the data.
Eric Rancourt (Statistics Canada) said he found the discussion very interesting because in Canada there is increasing reflection as to what it means for data to be within an agency. Is it within Statistics Canada? Is it within the data centers? Is it within other departments of the Canadian federal family? They are trying to more modernly interpret the law when people talk about data centers and about employees. How can one interpret it so that it still makes sense, he asked. Canadians seem to be a bit more open to what confidentiality and privacy protection mean than what he has heard during this workshop. Maybe data with different levels of protection would be available to colleagues and other departments, which one might consider to be within an agency, while a more protected version would go to researchers, and an even more protected one would go to general users.
Rancourt noted that in terms of how to better document data, he and all of his colleagues have been asking themselves that question for decades. A few years ago, the chief statistician issued an order that evaluations of directors were going to include an objective about documentation of their programs. If they failed to meet that objective, it would impact the assessment of whether they would receive their full (retained) pay (at-risk) at the end of the year. Within 2 years, he said, compliance in the area of documentation substantially increased. In response to a question about what “documentation” means in the previous statement, Rancourt said that he referred to ensuring that all of the Statistics Canada surveys have metadata in the public database.
Sally Thompson (BEA) said that she did not envy the U.S. Census Bureau and all of the challenges it faces as such a large organization with so much data and so much to organize. She said that BEA has a small slice of those issues because the agency has its own survey program to collect data on services, trade, and foreign investment. She complimented Killion for her presentation but quibbled with her about ad hoc editing versus automatic editing. Thompson said she is a big fan of automated editing, which was implemented at BEA last year for some of the agency’s benchmark surveys. BEA is using BANFF, a Statistics Canada program, and there are others. BEA has done some studies to determine the threshold in terms of the impact on the statistics between ad hoc and automated editing. Since
BEA is dealing with multinational enterprise data, such editing would not be used for a company such as Exxon. This situation is different from demographic-based responses, which have more well-behaved distributions. Thompson said that a big challenge for BEA is to figure out where to draw the line on auto-editing. One reason that BEA has its own survey program, rather than relying on the U.S. Census Bureau, is because the agency has experts in these areas who can make judgments about what is reasonable and what is not.
Thompson noted that BEA also has a special sworn employee program. This program gives access to unedited and edited microdata to special sworn employees, and that work is then checked for disclosure. Basically, she said, the researchers using those data seem to want the edited data because unedited data can be a mess.
Another participant said that one thing that has struck him in observing the U.S. statistical system is what one could call a market failure. There are no commercial vendors with state-of-the-art edit and imputation programs; the people creating such programs are at various statistical agencies around the world. Maybe it would be better to collaborate so not everybody is trying to reinvent the wheel.
A participant noted that the former head of the Australian Bureau of Statistics launched a program based on a similar idea. It is now known as the High-Level Group for the Modernisation of Official Statistics, and its goal is to make sure that not all organizations build systems for all of the necessary data treatment steps. The hope is that the programs can be “plug and play.”
Another participant pointed out that the problem of what is called extract, transform, and load is a standard data management problem outside of statistical agencies. There might be standard commercial tools that can significantly help an analyst. The general idea is if someone is going to import data into a database, they are going to have to make adjustments, and what is done depends on who they are, where the data came from, what was wrong with the data, and how they need to change the data to fit. He said this is a well-recognized commercial need, and there are tools to help the process.
Audris Mockus offered a different perspective from his work looking at software development. About 25 years ago, he said, he started looking into how AT&T writes their source code. The company had 5,000 people writing some very large programs. AT&T used what is called a version-control system to keep track of each version of the code. With that system, Mockus said he could look and see exactly what each person was doing. At that time, one large program had been running for 15 years and there were 150 million lines of code. It has been 40 years now, and one can assume that the program is much larger. A very important thing he learned—which he had
not already heard in relation to products and documenting the products—is that what is actually more important than documenting the current version is documenting changes. There are very simple ways to do that. There are many edits and mixes of inputs from various sources, and so understanding what that code is producing is very important. Mockus said his message is very simple: one needs to know why various changes were made to understand any nontrivial piece of code and these changes may be useful if all changes to the code have documentation.
Mockus offered a more detailed explanation of version control, which is essentially keeping track of changes to documents, data, or products. That is, it is not enough just to keep track of the current methods; instead, one can document each change. If a person makes a change, one can learn exactly what was done, and if the imputation (or other change) works differently after the change, one knows the origin of the difference. If, however, such a change is not documented, one does not know which fixes resulted in which differences. Once someone makes a change, someone else has to inspect it. The reason is not so much that other analysts would know what to do best; the reason is just to have somebody else look at it. The person may provide useful inputs that improve the code. But more importantly, it is done to spread knowledge and also to foster innovation because now somebody else can take that part of the code and modify it and so forth. In summary, he said, a person records every change to a product, and every change can be undone. A grouping of such changes is referred to as a release. That allows people to collaborate and work on different releases.
For an illustration, Mockus talked about a small study he had done to understand how much version control is used among various federal agencies. He used Google to see what had been done, and he created a version control repository where he stored the findings. If anybody wants to look at the data or the scripts he wrote, they can do so.3 When this repository is edited, one then commits to the change—as such, the changes are called commits. One can also create what is called a pull request. This is what people typically do in groups. A person says that she or he is making this change but does not want the change to go into the test branch until somebody else takes a look and says it is okay.
Mockus then considered version control in how agencies operate. He started with an observation from Watts Humphrey, which was that some projects could deliver their news releases on time and with high quality and other projects could not. Moreover, Humphrey observed, there were some systematic differences between the management of these projects. From this observation, Humphrey created the idea of the capability and maturity model: that the projects with more maturity are more predictable, are more
repeatable, and can do their job in a more repeatable fashion. One of the first things Mockus said he observed is that the projects that used version control were more mature. Second, they use automation for everything. If one has a repeatable process, everything is encoded, and analysts will run tests whenever they make a change. If any of these tests fails, one knows exactly that something went wrong and why. As a result, he said, productivity grows.
Mockus said that another benefit of version control is that it is administrative data. If one has information about the code, one has all versions of the code and all versions of the data. Second, one has the reasons for all of the changes to the code. It turns out that one can analyze that information and use it to understand all sorts of things. For about 15 years, Mockus said, he and his coworkers produced reports on the state of the software in company reports, in which they used essentially all version-control data from all of the company’s projects, supported by some survey scan interviews, to estimate what was going on, who was doing badly, and what the company needed to do.
Mockus acknowledged that things are a little bit more demand-oriented these days, referring to open-source ecosystems. A department now rents everything that civilization relies on based on some open-source software that is maintained by other people, sometimes paid, sometimes unpaid. This works because of the huge personal and corporate benefit from operating in an open manner. These open-source systems are complicated, and understanding them is important. But one can study them easily because they are totally open, totally transparent, and totally reproducible. A person can take any piece of the code that underlies infrastructure and even maintain it if he or she wants to do so.
Mockus said that he believes this approach will be used for all of the statistical applications within the next 20 years, and this will have to happen because it will improve productivity and quality. To achieve such gains, he said, researchers and analysts should make sure that every document and every piece of code undergoes version control, at least internally. And, if possible, he said, one should make it open and external. There will be a lot of feedback that one did not anticipate. As far as transparency, he thinks that the notion of transparency, reproducibility, and replicability that John Eltinge framed could take place if the pieces were ordered by their easiness of implementation and importance to maturity as part of some sort of transparency or reproducibility maturity model.
In response to a question about using blockchain for capturing version control, Mockus said that blockchain is more about keeping track of what actually happens. He said that he does not think blockchain could be useful unless one is worried about someone tampering with the data. For situations in which there are internal controls and if one trusts the data
contributors, he does not see the benefits of blockchain. However, he said, one advantage would be that one can keep track of who inserted what and when. If everything is opened to the public, one would want to have this capability. Mockus said that, to a large extent, the current version-control systems like those that Git makes are like a blockchain because the content is addressable, so one can actually trace things back and cannot modify the content of the code without such modifications being detectable. From a documentation perspective, that might be necessary because if someone sees something off in the output of a program, one can see what changes caused that result, why they were implemented, and their properties.
A participant said that this type of reproducibility goes to the software level and the implementation level for computational results. He wanted to add a footnote that it is not a panacea for the statistical questions regarding different levels of reproducibility. For example, it does not neatly or naturally facilitate such analyses as the comparison of workflows or pipelines in chains of analysis. That is one thing to think about as a future evolution of how one stores versions of software.
Mockus agreed that what he talked about is not going to resolve all problems, and he offered a few more meanings of version control. It enables collaborative work and building on the work of others, he stressed. In a way, it improves quality and that is a very important part of reproducibility. It also allows others to contribute. For example, he could go to the U.S. Census Bureau’s Website if the code was public and say that he tried some example but it was doing the wrong thing and would be useful if fixed. But he acknowledged that the questioner was right: it is not a panacea. He noted, however, that he meant more than just keeping track of things; he also meant reuse. The reason for that is while the workshop is focused on federal statistics, the support for this activity originated in an interest in open science. Perhaps everyone might think about reproducibility in a broader sense in that researchers want to create more knowledge.
VIEWS FROM ACADEMIA
Lars Vilhuber: Making Confidential Data Part of Reproducible Research
Turning to a perspective from academia, Lars Vilhuber (Cornell University) started by asserting that in the social and statistical sciences, replicability of research is an increasingly required part of research output, but confidential data, such as those curated by statistical agencies, present problems. The replicability of research using proprietary data is perceived as problematic at best, impossible at worst. He proceeded to enumerate the characteristics of a good replication archive: a permanent URL, the identification and broad availability of original data (ideally with provenance),
and transformed data (and the programs to transform it). Furthermore, Vilhuber said, the archive should provide the core analysis programs. Current practice, however, requires that materials be deposited at journals. The provision of the original data is optional, but transformed (or analysis) data are not. Analysis programs are almost always provided.
Vilhuber said that the requirement to provide the transformed data is an issue because he can point to any number of graphs showing that the use of confidential data or restricted access data is increasing, including in research conducted at his research institute. This increase might reflect the opportunities for some really interesting research, so that articles based on confidential data are cited more frequently. However, the replicability of such findings is put in doubt. In current practice, to be perceived as “replicable,” the data underlying the research may need to be deposited at journals, which naturally cannot happen with confidential data. Though social science archives are able to handle some confidential data, researchers do not avail themselves of this opportunity very often. He gave as examples the Inter-university Consortium for Political and Social Research at the University of Michigan, the Dataverse project at Harvard, and Zenodo at CERN in Switzerland, which are mostly free for all practical purposes. They can handle some restricted-access data for which researchers have to request access.
Vilhuber posed the question of why researchers do not use these archives, answering that it is hard to say. In some cases, he said, it is because a law prevents the data from leaving the official statistical agency because it is the only accepted custodian (e.g., in the United States). In addition to those legal restrictions, however, he said that he believes that the real reasons are often that the agreement on confidentiality is between one researcher and one firm and so it is nontransferable, the lawyers get cold feet, data access costs money, or an institution such as a school system has no funds to archive the data, and so the data are destroyed.
Vilhuber acknowledged that reproducibility with confidential data is hard. However, he argued that data that are held by federal or national statistical offices can alleviate some of the concerns. He stressed that most national statistical offices have well-managed archives for the data and can make the data available. In addition, these datasets are well documented, and there are excellent metadata and various other important attachments. Most often, he said, this is not true for data provisioned by the private sector.
For the data available at national statistical offices, Vilhuber noted that access to data is typically tightly controlled, both legally and technically. Researchers can be given access at a secure research center but only after submitting a proposal and gaining ultimate approval. Once researchers get access to the data, the release of research results is tightly controlled because any of those results may be subject to disclosure avoidance measures,
as are the original statistics. For each research project, in order to request the data, the researcher’s proposal has to identify precisely the data that are needed, possibly down to the variable level. Results that “go out the door” are very well identified because of the detail-rich process of documenting the release of results, a process called a “disclosure review request” (various agencies may have different terms). A researcher could show the analysis programs to be used in order to prove that she or he did not manipulate the data in a way that would permit disclosure in the results, thus providing a very detailed provenance description of the analysis results.
Vilhuber went on to argue that this tight control actually leads to data, code, and results being well documented. As he put it, “you need to document why you actually allowed someone in the door and gave them access to some data,” as well as why that same person was allowed to use the results created from the data.
Revisiting the characteristics of a good replication archive posited earlier, it turns out that a project proposal and the subsequent release process have many of the same attributes: (1) the project proposal specifies exactly the required data (though a permanent URL for such data may still pose a problem); (2) intermediate files, the original data, and model parameters are stored in well-defined locations; and (3) all programs are stored in well-defined locations.
As it turns out, in many national statistical offices, this information is already being collected in a disclosure review request (DRR). Every research project with publications has at least one DRR. In theory, the DRR captures the information necessary for a replication.
There remain issues that detract from replicability: (1) access, which involves long lead times and capacity constraints (size, distance, computing); (2) replication, since the information necessary for replication is not easily available given the absence of a permanent URL; and (3) documentation.
Vilhuber provides some suggestions for addressing these issues: (1) access—find a consensus among data custodians that replication is a legitimate justification for access (speed up access for replications); (2) transparency—provide assistance to users on how to cite data, describe processing, and prepare replication packages; and (3) leadership—encourage and support replication activities and view replication as part of standard operating procedures.
An additional problem is that researchers would not be able to find the data and know about their existence in the first place to request access. Researchers do not necessarily know where one can find all of the elements of these particular pieces. They cannot easily find strong guidance for crafting project proposals. They cannot always find whether metadata for data exist. Furthermore, there is no consolidated way to document the analysis programs that the researchers have actually used. And the results
published by the researchers may not agree with the results that were actually released by the agency.
Vilhuber drew an analogy to the broader topic of reusing administrative data. The data about the application and release process are, in fact, administrative data and are currently being collected. The national statistical agencies collect and curate all of the elements of a DRR of a research data center (RDC) researcher. He pointed out that this just says that there is some theoretical proof that replication is feasible, but it can take a long time. Why would a researcher or a student ever ask to perform a replication that is simply going to test some simple code if doing so takes 1 year of project development, including a proposal, etc.? And if many researchers actually did this, for instance as part of a graduate-level class, there would still be a capacity constraint; could the current access systems, such as the federal statistical RDCs, accommodate a large number of “replicators?” Nevertheless, some suggestions can come out of this kind of structure. Integrating the structure into the objective function of the people managing system would speed up the process of checking code. If writing code and documentation of that code should be made part of the normal operations of the system, then create incentives so that it becomes natural to do so. Right now, the ability to support replication or reproducibility is not part of data access mechanisms. Would a lot of students subject themselves to getting clearance to access data by becoming special sworn employees? This could be a much more reasonable approach. It involves asking for access to check for replicability, which is not currently a widely accepted notion. The overall goal is greater methodological transparency about microdata. Vilhuber ended by stating that in terms of leadership going forward, researchers can see that they are an integral part of the whole system. But in order to support them, replication may need to be seen as a standard operating procedure.
H.V. Jagadish: What Does Reproducibility Mean for Federal Statistics? An Academic’s Perspective
H.V. Jagadish (University of Michigan) addressed the question of what reproducibility means for federal statistics. He stated that reproducibility is the ability to duplicate an entire analysis of an experiment or study, either by the same researcher or by someone else working independently. Two problems that arise in federal statistics are that human judgment is not reproducible and processing by defunct software is not (easily) reproducible. The first part of his talk was how to achieve reproducibility, which was followed by a discussion of the costs.
With respect to achieving reproducibility, someone else might be able to use (substantially) the same method to get the same results. This requires
that every step be documented precisely. Jagadish pointed out that it is easy to be vague about the specific process used and to make incorrect assumptions about what would be “obvious” to someone else, so one has to strive to include all details.
In computing research, the question is whether experimental results (e.g., performance evaluation of a new algorithm) reported in a research paper can be reproduced? Assume that a volunteer “reproducibility committee” attempts reproduction and awards a merit badge to work that can be reproduced. Challenges for this include unstated code dependencies, unstated assumptions, etc.
For survey data, paradata are recorded and reported. While not part of the headlines, these are often critical for scholarly study, for reconciliation of differences, and so on. Additional processing of survey data is also recorded, but spottily (e.g., manual error correction).
The most recent concern is that agencies are now collecting new types of data. Regarding “big data,” they are repurposing administrative, business, and other data increasingly to compute statistics. For such information sources, recording “paradata” becomes even more important. Jagadish said that the meanings of variables may subtly differ so it is important to be precise. Unfortunately, in practice, the provided paradata and metadata may be limited. Also, these data sources are likely to involve much more processing to get them ready for use, he said. Therefore, figuring out what is happening through all of the computational steps becomes important. This is where the notion of computational provenance comes in. It is a subfield that is fairly well developed. The idea is to associate metadata with data and this metadata informs where and how the data came about. There are many different taxonomies of what one is recording in terms of provenance. One that he finds useful is dataset-level provenance and data item-level provenance as a top-level distinction. A dataset-level provenance records the workflow that created the dataset. For example, one has a result dataset and also has the computational workflow that resulted in this dataset. At the dataset-item level, one might have provenance regarding particular data in the result dataset, which indicates how that individual item was derived and on what source data items it depends.
For statistical purposes, the data item-level provenance, which technically is more challenging, is probably not that important, meaning that researchers have to deal only with the easier dataset-level provenance.
He explained that recording provenance for workflows has been done for 20 years and that it is known how to place software in a version repository. However, managing dependencies between the software that is required for this to run can be complicated, especially if there are dependencies on the specific operating system that the software runs in. For non-software operations, which include manual edits in particular (to minimize),
one has to record all changes. In other words, editors are saying exactly what was changed, not that an expert cleaned up some dirty data. Finally, one can retain access to source data because the workflow is telling of what operational steps were applied. But the operational steps were applied to the source data, and if one does not have that, they cannot work their way forward.
Jagadish pointed out that while this is known, there are challenges. Whereas recording and maintaining provenance is an easy problem, using provenance remains a hard problem. A full provenance dump is usually too much for a user to deal with, so provenance exploration methods are the subject of current research. In addition, he noted that the provenance records where the source data came from, what was computed, and how it was computed, but it does not indicate why. There are many assumptions for everything done, and these assumptions do not get surfaced and they are not part of the standard procedure. When items get repurposed or sources that have been repurposed are used, the why becomes important because there are a lot of assumptions that do not get surfaced. That is a big area where users have potential minefields.
Jagadish then discussed costs. There are the costs to the producer, but there are also the costs to the consumer. If the costs for reproduction are expensive, which is typical, they can be delegated. Reproduction also requires skill. Most consumers of scientific conclusions (or statistics) cannot afford reproduction and do not have the skill, but everyone should still care about reproducibility because that is part of the scientific method and it is part of why they trust whatever was said. As a result, all users will need to rely on experts to do the reproduction and verification.
Regarding national statistics, consumers cannot (often) be given access to source data due to privacy concerns. However, consumers will still need explicit documentation of metadata to support reproducibility. This is particularly true as new sources of data are used and new methodologies are introduced. Jagadish said that people do that again through delegation. If there are metadata, if the methodologies are all documented, they rely on somebody else to be able to do this.
Jagadish stressed that verifiability is central to trust. The most straightforward way to verify a national statistic is to reproduce it. Reproducibility requires adequate metadata and a log of all actions taken. Metadata and logging are critical for trustworthy national statistics. He used the analogy that he does not actually need to know how the sausage was made, but he needs to know that a food inspector was able to visit the factory at some point.
A participant asked Sarah Henry if when she was talking about the national-level office that looks at statistical agencies, is what had just been described to ensure reproducibility part of that? Henry responded that they
have not yet gotten to that level of detail. She anticipated that someone might ask her that question, and she believes it needs to be an ambition. The question is how far can statisticians get to that ambition, bearing in mind the purpose of their work and the level of detail people need in order to use statistics to go about their decision making in their daily lives. She has never met a user that scrutinizes that level of detail. Are users not doing that exactly because they are expecting that the food inspector did that for them, or because they have never given it a second thought? Some plain English engagement could be useful with the user community, which needs to let statisticians know how far to go.
The same participant continued that some of this reproducibility testing takes place because they have or can bring the data inhouse. This raises the general question in each country as to where this activity can be outsourced. How does one build trust and credibility in national statistics by creating a third-party entity that has the resources to do this? How much more effective would it be if someone else were able to do it in a definitive way?
Another participant said that not everything needs verification. People are used to the notion of having a sample audit; that is how the IRS works, which keeps people honest in terms of filing taxes.
Someone else said that the policy analysis community in Washington, DC, does a fair amount of this auditing and the Congressional Budget Office (CBO) is part of that. But it is also run by people who run other policy analyses, such as microsimulation models. Unfortunately, there is currently no mechanism for such a feedback loop. What often happens is the modelers give feedback (e.g., the total of income reported in a certain program does not match what the administrative totals say from the agency that actually paid out the money) and then they make their own adjustments to the data so that they can do what they think is the right thing for the policy analysis. There is some verification work that goes on, but it is not organized.
Henry said that something glossed over in her presentation was the role of international standards and measurement, which is very important. She presented the example of when her group was trying to estimate the number of people from Great Britain who come to visit the United States and how much they spend, and compare that to what the Office of National Statistics estimates for Americans who visit Great Britain. This is where one has to make use of various adjustments. There are all sorts of reasons why one may not have reproducibility. The metadata may be clear, but it may be that one country says that a person is not a foreign visitor until he or she is there for at least 2 days. If someone has just been in the airport, it may be treated differently. There is an art to this and the issue of reproducibility makes her nervous because at the end of the day, statisticians want people to have confidence in their statistics. The notion that making things as transparent as possible is going to improve confidence is an arguable issue.
Another participant said he thinks there is a thread here that ties these positions together. Maybe a statistic has to be replicable, not necessarily replicated. The threat of replication is generally sufficient. If it is actually tested, publication bias could occur, which means that the result that suggests that there is a problem is the only one that will be published. The confirmation that it worked well is not going to be mentioned. There is a generic communication issue for this. His research group struggled with this on the academic side, he explained, and suggested that statisticians should encourage some publication of this in some forum just to make sure that the system actually works. If a researcher can replicate the unemployment number, what will happen with that result? That kind of activity can be explicitly pushed so that estimates of verification frequency can become known. The IRS does this, which is why people believe that it actually audits the tax files.
Another participant pointed out that there are different but related goals that have been discussed. One is about scientific progress and knowledge creation. Having transparency and reproducibility allows science to build on previous results for experts to understand why one result is different from another. The other point is legitimacy—how the public and policy makers perceive what statisticians do. The field is clearly at a point at which there is a crisis around social statistics and analysis using social statistics. Being transparent will not solve that problem; transparency experts acknowledge that it gives statistical agencies some credibility and legitimacy that they do not have if it seems like they are hiding things. Furthermore, panelists and participants have talked mainly about reproducibility in terms of data. They have also talked about reproducibility in terms of research results and the implications of data analysis. These are sometimes different. Ensuring that research results are reproducible is different from the steps to ensure that the data are reproducible. Jagadish focused on understanding the provenance of the data, and that is making the data reproducible. Jagadish said it is okay if statisticians do not do that perfectly; when participants talked about the Public Use Micro Sample, it was mentioned that the edits made to the data often do make a difference and those are at the datum level. It is much harder to make that process completely transparent and certainly not reproducible by some outside party because agencies do not make that data available. What Vilhuber touched on was making research results reproducible. While that process can be involved, it is not as challenging as making the data reproducible. Jagadish’s point was that there is software that tracks everything that is done with a dataset that will allow an agency to maintain the provenance of that data. Creating metadata this way actually helps to solve the problem that David Barraclough from OECD was talking about, which is having datasets and not being able to find them. Having good metadata helps an agency keep track of its own data.
Finally, the discussion returned to the last couple of points that Vilhuber presented about the idea that the restricted data systems provide a useful model for how to make research results reproducible. It is possible to share the code and the metadata and the actual data, but one of the problems with the reproducibility of research results in academia is that users are often taking data out of context and attaching them to a research article. People cannot do that with data in the RDCs. Then there is the question about whether an agency has the ability to absorb new techniques and ideas. This issue is one of many that they have run into in the RDC Program when interacting with researchers.
This page intentionally left blank.