JOHN ABOWD: SUMMARIZING THE WORKSHOP UNTIL NOW
The second day started with a summary of the workshop presented by John Abowd (U.S. Census Bureau). He asked Victoria Stodden, one of the reproducibility editors for the Journal of the American Statistical Association—Applications and Case Studies, and Maggie Levenstein, director of the Inter-university Consortium for Political and Social Research (ICPSR), to join him. Abowd started by stating that regarding transparency, reproducibility, and replicability, an easy target is code management libraries. Inside the research and methodology directorship at the U.S. Census Bureau, they have the time-series programs used to seasonally adjust virtually every macroeconomic time series in the world. Brian Monsell manages this code base, and upon his retirement, what is needed is to transform that code base into a modern version control system. The U.S. Census Bureau standard is GitHub,1 but there is enormous diversity of opinion among the people who are in the chain of command to approve one’s usage of GitHub. The demonstration of the Longitudinal Employer–Household Dynamics (LEHD) Program the day before included what appeared to be a public GitHub page sponsored by the U.S. Census Bureau, but was actually a page set up by a Bureau employee in public GitHub space. Abowd added that they are trying to establish exactly what the rules are in GitHub, and then the U.S. Census Bureau will set up an official GitHub space. Abowd continued that the next step is for the Bureau to be able to create digital
1 Recently bought by Microsoft.
object identifiers (DOIs). Levenstein answered that DOIs for artifact versions are best practice but current practice is not that. Having some permanent digital identifiers would be good for both version data and the code that produces them.
Abowd responded that they now place papers onto the Web, and they have a new content management system that is database driven. However, if a user does not use Google to search Census.gov, then he or she is not going to find these papers. Abowd added that when someone does find a paper, he or she cannot be sure about what exactly was found.
Bill Eddy (chair, steering committee, Carnegie Mellon University) asked whether there is any internal resistance to this idea. Abowd answered that the problem is not internal resistance; it is that employees are not trained in digital curation. Archiving is necessary but it is not sufficient. That is the Library of Science’s part of information science, and the agency has extremely important contributions to make because they gradually transformed from ink on paper to bits on a medium, and they understand the process of moving the bits forward in generations of hardware, as well as the process of moving the metadata forward so those bits are interpretable. There are some experts at the U.S. Census Bureau but there are not nearly enough to go around and their expertise is put into very specific systems, so the system that Sienkiewicz talked about the day before got Vilhuber’s expertise and that of two other contractors who understood how to build a metadata database and how to make an early version of a data link work.
Abowd added that another piece of this low-hanging fruit is the curation of code bases for confidential data analyses. For reasons discussed the day before, the U.S. Census Bureau is already curating a code base. If an external researcher working on an external project asks to have data released, there is a curated code base. A reasonable question is why this does not happen internally. The enforcement mechanism is going to be getting a unique identifier when an external researcher on an external project undergoes a disclosure avoidance review. The Bureau is going to require that identifier before they will allow the working paper to be posted—whether it is internal or external—and that will bring about internal compliance.
Some internal researchers have always had to work in the research data center (RDC) environment, but most internal researchers had to go through the formal clearance process. They were not obligated to document their code base, document the tables that they wanted released, or get a protocol number to say that their process was approved.
Stodden had a few thoughts. The smallest scope mentioned is the use of GitHub, but at one point there was also SourceForge and code.google.com. These platforms lose relevance over time, so one thing to be aware of is what is happening to such platforms external to the U.S. Census Bureau.
Levenstein mentioned code that produces datasets or has somehow reconciled datasets. Working papers will have analysis code attached to them of two different types. The roadmap would include identifying the different types of code and the remedies that would attend to each of those different types. It may be easier, for example, to gather analysis code that underlies the working papers than the larger sets of code that have been established over decades around big data products.
Stodden continued that there are decisions to be made about when to snapshot code. If a user has this roadmap structure and is interested in the reproducibility of a result it produced, that user would snapshot the code and give it a DOI. The user may then want to back up the chain of research steps and snapshot the code that produced the data that that analysis relies on, so that DOIs can be attached to those snapshots that are associated with a final result that would be reproducible.
Levenstein said that she tends to think about things a bit differently. In terms of taking the workflow that the RDCs currently use to disclose things looks like a perfect process in some ways, but there are a couple of things to remember. They do not just take code; they have to disclose the code as well. Abowd pointed out that the Bureau curates both the released and the unreleased versions internally. Levenstein agreed but noted that when people think about replication, taking that outside of a census environment means that the code will not run, and that is very different from what people think about replication. Again, it is important to go back to what the goal of transparency is. While it is relatively easy to put the things that external researchers are doing through this process, the process is in place to disclose them, and researchers who work in the RDCs are used to doing this. But she thinks that the real benefit comes from, for example, curating the code for producing the data and that the example of seasonality might not be low-hanging fruit.
Levenstein added that everyone is nervous about someone looking at their code and finding problems with it, but that is why transparency is a useful thing. People do not live forever, and how statisticians think about seasonality today is very different from how people thought about it 100 years ago. Thus, having transparency also makes it possible for there to be advances. Furthermore, it is in the production of data that transparency is extremely important.
Abowd said that he asked for the archived code from the 2010 Census that produced the publication tables from the Census Unedited File through to PL94 SF1/SF2 and he was surprised to find that this code does not exist. He is making sure this is not the case for the 2020 Census.
Levenstein said the easy changes that she thinks of in terms of what the research community wants (which is true for most of the statistical agencies) are having versions of survey instruments, data, and metadata that are
Data Documentation Initiative (DDI) compliant and have proper tags so that people can discover, use, and write code based on them. In terms of the legitimacy of federal statistics, it would be useful to state that everybody in the U.S. Census Bureau knows how to produce seasonal adjustments and that these are not somebody’s personal algorithm. In the age of mass production of data and federal statistics, we should not be using production techniques that are specific to an agency or researcher. That is where the most important investments should be.
Abowd said, by way of good news, that the design of the 2020 Census forces his colleagues—who have been doing it the way just described with the traditional metadata embodied—to transform this information into a code base. It is not a major improvement in census methodology, but it is in fact a watershed moment for the agency. There is a global data library called a data lake that contains the specs, which provides a general resource for each of the steps in the process. The U.S. Census Bureau will not be able to get this ready for the 2018 end-to-end test, but it will soon afterward. Abowd said that is a big step forward, because that code base will be curated. It already has a GitHub repository that the technical integrators are working from, so when his successor asks for the code base from the 2020 Census, there is going to be one.
Separate from the code that will produce census products and internal code that might analyze or produce working papers within the Bureau, there might be external researchers who use census data. Thus, there may be a way to extend that pipeline, Stodden said. This could be a way of ensuring reproducibility for results generated by external researchers who use agency data as an input to their analysis. Abowd said that it is a requirement of the external users that they document the contributions made to the U.S. Census Bureau.
Levenstein said that there are 700 researchers working in the federal statistical RDCs and every single one of them is required to produce benefits for the U.S. Census Bureau. This is because they are getting access to the data and have a legal and moral obligation to do so. They also prepare technical memos about the data that they are using, and those technical memos get turned over to the Bureau. But there is very little capacity for the Bureau to absorb that knowledge. If it is something about the longitudinal business database, and it gets to the longitudinal business database administrator, then it gets to the right person. But in demographic surveys—and this is true for a lot of administrative data where there is little metadata—researchers will benefit by systematically contributing to improving the metadata.
Levenstein said she thinks that the U.S. Census Bureau’s process or the federal statistical RDCs’ process could be a really good model for the research community where one cannot take any data out of the RDCs; all one can take out is the code. There is the code that generated the extract,
but when many manipulations of the variables are added, the resulting DOI is for the full augmented dataset. They do save intermediate datasets, but not in a way that is worth identifying. It is a better way for researchers in general. Levenstein said that if researchers are using the public-use SIPP, she would rather they associated that not with a code of a data extract from the SIPP, but with the code that they used to take the entire data file. They would like people to document replicating the processes of generating the results in a particular research article.
Audris Mockus mentioned some things that were implicit that he wanted to make explicit: virtualization and automation. Virtualization easily solves many of the problems previously mentioned. The approach creates a container within which the code compiles and can produce the expected results on an applicable test dataset. Sharing these containers is not much more difficult than sharing the code. Automation responds to the manual work described the previous day of the workshop. Automation is not only a good thing from a reproducibility perspective, it is also a good thing from a productivity perspective. Even though people are reluctant and it takes time to automate some of these manual tasks, Mockus believes that once one does, the reproducibility increases massively and there are many other benefits, such as increased quality, because iterations are shorter.
Tom Louis referred to the RDC approach as catch and release. A user has a dataset and creates an analysis dataset, and then uses it and releases it. He thinks it could be carried to extremes, where there was never anything other than the original source, although that might be too challenging computationally. However, it helps protect privacy since there are no analysis datasets that may eventually be assembled to produce some disclosure. Additionally, having the original dataset and code makes it easier to curate. As a result, there are many advantages, as long as they are not carried to the extreme.
Levenstein said that even for internal datasets for the Census, versioning helps. There is a version with the payer data and another version without them, and this allows one to know which data people have been analyzing when they produce results. Virtualization is a good idea for researchers working in a different kind of environment. Automating makes things easier, more efficient, and replicable. The alternative way to produce better metadata and improvements to data is crowdsourcing. Because agencies are using so much administrative data that does not come from a survey instrument, she thinks getting the free resources of the research community to help with this has a lot of potential, but it is exactly the opposite of automating it. She suggested creating a system of trusted data stewards who can participate in a crowdsourcing effort to improve administrative data. It may be necessary to think about how to build that in because otherwise there is too much data out there to handle within the federal statistical community.
Abowd also shared that the people who paid for data analyses conducted on restricted access data could demand that the custodians of those data allow audits from either the agency paying or an independent body paying to go in and do the reproduction study on the papers that were published or some random subset of them. The reason for an experiment on reproducibility on an external researcher’s papers is so that researchers put their approval number for the release of data on the disclaimer notice, and the internal researchers too once they have been issued one. Following that, one could invite the audit.
Levenstein added that ICPSR has a criminal justice archive that contains mostly data from the Bureau of Justice Statistics and some data funded by the National Institute of Justice and the Office of Juvenile Justice and Delinquency Prevention. Most of the data are public, though some is restricted. The director of their criminal justice archive is now editing a special issue of the lead journal in technology that is all replication studies. During the workshop, speakers have mentioned that it is hard to get people to do replication studies because it is not interesting to replicate, yet they do not want to encourage people to publish when they find problems.
Audris Mockus said that he thinks replication is quite interesting because in most cases interesting things are discovered. Publishing once replication is performed may be a little bit more challenging.
Sally Thompson asked if the problem for federal statistical products might be low confidence in the results of these studies. Abowd responded that political science journals would not publish papers written in the RDCs. Thompson asked if the need is for certification for publication, or if there really is a problem concerning the veracity of research published using census data. Abowd answered yes. Finally, she explained that she was under the impression that the day before, the participants were talking about federal statistics, and while this is a part of that, should they be discussing whether or not audits of the production of the statistics themselves are more likely to be published in academic journals?
Abowd pointed out that since the code base for production of the 2010 Census does not exist, and they have an auditor coming in to audit the production of summary files from the 2010 Census, that auditor has nothing to work with. In the long-range plans of the U.S. Census Bureau, central data repositories called data lakes will be established. There will be a common-access protocol and a curated code base. The 2020 Census will have all of the files necessary so that an auditor could audit the sequence that produced the data that were released from the 2020 Census. In the meantime, the workshop has indicated some viable options. It is a serious issue that referees increasingly cannot access the data that were used. That is true for both official products and for research products that are downstream from them. It is not straightforward but one could already figure
out a resource chart for this and do it. Some production activities could be audited. The LEHD’s production of quarterly workforce indicators could be audited back to the beginning.
Abowd continued that the system Rob Sienkiewicz described can literally produce the publication data from any vintage—from its exact inputs and its exact code bases—it is ready to be audited. There is no other system at the U.S. Census Bureau that has that capability. The American Community Survey has a very robust production system associated with it, but it has not been tested this way.
Thompson argued that the easy change, the replicability for internal purposes for all of the agencies, would be a great step. The challenge is that the old style required agencies to have their own Excel spreadsheets and their own databases and their own resident knowledge. Agencies are moving away from that and it is a good thing.
John Eltinge had two very brief comments: some of the earlier discussion advocated having this cost be part of risk management. One way to think about this in a resource constrained environment is to try to spell that out in greater depth and, if documented, it could mean that basic prudent management, as well as Abowd’s point code curation and software engineering processes, are all part of customary practice in computer science. People did that because large organizations depend on computer science in this way and therefore see the relevant costs, especially involving risk management.
Eltinge continued that he liked Levenstein’s comments about replicability studies that are often perceived not to be interesting if they just confirm. Perhaps that means people ought to be able to spell out in greater detail, not just a simple binary confirm/no confirm, but what is typically a much richer answer. For example, there are differences in the following areas, and they may be attributed to the following factors; or that the result was a house effect related to data collection processes. That in turn makes an interesting contribution to the science, particularly in terms of getting the academic world and the broader research world to recognize the importance of that follow up rather than being focused solely on a simple binary outcome of confirmed/not confirmed.
Levenstein agreed that is important, and not just for the research community; it is incumbent on agencies to teach the broader public about the research process, as well as how to think about data. “What do we think is wrong and what do we think that the implications of this might be for how we should think? That we learn from disagreement is important,” she noted.
Stodden made a small addition to Eltinge’s first point. A 2017 National Academies study titled Fostering Integrity in Research2 includes two rec-
2 National Academies of Sciences, Engineering, and Medicine. (2017). Fostering Integrity in Research. Washington, DC: The National Academies Press.
ommendations, both of which are associated with reproducibility. Recommendation 6 broadly advocates reproducibility as part of integrity in results and findings, and Recommendation 7 calls for more archiving support for reproducibility artifacts like code and data.
Ron Jarmin said that he has moved away from the RDC world, and he now runs one of the production areas of the U.S. Census Bureau. He emphasized the cultural inertia against doing replication studies inside academia. It will meet with lots of resistance inside the places where the data are produced. The case with the decennial census that Abowd was talking about is only possible because of one key retirement. The reason it was bottled up is because a very small group of people were worried about disclosure avoidance and the primary selection algorithm. A number of business decisions were made to keep that information tightly held, which led to a code universe that was not best practice. Trading off what is best practice from a computer science perspective versus what is best practice from these business decisions that are made in a highly charged political environment, is something that can be taken into consideration, even though most current U.S. Census Bureau products do not have that aspect to them.
Jarmin continued that over the next few decades, the amount of data that the statistical agencies will produce that they are completely responsible for is going to be a vanishing share. Agencies already use lots of administrative records and are going to use more. So, large parts of the production function of statistical information are going to be outside of the control of a statistical agency and outside of the control of the end user. The practice of tracking what edits or changes are made to data since they were originally generated so that people know what is going on is going to be increasingly important.
Levenstein said that the day before, participants learned about the workflow of data coming in from different places. She thinks that the future includes lots of data coming in from many different places. Thinking about LEHD and the data they get from all of the different state agencies, it will be an exciting and fun challenge of the coming period.
In his community—database information management—H.V. Jagadish mentioned that replicability studies have become standard; it is voluntary, someone gets a badge, and that is it. A significant fraction of authors, still a minority, opt to supply the materials to get that badge, and there are no social pressures on the rest to comply.