Page 45 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

3

Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing

Adopting and exploiting nontraditional sources of data for national statistics will require significant changes to the data collection and processing currently used by many statistical agencies. The design of computer systems to meet the increasing data processing demands is largely well understood, at least in the computing industry, though there remain challenges that continue to be studied in both industry and academia. In addition, there often are situation-specific issues, specific to the multisource system envisaged, that may not be covered by the accepted general solutions.

Many detailed texts describe the relevant design principles and corresponding performance tradeoffs in developing the envisaged computer systems (see, e.g., Kleppmann, 2017; Laudon and Laudon, 2017; Martin, 1981; Özsu and Valduriez, 2011). Here, we provide an overview of the design and implementation issues that can be expected to be encountered when developing information technology (IT) systems architectures for the proposed system. Our intent is not to provide solutions but, rather, to highlight considerations.

We begin with a brief overview of the IT issues that federal statistical agencies face with their current systems. We then review the nature of the architecture that will be needed by statistical agencies and for the panel’s recommended new entity. Broadly speaking, there is a choice between a centralized system and a distributed system. Because a centralized structure imposes prohibitive constraints, we focus on a distributed system and discuss various distributed configurations.

The chapter continues with a discussion of data processing issues. That

Page 46 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

is, we describe data acquisition, data cleaning and transformation, provenance, and reproducibility. We emphasize the existing and future quality control requirements of the individual statistical agencies and how these will be met.

The chapter concludes with two brief discussions of some considerations in transitioning existing systems toward the future environment and the implications for staffing. We note the implications of supplementing the existing systems not only in terms of architectures, but also in terms of staffing, both retraining and growth. Because a sudden shock would be difficult and is not needed, we discuss the evolution to a supplemented approach rather than a massive one-time change.

ISSUES FOR FEDERAL STATISTICAL AGENCY IT SYSTEMS

The decentralized nature of the U.S. federal statistical system requires that every statistical agency has its own IT system, both because of tradition and the laws that authorize the agencies. In recent years, with greater efforts toward centralizing IT systems within departments and the passage of the Federal Information Technology Acquisition Reform Act (P.L. 113.291), department chief information officers (CIOs) have a strong role in managing the information systems of all bureaus in their departments. However, the U.S. Office of Management and Budget has also issued guidance that CIOs are to work closely with their statistical agencies to meet statutory obligations to protect the confidentiality of their data and ensure the data are used only for statistical purposes (U.S. Office of Management and Budget, 2015).

This vertical organization and control of IT systems within departments means that individual statistical agencies cannot directly access each other’s systems or data. Even statistical agencies in the same department may not be able to access each other’s data. For example, the Bureau of Economic Analysis and the Census Bureau are both part of the Department of Commerce and were recently co-located at the Census Bureau’s headquarters building. However, they have completely separate IT systems and, given the different statutory protections and authorizations of their datasets, completely separate access. One senior manager described this situation as having a glass wall between the employees of the two agencies but a solid statutory brick wall between their datasets.

There were efforts a few years ago to create a statistical “community of practice” that could serve as platform for statistical agencies to collaborate on common protocols and tools (Bianchi, 2011, p. 5):

[to] enhance the horizontal, functionally-based integration of IT resources among federal statistical agencies. A statistical enterprise data center would

Page 47 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

be mandated to foster creative approaches for collecting, storing, analyzing, and otherwise processing federal statistical data to meet statistical agencies missions—with significant cost savings—and to more efficiently feed data to Data.gov. The center would house federal and commercial statistical datasets; visualization and dissemination tools; data quality, interoperability and confidentiality tools; and statistical analytical applications and models for cloud-type access by all federal statistical agencies.

However, no specific funding or authorization was ever provided for these kinds of activities.

In most agencies, there are also organizational and programmatic silos for IT systems. It is not uncommon for statistical agencies to have separate systems for collecting and processing data for each of their survey programs, or a system may be shared by only a couple of related programs. However, there have been recent changes. The National Agricultural Statistics Service recently consolidated its own highly decentralized survey processing architecture from 46 field offices into a centralized system (see Nealon and Gleaton, 2013). The Census Bureau has embarked on a new census enterprise data collection and processing system in conjunction with the reengineering of the 2020 census: the goal is to attempt to reduce the more than 100 systems it operates for data collection and processing to a single unified approach.¹ The Census Bureau is similarly seeking to streamline 30 different applications used to disseminate information into a unified approach by creating a new Center for Enterprise Dissemination Services and Consumer Innovation in the Census Bureau that will centrally disseminate information to application program interfaces as well as interactive web tools, data visualizations, and mapping tools.

The number and diversity of IT systems within and across federal statistical agencies will pose challenges as agencies move from being focused on processing a single survey or administrative data source to integrating and using data from multiple different sources. National statistical offices in other countries have been facing similar issues. Struijs et al. (2013) note that most statistics are produced on separate production lines, each with its own methodology and IT systems. Even countries with centralized statistical offices have been working recently to integrate their systems into an overall enterprise architecture because of the higher costs of developing and maintaining separate systems and the difficulties in combining and reusing data across systems when needed (see Borowik et al., 2012; Struijs et al., 2013).

As part of these modernization efforts, there have been international collaborations to develop a common metadata framework and a generic

___________________

¹ See https://www.census.gov/library/video/cedcap_cedsci.html?ncid=edlinkushpmg00000313 [July 2017].

Page 48 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

statistical business process model to describe the common processes that all organizations use in producing statistics (see Vale, 2009). A generic statistical information model has also been created to provide internationally agreed-on definitions for information objects (e.g., data, metadata, rules, parameters) that flow through the various processes in the production of statistical information. These frameworks facilitate communication across statistical offices and can help harmonize architectures and the sharing of statistical software across organizations, nationally and internationally (see Eltinge et al., 2013). The U.N. Economic Commission for Europe (2015) has also created the Common Statistical Production Architecture initiative to provide reference architecture for the statistical industry to support the facilitation, sharing, and reuse of statistical services across and within statistical organizations. Adopting this common reference architecture would make it easier for organizations to standardize and combine components of statistical production, thus enabling sharing components across agencies or even countries.

SYSTEM ARCHITECTURE

Traditionally, a computer system is designed as a single system, controlled by its owner and designed according to the owner’s choice. Databases may be stored and managed on this computer and access provided to others as needed. This access could be restricted to local access, that is, only to others who can physically visit a “safe room,” but more commonly access is provided across a network. Such access can be provided to selected authenticated parties or to the public.

When data from multiple data sources are aggregated into a database, one popular paradigm in the computing industry follows the centralized model described above: the traditional “data warehouse.” This was the expected structure of the National Data Center proposed many years ago (Kraus, 2013). In contrast, the committee believes that it is possible to obtain many of the benefits of a national data center without privacy risks incurred by storing so much data in one place. Specifically, we would like to provide access to aggregated statistical data obtained through fusion of multiple sources, but not direct access to individual-level data (identifiable information about a person, household, or business), and to do so with careful attention to privacy (see Chapter 5 for a more detailed discussion). In this section, we look at architectural alternatives.

As the number of users, the sizes of databases, and the processing performed on the collected data are scaled up, it may no longer be feasible for a single centralized system to manage the load. In this case, a set of systems can be used in parallel to perform this task. It would still be a single cen-

Page 49 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

tralized system in terms of the system architecture, even if implemented as a room full of machines.

Instead of being placed in a single machine room or data center, the set of machines could be distributed across multiple locations. Such distribution may be desirable for a variety of reasons, including proximity to users, resilience to a disastrous event, and availability and cost of space for a machine room. With a distributed physical structure, it would be necessary to decide whether to reflect that distribution in the data placement design. That is, the data could be partitioned across sites, with each site handling some of the data, or the data could be replicated so that multiple complete copies of the data are created, one at each site. A mix of the two approaches is also possible, with some popular data replicated at each site, but most data are held only at one site. The choice depends on various factors, including the desired performance requirements or objectives. Another decision to make with a distributed system is whether to expose the distribution at the logical level: should users know where the servers are located? Do they need to know? Often, but not always, the answer to these questions is “no.”

Traditionally, businesses and other private-sector enterprises have developed their own data centers to meet their storage and processing needs. As in so many aspects of business, it sometimes makes sense to outsource this responsibility to a service provider that has particular expertise in this task. For data processing in particular, this outsourcing has been made particularly easy through the “cloud.” The basic idea is that the data centers are owned and managed by service providers that often share the same data center facilities across multiple enterprises. Enterprises rent needed capacity and services from the service provider. Arrangements vary greatly, from fixed capacity to variable on-demand plans, with various quality of service guarantees. The service contracted for could also range from the bare bones, compute and storage, up through data management and web hosting, to sophisticated software capabilities.

Outsourcing responsibility for some tasks is not the same thing as transferring ownership. Even when an enterprise obtains services from a third-party service provider, it is still the owner of the data and responsible for all aspects of the data, including database design and data quality. The owner could continue to specify every aspect of the system. However, a looser federated design is possible, particularly when there are multiple sources of data: for example, instead of integrating all the data into a single warehouse, datasets could remain under the control of the providers of the original data or of intermediaries. A thin layer of software could provide users with the illusion of a centralized system while actually providing requested data from multiple places as needed (see Contreras and Reichman, 2015). In the case of derived data products, such as national statistics,

Page 50 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

the same principle would apply to the derived data creation process. For a user, one could combine data from multiple owners, at multiple locations, and use these combined data to generate national statistics.

A key difference between a federated and a distributed architecture is in the logical control and ownership: in a federated system, each member of the federation designs and owns its own data; in a distributed architecture, there is a single owner in charge. This difference has an impact on data integration; if one party controls the design and structure of multiple databases, it can also determine the required protocols to combine databases. However, if databases are independently designed and structured, required translations are usually possible only through negotiated specifications and interfaces. Such negotiations are often cumbersome and can cause systems to become brittle and unresponsive to changes that may be necessary or desirable. However, incompletely-agreed-to standards and interfaces can become a barrier to data integration and can result in errors. Since in practice there are many scenarios in which one may encounter data from a system that does not adhere to desired or standard structures, there is a need to develop abilities to perform ad hoc integration.

As we described in Chapter 2, protocols for linking individual, household, establishment, and enterprise data records have been well developed by federal statistical agencies. Such handling of multiple datasets would be a key feature of the panel’s recommended new entity. If statistical systems use ad hoc integration technologies, care must be taken to validate results and manage any errors. If statistical systems use engineered integration technologies, they must develop processes for incrementally adapting integration rules as data sources evolve. The questions that need to be addressed include not only developing new mappings for the modified source data structure, but also how a statistical system will even know that the source has updated its data representation. Will the source system reliably convey information regarding updates to the statistical system? Will the statistical system perform some checks on the data supplied by the source system to validate assumptions regarding structure, representation, value encoding, and other characteristics before ingesting the data? The personnel responsible for performing such validation will need to have skills both in statistics and in computer science.

The panel noted in its first report that data breaches and identity theft pose risks to the public and that a continuing challenge for federal statistical agencies is to produce data products that safeguard privacy. Even if strong access and data release practices are designed to satisfy privacy requirements, it is difficult to guarantee against a data breach. We discuss security issues and protecting privacy in more depth in Chapter 5, but we note here in the context of systems design that privacy loss from a data breach can

Page 51 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

be greatly ameliorated by distributing the data among different places and making it difficult for an attacker to access those locations.

CONCLUSION 3-1 Moving to a paradigm of using multiple data sources requires a new and different information technology architecture than a paradigm based on a single data source. Federal statistical agencies will need to create research and production systems capable of using multiple, diverse data sources to create statistics.

CONCLUSION 3-2 A range of possible computing environments could enable use of multiple data sources for statistics. Federal statistical agencies will need to consider the governance, functionality, and flexibility of the system, as well as the implications for protecting privacy and addressing data providers’ concerns regarding privacy.

DATA PROCESSING ISSUES

Moving to a paradigm of integrating multiple data sources for federal statistics will necessitate a greater focus on data curation by federal statistical agencies, which requires the “processes and activities needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data” (Miller, 2014, p. 4). As noted in the panel’s first report, agencies are used to using administrative data in a variety of ways to enhance their design, collection, and analysis of survey data, as well as to produce some statistics directly (see National Academies of Sciences, Engineering, and Medicine, 2017b). However, agencies generally directly collect much of the data they use to produce statistics, are used to having a good deal of control over design and collection of those data, and know what happens at each stage of collection, editing, imputation, and analysis. In the system we envisage, agencies will not have control and may have limited knowledge or documentation of all of these processes for some of the data they acquire. It will be essential that federal statistical agencies, including the panel’s recommended new entity, carefully document all of the operations that they perform on datasets they acquire or access for federal statistics (see section, “Provenance”).

Data Acquisition

There are two main paradigms for software to obtain data from the source: “push” and “pull.” In a push paradigm, the data source pushes data to the statistical agency or other entity. This push could be periodic, say, once a month; it could be in response to an event occurring, such as the accumulation of 100 updates; or it could be on any other basis chosen by

Page 52 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

the data source, such as whenever the data source has spare computational and network bandwidth resources.

In a pull paradigm, the statistical agency or other entity pulls the data from the source when it needs the data. This pull could be based on the issuance of a service order; the pull could be periodic, just as in the case of a push. The pull can be whenever the agency or entity needs the data to perform some computation. In practice, the data source may not give free access to its data to the requesting (consumer) agency or entity. So a pull is typically implemented as a request from the consumer to the source, which the source then responds to. The key point is who controls the timing. However, a data pull can also be implemented without requiring the explicit cooperation of the data source: for example, the consumer system could scrape data from a website put up by the source.

Although data can be pulled without the explicit cooperation of the generating source, several issues regarding such a pull need to be addressed. First, there may be legal restrictions on the frequency or volume of access. Second, there may be no guarantee of continued access in the future. In addition, the guarantees regarding the quality of the data are unknown. Data quality is always of concern, and obtaining the data without coordination with the data generator further exacerbates the issue since no explicit contractual guarantees are provided by the data generator to the data collector.

Another design parameter to consider is how each data transfer (whether push or pull) relates to what has been previously transferred. A transfer could be a complete refresh, meant to overwrite the previous data; it could be an addition, comprising only new data from the current period; or it could be a change log, comprising not only new data but also other changes (such as updates and deletes). In many uses of multiple data sources for the federal statistical system, the panel assumes that updates of data sources will be to update statistics. When data sources send updates, statistical agencies will need to keep track of the multiple versions of data. It is possible that various statistical products will have been computed using different versions of data. Reconciling these statistical results will require keeping track of specific versions used (see sections, “Provenance” and “Reproducibility”).

Data Cleaning and Transformation

Generally, data that are obtained for statistical analyses for the first time will require statistical attention prior to analysis. Such attention may be required for many reasons, including: there could be recording errors in the original source; there could be mistakes in understanding or interpreting metadata; there could be errors in data linkage (see Chapters 2 and 6); and

Page 53 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

there could be missing data in fields needed for the computations. Regardless of the cause, such errors can be propagated and result in bad statistics if they are not corrected. As such, data cleaning is a critical function, which has to be performed when new data are received and possibly again after processing stages.

Data cleaning techniques range from actual removal of erroneously detected data to replacement of or additions to data items through extrapolation, harmonization, or approximation. Rules might be imposed on the data—such as data domain ranges; averaging or mode selection; and comparing and augmenting through external sources.

Data might be enhanced or completed. One such example is the completing of addresses by adding four-number ZIP code suffixes to the originally provided five-digit codes or simply adding a missing city or state to the ZIP code provided. External information might be used to obtain the appropriate code suffixes.

The data might be harmonized. For example, state names might all be converted to the two-letter state name coding. Another example is the conversion of U.S. phone numbers, that is, stripping any additional characters other than the 10-digit numbers. In a quite different realm, for medical data, missing body temperatures might be assumed as “normal” if only fever ratings are recorded. Regardless of the cleaning techniques used, caution is needed to ensure that the cleaning process itself does not introduce error or bias.

Federal statistical agencies are well acquainted with data cleaning and transformation in the context of survey data. A major difference between what they have been doing and what would be required in the envisaged new system is that they currently often build in data cleaning checks at the acquisition stage so that they can collect more accurate information from the household or business respondent directly. For example, there may be specified ranges built into an Internet questionnaire or the interviewer’s computer-assisted personal interviewing (CAPI) instrument that do not permit respondents to enter values that are out of the range. Consistency checks are often also built in to make sure that data are consistent and to catch potential errors. However, what is done in the survey context often cannot be done in the same way with data acquired from other sources, which results in more work to clean the data and a shift in costs from data collection to preprocessing and preparing the data.

In the panel’s first report, we noted that there are a wide variety of structured, semi-structured, and unstructured data available that could have the potential to enhance federal statistics. Much of the data from these sources will not be available in the desired form and structure; it will need to be transformed. Usually, such a transformation is straightforward to perform, though it may not always be easy to specify the transformation

Page 54 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

correctly. Furthermore, it may not be possible to perform all required data cleaning at the time of acquisition. For example, if a selected data source has some critical missing values in some records, it may not be feasible to insist that these be completed, as could be done in a CAPI survey. Instead, these missing values may need to be imputed, and the most efficient way to perform such imputation is with additional context obtained through record linkage.

Current surveys often collect detailed descriptions of jobs and industries, which coders then review and classify into the North American Industry Classification System or standard occupation codes. Federal statistical agencies have been developing and using sophisticated tools to streamline these kinds of coding tasks and will need to develop and apply similar tools with new data sources.

Provenance

There is a well-developed notion of metadata associated with surveys, including capturing and recording paradata—auxiliary information obtained during data collection that provides data about the data collection process itself. Similarly, there is also a well-developed notion of reproducibility in software, by recording the specific version of a program run and all parameters used. (Recording of the statistical methods used, such as imputation of missing values, removal of outliers, and the like constitutes a subset of the issue of software reproducibility, discussed below). In the computer science field, these notions are referred to under the concept of “provenance.”

Provenance, a term most often associated with a work of art, refers to its origin and provides confidence that it is not a fake. For data, provenance serves the same purpose. For data obtained from nontraditional sources, such as repurposed administrative data or private-sector data, it will be critically important to carefully specify what the equivalent would be to survey metadata and paradata. Since data are being repurposed, the meanings of particular values are likely to be subtly different. Having a precise understanding of the provenance of the data will be critical for correct interpretation, but it will be difficult due to varying metadata recording standards across data sources. Similarly, population coverage and sampling bias are also of concern for repurposed data. Therefore, understanding and documenting what is known about these domains will be a necessary step to ensuring correctness. Finally, repurposed data will often require considerable manipulation. Therefore, recording the editing and cleaning processes applied to the data, as well as the statistical transformations and the software run, will also be critically important.

There are multiple types of metadata, and all of these have to be

Page 55 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

recorded at the source. However, mere recording is not in itself enough: metadata will also be needed as data are transformed and new data products are derived, so that the dependencies associated with any data product of interest can be fully understood, including, in particular, the final reported statistics. Data provenance methods, which allow researchers to track their data through all transformations, analyses, and interpretations, have been developed for this purpose.

There are many different ways in which provenance can be recorded. Perhaps the most important distinction is between set-level (or process) provenance and item-level (or database) provenance. The former is captured automatically by workflow systems: for a given dataset or a statistic, it provides information on the sequence of operations for its creation. However, if a particular item in a newly developed dataset surprises a user, knowing the creation process for the dataset is not necessarily helpful. In such cases, the alternative is to record provenance for individual data items, recording how each one was derived. Such fine-grain recording of provenance can involve significant time and, consequently, costs, and ways to do this efficiently is an active area of research.

For national statistics, the panel assumes that users, for the most part, care only about the aggregate averages so one may think that process-level provenance will suffice. However, specific values are often manipulated individually, and a detailed record is required of the individual manipulations. For instance, outliers may be eliminated and erroneous entries may be manually corrected. It is not enough, for example, to record a manual review and error correction step at the level of a dataset: information needs to be recorded on the individual manual corrections applied.

The panel recognizes that federal statistical agencies currently have thorough documentation and good metadata, often including paradata, for their surveys. For many ongoing major surveys, there is a wealth of information and research that has been accumulated over many years. However, statistical agencies typically do not have or retain some of the fine-grained detailed records discussed above as part of data provenance. For example, analysts who review and edit business survey data may not retain a record of every action they took in cleaning the data. Edits and imputations performed on survey datasets may not be fully documented in a user-friendly manner except at a very high level (e.g., if hot deck imputation is used). Often, only a small number of people are familiar with the code that performs these operations, which also may not be clearly documented. Thus, instituting the more comprehensive and detailed documentation of these processes and activities that will be required for new data sources will be new to agencies and may be seen as burdensome, but we believe it is worthwhile.

The complete provenance associated with any dataset can be over-

Page 56 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

whelming. Even if the provenance contains all the requisite information, actual utilization is not easy. This “fitness-for-use” issue is much more complex when using multiple data sources, as multiple data sources often provide more information than traditional single data sources. Frequently, the user may not be interested in the provenance of the entire dataset, but only in a particular value. Much of the provenance recorded may not be relevant to that particular value, so a much less complete provenance may suffice for the user’s needs. However, it may not be sufficient simply to identify the data sources from which the particular result was derived; one may wish to identify the specific source values that contributed to it. The concept of fine-grain provenance will be valuable in such cases.

Often, when a user seeks provenance for a dataset or item, the user has a particular purpose in mind: for example, to answer a specific question. Rather than providing such a user with the provenance for the dataset, should the user be provided with only the provenance components that are relevant to the question? This question and similar ones are currently topics of an on-going stream of research, with some good ideas being investigated.

Reproducibility

In theory, a complete provenance record should permit the entire data production and computational process to be reproduced. In practice, a few additional considerations arise. One is dependence on secondary inputs. For example, suppose that one aggregates input from a credit card processor to determine spending by category. This process requires that every merchant be mapped to a category. If the category of a merchant changes, then the computation loses reproducibility unless the entire mapping table is also recorded. A second consideration is dependence on a software version. Small, supposedly innocuous changes in software can cause different results to be produced if rounding is done differently.

Software does not exist in a vacuum; it relies on an operating environment composed of both hardware and software. If the operating environment changes—for example, if the hardware platform is changed—results might change. The precision of a computation can change if the operating environment changes even though the application software system remains unaltered.

There are examples from the decennial census of how there have been changes over time in edit rules and coding. At one time, “tailor” was an occupation with male connotations and “seamstress” an occupation with female connotations. Therefore, a census form response identifying a woman as a tailor quietly had her occupation changed to seamstress at that time (Conk, 1980). In the 2000 census, same-sex couples who identified as

Page 57 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

married were recorded as unmarried as there were no states at that time that legally recognized same-sex marriages.

In short, issues for reproducibility are mostly well understood, but perfect reproducibility can be very difficult to implement in practice. Departures from attaining full reproducibility may become necessary, sometimes for reasons of cost, but they should be undertaken with care as reproducibility is of growing importance throughout scientific work (National Academies of Sciences, Engineering, and Medicine, 2016b). Reproducibility will be key to helping researchers both in and outside the federal statistical system understand what was done with different data sources and the quality implications for the resulting statistics (see Chapter 6). Allowing internal and qualified external researchers to access raw granular data and to examine the provenance and perform appropriate analyses will permit useful evaluations of what was done and sharing of good practices across agencies. The panel recognizes that to the extent that reproducibility requires recording detailed, record-level information, there are implications for privacy (see Chapters 4 and 5).

Being able to explain exactly what process was used is also important to maintaining the credibility and trust of an agency with the users of its data. This consideration increases in importance as more complex and more computational processes are used to generate statistical products.

CONCLUSION 3-3 Creating statistics using multiple data sources often requires complex methodology to generate even relatively simple statistics. With the advent of new and different sources and innovations in statistical products, federal statistical agencies need to figure out ways to provide transparency of their methods and to clearly communicate these methods to users.

SYSTEM MIGRATION

As we noted above, computing statistics from diverse data sources will require a system architecture that differs substantially from what many statistical agencies have today. This requirement raises the question of the migration path for data both within an agency and to the panel’s recommended new entity. We note that this migration occurs not just for the computing systems, but also for the business processes used.

In general, a gradual migration introduces less risk. However, in many instances an agency may not have the luxury of being able to migrate gradually. For example, if credit card transaction data are to be used to compute some statistics of economic activity, there may be no reliable way to take a portion of the reported transactions and use traditional data collection techniques for the rest.

Page 58 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

BOX 3-1
Pilot Study in Migration: Palantir Technologies

In 2016, Palantir Technologies conducted an 8-month pilot study to report on the “health” of the U.S. economy. The findings of this study indicated that by combining credit card transaction reports with public measures, accurate and timely economic insights were obtained.

The pilot study integrated data from the Bureau of Labor Statistics, the Census Bureau, the Energy Information Agency, and the Bureau of Economic Analysis (BEA), along with credit card transaction information from the database of First Data (one of the largest payment processing companies in the United States). The data integration and analysis engine enabled BEA and Census Bureau analysts to incorporate various other datasets into the system to support independent investigations at multiple levels of granularity, with varying aggregation duration lengths.

The underlying motivation for the pilot was the creation of a system to carefully track the indicators of retail trade as measured by the Monthly Retail Trade Surveys. While traditionally collecting and processing the monthly retail trade indicators takes weeks, the pilot system took hours, with nearly a perfect (0.96) correlation between the results from the surveys and the credit card transaction data.

Traditional survey-based data collection is not only difficult and suffers from ever-declining response rates, but processing at a fine granular level is prohibitively expensive. In contrast, such processing at a geographically refined sector and duration-specified intervals using the pilot solution resulted in a nearly 0.9 correlation of findings as compared to the findings obtained using traditional sources without increasing cost.

Architecturally, Palantir Technologies’ platform consists of a data hub and a set of report generators. The data hub logically is a set of storage platforms and corresponding analytical models that organize and manipulate the data on the storage platforms. Report generators of various types process the data according to their respective business rules. Key to the acceptance of the effort is Palantir Technologies’ data integration approach, which retains complete data provenance throughout the entire data creation and manipulation cycle.

Palantir Technologies supported a variety of core capabilities, including security and auditing mechanisms, collaboration environments, and report generation tools. The open architecture enabled third parties to develop and integrate specific applications. Like all successful analytical infrastructures, Palantir Technologies touts and demonstrates that its engine supports a flexible data management configuration, reliable and easily parsed auditing tools, a highly secure environment, and a wide variation of collaboration options.

To soften the possible impact of and concerns about an abrupt migration, it would be advisable to initially run traditional approaches simultaneously with the new approaches, which is currently common practice in the statistical agencies. Comparing the findings of the two approaches

Page 59 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

would verify the correctness of the new approach and instill confidence in it. It may also be possible to create “sandbox” spaces where new computational streams can be experimented with and tested before being moved into production. In addition, agencies can use “rollback” mechanisms, in which some “old” processing modes can be used if difficulties are found in the new mode.

When transitions are made, the changes need to be carefully logged. Thus, any errors or undesired changes can be detected, and users will at least have the beginning of an explanation in many situations. Such a log may even permit a rollback of changes that have been applied.

PERSONNEL STAFFING AND SKILLS

The use of new technologies and new methodologies will require staff at federal statistical agencies and the panel’s recommended new entity to have appropriate new skills. This is true even if the bulk of the computational work is outsourced to a private contractor. For the existing statistical agencies, we believe this need can be met with a judicious use of training programs and a shift in the skill profile for new staff over time. The agencies have many staff with strong technical skills and experience that is relevant for dealing with data from any source. We believe that with additional training, many staff will be able to adopt the new paradigm, provided that key technical steps are outsourced.

The key requirement for moving to the new paradigm is a smooth transition. Agency history and domain knowledge need to be preserved. Hence, it is important not only that current staff do not experience hardships due to a migration to a new approach, but that they are also incentivized to be involved with the new approach so that they can capture the needed domain knowledge and improve the chances of meaningful new data being obtained and properly utilized. See Box 3-1 for a brief description of a pilot study that illustrates an exemplary migration.

RECOMMENDATION 3-1 Because technology changes continuously and understanding those changes is critical for the statistical agencies’ products, federal statistical agencies should ensure that their information technology staff receive continuous training to keep pace with these changes. Training programs should be set up to meet the current and expected future training needs for technology, and recruitment plans should account for future technology demands.

Page 60 Cite

Suggested Citation:"3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×