Page 55 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

4

Using Private-Sector Data for Federal Statistics

Recent years have witnessed an explosion of data from many sources, some of which are referred to as “big data” (e.g., Daas et al., 2015; Manyika et al., 2011). The term refers to the vast amounts of data that are now available in electronic form and are potentially accessible to analysis, including data that previously existed but were not centrally accessible (such as sales data and medical records) and new kinds of data for phenomena that were not previously measured on a consistent basis but now can be, using new kinds of measurements (such as sensors of natural and artificial phenomena—weather and traffic). IBM has estimated that 2.5 exabytes (2.5 million terabytes) of data are produced every day.¹ As a comparison, the U.S. Library of Congress has roughly estimated that its entire printed collection of 26 million volumes totals 208 terabytes.² Some of these new data come from digital records of government agencies (e.g., the health care transaction records of the Centers for Medicare & Medicaid Services). But many of them come from private-sector enterprises (e.g., Manyika et al., 2011). Indeed, a whole set of new enterprises are using large digital data resources as the basis of their business models (e.g., Uber, AirBnB, LinkedIn).

For this report’s purpose, we consider two kinds of private-sector data: private-sector structured data and private-sector data that have high dimensions, either in the number of observations or records or the number of

___________________

¹ See https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html [November 2016].

² See https://blogs.loc.gov/thesignal/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it [November 2016].

Page 56 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

attributes of the observations. Examples of high-dimensional data include streaming data production (e.g., utility meters, traffic cameras, and other sensors), Internet behavior documentation (e.g., browser search terms), and social media postings (e.g., data from Twitter, Facebook, LinkedIn). Examples of structured data include consumer information data, such as those from Zillow and Experian and other credit bureau data.

Some of these new data—whether from government or private-sector sources—could be used to create new statistics by themselves; others could be and are being using in conjunction with traditional statistical data. Some are stored in a form that permits useful statistical analysis immediately; others are stored in forms that would require significant processing prior to their statistical use.

In this chapter we first review the different kinds of private-sector data that are available and how the characteristics of these data affect their potential utility and usability for federal statistics. Next we briefly review efforts by national statistical offices around the world to examine and experiment with using these data sources to produce official statistics. We then review current work in the United States to examine and evaluate these new data sources for federal statistics. We conclude the chapter with a discussion of the challenges in using these data for federal statistics, including issues of access and quality.

DIMENSIONS OF NEW DATA SOURCES

We distinguish three dimensions of the new data resources: who owns or controls them (i.e., government agencies (federal, state, local) or private-sector entities), the purpose for which the data were created (e.g., record transactions or output from sensors or to communicate with others through social media), and the form of the data as stored (i.e., structured numeric data, semi-structured data, unstructured text, pixel data). In this chapter we deal primarily with private-sector data. Table 4-1 details these categories of new data resources.

The data sources shown in Table 4-1 vary in their “readiness” for use in federal statistics in terms of the likely time and effort it would take to clean and format them in order to produce usable statistics. As the second column of Table 4-1 indicates, private firms use surveys to assess their customers’ satisfaction or conduct broader surveys of a target population for market research or to make estimates of media use (e.g., Nielsen). The weaknesses in the survey paradigm (see Chapter 2) have also become very evident to private survey firms, which have generally lower response rates than do government surveys. In fact, many firms have abandoned the probability survey paradigm for opt-in Internet panels (Baker et al., 2010).

Page 57 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

TABLE 4-1 Types and Examples of Private-Sector Data Sources

Definition and Examples	Structured Data from Censuses and Probability Surveys	Structured Data from Administrative Records	Other Structured Data	Semi-Structured Data	Unstructured Data
Definition	Data from a population or a sample of that population used to estimate the population’s characteristics through the systematic use of statistical methodology	Data collected by private companies from transactions, process control, or financial or human resource records	Data that are highly organized and can easily be placed in a database or spreadsheet, though they may still require substantial scrubbing and transformation for modeling and analysis	Data that have structure, but also permit flexibility in structure so that they cannot be placed in a relational database or spreadsheet; the scrubbing and transformation for modeling and analysis is usually more difficult than for structured data	Data, such as in text, images, and videos, that do not have any structure so that information of value must first be extracted and then placed in a structured table for further processing and analysis
Private-Sector Examples	Customer satisfaction surveys Marketing research surveys Media use surveys Academic surveys	Data produced by businesses Commercial transactions Banking and stock records Credit card records Medical records University and other nonprofit grant transactions	E-commerce transactions Mobile phone location sensors Global Positioning System sensors Utility company sensors Weather, pollution sensors	Extensible Markup Language (XML) files Data from computer systems Logs Web logs Mobile phone content: text messages E-mail Internet of things^a Sport activity sensors (from watches, etc.)	Social network data (Facebook, Twitter, Tumblr, etc.) Internet blogs and comments Documents Pictures (Instagram, Flickr, Picasa, etc.) Videos (YouTube, etc.) Internet searches Traffic webcams Security/surveillance videos/images Satellite images Drones Radar images

^aThe Internet of things refers to electronics embedded in devices and machines that allow them to be connected to the Internet to directly send and receive data.

Page 58 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

In addition to government and private-sector data, surveys and censuses are also conducted by academic researchers. The data from these surveys are sometimes combined with administrative records to produce valuable information. For example, the Health and Retirement Study, conducted by the University of Michigan, obtains earnings records from the Social Security Administration and Medicare claims and summary information from the Centers for Medicare & Medicaid Services that are matched to respondents’ survey data to produce statistics and analysis about Americans’ physical and financial well-being.

As shown in the third column of Table 4-1, private firms also generate their own administrative records, which may be similar in structure to government administrative records. In the private sector, administrative records are often transactions, such as credit card purchase records or payroll documents. Sometimes these private-sector administrative records are used to produce statistics on their own, such as the National Employment Report from Automatic Data Processing, Inc. (ADP), which precedes the Bureau of Labor Statistics (BLS) release of the employment situation each month.³

The other three categories for private-sector data sources, shown in the last three columns of Table 4-1, vary in the structure of the data and how difficult they are to clean and transform into usable numeric form to produce statistics. By structured data we mean numeric data, often ordered into rectangular or fixed relational formats. The best structured data for statistical use have metadata attached to them, which document the format and meaning of each variable. However, even with these attributes, structured data generally need to be transformed for analytic purposes. Structured data in the private sector often include highly detailed geospatial data, such as those from mobile phones, traffic sensors, and Global Positioning System (GPS) devices, and these data may be available in real time. Some similar sensor data, including traffic monitoring sensors, may also be created by government agencies (see Table 3-1 in Chapter 3).

Semi-structured data can be best described as data that can be turned into formatted numeric data by being coded and classified into numeric categories based on information available from the unstructured data themselves. Examples of semi-structured data include Extensible Markup Language (XML) files, e-mails, documents, mobile data content, and log data from computer systems.

Unstructured data include digital videos, digitized pictures, and digital sound recordings, as well as digitized text. Some common forms of private-

___________________

³ ADP works in collaboration with Moody’s Analytics in using ADP’s large payroll dataset to predict private-sector employment prior to the BLS release. ADP processes the payrolls of about half a million private establishments in the United States, which employ nearly 20 percent of private-sector workers. Moody’s Analytics adjusts the ADP data to match those from BLS.

Page 59 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

sector unstructured data include text data from social networks (Facebook, Twitter, etc.), pictures (Instagram, etc.), videos (YouTube, surveillance cameras, etc.), satellite images, traffic webcams, data from drones, etc. These data are often the most difficult data to scrub and transform for statistical purposes as they require complicated transformations based on the specific data source.

Overall, large amounts of high-dimensional data resources are held in the private sector by firms that are themselves information-based enterprises. This observation leads to issues of access for federal statistical purposes, which we address later in this chapter and further in Chapter 6. The table also makes clear that the new data resources arise not from the design of a statistician, but as part of other processes. Sometimes the processes generating the data produce information that may be relevant to official statistics, but this is not their primary purpose. Hence, although the data have been collected by these enterprises, they are not, for the most part, immediately usable for statistical purposes or analysis. For some, much processing work would have to be done to create structured numeric data that have statistical utility. Finally, because the data were not designed for a statistical purpose, they tend to be rather lean, that is, not consisting of a large number of attributes describing the measurement unit (e.g., a person or company). Instead, they measure only what is needed by the process producing them for the firm or other entity. Hence, there is a need to blend these new data resources with traditional survey data in new statistical analyses if they are to be used to improve any existing official statistics. Although blending data sources holds the potential to improve federal statistics, there is no guarantee that it will do so; thus, careful evaluation of data sources is necessary (see below).

USING PRIVATE-SECTOR DATA SOURCES FOR STATISTICS

The potential opportunities to use new data resources for building national statistics have been recognized by many countries of the world with the creation of the U.N. Working Group on Big Data⁴ in March 2014. The working group acknowledges that “using Big data for official statistics is an obligation for the statistical community based on the Fundamental Principles [of Official Statistics (see Box 2-1)] to meet the expectation of

___________________

⁴ The full members of the working group are Australia, Bangladesh, Cameroon, China, Colombia, Denmark, Egypt, Indonesia, Italy, Mexico, Morocco, Netherlands, Oman, Pakistan, Philippines, Tanzania, the United Arab Emirates, the United States, the U.N. Economic Commission for Europe, the U.N. Economic and Social Commission for Asia and the Pacific, the U.N. Global Pulse, the International Telecommunications Union, the Organization for Economic Cooperation and Development (OECD), the World Bank, and the Statistical Centre for the Cooperation Council for the Arab Countries of the Gulf.

Page 60 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

society for enhanced products and improved and more efficient ways of working” (U.N. Economic and Social Council, 2014, p. 1). The goal of the group is to find promising uses of such data for official statistics, specifically focusing on uses for GPS devices, automated teller machines, scanning devices, sensors, mobile phones, satellites, and social media. The working group has created principles for access to big data sources to ensure fair treatment of businesses supplying these data.

To assess how national statistical offices are seeking to use these new data sources, the U.N. Statistical Commission (UNSC) conducted a survey of 93 national statistical offices. The national statistical offices of countries similar to the United States⁵ were most interested in using big data for “faster, more timely statistics” (88%), “reducing response burden” (75%), and creating “new products and services” (72%). These new products and more timely statistics were more important than other factors for the use of big data, such as “modernization of the statistical production process” (69%) and cost reduction (63%) (U.N. Economic and Social Council, 2016). Although many countries are interested in various big data sources for official statistics, very few have yet been able to actually produce official statistics based on these sources.

Academic and private-sector organizations have created statistics based on web-scraped data from e-commerce sites such as the Billion Prices Project (see Box 4-1). The project uses prices of products on the Internet to create a daily Consumer Price Index (CPI) for 22 countries (Cavallo and Rigobon, 2016).⁶

Statistics Netherlands has been able to use big data sources to create national statistics. It has drawn on data from road sensors for transportation and traffic statistics (Puts et al., 2016). Due to the large number of sensors detecting vehicles in about 20,000 highway loops, Statistics Netherlands is able to collect around 230 million records a day. The data are anonymous—the sensors do not record identifiable information, such as license plate numbers—but the data do allow for estimates of what kind of vehicle was observed based on the vehicle’s length traveling over the sensor, when vehicles entered and exited highways, and the time of day of the observation. After receiving the data and transforming them, the data are

___________________

⁵ The countries in this definition are those that are members of OECD: Australia, Austria, Belgium, Canada, Chile, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Israel, Italy, Japan, Korea, Latvia, Luxembourg, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Turkey, the United Kingdom, and the United States.

⁶ The 22 countries are Argentina, Australia, Brazil, Canada, Chile, China, Colombia, France, Germany, Greece, Ireland, Italy, Japan, Korea, Netherlands, Russia, South Africa, Spain, Turkey, the United Kingdom, the United States, and Uruguay. See http://www.pricestats.com/inflation-series?chart=1836 [November 2016].

Page 61 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

BOX 4-1
The Billion Prices Project

The Billion Prices Project was created by Cavallo and Rigobon (2016) at the Massachusetts Institute of Technology with the objective of measuring inflation using online posted prices for goods and services, an approach known as web-scraping. Web-scraping has the ability to transform the data underlying web pages into databases and, through a “data curation” approach, representative prices can be detected. Indeed, the main challenge of using big data is that most of the data are unimportant. Hence, the curation process involves carefully identifying the retailers that will serve as data sources; using web-scraping software to collect the data; then cleaning, homogenizing, categorizing, and finally extracting the information so it can be used in measurement and research applications.

As computing power has become less expensive, data have been downloaded for more than 50 countries and hundreds of retailers worldwide, and the project has constructed daily inflation measures for about 20 countries. The approach is hybrid: part of the information used in the project is collected by BLS (weights and some services such as education and health) to complement the online price data collection.

Data collection using web-scraping is orders of magnitude cheaper than traditional techniques, such as the surveys used to construct the CPI. More than 5 million items are tracked daily in the Billion Prices Project. ZIn contrast, the CPI is based on prices collected on about 80,000 items per month,^a with the “market basket” of items used by the CPI determined from data collected about consumer spending in the Consumer Expenditure Survey. The CPI is based on both online and offline goods, while the index created by the Billion Prices Project is based on online goods only; however, in categories such as electronics, clothing, hotels, books, entertainment, travel, and food, the dominance of online retailers is imminent. The advantage of the Billion Prices Project is that, even though it does not include all goods, the data are available on a daily basis for a much larger collection of items than is otherwise available.

^aSee Question #8; available: https://www.bls.gov/cpi/cpifaq.htm [November 2016].

cleaned and adjusted for any possible errors—for example, if a sensor was not functioning properly—and estimates are created for the total number of vehicles on the highways. These estimates can be produced extremely quickly if needed. In early 2016, the Netherlands experienced glazed frost, and Statistics Netherlands was able to produce estimates of how the glazed frost had affected traffic within 2 days.⁷

Another example of using high-dimensional data for national statistics comes from a partnership with private-sector mobile phone service pro-

___________________

⁷ See http://nos.nl/artikel/2079372-helft-minder-verkeer-door-ijzel.html [November 2016].

Page 62 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

viders. Ahas and colleagues (2011) created estimates of tourism statistics in Estonia using GPS-based data from mobile phones. Private-sector data sources are also being actively evaluated to provide new indicators of sustainability, especially for developing countries (U.N. Global Pulse, 2016). In fact, economists have used luminosity from satellite images as an estimator of gross domestic product (GDP) in developing countries (Chen and Nordhaus, 2010). Using 1° × 1° grid-cells that examine luminosity could provide important information on such factors as economic output where there is a lack of population or economic statistics, particularly from war-torn countries. However, luminosity has very little value added for countries that have high-quality statistical systems (Chen and Nordhaus, 2010).

Other emerging uses of high-dimensional data combine them with more traditional statistics created by government statistical agencies. Marchetti and colleagues (2015) created estimates of poverty for small areas by blending mobile phone data with other data from the national statistical office in Italy. Statistics Canada (2016b) has begun to use satellite imagery data as an input for agricultural statistics, replacing a survey. Chessa (2016) used retail outlet scanner data to cover a part of the product prices needed for the CPI. The Colombia National Statistics Office (2016) reported blending satellite digital images to improve land use statistics and land coverage statistics. The U.N. Global Pulse (2014) explored using transformed Twitter data to provide real-time food pricing estimates. Daas and Puts (2014) blended social media sentiment data with traditional data sources to measure consumer confidence.

Many big data projects are currently in pilot project phases, including such projects as use by the Australian Bureau of Statistics (ABS) of satellite surface reflectance data to classify crop type and estimate crop production. ABS is still trying to work out many important challenges such as ensuring reliability of the image data over time, aligning data to statistical boundaries, determining proper level of granularity for the data, and identifying the most accurate statistical methods for estimating quantities of interest (Australian Bureau of Statistics, 2015).

In the United States, a number of federal statistical agencies have been exploring and researching private-sector data sources, such as credit card transactions, other information from commercial providers, and information from Internet sources. Some federal statistical agencies are blending private-sector high-dimensional data with traditional data sources. The Bureau of Justice Statistics (BJS) is currently running a pilot project that is web-scraping data from online articles in order to try to improve estimates for arrest-related deaths (see Box 4-2). BLS currently uses scanner data as part of the input for its CPI estimates (see Horrigan, 2013).

Other federal statistical agencies are using private-sector sources to augment information that could be obtained through surveys. The Eco-

Page 63 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

BOX 4-2
Web-Scraping to Improve Statistics on Arrest-Related Deaths

BJS has been responsible for reporting arrest-related deaths since the Deaths in Custody Reporting Act (P.L. 106-247) was enacted in 2000. Until recently, BJS relied on state reporting coordinators (SRCs) in the criminal justice system for each state to identify relevant cases of law enforcement homicides. However, some states did not have SRCs who participated in the reporting of this information, and even in the states with participating SRCs, they used varying methodologies to identify arrest-related deaths. In response to these weaknesses in the system, BJS created a pilot program, the Arrest-Related Deaths Program, in March 2015. The program used a hybrid approach of open-source web-scraping along with existing homicide reports. In brief, the system begins by web-scraping for cases and stories in which a suspect dies in police custody and then noting possible causes. The next step is a survey conducted both with law enforcement agencies and medical examiners about each case found from web-scraping. By this process, the program attempts to identify both false positives and false negatives.

The web-scraping process involves several steps. To begin, the pilot program uses a process to filter and sort through a large volume of articles and stories on the web in order to find cases that are in scope of arrest-related deaths. Each night, the scraping will collect 30,000-40,000 different sources of information and news. Exact duplicates—when the same web URL is used in multiple sources and “untrusted domains” that are not the original source for information (e.g., Wikipedia, Reddit, Amazon)—are eliminated. Next, a “text similarity detector” process is performed: using a threshold of 80 percent of the text being the same, duplicated texts are removed. Next, a “relevancy classifier” is performed on the rest of the sources to identify sources relevant to the programs scope. After all these steps, the remaining sources of information are called the “web front end” and constitute about 1,500 of the original 30,000-40,000 cases. Finally, about 10 coders read through these remaining articles and identify and extract information relevant to the program, including date, personal information, and location. This information is then checked—as noted above—by conducting a survey with both law enforcement agencies and medical examiners to confirm the case is in fact an arrest-related death.

The pilot has so far been very successful, identifying more arrest-related deaths than many other open-source collections. Additionally, the program’s ability to use surveys to confirm arrest-related deaths with law enforcement agencies and medical examiners has made its estimates less volatile than similar open-source collections. The pilot program has been able to identify more cases than previously captured by BJS: the program identified about 400 arrest-related deaths in its short 3-month trial period in comparison with about 800 per year that the program had received from the coordinators.

Page 64 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

nomic Research Service (ERS) has purchased Nielsen and IRI scanner data, which is linked with individual household details, including demographic characteristics of residents, purchases, and prices. The information can be further linked with other geospatial and store characteristic data to get a more complete picture of the food environment for households.

CONCLUSION 4-1 Enormous amounts of private-sector data that are being generated every day have the potential to improve the timeliness and detail of national statistics.

RECOMMENDATION 4-1 Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits of using private-sector data sources.

CHALLENGES TO USING PRIVATE-SECTOR DATA SOURCES FOR FEDERAL STATISTICS

Given the many different data types shown in Table 4-1 (above) and the many different potential private sources for these data, there are similarly a wide range of challenges for agencies seeking to acquire and use those data for federal statistics. Although these data sources hold the potential to add value to official statistics, there are many methodological and logistical issues that would need to be addressed before that potential can be realized. In this section we discuss two of the major challenges—access and quality of the data—and we will explore them more fully in our second report.

Access

The approaches for accessing private-sector data resources are different from those for government-owned data. As noted in Chapter 3, U.S. federal statistical agencies typically develop a bilateral memorandum of understanding or an interagency agreement to codify the terms under which data can be shared between them. However, asking companies to share their data with federal agencies does not start from the same basic trust or common mission that exists among agencies. Although some companies publish their data and allow free access (e.g., Twitter), other companies sell data services and technology platforms. Companies may be reluctant to share their data for several reasons (Groves, 2013): (1) being liable for possible data breaches if their data are linked with government records; (2) increased attention to confidentiality issues and the data private firms have been collecting without much notice from the public; and (3) the possibility that other companies could use their data to create a

Page 65 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

profitable product, which they would be unable to capitalize on due to the collaborative agreement.

For companies that are willing to provide data, several approaches are possible. As noted above, ERS has purchased Nielsen and IRI scanner data for food policy research. And as described in Chapter 3, the Department of Housing and Urban Development purchased state and local county tax assessment data from Corelogic, which is a private-sector firm that aggregates these data from local sources and sells them to interested parties. Statistical agencies can also form public-private partnerships with private firms to obtain access to their data. Public-private partnerships are defined as a voluntary collaborative agreement between the public and private sectors. These partnerships are distinguished from other forms of public-private cooperation in that the partnership agreements contain defined roles, responsibilities, and rights and are typically characterized by long terms because of the need for longitudinal data (Robin et al., 2016). Data from private companies normally include information from data collection, including either active (survey) or passive (web-scraping) methods; administrative and similar data used for billing customers and targeting services; and transactional data.

Public-private partnerships are typically implemented through long-term contracts. There are four main approaches for access to and use of the data in public-private partnerships: (1) the company providing the data analyzes the data internally and then shares the relevant statistics with the agency; (2) the company transfers the data to the agency for the agency to compute the statistics; (3) the data are transferred to a trusted third party for analysis, and (4) the statistical agency’s functions, including data collection and processing, are outsourced to the private firm.

An example of the first type of partnership was used in Mexico where Telefónica analyzed its call detail records in order to assess the effectiveness of public health alerts for the spread of infectious diseases (Frias-Martinez and Frias-Martinez, 2012). Telefónica compared call detail records in the area of a health alert to a hypothetical model where no alert was given for the same area. Thus, by looking at the difference in mobility between the hypothetical model without health alerts and actual mobility with the health alerts, Telefónica was able to gather information about the effectiveness in reduction of infectious diseases due to health alerts, which it subsequently shared with public agencies.

In the second approach listed above, transfer of datasets is a sharing agreement that involves the physical transfer of databases to the statistical agency under a strict protocol that clearly specifies the terms and conditions and includes each party’s responsibilities and penalties for not following the agreement. BLS is currently negotiating with some large companies to provide payroll and other internal company data from which BLS will extract

Page 66 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

relevant information, rather than asking the company to complete its surveys. Although one advantage of this type of agreement is that statistical offices can analyze the data themselves, it is important to note that many agencies may not have the internal capacity to work with private-sector data (Robin et al., 2016).

The third approach listed above is the transfer to a third party to analyze the data from the provider to give to the statistical agency. There is an example of this type of partnership in Estonia, where the national statistical office formed a public-private partnership to create travel statistics based on cell phone call detail records through the analytics company Positium and the central Bank of Estonia, Eesti Pank. Positium has been working with mobile network operators for more than 10 years and has demonstrated that it is a trusted third party between the Estonian national statistical system and the telecom providers. Positium manages important concerns in using the detailed records, including preservation of business secrets, protection of users’ privacy, and compliance with privacy legislation. It also offers benefits to the Estonian statistical system, as it possess the technical ability to safely handle data provided by the mobile network operators in its private servers (Robin et al., 2016).

The last approach listed above, outsourcing a statistical agency’s functions, can be described as a process in which activities conducted by statistical offices are outsourced to a contractor. This approach is usually adopted for efficiency. It can include traditional collected data as well as nonofficial data sources that are freely available. This approach is quite common for U.S. federal statistical agencies: of the $7.4 billion spent annually on statistical activities across the federal government, approximately $1.5 billion was designated for private contractors in fiscal 2016 (U.S. Office of Management and Budget, 2015b). Often this work involves survey data collection, but it may also include such activities as frame development, sample design, analysis, and report preparation.

Public-private partnerships offer a number of potential benefits to statistical agencies in that they permit access to private data sources, but there are also important risks and challenges in using those sources. Most of the private data provided in some form to statistical offices from public-private partnerships contain important business data about a firm’s customers and strategy that could have negative effects for the data provider if accidently released or breached. Privacy and ethical issues are also important to consider in public-private partnerships as data often contain personally identifiable information, which is information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other information that is linked or linkable to a specific individual.⁸ In addition,

___________________

⁸ See OMB circular A-130, p. 33; available: https://obamawhitehouse.archives.gov/sites/default/files/omb/assets/OMB/circulars/a130/a130revised.pdf [February 2017].

Page 67 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

a firm’s customers or clients can be extremely sensitive about other uses of their data. For example, mobile network operators may be concerned that some customers will change providers simply on the basis that mobile network operators are holding their call data records (Infas, 2010).

Finally, incentives and sustainability for both parties need to be considered. Even if there are short-term benefits for both parties, long-run costs may become an issue as new methods of data collection become available that lead to issues of compatibility and completeness for longitudinal datasets. Moreover, statistical agencies may fear becoming dependent on an outside provider who may discontinue providing data at any time or could raise prices when it becomes clear an agency has no other source for the data.

From a company’s perspective, there are two primary access issues to consider: the privacy and confidentiality of the data and the profit objective, which come into play in different ways depending on the arrangement between the firm and the statistical agency. If a company has individual credit card data that could be used to assist in the construction of statistical measures, such as GDP or retail sales in the United States, the firm could work with the statistical agency in a couple of different ways with likely different implications. One possibility would be for the private firm and the statistical agency to jointly develop an index, which the company would sell to the statistical office. Privacy concerns would be minimized by providing aggregate statistics to the agency, but there could be implications for potential profits because such an index would also have value to others in the private sector. The statistical office would likely be unable to compensate the private firm sufficiently to keep it from also selling the index to other companies in the private sector.⁹

The second possibility is for a company to sell its raw credit card data to the statistical agency to analyze and combine with the agency’s other information. In this approach, the company and the statistical agency could then each develop their own separate indexes, and the company could sell its index to others without necessarily revealing the same information the statistical agency would publish. However, in this case, the firm would be very concerned about risks to the privacy of its clients and losing control of its data.¹⁰ We discuss issues of privacy and data security in detail in the next chapter, but the main point here is that a company’s privacy concerns and profit objective collide and make the form of engagement with a sta-

___________________

⁹ There is also the potential issue of prerelease ownership and access. If early access to the statistics is potentially of value (e.g., to investors), then loss of control over release could be a disadvantage: that is, there could be a risk that the private partner could profit from sharing the statistics before their official release.

¹⁰ Using secure multiparty computing platforms, which we note in Chapter 5 and will discuss further in report 2, may address these concerns.

Page 68 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

tistical agency complicated. There is likely no simple solution, but greater engagement between statistical offices and the private sector will be needed to try to meet the challenge.

Data Quality

We began this chapter noting a wide range of domains in which alternative data sources have the potential to contribute to federal statistics, but these sources are not typically simple substitutes for federal surveys, and careful evaluations of quality are needed. Google Flu Trends was designed to predict influenza incidence reports from the Centers for Disease Control and Prevention (CDC), but it represents a cautionary tale in the use of private data sources for national statistics. Although it performed well initially, in early 2013 Google Flu Trends was predicting nearly twice as many doctor visits due to influenza-like illnesses than the actual number of visits collected by the CDC (Lazer et al., 2014) (see Box 4-3). Other examples

BOX 4-3
Google Flu Trends

One of the weaknesses in Google Flu Trends (discussed above) was its dependence on a correlation between entering search terms that could be a signal that the user suffered from influenza (e.g., “Achy shoulders, runny nose”). If there were events that affect the nature of that relationship (e.g., media reports of prevention efforts against the flu), such a correlation could change. That is, more people not suffering from the flu may enter such search terms (Lazer et al., 2014).

The initial version of Google Flu Trends used existing data to find the best matches for 50 million search terms to fit only 1,152 data points (Ginsberg et al., 2009), which resulted in including some terms that correlated with random error instead of the underlying relationship (the model was “overfit”), so that many of the search terms that matched the propensity of the flu did not predict actual future cases. After Google Flu Trends updated its methodology in 2009, research showed that Google Flu Trends was not much better than a fairly simple projection using already available, 2-week lagged CDC data (Goel et al., 2010). However, Lazer and colleagues (2014) note that combining the Google Flu Trends data with other health data, such as lagged CDC data, could improve prediction over using either source alone.

More recently, Yang and colleagues (2015) have created a new model called ARGO (AutoRegressive with Google search data) that accounts for changes in people’s search behavior over time. The model is able to self-correct by recalibrating every 2 years using search terms and the CDC’s historical flu data. The model incorporates seasonal information on flu outbreaks but does not include terms simply related to the winter season.

Page 69 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

BOX 4-4
Scanner Data and Economic Indicators

Scanner data come from scanning consumers’ sales at retailers. These data usually include goods sold, prices, quantities, and the goods’ characteristics. In principle, these data could have tremendous advantages for the construction of several aggregate economic indicators, such as the CPI, retail sales, and economic activity in general. In practice, however, the available data are often incomplete and need to be properly curated.

If every transaction was recorded, then the construction of the CPI would be expected to be more accurate and simpler. The problem is that only rarely are all the details that are needed (i.e., quantities and prices) actually included. Sometimes the retailer aggregates the data by computing the average (daily or weekly) quantity and price, but this procedure implies averaging between several prices: the regular price, the loyalty card price, the sales price, and the discounted price due to coupons. Averaging may also be done across different stores, or retailers may decide to share only a subset of a store’s data. Moreover, not all the transactions in the economy are recorded in scanner data, although this problem is relatively minor and likely to be minimal in the future.

Another challenge in using scanner data is that companies that collect scanner data are mostly interested in measures of market share, response of customers to promotions and price changes, impact of advertising, etc. Answering these questions requires data that differ from the data needed to measure the daily sales of each product. That is, scanner data are currently being collected with the marketing, operations, and production set of questions in mind, but a statistical office is interested in measurements of economic activity and aggregate behavior. To satisfy the statistical need, scanner data would need to include a different level of granularity and complete coverage.

have shown how biased data lead to serious problems in prediction models (see Lum and Isaac, 2016).

High-dimensional data sources present a variety of other quality challenges for statistical uses, including coverage of the population and measurement issues. In terms of coverage of the population, there are often concerns about sample bias with these data sources, in part because such data often exist only for the “haves” and not the “have-nots.” In addition, social media data on Twitter, for example, are available only for those who choose to use the application (Couper, 2013). Measurement issues also arise with these data sources because, unlike a carefully designed and tested survey question, social media and some other data often are collected without a set stimulus. Similarly, it is difficult to determine how much of a social media post reflects someone’s “true” values and beliefs (Couper, 2013). Even seemingly objective and straightforward scanner data can be fraught with measurement issues (see Box 4-4).

Page 70 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

There have been some discussions on how to possibly address these issues (see, e.g., Struijs and Daas, 2014). It may be possible to create weights to reduce coverage bias based on the information that users provide in their social media profiles, which can include useful information about age, gender, or social group. However, considerably more research is needed in this area.

CONCLUSION 4-2 The data from private-sector sources vary in their fitness for use in national statistics. Systematic research is necessary to evaluate the quality, stability, and reliability of data from each of these alternative data sources currently held by private entities for their intended use.

We discuss fitness for use further in Chapter 6, and we will discuss quality frameworks evaluating fitness for use in our second report. Because of the many sources and potential challenges with private-sector data, as well as the limited resources of the federal statistical agencies, it is necessary that this research be conducted as efficiently and effectively as possible. We note in Chapter 2 that the Interagency Council on Statistical Policy assists OMB in coordinating the federal statistical system. Since this council is composed of the heads of the principal statistical agencies, it is the logical entity, along with OMB, to oversee the development and implementation of such a research agenda by the agencies in a collaborative and complementary manner.

RECOMMENDATION 4-2 The Federal Interagency Council on Statistical Policy should urge the study of private-sector data and evaluate both their potential to enhance the quality of statistical products and the risks of their use. Federal statistical agencies should provide annual public reports of these activities.

We provide some additional discussion of data quality issues for alternative data sources in Chapter 6, and the panel will address this issue more deeply in its second report. Although evaluation of specific data sources is best done at the program level, there is a need across the decentralized federal statistical system for greater leveraging of limited resources for research and development of new methods and assessing the quality of data from new sources. Sustainable access to these data sources is fundamental for federal statistical agencies to make progress in evaluating the quality and usefulness of these data sources for federal statistics. Hence, we end this chapter with key questions facing the future use of high-dimensional data for federal statistics:

Page 71 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×

Can the United States develop a sustainable mechanism and environment to permit federal statistical agency access to private-sector high-dimensional data for statistical purposes?
If such access is sustained, how can the quality of these data sources be evaluated for the benefit of all statistical uses of the data?
If such access is sustained, how can federal statistical agencies detect changes in the data created by the data holders, which may affect statistical estimates?

Page 72 Cite

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.

×