The analytic value of the ever-growing volume of data created by and captured from digital sources—from Internet-based storage and computing services to sensors scattered across cities and smart devices operated by millions of people—is now widely acknowledged. While alternative “big data” methods are being enthusiastically pursued, sustained work on the statistical validity of analyses based on them (e.g., the sample representativeness in a voluntary Internet-based survey) is not well established. For this reason, the primary means at this time for compiling information about civic engagement, social cohesion, and other dimensions of social capital remains household surveys.
Nonetheless, the changing data-creation landscape holds promise. There are at least four reasons for considering alternatives to traditional survey methods:
- The field of survey research is at a crossroads, facing numerous challenges affecting the viability of telephone-implemented and other conventional mode surveys, as well as the validity of their findings. The Current Population Survey (CPS) is conducted through a combination of in-person and telephone surveys. This partially insulates it from these survey viability concerns, since government face-to-face surveys have thus far maintained very high response rates. Nonetheless, this approach is extremely expensive, raising concerns about whether this method is sustainable in the long run. The increasing cost of government surveys
is also creating greater competition for the limited space available on questionnaires.
- Alternative survey modalities—most notably online instruments—have emerged, some with promising results.1 Although the underlying sample biases are not adequately known and require much more study, as do techniques for interpreting results, the knowledge base about this modality will grow rapidly. Even if these surveys do not enjoy the same levels of transparency and generalizability as traditional government surveys, their cheaper cost and more timely results may make them increasingly the information vehicle of choice for many uses.
- The emergence of big data that can be captured from a variety of (largely though not exclusively) digital information and communication technologies, coupled with advances in computational science analytic techniques, raises the possibility of developing less obtrusive indicators of citizens civic engagement and social cohesion behaviors, and perhaps even their opinions. And, as noted by Einav and Levin (2013, p. 3): “[T]he recording of individual behavior does not stop with the internet: text messaging, cell phones and geo-locations, scanner data, employment records, and electronic health records are all part of the data footprint that we now leave behind us.” Big data—whether drawn from Web searches, people’s browsing habits, social media, sensor signals, locational data from smartphones, road use data from “smart passes,” or genomic information and surveillance videos—has the potential to revolutionize measurement.
- The demand for small-area estimates—that is, for geographic areas or population domains for which the sample size is inadequate to provide precise (direct) estimates—and for more timely data will continue to increase. As detailed above, it will not be possible for traditional federal survey instruments alone to meet this need. There is already an increased emphasis on modeled estimates to meet the demand for small-area data. Such demands will increase the pressure to use both massive datasets and alternative survey vehicles.
In this context, it is important to think about substitutes (and complements) for government surveys that could generate valuable informa
1At this point of online survey development, sample validity requires a closed population sample, such as the workforce of a corporation, in which it is known that all potential respondents have Internet access. For a thorough discussion of characteristics of Web surveys and their capacity to collect accurate data, see Tourangeau et al. (2013).
tion. As we discuss in Chapter 4, there are major advantages of those surveys—methodological transparency and generalizability—which, with confidentiality and privacy protection, make them credible. However, costs of and demands for more timely information motivate consideration of alternative or complementary data sources.
In addition, as we argue above, national-level surveys do not always represent the most efficient way to gather data. For measuring social cohesion—important for purposes such as anticipating a city or community’s resilience to weather or other natural disasters, or providing an early warning system of social breakdown and civil unrest—the CPS supplement cannot capture its multidimensional character at the community levels of aggregation; and, in many cases, the data are not timely or frequent enough to capture the interesting trends. Appropriate data collection in the areas of social cohesion and connectedness will increasingly rely on nonsurvey methods, many of which may be beyond the scope of current government programs. Therefore, in considering the measurement of social capital, it is important to consider to the full range of options, both within and beyond the federal statistical system. The rest of this chapter discusses data linking and nonsurvey data collection methods and recommendations for how to exploit them.
The simultaneous demands to lower costs and provide more integrated information suggest that the U.S. federal statistical system should substantially improve its ability to link information among federal surveys and with administrative information. The potential to link across survey sources and to draw from administrative and other kinds of records is a clear strategy for analyses that require a wide range of variables or for situations in which data are needed for targeted purposes.2 The capacity to link across surveys and to administrative records can add a broader set of demographic and socioeconomic variables to analyses and also carries the potential to improve the accuracy of the survey data fields.
“Data linkage” refers to merging methods which vary and are motivated by different analytic objectives. First, there is sometimes a need to augment the data obtained from a survey by adding information available for a respondent from administrative record sources. Individual level records on items ranging from income and demographics to place of residence, program eligibility and participation, and employment reside
2The Health and Retirement Study is a good example of the latter; it is a survey with linkages to the administrative records of the Social Security Administration that is designed to facilitate research of health and pension policy questions (see Gustman and Steinmeier, 1999).
in administrative sources (e.g., tax and social security records) while other variables—such as many of those represented as elements of social capital represented in Table 2-1 (in Chapter 2)—are more commonplace in the form of direct responses (surveys). For studying community resilience, neighborhood engagement, or other aspects of social capital, it is easy to envision the value of being able to link survey data with localized information.
A second reason for data linking is to reduce the variance of small-area estimates. It is commonplace for federal surveys to have insufficient sample sizes to support local level estimates that would be useful to policy communities; this has made small-area modeling crucial. Such estimation methods include generalized linear mixed models (e.g., Fay-Herriot, 1979) or hierarchical models (e.g., Lindley and Smith, 1972). Data linking comes into play in methods using linear combinations of direct survey estimates and model-based estimates in which the dependent variable is a function of survey responses, and the predictors are from administrative sources.
The CPS’s sample sizes allow accurate estimates of labor force characteristics and employment and earnings status of the population at the national and state levels. With the exception of some large metropolitan areas and when data can be pooled across years, any geographic entity smaller than that—such as a congressional district—would be considered a small area. In research on civic engagement, small areas may be carved out along a number of dimensions—geographic (e.g., a congressional district), political affiliation (e.g., Republican, Democrat, or Independent), demographic (e.g., Latino voters, young nonvoters), or some intersection of these.
One example of the use of hierarchical models is that which allows CPS data to be augmented with census administrative records to indirectly estimate numbers of school-age children living under the poverty threshold at the school district level; allocation of more than 15 billion dollars of federal funds is based on such model-based indirect estimators.3 Similarly, using ACS data, Malec (2005) applied multivariate modeling methods incorporating data from outside the small area of interest and
3Gershunskaya (quoted in National Research Council, 2013a) differentiated between direct and indirect estimates:
Direct estimates use the values on the variable of interest from only the sample units for the domain and time period of interest. They are usually unbiased or nearly so but, due to limited sample size, can be unreliable. Indirect estimates “borrow strength” outside the domain or time period (or both) of interest and so are based on assumptions, either implicitly or explicitly. As a result of their use of external information, indirect estimates can have smaller variances than direct estimates, but they can be biased if the assumptions on which they are based are not valid. The objective therefore is to try to find an estimator with substantially reduced variance but with only slightly increased bias. Indirect methods rely on sets of assumptions regarding how information from outside the domain (small area) of interest relates to that within it.
“without making restrictive assumptions about within small area variance” to produce more efficient estimates of poverty and housing unit characteristics than could be could be made directly.4
The value added from data linking thus stems from two factors. First, national surveys, such as the CPS supplements, include a limited number of variables for studying specific topics. Linking datasets allows for a broadening of covariates that may be correlated with measures of outcomes. Combining individual-level survey information with data from other sources can provide contextual information about counties, districts, and states that may be useful explanatory variables. Second, and very relevant to the CPS Civic Engagement Supplement, is that sample sizes associated with national level population surveys are not typically adequate to support local area analyses.
CONCLUSION 8: The Current Population Survey (CPS) cannot provide all the variables and the level of geographic detail necessary for research on social capital, social cohesion, and civic engagement. It is therefore essential that design strategies for the CPS be conceptualized with the presumption that this data source will need to be linked (even if only at the group level) to other data from the federal government and beyond. The national-level data collected on a regular basis should complement other sources, both government and nongovernment, for use by researchers. Research data centers operated by the federal statistical agencies can create opportunities for these kinds of coordinated efforts, which must comply with respondent confidentiality and privacy requirements.
Going forward, much of the value of the federal statistical apparatus will depend on whether it can expand its capacity to link data sources—survey and nonsurvey, national and local. The Census Bureau, for one, already has a significant capacity to link data sources; of course, the resulting research data products are stripped of individual identifiers and can typically only be accessed through secure means. Much of this work is being done by researchers using datasets available on a restricted-access basis in the Census Bureau’s Research Data Centers.
Some of the most innovative programs have taken place on the busi-
4Alexander (1998) recommended that, for the ACS, direct (nonmodel-based) annual estimates should be limited to areas with populations of at least 65,000; estimates for areas with smaller size populations can be made by pooling data across years—as small as 15,000 when data from 5 years are used. Because of the need for these smaller area estimates, the Census Bureau has actively supported research on indirect modeling methods.
ness side rather than the demographic side. The Longitudinal Employer–Household Dynamics (LEHD) Program, which could serve as something of a model for data coordination and research on social capital, combines data from state and federal sources to create a longitudinal linked employer-employee dataset. LEHD data have been used to analyze commuting patterns, in transportation planning, and in studies of worker turnover, pensions, low-wage work, and worker productivity. One could envision similar linking to advance research in the area of social capital, although such work would be both technically difficult and resource intensive. Nevertheless, the panel strongly encourages continued work by federal agencies in this area.
In addition to the technical difficulties and resources needed, institutional and legal issues present significant challenges to data linking. The capacity of the federal statistical system to make greater and more intensive use of its flagship surveys will depend in part on the extent to which a decentralized system can collaborate. While progress has been made, much remains to be done.5 Respondents’ willingness to allow linkages is also a constraint. In the U.S. system, social security numbers (SSNs) are the most widely used individual identifiers, and declining SSN item response is a growing challenge for linking data sources.6
Public reticence, declining response rates, costs of traditional survey methods, and the emergence of massive data generation by new information and communication technologies are shifting the landscape of public
5For example, the 2002 Confidential Information Protection and Statistical Efficiency Act allowed greater data sharing among statistical agencies, but strong restrictions continue to apply to statistical uses of tax information.
6McNabb et al. (2009) described how the problem has affected two SSN linkage programs:
Respondents refusing to provide SSNs to SIPP [Survey of Income and Program Participation] interviewers increased from 12 percent to 35 percent between the 1996 and 2004 panels. Those refusing to provide SSNs in CPS increased from approximately 10 percent in 1994 to almost 23 percent by 2003…missing SSNs meant smaller and smaller proportions of the sample could be matched to administrative records. Additionally, differing rates of SSN nonresponse could instill potential bias into subsequent analyses.
The Census Bureau has responded to this growing item nonresponse problem by reducing the need to rely on direct SSN survey field entries. Under a new methodology, a respondent is informed that the survey data will be matched with other federal data for research purposes. Unless the respondent opts out, application information from SSA’s Numident file may be combined with address records from the IRS, SSA, and other sources to determine the respondent’s correct SSN. Using this methodology, match rates have increased from about 60 percent in 2001 to 79 percent in 2004 (for details, see http://www.ssa.gov/policy/docs/ssb/v69n1/v69n1p75.html [February 2014].
opinion and behavioral research. It is, however, premature to transition away from the traditional survey-based empirical approaches. Although online surveys are increasingly common in academic scholarship, major methodological issues about their quality are unresolved, not least of which is the representativeness of the sample of people who respond. Web scraping to exploit unstructured data for social science research is also promising, but much remains to be understood about its accuracy and reliability. A recent Pew Research Center study (Mitchel and Hitlin, 2013) found, for example, that Twitter reaction to political events was often at odds with public opinion as measured by traditional surveys. Policy making that relies on commercial big data sources—assuming they can be made available and their methods made transparent—can still be systematically underrepresenting large segments of the population. To date, there has not been sufficient high-quality survey research on differential access among populations to make the necessary corrections. As big data sources become increasingly relied on, it will be difficult to understand how our knowledge may (or may not) be skewed.
Stiglitz et al. (2009, pp. 184-185) weighed in on the modern-day role of surveys in producing statistics on one dimension of social capital:
[R]eliable indicators can only be constructed through survey data. Only personal reports allow measuring the many and evolving forms of social connectedness. In recent years a number of statistical offices (in the United Kingdom, Australia, Canada, Ireland, the Netherlands, and most recently, the United States) have begun to gather and report survey-based measures of various forms of social connections. As an example of these endeavors, Appendix 2.2 presents the list of the questions included (since early 2008) in an annual Supplement to the November US Current Population Survey, which has traditionally probed respondents about voting in national elections. These questions have been selected after extensive vetting by the Census Bureau and the Bureau of Labor Statistics for reliability, intelligibility, and inoffensiveness; they cover several manifestations of civic and political engagement, as well as other forms of social connections (such as number of friends, or frequency of contacts and favors done for neighbors).
For the short run, this panel agrees. During the next several years (we will not attempt to predict how many), the current survey-centric approach—which provides a known inferential framework and for which problems of data accuracy, quality, representativeness, and confidentiality have largely been solved or limited—will continue to produce the most reliable and scientifically valid estimates.
But the improving ability to link data and the increasing spread of social media and other technologies that produce unstructured digital data are leading to significant changes in the way research is conducted.
A study of the long-term effects of 9/11 on political behavior is suggestive of the methodological transition that is under way: using only nonsurvey data sources—specifically, lists of all registered voters in the state of New York and digital obituaries to match 9/11 victims—Hersch (2013) determined that “family members and residential neighbors of victims have become, and have stayed, significantly more active in politics in the last 12 years, and they have become more Republican.” The author noted that the methods of analysis used in this research would not have been possible without the recent improvements in computational capacity and the quality of public records.
The Kasinitz et al. (2008) study of immigrants in New York City and the Project on Human Development in Chicago Neighborhoods (Sampson and Graif, 2002, 2009) used detailed, multimode datasets, for which surveys were only one component, to capture the complexities of social capital, much of which takes place most intensively as community-level social processes. These studies were designed to generate insights about the links among neighborhood characteristics, social organizations, community-level phenomena, social functioning, and quality of life. They utilize a wide range of methodologies, ranging from experimental designs, capable of taking into account spatial and temporal dynamics, to systematic observational approaches that benchmark data on neighborhood social processes. They also required the empirical study of communities for the better parts of a decade. Only then could a comprehensive picture emerge of the processes whereby “neighborhoods influence a remarkably wide variety of social phenomena, including crime, health, civic engagement, home foreclosures, teen births, altruism, leadership networks, and immigration” (Sampson, 2012a, Foreword). Sampson (2013) described the “science of how cities and neighborhoods work”:
…using Chicago as an urban laboratory…My research team and I followed more than 6000 families wherever they moved, as well as studying the city’s neighbourhoods themselves. We surveyed more than 10,000 residents, watched video footage we took of thousands of city streets, assessed the social networks of community leaders and gathered data on collective civic events such as fundraising for schools and blood donation.…[lost letter and other experimental data were] combined with records on crime, violence, health, community organisations and population characteristics over 40 years.…Our research is part of a larger effort to develop tools to measure and evaluate the social-ecological infrastructure of cities, known as ‘ecometrics.’
The progress made with these in-depth studies helps in the development of questions for broader population surveys (as it has for the Neighborhood Capital Module of the American Housing Survey, discussed in Chapter 4). As we note throughout, however, without costly sample
sizes neighborhood-level and subgroup-specific phenomena cannot be measured with data from a national survey.
Some dimensions of social science measurement (including some elements of social capital, which have both individual- and community-level components) are especially amenable to methods other than those developed by a statistical system built on 20th century data and methods. Indeed, as pointed out by Hampton et al. (2012, p. 19) as part of the Pew Research Center’s Internet & American Life Project:7
Some information on the use of social networking sites is extremely difficult or impossible to collect as part of a phone survey. For example, information on the structure of people’s online friendship networks, such as the number of friends of friends, or how densely connected are a person’s friends (i.e., if a person’s friends have all friended each other). Such measures, while difficult to collect in a survey, are important in understanding how use of Facebook is related to different social outcomes. For example, measures such as social cohesion (density) in people’s personal network of relations is a strong predictor of things like trust and social support—the ability of people to get support when they are in need or seeking help making decisions.
Social media and Web search technologies seem particularly promising in generating data capable of underpinning social science research on people’s networking and communications behaviors.
How to exploit data generated from social media and other digital sources to intuit people’s opinions, attitudes, and actions is an emerging topic in this still nascent area of research—much of which is being done in computer science departments. Ungar and Schwartz (2013) used what they called differential language analysis of social media data sources to measure what word use reveals about people’s psychological and emotional states, and subjective well-being. DiGrazia et al. (2013) demonstrated a social media-based alternative to polls and surveys for gauging public attitudes and monitoring political races. Google’s data correlation mining tool has been used to estimate unemployment claims filed (Wolfers, 2011) and corruption (Saiz and Simonsohn, 2007). Twitter data have been used to study word use associated with different circumstances such as job search and to anticipate trends in unemployment claims.8
7The Project fielded a nationally representative phone survey about the social and civic lives of social network site users. For the detailed findings, see Hampton et al. (2011).
8Organically generated digital data have also been used for tagging crime hotspots in communities; Facebook data have been word mined to generate well-being measures; a “Mappiness” real-time phone app has been used for well-being monitoring in the United Kingdom, and on and on. Using experimental studies and field research, Cook et al. (2009) examined the relationship between trust in anonymous online exchanges (“eTrust”) and
Using longitudinal data from a representative sample of Internet users in Norway, Brandtzaeg (2012, p. 467) found a significantly higher score among social network site users relative to nonusers for three of four social capital dimensions: “face-to-face interactions, number of acquaintances, and bridging capital…However, SNS [social network site]-users, and in particular males, reported more loneliness than nonusers.”9 Facebook data have also been used to demonstrate the political diversity of friend groups and the collective influence of weak ties to the media (Bakshy, 2012); and “web scrapes” have been used to show that Internet political groups and online news consumption is less polarized than many face-to-face interactions (Gentzkow and Shapiro, 2011) and perhaps less segregated than initially thought (e.g., Sunstein, 2001).
Beyond social media, private-sector data generated by individuals’ shopping and other online activities and by automated payroll systems has created private-sector alternatives (or, in some cases, complements) to key economic indicators. These include the Consumer Price Index (CPI), the Web-based MIT Billion Prices Index,10 and employment statistics (e.g., the ADP National Employment Report).11 Premise, a new company, has begun constructing real-time price indexes based on Web searches of online retailers and images captured from individuals’ mobile phone cameras of items on store shelves. The index reportedly picked up the price spike on onions in India 3 weeks before it sparked rioting.12
It is important to note that official statistics do use a variety of private-sector data sources.13 This use of private-sector data is not limited to economic indicators. For example, Google’s flu trends estimates the prevalence of the illness from flu-related Internet search queries.14 Such alternatives provide both more timely data and for smaller areas. Whether, in this case, it meets the quality standards of traditional data from the Centers for Disease Control and Prevention is not yet established. The
cooperation between people. Einav and Levin (2013) explored more generally how “big data” will transform business, government, and other aspects of the economy.
9This article also provided an overview of studies on the effects of Internet use, social media use, and various dimensions of social capital; the author’s basic conceptualization of social capital is formulated from Coleman (1988), Ellison et al. (2007), and Putnam (2000), much of it organized in terms of bridging and bonding social capital.
12For information, see http://money.cnn.com/2013/10/16/news/economy/real-timeinflation/ [February 2014].
13Horrigan (2013) identified current and potential uses by the Bureau of Labor Statistics of a number of nonsurvey and administrative (public and private) data sources in their price index and other programs (http://magazine.amstat.org/blog/2013/01/01/sci-policyjan2013/) [February 2014].
2012-2013 flu season, when Google data drastically overestimated the peak flu levels, provided a cautionary example (Butler, 2013).15 Similarly, for gaining insights into aspects of social cohesion and connectedness, online and cell phone networking patterns and other unobtrusive measures such as credit card use may yield new attitudinal and behavioral information through the digital footprints left by people as they search, swipe and click their way through the day.
As alternative data sources are exploited, it is critical to understand the benefits and limitations of the corresponding estimates and the relationship between them. For example, users may choose traditional or nontraditional estimates of consumer prices based on their fitness for use in a given situation. However, such comparisons and choices can only be done if the properties of each estimator are well known. In the social sciences where important policy and research findings have been produced largely from survey data foundations, an abrupt migration to nonsurvey data could be quite damaging if the basic work needed to understand the new data is not done in a way that approaches the rigor earned through decades of survey methodology research.
Exploiting alternative data sources will affect the practices of federal statistical agencies. The breadth of data that statistical agencies will attempt to collect themselves may narrow, while the content of what they process and analyze from sources beyond their own surveys and administrative records expands. Even for the subset of data collections for which the federal statistical agencies are charged with overseeing, traditional survey methods will not always be the most cost-effective option; and the CPS and other population surveys will not always be the right vehicles for measuring public opinion, sentiment, or behavior. These changes will involve new relationships between the federal statistical system and the private sector, and the terms and conditions of these relationships are still unknown and will evolve over time.
While clearly promising, enough questions remain to warrant extreme caution as new methods are adopted and new resources tapped: To what extent does the utility of alternative data collection and analysis techniques vary by domain or topic? Are populations of interest well-enough represented by those accounting for most Internet communications and transactions (e.g., social connections of elderly people)? How can and
15This episode highlights the important point that techniques based on mining of Web data and on social media are, at this point, complements not substitute for traditional epidemiological surveillance. Butler (2013), making this point, noted that the problems with the algorithm may have been linked to widespread media coverage of the severe flu season and to social media which spread the news of the flu more quickly than the virus itself; apparently, the context of the word searches was not adequately taken into account in the analysis for the 2013-2014 season.
should the ease and comprehensiveness of digital data collection be balanced with privacy concerns? Where the algorithms are constantly being tweaked, what is the comparability of data over time? And, can “official statistics” be legitimately generated from private-sector data?
Active mechanisms are needed to keep the work necessary to understand and exploit emerging data sources in the forefront of agencies’ thinking and planning. As data increasingly derive from private-sector entities, the public will have less control over content and less influence over how data are used. Furthermore, if the statistical agencies are marginalized in the changing landscape, the leading institutional mechanism for ensuring quality control will be lost. The survey edifice rests on representativeness, coverage, privacy, and other fundamental attributes that are still needed to guide social science data collection and analysis methods. The federal statistical agencies can play an instrumental role in figuring out how to embrace and implement new data and new data strategies without abandoning scientific principles. This will require developing new approaches for linking data from a variety of sources and carrying out experiments to calibrate how answers differ under survey versus alternative data scenarios.
As described above (in the discussion about data linking), confidentiality, privacy, and transparency will also be major issues affecting the use of big data. The statistical agencies have extensive experience managing the protection of data at geographic levels smaller than cities (such as census tracts and block groups) so that those data can be accessed by the public and by researchers. Researchers of social capital need this kind of data detail, but there are legal, institutional, and administrative hurdles to obtaining it, as is the case for many surveys with geographic identifiers. The federal statistical agencies play a pivotal role in developing solutions to confidentiality issues that arise. They have long been concerned with respecting the privacy of citizens, ensuring the confidentiality of data collected about them, and developing a sound conceptual basis for these activities. In a study undertaken at the request of a group of federal statistical agencies, the National Research Council (1993, p. 3) developed what it called an ethos of information, which consisted of three principles: democratic accountability, constitutional empowerment, and individual autonomy:16
Functionally, democratic accountability recognizes the responsibilities of those who serve on behalf of others. It requires that the public have access to comprehensive information on the effectiveness of government policies. Government statistical agencies play a pivotal role in
16The title of the report, Private Lives and Public Policies, Confidentiality and Accessibility of Government Statistics, is indicative of its content.
ensuring democratic accountability by obtaining, protecting, and disseminating the data that allow the accurate assessment of the influence of government policies on the public’s well-being. Furthermore, they themselves are accountable to the public for two key functions in this process: (1) protecting the interests of data subjects through procedures that ensure appropriate standards of privacy and confidentiality and (2) facilitating the responsible dissemination of data to users.
Constitutional empowerment refers to the capability of citizens to make informed decisions about political, economic, and social questions. In the United States, constitutional theory emphasizes that ultimate power should reside in the people.…Constitutional practice emphasizes restraints on executive excess and broad access to the political process through the direct election of representatives as well as through separation and balance of power.
Individual autonomy refers to the capacity of members of society to function as individuals, uncoerced and with privacy. Protection of individual autonomy is a fundamental attribute of a democracy. If excessive surveillance is used to build data bases, if data are unwittingly dispersed, or if those who capture data for administrative purposes make that information available in personally identifiable form, individual autonomy is compromised.
These principles have stood the test of time. Federal statistical agencies’ practices are still based on the belief of individual autonomy—that sociodemographic information is the property of the individual.17 Because the information is owned by the individual, the government enters into a contract with the respondent promising to safeguard it (that is, to keep it confidential). Prior to 2002, the legislative authority for maintaining the confidentiality of identifiable information collected for statistical purposes was not uniform across statistical agencies. In 2002, the Confidential Information Protection and Statistical Efficiency Act (CIPSEA)18 was enacted to remedy this problem.
CIPSEA, which contains two key parts, provides a uniform standard of privacy and confidentiality for statistical agencies. The purposes of the first part are to:
- ensure that information supplied by individuals or organizations to an agency for statistical purposes under a pledge of confidentiality is used exclusively for statistical purposes;
17This principle is applicable even when a survey or census is declared to be mandatory, that is, when the public good for supplying the information is deemed to be sufficiently important to require participation.
18Confidential Information Protection and Statistical Efficiency Act of 2002, Title V of the E-Government Act of 2002 (Pub. L. 107-347).
- ensure that individuals or organizations who supply information under a pledge of confidentiality will not have that information disclosed in identifiable form to anyone not authorized in the legislation; and
- safeguard the confidentiality of individually identifiable information acquired under a pledge of confidentiality for statistical purposes by controlling access to, and uses made of, such information.
The second part of the act promotes statistical efficiency through limited sharing of business data among three designated statistical agencies, the Census Bureau (Census), the Bureau of Economic Analysis, and the Bureau of Labor Statistics.19
The uniform standards of privacy and confidentiality provided under CIPSEA were a major step forward; the federal government, particularly the Office of Management and Budget, deserves a great deal of credit for setting these rules for privacy and confidentiality. Until recently, the act’s reach covered much of the necessary ground in that federal, state, and local governments collected most of the identifiable data about individuals and controlled the rules about privacy and confidentiality. However, with the emergence of big data—for example, social media giants such as Facebook, Twitter, and Instagram—the situation has changed dramatically.20 Now, far more data about individuals (and far more detailed data, including digital photos and videos) is collected and controlled by corporations than by governments. Legislation such as CIPSEA does not apply to these corporate institutions which make their own rules about privacy and confidentiality. Privately controlled digital data sources are further differentiated from traditional statistical operations, such as the Current Population Survey, by the velocity, volume, and variety of information generated. One can expect these trends to continue, thereby complicating the ability to develop privacy and confidentiality standards—both within the private sector and between private and public entities—that would allow integration of traditional and emerging big data based statistical sources. A recent report by the White House Office of Science and Tech-
19See National Research Council (2007) for a detailed description of how CIPSEA legislation has contributed to data sharing among statistical agencies in the production of business statistics.
20In the United States alone, Facebook, Twitter, and Instagram have about 200 million, 50 million, and 35 million users, respectively (estimates vary depending on user-activity level specified, estimates of duplicate or bogus accounts, etc.), and the United States represents only a fraction of worldwide users of social media sites.
nology Policy (OSTP) on the issues surrounding big data described the problems and the potential solutions in the following way:21
Big data technologies are driving enormous innovation while raising novel privacy implications that extend far beyond the present focus on online advertising. These implications make urgent a broader national examination of the future of privacy protections, including the Administration’s Consumer Privacy Bill of Rights, released in 2012. It will be especially important to re-examine the traditional notice and consent framework that focuses on obtaining user permission prior to collecting data. While notice and consent remains fundamental in many contexts, it is now necessary to examine whether a greater focus on how data is used and reused would be a more productive basis for managing privacy rights in a big data environment. It may be that creating mechanisms for individuals to participate in the use and distribution of his or her information after it is collected is actually a better and more empowering way to allow people to access the benefits that derive from their information. Privacy protections must also evolve in a way that accommodates the social good that can come of big data use.
To deal with these issues, the OSTP report recommends, inter alia, advancing a consumer privacy bill of rights. Such a bill of rights would impose reasonable time periods for notification, minimize interference with law enforcement investigations, and potentially prioritize notification about large, damaging incidents over less significant incidents. The report asserted (p. 62):
Consumers deserve more transparency about how their data is shared beyond the entities with which they do business directly, including “third-party” data collectors. This means ensuring that consumers are meaningfully aware of the spectrum of information collection and reuse as the number of firms that are involved in mediating their consumer experience or collection information from them multiplies.
The statistical agencies are of course aware of data developments beyond the government sphere and have been working to incorporate changes into their programs. Nonetheless, the magnitude of upcoming changes warrants that the federal statistical system be involved more closely in these new data developments. And, as indicated above, OSTP has recognized the opportunities created by emerging data sources and technologies; noting that the federal government is underinvesting in these opportunities, a “big data” research and development initiative has
21Big Data: Seizing Opportunities, Preserving Values, Executive Office of the President, The White House, May 1, 2014.
been announced.22 The initiative is designed to (p. 1): “advance state-of-the-art core technologies needed to collect, store, preserve, manage, analyze, and share huge quantities of data; harness these technologies to accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning; and expand the workforce needed to develop and use Big Data technologies.”
A number of cities are also investing in “urban informatics.” New York City, for example, recently created an Office of Policy and Strategic Planning to house the city’s data-centered innovations, “conducting wide-ranging data mining and analysis to improve City services, enhance transparency and more effectively solve complex municipal issues.”23 Similarly, an initiative from the National Science Foundation is focused on new research efforts to extract knowledge and insights from large and complex collections of digital data which, among other things, calls for “Encouraging research universities to develop interdisciplinary graduate programs to prepare the next generation of data scientists and engineers.”24
While big data studies are often housed in university information technology departments, the statistical agencies, as the producers of official statistics, have a complementary role to play alongside the computer scientists—for example, managing data quality and focusing on such problems as population representativeness.25 Developing methods for exploiting and integrating nontraditional data for use in official and other statistics is part of the role, and one for which mechanisms will be needed to allow statistical agencies to provide guidance. Being given the capacity to hire staff with appropriate expertise is a necessary first step.
The preceding discussion emphasizes the burgeoning interest in using private-sector data as well as social media and other Internet-originating sources. There is only a very limited time period with which to make scientific decisions on how best to transition from a data collection system dominated by the survey-based model to one in which this
22For details, see http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf [February 2014].
23For details, see http://www.nyc.gov/portal/site/nycgov/menuitem.c0935b9a57bb4ef3daf2f1c701c789a0/index.jsp?pageID=mayor_press_release&catID=1194&doc_name=http://www.nyc.gov/html/om/html/2012b/pr337-12.html&cc=unused1978&rc=1194&ndi=1 [February 2014].
24This was announced at the same time as the OSTP initiative; see footnote 18.
25The statistical agencies, and survey statisticians more generally, are well positioned to help solve problems associated with unstructured web data. For example, to learn more about representativeness, questions (such as, Do you use Twitter or Facebook? How often?) could be added to population surveys—designed solely for the purpose of better understanding the properties of other nondesigned data sources. This kind of work will allow modeling for integrating the data sources and making them more useful.
model must coexist with alternatives. Taking advantage of this moment requires action.
RECOMMENDATION 5: Under the leadership of the U.S. Office of Management and Budget, the federal statistical system should accelerate (1) research designed to understand the quality of statistics derived from alternative data—including those from social media, other Web-based and digital sources, and administrative records; (2) monitoring of data from a range of private and public sources that have potential to complement or supplement existing measures and surveys; and (3) investigation of methods to integrate public and private data into official statistical products.
An improved understanding of the potential of alternative means of data gathering is important and worthwhile, independent of its relevance to the study of social capital.
The question of whether the research in Recommendation 5 can be accomplished is not trivial. The federal statistical system is decentralized, comprising more than 50 entities that produce statistics, of which about 15 are generally considered the principal statistical agencies. One of the drawbacks of such a system is the lack of a critical mass for the purpose of major research undertakings. The Census Bureau and perhaps the Bureau of Labor Statistics are the only agencies with significant numbers of in-house research staff, although there is exceptional research capability throughout the statistical system. However, many research topics, such as the ones recommended above, transcend the needs of any one agency and require a more centralized approach if they are to be successfully pursued.
Research in statistical agencies is also inhibited because of the recruitment and retention policies of the government. With rare exceptions, one must be a U.S. citizen to be employed by the federal government, but the research community is becoming more, not less, diverse with respect to citizenship. The ability to attract and retain first-class talent is also challenged by substantial pay differentials between the private and public sectors. For other activities, the federal government has developed entities called Federally Funded Research and Development Centers (e.g., Rand and Mitre corporations). The same could be done here.
RECOMMENDATION 6: In mapping the way forward for the integration and exploitation of new data sources, the U.S. Office of Management and Budget should coordinate the exploration of alternatives for developing the necessary research capability across the federal statistical system. Among the alternatives
are extensions of the current partnership between the Census Bureau and the National Science Foundation and the creation of a federally funded research and development center for this work.
Such a center for statistics would also allow for focusing research on topics that are of vital and common to the entire statistical system and not unique to one agency. The federal statistical system has recognized the importance of alternative approaches to research with the partnership to create research nodes between the Census Bureau and NSF.
The measurement areas described in this report—covering dimensions of civic engagement, social cohesion, and social capital—represent only a portion of those that factor into social science, urban planning, public health and other research areas. But the nature of the activities, attitudes, and behaviors encompassed, along with the multiple geographic levels of interest and the role of group and individual interactions, make it an illuminating case study of the growing need for multimode data collection to underpin modern research and policy. The characteristics of social capital highlight the opportunities now emerging in the rapidly evolving data landscape. And, because it is a relatively new strand of social science inquiry, where methods are not as entrenched as elsewhere, it is a good testing ground for development of experimental measurement approaches that explore and exploit these circumstances. Because data users have fewer preconceived notions of what the underlying statistical framework (and official statistics in the area) should look like, measurement of social cohesion, civic engagement, and other dimensions of social capital is a good place for statistical agencies to begin developing cutting edge techniques for blending traditional survey data with new, nonsurvey data into integrated measurement programs.