2012-2013 flu season, when Google data drastically overestimated the peak flu levels, provided a cautionary example (Butler, 2013).15 Similarly, for gaining insights into aspects of social cohesion and connectedness, online and cell phone networking patterns and other unobtrusive measures such as credit card use may yield new attitudinal and behavioral information through the digital footprints left by people as they search, swipe and click their way through the day.
As alternative data sources are exploited, it is critical to understand the benefits and limitations of the corresponding estimates and the relationship between them. For example, users may choose traditional or nontraditional estimates of consumer prices based on their fitness for use in a given situation. However, such comparisons and choices can only be done if the properties of each estimator are well known. In the social sciences where important policy and research findings have been produced largely from survey data foundations, an abrupt migration to nonsurvey data could be quite damaging if the basic work needed to understand the new data is not done in a way that approaches the rigor earned through decades of survey methodology research.
Exploiting alternative data sources will affect the practices of federal statistical agencies. The breadth of data that statistical agencies will attempt to collect themselves may narrow, while the content of what they process and analyze from sources beyond their own surveys and administrative records expands. Even for the subset of data collections for which the federal statistical agencies are charged with overseeing, traditional survey methods will not always be the most cost-effective option; and the CPS and other population surveys will not always be the right vehicles for measuring public opinion, sentiment, or behavior. These changes will involve new relationships between the federal statistical system and the private sector, and the terms and conditions of these relationships are still unknown and will evolve over time.
While clearly promising, enough questions remain to warrant extreme caution as new methods are adopted and new resources tapped: To what extent does the utility of alternative data collection and analysis techniques vary by domain or topic? Are populations of interest well-enough represented by those accounting for most Internet communications and transactions (e.g., social connections of elderly people)? How can and
15This episode highlights the important point that techniques based on mining of Web data and on social media are, at this point, complements not substitute for traditional epidemiological surveillance. Butler (2013), making this point, noted that the problems with the algorithm may have been linked to widespread media coverage of the severe flu season and to social media which spread the news of the flu more quickly than the virus itself; apparently, the context of the word searches was not adequately taken into account in the analysis for the 2013-2014 season.