A PARADIGM SHIFT IN COLLECTING AND ANALYZING DATA
The National Center for Science and Engineering Statistics (NCSES) finds itself in the midst of a paradigm shift in the way data are gathered, manipulated, and disseminated. The agency’s science and technology innovation (STI) indicators program faces several challenges:
- Traditional surveys face increasing expense, declining response rates, and lengthy time lags between when data are gathered and when derived indicators and other statistics can be published.
- Tools for data extraction, manipulation, and analysis are rapidly evolving.
- Repositories of STI measures that users demand are distributed among several statistical agencies, and private repositories.
- Sources of knowledge generation and innovation are expanding beyond the traditional developed countries to emerging and developing countries.
- Users’ expectations are rising, and they are demanding more access to statistics that are closer to the actual measures of what they want to know.
It is also expected that standards and taxonomies for data collection and analysis will change before the end of this decade. The Organisation for Economic Co-operation and Development’s National Experts on Science and Technology Indicators are discussing revising the Frescati and Oslo manuals (Organisation for Economic Co-operation and Development, 2002, 2005) on a rolling basis. The group plans to work on priority themes and to build a better bridge between the two manuals. The North American Industry Classification System (NAICS) codes and the Standard Occupational Codes may also undergo revisions in the next decade or less. In light of these likely changes, this chapter offers the panel’s analysis and recommendations on activities that NCSES needs to consider in the near future to continue to prepare organizationally for these challenges and to improve its portfolio of STI indicators in this environment. The final report will provide recommendations and a roadmap on how to implement those recommendations.
NCSES and, indeed, other government statistical agencies, confront a world of dizzying change in how information technology is integrated into their data gathering and data management activities. The World Wide Web (the web), in particular, has been transformational in making possible new kinds of forecasting and data collection methods that provide useful insights in almost real time. These tools provide information much more rapidly than the traditional surveys with up to multiple-year lags. In addition, other changes are occurring. In his November 2011 presentation at the annual meeting of the Consortium of Social Science Associations, Robert Groves (2011a) conveyed the status of U.S. surveys, noting: “Threatened
coverage of frames; falling participation rates; increasing reliance on nonresponse adjustments; and for surveys with high response rate targets, inflated costs.” His proposed solution set for what agencies should do to address these issues is to develop an approach of a “blended data world by building on top of existing surveys.”1 Groves (2011b) envisions multimodal data acquisition and manipulation of data, including: “Internet behaviors; administrative records; Internet self-reporting; telephone, face-to-face, paper surveys; real-time mode switch to fill in missing data; and real-time estimation.”
NCSES needs to determine now how it will handle these changes if they materialize and how the types and frequencies of various STI indicators will be affected. During the panel’s workshop, Alicia Robb (of the Kauffman Foundation) encouraged NCSES to explore the use of administrative records to produce STI indicators, but she also cautioned that ownership issues associated with use of those data will have to be addressed before they could become a reliable complementary data source to traditional survey data. Stefano Bertuzzi (of the National Institutes of Health and the STAR METRICS Program) also presented techniques of using administrative records at universities to determine the impact of federal research funds on scientific outputs and the development of human capital in the physical and biological sciences.
There are also foresight questions that STI indicators can inform. Demographic, economic, technological, and organizational changes will all influence the subjects being measured, the mechanisms used to measure them, and the products offered by NCSES. STI indicators will be called on to answer the following questions: How will demographic shifts affect the science, technology, engineering, and mathematics (STEM) workforce, nationally and internationally? Will those shifts change the locus of the most highly productive regions? Will global financial crises slow innovation activities or merely change the locus of activities? When will emerging economies be integrated into the global ecosystem of innovation, and what effects might they have on the system? However, as cautioned above, indicators are not predictors. They can be used in isolation or in groups to show tendencies, voids, and at times what additional information is needed.
All of this suggests a shift in emphasis over time for NCSES’s indicators program. The agency will have to make decisions on whether and how to adopt the new techniques. Although NCSES is not expected to eliminate all traditional survey methods, it is expected that the prolonged austerity of federal budgets will necessitate increased reliance on web-based techniques and databases. On the horizon, the panel believes that NCSES will have to use surveys more efficiently and increase use of web-based tools for harvesting data, particularly on human capital measures and output measures related to scientific discoveries and innovation, and databases from other government agencies and private providers.
INDICATORS FROM FRONTIER TOOLS
At the panel’s workshop, presentations by Erik Brynjolfsson (Massachusetts Institute of Technology), Lee Giles (Pennsylvania State University), Carl Bergstrom (University of Washington), and Richard Price (Academia.edu) provided insights about tools that can be used
1 For further comments on this point, see Census Bureau discussions: http://blogs.census.gov/directorsblog/2011/09/the-future-of-producing-social-and-economic-statistical-information-part-i.html [December 2011]; http://blogs.census.gov/directorsblog/2011/09/the-future-of-producing-social-and-economic-statistical-information-part-ii.html [December 2011]; http://blogs.census.gov/directorsblog/2011/10/the-future-of-producing-social-and-economic-statistical-information-part-iii.html [December 2011].
to give up-to-date information of science and engineering networks and linkages of human capital investments to STI outcomes and impacts. These experts showed panel members how to use nowcasting, netometrics, CiteSeerX, Eigenfactor, and Academia.edu (similar to Lattes in Brazil) to create scientometric2 data to create STI “talent” indicators. Such tools can be used, say, to track intangible assets and knowledge flows from online recruiting sites and social networks.
In addition to improving survey methods and using administrative records databases directly, another potential avenue for acquiring data is web scraping, that is, collecting data publicly available on the web. This approach is distinct from web-based survey methods, which use the web to administer a survey. For example, many job seekers now publish their résumés online; students participate in social networks; and researchers use online collaboration tools and working paper repositories. Each of these kinds of web sites provides information about the population using them. Hence, there are two specific questions that could be addressed using web data: (1) How many engineers are working in the United States (or what fraction of the workforce is made up of engineers)? (2) How many undergraduate students are majoring in mathematics in the United States? In this section of the report, we explore how current questions could be addressed with new data sources. Many of these sources also incorporate social networks, which may allow the development of entirely new types of indicators. However, we note that it is possible that many of the questions that web-based data sources could address may be more efficiently addressed with administrative records. This is a matter for further research.
Some web-scraping projects—for example Google’s search engine—use an ad hoc approach to collecting data, examining every web resource that they can access for relevant information. This approach could also be useful for gathering STI statistics. However, a large portion of STI deals with the composition of the labor force and students, and information related to them is centralized in several large websites, rather than being distributed across individual home pages. There are at least three examples of these sites and the kind of information they could provide:
- Facebook, Google+: number of students at a university, how many major in which fields;
- Mendeley, Academia.edu, CiteULike: how many researchers are active in which fields, how many collaborations, who collaborates with whom, how useful is a given piece of research;
- LinkedIn, Monster.com, Zerply: the composition of the labor force, geographic breakdown, skill sets, and similar information.
One can collect data from a site such as Monster.com either by scraping information from the public website or by negotiating with the site for access to the data. Two reasons for preferring negotiation are legality and data structure. The legality of web scraping has been
2In practice, scientometrics is often done using bibliometrics, a measurement of the impact of (scientific) publications and patents.
challenged several times in courts, both in the United States and abroad3 and there does not appear to be a consensus about what is legal. However, all the cases to date that the panel found involved one for-profit company scraping data from another for-profit company’s site for commercial use. For example, a travel company might use web scraping to collect information on ticket prices from an airline and then use those prices to make it easy for customers to do comparative shopping. During the course of this study, the panel has not found an example of a nonprofit or governmental organization or academic researcher being sued over web scraping.
The goal of web scraping is to take semistructured data from a public web page and register it into a structured database. The major search engines have started to collaborate on mechanisms for structuring data on the web. 4 The National Science Foundation (NSF) could consider participating or encouraging participation in the development of schema for structuring data relevant to indicators, such as adding fields for educational background to the Person schema or defining fields for a journal article as a specialization of the CreativeWork schema.5
A company such as Monster.com already has such a structured database.6 Although this database is not public, some companies have supplied parts of their databases as part of academic collaborations, and they may be willing to work with NCSES. If the data come directly from a company, there is no chance of introducing errors during the web scraping process. On the other hand, the company may not be willing to supply sufficient data for the agencys or may not be willing to supply it in a timely manner.
An advantage of web scraping is that it could be carried out continuously, and it does not require cooperation with the company. Alternatively, a middle ground would be to work with companies on structured feeds or digests of information that would be updated continuously. Companies may well prefer this to being scraped, which can require significant server resources. The National Institutes of Health, for example, makes XML versions of its grants database available and provides ongoing updates. Many researchers today mine social networks for data (without any legal consequences, as noted above). There are three basic questions for identifying sources of data:
- What STI topic areas does a particular company address?
- Is the company willing to provide data directly?
- How frequently is the company willing to update data?
A fundamental question requires more examination: What kind of statistical methodology to apply to data from web scraping? There are other, related questions: What are the tradeoffs with using web-based data sources instead of survey data? Is it possible to adjust web-based data
3For example, Ryanair, a European airline, initiated a series of legal actions to prevent companies such as Billigfleuge and Ticket Point from scraping ticket price data from their website to allow for easier comparison shopping: see http://www.ryanair.com/en/news/ryanair-wins-screenscraper-high-court-action [December 2011]. In a California 2000 case, eBay v. Bidder’s Edge, eBay sued Bidder’s Edge over price-scraping activities: see http://www.law.upenn.edu/fac/pwagner/law619/f2001/week11/bidders_edge.pdf [December 2011]. And in another California case, in 2009, Facebook, Inc. v. Power Ventures, Inc. Facebook sued Power Ventures over scraping of personal user data from the Facebook site: see http://jolt.law.harvard.edu/digest/9th-circuit/facebook-inc-v-power-ventures-inc [December 2011].
to represent a survey sample or to estimate errors? Is it possible to use a traditional survey to calibrate web-based data? How frequently must this be done? How frequently would NCSES want to publish new statistics? Would NCSES want to publish less reliable statistics if it means publishing them more frequently at lower cost?
A company such as LinkedIn stores in its servers a social network representing all of its users and relationships between them, and techniques for accurately sampling this social network have been developed (see Maiya and Berger-Wolf, 2011; Mislove et al., 2007). However, to our knowledge, researchers have not yet addressed how well this social network represents the larger population. For example, if one is interested in measuring how many chemical engineers are working in the United States, some subset of these are represented in LinkedIn’s social network, but it is unclear how to adjust this subset accurately to represent the target population or how to estimate the error incurred in using this subset.7 It is important to understand how the data collected from websites compares with traditional survey data, particularly because different websites have very different coverage. Facebook, for example, covers a very large portion of the undergraduate population (at least for the next couple of years). However, sites such as Mendeley and Academia.edu definitely do not cover the entire population of researchers.
It could prove useful to adopt a combination approach, in which web-based statistics are periodically calibrated against a traditional survey. Of course the survey would have to be administered less frequently than currently or there would be no cost or time savings.
Since NCSES has reported that the response rates of some of their surveys are declining, questions arise about how well those data reflect the population sampled and how to calibrate web-based data to those surveys. It is relatively straightforward to calibrate to the Survey of Earned Doctorates (SED), which has a 100 percent response rate, but only once and on questions that the SED asks. One solution to this dilemma would be for NCSES to put resources into getting close to a 100 percent response from a small number of people from a standard survey and use that to calibrate information from web-based sources or the rest of the survey. The calibration is similar to what computer scientists and mathematicians do with compressed sensing of data on pixels and is a very interesting and exciting area of research.
It may not yet be possible to achieve rigor comparable to a traditional survey with these methods, and NCSES will need to consider what its tolerance is for publishing statistics that may not be as reliable as those they have previously published. In such consideration, NCSES will need to balance reliability against timeliness: since little time is required for data collection with data mining techniques in comparison with traditional surveys, releasing statistics on a much more frequent basis is possible. In principle, nothing prevents statistics from being periodically or continuously updated. For example, the national unemployment rate, gross domestic product, and consumer price index are periodically updated without diluting the measure’s importance. The Billion Prices Project at the Massachusetts Institute of Technology uses an algorithm that collects prices daily from hundreds of online retailers worldwide, creating, among other things, a daily price index for the United States.8
7LinkedIn and similar data could be quite useful for questions involving relative rather than absolute measures. For example, are there more chemical than electrical engineers? Do chemical engineers change jobs more frequently than other engineers? Where in the country are chemical engineers most highly concentrated?
Developing Ideas Through Contests
One way to develop these ideas further would be through a contest or research funding or prize competition. There are several “open innovation” organizations that facilitate this type of contest, such as InnoCentive, the Innovation Exchange, and challenge.gov. Working with an outside entity to design and administer a contest would allow NCSES to focus on the problems it hopes to address rather than the implementation details of the contest. A National Research Council (2007) report, “Innovation Inducement Prizes at the National Science Foundation,” and the National Science Foundation’s new Innovation Corps Program could also serve as useful models, although these resources are focused more specifically on technology commercialization.
If the contest is designed to address the statistical questions around the usefulness of web-based data sources, it will be necessary to supply some sample data, and this might affect negotiations with companies. For example, LinkedIn might be willing to supply its data for NCSES to use but unwilling to allow its use in a public contest.
How can a federal statistical agency develop and rely on web-based and scientometric tools to produce gold-standard data for periodic publication? This is a basic question that needs to be considered in the current climate of rapidly changing technologies and increasing demands for data. There are a raft of related questions, including: How can an agency overcome looming privacy and security issues? How many personnel internal to the agency will it take to develop and operate the tools to produce the indicators? These are all good questions that will need to be fully addressed before NCSES or any other federal statistical agency implements the frontier methods described in this section.
One way to address these questions is by example. In 2011, the National Institutes of Health (NIH) decided to sponsor a competition9 to find improved methods of using the National Library of Medicine (NLM) to show knowledge flows from scientific exploration through to commercialized products. The agency also wanted to use the NLM resource for taxonomic development and to show relationships between research activities. Knowledge spillovers and effects are difficult to measure. NIH determined that one way to mine millions of data entries was to automate the process. Yet, that was not the expertise of any specific department at NIH, and it was important to cast a broad net to get the best expertise addressing the problem. The competition was announced on challenge.gov and was titled: The NLM Show Off Your Apps: Innovative uses of NLM Information Challenge. The competition was open to individuals, teams of individuals, and organizations and its purpose was to “develop innovative software applications to further NLM’s mission of aiding the dissemination and exchange of scientific and other information pertinent to medicine and public health.”10 The competition ended August 31, 2011, and winners were announced on November 2.11
11Another example of a competition is the Netflix Prize, documented in the materials for the Committee on Applied and Theoretical Statistics and the Committee on National Statistics of the National Academy of Sciences seminar, entitled: “The Story of the Netflix Prize,” November 4, 2011. (See http://www.netflixprize.com/community/ and http://www.science20.com/random_walk/predicting_movie_ratings_math_won_netflix_prize [January 2012]).
A Note of Caution
On a cautionary note, Boyd and Crawford (2011) assert that “[t]he era of Big Data has begun” and “it is imperative that we begin asking critical questions about what all this data means who gets access to it, how it is deployed, and to what ends.”12 While mining data at the project or individual level may yield valuable results, it is also the case that archival data from some sources are poor or nonexistent, and Boyd and Crawford (2011) also noted: “There is a risk in an era of Big Data of treating every connection as equivalent to every other connection, of assuming frequency of contact is equivalent to strength of relationship, and of believing that an absence of connection indicates a relationships should be made.” This is a very important point. NCSES will have to proceed with caution as it considers integration of frontier tools and datasets into its indicators production processes.
MULTIMODAL DATA DEVELOPMENT
One issue that needs to be explored is the feasibility of blending the use of administrative records, scientometric tools, and survey techniques to produce more accurate data on STI human capital measures and other indicators that NCSES produces, such as R&D input and performance measures. A multimodal approach would help to create longitudinal series using existing and new information. In the near term, the topic could be explored through a workshop specifically designed to discuss the conceptual framework and feasibility of blending data acquisition techniques and using this mixed-methods approach to develop new indicators.13 This approach could be useful for developing real-time maps of networked scholars, while measuring return on investments from federal research funds as they are used and linking them to outputs (paper and patents). At the same time, this approach would periodically assemble basic data on education, employment, work activities, and demographic characteristics. We must stress, however, that it would be prudent to test the approach on data that are already well developed at NCSES.14
STI DATA LINKAGE AND COORDINATION
The panel discovered that there appear to be multiple agencies collecting information about innovation, including: NCSES; the Bureau of Labor Statistics (BLS) in the U.S. Department of Labor; the Census Bureau, the Bureau of Economic Analysis (BEA), and the National Institute of Standard and Technology in the U.S. Department of Commerce; and the
12For other important references on the use of visual analytics tools to answer science policy questions, see Zucker and Darby (2011) and Thomas and Mohrman (2011).
13Statistical Neerlandica has prepublication views of a series of articles on the use of administrative records for analytical purposes, including regression analysis: see http://onlinelibrary.wiley.com/journal/10.1111(ISSN)1467-9574/earlyview [December 2011]. For theoretical foundations on how to combine information from multiple sources data, see Molitor et al. (2009).
14In a recent article, Roach and Cohen (2011) use citation-based and survey-based approaches to obtain measures of knowledge flows. They find overall “close correspondence between citation-based measures of knowledge flows and [the] survey-based measure at the industry level.” However, when they control for industry fixed effects, the correlation between the two sets of data “drop by approximately three quarters.” Roach and Cohen acknowledge that there is measurement error in their survey data and in the patent data that they use for comparison. They and other researchers are attempting to determine the nature of such errors to improve the reliability of proximate measurements of knowledge flows. One conclusion to be drawn here, though, is that developing indicators using different techniques will give users the relevant range for the measures that they seek to use.
U.S. Department of Agriculture. Indeed, if the subject is broadened to STI, we were told that at least 5 agencies collect these data. This suggests a need for an entity to assume a coordinating function, to ensure that STI data collection and reporting efforts across the government are efficiently distributed, to eliminate duplication, to take advantage of potential synergies, and to ensure high quality of the data and statistics. Such an entity could be an interagency council or a working group of agency representatives.
The NCSES could take on an important role in such a council or working group, particularly given its function as a data clearinghouse with respect to STI data and statistics. The panel was told by staff in several agencies that it could be beneficial to the statistical system if NCSES became an enhanced data aggregator and data curator for STI-related information. NCSES could explore such a role through the creation of an interagency council on STI statistics. Such a council could identify and address STI statistical issues and opportunities among other statistical agencies. It could not only serve as a clearinghouse, but it could also initiate and monitor activities that serve the user community. Both the Office of Management and Budget and the Office of Science and Technology Policy could have roles in such a council.
Better integration of data sources is needed to develop more robust STI indicators. John Haltiwanger (University of Maryland) suggests that infrastructure datasets could be fully integrated to track the career histories of innovators and entrepreneurs and track the relationships between startup/young/high-growth firms and large, mature firms. These infrastructure datasets could be fully integrated with all existing Census Bureau business surveys and other data: for example, one could integrate economic censuses and annual surveys to measure productivity, business practices, and R&D, linked to patent, citation, and other information about innovators from the U.S. Patent and Trademark Office.
It will be important to integrate any new STI indicators that are developed into the existing infrastructure (if not at the person/business level, then at some level of disaggregation). Data sharing and synchronization would permit even richer integration of BLS and BEA firm data. At the panel’s workshop, Matthieu Delescluse (European Commission) also remarked that the European Union (EU) is commissioning the linking of patent data with company databases to develop new indicators. For example, the relationship between small and medium firms and number of patents can be tracked over time. The EU is also using data from the Community Innovation Survey Business Register to determine the international sourcing of R&D. This statistic could also be developed in the United States by linking Census Bureau and BEA data. Employment dynamics, including worker mobility trends in science and engineering occupations, could be developed by linking Census Bureau, BLS, and BEA data. Existing research data centers or data enclaves could facilitate platforms for data integration and potentially data comparability with other nations that also follow similar data administration policies.15
NCSES already has the infrastructure at the National Opinion Research Center (NORC) to house many of its survey data and allow licensed researchers access through remote dedicated computers. Data that will be available by the beginning of 2012 will come from the following surveys: Survey of Earned Doctorates (SED), National Survey of Recent College Graduates (NSRCG), Survey of Doctorate Recipients (SDR), and the Scientists and Engineers Statistical Data System (SESTAT) integrated database (which includes NSRCG, SDR, and the National
15“The NORC Data Enclave exists to provide researchers with secure access to microdata and protect confidentiality, as well as index, curate and archive data.” See the full description of the NORC data enclave at http://popcenter.uchicago.edu/data [January 2012].
Survey of College Graduates [NSCG]). The panel heard from several people that NCSES sees the NORC data enclave as a way to build its community of licensed researchers while enabling its own staff to spend more time in helping researchers with the substance of the data rather than paperwork. Additionally, NCSES has worked with NORC to build an infrastructure that allows research teams to share their work in a secure environment, whether they are physically in one location or not.
There is strong interest in the dynamics of firm demographics, births, deaths, employment contributions by firms, and the role of the high-growth firm. These statistics can be developed by the Census Bureau by analyzing its business register. If these data are available to researchers—say, at the NORC data enclave—a broad spectrum on evidence-based statistics and indicators could be made publicly available. One means by which such a program could begin is through the initiation of a research program by NCSES. Such a program would energize the research community to use survey and other data as soon as the data arrive at the NORC data enclave. It could also be designed to incentivize researchers to develop new, high-utility statistics that are based on linked data from several agencies and that relate inputs to outputs, outcomes, and effects.
NCSES strives to be the central repository in the federal government of data, knowledge, and expertise in all STEM (science, technology, engineering, and mathematics) topic areas, as it is so identified in America COMPETES Reauthorization Act of 2010. Acting as such a repository will involve obtaining data from other federal agencies, as well as from intergovernmental and private sources. The focus would be on STEM aspects that are the official responsibility of NCSES. Although NCSES staff may obtain access to BRDIS data directly through the Survey Sponsor Data Center that was scheduled to come online December 30, 2011, researchers will not have access to these data. Linking data from the BRDIS and the Human Resource and Education Development survey with administrative longitudinal data would provide a rich dataset for linking knowledge inputs to outcomes. This would provide a longitudinal component to BRDIS analysis. NCSES could explore avenues for developing this type of longitudinal capability and the in-house analytical capacity to produce STI indicators from those data.
CONCLUSION AND RECOMMENDATION
Integration of the frontier tools (discussed above) into practice at NCSES would represent a paradigm shift for the agency, at a critical time when they are reaping benefits from investments in revised surveys during the past four to five years. Therefore, the panel recommends that NCSES in the near term undertake pilot work to determine how its indicators program can incorporate the new techniques with traditional survey methods. A logical extension of this is that NCSES begins to expand its role as a clearinghouse for STEM information and coordinating data interoperability, standards for metadata descriptions (e.g., participating in schema.org), and perhaps developing tools like open source web crawlers.
RECOMMENDATION 6: The National Center for Science and Engineering Statistics should fund exploratory activities on frontier data extraction and development methods. These activities should include
- research funding or prize competitions to harness the computing power of data specialists with a view to (a) analyzing existing public databases to develop better indicators of science, technology, and innovation activities and (b) analyzing the huge and growing amount of information on the Internet for similar purposes;
- pilot programs or experiments to produce a subset of indicators using web tools; and
- convening a workshop of experts on multimodal data development, to explore the new territory of developing metrics and indicators from surveys, administrative records, and scientometric sources.