Click for next page ( 130


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 129
7 A Paradigm Shift in Data Collection and Analysis The National Center for Science and Engineering Statistics (NCSES) finds itself in the midst of a paradigm shift in the way data are gathered, manipulated, and disseminated. The agency’s science, technology, and innovation (STI) indicators program must deal with several challenges: As discussed in previous chapters, traditional surveys face increasing expense, declining response rates (Williams, 2013), and lengthy time lags between when data are gathered and when derived indicators and other statistics can be published. Tools for data extraction, manipulation, and analysis are evolving rapidly. Repositories of STI measures that users demand are distributed among several statistical agencies and private repositories. Sources of knowledge generation and innovation are expanding beyond the traditional developed countries to emerging and developing countries. Users have rising expectations, and they are demanding more access to statistics that are closer to actual measures of what they want to know. This chapter explores this changing landscape of data collection and analysis and its implications for NCSES’s STI indicators program. NEW METHODS, NEW DATA SOURCES Standards and taxonomies for data collection and analysis are expected to change before the end of this decade. OECD’s National Experts on Science and Technology Indicators are discussing revising the Frascati Manual (OECD, 2002) and the Oslo Manual (OECD-Eurostat, 2005) on a rolling basis. The group plans to work on priority themes and to build a better bridge between the two manuals. The North American Industry Classification System (NAICS) codes and the Standard Occupational Codes may also undergo revision within the next decade. NCSES and, indeed, other government statistical agencies confront a world of dizzying change in the way information technology is integrated into their data gathering and data management activities. The World Wide Web, in particular, has been transformational in enabling new forecasting and data collection methods that yield useful insights in almost real time. These tools provide information much more rapidly than is possible with traditional surveys, which entail up to multiple-year lags. PREPUBLICATION COPY: UNCORRECTED PROOFS 7-1

OCR for page 129
7-2 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY Other changes are occurring as well. In his November 2011 presentation at the annual meeting of the Consortium of Social Science Associations, Robert Groves (2011a) conveyed the status of U.S. surveys: “Threatened coverage of frames; falling participation rates; increasing reliance on nonresponse adjustments; and for surveys with high response rate targets, inflated costs.” His proposed solution for agencies to address these issues is to develop an approach of a “blended data world by building on top of existing surveys.” 1 Groves (2011b) envisions multimodal data acquisition and manipulation of data, including “Internet behaviors; administrative records; Internet self-reporting; telephone, face-to-face, paper surveys; real-time mode switch to fill in missing data; and real-time estimation.” 2 Some of these innovations are already being implemented at the Census Bureau. The agency’s economic directorate has combined administrative data with survey data in inventive ways. It also handles multiple response modes—paper forms, Internet responses, and telephone interviews. To address the timeliness of economic indicators, it has devised workable decision rules for defining which estimates are preliminary and what information is required to revise them. Perhaps the most challenging innovation in Groves’ vision of the future of surveys is performing real-time estimation during data collection. Groves (2011b) envisions implementing real-time estimation routines—including imputation, nonresponse adjustment, and standard error estimation—after every 24 hours of data collection. Part of this progress would entail assessing whether the standard error increase due to imputation was acceptable or additional nonresponse follow-up was necessary. In this context, imputation can, in effect, be viewed as another mode of data collection. To make trade-off decisions about whether to terminate nonresponse efforts for a case using a particular mode, key statistics on the fully imputed estimates and measures of the imputation standard error and sampling standard error of the estimates would be actively tracked. Groves believes successfully implementing this real-time estimation and decision process at the Census Bureau would take at least 5 years. In this vein, one issue that needs to be explored is the feasibility of blending the use of administrative records, scientometric tools, and survey techniques to improve the accuracy of data on STI human capital measures and other indicators that NCSES produces, such as research and development (R&D) input and performance measures. A multimodal approach would help in creating longitudinal series using existing and new information. In the near term, the topic could be explored through a workshop designed specifically to discuss the conceptual framework and feasibility of blending data acquisition techniques and using this mixed-methods approach to develop new indicators. 3 This approach could be useful for developing real-time maps of networked scholars while measuring return on investments from federal research funds as they are used and linking them to outputs (paper and patents). At the same time, this approach would include periodically assembling basic data on education, employment, work activities, and demographic characteristics. Data from administrative records and web-based sources—termed “business practice data” (see Chapter 4)—have been used for many years at federal agencies with two purposes: to 1 For further comments on this point, see U.S. Census Bureau (2011a,b,c). 2 See Chapter 4 for detail on how business practice data (which include administrative records and web-based data) can be used to obtain innovation indicators. 3 Statistical Neerlandica has prepublication views of a series of articles on the use of administrative records for analytical purposes, including regression analysis; see http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467- 9574/earlyview [December 2011]. For theoretical foundations of combining information from multiple sources of data, see Molitor et al. (2009). Also see Eurostat (2003). PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-3 benchmark sample survey data and, along with sample survey data, to produce official statistics. Horrigan (2012, 2013) gives several examples of sources being used by the Bureau of Labor Statistics (BLS), including the Billion Prices Project data; retail scanner data; the J.D. Power and Associates used car frame; stock exchange bid and ask prices and trading volume data; universe data on hospitals from the American Hospital Association; diagnosis codes from the Agency for Healthcare Research and Quality, used to develop the producer price index; Energy Information Agency administrative data on crude petroleum for the International Price Program; Department of Transportation administrative data on baggage fees and the Sabre data, used to construct airline price indices; insurance claims data, particularly Medicare Part B reimbursements to doctors, used to construct health care indices; and many more sources of administrative records data from within the U.S. government, as well as web-based data. According to Horrigan (2013), in addition to the development of price indices, administrative records and web-scraping data are used to “improve the efficacy of estimates....the Current Employment Statistics (CES) Survey uses administrative data from the Quarterly Census of Employment and Wages (QCEW)....” BLS also is “using web-scraping techniques to collect input price information used to increase the sample of observations we use to populate some of our quality adjustment models” (Horrigan, 2013, p. 26). Horrigan cautions, however, that “the principle of constructing an inflation rate based on the rate of price increase for a known bundle of goods with statistically determined weights lies at the heart of what we do. While research may show the viability of using a web-scraped source of data for a particular item, it needs to be done within the framework of this methodology.” (Horrigan, 2013, p. 27.) The statistical methodology related to sampling and weights must be developed if these multimodal techniques are to be fully relied upon to deliver bedrock STI indicators. The panel must stress, moreover, that business practice data must be regularly calibrated using sample survey data. Business practice data contain a wealth of detailed and rapidly changing information that is not practically acquired using surveys. However, businesses and government enterprises generally do not maintain the sort of consistency across organizations, categories, and time that would enable cross-sectional and longitudinal comparisons. In time, and with appropriate financial and human resources, NCSES and other statistical agencies should be able to publish indicators based on business practice data, but only if the raw data are adjusted using a well- designed program of sample surveys. Indeed, the challenge will be to design the most efficient combination—financially and statistically—of traditional sample surveys and administrative and web-based sources. IMPLICATIONS FOR NCSES NCSES needs to determine now how it will handle the above changes if they materialize and how the types and frequencies of various STI indicators will be affected. During the panel’s July 2011 workshop, Alicia Robb of the Kauffman Foundation encouraged NCSES to explore the use of administrative records to produce STI indicators. She also cautioned, however, that ownership issues associated with the use of those data will have to be addressed before they can become a reliable complement to traditional survey data. Also at the workshop, Stefano Bertuzzi of the National Institutes of Health (NIH) and the STAR METRICS program at that time, in collaboration with Julia Lane from the National Science Foundation (NSF), presented techniques for using administrative records at universities to determine the impact of federal research funds on scientific outputs and the development of PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-4 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY human capital in the physical and biological sciences. In follow-on discussions, George Chacko, who took the helm of the STAR METRICS program at NIH in late 2011, described that $1.5 million per year activity. Level 1 data outputs (described by Bertuzzi) are in the data validation process. There are two potential sources of error (data entry and data transformation into readable files), but biases are not yet known. Chacko noted that further research is needed to determine the quality of the Level 1 data. He also described Level 2, which will allow the integration of research project data; that effort had not yet begun as of the writing of this report. Participants in Level 2 will include the U.S. Department of Agriculture, the Environmental Protection Agency, NIH, and NSF. Since each agency has different ways of reporting the same kinds of grants, one of the first tasks will be the creation of a data ontology and taxonomy before a database is developed. Sometime in the future, Chacko expects that STAR METRICS Level 3 will enable demographic identifiers for individuals, thereby allowing analysis of science, technology, engineering, and mathematics (STEM) outcomes by gender and race/ethnicity. In May 2012, Christopher Pece gave a presentation at NSF on NCSES’s Administrative Records Project (ARP), which is still in the feasibility testing stage. Pece cited the National Research Council (2010) report recommending that NCSES (Pece, 2012): develop R&D descriptors (tags) into administrative databases to better enable identification of R&D components of agency or program budgets; use administrative data to test new classification schemata by direct access to intramural spending information from agency databases; and develop several demonstration projects to test for the best method for moving to a system based at least partly on administrative records. Accordingly, NCSES is working with a subset of agencies that have data reported in the Federal Funds Survey and Federal Support Survey to pilot methods for using administrative records to produce data comparable to the survey data. In addition to budgetary constraints and negotiation of interagency agreements, other challenges must be addressed, including the creation of data tags and R&D crosswalks between agency systems that use different data taxonomies, accounting practices, and information technology systems. The panel was impressed by NCSES’s willingness to experiment with the use of administrative records to complement its survey-based datasets, but also recognized the need for NCSES to acquire increased resources— funding and staff—at least in the short term, with the potential ultimately for reduced survey costs, reduced data validation costs, and increased timeliness of data delivery. During the July 2011 workshop, presentations by Erik Brynjolfsson of the Massachusetts Institute of Technology, Lee Giles of Pennsylvania State University, Carl Bergstrom of the University of Washington, and Richard Price of Academia.edu provided insights regarding tools that can be used to obtain up-to-date information on science and engineering networks and linkages between human capital investments and STI outcomes and impacts. These experts showed panel members how nowcasting, netometrics, CiteSeerX, Eigenfactor, and Academia.edu (similar to Lattes in Brazil) can be used to create scientometric 4 data for use in developing STI “talent” indicators. Such tools can be used, say, to track intangible assets and knowledge flows from online recruiting sites and social networks. 4 In practice, scientometrics often uses bibliometrics, a measurement of the impact of (scientific) publications and patents (see Chapter 5). PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-5 Many questions remain about the representativeness of data sources such as STAR METRICS Level 1 (which includes data from 80 universities) and the datasets drawn from web- based sources, primarily since they are nonrandom convenience samples. Recent work on medical care expenditures by Ana Aizcorbe at the Bureau of Economic Analysis (BEA) 5 and by Dunn and colleagues (2012) shows that insurance companies’ patient claims data can be used to develop reliable price estimates, given the appropriate weighting strategy. Both projects use MarketScan data, which include sampling weights designed to provide population estimates from the MarketScan sample. This is a potentially cost-effective approach compared with the use of traditional household surveys (in this case, the Medical Expenditure Panel Survey [MEPS]). Clearly, the MarketScan data cannot address all of the questions that the MEPS was designed to address. However, Dunn and colleagues find that the MarketScan data “produce spending growth figures that are more aligned with other benchmark estimates of price and expenditure growth from national statistics” (Dunn et al., 2012, p. 26). INDICATORS FROM FRONTIER TOOLS: EXAMPLE OF THE DATA SCIENCE DISCIPLINE Consider the rise of data science, an emerging discipline that encompasses analysis, visualization, and management of large datasets. (“Large” in this context typically means many millions or billions of records.) The digitization of records, increasing numbers of sensors, and inexpensive storage have combined to produce enormous quantities of data in the sciences and business. Data scientists use specialized techniques to sift through these troves of information to discover new insights and create new value. Google’s chief economist, Hal Varian, has characterized the statistical work of data scientists as “the sexy job in the next 10 years” (Lohr, 2009); Forbes magazine describes the data scientist’s role as “the hot new gig in tech” (Lev- Ram, 2011); and The Economist (2011) says data science is “where the geeks go.” In line with perennial concerns about the supply of knowledge workers in the United States (Atkinson, 1990; Freeman and Aspray, 1999; Jackson, 2001; The Partnership for a New American Economy and The Partnership for New York City, 2012), data scientists are already projected to be in short supply in the near future. According to a 2011 McKinsey study (Manyika et al., 2011), “a shortage of people with the skills necessary to take advantage of the insights that large datasets generate is one of the most important constraints on an organization’s ability to capture the potential from big data.” Likewise, an EMC Corporation (2011) study foresees a “looming talent shortage.” Access to talent is not an issue just for industry: 23 percent of respondents to a 2011 Science survey said their laboratories were lacking in data analysis skills. Given that past projections of shortages of knowledge workers have proven controversial (Lowell and Salzman, 2007; Matloff, 2003; Stanford News Service, 1995; Weinstein, unpublished), it is worth examining the above claims more closely. Consider some of the questions a policy maker concerned about the future data scientist workforce might ask of NCSES: How many new data scientists are graduating each year? - How many in the United States? - How many in other parts of the world? 5 See Aizcorbe et al. (2012) for more detail on the Health Care Satellite Accounts at BEA. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-6 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY Where were existing data scientists educated? - What schools? - What programs? Where are data scientists employed? - What fraction work in industry? - In government? - In academia? What range of salaries do data scientists command? - How much do salaries vary with degree level? - With sector? Is the United States producing enough or too many data scientists? A funding agency director (such as the director of NSF) might want to know: Is NSF funding data science research? - Which directorates? - How much is NSF spending? What basic research does data science draw upon? NCSES’s existing STEM workforce surveys would be hard pressed to answer these questions. For one thing, “data science” is not in the taxonomy of fields used in the STEM degree surveys, so one cannot obtain data science degree counts directly from existing NCSES datasets. Similarly, the taxonomy of occupations used by the Current Population Survey/American Community Survey does not contain “data scientist,” so NCSES datasets derived from these surveys will likewise miss this group. Fixed, slowly evolving taxonomies restrict the ability of existing surveys to provide insights about emerging fields. One potential remedy might be a one-time custom survey. Given the cost of existing surveys, however, this would likely be a prohibitively expensive undertaking. A custom survey would entail the additional difficulty that there is no obvious, well-defined frame. An alternative might be for NCSES to update its taxonomy of fields for future surveys. This would be a slow process, however: turnaround time for the NCSES surveys is on the order of 2 years (National Science Foundation, 2012a), and additional time would be needed to formulate and validate a revised taxonomy. Even if taxonomic issues were resolved, the limitation would remain that NSF’s datasets cover only the United States. An Alternative Approach Datasets exist that could shed light on questions about data science, but they are very different from those produced by NCSES. They are not among the datasets typically used to analyze the STEM workforce in part because, while they offer significant advantages over NCSES’s data, they also come with significant challenges. Consider, first, the task of counting data science doctoral degrees. Rather than using a survey to ask new doctorate recipients whether they did data science-related research, an expert could look at their dissertations to make this determination. The expert could then tally the number of new data science graduates. The expert could also identify the degree-granting institutions and programs from the dissertations. The idea is not merely theoretical: both PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-7 ProQuest and WorldCat maintain large databases of dissertations (Online Computer Library Center, 2012; ProQuest, 2012). While counting data scientists in this fashion is labor-intensive, it is potentially faster and less expensive than either conducting a custom survey or waiting for a taxonomy update in an existing survey. In addition, the approach has the benefit of providing global counts of data science degrees, not just U.S. counts. For the task of learning about the number of data scientists in the current workforce, one could examine a database of resumés; count the number of people whose job titles or job descriptions include such phrases as “data scientist,” “data mining,” or “big data”; tally their educational backgrounds; and observe their sectors of employment. A large, global resumé database such as LinkedIn (http://www.linkedin.com; see Box 7-1) or profiles of science professionals in science-related social networks such as ResearchGate (http://www.researchgate.net/), Academia.edu (http://academia.edu/), or BioMedExperts (http://www.biomedexperts.com/) could be used for this procedure. Again, the process of counting the resumés or profiles and classifying the associated educational backgrounds and employers would be labor-intensive, but it potentially could provide inexpensive and timely insights on the supply of data scientists in both the United States and international markets that would otherwise be unavailable or prohibitively expensive to generate. BOX 7-1 Employment Shifts from LinkedIn Data LinkedIn’s data science team (The Noisy Channel, 2012) recently collaborated with the White House Council of Economic Advisors to identify the industries that grew and shrank the most during the 2008-2009 recession and the subsequent recovery. By following people who were site members in 2007 longitudinally through 2011, they were able to see the rapid growth in renewable energy and Internet companies, as well as sharp declines in newspapers, restaurants, and the retail sector. The cohort they followed numbered in the tens of millions, and LinkedIn contains detailed data on its members’ educational backgrounds, so one can readily imagine conducting similar analyses restricted to workers with science, technology, engineering, and mathematics (STEM) degrees. Moreover, one of the study’s authors says that in principle, LinkedIn could track such changes in real time. SOURCES: Nicholson, 2012; The Economist, 2012. To assess demand and salary levels for data scientists, one could turn to large databases of job listings such as Monster.com (http://www.monster.com/), Indeed.com (http://www.indeed.com/), or SimplyHired.com (http://www.simplyhired.com/). An expert could identify data science-related jobs and then tally offered salary levels as a function of the level of the job. Mandel and Scherer (2012) recently used techniques of this sort to estimate the size and location of the “App Economy”—jobs related to smartphone and tablet applications and to Facebook plugins. 6 Since this is a very recently created category of jobs, existing labor market statistics could not provide useful information about these types of jobs. Mandel and Scherer turned to The Conference Board’s Help Wanted OnLine data (The Conference Board, 2012), a 6 See also Box 4-8 in Chapter 4 on the use of this approach to develop innovation indicators. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-8 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY collection of job listings from more than 16,000 Internet job boards. They counted job listings containing keywords associated with the App Economy (e.g., “iPhone,” “iPad,” “Android”) and then used historical data on the ratio of job listings to jobs in the tech sector to estimate the total number of App Economy jobs. They were able to identify the geographic distribution of App Economy jobs from the job locations listed in the ads. One might apply a similar approach to data science jobs by looking for keywords related to this field (e.g., “data science,” “data mining,” “big data”) and common data science tools and techniques (e.g., “map-reduce,” “Hadoop,” “Hive,” “Pig”). Mandel (2012) and Mandel and Scherer (2012) analyzed online help-wanted ads to track the demand for “app-related” skills geographically, using key words such as “iOS” and “Android.” This analysis made it possible to identify states with a higher-than-expected number of app-related want ads relative to their size (see Table 7-1). This procedure could be repeated for any innovation; one could identify clusters of green innovations, software innovations, or medical innovations in the same way, for example, at the state or metro level. TABLE 7-1 The App Leaders: States #1-15 SOURCE: Reprinted by permission from Mandel and Scherer, 2012. The data from help-wanted ads also can be combined with conventional survey data to provide a more complete and timely picture of innovation activities at the national or subnational level. Suppose, for example, one wanted to get a picture of R&D activities in biotechnology by state. The Business Research and Development and Innovation Survey (BRDIS) would provide some information, but many cells of the tables derived from its data would likely be suppressed PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-9 to avoid disclosure. However, it would be possible to use biotechnology-related help-wanted ads to impute the missing state information without violating confidentiality. This analysis could even be taken down to the metro level, using help-wanted ads combined with state-level survey data to provide a benchmark. Using help-wanted ads to track the diffusion and use of innovation at both the national and subnational levels has several advantages: these ads are public and continuously refreshed; full databases of the ads already exist in digital form, available in near real time; the ads are semistructured—they are free text, but must include information about the skills and experience needed, using recognizable terms; and organizations such as The Conference Board already have procedures for automatically tagging them by occupation, industry, location, and so forth. As a result, the ads provide a continually changing real-time map of the skills employers need. In particular, as the content of the ads changes, one can see innovations diffusing through the economy. Finally, to gauge existing support for data science research within NSF, an expert could read through existing NSF awards and identify those related to data science research. The expert could then identify the institutions and programs doing data science work, as well as tally the directorates supporting data science. To identify the basic research on which data science draws, the expert could compile a set of recent, important data science papers; follow their citations back, possibly multiple levels; and tally the journals and/or fields of research represented by frequently cited precursors to data science papers. Several large databases of papers and citations—such as Web of Science (Thomson Reuters, 2012) or Scopus (http://www.scopus.com/home.url)—could be used to facilitate the process of tracing papers back. Advantages and Challenges Using the datasets described above to learn about the state of data science offers several advantages over using surveys. First, the datasets have already been created for other purposes, so the incremental cost of using them to learn about data science is modest. Second, the databases are all updated continuously, so the lengthy delays associated with gathering survey data do not come into play. And third, because experts classify the data, there is no locked-in, limiting, pre-existing taxonomy that could lead to misclassification of data scientists (although this also presents its own issues). Along with the benefits associated with these new datasets, however, come challenges: In many cases an expert is needed to assign data to categories because the datasets are unstructured (see the discussion of this issue in the next section). There will be uncertainty in this classification process, especially if multiple experts are used since they may not always agree. Classifying large collections of dissertations, resumés, awards, and papers by hand is labor-intensive for the expert—there is an issue of scale. Some of the datasets are commercial products, so one must pay for or negotiate access to the datasets. In addition, some of the datasets are sensitive and require special handling. More generally, concerns have been raised about the inconsistency in the way R&D- related data are specified, making data sharing and linking by researchers difficult. A PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-10 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY web-based infrastructure for data sharing and analysis, including clear data exchange standards, would be a useful first step in addressing this issue. 7 Finally, the question of validation arises. Many of the databases cited above are incomplete in that they do not include the entire population of interest. It is necessary to understand the portion that is missing in order to estimate or at least bound the possible bias in using the database of interest to characterize that population. 8 In addition to coverage and sampling biases, measurement, nonresponse, and interviewer biases should be examined to determine the validity of statistical indicators derived from such databases. Moreover, a process needs to be in place for measuring the reliability of the expert’s classifications. As noted, most of the web-based datasets described here are neither representative samples nor complete censuses of the population of interest. That being the case, developing and implementing methods for using these datasets is largely tilling new ground for the staff of any statistical agency. Should NCSES decide to move in this direction, it will need to ensure that it has staff with the appropriate training and experience to develop and implement suitable analytic and data collection approaches in this new environment. Considerable progress has been made in addressing all of these challenges, but many important open questions remain. A NEW DIRECTION FOR NCSES The general approach described above for learning quickly and inexpensively about an emerging field by repurposing existing datasets holds considerable promise for improving understanding of many aspects of innovation in science and engineering. At the same time, the approach entails methodological and logistical problems. The tasks of structuring unstructured data, dealing with challenges of scale, negotiating access to data and protecting sensitive data, and validating nontraditional data sources are common to many potentially useful but nontraditional datasets. The panel proposes that NCSES explore and experiment with these new, nontraditional data sources. This section describes four core areas in which NCSES could contribute to improving the state of the art in this area, with the goal of improving outputs from its data development and indicators programs. Identification of Data Sources Plummeting prices for data storage, low-cost sensors, improvements in data collection mechanisms, and increases in Internet access have given rise to vast new collections of digital data (The Economist, 2010a), and the total amount of digital data is growing rapidly—a 10-fold expansion occurred between 2006 and 2011 (Gantz et al., 2008). A wide variety of datasets could 7 See Haak et al. (2012, pp. 196-197) for a discussion of this problem and possible solutions. 8 Sample surveys are used to draw inferences about well-defined populations. Survey methodologists have developed tools for measuring how valid and robust their inferences from surveys are. Conceptually, these methods can be applied to nonsurvey databases as well. See Groves et al. (2009), particularly Chapters 3 and 5. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-11 be used to better understand STEM innovation—the ones mentioned above barely scratch the surface. Annex 7-1 at the end of this chapter lists some promising possibilities. NCSES could help answer two key questions regarding these new data sources: What are the most promising new datasets for understanding the state of STEM? and What are effective ways to analyze these datasets? NCSES has historically used a combination of internally generated and third-party datasets in assembling science and engineering indicators and InfoBrief’s. The agency could test the waters by adopting the goal of including in its publications and on its website analyses performed with nontraditional data in the areas of human resources, R&D, and innovation. Such analyses could be performed by external researchers funded by targeted awards. RECOMMENDATION 7-1: The National Center for Science and Engineering Statistics should use research awards to support the development and experimental use of new sources of data to understand the broad spectrum of innovation activities and to develop new measures of science, technology, and innovation. NCSES should also support the development of new datasets to measure changing public perceptions of science, international trade in technological goods and services, new regions for entrepreneurial activity in science and technology, and precommercialized inventions. Structuring of Unstructured Datasets The data generated from NCSES’s surveys are structured. Data are stored as tables of numbers, with each number having a well-defined meaning. As noted, many of the nontraditional datasets discussed above—perhaps the majority—are in unstructured form, such as free text. A traditional (but apocryphal) rule of thumb is that 80 percent of corporate data is unstructured (Grimes, 2011); a recent article in The Economist (2010b) estimates that 95 percent of new data is unstructured. The databases of doctoral dissertations, resumés, job listings, and NSF grant proposals described above are vast and rich stores of information, but they are difficult to process by machine. The data science example given earlier assumes that a human expert is willing to spend weeks categorizing dissertations, job listings, and so forth. This role would likely be difficult to fill, as the work is tedious and repetitive. To tap the potential of unstructured datasets fully, new tools and techniques are needed. The problem of extracting structured information from unstructured text is an active area of research, and several NSF directorates are funding work in this area (National Science Foundation, 2008). One approach is to use divide-and-conquer techniques: rather than having a single expert spend months on a repetitive task, one can use “crowdsourcing” (Wikipedia, 2012), a technique in which a large task is accomplished by being divided among a large collection of people. Services such as Amazon.com’s Mechanical Turk (https://www.mturk.com/mturk/welcome) and CrowdFlower (http://crowdflower.com/) provide workers and infrastructure for distributing tasks to them. Crowdsourcing can be used to extract information from unstructured data. For example, researchers have used the technique to identify people, organizations, and locations in tweets on Twitter (Finin et al., 2010) and to analyze and annotate charts (Willett et al., 2012). In the realm PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-13 A fundamental question concerning the use of unstructured data for indicators requires more examination: What kind of statistical methodology should be applied to data derived from web scraping? There are other, related questions: What trade-offs are entailed in using web- based data instead of survey data? Is it possible to adjust web-based data accurately to represent a survey sample and to estimate sampling errors and statistical biases? Is applying modeling techniques to web-based data and traditional survey data a promising approach to achieving this end? How frequently must this be done? How frequently would NCSES want to publish new statistics? Would NCSES want to publish less reliable statistics if it meant publishing them more frequently at lower cost? A company such as LinkedIn stores in its servers a social network representing all of its users and relationships among them, and techniques for accurately sampling this social network have been developed (see Maiya and Berger-Wolf, 2011; Mislove et al., 2007). To the panel’s knowledge, however, researchers have not yet addressed how well this social network represents the larger population. For example, if one is interested in measuring how many chemical engineers are working in the United States, some subset of these individuals is represented in LinkedIn’s social network. Adjusting this subset accurately to represent the target population and estimating the error incurred in using this subset is a daunting challenge. 9 It is important to understand how the data collected from websites compare with traditional survey data, particularly because different websites have very different coverage. Facebook, for example, covers a large portion of the undergraduate population (at least for the next couple of years). However, sites such as Mendeley and Academia.edu clearly cover only a subset of the entire population of researchers. It could prove useful to adopt a combination approach, in which web-based statistics would be calibrated periodically against a traditional survey. Of course, the survey would have to be administered less frequently than is currently the case, or there would be no cost or time savings. It could also be a useful experiment to run parallel efforts (collecting traditional indicators in addition to their possible replacements) for a period of time in order to examine the strengths and weaknesses of using certain nontraditional data sources for indicators. This period of time would also be important for assessing how well the newly constructed indicators identify trends and rates of change, particularly for policy purposes. Since NCSES has reported that the response rates for some of its surveys are declining, questions arise about how well those data reflect the population sampled and how web-based data could be calibrated to those surveys. Calibrating to the Survey of Earned Doctorates (SED), which has a response rate above 90 percent, would be relatively straightforward, but only once and on questions asked by that survey. One solution to this dilemma would be for NCSES to devote resources to sampling for nonresponse follow-up, 10 that is, strive to achieve close to a 100 percent response rate from a small sample of nonrespondents to a standard survey, adjust the survey results for nonresponse, and use the adjusted survey data to calibrate information from web-based sources. 11 The calibration would be similar to what computer scientists and 9 LinkedIn and similar data could be quite useful for questions involving relative rather than absolute measures. For example, are there more chemical than electrical engineers? Do chemical engineers change jobs more frequently than other engineers? Where in the country are chemical engineers most highly concentrated? 10 This is one of several tools used by survey methodologists to address nonresponse in sample surveys. See Groves et al. (2009, Chapter 6) for a description of some of these methods. 11 This approach would entail using the survey data as the dependent variable in a model that used information from the web-based data source to create the explanatory variables. That model could then be used to “now-cast” population values of interest directly from the web-based data. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-14 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY mathematicians do with compressed sensing of data on pixels and is a promising area of research. Achieving a level of rigor comparable to that of a traditional survey with these methods may not be possible, and NCSES would need to consider its tolerance for publishing statistics that may not be as reliable as those it has previously published. The agency would need to balance reliability against timeliness: since little time is required for data collection with data mining techniques in comparison with traditional surveys, releasing statistics much more frequently would be possible. In principle, nothing prevents statistics from being updated periodically or continuously. For example, the national unemployment rate, gross domestic product, and the consumer price index are updated periodically with no compromise to their importance. And the Billion Prices Project at the Massachusetts Institute of Technology uses an algorithm that collects prices daily from hundreds of online retailers worldwide, creating, among other things, a daily price index for the United States (see Massachusetts Institute of Technology, 2013). RECOMMENDATION 7-2: The National Center for Science and Engineering Statistics should pursue the use of text processing for developing science, technology, and innovation indicators in the following ways: explore synergies with National Science Foundation directorates that fund research on text processing; and enable NCSES staff to attend and develop workshops that bring together researchers working on text processing and on understanding the science, technology, engineering, and mathematics ecosystem. RECOMMENDATION 7-3: The National Center for Science and Engineering Statistics should use its grants program to encourage the sharing of new datasets and extracted metadata among researchers working on understanding innovation in science, technology, engineering, and mathematics. Data Validation While the datasets discussed in the data science example offered earlier provide new windows into the state of the STEM workforce, the accuracy of statistics gleaned from some of these datasets is unknown. The ProQuest and WorldCat dissertation databases are both large, for example, but neither is complete. Do either or both contain biased subsets of new dissertations? If so, can the biases be characterized in ways that can be understood and corrected systematically? One way to better understand omissions in a dataset such as a dissertation database would be to compare it with a definitive source such as the SED (National Science Foundation, 2011a) or the Integrated Postsecondary Education Data System (IPEDS) (National Center for Education Statistics, 2012). If, say, the databases were less likely to contain dissertations from students at private institutions or from humanities students, estimates based on those databases could be reweighted accordingly. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-15 Assessing the accuracy of metrics based on other types of sources can be more difficult. For example, counts of Twitter mentions (tweets) have been proposed as an indicator of the impact of a paper (Priem and Costello, 2010), and journals such as PLOS ONE now report Twitter mentions for each article (PLOS ONE, 2012). How might one assess the validity of tweets as an indicator of impact? NSF is supporting ongoing research in areas that could facilitate assessing nontraditional data sources. Techniques from sampling theory, approaches for handling missing data, and imputation algorithms could all prove useful. In addition, NCSES’s own datasets could be used for calibrating new datasets. RECOMMENDATION 7-4: The National Center for Science and Engineering Statistics should coordinate with directorates at the National Science Foundation in supporting exploratory research designed to validate new sources of data related to innovation in science, technology, engineering, and mathematics. Data Access Many datasets that would be promising for better understanding STEM have restrictions on their usage. The ProQuest and WorldCat dissertation databases, Web of Science, Scopus, and The Conference Board’s Help Wanted OnLine database are all commercial datasets for which the processes for obtaining access are well defined. Other datasets are more sensitive and may require carefully controlled access. For example, access to some types of census data entails stringent controls on how the data are handled. Likewise, some corporate datasets are zealously guarded by their owners and may be used only by employees. NCSES has considerable experience with managing access to sensitive datasets—the Survey of Doctorate Recipients (SDR) and census data. The experience it has gained in the process may be useful in negotiating access to other sensitive datasets. NCSES already has the infrastructure in place at the National Opinion Research Center (NORC) to house many of its survey data and allow licensed researchers access through remote dedicated computers. 12 In October 2012, data from the following surveys became available in the NORC Data Enclave: the SED, the National Survey of Recent College Graduates (NSRCG), the SDR, and the Scientists and Engineers Statistical Data System (SESTAT) integrated database (which includes the NSRCG, the SDR, and the National Survey of College Graduates [NSCG]). The panel heard from several people that NCSES sees the NORC Data Enclave as a way to build its community of licensed researchers while enabling its own staff to spend more time in helping researchers with the substance of the data rather than paperwork. Additionally, NCSES has worked with NORC to build an infrastructure that allows research teams to share their work in a secure environment, regardless of whether they are physically in one location. 12 “The NORC Data Enclave exists to provide researchers with secure access to microdata and protect confidentiality, as well as index, curate and archive data. The NORC Data Enclave provides authorized researchers with remote access to microdata using the most secure methods to protect confidentiality.” See the full description of the NORC Data Enclave in The University of Chicago (2013). BEA does not permit data migration to research data centers. Instead, it has a program whereby individuals can use the data in house under a special sworn employee arrangement. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-16 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY There is strong interest in the dynamics of firm demographics, births, deaths, and employment contributions and the role of high-growth firms. The Census Bureau can develop these statistics by analyzing its business register. If these data were available to researchers— say, at the NORC Data Enclave—a broad spectrum of evidence-based statistics and indicators could be made publicly available. One means by which such capability could be built is through NCSES’s initiation of a research program. Such a program would energize the research community to use survey and other data as soon as the data arrived at the NORC Data Enclave. The program could also be designed to incentivize researchers to develop new, high-utility statistics that were based on linked data from several agencies and that related inputs to outputs, outcomes, and effects. For datasets that cannot be used outside of a company, another approach NCSES could take would be to work with NSF directorates that sponsor industrial fellowships. For example, LinkedIn has an excellent Data Science team that could potentially provide mentorship for a graduate student or postdoctoral fellow. A program modeled after the NSF Division of Mathematical Science’s University-Industry Cooperative Research Programs in the Mathematical Sciences (National Science Foundation, 2004) could provide a way for researchers interested in the STEM labor market to collaborate with LinkedIn’s Data Science team and explore LinkedIn’s data while under close supervision. RECOMMENDATION 7-5: The National Center for Science and Engineering Statistics should explore the use of university-industry exchanges as a mechanism for giving researchers access to promising datasets and industry teams access to new research techniques. NEXT STEPS The emerging field of data science is more than the motivating example for this chapter. The new approach to understanding STEM that the panel believes NCSES should explore is at its core a data science approach. Since the field of data science is new and the number of practitioners is relatively small, the panel proposes two concrete initiatives that would provide some opportunities for NCSES to gain experience with new data science tools and to collaborate with data scientists. NSF has long history of funding university-industry research collaborations. The model is typically that an industry partner with a problem to solve is paired with a university partner that has experience with techniques and tools relevant to the problem domain. A graduate student or postdoctoral fellow (or professor) splits his or her time between the university and the corporation and is mentored by people in both institutions. The student/postdoctoral fellow gains valuable real-world experience, the industry partner gains solutions to problems, and the university partner gains a better understanding of real problems and potentially relevant data. One example of this approach is the previously mentioned NSF Division of Mathematical Science’s University-Industry Cooperative Research Programs in the Mathematical Sciences (National Science Foundation, 2004). NCSES could gain considerable experience in data science techniques by playing the role of the industry partner and collaborating with a university in this fashion. NCSES has a mandate to understand the state of STEM; access to interesting datasets; and a staff well versed in navigating federal research organizations, managing datasets, and conducting survey research. A PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-17 collaboration with a university laboratory focused on such matters as text processing, web mining, or Internet-oriented econometrics could yield valuable experience for both sides. RECOMMENDATION 7-6: The National Center for Science and Engineering Statistics should collaborate with university researchers on the use of data science techniques to understand the science, technology, engineering, and mathematics ecosystem, using a mechanism similar to existing National Science Foundation university-industry partnerships. One or two graduate students or postdoctoral fellows could alternate between working at NCSES and at their home institution for up to 2 years, with the specific goal of contributing new findings to NCSES’s data and indicators programs. NCSES has considerable experience with managing structured data in the form of its survey products, but much less experience with unstructured data. Conveniently, NSF has a rich but relatively untapped dataset of unstructured data that could provide a wealth of new insights into STEM in the United States—its database of awards. This dataset is quite sensitive, but there is precedent for granting researchers access to the data: for the Discovery in a Research Portfolio report (National Science Foundation, 2010a), NSF’s Science of Science and Innovation Policy program recently granted 10 academic research laboratories access to a small set of recent award data. NCSES is well versed in managing researcher access to sensitive datasets and would doubtless be up to the task of ensuring responsible access to its award database. The Discovery in a Research Portfolio report recommends that NSF make a number of improvements to its award data, several of which align well with the panel’s recommendations. For example, the report recommends combining award data with internal and external data—a task that would benefit from automated techniques for extracting entities (people, laboratories, programs, institutions) from awards and linking them to related entities in other datasets. The report also recommends improving visualization techniques and understanding of the interrelationships between people and topics—both of which would make promising research projects for a visiting graduate student or postdoctoral fellow. Managing award data for research purposes would provide a useful test bed for several other recommendations offered in this chapter. For example: NCSES and NSF’s Science of Science and Innovation Policy program could formulate a set of key questions they believe NSF award data could help answer and then work with relevant directorates to fund this research. NCSES could work to share some of the tools used to add structure (in the form of automatically assigned topics) to awards. NCSES could also share the topics themselves, either as additions to the existing online awards database or as a separate metadata file. RECOMMENDATION 7-7: The National Center for Science and Engineering Statistics should explore methods of mining the awards database at the National Science Foundation as one means of discovering leading pathways for transformational scientific discoveries. NCSES should engage researchers in this exploratory activity, using its grants program. NCSES PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-18 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY should develop mechanisms for using the tools and metadata developed in the course of this activity for the development of leading indicators of budding science and engineering fields. One way to develop these ideas further would be through a contest for research funding or prize competition. Several “open innovation” organizations, such as InnoCentive, the Innovation Exchange, and challenge.gov, facilitate this type of contest. Working with an outside entity to design and administer a contest would allow NCSES to focus on the problems it hoped to address rather than the implementation details of the contest. The National Research Council (2007) report Innovation Inducement Prizes at the National Science Foundation and NSF’s new Innovation Corps Program could also serve as useful models, although these resources are focused more specifically on technology commercialization. If the contest were designed to address statistical questions related to the usefulness of web-based data sources, it would be necessary to supply some sample data, and this might affect negotiations with companies. For example, LinkedIn might be willing to supply its data for NCSES to use but unwilling to allow use of the data in a public contest. How can a federal statistical agency develop and rely on web-based and scientometric tools to produce gold-standard data for periodic publication? This is a basic question that needs to be considered in the current climate of rapidly changing technologies and increasing demands for data. There are numerous related questions, including: How can an agency overcome looming privacy and security issues? and How many personnel internal to the agency will it take to develop and operate the tools to produce the indicators? These are good questions that will need to be fully addressed before NCSES or any other federal statistical agency implements the frontier methods described in this section. One way to address these questions is by example. In 2011, NIH decided to sponsor a competition 13 to find improved methods for using the National Library of Medicine (NLM) to show knowledge flows from scientific exploration through to commercialized products. The agency also wanted to use the NLM resource for taxonomic development and for showing relationships among research activities. Knowledge spillovers and effects are difficult to measure. NIH determined that one way to mine millions of data entries would be to automate the process. Yet that was not the expertise of any specific department at NIH, and it was important to cast a broad net to obtain the best expertise for addressing the problem. The competition was announced on challenge.gov and was titled “The NLM Show Off Your Apps: Innovative Uses of NLM Information Challenge.” The competition was open to individuals, teams of individuals, and organizations, and its purpose was to “develop innovative software applications to further NLM’s mission of aiding the dissemination and exchange of scientific and other information pertinent to medicine and public health.” 14 The competition ended August 31, 2011, and winners were announced on November 2. 15 13 The panel thanks Jerry Sheehan (National Institutes of Health) for providing information and materials on this competition (see http://showoffyourapps.challenge.gov/ [December 2011]). 14 See http://showoffyourapps.challenge.gov/ [December 2011]. 15 The U.S. Census Bureau ran a visualization competition in 2012 to develop a statistical model that could predict census mail return rates (see U.S. Census Bureau, 2012). Another example of a competition is the Netflix Prize, documented in the materials for the seminar of the Committee on Applied and Theoretical Statistics and Committee on National Statistics of the National Academy of Sciences titled “The Story of the Netflix Prize” (November 4, 2011) (see http://www.netflixprize.com/community/ and Gillick, 2012). PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-19 SUMMARY In this chapter, the panel has presented seven recommendations. The focus is on exploratory activities that should enable NCSES to produce over time STI indicators that measure more accurately what users really want measured, and in a more timely fashion. Researcher engagement and incentives for exploratory activities are important aspects of these recommendations. While the recommendations in Chapters 4-6 take priority over the recommendations in this chapter, the panel views these exploratory efforts as important investments in the long-term viability of NCSES with respect to its ability to meet evolving user needs in changing technological and economic environments. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-20 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY ANNEX 7-1: POTENTIAL DATA SOURCES TO EXPLORE Measuring Research Impact Considerable activity is focused on finding better measures than citations for the impact of papers (see Priem and Hemminger, 2010, for an overview). The approaches being used fall into three broad categories: Impact measured through refinements of citation counts—The Eigenfactor algorithm (http://www.eigenfactor.org/) gauges impact by computing impact-weighted citation counts. Citations from high-impact papers count for more than citations from low- impact papers. The algorithm is related to Google’s PageRank algorithm for determining the relevance of web pages. Impact measured through aggregation of online indicators - In addition to citations, Public Library of Science (PLOS) journals use article- level metrics to track usage statistics (article views and downloads), user feedback, and blog posts (Patterson, 2009). - Total Impact is an application programming interface (API) that allows sites to display PLOS-style article-level metrics for arbitrary articles (http://impactstory.org/). - Altmetric.com is a start-up that tracks mentions of scholarly articles in blogs, social media, newspapers, and magazines and provides scores for articles (http://altmetric.com/). Impact gauged by expert raters—Faculty of 1000 is a subscription-based service that selects new publications deemed by expert reviewers to be important. Articles are assigned a numerical rating (http://f1000.com/). Measuring Knowledge Diffusion Diffusion within the Research Literature Citation flows from Thomson/Reuters have been analyzed to gauge the flow of knowledge within disciplines (see, e.g., Rosvall and Bergstrom, 2008). These flows can provide insight into the extent to which basic research in various fields propagates out into more applied disciplines. Diffusion within and outside the Research Literature The Kauffman-funded COMETS (Connecting Outcome Measures in Entrepreneurship Technology and Science) database (Ewing Marion Kauffman Foundation, 2012) and COMETSandSTARS database seek to shed light on the next stage of diffusion of ideas, from research to products (Zucker et al., 2011). The databases link awards from NSF/NIH, patents, papers from Web of Knowledge, doctoral dissertations, and more. The initial implementation of STAR METRICS at NSF involves similar types of linkages, initially linking research awards to patents and jobs (supported by awards), with ambitious future plans for tracking outputs such as publications, citations, workforce outcomes, public health outcomes, and more (National PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-21 Institutes of Health, 2012). The linking was accomplished using sophisticated text mining tools (Lai et al., 2011), in this case a variant of the Torvik-Smalheiser algorithm (Smalheiser and Torvik, 2009). Diffusion outside the Research Literature Alexopoulos and Cohen (2011) have mined Machine Readable Cataloging (MARC) records (Library of Congress, 2012) from the Library of Congress to identify patterns of technical change in the economy (see Box 7-2). They argue that the book publication counts captured in these records correspond more closely than patent records to the commercialization of ideas. Other tools for mining data in books include: Google’s NGram viewer, a search engine for n-grams in Chinese, English, French, German, Hebrew, Spanish, and Russian books published between 1800 and 2008; and Culturomics for arXiv, a search engine for n-grams in scientific preprints published in ArXiv between 1992 and 2012. BOX 7-2 Tracking the Commercialization of Technologies through Records of New Books Michelle Alexopoulos and collaborators at the University of Toronto have been measuring the commercialization of technology using records of new books from the Library of Congress (Alexopoulos and Cohen, 2011). The idea is that publishers invest in new books on commercially promising innovations and stop investing when innovations are in decline. Hence, a count of new book titles on a particular technology provides an indicator of the success of that technology in the marketplace. Alexopoulos and Cohen trace the diffusion of such inventions as the Commodore 64 and Microsoft Windows Vista by searching for related keywords in new book titles. One potential generalization of this work would be to attempt to trace the flow of ideas back to the research that preceded commercialization. This task would be considerably more difficult than tracking commercialization as it would be necessary to examine vastly more papers than books, but techniques such as automated topic extraction could make the task more feasible. SOURCE: Panel’s own work. Two of NSF’s administrative datasets could potentially shed additional light on the translation of research work into commercial products: NSF proposals have a required section (National Science Foundation, 2011b) on the impact of past NSF-funded work. NSF-funded researchers submit Research Performance Progress Reports (National Science Foundation, 2012c) that document accomplishments, including inventions, patent applications, and other products. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
7-22 CAPTURING CHANGE IN SCIENCE, TECHNOLOGY, AND INNOVATION: IMPROVING INDICATORS TO INFORM POLICY The STEM Labor Market Demand Large job boards such as Monster.com or job board aggregators such as Indeed.com or SimplyHired.com could provide a rich source of information on demand for STEM professionals. The Conference Board’s Help Wanted OnLine dataset includes jobs collected from 16,000 online job boards (The Conference Board, 2012). One can collect data from a site such as Monster.com either by scraping information from the public website or by negotiating with the site for access to the data. Two reasons for preferring negotiation are legality and data structure. The legality of web scraping has been challenged several times in courts both in the United States and abroad, 16 and there appears to be no consensus on what is legal. However, all the cases to date that the panel found involved one for-profit company scraping data from another for-profit company’s site for commercial use. For example, a travel company might use web scraping to collect information on ticket prices from an airline and then use those prices to facilitate customers’ comparative shopping. During the course of this study, the panel found no example of a nonprofit or government organization or academic researcher being sued over web scraping. Supply Several new social networks for researchers could be used to learn more about career trajectories in the sciences, particularly nonacademic careers: ResearchGate—http://www.researchgate.net/ Mendeley—http://www.mendeley.com/ Academia.edu—http://academia.edu/ LinkedIn.com is a broader social network for professionals that had 175 million members as of June 2012. Several initiatives may make new data on researchers available online: Vivo is a web platform for exposing semantic data on researchers and their work on the websites of research institutions. Vivo tools provide a way for institutions to create rich, structured datasets on their research activities. SciENCV is an NIH demonstration project for allowing researchers to create public research profiles. These profiles are designed to streamline the process of applying for NIH and other grants, but will also generate useful structured datasets on researchers. 16 For example, Ryanair, a European airline, initiated a series of legal actions to prevent companies such as Billigfleuge and Ticket Point from scraping ticket price data from its website to allow for easier comparison shopping (see Ryanair, 2010). In a California 2000 case, eBay v. Bidder’s Edge, eBay sued Bidder’s Edge over price-scraping activities; see http://www.law.upenn.edu/fac/pwagner/law619/f2001/week11/bidders_edge.pdf [December 2011]. And in another California case, in 2009, Facebook, Inc. v. Power Ventures, Inc., Facebook sued Power Ventures over scraping of personal user data from the Facebook site; see http://jolt.law.harvard.edu/digest/9th-circuit/facebook-inc-v-power-ventures-inc [December 2011]. PREPUBLICATION COPY: UNCORRECTED PROOFS

OCR for page 129
A PARADIGM SHIFT IN DATA COLLECTION AND ANALYSIS 7-23 Brazil’s Lattes Platform is a database of all Brazilian researchers and their work. It extends the ideas in Vivo and SciENCV, and participation is mandatory. The (ORCID) project seeks to provide researchers with unique identifiers that will be used as author identifiers for publications, awards, and so on. The goal is to facilitate linking of datasets involving individual researchers. ORCID will serve as a registry rather than a data provider, but the use of these identifiers can help structure existing unstructured datasets. (Some researchers [Smalheiser and Torvik, 2009] have expressed skepticism about the utility of such identifiers, however.) The U.S. Department of Labor issues quarterly foreign labor certification data for H- 1B visa holders (U.S. Department of Labor, 2012a). The dataset contains job titles and employers for new H-1B holders, and degree level can be inferred for some broad categories of jobs (e.g., “postdoctoral scholar”). The data are imperfect in that not all Ph.D.’s are on H-1B visas, there will be some overlap between SED respondents and those receiving H1-B visas, and job title is an imperfect predictor of degree status, but one may be able to see useful year-to-year changes in numbers of foreign postdoctoral fellows. Finally, there are several databases of dissertations: ProQuest, WorldCat, and OpenThesis. PREPUBLICATION COPY: UNCORRECTED PROOFS