Read "Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy" at NAP.edu

Page 87 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

7

A Paradigm Shift in Data Collection and Analysis

The National Center for Science and Engineering Statistics (NCSES) finds itself in the midst of a paradigm shift in the way data are gathered, manipulated, and disseminated. The agency’s science, technology, and innovation (STI) indicators program must deal with several challenges:

As discussed in previous chapters, traditional surveys face increasing expense, declining response rates (Williams, 2013), and lengthy time lags between when data are gathered and when derived indicators and other statistics can be published.
Tools for data extraction, manipulation, and analysis are evolving rapidly.
Repositories of STI measures that users demand are distributed among several statistical agencies and private repositories.
Sources of knowledge generation and innovation are expanding beyond developed countries to emerging and developing economies.
Users have rising expectations, and they are demanding more access to statistics that are closer to actual measures of what they want to know.

This chapter explores this changing landscape of data collection and analysis and its implications for NCSES’s STI indicators program.

NEW METHODS, NEW DATA SOURCES

Standards and taxonomies for data collection and analysis are expected to change before the end of this decade. OECD’s National Experts on Science and Technology Indicators are revising the Frascati Manual (OECD, 2002) and the Oslo Manual (OECD-Eurostat, 2005) on a rolling basis. The group plans to work on priority themes and to build a better bridge between the two manuals. The North American Industry Classification System (NAICS) codes and the Standard Occupational Codes may also undergo revision within the next decade.

NCSES and, indeed, other government statistical agencies confront a world of dizzying change in the way information technology is integrated into their data-gathering and data-management activities. The World Wide Web, in particular, has been transformational in enabling new forecasting and data collection methods that yield useful insights in almost real time. These tools provide information much more rapidly than is possible with traditional surveys, which entail up to multiple-year lags.

Other changes are occurring as well. In his November 2011 presentation at the annual meeting of the Consortium of Social Science Associations, Robert Groves (2011a) conveyed the status of U.S. surveys: “Threatened coverage of frames; falling participation rates; increasing reliance on nonresponse adjustments; and for surveys with high response rate targets, inflated costs.” His proposed solution for agencies to address these issues is to develop an approach of a “blended data world by building on top of existing surveys.”¹ Groves (2011b) envisions multimodal data acquisition and manipulation of data, including “Internet behaviors; administrative records; Internet self-reporting; telephone, face-to-face, paper surveys; real-time mode switch to fill in missing data; and real-time estimation.”²

Some of these innovations are already being implemented at the Census Bureau. The agency’s economic directorate has combined administrative data with survey data in inventive ways. It also handles multiple response modes—paper forms, Internet responses, and telephone interviews. To address the timeliness of economic indicators, it has devised workable decision rules for defining which estimates are preliminary and what information is required to revise them.

____________________

¹For further comments on this point, see U.S. Census Bureau (2011a,b,c).

²See Chapter 4 for detail on how business practice data (which include administrative records and web-based data) can be used to obtain innovation indicators.

Page 88 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

Perhaps the most challenging innovation in Groves’ vision of the future of surveys is performing real-time estimation during data collection. Groves (2011b) envisions implementing real-time estimation routines—including imputation, nonresponse adjustment, and standard error estimation—after every 24 hours of data collection. Part of this progress would entail assessing whether the standard error increase due to imputation was acceptable or additional nonresponse follow-up was necessary. In this context, imputation can, in effect, be viewed as another mode of data collection. To make trade-off decisions about whether to terminate nonresponse efforts for a case using a particular mode, key statistics on the fully imputed estimates and measures of the imputation standard error and sampling standard error of the estimates would be actively tracked. Groves believes successfully implementing this real-time estimation and decision process at the Census Bureau would take at least 5 years.

In this vein, one issue that needs to be explored is the feasibility of blending the use of administrative records, scientometric tools, and survey techniques to improve the accuracy of data on STI human capital measures and other indicators that NCSES produces, such as research and development (R&D) input and performance measures. A multimodal approach would help in creating longitudinal series using existing and new information. In the near term, the topic could be explored through a workshop designed specifically to discuss the conceptual framework and feasibility of blending data acquisition techniques and using this mixed-methods approach to develop new indicators.³ This approach could be useful for developing real-time maps of networked scholars while measuring return on investments from federal research funds as they are used and linking them to outputs (paper and patents). At the same time, this approach would include periodically assembling basic data on education, employment, work activities, and demographic characteristics.

Data from administrative records and web-based sources—termed “business practice data” (see Chapter 4)—have been used for many years at federal agencies with two purposes: to benchmark sample survey data and, along with sample survey data, to produce official statistics. Horrigan (2012, 2013) gives several examples of sources being used by the Bureau of Labor Statistics (BLS), including the Billion Prices Project data; retail scanner data; the J.D. Power and Associates used car frame; stock exchange bid and ask prices and trading volume data; universe data on hospitals from the American Hospital Association; diagnosis codes from the Agency for Healthcare Research and Quality, used to develop the producer price index; Energy Information Agency administrative data on crude petroleum for the International Price Program; Department of Transportation administrative data on baggage fees and the Sabre data, used to construct airline price indices; insurance claims data, particularly Medicare Part B reimbursements to doctors, used to construct health care indices; and many more sources of administrative records data from within the U.S. government, as well as web-based data. According to Horrigan (2013), in addition to the development of price indices, administrative records and web-scraping data are used to “improve the efficacy of estimates … the Current Employment Statistics (CES) Survey uses administrative data from the Quarterly Census of Employment and Wages (QCEW)….” BLS also is “using web-scraping techniques to collect input price information used to increase the sample of observations we use to populate some of our quality adjustment models” (Horrigan, 2013, p. 26). Horrigan cautions, however, that “the principle of constructing an inflation rate based on the rate of price increase for a known bundle of goods with statistically determined weights lies at the heart of what we do. While research may show the viability of using a web-scraped source of data for a particular item, it needs to be done within the framework of this methodology” (Horrigan, 2013, p. 27).

The statistical methodology related to sampling and weights must be developed if these multimodal techniques are to be fully relied upon to deliver bedrock STI indicators. The panel must stress, moreover, that business practice data must be regularly calibrated using sample survey data. Business practice data contain a wealth of detailed and rapidly changing information that is not practically acquired using surveys. However, businesses and government enterprises generally do not maintain the sort of consistency across organizations, categories, and time that would enable cross-sectional and longitudinal comparisons. In time, and with appropriate financial and human resources, NCSES and other statistical agencies should be able to publish indicators based on business practice data, but only if the raw data are adjusted using a well-designed program of sample surveys. Indeed, the challenge will be to design the most efficient combination—financially and statistically—of traditional sample surveys and administrative and web-based sources.

IMPLICATIONS FOR NCSES

NCSES needs to determine now how it will handle the above changes if they materialize and how the types and frequencies of various STI indicators will be affected. During the panel’s July 2011 workshop, Alicia Robb of the Kauffman Foundation encouraged NCSES to explore the use of administrative records to produce STI indicators. She also cautioned, however, that ownership issues associated with the use of those data will have to be addressed before they can become a reliable complement to traditional survey data.

____________________

³Statistical Neerlandica has prepublication views of a series of articles on the use of administrative records for analytical purposes, including regression analysis; see http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-9574/earlyview [December 2011]. For theoretical foundations of combining information from multiple sources of data, see Molitor et al. (2009). Also see Eurostat (2003).

Page 89 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

Also at the workshop, Stefano Bertuzzi of the National Institutes of Health (NIH) and the STAR METRICS Program at that time, in collaboration with Julia Lane from the National Science Foundation (NSF), presented techniques for using administrative records at universities to determine the impact of federal research funds on scientific outputs and the development of human capital in the physical and biological sciences. In follow-on discussions, George Chacko, who took the helm of the STAR METRICS Program at NIH in late 2011, described that $1.5 million per year activity. Level 1 data outputs (described by Bertuzzi) are in the data validation process. There are two potential sources of error (data entry and data transformation into readable files), but biases are not yet known. Chacko noted that further research is needed to determine the quality of the Level 1 data. He also described Level 2, which will allow the integration of research project data; that effort had not yet begun as of the writing of this report. Participants in Level 2 will include the U.S. Department of Agriculture, the Environmental Protection Agency, NIH, and NSF. Because each agency has different ways of reporting the same kinds of grants, one of the first tasks will be the creation of a data ontology and taxonomy before a database is developed. Sometime in the future, Chacko expects that STAR METRICS Level 3 will enable demographic identifiers for individuals, thereby allowing analysis of science, technology, engineering, and mathematics (STEM) outcomes by gender and race/ethnicity.

In May 2012, Christopher Pece gave a presentation at NSF on NCSES’s Administrative Records Project (ARP), which is still in the feasibility testing stage. Pece cited the National Research Council (2010) report recommending that NCSES (Pece, 2012):

develop R&D descriptors (tags) into administrative databases to better enable identification of R&D components of agency or program budgets;
use administrative data to test new classification schemata by direct access to intramural spending information from agency databases; and
develop several demonstration projects to test for the best method for moving to a system based at least partly on administrative records.

Accordingly, NCSES is working with a subset of agencies that have data reported in the Federal Funds Survey and Federal Support Survey to pilot methods for using administrative records to produce data comparable to the survey data. In addition to budgetary constraints and negotiation of interagency agreements, other challenges must be addressed, including the creation of data tags and R&D crosswalks between agency systems that use different data taxonomies, accounting practices, and information technology systems. The panel was impressed by NCSES’s willingness to experiment with the use of administrative records to complement its survey-based datasets, but also recognized the need for NCSES to acquire increased resources—funding and staff—at least in the short term, with the potential ultimately for reduced survey costs, reduced data validation costs, and increased timeliness of data delivery.

During the July 2011 workshop, presentations by Erik Brynjolfsson of the Massachusetts Institute of Technology, Lee Giles of Pennsylvania State University, Carl Bergstrom of the University of Washington, and Richard Price of Academia.edu provided insights regarding tools that can be used to obtain up-to-date information on science and engineering networks and linkages between human capital investments and STI outcomes and impacts. These experts showed panel members how nowcasting, netometrics, CiteSeerX, Eigenfactor, and Academia.edu (similar to Lattes in Brazil) can be used to create scientometric⁴ data for use in developing STI “talent” indicators. Such tools can be used, say, to track intangible assets and knowledge flows from online recruiting sites and social networks.

Many questions remain about the representativeness of data sources such as STAR METRICS Level 1 (which includes data from 80 universities) and the datasets drawn from web-based sources, primarily because they are nonrandom convenience samples. Recent work on medical care expenditures by Ana Aizcorbe at the Bureau of Economic Analysis (BEA)⁵ and by Dunn and colleagues (2012) shows that insurance companies’ patient claims data can be used to develop reliable price estimates, given the appropriate weighting strategy. Both projects use MarketScan data, which include sampling weights designed to provide population estimates from the MarketScan sample. This is a potentially cost-effective approach compared with the use of traditional household surveys (in this case, the Medical Expenditure Panel Survey [MEPS]). Clearly, the MarketScan data cannot address all of the questions that the MEPS was designed to address. However, Dunn and colleagues find that the MarketScan data “produce spending growth figures that are more aligned with other benchmark estimates of price and expenditure growth from national statistics” (Dunn et al., 2012, p. 26).

INDICATORS FROM FRONTIER TOOLS: EXAMPLE OF THE DATA SCIENCE DISCIPLINE

Consider the rise of data science, an emerging discipline that encompasses analysis, visualization, and management of large datasets. (“Large” in this context typically means many millions or billions of records.) The digitization of records, increasing numbers of sensors, and inexpensive storage have combined to produce enormous quantities of data in the sciences and business. Data scientists use specialized techniques to sift through these troves of information

____________________

⁴In practice, scientometrics often uses bibliometrics, a measurement of the impact of (scientific) publications and patents (see Chapter 5).

⁵See Aizcorbe et al. (2012) for more detail on the Health Care Satellite Accounts at BEA.

Page 90 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

to discover new insights and create new value. Google’s chief economist, Hal Varian, has characterized the statistical work of data scientists as “the sexy job in the next 10 years” (Lohr, 2009); Forbes magazine describes the data scientist’s role as “the hot new gig in tech” (Lev-Ram, 2011); and The Economist (2011) says data science is “where the geeks go.”

In line with perennial concerns about the supply of knowledge workers in the United States (Atkinson, 1990; Freeman and Aspray, 1999; Jackson, 2001; The Partnership for a New American Economy and The Partnership for New York City, 2012), data scientists are already projected to be in short supply in the near future. According to a 2011 McKinsey study (Manyika et al., 2011), “a shortage of people with the skills necessary to take advantage of the insights that large datasets generate is one of the most important constraints on an organization’s ability to capture the potential from big data.” Likewise, an EMC Corporation (2011) study foresees a “looming talent shortage.” Access to talent is not an issue just for industry: 23 percent of respondents to a 2011 Science survey said their laboratories were lacking in data analysis skills.

Given that past projections of shortages of knowledge workers have proven controversial (Lowell and Salzman, 2007; Matloff, 2003; Stanford News Service, 1995; Weinstein, unpublished), it is worth examining the above claims more closely. Consider some of the questions a policy maker concerned about the future data scientist workforce might ask of NCSES:

How many new data scientists are graduating each year?

- How many in the United States?

- How many in other parts of the world?

Where were existing data scientists educated?

- What schools?

- What programs?

Where are data scientists employed?

- What fraction work in industry?

- In government?

- In academia?

What range of salaries do data scientists command?

- How much do salaries vary with degree level?

- With sector?

Is the United States producing enough or too many data scientists?

A funding agency director (such as the director of NSF) might want to know:

Is NSF funding data science research?

- Which directorates?

- How much is NSF spending?

What basic research does data science draw upon?

NCSES’s existing STEM workforce surveys would be hard pressed to answer these questions. For one thing, “data science” is not in the taxonomy of fields used in the STEM degree surveys, so one cannot obtain data science degree counts directly from existing NCSES datasets. Similarly, the taxonomy of occupations used by the Current Population Survey/American Community Survey does not contain “data scientist,” so NCSES datasets derived from these surveys will likewise miss this group. Fixed, slowly evolving taxonomies restrict the ability of existing surveys to provide insights about emerging fields.

One potential remedy might be a one-time custom survey. Given the cost of existing surveys, however, this would likely be a prohibitively expensive undertaking. A custom survey would entail the additional difficulty that there is no obvious, well-defined frame. An alternative might be for NCSES to update its taxonomy of fields for future surveys. This would be a slow process, however: turnaround time for the NCSES surveys is on the order of 2 years (National Science Foundation, 2012a), and additional time would be needed to formulate and validate a revised taxonomy. Even if taxonomic issues were resolved, the limitation would remain that NSF’s datasets cover only the United States.

An Alternative Approach

Datasets exist that could shed light on questions about data science, but they are very different from those produced by NCSES. They are not among the datasets typically used to analyze the STEM workforce in part because, while they offer significant advantages over NCSES’s data, they also come with significant challenges.

Consider, first, the task of counting data science doctoral degrees. Rather than using a survey to ask new doctorate recipients whether they did data science-related research, an expert could look at their dissertations to make this determination. The expert could then tally the number of new data science graduates. The expert could also identify the degree-granting institutions and programs from the dissertations. The idea is not merely theoretical: both ProQuest and WorldCat maintain large databases of dissertations (Online Computer Library Center, 2012; ProQuest, 2012). Although counting data scientists in this fashion is labor-intensive, it is potentially faster and less expensive than either conducting a custom survey or waiting for a taxonomy update in an existing survey. In addition, the approach has the benefit of providing global counts of data science degrees, not just U.S. counts.

For the task of learning about the number of data scientists in the current workforce, one could examine a database of resumés; count the number of people whose job titles or job descriptions include such phrases as “data scientist,” “data mining,” or “big data”; tally their educational backgrounds; and observe their sectors of employment. A large, global resumé database such as LinkedIn (http://www.linkedin.com;

Page 91 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

BOX 7-1
Employment Shifts from LinkedIn Data

LinkedIn’s data science team (The Noisy Channel, 2012) recently collaborated with the White House Council of Economic Advisors to identify the industries that grew and shrank the most during the 2008-2009 recession and the subsequent recovery. By following people who were site members in 2007 longitudinally through 2011, they were able to see the rapid growth in renewable energy and Internet companies, as well as sharp declines in newspapers, restaurants, and the retail sector. The cohort they followed numbered in the tens of millions, and LinkedIn contains detailed data on its members’ educational backgrounds, so one can readily imagine conducting similar analyses restricted to workers with science, technology, engineering, and mathematics (STEM) degrees. Moreover, one of the study’s authors says that in principle, LinkedIn could track such changes in real time.

SOURCES: Nicholson (2012); The Economist (2012).

see Box 7-1) or profiles of science professionals in science-related social networks such as ResearchGate (http://www.researchgate.net/), Academia.edu (http://academia.edu/), or BioMedExperts (http://www.biomedexperts.com/) could be used for this procedure. Again, the process of counting the resumés or profiles and classifying the associated educational backgrounds and employers would be labor-intensive, but it potentially could provide inexpensive and timely insights on the supply of data scientists in both the United States and international markets that would otherwise be unavailable or prohibitively expensive to generate.

To assess demand and salary levels for data scientists, one could turn to large databases of job listings such as Monster. com (http://www.monster.com/), Indeed.com (http://www.indeed.com/), or SimplyHired.com (http://www.simplyhired.com/). An expert could identify data science-related jobs and then tally offered salary levels as a function of the level of the job.

Mandel and Scherer (2012) recently used techniques of this sort to estimate the size and location of the “App Economy”—jobs related to smartphone and tablet applications and to Facebook plugins.⁶ Because this is a very recently created category of jobs, existing labor market statistics could not provide useful information about these types of jobs. Mandel and Scherer turned to The Conference Board’s Help Wanted OnLine data (The Conference Board, 2012), a collection of job listings from more than 16,000 Internet job boards. They counted job listings containing keywords associated with the App Economy (e.g., “iPhone,” “iPad,” “Android”) and then used historical data on the ratio of job listings to jobs in the tech sector to estimate the total number of App Economy jobs. They were able to identify the geographic distribution of App Economy jobs from the job locations listed in the ads. One might apply a similar approach to data science jobs by looking for keywords related to this field (e.g., “data science,” “data mining,” “big data”) and common data science tools and techniques (e.g., “map-reduce,” “Hadoop,” “Hive,” “Pig”).

Mandel (2012) and Mandel and Scherer (2012) analyzed online help-wanted ads to track the demand for “app-related” skills geographically, using key words such as “iOS” and “Android.” This analysis made it possible to identify states with a higher-than-expected number of app-related want ads relative to their size (see Table 7-1). This procedure could be repeated for any innovation; one could identify clusters of green innovations, software innovations, or medical innovations in the same way, for example, at the state or metro level.

The data from help-wanted ads also can be combined with conventional survey data to provide a more complete and timely picture of innovation activities at the national or subnational level. Suppose, for example, one wanted to get a picture of R&D activities in biotechnology by state. The Business Research and Development and Innovation Survey (BRDIS) would provide some information, but many cells of the tables derived from its data would likely be suppressed to avoid disclosure. However, it would be possible to use biotechnology-related help-wanted ads to impute the missing state information without violating confidentiality. This analysis could even be taken down to the metro level, using help-wanted ads combined with state-level survey data to provide a benchmark. Using help-wanted ads to track the diffusion and use of innovation at both the national and subnational levels has several advantages: these ads are public and continuously refreshed; full databases of the ads already exist in digital form, available in near real time; the ads are semistructured—they are free text, but must include information about the skills and experience needed, using recognizable terms; and organizations such as The Conference Board already have procedures for automatically tagging them by occupation, industry, location, and so forth. As a result, the ads provide a continually changing real-time map of the skills employers need. In particular, as the content of the ads changes, one can see innovations diffusing through the economy.

Finally, to gauge existing support for data science research within NSF, an expert could read through existing NSF awards and identify those related to data science research. The expert could then identify the institutions and programs doing data science work, as well as tally the directorates supporting data science. To identify the basic research on which data science draws, the expert could compile a set of recent, important data science papers; follow their citations back, possibly multiple levels; and tally the journals and/or

____________________

⁶See also Box 4-8 in Chapter 4 on the use of this approach to develop innovation indicators.

Page 92 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

TABLE 7-1 The App Leaders: States #1-15


State	App Intensity (U.S. average = 1)	App Economy Jobs (thousands)	App Economy Economic Impact (millions of dollars, annual rate)

1. Washington	4.47	49.8	$2,671
2. California	2.71	151.9	8,241
3. Massachusetts	1.71	21.4	1,143
4. Oregon	1.70	10.8	526
5. Georgia	1.56	24.0	1,062
6. New Jersey	1.29	19.5	1,087
7. New York	1.16	39.8	2,313
8. Virginia	1.04	15.0	788
9. Delaware	0.93	1.5	76
10. Colorado	0.90	8.1	429
11. Illinois	0.90	19.9	847
12. Connecticut	0.88	5.6	294
13. Minnesota	0.87	9.1	475
14. Utah	0.86	4.1	192
15. Maryland	0.84	8.4	436

NOTES: All figures estimated as of April 2012, includes spillover jobs. Data from The Conference Board, Bureau of Labor Statistics, and calculations from South Mountain Economics LLC.

SOURCE: Reprinted with permission from Mandel and Scherer (2012).

fields of research represented by frequently cited precursors to data science papers. Several large databases of papers and citations—such as Web of Science (Thomson Reuters, 2012) or Scopus (http://www.scopus.com/home.url)—could be used to facilitate the process of tracing papers back.

Advantages and Challenges

Using the datasets described above to learn about the state of data science offers several advantages over using surveys. First, the datasets have already been created for other purposes, so the incremental cost of using them to learn about data science is modest. Second, the databases are all updated continuously, so the lengthy delays associated with gathering survey data do not come into play. And third, because experts classify the data, there is no locked-in, limiting, pre-existing taxonomy that could lead to misclassification of data scientists (although this also presents its own issues).

Along with the benefits associated with these new datasets, however, come challenges:

In many cases an expert is needed to assign data to categories because the datasets are unstructured (see the discussion of this issue in the next section). There will be uncertainty in this classification process, especially if multiple experts are used because they may not always agree.
Classifying large collections of dissertations, resumés, awards, and papers by hand is labor-intensive for the expert—there is an issue of scale.
Some of the datasets are commercial products, so one must pay for or negotiate access to the datasets.

In addition, some of the datasets are sensitive and require special handling.

More generally, concerns have been raised about the inconsistency in the way R&D-related data are specified, making data sharing and linking by researchers difficult. A web-based infrastructure for data sharing and analysis, including clear data exchange standards, would be a useful first step in addressing this issue.⁷
Finally, the question of validation arises. Many of the databases cited above are incomplete in that they do not include the entire population of interest. It is necessary to understand the portion that is missing in order to estimate or at least bound the possible bias in using the database of interest to characterize that population.⁸ In addition to coverage and sampling biases, measurement, nonresponse, and interviewer biases should be examined to determine the validity of statistical indicators derived from such databases. Moreover, a process needs to be in place for measuring the reliability of the expert’s classifications.
As noted, most of the web-based datasets described here are neither representative samples nor complete censuses of the population of interest. That being the case, developing and implementing methods for

____________________

⁷See Haak et al. (2012, pp. 196-197) for a discussion of this problem and possible solutions.

⁸Sample surveys are used to draw inferences about well-defined populations. Survey methodologists have developed tools for measuring how valid and robust their inferences from surveys are. Conceptually, these methods can be applied to nonsurvey databases as well. See Groves et al. (2009), particularly Chapters 3 and 5.

Page 93 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

using these datasets is largely tilling new ground for the staff of any statistical agency. Should NCSES decide to move in this direction, it will need to ensure that it has staff with the appropriate training and experience to develop and implement suitable analytic and data collection approaches in this new environment.

Considerable progress has been made in addressing all of these challenges, but many important open questions remain.

A NEW DIRECTION FOR NCSES

The general approach described above for learning quickly and inexpensively about an emerging field by repurposing existing datasets holds considerable promise for improving understanding of many aspects of innovation in science and engineering. At the same time, the approach entails methodological and logistical problems. The tasks of structuring unstructured data, dealing with challenges of scale, negotiating access to data and protecting sensitive data, and validating nontraditional data sources are common to many potentially useful but nontraditional datasets.

The panel proposes that NCSES explore and experiment with these new, nontraditional data sources. This section describes four core areas in which NCSES could contribute to improving the state of the art in this area, with the goal of improving outputs from its data development and indicators programs.

Identification of Data Sources

Plummeting prices for data storage, low-cost sensors, improvements in data collection mechanisms, and increases in Internet access have given rise to vast new collections of digital data (The Economist, 2010a), and the total amount of digital data is growing rapidly—a 10-fold expansion occurred between 2006 and 2011 (Gantz et al., 2008). A wide variety of datasets could be used to better understand STEM innovation—the ones mentioned above barely scratch the surface. Annex 7-1 at the end of this chapter lists some promising possibilities.

NCSES could help answer two key questions regarding these new data sources: What are the most promising new datasets for understanding the state of STEM? and What are effective ways to analyze these datasets? NCSES has historically used a combination of internally generated and third-party datasets in assembling science and engineering indicators and InfoBriefs. The agency could test the waters by adopting the goal of including in its publications and on its website analyses performed with nontraditional data in the areas of human resources, R&D, and innovation. Such analyses could be performed by external researchers funded by targeted awards.

RECOMMENDATION 7-1: The National Center for Science and Engineering Statistics (NCSES) should use research awards to support the development and experimental use of new sources of data to understand the broad spectrum of innovation activities and to develop new measures of science, technology, and innovation. NCSES should also support the development of new datasets to measure changing public perceptions of science, international trade in technological goods and services, new regions for entrepreneurial activity in science and technology, and precommercialized inventions.

Structuring of Unstructured Datasets

The data generated from NCSES’s surveys are structured. Data are stored as tables of numbers, with each number having a well-defined meaning. As noted, many of the nontraditional datasets discussed above—perhaps the majority—are in unstructured form, such as free text. A traditional (but apocryphal) rule of thumb is that 80 percent of corporate data is unstructured (Grimes, 2011); a recent article in The Economist (2010b) estimates that 95 percent of new data is unstructured.

The databases of doctoral dissertations, resumés, job listings, and NSF grant proposals described above are vast and rich stores of information, but they are difficult to process by machine. The data science example given earlier assumes that a human expert is willing to spend weeks categorizing dissertations, job listings, and so forth. This role would likely be difficult to fill, as the work is tedious and repetitive. To tap the potential of unstructured datasets fully, new tools and techniques are needed.

The problem of extracting structured information from unstructured text is an active area of research, and several NSF directorates are funding work in this area (National Science Foundation, 2008). One approach is to use divide- and-conquer techniques: rather than having a single expert spend months on a repetitive task, one can use “crowdsourcing” (Wikipedia, 2012), a technique in which a large task is accomplished by being divided among a large collection of people. Services such as Amazon.com’s Mechanical Turk (https://www.mturk.com/mturk/welcome) and CrowdFlower (http://crowdflower.com/) provide workers and infrastructure for distributing tasks to them.

Crowdsourcing can be used to extract information from unstructured data. For example, researchers have used the technique to identify people, organizations, and locations in tweets on Twitter (Finin et al., 2010) and to analyze and annotate charts (Willett et al., 2012). In the realm of STEM, crowdsourcing has been used to identify the correct authors of papers when there is ambiguity in names (e.g., identifying which of several John Smiths wrote a particular paper) (Brooks et al., 2011).

Page 94 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

A second approach to structuring unstructured data is to use automated information extraction algorithms. Most of the tasks requiring an expert’s input in the data science example involve extracting topics from documents (“Is this dissertation related to data science?”) and extracting entities (“What university did the author of this dissertation attend?”). For other applications, it is also important to disambiguate entities (“Is the Susan Jones who wrote paper A the same as the Susan Jones who wrote paper B?”) and to link entities from one dataset to another (“Is the Kim Lee who received patent A the same as the Kim Lee who wrote NSF award B?”). Automated algorithms exist that can perform all of these tasks with varying degrees of success, and research is ongoing on all of these problems.

Indeed, staff in NSF’s Social, Behavioral, and Economics Directorate have started using some of these information extraction techniques in the construction of the directorate’s portion of the STAR METRICS Program (National Science Foundation, 2012b). For example, latent Dirichlet allocation (Blei et al., 2003), a technique for automatically identifying topics in documents based on the frequency of keywords, was used to assign topics to NSF awards (Lane and Schwarz, 2012). Automated entity disambiguation techniques are being used to link NSF awards to subsequent patents (Lai et al., 2011). Improvements in text processing techniques and broader availability of tools for topic and entity extraction could open up rich new datasets that could shed light on STEM innovation. NCSES could catalyze progress by coordinating research and facilitating the dissemination of ideas.

NSF made use of text processing tools internally for STAR METRICS. These tools may have applicability beyond NSF, and NCSES should explore the possibility of making these tools more widely available. Although NCSES should not be in the business of supporting software products, it could explore the possibility of making parts of the text processing code available without support or available as an open-source project. More generally, NCSES could encourage the sharing of text processing tools produced by NSF-supported researchers. The Apache Software Foundation (http://www.apache.org/foundation/) could serve as a model and potential collaborator. A valuable role for NCSES would be to provide organizational and financial support for regularly developing open-source text processing projects. For example, NCSES could pay for a common repository for source code (through a provider such as GitHub), fund meetings among contributors to projects, and organize workshops to bring developers together with users. The broad value of these tools for NSF implies an opportunity for sharing across directorates the resources required to develop these activities.

Encouraging the sharing of extracted structured data would be valuable as well. For example, NSF has autogenerated topics for awards through text processing software developed for STAR METRICS and could start including these topics in its award database so that other researchers might benefit from them. Similarly, if one team of NSF-funded researchers were to link, say, papers in entomology journals to patents in pest control for one project, it might be useful for another team to use the same set of linkages. NCSES could provide a common repository for the sharing of extracted metadata about datasets and encourage NSF-funded researchers to contribute to it.

A fundamental question concerning the use of unstructured data for indicators requires more examination: What kind of statistical methodology should be applied to data derived from web scraping? There are other, related questions: What trade-offs are entailed in using web-based data instead of survey data? Is it possible to adjust web-based data accurately to represent a survey sample and to estimate sampling errors and statistical biases? Is applying modeling techniques to web-based data and traditional survey data a promising approach to achieving this end? How frequently must this be done? How frequently would NCSES want to publish new statistics? Would NCSES want to publish less reliable statistics if it meant publishing them more frequently at lower cost?

A company such as LinkedIn stores in its servers a social network representing all of its users and relationships among them, and techniques for accurately sampling this social network have been developed (see Maiya and Berger-Wolf, 2011; Mislove et al., 2007). To the panel’s knowledge, however, researchers have not yet addressed how well this social network represents the larger population. For example, if one is interested in measuring how many chemical engineers are working in the United States, then some subset of these individuals is represented in LinkedIn’s social network. Adjusting this subset accurately to represent the target population and estimating the error incurred in using this subset is a daunting challenge.⁹ It is important to understand how the data collected from websites compare with traditional survey data, particularly because different websites have very different coverage. Facebook, for example, covers a large portion of the undergraduate population (at least for the next couple of years). However, sites such as Mendeley and Academia. edu clearly cover only a subset of the entire population of researchers.

It could prove useful to adopt a combination approach, in which web-based statistics would be calibrated periodically against a traditional survey. Of course, the survey would have to be administered less frequently than is currently the case, or there would be no cost or time savings. It could also be a useful experiment to run parallel efforts (collecting traditional indicators in addition to their possible replacements) for a period of time in order to examine the strengths and weaknesses of using certain nontraditional data sources for indicators. This period of time would also be important for

____________________

⁹LinkedIn and similar data could be quite useful for questions involving relative rather than absolute measures. For example, are there more chemical than electrical engineers? Do chemical engineers change jobs more frequently than other engineers? Where in the country are chemical engineers most highly concentrated?

Page 95 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

assessing how well the newly constructed indicators identify trends and rates of change, particularly for policy purposes.

Because NCSES has reported that the response rates for some of its surveys are declining, questions arise about how well those data reflect the population sampled and how web-based data could be calibrated to those surveys. Calibrating to the Survey of Earned Doctorates (SED), which has a response rate above 90 percent, would be relatively straightforward, but only once and on questions asked by that survey. One solution to this dilemma would be for NCSES to devote resources to sampling for nonresponse follow-up,¹⁰ that is, strive to achieve close to a 100 percent response rate from a small sample of nonrespondents to a standard survey, adjust the survey results for nonresponse, and use the adjusted survey data to calibrate information from web-based sources.¹¹ The calibration would be similar to what computer scientists and mathematicians do with compressed sensing of data on pixels and is a promising area of research.

Achieving a level of rigor comparable to that of a traditional survey with these methods may not be possible, and NCSES would need to consider its tolerance for publishing statistics that may not be as reliable as those it has previously published. The agency would need to balance reliability against timeliness: because little time is required for data collection with data mining techniques in comparison with traditional surveys, releasing statistics much more frequently would be possible.

In principle, nothing prevents statistics from being updated periodically or continuously. For example, the national unemployment rate, gross domestic product, and the consumer price index are updated periodically with no compromise to their importance. And the Billion Prices Project at the Massachusetts Institute of Technology uses an algorithm that collects prices daily from hundreds of online retailers worldwide, creating, among other things, a daily price index for the United States (see Massachusetts Institute of Technology, 2013).

RECOMMENDATION 7-2: The National Center for Science and Engineering Statistics (NCSES) should pursue the use of text processing for developing science, technology, and innovation indicators in the following ways:

explore synergies with National Science Foundation directorates that fund research on text processing; and

enable NCSES staff to attend and develop workshops that bring together researchers working on text processing and on understanding the science, technology, engineering, and mathematics ecosystem.

RECOMMENDATION 7-3: The National Center for Science and Engineering Statistics should use its grants program to encourage the sharing of new datasets and extracted metadata among researchers working on understanding innovation in science, technology, engineering, and mathematics.

Data Validation

Although the datasets discussed in the data science example offered earlier provide new windows into the state of the STEM workforce, the accuracy of statistics gleaned from some of these datasets is unknown. The ProQuest and WorldCat dissertation databases are both large, for example, but neither is complete. Do either or both contain biased subsets of new dissertations? If so, then can the biases be characterized in ways that can be understood and corrected systematically?

One way to better understand omissions in a dataset such as a dissertation database would be to compare it with a definitive source such as the SED (National Science Foundation, 2011a) or the Integrated Postsecondary Education Data System (IPEDS) (National Center for Education Statistics, 2012). If, say, the databases were less likely to contain dissertations from students at private institutions or from humanities students, then estimates based on those databases could be reweighted accordingly.

Assessing the accuracy of metrics based on other types of sources can be more difficult. For example, counts of Twitter mentions (tweets) have been proposed as an indicator of the impact of a paper (Priem and Costello, 2010), and journals such as PLOS ONE now report Twitter mentions for each article (PLOS ONE, 2012). How might one assess the validity of tweets as an indicator of impact?

NSF is supporting ongoing research in areas that could facilitate assessing nontraditional data sources. Techniques from sampling theory, approaches for handling missing data, and imputation algorithms could all prove useful. In addition, NCSES’s own datasets could be used for calibrating new datasets.

RECOMMENDATION 7-4: The National Center for Science and Engineering Statistics should coordinate with directorates at the National Science Foundation in supporting exploratory research designed to validate new sources of data related to innovation in science, technology, engineering, and mathematics.

____________________

¹⁰This is one of several tools used by survey methodologists to address nonresponse in sample surveys. See Groves et al. (2009, Chapter 6) for a description of some of these methods.

¹¹This approach would entail using the survey data as the dependent variable in a model that used information from the web-based data source to create the explanatory variables. That model could then be used to “now-cast” population values of interest directly from the web-based data.

Page 96 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

Data Access

Many datasets that would be promising for better understanding STEM have restrictions on their usage. The ProQuest and WorldCat dissertation databases, Web of Science, Scopus, and The Conference Board’s Help Wanted OnLine database are all commercial datasets for which the processes for obtaining access are well defined. Other datasets are more sensitive and may require carefully controlled access. For example, access to some types of census data entails stringent controls on how the data are handled. Likewise, some corporate datasets are zealously guarded by their owners and may be used only by employees.

NCSES has considerable experience with managing access to sensitive datasets—the Survey of Doctorate Recipients (SDR) and census data. The experience it has gained in the process may be useful in negotiating access to other sensitive datasets.

NCSES already has the infrastructure in place at NORC to house many of its survey data and allow licensed researchers access through remote dedicated computers.¹² In October 2012, data from the following surveys became available in the NORC Data Enclave: the SED, the National Survey of Recent College Graduates (NSRCG), the SDR, and the Scientists and Engineers Statistical Data System (SESTAT) integrated database (which includes the NSRCG, the SDR, and the National Survey of College Graduates [NSCG]). The panel heard from several people that NCSES sees the NORC Data Enclave as a way to build its community of licensed researchers while enabling its own staff to spend more time in helping researchers with the substance of the data rather than paperwork. Additionally, NCSES has worked with NORC to build an infrastructure that allows research teams to share their work in a secure environment, regardless of whether they are physically in one location.

There is strong interest in the dynamics of firm demographics, births, deaths, and employment contributions and the role of high-growth firms. The Census Bureau can develop these statistics by analyzing its business register. If these data were available to researchers—say, at the NORC Data Enclave—then a broad spectrum of evidence-based statistics and indicators could be made publicly available. One means by which such capability could be built is through NCSES’s initiation of a research program. Such a program would energize the research community to use survey and other data as soon as the data arrived at the NORC Data Enclave. The program could also be designed to incentivize researchers to develop new, high-utility statistics that were based on linked data from several agencies and that related inputs to outputs, outcomes, and effects.

For datasets that cannot be used outside of a company, another approach NCSES could take would be to work with NSF directorates that sponsor industrial fellowships. For example, LinkedIn has an excellent Data Science team that could potentially provide mentorship for a graduate student or postdoctoral fellow. A program modeled after the NSF Division of Mathematical Sciences University-Industry Cooperative Research Programs in the Mathematical Sciences (National Science Foundation, 2004) could provide a way for researchers interested in the STEM labor market to collaborate with LinkedIn’s Data Science team and explore LinkedIn’s data while under close supervision.

RECOMMENDATION 7-5: The National Center for Science and Engineering Statistics should explore the use of university-industry exchanges as a mechanism for giving researchers access to promising datasets and industry teams access to new research techniques.

NEXT STEPS

The emerging field of data science is more than the motivating example for this chapter. The new approach to understanding STEM that the panel believes NCSES should explore is at its core a data science approach. Because the field of data science is new and the number of practitioners is relatively small, the panel proposes two concrete initiatives that would provide some opportunities for NCSES to gain experience with new data science tools and to collaborate with data scientists.

NSF has a long history of funding university-industry research collaborations. The model is typically that an industry partner with a problem to solve is paired with a university partner that has experience with techniques and tools relevant to the problem domain. A graduate student or postdoctoral fellow (or professor) splits his or her time between the university and the corporation and is mentored by people in both institutions. The student/postdoctoral fellow gains valuable real-world experience, the industry partner gains solutions to problems, and the university partner gains a better understanding of real problems and potentially relevant data. One example of this approach is the previously mentioned NSF Division of Mathematical Sciences University-Industry Cooperative Research Programs in the Mathematical Sciences (National Science Foundation, 2004).

NCSES could gain considerable experience in data science techniques by playing the role of the industry partner and collaborating with a university in this fashion. NCSES has a mandate to understand the state of STEM; access to interesting datasets; and a staff well versed in navigating federal research organizations, managing datasets, and conducting survey research. A collaboration with a university

____________________

¹²“The NORC Data Enclave exists to provide researchers with secure access to microdata and protect confidentiality, as well as index, curate and archive data. The NORC Data Enclave provides authorized researchers with remote access to microdata using the most secure methods to protect confidentiality.” See the full description of the NORC Data Enclave in The University of Chicago (2013). BEA does not permit data migration to research data centers. Instead, it has a program whereby individuals can use the data in house under a special sworn employee arrangement.

Page 97 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

laboratory focused on such matters as text processing, web mining, or Internet-oriented econometrics could yield valuable experience for both sides.

RECOMMENDATION 7-6: The National Center for Science and Engineering Statistics (NCSES) should collaborate with university researchers on the use of data science techniques to understand the science, technology, engineering, and mathematics ecosystem, using a mechanism similar to existing National Science Foundation university-industry partnerships. One or two graduate students or postdoctoral fellows could alternate between working at NCSES and at their home institution for up to 2 years, with the specific goal of contributing new findings to NCSES’s data and indicators programs.

NCSES has considerable experience with managing structured data in the form of its survey products, but much less experience with unstructured data. Conveniently, NSF has a rich but relatively untapped dataset of unstructured data that could provide a wealth of new insights into STEM in the United States—its database of awards. This dataset is quite sensitive, but there is precedent for granting researchers access to the data: for the Discovery in a Research Portfolio report (National Science Foundation, 2010a), NSF’s Science of Science and Innovation Policy Program recently granted 10 academic research laboratories access to a small set of recent award data. NCSES is well versed in managing researcher access to sensitive datasets and would doubtless be up to the task of ensuring responsible access to its award database.

The Discovery in a Research Portfolio report recommends that NSF make a number of improvements to its award data, several of which align well with the panel’s recommendations. For example, the report recommends combining award data with internal and external data—a task that would benefit from automated techniques for extracting entities (people, laboratories, programs, institutions) from awards and linking them to related entities in other datasets. The report also recommends improving visualization techniques and understanding the interrelationships between people and topics—both of which would make promising research projects for a visiting graduate student or postdoctoral fellow.

Managing award data for research purposes would provide a useful test bed for several other recommendations offered in this chapter. For example:

NCSES and NSF’s Science of Science and Innovation Policy Program could formulate a set of key questions they believe NSF award data could help answer and then work with relevant directorates to fund this research.
NCSES could work to share some of the tools used to add structure (in the form of automatically assigned topics) to awards. NCSES could also share the topics themselves, either as additions to the existing online awards database or as a separate metadata file.

RECOMMENDATION 7-7: The National Center for Science and Engineering Statistics (NCSES) should explore methods of mining the awards database at the National Science Foundation as one means of discovering leading pathways for transformational scientific discoveries. NCSES should engage researchers in this exploratory activity, using its grants program. NCSES should develop mechanisms for using the tools and metadata developed in the course of this activity for the development of leading indicators of budding science and engineering fields.

One way to develop these ideas further would be through a contest for research funding or prize competition. Several “open innovation” organizations, such as InnoCentive, the Innovation Exchange, and challenge.gov, facilitate this type of contest. Working with an outside entity to design and administer a contest would allow NCSES to focus on the problems it hoped to address rather than the implementation details of the contest. The National Research Council (2007) report Innovation Inducement Prizes at the National Science Foundation and NSF’s Innovation Corps Program could also serve as useful models, although these resources are focused more specifically on technology commercialization.

If the contest were designed to address statistical questions related to the usefulness of web-based data sources, then it would be necessary to supply some sample data, and this might affect negotiations with companies. For example, LinkedIn might be willing to supply its data for NCSES to use but unwilling to allow use of the data in a public contest.

How can a federal statistical agency develop and rely on web-based and scientometric tools to produce gold-standard data for periodic publication? This is a basic question that needs to be considered in the current climate of rapidly changing technologies and increasing demands for data. There are numerous related questions, including: How can an agency overcome looming privacy and security issues? How many personnel internal to the agency will it take to develop and operate the tools to produce the indicators? These are good questions that will need to be fully addressed before NCSES or any other federal statistical agency implements the frontier methods described in this section.

One way to address these questions is by example. In 2011, NIH decided to sponsor a competition¹³ to find improved methods for using the National Library of Medicine (NLM)

____________________

¹³The panel thanks Jerry Sheehan (National Institutes of Health) for providing information and materials on this competition (see http://showoffyourapps.challenge.gov/ [December 2011]).

Page 98 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

to show knowledge flows from scientific exploration through to commercialized products. The agency also wanted to use the NLM resource for taxonomic development and for showing relationships among research activities. Knowledge spillovers and effects are difficult to measure. NIH determined that one way to mine millions of data entries would be to automate the process. Yet that was not the expertise of any specific department at NIH, and it was important to cast a broad net to obtain the best expertise for addressing the problem. The competition was announced on challenge.gov and was titled “The NLM Show Off Your Apps: Innovative Uses of NLM Information Challenge.” The competition was open to individuals, teams of individuals, and organizations, and its purpose was to “develop innovative software applications to further NLM’s mission of aiding the dissemination and exchange of scientific and other information pertinent to medicine and public health.”¹⁴ The competition ended August 31, 2011, and winners were announced on November 2.¹⁵

SUMMARY

In this chapter, the panel has presented seven recommendations. The focus is on exploratory activities that should enable NCSES to produce over time STI indicators that measure more accurately what users really want measured, and in a more timely fashion. Researcher engagement and incentives for exploratory activities are important aspects of these recommendations. While the recommendations in Chapters 4-6 take priority over the recommendations in this chapter, the panel views these exploratory efforts as important investments in the long-term viability of NCSES with respect to its ability to meet evolving user needs in changing technological and economic environments.

ANNEX 7-1: POTENTIAL DATA SOURCES TO EXPLORE

Measuring Research Impact

Considerable activity is focused on finding better measures than citations for the impact of papers (see Priem and Hemminger, 2010, for an overview). The approaches being used fall into three broad categories:

1. Impact measured through refinements of citation counts—The Eigenfactor algorithm (http://www.eigenfactor.org/) gauges impact by computing impact-weighted citation counts. Citations from high-impact papers count for more than citations from low-impact papers. The algorithm is related to Google’s PageRank algorithm for determining the relevance of web pages.

2. Impact measured through aggregation of online indicators

a. In addition to citations, Public Library of Science (PLOS) journals use article-level metrics to track usage statistics (article views and downloads), user feedback, and blog posts (Patterson, 2009).

b. Total Impact is an application programming interface (API) that allows sites to display PLOS-style article-level metrics for arbitrary articles (http://impactstory.org/).

c. Altmetric.com is a start-up that tracks mentions of scholarly articles in blogs, social media, newspapers, and magazines and provides scores for articles (http://altmetric.com/).

3. Impact gauged by expert raters—Faculty of 1000 is a subscription-based service that selects new publications deemed by expert reviewers to be important. Articles are assigned a numerical rating (http://f1000.com/).

Measuring Knowledge Diffusion

Diffusion Within the Research Literature

Citation flows from Thomson/Reuters have been analyzed to gauge the flow of knowledge within disciplines (see, e.g., Rosvall and Bergstrom, 2008). These flows can provide insight into the extent to which basic research in various fields propagates out into more applied disciplines.

Diffusion Within and Outside the Research Literature

The Kauffman-funded COMETS (Connecting Outcome Measures in Entrepreneurship Technology and Science) database (Ewing Marion Kauffman Foundation, 2012) and COMETSandSTARS database seek to shed light on the next stage of diffusion of ideas, from research to products (Zucker et al., 2011). The databases link awards from NSF/ NIH, patents, papers from Web of Knowledge, doctoral dissertations, and more. The initial implementation of STAR METRICS at NSF involves similar types of linkages, initially linking research awards to patents and jobs (supported by awards), with ambitious future plans for tracking outputs, such as publications, citations, workforce outcomes, public health outcomes, and more (National Institutes of Health, 2012). The linking was accomplished using sophisticated text mining tools (Lai et al., 2011), in this case a variant of the Torvik-Smalheiser algorithm (Smalheiser and Torvik, 2009).

____________________

¹⁴See http://showoffyourapps.challenge.gov/ [December 2011].

¹⁵The U.S. Census Bureau ran a visualization competition in 2012 to develop a statistical model that could predict census mail return rates (see U.S. Census Bureau, 2012). Another example of a competition is the Netflix Prize, documented in the materials for the seminar of the Committee on Applied and Theoretical Statistics and Committee on National Statistics of the National Academy of Sciences titled “The Story of the Netflix Prize” (November 4, 2011) (see http://www.netflixprize.com/community/ and Gillick, 2012).

Page 99 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

BOX 7-2
Tracking the Commercialization of Technologies Through Records of New Books

Michelle Alexopoulos and collaborators at the University of Toronto have been measuring the commercialization of technology using records of new books from the Library of Congress (Alexopoulos and Cohen, 2011). The idea is that publishers invest in new books on commercially promising innovations and stop investing when innovations are in decline. Hence, a count of new book titles on a particular technology provides an indicator of the success of that technology in the marketplace. Alexopoulos and Cohen trace the diffusion of such inventions as the Commodore 64 and Microsoft Windows Vista by searching for related keywords in new book titles.

One potential generalization of this work would be to attempt to trace the flow of ideas back to the research that preceded commercialization. This task would be considerably more difficult than tracking commercialization as it would be necessary to examine vastly more papers than books, but techniques such as automated topic extraction could make the task more feasible.

Diffusion Outside the Research Literature

Alexopoulos and Cohen (2011) have mined Machine Readable Cataloging (MARC) records (Library of Congress, 2012) from the Library of Congress to identify patterns of technical change in the economy (see Box 7-2). They argue that the book publication counts captured in these records correspond more closely than patent records to the commercialization of ideas. Other tools for mining data in books include

Google’s NGram viewer, a search engine for n-grams in Chinese, English, French, German, Hebrew, Russian, and Spanish books published between 1800 and 2008; and
Culturomics for arXiv, a search engine for n-grams in scientific preprints published in ArXiv between 1992 and 2012.

Two of NSF’s administrative datasets could potentially shed additional light on the translation of research work into commercial products:

NSF proposals have a required section (National Science Foundation, 2011b) on the impact of past NSF-funded work.
NSF-funded researchers submit Research Performance Progress Reports (National Science Foundation, 2012c) that document accomplishments, including inventions, patent applications, and other products.

The STEM Labor Market

Demand

Large job boards such as Monster.com or job board aggregators such as Indeed.com or SimplyHired.com could provide a rich source of information on demand for STEM professionals. The Conference Board’s Help Wanted OnLine dataset includes jobs collected from 16,000 online job boards (The Conference Board, 2012).

One can collect data from a site such as Monster.com either by scraping information from the public website or by negotiating with the site for access to the data. Two reasons for preferring negotiation are legality and data structure. The legality of web scraping has been challenged several times in courts both in the United States and abroad,¹⁶ and there appears to be no consensus on what is legal. However, all the cases to date that the panel found involved one for-profit company scraping data from another for-profit company’s site for commercial use. For example, a travel company might use web scraping to collect information on ticket prices from an airline and then use those prices to facilitate customers’ comparative shopping. During the course of this study, the panel found no example of a nonprofit or government organization or academic researcher being sued over web scraping.

Supply

Several new social networks for researchers could be used to learn more about career trajectories in the sciences, particularly nonacademic careers:

ResearchGate—http://www.researchgate.net/
Mendeley—http://www.mendeley.com/
Academia.edu—http://academia.edu/

LinkedIn.com is a broader social network for professionals that had 175 million members as of June 2012.

Several initiatives may make new data on researchers available online:

____________________

¹⁶For example, Ryanair, a European airline, initiated a series of legal actions to prevent companies such as Billigfleuge and Ticket Point from scraping ticket price data from its website to allow for easier comparison shopping (see Ryanair, 2010). In a California 2000 case, eBay v. Bidder’s Edge, eBay sued Bidder’s Edge over price-scraping activities; see http://www.law.upenn.edu/fac/pwagner/law619/f2001/week11/bidders_edge.pdf [December 2011]. And in another California case, in 2009, Facebook, Inc. v. Power Ventures, Inc., Facebook sued Power Ventures over scraping of personal user data from the Facebook site; see http://jolt.law.harvard.edu/digest/9th-circuit/facebook-inc-v-power-ventures-inc [December 2011].

Page 100 Cite

Suggested Citation:"7 A Paradigm Shift in Data Collection and Analysis." National Research Council. 2014. Capturing Change in Science, Technology, and Innovation: Improving Indicators to Inform Policy. Washington, DC: The National Academies Press. doi: 10.17226/18606.

×

Vivo is a web platform for exposing semantic data on researchers and their work on the websites of research institutions. Vivo tools provide a way for institutions to create rich, structured datasets on their research activities.
SciENCV is an NIH demonstration project for allowing researchers to create public research profiles. These profiles are designed to streamline the process of applying for NIH and other grants, but will also generate useful structured datasets on researchers.
Brazil’s Lattes Platform is a database of all Brazilian researchers and their work. It extends the ideas in Vivo and SciENCV, and participation is mandatory.
The ORCID project seeks to provide researchers with unique identifiers that will be used as author identifiers for publications, awards, and so on. The goal is to facilitate linking of datasets involving individual researchers. ORCID will serve as a registry rather than a data provider, but the use of these identifiers can help structure existing unstructured datasets. (Some researchers [Smalheiser and Torvik, 2009] have expressed skepticism about the utility of such identifiers, however.)
The U.S. Department of Labor issues quarterly foreign labor certification data for H-1B visa holders (U.S. Department of Labor, 2012a). The dataset contains job titles and employers for new H-1B holders, and degree level can be inferred for some broad categories of jobs (e.g., “postdoctoral scholar”). The data are imperfect in that not all Ph.D.’s are on H-1B visas, there will be some overlap between SED respondents and those receiving H1-B visas, and job title is an imperfect predictor of degree status, but one may be able to see useful year-to-year changes in numbers of foreign postdoctoral fellows.

Finally, there are several databases of dissertations: