A motivation for the workshop was the need for new measures of science and technology describing inputs and outcomes associated with innovative activity and, concomitantly, the need to call upon multiple modes of data. Commercial data, some recoverable from web scraping and other computer science methods, may shape future measurement in a range of areas—for example, in tracking new product introductions, quality change, prices and productivity, and product diffusion—where large datasets provide advantages in terms of granularity, timeliness, geographic specificity, and other factors. During this session, presenters continued discussing several of the data-related questions raised at various points during the workshop: To what extent can the digital revolution transform metrics in the area of innovation measurement? What are the roles of specialized surveys in innovative data collection? And, how can administrative records best be exploited?
Scott Stern (Massachusetts Institute of Technology [MIT]) presented his perspectives on developing metrics for capturing the characteristics of innovation application and diffusion. In the process, he made the case that distinguishing between innovation that is cumulative versus that which is one-off in nature is essential to understanding the relationship between inputs to, and the skewed distribution of outputs and outcomes from, the process.
Stern opened with an illustrative example—CRISPR, a gene-editing tool that was developed with public funding in the 1980s. It was not until 2012, however—when a group of researchers at Berkeley, followed shortly thereafter by others at MIT and Harvard, figured out that the tool could be repurposed slightly—that CRISPR became arguably the single largest advance in life sciences gene editing ever. The cumulative number of forward citations of articles by these research teams on the topic is now up to about 3,500 on an annual basis (see Ledford, 2016).
CRISPR is essentially a cut-and-paste editing tool for genes that can be used by large pharmaceutical companies, biotech firms, and startups. For example, very recently, the tool was decisive in the creation of a platform for detecting Zika virus. The emergence of CRISPR is an example of the process from basic discovery to traditional science and technology to unanticipated applications—in this case, all within a 3-year period. Stern said the episode illustrates three important points: (1) innovation is inherently cumulative, (2) innovation is inherently uncertain, and (3) innovation is highly skewed in the distribution of application and in its impact across multiple dimensions.
He noted one of the key questions of the workshop was how fundamental characteristics of innovation are reflected in a measurement framework and in systematic statistical programs for collecting information about innovation in order to move beyond examples. The cumulativeness of innovation—that is, the ability to draw upon an ever-wider body of scientific and technical knowledge—is widely regarded as a critical component for idea-driven, long-term economic growth (Aghion and Howitt, 1998; Dasgupta and David, 1994; Mokyr, 2004; Romer, 1994; Rosenberg, 1982). But, Stern asked, how can it be known that innovation is cumulative, how and whether the degree of cumulativeness varies by time and place, and how it can be measured?
To test cumulative impacts, Furman and Stern (2011) examined biological resource centers, wherein biological materials used in research are deposited and made publicly accessible to future generations of researchers. It allows them to avoid having to reinvent the wheel, Stern noted. The authors identified an approach using citations data to track the diffusion of knowledge from scientific papers placed into an institutional environment that promotes cumulativeness—a biological resource center—compared with similar articles that remained in a less open system. The articles were tracked to estimate cumulative impact, measured by the rate of citations (which he noted is not the same as innovation, but an indicator of influence). By this metric, for articles placed in the biological resource centers, the authors found a more than doubling of the impact on subsequent productivity of publicly funded knowledge.
For purposes of systematic measurement, the question that arises is
how to aggregate this kind of study of cumulativeness (and the role of institutions and policy in shaping cumulativeness and outcomes) beyond the level of individual “pieces” of knowledge. Can it be mapped? Research by Heidi Williams (MIT) does this by using evidence from the human genome project (HGP) to examine the role of intellectual property rights (IPR) on innovation. During the final years of the HGP, Celera Corporation was granted temporary licensing rights for sequences they identified prior to HGP coverage. Williams (2013) took advantage of this natural experiment to investigate whether follow-on research on individual genes in the post-HGP era was impacted by Celera’s IPR claims. Her results suggest an approximately 30 percent reduction in subsequent publications, phenotype-genotype linkages, and diagnostic tests for genes first sequenced by Celera—the company’s licensing rights impacted cumulative knowledge occurring at the level of a research community.
Turning to the topic of uncertainty, and the highly skewed nature of knowledge creation and innovation, Stern highlighted three points. First, he cited the work of Uzzi et al. (2013) who used unconventional metrics to predict which publications become the rare ones that go on to have a high impact. Their analysis of 17.9 million papers from all scientific fields suggests that science follows (p. 1) “a nearly universal pattern: the highest-impact science is primarily grounded in exceptionally conventional combinations of prior work yet that simultaneously features an intrusion of unusual combinations.” Stern called insights from these kinds of analyses “game changers.” In a similar fashion, the research by Cathy Fazio and her colleagues (summarized in Chapter 6) provided insights about the structure of skew in the area of entrepreneurial quality, where a high portion of consequential outcomes emerge from 1 or even one-tenth of 1 percent of the overall activity.
Stern turned next to the question of how the skewed impact of patents and other phenomena related to innovation can be mapped in a way that captures cumulativeness, uncertainty, and high skew. He noted the fact that discovery may occur in an area that was unanticipated. By way of analogy, Stern described how the availability of “open” satellite image maps of the Earth impacted discovery and entrepreneurship in the gold industry. Research by Nagaraj (2015) found a large and persistent difference in the rate of gold mining and discovery depending on the availability of public images—open-access maps—for a given geographical area. Furthermore, entrepreneurs were found to be far more likely to take advantage of open-access maps than established firms. Providing the mechanism for this application was a research community who figured out that these images could dramatically improve the chances of understanding where gold was located, which had nothing to do with the initial motivations for the NASA Landsat Program. The takeaway is
that discovery of gold, which is unexpected and highly skewed (but concentrated in regions where the satellite imagery was available), benefited from research developed for other purposes.
Stern concluded with several statements about future measurement and policy making:
- Innovation statistics and metrics are increasingly being used to evaluate and track innovation systems at multiple levels of granularity.
- There is a need to develop meaningful connections that allow cohesive assessment of the role of different elements of the innovation system (inputs and outputs) over time. This will not always require new data sources. It may be a matter of connecting existing data—and combining traditional measures with alternatives from unanticipated sources—in creative ways.
- There is a need in measurement frameworks to recognize cumulativeness, uncertain and highly skewed phenomena, and distributed impacts.
This agenda, he argued, is particularly important for areas of innovation beyond traditional “tech to market” applications of science, as it relates to emerging uses of digital knowledge, maps, and nonscience knowledge systems.
Jeff Oldham (Google) presented insights that can be drawn about innovation from linking patent metrics, which summarize sets of patents, to other types of indicators. Following up on one of Stern’s points, Oldham noted that one way to make knowledge more cumulative is to increase the accessibility of statistics and indicators. This is analogous to the idea behind the biological resource centers that Stern discussed.
The key to increasing the value of patent data is to create the capacity to link them to a web of other innovative indicators, Oldham said. In this way, patent data can serve as a nucleus of innovation measurement. Currently, anyone can go to Google Patents or to the U.S. Patent and Trademark Office (USPTO) Website to research detailed information about a patent; there are low barriers of entry for this kind of inquiry. But it is difficult to compute over collections of patent cases. This is why patent metrics, based on summaries, have been developed.
Among the metrics that can be generated from patent summaries are publication numbers (which serve as keys), applications and grants, scope
of patents by CPC codes,1 family members (whose patents can be linked), backward and forward citations, and publication dates. A wide range of patent statistics can be added. Researchers can also compute their own patent statistics using simple structured query language (SQL) queries.
Oldham provided one sample application—an analysis of how long it takes for patents to be approved by the USPTO and whether, over a period of time, that process has slowed or sped up. Estimates from this analysis reveal that, for example, sometime between 2000 and 2001, the process grew much faster, and it has continued on this trend up until now.
After providing a quick overview of what can be done with patent metrics, Oldham turned to his main message: The analytic power of patent metrics can be greatly increased by linking them to other databases—in other words, by making patent metrics the core of a web of innovative indicators. Specifically, Oldham suggested that links to the following sources would create analytic content:
- census data—to assess patents’ impact on people’s quality of life;
- inventor school(s)—to assess training effectiveness;
- inventor school funding sources—to inform STAR METRICS;2
- government R&D funding—to assess program effectiveness;
- company sales data—to investigate the financial impact of design patents (Bascavusoglu-Moreau and Tether, 2011);
- sector sales data—to assess impact of innovation on sales; and
- inventor citizenship—to assess the effect of immigration policy.
Oldham concluded by describing Google’s capacity to make various data sources available at low cost. For example, the company’s BigQuery is basically an SQL that allows researchers to analyze patent metrics data either from a browser or programmatically. Patent metrics provide a means for summarizing patents that allows anyone to access and analyze them. Oldham expressed the desire that these tools be made available in a format that is linkable to other data sources, such as tables created by other researchers, government agencies, or companies.
During open discussion, Javier Miranda asked if Google had begun
1CPC stands for Cooperative Patent Classification, which is a set of codes developed jointly by the European Patent Office and the USPTO. See http://www.uspto.gov/patents-application-process/patent-search/classification-standards-and-development [August 2016].
2From https://www.starmetrics.nih.gov/ [August 2016]: “STAR METRICS® is a federal and research institution collaboration to create a repository of data and tools that will be useful to assess the impact of federal R&D investments. The National Institutes of Health (NIH) and the National Science Foundation (NSF), under the auspices of Office of Science and Technology Policy (OSTP), are leading this project.”
to think through how data could be combined in safe ways so that information could be extracted while still residing in various silos. Oldham responded that cloud data systems have access-control levels, so it is possible to provide access to approved people for specific projects. He also noted the promise of remote access protocols whereby certified individuals from within an organization run programs and deliver the results to external researchers, or that researchers could compute over particular fields or experiment with a sample.
Sallie Keller (Social and Decision Analytics Laboratory) agreed that technological solutions will continue to emerge. She pointed to progress in such fields as the biosciences around genomics and related areas. Computer scientists in the cryptography world are working on access and confidentiality issues and finding that it is possible to compute over two distinct datasets that do not leave their silos. She characterized them as solvable problems 5 or 10 years from now. And, finally, while the concept of data governance is central across statistical agencies, she argued that it is important to begin thinking about changing the conversation about what privacy and confidentiality mean.
Ron Jarmin (U.S. Census Bureau) discussed activities and plans of U.S. statistical agencies to continue improvement of their data programs in the areas of business dynamics and innovation. Part of the plan calls for increased collaboration among such agencies as the Census Bureau, Bureau of Economic Analysis, and Bureau of Labor Statistics. Jarmin’s remarks focused on an initiative at the Census Bureau directed at measuring the inputs and outputs of innovation, or the activities that may lead to innovation, on a granular level. It is a collaborative research project among the Census Bureau, University of Michigan, Ohio State University, University of Chicago, and New York University that links administrative data from these institutions on funded research projects with data assets at the Census Bureau.
Jarmin said the agency’s goals for the program are to: (1) improve measurement of a small but important sector of the U.S. economy—individuals involved with funded research grants; (2) address data gaps in the measurement of innovation and its relation to economic growth; (3) collaborate with data providers to deliver data products they value, such as customized reports; and (4) initiate a prototype project that can be scaled and extended to other sectors of the economy. All of this is consistent with the Census Bureau’s economic and social measurement mission and directly relevant to the data providers, he said.
“Mashing up” university administrative and Census Bureau data to
create new statistics and facilitate research requires collaboration, in this case, with the University of Michigan’s Institute on Research in Innovation and Science (IRIS). Data from IRIS-UMETRICS on sponsored research projects and the faculty, staff, postdocs, and students involved as grant recipients (see Jason Owen-Smith’s discussion in Chapter 4) allows the program to experiment with using “fat pipe” data for a sector of the economy. These data are highly complementary to business and household data at the Census Bureau, including the Business Register, Person Identification Validation System, Longitudinal Business Database, and the Longitudinal Employer-Household Dynamics (LEHD) Program underlying the Quarterly Workforce Indicators.
Linking these data, analysts at the Census Bureau are able to identify individuals and link them to records in the LEHD infrastructure, which includes data on all jobs in the economy covered by state unemployment insurance programs. Data on transactions can be linked to equipment and supply purchases and to the Business Register in a way that allows researchers to analyze the upstream and downstream value chain of university-based research. Accurate linkage is one of the major challenges inherent in the project. The quality of the links is directly proportional to the quality of the input data. Universities so far have not provided the Census Bureau with individual identifiers such as Social Security numbers that would make linking more accurate; name, address, and sometimes date of birth fields may be used.
The link to census data allows researchers to investigate some of the flows associated with science investments to universities. Hiring and spending can be tracked at a granular level. Even detailed transactions, such as where universities are sourcing their laboratory mice, can be tracked at the lab level. More generally, one can determine the locations of businesses that provide materials purchased by the universities. From employment records, the flow of human capital (mainly students, but some faculty and staff as well) from the university outward into the economy can also be tracked. These data can address the question about what happens to people involved in funded research when they leave the university, Jarmin noted. For example, taxpayers in Michigan can see that more graduates who received grants on projects took jobs in Michigan than anywhere else. The characteristics of the hiring companies can also be tracked, which is helpful for creating a picture of the health and direction of local economies.
Jarmin pointed out that this project is only scratching the surface of what can be done with linked datasets; the ability to track the inputs and the outputs at a granular level opens a wide range of possibilities. Similar programs could in principle be set up for private-sector organizations that would allow tracking of upstream and downstream impacts of a broader
array of activities in the economy than university-funded research. Systematizing standards of how companies might implement such data collections to bring in economy-wide transactions would be a difficult task. But, Jarmin suggested, if there were particular areas of interest such as the health care sector where advances in electronic health records are creating new possibilities, it could be possible to expand in new directions.
Jarmin concluded with the observation that economists have always wanted access to data on every transaction in the economy. The fact that it is now possible to take a group of organizations and look at every transaction for a particular set of activities shows promise that these desires may become reality. It may never be practical to measure every transaction in the economy, he said, but there is a vast space between the present and the near future in this regard.
Sallie Keller (Social and Decision Analytics Laboratory) pulled together a number of discussion threads developed over the course of the workshop about combining alternative types of data to analyze and understand society. She focused on leveraging local data sources in new and foundational ways. Nonsurvey data—some of it collected in real time while individuals are engaged in day-to-day life situations—provide a new lens on social observation, she said. Much of her work is oriented toward adapting statistical methods to make the best use of these data. As with high-powered telescopes peering into the universe, these emerging data sources may not be exposing things that are new, but they are exposing things that we have not been able to see before.
Eschewing the term “big data,” Keller described the data revolution in more precise categorical terms. Specifically, data analysts are now able to call on designed data collections (statistically designed and intentional observational data collections); administrative data (data collected for the administration of an organization or program); opportunity data (data generated through daily activities); and procedural data (data derived from policies and procedures).
Designed data collection includes the surveys that statistical agencies have conducted for decades, which Keller said are very good in terms of quality control and application. Administrative data are increasingly important in that they are broadly collected and highly underutilized across many areas of science and evidence-based policy. Opportunity data exist on the Internet and other platforms (e.g., sensors, GPS, and video) that can be captured and analyzed. These emerging data sources do not make standard ways of observing or collecting data obsolete,
Keller argued, but they create opportunities to observe behavior at a finer level of granularity than with designed and administrative data. Finally, she said, procedural data provide information about policies and practices that govern a variety of activities and that provide the context necessary for comprehensive studies of various phenomena.
In addition to their advantages, new sources of data and new methods of combining them also create major challenges. Keller suggested that, while researchers innately address these challenges as they use data, as a community, there needs to be coordinated, disciplined ways of approaching them. The challenge with using such a diverse data landscape is data quality—both within and across buckets. The traditional approach in science involves controlling measurement processes, whether with a survey or a physical bench lab instrument. Disciplines have gone through great pains to develop theory and statistical methods—optimizing sampling frames and designing collection processes in engineering and in bench sciences. Considerations of data quality have also required controlling the ownership of data to ensure that reliability and quality are maintained. In three of the four data buckets, Keller pointed out the lack of control over data quality compared with traditional survey data. Because administrative data are collected primarily for nonresearch uses, they have to be repurposed to be used in statistical models and analyses. Another major challenge to using new kinds of data—and one she noted was not addressed extensively during the workshop—is privacy and confidentiality. Not all work can be done in secure research data centers, so this will continue to be an issue for some time.
Keller walked through some case studies being done by her lab for the Census Bureau that leverage new and detailed local data to support or enhance, or even possibly replace, some of the federal reporting and in particular some of the estimates in the American Community Survey (ACS). The idea underlying the projects, she said, is to “open the aperture” to all possible data sources, not just the ones that are easy to access.
In an education-sector case study, Keller’s group set out to acquire sources of data for key variables:
- Profiled variables: student ID, district code, year, gender, race/ethnicity, grade, age, and other variables (e.g., limited English proficiency)—data for most of these variables were complete and consistent requiring very little cleaning.
- Transformed variables: school districts were matched with counties; ages were calculated from birthdates (North Carolina and Kentucky); for Texas, enrollment estimates were weighted to match the state level counts.
- Restructured variables: Virginia data were restructured to create tables for race/ethnicity, by grade, gender by grade, and disadvantaged status by grade.
Figuring out which data are relevant and useful for the question at hand requires careful planning, Keller said. For state longitudinal data systems, challenges exist with geographic alignment whenever local data are being combined with state or federal sources. She noted that Feldman’s Research Triangle Park account provided a good example of this challenge as the Raleigh-Durham area spans two Metropolitan Statistical Areas (see Chapter 6). Likewise, school districts do not necessarily align with the Census Bureau’s Public Use Microdata Areas or even with counties. But, Keller argued, these are solvable problems.
Because the data assembled by Keller’s team are so much more granular than traditional federal statistical agency sources, it becomes possible to analyze new things. For example, in an analysis of school dropout data in Kentucky, the team was able to map rates for males and females in the Appalachian and non-Appalachian regions of the state in correspondence with the reasons for dropping out—including pull factors (e.g., employment, illness, pregnancy) and push factors (e.g., failing classes, expulsion).
Similarly, their analyses of housing exploited detailed data, including local tax assessment data, data from a host of vendors, a number of transformed variables (e.g., parcels weighted by estimated number of units), and restructured variables to create consistent geocodes per parcel. These data were used to produce neighborhood profiles at highly localized levels of geography. From the ACS, one can estimate median house values across census tracts. With local data, it becomes possible to compute actual house values and, if housing value is a surrogate for wealth, a granular index on wealth diversity at the neighborhood level.
Keller offered the following concluding points. First, she said, the use of external data (especially private-sector data) requires understanding and mitigating problems that may exist with the data because analysts and policy makers have no control over their collection; this is the opposite of the federal statistics paradigm developed over the course of decades. Second, case studies are extremely helpful for developing and testing new kinds of data frameworks. Finally, a disciplined, yet flexible and adaptable, data framework is needed to assess data quality and fitness-for-use. Keller said she envisions that analytic use of granular data from multiple sources, including data quality assessments, can be done in a disciplined way such that students in 10 years will have an approach for administrative, opportunity, and procedural data that is similarly rigorous to that which is currently in place for survey data.