Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 35
7
Improving Data Collection and Dissemination
A PARADIGM SHIFT IN COLLECTING AND ANALYZING DATA
The National Center for Science and Engineering Statistics (NCSES) finds itself in the
midst of a paradigm shift in the way data are gathered, manipulated, and disseminated. The
agency’s science and technology innovation (STI) indicators program faces several challenges:
Traditional surveys face increasing expense, declining response rates, and lengthy
time lags between when data are gathered and when derived indicators and other
statistics can be published.
Tools for data extraction, manipulation, and analysis are rapidly evolving.
Repositories of STI measures that users demand are distributed among several
statistical agencies, and private repositories.
Sources of knowledge generation and innovation are expanding beyond the traditional
developed countries to emerging and developing countries.
Users’ expectations are rising, and they are demanding more access to statistics that
are closer to the actual measures of what they want to know.
It is also expected that standards and taxonomies for data collection and analysis will change
before the end of this decade. The Organisation for Economic Co-operation and Development’s
National Experts on Science and Technology Indicators are discussing revising the Frescati and
Oslo manuals (Organisation for Economic Co-operation and Development, 2002, 2005) on a
rolling basis. The group plans to work on priority themes and to build a better bridge between the
two manuals. The North American Industry Classification System (NAICS) codes and the
Standard Occupational Codes may also undergo revisions in the next decade or less. In light of
these likely changes, this chapter offers the panel’s analysis and recommendations on activities
that NCSES needs to consider in the near future to continue to prepare organizationally for these
challenges and to improve its portfolio of STI indicators in this environment. The final report
will provide recommendations and a roadmap on how to implement those recommendations.
NCSES and, indeed, other government statistical agencies, confront a world of dizzying
change in how information technology is integrated into their data gathering and data
management activities. The World Wide Web (the web), in particular, has been transformational
in making possible new kinds of forecasting and data collection methods that provide useful
insights in almost real time. These tools provide information much more rapidly than the
traditional surveys with up to multiple-year lags. In addition, other changes are occurring. In his
November 2011 presentation at the annual meeting of the Consortium of Social Science
Associations, Robert Groves (2011a) conveyed the status of U.S. surveys, noting: “Threatened
35
OCR for page 36
coverage of frames; falling participation rates; increasing reliance on nonresponse adjustments;
and for surveys with high response rate targets, inflated costs.” His proposed solution set for
what agencies should do to address these issues is to develop an approach of a “blended data
world by building on top of existing surveys.”1 Groves (2011b) envisions multimodal data
acquisition and manipulation of data, including: “Internet behaviors; administrative records;
Internet self-reporting; telephone, face-to-face, paper surveys; real-time mode switch to fill in
missing data; and real-time estimation.”
NCSES needs to determine now how it will handle these changes if they materialize and
how the types and frequencies of various STI indicators will be affected. During the panel’s
workshop, Alicia Robb (of the Kauffman Foundation) encouraged NCSES to explore the use of
administrative records to produce STI indicators, but she also cautioned that ownership issues
associated with use of those data will have to be addressed before they could become a reliable
complementary data source to traditional survey data. Stefano Bertuzzi (of the National Institutes
of Health and the STAR METRICS Program) also presented techniques of using administrative
records at universities to determine the impact of federal research funds on scientific outputs and
the development of human capital in the physical and biological sciences.
There are also foresight questions that STI indicators can inform. Demographic,
economic, technological, and organizational changes will all influence the subjects being
measured, the mechanisms used to measure them, and the products offered by NCSES. STI
indicators will be called on to answer the following questions: How will demographic shifts
affect the science, technology, engineering, and mathematics (STEM) workforce, nationally and
internationally? Will those shifts change the locus of the most highly productive regions? Will
global financial crises slow innovation activities or merely change the locus of activities? When
will emerging economies be integrated into the global ecosystem of innovation, and what effects
might they have on the system? However, as cautioned above, indicators are not predictors. They
can be used in isolation or in groups to show tendencies, voids, and at times what additional
information is needed.
All of this suggests a shift in emphasis over time for NCSES’s indicators program. The
agency will have to make decisions on whether and how to adopt the new techniques. Although
NCSES is not expected to eliminate all traditional survey methods, it is expected that the
prolonged austerity of federal budgets will necessitate increased reliance on web-based
techniques and databases. On the horizon, the panel believes that NCSES will have to use
surveys more efficiently and increase use of web-based tools for harvesting data, particularly on
human capital measures and output measures related to scientific discoveries and innovation, and
databases from other government agencies and private providers.
INDICATORS FROM FRONTIER TOOLS
At the panel’s workshop, presentations by Erik Brynjolfsson (Massachusetts Institute of
Technology), Lee Giles (Pennsylvania State University), Carl Bergstrom (University of
Washington), and Richard Price (Academia.edu) provided insights about tools that can be used
1
For further comments on this point, see Census Bureau discussions:
http://blogs.census.gov/directorsblog/2011/09/the-future-of-producing-social-and-economic-statistical-information-
part-i.html [December 2011]; http://blogs.census.gov/directorsblog/2011/09/the-future-of-producing-social-and-
economic-statistical-information-part-ii.html [December 2011]; http://blogs.census.gov/directorsblog/2011/10/the-
future-of-producing-social-and-economic-statistical-information-part-iii.html [December 2011].
36
OCR for page 37
to give up-to-date information of science and engineering networks and linkages of human
capital investments to STI outcomes and impacts. These experts showed panel members how to
use nowcasting, netometrics, CiteSeerX, Eigenfactor, and Academia.edu (similar to Lattes in
Brazil) to create scientometric2 data to create STI “talent” indicators. Such tools can be used,
say, to track intangible assets and knowledge flows from online recruiting sites and social
networks.
Web Scraping
In addition to improving survey methods and using administrative records databases
directly, another potential avenue for acquiring data is web scraping, that is, collecting data
publicly available on the web. This approach is distinct from web-based survey methods, which
use the web to administer a survey. For example, many job seekers now publish their résumés
online; students participate in social networks; and researchers use online collaboration tools and
working paper repositories. Each of these kinds of web sites provides information about the
population using them. Hence, there are two specific questions that could be addressed using
web data: (1) How many engineers are working in the United States (or what fraction of the
workforce is made up of engineers)? (2) How many undergraduate students are majoring in
mathematics in the United States? In this section of the report, we explore how current questions
could be addressed with new data sources. Many of these sources also incorporate social
networks, which may allow the development of entirely new types of indicators. However, we
note that it is possible that many of the questions that web-based data sources could address may
be more efficiently addressed with administrative records. This is a matter for further research.
Some web-scraping projects—for example Google’s search engine—use an ad hoc
approach to collecting data, examining every web resource that they can access for relevant
information. This approach could also be useful for gathering STI statistics. However, a large
portion of STI deals with the composition of the labor force and students, and information related
to them is centralized in several large websites, rather than being distributed across individual
home pages. There are at least three examples of these sites and the kind of information they
could provide:
Facebook, Google+: number of students at a university, how many major in which
fields;
Mendeley, Academia.edu, CiteULike: how many researchers are active in which
fields, how many collaborations, who collaborates with whom, how useful is a given
piece of research;
LinkedIn, Monster.com, Zerply: the composition of the labor force, geographic
breakdown, skill sets, and similar information.
One can collect data from a site such as Monster.com either by scraping information from
the public website or by negotiating with the site for access to the data. Two reasons for
preferring negotiation are legality and data structure. The legality of web scraping has been
2
In practice, scientometrics is often done using bibliometrics, a measurement of the impact of (scientific)
publications and patents.
37
OCR for page 38
challenged several times in courts, both in the United States and abroad3 and there does not
appear to be a consensus about what is legal. However, all the cases to date that the panel found
involved one for-profit company scraping data from another for-profit company’s site for
commercial use. For example, a travel company might use web scraping to collect information
on ticket prices from an airline and then use those prices to make it easy for customers to do
comparative shopping. During the course of this study, the panel has not found an example of a
nonprofit or governmental organization or academic researcher being sued over web scraping.
The goal of web scraping is to take semistructured data from a public web page and
register it into a structured database. The major search engines have started to collaborate on
mechanisms for structuring data on the web. 4 The National Science Foundation (NSF) could
consider participating or encouraging participation in the development of schema for structuring
data relevant to indicators, such as adding fields for educational background to the Person
schema or defining fields for a journal article as a specialization of the CreativeWork schema.5
A company such as Monster.com already has such a structured database.6 Although this
database is not public, some companies have supplied parts of their databases as part of academic
collaborations, and they may be willing to work with NCSES. If the data come directly from a
company, there is no chance of introducing errors during the web scraping process. On the other
hand, the company may not be willing to supply sufficient data for the agencys or may not be
willing to supply it in a timely manner.
An advantage of web scraping is that it could be carried out continuously, and it does not
require cooperation with the company. Alternatively, a middle ground would be to work with
companies on structured feeds or digests of information that would be updated continuously.
Companies may well prefer this to being scraped, which can require significant server resources.
The National Institutes of Health, for example, makes XML versions of its grants database
available and provides ongoing updates. Many researchers today mine social networks for data
(without any legal consequences, as noted above). There are three basic questions for identifying
sources of data:
1. What STI topic areas does a particular company address?
2. Is the company willing to provide data directly?
3. How frequently is the company willing to update data?
A fundamental question requires more examination: What kind of statistical methodology
to apply to data from web scraping? There are other, related questions: What are the tradeoffs
with using web-based data sources instead of survey data? Is it possible to adjust web-based data
3
For example, Ryanair, a European airline, initiated a series of legal actions to prevent companies such as
Billigfleuge and Ticket Point from scraping ticket price data from their website to allow for easier comparison
shopping: see http://www.ryanair.com/en/news/ryanair-wins-screenscraper-high-court-action [December 2011]. In a
California 2000 case, eBay v. Bidder’s Edge, eBay sued Bidder’s Edge over price-scraping activities: see
http://www.law.upenn.edu/fac/pwagner/law619/f2001/week11/bidders_edge.pdf [December 2011]. And in another
California case, in 2009, Facebook, Inc. v. Power Ventures, Inc. Facebook sued Power Ventures over scraping of
personal user data from the Facebook site: see http://jolt.law.harvard.edu/digest/9th-circuit/facebook-inc-v-power-
ventures-inc [December 2011].
4
See http://schema.org/docs/faq.html [December 2011]
5
For the person schema, see http://schema.org/Person [December 2011]; for the CreativeWork schema, see
http://schema.org/CreativeWork [December 2011].
6For example, see
http://www.simplyhired.com/ searches the web and pulls in jobs; indeed.com does the same.
38
OCR for page 39
to represent a survey sample or to estimate errors? Is it possible to use a traditional survey to
calibrate web-based data? How frequently must this be done? How frequently would NCSES
want to publish new statistics? Would NCSES want to publish less reliable statistics if it means
publishing them more frequently at lower cost?
A company such as LinkedIn stores in its servers a social network representing all of its
users and relationships between them, and techniques for accurately sampling this social network
have been developed (see Maiya and Berger-Wolf, 2011; Mislove et al., 2007). However, to our
knowledge, researchers have not yet addressed how well this social network represents the larger
population. For example, if one is interested in measuring how many chemical engineers are
working in the United States, some subset of these are represented in LinkedIn’s social network,
but it is unclear how to adjust this subset accurately to represent the target population or how to
estimate the error incurred in using this subset.7 It is important to understand how the data
collected from websites compares with traditional survey data, particularly because different
websites have very different coverage. Facebook, for example, covers a very large portion of the
undergraduate population (at least for the next couple of years). However, sites such as
Mendeley and Academia.edu definitely do not cover the entire population of researchers.
Combination Approaches
It could prove useful to adopt a combination approach, in which web-based statistics are
periodically calibrated against a traditional survey. Of course the survey would have to be
administered less frequently than currently or there would be no cost or time savings.
Since NCSES has reported that the response rates of some of their surveys are declining,
questions arise about how well those data reflect the population sampled and how to calibrate
web-based data to those surveys. It is relatively straightforward to calibrate to the Survey of
Earned Doctorates (SED), which has a 100 percent response rate, but only once and on questions
that the SED asks. One solution to this dilemma would be for NCSES to put resources into
getting close to a 100 percent response from a small number of people from a standard survey
and use that to calibrate information from web-based sources or the rest of the survey. The
calibration is similar to what computer scientists and mathematicians do with compressed
sensing of data on pixels and is a very interesting and exciting area of research.
It may not yet be possible to achieve rigor comparable to a traditional survey with these
methods, and NCSES will need to consider what its tolerance is for publishing statistics that may
not be as reliable as those they have previously published. In such consideration, NCSES will
need to balance reliability against timeliness: since little time is required for data collection with
data mining techniques in comparison with traditional surveys, releasing statistics on a much
more frequent basis is possible. In principle, nothing prevents statistics from being periodically
or continuously updated. For example, the national unemployment rate, gross domestic product,
and consumer price index are periodically updated without diluting the measure’s importance.
The Billion Prices Project at the Massachusetts Institute of Technology uses an algorithm that
collects prices daily from hundreds of online retailers worldwide, creating, among other things, a
daily price index for the United States.8
7
LinkedIn and similar data could be quite useful for questions involving relative rather than absolute measures.
For example, are there more chemical than electrical engineers? Do chemical engineers change jobs more frequently
than other engineers? Where in the country are chemical engineers most highly concentrated?
8
See http://bpp.mit.edu/ [January 2012].
39
OCR for page 40
Developing Ideas Through Contests
One way to develop these ideas further would be through a contest or research funding or
prize competition. There are several “open innovation” organizations that facilitate this type of
contest, such as InnoCentive, the Innovation Exchange, and challenge.gov. Working with an
outside entity to design and administer a contest would allow NCSES to focus on the problems it
hopes to address rather than the implementation details of the contest. A National Research
Council (2007) report, “Innovation Inducement Prizes at the National Science Foundation,” and
the National Science Foundation’s new Innovation Corps Program could also serve as useful
models, although these resources are focused more specifically on technology
commercialization.
If the contest is designed to address the statistical questions around the usefulness of web-
based data sources, it will be necessary to supply some sample data, and this might affect
negotiations with companies. For example, LinkedIn might be willing to supply its data for
NCSES to use but unwilling to allow its use in a public contest.
How can a federal statistical agency develop and rely on web-based and scientometric
tools to produce gold-standard data for periodic publication? This is a basic question that needs
to be considered in the current climate of rapidly changing technologies and increasing demands
for data. There are a raft of related questions, including: How can an agency overcome looming
privacy and security issues? How many personnel internal to the agency will it take to develop
and operate the tools to produce the indicators? These are all good questions that will need to be
fully addressed before NCSES or any other federal statistical agency implements the frontier
methods described in this section.
One way to address these questions is by example. In 2011, the National Institutes of
Health (NIH) decided to sponsor a competition9 to find improved methods of using the National
Library of Medicine (NLM) to show knowledge flows from scientific exploration through to
commercialized products. The agency also wanted to use the NLM resource for taxonomic
development and to show relationships between research activities. Knowledge spillovers and
effects are difficult to measure. NIH determined that one way to mine millions of data entries
was to automate the process. Yet, that was not the expertise of any specific department at NIH,
and it was important to cast a broad net to get the best expertise addressing the problem. The
competition was announced on challenge.gov and was titled: The NLM Show Off Your Apps:
Innovative uses of NLM Information Challenge. The competition was open to individuals, teams
of individuals, and organizations and its purpose was to “develop innovative software
applications to further NLM’s mission of aiding the dissemination and exchange of scientific and
other information pertinent to medicine and public health.”10 The competition ended August 31,
2011, and winners were announced on November 2.11
9
We thank Jerry Sheehan (National Institutes of Health) for providing us with information and materials on this
competition. See http://showoffyourapps.challenge.gov/ [December 2011].
10
See http://showoffyourapps.challenge.gov/. [December 2011].
11
Another example of a competition is the Netflix Prize, documented in the materials for the Committee on
Applied and Theoretical Statistics and the Committee on National Statistics of the National Academy of Sciences
seminar, entitled: “The Story of the Netflix Prize,” November 4, 2011. (See
http://www.netflixprize.com/community/ and
http://www.science20.com/random_walk/predicting_movie_ratings_math_won_netflix_prize [January 2012]).
40
OCR for page 41
A Note of Caution
On a cautionary note, Boyd and Crawford (2011) assert that “[t]he era of Big Data has
begun” and “it is imperative that we begin asking critical questions about what all this data
means who gets access to it, how it is deployed, and to what ends.”12 While mining data at the
project or individual level may yield valuable results, it is also the case that archival data from
some sources are poor or nonexistent, and Boyd and Crawford (2011) also noted: “There is a risk
in an era of Big Data of treating every connection as equivalent to every other connection, of
assuming frequency of contact is equivalent to strength of relationship, and of believing that an
absence of connection indicates a relationships should be made.” This is a very important point.
NCSES will have to proceed with caution as it considers integration of frontier tools and datasets
into its indicators production processes.
MULTIMODAL DATA DEVELOPMENT
One issue that needs to be explored is the feasibility of blending the use of administrative
records, scientometric tools, and survey techniques to produce more accurate data on STI human
capital measures and other indicators that NCSES produces, such as R&D input and performance
measures. A multimodal approach would help to create longitudinal series using existing and
new information. In the near term, the topic could be explored through a workshop specifically
designed to discuss the conceptual framework and feasibility of blending data acquisition
techniques and using this mixed-methods approach to develop new indicators.13 This approach
could be useful for developing real-time maps of networked scholars, while measuring return on
investments from federal research funds as they are used and linking them to outputs (paper and
patents). At the same time, this approach would periodically assemble basic data on education,
employment, work activities, and demographic characteristics. We must stress, however, that it
would be prudent to test the approach on data that are already well developed at NCSES.14
STI DATA LINKAGE AND COORDINATION
The panel discovered that there appear to be multiple agencies collecting information
about innovation, including: NCSES; the Bureau of Labor Statistics (BLS) in the U.S.
Department of Labor; the Census Bureau, the Bureau of Economic Analysis (BEA), and the
National Institute of Standard and Technology in the U.S. Department of Commerce; and the
12
For other important references on the use of visual analytics tools to answer science policy questions, see
Zucker and Darby (2011) and Thomas and Mohrman (2011).
13
Statistical Neerlandica has prepublication views of a series of articles on the use of administrative records for
analytical purposes, including regression analysis: see http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-
9574/earlyview [December 2011]. For theoretical foundations on how to combine information from multiple sources
data, see Molitor et al. (2009).
14In a recent article, Roach and Cohen (2011) use citation-based and survey-based approaches to obtain measures
of knowledge flows. They find overall “close correspondence between citation-based measures of knowledge flows
and [the] survey-based measure at the industry level.” However, when they control for industry fixed effects, the
correlation between the two sets of data “drop by approximately three quarters.” Roach and Cohen acknowledge that
there is measurement error in their survey data and in the patent data that they use for comparison. They and other
researchers are attempting to determine the nature of such errors to improve the reliability of proximate
measurements of knowledge flows. One conclusion to be drawn here, though, is that developing indicators using
different techniques will give users the relevant range for the measures that they seek to use.
41
OCR for page 42
U.S. Department of Agriculture. Indeed, if the subject is broadened to STI, we were told that at
least 5 agencies collect these data. This suggests a need for an entity to assume a coordinating
function, to ensure that STI data collection and reporting efforts across the government are
efficiently distributed, to eliminate duplication, to take advantage of potential synergies, and to
ensure high quality of the data and statistics. Such an entity could be an interagency council or a
working group of agency representatives.
The NCSES could take on an important role in such a council or working group,
particularly given its function as a data clearinghouse with respect to STI data and statistics. The
panel was told by staff in several agencies that it could be beneficial to the statistical system if
NCSES became an enhanced data aggregator and data curator for STI-related information.
NCSES could explore such a role through the creation of an interagency council on STI
statistics. Such a council could identify and address STI statistical issues and opportunities
among other statistical agencies. It could not only serve as a clearinghouse, but it could also
initiate and monitor activities that serve the user community. Both the Office of Management
and Budget and the Office of Science and Technology Policy could have roles in such a council.
Better integration of data sources is needed to develop more robust STI indicators. John
Haltiwanger (University of Maryland) suggests that infrastructure datasets could be fully
integrated to track the career histories of innovators and entrepreneurs and track the relationships
between startup/young/high-growth firms and large, mature firms. These infrastructure datasets
could be fully integrated with all existing Census Bureau business surveys and other data: for
example, one could integrate economic censuses and annual surveys to measure productivity,
business practices, and R&D, linked to patent, citation, and other information about innovators
from the U.S. Patent and Trademark Office.
It will be important to integrate any new STI indicators that are developed into the
existing infrastructure (if not at the person/business level, then at some level of disaggregation).
Data sharing and synchronization would permit even richer integration of BLS and BEA firm
data. At the panel’s workshop, Matthieu Delescluse (European Commission) also remarked that
the European Union (EU) is commissioning the linking of patent data with company databases to
develop new indicators. For example, the relationship between small and medium firms and
number of patents can be tracked over time. The EU is also using data from the Community
Innovation Survey Business Register to determine the international sourcing of R&D. This
statistic could also be developed in the United States by linking Census Bureau and BEA data.
Employment dynamics, including worker mobility trends in science and engineering
occupations, could be developed by linking Census Bureau, BLS, and BEA data. Existing
research data centers or data enclaves could facilitate platforms for data integration and
potentially data comparability with other nations that also follow similar data administration
policies.15
NCSES already has the infrastructure at the National Opinion Research Center (NORC)
to house many of its survey data and allow licensed researchers access through remote dedicated
computers. Data that will be available by the beginning of 2012 will come from the following
surveys: Survey of Earned Doctorates (SED), National Survey of Recent College Graduates
(NSRCG), Survey of Doctorate Recipients (SDR), and the Scientists and Engineers Statistical
Data System (SESTAT) integrated database (which includes NSRCG, SDR, and the National
15
“The NORC Data Enclave exists to provide researchers with secure access to microdata and protect
confidentiality, as well as index, curate and archive data.” See the full description of the NORC data enclave at
http://popcenter.uchicago.edu/data [January 2012].
42
OCR for page 43
Survey of College Graduates [NSCG]). The panel heard from several people that NCSES sees
the NORC data enclave as a way to build its community of licensed researchers while enabling
its own staff to spend more time in helping researchers with the substance of the data rather than
paperwork. Additionally, NCSES has worked with NORC to build an infrastructure that allows
research teams to share their work in a secure environment, whether they are physically in one
location or not.
There is strong interest in the dynamics of firm demographics, births, deaths,
employment contributions by firms, and the role of the high-growth firm. These statistics can be
developed by the Census Bureau by analyzing its business register. If these data are available to
researchers—say, at the NORC data enclave—a broad spectrum on evidence-based statistics and
indicators could be made publicly available. One means by which such a program could begin is
through the initiation of a research program by NCSES. Such a program would energize the
research community to use survey and other data as soon as the data arrive at the NORC data
enclave. It could also be designed to incentivize researchers to develop new, high-utility statistics
that are based on linked data from several agencies and that relate inputs to outputs, outcomes,
and effects.
NCSES strives to be the central repository in the federal government of data, knowledge,
and expertise in all STEM (science, technology, engineering, and mathematics) topic areas, as it
is so identified in America COMPETES Reauthorization Act of 2010. Acting as such a
repository will involve obtaining data from other federal agencies, as well as from
intergovernmental and private sources. The focus would be on STEM aspects that are the official
responsibility of NCSES. Although NCSES staff may obtain access to BRDIS data directly
through the Survey Sponsor Data Center that was scheduled to come online December 30, 2011,
researchers will not have access to these data. Linking data from the BRDIS and the Human
Resource and Education Development survey with administrative longitudinal data would
provide a rich dataset for linking knowledge inputs to outcomes. This would provide a
longitudinal component to BRDIS analysis. NCSES could explore avenues for developing this
type of longitudinal capability and the in-house analytical capacity to produce STI indicators
from those data.
CONCLUSION AND RECOMMENDATION
Integration of the frontier tools (discussed above) into practice at NCSES would represent
a paradigm shift for the agency, at a critical time when they are reaping benefits from
investments in revised surveys during the past four to five years. Therefore, the panel
recommends that NCSES in the near term undertake pilot work to determine how its indicators
program can incorporate the new techniques with traditional survey methods. A logical extension
of this is that NCSES begins to expand its role as a clearinghouse for STEM information and
coordinating data interoperability, standards for metadata descriptions (e.g., participating in
schema.org), and perhaps developing tools like open source web crawlers.
RECOMMENDATION 6: The National Center for Science and Engineering
Statistics should fund exploratory activities on frontier data extraction and
development methods. These activities should include
43
OCR for page 44
research funding or prize competitions to harness the computing power of data
specialists with a view to (a) analyzing existing public databases to develop better
indicators of science, technology, and innovation activities and (b) analyzing the
huge and growing amount of information on the Internet for similar purposes;
pilot programs or experiments to produce a subset of indicators using web tools;
and
convening a workshop of experts on multimodal data development, to explore
the new territory of developing metrics and indicators from surveys,
administrative records, and scientometric sources.
44