The National Academies Press

Currently Skimming:

7 A Paradigm Shift in Data Collection and Analysis
Pages 87-100

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 87... ... The North American 2See Chapter 4 for detail on how business practice data (which include Industry Classification System (NAICS) codes and the Stan administrative records and web-based data) Read the entire page →
From page 88... ... envisions implement- Transportation administrative data on baggage fees and the ing real-time estimation routines -- including imputation, Sabre data, used to construct airline price indices; insurance nonresponse adjustment, and standard error estimation -- claims data, particularly Medicare Part B reimbursements after every 24 hours of data collection. Part of this progress to doctors, used to construct health care indices; and many would entail assessing whether the standard error increase more sources of administrative records data from within the due to imputation was acceptable or additional nonresponse U.S. Read the entire page →
From page 89... ... Data scientists use specialbetween agency systems that use different data taxonomies, ized techniques to sift through these troves of information accounting practices, and information technology systems. 4In The panel was impressed by NCSES's willingness to experi- practice, scientometrics often uses bibliometrics, a measurement of ment with the use of administrative records to complement the impact of (scientific) Read the entire page →
From page 90... ... Consider some of the questions a policy maker concerned about the future data scientist workforce Datasets exist that could shed light on questions about might ask of NCSES: data science, but they are very different from those produced by NCSES. They are not among the datasets typically used • How many new data scientists are graduating each to analyze the STEM workforce in part because, while they year? Read the entire page →
From page 91... ... Using help-wanted ads to track the international markets that would otherwise be unavailable diffusion and use of innovation at both the national and or prohibitively expensive to generate. subnational levels has several advantages: these ads are To assess demand and salary levels for data scientists, one public and continuously refreshed; full databases of the ads could turn to large databases of job listings such as Monster. Read the entire page →
From page 92... ... for the expert -- there is an issue of scale. 8Sample surveys are used to draw inferences about well-defined popula • Some of the datasets are commercial products, so one tions. Read the entire page →
From page 93... ... A NEW DIRECTION FOR NCSES The general approach described above for learning Structuring of Unstructured Datasets quickly and inexpensively about an emerging field by repurposing existing datasets holds considerable promise The data generated from NCSES's surveys are structured. for improving understanding of many aspects of innovation Data are stored as tables of numbers, with each number havin science and engineering. Read the entire page →
From page 94... ... For example, NSF has autogenerated topics for awards through text processing software 9LinkedIn and similar data could be quite useful for questions involv developed for STAR METRICS and could start including ing relative rather than absolute measures. For example, are there more these topics in its award database so that other researchers chemical than electrical engineers? Read the entire page →
From page 95... ... should of tweets as an indicator of impact? pursue the use of text processing for developing sci- NSF is supporting ongoing research in areas that could ence, technology, and innovation indicators in the facilitate assessing nontraditional data sources. Read the entire page →
From page 96... ... Because the Data Enclave as a way to build its community of licensed field of data science is new and the number of practitioners researchers while enabling its own staff to spend more time is relatively small, the panel proposes two concrete initiatives in helping researchers with the substance of the data rather that would provide some opportunities for NCSES to gain than paperwork. Additionally, NCSES has worked with experience with new data science tools and to collaborate NORC to build an infrastructure that allows research teams with data scientists. Read the entire page →
From page 97... ... For example, the report recommends combining award How can a federal statistical agency develop and rely data with internal and external data -- a task that would on web-based and scientometric tools to produce goldbenefit from automated techniques for extracting entities standard data for periodic publication? This is a basic (people, laboratories, programs, institutions) Read the entire page →
From page 98... ... The initial implementation of STAR METRICS at NSF involves similar types of linkages, ini tially linking research awards to patents and jobs (supported 14See http://showoffyourapps.challenge.gov/ [December 2011] Read the entire page →
From page 99... ... of Technologies Through Records of New Books The STEM Labor Market Michelle Alexopoulos and collaborators at the University of Toronto have been measuring the commercialization of technol- Demand ogy using records of new books from the Library of Congress Large job boards such as Monster.com or job board (Alexopoulos and Cohen, 2011) Read the entire page →
From page 100... ... The goal is Finally, there are several databases of dissertations: to facilitate linking of datasets involving individual researchers. ORCID will serve as a registry rather • ProQuest, than a data provider, but the use of these identifiers • WorldCat, and can help structure existing unstructured datasets. Read the entire page →

From page 87...

... The North American 2See Chapter 4 for detail on how business practice data (which include Industry Classification System (NAICS) codes and the Stan administrative records and web-based data)

Read the entire page →

From page 88...

... envisions implement- Transportation administrative data on baggage fees and the ing real-time estimation routines -- including imputation, Sabre data, used to construct airline price indices; insurance nonresponse adjustment, and standard error estimation -- claims data, particularly Medicare Part B reimbursements after every 24 hours of data collection. Part of this progress to doctors, used to construct health care indices; and many would entail assessing whether the standard error increase more sources of administrative records data from within the due to imputation was acceptable or additional nonresponse U.S.

Read the entire page →

From page 89...

... Data scientists use specialbetween agency systems that use different data taxonomies, ized techniques to sift through these troves of information accounting practices, and information technology systems. 4In The panel was impressed by NCSES's willingness to experi- practice, scientometrics often uses bibliometrics, a measurement of ment with the use of administrative records to complement the impact of (scientific)

Read the entire page →

From page 90...

... Consider some of the questions a policy maker concerned about the future data scientist workforce Datasets exist that could shed light on questions about might ask of NCSES: data science, but they are very different from those produced by NCSES. They are not among the datasets typically used • How many new data scientists are graduating each to analyze the STEM workforce in part because, while they year?

Read the entire page →

From page 91...

... Using help-wanted ads to track the international markets that would otherwise be unavailable diffusion and use of innovation at both the national and or prohibitively expensive to generate. subnational levels has several advantages: these ads are To assess demand and salary levels for data scientists, one public and continuously refreshed; full databases of the ads could turn to large databases of job listings such as Monster.

Read the entire page →

From page 92...

... for the expert -- there is an issue of scale. 8Sample surveys are used to draw inferences about well-defined popula • Some of the datasets are commercial products, so one tions.

Read the entire page →

From page 93...

... A NEW DIRECTION FOR NCSES The general approach described above for learning Structuring of Unstructured Datasets quickly and inexpensively about an emerging field by repurposing existing datasets holds considerable promise The data generated from NCSES's surveys are structured. for improving understanding of many aspects of innovation Data are stored as tables of numbers, with each number havin science and engineering.

Read the entire page →

From page 94...

... For example, NSF has autogenerated topics for awards through text processing software 9LinkedIn and similar data could be quite useful for questions involv developed for STAR METRICS and could start including ing relative rather than absolute measures. For example, are there more these topics in its award database so that other researchers chemical than electrical engineers?

Read the entire page →

From page 95...

... should of tweets as an indicator of impact? pursue the use of text processing for developing sci- NSF is supporting ongoing research in areas that could ence, technology, and innovation indicators in the facilitate assessing nontraditional data sources.

Read the entire page →

From page 96...

... Because the Data Enclave as a way to build its community of licensed field of data science is new and the number of practitioners researchers while enabling its own staff to spend more time is relatively small, the panel proposes two concrete initiatives in helping researchers with the substance of the data rather that would provide some opportunities for NCSES to gain than paperwork. Additionally, NCSES has worked with experience with new data science tools and to collaborate NORC to build an infrastructure that allows research teams with data scientists.

Read the entire page →

From page 97...

... For example, the report recommends combining award How can a federal statistical agency develop and rely data with internal and external data -- a task that would on web-based and scientometric tools to produce goldbenefit from automated techniques for extracting entities standard data for periodic publication? This is a basic (people, laboratories, programs, institutions)

Read the entire page →

From page 98...

... The initial implementation of STAR METRICS at NSF involves similar types of linkages, ini tially linking research awards to patents and jobs (supported 14See http://showoffyourapps.challenge.gov/ [December 2011]

Read the entire page →

From page 99...

... of Technologies Through Records of New Books The STEM Labor Market Michelle Alexopoulos and collaborators at the University of Toronto have been measuring the commercialization of technol- Demand ogy using records of new books from the Library of Congress Large job boards such as Monster.com or job board (Alexopoulos and Cohen, 2011)

Read the entire page →

From page 100...

... The goal is Finally, there are several databases of dissertations: to facilitate linking of datasets involving individual researchers. ORCID will serve as a registry rather • ProQuest, than a data provider, but the use of these identifiers • WorldCat, and can help structure existing unstructured datasets.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

7 A Paradigm Shift in Data Collection and Analysis Pages 87-100

7 A Paradigm Shift in Data Collection and Analysis
Pages 87-100