Page 12 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

2

Knowledge for Data Scientists

Over the past decade, data science has emerged out of a variety of widespread developments (as discussed in Chapter 1), and companies, academic institutions, and governments are striving to hire data scientists while transforming their practices (BHEF and PwC, 2017; Ernst and Young, 2017). There are many instances of academic data science. Still, “data science” is not yet fully defined as an academic subject; the central tenets, concepts, knowledge, skills, and ethics powering this emerging discipline remain points of active discussion and continue to evolve. A new generation of tool developers and tool users will require the ability to understand data, to make good judgments about and good decisions with data, and to use data analysis tools responsibly and effectively (referred to as “data acumen” throughout this report). Developers and users draw from computing, mathematics, statistics, and other fields and application domains. Educators and administrators are beginning to reimagine course content, delivery, and enrollment at the undergraduate level to best prepare students to operate in this new discipline.

New and greater volumes of information, along with its variety and velocity, compound long-standing challenges of data analysis—and raise new ones. The ability to measure, understand, and react to large quantities of complex data can shape scientific discovery, social interaction, political interactions and institutions, economic practice, public health, and many other areas. Data science workflows not only consume data, but they also produce data—such as intermediate data sets, statistics, and other by-products such as visualization—that need to be understood.

Page 13 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Although the definition of data science is evolving, it centers on the notion of multidisciplinary and interdisciplinary approaches to extracting knowledge or insights from large quantities of complex data for use in a broad range of applications. Data science is about synthesizing the most relevant parts of the foundational disciplines to solve particular classes of problems or applications that are newly enabled because the volume and variety of data available are expanding swiftly, data are available more immediately, and decisions based on data are increasingly automated and in real time. Data scientists often work at the interface of disciplines and can help develop new approaches to address problems in these areas.

Data science applications have varying levels of risk. For example, recommender systems that suggest purchases within an online shopping platform or select advertisements for website visitors are relatively low risk. Although provider sales may be affected if undesirable products are recommended and users may be dissatisfied with their purchases, the overall impact of poor retail recommender systems to individuals and society is generally low. Still, the recommendations can influence the behavior of large segments of a population and are often coupled with a just-in-time supply chain, which aims to forecast consumer demand given available data and optimize production and shipping of goods. In this case, the systems can have substantial impact, especially if they result in a shortage of necessary items, such as food and medicine, owing to natural disaster or unanticipated interactions with other external factors. But increasingly, as similar data-driven algorithms are used to recommend sentencing or release of criminals, guide testing or treatment of patients, plan urban development, draw political boundaries, allocate funds, and inform other critical public policy decisions, impacts on individuals and society can be profound.

While new volumes and types of information can make analyses more accurate than past methods that relied on sparse surveys with lower than desired survey frequency, response rates, and sample sizes, they still have limitations. Weaknesses in data quality and data analysis might have a wide range of negative policy effects: problems might be misunderstood in their causes and scale; a program that a family depends on might get insufficient funding; or a policy might be enacted that has unintended consequences for large segments of the population.

Thus, it will be important that data are collected and analyzed appropriately and that there are clear principles guiding the use of data for human good. Furthermore, the complexity of the analyses and the increasing dependency on data across all fields of human endeavor will drive demand for “smarter” tools and best practices for data science that minimize mistakes in interpretation.

Page 14 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Data science is not just the practice of analyzing a certain data set about a particular question. It often results in the creation of processes that continuously take in new data, often from many sources, and generate refined distillations of those data, which in turn become sources for new inquiries, questions, and analyses. The products of the data scientist—including data, code, visualizations, and recommendations—often take on a life of their own far beyond the initial question that gave rise to their creation. In this way, data science takes on aspects associated with engineering—namely, the creation of infrastructures that undergird society and must safely withstand unanticipated changes in demand and use.

Academic institutions, companies, and governments recognize these shifts and are rapidly embracing a vision of an emerging discipline of data science that is unique yet builds on knowledge from existing disciplines (NRC, 2014). Generally, each academic discipline recognizes that its viewpoint alone is insufficient to encompass all of data science.¹ Advances in the power and usability of data science computing tools have made it possible for even inexperienced people to conduct complex analyses over enormous data sets without really understanding the possible artifacts and biases that may be lurking in the data or the reliability of the results and interpretations. Machine learning models may achieve superhuman performance on challenging machine vision tasks yet may employ biased or unfair interpretations of the data (Jordan, 2013). Application domains (e.g., business, medicine, natural science, social sciences, or engineering) are developing and adapting machine learning and deep learning techniques to solve specific research questions. These techniques can be more effective than previously used methods but may lack mathematical or statistical rigor or computational scalability. Increasingly, domains in the humanities, such as philosophy, rhetoric, history, and literary studies, embrace elements of data science while issues of algorithmic bias present moral and ethical questions.

Data scientists have the potential to help address critical real-world challenges. Just a few examples are listed here:

Enabling more accurate diagnosis of melanomas through better analysis of images. Deep learning techniques have been applied to detect melanoma, the deadliest form of skin cancer. These methods improve

___________________

¹ The American Statistical Association and the Computing Research Association have both released formal statements to this effect (ASA, 2015; CRA, 2016). The Institute of Electrical and Electronics Engineers (IEEE) has introduced numerous data science conferences associated with its various special interest groups. National position statements around information management and operations research are less well defined. The popular press is full of comparisons of data science and business analytics or business intelligence; none assert that the latter two subsume data science.

Page 15 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

the analysis of tissue images, promising a more accurate diagnosis than traditional techniques (Codella et al., 2017).
Enhancing business decisions. Business analytics can assist entrepreneurs and company executives in making timely decisions based on market trends. This can be coupled with analysis of online social media information to respond directly to consumer demands or create a more personalized advertising experience (Chen, Chiang, and Storey, 2012).
Helping aid organizations to respond faster. Data science and analytics are used to assist aid organizations to respond more quickly in times of need, such as when the Swedish Migration Board used data science to make predictions about and determine national implications of emigration trends (Pratt, 2016).
Developing “smart cities.” Cities around the world, such as London, Rio de Janeiro, and New York, collect real-time data from a variety of sources, such as public transportation, traffic cameras, environmental sensors for parameters such as temperature and humidity, and social media interactions regarding local issues. The data can then be processed, analyzed, and utilized to improve city efficiency and cost-effectiveness as well as resident well-being (Kitchin, 2014).

However, there are also many instances of high-impact and high-profile data science research that has resulted in flawed or inaccurate findings, as well as ethical and legal quandaries. A few examples are listed here:

Inaccurate predictions of flu trends. In 2013, Google Flu Trends over-predicted true influenza-related doctors’ visits as determined by the Centers for Disease Control and Prevention. This has been primarily attributed to overreliance on outdated models (Butler, 2013).
Release of personally identifiable data. The abundance of data available on individuals from companies and social media can present ethical dilemmas to researchers in terms of privacy, scalability of results, and subject participation agreement. For instance, a 2013 study linking numerous Twitter users to sensitive information from their financial institutions prompted discussions of when researchers should be required to obtain written consent when using nominally publicly accessible information (Danyllo et al., 2013).
Biases in predictive policing. There is much debate over the use and appropriateness of predictive policing—the use of data science by law enforcement to predict crime before it occurs. There is no consensus yet on the effectiveness of this methodology, and civil liberties groups argue that the data used to develop (i.e., train) the models are inherently biased (Hvistendahl, 2016).

Page 16 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Surveillance of citizens. China is deploying facial recognition technologies as well as other data science approaches to track individuals and influence behavior. The national goal is to link these surveillance systems by 2020 to “implement a national ‘social credit’ system that would assign every citizen a rating based on how they behave at work, in public venues, and in their financial dealings” (Chin and Lin, 2017).

Data science is currently being applied in many organizations within industry, academia, and government, often by self-taught practitioners. There are indications of strong demand in a variety of domains for graduates with data science skills. A recent study by IBM found more than 2.3 million data science and analytics job listings in 2015, and both job openings and job demand are projected to grow significantly by 2020 (Columbus, 2017). Three-fifths of the data science and analytics jobs today are in the finance and insurance, professional services, and information technology sectors, but the manufacturing, health care, and retail sectors also are hiring significant numbers of data scientists (Markow et al., 2017). The IBM study also shows that it takes significant time to find and hire staff with the right mix of skills and experience (Columbus, 2017). Since many employers are themselves new to the use of data science, they may not be able to provide training and therefore may prefer to hire individuals who already have appropriate classwork and hands-on experience. More generally, a poll conducted by Gallup for the Business-Higher Education Forum revealed that 69 percent of employers expect candidates with data science and analytics skills to get preference for jobs in their organizations by 2021 (BHEF and PwC, 2017).

Current data science courses, programs, and degrees are highly variable in part because emerging educational approaches start from different institutional contexts, aim to reach students in different communities, address different challenges, and achieve different goals. This variation makes it challenging to lay out a single vision for data science education in the future that would apply to all institutions of higher learning, but it also allows data science to be customized and to reach broader populations than other similar fields have done in the past. Moreover, the continual emergence of new data sources and new analytical tools make this an extremely fluid environment, where the courses that are taught today might be organized around concepts and practices that are supplanted in the near future. Any data science program will have to take this into account, and this complicates discussions about how to define and structure the field.

However, important foundational data science skills are highlighted in this chapter and may serve as a platform for any practicing data

Page 17 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

scientist. The themes described in this chapter underlie data science education, but they are not necessarily novel challenges or even unique to data science. The lessons learned from other disciplines can help pave the way to ensuring the success of data science education.

DATA SCIENTISTS OF TODAY AND TOMORROW

As was discussed in the previous section, there is a current shortage of workers with data science skills. The day-to-day work and thus educational needs of the different types of data scientists are highly differentiated. This section of the report will map the educational needs with the roles that students will be expected to perform in the workplace and the skills needed to prepare students for graduate studies and research careers in many fields of inquiry (NRC, 2013).

Data science roles vary across government, industry, and academia and will continue to evolve in the future. As with other complex fields of study, there is both differentiation and overlap in these roles. The breadth and depth of data science roles underscore the complexity that employers face in the identification of qualified candidates for their job postings and the challenges that academic institutions face in preparing their students for these emerging roles. Some current areas of focus for data scientists include the following:

Computing hardware and software platforms for data science. Data scientists who manage the platforms on which data science models are created focus on understanding and maintaining a computing environment that meets the demands for big data, fast (sometimes real time or near real time) model generation, and data interrogation—up to and including the demands of real-time data collection (i.e., streaming) and complex data visualizations. A significant challenge of this job is remaining current on the latest computing hardware and software. Unlike system administrators, who need to understand only one or two computer systems, these data scientists create environments for data science modelers and analysts that can be used across a range of computing platforms. This requires that they understand the changing programming languages used for data science, the supporting libraries, and the many types of data storage systems, as well as how to keep all of these components operational and secure. Because of the rapid rate of change in this area, educational training needs to focus on key topics such as database maintenance, security, programming hardware, and operating systems. A certain level of proficiency in these skills could be developed in a 2-year associate’s degree pro-

Page 18 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

gram, upon completion of which graduates will be able to manage changing computing systems and keep pace with the ever-growing computational needs of machine learning and data science model development and workflows. However, additional depth may be required to equip professionals to keep pace with rapidly advancing technology, perform capacity planning and availability assessments, and deploy solutions that are reliable and scalable.
Data storage and access. Data scientists who focus on managing data storage solutions as well as extracting, transforming, and loading data for modeling should have the ability to manage exceptionally large data sets from a variety of heterogeneous data sources and in batch or streaming form, and to assess the predictive value of these data sources. A strong knowledge of both databases and streamed analytical processing is key to this role. These data scientists need to understand the data science workflow, to document data quality problems, and to select appropriate methods of interpolation—even, in some cases, creating data models to clean and reduce errors in downstream model development performance. Some domain knowledge is likely needed (e.g., to understand data quality issues and how to best mange the data). The education needed for this role varies, and skill sets could be developed at both 2- and 4-year institutions (keeping in mind that a data science team would likely need to represent additional important skills, such as computing, continuous cross-validation, and adoption of new modeling techniques or frameworks).
Statistical modeling and machine learning. Experts in statistical modeling and machine learning interface with stakeholders to capture requirements and develop the scope of work for data science projects, undertake the data science analysis cycle, and typically bridge the gaps among more narrowly focused data science roles. Written and oral communication skills are essential for this position, as is experience with coordinating teams. Often these data scientists require considerable domain expertise in the field for which the data science models are being developed. For example, an individual developing a model for clinical trial analysis for drug development would need to have a significant understanding of pharmacology and clinical data collection. Although data science skills required for this role are broad, the disciplinary knowledge is highly specific. Owing to the breadth and complexity of this position, a 4-year undergraduate program may be required to develop the level of proficiency necessary for success. However, even 4-year programs are unlikely to develop sufficient knowledge in a domain through exposure to a small number of courses (i.e., a second

Page 19 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

major, co-major, or minor may be necessary). All data scientists need to acquire domain knowledge, but it is particularly important for this role.
Data visualization. Ideally, data visualization experts combine development and design skills with the ability to understand the meaning of the underlying analyses. These data scientists are adept at visual storytelling with data. They can examine large data sets and create clear, efficient, compelling online layouts, images, dashboards, and interactive features that can stand on their own or complement narrative text. At their core, they are effective translators between technical and statistical specialists and superior communicators with multiple nontechnical audiences. They are well versed in the key elements of effective graphical displays as well as the pitfalls of misrepresenting data and results. These data scientists combine knowledge of statistical analysis tools, libraries, and frameworks to complement a foundation in computational, statistical, and data management methods. They are well prepared to adjust quickly as new standards and tools become available. They can design for multiple formats and platforms and be grounded in user experience insights. They understand application programming interfaces—how to parse them and, ideally, how to build them—and are closely aligned with the data management functions performed by others on a team. Both 2- and 4-year programs can help prepare students for this role.
Business analysis. A growing number of positions involves making sense of and communicating about data without necessarily relying on programming skills. These jobs are built around assembling and presenting data to inform a decision-making process. These data scientists are common in many business areas, have expertise in various domains, and can utilize skills developed in both 2- and 4-year programs.

There are many other types of data scientists today, and their roles will continue to change and expand in the future. Beyond the differences among them, there is considerable variance in the lower-order and higher-order knowledge and skills that some data science jobs require. There are also many commonalities among the varied types of data scientists. All data scientists need to learn how to tackle questions with real data. It is insufficient for them to be handed a “canned” data set and be told to analyze it using the methods that they are studying. Such an approach will not necessarily prepare them to solve more realistic and complex problems taken out of context, especially those involving large, unstructured

Page 20 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

data. Instead, they need repeated practice with the entire cycle beginning with ill-posed questions and “messy” data.²

An effective data science workflow involves formulating good questions, considering whether available data are appropriate for addressing a problem, choosing from a set of different tools, undertaking analyses in a reproducible manner, assessing analytic methods, drawing appropriate conclusions, and communicating results. Students need practice applying a unified approach to problem solving with data. Such an integrated approach needs to be introduced in their first courses and remain a consistent theme in subsequent courses. Students need to see that data science is not simply a collection of varied tools (or methods), but rather a general approach to problem solving. Many of the emergent data science programs at every academic level encourage students to assume that they will benefit from continuing professional education throughout their careers. All require that graduates have the capability to identify problems to be solved with data, determine and implement solutions, assess results, and communicate results and findings (UC Santa Cruz, 2018).

Finding 2.1: Data scientists today draw largely from extensions of the “analyst” of years past trained in traditional disciplines. As data science becomes an integral part of many industries and enriches research and development, there will be an increased demand for more holistic and more nuanced data science roles.

Finding 2.2: Data science programs that strive to meet the needs of their students will likely evolve to emphasize certain skills and capabilities. This will result in programs that prepare different types of data scientists.

Recommendation 2.1: Academic institutions should embrace data science as a vital new field that requires specifically tailored instruction delivered through majors and minors in data science as well as the development of a cadre of faculty equipped to teach in this new field.

Recommendation 2.2: Academic institutions should provide and evolve a range of educational pathways to prepare students for an array of data science roles in the workplace.

___________________

² A description of the importance of the multistep scientific process and how it relates to data analysis can be found in the Curriculum Guidelines for Undergraduate Programs in Statistical Science (ASA, 2014).

Page 21 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

DATA ACUMEN

Data science is a complex activity that requires specific skills, such as coding in advanced computer languages, and less well-defined but equally crucial skills, including the ability to do the following:

Combine many existing programs or codes into a “workflow” that will accomplish some important task;
“Ingest,” “clean,” and then “wrangle” data into reliable and useful forms;
Think about how a data processing workflow might be affected by data issues;
Question the formulation and establishment of sound analytical methods; and
Communicate effectively about properties of computer codes, task workflows, databases, and data issues.

Aspiring data scientists need to develop these skills in order to avoid conducting flawed or incomplete analyses.

In short, getting a useful answer from data requires many skills that are often not fully developed on their own in traditional mathematics, statistics, and computer science courses—although such fields certainly come closest today to providing mastery of the desired skill set. Donoho (2017) noted the need for data scientists who can face “essential questions of a lasting nature and [use] scientifically rigorous techniques to attack those questions.”

Students also need to learn how to ensure that outcomes are valid—extracting the right insights and having confidence that, start to finish, what one says is true, within some margins of error. Repeated exposure to the data science life cycle (i.e., posing a question; collecting, cleaning, and storing data; developing tools and algorithms; performing exploratory analysis and visualization; making inferences and predictions; making decisions; and communicating results) is needed to help hone the skills required to assess the data at hand, extract meaning from them, and communicate those findings to nonexperts. Students also need to consider the provenance of the data used.

Building on the work of De Veaux et al. (2017), the committee puts forth the following key concept areas for data science: mathematical foundations, computational foundations, statistical foundations, data management and curation, data description and visualization, data modeling and assessment, workflow and reproducibility, communication and teamwork, domain-specific considerations, and ethical problem solving.

Experience and facility in these and other areas are essential to building what this committee defines as “data acumen.” Some exposure to

Page 22 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

key high-level topics is needed by all students, while other students will require additional exposure or extended work to develop expertise. The process of starting students down the path toward data acumen is a chief objective of data science education.

Finding 2.3: A critical task in the education of future data scientists is to instill data acumen. This requires exposure to key concepts in data science, real-world data and problems that can reinforce the limitations of tools, and ethical considerations that permeate many applications. Key concepts involved in developing data acumen include the following:

Mathematical foundations,
Computational foundations,
Statistical foundations,
Data management and curation,
Data description and visualization,
Data modeling and assessment,
Workflow and reproducibility,
Communication and teamwork,
Domain-specific considerations, and
Ethical problem solving.

Recommendation 2.3: To prepare their graduates for this new data-driven era, academic institutions should encourage the development of a basic understanding of data science in all undergraduates.

Mathematical Foundations

Mathematics is essential for data science; however, how much and what types of mathematics are needed vary. Data scientists need to know how to test hypotheses and determine why they do or do not align to real-world problems. They need to be capable of assessing their data science models, determining when these models fail and how to make corrections that lead to scientific discovery. Tools (e.g., Wolfram Alpha) can be utilized and combined to produce an outcome (e.g., simulation or visualization) that reinforces data scientists’ computational and statistical knowledge without demanding the study of calculus in full detail (see Hardin and Horton, 2017).

New, more flexible pathways to help establish a mathematical foundation for data science are being developed. The University of Texas

Page 23 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

at Austin Dana Center’s Mathematics Pathways³ is one such program designed to increase opportunities for students across the nation through mathematics and statistics education. This program instills confidence, advocates for degree or certificate completion, and provides students with the skills and tools to apply mathematical and quantitative reasoning at home and in the workplace. The development of additional pathways to help students develop mathematical foundations would be beneficial for the field of data science.

Key mathematical concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Set theory and basic logic,
Multivariate thinking via functions and graphical displays,
Basic probability theory and randomness,
Matrices and basic linear algebra,
Networks and graph theory, and
Optimization.

Some data scientists and programs require a deeper understanding of mathematical underpinnings. This might include the following:

Partial derivatives (to understand interactions in a model),
Advanced linear algebra (i.e., properties of matrices, eigenvalues, decompositions),
“Big O” notation and analysis of algorithms, and
Numerical methods (e.g., approximation and interpolation).

While linear algebra and optimization may be particularly helpful in data science, the traditional mathematics curriculum has many courses that precede multivariate calculus and linear algebra. It may be the case that institutions need to develop a “math for data science” class⁴ to build these foundations without requiring multiple semesters of coursework. This could potentially serve as an accelerated course in relevant mathematical approaches for data science and possibly replace further coursework for some students.

___________________

³ The website for the Dana Center’s Mathematics Pathways is http://www.utdanacenter.org/higher-education/dcmp/, accessed January 18, 2018.

⁴ See Hardin and Horton (2017) for one suggested approach.

Page 24 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Computational Foundations

Working with data requires extensive computing skills. Data science graduates need to be proficient in many of the foundational software skills and the associated algorithmic, computational problem-solving skills associated with the discipline of computer science. A data science student needs to be prepared to work with data as they are commonly found in the workplace and research laboratories. Accessing and organizing data in databases, scraping data from websites, processing text into data that can be analyzed, ensuring secure data storage, and protecting confidentiality all require extensive computing skills. Computational problem-solving skills recur throughout the data scientist’s workflow. As Wing (2006, p. 34) noted, “Thinking like a computer scientist means more than being able to program a computer. It requires thinking at multiple levels of abstraction.”

To be prepared for careers in data science, students also need facility with professional statistical analysis software packages and an understanding of the computational and algorithmic problem-solving principles that underlie these packages.

It is also important for data science students to be aware of the state of the art of information technology and for faculty to educate these students so that their knowledge will continue to evolve accordingly. Students will also benefit from instruction in aspects of data structures, object-oriented programs, and workflow (i.e., aspects of a broader set of project management skills). The first pedagogical approach to achieving this understanding is to teach students how to think about algorithms. Students will need further skill development to be able to deepen their understanding of abstraction and be able to learn new data technologies. It is more important for students to learn how to follow the information technology frontier than to master the details of today’s architecture.

While it would be ideal for all data scientists to have extensive coursework in computer science, new pathways may be needed to establish appropriate depth in algorithmic thinking and abstraction in a streamlined manner. This might include the following:

Basic abstractions,
Algorithmic thinking,
Programming concepts,
Data structures, and
Simulations.

Page 25 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Statistical Foundations

All data scientists need to understand basic statistical concepts, practice, and theory. According to De Veaux et al. (2017, p. 20), “Students should understand the basic statistical concepts of data analysis, data collection, modeling, and inference. A sound knowledge of basic theoretical foundations will help inform their analyses and the limits to their models. Successful graduates will be able to apply statistical knowledge and computational skills to formulate problems, plan data collection campaigns or identify and gather relevant existing data, and then analyze the data to provide insights.”

To avoid drawing invalid or incorrect conclusions, data science students need to understand the concept of inference, including sampling and nonsampling errors. Owing to the nature of observational data as found artifacts (which may represent a nonrandom selection or include confounding factors), it is important for students to study confounding and causal inference early to make sense of the data around them. As a specific example, having 30 million credit card records can help identify a number of relationships in the observed data (e.g., people who shop at a particular retailer tend to exceed a certain income threshold), but those relationships will not necessarily hold in the next set of records. In addition, other measured or unmeasured factors may be important in determining causal conclusions (e.g., students could wrongly conclude that use of sunscreen is associated with skin cancer if the amount of sun exposure is not controlled for in an analysis).

As for the previous areas, work is needed to identify approaches to build a strong foundation in statistics. The American Statistical Association (ASA) guidelines for undergraduate programs in statistics (ASA, 2014) discuss important considerations for educating students in statistical practice, as do the “Curriculum Guidelines for Undergraduate Programs in Data Science,” which were endorsed by the ASA (De Veaux et al., 2017). Data science students need to know about randomized trials (commonly used in businesses running A/B comparisons) but need to quickly move to approaches that are applicable for nonrandomized studies. They need repeated practice with the whole data science life cycle.

Important statistical foundations might include the following:

Variability, uncertainty, sampling error, and inference;
Multivariate thinking;
Nonsampling error, design, experiments (e.g., A/B testing), biases, confounding, and causal inference;
Exploratory data analysis;
Statistical modeling and model assessment; and
Simulations and experiments.

Page 26 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Data Management and Curation

At the heart of data science is the storage, preparation, and accessing of data. It is often said that a typical data analysis project is more than 70 percent data cleaning, merging, and marshaling. With the advent of large public databases with data of all kinds ranging from governmental to genomic, there has never been a better time to teach students about the many aspects of data management. The students can directly experience the many forms in which data can be found today, from spreadsheets and text files to relational and nonrelational databases.

One way in which data scientists succeed is by providing others a very clear understanding of the details of the data that went into a project, possibly also making the data, or their derivatives, available to others. Throughout their coursework, students need to become facile with data of different types (e.g., relational, text, images).

Key data management and curation concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Data provenance;
Data preparation, especially data cleansing and data transformation;
Data management (of a variety of data types);
Record retention policies;
Data subject privacy;
Missing and conflicting data; and
Modern databases.

Data Description and Visualization

Many data scientists create value by creating “dashboards” that display some basic statistics and visualizations to monitor the contents of an evolving database or stream. In this way, they provide situational awareness for decision makers. Students who might be creators or users of such dashboards need to learn about traditional descriptive statistics for developing a feel of what is in a data set as well as about traditional graphics such as scatter and time-series plots with decorations and modifications. This will help prepare them to present data in a clear and compelling fashion. After learning how to make basic displays, students then need to be taught how to use simple graphics to check data for artifacts, snafus, and inconsistencies. Then they can start to undertake exploratory data analysis.

Data visualization is at the core of data science insight extraction, communication with others, and quality assurance. A key challenge for data scientists is to be able to tell a story with data and translate key aspects of the data science life cycle and outcomes of efforts to both users

Page 27 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

and leaders. It is crucial that data visualization training goes hand-in-hand with communication training, as a well-chosen graph can efficiently convey to others some important feature of a data set that might otherwise be very difficult to capture in words. Such visual displays help to avoid a “garbage in, garbage out” situation where important outliers or incorrectly coded data lead to misleading conclusions.

Key data description and visualization concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Data consistency checking,
Exploratory data analysis,
Grammar of graphics,
Attractive and sound static and dynamic visualizations, and
Dashboards.

Data Modeling and Assessment

Data scientists have a rich and growing set of models and methods at their disposal. The challenge is how to identify which models are most appropriate for a given setting and assess whether the assumptions and conditions needed to apply that method are tenable.

Key data modeling and assessment concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Machine learning,
Multivariate modeling and supervised learning,
Dimension reduction techniques and unsupervised learning,
Deep learning,
Model assessment and sensitivity analysis, and
Model interpretation (particularly for black box models).

Workflow and Reproducibility

Modern data science has at its core the creation of workflows—pipelines of processes that combine simpler tools to solve larger tasks. Documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability. Documenting and sharing workflows enable others to understand how data have been used and refined and what steps were taken in an analysis process. This can increase the confidence

Page 28 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

in results and improve trust in the process as well as enable reuse of analyses or results in a meaningful way.

Students need to be exposed to the concept of workflows and gain experience constructing them. Understanding the end-to-end structure of a workflow and being able to describe and document the workflow is important. Students need to learn about software systems that enable building workflows (e.g., R and Python) and how to document what they do (e.g., R Markdown and Jupyter Notebook). Studying end-to-end properties of workflows and then incrementally improving them in an evidence-based fashion is important. Students need to learn about such practices and learn how to execute such practices autonomously.

Providing experiential learning at multiple time points is important as students learn workflow processes and practice implementing and documenting steps within a workflow. Students need practice developing a unified approach to analysis and integration of multiple methods applied to data sets in an iterative manner. Project management could be integrated into capstone experiences. Longer-term projects involving interim reports and evaluation are critical.

Key workflow and reproducibility concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Workflows and workflow systems,
Documentation and code standards,
Source code (version) control systems,
Reproducible analysis, and
Collaboration.

Communication and Teamwork

One major distinguishing attribute of the work of data scientists centers on their capacity to frame research questions well and then communicate the findings in writing, in graphical form, and in conversation. In many cases, this involves coordinating among multidisciplinary actors, translating the interests of various parties, and then synthesizing the findings for nonexpert audiences. This requires competency in statistics, computer science, mathematics, coding, and domain-specific interests. Graduates also need to write clearly, speak articulately, construct effective visual displays and compelling written summaries, and communicate complex data science results in basic terms to various stakeholders.

The development of responsible oral and written communication skills is also essential for productive collaboration in the classroom and in the workplace. The ability to work well in multidisciplinary teams is a key

Page 29 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

component of data science education that is highly valued by industry, as teams of individuals with particular skill sets each play a critical role in producing data products. Multidisciplinary collaboration provides students with the opportunity to use creative problem solving and to refine leadership skills, both of which are essential for future project organization and management experiences in the workplace. Multidisciplinary teamwork also emphasizes inclusion and encourages diversity of thought in approaching data science problems.

Key communication and teamwork concepts/skills that would be important for all students in their data science programs and critical for their success in the workforce are the following:

Ability to understand client needs,
Clear and comprehensive reporting,
Conflict resolution skills,
Well-structured technical writing without jargon, and
Effective presentation skills.

Domain-Specific Considerations

Effective application of data science to a domain requires knowledge of that domain. Grounding data science instruction in substantive contextual examples (which will require the development of judgment and background in those areas) will help ensure that data scientists develop the capacity to pose and answer questions with data. Reinforcing skills and capacities developed in data science courses in the context of a specific domain will help students see the entire data science process. This might include completion of a track in a domain area, specialized connector courses that link data science concepts directly to students’ fields of interest to build data science skills in context, a minor in a domain area, or a co-major or double major in an application area. Hopefully, such an interconnected appreciation for a domain and for data science methods will generalize to other domains and applications.

Students who have completed the course Data 8: Foundations of Data Science at the University of California, Berkeley, for example, have an opportunity to enroll in specialized connector courses that are offered by a variety of academic departments. Examples of these connector courses include Data Science for Smart Cities, Making Sense of Cultural Data, Data Science and the Mind, and Data Science, Demography, and Immigration.⁵ Students at the University of Illinois, Urbana-Champaign, also

___________________

⁵ The website for the connector curriculum is https://data.berkeley.edu/education/connectors, accessed February 20, 2018.

Page 30 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

have an opportunity to integrate their domain knowledge with data science concepts in the CS+X degree program. In this bachelor’s degree program, students dedicate half of their coursework to the study of computer science and the other half to a specific discipline—current selections include mathematics, statistics, anthropology, astronomy, chemistry, linguistics, music, philosophy, geoscience, crop science, and advertising.⁶ Such approaches reinforce the integrative nature of data science, offering a comprehensive educational experience that better prepares students for the future workforce. The committee anticipates that the demand for interdisciplinary experiences will increase as the field of data science continues to evolve. Additional interdisciplinary pairings—such as English and data science to prepare future data journalists—are likely to emerge.

Ethical Problem Solving

As powerful analytical tools are growing to meet new possibilities of collecting data, students need to be aware of ethical challenges that can emerge. With this proliferation of data and advancement of innovation, data science practitioners may often be confronted with decisions about whether they should take certain actions just because they have the ability and tools to do so. The explosion of data potentially raises the possibilities of new intrusions and interventions in people’s lives and other previously “safe” and protected places. The misuse of data can pierce basic human dignities or thwart human agency and autonomy. Students working with data need to know the ways in which their findings might compromise people’s dignity and their identities. Most disturbing of all, data can be misused in ways that are socially unjust. Students also need to be aware of legal requirements aimed at protecting individuals’ privacy such as the European Union General Data Privacy Regulation, which aims to increase the rights of data subjects and provides penalties for individuals or organizations that violate them.⁷

Ethical considerations, in other words, lie at the heart of data science. Unique ethical considerations arise in each step of and throughout the data science life cycle (i.e., when posing a question; collecting, cleaning, and storing data; developing tools and algorithms; performing exploratory analysis and visualization; making inferences and predictions; making decisions; and communicating results). Stand-alone courses on ethics could help students learn what intelligent systems and the tools of data

___________________

⁶ The website for the CS+X program is https://cs.illinois.edu/academics/undergraduate/degree-program-options/cs-x-degree-programs, accessed February 20, 2018.

⁷ The website for the European Union General Data Protection Regulation is https://www.eugdpr.org/, accessed March 29, 2018.

Page 31 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

science can and cannot do. It is important to emphasize to students that this is not simply a case of “do it like it is done today”—it is a case where ongoing improvement and elevation of standards are needed. Beyond the stand-alone ethics course, students stand to develop a deeper understanding of the role that ethics plays throughout the study and practice of the data science life cycle if ethical principles are incorporated into most of the courses in the data science curriculum.

Case studies may be an especially effective approach. For example, case studies could be used to show how vulnerable people can be exploited by means of their medical or behavioral data being shared. Through these case studies, students could begin to develop a sense of awareness of the potential impacts on inequality of the misuse of data. In addition to learning about standards for responsible behavior through such case studies, students would also benefit from instruction in developing specific skills to navigate the challenging ethical problems with which data scientists struggle.

Key aspects of ethics needed for all data scientists (and for that matter, all educated citizens) include the following:

Ethical precepts for data science and codes of conduct,
Privacy and confidentiality,
Responsible conduct of research,
Ability to identify “junk” science, and
Ability to detect algorithmic bias.

Recommendation 2.4: Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education. Academic institutions should ensure that ethics is woven into the data science curriculum from the beginning and throughout.

A CODE OF ETHICS FOR DATA SCIENCE

Other disciplines have benefited from publishing specific ethical guidelines by which their members agree to conduct themselves. Practitioners in the fields of medicine and engineering have long traditions of similar ethical guidelines. The AMA Code of Medical Ethics of the American Medical Association includes guidance on interactions between medical professionals and their patients; use of medical treatments, including those that rely on new technologies; and “self-regulation” within the workplace (AMA, 2016). The “IEEE Code of Ethics” of the Institute of Electrical and Electronic Engineers encourages engineering professionals to prioritize the safety of the public, avoid or disclose conflicts of inter-

Page 32 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

est, present evidence-based claims, and maintain appropriate technical qualifications, for example (IEEE, 2017).

Considerable work in the study of ethical decision making for scientists has been undertaken by the Association of Computing Machinery (ACM) and the ASA. The 2018 ACM Code of Ethics and Professional Conduct: Draft 3 (an update to the ACM’s 1992 code of ethics) includes principles for moral conduct in addition to leadership guidelines for computing professionals acting in the interest of the public good (ACM, 2018). The ASA’s Ethical Guidelines for Statistical Practice presents guidelines pertaining to integrity and accountability in statistical work. It also details the various ethical responsibilities that statisticians have toward their research subjects, clients, employers, and colleagues (ASA, 2016).

These rules of conduct uphold specific ethical standards for professionals whose activities and practice can significantly impact the health and well-being of people, society, and their profession. As an emerging discipline, data science could benefit from having its own ethical standards of conduct. There are many areas specific to data science that could be addressed, including the responsibility to protect privacy of personal data, the responsibility to not misrepresent the data for personal gain, the responsibility to ensure fairness in the use of machine learning algorithms and choice of training data, and the responsibility to ensure that results produced by the analyst are reproducible.

Given the sensitive nature of certain types of data and the significant ethical implications of working with such data, efforts to establish a code of ethics for data scientists are under way throughout the field.⁸ Data science ethics might be codified in an “oath” similar to the Hippocratic Oath taken by physicians as a way to crystallize what is being asked of them. Although the specific content and form of an oath may be controversial, it can also underline the importance of the commitment being made. A draft version of such an oath was presented in the interim report from this committee, and a revised version appears in Appendix D of this report. The potential consequences of the ethical implications of data science cannot be overstated.

___________________

⁸ To read about other work in the development of data science codes of ethics, see, for example, https://datapractices.org/community-principles-on-ethical-data-sharing/, http://datafordemocracy.org/projects/ethics.html, http://www.datascienceassn.org/code-of-conduct.html, http://www.rosebt.com/blog/open-for-comment-proposed-data-science-code-of-professional-conduct, https://dssg.uchicago.edu/2015/09/18/an-ethical-checklist-for-data-science/, http://thedataist.com/a-proposal-for-data-science-ethics/, https://www.accenture.com/t20160629T012639Z__w__/us-en/_acnmedia/PDF-24/Accenture-Universal-Principles-Data-Ethics.pdf, accessed January 31, 2018.

Page 33 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Recommendation 2.5: The data science community should adopt a code of ethics; such a code should be affirmed by members of professional societies, included in professional development programs and curricula, and conveyed through educational programs. The code should be reevaluated often in light of new developments.

REFERENCES

ACM (Association for Computing Machinery). 2018. 2018 ACM Code of Ethics and Professional Conduct: Draft 3. https://ethics.acm.org/2018-code-draft-3. Accessed February 6, 2018.

AMA (American Medical Association). 2016. AMA Code of Medical Ethics. https://www.ama-assn.org/delivering-care/ama-code-medical-ethics. Accessed February 12, 2018.

ASA (American Statistical Association). 2014. Curriculum Guidelines for Undergraduate Programs in Statistical Science. http://www.amstat.org/asa/files/pdfs/EDUguidelines2014-11-15.pdf.

ASA. 2015. ASA Statement of the Role of Statistics in Data Science. http://ww2.amstat.org/misc/DataScienceStatement.pdf.

ASA. 2016. Ethical Guidelines for Statistical Practice. http://www.amstat.org/asa/files/pdfs/EthicalGuidelines.pdf.

BHEF and PwC (Business-Higher Education Forum and PricewaterhouseCoopers). 2017. Investing in America’s Data Science and Analytics Talent: The Case for Action. http://www.bhef.com/sites/default/files/bhef_2017_investing_in_dsa.pdf.

Butler, D. 2013. When Google got flu wrong. Nature 494:155-156.

Chen, H., R.H.L. Chiang, and V.C. Storey. 2012. Business intelligence and analytics: From big data to big impact. MIS Quarterly 36(4):1165-1188.

Chin, J., and L. Lin. 2017. China’s all-seeing surveillance state is reading its citizens’ faces. Wall Street Journal, June 26. https://www.wsj.com/articles/the-all-seeing-surveillance-state-feared-in-the-west-is-a-reality-in-china-1498493020.

Codella, N.C.F., Q.B. Nguyen, S. Pankanti, D. Gutman, B. Helba, A. Halpern, and J.R. Smith. 2017. Deep learning ensembles for melanoma recognition in dermoscopy images. IBM Journal of Research and Development 61(4):5.1-5.15.

Columbus, L. 2017. IBM predicts demand for data scientists will soar 28% by 2020. Forbes, May 13.

CRA (Computing Research Association). 2016. Computing Research and the Emerging Field of Data Science. https://cra.org/wp-content/uploads/2016/10/Computing-Research-and-the-Emerging-Field-of-Data-Science.pdf.

Danyllo, W.A., V.B. Alisson, N.D. Alexandre, LM.J. Moacir, B.P. Jansepetrus, and R.F. Oliveira. 2013. “Identifying Relevant Users and Groups in the Context of Credit Analysis Based on Data from Twitter.” Paper presented at the 2013 IEEE Third International Conference on Cloud and Green Computing, September/October, Karlsruhe, Germany.

De Veaux, R., M. Agarwal, M. Averett, B.S. Baumer, A. Bray, T.C. Bressoud, L. Bryant, et al. 2017. Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Applications 4:15-30.

Donoho, D. 2017. 50 years of data science. Journal of Computational and Graphical Statistics 26(4):745-766.

Ernst and Young. 2017. “Data and Advanced Analytics: High Stakes, High Rewards.” Forbes Insights, February. https://www.forbes.com/forbesinsights/ey_data_analytics_2017/. Accessed February 13, 2018.

Page 34 Cite

Suggested Citation:"2 Knowledge for Data Scientists." National Academies of Sciences, Engineering, and Medicine. 2018. Data Science for Undergraduates: Opportunities and Options. Washington, DC: The National Academies Press. doi: 10.17226/25104.

×

Hardin, J.S., and N.J. Horton. 2017. Ensuring that mathematics is relevant in a world of data science. Notices of the AMS 64(9):986-990. https://www.ams.org/publications/journals/notices/201709/rnoti-p986.pdf.

Hvistendahl, M. 2016. Can “predictive policing” prevent crime before it happens? Science, October 5. http://www.sciencemag.org/news/2016/09/can-predictive-policing-prevent-crime-it-happens.

IEEE (Institute of Electrical and Electronics Engineers). 2017. “IEEE Code of Ethics.” https://www.ieee.org/about/corporate/governance/p7-8.html. Accessed February 12, 2018.

Jordan, M. 2013. On statistics, computation and scalability. Bernoulli 19(4):1378-1390.

Kitchin, R. 2014. The real-time city? Big data and smart urbanism. GeoJournal 79:1-14.

Markow, S., S. Braganza, B. Taska, S. Miller, and D. Hughes. 2017. The Quant Crunch: How the Demands for Data Science Skills Is Disrupting the Job Market. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=IML14576USEN&. Accessed June 21, 2017.

NRC (National Research Council). 2013. Frontiers in Massive Data Analysis. Washington, D.C.: The National Academies Press.

NRC. 2014. Training Students to Extract Value from Big Data: Summary of a Workshop. Washington, D.C.: The National Academies Press.

Pratt, M.K. 2016. Big data’s big role in humanitarian aid. Computer World, February 8. http://www.computerworld.com/article/3027117/big-data/big-datas-big-role-in-humanitarian-aid.html. Accessed June 21, 2017.

UC Santa Cruz (University of California, Santa Cruz). 2018. “Program Learning Outcomes: Programs, Curriculum Alignment, and Assessment Plans. Jack Baskin School of Engineering.” https://www.soe.ucsc.edu/departments/computer-science/programlearning-outcomes. Accessed January 18, 2018.

Wing, J.M. 2006. Computational thinking. Communications of the ACM 49(3):33-35.