Expanding data science training has the potential to transform scientific discovery, other academic research, many professions, and the broader society. With such an onset of new technologies, modes of thought, and means of communication, questions arise among industries, government agencies, and educational institutions: What skills are needed to be successful in the workplace and in society? Is data science a fundamental skill that all students should have some exposure to? How can data literacy be improved? In what skills, methods, and technologies should future data scientists be trained, given the wide variety of potential applications? Understanding the complexities of these questions is a first step in imagining the discipline for a diverse set of participants.
This chapter discusses some of the foundational, translational, ethical, and professional skills that make it possible for students to be effective data scientists.
What are the key ideas and principles to be included in the data science curriculum? One way to determine this is to consider the typical work cycle in which data scientists engage. For example, this cycle is often initiated with a domain-specific question that then leads to data collection. The data are typically curated, described, and modeled. Models are tested and deployed, then the results are put to use and communicated to stakeholders. There are potentially several phases of analysis within this work cycle that lead to other questions and deeper understandings. Additionally, data scientists will need to draw on computational, statistical, and mathematical knowledge, as well as domain-specific knowledge, to inform the analytic choices and interpretation made throughout the workflow.
A simplified description of a data scientist’s work cycle helps illuminate the essential components of a data science curriculum. For example, De Veaux et al. (2017) describe the following six areas of focus for data science curricula that map well to this workflow: (1) data description and curation, (2) mathematical foundations, (3) computational thinking, (4) statistical thinking, (5) data modeling, and (6) communication, reproducibility, and ethics. Given different definitions of computational thinking, it may not be evident that De Veaux’s six areas are intended to encompass not only basic computing concepts (such as abstraction and indirection) but also the array of computing skills required to manage data. Therefore, the committee suggests adding computing skills as a seventh important area.
Specific topics within these seven focus areas might include the following: software engineering, linear algebra, optimization, algorithms and data structures, information technology, basic statistics, uncertainty quantification, and tools for fitting models to data. Human–computer interaction research may also play a role in the foundational data science curriculum as it examines the design and use of computer technology focused on the interfaces between people and computers, including the range of ways in which humans are integrated with computational processes, data collection, dissemination, and analysis.
The path from research question to analysis has changed with the advent of data science. In the past, data were typically collected with a specific purpose and via a particular design to answer a priori research questions. These data, in turn, informed the statistical analyses. More analyses are utilizing
extant data and repurposing them to answer new questions and explore new hypotheses (Groves, 2011). As more of these existing data sets become accessible (e.g., via application programming interfaces), one core question is how to extract knowledge and insight from data that were collected for an entirely different purpose and, subsequently, with little forethought to the design necessary for answering those questions.
While it might be possible to take a piecemeal approach to the data science curriculum in which courses are selected from existing departments, and although these courses might look reasonable as a curricular whole, in reality, such a curriculum will almost certainly lack educational and cross-disciplinary cohesion unless there is some coordination across the departments and courses.
In developing an undergraduate data science curriculum, it is important to evaluate how particular topics and skill sets will both fulfill program requirements and prepare students to address data challenges they will face in their careers. Training in describing and documenting models, analyses, and value propositions effectively will benefit students preparing for a wide variety of data science careers. Although a deep theoretical foundation is less necessary for students pursuing data science positions after earning an undergraduate degree, an emphasis on developing sophisticated techniques and complex modeling skills is still valuable to help solve real-world data science problems. With more data and more complicated models available, interpretability of models is important in both data science education and practice, as is fairness in algorithms and computation-driven decisions. Many academic researchers, businesses, and government agencies that are hiring new employees value graduates with expertise in survey methodology and designed data, elements of statistical learning that serve as the framework for machine learning, Bayesian data analysis, and implementations of reproducible research.
Developing and applying these skill sets requires “data acumen”—the ability to make good judgments and decisions with data. This trait is increasingly important, especially given the large volume of data typically present in real-world problems, the relative ease of (mis)applying tools, and the vast ethical implications inherent in many data science analyses. With large volumes of data, it can be difficult to understand at first glance what is needed, what is possible, and what limitations exist. Still, questions remain as to how to most effectively build data acumen in students. Data acumen can be developed over time through research experience, industry partnerships, courses in creative data analysis, domain-specific data science courses or experiences, and extensions of capstone-like experiences throughout the curriculum. It can be enhanced through exposure to current key components of data science, including mathematical foundations, computational thinking, statistical thinking, data management, data description and curation, data modeling, ethical problem solving, communication and reproducibility, and domain-specific considerations. Similar to the concept of “mathematical maturity,” which typically denotes a mixture of mathematical insights and experiences that mathematicians develop and strengthen with time, data acumen is not a final state to be reached but rather a skill that data scientists develop and refine over time.
Finding 2.1: A critical component of data science education is to guide students to develop data acumen. This requires exposure to key concepts in data science, real-world data and problems that can reinforce the limitations of tools, and ethical considerations that permeate many applications. Key concepts related to developing data acumen include the following:
- Mathematical foundations,
- Computational thinking,
- Statistical thinking,
- Data management,
- Data description and curation,
- Data modeling,
- Ethical problem solving,
- Communication and reproducibility, and
- Domain-specific considerations.
The necessary level of exposure to each area will vary based on the overall objectives and duration of the data science program as well as the goals for the students.
As is true with curriculum design in other academic settings, it is important to build a curriculum in which students can recognize when they do not know something so that they learn which questions to ask. It is also important for students to learn to question what they may already perceive as fact, to understand the risks associated with using data, and to recognize that data is a product of the specific context in which it is generated, collected, analyzed, and interpreted. Incorporating insights from the humanities and social sciences into data science curricula enables students to consider how behaviors, interactions, and attitudes shape data in an informed and grounded way.
Lessons learned and effective practices from other domains can help shape how data science is taught—including co-curricular activities (such as mentorship programs), individualized advising, supplemental opportunities for students to learn fundamentals (such as summer bridge programs), and introductory courses designed to appeal to a wide student audience. High-impact educational practices, such as those put forth by the Association of American Colleges and Universities, describe teaching and learning practices that have been shown to be beneficial for postsecondary students from many backgrounds. These practices take many different forms, depending on learner characteristics and on institutional priorities and contexts, but could be useful in ensuring that data science education is effective. Box 2.1 highlights the educational practices put forth by the Association of American Colleges and Universities and provides some examples of how they are currently being applied to data science education.
In addition to developing foundational skills, it is valuable for data science students to apply techniques and technologies learned in the classroom or laboratory to specific and different situations in practice. In other words, educators would train students to do data science in real application contexts, incorporating real data, broad impact applications, and commonly deployed methods, as well as working in teams. Students benefit from experiences such as carrying out sentiment analysis of texts, generating interactive maps to explore spatial data, assessing relationships between links within social networks,
drawing samples from a distribution, visualizing multidimensional data to draw conclusions, and make decisions using data from a variety of sources and domains.
It is useful for students to learn how to translate understandings across domains and think critically about assertions and also to appreciate the importance of reusing and sharing data. As an example, consider the insights that can be found from digitizing an entomological collection. While the images originally might have been thought to be important only for understanding insects, later study of the pollen on the legs of the insects could yield unique insights into the changing nature of ecological systems with local climate change over more than a century. There is great potential for unanticipated ancillary derivatives from the data generation and integration process, but only if the right foundational knowledge is instilled about how one can and should combine sources and share findings in a reproducible workflow for others to interrogate.
Students also benefit from experiences with integrating diverse data and accounting for outside factors. For example, randomized trials—where a treatment is applied to a randomly selected subgroup of a population and then the outcomes of the full group are tracked—are the gold standard of evidence-based practices. However, conducting and generalizing from randomized trials can be difficult in many settings. Since most trials have issues with adherence, compliance, and nonresponse, it is important to account for post-randomization factors. Similarly, survey methods that undergird large surveys and censuses, such as those undertaken by the Census Bureau and Bureau of Labor Statistics, have contributed greatly to understanding and decision making about our society and economy. Today, these surveys can be enhanced and complemented—but not replaced—by fusing data that do not arise from a well-characterized sampling frame and design with data that were carefully sampled and vetted. While this integration may not be straightforward, it has the potential to extend the reach of ongoing surveys and to answer new questions in different ways.
A major challenge relates to how results that are demonstrated using unstructured data are verified. This struggle is analogous to that of research environments adapting to the arrival of a new instrument. For example, the microscope, telescope, and genetic sequencing machines have all allowed researchers to resolve something previously unresolvable. Scientific cultures have to grapple with what to do when observations are unprecedented and when old methods are insufficient to determine whether the new instruments are accurate or not. The development of new data science methods to address data challenges and undertake new analyses will require similar adaption. This analogy may also apply to the expanding computational infrastructure and capacity, which will continue to provide the opportunities to carry out analyses that were previously infeasible.
It is also important to better understand what types of questions and information are amenable to data science approaches. For example, understanding and conveying what computational approaches have produced and why is an important skill. New frameworks and models for carefully constructed distillations to be communicated are needed, along with reports that can be validated, reproduced, and assessed. Similar to the “John Henry” folklore1 that pitted man against machine, data science is furthering consideration of where the edges of human abilities are, what can be measured, and what can be analyzed.
Educational systems and structures need to prepare students to inhabit a world that will have different tools than those currently available. Students develop judgment through the practice of working through the entire data science cycle. They benefit from opportunities to gradually build large systems by composing smaller systems where the behavior of the smaller systems is better understood. While some curricula offer these opportunities in the form of a capstone course, internship or externship opportunities, or similar integrative experience, it is beneficial for additional complementary experiences to be provided earlier in the curriculum.
Finding 2.2: It is important for data science education to incorporate real data, broad impact applications, and commonly deployed methods.
In addition to the foundational and translational skills training that students receive, they would also benefit from a better understanding of ethics and social context of data (O’Neil, 2016; De Veaux et al., 2017). Several ethical considerations and corresponding examples that could be discussed include the following:
- Fairness. This multifaceted consideration can be summarized as the ability for data science techniques to treat all people equitably and avoid bias that may be inherent in training data sets. This is an especially important factor for applications that directly affect individuals, such as in the design of criminal justice models that determine sentencing practices without introducing racial or socioeconomic bias (Berk et al., 2017).
- Validity. Before data science methodologies can be applied, it is vital to ensure that the data set contains valid (e.g., accurate and relevant) information. Use of data that are falsified, not current, incomplete, from an unbalanced sample, biased in terms of survivorship, or not measuring the appropriate factors could lead to faulty conclusions, such as inaccurate estimates for health care needs in a given area based on outdated survey information (IOM, 2009).
- Data context. Similar to validity, it is important for individuals to understand the context of data sets before they are processed and analyzed. Knowing where, when, and how data were collected could lead to important insights that aid in analysis, detect inherent biases, and mitigate risk to individuals whose information is contained within the data sets.
- Data confidence. Recognition of the limitations of data science is important for avoiding overconfidence and the inclination to draw stronger-than-appropriate conclusions. This “data hubris” could be detrimental, for example, if too much confidence is invested in a data science model that makes stock market predictions for investments without any consideration of model limitations (Zacharakis and Shepherd, 2001).
- Stewardship. In data science, stewardship refers to the supervision of a data set at all stages of existence, including collection, storage, and analysis. This facilitates protection of individuals whose information is within the data set, including considerations of intellectual property rights or cybersecurity risk.
- Privacy. Considerations regarding individuals’ privacy with respect to how data are collected and analyzed arise in many disciplines. Further information on privacy concerns in terms of data science is discussed later in this section.
By developing a truly integrative curriculum, it is possible to explore how society is affected by and reflected in data. Students would learn the importance of asking the following sequence of questions frequently, consistently, and thoughtfully: Whose data is being collected, by whom, for what purposes, and with what possible implications? Data are often representations (and simplifications) of the lives of people; this point could be integrated into every phase of a data science curriculum. Ethics plays a central role as students learn to problem-solve with data. An important lesson for students to learn is that transparency, trust-building, and validation/replication are key concepts; reputable data scientists are able to show why they do their work, explain the benefits that will emerge from it, and characterize and communicate the limitations of that work.
The trade-offs related to privacy play a key role in this discussion as well. A question arises about whether people and their information are “public by default” (Boyd, 2010), because levels of involvement are different and ever-changing for each individual. Moreover, data from multiple sources can be
combined, enabling an even richer and more intimate understanding of subjects. For example, with widespread adoption of mobile devices, data scientists may have access to detailed information about a person’s location over time, which, in combination with other data, may yield far more information than the individual intended them to know. Individuals appreciate the ability to make choices about and have control over their information; they want secure data sharing, clear disclosure mechanisms, and a process to gain reparation from damages due to data breaches. All of these real-world problems could provide robust content in a data science curriculum.
An example of a course that integrates a study of data with a study of social context is “Data in Social Context”2 at Virginia Tech. This course promotes dual literacy (i.e., humanities skills for data analytics students and data analytics skills for humanities students), explores why people turn to data to explain historical phenomena, and shows students a different way to approach questions with accessible tools and data. It highlights how valuable social context is in data analytics; data are filled with narratives, and questions often arise about ethics, probability, and bias.
Finding 2.3: Incorporating ethics into an undergraduate data science program provides students with valuable skills that can be applied to complex, human-centered questions across disciplines.
Broad professional skills are particularly critical in data science (BHEW, 2017; Hicks and Irizarry, 2017). Industry partners relate that desirable characteristics include the ability to state goals clearly, to validate solutions, and to communicate with both technical and nontechnical audiences. Communication, both written and verbal, plays a significant role in data science because of diverse application areas, interdisciplinary research groups, and the ubiquity of data spanning many fields and being produced by many people. Conveying information with diverse audiences, expressing nuance regarding evidence in the presence of uncertainty, communicating limitations of analyses, and ensuring that what is conveyed is a faithful and honest representation of the data are all essential to data science.
Communication skills can be strengthened through practice with communicating various types of information to diverse audiences, such as a course or experiences that emphasize public speaking as well as technical and nontechnical writing. Communication can also be strengthened by improved understanding of diverse audiences. For example, what aspects of the data science process and results would domain scientists need to know to further their research, versus what would managers need to know to make relevant business decisions, versus what would policy makers need to know to make sound policies? These courses could also include a section on effective data visualization and its benefits, especially when explaining best communication practices for an audience from a nontechnical background.
The ability to work well in multidisciplinary teams is also important to data science and highly valued by industry. Multidisciplinary teamwork offers students the opportunity to use creative problem solving and refine leadership skills, and allows for diverse perspectives when tackling data science problems.
Finding 2.4: Strong oral and written communication skills and the ability to work well in multidisciplinary teams are critical to students’ success in data science.