Page 10 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

2

Acquiring Data Science Skills and Knowledge

Expanding data science training has the potential to transform scientific discovery, other academic research, many professions, and the broader society. With such an onset of new technologies, modes of thought, and means of communication, questions arise among industries, government agencies, and educational institutions: What skills are needed to be successful in the workplace and in society? Is data science a fundamental skill that all students should have some exposure to? How can data literacy be improved? In what skills, methods, and technologies should future data scientists be trained, given the wide variety of potential applications? Understanding the complexities of these questions is a first step in imagining the discipline for a diverse set of participants.

This chapter discusses some of the foundational, translational, ethical, and professional skills that make it possible for students to be effective data scientists.

FOUNDATIONAL SKILLS

What are the key ideas and principles to be included in the data science curriculum? One way to determine this is to consider the typical work cycle in which data scientists engage. For example, this cycle is often initiated with a domain-specific question that then leads to data collection. The data are typically curated, described, and modeled. Models are tested and deployed, then the results are put to use and communicated to stakeholders. There are potentially several phases of analysis within this work cycle that lead to other questions and deeper understandings. Additionally, data scientists will need to draw on computational, statistical, and mathematical knowledge, as well as domain-specific knowledge, to inform the analytic choices and interpretation made throughout the workflow.

A simplified description of a data scientist’s work cycle helps illuminate the essential components of a data science curriculum. For example, De Veaux et al. (2017) describe the following six areas of focus for data science curricula that map well to this workflow: (1) data description and curation, (2) mathematical foundations, (3) computational thinking, (4) statistical thinking, (5) data modeling, and (6) communication, reproducibility, and ethics. Given different definitions of computational thinking, it may not be evident that De Veaux’s six areas are intended to encompass not only basic computing concepts (such as abstraction and indirection) but also the array of computing skills required to manage data. Therefore, the committee suggests adding computing skills as a seventh important area.

Specific topics within these seven focus areas might include the following: software engineering, linear algebra, optimization, algorithms and data structures, information technology, basic statistics, uncertainty quantification, and tools for fitting models to data. Human–computer interaction research may also play a role in the foundational data science curriculum as it examines the design and use of computer technology focused on the interfaces between people and computers, including the range of ways in which humans are integrated with computational processes, data collection, dissemination, and analysis.

The path from research question to analysis has changed with the advent of data science. In the past, data were typically collected with a specific purpose and via a particular design to answer a priori research questions. These data, in turn, informed the statistical analyses. More analyses are utilizing

Page 11 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

extant data and repurposing them to answer new questions and explore new hypotheses (Groves, 2011). As more of these existing data sets become accessible (e.g., via application programming interfaces), one core question is how to extract knowledge and insight from data that were collected for an entirely different purpose and, subsequently, with little forethought to the design necessary for answering those questions.

While it might be possible to take a piecemeal approach to the data science curriculum in which courses are selected from existing departments, and although these courses might look reasonable as a curricular whole, in reality, such a curriculum will almost certainly lack educational and cross-disciplinary cohesion unless there is some coordination across the departments and courses.

In developing an undergraduate data science curriculum, it is important to evaluate how particular topics and skill sets will both fulfill program requirements and prepare students to address data challenges they will face in their careers. Training in describing and documenting models, analyses, and value propositions effectively will benefit students preparing for a wide variety of data science careers. Although a deep theoretical foundation is less necessary for students pursuing data science positions after earning an undergraduate degree, an emphasis on developing sophisticated techniques and complex modeling skills is still valuable to help solve real-world data science problems. With more data and more complicated models available, interpretability of models is important in both data science education and practice, as is fairness in algorithms and computation-driven decisions. Many academic researchers, businesses, and government agencies that are hiring new employees value graduates with expertise in survey methodology and designed data, elements of statistical learning that serve as the framework for machine learning, Bayesian data analysis, and implementations of reproducible research.

Developing and applying these skill sets requires “data acumen”—the ability to make good judgments and decisions with data. This trait is increasingly important, especially given the large volume of data typically present in real-world problems, the relative ease of (mis)applying tools, and the vast ethical implications inherent in many data science analyses. With large volumes of data, it can be difficult to understand at first glance what is needed, what is possible, and what limitations exist. Still, questions remain as to how to most effectively build data acumen in students. Data acumen can be developed over time through research experience, industry partnerships, courses in creative data analysis, domain-specific data science courses or experiences, and extensions of capstone-like experiences throughout the curriculum. It can be enhanced through exposure to current key components of data science, including mathematical foundations, computational thinking, statistical thinking, data management, data description and curation, data modeling, ethical problem solving, communication and reproducibility, and domain-specific considerations. Similar to the concept of “mathematical maturity,” which typically denotes a mixture of mathematical insights and experiences that mathematicians develop and strengthen with time, data acumen is not a final state to be reached but rather a skill that data scientists develop and refine over time.

Finding 2.1: A critical component of data science education is to guide students to develop data acumen. This requires exposure to key concepts in data science, real-world data and problems that can reinforce the limitations of tools, and ethical considerations that permeate many applications. Key concepts related to developing data acumen include the following:

Mathematical foundations,
Computational thinking,
Statistical thinking,
Data management,
Data description and curation,
Data modeling,
Ethical problem solving,
Communication and reproducibility, and
Domain-specific considerations.

Page 12 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

The necessary level of exposure to each area will vary based on the overall objectives and duration of the data science program as well as the goals for the students.

As is true with curriculum design in other academic settings, it is important to build a curriculum in which students can recognize when they do not know something so that they learn which questions to ask. It is also important for students to learn to question what they may already perceive as fact, to understand the risks associated with using data, and to recognize that data is a product of the specific context in which it is generated, collected, analyzed, and interpreted. Incorporating insights from the humanities and social sciences into data science curricula enables students to consider how behaviors, interactions, and attitudes shape data in an informed and grounded way.

Lessons learned and effective practices from other domains can help shape how data science is taught—including co-curricular activities (such as mentorship programs), individualized advising, supplemental opportunities for students to learn fundamentals (such as summer bridge programs), and introductory courses designed to appeal to a wide student audience. High-impact educational practices, such as those put forth by the Association of American Colleges and Universities, describe teaching and learning practices that have been shown to be beneficial for postsecondary students from many backgrounds. These practices take many different forms, depending on learner characteristics and on institutional priorities and contexts, but could be useful in ensuring that data science education is effective. Box 2.1 highlights the educational practices put forth by the Association of American Colleges and Universities and provides some examples of how they are currently being applied to data science education.

BOX 2.1
High-Impact Educational Practices¹

A Brief Overview

The following teaching and learning practices have been widely tested and have been shown to be beneficial for college students from many backgrounds. These practices take many different forms, depending on learner characteristics and on institutional priorities and contexts.

On many campuses, assessment of student involvement in active learning practices such as these has made it possible to assess the practices’ contribution to students’ cumulative learning. However, on almost all campuses, utilization of active learning practices is unsystematic, to the detriment of student learning. Presented below are brief descriptions of high-impact practices that educational research suggests increase rates of student retention and student engagement.

First-Year Seminars and Experiences

Many schools now build into the curriculum first-year seminars or other programs that bring small groups of students together with faculty or staff on a regular basis. The highest-quality first-year experiences place a strong emphasis on critical inquiry, frequent writing, information literacy, collaborative learning, and other skills that develop students’ intellectual and practical competencies. First-year seminars can also involve students with cutting-edge questions in scholarship and with faculty members’ own research. An example of this first-year experience in data science has been explored by a professor at Duke University, who suggests the creation of a gateway course that would allow students to learn fundamental data science skills in a project-oriented curriculum (Cetinkaya-Rundel, 2017).

Common Intellectual Experiences

The older idea of a “core” curriculum has evolved into a variety of modern forms, such as a set of required common courses or a vertically organized general education program that includes advanced integrative studies and/or required participation in a learning community. These programs often combine broad themes—for

Page 13 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

example, technology and society, and global interdependence—with a variety of curricular and co-curricular options for students. An example model of a course that could be utilized in a core curriculum is Data8, a foundational data science course at the University of California, Berkeley that analyzes the technical and social implications of data science in a hands-on manner. The course is intended for students without a background in statistics or computer science, allowing for students to explore these topics without pre-requisites (Culler, 2016).

Learning Communities

The key goals for learning communities are to encourage integration of learning across courses and to involve students with “big questions” that matter beyond the classroom. Students take two or more linked courses as a group and work closely with one another and with their professors. Many learning communities explore a common topic and/or common readings through the lenses of different disciplines. Some deliberately link “liberal arts” and “professional courses;” others feature service learning. An example of this within the data science undergraduate experience is the Statistics Living-Learning Community at Purdue University, which provides a small group of sophomore students with the ability to live together in the same dorm; take courses together in probability theory, statistical theory, and data analysis; and conduct a year-long research project together (Purdue University, 2013).

Writing-Intensive Courses

These courses emphasize writing at all levels of instruction and across the curriculum, including final-year projects. Students are encouraged to produce and revise various forms of writing for different audiences in different disciplines. The effectiveness of this repeated practice “across the curriculum” has led to parallel efforts in such areas as quantitative reasoning, oral communication, information literacy, and, on some campuses, ethical inquiry. This practice is currently incorporated into the core curriculum at Columbia University through the University Writing Course “Readings in Data Science,” which aims to enhance students’ ability to understand data science through enhancement of reading and writing techniques (Columbia University, 2013). This aligns with the curriculum practices recommended by the American Statistical Association to teach students how to strengthen communication skills within the data science field (ASA, 2014).

Collaborative Assignments and Projects

Collaborative learning combines the following two key goals: learning to work and solve problems in the company of others and sharpening one’s own understanding by listening seriously to the insights of others, especially those with different backgrounds and life experiences. Approaches range from study groups within a course, to team-based assignments and writing, to cooperative projects and research. An example of this currently in place at a university level is CS169, “Software Engineering,” at the University of California, Berkeley. This course is designed for students to work in groups of six with an outside client to develop a “software-as-a-service” deliverable by course completion, highlighting the importance of applying skills learned inside the classroom to a real-world application (University of California, Berkeley, 2017). Another example is the Michigan Data Science Team run out of the Michigan Institute for Data Science at the University of Michigan. This not-for-credit extracurricular activity is organized and run by undergraduate computer science and data science students with light faculty oversight. Students collect and analyze data in the context of an application with high impact on local community or society. For example, in 2016 they applied their skills to help the city of Flint, Michigan, cope with the lead contamination crisis through better data collection and analysis (Meisler, 2017).

Undergraduate Research

Many colleges and universities are now providing research experiences for students in all disciplines. Undergraduate research, however, has been most prominently used in science disciplines. With strong support from the National Science Foundation (NSF) and the research community, scientists are reshaping their courses to connect key concepts and questions with students’ early and active involvement in systematic investigation and research. The goal is to involve students with actively contested questions, empirical observation, cutting-edge technologies, and the sense of excitement that comes from working to answer important questions. This is demonstrated through NSF’s Research Experience for Undergraduates, which places undergraduate students at a research institution for the summer to conduct innovative research across disciplines, including data science (NSF, 2017). This emphasizes the importance in exposing students to real data and problems, which can be messier and more challenging to work with than data sets used within the classroom.

Page 14 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

Diversity/Global Learning

Many colleges and universities now emphasize courses and programs that help students explore cultures, life experiences, and worldviews different from their own. These studies—which may address U.S. diversity, world cultures, or both—often explore “difficult differences,” such as racial, ethnic, and gender inequality, or continuing struggles around the globe for human rights, freedom, and power. Frequently, intercultural studies are augmented by experiential learning in the community and/or by study abroad. For instance, the “Comparative Public Health: The U.S. and the World” study abroad program at St. Olaf College provides students a chance to explore public health facilities at the Centers for Disease Control and Prevention in Atlanta, Georgia, as well as the World Health Organization in Geneva, Switzerland. This interdisciplinary program allows for undergraduates to explore and compare international public health efforts while individually exploring a public health topic of interest (Legler, 2017). Additionally, the Undergraduate Fellowships for Community Engaged and Translational Research at Virginia Commonwealth University seeks to provide funding to a select few undergraduate research projects conducted with a community partner, with at least one project a year dedicated to human health (VCU, 2017).

Service Learning, Community-Based Learning

In these programs, field-based “experiential learning” with community partners is an instructional strategy—and often a required part of the course. The idea is to give students direct experience with issues they are studying in the curriculum and with ongoing efforts to analyze and solve problems in the community. A key element in these programs is the opportunity students have to both apply what they are learning in real-world settings and reflect in a classroom setting on their service experiences. These programs model the idea that giving back to the community is an important college outcome, and that working with community partners is good preparation for citizenship, work, and life. An example of this is the Center for Data Science and Public Policy at the University of Chicago, which has several opportunities to engage students in the intersection of public policy and data science. This includes undergraduate coursework opportunities such as “Data Analytics for Campaigns,” “Machine Learning for Public Policy,” and “Computation and Public Policy,” and the Eric and Wendy Schmidt Data Science for Social Good Fellowship at the University of Chicago, which brings together undergraduate- and graduate-level data scientists from around the globe to solve real-world social problems in conjunction with government agencies and nonprofits (University of Chicago, 2017).

Internships

Internships are another increasingly common form of experiential learning. The idea is to provide students with direct experience in a work setting—usually related to their career interests—and to give them the benefit of supervision and coaching from professionals in the field. If the internship is taken for course credit, students complete a project or paper that is approved by a faculty member. The variety of internship opportunities available within the data science field provides students the opportunity to gain real exposure and explore all aspects of the field, from the more technical side to potential policy implications. There are a myriad of undergraduate data science internship examples, ranging from those housed at large technology companies to specialized programs such as the Atlanta Data Science for Social Good program.

Capstone Courses and Projects

Whether they are called “senior capstones” or some other name, these culminating experiences require students nearing the end of their college years to create a project of some sort that integrates and applies what they have learned. The project might be a research paper, a performance, a portfolio of “best work,” or an exhibit of artwork. Capstones are offered both in departmental programs and, increasingly, in general education. For example, the Statistics Capstone Course at the University of Georgia provides students the opportunity to engage in a year-long data analytics project that enhances understanding of advanced statistical material while reinforcing oral and written communication skills (Lazar et al., 2012). Similarly, the computer science capstone courses at Virginia Tech requires undergraduates majoring in that field to pursue a design-intensive, team-based final project within the area of their choosing, ranging from “Issues in Scientific Computing” to “Human–Computer Interaction” (Virginia Polytechnic Institute and State University, 2007).

Undergraduate Teaching Assistantships

Undergraduate teaching assistantships are another form of experiential learning. In the State University of New York system, for example, “a student enrolled in a credit-bearing course with specific student learning

Page 15 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

outcomes to assist faculty in providing instructional support” can serve as an undergraduate teaching assistant. During such an experience, a teaching assistant for a data science course has the opportunity to learn about the teaching and learning processes, acquire expertise in data science concepts, hone oral and written communication skills, develop leadership and teamwork skills, and practice good time management. Whether or not teaching assistants plan to join the teaching profession, this experience and the skills it fosters prepares these students for a wide range of data science career opportunities (SUNY, 2012).

References

ASA (American Statistical Association). 2014. “Curriculum Guidelines for Undergraduate Programs in Statistical Sciences.” http://www.amstat.org/asa/files/pdfs/EDU-guidelines2014-11-15.pdf.

Cetinkaya-Rundel, M. 2017. “Teaching Data Science and Statistical Computation to Undergraduates.” United States Conference on Teaching Statistics, State College, Pa., May 20. https://www.causeweb.org/cause/uscots/uscots17/keynote/3.

Columbia University. 2013. “University Writing: Readings in Data Sciences.” https://www.college.columbia.edu/core/node/3290/.

Culler, D. 2016. “Data Science at the Heart of a 21st Century University,” presentation to the National Academies’ Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective, Washington, D.C., December 12.

Lazar, N.A., J. Reeves, and C. Franklin. 2012. A capstone course for undergraduate statistics majors. The American Statistician 65(3): 183-189.

Legler, J. 2017. “ID 280: Comparative Public Health: the US and the World.” St. Olaf College. http://stolaf.studioabroad.com/_customtags/ct_FileRetrieve.cfm?File_ID=05027172754F73020D020306070B1C04080C0014757800006E06030E7A057773730271030775047172. Accessed June 28, 2017.

Meisler, D. 2017. “Google, U-M to build digital tools for Flint water crisis.” MDST Projects, January 25.

NSF (National Science Foundation). 2017. “Research Experience for Undergraduates (REU).” https://www.nsf.gov/crssprgm/reu/. Accessed June 21, 2017.

Purdue University. 2013. “Statistics Living-Learning Community.” http://llc.stat.purdue.edu/.

University of California, Berkeley. 2017. “UC Berkeley CS 169 Software Engineering.” http://cs169.saasclass.org/faq. Accessed June 21, 2017.

University of Chicago. 2017. “Learning Opportunities, Center for Data Science and Public Policy.” http://dsapp.uchicago.edu/research-areas/learning-opportunities/. Accessed June 22, 2017.

VCU (Virginia Commonwealth University). 2017. “Undergraduate Fellowships for Community Engaged and Translational Research.” http://www.research.vcu.edu/ugresources/ce_cctr_fellowship.htm. Accessed June 28, 2017.

Virginia Polytechnic Institute and State University. 2007. “Capstone Courses.” http://www.cs.vt.edu/undergraduate/capstones.

__________________

¹ The committee adapted this list from G.D. Kuh, 2008, High-Impact Educational Practices: What They Are, Who Has Access to Them, and Why They Matter (http://secure.aacu.org/imis/aacur) and provided additional content relevant to this interim report.

TRANSLATIONAL SKILLS

In addition to developing foundational skills, it is valuable for data science students to apply techniques and technologies learned in the classroom or laboratory to specific and different situations in practice. In other words, educators would train students to do data science in real application contexts, incorporating real data, broad impact applications, and commonly deployed methods, as well as working in teams. Students benefit from experiences such as carrying out sentiment analysis of texts, generating interactive maps to explore spatial data, assessing relationships between links within social networks,

Page 16 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

drawing samples from a distribution, visualizing multidimensional data to draw conclusions, and make decisions using data from a variety of sources and domains.

It is useful for students to learn how to translate understandings across domains and think critically about assertions and also to appreciate the importance of reusing and sharing data. As an example, consider the insights that can be found from digitizing an entomological collection. While the images originally might have been thought to be important only for understanding insects, later study of the pollen on the legs of the insects could yield unique insights into the changing nature of ecological systems with local climate change over more than a century. There is great potential for unanticipated ancillary derivatives from the data generation and integration process, but only if the right foundational knowledge is instilled about how one can and should combine sources and share findings in a reproducible workflow for others to interrogate.

Students also benefit from experiences with integrating diverse data and accounting for outside factors. For example, randomized trials—where a treatment is applied to a randomly selected subgroup of a population and then the outcomes of the full group are tracked—are the gold standard of evidence-based practices. However, conducting and generalizing from randomized trials can be difficult in many settings. Since most trials have issues with adherence, compliance, and nonresponse, it is important to account for post-randomization factors. Similarly, survey methods that undergird large surveys and censuses, such as those undertaken by the Census Bureau and Bureau of Labor Statistics, have contributed greatly to understanding and decision making about our society and economy. Today, these surveys can be enhanced and complemented—but not replaced—by fusing data that do not arise from a well-characterized sampling frame and design with data that were carefully sampled and vetted. While this integration may not be straightforward, it has the potential to extend the reach of ongoing surveys and to answer new questions in different ways.

A major challenge relates to how results that are demonstrated using unstructured data are verified. This struggle is analogous to that of research environments adapting to the arrival of a new instrument. For example, the microscope, telescope, and genetic sequencing machines have all allowed researchers to resolve something previously unresolvable. Scientific cultures have to grapple with what to do when observations are unprecedented and when old methods are insufficient to determine whether the new instruments are accurate or not. The development of new data science methods to address data challenges and undertake new analyses will require similar adaption. This analogy may also apply to the expanding computational infrastructure and capacity, which will continue to provide the opportunities to carry out analyses that were previously infeasible.

It is also important to better understand what types of questions and information are amenable to data science approaches. For example, understanding and conveying what computational approaches have produced and why is an important skill. New frameworks and models for carefully constructed distillations to be communicated are needed, along with reports that can be validated, reproduced, and assessed. Similar to the “John Henry” folklore¹ that pitted man against machine, data science is furthering consideration of where the edges of human abilities are, what can be measured, and what can be analyzed.

Educational systems and structures need to prepare students to inhabit a world that will have different tools than those currently available. Students develop judgment through the practice of working through the entire data science cycle. They benefit from opportunities to gradually build large systems by composing smaller systems where the behavior of the smaller systems is better understood. While some curricula offer these opportunities in the form of a capstone course, internship or externship opportunities, or similar integrative experience, it is beneficial for additional complementary experiences to be provided earlier in the curriculum.

___________________

¹ As the legend goes, John Henry won but died from his efforts (see Johnstown Area Heritage Association, 2013).

Page 17 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

Finding 2.2: It is important for data science education to incorporate real data, broad impact applications, and commonly deployed methods.

ETHICAL SKILLS

In addition to the foundational and translational skills training that students receive, they would also benefit from a better understanding of ethics and social context of data (O’Neil, 2016; De Veaux et al., 2017). Several ethical considerations and corresponding examples that could be discussed include the following:

Fairness. This multifaceted consideration can be summarized as the ability for data science techniques to treat all people equitably and avoid bias that may be inherent in training data sets. This is an especially important factor for applications that directly affect individuals, such as in the design of criminal justice models that determine sentencing practices without introducing racial or socioeconomic bias (Berk et al., 2017).
Validity. Before data science methodologies can be applied, it is vital to ensure that the data set contains valid (e.g., accurate and relevant) information. Use of data that are falsified, not current, incomplete, from an unbalanced sample, biased in terms of survivorship, or not measuring the appropriate factors could lead to faulty conclusions, such as inaccurate estimates for health care needs in a given area based on outdated survey information (IOM, 2009).
Data context. Similar to validity, it is important for individuals to understand the context of data sets before they are processed and analyzed. Knowing where, when, and how data were collected could lead to important insights that aid in analysis, detect inherent biases, and mitigate risk to individuals whose information is contained within the data sets.
Data confidence. Recognition of the limitations of data science is important for avoiding overconfidence and the inclination to draw stronger-than-appropriate conclusions. This “data hubris” could be detrimental, for example, if too much confidence is invested in a data science model that makes stock market predictions for investments without any consideration of model limitations (Zacharakis and Shepherd, 2001).
Stewardship. In data science, stewardship refers to the supervision of a data set at all stages of existence, including collection, storage, and analysis. This facilitates protection of individuals whose information is within the data set, including considerations of intellectual property rights or cybersecurity risk.
Privacy. Considerations regarding individuals’ privacy with respect to how data are collected and analyzed arise in many disciplines. Further information on privacy concerns in terms of data science is discussed later in this section.

By developing a truly integrative curriculum, it is possible to explore how society is affected by and reflected in data. Students would learn the importance of asking the following sequence of questions frequently, consistently, and thoughtfully: Whose data is being collected, by whom, for what purposes, and with what possible implications? Data are often representations (and simplifications) of the lives of people; this point could be integrated into every phase of a data science curriculum. Ethics plays a central role as students learn to problem-solve with data. An important lesson for students to learn is that transparency, trust-building, and validation/replication are key concepts; reputable data scientists are able to show why they do their work, explain the benefits that will emerge from it, and characterize and communicate the limitations of that work.

The trade-offs related to privacy play a key role in this discussion as well. A question arises about whether people and their information are “public by default” (Boyd, 2010), because levels of involvement are different and ever-changing for each individual. Moreover, data from multiple sources can be

Page 18 Cite

Suggested Citation:"2 Acquiring Data Science Skills and Knowledge." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

combined, enabling an even richer and more intimate understanding of subjects. For example, with widespread adoption of mobile devices, data scientists may have access to detailed information about a person’s location over time, which, in combination with other data, may yield far more information than the individual intended them to know. Individuals appreciate the ability to make choices about and have control over their information; they want secure data sharing, clear disclosure mechanisms, and a process to gain reparation from damages due to data breaches. All of these real-world problems could provide robust content in a data science curriculum.

An example of a course that integrates a study of data with a study of social context is “Data in Social Context”² at Virginia Tech. This course promotes dual literacy (i.e., humanities skills for data analytics students and data analytics skills for humanities students), explores why people turn to data to explain historical phenomena, and shows students a different way to approach questions with accessible tools and data. It highlights how valuable social context is in data analytics; data are filled with narratives, and questions often arise about ethics, probability, and bias.

Finding 2.3: Incorporating ethics into an undergraduate data science program provides students with valuable skills that can be applied to complex, human-centered questions across disciplines.

PROFESSIONAL SKILLS

Broad professional skills are particularly critical in data science (BHEW, 2017; Hicks and Irizarry, 2017). Industry partners relate that desirable characteristics include the ability to state goals clearly, to validate solutions, and to communicate with both technical and nontechnical audiences. Communication, both written and verbal, plays a significant role in data science because of diverse application areas, interdisciplinary research groups, and the ubiquity of data spanning many fields and being produced by many people. Conveying information with diverse audiences, expressing nuance regarding evidence in the presence of uncertainty, communicating limitations of analyses, and ensuring that what is conveyed is a faithful and honest representation of the data are all essential to data science.

Communication skills can be strengthened through practice with communicating various types of information to diverse audiences, such as a course or experiences that emphasize public speaking as well as technical and nontechnical writing. Communication can also be strengthened by improved understanding of diverse audiences. For example, what aspects of the data science process and results would domain scientists need to know to further their research, versus what would managers need to know to make relevant business decisions, versus what would policy makers need to know to make sound policies? These courses could also include a section on effective data visualization and its benefits, especially when explaining best communication practices for an audience from a nontechnical background.

The ability to work well in multidisciplinary teams is also important to data science and highly valued by industry. Multidisciplinary teamwork offers students the opportunity to use creative problem solving and refine leadership skills, and allows for diverse perspectives when tackling data science problems.

Finding 2.4: Strong oral and written communication skills and the ability to work well in multidisciplinary teams are critical to students’ success in data science.

___________________

² Tom Ewing, “Data in Social Context,” http://ethomasewing.org/disc/, accessed August 21, 2017.