Established in December 2016, the National Academies of Sciences, Engineering, and Medicine’s Roundtable on Data Science Postsecondary Education was charged with identifying the challenges of and highlighting best practices in postsecondary data science education. Convening quarterly for 3 years, representatives from academia, industry, and government gathered with other experts from across the nation to discuss various topics under this charge. Some stakeholders argue for data science to be described as a discipline, others as a domain, and still others as an umbrella. No matter the label, academia is in the midst of a transformation that will continue to have profound implications across society. In an effort to train postsecondary students effectively, institutions of higher education are (re)examining who is taught what and why, as well as how and by whom, and considering how to increase interactions with students’ potential employers.
This introduction serves to orient readers to four central themes that emerged during the roundtable meetings: (1) foundations of data science; (2) data science across the postsecondary curriculum; (3) data science across society; and (4) ethics and data science. These themes are expanded in the chapters that follow, which contain detailed summaries of each roundtable meeting. These meeting summaries, as well as original meeting videos, are also available online.1 These meeting recaps were prepared by the
1 Watch meeting videos or download presentations at https://www.nationalacademies.org/our-work/roundtable-on-data-science-postsecondary-education, accessed February 13, 2020.
National Academies of Sciences, Engineering, and Medicine as informal records of issues that were discussed during meetings of the Roundtable on Data Science Postsecondary Education. All opinions presented are those of the individual participants and do not necessarily reflect the views of the National Academies or the roundtable sponsors.
FOUNDATIONS OF DATA SCIENCE
Roundtable participants discussed the type(s) of training best suited for robust data science practice and considered what data science education could look like from a national perspective. No consensus opinion emerged as to how data science should be defined or what a data science degree should require. Nonetheless, roundtable participants agreed that this increasingly interdisciplinary field depends on foundational elements from many disciplines, including but not limited to statistics, computer science, engineering, and mathematics. Participants noted an abundance of foundational skills, techniques, and concepts—from one discipline or common to many—that are integral to proficiency in data science (see Chapter 2). As data science courses, programs, and degrees continue to evolve, consensus foundations may emerge but will depend on institutional contexts and opportunities.
DATA SCIENCE ACROSS THE POSTSECONDARY CURRICULUM
Roundtable participants discussed the intersection of data science and domain sciences and contemplated how this interplay impacts the teaching of data science. Faculty are challenged to learn new skills, adapt methods, and find new (often multidisciplinary) approaches to teaching data science. At the same time, student demand for data skills continues to increase, and data, computation, and software tools are becoming pervasive. Several promising approaches to postsecondary data science education have emerged, such as integrating data science perspectives into existing data-intensive domain courses; creating new courses that integrate multiple perspectives, skills, and fields; and teaching collaboratively (see Chapter 3). These and other paths forward could be implemented successfully alongside the following strategies: training graduate teaching assistants across a range of skills; identifying and better supporting faculty who are willing to experiment with and assess new approaches; improving the understanding of disciplinary needs for data science; developing methods to introduce data science to students without quantitative training; integrating standard disciplinary data sets to support data science instruction; and lessening traditional seminar teaching and single-author monograph publishing. A shared goal of many data science
Ph.D. programs in data science are more nascent than undergraduate and master’s programs in data science. Approaches to Ph.D. training in data science—such as a new entity created with existing faculty, an expansion of an existing entity, or an overlay model—often depend on where the Ph.D. program is housed within an academic institution and how it interacts with other departments. Each approach serves a unique purpose, and, collectively, these approaches are quickly creating options for advanced data science education. These approaches differ in their administrative mechanisms and application processes. For example, doctoral students could be directly admitted into a data science program or admitted into a home department; in some cases, admission into a data science program only happens after a student arrives on campus. Several programs compel students to complete all requirements in their home departments before completing additional requirements for data science. The ability to carry out a broader dissertation and research, often with interactions with multiple scientific domains, is advantageous in a data science Ph.D. program. Faculty flexibility and a willingness to work within the constraints of an institution are beneficial, especially given the challenges associated with starting new programs. In the future, it is likely that there will be some consolidation of the approaches taken, although it is unlikely (and undesirable) that one approach will fit all institutions. Evaluative measures could be useful to determine which programs flourished, how requirements varied across different programs, what types of dissertations were produced, and where graduates were hired (see Chapter 8).
Demand for employees with data science skills is expanding across industries, and some of today’s data science jobs could be filled by individuals with 2-year (associate’s) degrees. Many efforts are under way to better understand and align with the needs of employers, such as the development of data career pathways and expert worker profiles. Institutions are revising curricula accordingly to reflect the changing demands of the skilled workforce. Several 2-year colleges are developing courses, certificates, and associate’s degrees in data science, data analytics, and data management. Incentivizing faculty training is key to the success of these
programs. Because funding and resource constraints make it difficult to implement new programs at 2-year colleges, there is value in establishing formal partnerships with nearby 4-year and master’s-granting institutions. Two-year colleges are also considering the potential for transfer as they design their programs and are beginning to develop articulation agreements that could create smoother transitions for students in search of advanced training (see Chapter 12).
The past decade has seen substantial experimentation in how data science education is delivered both within and outside the classroom. Three alternative mechanisms within academia include certificate programs, practicums, and collaborative environments. These mechanisms challenge the traditional model, where practice—if incorporated at all into the curriculum—is more likely to be encountered through a class project or a capstone experience. Other unique educational opportunities include hackathons, boot camps, and activities in informal settings such as museums. Boot camps provide a way to fill the increasing gaps between degree-based programs in academia and on-the-job learning in industry, with focused, problem-based, team-oriented programs that build proficiency with the data science life cycle. Wider dissemination of successful efforts could be helpful for the efficient use of resources as well as to scale emerging best practices (see Chapter 5).
DATA SCIENCE ACROSS SOCIETY
Roundtable participants further examined the development of data science expertise for the workplace. Effective teamwork and the ability to communicate clearly with diverse audiences are particularly important skills. High-quality, free, online training is readily available, and assessments have revealed that participants are incentivized by the opportunity to work with real data sets to solve real problems (see Chapter 4). Increased coordination between academia and industry (as well as between academia and government) could be key to the future of robust data science education and practice. Students and faculty could benefit from the opportunity to spend time in industry in the form of prolonged internships and postdoctoral assignments or with joint appointments, respectively. Academic institutions could stimulate successful partnerships by leveraging experiences from other disciplines; benchmarking and developing best practices; fostering continued interactions; providing firm financial support; offering resources and incentives to both
students and faculty; increasing diverse representation; developing synergistic relationships with neighboring institutions; building on credentials in well-established areas; creating legally binding master agreements; embracing open source, open data, and open science; and using cloud platforms (see Chapter 11).
ETHICS AND DATA SCIENCE
Roundtable participants posed several questions about ethical and privacy issues related to data science education (see Chapter 6):
- What does it mean for an algorithm to be transparent, interpretable, or explainable?
- What rights should individuals have when they are the subjects of algorithms, and how do these rights interface with existing legislation?
- Who is responsible for the effects of how data are used?
- What information about fairness could data scientists supply that is suitable for a range of metrics for fairness? How does one close the feedback loop from metrics of fairness back to the design of algorithms?
- What rights should individuals have about keeping their data private?
- What are the sources of bias in algorithms and in data science more generally? Could they be eliminated or substantially lessened by explicit protocols and policies?
- How could those with technical knowledge most accurately and understandably present trade-offs? Would advance knowledge of trade-offs skew the results against privacy and fairness?
- What are the lessons learned from other disciplines?
- How and at what level could students be taught about ethics and privacy?
Rigorous approaches to these ethical questions are being implemented in research and in academic institutions through new courses on ethics in data science and through modules as part of other data science courses. In this time of innovation, making teaching materials widely and quickly available could help to expand ethical conversations in the classroom. The discussion of ethics and privacy in data science education could be broadened to include perspectives for social science, philosophy, industry, law, and policy as the research begins to delve deeper into issues of accountability, transparency, fairness, privacy, and bias (see Chapter 6).
Reproducibility in computationally enabled research has been an area of active discussion in the academic community for several decades; this discussion influences data science practice and teaching on both theoretical and practical levels. Computational tools (e.g., the Jupyter Notebook2) can help to address issues of reproducibility and transparency. Computational transparency permits not only the understanding of the reasoning behind scientific findings but also the comparison of results that may differ and yet claim to answer the same scientific question. These efforts inform modern practices in software development and coding for data science such as version control, the use of notebooks, and skills in specific languages (e.g., Python). The adaptation of techniques and tools from software engineering, database management, computing at scale, and statistical inference is essential to data science practice, but these techniques and tools do not guarantee the accuracy of the resulting scientific findings. Generally accepted standards for teaching computational transparency and reproducibility in data science could be useful, as could generally accepted standards for best practices in software engineering in data science applications (see Chapter 7).
Approaches to engaging students in meaningful projects with the potential for social impact are rapidly emerging and these efforts could help to attract and retain future data scientists. Questions remain about which types of institutions are able to provide these experiences; whether emerging programs are particularly resource-intensive, scalable, and conducive to academic or industry reward structures; and how to best prepare people with different levels of authority in academic and industrial settings to be able to raise and discuss ethical issues. The data science community more broadly could benefit from a process that builds trust between technologists and the social sector, increases attention to data collection and security, explains conclusions drawn from models, and plans for cases when harm is done to users (see Chapter 10).
Diversity and Inclusion
Many of the same institutional barriers that have impeded equity and inclusion in STEM affect data science education—for example, the rigidity of the faculty reward system and implicit biases in faculty hiring and
promotion and graduate school admission. The data science community has succeeded in raising awareness of the importance of inclusion, in part owing to a nationwide shortage of data scientists. Mentorship programs and cohort experiences have been particularly successful in recruiting and retaining underrepresented groups for data science education. Given that academic institutions are slow to change, especially with regard to rewarding faculty involvement in activities that do not result in peer-reviewed publication, partnership with industry could be a promising avenue to increase diverse participation in data science. Other potential paths toward success could include a more coordinated effort to involve teachers, counselors, and administrators in implementing change at the K-12 level; increased assessment and the sharing of best practices; and a stronger connection between inclusive academic programs and organizations working to increase inclusive participation in the field of data science (see Chapter 9).