The emergence of a novel science of data highlights the need for new principles for data collection, storage, integration, and analysis. These new scientific principles are leading to new tools that uniquely respond to the challenges of big data. However, the main concepts, skills, and ethics powering this emerging discipline of data science still need to be identified. A new generation of tool developers and tool users will require the ability to make good judgments and decisions with data and use tools responsibly and effectively (referred to as “data acumen” throughout this report). Some of these developers and users will draw from computing, mathematics, and statistics fields, but many will come from other fields and application domains. Educators and administrators are beginning to reimagine data science course content, delivery, and enrollment at the undergraduate level to best prepare students to operate in this new paradigm.
New and greater volumes of information compound long-standing challenges of data analysis—and raise new ones. The ability to measure, understand, and react to data can affect scientific discovery, social interaction, political tradition, economic practice, public health, and many other areas. Some data science applications are low risk, such as recommender systems that suggest purchases within an online shopping platform or select advertisements for website visitors. Although provider sales may be affected if undesirable products are recommended and users may be dissatisfied with their purchases, the overall impact of poor recommender systems to individuals and society is low. Of greater impact, census data are used to redraw political boundaries, allocate funds, and inform other critical public policy decisions. While new volumes and types of information can make analyses more accurate than past methods that relied on sparse surveys with lower than desired response rates, people can be negatively affected if the interpretation of the data does not account for all relevant factors. A program that a family depends on may not have sufficient funding, or a policy might be enacted that has unintended consequences for large segments of the population if weak data analysis is used. Thus, it is important that data are collected and analyzed appropriately, especially as new demands are placed on data collection and evaluation and as new technologies emerge. It is equally important that there are clear principles guiding the use of data for human good. Further, the complexity of the analyses and the increasing dependency on data across all the fields of human endeavor drive demand for “smarter” tools and best practices for data science that will minimize mistakes in interpretation.
Academic institutions and industry recognize these shifts and are rapidly embracing the idea that there is an emerging discipline of data science that is unique yet builds on knowledge from existing disciplines (NRC, 2014). Traditional statistical methods are well established and clearly understood but often do not scale to handle the vast volumes of data that must be analyzed for today’s data science. Computing is unparalleled in its capacity to handle vast volumes or fast-flowing streams of information, but often without statistical and inferential guarantees, which can result in unreliable results and biased or unfair interpretations of the data (Jordan, 2013). Domain areas (e.g., business, medicine, natural science, social sciences, or engineering) are developing and adapting techniques to solve specific research
questions, which can be more effective than using general methods. However, these approaches may suffer from insufficient mathematical or statistical rigor or lack computational scalability.
Although the definition of data science is evolving, it centers around the notion of multidisciplinary and interdisciplinary approaches to extracting knowledge or insights from data for use in a broad range of applications. It is the field of science that relies on processes and systems (mathematical, computational, and social) to derive information or insights from data. It is about synthesizing the most relevant parts of the foundational disciplines to solve particular classes of problems or applications while also creating novel techniques to address the “cracks” between those disciplines where no approaches may yet exist. This flexibility is an essential component of data science and is equally important in data science education.
Data scientists have the potential to help address critical real-world challenges. The following list includes just a few illustrative examples:
- Enabling more accurate diagnosis of melanomas through better analysis of images. Within the clinical field, deep learning techniques1 have been applied to detect melanoma, the most deadly form of skin cancer. These methods improve the analysis of tissue images, enabling a more accurate diagnosis than traditional techniques (Codella et al., 2017).
- Enhancing business decisions. Business analytics can assist entrepreneurs and company executives in making timely decisions based on market trends. This can be coupled with online social media information to respond directly to consumer demands or create a more personalized advertising experience (Chen et al., 2012).
- Helping aid organizations respond quickly. Data science and analytics are used to assist aid organizations to respond more quickly in times of need, such as when the Swedish Migration Board used data science to make predictions about and determine national implications for immigration trends (Pratt, 2016).
- Developing “smart cities.” Cities around the world such as London, Rio de Janeiro, and New York City collect real-time data from a variety of sources, such as public transportation smart cards and traffic cameras, environmental sensors for parameters such as temperature and humidity, and social media interactions regarding local issues. The data can then be processed, analyzed, and utilized to improve city efficiency and cost-effectiveness as well as resident well-being (Kitchin, 2014).
However, there are also many instances of high-impact and high-profile data science research resulting in flawed or inaccurate findings, as well as ethical and legal quandaries. The following list includes a few illustrative examples:
- Inaccurate predictions of flu trends. In 2013, Google Flu Trends over-predicted true influenza-related doctors’ visits as determined by the Centers for Disease Control and Prevention. This was primarily the result of overreliance on outdated models (Butler, 2013).
- Use of personally identifiable data. The abundance of data available on individuals from companies and social media can present ethical dilemmas to researchers in terms of privacy, scalability of results, and subject participation agreement. For instance, a 2013 study linked numerous Twitter users to sensitive information from their financial institutions, which contributed to discussions of if and when researchers should be required to obtain written consent when using nominally publicly accessible information (Danyllo et al., 2013).
- Predictive policing. There is much debate over the use and appropriateness of predictive policing—the use of data science by law enforcement to predict crime before it occurs. There
1 Deep learning is a powerful class of a machine learning methods that explore data representations using supervised, semi-supervised, or unsupervised learning.
is no consensus yet on the effectiveness of this methodology, and civil liberties groups argue that the data used to develop (i.e., train) the models are inherently biased (Hvistendahl, 2016).
Data science is currently being practiced in hundreds of organizations within industry, academia, and government, often by self-taught practitioners. There are indications of strong demand in a variety of domains for graduates with data science skills. A recent study by IBM found more than 2.3 million data science and analytics job listings in 2015, and both job openings and job demand are projected to grow significantly by 2020. Three-fifths of the data science and analytics jobs are in the finance and insurance, professional services, and information technology sectors, but the manufacturing, health care, and retail sectors also are hiring significant numbers of data scientists (Miller and Hughes, 2017; Columbus, 2017). The IBM study also shows that it takes significant time to find and hire staff with the right mix of skills and experience. Since many employers are themselves new to the use of data science, they may not be able to provide training and therefore may seek individuals who have appropriate classwork and hands-on experience.
Current data science courses, programs, and degrees are highly variable in part because emerging educational approaches start from different institutional contexts, aim to reach students in different communities, address different challenges, and achieve different goals. This variation makes it challenging to lay out a single vision for data science education in the future that would apply to all institutions of higher learning, but it also allows data science to be customized and to reach broader populations than other similar fields have done in the past. Data science educational programs are emerging within many existing fields such as statistics, computer science, business, and social sciences. These field-specific approaches bring about unique distinctions in how data science is taught, which skills are emphasized, which students are served, and which career paths graduates pursue. Other data science educational programs are taking a cross-disciplinary approach—for example, integrating statistics and computer science concepts into the undergraduate data science degree program. (Several example programs are discussed in Chapter 3 of this report.)
This report highlights some of the important common threads that can be woven throughout much of data science education. Chapter 2 discusses the foundational, translational, ethical, and professional skills that help students acquire data science skills and knowledge. Chapter 3 explores the role of innovative curriculum development and provides some considerations for institutions. Chapter 4 examines ways to ensure broad participation in data science, including recruitment and retention strategies, institutional partnerships, K-12 objectives, public outreach, and the role of evaluation and assessment. Chapter 5 provides some reflections on the committee’s findings throughout the report and proposes questions for public input. That chapter also lays out a draft Hippocratic Oath for data science, which is also open for public input.
The themes described in this report underlay data science education, but they are not necessarily novel challenges or even unique to data science (as demonstrated by the historical case study in Box 1.1). The lessons learned from other disciplines can help pave the way to ensuring the success of data science education.
This report aims to lay out some key questions to help advance conversations around data science education and to provide institutions and participants with a clearer picture of paths forward.
This National Academies’ interim report begins to address the statement of task for the committee, presented in Box 1.2.
At the time of this writing, the study committee held two meetings and one webinar to collect information, engage a diverse community, and deliberate. The open session presentations given during these meetings are listed in Appendix B. Additional information-gathering activities are planned for the remainder of 2017, after which the committee will release a final report.
During the first meeting of the committee on December 12-13, 2016, participants discussed possible future directions based on progress with current data science programs; societal implications of the evolving field of data science; ways to expand diversity and inclusion in data science among students, employees, and even topic areas; and perspectives on envisioning the future of data science specifically for undergraduates.
The committee hosted a public webinar on April 25, 2017, and gathered public input and outside perspectives on the topics the committee should discuss throughout the remainder of the study.
The committee also held a workshop on May 2-3, 2017, where participants discussed (1) educational models to build relevant foundational, translational, and professional skills and knowledge for data scientists in various roles; (2) use of high-impact educational practices in the future delivery of data science education; and (3) strategies for broad participation in data science education that revolve around formal modes of evaluation and assessment. Other topics emerged as well, including the role of teacher education, the need to consult research on learning styles and teaching methods, the relationship between data science and popular culture, better methods for assessment of student and program success, and the ways in which students, institutions, and programs might change over the next 10 years and how these changes may affect plans for the future of data science education.