In the past decade, the world has been transformed by the rapidly evolving field of data science. This new science, which is already revolutionizing business, science, and society, builds on an array of technological developments, including the widespread use of smartphones and rapid technological progress in computing and communications. Massive investments have gone into building out wireless infrastructure and data centers (the cloud) and into leveraging such facilities. New methods have been developed to connect and understand the data being generated.1
In this new landscape, all individuals constantly generate data about their whereabouts, habits, and preferences. All parts of commerce—browsing, ordering, shipping, inventory, manufacturing, advertising—have gained a digital footprint. Social network sites have illuminated relationships among billions of individuals, and tweets and posts have made global-scale communication patterns instantly visible. Governmental bodies have digitized and given public access to vast corpora of data and documents. Most of recorded history and literature have become digitized and accessible for algorithmic analysis. Electronic health records have allowed medical analyses across populations and time, while genomic sequencing has brought individualized treatment to the cellular level. Design and synthesis of pharmaceuticals, materials, and
1 Science and engineering have provided many notable examples of digital transformations in the previous decade, foreshadowing the large public transformation now taking place. Examples from the 1995-2005 time frame include virtual observatories (see, e.g., Szalay and Gray, 2001; NSF, 2018) and advanced computational methods (see, e.g., Berman, Fox, and Hey, 2003).
chemicals have become computational. The volume of data being collected automatically—and the processing of such data—has soared. New data-driven services have arisen (e.g., navigation apps, ride-hailing apps, and voice-driven assistants), exploiting this new data-driven environment and convincing the public of the power and elegance of the data-driven paradigm. Several of the highest market capitalization companies have been heavily involved in digital transformation, displacing oil and car companies that had been market leaders for decades.
These emblematic advances signal more extensive and widespread transformations to come. The smartphone, mobility, genomics, and cloud “revolutions” are in fact only at their inception as technologists find ways to leverage them ever further. The increased use of Internet-connected home thermostats and fitness wristbands have marked the beginning of the Internet-of-Things era, in which people are surrounded by an environment that is instrumented, communicative, and responsive. Meanwhile, rapid advances in machine learning are enabling new applications.
This year’s entering undergraduates, who may be in the workforce until roughly 2075, will face an employment landscape transformed by these developments. The data-driven era will spawn many new occupational niches based on the massive opportunities presented by new kinds and volumes of data even as it supplants traditional occupational categories.
Today, the term “data scientist” typically describes a knowledge worker who uses the complex and massive data resources characteristic of this new era. However, data science is a broader concept involving principles for data collection, storage, integration, analysis, inference, communication, and ethics appropriate for this new data-driven era. Several industries and academic disciplines have perceived that a new field of data science is emerging out of several established fields, including information technology, computer science, statistics, mathematics, operations management, and business analytics. However, core data science concepts involving the aforementioned principles are not being conveyed by mainstream training in any one field because data science is not reducible to any of the preexisting fields. Data scientists of the future will need to be educated in the full scope of data science principles.
There are many reports that industry finds itself constrained by today’s relatively small supply of well-trained data science talent, and hiring demand for data scientists has begun to increase rapidly; some projections forecast that approximately 2.7 million new data science positions will be available by 2020 (Columbus, 2017). Not only is the lack of data science talent an issue, but so too is students’ lack of understanding about what a data scientist is and what types of tasks such an individual might perform.
It is imperative that educators, administrators, and students begin today to consider how to best prepare for and keep pace with this data-driven era of tomorrow. Undergraduate teaching, in particular, offers a critical link in providing more data science exposure to students and expanding the supply of data science talent. Many distinct data science roles will exist in the future workplace; both specialists and broad users with different levels of knowledge and different skill sets will be in high demand.
Understandings of and applications for data science vary among professionals, within academic institutions, and throughout the broader world. One common observation is that data science is now essential in many academic fields (Hey, Tansley, and Tolle, 2009) and can be both pervasive in and yet distinct from other disciplines. For example, data science techniques and tools may be applied commonly across a variety of disciplines, including those in the sciences and in the humanities. However, what gives data science its unique identity is that it draws on individual skills and concepts from a wide spectrum of disciplines that may not always overlap with one another—a truly multidisciplinary field. As discussions continue regarding the distinctions among data science, computer science, statistics, and other fields, many U.S. academic institutions are considering how to best deliver data science education and thus better prepare graduates for the data-driven era that lies ahead of them.
The need for data science instruction is broad and extends to a wide range of students from varied programs. Depending on the students’ levels of interest and career goals, as well as institutional goals and resources, one can envision a variety of models for data science instruction, including discipline-centered data science courses offered by specific academic departments focusing narrowly on the skills needed by that department’s majors, large introductory data science courses serving the campus-wide student body, highly structured course sequences within a formal data science major, online courses, boot camps, and other innovative approaches. To achieve this vision, data science education and practice demand a level of collaboration not necessarily seen in other fields, new approaches to evaluating educational outcomes, and a constant eye toward refining and evolving the undergraduate experience as this field continues to advance. Stand-alone data science departments may emerge naturally on some campuses when the level of collaboration surpasses the bandwidth of currently established departments or when the student demand increases greatly. However, developing stand-alone departments is not the only means of delivering data science education effectively nor may it be appropriate in all settings—equipping students with data science skills can be done through a variety of pathways, as will be discussed in this report.
Imagine it is now 2040. Students born in 2018 are graduating from college. It is more than 30 years since billions of autonomous sensors and devices started continuously delivering data to cloud-based databases, which record the states and activities of vehicles, buildings, customers, patients, and citizens. Many other data-driven changes that were difficult to foresee have become pervasive and important. Thus, it is not farfetched to expect academic institutions to envision the data-driven world of 2040 as they shape the future undergraduate experience.
In the ideal case for the future evolution of data science, all private industries and public agencies would use data confidently and efficiently to operate fairly without gender or racial bias. Data science jobs would be plentiful. While some of these data science jobs would require vocational education, other data science subspecialties would require certificates, associate’s degrees, and bachelor’s degrees. Efforts would have been undertaken to distribute the workforce equitably over rural, urban, and suburban regions; socioeconomic strata; and ethnic identities. The importance of data skills would be appreciated in all high schools, and the vast majority of high school graduates would have a basic understanding of data science. Data science methods would be used by data science programs to continuously evolve to meet the needs of their students.
Data scientists’ work would be varied, and different skill mixes would be needed for different data science positions. Some of these individuals would have been trained in particular fields but have learned data science along the way. Others would have explicit degrees in data science. For those who need a degree in data science for their work, there will likely be many options. They might earn those degrees remotely, on-site, or in combination. They might learn through a mixture of interactive web applications and augmented reality simulations, interactions with fellow learners and multidisciplinary faculty, and immersive industry apprenticeships. Students in 2- and 4-year institutions would be exposed to important concepts through a range of motivating applications. Humanities, social sciences, and professional education (e.g., music, art, and architecture) would be taught for enrichment, for building cross-disciplinary communication skills, and as contexts in which to provide examples of different types of data. Ethical data concepts such as privacy, justice, fairness, and reproducibility would be taught continuously in safe spaces where students learn from their mistakes without penalty and without harm to others. Faculty would use data science to continuously monitor their students’ progress and to adapt their curriculum to ensure student competency, confidence, and well-being with respect to the needs of industry, government, and society.
The committee’s vision for the world of 2040 has many debatable
elements—whether the transformations just described will actually go nearly as far as depicted or whether this mostly utopian vision will develop dystopian elements. This much is not debatable: the undergraduate instructional framework will need to transform if it is to support the transition from the world of 2018 to the likely world of 2040. This report outlines some considerations and approaches for academic institutions and others in the broader data science communities to help guide this transformation, but it is not intended to be a final word on undergraduate data science education. This vision needs to be continually evolved and refined as the field matures.
In Chapter 2, the committee considers what data science professionals will need to know. Because expectations and tasks for data scientists will vary across industries and over time, it is important to consider the skill sets, learning outcomes, and ethical considerations best suited for individual undergraduate students to be successful in their future careers. In Chapter 3, the committee lays the groundwork for exploring how these data science students can be educated and thus well prepared. Using data from existing data science education programs, the committee discusses the successes and challenges associated with implementing and delivering 2- and 4-year undergraduate programs and classes, alternative courses, and interdisciplinary approaches in an effort to guide individual institutions to follow the pathways that simultaneously align with their missions and meet the varied needs of the field of data science. In Chapter 4, the committee describes a number of challenges that arise in creating a new data science program. Acknowledging that the field of data science and the content of data science education will continue to change rapidly, the committee considers how to evolve from current to future data science education and practice in Chapter 5. The committee evaluates strategies to refine educational and administrative infrastructure, create professional development opportunities, and draw on professional societies. In Chapter 6, the committee offers a summary of its findings and recommendations that appeared throughout Chapters 2 to 5.
Berman, F., G. Fox, and A.J.G. Hey, eds. 2003. Grid Computing: Making the Global Infrastructure a Reality. West Sussex, UK: Wiley.
Columbus, L. 2017. IBM predicts demand for data scientists will soar 28% by 2020. Forbes, May 13.
Hey, T., S. Tansley, and K. Tolle, eds. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, Wash.: Microsoft Research.
NSF (National Science Foundation). 2018. The Observatory [National Ecological Observatory Network]: History. http://www.neonscience.org/observatory/history. Accessed February 6, 2018.
Szalay, A., and J. Gray. 2001. The world-wide telescope. Science 293(5537):2037-2040.