Page 5 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

1

Introduction

ENVISIONING DATA SCIENCE FROM AN UNDERGRADUATE PERSPECTIVE

The emergence of a novel science of data highlights the need for new principles for data collection, storage, integration, and analysis. These new scientific principles are leading to new tools that uniquely respond to the challenges of big data. However, the main concepts, skills, and ethics powering this emerging discipline of data science still need to be identified. A new generation of tool developers and tool users will require the ability to make good judgments and decisions with data and use tools responsibly and effectively (referred to as “data acumen” throughout this report). Some of these developers and users will draw from computing, mathematics, and statistics fields, but many will come from other fields and application domains. Educators and administrators are beginning to reimagine data science course content, delivery, and enrollment at the undergraduate level to best prepare students to operate in this new paradigm.

New and greater volumes of information compound long-standing challenges of data analysis—and raise new ones. The ability to measure, understand, and react to data can affect scientific discovery, social interaction, political tradition, economic practice, public health, and many other areas. Some data science applications are low risk, such as recommender systems that suggest purchases within an online shopping platform or select advertisements for website visitors. Although provider sales may be affected if undesirable products are recommended and users may be dissatisfied with their purchases, the overall impact of poor recommender systems to individuals and society is low. Of greater impact, census data are used to redraw political boundaries, allocate funds, and inform other critical public policy decisions. While new volumes and types of information can make analyses more accurate than past methods that relied on sparse surveys with lower than desired response rates, people can be negatively affected if the interpretation of the data does not account for all relevant factors. A program that a family depends on may not have sufficient funding, or a policy might be enacted that has unintended consequences for large segments of the population if weak data analysis is used. Thus, it is important that data are collected and analyzed appropriately, especially as new demands are placed on data collection and evaluation and as new technologies emerge. It is equally important that there are clear principles guiding the use of data for human good. Further, the complexity of the analyses and the increasing dependency on data across all the fields of human endeavor drive demand for “smarter” tools and best practices for data science that will minimize mistakes in interpretation.

Academic institutions and industry recognize these shifts and are rapidly embracing the idea that there is an emerging discipline of data science that is unique yet builds on knowledge from existing disciplines (NRC, 2014). Traditional statistical methods are well established and clearly understood but often do not scale to handle the vast volumes of data that must be analyzed for today’s data science. Computing is unparalleled in its capacity to handle vast volumes or fast-flowing streams of information, but often without statistical and inferential guarantees, which can result in unreliable results and biased or unfair interpretations of the data (Jordan, 2013). Domain areas (e.g., business, medicine, natural science, social sciences, or engineering) are developing and adapting techniques to solve specific research

Page 6 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

questions, which can be more effective than using general methods. However, these approaches may suffer from insufficient mathematical or statistical rigor or lack computational scalability.

Although the definition of data science is evolving, it centers around the notion of multidisciplinary and interdisciplinary approaches to extracting knowledge or insights from data for use in a broad range of applications. It is the field of science that relies on processes and systems (mathematical, computational, and social) to derive information or insights from data. It is about synthesizing the most relevant parts of the foundational disciplines to solve particular classes of problems or applications while also creating novel techniques to address the “cracks” between those disciplines where no approaches may yet exist. This flexibility is an essential component of data science and is equally important in data science education.

Data scientists have the potential to help address critical real-world challenges. The following list includes just a few illustrative examples:

Enabling more accurate diagnosis of melanomas through better analysis of images. Within the clinical field, deep learning techniques¹ have been applied to detect melanoma, the most deadly form of skin cancer. These methods improve the analysis of tissue images, enabling a more accurate diagnosis than traditional techniques (Codella et al., 2017).
Enhancing business decisions. Business analytics can assist entrepreneurs and company executives in making timely decisions based on market trends. This can be coupled with online social media information to respond directly to consumer demands or create a more personalized advertising experience (Chen et al., 2012).
Helping aid organizations respond quickly. Data science and analytics are used to assist aid organizations to respond more quickly in times of need, such as when the Swedish Migration Board used data science to make predictions about and determine national implications for immigration trends (Pratt, 2016).
Developing “smart cities.” Cities around the world such as London, Rio de Janeiro, and New York City collect real-time data from a variety of sources, such as public transportation smart cards and traffic cameras, environmental sensors for parameters such as temperature and humidity, and social media interactions regarding local issues. The data can then be processed, analyzed, and utilized to improve city efficiency and cost-effectiveness as well as resident well-being (Kitchin, 2014).

However, there are also many instances of high-impact and high-profile data science research resulting in flawed or inaccurate findings, as well as ethical and legal quandaries. The following list includes a few illustrative examples:

Inaccurate predictions of flu trends. In 2013, Google Flu Trends over-predicted true influenza-related doctors’ visits as determined by the Centers for Disease Control and Prevention. This was primarily the result of overreliance on outdated models (Butler, 2013).
Use of personally identifiable data. The abundance of data available on individuals from companies and social media can present ethical dilemmas to researchers in terms of privacy, scalability of results, and subject participation agreement. For instance, a 2013 study linked numerous Twitter users to sensitive information from their financial institutions, which contributed to discussions of if and when researchers should be required to obtain written consent when using nominally publicly accessible information (Danyllo et al., 2013).
Predictive policing. There is much debate over the use and appropriateness of predictive policing—the use of data science by law enforcement to predict crime before it occurs. There

___________________

¹ Deep learning is a powerful class of a machine learning methods that explore data representations using supervised, semi-supervised, or unsupervised learning.

Page 7 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

is no consensus yet on the effectiveness of this methodology, and civil liberties groups argue that the data used to develop (i.e., train) the models are inherently biased (Hvistendahl, 2016).

Data science is currently being practiced in hundreds of organizations within industry, academia, and government, often by self-taught practitioners. There are indications of strong demand in a variety of domains for graduates with data science skills. A recent study by IBM found more than 2.3 million data science and analytics job listings in 2015, and both job openings and job demand are projected to grow significantly by 2020. Three-fifths of the data science and analytics jobs are in the finance and insurance, professional services, and information technology sectors, but the manufacturing, health care, and retail sectors also are hiring significant numbers of data scientists (Miller and Hughes, 2017; Columbus, 2017). The IBM study also shows that it takes significant time to find and hire staff with the right mix of skills and experience. Since many employers are themselves new to the use of data science, they may not be able to provide training and therefore may seek individuals who have appropriate classwork and hands-on experience.

Current data science courses, programs, and degrees are highly variable in part because emerging educational approaches start from different institutional contexts, aim to reach students in different communities, address different challenges, and achieve different goals. This variation makes it challenging to lay out a single vision for data science education in the future that would apply to all institutions of higher learning, but it also allows data science to be customized and to reach broader populations than other similar fields have done in the past. Data science educational programs are emerging within many existing fields such as statistics, computer science, business, and social sciences. These field-specific approaches bring about unique distinctions in how data science is taught, which skills are emphasized, which students are served, and which career paths graduates pursue. Other data science educational programs are taking a cross-disciplinary approach—for example, integrating statistics and computer science concepts into the undergraduate data science degree program. (Several example programs are discussed in Chapter 3 of this report.)

This report highlights some of the important common threads that can be woven throughout much of data science education. Chapter 2 discusses the foundational, translational, ethical, and professional skills that help students acquire data science skills and knowledge. Chapter 3 explores the role of innovative curriculum development and provides some considerations for institutions. Chapter 4 examines ways to ensure broad participation in data science, including recruitment and retention strategies, institutional partnerships, K-12 objectives, public outreach, and the role of evaluation and assessment. Chapter 5 provides some reflections on the committee’s findings throughout the report and proposes questions for public input. That chapter also lays out a draft Hippocratic Oath for data science, which is also open for public input.

The themes described in this report underlay data science education, but they are not necessarily novel challenges or even unique to data science (as demonstrated by the historical case study in Box 1.1). The lessons learned from other disciplines can help pave the way to ensuring the success of data science education.

This report aims to lay out some key questions to help advance conversations around data science education and to provide institutions and participants with a clearer picture of paths forward.

STUDY ORIGIN AND APPROACH

This National Academies’ interim report begins to address the statement of task for the committee, presented in Box 1.2.

Page 8 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

BOX 1.1
Data Science Education: An Expedition

Envisioning a vibrant and robust approach to undergraduate data science education is an important mission that may require considerable deliberation and effort to achieve. Such missions of exploration feature challenges and opportunities.

A useful historical analogy can be made with the Lewis and Clark expedition (1804–1806), which laid the foundation for great discoveries about a newly expanded nation through systematic data collection and analysis. In his instructions to Meriwether Lewis, U.S. President Thomas Jefferson described the data-gathering equipment made accessible to the party: “Instruments for ascertaining by celestial observations the geography of the country thro’ which you will pass, have been already provided.”

Data collection and aspects of what we might think of as reproducible data were important to Jefferson:¹

Your observations are to be taken with great pains & accuracy to be entered distinctly, & intelligibly for others as well as yourself, to comprehend all the elements necessary, with the aid of the usual tables to fix the latitude & longitude of the places at which they were taken, & are to be rendered to the war office, for the purpose of having the calculations made concurrently by proper persons within the U.S. Several copies of these as well as of your other notes, should be made at leisure times, & put into the care of the most trustworthy of your attendants, to guard by multiplying them against the accidental losses to which they will be exposed. A further guard would be that one of these copies be written on the paper of the birch, as less liable to injury from damp than common paper.

Aspects of non-technical (political considerations) were also acknowledged in the instructions:²

Your mission has been communicated to the Ministers here from France, Spain, & Great Britain, and through them to their governments: and such assurances given them as to its objects as we trust will satisfy them. The country of Louisiana having been ceded by Spain to France, the passport you have from the Minister of France, the representative of the present sovereign of the country, will be a protection with all its subjects: and that from the Minister of England will entitle you to the friendly aid of any traders of that allegiance with whom you may happen to meet.

The Lewis and Clark expedition made important contributions to science (most notably geography, ecology, and biology) while spurring growth and migration into the American frontier. Many discoveries were serendipitous, with unanticipated positive and negative outcomes. A similar ambitious exploration of how to formulate effective undergraduate data science education will not be easy but has the potential to have a significant impact on higher education and society.

__________________

¹ T. Jefferson, 1803, “Jefferson’s Instructions for Meriwether Lewis,” in the “Thomas Jefferson Papers,” Library of Congress, https://www.loc.gov/exhibits/lewisandclark/transcript57.html.

² Ibid.

COMMITTEE ACTIVITIES TO DATE

At the time of this writing, the study committee held two meetings and one webinar to collect information, engage a diverse community, and deliberate. The open session presentations given during these meetings are listed in Appendix B. Additional information-gathering activities are planned for the remainder of 2017, after which the committee will release a final report.

During the first meeting of the committee on December 12-13, 2016, participants discussed possible future directions based on progress with current data science programs; societal implications of the evolving field of data science; ways to expand diversity and inclusion in data science among students, employees, and even topic areas; and perspectives on envisioning the future of data science specifically for undergraduates.

Page 9 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2018. Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/24886.

×

BOX 1.2
Statement of Task

A National Academies of Sciences, Engineering, and Medicine study will set forth a vision for the emerging discipline of data science at the undergraduate level. It will emphasize core underlying principles, intellectual content, and pedagogical issues specific to data science, including core concepts that distinguish it from neighboring disciplines. It will not consider the practicalities of creating materials, courses, or programs. It will develop this vision considering applications of and careers in data science. It will focus on the undergraduate level, addressing related issues at the middle and high school as well as community colleges as appropriate, and will draw on experiences in creating master’s-level programs. It will also consider opportunities created by the emergence of a new STEM [science, technology, engineering, and mathematics] field to engage underrepresented student populations and consider ways to reduce the “leakage” seen in existing STEM pathways. Information gathering will center around two workshops, the first likely focused on principles and intellectual content, and the second likely focused on pedagogy and implications for middle and high schools and community colleges. To get material on the record quickly and spark community feedback, an interim report will be issued following the first workshop. The interim report will not include recommendations, but may include findings or conclusions if the evidence warrants. A final report will be issued following both workshops and committee deliberations setting forth a vision for undergraduate education in data science.

The committee hosted a public webinar on April 25, 2017, and gathered public input and outside perspectives on the topics the committee should discuss throughout the remainder of the study.

The committee also held a workshop on May 2-3, 2017, where participants discussed (1) educational models to build relevant foundational, translational, and professional skills and knowledge for data scientists in various roles; (2) use of high-impact educational practices in the future delivery of data science education; and (3) strategies for broad participation in data science education that revolve around formal modes of evaluation and assessment. Other topics emerged as well, including the role of teacher education, the need to consult research on learning styles and teaching methods, the relationship between data science and popular culture, better methods for assessment of student and program success, and the ways in which students, institutions, and programs might change over the next 10 years and how these changes may affect plans for the future of data science education.

Envisioning the Data Science Discipline: The Undergraduate Perspective: Interim Report (2018)

Chapter: 1 Introduction

1

Introduction

ENVISIONING DATA SCIENCE FROM AN UNDERGRADUATE PERSPECTIVE

STUDY ORIGIN AND APPROACH

COMMITTEE ACTIVITIES TO DATE

Welcome to OpenBook!

Get Email Updates