Data sets—whether in science and engineering, economics, health care, public policy, or business—have been growing rapidly; the recent National Research Council (NRC) report Frontiers in Massive Data Analysis documented the rise of “big data,” as systems are routinely returning terabytes, petabytes, or more of information (National Research Council, 2013). Big data has become pervasive because of the availability of high-throughput data collection technologies, such as information-sensing mobile devices, remote sensing, radiofrequency identification readers, Internet log records, and wireless sensor networks. Science, engineering, and business have rapidly transitioned from the longstanding state of striving to develop information from scant data to a situation in which the challenge is now that the amount of information exceeds a human’s ability to examine, let alone absorb, it. Web companies—such as Yahoo, Google, and Amazon—commonly work with data sets that consist of billions of items, and they are likely to increase by an order of magnitude or more as the Internet of Things1 matures. In other words, the size and scale of data, which can be overwhelming today, are only increasing. In addition, data sets are increasingly complex, and this potentially increases the problems associated with such concerns as missing information and other quality concerns, data heterogeneity, and differing data formats.
1 The Internet of Things is the network of uniquely identifiable physical objects embedded throughout a network structure, such as home appliances that can communicate for purposes of adjusting their settings, ordering replacement parts, and so on.
Advances in technology have made it easier to assemble and access large amounts of data. Now, a key challenge is to develop the experts needed to draw reliable inferences from all that information. The nation’s ability to make use of the data depends heavily on the availability of a workforce that is properly trained and ready to tackle these high-need areas. A report from McKinsey & Company (Manyika et al., 2011) has predicted shortfalls of 150,000 data analysts and 1.5 million managers who are knowledgeable about data and their relevance. It is becoming increasingly important to increase the pool of qualified scientists and engineers who can extract value from big data. Training students to be capable in exploiting big data requires experience with statistical analysis, machine learning, and computational infrastructure that permits the real problems associated with massive data to be revealed and, ultimately, addressed. The availability of repositories (of both data and software) and computational infrastructure will be necessary to train the next generation of data scientists. Analysis of big data requires cross-disciplinary skills, including the ability to make modeling decisions while balancing trade-offs between optimization and approximation, all while being attentive to useful metrics and system robustness. To develop those skills in students, it is important to identify whom to teach, that is, the educational background, experience, and characteristics of a prospective data science student; what to teach, that is, the technical and practical content that should be taught to the student; and how to teach, that is, the structure and organization of a data science program.
The topic of training students in big data is timely, as universities are already experimenting with courses and programs tailored to the needs of students who will work with big data. Eight university programs have been or will be launched in 2014 alone.2 The workshop that is the subject of this report was designed to enable participants to learn and benefit from emerging insights while innovation in education is still ongoing.
On April 11-12, 2014, the standing Committee on Applied and Theoretical Statistics (CATS) convened a workshop to discuss how best to train students to use big data. CATS is organized under the auspices of the NRC Board on Mathematical Sciences and Their Applications.
To conduct the workshop, a planning committee was first established to refine the topics, identify speakers, and plan the agenda. The workshop was held at the Keck Center of the National Academies in Washington, D.C., and was sponsored by the National Science Foundation (NSF). About 70 persons—including speakers,
members of the parent committee and board, invited guests, and members of the public—participated in the 2-day workshop. The workshop was also webcast live, and at least 175 persons participated remotely.
A complete statement of task is shown in Box 1.1. The workshop explored the following topics:
- The need for training in big data.
- Curricula and coursework, including suggestions at different instructional levels and suggestions for a core curriculum.
- Examples of successful courses and curricula.
- Identification of the principles that should be delivered, including sharing of resources.
Although the title of the workshop was “Training Students to Extract Value from Big Data,” the term big data is not precisely defined. CATS, which initiated the workshop, has tended to use the term massive data in the past, which implies data on a scale for which standard tools are not adequate. The terms data analytics and data science are also becoming common. They seem to be broader, with a focus on using data—maybe of unprecedented scale, but maybe not—in new ways to inform decision making. This workshop was not developed to explore any particular one of these definitions or to develop definitions. But one impetus for the workshop
Statement of Task
An ad hoc committee will plan and conduct a public workshop on the subject of training undergraduate and graduate students to extract value from big data. The committee will develop the agenda, select and invite speakers and discussants, and moderate the discussions. The presentations and discussions at the workshop will be designed to enable participants to share experience and perspectives on the following topics:
- What current knowledge and skills are needed by big data users in industry, government, and academia?
- What will students need to know to be successful using big data in the future (5-10 years out)?
- How could curriculum and training evolve to better prepare students for big data at the undergraduate and graduate levels?
- What options exist for providing the necessary interdisciplinary training within typical academic structures?
- What computational and data resources do colleges and universities need in order to provide useful training? What are some options for assembling that infrastructure?
was the current fragmented view of what is meant by analysis of big data, data analytics, or data science. New graduate programs are introduced regularly, and they have their own notions of what is meant by those terms and, most important, of what students need to know to be proficient in data-intensive work. What are the core subjects in data science? By illustration, this workshop began to answer that question. It is clear that training in big data, data science, or data analytics requires a multidisciplinary foundation that includes at least computer science, machine learning, statistics, and mathematics, and that curricula should be developed with the active participation of at least these disciplines. The chapters of this summary provide a variety of perspectives about those elements and about their integration into courses and curricula.
Although the workshop summarized in this report aimed to span the major topics that students need to learn if they are to work successfully with big data, not everything could be covered. For example, tools that might supplant MapReduce, such as Spark, are likely to be important, as are advances in Deep Learning. Means by which humans can interact with and absorb huge amounts of information—such as visualization tools, iterative analysis, and human-in-the-loop systems—are critical. And such basic skills as data wrangling, cleaning, and integration will continue to be necessary for anyone working in data science. Educators who design courses and curricula must consider a wide array of skill requirements.
The present report has been prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The planning committee’s role was limited to planning and convening the workshop. The views contained in the report are those of individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the NRC.
Suzanne Iacono, National Science Foundation
Suzanne Iacono, of NSF, set the stage for the workshop by speaking about national efforts in big data, current challenges, and NSF’s motivations for sponsoring the workshop. She explained that the workshop was an outgrowth of the national big data research and development (R&D) initiative. The federal government is interested in big data for three reasons:
- To stimulate commerce and the economy.
- To accelerate the pace of discovery and enable new activities.
- To address pressing national challenges in education, health care, and public safety.
Big data is of interest to the government now because of the confluence of technical, economic, and policy interests, according to Iacono. Advances in technology have led to a reduction in storage costs, so it is easier to retain data today. On the policy side, data are now considered to be assets, and government is pushing agencies to open data sets to the public. In other words, there has been a democratization of data use and tools.
Iacono described a recent book (Mayer-Schönberger and Cukier, 2012) that outlined three basic shifts in today’s data:
- There are more data than ever before.
- Data are messy, and there must be an increased acceptance of imperfection.
- Correlations can help in making decisions.
She then described the national Big Data Research and Development Initiative in more detail. A 2010 report from the President’s Council of Advisors on Science and Technology argued that the federal government was not investing sufficiently in big data research and development and that investment in this field would produce large returns. A working group in big data, under the interagency Networking and Information Technology Research and Development (NITRD) program and managed by the Office of Science and Technology Policy, was charged with establishing a framework for agency activity. The result was that in 2012, $200 million was allocated for big data R&D throughout the NITRD agencies, including the Defense Advanced Research Projects Agency (DARPA), the Department of Energy (DOE) Office of Science, the National Institutes of Health (NIH), and NSF. Iacono showed the framework for moving forward with big data R&D, which included the following elements:
- Foundational research. Iacono stressed that this research is critical because data are increasing and becoming more heterogeneous.
- Cyberinfrastructure. New and adequate infrastructure is needed to manage and curate data and serve them to the larger research community.
- New approaches to workforce and education.
- New collaborations and outreach.
Iacono noted that policy envelops all four elements of the framework.
A 2013 White House memorandum directed executive branch agencies to develop plans to increase public access to the results of federally funded research, including access to publications and data, and plans are under way at the agency level to address this memorandum. Iacono noted that increased access to publications is not difficult, because existing publication-access methods in professional societies and some government agencies can be used as models. She also pointed
out that NIH’s PubMed3 program may be a useful model in that it shares research papers. However, she noted that access to data will be much more difficult than access to publications because each discipline and community will have its own implementation plan and will treat data privacy, storage duration, and access differently.
Iacono described foundational R&D in more detail. She explained that NSF and NIH awarded 45 projects in big data in 2012 and 2013. About half were related to data collection and management and one-fourth to health and bioinformatics. The remaining awards were spread among social networks, physical sciences and engineering, algorithms, and cyberinfrastructure. Seventeen agencies are involved in the Big Data Senior Steering Group, and each is implementing programs of its own related to big data. For example, DARPA has implemented three new programs—Big Mechanism, Memex, and Big Data Capstone; the National Institute of Standards and Technology maintains a Big Data Working Group; DOE has an Extreme Scale Science initiative; and NSF and NIH each has a broad portfolio related to big data. Iacono stressed that big data is a national issue and that there is substantial interest now in industry and academe, so she believes that government should consider multistakeholder partnerships.
Iacono discussed three challenges related to big data:
- Technology. She emphasized that technology alone cannot solve big data problems, and she cited several recent popular books that discuss the folly of technological solutionism (Mayer-Schönberger and Cukier, 2012; Mele, 2013; Reese, 2013; Schmidt and Cohen, 2013; Surdak, 2014; Webb, 2013).
- Privacy. Iacono pointed out that many of our behaviors—including shopping, searching, and social interactions—are now tracked, and she noted that a White House 90-day review to examine the privacy implications of big data was under way.4 In general, Iacono noted the importance of regulating data use, as opposed to data collection; balancing interests; and promoting data sharing.
- Education and workforce. As noted above, the 2011 report from the McKinsey & Company predicted large shortfalls of big data experts. Iacono noted that the Harvard Business Review labeled data science as “the sexiest job of the 21st century” (Davenport and Patil, 2012). The New York Times has recently hired a chief data scientist. The bottom line, Iacono explained,
4 That study has since been completed and can be found at Executive Office of the President, Big Data: Seizing Opportunities, Preserving Values, May 2014, http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf.
is that the talent pool in data science must be expanded to meet current and future needs.
Iacono pointed out that there are traditional ways to educate students through school curricula but that there are also other ways to learn. Such companies as DataKind and Pivotal are matching data scientists with data problems in the nonprofit community. Universities, such as the University of Chicago, as discussed by Rayid Ghani (see Chapter 2), are also working to connect data scientists to problems of social good. Iacono concluded by stressing the many opportunities and challenges in big data that lie ahead.
The remaining chapters of this report summarize the workshop presentations and discussions. To assist the reader, each chapter begins with a short list of important statements made by speakers during the workshop session. Chapter 2 outlines the need for training. Chapter 3 discusses some of the principles of working with big data. Chapter 4 focuses on courses and curricula needed to support the use of big data. Chapter 5 discusses shared resources, and Chapter 6 summarizes the group discussion of lessons learned from the workshop. Finally, Appendix A lists the workshop participants, Appendix B shows the workshop agenda, and Appendix C defines acronyms used in this report.