Robert Kass (Carnegie Mellon University) led a final panel discussion session at the end of the workshop. Panelists included James Frew (University of California, Santa Barbara), Deepak Agarwal (LinkedIn Corporation), Claudia Perlich (Dstillery), Raghu Ramakrishnan (Microsoft Corporation), and John Lafferty (University of Chicago). Panelists and participants were invited to add their comments to the workshop; final comments tended to focus in four categories: types of students, organizational structures, course content, and lessons learned from other disciplines.
WHOM TO TEACH: TYPES OF STUDENTS TO TARGET IN TEACHING BIG DATA
Robert Kass opened the discussion session by noting that the workshop had shown that there are many types of potential students and that each type would have different training challenges. One participant suggested that business managers need to understand the potential and realities of big data better to improve the quality of communication. Another pointed out that older students may be attracted to big data instruction to pick up missing skill sets. And another suggested pushing instruction into the high-school level. Several participants posited that the background of the student, more than the age or level, is the critical element. For instance, does the student have a background in computer science or statistics? Workshop participants frequently mentioned three main subjects related to big data: computation, statistics, and visualization. The student’s background knowledge in each of the three will have the greatest effect on the student’s learning.
HOW TO TEACH: THE STRUCTURE OF TEACHING BIG DATA
Numerous participants discussed the types of educational offerings, including massive online open courses (MOOCs), certificate programs, degree-granting programs, boot camps, and individual courses. Participants noted that certificate programs would typically involve a relatively small investment in a student’s time, unlike a degree-granting program. One participant proposed a structure consisting of an introductory data science course and three or four additional courses in the three domains (computation, statistics, and visualization). Someone noted that the University of California, Santa Barbara, has similar “emphasis” programs in information technology and technology management. These are sought after because students wish to demonstrate their breadth of understanding. In the case of data science, however, students may wish to use data science to further their domain science. As a result, the certificate model in data science may not be in high demand, inasmuch as students may see value in learning the skills of data science but not in receiving the official recognition of a certificate.
A participant reiterated Joshua Bloom’s suggestion made during his presentation to separate data literacy from data fluency. Data fluency would require several years of dedicated study in computing, statistics, visualization, and machine learning. A student may find that difficult to accomplish while obtaining a domain-science degree. Data literacy, in contrast, may be beneficial to many science students and less difficult to obtain. A participant proposed an undergraduate-level introductory data science course focused on basic education and appreciation to promote data literacy.
Workshop participants discussed the importance of coordinating the teaching of data science across multiple disciplines in a university. For example, a participant pointed out that Carnegie Mellon University has multiple master’s degree offerings (as many as nine) around the university that are related to data science. Each relevant discipline, such as computer science and statistics, offers a master’s degree. The administrative structure is probably stovepiped, and it may be difficult to develop multidisciplinary projects. Another participant argued that an inherently interdisciplinary field of study is not well suited to a degree crafted within a single department and proposed initiating task forces across departments to develop a degree program jointly. And another proposed examining the Carnegie Mellon University data science master’s degrees for common topics taught; those topics probably are the proper subset of what constitutes data science.
A workshop participant noted that most institutions do not have nine competing master’s programs; instead, most are struggling to develop one. Without collective agreement in the community about the content of a data science program of study, he cautioned that there may be competing programs in each school instead of a single comprehensive program. The participant stressed the need to understand the core requirements of data science and how big data fits into data science.
Someone noted the importance of having building blocks—such as MOOCs, individual courses, and course sequences—to offer students who wish to focus on data science. Another participant pointed out that MOOCs and boot camps are opposites: MOOCs are large and virtual, whereas boot camps are intimate and hands-on. Both have value as nontraditional credentials.
Guy Lebanon stated that industry finds the end result of data science programs to be inconsistent because they are based in different departments that have different emphases. As a result, industry is uncertain about what a graduate might know. It may be useful to develop a consistent set of standards that can be used in many institutions.
Ramakrishnan stated that “off-the-shelf” courses in existing programs cannot be stitched together to make a data science curriculum. He suggested creating a wide array of possible prerequisites; otherwise, students will not be able to complete the course sequences that they need.
WHAT TO TEACH: CONTENT IN TEACHING BIG DATA
The discussion began with a participant noting that it would be impossible to lay out specific topics for agreement. Instead, he proposed focusing on the desired outcomes of training students. Another participant agreed that the fields of study are well known (and typically include databases, statistics and machine learning, and visualization), but said that the specific key components of each field that are needed to form a curriculum are unknown.
Several participants noted the importance of team projects for teaching, especially the creation of teams of students who have different backgrounds (such as a domain scientist and a computer scientist). Team projects foster creativity and encourage new thinking about data problems. Several participants stressed the importance of using real-world data, complete with errors, missing data, and outliers. To some extent, data science is a craft more than a science, so training benefits from the incorporation of real-world projects.
A participant stated that an American Statistical Association committee had been formed to propose a data science program model for a statistical data science program; it would probably include optimization and algorithms, distributed systems, and programming. However, other participants pointed out that that initiative did not include computer science experts in its curriculum development and that that would alter the emphases.
One participant proposed including data security and data ethics in a data science curriculum.
Several participants discussed how teaching data science might differ from teaching big data. One noted that data science does not change its principles when data move into the big data regime, although the approach to each individual step
may differ slightly. Temple Lang said that with large data sets, it is easy to get mired in detail, and it becomes even more important to reason through how to solve a problem.
Ramakrishnan recommended including algorithms and analysis in computer science. He noted that although grounding instruction in a specific tool (such as R, SAS, or SQL) teaches practical skills, teaching a tool can compete with teaching of the underlying principles. He endorsed the idea of adding a project element to data science study.
PARALLELS IN OTHER DISCIPLINES
Two examples in other domains that were discussed by participants could provide lessons learned to the data science community.
- Computational science. A participant noted that computational science was an emerging field 25 years ago. Interdisciplinary academic programs seemed to serve the community best although that model did not fit every university. The participant discussed specifically how the University of Maryland structured its computational-science instruction, which consisted of core coursework and degrees managed through the domain departments. The core courses were co-listed in numerous departments. That model does not require new hiring of faculty or any major restructuring.
- Environmental science. Participants discussed an educational model used in environmental science. An interdisciplinary master’s-level program was developed so that students could obtain a master’s degree in a related science (such as geography, chemistry, or biology). The program involved core courses, research projects, team teaching, and creative use of the academic calendar to provide students with many avenues to an environmental-science degree.