2
The Need for Training: Experiences and Case Studies
Important Points Made by Individual Speakers
- Students often do not recognize that big data techniques can be used to solve problems that address societal good, such as those in education, health, and public policy; educational programs that foster relationships between data science and social problems have the potential to increase the number and types of students interested in data science. (Rayid Ghani)
- There may be a mismatch between some industry needs and related academic pursuits: current studies of recommendation systems, such as off-line score prediction, do not always correlate well with important industry metrics, such as sales and user engagement. (Guy Lebanon)
- Academia does not have sufficient access to practical data scenarios in industry. (Guy Lebanon)
Big data is becoming pervasive in industry, government, and academe. The disciplines that are affected are as diverse as meteorology, Internet commerce, genomics, complex physics simulations, health informatics, and biologic and environmental research. The second session of the workshop focused on specific examples and case studies of real-world needs in big data. The session was co-chaired by John Lafferty (University of Chicago) and Raghu Ramakrishnan (Microsoft Corporation), the co-chairs of the workshop’s organizing committee. Presentations were made in Session 2 by Rayid Ghani (University of Chicago) and Guy Lebanon (Amazon Corporation).
TRAINING STUDENTS TO DO GOOD WITH BIG DATA
Rayid Ghani, University of Chicago
Rayid Ghani explained that he has founded a summer program at the University of Chicago, known as the Data Science for Social Good Fellowship, to show students that they can apply their talents in data science to societal problems and in so doing affect many lives. He expressed his growing concern that the top technical students are disproportionately attracted to for-profit companies, such as Yahoo and Google, and posited that these students do not recognize that solutions to problems in education, health, and public policy also need data.
Ghani showed a promotional video for the University of Chicago summer program and described its applicant pool. Typically, half the applicants are computer science or machine learning students; one-fourth are students in social science, public policy, or economics; and one-fourth are students in statistics. Some 35 percent of the enrolled students are female (as Ghani pointed out, this is a larger proportion than is typical of a computer science graduate program). Many of the applicants are graduate students, and about 25 percent are undergraduate seniors. The program is competitive: in 2013, there were 550 applicants for 36 spots. Ghani hypothesized that the program would be appropriate for someone who had an affinity for mathematics and science but a core interest in helping others. Once in the program, students are matched with mentors, most of whom are computer scientists or economists with a strong background in industry.
He explained that the program is project-based, using real-world problems from government and nonprofit organizations. Each project includes an initial mapping of a societal problem to a technical problem and communication back to the agency or organization about what was learned. Ghani stressed that students need to have skills in communication and common sense in addition to technical expertise. The curriculum at the University of Chicago is built around tools, methods, and problem-solving skills. The program now consistently uses the Python language, and it also teaches database methods. Ghani emphasized the need to help students to learn new tools and techniques. He noted, for instance, that some of the students knew of regression only as a means of evaluating data whereas other tools may be more suitable for massive data.
Ghani described a sample project from the program. A school district in Arizona was experiencing undermatching—that is, students have the potential to go to college but do not, or students have the potential to go to a more competitive college than the one they ultimately select. The school district had collected several years of data. In a summer project, the University of Chicago program students built models to predict who would graduate from college, who would go to college,
and who was not likely to apply. In response to the data analysis, the school district has begun a targeted career-counseling program to begin intervention.
THE NEED FOR TRAINING IN BIG DATA: EXPERIENCES AND CASE STUDIES
Guy Lebanon, Amazon Corporation
Guy Lebanon began by stating that extracting meaning from big data requires skills of three kinds: computing and software engineering; machine learning, statistics, and optimization; and product sense and careful experimentation. He stressed that it is difficult to find people who have expertise and skills in all three and that competition for such people is fierce.
Lebanon then provided a case study in recommendation systems. He pointed out that recommendation systems (recommending movies, products, music, advertisements, and friends) are important for industry. He described a well-known method of making recommendations known as matrix completion. In this method, an incomplete user rating matrix is completed to make predictions. The matrix completion method favors low-rank (simple) completions. The best model is found by using a nonlinear optimization procedure in a high-dimensional space. The concept is not complex, but Lebanon indicated that its implementation can be difficult. Implementation requires knowledge of the three kinds referred to earlier. Specifically, Lebanon noted the following challenges:
- Computing and software engineering: language skills (usually C++ or Java), data acquisition, data processing (including parallel and distributed computing), knowledge of software engineering practices (such as version control, code documentation, building tools, unit tests, and integration tests), efficiency, and communication among software services.
- Machine learning: nonlinear optimization and implementation (such as stochastic gradient descent), practical methods (such as momentum and step selection size), and common machine learning issues (such as overfitting).
- Product sense: an online evaluation process to measure business goals; model training; and decisions regarding history usage, product modification, and product omissions.
Lebanon described two problems that limit academic research in recommendation systems, both related to overlooking metrics that are important to industry. First, accuracy in academic, off-line score prediction does not correlate with
important industry metrics, such as sales and increased user engagement. Second, academe does not have sufficient access to practical data scenarios from industry. Lebanon posited that academe cannot drive innovation in recommendation systems; research in recommendation systems does not always translate well to the real world, and prediction accuracy is incorrectly assumed to be equivalent to business goals.
He then described a challenge run by Netflix. In the early 2000s, Netflix held a competition to develop an improved recommendation system. It provided a data set of ratings that had been anonymized and offered a $1 million prize to the top team. The competition created a boost in research, which saw a corresponding increase in research papers and overall interest. However, a group of researchers at the University of Texas, Austin, successfully deanonymized the Netflix data by joining them with other data. Netflix later withdrew the data set and is now facing a lawsuit. As a result of that experience, industry is increasingly wary about releasing any data for fear of inadvertently exposing private or proprietary data, but this makes it difficult for academe to conduct relevant and timely research.
Lebanon pointed out that the important result in a recommendation system is prediction of a user’s reaction to a specific recommendation. For it to be successful, one needs to know the context in which the user acts—for instance, time and location information—but that context is not conveyed in an anonymized data set. In addition, methods that perform well on training and test data sets do not perform well in real environments when a user makes a single A/B comparison.1 Lebanon proposed several new ideas to address those characteristics:
- Study the correlations between existing evaluation methods and increased user engagement in an A/B test.
- Develop new off-line evaluations to account for user context better.
- Develop efficient searches among the possibilities to maximize A/B test performance.
Few data sets are publicly available, according to Lebanon. Working with limited data, the research community may focus on minor improvements in incremental steps, not substantial improvements that are related to the additional contextual information that is available to the owners of the data, the companies. He pointed out that real-world information and context, such as user addresses and other profile information, could potentially be incorporated into a traditional recommendation system.
______________
1 In A/B testing, more formally known as two-sample hypothesis testing, two variants are presented to a user, and the user determines a winner.
Lebanon concluded with a brief discussion of implicit ratings. In the real world, one often has implicit, binary-rating data, such as whether a purchase or an impression was made. Evaluating that type of binary-rating data requires a different set of tools and models, and scaling up from standard data sets to industry data sets remains challenging.