Important Points Made by Individual Speakers
- Synthetic knowledge bases for domain sciences, such as the PaleoDeepDive system at Stanford University, can be developed by using the automatic extraction of data from scientific journals. (Christopher Ré)
- Divide and recombine methods are powerful tools for analysts who conduct deep examinations of data, and such systems can be used by analysts without the need for complex programming or a profound understanding of the tools used. (Bill Cleveland)
- Yahoo Webscope is a reference library of large, scientifically useful, publicly available data sets for researchers to use. (Ron Brachman)
- Amazon Web Services (AWS) hosts large data sets in a variety of models (public, requestor-paid, and private or community) to foster large-scale data sharing. AWS also provides data computation tools, training programs, and grants. (Mark Ryland)
The fifth session of the workshop was chaired by Deepak Agarwal (LinkedIn Corporation). The session had four speakers: Christopher Ré (Stanford University), Bill Cleveland (Purdue University), Ron Brachman (Yahoo Labs), and Mark Ryland (Amazon Corporation).
Christopher Ré, Stanford University
Christopher Ré focused on a single topic related to data science: knowledge bases. He first discussed Stanford University’s experience with knowledge bases. Ré explained that in general scientific discoveries are published and lead to the spread of ideas. With the advent of electronic books, the scientific idea base is more accessible than ever before. However, he cautioned, people are still limited by their eyes and brain power. In other words, the entire science knowledge base is accessible but not necessarily readable.
Ré noted that today’s science problems require macroscopic knowledge and large amounts of data. Examples include health, particularly population health; financial markets; climate; and biodiversity. Ré used the latter as a specific example: broadly speaking, biodiversity research involves assembling information about Earth in various disciplines to make estimates of species extinction. He explained that this is “manually constructed” data—a researcher must input the data by examining and collating information from individual studies. Manually constructed databases are time-consuming to produce; with today’s data sources, the construction exceeds the time frame of the typical research grant. Ré posited that the use of sample-based data and their synthesis constitute the only way to address many important questions in some fields. A system that synthesizes sample-based data could “read” journal articles and automatically extract the relevant data from them. He stated that “reading” machines may be coming in the popular domain (from such Web companies as IBM, Google, Bing, and Amazon). The concept of these machines could be extended to work in a specific scientific domain. That would require high-quality reading—reading of a higher quality than is needed in a popular-domain application—in that mistakes can be more harmful in a scientific database.
Ré described a system that he has developed, PaleoDeepDive,1 a collaborative effort with geologist Shanan Peters, also of Stanford University. The goal of PaleoDeepDive is to build a higher-coverage fossil record by extracting paleobiologic facts from research papers. The system considers every character, word, or fragment of speech from a research paper to be a variable and then conducts statistical inference on billions of variables defined from the research papers to develop relationships between biologic and geologic research. PaleoDeepDive has been in operation for about 6 months, and preliminary results of occurrence relations extracted by PaleoDeepDive show a precision of around 93 percent; Ré indicated that this is a very high-quality score.
Ré then stated the challenges for domain scientists related to synthetic knowledge bases:
- Students are not trained to ask questions of synthetic data sets. Ré noted that this may be changing; the University of Chicago, for instance, includes such training in its core curriculum. Stanford has an Earth-science class on how to use PaleoDeepDive.
- Students lack skills in computer science and data management. Ré indicated that this also may be changing; 90 percent of Stanford students now take at least one computer science class.
- Some people are skeptical of human-generated synthetics. Ré suggested that this is also changing as statistical methods take stronger hold.
Ré noted challenges for computer scientists related to synthetic knowledge bases:
- Finding the right level of abstraction. Ré posited that approaches to many interesting questions would benefit from the use of synthetic knowledge bases. However, PaleoDeepDive is not necessarily scalable or applicable to other disciplines.
- Identifying features of interest. Computer scientists, Ré noted, focus on algorithms rather than on features. However, synthetic knowledge bases are feature based and require a priori knowledge of what is sought from the data set.
A participant noted that noise, including misspelled words and words that have multiple meanings, is a standard problem for optical character recognition (OCR) systems. Ré acknowledged that OCR can be challenging and even state-of-the-art OCR systems make many errors. PaleoDeepDive uses statistical inference and has a package to improve OCR by federating open-source material together and using probabilistic inputs. Ré indicated that Stanford would be releasing tools to assist with OCR.
Bill Cleveland, Purdue University
Bill Cleveland explained the goals of divide and recombine for big data:
- Methods and environments that do not require reductions in dimensionality should be used for analyzing data at the finest level of granularity possible. The data analysis could include visualization.
- At the front end, analysts can use a language for data analysis to tailor the data and make the system efficient.
- At the back end, a distributed database is accessible and usable without the need for the analyst to engage in the details of computation.
- Within the computing environment, there is access to the many methods of machine learning and visualization.
- Software packages enable communication between the front and back ends.
- The system can be used continuously to analyze large, complex data sets, generate new ideas, and serve as a test bed.
Cleveland then described the divide and recombine method. He explained that a division method is used first to divide the data into subsets. The subsets are then treated with one of two categories of analytic methods:
- Number-category methods. Analytic methods are applied to each of the subsets with no communication among the computations. The output from this method is numeric or categoric.
- Visualization. The data are organized into images. The output from this method is plots. It is not feasible to examine all the plots, so the images are sampled. That can be done rigorously; sampling plans can be developed by computing variables with one value per subset.
Cleveland described several specific methods of division. In the first, conditioning-variable division, the researcher divides the data on the basis of subject matter regardless of the size of the subsets. That is a pragmatic approach that has been widely used in statistics, machine learning, and visualization. In a second type of division, replicate division, observations are exchangeable, and no conditioning variables are used. The division is done statistically rather than by subject matter. Cleveland stated that the statistical division and recombination methods have an immense effect on the accuracy of the divide and recombine result. The statistical accuracy is typically less than that with other direct methods. However, Cleveland noted that this is a small price to pay for the simplicity in computation; the statistical computation touches subsets no more than once. Cleveland clarified that the process is not MapReduce; statistical methods in divide and recombine reveal the best way to separate the data into subsets and put them back together.
Cleveland explained that the divide and recombine method uses R for the front end, which makes programming efficient. R saves the analyst time, although it is slower than other options. It has a large support and user community, and statistical packages are readily available. On the back end, Hadoop is used to enable parallel computing. The analyst specifies, in R, the code to do the division compu-
tation with a specified structure. Analytic methods are applied to each subset or each sample. The recombination method is also specified by the analyst. Cleveland explained that Hadoop schedules the microprocessors effectively. Computation is done by the mappers, each with an assigned core for each subset. The same is true for the reducers; reducers carry out the recombination. The scheduling possibilities are complex, Cleveland said. He also noted that this technique is quite different from the high-performance computing systems that are prevalent today. In a high-performance computing application, time is reserved for batch processing; this works well for simulations (in which the sequence of steps is known ahead of time and is independent of the data), but it is not well suited to sustained analyses of big data (in which the process is iterative and adaptive and depends on the data).
Cleveland described three divide and recombine software components between the front and back ends. They enable communication between R and Hadoop to make the programming easy and insulate the analyst from the details of Hadoop. They are all open-source.
- R and Hadoop Integrated Programming Environment (RHIPE2). This is an R package available on GitHub.3 Cleveland noted that RHIPE can be too strenuous for some operating systems.
- Datadr.4 Datadr is a simple interface for division, recombination, and other data operations, and it comes with a generic MapReduce interface.
- Trelliscope.5 This is a trellis display visualization framework that manages layout and specifications; it extends the trellis display to large, complex data.
Cleveland explained that divide and recombine methods are best suited to analysts who are conducting deep data examinations. Because R is the front end, R users are the primary audience. Cleveland emphasized that the complexity of the data set is more critical to the computations than the overall size; however, size and complexity are often correlated.
In response to a question from the audience, Cleveland stated that training students in these methods, even students who are not very familiar with computer science and statistics, is not difficult. He said that the programming is not complex; however, analyzing the data can be complex, and that tends to be the biggest challenge.
Ron Brachman, Yahoo Labs
Ron Brachman prefaced his presentation by reminding the audience of a 2006 incident in which AOL released a large data set with 20 million search queries for public access and research. Unfortunately, personally identifiable information was present in many of the searches, and this allowed the identification of individuals and their Web activity. Brachman said that in at least one case, an outside party was able to identify a specific individual by cross-referencing the search records with externally available data. AOL withdrew the data set, but the incident caused shock-waves throughout the Internet industry. Yahoo was interested in creating data sets for academics around the time of the AOL incident, and the AOL experience caused a slow start for Yahoo. Yahoo persisted, however, working on important measures to ensure privacy, and has developed the Webscope6 data sharing program. Webscope is a reference library of interesting and scientifically useful data sets. It requires a license agreement to use the data; the agreement is not burdensome, but it includes terms whereby the data user agrees not to attempt to reverse-engineer the data to identify individuals.
Brachman said that Yahoo has just released its 50th Webscope data set. Data from Webscope have been downloaded more than 6,000 times. Webscope has a variety of data categories available, including the following:
- Language and content. These can be used to research information-retrieval and natural-language processing algorithms and include information from Yahoo Answers. (This category makes up 42 percent of the data in Webscope.)
- Graph and social data. These can be used to research matrix, graph, clustering, and machine learning algorithms and include information from Yahoo Instant Messenger (16 percent).
- Ratings, recommendation, and classification data. These can be used to research collaborative filtering, recommender systems, and machine learning algorithms and include information on music, movies, shopping, and Yelp (20 percent).
- Advertising data. These can be used to research behavior and incentives in auctions and markets (6 percent).
- Competition data (6 percent).
- Computational-system data. These can be used to analyze the behavior and performance of different types of computer systems architectures, such as distributed systems and networks and include data from the Yahoo Sherpa database system (6 percent).
- Image data. These can be used to analyze images and annotations and are useful for image-processing research (less than 4 percent).
Brachman explained that in many cases there is a simple click-through agreement for accessing the data, and they can be downloaded over the Internet. However, downloads are becoming impractical as database size increases. Yahoo had been asking for hard drives to be sent through the mail; now, however, it is hosting some of its databases on AWS.
In response to questions, Brachman explained that each data set is accompanied by a file explaining the content and its format. He also indicated that the data provided by Webscope are often older (around a year or two old), and this is one of the reasons that Yahoo is comfortable with its use for academic research purposes.
Brachman was asked whether any models fit between the two extremes of Webscope (with contracts and nondisclosure agreements) and open-source. He said that the two extremes are both successful models and that the middle ground between them should be explored. One option is to use a trusted third party to hold the data, as is the case with the University of Pennsylvania’s Linguistic Data Consortium data.7
Mark Ryland, Amazon Corporation
Mark Ryland explained that resource sharing can mean two things: technology capabilities to allow sharing (such as cloud resources) and economic and cost sharing, that is, how to do things less expensively by sharing. AWS is a system that does both. AWS is a cloud computing platform that consists of remote computing storage and services. It holds a large array of data sets with three types of product. The first is public, freely available data sets. These data sets consist of freely available data of broad interest to the community and include Yahoo Webscope data, Common Crawl data gathered by the open-source community (240 TB), Earth-science satellite data (40 TB of data from NASA), 1000 Genomes data (350 TB of data from NIH), and many more. Ryland stated that before the genome data were publicly stored in the cloud, fewer than 20 researchers worked with those
data sets. Now, more than 200 are working with the genome data because of the improved access.
A second type of AWS data product is requestor-paid data. This is a form of cost-sharing in which data access is charged to the user account but data storage is charged to the data owner’s account. It is fairly popular but perhaps not as successful as AWS would like it to be, and AWS is looking to broaden the program.
The third type of AWS data product is community and private. AWS may not know what data are shared in this model. The data owner controls data access. Ryland explained that AWS provides identity control and authentication features, including Web Identity Federation. He also described a science-oriented data service (Globus8), which provides cloud-based services to conduct periodic or episodic data transfers. He explained that an ecosystem is developing around data sharing.
Sharing is also taking place in computation. Ryland noted that people are developing Amazon Machine Images with tools and data “prebaked” into them, and he provided several examples, including Neuroimaging Tools and Resources Clearinghouse and scientific Linux tools. Ryland indicated that there are many big data tools, some of which are commercial and some of which are open-source. Commercial tools can be cost-effective when accessed via AWS in that a user can access a desired tool via AWS and pay only for the time used. Ryland also pointed out that AWS is not limited to single compute nodes and includes cluster management, cloud formation, and cross-cloud capacities. AWS also uses spot pricing, which allows people to bid on excess capacity for computational resources. That gives users access to computing resources cheaply, but the resource is not reliable; if someone else bids more, then the capacity can be taken away and redistributed. Ryland cautioned that projects must be batch-oriented and assess their own progress. For instance, MapReduce is designed so that computational nodes can appear and disappear.
Ryland explained that AWS offers a number of other managed services and provides higher-level application program interfaces. These include Kinesis9 (for massive-scale data streaming), Data Pipeline10 (managed datacentric workflows), and RedShift11 (data warehouse).
AWS has a grants program to benefit students, teachers, and researchers, Ryland said. It is eager to participate in the data science community to build an education base and the resulting benefits. Ryland reported that AWS funds a high percentage of student grants, a fair number of teaching grants, but few research grants. Some of the research grants are high in value, however. In addition to grants, AWS provides spot pricing, volume discounting, and institutional cooperative pricing. In the latter, members of the cooperative can receive shared pricing; AWS intends to increase its cooperative pricing program.
Ryland explained that AWS provides education and training in the form of online training videos, papers, and hands-on, self-paced laboratories. AWS recently launched a fee-based training course. Ryland indicated that AWS is interested in working with the community to be more collaborative and to aggregate open-source materials and curricula. In response to a question, Ryland clarified that the instruction provided by Amazon is on how to use Amazon’s tools (such as RedShift), and the instruction is product-oriented, although the concepts are somewhat general. He said that the instruction is not intended to be revenue-generating, and AWS would be happy to collaborate with the community on the most appropriate coursework.
A workshop participant posited that advanced tools, such as the AWS tools, enable students to use systems to do large-scale computation without fully understanding how it works. Ryland responded that this is a pattern in computer science: a new level of abstraction develops, and a compiled tool is developed. The cloud is an example of that. Ryland posited that precompiled tools should be able to cover 80 percent or more of the use cases although some researchers will need more profound access to the data.