Rod Smith, Emerging Technologies, IBM
The following is a summary of the presentation made by Rod Smith, who attended the Big Data Workshop via teleconference. According to Smith, IBM recognizes that with the proper tools and understanding of customer needs, there is significant business potential in gleaning information from the burgeoning information loosely termed “big data.” Big data grows by the easy sharing of data on the web and with mobile applications, which now is the backbone of social interactions and many business transactions. While the sources of data are growing in both number and magnitude, the cost of storage and processing continues to decline. By utilizing early and direct customer engagement, IBM has developed business insights enabling quick, profitable, and advantageous decisions.
When integrated with data from social media like Twitter (7 terabytes/day) and Facebook (10 terabytes/day), big data acquires a real-time dimension. The value increases even more with the addition of proprietary data. These combined sources of data yield a stronger, discernible signal that illuminates sights and events that might otherwise go unnoticed. One example of the real-time effect of social media is the reporting of the August 2011 earthquake in Mineral, Virginia: Twitter users posted reports in about 40 seconds, whereas the U.S. Geological Survey issued reports, based on seismometer readings, 2 minutes after the same event.
Open source projects like Hadoop1 and Cassandra2 are common platforms for big data solutions. The IBM tools for such analyses are refined or modest modifications of rapidly evolving and widely available web technologies so that processing developments using Java, Linux, and XML continue without direct investments from IBM. Using the open source environment is both economical and ensures that efforts continuously remain at the cutting edge of technology.
An example of what IBM can do with these capabilities comes from the support it provided to the Mergers and Acquisition Department of American Express (AMEX), for which IBM produced critical decision information through discovery and analysis of public and private data on intellectual property. The AMEX business question was whether or not the innovation of a specific company was enhanced by a particular business acquisition. The analysis began with a review of the company’s patents, ranking them in value in accordance with the number of times they were cited in other patents. This analysis included all U.S. patents (1,400,000) from 2002 to 2009 and another 6,100,000 U.S. and international patents. With approximately one hour of processing time, the analysis showed that one patent had 67 citations, and 24 patents had one citation each. This search triggered an ancillary question, Were these patents involved in litigation?, which resulted in the identification of 3600 cases from the Federal Circuit Court of Appeals (1993-2007).
1 Hadoop is a cross-platform distributed file system that allows massive interaction of computationally independent systems processing enormous amounts of data. It is a product of the Apache Software Foundation.
2 Cassandra is a product of the Apache Software Foundation, which is dedicated to open source products. Apache Cassandra is a database management system that is designed to handle massive amounts of data distributed across many systems.
IBM believes that such applications are only beginning, demonstrating the start of a possible next wave of business applications. Further, the combination of astute data analytics from IBM and continued contact with customers is key to success. Smith posited that integrating social media analytics is critical to reducing time to value.
Darrell Long and Gilman Louie
Workshop attendee Gilman Louie and committee member Darrell Long next led a discussion on big data challenges. The challenges of big data are difficult to categorize, primarily because the exact definition of "big data" varies according to the intentions of the speaker. As such, several participants noted that it is important to specify precisely which problem applies in which context and then approach the problem from that definitional space. There was a robust dialog that included four different ways to view the problem: volume3 of data (too much data), ubiquity of sensor data, data fusion challenges, and too much of a certain type of data.
Beginning with the problem of big data as simply too much data, or overwhelming amounts of data, many participants felt that the challenge was not new. It was pointed out that too much data had always been a problem, particularly for aggressive collectors of data, as large governments tend to be, and that the result of too much data typically inspires new approaches for handling the increasing amounts of data. Several participants, however, noted two elements of the current era of big data that seemed to be different from previous eras. The first element was the relationship of data to individuals, uniquely and globally; i.e., the big data challenges of this era seem to be mostly about the data associated with individual movements, preferences, sentiments, and thoughts. This situation differs from previous eras in which big data tended to be generated as a result of economic activities, wars, and science. The individualization of big data stems primarily from the social networking phenomenon but is also enabled by the credit and debit card industries and the logistics industry, particularly point-of-sale applications. Other workshop participants noted that the second element was the importance of algorithmic analysis of data; i.e., the use of math, machine learning, and human emotion-behavior analyses (such as sentiment analyses) seems to have made both a quantitative and a qualitative difference in how data is used and interpreted.
It was noted by several workshop participants that part of the forcing function of the big data problem is what is called the data ingest challenge: more data is originating from more sensors. These sensors range from social network updates (Facebook posts, tweets, blog posts, etc.) to embedded, distributed utility functions (wifi repeaters, cameras, financial transactions). Some participants suggested that the greater variety of sources of data and data inputs requires different approaches to data integration and analysis, and also contributes to data communication and storage challenges.
3 Some believe that unlike in the “old world” where volume was a problem, in the big data world, volume is a friend: even dirty data can increase the “resolution” of an entity. In the big data world, data is processed differently. Unlike in the old world where data was processed by reducing collections down to semi-finished and finished intelligence (known as the INTs) and then re-integrating it (all-source analysis) to produce knowledge, in the big data world, data is computed all at once and across different data types, to reveal or allow discovery of knowledge and intelligence.
The great variety of data types, including audio, video, text, geographic location markers, and photos, were discussed by many workshop participants as having contributed to the growing need for different approaches to data fusion. The point was made that prior to 1999, most of the data available for analysis was structured. Now it is mostly unstructured and varies in terms of format and dimensionality. For example, fusing text to video is challenging, particularly if the video is not annotated in any way that allows the analyst (and the analysis) to know at what point in the video the text is pertinent. It was noted that these types of data fusion issues are of utmost concern to the intelligence community. The fusion problem, of course, is simply the “tip of the iceberg.” An issue is how the data can be presented for cognitive review, i.e., how they might be visualized. A workshop participant commented that representational data can contribute to solving this issue, but that may be simply replacing one form of metadata (existing text tags) with a different sort of metadata (representational constructs of the actual data).
There is a point at which conventional approaches to storage (e.g., copies of files) may stop scaling. This was characterized as reflecting the difference between conventional and quantum physics: below the petabyte range, data storage is “Newtonian”; whereas at greater than petabyte sizes, it becomes “Einsteinian.” Novel approaches to storage may help, such as the use of mathematical techniques for distributing elements of data sets and then recreating them as needed. Challenges related to representational data versus fully captured data include preprocessing, distribution of processing actions, reduction of communications needs, “data to decision,” targeted ads, signatures/signals identification, and “analysis at the edge.”
Eldar Sadikov of Jetlore
Eldar Sadikov of Jetlore (formerly Qwisper) was asked to present as a representative of the community that exploits large-scale data for social analysis. Jetlore is capable of taking in unstructured data from social networking sites and producing detailed analysis. Sadikov noted that one of the challenges is in natural language processing, particularly in using context to recognize entities and relationships in unstructured texts written in less than grammatically correct language. He discussed how, in contrast to only a few years ago, there is a wealth of data available in addition to the traditional textual content sensor data; these data include geocoordinates, image and video, and data from many other sources that are not evident to the active user. He discussed using this information for non-traditional purposes such as event detection. He referred to the speed of reports generated via Twitter, which are much quicker than those produced through traditional methods. Much of the discussion focused on the profound changes in the amount of data that is available, and the ability to fuse such data to derive information in ways that were not possible in the past. He also discussed where the best work was being done, and he learned that some of the foundational mathematics was done in Russia.
Chris Gladwin of Cleversafe
In his presentation on computational data, Chris Gladwin provided an overview on the size and scale of data as technologists move into the next 10 years and the approaches that industry is using to be able to cope with it. According to Gladwin, the world’s data requirements are growing exponentially. While much of the data of the past was textual (e.g., HTML 1.0 web pages) or in documents, most of the new data (up to 85 percent of it) is now unstructured data in the form of media (audio, video, images) and other binary large objects (blobs). Complexity arises when these data must be indexed, accessed, and distributed effectively. As an example, Gladwin noted that one of his current customers, Shutterfly, needs to be able to serve 1.4 million photos per hour to its customer base with little to no perceived delay between request and response.
Gladwin noted that the new challenges related to massive amounts of data supporting dynamic requests have given birth over the last decade to a whole new subset of mathematical techniques for storage. Erasure coding for instance, offers a means for breaking data into recoverable chunks and distributing it across more than one storage device, thus hardening the data against loss. Gladwin noted that Cleversafe uses a 20-of-26 strategy in which it needs only 20 slices of data to recover the whole set (the data is stored as 26 slices). He stated, “At a quadrillion bits, you can’t assume they are all right.” Through error-coding techniques, data can be protected against corruption.
The final portion of the briefing addressed scalability of data storage systems. Gladwin pointed out that Hadoop, for instance, is limited in the number of nodes that are addressable within its name space—limited to a few thousand nodes. Data storage and access needs in the exabyte range would require new computationally efficient ways to build systems capable of supporting these needs. Gladwin noted that his company was able to prove the feasibility of a storage architecture, not yet in use, that could support 10 exabytes of data—the single largest storage architecture in the world. An on-the-spot back-of-the-envelope calculation shows that this highly distributed and decentralized architecture would need roughly 300 million conventional hard drives and consume approximately 2.4 GW just to operate the hard drives (not including the power demands of servers and networking infrastructure for such tasks as thermal management). The discussion then explored the extreme challenge that such a large data storage structure would present to the current ability to generate power at this scale.
John Marion of Logos Technologies
John Marion of Logos Technologies described a persistent surveillance system that evolved from a prototype developed at Lawrence Livermore National Laboratory into a series of systems that have been operationally deployed. The fundamental technology is a group of high-resolution image sensors mounted on a gimbaled platform carried aboard aircraft. The challenges presented by such sensors are centered on limited communication bandwidth, and although various communication technologies have been proposed, none has provided the necessary bandwidth. Marion stated that the current strategy is to exchange arrays of magnetic disks physically, but it is understood that techniques need to be developed for processing all the data on the platform in order to reduce the decision latency. By performing computation at the edge, for example, by using models to enable detection of change, Marion said, only small amounts of data need to be transmitted over the limited communications bandwidth that is available. Even though processing at the edge will be necessary, it is not anticipated that raw data will lose its value: some problems can be addressed through processing on the platform, while others will need higher-performance processing and fusion with other data that will be available only on the ground.
Committee member Al Velosa led a discussion on vulnerabilities. He mentioned three core areas of vulnerability in the big data arena: infrastructure, data and analysis, and tools and technology.
Infrastructure presents opportunity for vulnerabilities. For example, adversaries can and do have the same level of access to equivalent types of infrastructure as the United States does. The infrastructure tends to be built on standardized equipment that is available globally and can be installed by a large number of service providers. For those adversaries who cannot afford the infrastructure, plenty of companies offer this level of infrastructure by means of a “pay as you go” business model. The infrastructure does have the benefit of many data centers with redundant backups of the data, but some key facilities can be crippled by disrupting their power supply.
Data and analysis also present a variety of vulnerabilities. Data, and lots of it, is available to opposing forces, often for free. But the proliferation of data, and the speed with which it is used and consumed, sometimes limit how much the United States verify the data. As a result, there is the possibility of false and malicious data being planted in U.S. systems (e.g., false data on stock movements that can drive capital markets or data that can start a panic about a transmittable disease or contamination of food). Data vulnerabilities operate over different time frames. In an attempted manipulation of financial markets, a short-time-frame response might involve an army of bankers immediately figuring out what is happening and then announcing that this is all misinformation. A long-time-frame response might be needed in a scenario where people are faking illness: authorities might need time to discern what is happening and to determine that misinformation has been spread and that there is no cause for alarm. The analysis and communication of the truth associated with these scenarios are also challenging, in that trust becomes a critical issue. Thus data are very susceptible to issues that center on trust.
Tools and technology are widespread and often available on an open source basis. Thus opponents often have access to the same levels of analysis that the United States does. Furthermore, the large number of computer scientists graduating from both U.S. and foreign universities guarantees them a talent base that may develop opportunities and tools that opponents could deny to the U.S.
Benjamin Reed of Yahoo! Research
Ben Reed, a research scientist at Yahoo! Research, gave a presentation on data discovery. One of the assumptions on which Yahoo! operates is that everyone has the same kind of infrastructure, as compared to Yahoo!. The secret sauce (what is kept confidential) is the code used to link data pieces. Yahoo! tries to anticipate the information wants of the general online population so that when someone seeks elaboration on a particular piece of news and goes to the Yahoo! website, he or she can easily find those details. For example, Yahoo! kept track of the buzz surrounding the death of Michael Jackson so that people could find out about the details.
Yahoo! also embraced open source implementation (specifically Hadoop). Yahoo! has commoditized hardware, software, and who can use the platforms. Yahoo! data analysis tools are open source (such as Pig4 ) and have contributors from all over the world. Users actually contribute, and do not just use the tools. There is also a Yahoo! Asia office that coordinates these contributions.
4According to Wikipedia, a “‘Pig’ is a high-level platform for creating MapReduce programs used with Hadoop. Pig was originally developed at Yahoo! Research around 2006 for researchers to have an ad hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.”
Automated, large-scale data-gathering agents, known as bots (short for software robots), generate a large volume of traffic to Yahoo! and tend to tax Yahoo! with large quantities of queries. Yahoo! deals with bots by giving them a “fake” version of the information they seek. Because attempts to ignore the bot queries, once they are identified as such, simply result in a multiplication of even more bot queries, Yahoo! simply replies with a version of what the bot asked for, minimally satisfying the query, but well enough to pacify the bot and clear bandwidth for other users.
Hardware to process big data is easily accessible, the software is free, and the processing models are accessible, and so big data is no longer a niche market—there is no barrier to the commercial market. During the workshop discussion, a question was asked about whether parallel processing is difficult across multiple nodes with high-performance computing. Yahoo! does do parallel computing, with algorithms designed to solve the big data problem that are often separate and distinct from those ubiquitous to traditional high-performance computing. With big data, a whole lot of information comes in, and not much comes out. In high-performance computing, a little bit of information comes in, but the outputs are tremendous. So, a different type of tool is required for these two data environments.
Paul Twohey of Ness Computing
Ness Computing (not to be confused with Ness Corporation) is a small start-up headquartered in Los Altos, California. Currently with 15 employees, it embraces data analysis for commercial purposes. The firm’s LikeNess search engine, which draws data from various sources of information, such as social networking sites, applies machine learning techniques to tease out patterns that are then used to establish recommendations for users, generating a small profit per transaction. Ness Computing describes what it does in the following terms: “Ness creates products that connect our users with new experiences.”
Its flagship product, Ness, is an application that runs on mobile phones to provide users with restaurant recommendations. To seed the analysis, users are asked to input reviews of 10 restaurants. Based on these reviews and the powerful back-end analysis of data from other users and social networks, the Ness “app” provides recommendations on what other restaurants the user can be expected, in all probability, to like.
According to workshop presenter Paul Twohey, his firm’s approach to developing products is based on the emerging realities of electronic commerce and free social networking, whereby the user exchanges personal information for services. It requires extensive back-end computing power, which is different from high-performance computing. He stated that the computing approach is that of taking an enormous amount of data, performing complex mathematical analysis (including sentiment analysis), and providing customized output per user. Ness Computing hires only employees with superb math and computer science skills, and a workshop participant voiced that this approach is problematic, given the low availability of individuals with such skills. Several participants at the workshop noted the need for an emphasis by U.S. educational institutions on advanced math skills so that the U.S. workforce can remain competitive in the future.