Committee member Al Velosa led a discussion on vulnerabilities. He mentioned three core areas of vulnerability in the big data arena: infrastructure, data and analysis, and tools and technology.
Infrastructure presents opportunity for vulnerabilities. For example, adversaries can and do have the same level of access to equivalent types of infrastructure as the United States does. The infrastructure tends to be built on standardized equipment that is available globally and can be installed by a large number of service providers. For those adversaries who cannot afford the infrastructure, plenty of companies offer this level of infrastructure by means of a “pay as you go” business model. The infrastructure does have the benefit of many data centers with redundant backups of the data, but some key facilities can be crippled by disrupting their power supply.
Data and analysis also present a variety of vulnerabilities. Data, and lots of it, is available to opposing forces, often for free. But the proliferation of data, and the speed with which it is used and consumed, sometimes limit how much the United States verify the data. As a result, there is the possibility of false and malicious data being planted in U.S. systems (e.g., false data on stock movements that can drive capital markets or data that can start a panic about a transmittable disease or contamination of food). Data vulnerabilities operate over different time frames. In an attempted manipulation of financial markets, a short-time-frame response might involve an army of bankers immediately figuring out what is happening and then announcing that this is all misinformation. A long-time-frame response might be needed in a scenario where people are faking illness: authorities might need time to discern what is happening and to determine that misinformation has been spread and that there is no cause for alarm. The analysis and communication of the truth associated with these scenarios are also challenging, in that trust becomes a critical issue. Thus data are very susceptible to issues that center on trust.
Tools and technology are widespread and often available on an open source basis. Thus opponents often have access to the same levels of analysis that the United States does. Furthermore, the large number of computer scientists graduating from both U.S. and foreign universities guarantees them a talent base that may develop opportunities and tools that opponents could deny to the U.S.
Benjamin Reed of Yahoo! Research
Ben Reed, a research scientist at Yahoo! Research, gave a presentation on data discovery. One of the assumptions on which Yahoo! operates is that everyone has the same kind of infrastructure, as compared to Yahoo!. The secret sauce (what is kept confidential) is the code used to link data pieces. Yahoo! tries to anticipate the information wants of the general online population so that when someone seeks elaboration on a particular piece of news and goes to the Yahoo! website, he or she can easily find those details. For example, Yahoo! kept track of the buzz surrounding the death of Michael Jackson so that people could find out about the details.
Yahoo! also embraced open source implementation (specifically Hadoop). Yahoo! has commoditized hardware, software, and who can use the platforms. Yahoo! data analysis tools are open source (such as Pig4 ) and have contributors from all over the world. Users actually contribute, and do not just use the tools. There is also a Yahoo! Asia office that coordinates these contributions.
4According to Wikipedia, a “‘Pig’ is a high-level platform for creating MapReduce programs used with Hadoop. Pig was originally developed at Yahoo! Research around 2006 for researchers to have an ad hoc way of creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation.”