Big Data as Data Fusion Challenges

The great variety of data types, including audio, video, text, geographic location markers, and photos, were discussed by many workshop participants as having contributed to the growing need for different approaches to data fusion. The point was made that prior to 1999, most of the data available for analysis was structured. Now it is mostly unstructured and varies in terms of format and dimensionality. For example, fusing text to video is challenging, particularly if the video is not annotated in any way that allows the analyst (and the analysis) to know at what point in the video the text is pertinent. It was noted that these types of data fusion issues are of utmost concern to the intelligence community. The fusion problem, of course, is simply the “tip of the iceberg.” An issue is how the data can be presented for cognitive review, i.e., how they might be visualized. A workshop participant commented that representational data can contribute to solving this issue, but that may be simply replacing one form of metadata (existing text tags) with a different sort of metadata (representational constructs of the actual data).

Big Data as Too Much of a Good Thing

There is a point at which conventional approaches to storage (e.g., copies of files) may stop scaling. This was characterized as reflecting the difference between conventional and quantum physics: below the petabyte range, data storage is “Newtonian”; whereas at greater than petabyte sizes, it becomes “Einsteinian.” Novel approaches to storage may help, such as the use of mathematical techniques for distributing elements of data sets and then recreating them as needed. Challenges related to representational data versus fully captured data include preprocessing, distribution of processing actions, reduction of communications needs, “data to decision,” targeted ads, signatures/signals identification, and “analysis at the edge.”


Eldar Sadikov of Jetlore

Eldar Sadikov of Jetlore (formerly Qwisper) was asked to present as a representative of the community that exploits large-scale data for social analysis. Jetlore is capable of taking in unstructured data from social networking sites and producing detailed analysis. Sadikov noted that one of the challenges is in natural language processing, particularly in using context to recognize entities and relationships in unstructured texts written in less than grammatically correct language. He discussed how, in contrast to only a few years ago, there is a wealth of data available in addition to the traditional textual content sensor data; these data include geocoordinates, image and video, and data from many other sources that are not evident to the active user. He discussed using this information for non-traditional purposes such as event detection. He referred to the speed of reports generated via Twitter, which are much quicker than those produced through traditional methods. Much of the discussion focused on the profound changes in the amount of data that is available, and the ability to fuse such data to derive information in ways that were not possible in the past. He also discussed where the best work was being done, and he learned that some of the foundational mathematics was done in Russia.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement