Realizing the Value from Big Data36
February 28-March 2, 2011
Institute for Infocomm Research (I2R) of Singapore’s Agency for Science, Technology, and
This meeting convened bioinformatics scientists and environmental scientists together with computational/data scientists, who were asked to identify computational and policy roadblocks that prevent their disciplines from fully extracting value from “big data.” Bioinformatics and environmental sciences were selected not only because both are data-rich applications, but also because the underlying research challenges are inherently international. Invited participants were selected jointly with principals from Singapore’s I2R,37 which hosted the meeting, and were drawn from research organizations in Australia, China, England, Hong Kong, Japan, Korea, the Netherlands, Portugal, Singapore, and the United States. Discussions during the meeting reflected broad international interest in the subject, but also exposed difficulties inherent to communications across problem domains as well as across cultural contexts.
A central theme centered on challenges stemming from researchers’ needs to find and use “big data” captured or generated by others; many participants generally agreed that improvements in this area would enhance the efficiency of their own research. Issues ranged from researchers’ inability to find and access relevant datasets to an inability to make sense of the data, given access. While some barriers derived from policy (e.g., ownership, privacy), other impediments were related to the absence of standards for metadata that could enable search engines to find relevant datasets and also help researchers understand the provenance and meaning of the data. Participants working in small groups were asked to identify specific initiatives that might mitigate key barriers; suggestions ranged from the development of common abstractions that could be reused across domains, to the notion of a standardized Internet protocol that would facilitate identification and location of “big data” of interest to a research team.
Participants also discussed challenges related to the management and exploration of “big data,” e.g., the importance of common infrastructures to share the cost burden associated with “big data”; efficient processes and incentives to motivate researchers to share data; and common tools to facilitate mining and exploration of complex datasets. Participants expressed differing opinions on the definition of “big data”; some viewed it as a matter of size, while others associated it with complexity. Disciplinary differences arose, too. Most computer scientists voiced a desire to perform research at a level of abstraction above that valued by domain scientists working on specific problems (e.g., many computer scientists wanted to be seen as “enablers” rather than “plumbers”). Many participants expressed the view that a “principal investigator-centric” funding model is not well-matched to “big data” problems, as a multidisciplinary collaborative environment is needed.
Participants at this meeting identified a diverse array of issues—most of which were common to all nations represented—that today limit their abilities to fully extract value from ‘big data.’
36 A brief summary of this meeting can be found at http://sites.nationalacademies.org/xpedio/groups/pgasite/documents/webpage/pga_062988.pdf.