In response to the wide variety of topics discussed over the course of the first day, the workshop’s second day featured several different approaches to identifying and summarizing the important lessons in these presentations. George Poste, the chief scientist of the Complex Adaptive Systems Initiative and a Regents Professor in health innovation at Arizona State University, first provided his synthesis of the presentations. Next, the workshop participants broke into three groups that each discussed the following questions:
- What are the key lessons and takeaways from the discussions?
- How can big data and analytics best be managed and leveraged to tackle infectious diseases—and in what specific areas?
- What are the pros and cons, opportunities, and concerns (e.g., regulatory, legal, ethical, technical, human resources and training), and how can they be addressed?
- What concrete steps should be pursued next to advance discussions and actions for relevant research, operations, and policies?
The workshop activities then concluded with Poste moderating a discussion among Scott Dowell, the deputy director for surveillance and epidemiology at the Bill & Melinda Gates Foundation; Jennifer Gardy, an assistant professor in the School of Population and Public Health at the University of British Columbia; Kent Kester, the vice president and head of translational science and biomarkers at Sanofi Pasteur; Lonnie King, a professor and dean emeritus of the College of Veterinary Medicine at The Ohio State University; Martin Sepúlveda, a recently retired senior physician at IBM’s Watson Research Laboratory; Jay
Siegel, the chief biotechnology officer and head of scientific strategy and policy for Johnson & Johnson; and Lance Waller, the Rollins Professor and chair of the Department of Biostatistics and Bioinformatics at Emory University’s Rollins School of Public Health. Final thoughts on the workshop were given by Jeffrey Duchin, a professor of medicine at the University of Washington School of Medicine, and David Relman, the Thomas C. and Joan M. Merigan Professor at Stanford University and the chief of infectious diseases at the Veterans Affairs Palo Alto Health Care System.
When called on to provide a synthesis of the workshop’s first day, Poste said that the common message from all of the workshop’s speakers was that data are the critical currency for improving global health capabilities and preparedness for epidemic and pandemic threats. Big data and the associated analytics, he said, will have a paradigm-changing impact on how global systems will monitor infectious disease dynamics—including resurgent antibiotic resistance, the more rapid spread of new threats arising from global connectivities, potential bioterrorism threats, and the development of synthetic organisms. More importantly, he said, he believes that big data will provide a better understanding of the instabilities and complex adaptive responses of microorganisms that trigger emergent threats. Big data, which is already enabling the holistic, systems-based analysis of human, animal, and ecosystem interdependencies that make zoonotic diseases such a threat to human health, is also being overlaid onto the domains of molecular epidemiology, pathogen biology, and the development of diagnostics, drugs, and vaccines.
One key challenge, Poste said, will be dealing with not just the rapid proliferation of data but the growing diversity of data classes. Other challenges will include determining how to integrate, analyze, and curate the massive, heterogeneous datasets being generated; accounting for the variable reliability of the data; and generating actionable guidance. In the end, he said, the ultimate value of big data will come down to its utility for the public health practitioner and policy maker.
Learning from Others
Learning lessons from other fields, particularly with regard to integrating big datasets as opposed to merely aggregating them, will be critical to quickly and effectively addressing the challenges posed by Poste, a message reiterated by several workshop participants when each was given the opportunity to recount the key messages of the small group discussions that started the second day of the workshop. Gardy, Dowell, and Waller each noted the importance of learning from case studies and suggested creating a compilation of case studies with both successes and failures in using big data in public health and research
applications. In particular, they pointed to the PREDICT project and Chicago’s use of big data as examples of the type of case studies that could provide useful guidance to the field.
Potential of Big Data
The importance of early detection and preparedness mobilization is obvious, Poste said, but the field is still struggling with the challenge of profiling, detecting, and acting rapidly. How will big data help address this challenge? Compiling comprehensive genetic signature databanks for infectious diseases, he said, will lead to the development of distributed, rapid, and automated point-of-need diagnostic tests, which in turn will enable a real-time situational awareness of emerging threats and lead to faster mobilization of responses to those threats. The data for these signature databanks will come from comprehensive front-line sampling of sentinel species informed by geo-demographic and geographical information system applications. Already, Poste said, the real-time field reporting of anomalous events done by analyzing nontraditional sources of information, such as social media exchanges, is accelerating the detection of emerging outbreaks and enabling faster responses that save lives.
Some futuristic applications of big data may lead to on-body and in-body sensors for real-time remote monitoring and evaluation of health status as well as to a technological revolution in vaccine production for dealing with emerging threats. For the latter, big data may enable computational epitope mapping of bacteria and viruses and the identification of rule sets regarding the composition and structure of proteins that trigger different types of immune responses. The goal, Poste said, will be to use this information, in combination with the rapid profiling of an emerging threat, to chemically synthesize epitopes for vaccine production at point-of-need facilities. Such an ability to develop vaccines rapidly in this manner, he said, will be a critical component of any system designed to respond to bioterrorism agents created using evolving technologies such as genome editing, which the U.S. Director of National Intelligence, James R. Clapper, recently declared to be one of the top six existential threats to the nation.
Out-of-the box applications such as these will only come to fruition, Gardy said, if there are sustainable funding mechanisms and policies that support programs that fully explore the potential of big data, a point also made by Siegel. However, money alone will not be enough, Gardy said. “In the end, it will come down to people—people who have a vision, who are creative, who can gather people around them to work together and share their data.”
A New Mode of Operating
Managing the zettabyte-sized biomedical databases that are on the near horizon will not be a simple extrapolation from current information technology
practices, both Gardy and Poste said. In fact, they both said, current public health, biomedical research, and clinical institutional structures and information technology infrastructures are ill-prepared for the coming data deluge and incapable of conducting the type of analyses highlighted at the workshop. Already, Poste said, the four Vs that Guillaume Chabot-Couture, the associate principal investigator at the Institute for Disease Modeling at Intellectual Ventures, used to characterize big data are becoming the six Vs—volume, variety, velocity, veracity, visualization, and value—plus “three Ds”—dynamics, dimensionality, and decisions.
Quoting T. S. Elliot, who said, “Hell is a place where nothing connects,” Poste said that the challenge will be to master the six Vs and three Ds while creating a seamless process for communicating the results to those responsible for responding to an outbreak. Today, he said, many organizations have created an extraordinary repertoire of global disease surveillance capabilities, but they are poorly integrated and generate incomplete, inconsistent, and incompatible data report formats. Unless this situation changes, big data will only make matters worse, Poste said. In addition, too many biomedical datasets are problematic because of “sloppy science” that produces irreproducible data, the use of underpowered statistics that over-fit large feature sets to small sample sizes, silos and data tombs, a reluctance to share data, the limited use of common ontologies, inconsistent and incompatible data formats, and episodic snapshots of dynamic systems.
Siegel, Gardy, and Kester all stressed the need for researchers to stop hording their data. Limited access to data is a major obstacle to realizing the full potential of big data, said Siegel, who noted that there are many datasets languishing in data silos for which the original investigators have little use but that could be of great utility to other researchers. He acknowledged the need for mechanisms to allow those who generate data to publish before providing open-access, but he also noted that government agencies seem to be biased against the use of the data they generate for research purposes. Sepúlveda added that in addition to making their data available for others to use, those who generate the data should be enticed to lend their expertise as collaborators on big data projects.
Though stressing the need for open access data, both Siegel and Waller said that the field does need to establish rules and ethical standards to protect privacy for data from individuals, including processes for obtaining patient consent. King also stressed the importance of keeping ethical considerations in mind and reiterated the importance of communicating uncertainty when sharing data and making results available to the public. He noted, though, that there is an opportunity to use big data as a means of engaging the public as citizen scientists and helping individuals make better, more knowledgeable decisions.
Going forward, the critical challenges for generating and analyzing robust, large-scale data for public health purposes, as enumerated by Poste and several workshop participants, will include
- handling the size and scale of current and future databases;
- differentiating between the signal and the noise as datasets become larger, particularly when genomics data are added to the existing datasets;
- improving natural-language processing to analyze the 80 percent of databases that contain unstructured data;
- lowering the cost of storage and fast access of data;
- implementing better security protocols for biomedical databases;
- moving the field to open-source data systems that enable sharing, authentication, and attribution;
- designing protocols for validating and curating data;
- developing guidelines for privacy, consent, and data ownership and stewardship;
- building a workforce capable of supporting and analyzing big data;
- preventing the “digital Darwinism” that could result from the growing imbalance in sophistication of different end users and their ability to embrace data scale and complexity; and
- embracing multi-institutional, multi-investigator, and multidisciplinary research.
Big data, Poste concluded, will change the nature of discovery by adding unbiased analytics of large datasets to today’s hypothesis-driven methods. It will change the cultural process of knowledge acquisition by fostering the development of large-scale collaboration networks, consortia, and open systems to augment individual investigator-driven research as well as the analysis and application of knowledge to produce real-time intelligence, deeper insights, and better decision making. Big data, he predicted, will also change education, training, health care delivery, public policy formation, and the critical competencies and infrastructure required for institutional relevance and competitiveness.
In recapping the small group discussions, workshop participants suggested several steps for activities that could be beneficial to investigate in the future. These included
- Convening representatives of the public, private, and nonprofit sectors to discuss approaches for integrating and harmonizing data from disparate sources (Sepúlveda, Siegel)
- Facilitating discussions on how to define a decision-making process based on big data analytics (Kester)
- Assembling case studies and best practices for data storage, analysis, and reporting (Dowell, Gardy, Kester, Waller)
The availability of big data and the analytics necessary to extract actionable information from those large datasets marks the beginning of a transformative era in infectious disease research, operations, and policy, said King. Calling himself a cautious optimist with regard to the impact big data will have on infectious disease research, he warned that the field should not be so seduced by technology that it is used when it is not needed. “We have to be careful not to have a triumph of technology over purpose,” he said.
Commenting on some of the impacts he expects big data to have, King said that big data can help city, state, and federal agencies to do more with limited financial resources, as illustrated by both the PREDICT and Chicago projects. He said he believes that an especially exciting development is the ability to combine disparate datasets and open new avenues of research on microbial threats to humans and animals. King predicted that efforts to use big data will draw increasing numbers of researchers together into interdisciplinary teams, which will not only lead to new insights but also help break down the silos that impede research and innovation and slow the expansion of knowledge.
He also predicted that big data will have a transformative effect on public health. “Public health organizations will not be the same 5 or certainly 10 years from now thanks to the influx of data scientists and others whose organizations will need to take advantage of and leverage big data,” King said. In his opinion, he added, public health will benefit from the new partners that big data will bring to the field, and he encouraged public health organizations to reach out to disciplines such as retailing and marketing, which he said have the most experience with data analytics and novel insights into using big data. As a final thought, he said that these are disruptive times in public health as a result of the emergence of new infectious diseases and such times demand innovation.
Duchin said that the Centers for Medicare & Medicaid Services Accountable Health Communities project is aimed at helping transform the nation’s health care system by integrating big data across the entire spectrum of health care. “I think we need to think about aligning with that current activity,” he said. He added that health care systems are looking at big data analytics to meet value-based payment objectives and improve population health, another effort that could benefit the infectious disease community.
Relman offered the final comment: “I would argue that until we have additional discussions about the purposes of creating and using these kinds of data, there will be tensions between the public and private use of these data. Going
forward, part of the forum’s task will be to think about where we can delve into some of these use cases and explore where exactly benefit can be had, what is the most effective path toward those concrete, real-world benefits, and what the pitfalls are, and where we can be led astray by big data.”
This page intentionally left blank.