Proceedings of a Workshop
Informing Environmental Health Decisions Through Data Integration
Proceedings of a Workshop—in Brief
Integrating large quantities of data from multiple, disparate sources can create new opportunities to understand complex environmental health questions. Currently, efforts are under way to develop methods to reliably integrate data from sources or designed experiments that are not traditionally used in environmental health research, such as electronic health records (EHRs), geospatial datasets, and crowd-based sources. However, combining new types and larger quantities of data to inform a specific decision also presents many new challenges.
On February 20–21, 2018, the National Academies of Sciences, Engineering, and Medicine’s (the National Academies’) Standing Committee on the Use of Emerging Science for Environmental Health Decisions held a 2-day workshop to explore the promise and potential pitfalls of environmental health data integration. The workshop brought together a multidisciplinary group of scientists, policy makers, risk assessors, and regulators to discuss the topic. The workshop was sponsored by the National Institute of Environmental Health Sciences (NIEHS). This Proceedings of a Workshop—in Brief summarizes the discussions that took place at the workshop, with emphasis on the comments from invited speakers.
PROMISE, PERILS, AND FOUNDATIONS FOR DATA INTEGRATION
The purpose of data integration for environmental health research and decision-making is to improve public health by monitoring environmental exposures and health outcomes, said Lance Waller from Emory University. The ultimate goal, Waller explained, is to link exposure and health outcome datasets to identify, propose, test, implement, and evaluate potential interventions. Today, the sources and types of data available for integration and analysis are almost limitless, so it is important to first decide on the key questions of interest and then identify the data needed to answer those questions to a desired level of accuracy and precision. Next comes a reality check in terms of the data available, the methods to access and analyze those data, and the questions that can be answered with those data and methods, Waller said. Last comes an assessment to see if the questions that can be answered are close to the original questions, and if not, then an exploration of what additional data, data sources, and methods are needed to get closer to answering the original question.
Waller emphasized that in an ideal world, it would be possible to measure the “exposome,” the sum of everything the body is exposed to during a person’s lifetime. The exposome can be used to understand the nurture perspective of the nature versus nurture concept as it integrates all of the insults and factors that affect a genome throughout the lifespan,1 and hence while the concept is useful, the exposome is intangible and unmeasurable. Nonetheless, there are sources of data for pieces of the exposome, such as multiple sensors and monitors for various air pollutants, EHRs for clinical information, dietary records, longitudinal and spatial data from various mobile devices, and others. As Waller noted, there are large-scale and small-scale data sources that can be lined up to generate new insights, which is what John Snow and the London Board of Health did in
1 Miller, G. W., and D. P. Jones. 2014. The nature of nurture: Refining the definition of the exposome. Toxicological Sciences 7(1):1–2.
1854 when they incorporated meteorological variables with the incidence of diarrhea cases and cholera deaths into a single graph to gain insights into the relationship between disease and the environment.2
Waller explained that he does not think of data integration in terms of finding the perfect data to answer a question. Instead, he thinks of data integration as a process for bringing together available datasets in creative and informative ways to help refine research questions and inform the next cycle of data acquisition and analysis. For example, census data provide information on where people live, but data about where people work might be more informative. People generally spend more time at work than at home, so workplace exposures may have a stronger influence on a person’s exposome and ultimately his or her health. He said it is important when presented with a new dataset to ask what the best way is to incorporate it into the existing data corpus and learn both about the data and from them. This is particularly true with “big data,” which, according to Waller, have many definitions that boil down to “more data than you know what to do with” and for which data scientists are in the formative stages of developing methods for characterizing and understanding their characteristics and utility.
Data science, said Waller, sits at the intersection of statistics and computer science. Statistics is good for designing experiments and controlling, modeling, measuring, stabilizing, and dealing with uncertainty, while computer science excels at searching, sorting, and translating data, all of which are challenging activities with big data. One of the main thrusts of data science, and particularly artificial intelligence, is not just solving problems faster by using existing methods designed for “small” datasets, but rethinking the analytical problem from the lens of being able to bring in more data, integrate them in new ways, and calibrate results with what is already known about a particular problem.
When it comes to integrating data, Chris Gennings from the Icahn School of Medicine at Mount Sinai cautioned that big data are not always better. She emphasized that big data are usually more complex and should be used to conduct hypothesis-driven and confirmatory research, not just used in an exploratory manner. Another concern with big data are that they can make any finding look significant in the traditional statistical sense, but they may have an effect size that might not be critically meaningful. Regarding data integration, there are two basic strategies, she said. The first strategy is to integrate data from disparate study types, such as those linking environmental exposures and health outcomes on the assumption that an exposure in a certain locality might be relevant to health outcomes in those localities. Gennings noted that a variation of this strategy involves linking human data with experimental study results, as in the case of the European Union’s EDC-MixRisk project, which links laboratory studies on endocrine disrupting chemicals with data from two birth cohorts to suggest what realistic relevant exposures might be and generate a risk assessment.
A second strategy, Gennings explained, is to integrate data across epidemiology studies. One goal of this strategy is to increase generalizability by combining, for example, multiple exposure studies from many different locations around the country. The Center for the Health Assessment of Mothers and Children of Salinas (CHAMACOS) study of pesticide exposures among children in a farmworker community3 is an example of this strategy. However, the challenge is compatibility—whether different datasets should be combined and how to determine if the data integration is done correctly, she said. Gennings stressed that the CHAMACOS investigators noted that their study’s findings were limited by differences in the study populations, which demonstrates that data integration is not always beneficial.
Meta-analysis, which produces a weighted average of estimates from individual studies, and meta-regression, which adjusts the meta-analysis with moderator values using regression techniques, are common approaches to analyzing integrated datasets. In general, these methods are linear techniques that may not account for different dose–response relationships across exposure ranges. One solution to this problem is to get the original datasets, pool them, and analyze the pooled data, or better yet, to establish a common data center and pool and harmonize the data from the start of an experiment. As an example of the latter, Gennings discussed the Children’s Health Exposure Analysis Resource (CHEAR), a network of universities that provides an infrastructure for adding or expanding exposure analysis to studies involving research in children’s health while advancing the understanding of the impact of environmental exposures on children’s health and development. One important aspect of CHEAR is that it has created common quality control materials that could enable researchers from different laboratories to check their results against one another. Aside from quality controls, another aspect of CHEAR is the method by which controls and cases are being matched. In order to reduce bias, control subjects are being shared and matched across studies using “propensity score matching.” Regardless of the particular approach to integrating data, it is important to examine heterogeneity across studies and develop metrics to assess what the differences might be and how those differences will affect subsequent data analyses. Gennings underscored that combining large datasets and then using standard analytical techniques is not the path to generating meaningful findings.
2 Kohn, T. 2005. Cartographies of Disease: Maps, Mapping, and Medicine. Redland Hills, CA: Esri Press.
3 Harley, K. G., S. M. Engel, M. G. Vedar, B. Eskenazi, R. M. Whyatt, B. P. Lanphear, A. Bradman, V. A. Rauh, K. Yolton, R. W. Hornung, J. G. Wetmur, J. Chen, N. T. Holland, D. B. Barr, F. P. Perera, and M. S. Wolff. 2016. Prenatal exposure to organophosphorous pesticides and fetal growth: Pooled results from four longitudinal birth cohort studies. Environmental Health Perspectives 124(7):1084–1092.
In the same vein, Linda Birnbaum, director of NIEHS, emphasized that the path to informing environmental health decisions through data integration will not be a smooth one unless data is FAIR—findable, accessible, interoperable, and reusable. “If you think you are going to be able to keep your data for yourself, that is not going to happen,” stressed Birnbaum. She said data will be shared. She also noted that data sharing has become the norm in the genomics field and it will be required of exposure science as well, although there are ethical issues to acknowledge and work through.
Since NIEHS issued its last strategic plan in 2012, it has emphasized exposure science and the development of the exposome, along with knowledge and data management. In 2013, NIEHS, along with the National Center for Advancing Translational Sciences, the University of North Carolina, and Sage Biologics, held a toxicogenomics challenge to better understand how an individual’s genetics can influence both the site of toxic response and genomic data as a step toward integration. Birnbaum added that the National Institutes of Health (NIH) has funded centers to begin to develop and support research in data science. In addition, NIEHS has been developing a framework of environmental health science language and an ontology and has created several new offices that work together to further the applications of data science, big data, and data integration to environmental health issues.
Birnbaum noted that there are plans to morph the CHEAR infrastructure into a health environmental assessment resource (HEAR) that will be deployed across NIH, including to All of Us,4 the 1 million person NIH precision medicine cohort. HEAR will provide access to clinical, biological, and epidemiological measurements along with the ability to look at familial relationships across multiple time periods of fetal and childhood development. NIEHS has also developed a comparative toxicogenomics database containing more than 30 million toxicogenomic connections,5,6 and a suite of toxico-informatics tools for management, analysis, and visualization of chemical effects data. All of these databases and resources will be captured and included in the NIH data commons as well as in NIEHS’s own data commons and knowledge network.
As Birnbaum alluded to in her remarks, developing an ontology is essential for enabling widespread data sharing and integration. An ontology specifies a rich description of terminology, concepts, and nomenclature, as well as the relationships among concepts and individuals. Without an ontology that has a well-defined and controlled vocabulary, data sharing, experimental reproducibility, and data reuse will be difficult, said Deborah McGuinness from Rensselaer Polytechnic Institute. McGuinness explained that ontologies provide support for mapping and integration, inform decisions about variables that may be combined, help flag errors in a dataset, and help expose implicit information and find links. She also noted that ontologies enable FAIR data resources and can support movement across levels of abstraction, such as genomics to population health. She recommended selectively and thoughtfully reusing the many existing best practice ontologies and vocabularies instead of building an ontology from scratch. She also recommended engaging experts in choosing portions of existing ontologies and designing a knowledge architecture. Ecosystems and diverse teams are critical for success when creating an ontology, she added, and community-driven and maintained ontology-based systems are the future of data science.
INTEGRATING DATA ON CLINICAL OUTCOMES
Lucila Ohno-Machado from the University of California, San Diego, described two issues with human clinical studies: (1) the allowed usage of the generated data is limited to uses that are specified in the consent for that study, and (2) the data are destroyed at the completion of many studies. Studies that rely on EHRs, however, do not require consent as long as the records have undergone de-identification following specifications in the Health Insurance Portability and Accountability Act (HIPAA). Given that integrating data across EHRs is not simple unless the data are in a common format, Ohno-Machado and her colleagues developed the Clinical Data Research Network, which includes EHRs from the University of California system, the University of Southern California, and the U.S. Department of Veterans Affairs enterprise clinical warehouse, which holds 21 million patient records. Creating this integrated database required developing compatible institutional policies, rules of engagement, and shared ethical principles. The result is a system that requires minimal computational infrastructure at the originating sites, said Ohno-Machado.
HIPAA de-identification regulations that enable EHR data to be used without explicit patient consent require the removal of 18 specific identifiers,7 including biometric data. Under those regulations, said Ohno-Machado, genomes are considered identifiers, and so using genomic data in a patient’s EHR requires patient consent. Ohno-Machado noted that what is not yet clear is whether wearable sensor data will also be treated as an identifier, given that research has shown that sensor data can be used to re-identify individuals.
Although there are only a few major EHR vendors, every EHR is one of a kind because it is tailored to meet the design requirements for each health system, explained Marylyn Ritchie from the University of Pennsylvania Perelman School of
6 Davis, A. P., C. J. Grondin, R. J. Johnson, D. Sciaky, B. L. King, R. McMorran, J. Wiegers, T. C. Wiegers, and C. J. Mattingly. 2017. The Comparative Toxicogenomics Database: Update 2017. Nucleic Acids Research 45(D1):D972–D978. doi: 10.1093/nar/gkw838.
Medicine. EHRs were developed in the 1990s for the primary purpose of billing insurance companies, not for doing research, and were later extended to assist with medical care, ordering procedures and medications, and scheduling. Ritchie said that when she took her first faculty position, she was skeptical that EHRs could be used for research, but her first project using an EHR, published in 2010 at Vanderbilt University, showed that it was possible to robustly replicate genotype–phenotype associations across multiple diseases using data in an EHR.8
Ritchie has since been involved with the Electronic Medical Record and Genomics Network, which integrates EHR genomics data from more than 100,000 individuals served by health systems around the country. One project based at the Marshfield Clinic in central Wisconsin is integrating EHR data with the results of a series of harmonized surveys known as the PhenX Toolkit that can provide environmental exposure data. Ritchie and her collaborators are using these datasets to look for associations between clinical traits in the EHR and environmental variables identified by the PhenX surveys.9 One analysis showed, for example, that the top two environmental correlates for type 2 diabetes were the frequency of alcohol consumption over 30 days and smoking at home, findings that were replicated in the National Health and Nutrition Examination Survey (NHANES). Another analysis found a correlation between cataracts and fatty acid consumption, a nutritional variable, and eventually variations in the genome.
In the future, said Ritchie, geocoding using mobile sensors, zip codes, and questionnaires will be important for using EHR data in environmental health studies given that most environmental exposure data are not captured in the EHR today. However, there are databases with geolocated environmental data that could be integrated with EHR data. The one caution is that a person’s address is a HIPAA identifier, making it important to have the proper institutional research board protections in place and to have patient consent to use geolocation data.
Mobile sensors, data integration, and advanced informatics will play a major role in the Pediatric Research using Integrated Sensor Monitoring Systems (PRISMS) project that the National Institute of Biomedical Imaging and Bioengineering launched in 2015. This project, explained Sandrah Eckel from the University of Southern California, aims to develop sensor-based, integrated health monitoring systems for measuring environmental, physiological, and behavioral factors in pediatric epidemiological studies of asthma and eventually other chronic diseases. Asthma, she noted, affects 1 in 12 people in the United States. The idea is that patients and various sensors will interact with PRISMS through smartphones or smartwatches that will securely upload data to the project’s informatics platform and data coordinating center. This individual-level data will be linked with external environmental data from sources such as the U.S. Environmental Protection Agency’s (EPA’s) monitoring networks or pollen counts and with EHR data. After synchronizing and integrating these data sources, PRISMS investigators will conduct predictive modeling that can be fed back to the patient or parent to both engage the patients and encourage patient compliance with an asthma management plan. Health care providers may also receive information from the system.
Children participating in PRISMS will receive smartwatches that can collect real-time data from built-in GPS, accelerometers, and gyroscopes, which will provide a measure of the child’s activity and microenvironment. Children will also carry portable Bluetooth-enabled spirometers—so-called smart inhalers that provide a geolocation and time stamp with every use—and sensors that can sample the environment and provide a personal measurement of exposure to air pollution. Currently, said Eckel, the project team is working on methods to process the torrent of data these sensors will generate, align the different sensor data streams, integrate them with external data sources, and use advanced analytics, including machine learning, to cluster patients and make predictions from individual baseline measurements. One question that she is working to answer is how often the team should update the PRISMS model with real-time streaming data so that it can inform the patient in a relatively short time frame about whether his or her risk for exacerbation has increased given the current environmental conditions and the individual’s activity level. Other challenges that the PRISMS team is addressing, said Eckel, include how to link sensor data and exposure data to health outcomes, given that exposures are measured at a much higher time resolution than the health outcomes will be measured, and linking GPS trajectories to exposure surfaces.
According to Eckel, PRISMS can impact two major areas of science: environmental epidemiology research and personalized medicine. Time-resolved data, Eckel said, will help provide a better understanding of the timing between exposure and health response. Time-resolved data also help clarify the context of measurements of exposure. For example, these time-resolved data can be used to determine whether exposure to particulate matter from traffic is more or less harmful than exposure to particulate matter from cooking. Time-resolved and GPS data may help identify unknown sources of exposure and provide insights on the heterogeneity of response to exposure.
8 Ritchie, M. D., J. C. Denny, D. C. Crawford, A. H. Ramirez, J. B. Weiner, J. M. Pulley, M. A. Basford, K. Brown-Gentry, J. R. Balser, D. R. Masys, J. L. Haines, and D. M. Roden. 2010. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. American Journal of Human Genetics 86(4):560–572. doi: 10.1016/j.ajhg.2010.03.003.
9 McCarty, C. A., R. Berg, C. M. Rottscheit, C. J. Waudby, T. Kitchner, M. Brilliant, and M. D. Ritchie. 2014. Validation of PhenX measures in the personalized medicine research project for use in gene/environment studies. BMC Medical Genomics 7:3. doi: 10.1186/1755-8794-7-3.
Questions going forward include how long specific sensors need to be worn; whether they are measuring the right sources of data; how often they require calibration, given that they are inexpensive; and how to incorporate data quality metrics into models. In addition, the PRISMS team is developing privacy and ethics guidelines and procedures for providing feedback to the participants.
Margaret Karagas from Dartmouth College has been using the My Exposome wristband to track exposures on approximately 2,000 mother–infant pairs. Her research team is combining those data with ultrasound records from the mothers’ EHRs and metabolomic data collected during pregnancy, including from umbilical cord blood, in collaboration with the CHEAR initiative. The research team is also measuring exposures in early childhood using a wearable air monitor and accelerometer designed specifically for young children, and Karagas said she looked forward to collaborating with the PRISMS program to use some of its sensors. Other data for her studies come from geospatial observations of different environments, both terrestrial and aquatic, as well as data on environmental microbial communities; imaging data measuring bone density, body mass, and body fat; and even dental analysis to provide a trajectory of exposure starting in utero. The challenge she and her collaborators are dealing with is how to integrate the data coming from different data streams. Looking to the future, she has started a cross-disciplinary postdoctoral program to train investigators who can better tackle that problem.
Atul Butte from the University of California, San Francisco, discussed data integration efforts that are part of NIH’s All of Us research program. The All of Us research program, formerly known as the Precision Medicine Initiative, aims to create a cohort of 1 million or more individuals who will volunteer to be studied for 5 to 10 years. While the primary mission of this program is to “accelerate health research and medical breakthroughs that will enable individualized prevention, treatment, and care for everyone,”10 the All of Us database will also be incredibly useful for environmental health research, he emphasized. So far, the program has recruited more than 10,000 individuals who have been participating in beta testing the system, with a national launch expected in 2018. The program coordinators expect that half of the volunteers will come through health care provider organizations and half via direct recruitment. Although the participants will enter the program differently, they will have the same experience, which will include electronic consent, a basic physical examination, and the collection of urine and blood. Individuals recruited through their health care providers will also consent to provide access to their EHRs. In the future, participants may receive wearable devices and other sensors. One interesting aspect of All of Us is that the program solicited ideas from the research community for possible studies using the program’s data and let potential users vote on those ideas. One idea that made it into the top 10 was to explore how to better identify and assess environmental and genetic/physiological risk factors that lead to child/teen stress.
As an example of the type of research that is enabled by a huge, integrated database of health information, Butte described how he has used data in the University of California Research eXchange, which integrates EHR data from more than 15 million individuals in the University of California health system, to create maps of cause of death for Californians in the state. The map for alcohol-related illnesses was rather simple, while the map for heart disease was far more complex and showed that a major cause of death among individuals who have had a heart attack was sepsis that developed several years later. The goal of this work is not to build maps, but to figure out the genetic, clinical, and environmental reasons that people progress from one disease state to another and to develop interventions that help prevent premature death.
INTEGRATING DATA FOR RISK ASSESSMENT
Kris Thayer from EPA described how the agency is using the Integrated Risk Information System (IRIS) to integrate chemical-specific information with data from human, animal, and mechanistic studies in the context of chemical health risk assessment. Thayer explained EPA’s systematic review process, which includes forming a problem, identifying the available evidence, evaluating the quality of individual pieces of evidence, looking across studies to emphasize evidence integration approaches, and identifying hazards and deriving toxicity values. She stressed the importance of problem formulation, which affects decision-making and the screening of relevant, not supplemental, studies and increases the efficiency of the process. She noted that evidence integration is qualitative in IRIS assessments and is expressed in the context of confidence, although EPA is exploring quantitative methods for combining data. Thayer also explained that when looking at individual pieces of evidence, the focus is on separating decisions about internal validity from decisions about applicability.
Synthesizing evidence, Thayer explained, should be a systematic process that takes into account sensitivity and bias while describing individual studies rather than simply counting how many studies were positive or negative. Synthesis, she said, should rely on available medium and high confidence studies and try to assess the strength of the evidence across them. She added that synthesizing evidence from human and animal studies could draw attention to the need for particular mechanistic analyses and improve dose–response modeling and quantification of uncertainties in the data.
Thayer briefly discussed the Health Assessment Workspace Collaborative (HAWC), a free, Web-based, open-source software application that helps facilitate chemical assessments. HAWC enables evidence to be presented visually, which helps identify patterns in the data. She noted that the analyses of the quality of individual studies and pieces of evidence are challenges for complex data types. With regard to HAWC, Thayer said there is a need to monitor whether current structured frameworks for evidence synthesis and integration can accommodate newer types of evidence, including evidence derived from big data analyses. She added that it is necessary to determine if structured approaches for summarizing study design and methods need to be repurposed to change the way biomedical data is published. For example, linked Web-based data presentations might replace tables and charts in publications and make data synthesis across studies easier.
The basic goal of risk assessments, said Timothy Pastoor from Pastoor Science Communications, is to assimilate data to provide an understanding of what types of exposures will imperil someone’s health and how to keep people safe from exposures to potentially harmful chemicals in the environment. Risk and safety, he said, are functions of hazard and exposure. Toxicologists and epidemiologists identify potential hazards, but hazards are not meaningful for conducting a risk assessment in the absence of information about exposures, and information on exposures is often limited.
RISK21 is a scheme for conducting risk assessment that is driven by exposures. Developed by more than 120 participants over the course of 7 years, this framework starts by formulating a problem, developing a conceptual model, and generating specific data based on that problem and model. In contrast, the current paradigm generates reams of data and then sorts through them to find an answer to a problem formulated post hoc. By paraphrasing the famous statistician John Tukey, Pastoor explained that the goal should be to “have an approximate answer to the right question, rather than a precise answer to the wrong question,” and to have enough precision in the answer to make a decision.
Pastoor said that when conducting experiments to generate toxicity and exposure data, it is important for the toxicity studies and exposure studies to use the same quantities. As an example, he said that if a feeding study uses a dose of milligrams per kilogram per day, the exposure studies should also be done at milligrams per kilogram per day. In addition, exposure levels should be relevant to real-world human exposures. Then, plotting toxicity versus exposure will yield a risk matrix that can be used to both set exposure limits and to identify where more data are needed to improve a risk assessment. The RISK21 tool11 produces a visible matrix to communicate risk information. Pastoor cautioned that investigators should put their data through the systematic review process that Thayer discussed before using this tool.
One of the challenges in getting the data needed for risk assessment arises from the fact that most data are generated on what Barry Hardy from Douglas Connect and OpenTox12 called individual islands. As a result, there are many disconnects between data generation and processing, as well as between its integration and use. Building bridges between these islands will require investigators to work together on planning experiments by describing the data and making it accessible, usable, and actionable for specific purposes.
The first application of OpenTox’s extensible and interoperable approach to data integration and analysis was Bioclipse OpenTox, a decision-making tool for drug design. Since then, OpenTox has been constituted as an international nonprofit that recognizes the importance of open data, as well as open knowledge, methods, tools, and resources supporting productive toxicology, safety assessment, and risk management. Hardy said that OpenTox, which has encouraged the formation of local chapters, sponsors community activities such as data hackathons and workshops that bring data scientists and scientists from other fields together to develop data processing tools.
ToxBank,13 a project supported by a public–private partnership between the European Commission and the European cosmetics industry, created new methods of evaluating repeated-dose toxicity. This project, said Hardy, tackled numerous issues, including the need to integrate infrastructure and processes, the selection of common reference compounds and biologicals used by all of the partners and projects across the program, case study driven research, and documentation and storage of all protocols used in the program linked to the datasets generated. ToxBank introduced the use of data templates, which described the experimental biological metadata based on emerging standards, and it also uncovered significant gaps between laboratory practices and the practices needed to achieve data integration. A subsequent project, EU-ToxRisk, is extending this work.
OpenRiskNet14 is the second-generation successor to OpenTox. OpenRiskNet uses case studies to develop an open engineering infrastructure to support data interoperability and data use for many aspects of risk assessment, explained Hardy. OpenRiskNet contains an interoperability layer that provides harmonized access to different sources of data, including data from European projects, U.S. agencies, and even databases in Japan and South Korea. Openness, said Hardy, enables innovation and transparency drives the acceptance of the new methods developed for these projects. Attention to detail on the description, processing, and analysis of data provides a foundation for developing the reproducible framework and solutions that the field of risk assessment has been missing, he added.
Going forward, Hardy emphasized the importance of defining the best in silico practices for regulatory acceptance and of using many new sources of data that are becoming available. Toward that end, OpenRiskNet is supporting the construction of evidence-based toxicology workflows supported by knowledge interactions from heterogeneous sources of data, including in vivo and in vitro data, human adverse event reporting, and the systematic review of the literature, he said.
FRAMEWORK AND DESIGN PRINCIPLES FOR VISUAL COMMUNICATION OF DATA
Making sense of data visualization requires basic literacy—the ability to read and write text—visual literacy, and data literacy, explained Katy Börner from Indiana University. To develop a baseline for the average American’s data visualization literacy,15 she and her colleagues asked 1,000 youths and their caregivers at 6 U.S. science museums to name, interpret, and envision how they could use different visualizations. Perhaps not surprisingly, data visualization literacy was rather low. To remedy this situation, Börner and her colleagues are developing software bundles, which she calls macroscopes, that can incorporate new datasets and algorithms and produce visualizations that help communicate the results of an analysis. These macroscopes, she said, help find patterns, trends, and structures in large-scale or small-scale datasets. In addition to developing these tools, Börner and her collaborators teach an online information visualization course each spring that draws students of all ages from 100 countries and many different scientific disciplines. The course is also offered for free asynchronously, and anyone who finishes the 15-week, 8 hours per week course with good scores receives a Mozilla badge.
Börner pointed out that there are many different types of visualizations and that the field could use an ontology to describe them. Each type of visualization is useful for a particular purpose, such as looking for trends, seeking correlations, characterizing distributions, or creating rank orders of variables, and different types of visualizations can be useful for conveying the same information to different audiences. There are also different data scale types and different types, sizes, shapes, and colors of graphic symbols that can be overlaid on maps. There are even tools that will optimize the colors used in visualizations so that a colorblind individual can read them.
One typical standardized dataset is the table, said Miriah Meyer of The University of Utah, with the rows representing sets of items and the columns representing attributes. Tables can be multidimensional and can be visualized in several ways, including scatterplots, bar charts, and line charts. Another common dataset includes networks and trees, where there are nodes that have relationships with one another and can be visualized as edges and linked nodes.
Attributes come in different types. For example, categorical attributes have no implicit ordering, such as different types of fruit. Ordered attributes can be ordinal, where there is a meaningful order, such as bronze, silver, and gold, and quantitative, which is both ordered and has a meaningful magnitude, such as temperature or height. Understanding the properties of an attribute is vital, said Meyer, for choosing the right kind of visual encoding channels used to create visualizations. She noted that controlled laboratory studies have looked at the effectiveness of different encoding channels for their ability to convey information.
Color, as it turns out, is a challenging encoding technique for several reasons. Background colors, for example, can alter how two gradations of a color will be interpreted. A good rule of thumb, said Meyer, is to avoid using color in visualizations. While spatial encoding is the strongest way to represent data, three-dimensional visualizations are problematic, particularly on a digital interface, she explained, unless the user has the ability to rotate the three-dimensional visualization. Animations are another poor choice for visualization, even when the data have a temporal component, because human short-term working memory is limited, explained Meyer. Instead, she recommended using a technique that she and her colleagues in the visualization community call small multiples, which are repeated visuals representing relevant time steps that the user can scan visually and make fine-grained comparisons that would be nearly impossible while watching a video.
The hardest part of choosing the right visualization for a given dataset is understanding what needs to be visualized and extracted from the data to address the problem at hand, said Meyer. In her role as co-director of The University of Utah’s visualization design laboratory, she spends a great deal of time working with researchers to understand the problem they want to address and how to encode the data to provide useful insights. She and her team create rapid prototypes to try ideas and then test them in the wild to get a sense of how they impact the world. The process is iterative, she said, and going through this user-centered design process with the researchers often leads to new questions.
SCIENCE, OPEN COLLABORATION, AND NEW METHODS
The opportunity today in the world of data networking, said John Wilbanks of the nonprofit Sage Bionetworks, is in determining how to connect data silos, which is different than building data commons. In thinking about this problem, Wilbanks alluded to the three pillars of modern scientific methods—team science, open science, and participant-centered science—and
15 The ability to understand visual representations of data, including charts, graphs, and other illustrations.
the need to develop pilot systems and approaches that can network these three in a way that creates open systems, provides incentives to participate in the resulting network, and makes it the norm to do so. He realizes this ideal system requires an infrastructure to provide robust, reusable solutions that can support member-driven, voluntary research communities. He explained that as it becomes the norm for scientists to share and integrate their data, demonstrating the ease and potential benefits of doing so should generate momentum within that scientific community. Once there is a large enough group of scientists getting the benefit of using the resource, others will be incentivized to join for fear of being outcompeted by their peers.
One example of such a community is the Accelerating Medicines Partnership–Alzheimer’s Disease involving NIH, 10 biopharmaceutical companies, and several nonprofit organizations that test 6 “radically different” scientific approaches to Alzheimer’s disease target discovery. Sage Bionetworks’ role in this effort is to use technical, social, and analytical processes to promote team science and open science practices. Wilbanks characterized this role as being the experts in collaboration, networking, and data sharing so that the researchers can be the experts in Alzheimer’s disease research. He noted that the participating investigators had some difficulty adjusting to this networked approach to research, but by the middle of the second year, even the most stubborn investigators began to see the benefits of this approach. Now in its third year, the network’s research groups have been won over by this approach and are training new lab members in this way of doing science. The six teams began the project by working independently, in what Wilbanks called the fully closed phase, to identify new targets and explore new hypotheses. In the team phase, the research teams compare evidence to select the top targets, and in the final open phase, research and data will be presented to the larger community for input and external use.
In another project on colorectal cancer subtyping, the individual teams generated four papers on four different subtypes using four different datasets and algorithms. After using what Wilbanks termed “diplomacy,” the Sage team convinced the investigators to share their data and write a group consensus paper with a consensus molecular subtype that none of the groups would have been able to identify individually.16
One scientific need is for optimized algorithms developed through an unbiased, consistent, and rigorous assessment that samples a space of diverse methods. Wilbanks explained that Sage issues “challenges,” community competitions to create open source consensus methods that journals, regulators, and governments can use. Through the challenge approach, Sage builds consensus algorithms out of the top sets of challenge submissions and simultaneously creates an open source benchmark for investigators. Sage has done this for prostate cancer, digital demography, and single cell parameter estimation.
According to Wilbanks, one of the drivers for the life sciences to embrace mobile health as a research tool is the need to expand the size of observational studies that lead to diagnostic and treatment guidelines for the entire country. Currently, said Wilbanks, approximately 200 million Americans have smartphones, with 75 to 80 percent rising penetration in every major demographic group.17 These numbers are sure to increase over the next few years. To take advantage of this opportunity to involve the public in research, Sage partnered with Apple to create an open-source application framework, the Apple ResearchKit, that lets anyone build a medical research application for use on an iPhone. For example, Sage collaborated with researchers at the University of Rochester to develop mPower for Parkinson’s disease. mPower precisely measures dexterity, balance, memory, and gait—data that can help researchers and participants learn, recognize, and understand more about an individual’s symptoms. In the first 6 months of the project, more than 14,600 participants enrolled in the study, 9,500 of whom agreed to share their data broadly. Because the mPower application enables participants to take measurements before and after taking their medications, the resulting data have been able to identify people who have different responses to the same medication or whose response to the medication changes in reaction to some stimuli.
A limitation of this type of application is that it cannot make recommendations to patients, because then it would be regulated by the U.S. Food and Drug Administration. The solution was to provide data back to people in a format that allows them to draw their own conclusions. An advantage of more participant-centered science, said Wilbanks, is that it does not have as many institutional forces dragging it toward enclosure. He said all that is needed is a method of enabling informed consent, although creating a consent process that actually informs someone is not easy. Doing so on a phone requires treating consent as a design problem that links a picture to a headline, which slows down how the eye moves over a phone screen. This design for the consent process is required by Apple for all of the apps built using its ResearchKit. Sage has, in turn, pushed this app to Android phones, and it is the basis for the All of Us research program’s consent process.
One interesting finding from user surveys was that participants were upset when they found out that clinical scientists were not sharing their data. As a result, Sage requires anyone who wants access to its Cloud infrastructure to offer this as a choice to participants, and 70 percent of individuals across all of the studies Sage supports elect to donate their data to science. To protect the data from the negative externalities of letting random anonymous users access the data, potential users have to pass a test, validate their identity, file a data use statement, and sign an oath of ethical data usage. In turn, Sage informs study participants about how their data are being used.
16 Guinney, J., R. Dienstmann, X. Wang, et al. 2015. The consensus molecular subtypes of colorectal cancer. Nature Medicine 21:1350–1356. doi: 10.1038/ nm.3967.
As an example of how technology can enable the generation of high-resolution environmental data, Melissa Lunden from Aclima discussed how her company is using Google Street View cars to create a ubiquitous, real-time sensor network for collecting fine-grained environmental data that can enable better decision-making. The Aclima platform, she explained, combines leading-edge sensing technology, Cloud computing, and artificial intelligence to generate a real-time picture of the environment at a cost that she said is 100 to 1,000 times less expensive than the current methods for measuring air pollutants. The goal is to provide environmental intelligence as a service to a range of users.
Aclima started developing its system with a 500-node network inside Google’s buildings. Building interiors represent an important environment given that most Americans now spend 90 percent of their time indoors and indoor air pollution levels can be higher than outdoor levels. She noted, for example, that indoor carbon dioxide levels in office buildings, conference rooms, and even cars can easily exceed the 1,000 parts per million (ppm) level that starts degrading cognitive performance. In one naturally ventilated school in which Aclima installed sensors, classroom carbon dioxide levels reached 3,500 ppm. By mapping the indoor space and combining building data with environmental and physiological measures, it is possible to understand how to design or modify spaces to optimize human performance and health, said Lunden.
Currently, outdoor air pollution monitoring occurs with broad spatial resolution, but by piggybacking on Google’s Street View cars, Aclima has been able to create what Lunden called hyper-local, city-wide maps of ozone, nitric and nitrous oxide, carbon black or soot, and particulate matter across different particle size ranges for the entire San Francisco–Oakland metropolitan area. Instruments also measure location and meteorological variables. The sensors on the cars take readings every second and the data are streamed to the Cloud in real time.18 Questions that still need answers, she said, include how often a car needs to drive down a street to get meaningful readings and how to optimize routes to reduce driving time while still getting the desired data.
Lunden noted that averaging the high-resolution data around an area that includes a standard environmental monitoring station produces a value that closely matches the data currently used for regulatory purposes. What these more localized data can enable, however, is the ability to specifically tackle those areas of high pollutant concentration to get a bigger bang for the buck, she added. One finding from this work has been that the degree of variability is a function of the pollutant and the location. For example, the distribution of carbon black and other particulate matter showed significant differences by neighborhood.
As a final point, Lunden said that public institutions are not set up to deal with this type of spatially and temporally high resolution data and do not have the tools or budgets to access data from the Cloud, which is expensive. This is why it will be important to deliver actionable data products and tools, rather than just data, for this type of large-scale, fine-grained sensor system to be useful for addressing real-world environmental health issues.
Jacob Abernathy from the Georgia Institute of Technology became involved in the response to the Flint, Michigan, lead-tainted water crisis through a program he had developed while he was at the University of Michigan. That program had students engage in collaborative competitions to help local organizations with various data-intensive analytical problems. At the time, there was a huge dataset being generated by homeowners submitting water samples to be tested for lead concentrations. When mapped to location, the homes with high lead levels were found to be scattered across the city, so early attempts to use the data to model contamination and predict which homes would be most at risk were not successful. In response to this challenge, Abernathy’s student group produced significant performance improvements in the model that have been incorporated into an app for Flint residents.
From that work, Abernathy become involved in the Flint service line replacement program. The service line is the pipe that connects a city’s water distribution system to each house. For much of the 20th century, service line pipes were made of lead and an estimated 7.5 million homes in the United States have lead service lines that would cost as much as $250 billion to replace. One of the big issues with replacing these lines, he explained, is that they are buried and in many instances their exact locations and the material from which they are made are unknown. In Flint, records were kept on 3 × 5 index cards and incomplete handwritten maps stored in the basement of City Hall. Abernathy is working with a private company to digitize more than 100,000 of these records, which is a challenging task.
Abernathy and his team also developed a data collection smartphone app that contractors can use to enter information about what they find when they dig at a location to replace a service line. The information is uploaded in real time and can help make recommendations for how the contractors should proceed with a given replacement. They also
18 Apte, J. S., K. P. Messier, S. Gani, M. Brauer, T. W. Kirchstetter, M. M. Lunden, J. D. Marshall, C. J. Portier, R. C. H. Vermeulen, and S. P. Hamburg. 2017. High-resolution air pollution mapping with Google Street View Cars: Exploiting big data. Environmental Science and Technology 51(12):6999–7008.
developed a statistical model that predicts which homes are likely to have water service pipes made out of lead or other materials. He noted that these models have helped program coordinators to choose homes for replacement service lines, although not to the extent that Abernathy would have preferred due to political challenges to the process. He was, however, able to convince city leaders to use the model to guide where relatively small and inexpensive inspection holes would be dug to check on whether pipes were copper, and therefore safe, or if they needed to be replaced for a fraction of the cost of digging a large hole at random. Overall, statistical modeling and active inspection procedures have reduced costs by roughly 10 percent, he said, which means that 2,000 additional homes are receiving replacement service lines from the funds allocated by the state and federal governments.
In the workshop’s final presentation, Thomas Seager from Arizona State University discussed how data integration can inform a product’s environmental impact over its entire lifecycle. The problem with lifecycle analysis as it is currently done is that the results are not particularly useful for informing how people make decisions, he said. A different approach that he and many collaborators have developed attempts to understand tradeoffs between different products in a way that better informs decisions in the context of human values. Another problem when it comes to environmental decisions is that the units of comparison are often not comparable. Having incomparable units leads to questions that are value driven, such as the level of human illness or disease that is acceptable before decision-makers are willing to implement policies to reduce air pollution.
Given the imperfect data that come from lifecycle analyses, Seager suggests discontinuing the use of point estimates. What he wants to see are probability distributions, and he and his collaborators model inventory data as a distribution when conducting a lifecycle analysis. They also take a stochastic approach to modeling stakeholder decisions to reflect the fact that little is known about how much weight should be assigned to each environmental category. The output of the model is a confidence score for one alternative over the other for a single criterion. He then accumulates these preferences and produces an overall estimate of preference for one choice over another.
Many workshop presenters and discussants emphasized that developing approaches for integrating data from multiple disparate sources has great potential for informing environmental health decisions. As emphasized by Waller, Gennings, and many other workshop participants, the challenges associated with data integration are significant, but investigators are making headway and producing results in forms that decision-makers find valuable.
Disclaimer: This Proceedings of a Workshop—in Brief was prepared by Joe Alper as a factual summary of what occurred at the workshop. The planning committee’s role was limited to planning the workshop. The statements made are those of the rapporteur or individual meeting participants and do not necessarily represent the views of all meeting participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.
Reviewers: To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Bhramar Mukherjee, University of Michigan; Chirag Patel, Harvard University; Marylyn D. Ritchie, University of Pennsylvania; and Lance Waller, Emory University.
Planning Committee for Informing Environmental Health Decisions Through Data Integration: Kim Boekelheide (Chair), Brown University; Chris Gennings, Mount Sinai Health System; Margaret Karagas, Dartmouth College; Patrick McMullen, ScitoVation; Donna Mendrick, U.S. Food and Drug Administration; David Reif, North Carolina State University; Gina Solomon, University of California, San Francisco; and Lance Waller, Emory University.
Sponsor: This workshop was sponsored by the National Institute of Environmental Health Sciences.
About the Standing Committee on Emerging Science for Environmental Health Decisions
The Standing Committee on Emerging Science for Environmental Health Decisions is sponsored by the National Institute of Environmental Health Sciences to examine, explore, and consider issues on the use of emerging science for environmental health decisions. The Standing Committee’s workshops provide a public venue for communication among government, industry, environmental groups, and the academic community about scientific advances in methods and approaches that can be used in the identification, quantification, and control of environmental impacts on human health. Presentations and proceedings such as this one are made broadly available, including at http://nas-sites.org/emergingscience.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2018. Informing Environmental Health Decisions Through Data Integration: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: http://doi.org/10.17226/25139.
Division on Earth and Life Sciences
Copyright 2018 by the National Academy of Sciences. All rights reserved.