Proceedings of a Workshop
Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions
Proceedings of a Workshop—in Brief
Artificial intelligence (AI) is a technological invention that promises to transform everyday life and the world. Investment and enthusiasm for AI—or the ability of machines to carry out “smart” tasks—are driven largely by advancements in the subfield of machine learning. Machine learning algorithms can analyze large volumes of complex data to find patterns and make predictions, often exceeding the accuracy and efficiency of people who are attempting the same task. Powered by a tremendous growth in data collection and availability as well as computing power and accessibility, AI and machine learning applications are becoming commonplace in many aspects of modern society, as well as in a growing number of scientific disciplines.
On June 6–7, 2019, the National Academies of Sciences, Engineering, and Medicine’s Standing Committee on the Use of Emerging Science for Environmental Health Decisions held a 2-day workshop to explore emerging applications and implications of AI and machine learning in environmental health research and decisions. Speakers highlighted the use of AI and machine learning to characterize sources of pollution, predict chemical toxicity, and estimate human exposures to contaminants, among other applications. Though promising, questions remain about the use of AI and machine learning in environmental health research and public policy decisions. For example, workshop participants examined how a lack of transparency and interpretability of AI systems compounds fundamental issues about the availability, quality, bias, and uncertainty in the data used to develop machine learning algorithms. Participants also discussed how these issues may impact the reproducibility and replicability of results, deliver misleading or inaccurate results, and potentially diminish social trust in research. “We have smart technologies everywhere. They are more ubiquitous than we could have ever expected,” stated Melissa Perry of The George Washington University, co-chair of the standing committee and member of the organizing committee, in her welcoming remarks. This workshop is “intended to be an anchor from which new ideas are generated, influencing your work and influencing new investigations and collaborations,” said Perry.
The workshop was sponsored by the National Institute of Environmental Health Sciences (NIEHS). This Proceedings of a Workshop—in Brief summarizes the discussions that took place at the workshop, with emphasis on the comments from invited speakers.
OPPORTUNITIES FOR AI IN ENVIRONMENTAL HEALTH
Jason Moore of the University of Pennsylvania briefly described the origin of the field of AI. The basic idea that underlies the field of AI is “whether we can get computers to plan and solve problems, and reason,” said Moore. In the 1950s, the now famous computer scientist Alan Turing first asked whether machines can think and how a person could know if a machine could think. “That point at which you can’t tell the difference between talking to a real person and talking to a computer, the computer is said to pass the Turing Test,” explained Moore. There are two broad approaches to AI: top-down AI and bottom-up AI. The top-down approach is to build a machine that can mimic how the human brain works, or an “artificial brain,” said Moore. However, he said that most AI researchers focus on bottom-up approaches.
They develop simple mathematical components that when pieced together collectively exhibit complex behavior. Deep learning neural networks that can accurately examine and classify images are an example of bottom-up AI, said Moore. The term “artificial intelligence” was coined during a workshop held at Dartmouth College in 1956. During that workshop, leaders in computer science recognized the need “to unify the field behind one name,” said Moore.
Moore’s research focuses on the different pathways or processes people use to analyze datasets. When people, whether a statistician, a biologist, or an informatician, work with a new dataset, they often try different analyses to determine what works and what does not work, explained Moore. “Can we get a computer do this? That is the AI piece,” stated Moore. This is particularly important in the era of “big data,” which can have potentially millions of features or variables. For example, predicting the impact of one or more toxic exposure on human health requires understanding how myriad biological and environmental variables work together. Machine learning can help with this, said Moore. “Machine learning is ideally suited to look at many factors simultaneously, model the complexities of interactions among all of these factors, and determine how that produces some clinical endpoint,” said Moore.
A challenge is knowing how to best pair machine learning methods with particular analyses. Moore explained that each of the dozens of available machine learning methods takes a different approach to finding patterns in data. Picking a method is difficult without knowing what kind of patterns the data will reveal. To help researchers, Moore created PennAI,1 an open-source and transparent system for selecting the most appropriate machine learning method. PennAI includes a comprehensive library of available machine learning methods and tests how a machine learning method analyzes a dataset, explained Moore. PennAI also learns from its own results so that it can continually improve on selecting appropriate machine learning methods to apply to different types of datasets, stated Moore. Brahmar Mukherjee from the University of Michigan outlined the benefits of machine learning and AI as their ability to capture complexity, involve less structured assumptions, and make better predictions. However, she argued that AI is better thought of as much more than just algorithms that can make predictions. She highlighted a recent article that argued that mapping observed inputs to observed outputs (i.e., prediction) does not really qualify as intelligence.2 In other words, “there is a life and science beyond prediction,” that includes characterizing uncertainty, principled data integration, establishing causality, developing generalizable and representative models, adequate sampling, and incorporating important scientific or clinical contexts into problems. AI integrates expert knowledge and the ability to make causal inferences to develop counterfactual predictions about the effect that different inputs will have on outcomes, emphasized Mukherjee. This framing is particularly important to recognize in order to facilitate the use of AI for transforming big data in environmental health into knowledge. Together with multiple collaborators, Mukherjee is developing an AI system that can predict future health endpoints based on the environmental exposures to a moderate number of pollutants. This effort, which includes several longitudinal studies, focuses on how maternal exposures during pregnancy affect the future health of both mother and child. A key aspect of these studies is that they are looking at the effect of exposures to the multiple chemical and non-chemical stressors that everyone experiences every day, said Mukherjee. Her research team is attempting to use machine learning to develop a cumulative environmental risk score that goes beyond standard linear models. “You need computational tools. You need modern computation theory. You need good statistical inference,” she stated. She also emphasized that reproducible and rigorous science requires improved measurement processes, large sample sizes, principled methods, and user-friendly software.
Richard Woychik of NIEHS highlighted five areas of environmental health for which AI and machine learning could play an integral role in research: predicting the toxicology of chemicals, measuring the exposome,3 understanding the interactions between genes and environmental exposures, examining the role of epigenetics, and supporting systematic reviews4 of scientific literature. Woychik imagined a future in which AI could integrate environmental health datasets—such as curated legacy data, results of high-throughput screening assays, and details about specific chemical features—and provide “everything we need to know about various different chemicals,” within a few hours. However, Woychik noted that a major challenge is determining which datasets to use because biology is complex. “What datasets do you need to actually plug into an [AI] algorithm to [generate] a better understanding of differential susceptibility to our environment that we inherited from our parents?”
In a panel discussion, Moore, Mukherjee, and Woychik all stressed the importance of embracing complexity and of measuring and accounting for the exposome—that is, moving away from looking at one or a few environmen-
2 Hernan, M. A., J. Hsu, and B. Healthy. 2019. A second chance to get causal inference right: A classification of data science tasks. Chance 2(1):42–49.
3 The totality of a person’s exposures from conception onward.
4 A structured method to collect, appraise, and synthesize data from published literature.
tal factors at a time—when studying the link between environment and health. AI and machine learning, along with a framework for how the data will be used, will play essential roles in enabling the resulting complex analyses that will be capable of modeling actual human biology. Woychik also stressed the need to have a framework ontology for how data will be used before starting data collection as a means of ensuring that it will be possible to merge the resulting database with the various available -omics databases, an idea Moore, Mukherjee, and Nicole Kleinstreuer from NIEHS’s National Toxicology Program seconded.
The panelists discussed when it is appropriate to use an AI system that makes predictions5 versus one that produces inferences.6 Moore pointed out that a predictive system would be useful when investigators do not need information about the biological mechanisms that establish causality. For example, a dermatologist who uses AI to diagnose melanoma from a skin image probably does not need to know the key biological features being used by the AI algorithm to make the prediction. In contrast, Moore pointed out that an inferential AI system would be more appropriate when mechanism and causality are important, such as trying to understand the mechanism by which a toxic metal affects human biology. However, the two approaches are not mutually exclusive, and using both types to address the same problem can provide complementary information.
PREDICTING CHEMICAL TOXICITY: THE APPLICATION OF AI TO CHEMICAL HAZARDS CHARACTERIZATION
To characterize chemical hazards, Thomas Luechtefeld from Insilica developed a large neural network that breaks down a given chemical of interest into various functional groups and other features and uses those to produce globally recognized chemical hazard labels. The analysis uses structural data contained in the PubChem database to create a measure of similarity between the chemical of interest and some 200,000 chemicals that have been classified according to 74 different hazards. For example, a chemical might be similar in structure and functional groups to others that are known mutagens or acute dermal hazards. Luechtefeld noted the importance of starting with as large a database as possible given that the goal is to generalize to the 98 million compounds in the PubChem database.
The computational models Luechtefeld developed for each of the 74 hazards do not operate in isolation but rather benefit from transfer learning, recognizing that a model for one hazard can inform a model for another hazard while increasing the amount of data available for analysis. He discussed how he uses the nearly 100 million chemicals in PubChem to determine whether to trust a model. Luechtefeld tested whether a model can correctly identify which PubChem chemicals without hazard labels have similar features as chemicals that are well characterized and with known toxicities.
Kleinstreuer is addressing the untested chemical problem—some 140,000 chemicals are used in commerce, but less than 10 percent have been tested for toxicity using a limited number of low-throughput, animal-based assays—by mining the existing quantitative structure-activity relationship (QSAR) databases and crowdsourcing consensus models. Her team’s approach starts with curating high-quality training datasets and releasing them to experienced modeling groups around the world that are interested in participating in this effort. The goal is to identify models that can make accurate predictions across large chemical sets to inform regulatory decision making.
Kleinstreuer discussed a project to identify chemicals with the potential to disrupt endocrine function through interactions with either estrogen or androgen receptors. The models relied on integrating data from high-throughput screening programs such as the Toxicology in the 21st Century (Tox21) program,7 which also provides excellent coverage of the biological pathways corresponding to these two receptors, she added. Nearly 50 models of the estrogen receptor pathway and more than 90 models of the androgen receptor pathway were assessed. While no one model can cover all of the chemicals of interest, each model may have specific areas in the chemical space for which it is better or worse at making predictions.
Each model was trained using high-quality data for 1,800 chemicals, and the evaluation dataset consisted of published data that were not necessarily curated for quality. Results were considered accurate when two or more models identified a chemical as being an endocrine disrupter. The U.S. Environmental Protection Agency (EPA) is now using the validated models to prioritize which chemicals to test in its endocrine disruptor screening program.
Kleinstreuer’s group is now modeling acute oral toxicity, for which it has datasets on more than 15,000 chemicals that it was able to curate, including data on some 1,000 chemicals for which there were oral toxicity data obtained
5 The determination of what will or might happen in the future based on previous observation, experience, or scientific reason.
6 Derivations as conclusions based on known facts or evidence.
at different time points. More than 140 were tested in this exercise that evaluated 5 different endpoints of interest to different regulatory agencies. Each modeling group received chemical identifiers for 48,000 chemicals to evaluate individual models. Hidden in those 48,000 were 2,000 chemicals for which the endpoints were known. The results of this evaluation enabled the collaborators to develop a consensus model using an open QSAR (OPERA) modeling suite.8,9 The consensus model was able to predict and replicate in vivo data (i.e., the endpoints of the 140 tested chemicals) and outperform each individual model. Kleinstreuer noted that consensus models can serve as a powerful way to leverage collective expertise and said the results of these exercises show that toxicology data can be synthesized and modeled effectively using AI and machine learning approaches.
Addressing Chemical Mixtures
Traditional epidemiology has focused on assessing exposures one chemical at a time, an approach that has worked well for identifying toxicants and led to regulatory action, said Marianthi-Anna Kioumourtzoglou from Columbia University. However, the problem is that this approach does not represent the reality of life, where, for example, someone breathes in a mixture of air pollutants rather than a single pollutant, and it can lead to spurious findings when there are high correlations between exposure to a toxic and to a non-toxic substance. In addition, exposure to even a single chemical can occur with other stressors present, something for which current exposure models do not take into account.
No one single statistical model can address all of the research questions of interest, explained Kioumourtzoglou, and the traditional approaches that have been used to analyze environmental health data are limited in their ability to handle the high dimensionality of data that occurs when including multiple chemicals in such models. Machine learning approaches, though, are well positioned to accommodate high dimensionality and complexity in data structures while providing the flexibility to capture non-linear interactions among chemicals in a mixture. Doing so, said Kioumourtzoglou, requires understanding the strengths and limitations of the many available machine learning techniques, as well as having environmental epidemiologists and toxicologists collaborating with machine learning experts and biostatisticians to tweak, adapt, and extend existing methods to make them more appropriate for use in environmental health studies, particularly regarding interpretability and robustness.
Kioumourtzoglou briefly described an approach she and her colleagues developed for reducing data dimensionality and identifying patterns by modifying a method used to interpret medical images. This approach, called Principal Component Pursuit (PCP), decomposes the data matrix into a low-rank matrix that identifies consistent patterns of exposure and a sparse matrix that identifies unique or extreme exposure events. PCP, she noted, is robust to noisy and corrupt data and requires minimal assumptions. Currently, her team is adapting PCP to address the problem of values below the limits of detection and to be able to incorporate prior knowledge about a mixture or its components.
As a proof of concept, Kioumourtzoglou’s team tested its PCP model using a well-characterized dataset for particulate air pollution in Boston between 2003 and 2010. The PCP analysis produced results that agreed with those generated by other methods used for source apportionment, such as positive matrix factorization and absolute principal component analysis. More exciting, she said, was that PCP identified singular events, such as the particulate cloud produced by wildfires in Canada that passed through Boston on May 31, 2010. In this case, the analysis identified high levels of carbon black and potassium, a biomarker for biomass burning.
Biomarkers of Exposure
Discovering biomarkers of exposure—measurable biological indicators of the presence or severity of exposure—requires time-consuming and expensive validation that relies heavily on the computational or mathematical methods used to identify them, explained Katrina Waters from the Pacific Northwest National Laboratory. Waters explained that additionally, biomarker discovery uses complex and diverse datasets that include the influence of many different biological processes, including gene expression, post-translational modifications of proteins, and metabolism, and external factors that together produce a physiological response to an environmental exposure. As a result, the scale of data needed to identify biomarkers of exposure is huge and includes sequencing, microarray, and other -omics data, high-resolution mass spectrometry data, and even imaging data. “What we are talking about is a multi-scale problem going from genes to molecules to networks to the whole physiology of an individual combined with their microbiome, genetics, and everything else that is going on,” said Waters.
8 Kleinstreuer, N. C., A. L. Karamaus, K. Mansouri, D. G. Allen, J. M. Fitzpatrick, and G. Patlewicz. 2018. Predictive models for acute oral systemic toxicity: A workshop to bridge the gap from research to regulation. Computational Toxicology 8:21–24.
9 Mansouri, K., C. Grulke, R. Judson, and A. Williams. 2017. OPERA: A QSAR tool for physicochemical properties and environmental fate predictions (ACS Spring meeting). Presented at ACS Spring meeting, San Francisco, California, April 2–6.
One challenge for machine learning, then, is dealing with the scale and complexity of data. Another challenge arises when dealing with the structure of the data, that is, whether it is discrete or continuous, quantitative or qualitative, linear or non-linear, or complete or missing some elements. Model interpretation and reproducibility are also key challenges, as is the need to have a panel of markers, as opposed to a single biomarker, that can capture the multitude of pathways that lead to an adverse response to a pollutant or a collection of pollutants. Models need to be more than opaque black boxes for inputting data and generating results, said Waters. Rather, they need to be explainable in terms of why they identify a particular marker, generate results that inspire confidence in the biomarkers they identify, and provide lessons that inform subsequent model development when they fail.
Current research on integrative and interpretable machine learning focuses on identifying features that work in combination across multiple -omics and metadata that can predict a disease versus a control state. For example, the Diabetes Autoimmunity Study in the Young (DAISY)10 program is following more than 2,500 high-risk children who have a diabetic relative, while The Environmental Determinants of Diabetes in the Young (TEDDY)11 program is following 7,766 children in Europe and the United States who are at the highest risk of developing diabetes. In addition, the Human Islet Research Network (HIRN) is generating multi-omics profiles and single-cell images of human islets and their response to cytokine stimulation as a means of understanding how islet cells respond to chemicals, immune factors, and other molecules in the environment. Collecting these data across different populations and different life stages can help build a more comprehensive picture of the etiology and progression of diabetes and the population level of variability in the data. So far, Waters and her collaborators have identified biomarker panels that can differentiate control groups from diabetic groups prior to the onset of clinical symptoms, offering the possibility of identifying those children at risk of developing diabetes. The next step will be to pull apart these makers using recursive feature elimination to identify features that are highly predictive.
Waters is also developing a model to identify what factors make some people more susceptible to Ebola infection. She and her collaborators carried out a small study using the genomics, proteomics, metabolomics, and lipidomics data from blood samples of survivors and non-survivors.12 They were able to identify metabolomics and lipidomic markers that indicate which patients, without immediate intensive treatment, are most likely to succumb to Ebola infection. Such a model could be valuable for triaging patients—enabling clinicians to determine which individuals need palliative care and which need survivor serum and other treatments to overcome infection, she explained.
When asked about how they address data quality, Waters, Kioumourtzoglou, and Kleinstreuer said they model uncertainty; perform many hours of manual curation; run consensus models to override data errors; use standardized pipelines for processing, normalizing, and analyzing data; and use statistically driven imputation approaches to fill in missing data. They noted that more work is needed to understand data quality and develop methods to deal with data issues. The panelists also emphasized that it is important for researchers to be transparent about possible data issues when discussing their models.
Kioumourtzoglou commented that machine learning techniques can provide important insights about complex interactions that might change with exposure level and that she is optimistic that machine learning methods can address complicated data structures, including those having to do with non-chemical stressors. The challenge with the latter is dealing with data over different scales. Waters added that data from wearable biosensors may help with developing algorithms that can include information on the time course of exposure.
Responding to a question about whether interacting with regulators and other stakeholders influences how models are developed, Kleinstreuer stressed the absolute necessity of involving end users as participants in the development process, starting from the earliest stages. “To be pragmatic and have your work put into practice, you have to understand what it is that regulatory decision makers are grappling with and what they have to deal with on a day-to-day basis,” said Kleinstreuer. In her work, discussions with stakeholders identified the endpoints that were most important to them. Waters remarked that the clinicians she works with have contributed their real-world expertise to
10 Initiated in 1993 and continuously funded by the National Institutes of Health, DAISY is a long-term prospective cohort study to determine how genes and the environment interact to cause childhood (type 1) diabetes.
11 TEDDY, supported by an international consortium founded by the National Institutes of Health, is a long-term prospective cohort study to identify environmental exposures—such as infectious agents, diet, and psychosocial stress—that influence the development of autoimmunity and type 1 diabetes.
12 Kyle, J. E., K. E. Burnum-Johnson, J. P. Wendler, A. J. Eisfeld, Peter J. Halfmann, T. Watanabe, F. Sahr, R. D. Smith, Y. Kawaoka, K. M. Waters, and T. O. Metz. 2019. Plasma lipidome reveals critical illness and recovery from human Ebola virus disease. 2019. Proceedings of the National Academy of Sciences of the United States of America 116(9):3919–3928.
single out measurements that would not be useful to make and variables that would produce data that are redundant and not independent.
ESTIMATING EXPOSURES: APPLICATION OF AI TO EPIDEMIOLOGY AND EXPOSURE SCIENCE
Do we have the ingredients to answer questions about environmental health and the exposome using AI and machine learning?, asked Arjun Manrai from Harvard Medical School. He pointed out that machine learning has been successful as currently practiced for many health-related questions. However, Manrai expressed doubt about AI’s utility for answering questions in environmental health and the exposome except when there is already a wealth of high-quality data. Even if high-quality data are available, Manrai emphasized that pervasive issues around measurement and reproducibility are likely to limit the application of AI and machine learning to environmental health questions. One high-quality public dataset that is useful to answer environmental health questions comes from the National Health and Nutrition Examination Survey (NHANES).13 As an example of how it can power machine learning, Manrai described how he used NHANES data in combination with linear regression to associate lead exposure and C-reactive protein, a blood marker of inflammation. He also suggested that genome-wide association studies could inspire the development of new machine learning approaches studying environmental health associations in large datasets.14
Scott Weichenthal from McGill University is developing exposure models for population-based studies focusing on spatial variations in air pollution. Measuring or estimating exposure is the biggest challenge in these studies given they are intended to cover some meaningful period and must account for residential mobility. “Getting good exposure information is what makes or breaks many of these studies,” said Weichenthal. Geostatistical models and land-use regression models offer two approaches for estimating population-level exposures. For example, his group put out 100 air pollution monitors in Washington, DC, and collected air pollution data for 2 weeks. He then extracted land-use data from public sources on items such as traffic; whether a given monitor is in a residential, commercial, or industrial district; road width; and the type of vehicles driving past the monitors. These geographical parameters become predictors in a multi-regression model that can predict exposure information in places for which there are no direct measurements. Limitations to this approach include that available geographic data (e.g., land use and traffic density) are available on a limited spatial scale—spatial differences in human exposures; that it can predict only one pollutant at a time; and that it is difficult to evaluate interactions with geostatistical models. Moreover, geographic data do not fully capture the environment as people experience it.
One approach Weichenthal is exploring to address those limitations is to use large databases of images of pollution to predict environmental exposures.15 This approach assumes that the built environment plays a large role in exposures and that photographs and satellite images of the built environment therefore contain exposure information. For the analysis, he uses a deep convolutional neural network16 that extracts information by analyzing layers within an image, makes predictions, compares the predicted results to the true value, adjusts the filters the network uses to extract information from an image, and then repeats this process iteratively to arrive at a best prediction. He noted that there is a certain amount of trial and error that goes into selecting the best parameters on which to train this type of model.
Weichenthal’s first test of this approach used satellite images and 20,000 ground-level 2.5-micron particulate matter (PM2.5) measurements from 6,000 sites worldwide to predict the annual average PM2.5 concentrations across the globe. He also tested it with some 100,000 remote-sensing measurements taken just in North America to see if the model could predict a narrower PM2.5 exposure range. Using downloaded satellite images for each monitoring site, his team examined several different zoom levels ranging from the size of a city to the size of a neighborhood. This approach explained 75 percent of the variation in actual ground-level PM2.5 measurements globally and 90 percent of the variation in North America. When compared to the current best-in-class global disease burden model, Weichenthal’s approach yielded results that came close to those of the accepted model. A new model, one that uses zoomed-in images to provide local information and zoomed-out images to get regional information, comes even closer to the standard global disease burden model.
14 Manrai, A. K., J. P. A. Ioannidis, and C. J. Patel. 2019. Signals among signals: Prioritizing nongenetic associations in massive data sets. American Journal of Epidemiology 188(5):846–850.
15 Weichenthal, S., M. Hatzopoulou, and M. Brauer. 2019. A picture tells a thousand…exposures: Opportunities and challenges of deep learning image analyses in exposure science and environmental epidemiology. Environment International 122:3–10.
16 A type of neural network (machine learning algorithms loosely modeled on the human brain) that are particularly suited for recognizing and classifying images.
His group is starting to use audio data to provide more information about exposures. His team has recruited people in Montreal and India who will travel around with video and audio recorders, pollution monitors, and thermometers to build a large database for more locally based analysis to see if this type of model will work at the street level. His group is also building models to predict the heat island effect using 200 air temperature monitors placed around Montreal. The goal is to develop a high-resolution map of temperature and use this map with its deep learning model to predict hot spots using images of the built environment.
As previous speakers had mentioned, a major challenge in environmental health research is deciding which of the trillions of combinations of human-made chemicals present in the environment—and in human blood—to test for possible toxicity. At EPA’s National Center for Computational Toxicology, researchers used frequent itemset mining, an algorithm designed to inform which items should be placed near each other in the grocery store, and NHANES data to identify combinations of chemicals that co-occur in blood samples from the same individual. This data-mining technique identified 29 chemical mixtures that occurred in at least one-third of the U.S. population, which EPA’s John Wambaugh noted would be easy to run through NIEHS’s Tox21 program or EPA’s ToxCast program.
At EPA, Wambaugh and his colleagues have developed a framework called the Systematic Empirical Evaluation of Models (SEEM), which uses Bayesian statistical methods to incorporate multiple models into consensus predictions for thousands of chemicals.17 The consensus results predict a main tendency along with a window of uncertainty or a confidence interval. SEEM identified five factors that predict pesticide exposure for the entire U.S. population when the data are stratified by age, sex, and body mass index.18 These predictors focus on whether an industrial pesticide is also in public use (yes or no), whether the industrial pesticide is active or inert, and the production volume of the pesticide. Wambaugh noted that this model explains many of the variations in exposures measured by NHANES.
In collaboration with research groups in the United States, Canada, and Europe, Wambaugh examined whether machine learning could predict the probability that a chemical is associated with one of four exposure pathways: residential, dietary, pesticide, and industrial.19 First, the investigators identified 13 exposure models and grouped them by the chemical-specific pathway for toxicity and then by routes of human exposure. Next, they created a consensus meta-model based on the 13 exposure models using SEEM. In this meta-model predictors of exposure were combined with information about the exposure pathway and predictors of chemical intake rate inferred from human biomonitoring data for 114 chemicals. The consensus model explained approximately 80 percent of the chemical-to-chemical variations in exposure for the average person. Wambaugh then applied the consensus model to predict the exposure pathway and intake rates for more than 687,000 chemicals in EPA’s database that have minimal exposure information. While it seems unlikely to extrapolate exposure information about 687,000 chemicals with a machine learning model developed from data on 120 chemicals, Wambaugh and his colleagues were able determine exposure pathways for 70 percent of the chemicals.
In another project,20 Wambaugh’s colleagues at EPA used random forest21 machine learning classification models to predict chemicals that could serve as alternatives to existing chemicals with a toxicity concern. “This is green chemistry by machine learning,” said Wambaugh.
When asked about the populations included in their studies, the workshop panelists noted that the NHANES database represents a true cross-section of the U.S. population with three exceptions: (1) NHANES oversamples pregnant women, (2) NHANES does not include children younger than 6 years old, and (3) NHANES does not include occupational exposures. Prior to 2011, NHANES also did not include a separate category for Asians. The panelists also noted that it can be hard to meld NHANES data with other datasets, such as genomic data. One obvious shortcoming of existing databases is that there are many places in the world where there are no data on exposures and no ground-level measurement technology. A possible solution might be to leverage information on exposure and images to gen-
17 Ring, C. L, J. A. Arnot, D. H. Bennett, P. P. Egeghy, P. Fantke, L. Huang, K. K. Isaacs, O. Jolliet, K. A. Phillips, P. S. Price, H. M. Shin, J. N. West-gate, R. W. Setzer, and J. F. Wambaugh. 2019. Consensus modeling of median chemical intake for the U.S. population based on predictions of exposure pathways. Environmental Science & Technology 53(2):719–732.
18 Wambaugh, J. F., A. Wang, K. L. Dionisio, A. Frame, P. Egeghy, R. Judson, and R. Woodrow Setzer. 2014. High throughput heuristics for prioritizing human exposure to environmental chemicals. Environmental Science & Technology 48(21):12760–12767.
19 Ring, C. L, J. A. Arnot, D. H. Bennett, P. P. Egeghy, P. Fantke, L. Huang, K. K. Isaacs, O. Jolliet, K. A. Phillips, P. S. Price, H. M. Shin, J. N. West-gate, R. W. Setzer, and J. F. Wambaugh. 2019. Consensus modeling of median chemical intake for the U.S. population based on predictions of exposure pathways. Environmental Science & Technology 53(2):719–732.
20 Phillips, K. A., J. F. Wambaugh, C. M. Grulke, K. L. Dionisio, and K. K. Isaacs. 2017. High-throughput screening of chemicals as functional substitutes using structure-based classification models. Green Chemistry 19(4):1063–1074.
21 A computational algorithm that utilizes multiple decision trees that operate as an ensemble.
erate reasonable estimates in data-poor regions of the world. Another solution might be to engage local populations in citizen science projects that, for example, could deploy a smartphone app that could determine some estimate of exposure to noise and certain air pollutants.
Hands-on Learning Experience
To get a sense of how to use machine learning to answer questions regarding chemical exposure and environmental health, the audience participated in a hands-on learning experience. To provide that experience, David Dunson of Duke University and two of his students, Kelly Moran and Evan Poworoznek, guided the workshop participants through a demonstration of machine learning using ToxCast data on chemical features and chemical dose–responses to predict toxicity outcomes from chemicals that have yet to be studied.
SOCIAL AND ETHICAL CONSIDERATIONS OF USING AI
If AI and machine learning are to realize their potential to advance environmental health research and decision making, policy makers, the public, and other stakeholders will need to be able to trust these methodologies and there should be a measure of accountability regarding the predictions they make, said Alex John London from Carnegie Mellon University. He noted that the experts who develop AI and machine language systems have specialized knowledge and that stakeholders depend on experts to safeguard and advance their interests, which implies that stakeholders need a certain level of trust in those experts and that the experts are accountable for the output of their models. Being accountable, he explained, involves reducing information asymmetries by opening the decision-making process—for example, the process the experts used in deciding how to build their model—and making subject-matter knowledge available to stakeholders.
While the examples London used in his presentation involved the use of AI and machine learning to guide medical treatments, the lessons he highlighted apply equally to environmental health decisions. There is no one-size-fits-all recipe for ethically acceptable AI, he said, because different learning tasks pose different challenges and different datasets can support different inferences. Accountability, said London, requires justification for
- the problem being addressed,
- the choice of methods used versus alternatives,
- known and unknown limitations in the data,
- limitations of the methods chosen,
- strategies for validating model results, and
- a warrant for replicating results in practice.
In his opinion, the ability to explain how an AI or machine learning model works is not a panacea for engendering trust. Rather, for decision-making purposes, verification is often more important than a detailed explanation of how a model works. It is important, said London, to avoid false dichotomies between AI and empirical tests, as well as to seize opportunities to integrate AI with novel, controlled trial designs to create a system capable of learning and improving in its ability to make useful predictions.
For Lance Waller from Emory University, a key ethical question for those engaging in AI and machine learning is whether they should worry only about getting the mathematics of their models correct and running their computations correctly and faster. Or, if they should be more concerned about the consequences of their calculations. In his opinion, calculations do have consequences, with the interpretation of results being an essential part of the analytic process. Theory and application, he added, cannot be separated from one another.
When thinking about cultural frameworks and ethics training in data science, Waller pointed to the ethical guidelines for professional practice issued by the American Statistical Association. In the ideal, statisticians—Waller considers those working in AI and machine learning to be operating in the statistics realm—should diligently search to develop evidence bearing on a hypothesis rather than on a predetermined conclusion, make wise use of methods to produce the best results from the analysis in relation to the problem at hand, and be willing to answer requests about the details of their work. He noted that most moral and ethical decisions are made without conscious thought or reflection and are often driven by habit. However, when an individual consciously and conscientiously examines how he or she makes moral and ethical choices, that individual becomes a better professional, a better citizen, and a better person.
Another ethical issue is the lack of recognition data scientists receive for their work. Waller argued that addressing the issue of how data scientists can be recognized and promoted is critical, given the level of scholarship required for the discovery and development of the methodology, the integration that merges the fields of data science and statistics, developing robust applications of data science, and teaching data science. Each of these criteria can be
quantified in the same way all other research is quantified for recognition and promotion, namely through publications—including software and curated datasets—and citations demonstrating reproducibility. He noted that a recent commentary in Nature22 called for data generators, which includes modelers, to receive credit when others use their data or their models, an idea that several workshop participants seconded. The challenge, said Waller, is that software and data, unlike a publication, continue to change after their initial release, and there is currently no citation system that recognizes that fact.
Reproducibility and Replicability
Data dependency and data quality are critical issues in QSAR models, said Alexander Tropsha from the University of North Carolina at Chapel Hill. For example, if even a small percentage of the chemical structures in a dataset are incorrect, the accuracy of the modeling results will be reduced dramatically. Citing a 2013 paper, he noted that even something as small as a change in the reagent dispensing process can have a major effect on calculated biological activity as determined by computational and statistical analyses.23 Despite the view held by some that AI and machine learning algorithms can be forgiving when it comes to poor quality data, the fact is that data quality does matter.
Tropsha cited work from a group that claimed to be able to predict a substance’s toxicity for $295 using a QSAR-like model it developed as an example of how modelers can go wrong with the data and methods they use. This work, which received a great deal of media attention, had multiple problems regarding data, perhaps the biggest of which was that much of the data used to build the model were not collected experimentally, but were predicted in other QSAR studies. In addition, the database serving as a major source of data was not curated and determined to be unreliable. Moreover, said Tropsha, there was significant misuse and misinterpretation of the statistics, over-fitting of data, and a general failure to validate the model correctly. Tropsha pointed out that misinterpretation and validation are not new or unique challenges in model development. He emphasized that scientists can leverage historical lessons to guide the development and use of AI and machine learning. Citing a symposium held at the 2019 annual meeting of the American Association for the Advancement of Science, Tropsha said that machine learning techniques used by thousands of scientists to analyze data are producing results that are misleading and often completely wrong. One reason for this is that machine learning algorithms have been designed specifically to find interesting things in datasets, and so when they search through huge amounts of data, they will inevitably find some sort of pattern, he said.
Data curation is another issue that will need to be addressed by those interested in working with AI and machine learning systems stated Tropsha. He said data curation is not a simple matter, but rather one that entails numerous steps, including duplicate analysis, analysis of intra- and inter-lab experimental variability and excluding unreliable data sources, detection and verification of what he called activity cliffs, identifying and correcting mislabeled compounds, and others.
In summary, Tropsha said that while the accumulation of big data created previously unachievable opportunities for using AI and machine learning approaches, it is critically important to curate and validate the primary data with extreme care. He noted that the growing use of models to guide experimental research raises the importance of rigorous and comprehensive model validation using truly external data.
Scientific Understanding from Model Results
In the workshop’s final presentation, Sorelle Friedler from Haverford College described some techniques that can be used along with serious domain expertise to try to understand what these modeling techniques are doing. She began with an example of a project she completed that aimed to predict whether a given compound will form a crystal, a problem that requires both positive and negative examples. While positive examples were plentiful, negative examples were not, so she and her collaborators collected data on failed crystallization experiments from laboratory notebooks. The result was a machine learning model that recommended experiments to perform. When those experiments were run, they successfully validated the model. Thinking she was finished with that project, her chemist collaborator wanted to know what the model was saying about crystallization and chemical structure that made it perform better than humans at making predictions.
To get that information, Friedler created an interpretable model of the model, using the original model to predict outputs for training data for the new model. Those predicted outputs were then used as labels, which trained the new interpretable model. The first step toward creating an interpretable model is to first quantify the relative im-
22 Pierce, H. H., A. Dev, E. Strathan, and B. W. Bierer. 2019. Credit data generators for data reuse. Nature https://www.nature.com/articles/d41586-019-01715-4 (accessed August 23, 2019).
23 Ekins, S., J. Olecheno, and A. J. Williams. 2013. Dispensing processes impact apparent biological activity as determined by computational and statistical analyses. PLOS ONE 8(5). doi: 10.1371/journal.pone.0062325.
portance of features in the black box model. One way to do this is to replace a feature with random noise and see how much the model’s accuracy degrades, though this approach can produce misleading results if two features are correlated, she explained. A more complicated approach is to systematically replace all of the feature values with the presence or absence of each feature. Using these approaches, along with domain expertise, can provide some understanding of how important various features are to a model and therefore how important those features are to the underlying system being studied.
In the ensuing discussion, Friedler, London, and Waller all stressed the importance of using AI and machine learning in conjunction with good science. This means asking questions, understanding the reasons for using a specific model and the possible errors and shortcomings associated with that model, looking critically at the data going into and results coming out of a model, and focusing on reproducibility.
Tropsha noted the need to do a better job educating current and future practitioners about how important it is to generate data objectively, to develop tools that capture data without human bias, and to appreciate the ethical issues involved in using AI and machine learning. A workshop participant noted that while it will be easy to educate graduate students and postdoctoral researchers, reaching out and educating those who are already in the workforce will be more difficult. Tropsha added that there does need to be some form of “reinforcement learning,” where reinforcement is done in educational and publishing policies that would prevent bad science from being propagated. He and others also pointed to the importance of working in teams and for AI and machine learning novices to partner with someone with more experience.
PERSPECTIVES ON THE USE OF AI RESEARCH FOR ENVIRONMENTAL HEALTH DECISIONS
To conclude the workshop, five panelists—Nadira de Abrew from Proctor & Gamble, Anna Lowit from EPA, Kristi Pullen Fedinick from the Natural Resources Defense Council, Charles Schmitt from NIEHS, and Reza Rasoulpour from Corteva Agriscience—addressed two hypothetical scenarios from the perspective of decision makers and stakeholders. The first scenario involved a machine learning algorithm that used various measurables to predict which drinking water wells are more likely to be contaminated with nitrates. After validation with a small subset of homes from which water samples were collected, this algorithm identified 100 percent of the contaminated wells but with a 40 percent false positive rate.
Lowit and Rasoulpour commented that while false positives can be an acceptable tradeoff for protecting public health, a false positive rate of that magnitude can lead to the public lacking trust in science. Lowit added that erring on the side of transparency could help ameliorate that issue, though she and other speakers voiced concerns that an uneducated public could turn one alarming result into a blanket fear of chemicals in the same way that one discredited paper linking vaccines to autism has had a profoundly negative impact on vaccination rates. In that regard, Lowit cautioned against notifying homeowners by mail that they might have a contaminated well because that information could be misinterpreted without further explanation. “Everything should be transparent, but we need to be careful about how we communicate that information,” said Lowit. Fedinick countered that disclosing this type of information to the public gives people autonomy to decide for themselves if they are at risk. The key, she said, is providing the data in a way that is informative rather than alarming.
Schmitt’s concern was that the high false positive rate suggests there is a methodology issue relating to oversampling wells that were positive for nitrates, which would bias the model toward false positives. As a homeowner with a well and children, however, he would want to know if he should be ordering a testing kit. What he would not want, however, is someone else running the model and seeing that his well might be contaminated. de Abrew said she had similar concerns as the other panelists, particularly regarding the need for transparency and explanation.
In the second scenario, a hair care company developed a new anti-frizz treatment that contains a new chemical that a machine learning algorithm, using a large database of chemical structure-activity relationship, suggests may be neurotoxic, raising the question of what steps the company could take next. Schmitt said the default position should be that if the data used to power the model are of high quality, then the model is correct, and the onus is to try to prove the model is wrong. Lowit noted that in her experience, most models tend to have high false positive rates, so she would counsel running an accepted cell- or tissue-based assay for neurotoxicity as a first step before turning to more elaborate testing. She also pointed out that the model does not predict at what level the compound might be neurotoxic, which might greatly exceed potential consumer exposures.
Rasoulpour commented that the company should look at other structurally similar molecules it created during its discovery work and see what the model predicts for those compounds. Another step would be to look for possible molecular targets and mechanisms by which the compound might cause neurotoxicity. He also noted the im-
portance of considering the risk–benefit equation for this compound given that it is a hair product and not a new drug to treat a life-threatening condition. Rasoulpour thinks the utility of these models is that they can narrow down the chemical space to explore early in the discovery process, long before selecting the one compound with which to move forward. Thomas Barnum from the U.S. Agency for International Development remarked that while this discussion was interesting, decision makers want answers, not a discussion or nuanced options.
Tropsha commented that the problem with this scenario is that the expected accuracy of a model developed using a large dataset is valid only when another large dataset is used to predict the results. “It does not translate into prediction accuracy for a single molecule,” he explained. Tropsha said, “this is a very important area of statistical modeling that should be addressed.” Fedinick added that the modeling community needs to work to continually improve models because, as she put it, fundamentally all models are wrong even when they are useful.
When considering how to use AI and machine learning models to guide decision making, the panelists pointed to the need to spend time with decision makers and educate them about what goes into interpreting the results of these models, their limitations, and in what context they can be used. Lowit noted that the decision makers she had worked with are not looking for yes or no answers. Instead, they are interested in understanding the important fine points of how a model makes predictions and its limits. She also pointed out that important regulatory decisions are not made based on individual pieces of information, and in that respect, AI and machine learning are not meant to be the end-all and be-all for decision makers.
AI and machine learning “really do have the potential to revolutionize environmental health,” stated Gary Miller of Columbia University in his closing remarks for the workshop. But, he cautioned against hype and over-promising results. Miller also urged workshop participants to “maintain healthy skepticism” as the community grapples with the ethical dimensions of AI and machine learning as well as questions about reproducibility and replicability. He encouraged workshop participants to read the 2019 National Academies report Reproducibility and Replicability in Science,24 which focused on links between computation and rigor in science. Miller pointed out that remembering traditional frameworks for science is important whenever the research community explores the use of new technologies. “We were all trained scientists. We don’t abandon the things we’ve learned about proper study design,” when a new technology comes along, but explore how to use the new technology “to enhance what we’ve been trained to do,” he said. As a result, training programs may need to evolve. Miller envisioned a future in which the next generation of scientists are incentivized and equipped to “not just learn how to code,” but take into account issues such as human biases built into code, transparency, and replicability. With the rapid development and expansion of AI into research, “we have to really make sure that environmental health sciences stay out in front,” avoiding potential pitfalls and fostering ongoing conversation about challenges and opportunities, Miller concluded.
Disclaimer: This Proceedings of a Workshop—in Brief was prepared by Joe Alper and Keegan Sawyer as a factual summary of what occurred at the workshop. The planning committee’s role was limited to planning the workshop. The statements made are those of the rapporteurs or individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.
Planning Committee on Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions: A Workshop
Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVation; Gary Miller, Columbia University; Bhramar Mukherjee, University of Michigan; Roger D. Peng, Johns Hopkins University; Melissa Perry, The George Washington University; Reza Rasoulpour, Corteva Agriscience.
Staff: Elizabeth Barksdale Boyle, Program Officer, Board on Environmental Studies and Toxicology; Keegan Sawyer, Senior Program Officer, Board on Life Sciences; Ben Wender, Program Officer, Board on Mathematical Sciences and Analytics and Board on Energy and Environmental Systems.
Reviewers: To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Kim Boekelheide, Brown University, and Lance Waller, Emory University.
Sponsor: This workshop was supported by the National Institute of Environmental Health Sciences.
About the Standing Committee on the Use of Emerging Science for Environmental Health Decisions: The Standing Committee on the Use of Emerging Science for Environmental Health Decisions convenes public workshops to explore the potential use of new science, technologies, and research methodologies to inform personal, public health, and regulatory decisions. These workshops provide a public venue for multiple sectors—academic, industry, government, and nongovernmental organizations among others—to exchange knowledge and discuss new ideas about advances in science, and the ways in which these advances could be used in the identification, quantification, and control of environmental impacts on human health. More information about the standing committee and this workshop can be found online at http://nas-sites.org/emergingscience.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions: Proceedings of a Workshop—in Brief. Washington, DC: The National Academies Press. doi: https://doi.org/10.17226/25520.
Division on Earth and Life Studies
Copyright 2019 by the National Academy of Sciences. All rights reserved.