Pathogen Detection and the Internet of Things—Future Prospects for Robotics, Disease Control, and Commercialization
Ethan Jackson, Microsoft Healthcare
Ethan Jackson is senior director and principal researcher at Microsoft Healthcare. His research focuses on intelligent systems that improve the health of people and their environments. He directs the Microsoft Premonition project and described that project and some of its results. He explained that Microsoft Premonition aims to scalably monitor the biome from the perspective of the smaller elements of the biome. He noted that Earth’s biome—that is, the life all around us—affects the state and security of every economy, every government, and every society. In addition, most of Earth’s biome is comprised of entities that are very small—viruses, microbes, and invertebrates.
Jackson highlighted threats to national security, human health, and agriculture from very small things in the biome that have been observed just in 2019 and 2020. These threats range from fungal diseases to Ebola to locusts to swine fever. He observed that we are discussing these sorts of threats today (at the time of the workshop in late February 2020) as COVID-19 is spreading around the world. He said that one big challenge, from a technological perspective, is that these threats are mainly invisible to modern sensor networks.
To explain this invisibility, he noted that when considering the full breadth of biodiversity, the larger terrestrial vertebrates make up approximately 104 species. Terrestrial arthropods (invertebrates with exoskeletons)—which include mosquitoes and ticks (both of which can transmit diseases and affect agriculture)—are estimated at 106 species, two orders of magnitude larger. They are millimeters in size. Then, he said, there are even smaller entities in the biome, such as bacteria and viruses, some of which are threats. Those are estimated to make up 107 to 109 species and are in the microns to nanometers in size. Those, he pointed out, are too small to be detected by sensor networks that have been built to date.
The goal of the Premonition Project is to build a sensor network that can detect these smaller entities to provide timely data about the biome. Jackson said that the focus has been on viruses, microbes, and arthropods, which comprise most terrestrial biodiversity. He went on to explain the capabilities that the project is developing and how they converge in order to meet the goal.
Jackson explained that the first capability needed is the ability to estimate the composition of an arbitrary biological sample. A sample (e.g., taken from a mosquito) will contain many different biological organisms. The sample will contain the DNA of that organism; it will contain bacteria and may contain viruses. He said his team’s goal is to be able to take one of those samples and, not knowing what it is, estimate everything that is in it. He explained that many items in the sample will be novel and not even have a scientific name.
The technique that makes these sorts of estimates possible is called metagenomics. Jackson explained that metagenomics relies on converting biological samples into data by sequencing (converting DNA molecules into data). Metagenomics is the process of scanning these data using AI and machine learning techniques to estimate how they relate to things that have previously been sequenced.
The analysis requires scanning collected samples for similarity to all partially sequenced life forms. To do this, the project uses a reference database of about 3 trillion nucleotides of genetic material that has already been connected to existing life forms.
Jackson explained that technological advances mean that sequencing is no longer relegated to a gigantic laboratory to which only a few people have access. The necessary equipment continues to become smaller and ever more field deployable. An advantage of sequencing from the field, he said, is that it can produce immense amounts of data. He described work that his team did on a data set sequenced by the Sanger Institute. A goal was to try to understand the genomics of malaria in mosquitoes. He said that they want to be able to understand everything that those mosquitoes have encountered through analysis of the collected sample.
He noted that although the project has a reference database of about 3 trillion nucleotides to compare against, there could be viruses or bacteria present in the sample that are not yet known or sequenced. He described the challenge of matching a very large set of genetic data against trillions of reference sequences. To do this, he said, the project uses a cloud-scale architecture to do highly parallel analyses. He said the project builds a cloud-scale statistical model that aims to eliminate false positives which are a challenge. The output of these tools can be seen as a statistical model that provides estimates of the quantity of species in a given sample. He noted that for mosquitoes, the output model estimates that human, cow, and environmental bacteria are also present. He said that this sort of metagenomic analysis reveals the types of organisms that were physically co-located with a given sample.
The next step, Jackson explained, involves mining the data once it has been processed. Using the mosquito example again, he said his team noticed something interesting, namely that the mosquitoes were biting cows. Thus, the question became: What viruses are associated with mosquitoes that have bitten cows? Is it possible to factor out the mosquito data to see what might be in a cow just because this mosquito happened to bite that cow? He said the project’s data mining tool allows it to answer this question in just a few seconds. The team filters data until only signals from
mosquitoes that bit cows are left. The next step is to determine which viruses are present that are associated with just the cow samples. In the example he described, the team narrowed in on a known bovine parvovirus that is associated with cows. The presence of this virus has nothing to do with the fact that these mosquitoes transmit malaria; it has to do with the fact that the mosquito happened to bite a cow that had this virus. The upshot, he said, is that the project is using the mosquito as a sensor to sample what was in the cow.
Jackson then discussed the more general problem of looking at existing data sets using metagenomics to consider a sample and understand more about its environment. He said that one can think of metagenomics as the ability to probe deeper and deeper into the layers of data in a data set. He explained that the project can show that metagenomics does well at estimating what mosquito species the sample is from (particularly important when studying malaria).
Jackson explained that probing an order of magnitude deeper into this data set, the system can determine what species mosquitoes are biting. The team finds genetic material from the animals that the mosquitoes are encountering in the environment. In this example, he said, most of those data are from humans but they also find other species such as goats, dogs, cows, and donkeys. Two orders of magnitude deeper, he said, the team starts to see the malaria parasite itself in these mosquitoes, which makes sense given what they are and where they come from. He pointed out that this is not the normal way to detect malaria; the point is that it can be seen in a protocol that was not designed to see it. At another order of magnitude deeper, he said, the team finds human hepatitis-B virus in mosquitoes (which on its own has nothing to do with mosquitoes). Mosquitoes are not a vector for hepatitis B; the presence of the virus in a mosquito thus indicates only that an infected individual was bitten by one. He said the team can recover the virus’s genomes from this data and ask questions such as, Where does that virus sit on the phylogeny of known hepatitis-B viruses?
He compared this question to coronaviruses and in particular COVID-19. How, he asked, would we have seen that virus in the environment when we did not even have a name for it? When we have some data points about where it might sit on a phylogeny,
where would we sample? Metagenomics, he said, is a fundamental tool to monitor and understand part of the biome.
Jackson described the next major capability needed by the Premonition Project: how to detect and collect samples to bring back from the environment for analysis, which is a major challenge to achieving effective convergence. He emphasized that although there would probably be something interesting in almost anything collected and sequenced, the challenge is to collect samples that are likely to contain something important.
To address that challenge, Jackson said the team has been developing a robotic field biologist specifically focusing on terrestrial arthropods. It chose arthropods because they represent terrestrial biodiversity and they transmit a number of infectious diseases, including the second most significant infectious disease, malaria. Between malaria and dengue, arthropods contribute to 600 million cases of human disease per year.
Jackson described the first generation of robotic field biologists that his team built and plans for the next generation. The robotic field biologist is a cylindrical object surrounded by 64 “smart cells.” Each cell implements an optical wingbeat detector. When something flies through the infrared beam that crosses a cell, its wings generate a periodic frequency. The robotic field biologist detects that frequency in real time and uses that information to try to identify what has just probed one of the cells. This identification allows a real-time decision about whether to close the cell and keep the biological sample. Over time, the device can start to learn within its environment and be trained on what to prioritize and bring back. Even when it does not bring back a sample, it can still acquire data. This process contrasts with traditional monitoring of arthropod environments. The robotic field biologist collects digital data on many organisms and then selects the ones that are most important for retrieval.
Jackson explained that a lot of data are generated by these devices—sometimes several hundred megabytes per day. He noted that that amount of data allows the team to not only consider individually interesting samples, but also generate a baseline of what happens in the environment in general. Because the system collects so much phenotypic data, he said, the team can observe
how small-scale environmental factors such as time of day, ambient light level, and barometric pressure drive the presence or absence of various threats. With that information, he said, the team can then use statistical modeling similar to that used for building weather forecasts and note interesting peaks and troughs in the predicted behavior and in the predicted presence or absence of these different types of disease vectors. Once those peaks and troughs are known, one can start to take them into account and assess where the threats might be.
Another challenge faced in the Premonition Project is how to ensure that these devices are field-deployable and able to work in places disconnected from the cloud and at low cost. Jackson explained that in the example above the team focused on how inexpensively they could deploy a design that could determine, with 90 percent accuracy, the species of the detected arthropods. Even without detecting a known pathogen, species does correlate to which pathogens are carried. He explained that each wingbeat sensors costs about 25 cents, and the embedded microcontroller doing real-time processing costs about $10.
The third capability needed by the Premonition Project is the ability to do agile engineering in the context of real biology. He said the team built the Premonition Proving Ground—Microsoft’s first laboratory certified by the U.S. Department of Agriculture (USDA) and the Centers for Disease Control and Prevention (CDC). He explained that it is a secure facility that accepts biological specimens. Currently, the team is focused on mosquitoes to test its robotic field biologist designs. He explained that the facility allows the team to conduct high-fidelity simulations of the native environments of the mosquitoes. The mosquitoes arrive as eggs, and the facility attempts to replay their environment and life cycle as much as possible, including the solar cycle (when the Sun moves across the sky in the mosquitoes’ native environment). The mosquitoes spend their entire lives in a simulated and monitored environment that is programmed to be similar to the area from which they came. The testing environment allows researchers to vary parameters such as the presence of ultraviolet light, colors, and different types of chemicals.
The mosquitoes are securely released into a large, sealed room to interact with robots and test platforms. The room is heavily
instrumented, and computer vision systems track how the mosquitoes move around the room. The platform produces both individual mosquito trajectories as well as a heat map of where the mosquitoes spent their time. Jackson noted that the general goal is to learn how to map and detect the biome and threats in the environment in near real time. He explained that arthropods are a good place to start because they can be readily collected and analyzed and because they interact with so many different sectors. He pointed out that another significant advantage of the robotic field biologists is that it can help prioritize what to sequence in a wet laboratory, saving time and costs.
Brinsfield inquired about the reason for Microsoft’s work in this area and where Jackson thinks the centers of gravity of expertise for these sorts of problems reside. Jackson explained some of the philosophy behind Microsoft Research, an institute within Microsoft that works across the technology readiness levels from university-style basic research to incubation and finally productization. He noted that early-stage basic research need not be tied to existing products. Further, health and life sciences is an important area of investment at Microsoft, generally. He explained that the existence of a university-like model embedded in Microsoft has been critical for rapidly exploring and de-risking projects such as Premonition and other successful Microsoft endeavors such as Azure Sphere (in the IoT space), leading to successful transitioning into the later incubation and productization phases. He emphasized that it takes many years of research-driven exploration to know whether or not a product could work commercially.
Fingar noted that the Premonition Project has clear applications to health, safety, and protection of the food supply, but asked whether it would be possible to turn the technology to negative uses such as engineering pathogenic threats. Jackson responded that independently of the technologies described earlier, some viruses can be created from scratch from their known genomes. He noted that questions remain about how this capability translates as the team begins to map out the genomes of previously unknown entities. However, he further noted that examples already exist of known harmful pathogens that can be mutated today, such as highly pathogenic avian influenza virus. He recalled controversy in the
community around determining the few mutations required to turn that virus into one that would infect ferrets. He said that he considers the role of these technologies to be improving understanding of what is moving and present in the environment so as to better detect when unwanted things emerge and spread in new ways. Jackson also pointed out the challenge of securing sensor networks against fraudulent data meant to induce false results. He noted that the team works to ensure that the systems are secure end-to-end to help protect against those kinds of attacks.
Elder asked about the use of National Institutes of Health (NIH) databases as information sources. She asked whether Microsoft researchers are cleaning and returning the data to NIH. Jackson confirmed that the team does clean the data, and because it can look across the whole database, it can sometimes spot systematic sources of contamination. He noted that contamination is present in part because it is a challenge to acquire and then perfectly assemble the genome of only one organism. He said that the project’s approach is to systematically scan an end-to-end matrix of how different genomes are related to each other to spot contamination. He noted that when dealing with data sets this big, it is not possible to flag the contamination manually, but statistical models can account for suspicious data. He said that the team has not yet shared these results with the data sources. He noted that it is an open question—what is the right way for anyone in the community to share something back that has been “cleaned” and how to define what counts as “cleaning.”
One participant asked for observations on what might be next in this technological space, such as robots sampling different domains, or for concerns other than those related to public health. Jackson explained that the project’s original motivation came from the West Africa Ebola outbreak in 2014-2015. The project was concerned with how the world missed that signal, allowing it to get as far as it did. However, if the value in a system comes only when it finds a proverbial needle in a haystack, it becomes a difficult proposition to make that system scale and economically sustainable. He said that the team has evolved its thinking and work enough to consider that being in the environment and understanding the
biome better has numerous implications beyond detecting needles in haystacks.
Jackson noted that several new interventions come from the biome itself—phage therapies (using viruses to treat bacterial infections). He noted that we have a better understanding of challenges such as antimicrobial resistance through what is in the biome. Scientists can look for new treatments for microbes that are resistant to current antibiotics through understanding of what is in the biome. Agriculture depends a great deal on what is in the biome. Pests, of course, are a long-running concern, but there are new concerns about the ecosystem services of wild pollinators. Jackson believes that enough of a platform exists that it amortizes across several different sectors, providing the capability to look for threats in a new way while sustaining itself. He said, “If you make the haystack useful, then you have permission to find the needles.”
Another participant asked whether the project’s metagenomics analysis has found evidence of chemical contamination in the biome. They noted that bee colonies can run into chemical contamination that can result in massive declines in health. Jackson replied that metagenomics cannot see chemicals, but there are other approaches, such as the use of mass spectrometer tools, that the project has not tried. He noted that using arthropods as bio-indicators is a longstanding approach because some live in environments in which they accumulate chemicals in their bodies. Jackson said that although his system would not see such indicators metagenomically, there is research on environmental contaminants that uses such arthropods.
Mallory Stewart, a planning committee member, asked about concerns related to weaponization of such technologies and whether there are controls with respect to international players in the conversation. She noted that the commercial setting is very different from the national security setting. She asked about Microsoft’s theory on interactions with the national security apparatus domestically,
and how it works to prevent a technology from moving out of the commercial setting to pose a national security threat. Jackson responded that from the domestic point of view his team has tried to understand, given the long-running missions in force health protection and in pandemic preparedness, how this work could fit into the mission and needs of the Department of Defense (DoD). He added that his team has done some work with the Intelligence Advanced Research Projects Activity.
Jackson explained that the project’s interactions with international collaborators are important. Because other countries see disease threats first, it is important that his team understand and cooperate with them. He noted that their focused interactions will typically be around how to share data. He also noted that if Microsoft brings samples into the United States, it must work with the USDA and the CDC, as well as with partner organizations on the international side, regarding what kind of data can leave their country. In the genomic community, he said, there is a more general conversation around whether countries should share sequence information digitally. He emphasized that the scientific community believes it would be a major challenge if sequence information could not be shared between countries.