“After the Human Genome Project one of the big promises was that we would be able to map genes for different types of diseases,” Aviv Regev of the Massachusetts Institute of Technology told the workshop audience as she began her keynote address. Indeed, she continued, some 100,000 different regions in the human genome have now been associated with a range of diseases, everything from inflammatory bowel disease to schizophrenia. This has been a truly remarkable achievement by the human genetics research community, she said, but at the same time that achievement points to one of the biggest challenges facing that community today: How does one take these 100,000 variants scattered around the genome and understand what they do—first at the level of cells and then at the level of tissues, then organs, and then, finally, across entire humans beings?
Regev provided a big-picture view of the challenges facing the biological researchers attempting to understand the genotype–phenotype connection—that is, how the information coded in the genome leads to the physical characteristics of an organism. To do this she detailed results from various lines of research and described the multi-level approach that she argued will be necessary to discover the “rules of life.”
Regev began by discussing why it has proved so difficult to move from genotype to phenotype. The overarching reason, she said, is because biology is the integration of multiple different levels of organization. The fundamental unit of life is the cell, which is made up of molecules that are
either encoded directly by genes or else obtained through chemical reactions that gene products catalyze. These cells interact with each other, and in multi-cellular organisms they combine to make tissues, which in turn are assembled into organs, which account for an individual’s physiology. Increasing the complexity, individuals make up populations, which are part of ecologies, and it is at the ecological level that natural selection acts.
Research biologists interested in understanding the rules of life ask three different kinds of questions, Regev said. There is the “What?” question, which concerns structures. There is the “How?” question, which concerns genetics and mechanisms. And there is the “Why?” question, which is a functional question. Each of these questions must be asked at each of the organizational levels, including the levels of genes, tissue structure, physiology, and evolution.
The resulting complexity makes it difficult to answer any of these sorts of questions, Regev said. In addition, the interactions between the levels are complex and frequently nonlinear. This nonlinearity in particular made mapping and understanding these systems impossible for a long time.
“So it doesn’t really matter in which direction you look,” she said. “In theory, for each of these problems—and many other problems—the space of possibilities is enormous.” Regev noted that the main issue is that researchers do not know, up front, which connections matter and which do not.
The question then is, how does one study something systematically if it will never be possible to study it exhaustively? “So what I’m going to claim here today through many different examples,” Regev said, “is that in this case the bigness of biology is no longer our problem. It’s actually one of the best opportunities that has ever happened to us. We couldn’t really seize it in the past. We didn’t have the tools. But we do today.”
In the past the basic approach was for researchers to look for intelligent ways to limit “the bigness” because the tools were not available to deal with the entire search space. So instead of testing all possible mutated sequences, for example, one might probe for those mutations that were known to exist. Researchers might look for interactions between genes that are known to likely interact to avoid testing all possible genetic interactions. Or instead of looking at sequences in general, design assays to look at ones that exist in nature or that relate to existing models. “These are all great approaches, and they have actually served us incredibly well.” Regev noted, however, that “they don’t fully solve our problem. We’re not sure that what we get after we limit the search space is a general answer or a rule of life.”
In a search for a different approach, Regev offered a metaphor related to current approaches. Imagine that functional discovery in biology is like trying to figure out which painting is hidden behind a sheet of white tiles and every experiment in biology is like removing one small square from
the sheet to see what is behind it. “Until recently,” she said, “all we could do was remove a few tiles, so we had to focus.” If, for instance, something interesting—a bit of red, say—was found by pulling off a tile, then one might uncover more tiles around that. That could lead to uncovering little patches in a sea of white where something is known about part of the underlying painting. Yet, the painting as a whole remains a mystery.
If one uncovered the same number of tiles randomly, it would be hard to determine anything about the painting because most of what was uncovered was not interesting—background or some equally uninformative part of the painting. “But what if I were allowed a lot more tiles?” Regev asked. Then, even if they were done randomly, one might well be able to discern enough pattern to recognize the painting underneath.
Doing this type of pattern recognition in experimental biology is not typical, Regev said. “It’s a great new opportunity—and one for which we’re already reaping benefits, as I’ll try to show you today, in many different ways.” The remainder of her talk consisted of examples related to cells, programs, and mechanisms.
Cells are a key intermediate between genotype and phenotype, Regev said. For example, even though every cell in a body carries the same genetic variants that confer the risk of a particular disease, the disease will typically manifest only in those cells that express the gene product or are affected indirectly by it. “Knowing our cells is going to be essential for understanding gene function in humans,” she said.
Subsequently, it is a problem that scientists do not really know how many different cells there are or what their molecular characteristics are. Historically, cells have been categorized in many different ways—according to their structure, their location, their function, and so on—so there is not a unified way to think about all of them. One way to address this would be to come up with a “map” or an “atlas” that had a unified set of coordinates for all human cells, Regev suggested. Gene expression levels could provide such a collection of coordinates so that each cell would be a point in an extremely high-dimensional (20,000 or more dimensions) space, with one dimension for each gene.
This would not have worked in the past when it was not even possible to measure gene expression levels in single cells. In the past few years, breakthroughs in single-cell genomics have made it possible to measure expression, chromatin, and other molecular profiles in large numbers of individual cells. One technique in particular—single-cell RNA sequencing—has dramatically increased in recent years in the number of cells that can be examined, going from just over a dozen in 2011 to 30 million in 2019.
This capability makes it possible to carry out experiments that, in terms of Regev’s painting metaphor, seek to reveal many random pieces of the big picture—an approach she calls “design for inference.” She and her team decided to develop methods “that favor sparse and noisy data from massive numbers of cells over much richer and more precise data from a small number of cells”—in essence, uncovering lots of random tiles rather than focusing on one or two areas of interest. The reason for this decision, she said, was that her team knew they could handle the sparsity of data and data resolution, per cell, because gene expression and other characteristics of cells are highly structured both within and across cells. She noted that structure is key because it is possible to gain a lot more information from different types of data than one might expect.
That experimental design has proved successful, Regev said, although “it was extremely uncomfortable for the experimental biologists initially.” Their experiments looked for patterns and structures in sparse and noisy data and uncovered a number of discrete cell types. In addition, they can look at cellular changes over time, such as through development, and how the cells respond to environmental changes. Regev noted that they “can even see the imprint of where a cell is located—its anatomical position—in the fine histology or the tissue structure that it’s in and even the direct cellular neighbors that it has.”
What is most important to understand about this is that even though most of her group’s computational analysis methods are aimed at capturing one aspect at a time, the actual cell is all of those different things and identities at once. “It has a type and a location and a history and possible fates. It’s always undergoing multiple transitions all at once…. And all of these aspects interact with each other.”
The bottom line, she said, is that thinking about cells as the basic unit of biology can be useful for many purposes, but it can also be a limited representation because many of the behaviors that cells have do not obey those of the defined cell type. For that reason, she said, it is often helpful to carry out analyses where gene programs are a fundamental unit, rather than thinking only in terms of cells as the basic unit for analysis and understanding.
Gene programs are important in several ways, Regev said. First, they make it possible to better describe and understand cells that span the spectrum and do not obey the typical behavioral boundaries. Second, they help in studying the function of genes, that is, its phenotypic mapping. Finally, simply knowing that genes form structured programs helps with such problems as genetic interactions, which otherwise might appear intractable.
Understanding Cell Categories Using Gene Programs: An Example of Innate Lymphoid Cells in Psoriasis
She illustrated the idea that gene programs help describe cells that cannot be discretely categorized with an example related to the role of innate lymphoid cells (ILCs), a type of immune cell, in psoriasis. Initially in studies of psoriasis, it seemed that there were two distinct types of ILCs, labeled ILC2 and ILC3. ILC3s are thought to be the “first responders” that signal an immune response to T cells, which subsequently causes psoriatic skin. The issue, Regev said, is that when you look at healthy skin, there are no ILC3s present, only ILC2s, which leaves the enduring question of where the ILC3s come from. While it is possible that these cells circulate from other parts of the body, or are too rare to be noticed, the Regev lab wanted to shed some light on the situation and did so through work with a mouse model.
When Regev’s group investigated ILCs with single-cell RNA sequencing, they found that the ILCs were not discrete cell types but rather spanned a range of continuous cell states. This is difficult to capture when assuming that cells are the basic unit instead of considering the gene programs (Bielecki et al., 2018).
“So how can we capture this distinct biology?” Regev asked. To analyze what was happening in the cells in terms of gene programs she turned to an approach that is used in text analysis. To explain text analysis, she offered the example of a news article about the restaurant of Chef Gordon Ramsay. “Now, this piece is related to different topics,” she said. “There’s food, there’s business, and there’s celebrity culture.” And the words in the document will reflect the different topics. Some words, such as the names of dishes, are related only to one area—in this case, food and cooking. Others can be tied to multiple topics; “Gordon Ramsay,” for instance is related both to the topic of food, because he is a chef, and to the topics of business and celebrities. The key point is that one can analyze the words in the article—independent of the meaning of the individual sentences and paragraphs—to get information about the topics covered in the article.
“In the same way that words in a document result from topics that the document covers, even though no one tells me up front what these topics are,” Regev said, “gene transcripts in the cell are related to programs or processes in that cell even though we don’t know these topics or programs up front.” What is important is that by working from the gene transcripts in a cell, one can use various computational methods to capture the programs that are likely taking place in the cell. In her analysis of the role of ILCs in psoriasis, Regev used a particular method called weighted allocations. “They capture the topic [program] as a probability distribution over
the genes where each gene has a weight of belonging to the topic,” she explained, “and then each topic has a weight of being in the cell.”
The analysis indicated that there were “co-expressing cells” that had both ILC2-like and ILC3-like programs or features and that one could better understand the ILCs in terms of the different programs running in the cells rather than in terms of different cell types. These co-expressing cells are poised to move in either the ILC2 or the ILC3 direction, depending on various epigenetic triggers (Bielecki et al., 2018). The lab was able to validate this result by looking at the single-cell expression profile under different conditions as well as testing it in a mouse model.
Thus, Regev concluded, “genetic programs are helpful in thinking better about the functionality of cells. They are just more fluid and more flexible than putting everything in these discrete categories.”
Defining Function Through Gene Programs
The second advantage of programs that Regev discussed was in thinking about the functionality of genes. To illustrate this, Regev first described some of her work on ulcerative colitis, a form of inflammatory bowel disease (IBD). IBD is the “poster child” of human genetic studies because researchers have used genome-wide association studies (GWASs) to identify hundreds of loci that are associated with the disease, and for the vast majority of these, researchers do not know the function of the associated genes.
To study the genes responsible for ulcerative colitis, Regev and colleagues examined gene expression in different cell types and in particular looked for cells that were enriched for risk genes for ulcerative colitis (Smillie et al., 2019). Once they had those data, they began looking for patterns in gene expression. “One way of predicting gene function is asking which other genes co-express with it,” she said. Because they had gene expression data by cell type, they could look for genes that were co-varying within the specific cell types in which those genes were expressed. In essence they were looking for gene programs to lead them to clues about gene function.
The result was a collection of gene modules made up of a set of genes whose expression co-varied in specific cell types, and, as it turned out, most of the modules they identified consisted of multiple genes identified by GWAS—that is, most of the genes that co-varied with a GWAS gene in a particular cell type were themselves GWAS genes. Out of about 100 different GWAS genes that had been implicated in risk for IBD, Regev and her colleagues formed about 10 different modules that include more than half of the GWAS genes in a cell-type-specific manner. These were gene programs implicated in ulcerative colitis.
The knowledge that such structures exist is useful in itself, Regev said. For example, because GWAS genes tend to aggregate with each other in
modules, when examining candidate genes one could give preference to those candidates that co-vary with previously mapped GWAS genes or even just with other candidates.
Programs can also be used more directly in examining gene function, Regev said—not just by relying on inference, as in the work with ulcerative colitis, but through direct experiments.
For example, she described genetic screens that make it possible to find all genes that can individually affect the expression of a particular target gene, called Gene X. A great deal of biology has been learned from this technique, she said, but it does have some limitations. For one thing, there is a simple readout for each cell—the level of just one gene, Gene X. This means that one must know how to choose Gene X in advance of the experiment. It also means that all hits are going to look the same, because they all affect the level of Gene X. That makes it difficult to capture complex biology.
Regev described a new assay technique designed to get around these limitations called Perturb-seq, which involves pooled CRISPR screens with single-cell RNA-seq (RNA-sequencing) readout. One of its first applications was in dendritic cells, a type of immune cell, stimulated with lipopolysaccharide (LPS), a molecule found in the cell membrane of bacteria. What they found was the genes whose expression changed in the analysis partitioned into five programs (Dixit et al., 2016). “That means that all of the genes in the program are affected in a similar way across the perturbations,” she explained.
What is particularly important about the Perturb-seq technique is that the presence of the programs makes the assays easily scalable. For program-level effects, Regev said, it is enough to have as few as 30 to 50 cells per perturbation and a few hundred reads per cell. “The rest is just given to you by structure.” This in turn means that the screens can be done for many different purposes and will produce a unified readout. The technique also can be carried out with coding or noncoding variants in one cell type or in multiple cell types simultaneously.
Regev illustrated the power of Perturb-seq with a recent example where she used the technique to characterize the potential function of 35 genes with loss-of-function variants in autism (Jin et al., 2019). These are genes that are known to play a role in autism from human genetics, but researchers know nothing about the specific roles they play, the cell types in which they act, or the processes by which they work. “It’s really hard to decide even which screen you should devise for them,” she said. “So we devised this screen you do when you don’t know anything.”
Testing 35 genes known to be implicated in autism spectrum disorder against five major cell types, they first examined the effects of individual genes on individual cell types and found very little. Only 1 of the 35 genes, dyrk1a, had any significant phenotypic effects.
But things looked different when they turned to the level of programs. Examining which programs were affected by the perturbations, they found that 15 of the 35 autism genes affected six programs across four different cell types. “This highlights that there are probably a limited number of cellular processes crossing different cell types that these genes actually converge into,” she said. She added that this was an early screen done with a relatively small set of genes, but experimental improvements have now made it possible to do very large—and even genome-wide—screens.
Using Structured Programs to Understand Genetic Interactions
The third way that genetic programs can be of assistance is that simply knowing that genes form structured programs helps with such problems as understanding genetic interactions. For example, Regev said, “we can use this knowledge that there are expression programs not just to change our analysis of data we already measured, but also to change how we do measurements in the first place.” This was, for instance, why she knew that she could get useful results from sparse and noisy data from a large number of cells—because there was an underlying structure.
“But we got greedier and greedier with time,” she said. And they came up with a simple idea: Instead of measuring the expression of individual genes, they would measure what they called “composite genes,” which were linear combinations of small subsets of genes. They could then use a mathematical technique called decompression to make individual gene measurements. The approach depended on the assumption that gene expression was structured—if it is structured, then the expression profile can be described by the linear combination of a small number of modules.
Regev described an application of this approach in the context of spatial imaging where they were limited in the number of measurement channels, “so it’s a big deal if you can get rid of that limitation and get information about more genes without more experiments.” They created composite genes by mixing probes against different genes but with the same label. The experiment was done in mouse motor cortex with nine composite genes, each including from 8 to 13 individual genes out of the 37 total genes covered by the study. They were able to successfully decompress the 9 composites to get patterns for all 37 genes (Cleary et al., 2017).
Summarizing her comments on programs, Regev said, “I showed you that many biological processes are best captured at the level of programs, not cells. This gives us an ability to handle spectrums of programs … and solves the problem that no single partitioning of cells is going to capture all perspectives of biology.” In the case of human genetics, most genes captured by GWAS are expressed in specific cell subsets, and they map into modules that vary within these subsets. The modules also help predict GWAS gene
function. Programs provide multi-purpose, rich, robust, and diverse readouts for scaled pooled screens, and knowing that a structure exists makes it possible to perform more efficient experiments using compressed sensing.
To introduce her third topic, mechanisms, Regev asked the question, “How far can we go in chasing the really difficult problems?” There are certain problems in biology that seem to be so large and complicated that there are not enough cells in all of the humans in the world to do the necessary experiments and, indeed, where the number of necessary experiments is more than the total number of atoms in the universe. “So generally I think we all assume that these problems can never be fully tackled,” she said. “And that might still be true, but I think there’s some room for optimism.” She illustrated her point with two examples of how such problems can be addressed by thinking about them differently.
Creating Models to Understand Gene Expression
The first problem she discussed was predicting how genetic sequence controls levels of gene expression. The basic approach is to work from examples where there are known regulatory sequences and associated expression. As an approach, they have taken the sequences and associated expression data from individuals in a population for many years. Otherwise, researchers design the sequences, usually starting with something from nature, and modify them according to some understanding or hypothesis. With massively parallel reporter assays it is possible to get tens of thousands to hundreds of thousands of examples to work with.
The issue is that even though that sounds like a lot of data, it is not enough to take full advantage of machine learning, Regev said. She went on to ask how a much bigger dataset could be generated. One straightforward way would be to work with random sequences of DNA, which are available commercially as training data. This should work, Regev said, because “transcription factors bind in short degenerate sequences, and most transcription factor binding sites should exist in random DNA.” According to a calculation she and her colleagues made in 2009, one transcription factor motif should appear in every 1,500 base pairs, so if one analyzed a library of 10 million 80-base-pair sequences, there should be more than 500,000 such motifs.
With this in mind, Carl de Boer, who was working in Regev’s laboratory at the time, devised a simple assay: “You would measure the extremely noisy expression level of hundreds of millions of sequence examples (de Boer et al., 2020). For most of them we would get exactly 0 or 1 estimate
of their expression level. You will see them not at all or once.” It was an easy experiment to do, she said. “You put these in cells. You see how much expression they drive.”
The result was a huge amount of data that could be used to train a very complex model, Regev said. In particular, the model she and her team worked with was mechanistic, with many biological details, designed to be interpretable. Such models can provide detailed mechanistic insight. “For example,” she said, “they highlight very precise ways in which regulatory proteins might interact with DNA.” Most importantly, the model and all the data that went into it made it possible to look at much smaller effect sizes than had been possible before—with some surprising results. “These models showed—and we confirmed this with experiments—that weak interactions, which are usually completely ignored by models, … are actually a predominant way in which regulation happens.”
Besides offering insight, a second main use of models is to predict, and Regev’s team built a second model whose main purpose was to offer predictions rather than any insight into actual mechanisms. It worked well, Regev said. It accurately predicted the expression of both random sequences and native sequences from yeast (S. cerevisiae), their research organism. In particular, it predicted expression for random sequences with 98 percent accuracy and for native yeast sequences with 92 percent accuracy (de Boer et al., 2020). “It means we can now use the model based on random sequence to design sequences that have desired properties,” she said. “And when you do that, you can ask for ones that give you particularly high or particularly low expression.” Furthermore, she added, “if it’s as predictive as that, you might start thinking that you have a full landscape of how sequence maps to expression, and if expression maps to fitness, you can say something meaningful about evolution.”
Using Mechanisms to Understand Genetic Interactions
The second problem she discussed that can be approached in terms of mechanisms was the study of genetic interactions. This is an area in which the number of possible combinations is truly staggering. If, for example, one wished to test all possible five-gene combinations among the 20,000 or so human genes, there are not enough cells in the world to do the work.
Furthermore, the interactions will manifest differently, depending on the readout gene. At a small scale, genetic interactions can be studied by profiling-based methods such as Perturb-seq. Regev described one such study involving two genes for transcription factors, NF-κB1 and Rela, that jointly control a program, with Rela activating the program and NF-κB1 suppressing it. When perturbed together, her team found that in some
interactions NF-κB1 was dominant over Rela and in others their interaction was perfectly additive.
Despite this success, a study such as this looking for two-, three-, four-, and five-way interactions among all human genes is never going to happen. “There aren’t enough cells in the world.” And that, Regev said, was motivation for thinking about how to do such experiments differently. “Is there some way of both doing the experiments more efficiently and learning more from the ones that we do?”
This is the current problem her lab is working on, and their approach is once again taking advantage of the fact that there is structure involved, so it is possible to detect patterns with much less data than would otherwise be needed. In this case, “the affected genes are structured in these co-regulated programs, and the targeted genes with genetic perturbations are structured into these co-functional modules.” So the idea behind her approach was not to measure the effects of individual genes and individual perturbations, but instead measure the effects of their compositions.
“From the perturbation side,” she explained, “it means we’re going to sum up perturbations together. We can do it in different ways. We can perturb separately and only sum up at the measurement phase. Or we could squeeze a lot of perturbed cells into a single measurement and measure them together. We can also squeeze in a lot of perturbations into one cell.”
As a test her team perturbed 600 genes with LPS in dendritic cells. “We did it either the traditional way, 1 cell at a time, 82,000 single cells, 19 channels for the traditional models, or we did it in a compressed way, squeezing 250,000 cells into 2 channels.” After collecting the data, they used an algorithm to decompress the data from the second set and found that there was a 97 percent correlation between the results of the traditional and the compressed experiments. Similarly, when they examined the effects of five known major positive and negative regulators in this pathway on four major known targets, the results were consistent, showing that the approach works.
However, even with this sort of compression, examining all gene interactions in humans is still out of reach, she said. There are just too many possible interactions. So she is working on yet another approach that uses simple modular structures to help think about which genes are likely to interact. Using the GWAS genes from the ulcerative colitis work as seeds, they are building modules of two types: cell-type-specific modules that vary across all of the cell types, and program modules in which genes co-vary within a cell type. Then they will examine genetic interactions either where the genes in the same module interact or where the interactions are between genes in different modules. Some early results are just emerging from that work.
With that, Regev summarized the third part of her talk: “Responding genes form co-regulated programs, perturbed genes form co-functional modules, and this coupled structure can be leveraged in many different ways to tackle genetic interactions.” Furthermore, random experiments can be used to tackle such questions as the effects of sequence on gene expression. Indeed, she said, researchers have been doing random experiments for decades. “They just didn’t call them by that name.” But random experiments get better and better as the amount of data increases, she said. “So I think we have a lot of room for optimism.”
To conclude her talk, Regev mentioned the various large scientific initiatives that made the work she discussed possible. Details of this part of the talk are explained in Chapter 8 in relation to the consortia and databases with which other speakers work.
In addition to the topic of scientific initiatives, Regev’s talk touched on other topics that came up as themes throughout the workshop. These include understanding epigenetics and gene regulation, learning about how environmental interactions and perturbations affect gene expression, and the use of research organisms as models for laboratory work. Her talk also covered some conceptual ideas that other speakers brought up in their talks, including the non-linearity of biological interactions and gene expression, the complexity of understanding genetic function, and the inherent structure present in biological systems. Over the remainder of the 2.5-day workshop, speakers and panelists continually referred to the principles brought up in Regev’s talk as they related these important concepts to their own work.