Seth Shipman, University of California, San Francisco, and Gladstone Institutes
Seth Shipman, an assistant professor at the University of California, San Francisco (UCSF) and assistant investigator at Gladstone Institutes, opened by explaining that his laboratory’s approach to disease is to focus on building new technologies rather than on individual diseases. He noted that his laboratory focuses on making new parts that are not native to the cells, which are then delivered to those cells in synthesized or assembled pieces of DNA.
Shipman explained that once inside the cells, these new parts can execute new functions that cause cells to exhibit unusual behaviors and allows for analysis of what is happening in the cells. To facilitate the analysis, he created a molecular data acquisition device that functions inside of a cell, allowing him to collect data about what is happening in a complex tissue inside of cells over time. His team’s research led them to develop a way to use DNA as a storage medium.
Shipman explained that DNA in cells encodes essentially all of the biological information that programs the natural world; it is an information-bearing molecule. An individual strand of DNA is composed of four different bases, referred to as A, C, G, and T, and
the order of these bases on a given strand can be used to represent any arbitrary information in the same way that a string of zeroes and ones in a binary code can be used to represent any arbitrary information. This, he said, is why DNA can be used to encode data.
DNA is a remarkably successful information storage device in terms of capacity and durability, Shipman continued. He compared the storage capabilities of various media and noted that in Earth’s total biome, DNA stores 1,016 zettabytes of information. As a storage medium, DNA far exceeds current digital media in terms of size and durability.
Shipman explained that most types of storage media used today have a very short lifetime, on the order of tens of years, meaning that all the data being stored must be regenerated or it will be lost due to the lifespan of the storage medium. He asserted that when stored properly in DNA, information can potentially last millions of years and would allow current generations to communicate with future generations, something that is not possible with current storage media. He compared this concept to an information seed vault. He said that it was a fair comparison, in part because although long lasting, DNA storage does not allow quick access. He noted that others are working to address that challenge.
Shipman explained that an artist, not a scientist, was one of the first people to encode data in DNA, in the 1980s. The artist encoded a pictogram into binary code, and then synthesized it into E. coli.1 Shipman added that information stored in this manner is inherently hidden, and that additional steps could make the existence of that information even less obvious.
Continuing to provide historical context, Shipman told of researchers who encoded a World War II–era message in DNA. They then mixed it with other human DNA to further obscure it. Finally, taking inspiration from previous encryption methodologies,
1 A. Extance, How DNA could store all the world’s data, Nature 537:22-24, 2016, https://www.nature.com/news/how-dna-could-store-all-the-world-s-data-1.20496.
they blotted it in the size of a period on a page, making it almost impossible to find, although the information was fully recoverable.
Shipman asserted that DNA can be used just like any other storage medium, pointing out that sound, images, and videos all have been encoded in DNA and recovered. Further, he continued, the actual encoding schemes being used vary greatly. Variations address the trade-off of information density—that is, how much information is encoded in every base of the DNA, against redundancy and error correction. Using more DNA can result in a less dense encoding scheme but one that is more robust to errors.
One of the largest efforts in this field is a collaboration between Microsoft, the DNA synthesis company Twist Bioscience, and a laboratory at the University of Washington. They are exploring large-scale archival information storage in DNA. Startups are also working in this area, as well as DNA synthesis companies, many of which are working on enzymatic DNA synthesis and see data storage as a potential market for their efforts.
Recent work in academia has investigated different physical embodiments for storage DNA, according to Shipman. He reported on a recent paper2 that showed that it is possible to deposit pieces of DNA in a polymer that has a shape related to that DNA. The authors described printing a rabbit that contained the code for the printing of that same rabbit within it, allowing one to replicate the printed rabbit. The same paper provided another example, where the researchers encoded a movie and then printed the movie into a polymer. They also put polymer into a pair of eyeglasses. The information is present in the glasses but not apparent. The paper described that retrieving the knowledge is extremely difficult even when its presence is known.
In addition to being a compact and long-lasting data storage device, DNA is also compatible with cells. Shipman believes this is a key factor, because scientists could have a cell may be able analyze its environment or what is occurring within itself and then store the results of that analysis back into DNA, instead of outputing the results in a way that must be read with a piece of machinery. When
2 J. Koch, S. Gantenbein, K. Masania, W.J. Stark, Y. Erlich, and R.N. Grass, A DNA-of-things storage architecture to create materials with embedded memory, Nature Biotechnology 38:39-43, 2020, https://doi.org/10.1038/s41587-019-0356-z.
this happens, Shipman highlighted that the cells themselves would essentially become data acquisition devices. To Shipman, this idea relates to the earlier idea of building a cell that can report about the biology that it is experiencing. He said that his team’s start with a cell’s genomic DNA and make discrete modifications to it, organized over time. By doing this in a linear order, recovering the information also allows for recovery of the timing of the occurrence of that information.
Shipman said that he began this work using pieces from the CRISPR (clustered regularly interspaced short palindromic repeats) system. He said that when a phage attacks a bacterium with a functional CRISPR immune system, that virus deposits its own genome into that bacterial cell, causing an immune response that activates two proteins called CAS1 and CAS2 to form an integrase complex. According to Shipman, these proteins grab part of that virus’s genome and incorporate it into the bacterium’s own genome, in a particular spot. He said that this can be thought of as a memory of that virus; if the cell sees the virus again, it retrieves that stored virus, which can be transcribed and used to guide a nuclease to attack it. He observed that this process achieves resistance to a previously encountered virus.
Shipman said that his team was more interested in the first part of this process, the acquisition of information from the virus. These integrases do something that Shipman believes is really unique, that is, if they see another virus, they will acquire a part of that viral genome and also put it into the cell’s genome in that same spot. However, he noted that they do not swap out the information, instead they add to it. He said that they add another piece onto this record, and that they will do this in a directional way—that is, they always add it on one side of the previous information. The result is a record of pieces of viral DNA that provides the history and the order in which that cell experienced those viral attacks. Because of the predictable behavior of the integrases, Shipman found that if he delivered pieces of synthesized DNA of arbitrary content that resemble what the cell would normally grab from viruses, and delivered those pieces to cells that have one of these CRISPR systems, the cells will acquire these synthetic pieces of DNA given to them
in this CRISPR array. He said that by using different sequences over time, he could write these records of events right into a living cell’s genome.
Shipman noted that this system is quite reliable. It always grabs pieces of DNA of the same length; in this case, each new segment is 33 bases long. He also discussed sequence requirements that his team found. When his team looked at the entire realm of these orders of G’s, A’s, T’s, and C’s, it found that the integrases preferred some sequences over others, in particular a “protospacer adjacent motif.” He also mentioned the requirement of needing to grab something that looked biological. He said that having too many of the same motifs tended to make the system work less well. He added that long runs of a single base, which causes problems in other aspects of biology, also is not tolerated; nor is having equal numbers of the different bases.
Shipman then described his laboratory’s work to use this result to acquire complex information about developmental biology or about cellular ecosystems. He began by encoding a 30 by 30 pixel–image that had 21 different grayscale values. While this was not terribly complex process, it exceeds what could be put onto a single 33-base segment of the type that would be put into one cell. He said that the next step was to scale up the encoded image and distribute it across multiple individual strands of DNA. He also added what were essentially barcodes to enable him to reassemble those strands into an image. He explained that this process yielded an image with 21 grayscale values and 9 pixels per barcoded DNA strand. He said that the encoding was done by using a lookup table with three base codes for each pixel value, but with redundancy.
By replicating the flexibility of that redundancy, Shipman achieved different values with the same nucleotide using different nucleotide sequences, allowing him to account for the fact that certain sequences are better tolerated by biology than others. Shipman then encoded the information and delivered it to the cells. Once that was accomplished, Shipman propagated the resulting cells for hundreds of generations and later recovered that information by sequencing.
Shipman emphasized that the resulting code was more flexible and more closely resembled natural biology than he had originally thought was needed. He tested other coding versions that were less flexible and found that they did not resemble natural biology and did not work nearly as well. He observed that working with biological media drove him and his team to design an encoding scheme that took advantage of this finding by resembling natural biology.
When sequencing natural CRISPR systems in normal, wild bacterium, Shipman said that it is normal to find pieces of an unknown genetic sequence. When that happens, it is normal to presume that those pieces are drawn from viruses, even though there are no data to support that conclusion. He explained that very few bacterial viruses have been sequenced, and in fact, most viruses, in general, have not been sequenced. Building on this point, Shipman said that once information is encoded using his method, those sequences, which are not in any database of known entities, are indistinguishable from a wild bacterium.
Most of the DNA he works with is synthesized, which Shipman believes leads to an interesting consideration. He posited that the threat often mentioned in the context of synthesized DNA is the synthesis of dangerous functional elements; that is, parts of viruses or toxins that can be made. He noted that much effort is being devoted to identifying those sequences, in part to aid in the effort to control their sale and distribution to illegitimate actors, with the idea being that once dangerous sequences are cataloged, orders for them placed with a commercial company can be scrutinized. He believes that while this is an important result, the truth is that even if these elements are only synthesized, there is a significant amount of technical expertise and resources is needed to move from just a synthetic piece of DNA into something that is truly a functional piece of biology. He added that, conversely, when looking at storing information in DNA, that transfer of information takes very little technical expertise or resources. The
synthesized DNA is usable for information storage immediately and costs only dollars for a small piece.
Shipman also noted that we live in an era with both information and false information, and that it is theoretically feasible to create a false information stream using synthesized pieces of DNA, creating what are essentially genomic “deep fakes.” This can be done by creating a signature of an organism, a pathogen, or even an individual in a place where that person or that organism never existed, using synthesized DNA. He said that the falsity of this information would be essentially undetectable; although some artifacts of synthesis would be a starting point for a search, no framework for searching for this information exists.
Shipman reminded the audience that the majority of the natural world has not been sequenced, meaning that information does not appear in databases. That means that just because an examined sample is not found when searching a database, it does not follow that the information is not biological. Further, he pointed out, even when there is an actual match in the database, it tends to be less scrutinized because the inclination is to conclude that the information is from a biological source because historically it always has been. He wanted to impress upon the audience that the vulnerability created by that situation is worth attention.