20 / The structure of DNA: a double helix with matching base pairs of CG and AT. Image from U.S. Department of Energy Genomic Science Program. /
If all of the sequence reads were perfectly accurate, matching the overlapping ones would be routine. However, about 1 percent of base pairs were misread, and this meant the overlapping jigsaw puzzle pieces would not match. The approach then became to find a good match rather than the best match (see Figure 21).
Another issue, somewhat more subtle, was the problem of repeats. Human genomes include many sequences that repeat identically in many places. These repeats were a big headache for genome sequencers because when a contiguous region ended with a pattern that occurred in many places, they had no idea which puzzle piece should come next.
The way around the problem, it turned out, was to take a longer snippet of DNA—say, several thousand base pairs long—and sequence both ends. Even though you can’t sequence the middle, you can at least get a few hundred base pairs at each end and estimate how many base pairs lie between them. This gives you strings that link two jigsaw puzzle pieces together, including some that are on opposite sides of a gap or a repeat. These tethers create a scaffold to hang the “contigs” onto. Finally, the scaffolds could be wheeled into proper position by using the Human Genome Project’s high-level map of the genome.
In the more than a decade since the completion of the human genome, the landscape has changed in at least two important ways. First, because a “reference” human genome is now available (in fact, many such reference genomes are), no human genome has to be sequenced from scratch. If you have a patient with cancer or with a genetic disease, you can zero in on the 0.1 percent of the genome that is different from the reference version and ignore the 99.9 percent that is the same. Thus the problem is not one of assembling the genome but of looking up sequences in the reference genome that are similar to (but slightly different from) your patient’s.