New Directions in Genomic Research
Janet D. Rowley, M.D.
University of Chicago Medical Center
Serving on the Scholar Selection Committee and the excitement of being part of an experiment in the challenging new concept for funding biomedical research was a major thrill. The Selection Committee had heated discussions about the selection criteria and whether it should follow the guidelines of half of the awards going to M.D.s and half going to Ph.D.s, or whether it should try some other combination.
The guiding hand of Purnell Choppin was critical, even though his name has not been mentioned here. He was the Chair of the Selection Committee when it first began, assisted initially by Philip Leder and then David Kipnis. It was the wisdom of Purnell that kept things on track. The Markey Scholars program has been very successful, which is a tribute to the strength of the initial guidelines, to the wisdom of the Selection Committee, and the initiative of the scholars themselves.
There are some very fundamental questions emerging from genome projects. For example, Eric Green’s poster highlighted the kinds of answers that can be obtained now and from comparative genomics in the future. I am going to focus on a particular question related to genome research; one that is being raised now by comparing RNA transcripts that we can detect experimentally and that we can map to genomic DNA that are not seen in the EST database or in the UniGene clusters. I will raise the question in the context of human leukemia because that is the area in which I have done my research. The history of the study of chromosome translocations in cancer began in 1960 with the work of Peter Nowell and David Hungerford for identifying the Philadelphia Chromosome in
Chronic Myelogenous Leukemia (CML). This was followed by the discovery in 1972 that the Philadelphia Chromosome was really a translocation of chromosomes 9 and 22. In 1984 the 9;22 translocation was cloned by Nora Heistercamp and John Grafin who identified the Abelson (ABL) gene on chromosome 9 and the BCR gene on chromosome 22. Subsequent research showed that it was a fusion gene. The final advance, in 1998, was the use of Gleevec, which has turned out to be a miraculous drug for the treatment of many patients with CML. Although the chromosome abnormality was discovered in 1960, the discovery of specific treatment did not occur until 1998, almost a 40-year gap.
We examined leukemic cells from patients with acute myelogenous leukemia, each with a specific chromosome abnormality, namely different translocations, each with a unique morphology. My assumption has always been that this unique morphology is in fact associated with unique patterns of gene expression, and the challenge has been to try to figure out what these patterns are. All the breakpoints have been cloned. At the present time, except for the 15;17 translocation, we do not have genotypic specific therapy for any of these extremely common translocations.
The challenge is how to develop the optimal treatment. This is important from the clinical standpoint because the different chromosome abnormalities have different survivals, so they have prognostic implications. A 1998 study of the Medical Research Council Laboratory of Molecular Biology (MRC) that appeared in Blood, showed that the survival of patients with the recurring translocations I illustrated previously is 60 to 70 percent at 5 years. Surviving patients in general tend to be younger. In contrast, patients with a complex karyotype, loss of 5 or 7, or translocations of chromosome 11 have a dismal survival. These are genotypically different types of leukemia that need different types of therapy.
This raises two issues, one of which is improved diagnosis and the other, improved treatment. It is important to inform the physician of the patient’s likely prognosis, as this will influence treatment. However, we are not going to get to genotype specific treatment until we have better information about the biology of these different leukemias.
To address this issue, Jim Downing’s laboratory at St. Jude has used microarray analyses in childhood acute lymphoblastic leukemia. There is a unique pattern of gene expression in each one of these chromosomely unique types of leukemia. Thus, even using only known genes we can begin to develop diagnostic chips. I emphasize that these are known genes because that is all that are on the present Affymetrix microarrays.
We have taken a different approach and we are using Serial Analysis of Gene Expression (SAGE), which was developed by Ken Kinzler, and his colleague Bert Vogelstein at Johns Hopkins Medical Center. We used 3' mRNA and translated it into cDNA. SAGE uses the NLA III enzyme,
which identifies a four-base pair restriction site of CATG, and then cuts at that site, as well as at a ten-base pair sequence downstream. SAGE tags are ligated together; thus in a 500-base pair sequence you can sequence many, many SAGE tags. You can then compare the SAGE tag with the present expression map to see what your transcript really represents.
We have begun our analyses in normal hematopoietic cells. We have studied CD34+ cells, and CD15+ cells from normal bone marrow; we identified 42,000 and almost 39,000 unique SAGE tags, respectively, from more than 100,000 individual SAGE tags. Note that 44 or 45 percent are novel tags, that is, they do not match to sequences in any of the expression databases. Upon careful analysis, we have found that three quarters of the SAGE tags are present only once. Examining the distribution of novel tags, we found that there were a few unknown or novel tags in the CD34+ library, and none in the CD15+ library for very frequent SAGE tags. For the single copy tags, slightly more than 50 percent are novel tags.
We analyzed CD15+ cells in the bone marrow of leukemia patients obtained prior to treatment. We found SAGE tags in the leukemia cells that were never found in normal bone marrow. Out of just this subset of 10 SAGE tags, only two were known genes. One is matched to multiple genes, and all the rest are novel sequences. As a consequence none of these would be present on any microarray.
We plan to analyze samples from five patients with each translocation to identify genes that are uniquely different and uniquely over or under expressed in a particular type of chromosome abnormality to use them as a fingerprint for diagnosis. When we have identified the unique set of SAGE sequences, we will then use this as a microarray diagnostic chip. We hope that developing a prototype of a diagnostic chip will be helpful to physicians. At the same time we will try to understand the role of these transcripts in leukemia. We hope to use this information to develop new therapies.
All of our SAGE tags are identified using the 3’RNA transcribed into cDNA. We can now use RT-PCR to amplify from the SAGE tags to the 3' of the gene using a SAGE tag as the specific primer and a universal primer at the 3' end, resulting in sequences several hundred base pairs in length. Using 3' sequence information, we can obtain full-length cDNA. We have analyzed 23 cDNAs at present.
In one example, we extended a novel SAGE tag, to full-length cDNA and found that it matched to chromosome 19 band q13. The predicted intron/exon boundaries are based on blast search; none of the various strategies for identifying expressed sequences in the genome such as Ensemble or Genescan identified any expressed sequences.
We have been able to compare alternatively spliced genes in the two normal libraries. In a series of four genes different numbers of SAGE tags
representing alternatively spliced genes were identified in CD34+ and CD15+ cells. In some cases only one form is expressed in CD15+ cells and is not expressed in CD34+ cells. In another example, the same forms are expressed, but at different levels.
It has been an uphill battle trying to persuade people that these novel tags are biologically important. We were reassured from the recent paper in Science from Karpanov in collaboration with Bob Strasburg at NCI, describing large-scale transcriptional activity in chromosomes 21 and 22. One of the figures in the paper illustrated the use of the chip that was developed for the DiGeorge critical region on chromosome 22. RNA was obtained from a number of different cell lines, and then hybridized to this chip, which has a complete genomic sequence for this particular region. There was substantial hybridization of the RNA from these cell lines to the genomic DNA. But there was only a single exon predicted in this area. The conclusion from the paper in Science was that there is possibly an order of magnitude greater number of sites of transcription, of mature cytoplasmic poly A+ RNA than can be accounted for by the current annotation of the sequence of the human genome. Our data supports the idea that about half of the SAGE tags have no match.
There are a number of possibilities as to why this occurs. Clearly, the transcripts may be derived from splicing variants of known genes. I have already shown you using our own data that we know that this is true. The transcripts are present at very low levels. We have emphasized this from our own data, but that was a point that was made in the Science paper. The level of RT-PCR and nested RT-PCR that they had to use to get some of these transcripts indicates that they are present at low levels. The present method of getting ESTs really mitigates against finding low level RNA copies; this suggests that we need to change some of our strategies.
Are these transcripts non-coding RNA? There has been a great deal of interest now in RNA regulating the transcription of other genes. So maybe these transcripts are present in the cell and maybe they are much more abundant than any of us might imagine, but they are there to regulate other RNA or other cellular processes that we are not aware of at the present time. However, I do think that some of the transcripts may be novel genes that are not identified using current algorithms. So the broader message, and this is just an illustration, is that there is a great deal more that we are going to learn by the comparison, not only amongst genomes, but amongst these transcribed sequences. We need to try to understand what, if any, function they have in the cell. My own interest is how they are related to the development of leukemia in these particular patients.
I want to acknowledge the people in the laboratory who have been responsible for this aspect of the research. San Ming Wang is the leader of
the group. Jinjun Chen is a very clever colleague who developed the GLGI. Sangzyu Lee and Guolin Zhang are doing the leukemia libraries. Markus Muschen was a visitor from Germany, his interest is in B-cells and so he is doing SAGE on B-cells. We are also indebted to Terry Clark who undertook the challenge to transform the enormous SAGE database into something intelligible.