Converting Data to Knowledge

Ultimately, the tremendous amount of information now being generated by biologists and deposited into databases is useful only if it can be applied to create knowledge. And, indeed, researchers are finding that the many databases now available are making it possible for them to do many things that they never could before.

DATA MINING

Perhaps the best-known technique is data mining. Because many data are now available in databases—including information on genetic sequences, protein structure and function, genetic mutations, and diseases— and because data are available not only on humans but also on many other species, scientists are finding it increasingly valuable to “mine” the databases for patterns or connected bits of information that can be assembled into a larger picture. By integrating details from various sources in this way, researchers can generate new knowledge from the data assembled in the databases.

Much of today's data mining is done by biologists who have discovered a new gene or protein and wish to figure out what it does, said Stanford's Douglas Brutlag, professor of biochemistry and medicine. At first, the researcher might know little more about the new find than its genetic sequence (for a gene) or its sequence of amino acids (for a protein), but often that is enough. By searching through databases to find



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 23
Bioinformatics: Converting Data to Knowledge Converting Data to Knowledge Ultimately, the tremendous amount of information now being generated by biologists and deposited into databases is useful only if it can be applied to create knowledge. And, indeed, researchers are finding that the many databases now available are making it possible for them to do many things that they never could before. DATA MINING Perhaps the best-known technique is data mining. Because many data are now available in databases—including information on genetic sequences, protein structure and function, genetic mutations, and diseases— and because data are available not only on humans but also on many other species, scientists are finding it increasingly valuable to “mine” the databases for patterns or connected bits of information that can be assembled into a larger picture. By integrating details from various sources in this way, researchers can generate new knowledge from the data assembled in the databases. Much of today's data mining is done by biologists who have discovered a new gene or protein and wish to figure out what it does, said Stanford's Douglas Brutlag, professor of biochemistry and medicine. At first, the researcher might know little more about the new find than its genetic sequence (for a gene) or its sequence of amino acids (for a protein), but often that is enough. By searching through databases to find

OCR for page 23
Bioinformatics: Converting Data to Knowledge similar genes or proteins whose functions have already been identified, the researcher might be able to determine the function of the new item or at least make a reasonable guess. In the simplest cases, data mining might work like this: A genome scientist has a new, unidentified human gene in hand and proceeds to search through genome databases on other species—the mouse, the fruit fly, the worm Caenorhabditis elegans, and so on—looking for known genes with a similar genetic sequence. Different species share many of the same genes; although the sequence of a particular gene might vary from species to species (more for distantly related species than for closely related ones), it is generally feasible to pick out genes in different species that correspond to a particular gene of interest. If a database search turns up such correspondences, the researcher now has solid evidence about what the newly discovered gene might do. In reality, the database analysis of genes and proteins has become far more sophisticated than that simple searching for “homologues, ” or items with similar structures. For instance, Brutlag noted, researchers have developed databases of families of sequences in which each family consists of a group of genes or proteins that have a structure or function in common. When a new gene or protein is found, its discoverer can compare it not just one on one with other individual genes or proteins, but with entire families, looking for one in which it fits. This is a more powerful technique than one-to-one comparisons because it relies on general patterns instead of specific details. Just as an unusual-looking mutt can be identified as a dog even if it cannot be classified as a particular breed, a new protein can often be placed in a family of proteins even if it is not a homologue of any known protein. Researchers have developed a series of databases that can be used to classify genes and proteins, each with a different technique for identifying relationships: sequence motifs, consensus sequences, position-specific scoring matrices, hidden Markov models, and more. “I can hardly keep up with the databases myself,” Brutlag said. With these techniques, researchers can now usually determine what a newly discovered human gene or protein does on the basis of nothing more than the information available in databases, Brutlag said. About a year before the workshop, his group created a database of all known human proteins and their functions. Over the next year, each time a new human protein was analyzed, they analyzed it by using homologues and a technique developed in Brutlag's laboratory called eMATRICES. “Using both methods, we assigned biologic functions to almost 77% of the human proteins. More than three-fourths of new proteins could be characterized by a technician who never left his computer; although the ultimate test remains experi-

OCR for page 23
Bioinformatics: Converting Data to Knowledge mental verification, this method promises to speed up drug discovery, for example.” INTERNATIONAL CONSORTIUM FOR BRAIN MAPPING A different way of exploiting the information in biologic databases is demonstrated by the International Consortium for Brain Mapping. The consortium is developing a database that will provide physicians and neuroscientists with a description of the structure and function of the human brain that is far more accurate and complete than anything available today. The database will be a combination of brain atlas, visualization device, clinical aid, and research tool. Mapping the human brain is complicated, and not simply because the brain is a complicated organ. The more important factor is that the brain varies from person to person. “Every brain is different in structure and probably more so in function,” said John Mazziotta, director of the Brain Mapping Division at the UCLA School of Medicine. Even identical twins have brains that look noticeably different and whose functions are localized in slightly different areas. The brains of unrelated people have even greater variation, and this makes it impossible to create a single, well-defined representation of the brain. Any representation must be probabilistic—that is, the representation will not describe exactly where each structure or function lies, but will instead provide a set of possible locations and the likelihood of each. So instead of creating a single, sharply defined map laying out the various features of the human brain and coloring in the areas responsible for different functions, any brain-mapping project must find some way to capture and display the inherent fuzziness in where things lie in the brain. “That is very hard,” Mazziotta said. “In fact, we don't yet have a good way to do it.” Nonetheless, the consortium has developed ways in which the natural variation from brain to brain can be captured and displayed, which make it possible to get a much clearer picture of what is normal in the human brain and what falls outside the normal range. The desire to create a brain-mapping tool was motivated by two main factors, Mazziotta said. The first was the sense that the various researchers in the field of brain mapping were heading off in many directions and that no one was attempting to bring all the threads together to see what they were jointly producing. “As in a pointillist painting, all of us in the imaging field were working on our dot in isolation, trying to refine it and get it better. The concept was that if we worked together and pooled the data, we would have a composite image that would show the big picture and be much more than the sum of the individual points.” One experiment in particular was a major factor behind the push to

OCR for page 23
Bioinformatics: Converting Data to Knowledge create the brain-mapping consortium, Mazziotta said. He and colleagues at UCLA and in London studied the brains of four subjects who were observing an object moving across their visual field. The researchers found that in each subject a particular small area of the brain became active during the experiment and the researchers could identify the area as being involved in the visual perception of motion. “The location was consistent across subjects, but we don't know how consistent, because there is such variance [in brain structure] between individuals. This is a big problem.” Without having a good idea of what constitutes normal variation in brain structure and function among individuals, the researchers had no way to judge the meaning of their results. One of Mazziotta's collaborators, John Watson, combed through the literature in search of information on patients who lose their ability to detect motion in their visual fields. He eventually found an article describing such a patient; the patient had damage in exactly the part of the brain that the researchers had already zeroed in on. Watson also found a 1918 description of which parts of a newborn's brain are myelinated—that is, in which areas the neurons were sheathed with myelin, a fatty coating that improves the performance of nerve cells. Only a few primary parts of the brain are myelinated at birth, but Watson found that one of those sections correlated precisely with the part of the brain that seemed to detect motion in the visual field. “Newborn infants might want to know that something is coming at them really fast in their visual environment,” Mazziotta said, “so that area has to be ready to go at birth. This is speculation, but it makes sense.” At the end of the process, the group of researchers had woven together evidence from a number of studies that this particular spot in the brain was responsible for detecting motion in the visual field. “The only problem was that this was a library exercise,” Mazziotta concluded. “What it needs to be—and what we want it to be—is a digital database exercise, where the framework is the structure of human brain, so we can do an experiment, find this observation, and go deep into the data and find other features that are now very awkward to identify.” The second motivating factor for developing the brain-atlas database, Mazziotta said, was the sheer amount of data generated by even the simplest experiments with the human brain. “A typical human male brain has a volume of about 1,500 cubic centimeters, and any given cell can express, at any time, 50,000-75,000 gene products. If you took the most crude sampling—1 cubic centimeter—that one could envision, that represents 75 million data points. If you scale it down to a cellular size, 10 micrometers [10 thousandths of a millimeter]—that represents 75,000 trillion data points for one brain at one time. If you take that across the age range—from birth to 100 years—and do that for different populations,

OCR for page 23
Bioinformatics: Converting Data to Knowledge you get truly astronomical amounts of data, just for this one perspective on gene expression as a function of age and spatial resolution.” With the potential for so many data, it seemed important to establish a place that could deal with them effectively, integrating the various types of data and creating a representation of the brain that was as complete as possible. The brain-mapping consortium contains sites around the world, including the United States, Japan, and Scandinavia. Ultimately, it will include data on 7,000 subjects, although data on only 500 have been collected so far. The data include not only a variety of brain images both structural and functional, but also histories of the subjects, demographic information, behavioral information from handedness to neuropsychology, and, for most of the subjects, DNA samples. Mazziotta said that the system makes it possible to study the relationships among genetics, behavior, brain structure, and brain function in a way that takes into account the variations in structure and function that occur among people. Mazziotta offered three examples of how this sort of system can be put to work. In the first, researchers at UCLA looked at images of the brains of 14-year-olds and 20-year-olds and asked whether there were any differences—a question that, because of natural brain-to-brain variation, would be nearly impossible to answer by looking at one or two subjects at each age. “The prevailing wisdom was that there was not a lot of change in brain structure between those ages.” But by mapping the normal range of 14-year-old brains and the normal range of 20-year-old brains, the group showed that changes did indeed take place in the prefrontal cortex and the base of the forebrain. A second study compared the brains of a population of patients who had early Alzheimer's disease, averaged in probabilistic space, with the brains of a population of patients in the later stages of the disease. It found that Alzheimer's disease causes changes in the gross structure of the brain, thinning the corpus callosum and causing the upper part of the parietal lobe to shrink. “This is an example of a disease demonstrated not in an individual but in a group,” Mazziotta said, “and it is useful clinically to evaluate different therapies. ” One might, for example, perform a clinical trial in which one-third of the patients were given an experimental therapy, another third a conventional therapy, and the rest a placebo. At the end of the trial, the probabilistic brain-mapping technique would produce a measure of the changes that took place in the brains of the three groups of patients and offer an objective measure of how well the different therapies worked. The final example was a diagnostic one. “Let's say that a 19-year-old woman has seizures that come from this part of the brain in the frontal lobe. If we do an MRI scan and look qualitatively at the individual slices, for that kind of patient it would typically be normal, given the normal

OCR for page 23
Bioinformatics: Converting Data to Knowledge variance of the structure of that part of the brain.” In other words, a physician could probably not see anything in the MRI that was clearly abnormal, because there is so much normal variation in that part of the brain. If, however, it were possible to use a computer to compare the patient's MRI with a probabilistic distribution calculated from 7,000 subjects, some parts of the brain might well be seen to lie outside the normal range for brains. “And if you could compare her brain with those of a well-matched population of other 19-year-old left-handed Asian women who smoke cigarettes, had 2 years of college, and had not read Gone With the Wind, you might find that there is an extra fold in the gyrus here, the cortex is a half-millimeter thicker, and so on.” In short, because of the data that it is gathering on its subjects and the capability of isolating the brains of subjects with particular characteristics, the probabilistic brain atlas will allow physicians and researchers not only to say what is normal for the entire population, but also what is normal for subgroups with specific traits. And that is something that would not be possible without harnessing the tremendous data-handling capabilities of modern biologic databases.