To prevent errors, Overton commented, it is necessary first to know how and why they appear. Some errors are entry errors. The experimentalists who generate the data and enter them into a database can make mistakes in their entries, or curators who transfer data into the database from journal articles and other sources can reproduce them incorrectly. It is also possible for the original sources to contain errors that are incorporated into the database. Other errors are analysis errors. Much of what databases include is not original data from experiments, but information that is derived in some way from the original data, such as predictions of a protein's function on the basis of its structure. “The thing that is really going to get us,” Overton said, “is genome annotation, which is built on predictions. We have already seen people taking predictions and running with them in ways that perhaps they shouldn't. They start out with some piece of genomic sequence, and from that they predict a gene with some sort of ab initio gene-prediction program. Then they predict the protein that should be produced by that gene, and then they want to go on and predict the function of that predicted protein.” Errors can be introduced at any of those steps.
Once an error has made it into a database, it can easily be propagated —not only around the original database, but into any number of other systems. “Computational analysis will propagate errors, as will transformation and integration of data from various public data resources,” Overton said. “People are starting to worry about this problem. Data can be introduced in one database and then just spread out, like some kind of virus, to all the other databases out there.” And as databases become more closely integrated, the problem will only get worse.
Because Overton's group is involved with database integration, taking information from a number of databases and combining it in useful ways, it has been forced to find ways to detect and fix as many errors as possible in the databases that it accesses. For example, it has developed a method for correcting errors in the data that it retrieves from GenBank, the central repository for gene sequences produced by researchers in the United States and around the world. “Using a rule base that included a set of syntactic rules written as grammar,” Overton said, “we went through all the GenBank entries for eukaryotic genes and came up with a compact representation of the syntactic rules that describe eukaryotic genes.” If a GenBank entry was not “grammatical” according to this set of syntactic rules, the system would recognize that there must be an error and often could fix it.