database and become quite accessible. Otherwise, it really depends on someone's personal initiative either to get funding to create a database through one of the government sources of funding or finding some support so that the information can be put into a form that is actually accessible.
I will describe two different scenarios for the pharmaceutical industry. One is chemical information (which is where I have spent much of my career), compared with what I call the bioinformatics movement; they differ in how the data have been handled and collected. Historically, chemical informatics has been a very commercialized activity. Three of the larger data repositories—Beilstein, Derwent, and Chemical Abstracts Service (CAS)—are organizations that all started out building printed repositories and then ultimately turned into electronic sources. They are highly commercialized, profitable activities that have served a truly critical role in the preservation and availability of chemical information.
Beilstein Institute, founded in 1881, first published its handbook with 1,500 compounds. The final version was printed in 1998, with the oldest references going back to 1771. Interestingly, this was converted in electronic form with some assistance by the German government and today holds 9 million compounds, a lot of information and data, and is distributed on a commercial basis by a Reed Elsevier company.
Derwent, which is today a Thompson company, was founded from ideas initiating from Monty Hyams in 1948. He was trying to make some sense out of the patent literature and began writing some simple abstracts about what was being published at the time. This information became useful. People started getting more interested, and it turned into a commercial operation. Today, Derwent's world patent index is global in scope, covering 40 different patent-issuing authorities and details in over 8 million separate inventions. It is a very large repository. If you work in the area of pharmaceutical research and development, you have to go to the Derwent database to understand what the patents are about.
CAS has a similar history. They were founded in 1907, with the goal of monitoring and abstracting the world's chemical-related literature. Today, with the Internet and all of the information that we have to deal with, it is mind-boggling to think that in 1907 this operation was formed because there was too much information to handle. CAS is a subsidiary of the American Chemical Society. There are 20 million organic and inorganic compounds registered at CAS; 21 million biological sequences; and almost 42 million separate and unique chemical entities registered in their system, all of them complete with names and references to the published literature, allowing scientists to find more information about these items. It is an incredible amount of information.
My observations are that chemical informatics has been quite commercialized and brings in quite a bit of money. In the area of pharmaceuticals, this has all been organized and put out there to be used because of its high commercial interest. These three companies and others look at scientific journals, books, patents, conferences, and dissertations; they do the work, and they extract a significant fee for it. The data are organized and then made available to the community, but at a price. The reason these data are not free is not because the underlying information is not free, because in many cases it is. But the fact that these companies have organized it and brought this together in a searchable (i.e., more useful) fashion gives these databases a very high value. This makes life much easier in terms of the chemical information community.
There is clearly value in that much of the original research was publicly funded research. There has also been a significant cost of creating these data sources. Although the pricing is fairly significant for these groups, they certainly have provided a service to the community. Frankly, whether these operations could have continued to exist in an electronic form without the current funding support is doubtful.
There are certainly hundreds, maybe thousands, of other databases and data repositories in chemistry that do not get picked up by these services. Are these less important to us as a scientific society? There are certainly cost barriers that prevent all the good data from being collected and organized in a reasonable fashion. The public funding has not been available to make these data generally available. I think that the funding tends to go toward collecting more data, and yet the funding for making the repositories of these data and making them available has not been there. My personal opinion is that funding authorities should consider the utility of the resulting data they pay to have generated from the start of these projects.
An interesting contrast to this illustration is in the bioinformatics arena, which in some ways is the antithesis of the chemical data franchise. Here, largely publicly funded projects have been formed by different highly motivated groups to put together what have become literally hundreds of sequence databases. A few of these are