use different terminology, and they are not really built to be compatible. So, how do you put them all together and start to do federated searches and queries?
The approach that we have chosen is to use available ontologies and, specifically, we use a technological tool called RDF, Resource Description Framework, to traverse these data sources on the Web. To provide an analogy, on the Web a URL is basically just a link between two Web pages. However, you can also conceive of URLs as definitions of things, and then the links could have meaning. Rather than just saying this page is linked to that page, we can say this receptor is located in that cell membrane and use these terms to connect different data sources. That is to say, that the Web can be used not only to link pages, but also concepts, and by doing so, to merge definition and knowledge.
So, for example, each of these concepts could exist as a separate link or a separate resource on the Internet. Then, when you want to make that connection, you just link other things to these networks of definitions. When you study the genes related to Alzheimer’s, the networks of biological pathways are extremely complicated and the only way that you can begin to elucidate them using computational approaches is to build the skeleton of meaning on which the flesh of knowledge can be attached.
The challenge is that, right now, we are limited to using sources that are open access, that are public domain, but there are a lot of journals with primary sources that are not available for text mining., There are consequently lots of holes in our ability to do this kind of research because of the closed status of some journals. Many databases, including government databases, are built upon restrictive licensing models that make data integration impractical or impossible. Thus, the challenge is how do we reformat what we already have stored in databases and journals and other sources of knowledge into a digitally networked commons that we can connect together, and then how do we also get the materials that are related to these digital objects into the emerging research Web so they could be accessible to those who need them most?
To explore the question of what kind of data-sharing protocol enables the kind of research we have been discussing here, I want to first describe three “licensing” regimes. The first broad category of data is those that are in the public domain. They have no restrictions on their use, no restrictions in distribution, and if there is any copyright, it is waived. The last is sometimes called “functional public domain”—because can be treated as public domain even if the legal status is different. The good news is that there are fields of research where the functional public domain is the norm. The human genome research community is one example, which evolved from the very deliberate consensus formed by the Bermuda Principles.
The second kind of regime is of community licenses, such as open source licenses, like GPL, and the Creative Commons licenses. What they have in common is that they are standard licenses that everyone within the relevant community uses. They sometimes offer a range of different rights, with some rights reserved. So, this is not the public domain because there are some restrictions, but, generally, the information or the resources are available to everybody under the same terms.
The third regime is private licenses, and, by that, I mean custom agreements that are specific to particular institutions or providers. These of course are the norm for commercially available sources of data, and they range wide broadly in terms of the rights provided to the user. However, in general, they are fairly restrictive in terms of redistribution or sharing of data, because of the need to protect a revenue stream.