Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 80
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium 7. WHAT CONSTITUTES A PUBLICATION IN THE DIGITAL ENVIRONMENT? INTRODUCTORY REMARKS Clifford Lynch, Coalition for Networked Information In this session, we start taking a look at the issue of what constitutes a publication in the evolving digital world, with specific attention to the frameworks of science and journals. Although we are focusing on science, there are a lot of developments in scholarship broadly beyond science, technology, and medicine. For example, the humanities have been very active and creative in exploring the use of the new media.22 As we look at this question of what constitutes a publication and how the character of publications changes, we switch our focus from the environmental questions of publication as process that the previous panel session discussed to how we author and bind together pieces of authorship into structures like journals. We can approach this from two kinds of perspectives. One is from the individual author's point of view. The practice of science is changing, as is well documented in the National Science Foundation (NSF) cyberinfrastructure report, for example. It is becoming much more data intensive. Simulations are becoming a more important part of some scientific practices. We are seeing the development of community databases that structure and disseminate knowledge alongside traditional publishing. From that perspective, one might usefully ask questions about how people author articles, given the new opportunities technology is making possible. It is clear that articles can be much more than just paper versions by digital means. In fact, they are often printed for serious reading and engagement, and the journals still are using all of this technology around an authorship model that is strongly rooted in paper. There are trivial extensions that could be made, but there are less trivial extensions, too. We can, of course, add multimedia. Moreover, not all of our readers are going to be human. Software programs are going to read the things we write, and are not very bright sometimes, so you have to write differently for them. So this is one perspective, the author’s, that we can look at. The other perspective is that of the journal, of the aggregation of these articles, recognizing that the ecology in which journals exist has changed radically. It used to be that the other items in that ecology were other print journals and secondary abstract and indexing sorts of databases. Now it has become very complicated. There are all kinds of data repositories. There are live linkages among journals. There is an interweaving of data and authored argument that is becoming very complex. These are the kinds of issues that we will have an opportunity to explore in this session. 22 See, e.g., Roundtable on Computing and the Humanities, jointly sponsored by the National Research Council, the Coalition for Networked Information, the National Initiative for a Networked Cultural Heritage, and the Two Ravens Institute, which can be found at http://www7.nationalacademies.org/CSTB/pub_humanities.html.
OCR for page 81
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium THE SIGNAL TRANSDUCTION KNOWLEDGE ENVIRONMENT Monica Bradford, Science This presentation about Science's Signal Transduction Knowledge Environment (STKE)23 summarizes its history and current status, the specific issues related to defining what is a publication in the digital environment, and then how Science has used its power as a more traditional publication to help move forward this less traditional project. The project was started in 1997. At the time, the staff at Science thought it was a bold experiment. It was formed jointly by the American Association for the Advancement of Science (AAAS), Stanford University Library, and Island Press. The reason the three groups came together was that Stanford University Library was very interested in making sure that the not-for-profit publishers and the smaller publishers were able to be players online as we moved into the digital environment. They also had started up HighWire Press. AAAS had recently launched its electronic version of Science and was excited about the possibility of working with Stanford University Press on new technology ideas. Island Press is a small environmental publisher, primarily of books, but they had ties to the Pew Charitable Trust, which was interested in funding some kind of publishing experiment online. Island Press was helping to determine what the right area for that might be. They were particularly interested in the intersection of science and policy. Of course, AAAS was so excited about creating Science Online and all the potential that the online environment might offer, that it was eager to try something new. The goal of the knowledge environment was to move beyond the electronic full-text journal model. The idea was to provide researchers with an online environment that linked all the different kinds of information they use, not just their journals, together so that they could move more easily among them and decrease the time that was required for gathering information, thereby giving them much more time for valuable research and increasing their productivity. Why was signal transduction the first area that was chosen? The funders of the project wanted to find an area that would have the chance to become self-sustaining. Therefore, science at the intersection with policy was quickly eliminated, particularly because so much of the literature in that area is actually gray literature, not digitized, and had unclear prospects. The project moved instead to an area where AAAS and Science, in particular, were very comfortable. Signal transduction is a very interdisciplinary field within the life sciences. Cell biologists, molecular biologists, developmental biologists, neuroscientists, structural biologists, immunologists, and microbiologists all come to a point in their research when they need to know something about signal transduction. There also were some business reasons, not necessarily cost or revenue, but the kinds of factors for which a publisher typically looks. It seemed there was a broad potential user base, with both industry and academia very interested in this topic. There was no primary journal at the time, with the information spread across a lot of journals, nor was there a major society. Other aspects about this area of research and the kind of information in it were some of the most important reasons for the partners wanting to pursue it. The area of signal transduction is very complex and the information is widely distributed. It was important to be able to create links between these discrete pieces of information to help push knowledge forward in this area. It appeared there was the potential by making these links for substantial gains in practical and basic understanding of biological regulatory systems. In short, it was an ideal place for AAAS to begin such an experiment, because it would reach across disciplines, and after all, that is what AAAS is all about. The information in signal transduction had outgrown what could be done in print, and it really called out for a new way of communicating. One thing the STKE partners were somewhat surprised to find out was that not only did they have to answer the question why signal transduction is important, but for business reasons they had to answer what signal transduction is. Although it was very clear to researchers what it meant, in the business world, they had to explain why a library should care about a knowledge environment around this topic—that a lot of their different researchers, schools, and departments would be interested in this. So, there had to be an education effort that went along with the marketing process. 23 For additional information about Science’s STKE, see http://stke.sciencemag.org/.
OCR for page 82
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium The STKE was the first of the knowledge environments hosted at HighWire Press. There are now five. AAAS considers STKE as part of the suite of products referred to as Science Online. One of the areas of knowledge environment that Science Online is moving into next is the biology of aging, because it has some of the same characteristics that were used for choosing signal transduction before: It is interdisciplinary, does not have a home base, and a need exists for scientists to talk across fields. The STKE has both traditional and nontraditional types of publications and functions. In addition to providing access to the typical journal literature, Science tried to create a community environment, so there is an area with tools that relate to that community, resources that scientists would use, and then the most interesting part, which is the various external connections. On the macro level, what is a publication? Parts of STKE are very familiar; they look like what you would think of as a publication, although it is really a collection of many items that used to be considered each as a separate publication. The STKE has connected them all together in this environment, which is considered to be a publication itself. It has its own International Serial Standard Number (ISSN); it is indexed by Medline; and the updated reviews, protocols, and perspectives all have their own digital object identifiers. In many ways, that part of it is very familiar and very similar to a traditional publication, but then it was combined with these other aspects to make a larger publication. The scientific community will see more of this in the future, as publishers move away from just putting the content of print journals online and try to pull diverse sources together. The STKE virtual journal has full-text access to articles about signal transduction from 45 different journals, including some that are represented at this symposium. They are referred to as the participating publishers. When their journals go online at HighWire Press, the STKE uses an algorithm that scans the content and selects the materials related to signaling, which the subscribers to the STKE can access. The community-related functions of STKE include letters, discussion forums, and directories. That has been the hardest part to develop. The initial presumption was that this would be perhaps the most exciting or interesting aspect, providing people with a place to talk across fields and to each other and with scientists that they do not normally see at their meetings, but in fact, that has been the most difficult part to really get going. Other STKE functions include learning services, reviews of the current literature with highlights of what is important, and custom electronic table of contents and personalization—the ability to put things in personalized folders, to save searches, and to relate resources. At the macro level, the STKE has been accepted as a publication. It has 45,000 unique computers visiting a month, with about a quarter of those coming back more than once a month. There are 30,000 requests for full texts of original articles, 5,000 PDF reprint downloads, and 10,000 connection map pages viewed. Within this larger publication there are parts than in and of themselves are a new kind of publishing and a new kind of publication. One of these is the connections map. This is basically a Sybase relational database of information about the signaling pathways. The connections map is pathway-centric, versus some efforts that are being done that are molecule-centric. In the long run, this will be an interesting way of synthesizing information and will have a lot of use. Each entry within the database is created by an authority, who is solicited by the editors at STKE. On top of this database there is a graphical user interface. To create this database, HighWire worked with STKE to create a software that is downloaded for the authority to work on to enter data into the system. As time goes on, the real value of this will be adding bioinformatic tools on top of this database to enhance the ability to find new and interesting information. The authorities that work with STKE on this are willing to do so because Science had existing relationships with them as authors. They trusted STKE to put out quality content and to try to make sure that this kind of effort is recognized. They put this kind of effort into creating the STKE because of the reputation of Science and its reliability. During this time of experimenting, that is really important. At this point, the question is: Is this effort moving beyond what would just be called a database entry to a true publication? For each of these pathways, the authors are supplying abstracts, and metadata are being created at the pathway level. The STKE plans to submit those to PubMed and see what happens when it submits the rest of its information.
OCR for page 83
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium Each pathway has an identified author, who is an authority, and who can work with a group. Other people may be brought in because some of these pathways are so complex that more than one authority is needed. They synthesize and vet vast and sometimes conflicting literature. The STKE edits the data in the database before it is released and approved, though this process is far from perfect, and it is more difficult than anticipated. These pathways are reviewed each year just before Science publishes its special issue that focuses on the pathways. The review process is very hard, because one has to go through a lot of information. It is all networked. One of the things that the STKE needs to develop is tools for the reviewers—how to be able to know systematically what they looked at, how it is connected, at what point they saw it, and have some sort of printed output for them to look at, because navigating through the database is difficult. At the bottom of each information component, there is a date and time that it was updated. Then, the “Viewpoint” in Science provides a snapshot of the state of the knowledge in this pathway at the time of review. The pathways can be reviewed continually, however, if there are significant changes to the pathway or changes in the authorities that are involved. In summary, the STKE is trying to create new knowledge and look for ways to explore the network property, the signaling systems that cannot be done in the print product. One needs to look for interpathway connections, to clear the pathways and look for networking properties, and then use this tool for modeling and integration with other bioinformatic resources. There are still many inputs that need work. The STKE project plans to look for more funding to do some of these things. The goal is to make it easier for scientists to discover new knowledge and to be more productive. There have been some lessons learned in the STKE experiment already. The definition of a publication is evolving. The researchers understand how this information can work for them, and they are excited. That is the best part of this and what makes it fun, because they say what they need. Efforts at standardizing data input and control vocabularies have been difficult. NIH tried to help with this, but the standards are not evolving. Someone will have to just start doing it, and from there the standards will start to take hold. The reward system is not yet in place for those people who are doing this kind of authoring. For that reason, the STKE felt that it had to link this work to the traditional mode of publication during the initial transition phase, so that the contributors get credit. Finally, the “Viewpoints” in Science do get cited. They are in PubMed and they draw attention to the site. Hopefully, over time, we will see the pathways themselves cited more than just by Science. PUBLISHING LARGE DATA SETS IN ASTRONOMY—THE VIRTUAL OBSERVATORY Alex Szalay, The Johns Hopkins University Why is the publishing of large data sets an issue? Scientific data are increasing exponentially, not just in astronomy, but in science generally. Astronomers currently have a few hundred terabytes derived from observations made in multiple wavelengths. Very soon, however, they will start projects that will reobserve the sky every four nights, to look for variable objects in the temporal dimension. At that point, the data will increase to a few petabytes per year. Astronomers are searching for new kinds of objects, for interesting ones that are atypical. They are also studying the properties of the typical objects, and for every object they detect in the sky they derive 400 different attributes. The bottom line is, the volume of data doubles every one to one-and-a-half years. Astronomy, as most other fields of science, operates under a flat budget, so astronomers spend as much money as they get from their funding sources to build new observational tools and computer equipment to get more data. The data in astronomy typically become public after about one year, because there is an initial proprietary period of use for the people who build the observing instruments and who schedule observing time, but after that the data are released. Everybody can see pretty much the same data that are publicly available. How will astronomers deal with this? The transfer of a terabyte of data at current Internet speeds takes about two days, but when we get to a petabyte, it will take years to do anything with the data in the
OCR for page 84
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium networked environment. This is already on the horizon, and, by the way, a petabyte currently would be on 10,000 disks. Consequently, astronomers need to create databases that are searchable so they do not have to look at all the data at once, and can use indices to find what they need in the data. They also can do searches in parallel that way. The driving force for astronomers is to make discoveries. They are trying to go deeper in the universe and also trying to explore a broader range of wavelengths. There is an interesting parallel to Metcalf's law. Robert Metcalf worked at Xerox PARC and is one of the inventors of the Ethernet. He postulated that the utility of computer networks grows not as the number of nodes on the network, but as the number of possible connections we can make. The same thing is true with data sets. If we have N different archives, we can make order of N squared connections between those different data sets, so there are more things to discover. That is the utility of federating data sets. The current sky surveys have really proven this. Astronomers have discovered many new objects, where they combine multiple surveys in different colors. There also is an increasing reuse of scientific data, so people are using each other's data for purposes that they were not necessarily originally intended. The data publishing in this exponential world is also changing very dramatically. All these big astronomy projects are undertaken by collaborations of 60 to 100 people, who work for 5 or 6 years to build an instrument that collects the data, and they then operate it for at least that long, because otherwise it would not be worth investing that much of their time to do it. Once they have the data, they keep using them and eventually publish them. They put the data in a database and make them accessible on their Web sites. When the project ends, the people go on to other projects, and at that point they are ready to hand over the data to some big national archive or centralized storage facility. After that, the scientists are still interacting with all the data. Why are the roles changing? The exponential growth makes a fundamental difference. First of all, because these projects last six years, or more, at any one time the data at the national data facilities are only going to hold about 12 percent of the data that have been collected, and everything else still remains with the groups of scientists. This is very different from the previous linear progression of data collection. There is also more responsibility placed on the projects. The astronomers and other scientists are learning how to become publishers and curators, because they do not have a choice if they want to make their data public. This is Professor Szalay’s situation. He was trained as a particle physicist, who turned into a theoretical cosmologist and then became an observational cosmologist, because that was where the new breakthroughs were occurring. Now he worries about data publishing and curation, because this is necessary to do the science. He spends an increasing fraction of his budget on software, and in many cases reinventing the wheel. More standards and more templates would help with this. Luckily there are many emerging concepts and developments that help. One is that it is becoming easier to standardize distributed data, for example, using XML. There are Web services emerging, supported on many platforms, that make it very easy to exchange data, even complex data. There also is a major trend in making computing more distributed. This is called grid computing, where the computing is distributed all across the Internet on multiple sites, and people can borrow time on central processing units (CPUs) whenever they need it. The people who talk about grid computing, however, tend to think only about harvesting the CPUs; they do not think about the hundreds of terabytes or possibly petabytes of data behind it, because we currently lack the bandwidth and cannot move the data to the computers. If you need huge amounts of data where every byte needs a little bit of computing, it is not efficient to move the data to the computer site. It is more efficient to move the computers and the analysis to where the data are. Essentially, we now have an intercommunication mechanism, and a lot of what is done in grid computing also applies in this distributed work, which is growing in an exponential way. It is also getting exponentially more complex. The threshold for starting a new project is getting lower and lower as the threshold for the hardware is getting cheaper. Professor Szalay got into this through the Sloan Digital Sky Survey, which is sometimes called the Cosmic Genome project. It is one of the first big astronomy projects that is set up in that mode, to map the sky not just to do one thing, but to try to create the ultimate map. There are two surveys being done. One is taking images in five colors and the other is trying to measure distances. There is quite a lot of software involved. At the time when this project started, which was in 1992, 40 terabytes of raw data looked like an enormous amount. Today it does not seem like such a big deal. In one night the
OCR for page 85
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium survey takes a 24,000 by 1 million pixel image of the sky in five colors. When it is finished in 2005, there will be approximately 2.5 terapixels of data. The project is also trying to measure distances so that through the expansion of the universe, astronomers can figure out the distance of the galaxy. For about 1 percent of the objects they are trying to get very detailed information. The data flow is such that they take the data on the mountaintop in New Mexico and ship the tapes via Federal Express to Fermilab in Batavia, Illinois, because the networks are simply too slow to move that much data around. Then they process the data and put them into an SQL database, which currently has about 100 million objects in it, and will have about 300 million when the project is completed. Professor Szalay and Jim Gray, of Microsoft Research, have begun a project to make these complex data understandable and useable by high school students. They opened a Web site in 2001, and after two years they had about 12 million page hits, and now get about 1 million page hits per month. It is used by high school students who are learning astronomy and the process of scientific discovery, using up-to-date data—data that are as good as any astronomer can get today—and also by professional astronomers, who like the ease of use. The project identified other issues that they are just starting to appreciate. After they released the first set of data, which was only about 100 gigabytes, they realized that once they put out the data, people started writing scientific papers about them. They are putting out the next data release in the summer of 2003, close to a terabyte, but they still cannot throw away the old data set, because there are papers based on it. Someone may want to go back and verify the papers, so whatever they put out once is like a publication; they cannot take it back. They have a yearly release of the data on the Web site, which is like a different edition of a book, except that the book is doubling every year. The project also brings up other interesting aspects. Once the project goes offline, the databases will be the only legacy. Most of the technical information is going on in e-mail communication. So they also have to capture, for example, all the e-mail archives of the project. They should not delete them, but rather add them to the database, because this will be the only way that somebody years later can figure out what they did with some subject, or a certain part of the sky. As Jim Gray says, astronomy data are special because they are entirely worthless. He means this in a complimentary and good sense. He works for Microsoft, so he does not have to sign disclosure agreements and bring in lawyers if he actually wants to play with some astronomy data. There are no proprietary or privacy concerns. You can take the data and give them to somebody else. They are great for experimenting with algorithms. They are real and well documented, spatially and temporally dimensional. One can do all sorts of exercises with them that one has to do with commercial data, but without being sued. Astronomical observations are diverse and distributed, with many different instruments constantly observing the sky from all the continents, in different wavelengths. The questions are interesting, and there are a lot of data. This all adds up to the concept of a “virtual observatory.” Szalay and Gray were struggling with their data publication process, based on a relatively small survey. Their colleagues at the California Institute of Technology, the Space Telescope Science Institute, the NASA Goddard Space Flight Center, and various places were all doing the same thing. They all decided that it is much better to try to do it together because, eventually, astronomers would ask why it does not work together. When they created the concept of the virtual observatory, the vision was to make the data integration easy by creating some standard interfaces and to federate the diverse databases without having to rebuild everything from scratch. They also wanted to provide templates for others, for the next generation of sky surveys, so astronomers could build it the right way from the beginning. This idea has taken off. About a year and a half ago, NSF funded a project for building the framework for the national virtual observatory, which involves all the major astronomy data resources in the United States—astronomy data centers, national observatories, supercomputer centers, universities, and people from various disciplines, including statistics and computer science. This project has already developed some demos, which led to some unexpected discoveries. It is also now growing internationally. This effort is now being copied in more than 10 countries, including Japan, Germany, Italy, France, the United Kingdom, and Canada. Today, all these projects are operating with a funding of about $60 million, and there is really active cooperation. In late April there was a one-week-long meeting in Cambridge, England, about the standardization efforts—what are the common dictionaries, what are the common data exchange formats,
OCR for page 86
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium how to do common registry formats that are OAI compatible, and so on. There is now even a formal collaboration—the International Virtual Observatory Alliance. Publishing this much data requires a new model. It is not clear what this new model is, however, so astronomers are trying to learn. There are multiple challenges in the use of the data for different communities. There are data-mining and visualization challenges—how to visualize such a large, distributed complex set of databases. There are important educational aspects of it; students now have the ability to see the same data as professional astronomers. And there is very much more data coming, petabytes per year by 2010. Astronomy is largely following the path of particle physics, with about a 15-year time delay. Particle physics also grew through this evolutionary phase. It will probably last for the next 10 or 15 years, until the next telescope will be so expensive that only the whole world together can afford to build one, as happened with the CERN LHC accelerator today. Until then, there will continue to be this data doubling. Indeed, the same thing is happening in all of science. We are driven by Moore’s law, whether high-energy physics, genomics, cancer research, medical imaging, oceanography, or remote sensing. This also shows that there is a new, emerging kind of science. When you think about how science began, it basically was very empirical. It started with Leonardo da Vinci, who did beautiful drawings of turbulence and described the phenomena as he saw them. Then through Kepler, through Einstein, people wrote down the equations that captured in the abstract sense the theoretical concepts behind the phenomena and the natural world, and provided a simple analytical understanding of the universe around us. Then a computational branch of science emerged over the past 20 or 30 years, and what we are faced with now is data exploration. We are now generating so much data, both in simulations and in real data, that we need both theory and empirical computational tools, and also information management tools to support the progress of science. GENOMIC DATA CURATION AND INTEGRATION WITH THE LITERATURE David Lipman, National Institutes of Health/National Center for Biotechnology Information24 David Lipman recently met with Jim Gray and Alex Szalay. A stimulating discussion on the similarities and differences between biomedical research and astronomy ensued. As Alex Szalay noted, one of the driving forces for most scientists, certainly those in biological research, is that science is becoming more data intensive. This means that researchers are generating more data for each paper, but they are also using more data from other scientists in their own research. That has an impact on both the factual databases and the literature. Many scientists believe that in order to make the most of these resources we will need to have deeper links and better integration between the literature and the factual databases. This includes improving data retrieval from both types of database, improving their actual usability, and maximizing the value that can be extracted from them. The quality of the factual data can be very much improved if tighter integration can be achieved between the literature and the databases. The growth in the number of papers in PubMed and MEDLINE is substantial, but it is basically linear. If we look at a number of factual databases, however, in most areas of biology the amount of data is increasing exponentially. For example, this is true of both the number of sequences and the total number of nucleotides in GenBank. An example of postgenomic research that generates a lot of data is proteomics. If you look at the number of proteins reported as identified in earlier studies, and compare this to the number reported in more recent articles, the amount of data generated with each paper is increasing. Most universities now have a variety of proteomics core services doing mass spectrometry that produces a lot of data. In the area of expression analysis—the kind of work that Pat Brown has pioneered—one can see the same kind of pattern. There are now many more labs doing this work, but also, any given lab can generate more data points per paper as the cost of doing this kind of experiment goes down. For example, an expression analysis experiment on the human genome may use a chip that monitors the 24 See http://www.ncbi.nlm.nih.gov/ for additional information about the NCBI.
OCR for page 87
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium expression of a few thousand to perhaps 20,000 genes over a range of conditions. Hundreds of thousands of data points could be generated that are relevant not only to support the results reported in a scholarly paper, but it would also be useful to make those data available so that others can learn new things. Not only are scientists producing more data, but they are also using the data generated by others for their own research. There are more than 400,000 users per weekday at the National Center for Biotechnology Information (NCBI) Web site, and this number is growing; most of these people are using data to design experiments. As well as accessing information though direct searching of databases, papers in electronic journals now have many links to, or at least cite, the identifiers of databases records that the authors used in writing their papers. Supplementary data files are also increasingly submitted with research articles, and these too are frequently linked back to the source information. A typical functional genomics approach, such as gene expression analysis, generally requires a range of genomic data to set up the experiments - sequence data from a number of transcripts or genomes are needed to design microarrray chips. In proteomics, it is essential to be able to compare mass spectrometry data against a number of genomic data sets in order to interpret them fully. Another important result of functional genomics processes is that the researcher will often generate a new kind of data. In proteomics, the new data might be on interacting proteins. These data may also be relevant to the very databases that were used to design the experiments. For example, a researcher doing proteomics in a microbe may get mass spectrometry data that confirm that certain proteins are actually translated and found in the cell. That information needs to be transferred back to the relevant databases so that researchers using the databases in the future will know that some previously hypothetical genes are now confirmed, and are translated or expressed. A related point is that if the researcher who generates these data keeps them only as supplementary data files on a personal server, then they are not going to be structured consistently, and therefore the data are not going to be as useful as they would be in a public database, where data are structured and normalized. For expression analysis experiments, some journals are just beginning to require submission of the data to various archives. It is very important to keep in mind this cycle of feedback between source data and publications, because it will affect the way the literature is presented in the future. At PubMed Central,25 which is the National Institutes of Health (NIH) archive for the biomedical literature, there are many links and other functions. For example, it is possible to link from the cited references in a full-text article to the appropriate PubMed records, as well as to a variety of other databases. One can now also look computationally through the text for matches to known organism names, and if the names appear in certain places in the article, for example, a Methods section, this increases confidence that the paper contains relevant information with respect to that organism. By having this fairly fine level of integration and links between the literature and factual databases, the article has a higher value: not only can the reader better understand the point that the author was trying to make, but he or she can also go beyond that point and look at newer information that may be available. DISCUSSION OF ISSUES Subscription Model as a Restriction on Use of the STKE David Lipman began the discussion by noting that because the STKE is proprietary and requires a subscription to access it, people do not link into it as much as they might otherwise. It is a limitation when you have some of these factual resources that not everybody is able to get into. Monica Bradford responded that the connections database is free and anyone can use it. The other parts of the site are behind a subscription wall, those that are more your typical journal items. One of the ideas behind the STKE was to see if you can support the whole combined environment while keeping the database itself free. That is what all the authorities want. The data most likely also will have value for drug discovery. One of the ideas is to see if STKE 25 For additional information about PubMed Central, see http://www.pubmedcentral.nih.gov/.
OCR for page 88
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium can license the data for use by pharmaceuticals companies so they can integrate them with their own databases on a proprietary basis, but providing enough support for the overall site to allow it to remain free for academic use. So far, most of the STKE’s experts, the authorities, are comfortable with that model. It is a little early to say how much traffic STKE will have, but it is expected to grow a lot. AAAS also has an NSF grant for Biosite Ednet, which is trying to take online resources and make them available, or make it easy for instructors to find them to use them in their course work. The STKE has adopted the same metadata that are being developed for that program. This has allowed STKE to extend its audience from primarily researchers, who were the initial focus, to incredible use in education in undergraduate courses. The Need for Federal Coordination and Investment in the Cyberinfrastructure Dan Atkins said that the three excellent presentations in this session were representative of the 80 or so exciting testimonies that his committee had as part of the cyberinfrastructure investigations at NSF. He would like to encourage the scientific community to get behind NSF to do something bold in this area. First, the exponential data growth is now present in many fields. It illustrates that the challenge and opportunity includes going to higher performance networks, higher speed computers, and greater capacity storage, but to do that together with another dimension that he mentioned earlier, and that is functional completeness, by having the complete range of services. The challenge involves this balance between increased capacity and increased function. A second point that the preceding talks illustrate is the exciting potential for multiuse technologies—the fact that the same underlying infrastructure is serving the leading edge of science, and making the learning of science more vivid, authentic, and exciting, all at the same time. Although a major investment is needed to create this infrastructure, once it is created, as the astronomy example illustrates, leading-edge teams or individual amateurs can make seminal and important contributions to science provided they are given open access to these data and to the tools. Finally, both the opportunities and the challenges illustrate the urgency for some leadership in this area. The various research communities are already doing this, and they are scraping together the resources. Cosmologists are becoming data curators, and so on. People are putting extraordinary efforts into it, and that is very commendable. At the same time, if we do not get the right investments or the right synergy between domain scientists and librarians and information specialists, we could end up with suboptimal solutions, solutions that do not scale, and worst of all, we can end up with balkanized environments that cannot interoperate and result in enormous opportunity costs. So this is a plea for the NIH, the Department of Energy, and the other agencies to cooperate and try to create the kind of environment that is going to allow these research opportunities to prosper on a grand scale. Quality Control and Review of Data in Very Large or Complex Databases Paul Resnick asked about the quality control or review process for the data that gets into the very large or complex databases. Can anyone put their data in? Alex Szalay responded that typically in astronomy a threshold is emerging. The condition for putting contributed data online is that they are provided with adequate metadata in certain formats. This keeps a lot of people out whose data are not of high enough quality, and who have not documented them sufficiently. This issue was given a lot of consideration, but they did not want to institute a formal refereeing process. They will probably introduce an annotation system eventually, where people can feed in comments and annotations. David Lipman noted that in the databases at NCBI, and typically for molecular biology in the life sciences, there are databases such as Genbank, where authors directly submit sequence data into an archive. When that process of direct office submission started about 20 years ago, the sense was that people were going to put in all kinds of make-believe data and so forth. Actually, that does not happen; although some data are redundant, and some versions are of higher quality than others. That has given rise to related databases, which are curated, some by expert groups on the outside, others by the database managers themselves. So, for example, NCBI curates the human sequences along with some others, such as mouse sequences. There is a comprehensive set of curated
OCR for page 89
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium sequences. For the molecular biology data, there are two related sets, the archived set that represents what scientists provided at the time and the curated set that contains what is supposed to be the best version of the data. Those two data sets work together well, because sometimes what some experts think is correct in the curated set turns out to be incorrect and something that was deposited originally may be more correct, so there are pointers to the older versions. Monica Bradford noted that the STKE is more of a curated database. The pathway and all the related components are externally reviewed by people outside of the group of authorities that created it and did the data entry. That is a snapshot in time, however, and those pathways and data entries need to be constantly updated. Right now, the only way that someone can comment during the period between the formal reviews is either by directly sending an e-mail to the authority, which one can do right from the graphical interface, or by using the feedback function on the Web site, in which case everyone sees it. The STKE will provide some additional tools to improve the review process and also to allow for more community annotation over time, as long as it remains clear which portion is the community annotation and which is the official part that has gone through the review process. A combination of these two approaches will add the most value over time. David Lipman raised one other point that relates to the difference between astronomy and biology. Astronomy has an organizing principle, in that there are space-time coordinates, which is largely agreed upon. Although there are a number of different coordinate systems that are used, they can be resolved largely to one. Given that there is a stronger theoretical base in astronomy from physics and there is this natural organizing principle, it is possible with a variety of computations to actually assess some aspects of how good the data are. In biology, at the level of the genome, transcripts, proteins, and, to some extent, protein structure, there are organizing principles that are natural and strong enough to allow detection of data anomalies that do not make sense. The information in a database fits together in a certain way, and as one gets more transcript information or more comparative data for proteins, one can see better how it fits together. Above the level of the genotype and perhaps protein structure, however, with expression data, pathways, and proteomics there are not such natural organizing principles. One of the difficulties of a project like STKE and other functional genomics projects is that it is much more difficult to use cross-validation to fully assess the quality of the data. Because there are high-throughput methods in biology that are at the level of function, it is a challenge to deal adequately with quality. Monica Bradford added that one of the reasons the STKE took the approach it did is because all the tools are not yet available. People are trying to make connections across different disciplines. A cancer specialist may find an oncogene that turns out to be in a signaling pathway. They may be looking for information in literature or data that they have not followed before. The value of having scientists, university libraries, and the STKE editorial processes associated with the curation is that it helps build some trust in the information when the more automated tools that David Lipman mentioned do not exist. You can also link to other archival and curated databases elsewhere, but at a certain point, when you are pushing the edges and trying to gain new knowledge, you have to figure out what you are going to trust. That is what the STKE hopes to improve. Professor Szalay also observed that as the data grow exponentially, over a period of time they will grow by a factor of 100. The capacity of computers will also grow by a factor of 100, so if the organizing principle is to save every bit of data collected, one can keep up. If the problem is to connect every bit of data to every other bit of data which is N squared, there will be 100 times more data and the computers will be 100 times faster, but the computational problem will be 10,000 times larger. We are starting to approach this. Martin Blume said he was struck by the connection between a number of things that have been discussed here and things that have happened in the past. It is useful to look at how we got here, and perhaps to extrapolate into the future. The e-Print arXiv grew out of the preprint archives in print back in the 1960s that followed the Xerox revolution at that time. Looking back particularly from astronomy to particle physics, there were experiments done in the 1960s with bubble chambers, and there were many photographs taken. Those data represented the equivalent of the current sky survey, because they could be used not just for one discovery, but for meta-experiments that were done on that. In fact, one can see the same thing now happening with the astronomical data, where one can reuse them for many different
OCR for page 90
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium experiments or analyses that yield new data into the future. The particle physics data were available freely after a period of proprietary use by the investigators (as are the astronomy data today) to others who could use them for new ideas and new experiments. Of course, the digital technologies have changed this dramatically, but it still looked like a massive job in the 1960s. There was one paper where the omega minus was discovered, and 33 people had collaborated on it. It was a sign of things to come. Data-Mining Restrictions from Proprietary STM Information Linda Rosenstein, at the University of Pennsylvania, asked how we might automatically download, create, and/or centralize a repository of identified articles if the intent is to extract data and republish subsequently extracted facts. As the extracted data would in no way resemble the original text and would also have pointers to their sources, this should not pose a significant infringement issue. The University of Pennsylvania scientists want to do something that is scientifically quite new, and they think it is very important for cancer research. They know there are various restrictions on the access to the data and the use of the data that their library has licensed. How can they undertake this incredible data-mining process, which presumably will have great results in science and perhaps even in medicine, when they are still subject to the proprietary model of how the scientific literature is made available to them? David Lipman said that is a good question, whether the current model of fee for access fits how the literature and the factual databases need to be used. However, if the activity in question involves text-mining work based on looking at papers, that will not be useful. If, instead, the researchers want to mine the organized data sets associated with the literature, and would use the two together, that is an issue. It is one of the issues with which the publishing community will have to contend. There is clearly a tension between the current model and how scientists want to work, but it is not clear where it is going to go. PubMed Central is a service in which the publishers volunteer to participate. They are not necessarily providing the newest versions of their information into it, but scientists request subsets or entire sets of PubMed Central data to compute on locally and to search for discoveries. Some of the publishers that participate in PubMed Central, upon request, allow the download of this information. As PubMed Central begins to get more open-access journals, it will be able to provide the material for download automatically, just as it does with the sequence data. The more literature that is provided under open access the more that allows for new directions in terms of how the literature and the data sets are used, and how they are developed in the first place. PubMed Central recently interacted with some radiologists who were interested in creating a special database of radiological images, which would be useful for education and training. Clearly, if the journals were open access, then it would be a very natural thing to put the two together. This is a challenge that both the publishers and the scientific community are going to have to face. Monica Bradford said that this tension is good. It helps publishers to think about what they are doing and the basic goals they have. The purpose of STKE is to help researchers be more efficient. It is supposed to advance science and serve society, and STKE has the potential to help do that. The tension makes publishers rethink their model and experiment with new approaches. It is an evolution. Hopefully, it will be possible to come up with some creative ways that will work in the marketplace, and not be totally dependent on government control or government funds. Publishing Large and Complex Data Sets for Knowledge Discovery Participant Eileen Collins pointed out that, presumably, the methods for organizing and labeling the huge data sets reflect current knowledge in the field. Does the availability of all these wonderful data to so many people enable researchers to make quantum leaps in their knowledge of phenomena? Or, is there a risk that because the data are organized according to what we understand now, it might be tempting just to stay in that vineyard and not make big advances or face big changes in modes of thinking? How might this issue be addressed, particularly for disciplines that are not as far along as those that are putting together these huge data sets? Alex Szalay responded that in the Sloan Digital Sky Survey, which is now about 40 percent ready,
OCR for page 91
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium after two nights of operation astronomers found 6 out of 10 of the most distant quasars in the universe. This is a small telescope by astronomy standards, but it has proven to be efficient in finding new objects by organizing the data and scanning through the outliers. Is it the only way to actually store the data? The answer is, clearly not. With these large data sets, the only way to make sure the data are safe is to store them at multiple sites around the world. If the mirror sites are organized slightly differently, each enabling certain types of functions, and the queries are redirected to the most appropriate place, it might improve the capabilities. David Lipman said that for databases like Genbank and for most of the factual databases in the life sciences, there are multiple sites. Genbank, the U.S. human genome sequence database, collaborates with similar centers in Japan and the United Kingdom. The data are exchanged every night, but they have different retrieval systems and different ways of organizing access into them. Furthermore, people can download large subsets or the entire database, and do whatever they want, including making commercial products. With the STM literature, if there were more open-access journals and multiple sites that provided comprehensive access, one would see different capabilities. PubMed Central is working with a group in France to set up a mirror archive that would organize those data in a different retrieval system. Open access to the data allows for multiple archives that are comprehensive and provides different ways to get at that information. Dr. Collins’s question raised a deeper issue, however. In biology, gene ontology is a way to make it easier for people to see what is there, and to move the information around and to understand it. This represents a trade-off between what we understand now and the kind of new discoveries that Alex Szalay refers to, which change that. There is a tension between reductionism and computing from the bottom up and learning new things, and being able to say this is what we know right now, and having that superimposed from the top down. Right now, there is a huge amount of interest in ontologies in biology, and some of it may be misplaced. One of the reasons researchers focused on molecular biology was that they really did not understand enough from the top down. If you look at proteins or genes that are involved in cancer, you find those that are a part of glycolytic pathways, and so forth. It is not clear how much these ontologies assist in finding and understanding things, and how much they obscure new discoveries. As long as we maintain access to the data in a very open way, and people can download and do almost whatever they want with them, then if they want to ignore something like an ontology, they can do that. Monica Bradford agreed with David Lipman, but also made a few observations that are not quite as global, but based on the experience with STKE. The research in signaling started out looking at things very literally. A researcher followed a signal down one pathway, and soon realized these were really networks; one could not think about one pathway, but had to think about all these pathways together and how they intersected and affected each other. This could not be done in print. So the ability to build a database and be able to represent these and then query across the pathways was very useful. AAAS would like to develop bioinformatic tools that will actually let researchers look for inter-path connections and find new knowledge. Ms. Bradford hopes that eventually these tools will allow researchers to play with the STKE and add in their own information to see if it changes the effect or has an effect on the pathways. The information is vast, and tools are needed to help organize it. Once the organization of the information is taken care of, it gives researchers a chance to begin to think about new things and to look at the information differently. This process can be found in many other disciplines. Clifford Lynch noted that another striking example of that is the migration away from surrogates to full texts that allow computation. That capability is having a radical effect in many fields. It is useful to be able to find things by doing computation and searching on the full text of scholarship, as opposed to being locked into some kind of a classification structure for subject description that some particular group used in creating surrogates. That is a theme he hears from scholars in every field from history all the way through biology. Transformation of Archiving in the Knowledge Discovery Processes Marcus Banks, of the National Library of Medicine, asked a question that he hoped connects sessions four and five. If we are moving from publication as product to publication as process, should we make a similar transformation in archiving? Or should we still want to archive products that maybe are
OCR for page 92
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium extracted from that process? An example in STKE would be the “Viewpoints” from Science, which are snapshots in time. It seems that archiving the best information in a separate subset would be useful, but then that might be paper bound and published the old way we have done things. Monica Bradford said that AAAS has been talking about that. Because STKE is constantly changing and the database is updated on a regular basis, they assume as it scales and grows it will increase a lot. Should the STKE Viewpoints be archived on a certain schedule, perhaps every quarter? AAAS is not sure yet what would be the right frequency with which to do that, but the Viewpoints do not have to be paper bound. Right now STKE is in a transition stage, and the scientific authorities still want to get credit and want to be recognized for the effort, because it takes a lot of effort to do these functions well. So the purpose for the Viewpoints is not so much archival, although it does serve that purpose. It is more to give the authorities some recognition. Dan Atkins noted that Rick Luce touched upon the archiving issue during his presentation. There is an enormous stream of digital objects that could be created by knowledge discovery processes that are mediated through technology. One needs to be able to archive not only these objects, but the relationships between the objects and the social context in what they are created. We need to start thinking about archiving that includes these temporal streams that come through. One of the most profound ideas about all of this comes from John Seeley Brown, former Chief Scientist at Xerox and Chief Innovation Officer for 12 Entrepreneur, Inc., who says that perhaps the most important aspect of this technology-mediated way of work is not just relaxing constraints of distance and time, enhancing access, and so forth, but actually comes from the possibility of archiving the process, not just sampling the artifacts along the way. The idea in areas of ubiquitous computing is that people could then come back and actually mine processes and extract new knowledge that otherwise has been left on the table. It is an extension of this whole notion of data mining into knowledge process mining, so it gets very abstract. We can start to see that it is not just fanciful, and it is something to think about. People who are interested in long-term preservation need to consider huge repositories that take into account not only the basic ingredients, but the social processes by which these ingredients are encountered. Increasing Data and Lost Opportunities Donald King noted that over a period of 15 years, scientists have increased the amount of time that they devote to their work, by about 200 hours a year. The point is that scientists are now approaching the individual capacity with regard to how much time they can devote to their work. Most of that additional 200 hours per year is devoted to communicating. The number of scientists increases about 3 percent a year, which means that the total number of scientists doubles about every 15 to 20 years or so, but the point was made earlier in the symposium that some of the information we are gathering doubles every year. It seems that one of the concerns is the limitation of the capacity of the human intellect to work with these data. The National Academies, NSF, and others could focus on trying to increase the number of scientists who work with these data and the infrastructure that can help them work with these data, to begin doubling the scientists every 5 to 10 years instead of every 15 to 20 years. In response, David Lipman agreed with the basic point Donald King made, but noted that scientists do adapt to dealing with large data sets. If presented with the computing power and the data, scientists ask different kinds of questions. It takes a long time before more scientists within the community shift and start to think of different kinds of questions. A few pioneers start to think a new way, and then it starts to happen. He has a lot of confidence that the more data we generate, if the computers and access to the data are there, people will come up with ways to ask the questions. Donald King clarified that he wanted to know if there are lost opportunities. It seems as though there must be. David Lipman said that a lost opportunity does exist. NIH a few years ago set up an initiative called the Biomedical Information Science and Technology Initiative (BISTI) to try to get more people involved in computational biology. They came up with a lot of recommendations. However, he thought the goal of BISTI would be to recommend more training for scientists and more money for computational research for the kind of work that Alex Szalay referred to, where one analyzes other peoples' data sets, because there is a huge number of discoveries to be made. There is to some extent a lost opportunity, because there are not enough biologists researching and writing papers that get
OCR for page 93
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium published in biological journals. BISTI should have focused more on training people to do that kind of work and on more grants for doing that sort of research. Professor Szalay also noted that because of this data avalanche or data revolution, if the data get properly published, it will cause another fundamental sociological change. Today in science, many people train a lifetime to build up their own suite of tools that they then apply to the raw data that they can get. Through these digital resources, one can get reasonably polished data, where one can think more about the discovery itself; a researcher does not have to spend so much time with the mechanics of scrubbing the raw data and converting them into useable form. People will be much less reluctant to cross boundaries, for example, if the data are all available in a ratified and documented form. The Value of Knowledge Discovery Mark Doyle, of the APS, said he was amazed when listening to the presentations and comments in this session because they make what he does when he publishes simple papers look trivial in terms of the amount of data and text published. He previously mentioned the two to three orders of magnitude difference between pure dissemination and what a publisher might do in creating an archival XML and doing peer review. There is another two or three orders of magnitude increase in what researchers actually are doing with their time. That makes him hopeful that publishing is really becoming more of a minor cost, compared with the cost of doing research and related activities. He hopes that regular publishing would piggyback on these larger kinds of things, since publishing is important, but not nearly as costly as doing the research or accumulating these much more complex kinds of things. He also asked whether there is a transition to where these things become much more primary than the journal articles that come out of them now. Monica Bradford noted that the 200 hours of communicating is a huge cost. That is the real publishing, the communicating and getting the idea shared. The amount of time someone puts into creating a product, be it a connections map or whatever, is significant. It is a time away from doing basic research. At STKE, they are happy to hear from the authorities that gives them an added value, in that they have to organize their own understanding and framework in their area. But that is a significant cost one cannot dismiss. Donald King also asked about the cost for doing the traditional paper. He asked if a researcher or some granting agency would realistically be willing to pay $1500, rather than going through a subscription model. The scale of the costs that are involved in traditional publishing is much greater than that. David Lipman raised the NIH budget as an example, assuming that the total amount that NIH spent and the number of publications that came out of it is increasing. It is probably $250,000 or $300,000. So the amount for doing the publication in some journal is a smaller part of that. There are issues in terms of economic analyses of open-access publishing versus fee for service on the basis of that. However, Monica Bradford and Mark Doyle were referring to how much time all those other things are factoring into this as well. In a sense, the paper does represent what the scientist did; it is the knowledge part of it. That can only be done so fast. So there is a difference between organizing data sets and making them useable to other people, which is a challenge, and finding and extracting what one thinks about that data set and getting it out there. The Role of Journals in the Successful Development of Databases in Molecular Biology Bob Simoni, from Stanford University and the Journal of Biological Chemistry (JBC), added that the enormous success of factual databases and our reliance upon them actually are the result of the collaborative effort the journals have made with regard to their “no-wall” requirement that the data be deposited as a prerequisite to publication. One might think that is a natural thing, but it is not. In the area of protein structure, for example, the data that crystallographers gathered were held and not deposited. The pressure from peers to a certain extent, but more importantly from journals, resulted in those data being deposited as a condition of publication, which makes the databases in this area as robust as they are. He then asked David Lipman about the status of a centralized, normalized system for deposition of arrayed data and gene expression data.
OCR for page 94
Electronic Scientific, Technical, and Medical Journal Publishing and its Implications: Proceedings of a Symposium David Lipman responded that people at the NCBI have been discussing this with David Klein, who is active with the JBC on that issue. There are standards that people have tried to agree upon for what is the minimum amount of data necessary to be submitted to a database. Unfortunately, that minimum is so high that a number of scientists are not willing to do that. Gene Expression Omnibus (GEO), which is an NCBI database, has a somewhat lower threshold in terms of the requirements for submission, but the NCBI is in discussions with the international group on that issue. GEO is growing faster now. Some journals require submission. He also seconded Bob Simoni's point about the critical role that journals have in terms of getting data in a useful form into these databases. Despite the fact that databases are useful, scientists often do not want to spend their time on data submission. What they are judged by is not what they put into a database, but what they publish. The role of the journals is absolutely critical, and JBC was one of the real pioneer journals in pushing submission to the sequence databases and getting essentially 100 percent compliance.
Representative terms from entire chapter: