Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
6 What Constitutes a Publication in the Digital Environment? As we study the question of what constitutes a publication and how the character of publications is changing, we switch our focus from the environmental questions of publication as process to how to author and bind together pieces of authorship into structures like journals. Publication used to refer to the act of preparing and issuing a docu- ment for public distribution. It could also refer to the act of bringing a document to the public's attention. Now, publication means much more. It can refer to a document that is Web-enriched, with links, search capabili- ties, and potentially other services nested in it. A publication now generates usage data and provides many other functions. We can approach this from two kinds of perspectives. One is from the individual author's point of view. The practice of science is changing. It is becoming much more data intensive. Simulations are becoming a more important part of some scientific practices. We are seeing the development of community databases that structure and disseminate knowledge along- side traditional publishing. From that perspective, one might usefully ask questions about how people author articles, given the new opportunities technology is making possible. It is clear that articles can be much more than just paper versions by digital means. The other perspective is that of the journal publisher, regarding the aggregation of these articles, recognizing that the ecology in which journals exist has changed radically. There are all kinds of data repositories. There 56
DIGITAL ENVIRONMENT 57 are live linkages among journals. There is an interweaving of data and authored materials that is becoming very complex. In this session we look at three innovative examples of publications and some of the issues they raise. THE SIGNAL TRANSDUCTION KNOWLEDGE ENVIRONMENT1 The goal of the Signal Transduction Knowledge Environment (STKE),2 developed by Science, was to move beyond the electronic full- text journal model. The idea was to provide researchers with an online environment that linked together all the different kinds of information they use, not just their journals, so that they could move more easily among them and decrease the time that was required for gathering information, thereby giving them much more time for valuable research and increasing their productivity. The STKE was the first of the knowledge environments hosted at HighWire Press. There are now five. The STKE has both traditional and nontraditional types of publica- tions and functions. In addition to providing access to the typical journal literature, Science tried to create a community environment, to provide an area with tools that relate to that community, and resources that scientists would use. The STKE is trying to create new knowledge and look for ways to explore the network property, the signaling systems that one cannot get from the print product. The STKE virtual journal has full-text access to articles about signal transduction from 45 different journals. When these journals are placed online by HighWire Press, the STKE uses an algorithm that scans the con- tent and selects the materials related to signaling, which the subscribers to the STKE can then access. The community-related functions of STKE include letters, discussion forums, and directories. That has been the hardest part to develop. There already have been some lessons learned from the STKE experi- ment. The definition of a publication is evolving. Efforts to standardize 1This section is based on the presentation by Monica Bradford, executive editor of Science. 2For additional information about Science's STKE, see http://stke.sciencemag.org/.
58 ELECTRONIC PUBLISHING data input and control vocabularies have been really difficult. Perhaps most important, the reward system is not yet in place for those who are doing this kind of authoring. PUBLISHING LARGE DATA SETS IN ASTRONOMY-- THE VIRTUAL OBSERVATORY3 Why is the publishing of large data sets an issue? Scientific data are increasing exponentially, not just in astronomy but in science generally. Astronomers currently have a few hundred terabytes derived from observa- tions made in multiple wavelengths. Very soon, however, they are going to start projects that are going to reobserve the sky every four nights, to look for variable objects in the temporal dimension. At that point, the data will increase to a few petabytes per year. Astronomy, as most other fields of science, operates under a flat bud- get, so astronomers spend as much money as they get from their funding sources to build new observational tools and computer equipment to get more data, which they can analyze. There also is an increasing reuse of scientific data, so people are using each other's data for purposes that were not necessarily originally intended. The data publishing in this exponential world is also changing very dramatically. The big astronomy projects typically are undertaken by col- laborations of 60 to 100 people, who work for 5 or 6 years to build an instrument that collects the data and who then operate it for at least that long, because otherwise it would not be worth investing that much of their time. Once they have the data, they keep using them and eventually pub- lish the data and their analyses. They organize the data in a database and make them accessible on their Web sites. When the project ends, the scien- tists go on to other projects, and at that point they are ready to hand over the data to some big national archive or centralized data storage facility. After that, the scientists continue to use the archived data. Why are the roles changing? The exponential growth makes a funda- mental difference. There is also more responsibility placed on the research projects. Astronomers and other scientists are learning how to become pub- lishers and curators, because they do not have a choice if they want to make 3 This section is based on the presentation by Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, The Johns Hopkins University.
DIGITAL ENVIRONMENT 59 their data public. More standards and more templates would help with this. There also is a major trend toward making high-capacity computing more distributed. This is called grid computing, where the computing is distributed all across the Internet at multiple sites, and people can borrow time on central processing units (CPUs) whenever they need it. The people who talk about grid computing, however, tend to think only about harvest- ing the CPUs; they do not think about the hundreds of terabytes or possi- bly petabytes of data behind it, because we currently lack the bandwidth and cannot move the data to the computers. Alex Szalay, of the Johns Hopkins University, and Jim Gray, of Microsoft Research, have begun a project to make these astronomical data understandable and useable by high school students. They opened their Web site in 2001, and after 2 years they have about 12 million pages online and get about 1 million hits per month. The site is used by high school students who are learning astronomy, but who are also learning the process of scientific discovery, using up-to-date data--data that are as good as any astronomer can get today. Astronomical observations are diverse and distributed, with many differ- ent instruments constantly observing the sky from all the continents, in differ- ent wavelengths, and producing a lot of data. This all adds up to the concept of a "virtual observatory." The vision for the virtual observatory was to make the data integration easy by creating some standard interfaces and to federate the diverse databases without having to rebuild everything from scratch. Astrono- mers also wanted to provide templates for others, for the next generation of sky surveys, so they could build it the right way from the beginning. This idea has taken off. The National Science Foundation funded a project for building the framework for the national virtual observatory, which involves all the major astronomy data resources in the United States--astronomy data centers, national observatories, supercomputer cen- ters, universities, and people from various disciplines, including statistics and computer science. There is now a formal international collaboration-- the International Virtual Observatory Alliance. Publishing this much data requires a new model. It is not clear what this new model is, however, so astronomers are trying to learn as they go. There are multiple challenges in the use of the data for different communi- ties. There are data mining and visualization challenges--how to visualize such a large, distributed complex set of databases. There are important educational aspects of it; students now have the ability to see the same data
60 ELECTRONIC PUBLISHING as professional astronomers do. And there is very much more data coming, petabytes per year by 2010. Indeed, the same thing is happening in all of science. Science is driven by Moores' law, whether in high-energy physics, genomics, cancer research, medical imaging, oceanography, or remote sensing. This also shows that there is a new, emerging kind of science. We are now generating so much data, both real data and in simulations, that we need a combination of theory, empirical computational tools, and also information management tools to support the progress of science. GENOMIC DATA CURATION AND INTEGRATION WITH THE LITERATURE4 One of the driving forces for most scientists, certainly those in biologi- cal research, is that science is becoming more data intensive. This means that researchers are generating more data for each paper, but they are also using more data from others for each paper. That has an impact on both the factual databases and the literature. In order to make this work, we will need to have deeper links and better integration between the literature and the factual databases to improve retrieval from both, and to improve their actual usability and the extraction of value from them. The quality of the factual data can be very much improved if one can get a tighter integration between the literature and the databases. In most areas of biology, as in all other areas of science, the increase in the amount of data is exponential. For example, the number of users per weekday at the National Center for Biotechnology Information (NCBI)5 Web site is more than 330,000 with different IP addresses, and this is grow- ing. They are using the NCBI data to design experiments and write their articles. The electronic journals now have many links or the number of identifiers from databases that authors are including in their papers. This is also the case with supplementary data files. At PubMed Central,6 which is NIH's archive for the biomedical litera- ture, there are many links and other functions as well. One can, for ex- 4This section is based on a presentation by David Lipman, director of the National Center for Biotechnology Information. 5See http://www.ncbi.nlm.nih.gov/ for additional information about the NCBI. 6For additional information about PubMed Central, see http://www.pubmedcentral. nih.gov/.
DIGITAL ENVIRONMENT 61 ample, link from a full-text article to get all of the referenced articles that are in PubMed from it, as well as a variety of other databases. By having this fairly fine level of integration and links between the literature and factual databases, the article has a higher value, because not only can the reader understand better the point that the author was trying to make, but can go beyond that point and look at newer information that is available. ISSUES RAISED IN THE DISCUSSION The Need for Federal Coordination and Investment in the Cyberinfrastructure The exponential data growth in many fields illustrates that the chal- lenges and opportunities include going to higher performance networks, higher speed computers, and greater capacity storage, but to do that to- gether with functional completeness by having the complete range of ser- vices. There is exciting potential for multiuse technologies: The common underlying infrastructure is serving the leading edge of science, and making the learning of science more vivid, more authentic, and exciting. Although a major investment is needed to create this infrastructure, once it is created, as the astronomy example illustrates, leading-edge teams or individual ama- teurs can make seminal and important contributions to science, provided they are given open access to these data and to the tools. Both the opportunities and the challenges illustrate the urgency for some leadership in this area, however. If we do not get the right invest- ments or the right synergy among domain scientists, librarians, informa- tion specialists, and technologists, we could end up with sub-optimal solu- tions or solutions that do not scale. Worst of all, we can end up with Balkanized environments that cannot interoperate and that result in enor- mous lost opportunity costs. Quality Control and Review of Data in Very Large or Complex Databases In astronomy, the condition for putting contributed data online is that they are provided with adequate metadata in certain formats. This keeps a lot of people out whose data are not of high enough quality, and who have not documented them sufficiently.
62 ELECTRONIC PUBLISHING At the NCBI, there is a comprehensive set of curated sequences. For these molecular biology data, there are two related sets, the archived set, which represents what scientists provided at the time, and the curated set, which contains what is supposed to be the best version of the data. An important point relates to the difference between astronomy and biology. Astronomy has an organizing principle, in that there are space- time coordinates, which is largely agreed upon. The biology data do not have that. One of the difficulties of a project like STKE and any of the other projects in functional genomics is that it is much more difficult to use cross- validation to fully assess the quality of the data. Because there are high-throughput methods in biology that are at the level of function, it is a challenge to deal adequately with quality. Data-Mining Restrictions from Proprietary STM Information How might we automatically download, create, and centralize a re- pository of identified articles if the intent is to extract data and republish subsequently extracted facts? There are restrictions on the access to and the use of the data from proprietary sources. How can data mining to extract related facts, which presumably could have important results, be done when these information sources are still subject to the proprietary model? One answer might be that perhaps this tension is good. It forces pub- lishers to think about what they are doing and the basic goals they have. If their purpose is to help researchers be more efficient, and to advance sci- ence and serve society, then the new technologies and new business models need to be examined much more thoroughly. Publishing Large and Complex Data Sets for Knowledge Discovery The methods for organizing and labeling the huge data sets reflect current knowledge in the field. One question that arises is whether the availability of all these wonderful data to so many people enables research- ers to make quantum leaps in their knowledge of phenomena. Or, is there a risk that because the data are organized according to what we understand now, it might be tempting just to stay in that vineyard and not make big advances or face big changes in modes of thinking? How might this issue be addressed, particularly for disciplines that are not as far along as those that are putting together these huge data sets?
DIGITAL ENVIRONMENT 63 In the case of the Sloan Digital Sky Survey, which is now about 40 percent complete, after 2 nights of operation astronomers found 6 out of 10 of the most distant quasars in the universe. This is a small telescope by astronomy standards, but it shows that astronomers are efficient in finding new objects by organizing the data and scanning through the outliers. For databases like GenBank, the U.S. sequence database, and for most other databases in the life sciences, there are multiple sites. GenBank col- laborates with similar centers in Japan and the United Kingdom. The data are exchanged every night, but they have different retrieval systems and different ways of organizing access into them. Furthermore, people can download large subsets or the entire database and do whatever they want, including making commercial products. With the STM literature, if we had more open-access journals, and there were multiple sites that provided comprehensive access, one would see different capabilities. PubMed Central is working with a group in France to set up a mirror archive that would organize that data in a different re- trieval system. Open access to the data allows for multiple archives that are comprehensive and provides different ways to get at that information. The questions posed above raise a deeper issue, however. In biology, gene ontology is a way to make it easier for people to see what is there, and to move the information around and to understand it. This represents a trade-off between what we understand now and the kind of astronomical discoveries referred to above. Right now, there is a huge amount of interest in ontologies in biology, and some of it may be misplaced. One of the reasons researchers focused on molecular biology was that they really did not understand enough from the top down. If one looks at proteins or genes that are involved in cancer, you find those that are a part of glycolytic pathways, and so forth. It is not clear how much these ontologies assist in finding and understanding things, and how much they obscure new discoveries. As long as we maintain access to the data in a very open way, and people can download and do almost what- ever they want with them, then if they want to ignore something like an ontology, they can do that. Another issue is the migration away from text surrogates to full texts that allow computation. That capability is having a radical effect in many fields. It is useful to be able to find things by doing computation and search- ing on the full text of scholarship, as opposed to being locked into some kind of a classification structure for subject description that some particu- lar group used in creating surrogates. That is a theme heard from scholars
64 ELECTRONIC PUBLISHING in every field from history all the way through biology. It is really quite a striking example of how things are changing. Transformation of Archiving in the Knowledge Discovery Processes If we are moving from publication as product to publication as pro- cess, should we make a similar transformation in archiving? Or, should we still want to archive products that may be extracted from that process? An example in STKE would be the "Viewpoints" from Science Online, which are snapshots in time. Archiving the best contributions could be useful, and these then might even be published in print form. So far, however, the focus for the STKE Viewpoints is not archival; although it does serve that purpose, it is more to give the authorities some recognition. There is an enormous stream of digital objects that could be created by knowledge discovery processes that are mediated through technology. One may want to archive not only these objects in temporal streams, but the rela- tionships between the objects and the social context in which they are created. One of the most profound ideas about this came from John Seeley Brown, who posited that perhaps the most important aspect of this tech- nology-mediated way of work is not just relaxing the constraints of distance and time, enhancing access, and so forth, but the possibility of actually archiving the process itself, not just sampling the artifacts along the way. In areas of ubiquitous computing, people could subsequently return and actually mine processes and extract new knowledge that otherwise has been left on the table. It is an extension of the notion of data mining into knowledge process mining, so it can get very abstract, but we can start to see that it is not just fanciful but something to think about. People who are interested in long-term preservation need to consider huge repositories that take into account not only the basic information but the social processes by which the information is disseminated, manipulated, and used. Increasing Data and Lost Opportunities According to statistics presented by Donald King, over the past 15 years, scientists have increased the amount of time that they devote to their work by about 200 hours a year. Scientists therefore are reaching the limits of their capacity for how much time they can spend on their work. Most of those additional 200 hours are devoted to communicating. The number of scientists only increases about 3 percent a year, however, which means that
DIGITAL ENVIRONMENT 65 the population of scientists doubles about every 15-20 years or so, yet some of the information we are gathering doubles every year. The limitations of the capacity of the human intellect to work with these data may therefore be a concern. The scientific community may wish to increase the number of scientists who work with the data and the information infrastructure. It seems that there must be lost opportunities. At the same time, of course, scientists do adapt to dealing with large data sets. If presented with more computing power and data, scientists ask different kinds of questions. It may take a long time, however, before more scientists within the community shift and start to think of different kinds of questions. A few pioneers start to think a new way, and then it starts to take hold. However, because of this avalanche of data, if all the data get properly published, it will cause another fundamental sociological change. Today in science, many people train a lifetime to build up their own suite of tools that they apply to the raw data that they can get. Through these new digital resources, if one can get reasonably polished data, one can think more about the discovery itself; a researcher does not have to spend so much time with the mechanics of scrubbing the raw data and converting them into useable form. People will be much less reluctant to cross disciplinary boundaries if the data are all available in a ratified and documented form. There also is a difference between organizing data sets and making them useable to other people, which is a challenge, and finding and extract- ing what one thinks about that data set and getting the results published. The article represents what the scientist did; it conveys the knowledge. That can only be done so fast. The Role of Journals in the Successful Development of Databases in Molecular Biology The enormous success of factual databases in molecular biology and the scientific community's reliance upon them are largely results of the collaborative effort the journals have made with regard to their require- ment that the data be deposited as a prerequisite to publication. In the area of protein structure, for example, the data that crystallographers gathered were held and not deposited. The pressure from peers to a certain extent, but more important, from journals, resulted in those data being deposited as a condition of publication, which makes the databases in this area as robust as they are.
66 ELECTRONIC PUBLISHING Despite the fact that databases are useful, scientists often do not want to spend their time on that. What they are judged by is not what they put into a database, but what they publish. The Journal of Biological Chemistry, a not-for-profit journal published by the American Society for Biochemis- try and Molecular Biology, was one of the pioneer journals in requiring submission to the sequence databases and in getting essentially 100 percent compliance.