Science, Collaboration, and Information Technology
Science is changing in many ways. The specifics vary among disciplines and subdisciplines, but, in general, scientists are addressing increasingly complex problems, the instruments and facilities needed to conduct research are becoming increasingly expensive,1 and funding for scientists is becoming tighter, especially on a per capita basis. These are among the factors that are promoting broader interest in collaboration, although the nature and extent of collaboration differ across disciplines. They are also promoting broader interest in the efficiency of the scientific process, something to which more widespread use of computing and communications may contribute.
The scientific process encompasses a wide range of technical, social, and procedural activities (see Figure 1.1 for one view), each of which involves information—information is collected, combined, analyzed, derived, discussed, and distributed. Some, if not all, of these activities may and often do benefit from the application of computer and networking technology. All are subject to the knowledge, attitudes, and preferences of individual scientists, the prevailing norms or cultures in individual fields, and the constraints and incentives imposed by relevant institutions, from those that employ or educate researchers to those that fund or otherwise support them.
INCREASE IN VOLUME OF INFORMATION AND COMPLEXITY
Early in this century, the "gentleman's agreement" in astronomy was that when an individual began to publish observations of a certain class of stars, other astronomers would avoid observing that particular class. Now there are many more astronomers than there are classes of stars, and all types of observations are fair game for those who have suitable instrumentation at their disposal. Scientific activity has been growing so rapidly that the doubling time for the body of scientific information is now about 12 years, and today, at least 90 percent of all scientists who have ever lived are still alive.
Meanwhile, the nature of data collection has changed significantly, and the opportunities for making measurements have increased dramatically. For example, the ability to field automated instruments that can monitor conditions in distant reaches of space or in remote ocean regions has produced massive amounts of data. Moreover, new fields of science have arisen because it became possible to make particular kinds of observations.2
A result of the explosive growth in all the sciences is the sheer volume of information to be accessed, stored, analyzed, and understood. Any one individual can master only tiny fractions of the total of scientific knowledge. This is the individual who "knows more and more about less and less."
At the same time, the complexity of many of the problems now being addressed by scientists has led to an increase in interdisciplinary research, as well as to the recognition that computers and communications, or information technology, have become essential tools for handling complexity. The requirements for collecting and sharing massive amounts of data, the difficulties of developing and working with models of complex phenomena, the requirements for massive computation (achievable through shared high-performance computing systems or through interconnected distributed computing
systems), and the growing understanding that many pressing scientific problems transcend the boundaries of individual disciplines are other manifestations of complexity that drive demand for information technology in science. These conditions suggest that interdisciplinary research and collaboration, and the means for facilitating both, will become increasingly important to many scientists. It is often the case that the individual who "knows less and less about more and more" is needed to bring together individuals who can collaborate in correlating facts across disciplines.
Notwithstanding the growth in volume and complexity of scientific activity, funding for research is getting tighter. That context makes stretching and leveraging available dollars a necessity.
INFORMATION TECHNOLOGY FOR RESEARCH AND COLLABORATION
Differences among disciplines explain in part observed differences in the use of information technology and also in the propensity to collaborate and the styles of collaboration chosen. For disciplines that are data-driven, databases, libraries, and access to such resources are central requirements. Just to display (or "visualize") subsets of data is often a major challenge that calls for sophisticated algorithms and software as well as high-performance computing hardware. For disciplines that are more model-driven, algorithms and software are also key resources. Computer communications and related
information technology can also facilitate the automatic control and sharing of instrumentation, some of which may be local to a particular research endeavor and some of which is remote.
Many scientific projects have to devote substantial time and money to the mechanics of sharing resources and coordinating activity, time, and money that could better be spent doing science if the mechanics were easier. Many scientists are deeply frustrated with a computing environment that does not adequately support the demands of their science. Hardware problems range from difficulty in justifying purchases of computing equipment in grant applications to a lack of wiring in buildings to support local area network connections. Software problems range from an inability to access previously collected data to a lack of software for data analysis. Human resources problems range from not being able to find knowledgeable technical staff to not being able to pay scientifically attuned support personnel.
Moreover, despite the importance of collaboration, when too many human minds try to collaborate meaningfully, the requirements for communication become overwhelming. Facilitating the necessary robust communication among scientists involves both technical and social considerations—researchers must have access to useful computer facilities, networks, and data sets but must also be able to work in an environment that fosters cooperation among individuals with differing academic traditions, approaches to and priorities in research, and budget constraints (NRC, 1990a). Thus, it becomes necessary to choose the kinds of collaboration and computational aids that will enable the sharing of information, instruments, and ideas needed for science to advance so that understanding of phenomena is increased and practical benefits achieved.
There are now a number of national and international initiatives whose success depends on inter-and intra-disciplinary research and collaboration among scientists, including the U.S. Global Change Research Program, the World Climate Research Program, and the International Geosphere-Biosphere Program (NRC, 1990d). At the same time, the national High Performance Computing and Communications (HPCC) initiative promises to aid interdisciplinary groups of scientists, engineers, and mathematicians in applying emerging high-performance computing and communications systems to advance the solution of diverse science and engineering problems (FCCSET, 1992; OSTP, 1992).
For large interdisciplinary scientific initiatives, collaboration is a requirement of progress. It may work well or imperfectly, depending on a variety of personal and social factors (Box 1.1) as well as on the availability of appropriate technical infrastructure; lack of tools to facilitate communication, for instance, can impede the best-intentioned plans for joint work.
THE COLLABORATORY CONCEPT
The idea of using computing and networking technology to aid sciences other than computer science is not new (Box 1.2). Databases, electronic mail, and computer-based statistical packages for data analysis are all common tools across the sciences. What is new is the idea that various tools and technologies can be integrated to provide an environment that enables scientists to make more efficient use of scientific resources wherever they are located. Such an environment is termed a ''collaboratory" in this report. At the highest level of abstraction, a collaboratory is a vision of a future "... 'center without walls,' in which the nation's researchers can perform their research without regard to geographical location—interacting with colleagues, accessing instrumentation, sharing data and computational resources, [and] accessing information in digital libraries" (Wulf, 1989). In operational terms a collaboratory is a distributed computer system with networked laboratory instruments and data-gathering platforms; tools that enable a variety of collaborative activities; financial and human resources for maintaining, evolving, and assisting in the use of computer-based facilities; and digital libraries that include tools for organizing, describing, and managing data, thus enabling the large-scale sharing of data. A collaboratory provides a technological base specifically created to support interaction among scientists, instruments, and data networked to facilitate research conducted independent of distance.
BOX 1.1 SHARING AND COLLABORATION IN SCIENCE
Science advances through the process of sharing data, theories, ideas, and results. The scientific paper, published in peer reviewed journals or proceedings, is the preeminent formal mechanism for advancing collective understanding. Formal gatherings such as invited colloquia, conferences, and workshops are also important, as are informal exchanges among peers. in all cases shared information is scrutinized and critiqued. Meritorious work becomes part of the collective understanding, thereby advancing the collective enterprise.
Individual scientists, as well as the scientific enterprise, also advance through sharing, with the character of this sharing looking different within the primary work group from the way it looks outside that group. Within their primary work group, scientists engage in extensive sharing on a daily basis. Problem solving, invention, and interpretation occur in large measure through informal give and take, typically in face-to-face encounters. Two or three scientists gathered in heated discussion around an instrument display, data plot, or blackboard typify this kind of sharing, which is governed. by norms of trust, reciprocity, and confidentiality. Through this mode of sharing scientists make sense of their own work and shape it for formal disclosure. Outside their primary work group, scientists' sharing occurs primarily through more stylized presentations and publications, which are governed by established norms and stand as accounts of work. Through this mode of sharing scientists receive credit for their work. Credit gets translated into reputation and other resources such as tenure, new and larger grants, better doctoral students, and scientific prizes. Because these resources are scarce, competition to secure credit can conflict with at least short-term open sharing. The need for timely publication and careful publication both militate against prepublication sharing of data with anyone outside a scientist's own work group. Strategic publication militates against publishing data (rather than papers) and publishing papers based on old data (rather than new data). In sum, the character of scientific sharing of data, theories, ideas, and results is influenced by both cooperation and competition.
Between collaboration and competition lies cooperation. Both collaboration and cooperation imply sharing data and other scientific resources. But the motivations and expected benefits are quite different. Cooperation may be impelled primarily out of narrow self-interest and may yield mutual benefit but not joint benefit. It can be construed as an exchange relationship. For instance, scientists cooperate With their peers by making their data available to them in publicly accessible databases. But they may do so primarily because they are required by third parties to make data public before they can publish or receive additional funding. Collaboration can be construed as a communal relationship that implies social trust and synergy among participants, with mutual benefit as the result.
Scientific organizations, like individual scientists, also engage in competition and collaboration. The competition for resources (funds and the best scientists) among scientific organizations is well known and intense. Yet, at the same time, cooperative agreements of different forms among organizations are also quite common today: academic and industry consortia, precompetitive industry projects, National Science Foundation science and technology centers, accelerator projects, and so on. While scientific organizations are motivated to advance scientific knowledge through these collaborations, they are also motivated to ensure their own well-being. Thus, issues of priority claims and credit can be as important to organizations as they are to individual scientists. (Some of the most complex provisions governing interorganizational agreements have to do with ownership of products resulting from the cooperation.)
A goal of the collaboratory concept is to render irrelevant the actual location of equipment and instrumentation and to make possible the creation of virtual laboratories using networked facilities. One can imagine the possibility of coordinating the capture of data by equipment on an orbiting satellite with the collection of data by ground-based instrumentation using computer networking tools to link all the facilities together. This collaboratory function may prove to be at least as important as providing for the sharing of information and support for collaborative interaction among colleagues. Where unique instru-
BOX 1.2 ANTECEDENTS OF THE COLLABORATORY
The antecedents of the collaboratory date to the development of the Arpanet in 1969. One of the first examples of computer-network-supported collaboration, started in 1973, was a collaboration among Stanford University, University College in London, and Bolt Beranek, and Newman in Boston. Using the Arpanet electronic mail and support for interprocess communication, a small number of implementors collaborated on the development of the TCP/IP protocols, the central element of internet technology. By the mid to late 1970s, roughly 150 people linked by electronic mail were involved in developing the evolving TCP/IP suite (then numbering about 35 protocols). in the late 1980s, the Arpanet was decommissioned, superseded by the Internet, of which it had become an element. Currently about 1,500 people—scattered across 100 countries that cross all 24 time zones—are involved in some 80 working groups that constitute the Internet Engineering Task Force. Electronic mail, shared document databases, distribution lists, anonymous file transfer archives, and a cornucopia of new applications for distributed data recovery and management make this global collaboration feasible. These tools also enable a six-person staff to function as a secretariat for this rapidly paced technical standardization work. Industry participation in the work has led to the rapid development and deployment, sometimes within days, of products resulting from standards agreements.
Consisting of over 10,000 networks that link more than 1,500,000 computers, the Internet itself represents another kind of collaboration infrastructure. There is no central operating authority, and the system is funded by an international melange of private, public, for-profit, and non-profit resources. The system is doubling in size annually; it has spawned a multi-billion-dollar international equipment and service market in computer communications, and it is used worldwide for sharing scientific results and coordinating scientific research.
On-line publications are beginning to emerge from the Internet environment. The Internet Society News, for example, is published quarterly and incorporates the contributions of over 150 reporters worldwide who submit stories to the editor and to a page-layout editor over the network.
Many software development projects sponsored by (D)ARPA rely heavily on the Internet for project management and for collaboration. Common Lisp, a popular computer language for artificial intelligence, was developed by more than 60 people from universities, government, and industry who collaborated for 3 years but attended face-to-face meetings for only 2 days. According to a lead participant in the design effort, "The development of Common LISP would probably not have been possible without the electronic message system provided by the Arpanet. ... Over the course of 30 months, approximately 3,000 messages were sent (an average of 3 per day), ranging in length from one line to 20 pages It would have been substantially more difficult to have conducted this discussion by any other means and Would have required much more time" (Steele, 1984).
Another highly successful collaboration is exemplified by design activities using the Metal Oxide Semiconductor Implementation System (MOSIS). Developed by (D)ARPA in the late 1980s and early 1990s, this system first appeared as a prototype at Xerox PARC and was further developed by the Information Sciences Institute at the University of Southern California. It accepts very large scale integrated (VLSI) circuit designs in digital form over the Internet, combines multiple circuits where there is room on chips, produces a tape that describes the fabrication mask, arranges for fabrication of the wafers at a foundry, has the chips packaged and tested, and returns the circuits to the original designers within a few weeks at a cost ranging from a few hundred to a few thousand dollars per chip. This collaboration between circuit designers (principally, graduate students at U.S. universities) and the MOSIS project staff led to a significant increase in the number of trained VLSI designers and to the formation Of some of the most successful computer hardware technologies, including the Geometry Engine used in machines produced by Silicon Graphics Inc. (SGI), systolic arrays that led to the Intel iWarp, the Connection Machine technology originally used by Thinking Machines Inc., and reduced instruction set computing (RISC) technologies such as those developed by MIPS Inc. (now part of SGI). The developing network and its collaboration-supporting electronic mail, file storage and retrieval capability; and the standards for chip design · representation were all vital to the success of the effort.
ments are needed, or simultaneous data collection is necessary, a collaboratory's ability to manage and control local and remote instrumentation could easily make the difference between a successful experiment and an impossible dream.
In the information-rich world of scientific research today, discovery of relevant data and results is a major challenge. The information cataloging and indexing capability of digital libraries, which are part of the collaboratory concept, as well as the idea of having available the full content of reports and even the raw data and analysis programs used to process them, contributes to the appeal of collaboratories. With the proper information technology infrastructure, collaboratories could be formed quickly and flexibly to address particular problems or research opportunities.
Although variations may evolve over time and in response to the needs of different disciplines, a collaboratory may be envisioned as including up to perhaps 2,000 principal investigators, postdoctoral associates and doctoral students, scientific support personnel, and technical support personnel located at from 5 to 20 home institutions. Participants engaged in joint scientific research would be linked in a system providing computerized information technology for the collection, analysis, and distribution of data and results. All data or data products, and means of accessing instrumentation as well as all analysis and modeling capabilities and results, would be immediately available at every scientist's workstation.
To achieve a collaboratory capability, a considerable amount of research, development, and experimentation is needed. Although some features of a collaboratory (e.g., the types of instruments used, if any, the nature of the data collected, and the programs needed to analyze them) will be unique to particular disciplines, others will be more generally applicable across the sciences or even to meet more general needs for collaboration that are likely to emerge in commercial markets. Developing useful collaboratories thus requires research and development partnerships among scientists and information technologists to define, refine, and stabilize disciplinary or interdisciplinary collaboratory tools.
To further explore the concept of a collaboratory as it was first articulated and discussed in 1989 (Towards a National Collaboratory, 1989), the Computer Science and Telecommunications Board of the National Research Council convened a committee in December 1991 to study the need for and benefits of collaboration in scientific research, factors determining the effectiveness of collaboration, and the ability of information technology—specifically of electronically integrated collaboratories—to support and enhance interactive scientific research. In addressing these issues, the committee focused on three discrete areas of scientific investigation—oceanography, in which the difficulty and expense of gathering data and the interdependence of modelers and experimentalists provide motivation for greater collaboration (Chapter 2); space physics, which has of necessity used extensive computational technology in the analysis of data collected by cooperatively fielded space- and ground-based instruments (Chapter 3); and gene mapping and sequencing, research that has led to construction of and reliance on massive databases (Chapter 4). Research in these fields is sponsored by a variety of agencies, including the National Science Foundation, the National Institutes of Health, the National Aeronautics and Space Administration, the (Defense) Advanced Research Projects Agency, the Office of Naval Research, and the Department of Energy.
The committee's investigations suggested technical requirements and social and practical issues (Chapter 5) that must be considered and dealt with as part of the process of initiating a national collaboratory program (Chapter 6) in support of scientific research.
In conducting this study, the committee sought to:
Identify common information technology needs that cross disciplines;
Identify specific information technology needs in three particular fields of science, using this information to synthesize and refine the collaboratory concept;
Increase awareness of the utility of information technology for the conduct of scientific research, particularly in the form of collaboratories; and
Identify goals, objectives, and costs of developing collaboratories that would achieve concrete payoff in the form of enhanced scientific output.