Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 56
National Collaboratories: Applying Information Technology for Scientific Research 5 Building and Using Collaboratories The promise of collaboratories is that—if they are thoughtfully developed to meet the needs of working scientists for handling information in its many forms—they have the potential to free researchers to concentrate on the purpose and results, rather than the mechanics, of communicating. If, in time, interactive "center[s] without walls" become a reality, then such collaboratories may further contribute to a positive and reinforcing sense of community among scientists that increases as the scale, quality, and scope of the shared information grow. Building collaboratories that can cost-effectively facilitate scientific research requires that several technical, social, organizational, and practical issues be recognized and dealt with. Some of these, in many cases representing a distillation of discipline-specific needs and problems pointed out in Chapters 2 through 4, are outlined below. IDENTIFYING BASIC CAPABILITIES A COLLABORATORY SHOULD SUPPORT Of the many capabilities a collaboratory might be envisioned as providing, the following four classes address common information-related problems that have led, for the most part, to ad hoc and idiosyncratic solutions. Collaboratories would integrate some or all of these capabilities (depending on the needs of the relevant scientists). They would also foster development of better tools. Data sharing. The capability for scientists in different locations working on the same project to quickly and easily obtain access to data, both within and across databases. Software sharing. The capability for scientists in different locations to conveniently share software that supports data analysis, visualization, and modeling. Controlling remote instruments. The capability for scientists to control instruments located in difficult-to-access regions on Earth, or in space, for example. Communicating with remote colleagues. The capability for scientists in different locations to interact effectively with one another despite separation in space and/or time. For these general classes of problems, common technologies that provide the desired capabilities and thus could be broadly useful across scientific disciplines might be made available through a public infrastructure (such as that provided today by the Internet, which enables sending messages and transferring files); in the form of general-purpose tools; or as a substrate for more specialized tools (as in the case of remote instrument control). Tools specific to particular disciplines, types of research problems, or projects will always be necessary, and they will have to be developed within the context of
OCR for page 57
National Collaboratories: Applying Information Technology for Scientific Research individual projects or disciplines. However, more widely applicable tools and infrastructure might be the initial components of collaboratories designed to support doing science in collaboration at a distance. PROVIDING BASIC CAPABILITIES-TECHNICAL CONSIDERATIONS Today's computing and communications infrastructure supports rudimentary collaboration at a distance but typically is inadequately developed, deployed, and supported to sustain the quality and scope of tools and applications envisioned for collaboratories that will facilitate scientific research. Interconnecting Data Sources Scientists participating in this project's three workshops discussed a number of components considered essential to an enhanced capability for sharing data: Electronic libraries that would combine databases, literature, and software relevant to their research. Scientists considered electronic libraries a top priority because of the potential for rapid access to literature and the library's capacity to help locate information and data. Easily accessible archives of data, particularly in the physical sciences, for some of which large, established archives already exist, such as those at the National Space Science Data Center and the National Center for Atmospheric Research. Archives have become increasingly important as experiments and data gathering have become more complex and expensive, and as data sets have become more massive. A comprehensive system that would support retrieval of data from any or all sources, regardless of the data's origin or physical location. An example is the globe data catalog contemplated in Chapter 2, which would visually relate collected and archived data to the area of investigation, the type of data collected, and the time period that the data derive from. Today, data catalogs of archived holdings do exist, but the format for their presentation, the means of searching for the desired information, and the accessibility of the catalogs to researchers all vary widely from archive to archive, especially across disciplines. Currently, research prototypes exist that permit users to issue single queries that search across multiple databases or archives. One example is the Worm Community System (Schatz, 1991-1992), described in Chapter 4, which represents a specific solution to a small research community's requirements for sharing data. Another useful tool is resource discovery software, which searches descriptions of databases or files to locate suitable sources to search and which can transfer files once they are discovered. Widely used examples of public-domain software include Archie,1 Gopher,2 and World-Wide Web,3 which search by name of the file, and Wide Area Information Server,4 which permits free-text searches by concept. These resource discovery tools were designed to be used on the Internet—an environment where it is expected that files will be shared. In a general-purpose file system—an environment in which the sharing of files may be an afterthought to their creation—resource discovery is much more difficult. Nevertheless, prototype resource discovery tools are now being developed for general-purpose file systems. One such prototype is Essence (Hardy and Schwartz, 1993). Despite current research efforts, more work needs to be done in this area. If all or nearly all the information sources for a given subject domain can be accessed uniformly, the resulting ensemble of all information sources becomes a powerful research tool. The entire corpus of sources can be considered a single federated database consisting of multiple physical databases and
OCR for page 58
National Collaboratories: Applying Information Technology for Scientific Research often containing different types of data, but giving the appearance of a single, logical whole. Supporting the appearance of uniform retrieval across data sources requires standard protocol interfaces and transformation programs to change the representation of each type of data into a standard format. Transformations such as those for text, graphics, and image conversions could be generic across all disciplines. Others would be subject-matter specific, such as those for maps and sequences for genome research or for temperature and currents for physical oceanography. Although the external data formats can vary considerably, federation across data types is possible with standard formats for representing the internal data. In standard database technology the standard query language (SQL) and Open-SQL interfaces provide a fundamental part of this linkage, but other structures for mapping data dictionaries and semantic values must also be developed for a federated database to be constructed. For sources containing textual (and other) information, some progress has been made in standardizing information search and retrieval protocols. The recently developed American National Standard Z39.50, Information and Retrieval Service Definition and Protocol Specifications for Library Applications, provides the means for performing queries on textual information and is being adapted by the International Organization for Standardization as an international standard. However, it is only one standard with a modest number of applications and a multitude of data formats to search across. Consequently, while the Z39.50 standard is a good start, much more needs to be done to extend this protocol and to further develop other appropriate standards and protocols for system-independent data search and information interchange. In addition to conducting broad searches across many databases, scientists may wish to record logical associations they detect between items within a database or across databases. Such associative links may build on previously identified relationships or represent the exploration of new ones among the database elements. For example, genes might be represented by linking various elements in gene map and sequence databases. An ocean voyage might be represented by a set of linked oceanographic database items. Unusual and nonintuitive links among database items recorded by one researcher may well stimulate new insights or approaches to a problem by other scientists. Implementing logical links between related items in different sources requires a standard format for representing the links and a series of methods for determining semantic relationships.5 These are areas for research and development. The commonest method for identifying semantic relationships relies on the use of standard terms or nomenclature such as the well-defined names that denote particular genes described in the literature, maps, and sequences. When standard nomenclature or terms have been used, it is possible to automatically generate links. For less obvious or novel relationships, a collaboratory system that supports data sharing should be able to support user-specified links. Sharing and Applying Programs Analysis of collected data lies at the heart of the scientific process. Data to be analyzed can be numeric or symbolic; some data, for example, may be in the form of literature. Increasingly, scientific analysis involves the use of software. Currently, genome researchers use analysis software to locate related items, such as gene sequences similar to a sequence being studied. In physical oceanography, simulation software is used to predict the results of future experiments, such as projecting ocean currents in a particular region. In space physics, modeling software is used to predict the behavior of observed phenomena, such as effects of the solar wind on the aurora. Community software such as IRAF and AIPS in the astronomy community and ORTEP and X-PLOR in the molecular biology community have proven very useful and have been widely disseminated and shared. The committee found that sharing of software, application of external (i.e., not local) software to data, and application of local software to external data were three important capabilities sought by scientists contemplating useful collaboratory tools. Workshop participants observed that their research would be facilitated by technology that would allow them to call their specialized programs into action
OCR for page 59
National Collaboratories: Applying Information Technology for Scientific Research easily and consistently, operate on data retrieved from their own and other data sources, and store results in network-accessible archives. Sharing of software is becoming increasingly attractive to scientists as the sophistication of software—for example, visualization tools for displaying complex, multidimensional data—grows and as the investment required to develop complicated programs increases. One vehicle for sharing scientific software is the supercomputer centers: the National Center for Supercomputing Applications and the San Diego Supercomputer Center, for example, have become visualization centers—in part because of their major investments in high-performance computing hardware, software, and skills that can be leveraged by scientists in a variety of fields. These centers demonstrate how user/scientists can partner with developer/technologists to the benefit of both. Scientists and funders of research, such as the National Science Foundation, recognize that adapting a tool developed for one discipline (or project) for use in another may be less expensive than developing a new tool from scratch. Despite its attractiveness, software sharing may be easier said than done. Users of borrowed programs may need to know the specifics of the numeric methods applied in an analysis tool. The proper interpretation of results may be greatly influenced by the choice of computational method and particular implementation of the software on a particular platform. These observations suggest that documenting program functionality, operation, and implementation, at least for programs made available for third-party use, is just as important as documenting the circumstances under which data may have been collected, so as to guide future analysis. In the absence of adequate funding for information tool development and dissemination, documentation of home-grown tools has tended to be limited to nonexistent, contributing to the tendency for scientists to replicate efforts. By explicitly underwriting the cost of tool development and dissemination, a collaboratory initiative could help to assure that appropriate documentation is developed and made available, as well as support the development of easier-to-use software. Applying software that may not be collocated with the data of interest is difficult today, because most scientific software is prepared for specific purposes, and the data with which it is used must be carefully formatted to match. Each program has its own calling sequences6 and user interface; there is little uniformity. For some applications such as statistical analysis, libraries of subroutines have been developed to perform commonly needed computations, but even when these routines are used, data must be formatted and passed in accordance with conventions or standards established for the library routines. In a collaboratory system, applying remote, network-based software would require a calling convention that supports remote program execution with standards for passing typed objects back and forth. Research and development are needed to create better conventions, which may in turn be adapted as standards.7 Some of the capabilities desired by scientists in wide-area, networked environments are in evidence on the smaller, simpler, and local scale of personal computer (PC) systems. Multifunction PC software integrates database, word-processing, spreadsheet, and electronic mail packages; in other cases, PC software is designed for ease of transfer of data between one kind of application and another. For example, spreadsheet results can be represented in bar or pie chart form and inserted into a multifont, compound text document; spreadsheets may refer to database entries; and text may be merged with the contents of one or more databases. In general, such integrated systems must be developed around common format conventions supporting data exchange or conversion, a process that has proved to be easier in the PC environment than in the more demanding scientific computing environment. The development of multifunction PC software has also benefited from the existence of a large commercial market, and it is likely that commercially developed technology such as video conferencing, "groupware," and computer-supported cooperative work-tools will benefit the development of collaboratory systems for software sharing as well. Implicit, but not explicitly addressed in the workshops, was the notion that improved algorithms would also benefit some scientists. Collaborations between computer scientists and other scientists could advance the state of algorithms applied in scientific research—one of the objectives of the High
OCR for page 60
National Collaboratories: Applying Information Technology for Scientific Research Performance Computing and Communications initiative. However, some software is so specific to a given discipline that the scientists involved must develop it or participate heavily in its development. Controlling Remote Instruments The value of controlling instruments through a computer network and collecting data regardless of the instruments' location depends on both the inaccessibility of the instruments and the difficulty of collecting data in the given environment. Remote control of instruments and remote data collection are required capabilities for space physicists, for example, who must collect data from distant reaches of space either through ground-based instruments positioned in remote locations, such as the Sondre Stromfjord Observatory in Greenland, or through space-based instruments that may need to be retargeted during the progress of a mission. Collection of data from remote instruments is also of major importance to physical oceanographers, whose data-gathering buoys and moorings are widely dispersed across the entire ocean surface and are visited only infrequently by research ships. In oceanography, the ability to better perform real-time reading of remote instruments would save time and money and would likely result in a much greater volume of higher-quality data being collected. Remote control of instruments and collection of data from remote instruments are considered relatively unimportant for genome mapping and sequencing at this time, although that situation could change. Remote instrument control requires reliable networking with sufficient speed to support interactive responses. Controlling an instrument across a network requires a method for capturing the output of the instrument and transmitting it to the researcher who is directing the instrument, and a method for gathering user commands and transmitting them as inputs to the instrument, so that, for example, the instrument can make a requested change in its data-gathering procedure. Assuming that the instrument can accept commands and respond to them to achieve the desired results, remote control can be implemented with an appropriate network connection. Consequently, there will be an increasing need for standards for telemetry for remote control of instruments. The space science community is pursuing the development of standards for remote instrument control through the international Consultative Committee for Space Data Systems (CCSDS), which includes representatives from the space agencies of 26 countries, including the United States, countries in the European Community, and Japan. The CCSDS has also been responsible for developing common standards for space-ground links, multiplexing of high-and low-speed data streams, and data formatting and authentication. It is apparent that scientists are making increasing use of remotely controlled instruments in ways not possible before the advent of computer networking. One such instrument is an Internet-based electron microscope that delivers high-quality imagery over the network and in return allows remote positioning of the viewing stage (Box 5.1). Similarly, some telescopes have been outfitted with network connections, so that they can deliver digitized views, captured in high-resolution charge-coupled-device arrays, essentially anywhere accessible by the Internet. Other sharable instruments could include particle accelerators and colliders, radio telescopes, various satellite or space-platform instruments, autonomous underwater vehicles, pilotless aircraft, and autonomous land rovers. Supporting User Interaction Collaboration requires cooperation, and cooperation implies communication. An essential component of a collaboratory is the capability to support user interaction ranging from immediate, real-time, face-to-face discussion to deferred informal messaging, and even formal exchange of refereed papers. The support of interpersonal interaction among a group of collaborators may be the most chal-
OCR for page 61
National Collaboratories: Applying Information Technology for Scientific Research BOX 5.1 DIAL-A-MICROSCOPE "Want to access an electron microscope without leaving the comfort of your own office? Soon you may be able to log onto a computer network and start collecting images ... on the 400,000-volt electron microscope at the University of California, San Diego. "The concept is called the Microscopist's Workstation, which ... enables a researcher with a computer workstation and access to Internet or NSFnet to control the microscope in real time. (A technician prepares the samples and puts them under the lens.) Project leader Mark Ellisman, a neuroscientist at the University of California, San Diego, unveiled his project at SIGGraph 92, an international conference on computer graphics held ... in Chicago. One session, which included Ellisman's and 34 other projects, focused on how high-speed computer networks might bring the lab to the scientists: In addition to hopping on the La Jolla scope, attendees previewed hookups that would allow scientists to use their computers to walk through the internal organs of a 7-week-old embryo, go on a tumor safari in a human brain, and interact with a developing thunderstorm, with all the images generated on a remote supercomputer. Ellisman says, '.... This is just the first step in a long-term collaboration ... to create a distributed laboratory that will make expensive national resources more widely available to the U.S. community.'" SOURCE: "Dial-a-Microscope," Science (21 August 1992) 257:1048. lenging aspect of collaboratory construction: it not only involves potentially all of the technical features of systems to access remote data, programs, and instruments, as well as multimedia work-group communication systems, but also requires an understanding of the complexities and vicissitudes of human behavior. The most widely available technologies that support user interaction are electronic mail and facsimile transmission, which support asynchronous communication. Other applications include bulletin board systems, computer conferencing systems, file and document storage and retrieval systems, and the relatively new "groupware," which is now emerging in the commercial marketplace but is still strongly proprietary in nature and therefore not yet a good basis for interoperability or standardization. These systems have the advantage that they do not require the communicating parties to be linked simultaneously, thus overcoming the problems of geographic separation and time-zone differences. Anyone who has experienced either "telephone tag" or the trials of close collaboration with a colleague many time zones away can appreciate the utility of these asynchronous communication tools. Recently, multimedia electronic messaging applications have been developed that allow complex documents, imagery, sound, and video information to be incorporated into messages to enrich the quality of the deferred communication. Technical problems still remain, however, in providing features such as automatic document format conversion between word processing systems. Even if technical solutions can be found, widespread deployment may be slow in coming since people are often reluctant to adopt new tools if the ones they are accustomed to seem to be serving them satisfactorily. Video conferencing has been available for some years, but most such systems have required that users go to special conferencing centers to make use of cameras, monitors, and special communications equipment. Moreover, these systems worked in either broadcast mode or two-way, two-site interactive mode. N-way multiple-party video conferencing is more difficult to support. Nevertheless, a variety of services have been available commercially, using the telephone system for transmission. Experimental and quasi-operational systems have been built to support video conferencing in a data network environment. The (Defense) Advanced Research Projects Agency, for instance, has been using an experimental packet-switched video-audio conferencing system on its wideband network for a number of years and recently transferred it to the Defense Information Systems Agency for more operational use. Recent experiments with packet-switched video and audio on high-performance work-
OCR for page 62
National Collaboratories: Applying Information Technology for Scientific Research stations indicate that desktop conferencing is possible. This new technology, called multicast, has been developed and tested on the Internet. Multicast uses a TCP/IP packet-switched network to deliver copies of the video-audio packets to terminals on the network that have been temporarily designated to be part of the multicast conference through the use of special addresses. Once the multicast conference is over, the terminal resumes using its standard network address. Interactive video may prove to be very important not only for conferencing, but also to support remote viewing of experimental procedures. Currently deployed research networks such as NSFnet do not have sufficient transmission and switching capacity to support very much video conferencing (although the more advanced National Research and Education Network (NREN) program contemplates such capabilities), and users may need special hardware to digitize and compress the video information before sending it on the data network. Workstation vendors are already developing such systems; relevant research and development are also under way at nearly all of the major workstation vendors. More sophisticated and potentially more useful will be ''shared workspaces,'' which would enable remote conversation with simultaneous joint viewing or other remote interactions. In the most general sense, shared electronic workspaces would mimic a complete physical research environment, with all of the data, software, and instrument control available to all parties within the context of a discussion. The coordinated data analysis workshops (CDAWs), discussed in Chapter 2, are an excellent example of a physically shared workspace. One can imagine a "virtual" CDAW in which many scientists participating from various geographic locations could interact with all sources, including all the participants and their data. Before such sophisticated computer-supported cooperative work environments can be realized, however, major technological advances must be made in areas ranging from wide-area network caching to multiuser synchronization. More immediate modes of group interaction can be supported with existing telephone and audio conferencing technology and desktop video conferencing using specially equipped terminals with video cameras, often coupled with facsimile transmission or other deferred-communication tools to provide a context for real-time discussions. In addition, some tools for supporting group work and interaction—such as shared editors with synchronized displays linking multiple authors, or multiplayer games that allow each player to view the real-time movements of the others—are already available as research prototypes and introductory products from commercial vendors. However, such technologies are not necessarily broadly applicable to the data-intensive demands of scientific collaboratories. Although most of the commercially promising computer-supported cooperative work technology does attempt to solve the problems associated with communication of voice, text, images, and data across networks (and in most cases joint authoring tools), it does not address remote control of instruments, remote collection of data, accessing of archived data, and resource discovery software. Computer-supported cooperative work technologies designed for commercial application are often implemented over private corporate networks operating in protocols and software environments that differ from those common in government-supported research networks. However, research that supports the development of commercial collaboration technology and the products that result will be of great interest and will likely aid the development of collaboration technology and collaboratories for science. Achieving Transparency It is highly desirable that the architecture of a collaboratory system be transparent, i.e., that it allow scientists to treat all the different databases, programs, instruments, and participants conceptually as being part of a single system by using a uniform set of commands accessible from their desktops or laboratory benches. To achieve this, the system must hide all its real-world variability internally. Prototype collaboratories discussed in the workshops (the Worm Community System and the Sondre Stromfjord testbed) demonstrate convincingly that it is technically feasible to implement and deploy such an architecture, at least on a relatively small scale.
OCR for page 63
National Collaboratories: Applying Information Technology for Scientific Research Achieving truly transparent data access from federated databases will require research and development. Some technology for federating diverse data sources is available in research prototypes, such as the relatively small and focused Worm Community System, but providing these capabilities on the much larger scale that appears to be required for many scientists will require a major effort. The effort to develop Knowledge Robots (Knowbots™), which are essentially network-mobile intelligent agents, involves substantive research on distributed systems architecture and control, authentication, security, semantic representations, and a host of other problems in computer science (Kahn and Cerf, 1988). Achieving database transparency is just one application of the much more general idea of cooperating intelligent agents working together over a computer-communications network. In the case of remotely controllable instruments, programmed intelligence is needed at the instrument site to accept control and prepare and deliver captured data. The more general case is that a set of programs distributed around the network must cooperate to achieve a particular objective. Distributed database systems, distributed processing systems, networked computing systems, and digital libraries are all instances in which cooperating intelligent agents could be and often are applied. The excitement in the computing community over distributed, object-oriented, client-server systems results, in part, from recognition that these systems may help to break the constraints of time and distance in the conduct of scientific and other research (Wiederhold, 1992). Building collaboratories may be one of the most powerful ways of applying these new computing ideas in support of science. ACKNOWLEDGING CONTEXT-SOCIAL AND INSTITUTIONAL CONSIDERATIONS In addition to the technical expertise required to construct collaboratories, basic social and institutional factors must be examined and dealt with effectively to achieve successful scientific collaboration. Without sufficient attention to these potential constraints, in particular, the best collaborative information technology will have little positive impact on the working lives of scientists. Among the many issues to be addressed are the willingness and ability of individuals and institutions to engage in large-scale efforts of the kind needed to build useful collaboratories. Underlying these issues are questions about motivations for collaborating, the prospects for achieving a working partnership between computer scientists and other scientists, and the perceived trade-offs between the rewards and risks, financial and otherwise, of participating in collaboratories. These concerns must all be factored into the design and selection of collaboratory efforts. Issues for Individual Scientists To use and build collaboration tools and systems, individual scientists—both the users and developers of technology—must have the motivation to do so. Of particular concern is the perceived lack of opportunity for career advancement associated with electronic data sharing and collaboration. Further, collaboratories must be designed so that the rewards of electronic collaboration and data sharing outweigh the risks. Current career paths may not gracefully accommodate scientists working in and building collaboratories. In the physical sciences and in computer science as well, fame accrues largely to the development of theories or ideas, and not to the development of tools or the gathering of data for others to use. Given long-standing traditions of individual achievement, and the more recent increase in competition for limited resources, collaboration is viewed warily by many scientists because neither collaboration in itself nor the facilitation of collaboration represents a direct means to gain acclaim, respect within the scientific community, or funding for research. Further, the objectives of individual
OCR for page 64
National Collaboratories: Applying Information Technology for Scientific Research scientists may conflict with larger organizational and societal objectives, which may also conflict with each other. Thus, abstract arguments about such benefits of collaboration as better handling of complexity, scale, and/or interdisciplinary research problems may not lead easily to changes in individual behavior. Similarly, top-down mandates for collaboration are not likely to be productive as long as scientists continue to be rewarded almost exclusively for publishing results of self-initiated research, having their work cited in the literature, and otherwise becoming distinguished as individuals. From a societal perspective, science advances through extensive, timely sharing of data (see Box 1.1). But to advance as individuals, scientists generally must use their own data to the fullest extent possible before sharing them with others. In addition, scientists need to be comfortable that data generated by others are of high quality and that any peculiarities associated with the data themselves or their collection are identified and understood. Given such constraints, it can be difficult for scientists to openly share data in recognition of a communal interest—that their community benefits from access to as much good data as possible. Technology cannot solve these problems, but technology can be designed to mitigate them by making more data easier to use by more scientists. By facilitating the broader use of data collected by individual scientists, technology can enable research sponsors and the scientific community to leverage individual data collection investments. For this to happen, appropriate policies and rewards need to be established that support scientists who share data. For example, NASA-sponsored projects have well-defined "rules of the road" outlining the obligations of researchers with respect to use of spacecraft mission data (Appendix C), and NOAA has a similar set of guidelines for ocean-craft missions. Another approach to rewarding and thus encouraging data sharing is to grant appropriate recognition for a contribution to an electronic archive or database, perhaps much in the way that a publication would be recognized. For sciences with more distributed data collection, such as genome research, publishers of journal literature may require that supporting data be deposited in the archives before articles referencing them can be published. This approach ties the additional reward of publication to data sharing, and many sequences are now submitted to databases very soon after discovery directly from many molecular biology departments. Nevertheless, the performance and management of electronic data archives must be trusted by scientists if such archives are to be used effectively. This implies a need for quality assurance and security (especially data integrity) mechanisms, some of which are procedural and some of which may involve the use of computer technology.8 A third way to reward data sharing is for the scientific community and the funding agencies to explicitly support scientists who analyze or reanalyze existing data. At the same time, information scientists or collaboratory builders have their own careers to manage, and they face, in their own context, reward and advancement issues parallel to those confronting other scientists. Traditionally, the prestige in science has gone not to the "technician" who develops a significant new tool but to the "scientist" who uses the tool to discover a significant new phenomenon (although sometimes these have been the same person). Furthermore, systems developers often have difficulty gaining tenure in academic computer science departments today; accordingly, they tend to gravitate toward industry.9 Yet systems developers participating in building a collaboratory could find in the development of such technology a means for demonstrating to their own departments the scientific merits of these complex systems. In some circumstances, collaboratories have the potential for creating a positive and reinforcing sense of community that increases as the scale, quality, and scope of the shared information in the collaboratory grows. Representing an effort perhaps analogous to constructing a major, landmark building or conducting a national project such as the space program of the 1960s, participating in the development and use of a collaboratory may confer a sense of shared purpose, teamwork, and community that can become self-sustaining. Other second-order effects are also suggested by the experiences of scientific communities that already make extensive use of networking technology. The committee emphasizes that these are early effects because today's technology is primitive compared to what it envisions. Distributed groups
OCR for page 65
National Collaboratories: Applying Information Technology for Scientific Research supported by technology can assemble expertise independent of the physical location of the scientists who possess that expertise. Network-based communication changes the character of informal exchange that scientists use to help them make sense of their work. Communicating the tips and techniques necessary to make experimental apparatus and data sets work as advertised need not depend on face-to-face exchange but can be shared electronically among broad communities of interest. Such sharing is quite common in electronic special-interest groups. Collaboratories may lead to the creation of new electronic organizations just as the Arpanet and later the Internet led to the creation of large electronic groups. Many of the Internet groups are extraordinarily lively, with their own unique community identity and practices. However, with some notable exceptions, most of these do not produce any joint product of lasting economic or intellectual value. Their primary output is usually discussion. In light of the importance of participants' motivations to achieving success, the first collaboratories should be developed with groups of scientists already predisposed to collaborate and to use collaboration technology. Biologists sequencing and mapping the genome of the nematode worm Caenorhabditis elegans, space physicists participating in CDAWs, and oceanographers involved in the Tropical Ocean-Global Atmosphere (TOGA) program illustrate what can be achieved when scientists themselves initiate collaboratory efforts. Costs for Individual Scientists of Using Computer-based Collaboration Technology Designers of collaboratories must recognize the costs and risks, as well as the benefits, associated with sharing data. Even for scientists who already see how computer-based collaboration technology can advance their work, the choice to use it (assuming it exists) is not a costless one. Based on the social history of computing to date, several kinds of costs need to be considered. Incompatibility/critical mass of tasks and people. Unless one's entire world is on-line, there will be inconveniences of switching from one medium to another. For example, if distributed project group members can share manuscript files that include data tables but not line drawings, or line drawings but not halftones, then at various points during the process of manuscript preparation some people will be denied access to the process. Economic costs. The history of organizational computing suggests that people continually underestimate the costs of operating, maintaining, and upgrading computer technology. These systems will require human support as well as capital and operating resources. If funds for direct personnel support are not forthcoming, they will show up as a tax on the time of scientific personnel. Doctoral students or postdoctoral researchers who are supposed to be doing science may end up spending a substantial fraction of their time doing technology support. Dependence and vulnerability. In addition to choosing and developing features that are responsive to scientists' personal and professional concerns about data sharing, collaboratory designers must also recognize that collaboratory performance overall will cause concern. The more one comes to depend on these technologies, the greater one's vulnerability when they break. At a minimum, technology failure engenders frustration. In some organizations today that are highly dependent on their internal networks, employees report that work absolutely stops when the network goes down. More seriously, technology failure may lead to irrevocable loss of data or work. Scientists will rightly have limited patience with experimental systems that may be unreliable; assurances as to system integrity and reliability will be needed. Ease of use and flexibility are other important systems dimensions.
OCR for page 66
National Collaboratories: Applying Information Technology for Scientific Research Other costs relate to education, as discussed below. Even "easy" systems can be surprisingly hard to learn and use. These will not be easy systems. Nor will they be stable ones. Time invested in learning today's system features will not eliminate the need to spend time learning tomorrow's improvements. Education and Training Scientists and technologists must learn the necessary skills to use and to build collaboration tools and systems. Effective use of new technologies and facility in making a transition from old to new techniques typically require special training. This will certainly be the case for collaboratories. Both in research and commercial settings where collaboration technology has been introduced, training has been an important factor in the success of the implementations. Even experience with the Internet, which has not been particularly user-friendly, indicates that those who make the greatest use of the Internet have had to devote time and effort to learning how to do so. In the long term, collaboration technology and its use for remote interaction with data, software, instruments, and colleagues will become commonplace for research scientists, as has the use of personal computers and workstations. Until then, an investment of time, effort, and resources for training new students and practicing scientists should be considered a part of the process of launching collaboratories. Appropriate training might be provided in the form of predoctoral or postdoctoral fellowship programs, summer studies, visiting professorships, and research group exchanges.10 The network itself may become the medium for delivering education and training to dispersed groups of scientists. Given the existing demands on scientists' time, the training burden must be minimized through the design of easy-to-use and easy-to-learn systems. To develop genuinely useful technology, it will be necessary to train technical experts who can work closely with user/scientists. One approach is to create interdisciplinary programs of the kind started at Rice University (Box 5.2) that combine instruction in a specific scientific discipline with training and hands-on experience in computer and information science. The emergence of computational and mathematical biology as a subdiscipline provides additional insight into the need for special training. This example is explored more fully in Appendix D. User-Developer Partnerships A partnership between computer scientists and engineers, on the one hand, and scientists who recognize a need for better computing and communications capabilities, on the other, should provide intellectual and material benefits to both parties. The uneven support for computer-related infrastructure is a principal reason that scientists have often had to develop their own infrastructure, software tools, and applications. These systems and tools are often ingenious in their application of computing technology to science. However, designing and building tools may divert scientists from their primary area of research and may yield tools and systems that are less useful than the scientists would like. Investigators at the frontiers of knowledge must be intimately involved in the design and definition of the tools they need to do their research. But if they can work in collaboration with skilled system builders, rather than act as their own programmers or programmer managers, they should be better served by the resulting tools. Furthermore, system builders may be better positioned to design and build generalizable tools that can subsequently be used by other scientists as well. Due to the specific needs of the scientists involved and the requirements of their science, the user-developer partnership will vary for different collaboratories. For some collaboratories it may be desirable for academic scientists to partner with industry to share resources and knowledge, and to facilitate technology transfer. For other collaboratories, computer engineers and software designers from the computer science research community may be essential. Although the mix of skills, talent, and training will thus need to be tailored
OCR for page 67
National Collaboratories: Applying Information Technology for Scientific Research BOX 5.2 THE COMPUTATIONAL SCIENCE AND ENGINEERING GRADUATE DEGREE PROGRAM AT RICE UNIVERSITY Rice University's Computational Science and Engineering (CSE) Graduate Degree Program is designed to provide interdisciplinary research and education in scientific computing. The Master's Degree Program The intent of the master's degree program is to graduate professional experts in scientific computing who will be able to work as technical specialists within an interdisciplinary research team. Degree candidates will be offered training in the use of state-of-the-art numerical methods, high-performance computer architectures, and software development tools for parallel and vector computers; application of these techniques to at least one scientific or engineering area; a curriculum consisting of topics from computer science, computational and applied mathematics, and a selected application area; and hands-on experience with leading-edge parallel supercomputers. The Ph.D. Degree Program The Ph.D. degree program offers the same interdisciplinary approach as the master's degree program but with greater specialization. An original thesis and, in addition, the completion of either an advanced schedule of courses or a computational project in an application area other than computer science or computational and applied mathematics, are required for the Ph.D. program. Participating Departments • Biochemistry and Cell Biology (expected) • Chemical Engineering • Computational and Applied Mathematics • Computer Science • Electrical and Computer Engineering • Statistics (expected) Admission Students must be admitted into one of the academic departments listed above to be considered for participation in the CSE graduate degree program. The student participates as a graduate student within that department in every way except that the curriculum and examination requirements will be set by guidelines for the CSE graduate degree program. SOURCE: Theresa Chatman, Center for Research on Parallel Computation, Rice University, Houston, Texas. to meet the requirements of specific programs, the goal is to have scientists and technologists working together to develop the collaboratory infrastructure so that all may benefit equally. Variations in the approach to a partnership should be considered something positive, because adapting a collaboratory program to meet specific conditions and needs of a group of scientists will be essential to its success. User-developer partnerships will present fundamental tensions that will have to be recognized and addressed. The primary interest of scientists will be the practice of their own science, whereas the primary interest of computer scientists will be in system design, prototype systems, and theoretical studies in support of system building. Without strong countervailing incentives, many scientists will be reluctant to invest too much of their time trying to use prototype systems that may be viewed as computer science experiments. Indeed, computer scientists will have to assure a satisfactory minimum level of performance for prototype systems that will affect other scientists' work and careers. Such issues have been faced and overcome before in the context of the Internet, in the development of expert systems for medicine such
OCR for page 68
National Collaboratories: Applying Information Technology for Scientific Research as DENDRAL, in Stanford's SUMEX-AIM project,11 and elsewhere, but reminders and reassurances on both sides of the partnership are likely to be necessary elements of the formation of a collaboratory program. All of this will require extra effort on the part of scientists and technologists. However, as demonstrated by pioneering efforts such as SUMEX-AIM or the more contemporary cases identified in the CSTB workshops, such extra effort can bring handsome rewards. Issues for Individual Institutions "Science regardless of distance" will not be cost-free to the institutions within which science is managed, funded, transmitted, and legitimized. Institutions must learn how to support their scientists who are working in collaboratories. It must also be recognized, however, that institutions may not, absent other changes, be willing or able to provide such support. The role of institutions will depend in part on the nature of available public infrastructure and the support they provide for such infrastructure. In enabling the creation of new working relationships among users, collaboratories will offer both opportunities and problems for the institutions in which individual scientists are based. They are expected to result in new forms of organization that are dispersed in space and time and that may substantially affect the conduct and progress of science. Further, collaboratories offer the opportunity to produce electronic products and services with economic and intellectual value. Issues related to intellectual property rights will undoubtedly arise as collaboratories become widely used in academic and industrial laboratory settings. It is impossible to say how the legal and economic issues associated with the work done through collaboratories will evolve, but it is clear that use and development of collaboratories must take these issues into account. Providing Local Infrastructure Support Today's division of responsibility and labor for supporting computing and communications infrastructure for science grew out of yesterday's technology and is demonstrably inadequate as technology rapidly changes. This inadequacy is highlighted by the question, Who pays for what? For example, research grants may pay for workstations in a laboratory but not for connections to the campus network. Campus communications organizations may pay for wiring between, but not within, buildings. The federal effort has built a national backbone network for research, but the NSFNET backbone per se excludes local network domains within universities. A university library may pay for journal subscriptions but not for searches of on-line literature databases. In oceanography research grants may pay for electronic mail services, but in space physics they may not. For an infrastructure to function, all the components must be adequately funded and supported; if any are missing, the entire system is weakened and can fail. Although the development of a national information infrastructure, for researchers and educators (e.g., through the NREN) and for the economy as a whole, is attracting attention to nationwide connectivity, it is essential to provide also for the local components—the access points, the support and maintenance capabilities and costs, local user training, and provision at each site of reference and tutorial documentation. Having a particular technology does not guarantee that it can be used effectively. Providing support extends beyond building or purchasing networks and tools to ensuring the availability of people skilled in organizing and providing support services. Such services may range from maintaining help desks and hot lines to providing information management support that will save researchers time and effort by helping them to set up workstations, connect to the network, install the software, use the features, update the databases, and so on. For information infrastructure to become widespread and institutionalized, there must be adequate and standard funding for support and maintenance.
OCR for page 69
National Collaboratories: Applying Information Technology for Scientific Research Discussions during the three workshops conducted for this project underscored the value of technicians and other paid professionals and paraprofessionals who provide essential technical support services but who, under current conditions and despite the best of qualifications or contributions, have a secondary status in the projects they support and often in the academic career structure of the department in which they work.12 Workshop participants acknowledged a lack of financial incentives and career advancement opportunities as an impediment to attracting and keeping talented support personnel. This is an issue for the management of institutions, and also for the proposed collaboratories, which would involve both the conventional range of support personnel and, also complementing the domain scientists, computer scientists and engineers who, as professionals in their own right, would expect to function on a par with their scientist partners. Managing the Results of Increased Interaction Using collaboration technology, individual scientists will have the capability to form new working relationships independent of their physical institutional homes. While these relationships may energize scientists and lead to new discoveries, they may also complicate the role of managers in scientists' home organizations who are charged with keeping track of scientific manpower and resources. Issues of control and accountability may arise for organizations whose members can form new collaborations or access new resources at will (Box 5.3). Although institutions may already have mechanisms for overseeing or engaging in projects involving many institutions, these mechanisms may be too cumbersome for the kinds of fluid, evolving relationships that will be possible in collaboratories. BOX 5.3 FACTORING THE NINTH FERMAT "Two mathematicians employed by Bell Communications Research (Bellcore) and Digital Equipment Corporation used electronic mail to recruit [computing resources from] several hundred researchers from companies, universities, and government laboratories around the world. They asked them to work on solving a large and important mathematical problem, one with practical implications for cryptography. Researchers who volunteered to help were sent a piece of the problem and returned their solutions by electronic mail. All of the partial solutions were then used to construct the final solution. The electronic message announcing the final results contained a charming admission: the two mathematicians who organized the work and constructed the final solution from the pieces returned to them did not even know the names of all of the people who helped them: We'd like to thank everyone who contributed computing cycles to this project, but I can't: we only have records of the person at each site who installed and managed the code. If you helped us, we'd be delighted to hear from you; please send us your name as you would like it to appear in the final version of the paper. (Manasse, 1990) [This case highlights some of the limitations in conventional thinking about organization and management that may become apparent when networked organizations become more common.] ... Typically managers influence their subordinates in large measure by allocating resources to their projects and allocating credit (or blame) to their accomplishments. How will the manager's role in resource allocation change when people can reach out across the network and directly solicit resources from others to help them with their work? How will the manager's role in allocating credit or blame change when managers do not know, and perhaps cannot know, who contributed in what ways to accomplishments?" SOURCE: Sproull and Kiesler (1991), pp. 160-161.
OCR for page 70
National Collaboratories: Applying Information Technology for Scientific Research Collaboratories, particularly by means of electronic-mail discussion groups, can foster the discovery of potential cooperative or collaborative partners. As a consequence of day-to-day interaction in an electronic environment, participants who might not normally meet face to face may learn of mutual interests that may lead to direct contact, laying the groundwork for more extensive cooperation or collaboration. Anecdotal evidence supports this view. In the early years of the Arpanet project sponsored by (D)ARPA, it was thought that the use of networked electronic mail would reduce the need for travel, but it later became apparent that travel budgets had actually increased substantially. One reason was that electronic mail allowed more people to interact directly and also allowed geographically larger projects to be managed; such projects became cost-effective as well as feasible because of facilitation via electronic mail. When the participants did get together, there were more of them and they typically traveled longer distances. ISSUES IN FUNDING FOR COLLABORATORY INFRASTRUCTURE The major High Performance Computing and Communications (HPCC) program recently initiated by the federal government provides a favorable context in which to consider a serious effort to develop collaboratories. Aimed at attacking many of the "grand challenges" of science (Federal Coordinating Council for Science, Engineering, and Technology, 1992), the program promotes the concept that HPCC technologies are fundamental to progress in many areas of science, especially advances in computational science, and articulates the vision of a national scientific information infrastructure. Collaboratories complement and build upon HPCC activities and technologies. Specifically, a collaboratory program, while distinct from the HPCC initiative, would drive research in advanced software technology, tools, and especially scientific applications of the NSFNET, NSInet, and other constituents of the Internet engendered under the NREN program. Even in rudimentary forms, collaboratories make the existing nationwide research networks more useful. Collaboratories can thus leverage investments in HPCC technologies and applications, and they can assist in the establishment of a national information infrastructure for science. Scientists' needs for computing and communications support have not been met in part because of an absence of mechanisms for funding information infrastructure development.13 While funding for infrastructure has been inadequate at the program level in most funding agencies, the greatest obstacle is structural. Scientific funding agencies are organized along discipline or mission lines. Although units within these agencies recognize the need to support major instruments and facilities unique to a field or mission, no single unit is responsible for funding general computing and communications infrastructure within a given field. Only NSF has a mission that supports basic science across all fields, but its resources for infrastructure are limited. Precisely because infrastructure is useful to everyone, it seems to be the responsibility of no one in the research funding structure. A consequence is that if program officers set aside resources for infrastructure, they do so at the apparent expense of research in their discipline. Faced with this dilemma, an individual program officer finds it almost impossible to make the decision in favor of infrastructure—even if that decision is in the long-term interest of the field. This structural barrier to providing adequate ongoing funding for infrastructure must be overcome if the country is to create, maintain, and benefit from the premier information infrastructure that many now want to build.14 Funding a collaboratory program does not have to imply taking money away from individual research grants. First, in recognition of the problems that many scientists have in collaborating under any circumstances and the added problems anticipated from broadening collaboration to include computer scientists and engineers along with domain scientists, the committee envisions that collaboratory projects will be selected from the bottom up. That is, they will be launched in response to inquiries and efforts by groups of scientists who recognize a need to collaborate and who manifest an interest in applying more and better information technology. This pattern has been successful, for example, in biology research
OCR for page 71
National Collaboratories: Applying Information Technology for Scientific Research projects associated with Los Alamos and Brookhaven National Laboratories. Second, since collaboration and the use of information technology are already elements of existing or planned projects (e.g., the HPCC program, the Sondre Stromfjord Observatory, the Solar-Terrestrial Energy Program, the World Ocean Circulation Experiment, and CDAWs), collaboratories could be incorporated into such projects more economically than starting ab initio. Third, economies from easier data sharing would free up resources otherwise spent on duplicative efforts to gather or store data. The committee is sensitive to a primary concern of many scientists that already-tight money for research not be diverted. However, it wishes to emphasize that collaboratories are most likely to succeed in those areas where research is already being transformed into an activity that cannot be done without the cooperative application of information technology. In those areas, investment in collaboratory projects, through conscious development and use of technology, should have as a payoff the more efficient and effective use of research resources. MAKING A START Information technology to support collaboration will not impel collaboration, but it can enable it. Scientists will use that technology if they believe it will advance their own work, and if they perceive the benefits to their work as outweighing the costs of using the technology. The likeliest cases of net benefit are those in which scientists who have already chosen to collaborate can be supplied with technology that makes it easier—in terms of cost in time and effort—for them to do what they are already doing. Moreover, some scientists will recognize that collaboratories, in some instances, can enable better research and greater exploration of specific topics. Efforts to date have hardly scratched the surface of the potential that collaboratories offer. The most significant results will be achieved with sustained and focused efforts to develop valuable shared databases and digital libraries of specific scientific content; to develop collaboration tools for particular scientific purposes; and to put into operation institutionalized pilot collaboratory programs, building on rapidly evolving technology trends. The technology base for such an effort is ready, the programmatic environment is especially auspicious, the computer and communications research community is particularly interested in making this a high-priority item, and the national interest will be well served by developing leading-edge information infrastructure that can serve not only the scientific community but, eventually, the business sector as well. We now have sufficient knowledge to substantially improve the functionality of collaboration technology, if only the scale of the efforts can be made adequate to the task of demonstrating their effective utility. NOTES 1. Archie—Index of FTP archies and file fetcher "Archie.doc", firstname.lastname@example.org. Author(s): Archie—Alan Emtage, Peter Deutsch, BillWhelan@cs.mcgil.ca; Prospero—CliffordNeuman@isi.edu; Archie client—BrendanKehoe@cygnus.com. 2. Gopher—Distributed document delivery service. "Gopher/doc/client.doc & server.doc", email@example.com. Author: The Internet Gopher Team, University of Minnesota, Minneapolis. 3. World-Wide Web—Distributed hypertext information server. "The World Wide Web", firstname.lastname@example.org. Author: Tim Berners-Lee, World Wide Web Project, CERN, 1211 Geneva, Switzerland. 4. Wide Area Information Service—Distributed text search and retrieval service. "WAIS Overview", email@example.com. Authors: Harry Morris, Brewster Kahle, Jonathan Goldman, Thinking Machines Corp. 5. For example, the Worm Community System uses a representation called an information space, in which each item of data is transformed into a uniform object called a unit of information and there is a standard representation for describing links between information units.
OCR for page 72
National Collaboratories: Applying Information Technology for Scientific Research 6. The calling sequence of a program is essentially the information that must be passed to the program for it to be run. For example, the calling sequence of a subroutine must specify its arguments and parameters; the calling sequence for a program must specify its location (e.g., its home directory). The calling sequence can be specified in many ways and is a matter of convention dictated by the structure of the computing environment in which the program will run. 7. One such standard, ASN. 1 (for Abstract Syntax Notation version one), was developed by the Organization for International Standardization (ISO) for the syntactic exchange of data. At the National Library of Medicine, ASN. 1 is used in the National Center for Biotechnology Information toolkit to support semantics for several common biology types and is being adopted in a number of analysis software packages in molecular biology as a common data interchange format (Ostell, 1992). In other contexts, notably those involving high-performance computing, ASN. 1 does not offer satisfactory performance. 8. A range of quality control processes exists for different purposes, from moderating (checking for topic) conference proceedings or newsletters to editing (checking for accuracy) journals or books. For journal literature, the peer review process ensures quality. For data generated by instruments operated by a single investigator or team, as is commonly the case in space physics and oceanography, the investigator is typically responsible for quality control and thus performs both data contribution and checking. In genome projects, with more distributed data collection than in oceanography and space physics projects, contributions to the archives have traditionally only been moderated and not edited, although large databases now have a curator whose responsibility it is to review submissions and maintain quality control for the information that will ultimately reside in the database. 9. A forthcoming Computer Science and Telecommunications Board report will examine this issue in the context of academic careers for experimental computer scientists. 10. See Appendix D, a reprint of Appendix 3 of the final report of an NSF-sponsored workshop, Training Computational and Mathematical Biologists, held at the Banbury Center of the Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, December 9-11, 1990. 11. The SUMEX-AIM project, begun in 1974, brought together scientists and technologists to facilitate research on the applications of artificial intelligence to medical (AIM) research. The project used the Arpanet and Tymnet to link together the Stanford University Medical Experimental computer (SUMEX) and a group of researchers from around the country. Although the computing resources available were primitive by today's standards, the project used electronic mail and an electronic bulletin board. The result of this collaboration was an encyclopedia of artificial intelligence tools, the AI Handbook. According to one description of the project, [S]uch a resource offers scientists both a significant economic advantage in sharing expensive instrumentation and a greater opportunity to share ideas about their research. This is especially timely in computer science, a field whose intellectual and technological complexity tends to nurture relatively its own line of investigation with limited convergence on working programs available from others. The complexity of these programs makes it difficult for one worker to understand and criticize the constructions of others, unless he has direct access to the running programs. In practice, substantial effort is needed to make programs written on one machine available on others, even if they are, in principle, written in compatible languages. In this respect, computer applications have demonstrated less mutual incremental progress from diverse sources than is typical of other sciences. The SUMEX-AIM project seeks to reduce these barriers to scientific cooperation in the field of artificial intelligence applied to health research. (Lederberg, 1978) 12. Some support services are as informal as the graduate student who figures out and shares a trick for working with a particular program, system, or database. 13. Although information infrastructure is now receiving increased attention as a matter of public policy and private enterprise, current efforts provide scientists with limited access to specialized network-based computing tools. The NSF supports the supercomputing centers, and to a lesser extent the science and technology centers, to assist scientists in using specialized, state-of-the-art hardware and software. NSF has also sponsored the development of the NSFNET backbone network and tributary, intermediate-level networks. It has maintained a modest program to assist institutions in acquiring the capital equipment needed to link to the Internet. Other government agencies such as the DOE, NASA, (D)ARPA, and NOAA provide assistance to their scientific communities, but overall access to network-based computing tools is still limited. 14. The anticipated interagency task force on information infrastructure might address this issue, although its focus is expected to be meeting the needs of the general public (industry, nonprofit organizations, and individuals).
Representative terms from entire chapter: