Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 35
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop 4 Summaries of Presentations on Cross-Disciplinary Issues Three parallel breakout panel discussions focusing on cross-disciplinary issues in open access to and preservation of scientific data and information from the viewpoint of China were convened during the course of the workshop. These sessions were organized according to (1) legal and policy, (2) institutional and economic, and (3) management and technical issues. The objective of these thematic breakout discussions was to examine different possible models in these areas and their potential benefits and shortcomings in China. PANEL DISCUSSION ON LEGAL AND POLICY ISSUES Introduction1 With regard to scientific data resources, most databases and data centers in China are managed directly or funded by government ministries and are subject to a relatively restrictive state information regime based on official secrecy requirements. This is a major challenge to the adoption of an open-access model because the past policies have been based on deeply rooted political, institutional, and cultural factors. Some of the restrictions have applied generally to the overall public information regime, while oth- 1 Paul F. Uhlir and Julie M. Esanu, U.S. National Academies.
OCR for page 36
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop ers have been more specific to science and based on perceived political or economic sensitivities (e.g., domestic disease statistics or high-resolution geospatial data). The Chinese government, however, increasingly recognizes that many types of scientific data should be made openly available and usable, especially within the country, and not just for research purposes. As discussed in Chapter 2, the recent high-level focus by the Chinese government on the laws and policies regarding access to government-produced and government-funded academic research data has made this a very propitious time to examine these issues. The case for change in access policies to governmental scientific data can be made at many levels, both internally and externally. The most effective approach is one based on the realization of national self-interest. A comparison with the policies of other countries can be effective as well. Particularly auspicious is the trend over the past decade by many developing countries to adopt Freedom of Information laws.2 Of course, there are legitimate public-policy reasons for limiting access to certain types of data, including appropriate national security restrictions, the protection of privacy and confidentiality, and the protection of private (as opposed to government) intellectual property rights. A related and very significant problem exists in getting scientists to contribute the data produced in the course of their research to public repositories. Barriers include the lack of an appropriate data center in which to deposit the data, no requirement by the funding source to deposit the data or to share them openly, insufficient recognition of the importance of data activities by the scientist’s institution, a lack of effective incentives or rewards to make the data available, the desire of researchers to sell their data at unreasonable prices despite very weak market estimates, inadequate funding to prepare the data sufficiently to make them usable by others, and a lack of training to do so. With regard to scientific, technical, and medical journals, these too are mostly published by government or government-sponsored organizations in China. Because they are meant to be read by the research community, they do not have many of the same official constraints based on national security considerations as the underlying data. They are, however, still published almost exclusively in print form, so their open availability on digital networks raises new policy issues for the Chinese journal publishers and research establishment. 2 For more information, see, for example, http://www.freedominfo.org.
OCR for page 37
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop The presentations summarized in this section focus on some of the legal and policy barriers to more open access for publicly funded scientific data at both the national and international levels, outline some of the policy arguments in favor of greater unrestricted access, and offer some policy guidelines in support of open availability. Global Trends to Restrict Access to Data from Government-Funded Research3 Scientific data produced from government-funded research constitute a fundamental element of the modern research infrastructure and, if well managed, can greatly accelerate scientific progress at the national and international levels. Newly emerging possibilities for enhancing this role of scientific data resources in the digital environment truly constitute another “endless frontier.” High-level policy attention is necessary at the national and international levels in order to maximize the inherent value of data collections and to minimize the negative effects of restrictions on access and use. Indeed, many economic, legal, and technological restrictions have been placed on public-domain scientific data throughout the world. From an economic perspective, the trends to privatize governmental public-good functions and to commercialize more of the academic sector’s research activities have been under way over the past two decades, particularly in biomedical and engineering areas. While these trends can support significant research advances and economic benefits, they are not without their own economic and social costs. A further continuation of privatization and commercialization of upstream public-sector information resources can be viewed as potentially having greater associated costs than benefits. Recent changes to international and national intellectual property laws, such as new digital copyright protection and the adoption of exclusive property rights protection for noncopyrightable databases in many countries, as well as the adoption of licensing agreements on onerous terms for research tools—including data—in academia are further diminishing the broad availability of public-domain data in science. Moreover, these highly protectionistic legal mechanisms are increasingly enforced by more effective digital rights management technologies. Such developments are intensifying the tensions that already exist between the policies that favor shar- 3 Based on a presentation by Jerome Reichman, Duke University School of Law.
OCR for page 38
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop ing of scientific data and the perceived need to restrict access to and uses of data in pursuit of increased commercial opportunities. Restrictions on the dissemination of potentially sensitive research data and information based on national security considerations are further constraining the availability of substantial amounts of material in the public domain. Finally, the recent enactment of a powerful new database protection statute in Europe and proposals for equivalent legislation in the United States and in other countries might be expected to push these tensions into other areas of public research, which up to now have been less affected by the proprietary pressures from the commercialization and privatization trends. A Contractually Reconstructed Research Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment4 If the economic, legal, and technological pressures on public-domain scientific data that were identified in the previous section continue unabated, they will result in lost opportunity costs across the entire research enterprise. These pressures, which are especially pronounced in biomedical and engineering research, could elicit one of two types of responses. One is essentially reactive, in which the public research community continues to adjust as best it can on an ad hoc basis, without organizing a response to the increasing encroachment of a commercial and proprietary ethos on data produced by government-funded research. The other would require a science policy response to the challenge by formulating a strategy that would enable the scientific community to take more active control of its basic data supply. The idea is to reinforce, by voluntary means, a public space in which the data sharing ethic in public science can be promoted and insulated from some of the excessive privatization and commercialization trends, without impeding socially beneficial commercial opportunities. There are some contractual approaches that are now being considered in the United States and Europe, which the Chinese science policy community might consider as well in addressing this challenge in biomedical and other types of publicly funded research.5 4 Based on a presentation by Jerome Reichman, Duke University School of Law. 5 These approaches are examined in detail in an article by J.H. Reichman and Paul F. Uhlir. 2003. “A Contractually Reconstructed Research Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment,” Law and Contemporary Problems, Duke University School of Law, vol. 66, Winter/Spring.
OCR for page 39
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop Balancing the General Public Interests and Copyright in Scientific Information Management6 In both the policy and legal arenas, there is a rising sense of research and scholarship falling within the more general public’s right to know and supporting initiatives for increasing and opening access to research. There are two policy aspects to consider, as government policies can determine how scientific knowledge circulates and as policies are affected by the research that is consulted in their formation. Shifts are taking place in policies affecting science, and these shifts are motivated by the basic human right to know as recognized, for example, by the United Nations Universal Declaration of Human Rights. They also are motivated by greater demands for accountability and transparency in the public administration of funding in areas such as government research grants. In Canada, for example, one of the principal granting councils for the social sciences and humanities is transforming itself into a “knowledge council,” which gives a high priority to the public impact and awareness of research. According to a University of British Columbia study of policy makers’ actual use of research,7 online access is having a substantial impact and is increasing the amount of research consulted, even as the policy makers are largely restricted to “open access” or free materials due to budgetary restrictions and the limited number of subscriptions held. Online access also has expanded policy makers’ circle of consultation, as they are relying less on a small set of academics to advise them. The role of research in policy making is an issue raised in many countries and in many contexts. Greater public access to the research literature and to the underlying data sources would help support more informed and rational policy making. In terms of legal issues, two pertinent areas of law are Freedom of Information legislation and copyright. In the United States, for example, recent legislation has brought federally funded scientific data produced in universities that are used to support the formation of federal government regulations within the purview of the Freedom of Information Act (FOIA). Previously, the FOIA applied only to data produced within the federal government itself. Charges also have recently been made in New York against a 6 Based on a presentation by John Willinsky, University of British Columbia, Canada. 7 See Willinsky, J. 2003. “Policymakers’ online use of academic research,” Education Policy Analysis Archives, 11(2), January 11. Retrieved January 28, 2005, from http://epaa.asu.edu/epaa/v11n2/.
OCR for page 40
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop major drug company for suppressing research unfavorable to its medication. These examples are indicative of rising public expectations of science and that people have a right to know what is known. Copyright protection in scholarly publishing is occasionally portrayed as a matter of protecting authors from plagiarism. Open access reduces considerably the likelihood of getting away with plagiarism. The more substantial legal issue, however, concerns the basic principle of copyright, namely, to protect the interests of the author and the public. Here a new argument can be introduced in favor of open-access scholarly publishing, serving the interests of the author and the public better than publishing models that depend on subscriptions and copyright control, which actually reduce both the author’s and the public’s rights in scholarly publication and communication. In short, increasing access to research has much to contribute to the policy and legal considerations in scholarly publishing. Borders in Cyberspace: Maximizing Social and Economic Benefit from Public Investment in Data8 Many nations are now embracing the concept of open and unrestricted access to public-sector information—particularly scientific, environmental, and statistical information of great public benefit. Federal information policy in the United States is based on the premise that government information is a valuable national resource and that the economic benefits to society are maximized when taxpayer-funded information is made available inexpensively and as widely as possible. This policy is expressed in the Paperwork Reduction Act of 19959 and in Office of Management and Budget Circular No. A-130, “Management of Federal Information Resources.”10 The policy actively encourages the development of a robust private sector, improved access to critical information in the academic and research sector, and offers to provide publishers with the raw content from which new information services may be created, at no more than the cost of dissemination and without copyright or other restrictions. 8 Based on a presentation by Peter Weiss, J.D., U.S. National Weather Service. See also Weiss. 1997. “International Information Policy in Conflict: Open and Unrestricted Access versus Government Commercialization,” in Borders in Cyberspace, Kahin and Nesson, eds., MIT Press. 9 See http://www.cio.gov/archive/paperwork_reduction_act_1995.html. 10 See http://www.whitehouse.gov/omb/circulars/a130/a130trans4.html.
OCR for page 41
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop In a number of nations, particularly in Europe and in many developing countries, publicly funded government agencies treat their information holdings as a commodity to be used to generate revenue in the short term. They assert monopoly control on certain categories of information in an attempt—almost always unsuccessful—to recover the costs of its collection or creation. Such arrangements tend to preclude other entities from developing markets for the information or otherwise disseminating the information in the public interest. The world scientific and environmental research communities, and especially developing nations, are particularly concerned that such practices have decreased the availability of critical data and information. Moreover, firms in emerging information-dependent industries seeking to utilize public-sector information find their business plans frustrated by restrictive government data policies and other anticompetitive practices. Recent economic research and initiatives at the European Commission, the United Nations Educational, Scientific and Cultural Organization, and the Organisation for Economic Co-operation and Development, as well as in individual countries, such as China’s Scientific Data Sharing Program, are helping to create an international framework for open and global data sharing. There has been an emerging recognition in Europe as well that open access to government information is critical to the information society, environmental protection, and economic growth. A “government commercialization” policy for public information cannot succeed in the face of social and economic evidence and evenhanded application of competition policies. Conversely, open government information policies foster significant, but not easily quantifiable, social and economic benefits to society. In order to achieve a successful international framework for open access to public scientific information, governments should: Support full, open, and unrestricted international access to scientific data for public interest purposes—particularly statistical, scientific, geographical, environmental, and meteorological information of great public benefit. Such efforts to improve the exploitation of public-sector information contribute significantly to maximizing its commercial, research, and social values. Allow the private sector to take an active role in using public-sector information to meet the diverse needs of citizens and users for such products and services. Meeting these needs requires entrepreneurial and pub-
OCR for page 42
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop lishing skills that are most evident in the private sector. Market needs are best served by open and unrestricted access to public-sector information. Prohibit copyright protection for government information, limit fees to recouping the cost of information dissemination only, and eliminate restrictions on reuse. This will allow diverse entities to make new and innovative uses of public-sector information. However, attribution of data sources should be made, e.g., through the use of electronic watermarks or appropriate citations. Avoid asserting a monopoly—either public or private—on public-sector information. Governments and societies both lose when governments treat their information as a commodity to be sold or allow a private-sector entity to “capture” the information on an exclusive basis. Develop and maintain strong freedom of information laws to foster greater transparency and public trust in government. Policy Considerations on Government Information Sharing in China11 Information policy research may be divided according to government, public, and commercial information. This section focuses on government information policies in China. As in the United States and other countries, government information sharing activities in China must be based on policy studies. The key issue is whether government information should be free and open or not. The opening up of government information is subject to competing policies of state security, commercial confidentiality, and personal privacy. There are conflicts and coordination between the state and the private sectors, and among different groups. Information sharing is also subject to information system security. There are several key values and principles informing the sharing of government information. Information is a strategic national asset of potential value to society and the economy. Government policy should seek to promote information sharing and the information sector. However, this needs to be done by balancing the costs and benefits to the providers and users of the information. The principles of open access, public-domain sta- 11 Based on a presentation by Jun Li, National Macro Economic Research Institute, China.
OCR for page 43
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop tus, and free use of information must be balanced against the protection of rights of the producers of the information. Government information sharing requires consideration of the following aspects: Planning the national system as a whole; Developing directories and catalogs for government information; Establishing a chief information officer system for government information, and responsibility for information access and restrictions; Development of a one-stop information service; Understanding the social and economic benefits of access and sharing; Marketing and exploitation of information; Classification of information by users; Integrating government information research across the government; and Enhancing exploitation and usage of government information to increase information availability and benefits to government, and to drive the information industry. Comparative Aspects of Policies for Open Access to Scientific Data in the United States, European Union, and China12 Effective management of scientific data has become a vital component of the research infrastructure in the information era. Many countries have developed their own policies and mechanisms to manage and share their scientific data and, as a result, a set of laws and regulations governing these activities has been gradually formed and improved. The United States, the European Union, and China have different policies regarding access to publicly funded scientific data. The United States supports “full and open” access to many kinds of scientific data and considers publicly funded data as a public good. However, European dissemination policies are based on the market value of public-sector scientific data. As a potentially big producer and user of scientific data, China needs to clarify its policies and 12 Based on a presentation by Chuang Liu, Global Change Information and Resource Center, Institute of Geography and Natural Resources Research, Chinese Academy of Sciences. See also, Chapter 18, “Recent Developments in Environmental Data Access Policies in the Peoples’ Republic of China,” by Chuang Liu in Open Access and the Public Domain in Digital Data and Information for Science: Proceedings of an International Symposium, National Academies Press, Washington, DC, 2004.
OCR for page 44
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop establish different management mechanisms for various data produced or funded by the government. China has already begun to do so with its recent Scientific Data Sharing Program, as discussed in Chapter 2. Data Sharing in Scientific Databases of the Chinese Academy of Sciences13 Ever since its initiation in 1983, the Chinese Academy of Sciences’ Scientific Database and Applications System (SDAS) has developed quickly in its construction, technology application and development, information services, and other functions. It has become the largest scientific database cluster in China, with 45 collaborative institutions providing over 8 terabytes of data through 313 specialized databases. There are still greater challenges to scientific data sharing and services, however, requiring innovation for the traditional project management and application models. Therefore, from the beginning of the tenth Five-year Plan of the Chinese Communist Party, the SDAS has focused on research on data sharing policies with standard criteria, in addition to its data resources and system platform construction, to meet the growing external and interdisciplinary demand for data sharing through remote access, research collaboration, and information integration. Setting up sharing policies for the SDAS is a major project to promote scientific data exchanges, enable further applications, and establish a series of fundamental standards for continuous data development. The methods for data sharing and management involve the establishment of sharing principles, classification, distribution requirements, collective management, and the protection of data owners’ rights and interests. The principle of “full and open” data access is being clarified and costs and revenues are distributed on a reasonable basis. Based on current legal sources in China, the rights, obligations, and proper conduct relevant to data sharing are defined in three main categories—the data producers, distributors, and end users. The sharing policy of the SDAS is not merely an ordinary administrative management regulation, but the creation of a new scientific tradition based on data sharing in research developed from changing legal sources. The policy encourages data sharing, which can provide theoretical and practical 13 Based on a presentation by Yun Xiao, Computer Network Information Center, Chinese Academy of Sciences, available at http://www7.nationalacademies.org/usnc-codata/Xiao_Yun_Presentation.ppt.
OCR for page 45
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop guidelines for data sharing services and the continuous development of scientific databases, consistent with the trends of the knowledge economy. The Data Sharing Policy of the Chinese Ecosystem Research Network14 The Chinese Ecosystem Research Network (CERN) has 36 field observation and research stations across China, and each station has produced a large amount of data through monitoring, experiments, and research. Users worldwide can share most of those data, in accordance with the CERN Data Sharing and Management Rule, which was issued in 2002 by the Chinese Academy of Sciences.15 This rule protects the rights of the data producers and permits these data to be shared widely, following the principle of keeping a balance between rights and obligations. The regulation divides data into two types, monitoring data (e.g., observational data from sensors) and data from research projects. In addition, it specifies five classes of users: related national departments, CERN members, members of the Chinese Academy of Sciences, domestic research and other nonprofit institutions, and others. This last category includes non-Chinese researchers. Users in each class have the same rights and obligations. The rule established initial periods of exclusive use for data producers, typically ranging from one-half year to two years. The producers have priority to use their own data within the initial protection periods; other users can access those data following those periods, or even within the periods, if they obtain permission from the producers and provide attribution. Data producers may make data available on their own initiative. Data Sharing Policy of the National Institutes of Health16 The sharing of biomedical data is essential for expedited translation of research results into knowledge, products, and procedures to improve hu- 14 Based on a presentation by Panqin Chen and Tieqing Huang, Bureau of Science and Technology for Resources and Environment, Chinese Academy of Sciences. 15 Both CERN’s data policy and metadata are available on the CERN Web site at http://www.cern.ac.cn:8080/index.jsp. 16 Based on a presentation by Belinda Seto, National Institute of Biomedical Imaging and Bioengineering, U.S. National Institutes of Health, available at http://www7.nationalacademies.org/usnc-codata/SetoPresentation.ppt.
OCR for page 51
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop access to data in order to generate a financial return and their policies are usually proprietary. Two recent National Research Council reports provide guidelines for resolving these different data policy requirements and for easing friction between the sectors.22 The Fair Weather: Effective Partnerships in Weather and Climate Services report examined conflicts among government, academia, and the private sector dealing with weather data. It concluded that establishing rigid boundaries between the sectors and defining what each should do is counterproductive. The Resolving Conflicts Arising from the Privatization of Environmental Data report examined these issues for all environmental data and provided criteria for purchasing data from the private sector and for transferring government data collection and product development to the private sector. The report concluded that transferring government data collection and product development to the private sector can be beneficial as long as the following conditions exist: Avoiding market conditions that will give private companies a monopoly; Preserving full and open access to key data sets and products; Assuring that a supply of high-quality information will continue to exist; and Minimizing disruption of ongoing uses and applications. For economic and data policy reasons, however, public funding for data collection and analysis should continue, focusing contributions of the private sector primarily on distribution of value-added products and collection of certain observations. 22 National Research Council (NRC). 2003. Fair Weather: Effective Partnerships in Weather and Climate Services, National Academies Press, Washington, DC; NRC. 2001. Resolving Conflicts Arising from the Privatization of Environmental Data, National Academy Press, Washington, DC.
OCR for page 52
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop PANEL DISCUSSION ON MANAGEMENT AND TECHNICAL ISSUES Introduction23 Although the technical aspects of digital scientific data and information activities are typically quite well understood and do not raise inordinate barriers except, perhaps, related to costs, the proper management of such activities, especially data preservation and dissemination, poses some unique hurdles. In the discussion below we identify some of the problems that ought to be considered in properly planning scientific data center activities. Operating a Twenty-First-Century Data Center24 In the past several decades, large-scale data resources have assumed an increasing role in scientific research, particularly research on Earth and its environment. There are a number of reasons for this, including advances in computational technologies, software, and observational capabilities, and a growing emphasis on empirical and interdisciplinary research. One of the consequences of the increasing dependence on data resources across fields of science is that in the coming years, the scientific community must devote a larger share of its resources and energies to data management and preservation than it has in the past. A recent Priority Area Assessment on Scientific Data and Information, presented to the strategic planning committee of the International Council for Science (ICSU) in June 2004, emphasized that the scientific community needs to develop strategies for data management over time periods of decades to centuries.25 Effective planning for long-term data management requires clarification of the role of data centers versus archives, obtaining regular scientific advice on data management and archiving decisions, and developing long-term financial support for data center and archival operations. The report also stressed the critical importance of professional management of data. That is, it is no longer sufficient for the scientists who analyze scientific data to be responsible for managing those data; profes- 23 Paul F. Uhlir and Julie M. Esanu, U.S. National Academies. 24 Based on a presentation by Roberta Balstad, Center for International Earth Science Information Network, Columbia University, United States. 25 The final Priority Area Assessment on Scientific Data and Information report is available from ICSU at http://www.icsu.org/1_icsuinscience/DATA_Paa_1.html.
OCR for page 53
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop sional data managers, with professional expertise, are needed. Professional data managers must be knowledgeable about the technological drivers of data center operations. They have to understand the financial implications of hardware, software, and training. They must be able to manage the rapid pace of change and the timely updating of data, software, and hardware. And they should develop career incentives and rewards for effective data production and management. Finally, the report recommended that there be a common international approach to data and information management. The benefits of an international approach to scientific data need to be properly understood and communicated so that common strategies, standards, and software interoperability can be developed. Managing the Effects of Programmatic Scale and Enhancing Incentives for Data Archiving26 The challenge of a digital scientific archive is to engage scientists in the process of archiving their data and provide the mechanism for archiving. The functions of an archive are to store data safely and reliably; build a catalog and structure; maintain storage across technology generations; review new data (quality assurance, metadata); “advertise” contents; find data for users with query and browse logic; and distribute data by providing access and references to documentation. An effective scientific data archive operates on several presumptions. Information sharing is important. Multidisciplinary data access will foster more robust scientific discoveries. Archiving can always be improved. The number of permanent data archives will increase, which will increase the value of all archives connected on an interoperable basis through digital networks. Even if an appropriate archive exists, however, there are many reasons why scientists do not archive their data. They lack incentives and proper acknowledgement; they may be concerned about giving up publication rights; their research may have poor planning or insufficient resources for data management and preservation; they may not believe the archive will 26 Based on a presentation by Raymond McCord, Oak Ridge National Laboratory, United States, available at http://www7.nationalacademies.org/usnc-codata/RaymondMcCordPresentation5.ppt.
OCR for page 54
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop get long-term support; and they may lack training or be unsure about the metadata content they should provide. Nevertheless, research managers can provide many good arguments and incentives to scientists for archiving their data, including the following points: Recognition for archiving. Scientists need to receive some career benefits for their archiving-related work. Consider scientific journals that also provide companion “data publications.” Emphasize good scientific practice. Promote professional development and training. Provide daily interactions between scientific and information specialists. Allow a reasonable time for initial discovery. Provide support for long-term “stewardship.” Provide institutional incentives. Archiving should be required by the sponsor. Data archiving should be “in the plan” and resources available to support it. Interweave archiving with the planning and publication processes. Exploit technological advances. It is technically easier now and there are more options. Plan for managing change. Change is inherent in research, but managing change without prior planning can become consumptive. Changes may cause confusion and diminish data usefulness. (See the next section for more details about managing change in archiving.) Perhaps the best argument is that effective archiving supports better science. Archiving extends data usefulness. Archived data increase the volume and diversity of our information base for doing research. Managing the Effects of Change on Archiving Research Data27 The archiving of scientific data and information is made more difficult by the evolving changes associated with research accomplishments. Research discoveries lead to a continual series of revisions to sampling schemes, measurement methods, and scientific objectives. All of these changes add to 27 Based on a presentation by Raymond McCord, Oak Ridge National Laboratory, United States, available at http://www7.nationalacademies.org/usnc-codata/RaymondMcCordPresentation.ppt.
OCR for page 55
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop the scope and complexity of information that must be recorded and logically organized as part of a successful data archive. Recording and communicating these changes for future data users is facilitated by additional supporting information and an evaluation of the rules that define the information. Most of the available information technology (hardware, software, and implementation methodology) originates from business applications, which are designed to accommodate fundamentally different patterns of change. Managers of scientific data archives will need to adapt the traditional designs of information systems to meet these special features of research data and its users. Management also needs to encourage extra effort during initial design and later operations to accommodate the future changes that will occur. The structure and content of scientific data and information can be very complex. Successful archiving of data requires that the variation in this complexity be minimized. The efficient operation of information systems and effective communication with future data users are enhanced by minimizing the variation in the logic, concepts, and keywords used in the metadata. Some of the complexity is inherent to the variety of measurements and materials included in the research and cannot be avoided. Additional complexity occurs as archived information is aggregated into more extensive systems and accessed by broader user communities. There are many management and institutional issues that must be considered to avoid unnecessary complexity and uncertainty in the archived information. The effects of the varying dimensions of programmatic scale (volume, diversity, longevity of data and research programs) need to be considered. Institutional impediments and incentives also affect the willingness of scientists to contribute information to archives. The documentation and archiving of data should be integrated with the publication process as part of the “modern scientific method” and should receive similar incentives. Management should reinforce these practices by insisting on early planning for data archiving and providing specific rewards for these activities. Other management issues include protecting initial discovery opportunities, supporting long-term stewardship of data (answering questions after the project is completed), and providing “cross-training” of archive personnel in both scientific and information disciplines. There are several key rules for creating data sets for archiving: Unique occurrences. Each type of measurement should be represented in a consistent way and each measurement event should be represented by only one value. If multiple versions of datasets accumulate, provide version
OCR for page 56
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop information, explain version differences, and document the effective data range for each version. Identifiers. Each value should be associated with a parameter name and each measurement value should have a quality indicator and link to a method description. Whenever possible, remove multiple aliases for the same identifier (e.g., sample identifier, site identifier or name, measurement name, etc.). Place and time. Each value should be associated with a unique place name with a quantitatively defined location (geographic coordinates). Each value also should be associated with a date and time. Do not confuse data and time for measurements with data and time for storage revisions or data and time ranges for measurement or encoding methods. Data storage and transport. Data should be stored or managed with a database management system or self-documenting data format. Include data analysis software in the data management suite. This is useful for comparing versions of data that accumulate over time. Also include data format conversion software in data management suite, which is useful for migrating data from one storage technology to another. Finally, there are a number of best practices for preparing ecological and ground-based data sets to share and archive:28 Assign descriptive file names; Use consistent and stable file formats; Define the parameters; Use consistent data organization; Perform basic quality assurance; Assign descriptive data set titles; and Provide the necessary documentation. Special Considerations for Archiving Data from Field Observations29 Archives depend on logical rules for information structures and consistent codes for metadata. Different types of data and information pose different types of challenges and archiving requirements, however. For ex- 28 Cook et al. 2001. Bulletin of the Ecological Society of America. Available at http://www.daac.ornl.gov/DAAC/PI/bestprac.html. 29 Based on a presentation by Raymond McCord, Oak Ridge National Laboratory, United States, available at http://www7.nationalacademies.org/usnc-codata/RaymondMcCordPresentation5.ppt.
OCR for page 57
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop ample, scientific data from field investigations are fundamentally different from laboratory observations. Laboratory studies are conducted under controlled conditions, whereas field observations are collected from incompletely controlled environments. Data archives for field observations must include additional design features to accommodate, but minimize and rationalize, this additional complexity. These features include a need to address multiple schemes for location information, temporary changes in methods, unmeasurable events, evolving reference lists, and a containment strategy for exceptions. Location information typically involves multiple geographic coordinate systems. Conversions from one system to another may not be reversible without some loss of information. It is important, therefore, to test changes before large-scale conversions are made. A visualization capability is essential. Location information also usually encounters multiple naming schemes, such as unofficial “folk” names; divergence in naming schemes at local, regional, and national scales; and connecting historical name changes. Archives also must take account of temporary changes in data collection methods. For example, the field sampling protocol may be insufficient, resulting in a need to restructure the metadata to record the temporary change. Field observations at remote sites can experience various anomalies and instrument malfunctions. A robust scheme for missing value representation is needed and data analyses must correctly exclude missing values. There also may be unmeasurable events that are either too small or too large. How do you record values that are below the detection limit (but not zero)? You can set all values to the minimum detection limit, set all values to the midpoint between detection and zero, set all values to zero, or retain estimated value, but include a quality flag. You will need to select one of these strategies and document the choice, realizing that the choice can have significant impacts on summary statistics. How do you record the biological population when there are too many individuals to count? You can record some arbitrary large number or flag it as unmeasurable. Similar problems can occur with wide ranges in chemical concentrations. Different schemes may have impacts on results from statistical analyses or on setting quality assurance limits. Evolving reference lists pose further challenges. Taxonomic lists assume agreement on a single and accepted classification scheme, which may not be true. Individuals who are involved only infrequently also may not be fully identifiable. Later samples may enable fuller identification, which may
OCR for page 58
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop require recoding earlier records to match newer identification. Chemical constituents may have similar classification problems. A containment strategy is thus needed to deal with the exceptions. The “90/10” rule is a useful guideline: approximately 90 percent of the data can be described by a few logical rules, while approximately 10 percent of the data cannot be described by rules and contain numerous and isolated exceptions. Put the information that cannot be described by rules in an alternative structure that can be labeled as “user beware,” support detailed and various documentation, and accommodate and communicate numerous exceptions. However, when there are “too many” logical rules, the archiving process will become inefficient and tedious, so it is important to make adjustments as needed. Toward a Balanced Performance Appraisal System in the Digital Era for Data Archiving and Sharing in China30 Although data archiving and sharing are not new problems in China, they became worse with the advent of the digital era. Traditionally, the data producers and the data users were not very different in terms of their academic recognition and reputation. They collected, archived, and used data, and most of the data products were published and thus were available to the public at affordable prices. The published data sets were also regarded as academic achievements by their peers and funding agencies. This mechanism encouraged some scientists to archive and publish their research data. In the early years of the digital era, however, a number of factors began to hinder data sharing, some of which have continued to the present. First of all, data management as a whole became more expensive and needed more investment. Second, digital publishing of research data is not the same as in the print format and the authorship of digital data products is less clear. Third, those scientists who are both data producers and data users (as is true in most cases) have a competitive edge over their peers if they can use their data exclusively forever. Fourth, public funding agencies may not be able to make data available to other scientists, because even the agencies sometimes have obtained the data according to restrictive contracts. Fi- 30 Based on a presentation by Zhengxing Wang, Global Change Information and Research Center, Institute of Geography and Natural Resources Research, Chinese Academy of Sciences.
OCR for page 59
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop nally, China’s scientific community has not been fully aware of the fundamental role that data have played in scientific inquiry. Data collection, processing, validation, archiving, and sharing all have not been included in China’s “academic performance appraisal system.” In a system in which print publishing gets nearly all the information funding and digital data archiving and sharing get much less, one has little incentive to continue working in data archiving. Therefore, it is critical to develop a more comprehensive and balanced performance evaluation system to foster and sustain digital data archiving and data sharing in China. Earth Science Data and Information Management in Western China31 Over the past few decades, the western regions of China have been the focus of important earth science research and a lot of earth science data and information have been accumulated there. Recent studies have focused on the plateaus, mountains, deserts, vegetation, hydrology, and ecosystems in that region from the perspective of different fields of study, including ecology, environmental and earth sciences, sociology, and regional development. The historical data and information resources are valuable for such studies. Based on the analysis of the distribution of information resources and users, and on user requirements and the support of data, many functions need urgent attention in the national framework of earth science data sharing in China. For example, reproducing single copies of data and information, accelerating the sharing of data and information, meeting the demands of potential users, and promoting the use of information to benefit the western region of China all require attention. There are several actions that could be taken in this regard. One should be to assist institutions that have not organized and digitized their historical data, and encourage them to provide an index and to become members of the data sharing system. Another is to establish a special management program to reduce the costs of data collection and promote their application. Finally, it would be very useful to establish a digitally networked clearinghouse for earth science data and information that involves the relevant organizations and persons in western China. 31 Based on a presentation by Chengquan Sun, Scientific Information Center for Resources and Environment, Chinese Academy of Sciences, available at http://www7.nationalacademies.org/usnc-codata/Sun_chengquan_Presentation.ppt.
OCR for page 60
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop Data Integration and Management: The Protein Data Bank Perspective32 The Protein Data Bank (PDB) is the worldwide repository for the structures of biological macromolecules.33 The PDB provides a rich history from which to explore the practices of biological data management, because its data set has many characteristics found in other biological data—diversity, complexity, and variable quantity and quality of annotation. A database resource is only as good as the data it contains. Data representation, acquisition, annotation, and distribution are all essential functions. The data representation (metadata) used by PDB is the macromolecular Crystallographic Information File (mmCIF) standard dictionary. The mmCIF dictionary conforms to a subset of encoding rules embodied in a Self-defining Text Archival and Retrieval (STAR) syntax. STAR has provisions for defining scope, resting, looping, and other aspects. Conforming to STAR is a Dictionary Definition Language (DDL) that defines how dictionaries are described. DDL has provisions for fully characterizing the terms in the domain and is relational in nature; that is, there is the notion of relations (categories), attributes (specific data names), primary and secondary keys (mandatory data items), and so on. The data defined by mmCIF consist of name-value pairs where each name must be defined in the mmCIF dictionary. The mmCIF dictionary can be characterized as having the features of extensible markup language (XML) document type definition or schema. Although the dictionary was written in the STAR format, the ontology and its derivations are independent of STAR or any other particular file format. It can be automatically converted to other formats. It provides the foundation of integrated software systems for building robust automated data pipelines. The PDB has been actively involved in various aspects of automated and accurate data acquisition, annotation, and distribution. Biological data management concerns more than just the technical aspects, however. There are sociological and political issues as well. A key element for success is good communication among those running the resource, who need to have diverse skill sets, and among every member of the team and the communities they represent. Community feedback must be 32 Based on a presentation by Zukang Feng, Protein Data Bank, United States, available at http://www7.nationalacademies.org/usnc-codata/ZukangFengPresentation.ppt. 33 See http://www.wwpdb.org/.
OCR for page 61
Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop treated seriously and lead to a prioritized set of action items to be addressed by the resources available. The technology must take advantage of the most recent innovations in hardware and software. These technological developments, however, must be introduced so as to enable and not disrupt the users of the resource. It is critical to maintain an interactive dialogue with the user community about desired new functionalities and the feasibility of their implementation. Beyond all else is the need for good data and a robust data representation that is flexible enough to meet the needs of the changing science.
Representative terms from entire chapter: