5
Building Distributed Geolibraries

Requirements

Previous sections of this report outline the vision of distributed geolibraries, discuss the problems and issues related to their social and institutional context and define their services and functions. This chapter addresses the process of building distributed geolibraries, the steps that will need to be taken to implement the vision, and related issues. It is impossible to be precise, of course, because of uncertainties surrounding future technologies, because the outcomes of research are in principle impossible to anticipate, and because many issues can only be resolved by constructing and working with prototypes. Given these constraints, this report attempts to address a number of key questions and to find answers where possible:

  • What will it take to build distributed geolibraries?
  • What economic incentives can be put in place such that stakeholders in all sectors of the community (business, education, government) can and will participate?
  • What arrangements need to be put in place in the form of institutions, regulations, standards, protocols, committees, and so forth?
  • What research needs to be done to address problems and issues for which no methods or solutions currently exist? How long will this research take?


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 73
5 Building Distributed Geolibraries Requirements Previous sections of this report outline the vision of distributed geolibraries, discuss the problems and issues related to their social and institutional context and define their services and functions. This chapter addresses the process of building distributed geolibraries, the steps that will need to be taken to implement the vision, and related issues. It is impossible to be precise, of course, because of uncertainties surrounding future technologies, because the outcomes of research are in principle impossible to anticipate, and because many issues can only be resolved by constructing and working with prototypes. Given these constraints, this report attempts to address a number of key questions and to find answers where possible: What will it take to build distributed geolibraries? What economic incentives can be put in place such that stakeholders in all sectors of the community (business, education, government) can and will participate? What arrangements need to be put in place in the form of institutions, regulations, standards, protocols, committees, and so forth? What research needs to be done to address problems and issues for which no methods or solutions currently exist? How long will this research take?

OCR for page 73
What data sets need to be constructed, and what mechanisms might be used? What software needs to be written, and who is likely to write it? At a higher level one might ask how it is possible to know the answers to these questions. Complex software systems and new institutions arise through an iterative process in which the end result may not be apparent until the process has been under way for some time. Creating a vision is part of that process, but the vision may be wrong or unachievable. Large-scale prototypes are sometimes built in part because it is difficult or impossible to know what is possible without such large-scale experimentation. Without building a distributed geolibrary prototype, it may not be possible to identify exactly what it will do successfully and what it will not do. It may be difficult to know at an early stage how much a distributed geolibrary will cost or whether its costs will be exceeded by its benefits. The Panel' vision of distributed geolibraries views them as a primary distribution mechanism for getting geospatial data and geographic knowledge resources into the hands of all stakeholders. Traditionally, the primary source of geospatial data in the United States, as in many other countries, has been the national mapping agency. Dissemination has been predominantly a one-to-many operation, as a single source provided information to a distributed user base. The vision of the National Spatial Data Infrastructure (NSDI) is very different and reflects an increasing degree of empowerment of individuals and agencies as significant producers of geospatial data. This vision is many-to-many, replacing a single source with a much more complex array. It is also complicated by the fact that the user/producer distinction is no longer as clear. Many users of geospatial data add value and become producers, and many users serve their own networks of clients. Many users of geospatial data are producers of geographic knowledge, which they may want to publish or make available through the mechanism of distributed geolibraries.

OCR for page 73
The many-to-many paradigm is familiar to librarians, who have traditionally acted as brokers between the publishers and the users of information. Thus, the paradigm shift that is occurring in geospatial data dissemination, in part through a process of technological empowerment, provides a strong reason to look to the library as a metaphor for new dissemination models and suggests that the library is a good place to look for models of distributed geolibraries and for solutions to problems and issues that may arise in building them. On the other hand, the timescale of library operations has been far slower than is normal with digital data dissemination. It may take years for information to pass fully through the complex process of publication and cataloging until it is finally available to the traditional library user. Users of the WWW are accustomed to delays on the order of minutes not years. Thus the library model will be useful only if its customary timescales can be compressed by many orders of magnitude. The following sections address the needs of distributed geolibraries in terms of standards and protocols, data sets, georeferencing, cataloging, visualizations, and knowledge creation. Later sections discuss research needs and institutional arrangements. The final section of the chapter discusses the measurement and assessment of progress in building distributed geolibraries. Standards and Protocols Geospatial applications are already supported by a large number of standards and protocols, and many more are in various stages of development. The set of particular relevance to distributed geolibraries includes: The metadata standard developed by the Federal Geographic Data Committee (FGDC) and known as the Content Standards for Digital Geospatial Metadata (http://www.fgdc.gov). This standard allows catalogs of geospatial data sets to be constructed using well-defined content. It is elaborate, and substantial effort is needed to achieve compliance. A very similar general metadata standard is in

OCR for page 73
the International Organization for Standardization (ISO) review process under the ISO Technical Committee 211 (ISO-TC 211). General file format standards for geospatial data. These include standards mandated under FIPS 173 and known as the Spatial Data Transfer Standard (SDTS), the scientific data standards HDF and netCDF, the imagery standards TIFF and GeoTIFF, the military standard DIGEST, and many more. Interoperability specifications. The Open GIS Consortium (www.opengis.org ) is developing a wide range of specifications for geospatial objects to support interoperation and is strongly supported by the GIS software industry. Other standards of relevance to distributed geolibraries include those under discussion on intellectual property rights in digital data, standards of geospatial data quality, definitions of geographic feature types, and general mapping standards. They are being developed through a multitude of standards organizations, including, for example, the ISO, the American National Standards Institute (ANSI), the FGDC, and the International Cartographic Association. The Internet and the WWW are built on a series of standards and protocols that have been widely accepted not because of any compulsion or mandate but because they clearly work and enable interesting applications. They include TCP/IP and HTTP. In the coming years it is likely that these standards will be extended repeatedly, and it appears that the architecture of the Next-Generation Internet will be significantly enhanced. Although none of these developments have been driven or are likely to be driven by the special needs of distributed geolibraries, as in the past we can expect them to be exploited in whatever ways are interesting, valuable, and appropriate. Finding 11 New technological initiatives such as the Next-Generation Internet and Internet II are likely to provide extensions to Internet and WWW protocols and orders of magnitude increases in bandwidth. Many of these developments are expected to be relevant to distributed geolibraries.

OCR for page 73
Data Sets Libraries assist their users in many ways; some of the most important are the mechanisms of abstraction employed to help users find relevant information. The process of cataloging is assisted by a number of data sets known as authorities that provide essential indices and lists. In distributed geolibraries an essential authority is the gazetteer. A distributed geolibrary's gazetteer will differ in several key respects from the traditional version found in the back pages of atlases: Support for extents, defined as the bounding coordinates of place-names. Traditional gazetteers, and their digital equivalents such as the Geographic Names Information System provide only point references for most features. In contrast to point locations, extents are needed to resolve the relevant discrepancies between the given footprint of an asset and the footprint of a user query. Because there is only marginal value in a highly precise footprint (since adding additional precision to a boundary's location will only marginally increase the effectiveness of a search), it may be sufficient to provide only bounding coordinates (e.g., minimum and maximum latitude and longitude). Extensibility, defined as the ability of a user to insert additional place names of interest into a local copy of a standard authority gazetteer. Specialization, defined as the ability of a user to define gazetteers for special applications. Many application domains have their own equivalents of recognized place names. Hydrologists use standard ways of indexing watersheds, for example, and remote sensing specialists use standard numbering systems for the images derived from satellites such as Landsat. Translations from these systems to standard coordinates will be important data sets in support of the functions of distributed geolibraries. Support for fuzziness. Traditional gazetteers literally provide authority only for officially recognized place names. While the

OCR for page 73
    footprint of a city name may vary depending on context and usage, the official footprint is most often defined by the city limits. Users of distributed geolibraries will want to be able to search based on place names that are not officially recognized but nevertheless in common usage, such as "downtown." Finding 12 A comprehensive gazetteer, linking named places and geographic locations, would be an essential component of a distributed geolibrary. A national gazetteer would be a valuable addition to the framework data sets of the NSDI. These framework data sets are being coordinated by the FGDC, which also has the responsibility for associated standards and protocols. Production and maintenance of the national gazetteer could be through the National Mapping Division of the U.S. Geological Survey (USGS) in collaboration with other agencies and could be an extension of the USGS's Geographic Names Information System. Another type of authority used by libraries is the thesaurus. In the geoinformation case, various kinds of authorities would be useful: lists of standard feature types, standard data themes, standard attribute definitions. For example, it would be useful if the meaning of vegetation and associated terms could be standardized, and much effort by the FGDC has been devoted over the past few years toward this end. In a world in which everyone can be a data producer, it is no longer possible to rely solely on the federal government to define essential mapping terms. At the same time it is important that distributed geolibraries reflect the contemporary social norms of their users. The very term authority suggests a command-and-control philosophy that may be orthogonal to the prevailing culture of the Internet and the WWW, which is dominated by individual empowerment and voluntary consensus. An authority for a distributed geolibrary is clearly something different from a traditional library authority, and digital technology must be used to serve different ends. Instead of a single authority created by a central agency and enforced top-down on the community through regulation, mandate, or incentive, digital

OCR for page 73
technology should be used to support translation and interoperability between a variety of different meanings and interpretations in a bottom-up process that accommodates diverse communities and groups and their associated terminologies. If the term downtown means something different to user A than to user B, distributed geolibraries should use the power of digital technology to make the two meanings interoperable, rather than to support the imposition of a single interpretation on all users. Georeferencing The system of latitude and longitude has been subject to international standards since the late nineteenth century. However, the definitions of latitude and elevation are dependent on the mathematical function used to approximate the shape of the Earth, and many such functions are in use. Thus, latitude is not fully interoperable, and two points near each other on the Earth and measured from opposite sides of certain international boundaries do not converge perfectly. Additional complications occur in the use of other world coordinate systems, such as UTM (Universal Transverse Mercator coordinate system) and between the U.S. State Plane coordinate systems. If distributed geolibraries are to be useful to people who do not understand the complexities of geodetic datums and cartographic projections, it will be necessary for systems to be developed that are capable of hiding such details or making them fully transparent to the user. Thus, a user ought to be able to access data sets in different projections and based on different datums and expect the system to handle the differences automatically. Such transparency is not yet available in standard geospatial software products and data sets, and its feasibility has not been demonstrated. Other general ways of referencing the surface of the Earth are gaining popularity because of interest in global environmental change and other processes that operate at the global level. These include standard hierarchical grids such as QTM (Dutton, 1984) and the sampling grids used by the EMAP program (White et al., 1992).

OCR for page 73
Such hierarchical systems may be important internally as indexing schemes for distributed geolibraries (Goodchild and Yang, 1992). Cataloging Reference was made earlier to the need to compress the traditional timescales of the library world. Nowhere is this more important than in cataloging, which serves the critical function of abstracting the information users need to find, examine, assess, and retrieve data. In effect, metadata are the key to the many-to-many structure that allows many users to search across many potential suppliers, and its timely creation will be crucial if distributed geolibraries are to function. Unfortunately, the process of metadata creation for digital geospatial data can be as lengthy and labor intensive as its traditional equivalent. The task of creating a full metadata record for a geospatial data set using the FGDC metadata standard can be much greater than the task of cataloging a simple book. The geospatial data community appears to have accepted the notion that metadata creation is largely the responsibility of the producer, whereas the prevailing notion in the library community is that cataloging is the responsibility of the librarian. This reflects a distinct difference in philosophy, since the library practice is based on the notion that the librarian may be more skilled in abstracting information on behalf of the user than is the producer of the information. If time is of the essence in the digital world of the Internet, it makes good sense to try to replace the labor-intensive cataloging process with automated methods. The Internet world's solution to this problem has been the WWW search service, exemplified by AltaVista, Yahoo, and Excite. To be successful, a search service designed to help the user of distributed geolibraries find geospatial data and geographic knowledge would have to place heaviest emphasis on the determination of an information object's geographic footprint, either by detecting or inferring coordinates or by identifying an appropriate place name, to be converted to coordinates using a gazetteer. Such tools would perform the functions of abstracting and metadata creation automatically. Such

OCR for page 73
automated discovery, indexing, and abstracting tools do not yet exist and will require extensive research and development. Three models that provide alternatives to the search service are described in Chapter 4. They are technically much simpler, but require practices that appear to be incompatible or only partially compatible with the culture of the Internet. Visualization One of the most powerful advantages of the concept of distributed geolibraries is the ability for the user to interact with a representation of the surface of the Earth. Information about the Earth's surface is naturally conceptualized as belonging to the surface, and globes, which are actual scaled representations of the Earth, provide a familiar and easily understood information source. The notion of doing the same in the digital world, of presenting information as if it were actually located on the surface of the globe, is termed the Digital Earth metaphor, and lies behind the idea described earlier in Chapter 2. Some types of geoinformation illustrate close approximations to actual appearance and can be rendered by draping onto a curved surface. These include optical imagery and false-color imagery, where colors are used to render information that corresponds to some other possibly invisible part of the spectrum. Other information in distributed geolibraries is not rendered so easily. How, for example, would one portray economic information such as average household income using the Digital Earth metaphor? In some cases there may be clever ways of making visible what is normally invisible; in other cases it may be necessary to represent the presence of information using symbols that exploit some other metaphor, such as books or library shelves. This is a novel area with no obvious guideposts, and research will be needed to determine how best to make the user of distributed geolibraries aware of the existence of information and of its important characteristics. In particular, we know almost nothing about how to render dynamic geospatial data or how to indicate

OCR for page 73
availability, yet we anticipate that such data will be increasingly available to the users of distributed geolibraries. Knowledge Construction Users of distributed geolibraries will need tools for analysis, modeling, simulation, decision making, and the creation of new geographic knowledge. An important component will be the workspace in which the user can process data using many of the functions found in today's GIS, along with other functions such as those described earlier in Chapter 4. Given the massive investment in GIS, the easiest way to achieve this will be through collaboration between the builders of distributed geolibraries and the developers and vendors of GIS software. Compatibility and interoperability between GIS products and distributed geolibraries will be needed. For example, the metadata used to discover, assess, and retrieve data should be processed and updated by the GIS as data are manipulated and used to create new data sets. Metadata should be generated automatically when new knowledge is created by analysis and modeling. Current software products are generally incapable of these functions, and much research remains to be done to make them generally available. Research Needs Many of the topics discussed in this report fall under the heading of ''things we do not yet know how to do." In some cases, such as the building of a distributed geolibrary itself, there may be no obviously missing piece of theory or understanding; rather, it may be that we have not yet tried and that given sufficient resources the necessary knowledge will be available. But other items require more focused research. Among them are the following: Scalability. We have no experience with building and operating data-handling systems on the massive scales envisioned here.

OCR for page 73
Interface design. Most information technologies are designed for skilled users. Distributed geolibraries will be used by everyone, over a wide range of levels of cognitive understanding, and will require new methods of interface design that embody sound principles, some of which have yet to be discovered. Merging data. We have very little experience with the massive redundancy anticipated in distributed geolibraries, where many sources of the same data will be available. We do not have techniques for merging data from different sources, across different scales and levels of accuracy, or across different data models or ontologies, or for combining or conflating the desirable properties of sources. Distributed geolibraries will be one of a growing number of applications that depend on the ability to register multiple data sets quickly and easily and to remove obvious discrepancies. Finding 13 The success of a distributed geolibrary will be largely dependent on the ability to integrate information available about a place. That ability is severely impeded today by differences in formats and standards, access mechanisms, and organizational structures. Removal of impediments to integration should become a high priority of government agencies that provide geospatial data. Indexing. Our methods of indexing data have been developed for the flat two-dimensional world of maps and images. Distributed geolibraries will require comprehensive approaches to indexing that are capable of supporting "drilling down" over a wide range of scales. Visualization. While techniques for visualizing static two-dimensional data are well understood, particularly in cartography, we do not have the same level of understanding of appropriate ways to visualize data on the curved surface of the Earth, especially when the data are time dependent. Much more research is needed into appropriate metaphors, techniques, and user responses before these will be as easy as traditional cartographic visualization.

OCR for page 73
Finding 14 Significant research problems will have to be solved to enable the vision of distributed geolibraries. Research is needed on indexing, visualization, scaling, automated search and abstracting, and data conflation. Research on these issues targeted to improve access to integrated geoinformation might be pursued by the National Science Foundation and other agencies sponsoring basic science, as well as by the National Mapping Division of the USGS, and the National Imagery and Mapping Agency. Many mechanisms and programs already exist to move this research agenda forward. Examples include the following: The Digital Library Initiative. Funded first in 1994 by NSF, NASA, and DARPA, this program was recently reannounced (www.nsf.gov/pubs/nsf9863/nsf9863.html), and is expected to fund research through 2003. Among the six projects funded by the first round, those at the University of California's Berkeley (elib.cs.berkeley.edu) and Santa Barbara (alexandria.ucsb.edu) campuses are particularly relevant to distributed geolibraries. Digital Earth. As discussed in Chapter 2, Vice-President Gore described a vision of Digital Earth that bears substantial resemblance to distributed geolibraries. In the next few years this vision may develop into a substantial funded research program. Digital Government. NSF recently announced research opportunities in a new program to build stronger ties between the research community in computer and information science and engineering and various government departments with very significant investments in systems and data integration (NSF Program Announcement 98-121). This program may be a suitable vehicle for promoting the research needed to support distributed geolibraries. Knowledge and Distributed Intelligence (KDI). NSF's KDI program announcement (NSF Program Announcement 98-55) has strong relevance to the vision and issues of distributed geolibraries.

OCR for page 73
The August 1998 Interim Report of the President's Information Technology Advisory Committee (www.ccic.gov/ac) called for substantial increases in federal information technology research and development and for a series of virtual expeditions in specific areas. An effort in distributed geolibraries seems to fit the intent of the report well. In addition to these formal mechanisms, significant research and development activities are under way in the private sector among vendors of GIS software and among defense and intelligence contractors that can be expected to push in the direction of distributed geolibraries over the next few years. For example, the vendors of new commercial space imagery could use systems like distributed geolibraries for the dissemination of their data products to the broad user community. The FGDC is also a potential source of research initiatives in this area, given its relevance to the future dissemination mechanisms of the NSDI. Many of the research needs identified here are basic in nature, and it may be many years before solutions can be found. On the other hand some issues such as the need for better methods of data integration, are so widely recognized, technical in nature, and strongly motivated that significant progress can be expected in a comparatively short period. Institutional Needs Although elements of a distributed geolibrary already exist in the form of prototype clearinghouses and other projects, it is easy to lose sight of the broader concept and the degree to which it represents a radical departure from current and past practices as reflected in our institutions and their accepted functions. More specifically: Traditional production and dissemination of geoinformation have been centralized, as functions of the upper levels of government. These arrangements made good sense in the past, but the empower-

OCR for page 73
    ment that has occurred as a result of the almost universal adoption of information technologies, especially geographic information technologies, over the past two decades has called them into question. Yet such institutions as the national mapping agencies still reflect this legacy. The vision of distributed geolibraries represents a broadly based restructuring of past institutional arrangements for the dissemination of geospatial data and one that is much more bottom-up, decentralized, and voluntary. The institutional arrangements of the WWW provide an excellent model. The implications of distributed geolibraries for intellectual property rights, the library as an institution, and the economics of information use are discussed at length in Chapter 3. Traditional production and dissemination practices for geoinformation have emphasized the horizontal integration of information at the expense of vertical integration. Today it is much easier to obtain and make use of the same type of data for different areas than it is to obtain and make use of different types of data for the same area. A distributed geolibrary would prioritize vertical integration to obtain responses to such queries as "What have you got about there ?" Producers and distributors of geospatial data could make it much easier to integrate different types of data. The USGS, for example, could make it easier to obtain digital elevation data, digital topographic data, and digital orthophoto data for the same area. Today that ability is severely impeded by differences in formats and standards, access mechanisms, and organizational structures, as well as in the basic geometric and positional problems associated with varying accuracy and varying definitions of shorelines and other features. Finding 15 While traditional production of geospatial data has been relatively centralized, the vision of distributed geolibraries represents a broadly based restructuring of past institutional arrangements for the dissemination of geospatial data and one that is much more bottom-up, decentralized, and voluntary.

OCR for page 73
Some of these issues are specific to geoinformation and geospatial data, but others are generally applicable to the emerging information society, which is being driven by technological change and by the desire for greater access to information. Lopez and Larsgaard (1998) discuss this relationship between the needs of the geospatial data and the broader institutional setting of the evolving digital library. That relationship is complex, and it is clear that distributed geolibraries are part of a larger vision of the digital library of the future. But the central role they give to searches based on location makes them clearly distinct, as do the research problems identified in the previous section. The development of distributed geolibraries will require a unique set of partnerships between developers of information technologies, geographic information scientists, application domain specialists, and user communities. It is unlikely, therefore, that the vision of distributed geolibraries will be realized through broadly based efforts to research and develop digital libraries in general; instead, efforts are needed that are directed specifically at distributed geolibraries and geoinformation. Funding and coordination are needed to develop prototypes, stimulate basic research, and build partnerships that specifically address the vision of distributed geolibraries. Measuring Progress The workshop convened by the Mapping Science Committee (see preface) was designed to help identify a vision of distributed geolibraries and the steps needed to realize that vision. An important element of building distributed geolibraries is, therefore, the measurement of progress: how will we know how much progress has been made and how much remains to be done? In this section we offer some possible bases for measurement. Query-based. If the objective of distributed geolibraries can be expressed in the ability to issue the query "What information is available about there?", a simple measure of progress can be based on the amount of information available to a user of the WWW in

OCR for page 73
    response to queries of that nature. Some of the sites listed in Appendix D can already respond to that type of query. A simple measure would be complicated by the various conditions under which information is available, such as cost, intellectual property restrictions, and quality. Analysis-based. Rather than base progress on the availability of data, a more sensitive and powerful measure might be one based on the ability of the user to obtain services that involve analysis. Information that involves processing in its creation from raw data, and information that represents knowledge, can be of more value that the raw data themselves. If distributed geolibraries are to involve a vision of services rather than simple data supply, measures based on the complexity of analysis will be important indicators of progress. Cost-based. One way to assess a traditional library is on the basis of cost: How much does it save its users to have access to resources such as books or databases via libraries in lieu of the user purchasing them? If economics is the real driver of the library system, the same argument can be made about distributed geolibraries: specifically, how much is saved when data are shared rather than re-created in multiple archives? Abstraction-based. Another view of the traditional library is that it is a successful abstraction mechanism, allowing its users to find and retrieve information objects (books) without direct knowledge of their contents, through the mechanisms used by the library to abstract and catalog. One might measure the progress of a distributed geolibrary on this basis, by developing indicators of the amount of work required on the part of the user to find a given item of information. The library has clearly failed if this can be done only by inspecting the contents of every information object in the library. In addition, progress toward the vision of distributed geolibraries could be measured through the volume of accumulated research results, the sophistication of prototypes, and the lessons learned from each.