SHARING OF SPATIAL DATA
With the increasing availability and use of geographic information systems, many governmental organizations, private companies, and academic researchers have the capacity to greatly expand the quantity, accuracy, and type of spatially referenced data available. With this capacity also comes the potential for substantial duplication of effort or the underutilization of valuable information that often has been created at considerable cost and effort.
RATIONALE FOR A SPATIAL DATA SHARING PROGRAM
The principal objective of a spatial data sharing program is to increase benefits to society arising from the availability of spatial data. The benefits will accrue through the reduction of duplication of effort in collecting and maintaining of spatial data as well as through the increased use of this potentially valuable information. The exposure of these data to a wider community of users may also result in improvements in the quality of the data. This will eventually benefit the donor and other users.
The focus of a spatial data sharing program should be on increasing access to spatial data that are collected with the direct or indirect support of public funds. A spatial data sharing program should not displace the role of the private sector in providing value-added products and services
associated with the utilization of this spatial data. It should, on the contrary, result in a rich environment for developing new business opportunities and enhancing economic growth.
Examples of Data Sharing Programs
The concept of sharing spatial data is not new. Examples can be found at federal, state, and local levels. The following two examples illustrate the types of efforts and benefits that can be derived from spatial data sharing. These examples cover two important types of baseline spatial data: geodetic detail and land parcel definition. There are many other kinds of spatial data as described elsewhere in this report, and there are enormous opportunities to reduce costs and increase user benefits through joint collection and sharing of spatial data. Data describing the location of a wide variety of phenomena can be shared: soil characteristics, wetlands, wildlife, hydrology, transportation systems, land use, and demographics, to name a few. All can be improved by organized joint efforts for their collection and distribution.
National Geodetic Reference System1
A number of incentives for data sharing and other forms of cooperation appear to have worked, and in some cases, very well. In 1980, for example, the NGS published standards for the submission of geodetic information to NGS. These volumes, known as the ''Blue Book" (Federal Geodetic Coordinating Committee—FGCC, 1980, 1989), provided the specific descriptive information and formats for the mandatory and optional data elements for vertical (bench marks) and horizontal control data for inclusion in the National Geodetic Reference System (NGRS). The third volume of the trilogy, covering gravity control data, was published in 1983 (FGCC, 1983).
The Blue Book has evolved over the years in response to changes in surveying technology. For example Volume 1, Horizontal Control Data , was revised in January 1989 to include Global Positioning System (GPS) data submission and new formats for an improved, unified publication
format for the descriptions that accompany published control-point values. As circumstances require, Blue Book requirements have been relaxed to accommodate unique situations. The USGS, in cooperation with the NGS, is currently converting much of its remaining third-order leveling data to computer-readable form so that these data can be incorporated into the North American Vertical Datum of 1988 (NAVD 88). This work is being accomplished with customized NGS software. The 10-year effort will eventually incorporate about 500,000 USGS bench marks into NAVD 88, thus vastly increasing the usefulness of the NGRS both in traditional leveling applications and through improved geoid modeling of regions that would otherwise be deficient.
This data sharing program works because the donors (private, county, state, and other federal organizations) want to ensure the accuracy of the points they observed (or had contractors observe) and earn NGS's stamp of approval as the nation's highest authority on geodetic control. It also provides the mechanism for the publication of officially sanctioned values, the national distribution of these values, and automatic updates of the data when future readjustments of NGRS are performed. Increasing the distribution of geodetic data in turn leads to an increased frequency of reuse of the control points by local and regional users, where each instance saves either private or public funds.
The NGS has received data from outside users for 65,000 horizontal control points since 1980, There has been a total of 36,000 km of geodetic leveling submitted to NGS by other organizations since 1980. A similar effort to gather private gravity data is currently under way. These gravity data will also be used to improve geoid height modeling, an essential requirement for accurate GPS-derived orthometric heights. The cost savings to NGS for these horizontal and vertical data conservatively can estimated at about $79.4 million (65,000 points × $1,000 per horizontal point = $65,000,000 and 36,000 km × $400/km of vertical data = $14,400,000).
North Carolina Land Records Management Program2
North Carolina has had a very active land records modernization effort since the inception of the Land Records Management Program
(LMRP) in 1977. The legislation provided for financial and technical assistance to local governments in the following areas: base maps, cadastral maps, a uniform system of parcel identifiers, and automation of land records.
One of the first major efforts was to develop standards for base and cadastral mapping and to arrive at a uniform parcel identifier. It was beneficial to the program that the Canadian project in Maritime Provinces of Canada was underway and the report from the National Research Council, The Need for a Multipurpose Cadastre (Committee on Geodesy, 1980), was published. These two major efforts were used in developing standards. Representatives from most state agencies and local government agencies that would be involved or benefit from the program participated in developing the standards.
Base maps (orthophoto and planimetric) were prepared on the State Plane Coordinate System, and a geo-coded parcel identification system was developed. Mapping on the State Plane Coordinate System was assisted by the fact that the North Carolina had a Geodetic Survey Office in place to establish and maintain a system of horizontal and vertical control monuments across the state. This office often gave priority to local governments in establishing additional control for mapping programs. Most counties elected to develop orthophoto base maps that were used to assist in the development of the cadastral maps. The photo image proved to be quite useful in establishing parcel boundary lines in this metes-and-bounds state. The photo image was also much better understood by the general public than was the planimetric map.
The fourth area of the program—automation—presented some interesting challenges in that many local governments had already begun automating their land records. It did not seem practical or feasible to standardize hardware systems across the state. Instead of standardizing computer systems, North Carolina opted to update, verify, and improve the data base as they developed the cadastral maps. This proved to be quite beneficial because many properties had not recently or ever been surveyed, and conflicts between property boundaries were discovered. In turn, property owners were notified of a potential problem and surveys ensued. The surveys were used to update the cadastral maps and data files.
North Carolina's financial assistance program of up to 50% of the total cost was important in the success of the program but not overriding. The state was willing to provide some seed money and establish the LRMP
to provide assistance, which was more important in encouraging local governments in proceeding with this major effort than were the funds received. In presentations to local governments, the state always stressed that they should not proceed with a land records project because they would receive matching funds from the state, but because there was a need for more effective and efficient local government. Time has shown that North Carolina local governments have indeed benefitted from this effort.
During this process of assisting local governments in North Carolina, many side benefits were realized. Cooperation between the state and the counties began to evolve. Most of the counties had or were in the process of developing soil maps. This was a cooperative effort by the state, counties, and the SCS. As North Carolina moved into the digital mapping arena, duplication of effort in the soils mapping area was eliminated and sharing of soils files was effected.
In 1977 the state had established a Land Resource Information Service. This program was to establish and maintain a digital land based system for the state. Soils mapping was one layer that was needed and made possible because the counties were well on their way in developing soil maps. As the counties began to enter the digital environment, they agreed to provide the state copies of the soils mapping digital files, thus eliminating the cost to digitize the soils mapping. The state also agreed to furnish local governments copies of soil mapping files that had already been digitized, thus reducing the cost to local governments for this layer in the local data base. Forty-one of North Carolina's counties currently have digital mapping capabilities.
The LRMP also led to improved cooperation between state and federal agencies regarding land records. One legislative study committee learned that a federal agency and a county were having aerial photography flown on the same day at the same scale of the same area. Furthermore, the aircraft were in the same airport. This incident caused the legislative study committee to direct the LRMP to set up a meeting with all federal agencies that might be acquiring aerial photography in North Carolina to achieve greater cooperation between the state and the federal governments. This meeting did take place and proved to be most beneficial. Much greater cooperation was achieved, especially with the SCS and the USFS. This effort was greatly assisted by North Carolina's Geodetic Survey Office, which had a long-time association with the NGS.
A PROPOSED SPATIAL DATA SHARING PROGRAM
The MSC envisions a system that enables digital spatial data collected by nonfederal institutions (e.g., state and local governments and the private sector) to be integrated into the national spatial data coverage. The spatial data sharing program for the NSDI should provide access to digital geographic base data of known quality and currency through a limited number of access nodes linked to a fully decentralized communication system. We envision a system of spatial data servers, owned and operated by institutions authorized by the FGDC, to be the custodians of the data in question. These institutions are regarded as coproducers of the data as the data are produced according to previously agreed-upon standards with mechanisms in place to ensure their quality. To be successful the spatial data sharing program needs to have real benefits or incentives for both the donor and recipient of the data. A conceptual model of the program is shown in Figure 7. 1; other details are given with Figure 7.2 and accompanying text.
Types of Spatial Data
The NSDI consists of geographic base data and other spatial data. Base data are a primary geographic spatial reference that is produced to a recognized standard of accuracy and is subjected to certified quality assurance programs. Typically this is the type of data produced by federal and state agencies responsible for cartographic products (see Table 4.2). Other spatial data are available from a variety of producers whose standards for spatial accuracy may not be as rigorous. Data that are of less precise locational control often contain valuable supplementary information that cannot be found from base data sources; these data would include those representing a higher degree of currency or those of a thematic nature. These two types of spatial data can be treated somewhat differently within the NSDI as proposed in Table 7. 1.
TABLE 7.1 Treatment of Base Data and Other Spatial Data Within the Proposed Program
Other Spatial Data
Accuracy standards, quality assurance, and independent certification
Statement of estimated accuracy
Cost sharing between co-producers
Include metadata descriptor
Comply with Spatial Data Transfer Standard
One or a few
Spatial data servers
Quality Assurance for Base Data
In the NSDI accuracy standards will need to be set for base data. We anticipate that these standards will continue to be established by the federal agencies with specific responsibility for different geographic features (as per OMB Circular A-16). These standards should be coordinated and disseminated as NSDI standards through the FGDC.
An important aspect of data access and retrieval is knowledge of their existence, contents, and fitness for an application. This knowledge is referred to as metadata, or data about information. Metadata describe the content, ancestry and source, quality, data base schema, and accuracy of data. Metadata support data sharing by providing information on many aspects of spatial data, each aspect having meaning in particular application contexts. Metadata that describe data base contents include data dictionaries and definitions, attribute ranges, and data types. The origin or
ancstry of data is critical for ascertaining the validity and suitability of data.
The metadata file descriptors are an important part of the SDTS. The development of the metadata standards and protocols will enable the creation of an easily accessible, networked data base that can be searched, preferably on-line, by the users seeking particular types of information. These metadata bases may also be used in the future to determine data gaps or duplication in the public national data base.
NSDI Spatial Data Sharing Program
The proposed spatial data sharing program (Figure 7.1) in many respects represents a combination of elements of both the Geodetic Advisor Program of the National Geodetic Survey System and the North Carolina LRMP (discussed above). Figure 7.2 shows how such a sharing program might be implemented.
State-level spatial data advisors (similar to the existing state-level geodetic advisor) would determine what base data being collected (or planned to be collected) within their state might be suitable for incorporation into the National Geographic Data System (NGDS). An advisor would contact the appropriate federal agency (presumably the lead agency on the FGDC for the data category of concern) to determine whether a given data set might be included in the federal data sets. If the federal agency agrees to consider these additional data, a plan would be developed for providing work sharing or partial reimbursement of costs for the collection of such data. Elements of this approach currently exist on an ad hoc arrangement between some federal agencies and states; these often include either work-share or cost-share agreements. If the data collection is planned, the federal agency would work with the nonfederal entity to build into the data collection program the appropriate standards, accuracy, and quality control of the resulting data. If the nonfederal data sets of interest currently exist, the federal agency would evaluate the potentially donated data sets.
Such data sets would be provided by those who collected the data (potential donors) to the appropriate federal agency in a standard format for a quality assurance and quality control (QA/QC) analysis. If the data set fails to meet the established QA/QC criteria, the data would be returned to the donor. If the data set meets the QA/QC criteria, the next question would be if generalization of the data (e.g., from a large scale to a smaller standard scale) is necessary. If generalization is needed, the appropriate
ness of the algorithms would be determined. If these were not acceptable, the data would be returned to the donor. If the generalization algorithms are acceptable or if generalization is not necessary, the federal agency would develop the appropriate metadata files for the data set and incorporate the data set into the national base.
The incentives for donors to submit their data to be considered in the national base would be threefold. First, a portion of the costs of data collection would be rebated to the collector, the amount being coordinated and negotiated by the state-level advisor with the federal agency; if the data are not yet collected, then work-or cost-sharing arrangements might be struck. Second, the donors would have the assurance that the data they collected (or had contractors collect) meet accepted national standards and have been subjected to an independent QA/QC analysis. Third, the program provides the mechanism for the broad national distribution of these data and other data in addition to updates of the data when future revisions are made from other donor sources.
A number of questions remain unanswered, all of which will require further analysis in developing a workable data sharing system. These include (1) How should the state data advisors be funded? (2) Can a single advisor handle the flow of data from the respective state? (3) What scales should be allowed? (4) What agencies should be responsible for the QA/QC? Should it be the lead agency designated by FGDC? (5) Should that same agency be responsible for developing the metadata files? (6) Who returns a portion of the data collection costs to the data donor? Is this a function of the agency that has stewardship of a given data type or layer within a broad National Geographic Data System (FGDC, 1991)?
Guidelines for System Implementation
Currently there are many federal government agencies providing their data to the public on a timely basis and at a nominal cost. The quantity and quality of information that can be obtained through federal government agencies is extremely rich and has been used in countless applications. For example, the availability of TIGER geography, national base mapping (USGS), and other national and international spatial data (through NOAA, SCS, DMA, and others) has provided many individual users and organiza
tions with an opportunity to leverage their own activities and avoid unnecessarily duplicating effort.
Clearly all federal government agencies should be participants in a spatial data sharing program. Similarly, state agencies and local governments involved in the collection of base data (see Table 4.2) could and should be active participants. For example, local governments are most directly involved with new street names and address ranges.
Apart from government agencies funded directly by tax revenues and private companies or academic institutions that may gather spatial data with the use of direct public funding, there should be no requirement that compels any organization to become a data donor. Public-spirited private companies and academic institutions may voluntarily elect to participate in a spatial data sharing program as a result of their own belief in the value and importance of sharing certain data with the rest of society; the exchange of such data for cost-equivalent access to other shared data would be another incentive for private companies.
What Data Are Donated?
A spatial data sharing program should place special emphasis on the collection and dissemination of primary or base data. What do we mean by base data? Are TIGER data, for example, a base data set or are they partly derived (from DLG) data? These are thorny issues that will arise with any sharing program, and it may not be prudent to attempt to exclude any type of data from the program. Rather, the emphasis may need to be placed on ensuring that the ancestry of any spatial data is unambiguously documented so that the user is fully aware of its origins and limitations. In some instances there may be some justification to disseminate spatial data that are clearly secondary (i.e., derived from a primary data source). The justification for dissemination of these data may be that the process of transforming the primary data into secondary data is very time consuming and that most work is with the data in the secondary form.
There are, however, certain types of base data that help form the structural backdrop for a large number of currently collected spatial data. In defining the base data for a NSDI, the information that is basic to one user may be selective to another and vice versa. However, in this report we refer to base data as the information required to establish a basic reference to the Earth's surface. To this basic data set can be added features, attributes, and other intrinsic information. However, it defines a
clear framework to reference data from many sources. In defining base data it quickly becomes apparent that accuracy and detail of content vary by use and scale of operation. Therefore, we have selected four levels to be used in reference to base data (see Table 4.2). Finally, for brevity and understanding we avoided a detailed set of specifications and approached the task by relating it to map scales, accuracies, and content. We realize that certain digital data are multiscale, but accuracies and content are determined by scale.
A very important dimension of the spatial data sharing program would be the emphasis placed on adhering to standards. The program should endorse one or two standard data transfer formats as the official currency of trade in the program (e.g., SDTS, VPF). There will be some questions here as to whether large quantities of information available in previously used government standards should be or need to be converted to any new standard before becoming available through the spatial data sharing program. This may require a transition strategy. In the long term, the spatial data sharing program should encourage the use of a limited number of standards. The standards should be publicly accessible and not proprietary. GIS software companies will eventually respond through the normal forces of the market and enable their software to both read and write to the standards required by the spatial data sharing program.
In addition to a file format standard, there needs to be a metadata file describing the content, ancestry, quality, and accuracy of the data being made available. Such information is proposed to be part of SDTS. Consideration should be given to embedding the metadata file descriptor in the programs' selected transfer standard. The development of the metadata standard will enable the creation of data bases that can be accessed and searched by users seeking particular types of information. These metadata bases may also be used to determine where there may be data gaps or duplication in the public national data base. Before a few years ago, strong arguments could have been made for a centralized catalog or metadata directory. There are developing computer networks, through such programs as the National Research and Education Network (NREN) of the High Performance Computing and Communications Program, that will enable the establishment of a distributed metadata directory system (assuming that standards and protocols are invoked). These networks can also provide a mechanism for data distribution; however, the desirability of this can be dependent on the data transfer rates and the slowest component on the network.
The spatial data sharing program may provide access to other components (other than data) of the infrastructure. Some of these components might include people, software, and facilities. The spatial data sharing program should be designed to accommodate such components in the future. An exception to this might be the availability of data translator software, for example, which converts spatial data in a file format not supported under the spatial data sharing program into an acceptable format. This would enable organizations with very large inventories of spatial data in unacceptable format to allow users to convert this information with their own time and effort.
What Are the Incentives and Requirements to Donate and Distribute?
The incentive to donate information to the spatial data sharing program will in part have to be driven by a public sense of responsibility and a recognition that in many instances the beneficiaries of the program will be the data donors themselves. In turn the donors will be able to reduce their costs by avoiding the collection of redundant and duplicate information. Additionally, the donors might receive a rebate to help offset the costs associated with the data collection or assistance through work-or cost-share programs. An important incentive for those participating in sharing of base data might be an independent assessment of data quality.
Federal agencies have little incentive to incur the incremental expense of adhering to standards and coordinating activities with other agencies when undertaking a spatial data generation program. There is currently no means by which the potential users of those data could share in the additional incurred costs, and therefore in periods of tight budgets agencies tend to do the minimum that is necessary to perform their basic mission.
In this environment it may be helpful if the federal government adopt a general statement of policy that spatial data created by any federal agency be made available in accordance with standards. Other agencies requesting the data should be prepared to bear the additional costs of adherence to standards. To ensure compliance this policy should be made a part of each agency's annual appropriations. The benefit of the standardization of data to all governmental agencies—federal, state, and local—and to the private sector is such that this incremental cost will be recovered to the federal treasury over time as direct savings in government programs and in increased efficiency in the private sector. This assertion is borne out in the many studies done on benefits of the use of GIS (see Chapter 3).
Additionally, the FGDC should assure the OMB that proposed federal programs that will gather significant quantities of spatial data will not duplicate data that already exist. The OMB, in their budget-examining role, should re on the FGDC's assurance that proposed federal programs for spatial data are nonduplicative.
The federal government could substantially influence participation of state and local governments by making it a requirement of its numerous grant programs that if spatial data are collected, these data are made available through the spatial data sharing program.
What Are the Rules for Usage?
Access to spatial data under the sharing program should be available to everyone. In some instances agencies may wish to restrict the usage of spatial data to particular target groups, but it can be argued that the administrative effort required to establish the credentials or appropriate conditions for exclusive access are too difficult to administer and add unnecessary costs. The essence of the spatial data sharing program should be to disseminate information that by its very nature is in the public domain.
In this respect there should also be no restriction for usage. Private companies should be able to freely access this information to build value-added products or services. Although no restrictions would apply, it may be an advantage of the program to ensure that organizations, private or public, that use data under the spatial data sharing program must acknowledge the donor (by adopting the metadata file descriptor) so that the consumer can be fully informed of its origins.
The FGDC needs to further investigate the legal liability for the quality and accuracy of donated spatial data. This is an extremely important dimension that needs to be addressed promptly so that it does not become a constraint to the success of the program.
Who Supports the Users?
The availability of spatial data will result in consumer questions about the data. Questions may be associated with the technical format of the data or relate to the data content. The increased availability could result in an initial additional burden being placed on many organizations to answer questions about their data and data-collection activities. This is an
inevitable consequence of increased public scrutiny and awareness of spatial data. Some organizations participating in the program will find that they need to place more effort in improving data dictionaries and other documentation regarding their data products. Although initially there may be some difficult adjustments, in the long term a spatial data sharing program will lead to an improvement in the value and accuracy of donor data-gathering activities. In addition, for data sharing to succeed beyond the original use of the data, a mechanism for continued data maintenance must be built into any support mechanism.
In some instances, agencies may be reluctant to participate in the program because of their concern that the increased public awareness of their spatial data may disrupt their ongoing programs. Although the MSC has some sympathy for the management of these organizations, the public's right to know and use these data must be the paramount consideration.
MECHANISMS FOR IMPLEMENTING A SPATIAL DATA SHARING PROGRAM
The proposed spatial data sharing program must do more than just disseminate spatial data collected by federal agencies. The richness and utility of the program is substantially enhanced by having participation of donors from state and local governments, academic, and the private sector. Unfortunately, there is no current mechanism in place for such participation in the operation of a program of this nature. A challenge in establishing the spatial data sharing program is, therefore, to determine whether this can occur without a formal organizational structure or, if necessary, what the optimum structure would be. An additional and formidable challenge is how the spatial data sharing program should be funded to be successful.
The FGDC is the federal program with the objectives and intent most comparable with this program, albeit without funds. The MSC believes that at this time the FGDC should assume the initial leadership role to embrace the broader scope of this proposed spatial data sharing program.
The FGDC should establish a data sharing committee with the objective of providing the policy making and leadership to launch, maintain, and operate the proposed program. Membership of the committee would consist of representatives from the federal community (FGDC)
as well as appointed/invited representatives from state and local governments as well as academia and the private sector.
The data sharing committee would not be responsible for any operational programs other than establishing policies, monitoring and evaluating the performance of the program, and communicating the existence and value of the program. The principal policy areas that the committee should address include the following:
data standards policy,
distribution policy, and
cost sharing policy.
The data standards policy responsibilities would include defining from a technical viewpoint the proposed metadata model for describing and categorizing spatial data donated under the program. Under this umbrella the committee could also select those federal data standards that would be accepted under the program (SDTS, TIGER, etc.). Finally, and at a high level, the committee may also wish to endorse the dissemination of certain federal conversion utilities that exchange data from one federal standard to another.
The depository policy role of the committee should be to establish guidelines for how federal agencies should participate in and comply with the program. Guidelines or recommended regulations could also be drafted for how federally funded cost-shared projects and programs ought to make spatially referenced data available. Guidelines and legal conditions and liability limitations for organizations and agencies volunteering to deposit data under the program could also be documented.
The distribution policy for the spatial data sharing program would be defined for both the metadata base as well as the specific geographic data files. Costs associated with fulfilling the distribution of the data should be borne by the end user (consistent with current federal policy on data distribution). Guidelines for technical user support should also be specified under the distribution policy.
The operation of the spatial data sharing program will require some financial resources, but the bulk of the operational costs should be borne by the donors and recipients of data from the program. The overhead associated with the operation of the committee and the maintenance and
distribution of the metadata base should be funded initially through the FGDC.
The first few years of the implementation of such a program will undoubtedly expose many issues and difficulties, some of which may not be easy to resolve or reconcile. These difficulties should not be permitted to distract from the central theme of this initiative that a cooperative environment can greatly benefit the nation. Although some initial financial support from federal agencies may be required to initiate this program, the MSC believes that the benefits of this investment will greatly outweigh the initial costs.
SPATIAL DATA CATALOGS
As mentioned previously, one of the needs for a robust NSDI is a mechanism for identifying the full range of spatial data collected, where the data are stored, who controls access to the data, the data content, the metadata, and the areas of data coverage. Spatial data catalogs provide an important component of the NSDI, one that can be established by using distributed computer networks.
Distributed Data Catalogs
Software can provide on-line search capabilities of catalogs of spatial and other data that are resident on a computer connected to a network of other computers. Such a capability, however, requires that the data catalogs be accessible in a standard protocol on the servers. The goal is to have information searches coherent across different services.
One such program is WAIS, which is a public-domain software program developed jointly by Dow Jones News Service, Peat Marwick, Apple Computer, Thinking Machines, Inc., and others. WAIS uses the Z39.50 protocol and the Internet computer network of networks to scan, search, and often access existing data bases. The Z39.50 standard or protocol that allows WAIS and other software to search distributed data bases is a product of the National Information Standards Organization (NISO), accredited to the American National Standards Institute (ANSI). Z39.50 is fully compatible with the NISO standard for library catalogs (Z39.2), originally promulgated by the Library of Congress and known as MARC (MAchine Readable Cataloguing), and has a corresponding
International Standards Organization (ISO) standard. Computer-to-computer interchanges, whether components of the Z39.50 protocol or of the content being delivered, are precisely represented in a standard computer language known as Abstract Syntax Notation (ASN.1).
WAIS implements Z39.50 in a client/server mode of computer interaction. In a typical search for textual information, the client software prompts the user to select which information sources to include in the search and to enter a search request. Once the search request is entered, the client software converts the search words to the standard information retrieval protocol (Z39.50) and presents the search request in turn to each server associated with a selected source. The server software takes the words and matches them to the contents of all documents in each selected source. The client software receives search results from all of the servers and presents to the user a list of all document or data base titles found. When requested by the user, the client software requests the server to pass the full contents of the document or metadata file and presents the document to the user.
The use of such software evokes the experience of using a library. A library user may begin by consulting a card catalogue or index or by asking a reference librarian. At this point, the user is searching for documents based on a few key words (e.g., subject, title) or names (e.g., author). The user reviews the documents found and may note other key words or names that could lead to additional relevant documents. A feedback situation develops as the user modifies subsequent searches based on results found in prior searches. Ideally, the user stops searching when all the most relevant documents are found.
Information servers using WAIS can be registered to a Directory of Servers currently maintained on Internet by Thinking Machines. The registration entry includes text information about the contents of the sources reachable through the server, and this information is itself indexed for searching. Also listed is information that will be used by the client software to contact the server (e.g., TCP/IP node name) as well as information on what and how to pay charges for use of the server if it is not free. Indexing of text to create an information source is fairly rapid, a 30 megabyte file was indexed in about 20 minutes on a Data General Aviion (Eliot Christian, USGS, personal communication, 1992).
Any server capable of responding to Z39.50 information retrieval requests can be an information server. Information servers can be local (on the workstation or local area network) as well as remote (accessible now via TCP/IP, in the future via X.25 networks or asynchronous dial-up). WAIS does not require any central coordination unless the server is to be advertised through the Directory of Servers. In fact, an information server registered to the Directory of Servers can itself act as a subordinate directory of servers administered locally. By describing sources under various directories of servers, it is possible to organize the sources in whatever relationships make sense and yet allow users to search as many sources as desired.
One feature of WAIS that is allowed but not required by the Z39.50 protocol is that the client/server interaction is stateless: at the application level each request from the client to the server is a separate process that is not associated with any previous request. The server does not maintain information about the client between requests. This feature is very significant for situations in which a user may want to search hundreds of sources on dozens of servers at a sitting.
Information servers provide access to the information sources placed on them. These sources are compilations that may include a variety of formats. Such formats are known as document types, although information need not be textual. Although all Z39.50 clients and servers support search and retrieval of textual information, support for other document types that may have been registered in Z39.50 is negotiated when the client initiates a relation with a server.
When sources are created, defining the document types allows the server to use the appropriate translation between the specific query format of the source and the Z39.50 protocol. The public domain WAIS package includes assistance in creating information sources and provides indexing software for several common document types consisting of text, graphics, and bibliographic references in MARC. Source code in the C programming language is provided for adding other document types. If access to other data structures is required, the server interface routines are also designed to be customized. A typical customization would be to use search requests to access a relational data base using Structured Query Language.
The USGS is using WAIS to enhance the Earth Science Data Directory (ESDD). The ESDD is maintained as a source for references to earth science data, including many at the state level and a comprehensive list of data holdings relevant for arctic research. WAIS is especially appropriate for that application, because the ESDD user community ranges from local citizenry to international global change researchers. The ability of WAIS to place the ESDD in the context of other USGS and external information sources is especially powerful for these users. The USGS intends to publish (and maintain) in the WAIS Directory of Servers, a subordinate directory of servers focused on earth science data and information.
The USGS is adding features required for ESDD (phrase searching, location searching, and key word searching within fields), which can be accommodated within WAIS. The USGS is also including the ability for a user of the client software to drop from a WAIS session into an automated log-in to existing data systems, such as the Global Land Information System (GLIS). With this approach, users of the USGS/WAIS client software would be able to access any Z39.50 server but would have additional capabilities when accessing one of the USGS servers. The USGS is also using WAIS to access a clearinghouse of USGS spatial data holdings.
The ESDD interfaces with the interagency Global Change Master Directory, a single interagency source for references to key global change data. The global change data management community is considering WAIS as an adjunct to the Global Change Master Directory. As a data directory tool, it is possible to rapidly correlate the Global Change Master Directory to existing data directories relevant to global change research. For example, NOAA has a directory with about 25,000 data set references and the Inter-university Consortium for Political and Social Research has another directory referencing about 28,000 data sets.
The ability of WAIS to handle different information sources through a single user interface makes it possible for researchers to explore publications and data sets concurrently. The federal research libraries involved in global change research (primarily NASA, NOAA, USGS, and USDA) are very interested in the potential for WAIS to bridge between the data and information worlds. Also, WAIS is seen as a useful way to
connect textual information with a data system. For example, when a user is researching an existing data set, it would be useful to provide immediate access to all of the associated documentation about that data set. The use of WAIS, the associated documentation could extend beyond the data set itself to include publications that reference the data set or engineering specifications of the instruments used.
Committee on Geodesy (1980). Need for a Multipurpose Cadastre, National Research Council, National Academy Press, Washington, D.C., 112 pp.
FGCC (1980). Input Formats and Specifications of the National Geodetic Survey Data Base: Volume II—Vertical Control Data, Federal Geodetic Coordinating Committee, National Geodetic Survey, Rockville, Maryland.
FGCC (1983). Input Formats and Specifications of the National Geodetic Survey Data Base: Volume III—Gravity Control Data, Federal Geodetic Coordinating Committee, National Geodetic Survey, Rockville, Maryland.
FGCC (1989). Input Formats and Specifications of the National Geodetic Survey Data Base: Volume I—Horizontal Control Data (revised 1989), Federal Geodetic Coordinating Committee, National Geodetic Survey, Rockville, Maryland.
FGDC (1991). A National Geographic Information Resource: The Spatial Foundation of the Information-Based Society, Federal Geographic Data Committee, First Annual Report to the Director of OMB, 10 pp. plus 41 pp. of appendixes.