Geospatial Databases and Data Mining
Spatiotemporal data, dynamic data, and location-aware computing present important opportunities for research in the geospatial database and data mining arenas. Current database techniques use very simple representations of geographic objects and relationships (e.g., point objects, polygons, and Euclidean distances). Data structures, queries, indexes, and algorithms need to be expanded to handle other geographic objects (e.g., objects that move and evolve over time) and relationships (e.g., non-Euclidean distances, direction, and connectivity) (Miller and Han, 2001). One of the most serious challenges is integrating time into database representations. Another is integrating geospatial data sets from multiple sources (often with varied formats, semantics, precision, coordinate systems, and so forth).
Data mining is an iterative process that attempts to extract from data useful information, patterns, and trends that were previously unknown. Although data mining is a relatively new area of research, its roots lie in several more established disciplines, including database management, machine learning, statistics, high-performance computing, and information retrieval. The main impetus behind the growth of data mining was the need to synthesize huge amounts of data into knowledge. Despite the importance and proliferation of geospatial data, most research in data mining has focused on transactional or documentary data.1
This chapter explores the current state of research and key future challenges in geospatial databases, algorithms, and geospatial data mining. Advances in these areas could have a great effect on how geospatial data are accessed and mined to facilitate knowledge discovery.
TECHNOLOGIES AND TRENDS
This section outlines key developments in database management systems and data mining technologies as they relate to geospatial data.
Database Management Systems
The ubiquity and longevity of the relational database architecture are due largely to its solid theoretical foundation, the declarative nature of the query processing language, and its ability to truly separate the structure of the data from the software applications that manipulate them. With the relational model it is possible for applications to manipulate data—query, update, add new information, and so forth—independent of the database implementation. This abstraction of the database to a conceptual model is the hallmark of all modern database technologies. By separating the application logic from the database implementation, the model makes it possible to accommodate changes—for example, in the physical organization of the data—without disturbing the application software or the users’ logical view of the data. This separation also means that efforts made to optimize performance or ensure robust recovery will immediately benefit all applications.
Over the past two decades, the relational model has been extended to support the notion of persistent software objects, which couple data structures to sets of software procedures referred to as methods. Many commercial applications rely on simple data types (e.g., integers, real numbers, date/time, and character strings) and do not require the functionality provided by software objects and their methods.2 Geodata, however,
typically require powerful software objects to implement the rich behavior demanded by geospatial applications. Typical geospatial operations include “length,” “area,” “overlap,” “within,” “contains,” and “intersects.” Geographic Information Systems (GIS) have employed relational database management systems for years and more recently have begun to use the object-relational database management system (DBMS).3 However, exchanging data between systems is difficult because of the lack of accepted standards,4 the multitude of proprietary formats, and the multitude of data models used in geospatial applications.
Geospatial Data Mining Tasks
The goal of data mining5 is to reveal some type of interesting structure in the target data. This might be a pattern that designates some type of regularity or deviation from randomness, such as the daily or yearly temperature cycle at a given location. Data mining may be structured using a top-down or bottom-up approach. Generally, a top-down approach is used to test a hypothesis; the most challenging aspect is the development of a good model that can be used to validate the premise. For example, patterns can be described in some form of statistical model that is fitted to the data, such as a fractal dimension for a self-similar data set, a regression model for a time series, a hidden Markov model, or a belief network. A bottom-up approach, on the other hand, searches the data for frequently occurring patterns or behaviors—or, conversely, anomalous or rare patterns. Most of the examples of geospatial applications described in this report tend to follow a bottom-up approach of ex
The Open GIS Consortium (OGC), whose members are leading geospatial vendors, users, and consultants, has published a standard describing the data types and their methods that should be implemented within an object-relational database system to support geospatial applications (OGC Simple Features for SQL).
For example, the Office of Management and Budget (OMB) recently announced a revision to Circular No. A-16 (which describes the responsibilities of federal agencies with respect to coordination of surveying, mapping, and related spatial data activities) to standardize geospatial data collected by the government. OMB argues that the lack of standard definitions of terms (e.g., scientists may differ on the distinction between a brook and a creek) has become a barrier to sharing data among organizations. Features such as boundaries, hydrography, and elevation will be included in the list of standard terms (Bhambhani, 2002). For more information on Circular No. A-16, see <http://www.whitehouse.gov/omb/circulars/a016/a016.html>.
The committee notes that because there are no generally accepted standards for data mining terminology, other papers and books may use different terms for the concepts expressed in this report.
ploratory analysis6 (and visualization) of results from computational models.
Geospatial data mining is a subfield of data mining concerned with the discovery of patterns in geospatial databases. Applying traditional data mining techniques to geospatial data can result in patterns that are biased or that do not fit the data well.7 Chawla et al. highlight three reasons that geospatial data pose new challenges to data mining tasks: “First, classical data mining…deals with numbers and categories. In contrast, spatial data is more complex and includes extended objects such as points, lines, and polygons. Second, classical data mining works with explicit inputs, whereas spatial predicates (e.g., overlap) are often implicit. Third, classical data mining treats each input to be independent of other inputs whereas spatial patterns often exhibit continuity and high auto-correlation among nearby features.”8 Chawla et al. suggest that data mining tasks be extended to deal with the unique characteristics intrinsic to geospatial data.
There are many different data mining tasks and many ways to categorize them. A thorough survey9 of geospatial data mining tasks is beyond the scope of this report; instead, the committee chose to highlight four of the most common data mining tasks: clustering, classification, association rules, and outlier detection.
“Clustering” attempts to identify natural clusters in a data set. It does this by partitioning the entities in the data such that each partition consists of entities that are close (or similar), according to some distance (similarity) function based on entity attributes. Conversely, entities in different partitions are relatively far apart (dissimilar). Because the objective is to discern structure in the data, the results of a clustering are then examined by a domain expert to see if the groups suggest something. For example, crop production data from an agricultural region may be clustered according to various combinations of factors, including soil type, cumula-
There are also important issues on how to make decisions, using the collected and mined geospatial data. Although this topic (called “confirmatory” analysis in statistics) is very important, the committee focused on “exploratory” analysis of data mining for two reasons. First, geospatial data mining has many unsolved problems, which lie in the intersection of geospatial data and information technology. Second, this area was a key concern for the workshop participants.
From Han et al., “Spatial Clustering Methods in Data Mining,” in Miller and Han (2001).
From Chawla et al., “Modelling Dependencies for Geospatial Data,” in Miller and Han (2001).
John F. Roddick, Kathleen Hornsby, and Myra Spiliopoulou maintain an online bibliography of temporal, spatial, and spatiotemporal data mining research at <http://kdm.first.flinders.edu.au/IDM/STDMBib.html>.
tive rainfall, average low temperature, solar radiation, availability of irrigation, strain of seed used, and type of fertilizer applied. Interpretation by a domain expert is needed to determine whether a discerned pattern— such as a propensity for high yields to be associated with heavy applications of fertilizer—is meaningful, because other factors may actually be responsible (e.g., if the fertilizer is water soluble and rainfall has been heavy). Many clustering algorithms that work well on traditional data deteriorate when executed on geospatial data (which often are characterized by a high number of attributes or dimensions), resulting in increased running times or poor-quality clusters.10 For this reason, recent research has centered on the development of clustering methods for large, highly dimensioned data sets, particularly techniques that execute in linear time as a function of input size or that require only one or two passes through the data. Recently developed spatial clustering methods that seem particularly appropriate for geospatial data include partitioning, hierarchical, density-based, grid-based, and cluster-based analysis.11
Whereas clustering is based on analysis of similarities and differences among entities, “classification” constructs a model based on inferences drawn from data on available entities and uses it to make predictions about other entities. For example, suppose the goal is to classify forest plots in terms of their propensity for landslides. Given historical data on the locations of past slides and the corresponding environmental attributes (ground cover, weather conditions, proximity to roads and streams, land use, etc.), a classification algorithm can be applied to predict which existing plots are at high risk or whether a planned series of new plots will be at risk under certain future conditions. Various classification methods have been developed in machine learning, statistics, databases, and neural networks; one of the most successful is decision trees. Spatial classification algorithms determine membership based on the attribute values of each spatial object as well as spatial dependency on its neighbors.12
“Association rules” attempt to find correlations (actually, frequent co-occurrences) among data. For instance, the association rules method could discover a correlation of the form “forested areas that have broadleaf hardwoods and occurrences of standing water also have mosquitoes.” Spatial association rules include spatial predicates—such as topological, distance,
or directional relations—in the precedent or antecedent (Miller and Han, 2001). Several new directions have been proposed, including extensions for quantitative rules, extensions for temporal event mining, testing the statistical significance of rules, and deriving minimal rules (Han and Kamber, 2000).
“Outlier detection” involves identifying data items that are atypical or unusual. Ng suggests that the distance-based outlier analysis method could be applied to spatiotemporal trajectories to identify abnormal movement patterns through a geographic space.13 Representing geospatial data for use in outlier analysis remains a difficult problem.
Typically, two or more data mining tasks are combined to explore the characteristics of data and identify meaningful patterns. A key challenge is that, as Thuraisingham (1999) argues, “Data mining is still more or less an art.” It is impossible to say with certainty that a particular technique will always be effective in obtaining a given outcome, or that certain sequences of tasks are most likely to yield results given certain data characteristics. Consequently, high levels of experience and expertise are required to apply data mining effectively, and the process is largely trial and error. Research to establish firm methodologies for when and how to perform data mining will be needed before this new technology can become mainstream for geospatial applications. The development of geospatial-specific data mining tasks and techniques will be increasingly important to help people analyze and interpret the vast amount of geospatial data being captured.
This chapter is concerned with how geospatial data can be stored, managed, and mined to support geospatial-temporal applications in general and data mining in particular. A first set of research topics stems from the nature of spatiotemporal databases. Although there has been some research on both spatial and temporal databases, relatively little research has addressed the more complex issues associated with spatiotemporal characteristics. In addition, research investments are needed in geometric algorithms to manipulate efficiently the massive amounts of geospatial data being generated and stored. Despite advances in data mining methods over the past decade, considerable work remains to be done to improve the discovery of structure (in the form of rules, patterns, regularities, or models) in geospatial databases.
Geospatial databases are an important enabling technology for the types of applications presented earlier. However, relational DBMSs are not appropriate for storing and manipulating geospatial data because of the complex structure of geometric information and the intricate topological relationship among sets of spatially related objects (Grumbach, Rigaux, and Segoufin, 1998). For example, the restriction in relational DBMSs to the use of standard alphanumeric data types forces a geospatial data object (such as a cloud) to be decomposed into simple components that must be distributed over several rows. This complicates the formulation and efficiency of queries on such complex objects. Also, geospatial data often span a region in continuous space and time, but computers can only store and manipulate finite, discrete approximations, which can cause inconsistencies and erroneous conclusions. A particularly difficult problem for geospatial data is representing both spatial and temporal features of objects that move and evolve continuously over time. To model geographic space, an ontology of geospatial objects must be developed. The final key problem is integrating geospatial data from heterogeneous sources into one coherent data set.
Moving and Evolving Objects
Objects in the real world move and evolve over time. Examples include hurricanes, pollution clouds, pods of migrating whales, and the extent and rate of shrinking of the Amazon rain forest. Objects may evolve continuously or at discrete instants. Their movement may be along a route or in a two- or three-dimensional continuum. Objects with spatial extent may split or merge (e.g., two separate forest fires may merge into one). Existing technologies for database management systems (data models, query languages, indexing, and query processing strategies) must be modified explicitly to accommodate objects that move and change shape over time (see Box 3.1). Such extensions should adhere to the recognized advantages of databases—high-level query mechanisms, data independence, optimized processing algorithms, concurrency control, and recovery mechanisms—and to the kinds of emerging applications used as examples in this report.
Although many different geospatial data models have been proposed, no commonly accepted comprehensive model exists.14 One key approach
BOX 3.1 The Complexity of Spatiotemporal Data
Despite significant advances in data modeling, much geospatial information still cannot be fully represented digitally. Most of the space-time data models proposed in the past decade rely on the time-stamping of data objects or values, the same way that time is handled in nonspatial databases. Only in recent years has it been recognized that space and time should not always be seen as two orthogonal dimensions. Many researchers advocate a different approach for modeling geographic reality, using events and processes to integrate space and time. Representing events and processes is not a trivial task, however, even at the conceptual level. Complexity arises because scale in space and time affects entity identification.
Depending on the scale of observation, events and processes can be identified as individual entities or as an aggregate. For instance, a thunderstorm front can be seen as one event or as multiple convective storms whose number, geometry, location, and existence may change over time. Whereas events and processes operate at certain spatial and temporal scales, their behaviors are somewhat controlled by events and processes operating at larger scales. Similarly, their behaviors not only affect other events and processes at their scale but also somewhat control those operating at smaller scales. Associations among events and processes at different scales must be represented so they can be fully expressed. This means that in addition to retrieving objects, events, and processes, a geodatabase must support calculations that will reveal and summarize their embedded spatiotemporal characteristics.
Another important representational issue in spatial analysis is the effect when data is aggregated over spatial zones. The heterogeneity of microdata patterns within a zone interacts with the zonal boundaries and size, making it difficult to determine what actually has been analyzed. Further, analysis and interpretation should consider larger-scale geographic entities that are related to the zone of interest, not just the microdata within the zone. Data structures are needed that can provide linkages among related data at different scales and enable the dynamic subdivision of zonal data.
Geospatial objects need to be structured accordingly in semantic, spatial, and temporal hierarchies. Semantically related geospatial entities (e.g., census tracts, neighborhoods, and towns) will then be easily associated in space and time, so their properties can be cross-examined at multiple scales. This approach will be increasingly important as spatial analysis is automated in response to the growing volume of spatiotemporal data.
SOURCE: Adapted from a white paper, “Research Challenges and Opportunities on Geospatial Representation and Data Structure,” prepared for the committee’s workshop by May Yuan.
is to extend traditional relational databases with geospatial data structures, types, relations, and operations. Several commercial systems are now available with spatial and/or temporal extensions; however, they are not comprehensive (i.e., they still require application-specific extensions), nor do they accurately model both spatial and temporal features of objects that move continuously. Most of the research in moving-object databases has concentrated on modeling the locations of moving objects as points (instead of regions). This is the approach used in many industrial applications, such as fleet management and the automatic location of vehicles. Wolfson notes that the point-location management method has several drawbacks, the most important being that it does not enable interpolation or extrapolation.15 Researchers are beginning to explore new data models. For example, Wolfson has proposed a new model, outlined in Box 2.1 (Chapter 2), that captures the essential aspects of the moving-object location as a four-dimensional linear function (two-dimensional space × time × uncertainty) and a set of operators for accessing databases of trajectories. Uncertainty is unavoidable because the exact position of a moving and evolving object is, at best, only accurate at the exact moment of update; between updates, the object’s location must be estimated based on previous behavior. Further, it is problematic to determine how often and under what conditions an object’s representation in the database should be changed to reflect its changing real-world attributes.16 As mentioned in Box 2.1, frequent location updates would ensure greater accuracy in the location of the object but consume more scarce resources such as bandwidth and processing power.
Güting and his colleagues have proposed an abstract model for implementing a spatiotemporal DBMS extension. They argue that their framework has several unique aspects, including a comprehensive model of geospatial data types (beyond just topological relationships) formulated at the abstract infinite point-set level; a process that deals systematically and coherently with continuous functions as values of attribute data types; and an emphasis on genericity, closure, and consistency (Güting et al., 2000). They suggest that more research is needed to extend their model from moving objects in two-dimensional (2D) space to moving volumes and their projections into space (Güting et al., 2000). A second approach is based on the constraint paradigm. DEDALE, one example of a constraint database system for geospatial data proposed by the Chorochronos Participants
(1999), offers a linear-constraint abstraction of geometric data that allows for the development of high-level, extensible query languages with a potential for using optimal computation techniques for spatial queries.
Query languages also will need to be extended to provide high-level access to the new geospatial data types. It is important to develop consistent algebraic representations for moving and evolving objects and to use them for querying geospatial databases. Query languages must provide the ability to refer to current as well as past and anticipated future position and extent of geospatial objects. For example, it should be possible to refer to future events and specify appropriate triggers, such as “Issue an air quality alert when pollution clouds cover a city” or “Sound an alarm when the fire is within 10 km of any habitation.”
Finally, because geodatabases are expected to grow very large (as, for example, environmental processes and events are tracked at a regional level), the invention of novel indexing schemes will be critical to support efficient processing. Because of the properties of continuous evolution and uncertainty, conventional indexing methods for geospatial data will not be adequate.
Ontologies for Information Exploitation
Workshop participants noted that the development of ontologies for geospatial phenomena is a critical research area.17
An ontology defines, as formally as possible, the terms used in a scientific or business domain, so that all participants who use a term will be in agreement about what it refers to, directly or indirectly (see Box 3.2). In the geospatial domain, ontologies would define geographic objects, fields, spatial relations, processes, situational aspects, and so on.18 Although disciplines that specialize in the gathering and exchange of information have recognized for some time the need for formalized ontological frameworks to support data integration, research on ontologies for geographic phenomena began only recently (Mark et al., 2000).
In the context of geospatial information, it is critical to remember that the earth is an “open” system—we cannot explain all outcomes from all known laws, and the earth scientist often “constructs” knowledge. This aspect separates the geospatial world from other domains that have rela-
BOX 3.2 Ontologies
Precision of expression is crucial for any discipline. An ontology for a discipline attempts to formalize the community understanding that is traditionally developed through scientific education.1 Ontologies can be thought of as hierarchical “trees” of terms. An ontology achieves more precision by defining relationships among its terms, such as “is a,” “part of,” and “subset of.” The definitions of many terms are sure to be discipline specific. For example, although the term “continent” can be defined by enumerating the continents, the definition of “country” is problematic—a politician is likely to have a somewhat different set of countries in mind than a historian. The task can be delegated by assigning it to a trusted organization such as the United Nations, but the resulting definition is still unlikely to satisfy more than a small range of disciplines (e.g., if a “country” is defined as any current member of the UN, what about the Vatican?).
Different disciplines also are likely to use different hierarchical organizations. Although the definition of “river” can be hard to agree on, positioning that term into a hierarchy raises even more issues. In a land-use ontology, “river” will be assigned to the higher-level concept “boundaries,” but in a navigation ontology it might fall under “waterways,” along with “lake” and “canal.” The granularity of the terms also will differ among disciplines and even countries. Whereas “river” may be at the lowest level in the land-use hierarchy, the differences between “river,” “canal,” and “creek” (in the United Kingdom and Australia, the last-mentioned implies a saltwater body) are significant in environmental remediation. Forcing a set of scientists to use a strange hierarchy will be confusing, and forcing them to use a finer or coarser granularity than they need will be wasteful.
As science grows more interdisciplinary and global, we must provide clear entry points into specialized sciences. Well-defined languages are pivotal. Ontologies often are conceptualized as a series of levels from the general (domain ontology) to the specific (application ontology) (Guarino, 1998), so even disparate scientific communities might intersect at some level. People who are effective in interdisciplinary work may not have precise knowledge about every term in each discipline, but they certainly need to understand the terms at the intellectual intersection of the fields and be able to map among terms in each language. For example, mapping would allow “river [land use]” to be equated with “waterway [navigation],” providing links to “river [navigation],” “lake [navigation],” and the like. Mappings can be complex, and expressing them formally will be a challenge.
tively simple, well-understood, and established ontologies. Geospatial ontologies, in particular, continue to grow and evolve, usually via subtle changes (e.g., as taxonomies for land use or geological mapping are refined) and occasionally via radical changes (as in a paradigm shift caused by a new theory—e.g., continental drift and plate tectonics). One community of scientists requires conceptual flexibility and is in the business of creating and modifying ontology; another is more concerned with using ontology that has been defined to understand the concepts underlying data to use those data appropriately. One research avenue is to design several geospatial ontologies covering limited and well-defined application domains, along with a mechanism for selecting the appropriate ontology for a given context. The ontologies should be designed in such a way that they can evolve over time (e.g., common concepts in overlapping domain-specific ontologies may be discovered that need cross-link-ages). Research also is needed to uncover precisely what aspects of meaning are important to different data users and how these aspects might be captured (from data, people, systems, and situations), represented (formally, in the machine), and communicated effectively to users of the data (i.e., via what type of mechanism—an ontology browser or visual navigation through concept space?). Such research would inform and expand the notions of geospatial ontologies and increase their usefulness.
Geospatial Data Integration
The purpose of data integration is to combine data from heterogeneous, multidisciplinary sources into one coherent data set.19 The sources of the data typically employ different resolutions, measurement techniques, coordinate systems, spatial or temporal scales, and semantics. Some or all must be adjusted to integrate the data into an effective result set. Perhaps the most obvious problem stems from positional accuracy. The positions recorded for geospatial data vary in accuracy, depending on how the location coordinates were derived, their spatial resolution, and so forth. The degree of difference may be unimportant or critical, depending on the requirements of the application. Conflation20 refers to the integration of geospatial data sets that come from multiple sources, are at different scales, and have different positional accuracies (e.g., a digital orthophoto quadrangle image
of a road and a digital roadmap). As Nusser notes, conflation can be useful in several ways: “1) as a means of correcting or reducing errors in one data set through comparison with a second; 2) as a process of averaging, in which the product is more accurate than either input; 3) as a process of concatenation, in which the output data set preserves all of the information in the inputs; and 4) as a means of resolving unacceptable differences when data sets are overlaid.”21
Substantial research work will be required to develop techniques that can completely automate the conflation process. Unfortunately, it is highly improbable that any single process will apply across all domains. A more realistic approach would be to develop techniques automating conflation for individual application domains, and then expand those domains to be as general as possible. Note that without at least some automatic support for conflation, it will be difficult, if not impossible, to build applications that integrate information from independent sources.
When data are produced, either from original measurements or by integrating values from existing sources, other transformations may be performed as well, including clipping, superimposition, projection trans-formations (e.g., from Mercator to conic projection), imputation of null values, and interpolation. Given the nature of the transformations required to integrate data sources, it is important to track how each data set was produced and specify which transformation operations were performed. If the metadata do not track this properly, some later application might lead to an invalid and potentially fatal decision because of imprecise information (e.g., an error of just ±100 meters in the location of a bombing target could destroy a hospital instead of an arms depot). Moreover, because each domain involves implicit knowledge about its own measurement methods and the appropriateness of its data to a given problem, data integration across disciplines imposes the added constraint of requiring integration of the knowledge that underlies the data. A key issue for spatial data integration is developing a formal method that bridges disparate ontologies—by using, for example, spatial association properties to relate categories from different ontologies—to make such knowledge explicit in forms that would be useful to other disciplines. Long-term research is required to create new data models and languages specifically designed to support heterogeneous spatiotemporal data sets (see Box 3.3 for a sample application). Similarly, languages and mechanisms must be developed for expressing attributes that currently are implicit in many stored data, such as integrity constraints, scaling proper
BOX 3.3 Dataset Clipboard
Consider the difficulty of determining which households to evacuate when a truck carrying hazardous materials is involved in a serious accident. One vision is to develop a “dataset clipboard” that could integrate models and domain knowledge from relevant scientific disciplines with collections of geospatial data to assist the knowledge discovery process. In this example, the emergency response team would use the clipboard to superimpose population density data and meteorological data, then employ a plume diffusion or surface runoff model to predict where and how the contaminant might spread. The resulting knowledge could be used to determine which households to evacuate and in what sequence and to efficiently deploy the appropriate hazard cleanup. The clipboard would need to support the following kinds of operations:
There are many challenges in creating such a tool: Given a collection of data resources whose only common attribute is location (possibly poorly specified), how does the tool establish the appropriate transformations and mappings to superimpose them? How does it determine which numerical models can be applied to those data? How does the tool handle representational formats for objects that can play multiple roles (e.g., a road can be a transportation network, a firebreak, or a corridor for human development)? How does it encode semantics to ensure that nonsensical operations, such as pasting a road on a lake, fail? How does the tool determine what format conversions or other data adjustments are needed as a model is launched? How can human experts represent their domain knowledge, such as integrity constraints and plume diffusion equations, so that it can be integrated into the digital clipboard?
ties, lineage, and authenticity. Markup languages, such as XML, and resource discovery technologies may play an important role in solving some of these difficult problems.
The collection and integration in real time of data from a variety of sources (hyperspectral remote-sensing data, in situ sensors, aerial photography, lidar) are yet another challenge.22 The difficulty of communicating large amounts of data from remote locations may require that spatiotemporal data integration or other forms of processing take place in situ at common collection points. Research is required to identify efficient algorithms for doing this, as well as an understanding of the trade-offs involved. Emergency response applications will require not only realtime collection and fusion of data but also ingestion of that data into process models that can prescribe actions to mitigate the effects of the evolving emergency—e.g., by simulating flooding conditions in the field to assign the deployment of resources and schedule evacuations (NSF, 2000).
Handling different kinds of imprecision and uncertainty is an important research topic that must be addressed for geospatial databases. Most important, for data integration in particular, different data sets may be described with different types of inaccuracy and imprecision, which seriously impedes information integration. An important research topic is modeling the propagation of errors when combining data sets with different accuracy models (root mean square error, epsilon bands, and so forth). Spatiotemporal data are particularly problematic because spatial-temporal correlations need to be included in such models. As a first step, accuracy models for several common data types (field models, moving objects, and others) need to be formally described and error propagation models for a variety of spatial operations and transformations need to be developed.23
Issues in Geospatial Algorithms
The objective of early geospatial applications was to replace manual (paper) cartographic production. The algorithmic problems encountered in such applications were static and involved relatively homogeneous data sets. That is no longer the case. The increasing availability and use of
massive geodata in a wide variety of applications have opened up a whole new set of algorithmic challenges. At the same time, although algorithms research traditionally has been relatively theoretical, efficiency, implementation issues (including numerical robustness), and realistic computational models have gained a more prominent position in the geometric algorithms community. As a result, further algorithms research could significantly enhance the accessibility and use of geospatial information. This section describes three broad and interrelated areas in which further development would be especially beneficial.24
Algorithms for Heterogeneous and Imperfect Data
The continued improvement in geodata capture capabilities has made available data sets of widely varying resolution, accuracy, and formats. Thus, geospatial applications have to store and manipulate very diverse and sometimes inherently imperfect (noisy and/or uncertain) data. Because most algorithm research assumes perfect data, the imperfections of real-world data mean that algorithms developed in a theoretical framework may not function correctly or efficiently in practice. One example of imperfect data could be several overlay data sets for the same terrain (containing, say, vegetation, road, and drainage information) that are inconsistent with one another because they have been acquired by different means.
The large variety of data also leads to a need for format conversion algorithms, which introduces yet another source of imperfection. Terrain representations are a prime example of this. Grid terrain representations are the most common, primarily because data from remote-sensing devices are available in gridded form and because grids typically support simple algorithms. In some applications, however, especially when handling massive terrain datasets, triangular irregular networks (TINs) are superior to grids.25 Obviously, conversions between the two formats can introduce inconsistencies. In fact, it is not even clear what it means for a conversion to be “consistent.” Traditionally, conversion methods have focused on minimizing differences in local elevation, but the preservation and consistency of global features (e.g., watershed hierarchy, the drainage network, and visibility properties) are often more important to the users of geospatial applications. Ultimately, problems like those encoun-
tered when converting between different terrain formats stem from the use of discrete representations for what are really continuous domains, and from the fact that many geospatial data representations lack explicit topological information. As is well known, algorithms that derive topological information from geometric information are vulnerable to even small measurement or calculation errors, which may significantly alter the connectivity of geometric entities.
More algorithm research in the geospatial domain is needed, especially on problems involving heterogeneous and imperfect data. Research is needed in transformation algorithms, not just to advance the efficacy of conversion and transformation but also to establish under what conditions different formats and algorithms are most appropriate. Research investments in what could be called topology-aware algorithms would greatly improve the usability of geospatial data. Such algorithms would treat topological connectivity information as being of higher priority than geometric size and location. This would equip the algorithms to better handle uncertainties in size and location and still able to yield topologically consistent results. The use of statistical methods to handle input data uncertainty also should be investigated.
Although the availability of massive geospatial data sets and of small but computationally powerful devices increases the potential of geospatial applications, it also exposes scalability problems with existing algorithms. One source of such problems is that most algorithm research has been done under models of computation in which each memory access costs one unit of time regardless of where the access takes place. This assumption is becoming increasingly unrealistic, because modern machines contain a hierarchy of memory ranging from small, fast cache to large, slow disks. One key feature of most memory hierarchies is that data are moved between levels in large, contiguous blocks. For this reason, it is becoming increasingly important to design memory-aware algorithms, in which data accessed close in time also are stored close in memory. Although operating systems use sophisticated caching and prefetching strategies to ensure that data being processed are in the fastest memory, often they cannot prevent algorithms with a pattern of access to nonlocal memory from thrashing (i.e., moving data between memory levels). Geospatial applications in particular suffer from thrashing effects because data sets often are larger than main memory. Therefore, I/O-efficient algorithms, which are designed for a two-level (external memory) model instead of the traditional flat-memory model, recently have received a lot of attention (Arge et al.,
2000). Several I/O-efficient algorithms have been developed, and experimental evaluations have shown that their use can greatly improve run time in geospatial applications. Very recently, cache-oblivious algorithms have been introduced that combine the simplicity of the two-level model with the realism of more complicated hierarchical models. This approach avoids memory-specific parameterization and enables analysis of a two-level model to be extended to an unknown, multilevel memory model. Further research in the area of I/O-efficient and cache-oblivious algorithms can significantly improve the usability of geospatial data by allowing complicated problems on massive data sets to be solved efficiently.
Kinetic Data Structures
With the rapid advances in positioning technologies (such as the Global Positioning System and wireless communication), tracking the changing position of continuously moving objects is becoming increasingly feasible and necessary. However, creating algorithms for handling continuously moving and evolving data is one of the most significant challenges in the area of temporal data. Existing data structures are not efficient for storing and querying continuously moving objects. In most geospatial applications, motion is modeled by sampling the time axis at fixed intervals and recomputing the configuration of the objects at each time step. The problem with this method is the choice of an appropriate time-step size. If the steps are too small, the method is very inefficient (due to frequent recomputation); if they are too large, important events can be missed, leading to incorrect results. A better approach would be to represent the position of an object as a function of time, so that the position can change without any explicit change in the data structure. Recently, there has been a flurry of activity in algorithms and data structures for moving objects, most notably the concept of kinetic data structures, which alleviate many of the problems with fixed-interval sampling methods (Basch et al., 1997; Guibas, 1998). The idea of a kinetic framework is that even though objects move continuously, qualitative changes happen only at discrete moments, which must be determined. In contrast to fixedinterval methods, in which the fastest moving object determines the update step for the entire data structure, a kinetic data structure is based on events that have a natural interpretation.
Important results already have been obtained in the kinetic framework, and its practical significance has been demonstrated through implementation work. Nevertheless, many key issues remain to be investigated. Further research in algorithms and data structures for moving objects in
general, and kinetic data structures in particular, could significantly advance our ability to efficiently manipulate spatiotemporal data.
Geospatial Data Mining
Data mining is typically an interactive process. It takes a human expert to decide which data to mine and which techniques to employ to derive meaningful results. Once a mining process has been worked out for a particular application, automation via a processing pipeline is reasonable (as it is for many bioinformatics applications). On the one hand, it may be useful to look at each stage of the mining process—from exploratory analysis to processing pipeline—and provide support (in the form of infrastructures and tools) for moving applications from explorations to production. On the other hand, automation of some parts of the data mining process (at some suitable level for domain-specific applications) could be tremendously useful in improving the accessibility and usability of geospatial information. For a specific application, we can automate many common operations (e.g., database specification, loading data into databases, connecting database with analysis/mining packages, some frequent queries and statistical analysis, and sending data or results back to databases or into visualization packages). Beyond such basic IT support for data mining, advances in the areas of languages and algorithms can improve the productivity of analysts and domain scientists. Two specific problems, the dimensionality “curse” and the mining of moving and evolving objects, remain difficult challenges and are discussed below.
Languages for Describing Data Mining Patterns
One of the most difficult data mining problems is to determine which task to perform (i.e., which class of patterns to look for) in a given data set. Are Gaussian clusters the most appropriate pattern classes to employ on a forest-fire data set? Should the interarrival times of fires be fitted to a Poisson model or something else? Because the assumptions required for the classical stochastic representations (such as Gaussian distributions and Poisson processes) do not always hold, expert users need to be able to specify new types of patterns. The language for expressing patterns would need to be extensible yet still enable efficient searches for new and frequent patterns.
For example, the probability distribution of the magnitude of earthquakes follows a power law—the Gutenberg-Richter law (Bak, 1996)—as opposed to a Gaussian distribution. Power laws, which appear in many settings—for example, income distribution (Pareto law), incoming and
outgoing hypertext links in the World Wide Web (Kumar et al., 1999), and numerous other settings (Barabasi, 2002)—are closely related to fractals and self-similarity.26 These concepts have brought revolutions in many settings, from the distribution of goods throughout the world to the description of coastlines and the shape of river basins (Schroeder, 1991; Mandelbrot, 1977). U.S. Geological Survey (USGS) researchers have developed a new approach to pattern recognition based on fractal geometry (the study of fragmented patterns nested within larger copies of themselves) that allows them to quantify complex phenomena (e.g., hurricanes and earthquakes) without having to simplify them.27 Similar successes have been achieved in the time-sequence analysis of network traffic, in which it was discovered that the number of packets per unit time is not a Poisson distribution but instead remains self-similar over several scales, contrary to all previous statistical assumptions (Leland et al., 1993). The tools of chaos and nonlinear dynamics also are closely related, and they should be included in any framework that looks for patterns. Systems that obey nonlinear difference equations exhibit behaviors qualitatively different from those of linear systems. For example, nonlinear systems govern populations of species (the Lotka Volterra equations for prey-predator systems, as well as the logistics parabola for a single-species system with limited resources). Weather also obeys nonlinear equations, which makes it chaotic (i.e., sensitive to initial conditions)—a tiny measurement error can result in a large error in the forecast.
A related issue is how to present the patterns once the data mining algorithm has detected them. For example, simple association rules of the form “land patches that have conifers and high humidity also have mosquitoes” provide an excellent first step, but ultimately the system would need to report more complicated patterns such as a fire-ant population x(t) that follows the logistics parabola: x(t) = a × [ x(t − 1) × (1 − x(t − 1)]. A software system should be developed that can select from a set of tools and typical pattern types (e.g., Gaussian distributions, Poisson processes, and fractal dimension estimators) the most suitable types for each data set.
Self-similarity often is observable only in natural phenomena across a constrained range of scales. See, for example, Goodchild and Mark (1987). It is argued, though, that this range is usually the range of interest. See, for example, Stoyan and Stoyan (1994). For further discussion on the use of fractal models for geospatial data, Hastings and Sugihara (1993) discuss fractal models in ecological systems, and Lovejoy and Mandelbrot (1985) discuss fractal models of rain showers.
From USGS fact sheet “Natural Disasters: Forecasting Economic and Life Losses.” Available online at <http://marine.usgs.gov/fact-sheets/nat_disasters/>.
Efficient Algorithms for Computer-Aided Pattern Discovery
In a typical data mining process, an analyst tries several data mining tools and data transformations. Typical data mining tools include clustering, classification, association rules, and outlier analysis. Typical transformation operations performed on data sets include log transformations, dimensionality reductions such as Principal Component Analysis, selection of portions of the records or the attributes, and aggregation for a coarser granularity. The process of applying tools and transformations is repeated until the analyst discovers some striking regularities that were not known in advance or, conversely, detects anomalous conditions. In the wildfire scenario in Chapter 1, the analyst could attempt to identify a trend in temperatures, spot periodicities, transform the temperatures by replacing them with their deviation from the seasonally expected value, and so forth.
One problem is that analysts may not be trained in the full spectrum of data mining tools, including knowledge of whether certain tools would be applicable after a nontrivial transformation of the data. The challenge is to devise efficient algorithms that can automate, as much as possible, the data mining process. For instance, an analyst might deposit the appropriate data sets and domain knowledge (constraints, process models, etc.) for a wildfire scenario into the data clipboard described in Box 3.3. If a new algorithm for classification became available, it could be dropped into the clipboard as well. Software agents28 would assist the analyst by indicating for which data sets the new algorithm is applicable, applying the algorithm, and comparing the classification (clustering, forecasting, etc.) results that are produced with previous results. In general, software agents should be able to automatically locate spatiotemporal data sets; process models and data mining algorithms; identify appropriate fits; perform conversions when necessary; apply the models and algorithms; and report the resulting patterns (e.g., correlations, regularities, and outliers). Resource discovery systems also will be needed to match algorithms to data sets and to user goals.
The Dimensionality Curse
Spatiotemporal data sets often suffer from what is known as the “dimensionality curse,” a very difficult problem. Although not specific to spatiotemporal data, advances in solving the dimensionality curse would
There is an emerging literature on geoagents. See the GIS Science Program on agents: <http://www.giscience.org/GIScience2000/program.html#agents>.
benefit geospatial applications. Many spatiotemporal data sets have a large number (100 or more, say) of measurements—dimensions—for each point in space and time. Most existing data mining algorithms suffer in high dimensions, exploding polynomially or exponentially with the number of dimensions. Not all of the measurements are useful, however; some may have near-constant values, while others are strongly correlated. It is essential to determine how many and which attributes really matter.
A well-known technique for dealing with the dimensionality curse is dimensionality reduction.29 Several algorithms are available that can perform a linear dimensionality reduction on geospatial data sets. They spot and exploit attributes that are linearly correlated. The most popular is the Principal Component Analysis (also known as Singular Value Decomposition, or SVD, the Karhunen-Loeve transform, or Latent Semantic Indexing). Unfortunately, this method can be slow (it is quadratic on the number of attributes), and it is susceptible to outliers. Faster, newer methods use random projections (Papadimitriou et al., 1998), or robust SVD (Knorr et al., 2001), to achieve faster results or to neutralize the effect of a few outliers. All the methods look for linear correlations across attributes, however, and will not work for nonlinear correlations.30 Research is needed on scalable, robust, nonlinear methods for reducing dimensionality.
Mining Data When Objects Move or Evolve
Just as moving and evolving objects pose problems for geospatial data models, they also pose problems for geospatial data mining. A key problem is how to identify and categorize patterns in the trajectories of moving objects, such as migrating birds or fish. An even more difficult task is to identify and categorize patterns in objects that evolve, change, or appear and disappear over time, such as animal habitats and sporadic water resources. Few data mining algorithms can handle temporal dimensions; even fewer can accommodate spatial objects other than points (such as polygons).
The simplest case is a data set of moving point objects in which each trajectory has a number of attributes. For example, the trajectories of wild foxes may have an attribute “ear-tag identification” and certain locations may have the attribute “fox den.” One place to start is by rethinking
Feature selection is a special case of dimensionality reduction in which certain features are selected to reduce the number of dimensions.
Some nonlinear methods, such as genetic algorithms, are recognized and used for feature selection in data mining (e.g., “Use of proteomic patterns in serum to identify ovarian cancer,” available online at <http://image.thelancet.com/extras/1100web.pdf>).
current data mining algorithms (clustering, classification, association rules) in terms of trajectories. What is an appropriate similarity metric for clustering trajectories? How might an efficient search algorithm be devised for “projective clustering”—that is, looking for strong clusters when grouping trajectories using the associated attributes? What form might association rules take? What about rules that predict future behavior based on initial trajectory characteristics? A more complicated example is a vehicle management application, which integrates data sets containing information on weather, special events, and traffic conditions. How do typical data mining algorithms work in this type of scenario?
Another direction is to create specialized algorithms for trajectories with no constraints in two- or three-dimensional space (plus time) as opposed to constrained trajectories such as movement along a network, or even more constrained, movement on a circuit. Population densities over a sequence of time intervals are another example of an evolving “object” for which new clustering, classification, and association rules are needed. They may be particularly beneficial in the context of sensor networks (see Box 3.4).
The questions posed above illuminate the challenge of mining moving and evolving objects and are intended to inspire some directions for future research.
BOX 3.4 Mining Data From Sensor Networks
Densely deployed sensor networks soon will be generating vast amounts of geospatial data. The scale of the data alone creates problems, because communication bandwidth is a key constraint in any sensor network. Assuming current technology in processors and wireless communication, the power required to transmit 1 kilobyte a distance of 100 meters could be used instead to execute several millions of instructions (Pottie and Kaise, 2000). This means that traditional centralized techniques for data mining are not directly applicable to sensor networks. Several other challenges also must be resolved to realize the potential of these networks for long-term environmental monitoring and problem detection.
Consider a field of visual or infrared sensors that will operate for a week at a time. The first day’s measurements indicate how the network might change as a function of time of day. For instance, individual sensors may behave differently depending on where they are located, because although all will experience daylight at essentially the same time, some may be in the shadow of a tree for part of the day. This baseline can be leveraged to determine if future measurements represent an expected value or an outlier.
As a second example, consider a network designed to monitor a power utility system, with sensors deployed at meter intervals along every pipe. Each sensor can sample the flow rate every few seconds. The challenges for data mining include the following:
Sensor networks introduce a new domain, in which spatially and physically distributed devices interact first with the environment and only secondarily (and in the aggregate) with human users. The exploitation of this domain will require significant long-term research investment, but it could yield immense benefits for future society.1
SOURCE: Adapted from a white paper, “Using Geospatial Information in Sensor Networks,” prepared for the committee’s workshop by John Heidemann and Nirupama Bulusu.
Arge, Lars, L. Toma, and J.S. Vitter. 2000. “I/O-Efficient Algorithms for Problems on Grid-Based Terrains,” In Proceedings Workshop on Algorithm Engineering and Experimentation, ALENEX 99.
Bak, P. 1996. How Nature Works: The Science of Self-Organized Criticality. New York: Copernicus Books.
Barabasi, Albert-Laszlo. 2002. Linked: The New Science of Networks. Cambridge, Mass.: Perseus Publishing.
Basch, Julien, Leonidas J. Guibas, and John Hershberger. 1997. “Data Structures for Mobile Data,” In Eighth ACM-SIAM Symposium on Discrete Algorithms, pp. 747-756.
Bhambhani, Dipka. 2002. “OMB Prods Agencies to Standardize Geodata.” Government Computer News, September 9.
The CHOROCHRONOS Participants. 1999. “Chorochronos: A Research Network for Spatiotemporal Database Systems.” ACM SIGMOD Record, 28(3), September. The Chorochronos Web site contains additional papers: <http://www.dbnet.ece.ntua.gr/~choros/>.
Computer Science and Telecommunications Board (CSTB), National Research Council. 2000. Information Technology Research for Federal Statistics. Washington, D.C.: National Academy Press.
Computer Science and Telecommunications Board (CSTB), National Research Council. 2001. Embedded, Everywhere: A Research Agenda for Networks of Embedded Computers. Washington, D.C.: National Academy Press.
Goodchild, M., and D. Mark. 1987. “The Fractal Nature of Geographic Phenomena.” Annals of the Association of American Geographers, 77(2):265-278.
Grumbach, Stephane, Philippe Rigaux, and Luc Segoufin. 1998. “The DEDALE System for Complex Spatial Queries.” ACM SIGMOD Record. 27(2):213-234.
Guarino, Nicola (ed.). 1998. Formal Ontology in Information Systems. IOS Press: Amsterdam.
Guibas, L.J. 1998. “Kinetic Data Structures—A State of the Art Report,” in P.K. Agrawal, L.E. Kavraki, and M. Mason (eds.), Proceedings from the Workshop on Algorithmic Foundation Robotics, pp. 191-209. Wellesley, Mass.: A.K. Petes.
Güting et al. 2000. “A Foundation for Representing and Querying Moving Objects.” ACM Transactions on Database Systems, 25(1), March.
Han, J., and M. Kamber. 2000. Data Mining: Concepts and Techniques. San Francisco, Calif.: Morgan Kaufmann Publishers.
Han, Jiawei. 2002. “Data Mining: Concepts and Techniques.” Slides for textbook. Available online at <http://www.cs.sfu.ca>.
Hastings, H.M., and S. Sugihara. 1993. Fractals: A User’s Guide for the Natural Sciences. Oxford, New York: Oxford University Press.
Knorr, Edwin, Raymond Ng, and Ruben Zamar. 2001. “Robust Space Transformations for Distance-Based Operations.” Knowledge Discovery and Data Mining Conference, San Francisco, Calif., August.
Kumar, S. Ravi, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. 1999. “Extracting Large-Scale Knowledge Bases from the Web.” In Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, pp. 639-650.
Kumler, M.P. 1994. “An Intensive Comparison of Triangular Irregular Networks (TINs) and Digital Elevation Models (DEMs).” Cartographica, 31(2), monograph 45.
Leland, Will E., S. Murad Taqqu, Walter Willinger, and Daniel V. Wilson. 1993. “On the Self-Similar Nature of Ethernet Traffic.” In Proceedings ACM SIGCOMM 93, September 13-17.
Lovejoy, S., and B.B. Mandelbrot. 1985. “Fractal Properties of Rain, and a Fractal Model.” Tellus, 37A:209-232.
Mandelbrot, B.B. 1977. Fractal Geometry of Nature. New York: W.H. Freeman.
Mark, David, Max Egenhofer, Stephen Hirtle, and Barry Smith. 2000. “UCGIS Emerging Research Theme: Ontological Foundations for Geographic Information Science.” Available online at <http://www.ucgis.org/emerging/ontology_new.pdf>.
Miller, Harvey J., and Jiawei Han (eds.). 2001. Geographic Data Mining and Knowledge Discovery. London: Francis and Taylor.
Mitchell, Tom. 1999. “Machine Learning and Data Mining.” Communications of the ACM, 42(11):30-36.
Østensen, Olaf. 2001. “The Expanding Agenda of Geographic Information Standards” In ISO Bulletin. Available online at <http://www.iso.org/iso/en/commcentre/pdf/geographic0107.pdf>.
Papadimitriou, Christos H., Prabhakar Raghavan, Hisao Tamaki, and S. Vempala. 1998. “Latent Semantic Indexing: A Probabilistic Analysis.” In Proceedings of 17th ACM Symposium on the Principles of Database Systems, Seattle, WA. June. Pp. 159-168.
Pottie, G.J., and W.J. Kaise. 2000. “Wireless Integrated Network Sensors.” Communications of the ACM, May, pp. 51-58.
Quinlan, J.R. 1986. “Induction of Decision Trees.” Machine Learning, 1:81-106.
Saalfeld, A. 1993. Conflation: Automated Map Conflation. Center for Automation Research, CAR-TR-670 (CS-TR-3066), University of Maryland, College Park.
Schroeder, Manfred. 1991. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman and Company.
Sowa, J.F. 1999. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Pacific Grove, Calif.: Brooks/Cole.
Stoyan, Dietrich, and Helga Stoyan. 1994. Fractals, Random Shapes and Point Fields. Chichester, New York: Wiley.
Thuraisingham, Bhavani. 1999. Data Mining: Technologies, Techniques, Tools, and Trends. Boca Raton, Fla.: CRC Press.
Zhang, Jingxiong, and Michael Goodchild. 2002. Uncertainty in Geographical Information. London: Taylor & Francis.