Read "Massive Data Sets: Proceedings of a Workshop" at NAP.edu

Page 149 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

Some Ideas About the Exploratory Spatial Analysis Technology Required for Massive Databases

Stan Openshaw

Leeds University

Abstract

The paper describes an emerging problem of an explosion in spatially referenced information at a time whilst there is only essentially legacy geographical analysis technology. It offers some ideas about what is required and outlines some of the computationally intensive methods that would seem to offer considerable potential in this area of applications.

1 A global spatial data explosion

The Geographic Information Systems revolution of the mid 1980's in conjunction with other historically unprecedented developments in Information Technology are creating an extremely spatial data rich world. Suddenly, most map and map related databases have increasingly fine resolution geography on them. The paper map using industries throughout the world have gone digital. In many countries this ranges from the traditional AM/FM areas of GIS (viz utilities), it includes the large scale mapping agencies (viz in the UK the Ordnance Survey, the Geological Survey, etc), it includes all manner of remotely sensed data and just as exciting it extends to virtually all items of fixed infrastructure and certainly all people related databases (that is any and all postal address data, property, road, and land information systems). At a world level there is increasing interest in global environmental databases (Mounsey and Tomlinson, 1988). There is now a vast wealth of digital backcloth material (viz the digital maps) and related attribute data covering most aspects of human life, ranging from the cradle to the grave and an increasing proportion of behaviour related events in between. Most governmental and commercial sector statistical and client information systems are well on the way to becoming geo-statistical. In fact nearly all computer information can be regarded as being broadly geographical in the sense that there are usually some kind of spatial coordinates on it or implicit in it. This may range from the traditional

Page 150 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

geographic map related data to non-traditional sources such as 2 and 3 dimensional images of virtually any kind. Indeed, it is worth observing that several existing GIS's already contain sufficient numeric resolution to work down to nanometre scales! But lets stick with the more traditional geographic data. The technology needed to create, store, and manipulate these land and people related databases exists, it well developed, and is fairly mature. What is almost entirely missing are the geographical analysis and modelling technologies able to deal with the many new potential opportunities that these data rich environments now make possible.

It is not unusual for governments and commercial organisations to spend vast sums of money on building databases relevant to their activities. Many have huge amounts of capital tied up in their databases and concepts such as data warehouses are providing the necessary IT infrastructure. They know that their future depends on building and using these information resources. Most businesses and governmental agencies are becoming information industries but currently there seems to be an almost universal neglect of investment in the analysis tools needed to make the most of the databases.

2 A global data swamp

Maybe the problem is that the data holders and potential users are becoming ''data swamped'' and can no longer appreciate the opportunities that exist. Often their ideas for analysis on these massive databases seems to mainly relate to an earlier period in history when data was scarce, the numbers of observation were small and the capabilities of the computer hardware limited. As a result there are seemingly increasingly large numbers of important, sometimes life critical and sometimes highly commercial, databases that are not being fully analysed; if indeed they are being spatially analysed at all. Openshaw (1994a) refers to this neglect as a type of spatial analysis crime. For some data there is already an over-whelming public imperative for analysis once the data exist in a suitable form for analysis. Is it not a crime against society if critical databases of considerable contemporary importance to the public good are not being adequately and thoroughly analysed. This applies especially when there might well be a public expectation that such analysis already occurs or when there is a strong moral imperative on the professions involved to use the information for the betterment of people's lives. Examples in the UK would include the non-analysis of much spatially referenced information: examples include most types of cancer data, mortality data, real-time crime event data, pollution data, data needed for real-time local weather forecasting, climatic change information, and personal information about people who are to be targeted for special assistance because they exist in various states of deprivation. There are many examples too involving the spatial non-analysis of major central and local government databases: for example, tax, social security payments, education performance and housing benefits. In the commercial sector also, it not unusual for large financial sector organisations to create massive customer databases, often containing longitudinal profiles of behaviour, increasingly being updated in real-time; and then do virtually nothing clever when it comes to analysis. Yet every single business in the IT Age knows that their long term future viability depends on themselves making good use of their information resources. Some of the opportunities involve spatial analysis; for example, customer targeting strategies, spatial planning and re-

Page 151 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

organisation of business networks. Economic well being and growth may become increasingly dependent on making more sophisticated use of the available data resources. Of course there are dangers of intrusion into private lives but all too often the confidentiality problems are grossly exaggerated as a convenient excuse for doing nothing (Openshaw, 1994b).

To summarise, there are an increasing number of increasingly large and complex databases containing potentially useful information that for various reasons are not currently being adequately analysed from a spatial statistical geographic perspective. In many countries, if you ask the question Who is responsible for monitoring key indicators of health and well-being for unacceptable or preventable local abnormalities, then the answer will usually be 'non-one' because the technology needed simply does not exist. In the UK the most famous suspected leukaemia cancer cluster was uncovered by journalists in a Pub rather than by any detailed spatial statistical analysis or data base monitoring system. Spatial analysis is a branch of statistics that is concerned with the quantitative analysis (and modelling) of the geographic arrangement, patterns, and relationships found in and amongst map referenced data of one form or another. The GIS revolution has greatly increased the supply of spatially referenced data without any similar provision of new spatial analysis tools able to cope with even the basic analysis needs of users with very large spatial databases. In the last couple of years high parallel computing systems such as the Cray T3D provide memory spaces just about big enough to handle all but the very largest of databases. The need now is for new data exploratory tools that can handle the largest databases and produce suggestions of geographical patterns or relationships; identify, geographically: localised anomalies if any exist: and provide a basis for a subsequent more focused analysis and action. Note also that the need is for analysis but not necessarily because there are expectations of discovering anything. Analysis has a major re-assurance aspect to it. In modern societies it is surely not an unreasonable expectation that benign Big Brother surveillance systems should be continually watching for the unusual and unexpected.

3 New tools are required

In a spatial database context there is an urgent need for exploratory tools that will continually sieve and screen massive amounts of information for evidence of patterns and other potentially interesting events (if any exist) but without being told, in advance and with a high degree of precision, WHERE to look in space, WHEN to look in time. and WHAT to look for in terms of the attributes that are of interest other than in the broadest possible ways. Traditionally, spatial analysts either start with a priori theory which they then attempt to test or else they use interactive graphics procedures, perhaps linked to map displays, to explore spatial data. However, as the level of complexity and the amount of data increases this manual graphics based approach become inadequate. Equally, in the spatial data rich 1990's there are not many genuine and applicable a priori hypotheses that can be tested; usually, we just do not know what to expect or what might be found. Exploration is seen as a means of being creative and insightful in applications where current knowledge is deficient.

The combination of a lack of prior knowledge and the absence spatial analysis tools sufficiently powerful to handle the task of exploratory spatial data analysis especially in

Page 152 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

highly multivariate situations with large numbers of observations, has resulted in a situation where very little investigative and applied analysis actually occurs. The technical problems should not be underestimated as spatial data is often characterised by the following features which serve to complicate the analysis tasks:

1. non-normal frequency distributions, 2. non-linearity of relationships, 3. spatially autocorrelated observations, 4. spatially structured variations in degrees of precision, noise, and error levels, 5. large and increasing data volumes, 6. large numbers of potential variables of interest (i.e. highly multivariate), 7. data of varying degrees of accuracy (i.e. can be variable specific), 8. often misplaced confidentiality concerns (i.e. it might identify someone!), 9. non-ideal information (i.e. many surrogates), 10. mixtures of measurement scales, 11. modifiable areal unit or study region effects, 12. patterns and relationships that are localised and not global, and 13. presence of database errors.

Traditionally, people have coped by being highly selective whilst working with very few variables and relatively small numbers of observations. However, this is increasingly hard to achieve. Many spatial epidemiological studies decide in advance on the selection of disease, the coding of continuous time into discrete bands, the recoding of the data, the geographical aggregation to be used, and the form of standardisation to be applied. Then they expect to "let what is left of the data to speak for itself" via exploratory analysis, after having first strangled it by the study protocols imposed on it in order to perform the analysis. This is crazy! Heaven alone knows what damage his may have done to the unseen patterns and structure lurking in the database or indeed, what artificial patterns might have been accidentally created. No wonder exploratory spatial epidemiology has had so few successes. It is useful, therefore, to briefly review some of the developments that appear to be needed to handle the geographical analysis of massive spatial databases.

3.1 Automated map pattern detectors

One strategy is to use a brute force approach and simply look everywhere for evidence of localised patterns. The Geographical Analysis Machines (GAM) of Openshaw et al (1987) is of this type, as is the Geographical Correlates Exploration Machine (GCEM) of Openshaw et al (1990). The search requires a supercomputer but is explicitly parallel and thus well suited for the current parallel supercomputers. The problems here are the dimensionally restricted nature of the search process, being limited to geographic space; and the statistical difficulties caused by testing millions of hypotheses (even if only applied in a descriptive sense). Nevertheless, empirical tests have indicated that the GAM can be an extremely useful spatial pattern detector that will explore the largest available data sets for evidence of localised geographic patterning. The strength of the technology results from its comprehensive search strategy, the lack of any prior knowledge about the scales and nature of the patterns to expect, and its ability to handle uncertain data.

3.2 Autonomous database explorer

Openshaw (1994c, 1995) outlines a further development based on a different search strategy. Borrowing ideas from Artificial Life, an autonomous pattern hunting creature can be used to search for patterns by moving around the spatial database in whatever dimensions are

Page 153 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

relevant. It operates in tri-space defined by geographic map coordinates, time coordinates. and also attribute coordinates (a dissimilarity space). In the prototype the creature is a set of hyperspheres that try to capture patterns in the data by enveloping them. The dimensions and locations of the spheres is determined by a Genetic Algorithm and performance is assessed by a sequential Monte Carlo significance test. The characteristics of the hyper-spheres indicate the nature of the patterns being found. This technology is being developed further and broadened to include a search for spatial relationships and also linked to computer animation to make the analysis process visible and thus capable of being more readily understood by end-users (Openshaw and Perrie. 1995).

3.3 Geographic object recognition

Another strategy is that described in Openshaw (1994d) which viewers the problem as being one of pattern recognition. Many spatial databases are now so detailed that it is becoming increasingly difficult to abstract and identify generalisable and recurrent patterns. As the detail and resolution have increased dramatically, geographers have lost the ability to stand back and generalise or even notice recurrent patterns. It is ironic that in geography the discovery of many of the principal spatial patterns and associated theoretical speculations that exist pre-date the computer. In the early computer and data starved era geographers tested many of these pattern hypotheses using statistical methods, and looked forward to better resolution data so that new and more refined spatial patterns might be found. Now that there is such an incredible wealth of spatial detail, it is clear that here too there is no good ideas of what to do with it! Beautiful, multi-coloured maps accurate to an historically unprecedented degree shown so much detail that pattern abstraction by manual means is now virtually impossible and, because this is a new state, the technology needed to aid this process still needs to be developed. Here, as in some other areas, finer resolution and more precision has not helped but hindered.

If you look at a sample of towns or cities you can easily observe broad pattern regularities in the location of good-bad-average areas etc. Each town is unique but the structure tends to repeat especially at an abstract level. Now examine the same towns using the best available data and there is nothing that can be usefully abstracted or generalised, just a mass of data with highly complex patterns of almost infinite variation. Computer vision technology could in principle be used to search for scale and rotationally invariant two or three dimensional geographic pattern objects, with a view to creating libraries of recurrent generalisations. It is thought, that knowledge of the results stored in these libraries might well contribute to the advancement of theory and concepts relating to spatial phenomenon.

3.4 Very large spatial data classification tools

Classification is a very useful data reduction device able to reduce the number of cases/observations from virtually any very large number to something quite manageable such as 50 clusters (or groups) of cases/observations that share common properties. This is not a new technology, however, there is now some need to be able to efficiently and effectively classify several millions (or more) cases. Scaling up conventional classifiers is not a difficult task; Openshaw et al (1985) reported the results of a multivariate classification of 22 million Italian

Page 154 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

households. However, much spatial data is imprecise, non-random, and of variable accuracy. Spatio-neural network methods based on modified Kohonen self-organising nets provide an interesting approach that can better the problems (Openshaw, 1994e). This version uses a modified training method that biases the net towards the most reliable data. Another variant also handles variable specific data uncertainty and has been put into a data parallel form for the Cray T3D.

Another form of classification is provided by spatial data aggregation. It is not unusual for N individual records describing people or small area aggregations of them to be aggregated to M statistical reporting areas, typically, M is much less than N. Traditionally, statistical output area definitions have been fixed, by a mix of tradition, historical, accident, and fossilised by inertia and outmoded thinking. However, GIS removes the tyranny of users having to use fixed areas defined by others that they cannot change. User controlled spatial aggregation of very large databases is potentially very useful because: (1) it reduces data volumes dramatically but in a user controlled or application specific way: (2) it provides a mechanism for designing analytically useful zones that meet confidentiality restrictions and yield highest possible levels of data resolution: and (3) it is becoming necessary purely as a spatial data management tool. However, if users are to be allowed to design or engineer their own zoning systems then they need special computer systems to help them A start has been made but much remains to be done, Openshaw and Rao (1995).

3.5 Neurofuzzy and hybrid spatial modelling systems for very large data bases

Recent developments in AI, in neural networks and fuzzy logic modelling have created very powerful tools that can be used in geographical analysis. The problem is applying them to very large spatial databases. The size and relational complexity of large data bases increasingly precludes simply downloading the data in a fiat file form for conventional workstation processing. Sometimes this will be possible but not always, so how do you deal with a 10 or 40 gigabyte database containing many hundred hierarchically organised table? There are two possible approaches: Method 1 is to copy it all in a decomposed fiat file form onto a highly parallel system with sufficient memory to hold all the data and sufficient speed to do something useful with it. The Cray T3D with 2.56 processors provides a 16 gigabyte memory space albeit distributed. However, there is clearly the beginnings of large highly parallel systems with sufficient power and memory to at least load the data. The question of what then still however needs to be resolved.

Method two is much more subtle. Why not leave the database alone, assuming that it is located on a suitably fast parallel database engine of some kind. The problem is doing something analytical with it using only SQL commands. One approach is to re-write whatever analysis technology is considered useful so that it can be run over a network and communicates with the database via SQL instructions. The analysis machines starts off with a burst of SQL commands. it waits patiently for the answers: when they are complete, it uses the information to generate a fresh burst of SQL instructions: etc. Now it is probably not too difficult to re-write most relevant analysis procedures for this type of approach.

Page 155 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

4 Discovery how to exploit the geocyberspace

It is increasingly urgent that new tools are developed and made available that are able to analyse and model the increasing number of very large spatial databases that exist. It is not data management or manipulation or hardware that is now so critical but analysis. It is not a matter of merely scaling existing methods designed long ago because usually the underlying technology is so fundamentally inappropriate. Nor is it a matter of being more precise about what we want to do: in fact there is often just too much data for that. Instead, there is an increasing imperative to develop the new highly computation and intelligent data analysis technologies that are needed to cope with incredibly data rich environments. In geography the world of the 21st century is perceived to be that of the geocyberspace, the world of computer information (Openshaw. 1994f). It is here where the greatest opportunities lie but it is also here where an extensive legacy of old fashioned, often early computer, philosophically inspired constraints still dominate thinking about what we can and should not do. Maybe it is different elsewhere.

References

[1] Mounsey, H., Tomlinson, R., 1988 Building databases for global science. Taylor and Francis, London

[2] Openshaw, S., Sforzi, F., Wymer, C., 1985 National classifications of individual and area census data: methodology, comparisons, and geographical significance. Sistemi Urbani 3, 283-312

[3] Openshaw, S., Charlton, M., Wymer, C., Craft, A., 1987 A Mark I Geographical analysis · machine for the automated analysis of point data sets. Int J. of GIS 1. 33.5-358

[4] Openshaw, S., Cross, A., Charlton, M., 1990 Building a prototype geographical correlates exploration machine. Int J of GIS 3, 297-312

[5] Openshaw, S., 1994a GIS crime and Spatial Analysis in Proceedings of GIS and Public Policy Conference. Ulster Business School 22-34

[6] Openshaw, S., 1994b Social costs and benefits of the Census. Proceedings of XVth International Conference of the Data Protection and Privacy Commissioners Manchester. 89-97

[7] Openshaw, S., 1994c Two exploratory space-time attribute pattern analysers relevant to GIS. in Fotheringham, S., and Rogerson, P., (eds) GIS and Spatial Analysis Taylor and Francis. London 83-104

[8] Openshaw, S., 1994d A concepts rich approach to spatial analysis. theory generation and scientific discovery in GIS using massively parallel computing, in Worboys, M.F. (ed) Innovations in GIS Taylor and Francis, London 123-138

Page 156 Cite

Suggested Citation:"Some Ideas About the Exploratory Spatial Analysis of Large Data Sets." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.

×

[9] Openshaw, S., 1994e Neuroclassification of spatial data, in Hewitson, B.C., and Crane, R.G., (eds) Neural Nets: Applications in Geography Kluwer Publishers, Boston 53-70

[10] Openshaw, S., 1994f Computational human geography; exploring the geocyberspace Leeds Review 37, 201-220

[11] Openshaw, S., 1995 Developing automated and smart spatial pattern exploration tools for geographical information systems. The Statistician 44, 3-16

[12] Openshaw, S., Rao, L., 1995 Algorithms for re-engineering 1991 census geography. Environment and Planning A 27. 425-446