Data Mining and Information Fusion
This appendix addresses the science and technology of data mining and information fusion and their utility in a counterterrorism context. The use of these techniques for counterterrorist purposes has substantial implications for personal privacy and freedom. While technical and procedural measures offer some opportunities for reducing the negative impacts, there is a real tension between the use of data mining for this purpose and the resulting impact on personal privacy, as well as other consequences from false positive identification. These privacy implications are primarily addressed in other parts of this report.
THE NEED FOR AUTOMATED TECHNIQUES FOR DATA ANALYSIS
In the past 20 years, the amount of data retained by both business and government has grown to an extraordinary extent, mainly due to the recent, rapid increase in the availability of electronic storage and in computer processing speed, as well as the opportunities and competitiveness that access to information provides. Moreover, the concept of data or information has also broadened. Information that is retained for analytic purposes is no longer confined to quantitative measurements, but also includes (digitized) photographs, telephone call and e-mail content, and representations of web travels. This new view of what constitutes information that one would like to retain is inherently linked to a broader set of questions to which mathematical modeling has now been profitably
applied. For example, handwritten text can now be considered to be data, and progress in automatic interpretation of handwritten text has already reached the point that over 80 percent of handwritten addresses are automatically read and sorted by the U.S. Postal Service every day. A problem of another type on which substantial progress has also been made is how to represent the information in a photograph efficiently in digital form, since every photograph has considerable redundancy in terms of information content. It is now possible to automatically detect and locate faces in digital images and, in some restricted cases, to identify the face by matching it against a database.
This new world of greatly increased data collection and novel approaches to data representation and mathematical modeling have been accompanied by the development of powerful database technologies that provide easier access to these massive amounts of collected data. These include technologies for dealing with various nonstandard data structures, including representing networks between units of interest and tools for handling the newer forms of information touched on above. A question not addressed here—but of considerable importance and a difficult challenge for the agencies responsible for counterterrorism in the United States—is how best to represent massive amounts of very disparate kinds of data in linked databases so that all relevant data elements that relate to a specific query can be easily and simultaneously accessed, contrasted, and compared.
Even with these new database management tools, the retention of data is still outpacing its effective use in many areas of application. The common concern expressed is that people are “drowning in data but starving for knowledge” (Fayyad and Uthurusamy1 refer to this phenomenon as “data tombs”). This might be the result of several disconnects, such as collecting the wrong data, collecting data with insufficient quality, not framing the problem correctly, not developing the proper mathematical models, or not having or using an effective database management and query system. Although these problems do arise, in general, more and more areas of application are discovering novel ways in which mathematical modeling, using large amounts and new kinds of information, can address difficult problems.
Various related fields, referred to as knowledge discovery in databases (KDD), data mining, pattern recognition, machine learning, and information or data fusion (and their various synonyms, such as knowledge extraction and information discovery) are under rapid development and providing new and newly modified tools, such as neural networks,
support vector machines, genetic algorithms, classification and regression trees, Bayesian networks, and hidden Markov models, to make better use of this explosion of information.
While there has been some overrepresentation of the gains in certain applications, these techniques have enjoyed impressive successes in many different areas.2 Data mining and related analytical tools are now used extensively to expand existing business and identify new business opportunities, to identify and prevent customer churn, to identify prospective customers, to spot trends and patterns for managing supply and demand, to identify communications and information systems faults, and to optimize business operations and performance. Some specific examples include:
In image classification, SKICAT outperformed humans and traditional computational techniques in classifying images from sky surveys comprising 3 terabytes (1012 bytes) of image data.
In marketing, American Express reported a 10-15 percent increase in credit card use through the application of marketing using data mining techniques.
In investment, LBS Capital Management uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million, outperforming the broad stock market.
In fraud detection, PRISM systems are used for monitoring credit card fraud; more generally, data mining techniques have been dramatically successful in preventing billions of dollars of losses from credit card and telecommunications fraud.
In manufacturing, CASSIOPEE diagnosed and predicted problems for the Boeing 737, receiving the European first prize for innovative application.
In telecommunications, TASA uses a novel framework for locating frequently occurring alarm episodes from the alarm stream, improving the ability to prune, group, and develop new rules.
In the area of data cleaning, the MERGE-PURGE system was successfully applied to the identification of welfare claims for the State of Washington.
In the area of Internet search, data mining tools have been used to improve search tools that assist in locating items of interest based on a user profile.
Under their broadest definitions, data mining techniques include a
diverse set of tools for mathematical modeling, going by such names as knowledge discovery, machine learning, pattern recognition, and information fusion. The data on which these techniques operate may or may not be personally identifiable information, and indeed they may not be associated with individuals at all, although of course privacy issues are implicated when such information is or can be linked to individuals.
Knowledge discovery is a term, somewhat broader than that of data mining, which denotes the entire process of using unprocessed data to generate information that is easy to use in a decision-making context. Machine learning is the study of computer algorithms that often form the core of data mining applications. Pattern recognition refers to a class of data mining approaches that are often applied to sensor data, such as digital photographs, radiological images, sonar data, etc.
Finally, data and information fusion are data mining methods that combine information from disparate sources (often so much so that it is difficult to define a formal probabilistic model to assist in summarizing the information). Information fusion seeks to increase the value of disparate but related information above and beyond the value of the individual pieces of information (“obtaining reliable indications from unreliable indicators”).
Because data mining has been useful to decision making in many diverse problem domains, it is natural and important to consider the extent to which such methodologies have utility in counterterrorism efforts, even if there is considerable uncertainty regarding the problems to which data mining can be productively applied.
One issue is whether and to what extent data mining can be effectively used to identify people (or events) that are suspicious with respect to possible engagement in activities related to terrorism; that is, whether various data sources can be used with various data mining algorithms to help select people or events that intelligence agents working in counterterrorism would be interested in investigating further. Data mining algorithms are proposed as being able to effectively rank people and events from those of greatest interest, with the potential to dramatically reduce the cases that intelligence agents have to examine.
Of course, human beings would be still required both to set the thresholds that delineate which people would receive further review and which would not (presumably dependent on available resources) and to check the cases that were selected for further inspection prior to any actions. That is, human experts would still decide, probably on an individual basis, which cases were worthy of further investigation.
A second issue is the possibility that data mining has additional uses beyond identifying and ranking candidate people and events for intelligence agents. Specifically, data mining algorithms might also be used
as components of a data-supported counterterrorist system, helping to perform specific functions that intelligence agents find useful, such as helping to detect aliases, or combining all records concerning a given individual and his or her network of associates, or clustering events by certain patterns of interest, or logging all investigations into an individual’s activity history. Data mining could even help with such tasks as screening baggage or containers. Such tools may not specifically rank people as being of interest or not of interest, but they could contribute to those assessments as part of a human-computer system. This appendix considers these possible roles in an examination of what is currently known about data mining and its potential for contributing to the counterterrorism effort.
An important related question is the issue of evaluating candidate techniques to judge their effectiveness prior to use. Evaluation is essential, first, because it can help to identify which among several contending methods should be implemented and whether they are sufficiently accurate to warrant deployment. Second, it is also useful to continually assess methods after they have been fielded to reflect external dynamics and to enable the methods to be tuned to optimize performance. Also, assuming that these new techniques can provide important benefits in counterterrorist applications, it is important to ask about the extent to which their application might have negative effects on privacy and civil liberties and how such negative effects might be ameliorated. This topic is the focus of Appendix L.
PREPARING THE DATA TO BE MINED
It is well known by those engaged in implementing data mining methods that a large fraction of the energy expended in using these methods goes into the initial treatment of the various input data files so that the data are in a form consistent with the intended use (data correction and cleaning, as described in Section C.1.2). The goal here is not to provide a comprehensive list of the issues that arise in these efforts, but simply to mention some of the common hurdles that arise prior to the use of data mining techniques so that the entire process is better understood.
The following discussion focuses on databases containing personal information (information about many specific individuals), but much of the discussion is true for more general databases.
Several common data deficiencies need prior treatment:
Reliable linkages. Often several databases can be used to provide information on overlapping sets of individuals, and in these cases it is extremely useful to identify which data entries are for the same individuals across the various databases. This is a surprisingly difficult and
error-prone process due to a variety of complications: (1) identification numbers (e.g., Social Security numbers, SSNs) are infrequently represented in databases, and when they are, they are sometimes incorrect (SSNs, in particular, have deficiencies as a matching tool, since in some cases more than one person has the same SSN, and in others people have more than one SSN, not to mention the data files that attribute the wrong SSNs to people). (2) There are often several ways of representing names, addresses, and other characteristics (e.g., use of nicknames and maiden names). (3) Errors are made in representing names and other characteristics (e.g., misspelled names, switching first and last names). (4) Matching on a small number of characteristics, such as name and birth date, may not uniquely identify individuals. (5) People’s characteristics can change over time (e.g., people get married, move, and get new jobs). Furthermore, deduplication—that is, identifying when people have been represented more than once on the same database—is hampered by the same deficiencies that complicate record linkage.
Herzog et al. point out the myriad challenges faced in conducting record linkage.3 They point out that the ability to correctly link records is surprisingly low, given the above listed difficulties. (This is especially the case for people with common names.) The prevalence of errors for names, addresses, and other characteristics in public and commercial data files greatly increases the chances of records either being improperly linked or improperly left unlinked. Furthermore, given the size of the files in question, record linkage generally makes use of blocking variables to reduce the population in which matches are sought. Errors in such blocking variables can therefore result in two records for the same individual never being compared. Given that data mining algorithms use as a fundamental input whether the joint activities of an individual or group of individuals are of interest or not, the possibility that these joint activities are actually for different people (or that activities that are joint are not viewed as joint since the individuals are considered to be separate people) is a crucial limitation to the analysis.
Appropriate database structure. The use of appropriate database management tools can greatly expedite various data mining methods. For example, the search for all telephone numbers that have either called a particular number or been called by that number can be carried out orders of magnitude faster when the database has been structured to facilitate such a search. The choice of the appropriate database framework can therefore be crucially important. Included in this is the ability to link
relevant data entries, to “drill down” to subsets of the data using various characteristics, and to answer various preidentified queries of interest.
Treatment of missing data. Nonresponse (not to mention undercoverage) is a ubiquitous feature of large databases. Missing characteristics can also result from the application of editing routines that search for joint values for variables that are extremely unlikely, which if found are therefore deleted. (A canonical example is a male who reports being pregnant.) Many data mining techniques either require or greatly benefit from the use of data sets with no missing values. To create a data file with the missing values filled in, imputation techniques are used, which collectively provide the resulting database with reasonable properties, with the assumption that the missing data are missing at random. (Missing at random means that the distribution of the missing information is not dependent on unobserved characteristics. In other words, missing values have the same joint distribution as the nonmissing values, given other nonmissing values available in the database.) If the missing data are not missing at random, the resulting bias in any subsequent analysis may be difficult to address. The generation of high-quality imputations is extremely involved for massive data sets, especially those with a complicated relational structure.
Equating of variable definitions. Very often, when merging data from various disparate sources, one finds information for characteristics that are similar, but not identical, in terms of their definition. This can result from time dynamics (such as similar characteristics that have different reference periods), differences in local administration, geographic differences, and differences in the units of data collection. (An example of differences in variable definitions is different diagnostic codes for hospitals in different states.) Prior to any linkage or other combination of information, such differences have to be dealt with so that the characteristics are made to be comparable from one person or unit of data collection to the next.
Overcoming different computing environments. Merging data from different computer platforms is a long-standing difficulty, since it is still common to find data files in substantially different formats (including some data not available electronically). While automatic translation from one format to another is becoming much more common, there still remain incompatible formats that can greatly complicate the merging of data bases.
Data quality. Deficiencies in data quality are generally very difficult to overcome. Not only can there be nonresponse and data linkage problems as indicated above, but also there can be misresponse due to a number of problems, including measurement error and dated responses. (For example, misdialing a phone number might cause one to become
classified as a person of interest.) Sometimes use of multiple sources of data can provide opportunities for verification of information and can be used to update information that is not current. Also, while not a data problem per se, sometimes data (that might be of high quality) have little predictive power for modeling the response of interest. For example, data on current news magazine subscriptions might be extremely accurate, but they might also provide little help in discriminating those engaged in terrorist activities.
SUBJECT-BASED DATA MINING AS AN EXTENSION OF STANDARD INVESTIGATIVE TECHNIQUES
This appendix primarily concerns the extent to which state-of-the-art data mining techniques, by combining information in relatively sophisticated ways, may be capable of helping police and intelligence officers reduce the threat from terrorism. However, it is useful to point out that there are applications of data mining, sometimes called subject-based data mining,4 that are simply straightforward extensions of long-standing police and intelligence work, which through the benefits of automation can be greatly expedited and broadened in comparison to former practices, thereby providing important assistance in the fight against terrorism. Although the extent to which these more routine uses of data have already been implemented is not fully known, there is evidence of widespread use both federally and in local police departments.
For example, once an individual is under strong suspicion of participating in some kind of terrorist activity, it is standard practice to examine that individual’s financial dealings, social networks, and comings and goings to identify coconspirators, for direct surveillance, etc. Data mining can expedite much of this by providing such information as (1) the names of individuals who have been in e-mail and telephone contact with the person of interest in some recent time period, (2) alternate residences, (3) an individual’s financial withdrawals and deposits, (4) people that have had financial dealings with that individual, and (5) recent places of travel.
Furthermore, the activity referred to as drilling down—that is, examining that subset of a dataset that satisfies certain constraints—can also be used to help with typical police and intelligence work. For example, knowing several characteristics of an individual of interest, such as a
description of their automobile, a partial license plate, and/or partial fingerprints, might be used to provide a much smaller subset of possible suspects for further investigation.
The productivity and utility of a subject-based approach to data mining depends entirely on the rules used to make inferences about subjects of interest. For example, if the rules for examining the recent places to which an individual has traveled are unrelated to the rules for flagging the national origin of large financial transactions, inferences about activities being worthy of further investigation may be less useful than if these rules are related. Counterterrorism experts thus have the central role in determining the content of the applicable rules, and most experts can make up lists of patterns of behavior that they would find worrisome and therefore worthy of further investigation. For example, these might include the acquisition of such materials as toxins, biological agents, guns, or components of explosives (when their occupations do not involve their use) by a community of individuals in regular contact with each other. Implemented properly, rule-based systems could be very useful for reducing the workload of intelligence analysts by helping them to focus on subjects worthy of further investigation.
The committee recognizes that when some of the variables in question refer to personal characteristics rather than behavior, issues of racial, religious, and other kinds of stereotyping immediately arise. The committee is silent on whether and under what circumstances personal characteristics do have predictive value, but even if they do, policy considerations may suggest that they not be used anyway. In such a situation, policy makers would have to decide whether the value for counterterrorism added by using them would be large enough to override the privacy and civil liberties interests that might be implicated through such use.
PATTERN-BASED DATA MINING TECHNIQUES AS ILLUSTRATIONS OF MORE SOPHISTICATED APPROACHES
Originating in various subdisciplines of computer science, statistics, and operations research, a class of relevant data mining techniques for counterterrorist application includes (1) those that might be used to identify combinations of variables that are associated with terrorist activities and (2) those that might identify anomalous patterns that experts would anticipate would have a higher likelihood of being linked to terrorist activities. The identification of combinations of variables that are associated with terrorist activities essentially requires a training set—which is a set of data representing the characteristics of people (or other units) of interest and those not of interest, so that the patterns that best discrimi-
nate between these two groups can be discerned.5 This use of a training set is referred to as supervised learning.
The creation of a training set requires the existence of ground truth. That is, for a supervised learning application to learn to distinguish X (i.e., things or people or activities of interest) from not-X (i.e., things or people or activities not of interest), the training set must contain a significant number of examples of both X and not-X.
For an example, consider airport baggage inspections. Here, supervised learning techniques can provide an improvement over rule-based expert systems by making use of feedback loops using training sets to refine algorithms through continued use and evaluation. Machines that use various types of sensing to “look” inside baggage for weapons and explosives can be trained over time to discriminate between suspicious bags and nonsuspicious ones. It might be possible, given the large volume of training data that can be collected from many airports, that they might be trained over time to demonstrate greater proficiency than human inspectors.
The inputs to such a procedure could include the types of bags, the arrangement of items inside the bags, the images recorded when the bags are sensed, and information about the traveler. Useful training sets should be very easy to produce in this application for two reasons. First, many people (sometimes inadvertently) pack forbidden items in carry-on luggage, thereby providing many varied instances of data from which the system could learn. Second, ground truth is available, in the sense that bags selected for further inspection can be objectively determined to contain forbidden items or not. (It would be useful, in such an application, to randomly select bags that were viewed as uninteresting for inspection to measure the false negative rate.) Furthermore, if necessary, a larger number of examples of forbidden articles can be introduced artificially—this process would increase the number of examples from which an algorithm might learn to recognize such items.6
The requirement in supervised learning methods that a training set must contain a significant number of labeled examples of both X and not-X places certain limitations on their use. In the context of distinguishing between terrorist and nonterrorist activity, because of the relative infrequency of terrorist activity, only a few instances can be included in a training set, and thus learning to discriminate between
normal activity and preterrorist activity through use of a labeled training set will be extremely challenging. Moreover, even a labeled training set can miss unprecedented types of attacks, since the ground truth they contain (whether or not an attack occurred) is historical rather than forward-looking.
By contrast, a search for anomalous patterns is an example of unsupervised learning, which is often based on examples for which no labels are available. The definition of anomalous behavior that is relevant to terrorist activity is rather fuzzy and unclear, although it can be separated into two distinct types. First, behavior of an individual or household can be distinctly different from its own historical behavior, although such differences may not (indeed, most often will not) relate specifically to terrorist behavior. For example, credit card use or patterns of telephone calls can be distinctly different from those observed for the same individual or individuals in the past. This is referred to as signature-based anomaly detection. Second, behavior can be distinctly different cross-sectionally; that is, an individual or household’s behavior can be distinctly different from that of other comparable individuals or households. Unsupervised learning seeks to identify anomalous patterns, some of which might indicate novel forms of terrorist activity. Candidate patterns must be checked against and validated by expert judgment.
As an example, consider the simultaneous booking of seats on an aircraft of a group of unrelated individuals from the same foreign country without a return ticket. A statistical model could be developed to estimate how often this pattern would occur assuming no terrorism and therefore how anomalous this circumstance was. If it turned out that such a pattern was extremely common, possibly no further action would be taken. However, if this were an extremely rare occurrence, and assuming that intelligence analysts viewed this pattern as suspicious, further investigation could be warranted.
A more recent class of data mining techniques, which are still under development, use relational databases as input.7 Relational databases represent linkages between units of analysis, and in a counterterrorism context the key example is social networks. Social networks are people who regularly communicate with each other, for example, by telephone or e-mail, and who might be acting in concert. Certainly, if one could produce a large relational database of individuals known to be in communication, it would be useful. One could then identify situations similar to those in which each member acquired an uninteresting amount of some chemical, but in which the total amount over all communicating individu-
als was capable of doing considerable harm. Of course, the vast majority of the networks would be entirely innocent, and only a few would be worthy of further investigation. However, having the potential for such assessments could be useful.
A key question then is, how useful is pattern-based data mining likely to be in counterterrorism? Without more empirical experience, it is difficult to make strong assertions, but some things are relatively clear. When training sets are available, as in the case of baggage inspection, pattern-based data mining techniques are very likely to provide substantial benefits. At this point, it is not known how prevalent such applications are likely to be, but an effort should be made to identify such situations, given the strong tools available in such cases. Also, when there is a specific initiating person(s) or event(s) that is known to be of interest, as argued in the previous section, subject-based techniques are certain to be very useful in helping those working in counterintelligence to expeditiously find other people and events of interest.
In the absence of training sets, and for the situation in which there are no initiating persons or events to provide initial foci of investigation, the benefits obtained from the use of pattern-based data mining techniques for counterterrorism are likely to be minimal. The reason is that ordinary people often engage in anomalous activities. Many people have the experience of having been temporarily restricted from making credit card purchases because their recent transactions have been viewed as being atypical. People travel to places they haven’t been before, make larger withdrawals of funds than they have before, buy things they haven’t bought before, and they call and e-mail people whom they have not called or e-mailed before.
A basic result from multivariate statistical analysis is that, when more characteristics are considered simultaneously, it is more likely for such joint events to be unusual relative to the remainder of the data. So, if the simultaneous actions of travel, communications, purchases, movement of funds, and so on are considered jointly, it is more likely that a joint set of characteristics will be viewed as anomalous. Therefore, searches for anomalous activities, without being trained and without using some linkage to a ground truth assessment of whether the activity is or is not terrorist-related, are much more likely to focus on innocent activity rather than activity related to terrorism.
Data mining tools can also be useful to intelligence analysts if they can reduce the time it currently takes them to carry out their current duties, as long as their accuracies are not less than those of the analysts. (Of course, if the analysts are unable to do a good job, because of data inadequacies for example, automated data mining tools will also result
in a bad job, only faster. In such a case, the tools can’t hurt, but spending money to acquire them may not be the best use of limited resources.)
As an illustration, consider a suite of data mining tools that facilitates the detection of aliases, record linkages concerning a given individual and his or her network of associates, identification of cluster of related events by certain patterns of interest, and indexed audio/images/video from surveillance monitors. Add to this suite data mining tools that performed as well as a very good analyst in identifying patterns of interest but did so more quickly. Such a suite could improve the productivity of an analyst significantly by allowing him or her to spend less time on “grunt work” and to spend more time on cases that did warrant further investigation. Note also that these activities are likely not to require training sets for their development.
It is not the goal of this appendix to include a description of the objectives or operations of the leading data mining techniques. (Excellent tutorials exist for most of the important methods, and software is typically readily available. Also, a number of recent texts provide excellent descriptions of the majority of the current data mining techniques.8) Some of the prominent techniques are listed in Box H.1.
Different techniques have different attributes, which make any given technique more or less suitable in a given application. These attributes include whether or the extent to which a given technique:
Is scalable. Scalability indicates whether the technique will run efficiently on very large data sets. Scalability is important because some data sets are far too large for an inefficient technique to process in any reasonable length of time.
Easily incorporates privacy protections. If so, it will be possible to incorporate into the methodology algorithms that provide reasonable protections against disclosures.
Is easily interpretable. An easily interpretable technique is one for which the general predictive model underlying the technique can be communicated to analysts without specific training.
Is able to handle missing data. Some techniques are better than others at handling data sets with missing values.
Has effective performance with low-quality data (i.e., with a small fraction of data having widely discrepant values).
Has effective performance in the face of erroneous record linkages. The
Common Data Mining Techniques
Hidden Markov models
Nearest neighbor estimation
Support vector machines
issue arises because record linkages often result in data with such values (10 percent or more of the data may have such values), and some techniques do not perform reliably when applied to such data.
Is resistant to gaming. Resistance to gaming indicates whether an adversary can take countermeasures to reduce the effectiveness of the method.
THE EVALUATION OF DATA MINING TECHNIQUES
It is crucially important that analysts planning to use a data mining algorithm for counterterrorism have some objective understanding of its performance, both prior to use and continually updated while in use. Evaluation provides the basis for (1) an understanding of the quality of the assessments provided, which is particularly important when those assessments are to be used in conjunction with other sources of information; (2) a quantitative way of judging the trade-offs between the benefits derived from the use of an algorithm and its associated costs, especially including a decrease in privacy, and particularly when those trade-offs justify its use; and (3) determining when a competing algorithm should be adopted in replacement or determining when a modification should be made. Evaluation of data mining techniques can be particularly difficult
in certain counterterrorism applications, for several reasons discussed below.
The Essential Difficulties of Evaluation
Evaluation of data mining methods can be carried out in two very general ways. First, internal validation can be used: an algorithm is examined step-by-step, assessing the likelihood of any assumptions obtaining, the quality of input data, the validity of any statistical models, etc. Sensitivity analyses are used in internal validation to examine the impact of divergences from the ideal. Second, external validation compares the predictions to ground truth for situations in which ground truth is available. External validation is very strongly preferred, since it is a direct assessment of the value of a data mining tool.
As mentioned above, data mining algorithms could play a number of very disparate supplementary roles in counterterrorism. In some cases, evaluation might be obvious, such as when the data mining tool performs the same function currently performed by intelligence agents, but much faster. An example might that of logging all investigations into an individual’s activity history.
However, in situations in which a pattern-based data mining algorithm is being used to discriminate between people or events of interest and those not of interest, a training set is not only extremely important for developing the data mining algorithm, but it is also nearly essential for carrying out an evaluation of such an algorithm when it has completed development. The difficulties in developing such algorithms therefore translate to difficulties in their evaluation.
As far as the committee knows, there are no data sets available that represent the activities of a diverse group of people including both terrorists (i.e., people of interest and worthy of further investigation) and non-terrorists (i.e., those not of interest) and also where they are correctly identified as such in the database. Also, since the development of procedures used to discriminate between two populations is greatly facilitated when there are substantial numbers of both types represented in the training set, the rarity of terrorist events, and more broadly the rarity of people of interest, complicates both the development and the evaluation of data mining techniques for counterterrorism.
Even if a procedure could be evaluated on a current training set, there is always the possibility that terrorists could adjust (game) their procedures to avoid detection once a methodology is implemented.9
Even without gaming, other dynamics might impact the effectiveness of a methodology over time. So not only is there a need for evaluation, but also there is a need for constant reevaluation.
To address this situation, evaluation must be carried out as an iterative process, in which techniques are initially implemented on a research basis, followed by a period of continuous evaluation and testing. Then, only those procedures that have demonstrated their utility would be formally deployed, and, after deployment, procedures would be continuously evaluated both to monitor their performance given the dynamic nature of the threat and to tune procedures to increase their effectiveness.
The importance of evaluation here is difficult to overstate, since the use of ineffective data mining procedures represents a threefold cost. First, there is the potentially enormous cost of using a less effective algorithm for identifying terrorists and possibly not preventing an attack. Second, there is the serious impact each additional data mining procedure has on the freedoms and privacy of U.S. citizens. Third, investigating false leads from ineffective data mining procedures may waste substantial resources, reducing the energies that can be addressed to real threats. For these reasons, evaluation plays an important role in the committee’s framework. It is therefore vitally important that procedures be comprehensively evaluated both in development and if implemented, throughout the history of their use, and further that implementation be contingent on a careful assessment of a technique’s effectiveness, as well as its costs in terms of impact on privacy and its required resources for continued use. Furthermore, it is crucial, given the finite resources and the costs to privacy, that poorly performing procedures be removed from development or from use as soon as possible.
Some progress in the evaluation of data mining techniques for counterterrorism can be made without the use of training sets. In the dichotomous supervised learning case, in which one is using data mining to discriminate between terrorist activities and nonterrorist activities, two types of errors that can be made are false positives and false negatives.
While some data mining techniques do separate the cases into those of interest and those not of interest, most data mining techniques only rank-order the cases from those of least interest to those of greatest interest, without specifying where a line should be drawn between the two groups. However, in practice, the intelligence and police agencies are likely to draw a line at some point, based on the results of the data mining algorithm, and some people will be further investigated (which may
mean having an analyst look over the data and okay or not okay further investigation) and some people will not be investigated. Therefore, for evaluation purposes, it makes sense to proceed as if there are false positives and false negatives that are the direct result of the application of data mining methods.
Even without a training set, the assessment of the false positive rate for a procedure is in some sense straightforward, because if a procedure identifies a number of people as being of interest, one can further investigate (a sample of) such people and determine whether they were, in fact, of interest. However, this procedure is clearly resource-intensive.
The assessment of the false negative rate is considerably more difficult than for the false positive rate. A number of ideas might be suggested to produce a type of training set for use in evaluation:
Have intelligence and police officers look at data on (likely) tens of thousands of individuals, identifying some as worthy of further investigation, with the remainder not of interest. However, the likely result is that the training set constructed in this way will not contain very many people of interest given the rarity of terrorist activity.
To deal with the lack of identified people of interest, one could relax the definition of “person of interest” to include people with less direct links to terrorist activity. This will boost the number of people of interest in the training set. However, the obvious problem is that the resulting data mining procedure will then be oriented to identify many more false positive cases.
Another way of increasing the number of identified people of interest is to introduce synthetic data that represent fictitious people worth further investigation.
Any of these ideas will require assessments of which cases are and are not of interest, which will require more resource-intensive use of analysts to make the assessments. Once such a training set is created, algorithms can then be run on the data to determine the ability of a procedure to correctly discriminate between those of interest and those not of interest.
The goal is that, over time, the data mining procedures trained on such data would mimic what the intelligence officers would do if they could process millions of data records. The downside is that the data mining algorithm is then limited to mimicry and does not have the capacity to anticipate new terrorism patterns that might elude intelligence experts.
A variety of approaches can be used for evaluating data mining methodologies. Many possibilities for evaluation exist in addition to the ones described below, including various forms of sensitivity analysis and
measuring the performance of an algorithm using mixtures of real and synthetic data sets.
One should not evaluate a data mining routine on the same training set that was used to develop the procedure, since the routine will then be overfit to that particular data set and therefore assessment of its performance on that training set will be optimistically biased. Cross-validation is one approach that can be used to counter this bias.
Cross-validation denotes an approach in which the available data are first separated into a training subset and a test subset. The system is trained using only the training subset, and the resulting trained procedure is then evaluated on the held-out test set of data. One typical procedure, called k-fold cross-validation, involves randomly splitting the training sample into k equal-sized subsets, and then training a procedure using all but the ith subset, evaluating that procedure by using it to predict the response of interest for the cases on the set-aside ith subset. This process is repeated so that each of the k subsets is used as the test set on one of the folds, and the evaluated accuracy over these k repetitions is averaged.
While cross-validation is strongly recommended as an evaluation tool, it has two limitations. First, as mentioned above, for any supervised learning technique, since a training set is typically not representative of time dynamics, cross-validation does not evaluate a procedure’s value for future data sets. Second, using this technique, one is evaluating each procedure as a single entity. However, the data mining procedures will be used as elements of a portfolio approach to counterterrorism. Therefore, what is desired is not how a procedure performs in isolation, but what a procedure adds to an existing group of techniques. In that sense, novelty may be much more valuable than correspondence with an extremely useful methodology that is already implemented.
Finally, cross-validation is most readily applied to data sets in which the specified subsets have no relationships with each other. It is likely that cross-validation can also be applied to more complicated data structures, such as networks, but additional research may be needed to determine the best way to do this.
Another evaluation tool is face validity. Generally speaking, procedures that produce sensible outputs in response to given, often extreme inputs (often best-case and worst-case scenarios) are said to have gained face validity. In addition, input data for fictitious individuals that are
designed to provoke an investigation given current procedures, and which are subsequently ranked as being of high interest using a particular data mining algorithm, provide some degree of face validity for that algorithm and procedure. The same is true for fictitious inputs for cases that would be of no interest to counterterrorism analysts for further investigation. Another way that a data mining procedure used for counterterrorism would gain face validity would be if counterterrorism analysts found the results of the procedure useful in their work.
So if a data mining routine provides rankings of interest for people (or other units of analysis) that an analyst finds saves time in deciding where to focus investigative attention, that outcome is a good starting point for indicating the potential value of the algorithm.
However, achieving face validity is a very limited form of evaluation. It does not help to tune or optimize a procedure, it is by its nature a small sample assessment, and experts in the field might have difficulty agreeing on whether a particular approach has face validity or objectively comparing several competing techniques. However, face validity is, at the least, a necessary hurdle for a methodology to overcome prior to fielding.
The committee suggests that the results of expert judgment should be retained in some fashion and incorporated into the data mining procedures in use over time so that their subsequent use reflects this input. This can be done in several ways, but the basic idea is that cases that experts view differently from the data mining procedure—for example, a person clearly of interest who receives a low ranking by the data mining procedure—should result in modifications to the procedure to avoid repeating that error in the future. To support this, not only should experts examine cases identified as of interest to discover false positives, but also a sample of those identified as not of interest should be reviewed in order to have some possibility, admittedly remote, of discovering false negatives. The evaluation and improvement of data mining procedures for counterterrorism needs to be an iterative process.
Finally, one could use face validity as a method for evaluating competing algorithms. One could conduct an experiment in which investigators are given leads from two competing data mining algorithms, denoted A and B to blind the comparison. At the end of the experiment, the experts involved in the experiment could be asked whether they preferred the leads from A or B.
Gaming and Countermeasures
Another topic that needs to be considered in evaluating data mining procedures for use in counterterrorism is the extent to which these procedures can be gamed. That is, if someone has some general knowledge
of the procedures being used, could their behavior be adjusted to reduce the effectiveness of the data mining technique (or to completely defeat the algorithm)?
Of course, specific knowledge of the precise procedures (and the specific parameter values) being used would be enormously valuable, although nearly impossible to obtain. What is more likely is that there would be a general understanding of what is being carried out. Certainly, there would be advantages to the typical actions those engaged in illegal activities already take to mask their identities, such as the use of false identifications and aliases, frequent changes of residences, etc.
However, our broad expectation is that some of the patterns that would be focused on through use of data mining would be difficult to mask. Therefore, while some gaming of the routines used would be effective, having a sufficiently diverse portfolio of algorithms might, over time, provide alternate avenues toward the discovery of terrorists engaged in many different kinds of terrorist activities. A general statement is that it is not whether a procedure can or cannot be gamed, but how relatively easily a procedure can be gamed relative to other competing ones, what is the impact on the procedure’s effectiveness, and how the opportunities for gaming can be reduced. Keeping the procedures and the input data sources secret reduces the opportunity for gaming, though at the same time it runs counter to the public’s right to know what the government is doing that may compromise personal privacy and other rights. Finding the appropriate middle ground is difficult.
The issue of how an adversary might take countermeasures against any data mining system or data collection effort raises an important policy issue regarding the costs and benefits of greater transparency into these systems and efforts (i.e., more public knowledge about the nature of the data being collected and how the systems work). As noted above, costs could include an increased risk of adversary circumvention of these systems and efforts and perhaps also strong negative reactions of citizens attempting to stop the loss of privacy and confidentiality. However, greater transparency is likely to result in increasing trust in government and some relief that the threat of terrorism was possibly being reduced.
Reducing Bias in Evaluation
A central issue in research and development is evaluation. It is easy to propose techniques, and vendors and other interested parties propose purportedly new techniques all the time. Government agencies like the U.S. Department of Homeland Security (DHS) and the National Security Agency (NSA) will acquire data mining algorithms for use in counterterrorism in two ways: from outside developers (contractors) and from
algorithm developers within the agency. But both internal and external developers are likely to have biases and vested interests in the outcome of any evaluation they may have conducted to judge the performance of an algorithm that they have developed. Thus, before deployment and operational use, such techniques must be as carefully and comprehensively evaluated as possible using the best available evaluation techniques and methods. Many such techniques and methods are used in sophisticated commercial applications.
For these reasons, independent checks on the evaluation work of developers are necessary to minimize the possibility of bias, regardless of whether proprietary claims are asserted. Thus, those conducting the checks should have as much information as necessary to conduct the reviews involved (e.g., full access to descriptions of the algorithms, results of previous evaluations, and descriptions of adjustments that have been made in response to earlier evaluations) and work as independently as possible from the developers.
Evaluators can also build on the foundations provided by preliminary or internal evaluations, since a great deal can be learned about the performance of a system through its performance throughout development. Developers often view these as proprietary, so if DHS or NSA is at the early stage of requesting proposals for development of such techniques, the sharing of such information must be specified in the contract prior to the beginning of work.
Finally, it is also important to subject work in this area to peer evaluation to the extent possible consistent with the needs to protect classified information. Engagement of the best talent and expertise available and solicitation of their contributions as input to the decision making process are important. Such expertise is generally needed to make critical judgments about vendor claims concerning new technological solutions, and it is essential toward deploying effective measures to security problems. Possible mechanisms to support such contributions include interagency professional agreements, sabbatical arrangements for academics, consulting agreements, and external advisory groups.
EXPERT JUDGMENT AND ITS ROLE IN DATA MINING
The importance of responsible expert judgment in various aspects of data mining, from research and development to field deployment, cannot be overstated. Expert judgment (of individuals with different background and experiences) is critical both in operations and in development.
From an operational standpoint, human beings are required to interpret the results of a data mining application. As noted above, data mining generally does not identify cases of interest. Instead, data mining
rank-orders cases from those of no interest to those of great interest. But it is a matter of human judgment to set thresholds (e.g., those above a certain specified line are of interest, and those below a certain, different, specified line are not of interest) and to determine exceptions (e.g., closer examination of person X who ranked above the threshold indicates that in fact he is not of interest). That is, human experts must decide, probably on an individual basis, which cases are worthy of further investigation or other action. Therefore, there is a need to consider the operator and the data mining algorithms as a sociotechnical system, as well as a need to determine how operators and the data mining technology can best work together.
As an example of a sociotechnical issue, consider a frequently held belief in the infallibility of a computer. Although in principle a human expert may be required to validate and check a computer’s conclusions or rank orderings, in practice it is all too easy for the human—especially a young and inexperienced one—to play it safe by accepting at face value a machine-generated conclusion. Procedures and incentives must be developed to shape the human’s behavior so that she or he is neither too trusting nor too skeptical of the computer’s output.
From a development standpoint, human judgment and expertise play critical roles in shaping how a given system works. In addition to the above-mentioned role for experts in deciding which cases should be further investigated, expert assessments also have other important roles:
Deciding which variables are discriminating and the values of these variables that indicate whether a given case is of interest or not. For example, a variable (“item purchased”) and an amount may be associated with a credit card transaction, some purchases and amounts should be indicated as being of interest and some not, and this is probably best determined by experts. Additional work is needed to determine which input data sets contain potentially relevant information.
Deciding on criteria to separate anomalous patterns (i.e., patterns that are unusual in some sense) into those that are and are not potentially threatening and indicative of terrorist activity.
Deciding on the specific form of the algorithm that is evaluated for use. (For example, should one use a transformed or untransformed version of a predictor in a logistic regression model?)
Improving the robustness of data mining routines against gaming and steps taken to “fly under the radar.” For example, a routine may be adjusted to account for an individual making many small purchases over an extended period of time by making that effectively equal to a large one-time purchase.
These multiple and significant roles for expert judgment remain even with the best of data mining technologies. Over time, it may be that more of this expertise can be represented in the portfolio of techniques used in an automated way, but there will always be substantial deficiencies that will require expert oversight to address.
ISSUES CONCERNING THE DATA AVAILABLE FOR USE WITH DATA MINING AND THE IMPLICATIONS FOR COUNTERTERRORISM AND PRIVACY
It is generally the case that the effectiveness of a data mining algorithm is much more dependent on the predictive power of the data collected for use than on the precise form of the algorithm. For example, it typically does not matter that much, in discriminating between two populations, whether one uses logistic regression, a classification tree, a neural net, a support vector machine, or discriminant analysis. Priority should therefore be given to obtaining data of sufficient quality and in sufficient quantity to have predictive value in the fight against terrorism.
The first step is to ensure that the data are of high quality, especially when they are to be linked. When derived from record linkages, data tend to assume the worst accuracies in the original data sets rather than the best. Inaccurate data, regardless of quantity, will not produce good or useful results in this counterterrorism context.
A second step is to ensure that the amount of data is adequate—although as a general rule, the collection of more data on people’s activities, movements, communications, financial dealings, etc., results in greater opportunities for a loss of privacy and the misuse of the information. Portions of the committee’s framework provide for best practices to minimize the damage done to privacy when information is collected on individuals, but ultimately, a policy still needs to be identified that specifies how much additional data should be used for obtaining better results.
Insight into the specifics of the trade-off can be obtained through the use of synthetic data for the population at large (i.e., the haystack within which terrorist needles are hiding) without compromising privacy. At the outset, researchers would use as much synthetic data as they were able to generate in order to assess the effectiveness of a given data mining technique. Then, by removing databases one by one from the scope of the analysis, they would be able to determine the magnitude of the negative impact of such removal. With this analysis in hand, policy makers would have a basis on which to make decisions about the trade-off between accuracy and privacy.
DATA MINING COMPONENTS IN AN INFORMATION-BASED COUNTERTERRORIST SYSTEM
It is too limiting a perspective to view data mining algorithms only as stand-alone procedures and not to view them as potentially components of a data-supported counterterrorist system. Consider, for example, that data mining techniques have played an essential role as various components of the algorithm that comprises an Internet search engine.
A search engine, at the user level, is not a data mining system, but instead a database with a natural query language. However, the component processes of populating this database, ranking the results, and making the query language more robust are all carried out through the essential use of data mining algorithms. These component processes include (1) spell correction, (2) demoting Web sites that are trying various techniques to inflate their “page rank,” (3) identifying Web sites with duplicate content, (4) clustering web pages by concept or similarity of central topic, (5) modifying ranking functions based on the history of users’ click sequences, and (6) indexing images and video.
Without these and other features, implemented partially in response to efforts to game search engines, search results would be nearly useless compared with their current value. But as these features have been added over the years, they have increased the value of search engines enormously over their initial implementations, and today search engines are an indispensable part of an individual’s online experience.
In a somewhat similar way, one can imagine a search engine, in a general sense of the term, that was designed and optimized for counterterrorist applications. Such a system could, among other things: (a) generalize/specialize the detection of aliases and/or address the ambiguity in foreign names, (b) combine all records concerning a given individual and his or her network of associates, (c) cluster related events by certain patterns of interest and other topics (such as the acquisition of materials and expertise useful for the development of explosives, toxins, and biological agents), (d) log all investigations into an individual’s activity history and develop ratings of people as to their degree of interest, and (e) index audio/images/video from surveillance monitors.
All of these are typical data mining applications that do not depend on the existence of training data, and they would seem to be critical components in any counterterrorism system that is designed to collect, organize, and make available for query information on individuals and other units of interest for possibly further data collection, investigation, and analysis. Therefore, data mining might provide many component processes of what would ideally be a large counterterrorism system, with human analysts and investigators playing an essential role alongside specific data mining tools.
Over time, as more data are acquired and different sources of data are found to be more or less useful, as attempts at gaming are continuously monitored and addressed, as various additional unforeseen complexities arise and are addressed, a system could conceivably be developed that could provide substantial assistance in reducing the risk from terrorism. Few of the necessary components of this idealized system currently exist, and therefore this is not something that could be implemented quickly. However, in the committee’s view, the threat from terrorism is very likely to persist, and therefore the committee is in support of a fully supported research and development program with the goal of examining the potential effectiveness of such a system.
It is important to point out that each of the above component applications is quite non-trivial. For example, part (b) “combine all records concerning a given individual and his or her network of associates” would be an extremely complicated tool to develop in a way that would be easy to access and use.
And it is useful to point out that when viewing data mining applications as part of a system, their role and therefore their evaluation changes. For example, consider a data mining algorithm that was extremely good at identifying patterns of behavior that are not indicative of terrorist activity but was not nearly as effective at identifying patterns that are. Such a component process could be useful as a filter, reducing the workload of investigators, and thereby freeing up resources to devote to a smaller group of individuals of potential interest. This algorithm would fail as a stand-alone tool, but as part of a system, it might perform a useful function.
Development of such a system would certainly be extremely challenging, and success in reducing the threat from terrorism would be a significant achievement. Therefore, research and development of such an approach requires the direct involvement of data mining experts of the first rank. What is needed is not simply the modification of commercial off-the-shelf techniques developed for various business applications, but a dedicated collaborative research effort involving both data miners and intelligence analysts with the goal of developing what are currently nonexistent techniques and tools.
Another class of data mining techniques, referred to as “information fusion,” might be useful in counterterrorism. Information fusion refers to a class of methods for combining information from disparate sources in order to make inferences that may not be possible from a single source. One possible, more limited application to counterterrorism is matching
people using a variety of sources of information, including address, name, and date of birth, as well as fingerprints, retinal scans, and other biometric information. A broader application of information fusion is identifying patterns that are jointly indicative of terrorist activity.
With respect to the narrower application of person matching, there are different ways of aggregating information to measure the degree to which the personal information matches. One can develop (a) distance metrics using sums of distances using the measured quantities themselves, (b) sums of measures of the assessment of the degree of match for each characteristic, and (c) voting rules that aggregate over whether or not there is a match for each characteristic. There may be advantages in different applications to combining information at different levels of the decision process. (A common approach to joining information at level (a) is through use of the Fellegi-Sunter algorithm.) The committee thinks that information fusion might prove helpful in this limited application. However, the problems mentioned above concerning the difficulties of record linkage will greatly reduce the effectiveness of many information fusion algorithms that are used to assist in person matching.
Regarding the broader application, consider the problem of identifying whether there is a terrorist threat from the following disparate sources of information: recent meetings of known terrorists, greater than usual movement of funds from countries known to harbor terrorists, and greater than usual purchases of explosives in the United States. Information fusion uses such techniques as the Kalman filter and Bayesian networks to learn how to optimally join disparate pieces of information at different levels of the decision process, by either combining individual data elements or combining higher level assessments for the decision at hand, in order to make improved decisions in comparison to more informal use of the disparate information.
Clearly, information fusion directly addresses an obvious need that arises repeatedly in the attempt to use various data sources and types of data for counterterrorism. Intelligence agencies will have surveillance photographs, information on monetary transactions, information on the purchase of dangerous materials, communications of people with suspected terrorists, movements of suspected people into and out of the country, and so on, all of which will need to be combined in some way to make decisions as to whether to initiate further and more intrusive investigations.
To proceed, information fusion for these broader applications typically requires estimates of a number of parameters, such as conditional probabilities, that model how to link the evidence received at various levels of the decision process to the phenomenon of interest. An example might be the probability that a terrorist act is planned in country B in the next three months, given a monetary movement of more than X dollars
from a bank in country A to one in country B in the last six months and the purchase in the last two months of more than the usual amounts of explosives of a certain type and greater than usual air travel in the last two months of individuals from country A to country B. Clearly, a conditional probability like this would be enormously useful to have, but how could one estimate it? It is possible that this conditional probability could be expressed as an arithmetic function of simpler conditional probabilities under some conditional independence assumptions, but then there is the problem of validating those assumptions to link those more primitive conditional probabilities to the desired conditional probability.
More fundamentally, information fusion for the broader problem of counterterrorism requires a structure that expresses the forms in which information is received and how it should be combined. At this time, especially given the great infrequency of terrorist events, it will be extremely difficult to validate either the above assumptions or the overall structure proposed for use. Therefore, while information fusion is likely to be useful for some limited problems, it does not currently seem likely to be productive for the broad problem of identifying people and events of interest.
AN OPERATIONAL NOTE
The success of any data mining enterprise depends on the availability of relevant data in the universe of data being mined and the ability of the data mining algorithms being used to identify patterns of interest.
In the first instance (availability of data), the operational security skills of the would-be terrorists are the determining factor as to whether data is informative. For terrorists planning high-end attacks (e.g., nuclear explosions involving tens or hundreds of thousands of deaths), the means and planning needed for carrying out a successful attack are complex indeed. On one hand, almost by definition, a terrorist group that could carry out such an attack would have a considerable level of sophistication, and it would take great care to minimize its database tracks. Thus, for attacks at the high end, those intending to carry out such attacks may be better able to reduce the evidence of their activities. On the other hand, the complicated planning necessary for these attacks might provide greater opportunity for data mining to succeed. The trade-off in this case is difficult to evaluate.
In the second instance, regarding the identification of patterns of interest against a noisy background, the primary issue is the fact that the means to carry out small-scale terrorist attacks (e.g., attacks that might result in a few to a few dozen deaths) are easily available. Though not a terrorist, in 2007 the Virginia Tech shooter, for example, killed a few dozen individuals with guns purchased over the counter at a gun store.
An Illustrative Compromise in Operational Security from a Terrorist Perspective
A conversation between a U.S. person and an unknown individual in Pakistan is intercepted. The call was initiated in the Detroit area from a pay phone using a prepaid phone card. The conversation was conducted in the Arabic language. The initiator is informing the recipient of the upcoming “marriage” of the initiator’s brother in a few weeks. The initiator makes reference to the “marriage” of the “dead infidel” some years ago and says this “marriage” will be “similar but bigger.” The recipient cautions the initiator about talking on the telephone and terminates the call abruptly.
The intelligence analyst’s interpretation of this conversation is that “marriage” is open code for martyrdom. Interrogation of another source indicates that the association of “marriage” and “dead infidel” is a reference to the Oklahoma City bombing. It is the analyst’s assessment that a major ANFO or ANNM attack on the continental United States is imminent. Red team analysis concludes that large quantities of ammonium nitrate can be untraceably acquired by making cash purchases that are geographically and temporally distributed.
A “tip” such as this phone conversation might well trigger a major ad hoc data mining exercise through previously unsearched databases, such as those of home improvement and gardening suppliers.
Moreover, the planning needed to carry out such an attack is fairly minimal, especially if the terrorist is willing to die. Thus, those intending to carry out relatively small-scale attacks might in principle leave a relevant database track, but the difficult (and for practical purposes, probably insoluble) problem would be the ability to identify that track and infer terrorist actions against a much larger background of innocuous activity.
For practical purposes, then, data mining tools may be most useful against the intermediate scale of terrorist attack (say, car or truck bombs using conventional explosives that might cause many tens or hundreds of deaths). Moreover, as a practical matter, terrorists must face the possibility of unknown leakages—telltale signs that a terrorist group may not know they are leaving, or human intelligence tips that cue counterterrorism authorities about what to look for (Box H.2)—and likelihood of such leakages can be increased by a comprehensive effort that aggressively seeks relevant intelligence information from all sources. This point further underscores the importance of seeing data mining as one element of a comprehensive counterterrorist effort.
ASSESSMENT OF DATA MINING FOR COUNTERTERRORISM
Past successes in applying data mining techniques in many diverse domains have interested various government agencies in exploring the extent to which data mining could play a useful role in counterterrorism. On one hand, this track record alone is not an unreasonable basis for interest in exploring, through research and development, the potential applicability of data mining for this purpose. On the other hand, the operational differences between the counterterrorism application and other domains in which data mining has proven its value are significant, and the intellectual burden that researchers must surmount in order to demonstrate the utility of data mining for counterterrorism is high.
As an illustration of these differences, consider first the use of data mining for credit scoring. Credit scoring, as described in Hand and in Lambert,10 makes use of the history of financial transactions, current debts, income, and accumulated wealth for a given individual, as well as for similar individuals, to develop models of how people behave who are likely to default on a loan, and those who are not likely. Such histories are extensive and have been collected for many years.
Training sets are developed that contain the above information on people who have been approved for loans who later paid in full and also those who were approved for loans and who later defaulted. Training sets are sometimes augmented by data on a sample of those who would not have been approved for a loan but who were granted one nonetheless, and whether or not they later defaulted on the loan. Training sets in this application can be used to develop very predictive models that discriminate well between those for whom additional loans would be both a good and a bad decision on the part of the credit granting institution.
The utility of training sets in this application benefits from the prevalence of the failure to repay loans. While there is a great interest in reducing the number of bad loans to the extent possible, missing a small percentage of bad loans is not a catastrophe. Therefore, false negatives are to be avoided, but a few bad loans are acceptable. While there is a substantial effort to game the process of awarding credit, it has been possible to discover ways to adjust the models that are used to discriminate between good and bad loan applications to retain their utility. Finally, while applications for credit from those new to the database are problem-
atic, it has also been possible to develop models that can be used for initial loan applicants to handle those without a credit history.11
By contrast, consider the contrasting problem of implementing a “no-fly” list. Although the details of actual programs remain secret, enough is known in the public domain to identify key differences between this problem and that of credit scoring. Some data on behavior relevant to potential terrorist activity (or more likely past activity) are available, but they are very incomplete, and the predictive power of the data collected and the patterns viewed as being related to terrorist activity is quite low. (For example, it is known that an individual with a name that is similar to that of a person on a terrorist watch list is cause for suspicion and additional screening.) Labeled training sets for supervised learning methods cannot be developed because the number of people that have attempted to initiate attacks on aircraft and other terrorist activity is extremely small. Furthermore, gaming—for example, the use of aliases and false documentation, including passports—is difficult to adjust to. Finally, as in credit scoring, there is a need for a process to deal with individuals for whom no data are available, but in this application there seems to be much less value in “borrowing information” from other people.
Given these differences, it is not surprising that the base technologies in each example have compiled vastly different track records: data mining for credit scoring is widely acknowledged as an extremely successful application of data mining, while the various no-fly programs (e.g., CAPPS II) have been severely criticized for their high rate of false positives.12 Box H.3 describes the largely unsuccessful German experience with counterterrorist profiling based on personal characteristics and backgrounds.
At a minimum, subject-based data mining (Section H.3) is clearly relevant and useful. This type of data mining—for example, structured searches for identifying those in regular contact with known terrorists
The German Experience with Profiling
In the aftermath of the September 11, 2001, terrorist attacks on the United States, German law enforcement authorities sought to explore the possibilities of using large-scale statistical profiling of entire sectors of the population with the purpose of identifying potential terrorists. An initial profile was developed, largely based on the social characteristics of the known perpetrators of 9/11 (male, 18-40 years old, current or former student, Islamic, legal resident in Germany, and originating from one of a list of 26 Muslim countries). This profile was scanned against the registers of residents’ registration offices, universities, and the Central Foreigners’ Register to identify individuals matching the defined profile—an exercise that resulted in approximately 32,000 entries.
Individuals in this database were then checked against another database of about 4 million individuals identified as possibly having the relevant knowledge to carrying out a terrorist attack, or who had familiarity with places that could constitute possible terrorist targets. This included, for example, individuals with a pilot’s license (or attending a course to obtain it), members of sporting aviation associations, as well as employees of airports, nuclear power plants, chemical plants, the rail service, laboratories and other research institutes, as well as students of the German language at the Goethe Institutes.
The comparison of these two databases yielded 1,689 individuals as potential “sleepers.” These individuals were investigated at greater length by the German police, but after one year not one sleeper had been identified. Seven individuals suspected of being members of a terrorist cell in Hamburg were arrested, but they did not fit the statistical profile.
In the entire profiling exercise, data were collected and analyzed on about 8.3 million individuals—with a null result to show for it. The exercise was terminated after about 18 months (in summer 2003) and the databases deleted. (In April 2006, the German Federal Constitutional Court declared the then-terminated exercise unconstitutional.)
SOURCE: Adapted from Giovanni Capoccia, “Institutional Change and Constitutional Tradition: Responses to 9/11 in Germany,” in Martha Crenshaw (ed.), The Consequences of Counterterrorist Policies in Democracies, New York, Russell Sage, forthcoming.
or identifying those, possibly as part of a group, who are collecting large quantities of toxins, biological agents, explosive material, or military equipment—might well identify individuals of interest that warrant further investigation, especially if their professional and personal lives indicate that they have no need for such material. (Such searches could also result in a large number of false positives that would require human judgment to dispose of.) Such searches are within the purview of law enforcement and intelligence analysts today, and it would be surprising if
such searches were not being conducted today as extensions of standard investigative techniques.
These approaches have been criticized because they are relevant primarily to future events that have a nontrivial similarity to past events, thus providing little leverage in anticipating terrorist activities that are qualitatively different from those carried out in the past. But even if this criticism is valid (and only research and experience will provide such indications), there is definite and important benefit in being able to reduce the risk from known forms of terrorist activity. Forcing terrorists to use new approaches implies new training regimes, new operational difficulties, and new resource requirements—all of which complicate their own planning and reduce the likelihood of successful execution.
The jury is still out on whether pattern-based data mining algorithms produced without the benefits of machine learning will be similarly useful, and in particular whether such techniques could be useful in discovering more subtle, novel patterns of behavior as being indicative of the planning of a terrorist event that would have been unrecognized a priori as such by intelligence analysts. Jonas and Harper (2006) refer to this kind of data mining as “pattern-based” data mining.13 The distinction between subject-based and pattern-based data mining is important. Subject-based data mining is focused on terrorist activities that are either precedented (because analysts have some retrospective understanding of them) or anticipated (because analysts have some basis for understanding the precursors to such activities), while pattern-based data mining is focused on future terrorist activities that are unanticipated and unprecedented (that is, activities that analysts are not able to predict or anticipate).
Subject-based techniques have the advantage of being based on strongly predictive models. For example, being a close associate of someone suspected of terrorist activity and having similar connections to persons or groups of interest are strong predictors that a given person will also be of interest for further investigation. By contrast, pattern-based techniques, in the absence of a training set, are likely to have substantially less predictive power than the subject-based patterns chosen by counterintelligence experts based on their experience—and consequently a very large false positive rate. (Indeed, one might expect such an outcome, since pattern-based techniques, by definition, seek to discover anomalous patterns that are not a priori associated with terrorist activity and therefore have no historical precedents to support them. Pattern-based techniques
are also, at their roots, tools for identifying correlations, and as such they do not provide insight into why a particular pattern may arise.)
Jonas and Harper (2006) identify three factors that are likely to have a bearing on the utility of data mining for counterterrorist purposes:
The ability to identify subtle and complex data patterns indicating likely terrorist activity,
The construction of training sets that facilitate the discovery of indicative patterns not previously recognized by intelligence analysts, and
The high false positive rates that are likely to result from the problems in the first two bullets.
A number of approaches can be taken to possibly address this argument. For example, as mentioned above, it may be possible to develop training sets by broadening the definition of what patterns of behavior are of interest for further investigation, although that raises the false positive rate. Also, it may be possible to reduce the rate of false positives to a manageable percentage by using a judicious mix of human analysis and different automated tools. However, this is likely to be very resource intensive. The committee does not know whether there are a large number of useful behavioral profiles or patterns that are indicative of terrorist activity.
In addition to these issues, a variety of practical considerations are relevant, including the paucity of data, the often-poor quality of primary data, and errors arising from linkage between records. (Section H.2 discusses additional issues in more detail.)