Scoping the Issue: Terrorism, Privacy, and Technology
THE NATURE OF THE TERRORIST THREAT TO THE UNITED STATES
Since September 11, 2001, the United States has faced a real and serious threat from terrorist action. Although the primary political objectives of terrorist groups vary depending on the group (e.g., the political objectives of Al Qaeda differ from those of Aum Shinrikyo), terrorist actions throughout history have nevertheless shared certain common characteristics and objectives. First, they have targeted civilians or non-combatants for political purposes. Second, they are usually violent, send a message, and have symbolic significance. The common objectives of terrorists include seeking revenge, renown, and reaction; that is, terrorists generally seek to “pay back” those they see as repressing them or their people; to gain notoriety or social or spiritual recognition and reward; and to cause those they attack to respond with fear, an escalating spiral of violence, irrational reaction and thus self-inflicted damage (e.g., reactions that strengthen the hand of the terrorists), or capitulation. Third, terrorists often blend with the targeted population—and in particular, they can exploit the fundamental values of open societies, such as the United States, to cover and conceal their planning and execution.
Despite these commonalities, today’s terrorist threat is fundamentally different from those of the past. First, the scale of damage to which modern terrorists aspire is much larger than in the past. The terrorist acts of September 11, 2001, took thousands of lives and caused hundreds of billions of dollars in economic damage. Second, the potential terrorist
use of weapons of mass destruction (e.g., nuclear weapons, biological or chemical agents) poses a threat that is qualitatively different from a threat based on firearms or chemical explosives. Third, terrorists operate in a modern environment plentiful in the amount of available information and increasingly ubiquitous in its use of information technology.
Even as terrorist ambitions and actions have increased in scale, smaller bombings and attacks are also on the rise in many corners of the world. To date, all seem to have been planned and executed by groups or networks and therefore have required some level of interaction and communication to plan and execute.
Left unaddressed, this terrorist threat will create an environment of fear and anxiety for the nation’s citizens. If people come to believe that they are infiltrated by enemies that they cannot identify and that have the power to bring death, destruction, and havoc to their lives, and that preventing that from happening is beyond the capability of their governments, then the quality of national life will be greatly depreciated as citizens refrain from fully participating in their everyday lives. That scenario would constitute a failure to “establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity,” as pledged in the Preamble to the Constitution.
To address this threat, new technologies have been created and are creating dramatic new ways to observe and identify people, keep track of their location, and perhaps even deduce things about their thoughts and behaviors. The task for policy makers now is to determine who should have access to these new data and capabilities and for what purposes they should be used. These new technologies, coupled with the unprecedented nature of the threat, are likely to bring great pressure to apply these technologies and measures, some of which might intrude on the fundamental rights of U.S. citizens.
COUNTERTERRORISM AND PRIVACY AS AN AMERICAN VALUE
In response to the mounting terrorist threat, the United States has increased its counterterrorist efforts with the aim of enhancing the ability of the government to prevent terrorist actions before they occur. These efforts have raised concerns about the potential negative impacts of counterterrorism programs on the privacy and other civil liberties of U.S. citizens, as well as the adequacy of relevant civil liberties protections. Because terrorists blend into law-abiding society, activities aimed at
detecting and countering their actions before they occur inherently raise concerns that such efforts may damage a free, democratic society through well-intentioned steps intended to protect it. One such concern is that law-abiding citizens who come to believe that their behavior is watched too closely by government agencies and powerful private institutions may be unduly inhibited from participating in the democratic process, may be inhibited from contributing fully to the social and cultural life of their communities, and may even alter their purely private and perfectly legal behavior for fear that discovery of intimate details of their lives will be revealed and used against them in some manner.
Privacy is, and should continue to be, a fundamental dimension of living in a free, democratic society. An array of laws protect “government, credit, communications, education, bank, cable, video, motor vehicle, health, telecommunications, children’s, and financial information; generally carve out exceptions for disclosure of personal information; and authorize use of warrants, subpoenas, and court orders to obtain information.”1 These laws usually create boundaries between individuals and institutions (or sometimes other individuals) that may limit what information is collected (as in the case of wiretapping or other types of surveillance) and how that information is handled (such as the fair information practices that seek care and openness in the management of personal information). They may establish rules governing the ultimate use of information (such as prohibitions on the use of certain health information for making employment decisions), access to the data by specific individuals or organizations, or aggregation of these data with other data sets. The great strength of the American ideal of privacy has been its robustness in the face of new social arrangements, new business practices, and new technologies. As surveillance technologies have expanded the technical capability of the government to intrude into personal lives, the law has sought to maintain a principled balance between the needs of law enforcement and democratic freedoms.
Public attitudes, as identified in public opinion polls, mirror this delicate balance.2 For example, public support for counterterrorism measures appears to be strongly influenced by perceptions of the terrorist threat,
U.S. Congressional Research Service, Privacy: Total Information Awareness Programs and Related Information Access, Collection, and Protection Laws (RL31730), updated March 21, 2003, by Gina Marie Stevens.
See Appendix M (“Public Opinion Data on U.S. Attitudes Toward Government Counterterrorism Efforts”) for more details. Among them are two caveats about the identification of public attitudes through public opinion surveys. The first one has to do with the framing of survey questions, in terms of both wording and context, which have been shown to strongly influence the opinions elicited. The second has to do with declining response rates to national sample surveys and the inability to detect or estimate nonresponse bias.
an assessment of government effectiveness in dealing with terrorism, and perceptions as to how these measures are affecting civil liberties. Thus, one finds that since 9/11, public opinion surveys reflect a diminishing acceptance of government surveillance measures, with people less willing to cede privacy and other civil liberties in the course of increased terrorism investigation and personally less willing to give up their freedoms and more pessimistic about protection of the right to privacy. Yet recent events, such as the London Underground bombings of July 2005 and reports in August 2006 that a major terrorist attack on transatlantic airliners had been averted, appeared to influence public attitudes; support increased for such surveillance measures as expanded camera surveillance, monitoring of chat rooms and other Internet forums, and expanded monitoring of cellular phones and e-mails. However, public attitudes toward recently revealed monitoring programs are mixed, with no clear consensus.
Public opinion polls also indicate that the public tends to defend civil liberties more vigorously in the abstract than in specific situations. At the same time, people seem to be less concerned about privacy in general (i.e., for others) but rather with protecting the privacy of information about themselves. In addition, most people are more tolerant of surveillance when it is aimed at specific racial or ethnic groups, when it concerns activities they do not engage in, or when they are not focusing on its potential personal impact. Thus the perception of threat might explain why passenger screening and searches both immediately after September 11, 2001, and continuing through 2006 consistently receive high levels of support while, at the same time, the possibility of personal impact reduces public support for government collection of personal information about travelers. The public is also ambivalent regarding biometric identification technologies and public health uses, such as prevention of bioterrorism and the sharing of medical information. For these, support increases with assurances of anonymity and personal benefits or when they demonstrate a high degree of reliability and are used with consent.
Legal analysts,3 even courts,4 if not the larger public, have long recognized that innovation in information and communications technologies often moves faster than the protections afforded by legislation, which is usually written without an understanding of new or emerging technologies, unanticipated terrorist tactics, or new analytical capabilities. Some of these developing technologies are described in Section 1.6 and in greater
For example, see R.A. Pikowsky, “The need for revisions to the law of wiretapping and interception of email,” Michigan Telecommunications & Technology Law Review 10(1), 2004.
U.S. Court of Appeals. (No. 00-5212; June 28, 2001), p. 10. Available at http://www.esp.org/misc/legal/USCA-DC_00-5212.pdf.
detail in Appendixes C (“Information and Information Technology”) and H (“Data Mining and Information Fusion”). The state of the law and its limitations are detailed in Appendix F (“Privacy-Related Law and Regulation: The State of the Law and Outstanding Issues”). As new technologies are brought to bear in national security and counterterrorism efforts, the challenge is no different from what has been faced in the past with respect to potential new surveillance powers: identify those new technologies that can be used effectively and establish specific rules that govern their use in accordance with basic constitutional privacy principles.5
THE ROLE OF INFORMATION
Information and information technology are ubiquitous in today’s environment. Massive databases are maintained by both governments and private-sector businesses that include information about each person and about his or her activities. For example, public and private entities keep bank and credit card records; tax, health, and census records; and information about individuals’ travel, purchases, viewing habits, Web search queries, and telephone calls. Merchants record what individuals look at, the books they buy and borrow, the movies they watch, the music they listen to, the games they play, and the places they visit. Other kinds of databases include imagery, such as surveillance video, or location information, such as tracking data obtained from bar code readers or RFID (radio frequency identification) tags. Through formal and informal relationships between government and private-sector entities, much of the data available to the private sector is also available to governments.
In addition, digital devices for paying tolls, computer diagnostic equipment in car engines, and global positioning services are increasingly common on passenger vehicles. Cellular telephones and personal digital assistants record not only call and appointment information, but also location, transmitting this information to service providers. Internet service providers record online activities, digital cable and satellite systems record what individuals watch and when, alarm systems record when people enter and leave their homes. People back up personal data files online and access online photo, e-mail, and music storage services. Global positioning technologies are appearing in more and more products, and RFID tags are beginning to be used to identify consumer goods, identification documents, pets, and even people.
Modern technology offers myriad options for communication
between individuals and among small groups, including cell phones, e-mail, chat rooms, text messaging, and various forms of mass media. With voice-over-IP telephone service, digital phone calls are becoming indistinguishable from digital documents: both can be stored and accessed remotely. New sensor technologies enable the tagging and tracking of information about individuals without their permission or awareness.
As noted earlier, the terrorists of today are embedded and operate in this environment. It is not unreasonable to believe that terrorists planning an attack might leave “tracks” or “signatures” in these digital databases and networks and might make use of the communications channels available to all. Extracting terrorist tracks from nonthreat tracks might be the goal, but this is nevertheless not easy. One could imagine that aspects of a terrorist signature may be information that is not easily available or easily linked to other information or that some signatures may garner suspicion but are really not threats. However, with appropriate investigative leads, the potential increases that examining these databases, monitoring the contents of terrorist communications, and using other techniques, such as tagging and tracking, may yield valuable clues to terrorist intentions.
These possibilities have not gone unnoticed by the U.S. government, which has increased the number of and investment in counterterrorism programs that collect and analyze information to protect America from terrorism and other threats to public health and safety.6 The government collects information from many industry and government organizations, including telecommunications, electricity, transportation and shipping, law enforcement, customs agents, chemical and biological industries, finance, banking, and air transportation. The U.S. government also has the technical capability and, under some circumstances, the legal right to collect and hold information about U.S. citizens both at home and abroad. To improve the overall counterterrorism effort, the government has mandated interagency and interjurisdictional information sharing.7 In short, the substantial power of the U.S. government’s capability to collect information about individuals in the United States, as well as that of private-sector corporations and organizations, and the many ways that
advancing technology is improving that capability necessitate explicit steps to protect against its misuse.
If it were possible to automatically find the digital tracks of terrorists and automatically monitor only the communications of terrorists, public policy choices in this domain would be much simpler. But it is not possible to do so. All of the data contained in databases and on networks must be analyzed to attempt to distinguish between the data associated with terrorist activities and those associated with legitimate activities. Much of the analysis can be automated, a fact that provides some degree of protection for most personal information by having data manipulated within the system and restricted from human viewing. However, at some point, the outputs need to be considered and weighed, and some data associated with innocent individuals will necessarily and inevitably be examined by a human analyst—a fact that leads to some of the privacy concerns raised above. (Other privacy concerns, largely rooted in a technical definition of privacy described below, arise from the mere fact that certain individuals are singled out for further attention, regardless of whether a human being sees the data at all.)
In conceptualizing how information is used, it is helpful to consider what might be called the information life cycle. Addressed in greater detail in Appendix C, digital information typically goes through a seven-step information life cycle:
Collection. Information, whether accurate or inaccurate, is collected by some means, whether in an automated manner (e.g., financial transactions at a point of sale terminal or on the Web, call data records in a telecommunications network) or a manual manner (e.g., a Federal Bureau of Investigation (FBI) agent conducting an interview with an informant). Information may often be collected or transmitted (or both) without the subject’s awareness. In some instances, the party collecting the information may not be the end user of that information. This is especially relevant in government use of databases compiled by private parties, since laws that regulate government collection of information do not necessarily place comparable restrictions on government use of such information.
Correction. Information determined to be erroneous, whether through automated or manual means, may be discarded or corrected. Information determined to be incomplete may be augmented with additional information. Under some circumstances, the person associated with the collected information can make corrections. Information correction is not trivial, especially when large volumes of data are involved. The most efficient and practical means of correcting information may reduce
uncertainties but is not likely to eliminate them, and indeed error correction may itself sometimes introduce more error.
Storage. Information is stored in data repositories—databases, data warehouses, or simple files.
Analysis and processing. Information is used or analyzed, often using query languages, business intelligence tools, or analytical techniques, such as data mining. Analysis may require access to multiple data repositories, possibly distributed across the Internet.
Dissemination and sharing. Results of information analysis and processing are published or shared with the intended customer or user community (which may consist of other analysts). Disseminated information may or may not be in a format compatible with users’ applications.
Monitoring. Information and analytical results are monitored and evaluated to ensure that technical and procedural requirements have been and are continuing to be met. Examples of important requirements include security (Are specified security levels being maintained?), authorization (Are all access authorized?), service level agreements (Is performance within promised levels?), and compliance with applicable government regulations.
Selective retention or deletion. Information is retained or deleted on the basis of criteria (explicit or implicit) set for the information repository by the steward or by prevailing laws, regulations, or practices. The decreasing cost of storage and the increasing belief in the potential value to be mined from previously collected data are important factors enabling the increase in typical data retention periods. The benefits of retention and enhanced predictive power have to be balanced against the costs of reduced confidentiality. Data retention policies should therefore be regularly justified through an examination of this trade-off.
As described, these steps in the information life cycle can be regarded as a notional process for the handling of information. However, in practice, one or more of these steps may be omitted, or the sequencing may be altered or iterated. For example, in some instances, it may be that data are first stored and then corrected. Or the data may be stored with no correction at all or processed without being stored, which is what firewalls do.
Additional issues arise when information is assembled or collected from a variety of storage sources for presentation to an analysis application. Assembling such a collection generally entails linking records based on data fields, such as unique identifiers if present and available (identification numbers) or less perfect identifiers (combinations of name, address, and date of birth). The challenge of accurately linking large databases should not be underestimated. In practice, it is often the case that data may be linked with little or no control for accuracy or ability to cor-
rect errors in these fields, with the likely outcome that many records will be linked improperly and that many other records that should be linked are not linked. Without checks on the accuracy of such linkages, there is no way of understanding how errors resulting from linkage may affect the quality or provenance of the subsequent analysis.
Finally, different entities handle information differently because of the standards and regulations imposed on them. The types of information that can be collected, corrected, stored, disseminated, and retained and by whom, when, and for how long vary across private industries and government agencies. For example, three different kinds of agencies in the United States have some responsibility for combating terrorism: agencies in the intelligence community (IC), agencies of federal law enforcement (FLE), and agencies of state, local, and tribal law enforcement (SLTLE). The information-handling policies and practices of these different types of agency are governed by different laws and regulations. For example, the information collection policies and practices of SLTLE agencies require the existence of a “criminal predicate” to collect and retain information that identifies individuals and organizations; a criminal predicate refers to the possession of “reliable, fact-based information that reasonably infers that a particularly described … subject has committed, is committing or is about to commit a crime.”8 No such predicate is required for the collection of similar information by agencies in the intelligence community. Some FLE agencies (in particular, the FBI and the Drug Enforcement Agency) are also members of the intelligence community, and when (and only when) they are acting in this role, they are not required to have such predicates, either. The rules for information retention and storage are also more restricted for SLTLE agencies than for IC agencies (or FLE agencies acting in an IC role).
ORGANIZATIONAL MODELS FOR TERRORISM AND THE INTELLIGENCE PROCESS
A variety of models exists for how terrorist groups are organized, so it is helpful to consider two ends of a spectrum of organizational practices. At one end is a command-and-control model, which also characterizes traditional military organizations and multinational corporations. In this top-down structure, the leaders of the organization are responsible for planning, and they coordinate the activities of operational cells. At the other end of the spectrum is an entrepreneurial model, in which terrorist
cells form spontaneously and do their planning and execution without asking anybody’s permission or obtaining external support, although they may be loosely coordinated with respect to some overall high-level objective (such as “kill Westerners in large numbers”). In practice, terrorist groups can be found at one end or the other of this spectrum, as well as somewhere in the middle. For example, a terrorist cell might form itself spontaneously but then make contact with a central organization in order to obtain some funding and technical support (such as a visit by a bomb-making expert).
The spectrum of organizational practice is important because the nature of the organization in question is closely related to the various information flows among elements of the organization. These flows are important, because they provide opportunities for disruption and exploitation in counterterrorist efforts. Exploitation in particular is important because that is what yields information that may be relevant to anticipating an attack.
Because it originates spontaneously and organically, the decentralized terrorist group, almost by definition, is usually composed of individuals who do blend very well and easily into the society in which they are embedded. Thus, their attack planning and preparation activities are likely to be largely invisible when undertaken against the background of normal, innocent activities of the population at large. Information on such activities is much more likely to come to the attention of the authorities through tips originating in the relevant neighborhoods or communities or through observations made by local law enforcement authorities. Although such tips and observations are also received in the context of many other tips and observations, some useful and others not, the amount of winnowing necessary in this case is very much smaller than the amount required when the full panoply of normal, innocent activities constitutes the background.
By contrast, the command-and-control terrorist group potentially leaves a more consistent and easily discernible information footprint in the aggregate (although the individual elements may be small, such as a single phone call or e-mail). By definition, a top-down command structure involves regular communication among various elements (e.g., between platoon leaders and company commanders). Against the background noise, such regularities are more easily detected and understood than if the communication had no such structure. In addition, such groups typically either “light up” with increased command traffic or “go dark” prior to conducting an attack. Under these circumstances, there is greater value in a centralized analysis function that assembles the elements together into a mosaic.
Although data mining techniques are defined and discussed below
in Section 1.6.1, it is important to point out here that different kinds of analytical approaches are suitable in each situation. This report focuses on two general types of data mining techniques (described further in Appendix H): subject-based and pattern-based data mining. Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum. Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity—these patterns might be regarded as small signals in a large ocean of noise.
In the case of the decentralized group, subject-based data mining is likely to augment and enhance traditional police investigations by making it possible to access larger volumes of data more quickly. Furthermore, communications networks can more easily be identified and mapped if one or a few individuals in the network are known with high confidence. By contrast, pattern-based data mining may be more useful in finding the larger information footprint that characterizes centrally organized terrorist groups.
Note that there is also a role for an analytical function after an attack occurs or a planned attack is uncovered and participants captured. Under these circumstances, plausible starting points are available to begin an investigation, and this kind of analytical activity follows quite closely the postincident activities in counterespionage: who were these people, who visited them, with whom were they communicating, where did the money come from, and so on. These efforts (often known as “rolling up the network”) serve both a prosecutorial function in seeking to bring the perpetrators to justice and a prophylactic function in seeking to prevent others in the network from carrying out further terror attacks.
ACTIVITIES OF THE INTELLIGENCE COMMUNITY AND OF LAW ENFORCEMENT AGENCIES
The intelligence community is responsible for protecting U.S. national security from threats that have been defined by the executive branch. When threats are defined, further information is sought (i.e., “intelligence requirements”) to understand the status and operations of the threat, from which intervention strategies are developed to prevent or mitigate the threat. The information collection and management process for the intelligence community is driven by presidential policy.
In contrast, law enforcement agencies identify threats based on behaviors that are specifically identified as criminal (i.e., with the Fourth Amendment requirement of particularity). The law enforcement approach
to the threat is based on traditional criminal investigation and case building, a problem-solving intervention, or a hybrid of these two. The law enforcement agency information collection and management process is driven by crime. The parameters and policy of law enforcement activity to deal with the threat are stipulated constitutional law (notably the law of criminal evidence and procedure) and civil rights cases (42 USC 1983) particularly based on consent decrees related to the intelligence process in a number of cities. Two civil cases—Handschu v. Special Services Division (NYPD) and American Friends Service Committee v. Denver—have been major forces in shaping law enforcement policy on information collection for the intelligence process, notably related to First Amendment expressive activity and the inferred right to privacy.
As a matter of U.S. public policy today, the prevention of terrorist attacks against the U.S. homeland and other U.S. interests is the primary goal of the intelligence community and of federal law enforcement agencies. Prevention of terrorist attacks is necessarily a proactive and ongoing role, and thus it is not necessarily carried out in response to any particular external event. Countercrime activities are usually focused on investigation and developing the information basis for criminal prosecution. As a practical matter, most such investigations are reactive—that is, they are initiated in response to a specific occurrence of criminal activity.
These comments are not intended to imply that there is no overlap between the counterterrorist and countercrime missions. For example, law enforcement authorities are also concerned about the prevention of crimes through the perhaps difficult-to-determine deterrent effect of postattack prosecution of terrorists and their collaborators. In addition, preparation for future criminal acts can themselves be a current criminal violation under the conspiracy or attempt provisions of federal criminal law or other provisions defining preparatory crimes, such as solicitation of a crime of violence or provision of material support in preparation for a terrorist crime. The standard for opening an investigation—and thus for collecting personally identifiable information—is satisfied when there is not yet a current substantive or preparatory crime but facts or circumstances reasonably indicate that such a crime will occur in the future (i.e., when there is a valid criminal predicate).9
Although most crimes do not have a direct terrorism nexus, it is not uncommon to find that terrorists engage in criminal activities that are on the surface unrelated to terrorism. For example, a terrorist group with-
Information on investigations and inquiries is derived from The Attorney General’s Guidelines on General Crimes, Racketeering Enterprise and Terrorism Enterprise Investigations, Attorney General John Ashcroft, U.S. Department of Justice, Washington, D.C., May 30, 2002, available at http://www.usdoj.gov/olp/generalcrimes2.pdf.
out financial resources provided from an external source may engage in fraud. One well-known case in 2002 involved cigarette smuggling in support of Hezbollah.10
In addition, law enforcement agencies are often unable to deploy personnel and other resources if they are not being used to further active criminal investigations, so counterterrorism investigations are often part of an “all crimes” approach—that is, law enforcement agencies focus on an overall goal of public safety and stay alert for any threats to the public safety, including but not limited to terrorism.
Finally, both criminals and terrorists (foreign or domestic) operating in the United States are likely to blend very well and easily into the society in which they are embedded. That is, ordinary criminals are likely to be similar in profile to decentralized terrorist groups that also would draw their members from the ranks of disaffected Americans (or from individuals who are already familiar with each other or trusted, such as family members). Thus, both counterterrorist and countercrime efforts are likely to depend a great deal on information originating in the relevant neighborhoods or communities or observations made by local law enforcement authorities.
TECHNOLOGIES OF INTEREST IN THIS REPORT
The counterterrorist activities of the U.S. government depend heavily on many different kinds of technology. A comprehensive assessment of all technologies relevant to these efforts would be extensive and resource-intensive, not to mention highly classified at least in part, and indeed the committee was not charged with conducting such an assessment. Instead, the charge directed the committee to focus primarily on two important technologies—data mining and behavioral and physiological surveillance—and their relationship to and impact on privacy.11
The focus of the committee’s charge does not negate the value of other technologies or programs that generate information relevant to the counterterrorist mission, such as technologies for tagging and track-
A North Carolina-based Hezbollah cell smuggled untaxed cigarettes into North Carolina and Michigan and used the proceeds to provide financial support to terrorists in Beirut. See D.E. Kaplan, “Homegrown terrorists: How a Hezbollah cell made millions in sleepy Charlotte, N.C.,” U.S. News and World Report, March 2, 2003, available at http://www.usnews.com/usnews/news/articles/030310/10hez.htm.
Despite the focus on data mining and behavioral surveillance, the committee does recognize that most of the issues related to privacy and these technologies also apply more broadly to other information technologies as they might be used for counterterrorism. Nevertheless, this is mostly a report about privacy as it relates to these two specific technologies of interest.
ing for identity management, or even for the admittedly controversial use of so-called “national security letters” for information gathering. (A national security letter (NSL) is a demand for information from a third party issued by the FBI or by other government agencies with authority to conduct national security investigations. No judicial approval is needed for the issuance of an NSL, and many NSLs have been issued pursuant to statutory nondisclosure provisions that prevent the issuance from being made known publicly. Both of these provisions have created controversy.) Indeed, regardless of whether a given information-generating program or technology is or is not classified, it can be said openly that the purpose of the program or technology is to generate information. Mission-directed intelligence analysis is an all-source enterprise—that is, the purpose of the analytical mission is to make sense out of information coming from multiple sources, classified and unclassified. Data mining and information fusion are technologies of analysis rather than collection, and thus they are intended to help analysts find patterns of interest in all of the available data.
Under the rubric of data mining techniques fall a diverse set of tools for mathematical modeling, including machine learning, pattern recognition, and information fusion.
Machine learning is the study of computer algorithms that improve automatically through experience.
Pattern recognition addresses a broad class of problems in which a feature extractor is applied to untreated (usually image) input data to produce a reduced data set for use as an input to a classification model, which then classifies the treated input data into one of several categories.
Information fusion refers to a class of methods for combining information from different sources in order to make inferences that may not be possible from a single source alone. Some information fusion methods use formal probabilistic models, and some include ways of assessing rates of linkage error; others include only one or none of these things.
There is a continuum of sophistication in techniques that have been referred to as data mining that may provide assistance in counterterrorism. On the more routine end of the spectrum (sometimes called subject-based data mining and often so routine as to not be included in the portfolio of techniques referred to as data mining) lies the automation of typical investigative techniques, especially the searching of large databases for characteristics that have been associated with individuals of interest, that
is, people who are worthy of further investigation. Through the benefits of automation, the investigative power of these traditional techniques can be greatly expedited and broadened in comparison to former practices, and therefore they can provide important assistance in the fight against terrorism.
Subject-based data mining can include, for example, people who own cars with license plates that are discovered at the scene of a terrorist act or whose fingerprints match those of people known to be involved in terrorist activity. Subject-based data mining might also include people who have been in communication with other persons of interest, people who have traveled to various places recently, and people who have transferred large sums of money to others of interest. When several disparate pieces of information of this type are obtained that are associated with terrorist activity, identifying a subset of a database that matches one or more of these various pieces of information can be referred to as “drilling down.” This is a data mining technique that simply expands and automates what a police detective or intelligence analyst would carry out with sufficient time.
There are two key requisites for this use of data mining. One is the development of linkages relating data and information in the relevant databases, which facilitates response to these types of queries—for example, being able to identify all numbers that have recently called or been called by a given telephone number. Of course, attestations regarding the accuracy and provenance of such identification are also necessary for confidence in the ultimate results. The second requisite is the quality of the information collected. Individuals claimed by law enforcement officials to match prints found at a crime scene have sometimes turned out not to match upon further investigation.12 Also, matching names or other forms of record linkage are error-prone operations, generally because of data quality issues.
Similarly, so-called rule-based techniques collect joint characteristics or data for individuals (or other units of analysis, such as networks of individuals) whom detectives or intelligence analysts view as being potentially associated with terrorist activity. This activity can include, for example, the recent purchase, possibly as a member of a group, of chemicals or biological agents that can be used to create explosives or toxins. Again, this is a simple extension of what analysts would do with sufficient resources and represents a relatively unsophisticated application of data mining. The key element is the use of analysts to identify the important
S. Kershaw, “Spain and U.S. at odds on mistaken terror arrest,” New York Times, June 5, 2004, available at http://query.nytimes.com/gst/fullpage.html?res=9800EFDB1031F936A35755C0A9629C8B63.
rules or patterns that are indicative of or associated with terrorist activity. Given that terrorists often operate in groups, network-based methods have particular importance and should be used in concert with rule-based methods when possible. As above, the use of rule-based techniques can be greatly compromised by poor-quality data.
Pattern-based data mining techniques either require a feedback mechanism to generate learning over time or are more assumption-dependent than subject-based techniques. Machine learning is one such technique: in situations in which the truth of a decision process can often be made known, the feedback of knowing which results were decided correctly and incorrectly can be used to improve the decision process, which “learns” over time to become a better discriminator. For example, in scanning carry-on luggage to decide which contents are of concern and which are not, the process of simultaneously and individually searching a large number of the bags identified both of concern and not of concern and feeding back this information into the decision algorithm, can be used to improve the algorithm. Over time the algorithm can learn which patterns are associated with bags of concern. These situations in which cases of interest and cases not of interest become known for a large number of instances, referred to as a training set, permit machine learning to operate. This represents a collection of techniques that might have important applicability to specific, limited components of the counterterrorism problem.
There are also a number of situations in which the identification of anomalous patterns, in comparison to a long historical pattern of behavior or use, might make it possible to ultimately discriminate between activities of interest and activities not of interest to intelligence analysts. Referred to as signature-based analysis, current successful applications of data mining in these situations include the identification of anomalous patterns of credit card use or the fraudulent use of a telephone billing account. However, in those applications, a training set is available to help evaluate the extent to which the pattern of interest is useful in discriminating the behavior of interest from that not of interest.
When a training set or some formal means for assessing predictive validity is not available (i.e., if there is no way to test predictions against some kind of ground truth), these techniques are unlikely to provide useful information for counterterrorism. Nevertheless, it may be possible to use subject-matter experts to identify discriminating patterns, and one cannot reject a priori the possibility that anomalous patterns might be identified that intelligence analysts would also view as very likely to be associated with terrorist activity. Working collaboratively, signature-based data mining techniques might be developed that could effectively discriminate in counterterrorism between patterns of interest and those not
of interest. Such patterns might then provide leads for further investigation through traditional law enforcement or national security means.
This rough partitioning of data mining techniques into pattern-based and subject-based approaches is meant to describe two relatively broad classes of techniques representing two “pure types” of methods used. However, many of the approaches used in practice can be considered combinations of these pure types, and therefore the examples included here of these two approaches do not fully explore the richness of techniques that is possible. Indeed, the data mining components of a real system are likely to reflect aspects of both subject-based and pattern-based data mining algorithms, through joint use of several perspectives using different units of analysis, combining evidence in several ways.
In many cases, the unit of analysis is the individual, and the objective is discriminating between the people who are and are not of interest. However, rather than using the individual as the basic unit of analysis, many techniques may use other constructs, such as the relevant group of close associates, the date of some possible terrorist activity, or the intended target, and then tailor the information retrieval and the analysis using that perspective. To best address a given problem, it may be beneficial at times to use more than one unit of analysis (such as a group), and to combine such analyses so that mutually consistent information can be recognized and used. The unit of analysis selected has implications for the rule-based techniques that might be used, or what patterns or signatures might be seen to be anomalous and therefore of interest.13
In addition, the use of data mining procedures may occur as component parts of a counterterrorism system, in which data mining tools address specific needs, such as identifying all the financial dealings, contacts, events, travels, etc., corresponding to a person of interest. The overall system would be managed by intelligence agents, who would also have impacts on both the design of the data mining components and on the remaining components, which might involve skills that could not be automated. The precise form of such a system is only hinted at here, and both system development and deployment are likely to require a substantial investment of time and resources as well as collaboration with those with state-of-the-art expertise in data mining, database management, and counterterrorism.
Finally, no single operational system has access to all of the relevant data at the same time. In practice, the results of an analysis from any given system will often result in queries being made of other systems exploit-
ing different analytical techniques with access to different databases. For example, an intensive analysis on one system may be made using a limited set of records to identify a set of initial leads. In subsequent stages using different systems, progressively more extensive sets of data may be analyzed to winnow the set of initial leads. Such a practice—often known as multistage inference—may help to improve efficiency and to reduce privacy impacts.
In general, there is little doubt about the utility of subject-based data mining as an investigative technique. However, the utility of pattern-based data mining and information fusion depends on the availability of a training set and the application to which the techniques are applied. Pattern-based data mining is most likely to be useful when training sets are available; there are supplementary tasks for which data mining tools might be helpful that do not require a training set. At the same time, the utility of pattern-based data mining, without a training set, to identify patterns indicative of individuals and events worth additional investigation, is very unclear. Although there is no a priori argument that categorically eliminates pattern-based data mining as useful tools for counterterrorism applications, considerable basic research will be necessary to develop operational systems that help to provide a prioritization of cases for experts to examine in more depth. Such research would examine the feasibility and utility of pattern-based data mining and information fusion for counterterrorism applications and subsequent development into specific applications components. That approach to the problem in question might not succeed but the potential gains are large, and for this reason such a modest program investment, structured in accordance with the framework proposed in Chapter 2, may be well worth making.
Behavioral surveillance seeks to detect physiological behaviors, conditions, or responses and the attendant biological activity that indicate that an individual is about to commit an act of terrorism. Specifically, behavioral surveillance seeks to detect patterns of behavior thought to be precursors or correlates of wrongdoing (e.g., deception, expressing hostile emotions) or that are anomalous in certain situations (e.g., identifying a person who shows much greater fidgeting and much more facial reddening than others in a security line).
If people were incapable of lying, the easiest and most accurate way to determine past, current, and future behavior would be to ask them what they have been doing, what they are doing, and what they plan to do. But people are highly capable of lying, and it is currently very difficult to detect lying with great degrees of accuracy (especially through
automated means). Thus, the terrorist’s desire to avoid detection makes this verbal channel of information highly unreliable.
For this reason, behavioral surveillance focuses on biological or physiological indicators that are relatively involuntary (i.e., whose presence or absence is not subject to voluntary control) or provide detectible signs when they are being manipulated. For example, physiological indicators, such as cardiac activity, facial expressions, and voice tone, can be monitored and the readings used to make inferences about internal psychological states (e.g., “based on this pattern of physiological activity, this person is likely to be engaged in deception”). However, such indicators do not provide direct evidence of deception of any sort, let alone terrorist behavior (e.g., the deception if present at all may not relate to terrorist behavior but rather to cheating on one’s income tax or spouse), and thus the problem becomes one of inferring the specific (i.e., terrorist behavior) from more general indicators.
To illustrate the government interest in behavioral surveillance, consider Project Hostile Intent, conducted under the auspices of the U.S. Department of Homeland Security’s Human Factors Division in the Science and Technology Directorate. This project seeks to develop models of hostile intent and deception, focusing on behavioral and speech cues. These cues would be determined from experiments and derived from operationally based scenarios that reflect the screening and interviewing objectives of the department. In addition, the project seeks to develop an automated suite of noninvasive sensors and algorithms that can automatically detect and track the input cues to the models. If successful, the resulting technologies would afford capabilities to identify deception and hostile intent in real time, on the spot, using noninvasive sensors, with the goal of being able to screen travelers in an automated fashion with equal or greater effectiveness than the methods used today without impeding their flow.14
Although behavioral methods are useful under some circumstances (such as real-life circumstances that closely approximate laboratory conditions), they are intrinsically subject to three limitations:
Many-to-one. Any given pattern of physiological activity can result from or be correlated with a number of quite different psychological or physical states.
Probabilistic. Any detected sign or pattern conveys at best a change
U.S. Department of Homeland Security, “Deception detection: Identifying hostile intent,” S&T Snapshots: Science Stories for the Homeland Security Enterprise 1(1), May 2007, available at http://www.homelandsecurity.org/snapshots/newsletter/2007-05.htm#deception.
in the likelihood of the behavior, intent, or attitude of interest and are far from an absolute certainty.
Errors. In addition to the highly desirable true positives and true negatives that are produced, there will be the very troublesome false positives (i.e., a person telling the truth is thought to be lying) and false negatives (i.e., a person lying is thought to be telling the truth). Such errors are linked to the probabilistic nature of behavioral signals and a lack of knowledge today about how to interpret such signals.
Privacy issues associated with behavioral surveillance are regarded by the committee to be far more significant and far-ranging than those associated with the collection and use of electronic databases, in part because of their potential for abuse, in part because of what they may later reveal about an individual that is potentially unconnected to terrorist activities, in part because of a sense that the intrusion is greater if mental state is being probed, in part because people expect to be allowed to keep their thoughts to themselves, and in part because there is often much more ambiguity regarding interpretation of the results.
THE SOCIAL AND ORGANIZATIONAL CONTEXT
Technology is always embedded in a social and organizational context. People operate machines and devices and make decisions based on what these machines and devices tell them. In turn, these decisions are based on certain criteria that are organizationally specified. For example, a metal detector is placed at the entrance to a building. At the request of the security guard, a visitor walks through the detector. If the detector buzzes, the guard searches the visitor more closely. If the guard finds a weapon, the guard confiscates it and calls his superior to take the visitor for additional questioning. The guard carries out these procedures because they are required by the organization responsible for building security—and if the guard does not carry out these procedures, security may be compromised despite the presence of the metal detector.
Nor can the presence of the relevant machines and devices be taken for granted. There are many steps that must be taken before the relevant machine or device is actually deployed and put into use at a security checkpoint, and even when the science underpinning the relevant machines and devices is known, the science must be instantiated into artifacts. For example, a functional metal detector depends on some understanding of the science of metal detection, even if the theory is not completely known. Prototypes must be built and problems in the manufacturing process overcome. Budgets must be available to acquire the necessary devices and to train security guards in their operation. Oversight
must be exercised to ensure that the processes and procedures necessary to decide on and implement a program are correctly followed.
Protecting privacy often depends on social and organizational factors as well. For example, the effectiveness of rules that prohibit agents or analysts from disclosing certain kinds of personal information about the targets of their investigations is based on the willingness and ability of those agents or analysts to follow the rules and the organization’s willingness and ability to enforce them. While encryption may provide the technical capability to protect data from being viewed by someone without the decryption key, policies and practices determine whether encryption capabilities are actually used in the proper circumstances.
The social and organizational context in which technology is embedded is important from a policy standpoint because the best technology embedded in a dysfunctional organization or operated with poorly trained human beings is often ineffective. This point goes beyond the usual concerns about a technology that is promising in the laboratory being found too unwieldy or impractical for widespread use in the field.
The Meaning of Privacy15
In both everyday discourse and the scholarly literature, a commonly agreed-on abstract definition of privacy is elusive. For example, privacy may refer to protecting the confidentiality of information; enabling a sense of autonomy, independence, and freedom to foster creativity; wanting to be left alone; or establishing enough trust that individuals in a given community are willing to disclose data under the assumption that they will not be misused. For purposes of this report, the term “privacy” is generally used in a broad and colloquial sense that encompasses the technical definitions of privacy and confidentiality commonly used in the statistical literature. That is, the statistical community’s definition of privacy is an individual’s freedom from excessive intrusion in the quest for information and an individual’s ability to choose the extent and circumstances under which his or her beliefs, behaviors, opinions, and attitudes will be shared with or withheld from others. Confidentiality is the care in dissemination of data in a manner that does not permit identification of the respondent or would in any way be harmful to him or her and that
the data are immune from legal process.16 Put differently, privacy relates to the ability to withhold personal data, whereas confidentiality relates to the activities of an agency that has collected such data from others. Yet another sense of privacy to keep in mind is that of a set of restrictions on how or for how long personal information can be used. In this report, when these distinctions are important, these different senses of meaning will be explicitly addressed, but in the less technical sections, the term “privacy” will be used in a more generic fashion.
In its starkest terms, privacy is about what personal information is being kept private and which parties the information is being kept from. For example, one notion of privacy involves confidentiality or secrecy of some specific information, such as preventing disclosure of an individual’s library records to the government. A second notion of privacy involves anonymity, as reflected in, for example, an unattributable chat room discussion that threatens the use of violence against innocent parties.
These two simple examples illustrate two key points regarding privacy. First, the party against which privacy is being invoked may have a legitimate reason for wanting access to the information being denied—a government conducting a terrorist investigation may want to know what a potential suspect is reading, or a law enforcement official may need the identity of the person threatening violence in order to protect innocent people. Second, some kind of balancing of competing interests may be necessary—thus raising the question of the circumstances under which the government should have access to such information.
In practice, three other issues are also critical to understanding the privacy interests in any given situation: what the information will be used for, where the information comes from, and what the consequences are for the individual whose information is at issue. Regarding purpose, divulging personal information for one purpose may not be regarded as a violation of privacy, whereas divulging the same information for a different purpose may be regarded as a clear violation of privacy. (In other words, a “justified” violation of an individual’s privacy—that is, for a reason that is good and valid to the individual in question—is generally not viewed as a violation of his or her privacy interests by that individual.) Regarding source, government collection of personal information is often regarded as different in kind from private collection of personal information, although government is increasingly making use of personal data gathered by private parties. This point is especially significant because laws that restrict government collection of personal information often do
not apply to private collection, and the government breaks no law in purchasing such information from private parties. Regarding consequences, for many people, a primary consideration in privacy is the adverse consequences they may experience if their privacy is compromised—denial of financial benefits, personal embarrassment or shame, and so on.
The notion of trust is intimately related to the meaning of privacy. Briefly put, people tend to invoke rights to privacy much more strongly when they fear the motivations or intent of the entity that is to receive their data. That is, a lack of trust in these data-receiving entities drives both the strength of people’s desires for privacy and their conceptions of privacy. This is especially true when the data-receiving entity is capable of imposing an adverse consequence on them. Box 1.1 addresses this point further.
Privacy also has a variety of more technical meanings, some of which are elaborated in Appendix L (“The Science and Technology of Privacy Protection”). The most well-defined of these meanings for scientific study is based on the intuitive notion that a system containing an individual’s information protects his or her privacy if all events, such as being singled out for additional attention at airport security, being denied medical insurance coverage, or gaining entrance to the college of his or her choice, are no more likely than if the system did not contain that information. This meaning can be formalized, as described in Appendix L (Section L.2).
The effectiveness of a technology, a system, or a program is judged by the extent to which it directly furthers the objective being sought. Effectiveness is a measure of technical performance, and policy makers and government officials responsible for developing, purchasing, deploying, and using information-based programs must make judgments regarding whether a given level of effectiveness is sufficient to proceed with the use or deployment of a given technology, system, or program. Section 1.8.4 addresses false positives and false negatives as essential elements of judging the effectiveness of a program.
The qualification of “directly” furthering the objective being sought is an important one. From time to time, technologies, systems, or programs are admittedly ineffective from a technological point of view and yet are justified on the basis of their alleged deterrent value. That is, their mere presence and the adversary’s concern that they might work are said to help deter the adversary from taking such an action.17 The desirability of
A Relationship Between Privacy and Trust
The National Research Council report Engaging Privacy and Information Technology in a Digital Age (2007) explicitly addresses the relationship between privacy and trust. Specifically, that committee found (in Finding 4) that “privacy is particularly important to people when they believe that the entity receiving their personal information is not trustworthy and that they may be harmed by sharing that information.”
That report goes on to explain (pp. 311-312):
Trust is an important issue in framing concerns regarding privacy. In the context of an individual providing personal information to another, the sensitivities involved will depend on the degree to which the individual trusts that party to refrain from acting in a manner that is contrary to his or her interests (e.g., to pass it along to someone else, to use it as the basis for a decision with inappropriately adverse consequences). As an extreme case, consider the act of providing a complete dossier of personal information on a stack of paper—to a person who will destroy it. If the destruction is verifiable to the person providing the dossier (and if there is no way for the destroyer to read the dossier), it would be hard to assert the existence of any privacy concern at all.
But for most situations in which one provides personal information, the basis for trust is less clear. Children routinely assert privacy rights to their personal information against their parents when they do not trust that parents will not criticize
admittedly ineffective systems that might help to deter adversaries is not considered in this report.
Law and Consistency with Values
Measures of effectiveness deal with issues of feasibility. Legality and ethicality, in contrast, address issues of desirability. Not all technically feasible technologies, systems, or programs are desirable. Law provides one codification of national values that prescribes required actions and
them or punish them or think ill of them as a result of accessing that information. (They also assert privacy rights in many other situations.) Adults who purchase health insurance often assert privacy rights in their medical information because they are concerned that insurers might not insure them or might charge high prices on the basis of some information in their medical record. Many citizens assert privacy rights against government, although few would object to the gathering of personal information within the borders of the United States and about U.S. citizens if they could be assured that such information was being used only for genuine national security purposes and that any information that had been gathered about them was accurate and appropriately interpreted and treated ….Perversely, many people hold contradictory views about their own privacy and other people’s privacy—that is, they support curtailing the privacy of some demographic groups at the same time that they believe that their own should not be similarly curtailed. This dichotomy almost certainly reflects their views about the trustworthiness of certain groups versus their own.
In short, the act of providing personal information is almost always accompanied to varying degrees by a perceived risk of negative consequences flowing from an abuse of trust. The perception may or may not be justified by the objective facts of the situation, but trust has an important subjective element. If the entity receiving the information is not seen as trustworthy, it is likely that the individuals involved will be much more hesitant to provide that information (or to provide it accurately) than they would be under other circumstances involving a greater degree of trust.
prohibits other actions. Although society expects its government to obey the law, it is also true that technologies and events outpace the rate at which law changes. Such rapid changes often leave policy makers with a difficult gray area in which certain actions are not explicitly prohibited but that nevertheless may be inconsistent with a broad notion of American values.
A good example of the impact of technological change on the law is the interpretation of the Supreme Court in 1976 in United States v. Miller18 that there can be no reasonable expectation of privacy in information held by a third party. The case involved cancelled checks, to which, the Court noted, “respondent can assert neither ownership nor possession.”19 Such documents “contain only information voluntarily conveyed to the banks and exposed to their employees in the ordinary course of business,”20 and
therefore the Court found that the Fourth Amendment is not implicated when the government sought access to them:
The depositor takes the risk, in revealing his affairs to another, that the information will be conveyed by that person to the Government. This Court has held repeatedly that the Fourth Amendment does not prohibit the obtaining of information revealed to a third party and conveyed by him to Government authorities, even if the information is revealed on the assumption that it will be used only for a limited purpose and the confidence placed in the third party will not be betrayed.21
Congress reacted to the decision by enacting modest statutory protection for customer financial records held by financial institutions,22 but there is no constitutional protection for financial records or for any other personal information that has been disclosed to third parties. As a result, the government can collect even the most sensitive information from a third party without a warrant and without risk that the search may be found unreasonable under the Fourth Amendment.
The Court reinforced its holding in Miller in the 1979 case of Smith v. Maryland, involving information about (as opposed to the content of) telephone calls.23 The Court found that the Fourth Amendment is inapplicable to telecommunications “attributes” (the number dialed, the time the call was placed, the duration of the call, etc.), because that information is necessarily conveyed to, or observable by, third parties involved in connecting the call.24 “[T]elephone users, in sum, typically know that they must convey numerical information to the phone company; that the phone company has facilities for recording this information; and that the phone company does in fact record this information for a variety of legitimate business purposes.”25 As with information disclosed to financial institutions, Congress reacted to the Supreme Court’s decision by creating a statutory warrant requirement for pen registers,26 but the Constitution does not restrict government action in this area.
Some legal analysts believe that this interpretation regarding the categorical exclusion of records held by third parties from Fourth Amendment protection makes less sense today because of the extraordinary increase in both the volume and the sensitivity of information about individuals so often held by third parties. In this view, the digital transactions of daily
life have become ubiquitous.27 Such transactions include detailed information about individuals’ behavior, communications, and relationships.
At the same time, people who live in modern society do not have a real choice to refrain from leaving behind such trails. Even in the 1970s when Miller and Smith were decided, individuals who wrote checks and made telephone calls did not voluntarily convey information to third parties—they had no choice but to convey the information if they wanted to make large-value payments or communicate over physical distances. And in those cases, the third parties did not voluntarily supply the records to the government. Financial institutions are required to keep records (ironically, this requirement is found in the Right to Financial Privacy Act), and telephone companies are subject to a similar requirement about billing records. In both cases, the government demanded the records. And, at the same time, the information collected and stored by banks and telephone companies is subject to explicit or implicit promises that it will not be further disclosed. Most customers would be astonished to find their checks or telephone billing records printed in the newspaper.
Today, such transactional records may be held by more private parties than ever before. For example, a handful of service providers already process, or have access to, the large majority of credit and debit card transactions, automated teller machine (ATM) withdrawals, airline and rental car reservations, and Internet access, and the everyday use of a credit card or ATM card involves the disclosure of personal financial information to multiple entities. In addition, digital networks have facilitated the growth of vigorous outsourcing markets, so information provided to one company is increasingly likely to be processed by a separate institution, and customer service may be provided by another. And all of those entities may store their data with still another. Moreover, there are information aggregation businesses in the private sector that already combine personal data from thousands of private-sector sources and public records. They maintain rich repositories of information about virtually every adult in the country, which are updated daily by a steady stream of incoming data.28
Finally, in this view, the fact that all of the data in question are in digital form means that increasingly powerful tools—such as automated data mining—can be used to analyze it, thereby reducing or eliminating privacy protections that were previously based on obscurity and difficulty
of access to the data. The impact of Miller in 1976 was limited primarily to government requests for specific records about identified individuals who had already done something to warrant the government’s attention, whether or not the suspicious activity amounted to probable cause. Today, the Miller and Smith decisions allow the government to obtain the raw material on millions of individuals without any reason for identifying anyone in particular.
Thus, in this view, the argument suggests that by removing the protection of the Fourth Amendment from all of these records solely because they are held by third parties, there is a significant reduction in the constitutional protection for personal privacy—not as the result of a conscious legal decision, but through the proliferation of digital technologies. In short, under current Fourth Amendment jurisprudence, all personal information in third-party custody, no matter how sensitive or how revealing of a person’s health, finances, tastes, or convictions, is available to the government without constitutional limit. The government’s demand need not be reasonable, no warrant is necessary, no judicial authorization or oversight is required, and it does not matter if the consumer has been promised by the third party that his or her data would be kept confidential as a condition of providing the information.
A contrary view is that Miller and Smith are important parts of the modern Fourth Amendment and that additional privacy protections in this context should come from Congress rather than the courts. According to this view, Miller and Smith ensure that there are some kinds of surveillance that the government can conduct without a warrant. Fourth Amendment doctrine has always left a great deal of room for unprotected activity, such as what happens in public: the fact that the police can watch in public areas for criminal activity without being constrained by the Fourth Amendment is critical to the balance of the Fourth Amendment’s rule structure. In switching from physical activity to digital activity, everything becomes a record. If all records receive Fourth Amendment protection, treating every record as private, the equivalent of something inside the home, then the government will have considerable difficulty monitoring criminal activity without a warrant. In effect, under this interpretation, the Fourth Amendment would apply much more broadly to records-based and digital crimes than it does to physical crimes, and all in a way that would make it very difficult for law enforcement to conduct successful investigations. In this view, the best way forward is for the Supreme Court to retain Smith and Miller and for Congress to provide statutory protections when needed, much as it has done with the enactment of privacy laws, such as the Electronic Communications Privacy Act.
Given these contrasting perspectives and the important issues they
raise, the constitutional and policy challenges for the future are to decide—explicitly and in light of new technological developments—the appropriate boundaries of Fourth Amendment jurisprudence regarding the disposition of data held by third parties. The courts are currently hearing cases that help get to this question; so far they have indicated that noncontent information is covered by Miller but that content information receives full Fourth Amendment protection. But these cases are new and may be overturned, and it will be some years before clearer boundaries emerge definitively.
False Positives, False Negatives, and Data Quality29
False positives and false negatives arise in any kind of classification exercise.30 For example, consider a counterterrorism exercise in which it is desirable to classify each individual in a set of people as “not worthy of further investigation/does not warrant obtaining more information on these people” or “worthy of further investigation/does warrant obtaining more information on these people,” based on an examination of data associated with each individual. A false positive is someone placed in the latter category who has no terrorist connection. A false negative is someone placed in the former category who has a terrorist connection.
Consider a naïve notional system in which a computer program or a human analyst examines the data associated with each individual, searching for possible indications of terrorist attack planning. This examination results in a score for each individual that indicates the relative likelihood of him or her being “worthy of further investigation” relative to all of the others being examined.31 When all of the individuals are examined, they are sorted according to this score.
This rank ordering does not, in itself, determine the classification—in addition, a threshold must be established to determine what scores will correspond to each category. The critical point here is that setting this threshold is the responsibility of a human analyst—technology does not,
This section is adapted largely from National Research Council, Engaging Privacy and Information Technology in a Digital Age, The National Academies Press, Washington, D.C., 2007, Chapter 1.
An extensive treatment of false positives and false negatives (and the trade-offs thereby implied) can be found in National Research Council, The Polygraph and Lie Detection, The National Academies Press, Washington, D.C., 2003.
The score calculated by any given system may simply be an index with only ordinal (rank-ordering) properties. If more information is available and a more sophisticated analytical approach is possible, the score may be an actual Bayesian probability or likelihood that could be manipulated quantitatively in accordance with the mathematics of probability and statistics.
indeed cannot, set this threshold. Moreover, it is likely that the appropriate setting of a threshold depends on the consequences for the individual being miscategorized. If the real-world consequence of a false positive for a given individual is being denied boarding of an airplane compared with looking at more records relevant to that individual, one may wish greater certainty to reduce the likelihood of a false positive—this desire would tend to drive the threshold higher in the first instance than in the second. In addition, any real analyst will not be satisfied with a system that impedes the further investigation of someone whose score is below the threshold. That is, an analyst will want to reserve the right (have the ability) to designate for further examination an individual who may have been categorized as below threshold—to say, in effect, “That guy has a lower score than most of the others, but there’s something strange about him anyway, and I want to look at him more closely even if he is below threshold.”
Because the above approach is focused on individuals, any realistic setting of a threshold is likely to result in enormous numbers of false positives. One way to reduce the number of false positives significantly is to exploit the fact that terrorists—especially those with big plans in mind—are most likely to operate in small groups (also known as cells). Thus, a more sophisticated system could consider a different unit of analysis—groups of individuals rather than individuals—that might be worth further investigation. This approach, known as collective inference, focuses on analyzing large collections of records simultaneously (e.g., people, places, organizations, events, and other entities).32 Conceptually, the output of this system could be a rank ordering of all possible groups (combinations) of two individuals, another rank ordering of all possible groups of three individuals, and so on. Once again, thresholds would be set to determine groups that were worth further investigation. The rank orderings resulting from a group-oriented analysis could also be used to rule out individuals who might otherwise be classified as worthy of further investigation—if an individual with an above-threshold score was not found among the groups with above-threshold scores, that individual would be either a lone wolf or clearly seen to be a false positive and thus eliminated before the investigation went any further.
A “brute-force” search of all possible groups of two, of three, and so on when the population in question is that of the United States is daunting, to say the least. But in practice, most of those groups will be individuals with no plausible connections among them, and thus the
More detail on these ideas can be found in D. Jensen, M. Rattigan, and H. Blau, “Information awareness: A prospective technical assessment,” Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, available at http://kdl.cs.umass.edu/papers/jensen-et-al-kdd2003.pdf.
records containing information about those groups need not be examined. Identifying such groups is a problem, but other techniques may be useful in eliminating some groups at a fairly early stage—for example, if a group does not contain individuals who have communicated with each other, that group might be eliminated from further consideration. All such criteria also run the risk of incurring false negatives, and it remains to be seen how useful such pruning efforts are in practice.
False positives and false negatives arise from two other sources. One is the validity of the model used to distinguish between terrorists and innocent individuals. A perfectly valid model of a terrorist is one in which a set of specific measurable characteristics, if correctly associated with a given individual, would correctly identify that individual as a terrorist with 100 percent accuracy, and other individuals lacking one or more of those characteristics would be correctly identified as an innocent individual. Of course, in the real world, no model is perfect, and so false positives and false negatives are inevitable from the imperfection of models.
The second and independent source of false positives and false negatives is imperfect data. That is, even if a model were perfect, in the real world, the data asserted to be associated with a given individual is not in fact associated with that individual. For example, an individual’s height may be recorded as 6.1 meters, whereas his height may in fact be 1.6 meters. Her religion may be recorded as Protestant, but in fact she may be a practicing Catholic. Such data errors arise for a wide range of reasons, including keyboarding errors, faulty intelligence, errors of translation, and so on. Improving data quality can thus reduce the rate of false positives and false negatives, but only up to the limits inherent in the imperfections of the model. Since models, for computability, abstract only some of the variables and behaviors of reality, they are by design imperfect. Model imperfections are a built-in source of error, and better data cannot compensate for a model’s inadequacies.
Model inadequacies stem from several possible sources: (1) the required data for various characteristics in the assumed causal model may not be available, (2) some variables may be left out to simplify computations, (3) some variables that are causal may be available but unknown, (4) the precise form of the relationship between the predictor variables and the assessment of degree of interest is unknown, (5) the form of the relationship may be simplified to expedite computation, and (6) the phenomenon may be dynamic in nature and therefore any datedness in the inputs could cause erroneous improper predictions.
Data quality is the property of data that allows them to be used effectively and rapidly to inform and evaluate decisions.33 Ideally, data should
be correct, current, complete, and relevant. Data quality is intimately related to false positives and false negatives, in that it is intuitively obvious that using data of poor quality is likely to result in larger numbers of false positive and false negatives than would be the case if the data were of high quality.
Data quality is a multidimensional concept. Measurement error and survey uncertainty contribute (negatively) to data quality, as do issues related to measurement bias. Many issues arise as the result of missing data fields; inconsistent data fields in a given record, such as recording a pregnancy for a 9-year-old boy; data incorrectly entered into the database, such as that which might result from a typographical error; measurement error; sampling error and uncertainty; timeliness (or lack thereof); coverage or comprehensiveness (or lack thereof); improperly duplicated records; data conversion errors, as might occur when a database of vendor X is converted to a comparable database using technology from vendor Y; use of inconsistent definitions over time; and definitions that become irrelevant over time.
All of the forgoing discussion relates to the implications of measurement error that could easily arise in a given environment or database. However, when data come from multiple databases, they must be linked, and the methodology for performing data linkages in the absence of clear, unique identifiers is probabilistic in nature. Even in well-designed record linkage studies, such as those developed by the Census Bureau, automated matching is capable of reliably matching only about 75 percent of the people (although some appreciable fraction of the remainder are not matchable), and hand-matching of records is required to reduce the remaining number of unresolved cases.34 The difficulty of reliable matching, superimposed on measurement error, will inevitably produce much more substantial problems of false positives and false negatives than most analysts recognize.
Data issues also arise as the result of combining databases—syntactic inconsistencies (one database records phone numbers in the form 202-555-1212 and another in the form 2025551212); semantic inconsistencies (weight measured in pounds vs. weight measured in kilograms); different
DM Review Magazine, August 2004, available at http://www.dmreview.com/article_sub.cfm?articleId=1007211; W.W. Eckerson, “Data warehousing special report: Data quality and the bottom line,” Application Development Trends Magazine, May 1, 2002, available at http://www.adtmag.com/article.aspx?id=6303; Y. Wand and R. Wang, “Anchoring data quality dimensions in ontological foundations,” Communications of the ACM 39(11):86-95, November 1996; and R. Wang, H. Kon, and S. Madnick, “Data quality requirements analysis and modelling,” Ninth International Conference of Data Engineering, Vienna, Austria, 1993.
provenance for different databases; inconsistent data fields for records contained in different databases on a given data subject; and lack of universal identifiers to specify data subjects.
Missing data are a major cause of reduction in data quality. In the situation in which network linkages are of interest and are directly represented in a database, the problem of missing data can sometimes be easier and sometimes more challenging than in the case of a rectangular file. A rectangular file usually consists of a list of individuals with their associated characteristics. In this situation, missing data can be of three general types: item nonresponse, unit nonresponse, and undercoverage. Item and unit nonresponse, while certainly problematic in the current context, are limited in impact and can sometimes be addressed using such techniques as imputation. Even undercoverage, while troubling, is at least limited to the data for the individual in question. (If such an individual is represented on another database to which one has access, merging and unduplicating operations can be helpful to identification, and estimates of the number of omissions can be developed using dual-systems estimation.)
On one hand, when the appropriate unit of analysis is networks of individuals (i.e., the individuals and their characteristics along with the various linkages between them are represented as being present or absent), the treatment of missing data can be easier when linkages from other individuals present in a database, such as phone calls, e-mails, or the joint issuance of plane tickets, etc., can help inform the analyst of another individual’s existence for whom no direct information was collected.
On the other hand, treating missing data can also be a challenging problem. If the data for a person in a network is missed, not only is the information on that individual unavailable, but also the linkages between that person and others may be missing. This can have a substantial impact on the data for the missing individual, as well as the data for the other members of the group in the network and even the structure of the network, since in an extreme case it may be that the missing individual is the sole linkage between two otherwise separate groups. It is likely that existing missing data techniques can be adapted to provide some assistance in the less extreme cases, but at this point this is an area in which additional research may be warranted.
False positives and false negatives are in some sense complementary for any given database and given analytical approach. More precisely, for a given database and analytical approach, one can drive the rate of false positives to zero or the rate of false negatives to zero, but not simultaneously. Decreases in the false positive rate are inevitably accompanied by increases in the false negative rate and vice versa, although not necessarily in the same proportion. However, as the quality of the data is
improved or if the classification technique is improved, it is possible to reduce both the false positive rate and the false negative rate, provided an accurate model for true positives and negatives is used.
Both false positives and false negatives pose problems for counterterrorist efforts. In the case of false positives, a counterterrorism analyst searching for evidence of terrorist attack planning may obtain personal information on a number of individuals. All of these individuals surrender some privacy, and those who have not been involved in terrorist activity (the false positives) have had their privacy violated or their rights compromised despite the lack of such involvement. Moreover, the use of purloined identities—identity theft—has enabled various kinds of fraud and evasion of law enforcement already. If terrorists are able to assume other identities, not only will that capability enable them to evade some detection and obfuscate the data used in the models—that is, deliberately manipulate the system, resulting in the generation of false positives against innocent individuals—but also it also might result in extreme measures being taken against the innocent individuals whose identities have been stolen.
Every false positive also has an opportunity cost; that is, it is associated with a waste of resources—precious investigative or analytical resources that are expended in the investigation of a innocent individual. In addition, false positives put pressure on officials to justify the expenditure of such resources, and such pressures may also lead to abuses against innocent individuals. From an operational standpoint, the key question is how many false alarms are acceptable. If one has infinite resources, it is easy to investigate every false alarm that may emerge from any system, no matter how poor its performance. But in the real world of constrained resources, it is necessary to balance the number of false alarms against the resources available to investigate them as well as the severity of the perceived threat. Furthermore, it is also important to consider other approaches that might be profitably applied to the problem, as well as other security issues in need of additional effort.
False negatives are also a problem and the nightmare of the intelligence analyst. A false negative is someone who should be under suspicion and is not. That is, the analyst simply misses the terrorist. From a political standpoint, the only truly acceptable number for false negatives is zero—but this political requirement belies the technical reality that the number of false negatives can never be zero. Moreover, identifying false negatives in any given instance may be problematic. In the case of the terrorist investigation, it is essentially impossible to know with certainty if a person is a false negative until he or she is known to have committed a terrorist act.
False positives and false negatives (and data quality, because it affects
both false positives and false negatives) are important in a discussion of privacy because they are the language in which the trade-offs between privacy and other needs are often cast. One might argue that the consequences of a false negative (a terrorist plan is not detected and many people die) are in some sense much larger than the consequences of a false positive (an innocent person loses privacy or is detained). For this reason, many decision makers assert that it is better to be safe than sorry. But this argument is fallacious. There is no reason to expect that false negatives and false positives trade off against one another in a one-for-one manner. In practice, the trade-off will almost certainly entail one false negative against an enormous number of false positives, and a society that tolerates too much harm to innocent people based on large a number of false positives is no longer a society that respects civil liberties.
Oversight and Prevention of Abuse
Administrators of government agencies face enormous challenges in ensuring that policies and practices established by higher authorities (e.g., Congress, the Executive Office of the President, the relevant agency secretary or director) are actually followed in the field by those who do the day-to-day work of the agency. In the counterterrorism context, one especially important oversight responsibility is to ensure that the policies and practices meant to protect citizen privacy are followed in a mission environment that is focused on ensuring transportation safety, protecting borders, and pursuing counterterrorism. Challenges in this domain arise not only from external pressures based on public concern over privacy but also from internal struggles about how to motivate high performance while adhering to legal requirements and staying within budget.
Preventing privacy abuses from occurring is particularly important in a counterterrorism context, since privacy abuses can erode support for efforts that might in fact have some effectiveness in or utility for the counterterrorist mission. In this context, abuse refers to practices that result in a dissemination of personally identifiable information and thereby violate promised, implied, or legally guaranteed confidentiality or civil liberties.35 This point implies that oversight must go beyond the enforcement
of rules and procedures established to cover known and anticipated situations, to be concerned with unanticipated situations and circumstances.
Oversight can occur at the planning stage to approve intended operations, during execution to monitor performance, and retrospectively to assess previous performance so as to guide future improvements. Effective oversight may help to improve trust in government agencies and enhance compliance with stated policy.
THE NEED FOR A RATIONAL ASSESSMENT PROCESS
In the years since the September 11, 2001, attacks, the U.S. government has initiated a variety of information-based counterterrorist programs that involved data mining as an important component. It is fair to say that a number of these programs, including the Total Information Awareness program and the Computer-Assisted Passenger Prescreening System II (CAPPS II), generated significant controversy and did not meet the test of public acceptability, leaving aside issues of technical feasibility and effectiveness.
Such outcomes raise the question of whether the nature and character of the debate over these and similar programs could have been any different if policy makers had addressed in advance some of the difficult questions raised by a program. Although careful consideration of the privacy impact of new technologies is necessary even before a program seriously enters the research stage, it is interesting and important to consider questions in two categories: effectiveness and consistency with U.S. laws and values.
The threshold consideration of any privacy-sensitive technology is whether it is effective toward a clearly defined law enforcement or national security purpose. The question of effectiveness must be assessed through rigorous testing guided by scientific standards. Research on the question of how large-scale data analytical techniques, including data mining, could help the intelligence community identify potential terrorists is certainly a reasonable endeavor. Assuming that the initial scientific research justifies additional effort based on the scientific community’s standards of success, that work should continue, but it must be accompanied by a clear method for assessing the reliability of the results.
Even if a proposed technology is effective, it must also be consistent with existing U.S. law and democratic values. Addressing this issue may involve a two-part inquiry. One must assess whether the new technique and objective comply with existing law, yet the inquiry cannot end there. Inasmuch as some programs seek to enable the deployment of very large-scale data mining over a larger universe of data than the U.S. government has previously analyzed, the fact that a given program complies with existing law does not establish that such surveillance practice is consistent with democratic values.
A framework for decision making about information-based programs couched in terms of questions in these two categories is presented in Chapter 2.