This chapter builds on the use cases presented in Chapter 3 to describe in more general terms some ways in which bulk collection is used by the Intelligence Community (IC) and some of the challenges associated with alternatives that use targeted collection.
If past events become interesting in the present for understanding new events, such as the discovery of a nuclear weapons test by a previously non-nuclear nation, historical facts and the context they provide will be available for analysis only if they were previously collected. Sometimes review of a targeted collection (e.g., against leaders of the non-nuclear nation) may reveal information for a new purpose that was not in mind when the information had been collected (e.g., the intent of the nation’s leaders regarding nuclear weaponry). But sometimes useful information, such as the nexus of suppliers for the weapons technology, will be present only if previously there had been bulk collection. If it is possible to do targeted collection of similar events in the future, and they happen soon enough, then the past events might not be needed. If the past events are unique or if delay in obtaining results is unacceptable (perhaps because of press coverage or public demand), then the intelligence will not be as complete.
Chapter 3 presented several use cases illustrating the use of bulk collection for tactical intelligence. Tactical intelligence requires prompt attention to newly discovered targets and imminent threats. Collecting and saving information in bulk, without a specific set of targets, is the only way to have past information about a party on hand when that party becomes one of interest. Sometimes that information will be available because of a targeted collection in which certain uses were not yet realized. But sometimes information becomes interesting only because of new events or information, in which case previous bulk collection may be the only possible source. Targeted collection provides data only on present and future actions of parties of interest at the time of collection, but not on their past activities. For example, bulk collection may allow the identification of hostile actors and their associates because they made mistakes as their activities began, perhaps because of ineffective tradecraft or other casual interactions.
Understanding the significance of past activities and their actors is a feature of all investigations, foreign and domestic. In contrast to domestic law enforcement, however, the world of intelligence analysis has many fewer tools available for investigation. In hostile foreign environments, personal interviews and observations and records review are much more limited. Accordingly, the role of bulk data as a way to understand the significance of past events is important, and the loss of this tool becomes more serious. Of course, bulk collection can also be useful in a domestic context.
Some kinds of targeted collection are focused on topics rather than people, and some targeting based on topics will be more specific than others. For example, a discriminant that collects all queries to Internet search engines that ask about “sarin” or “poison gas” will collect information about many people of no intelligence interest because only a handful of those making such searches will be of actual interest. However, other discriminants citing specific military code names might yield information about fewer people who are of no intelligence interest.
In strategic intelligence, information is gathered to build understanding about a topic (e.g., climate change, migration patterns), an entity or area (e.g., region, nation, subnational group), or set of activities and sometimes takes the form of statistics or trends. Some examples include the following:
• Collecting against national, military, or organizational decision makers.
• Monitoring many types of communication among the officers in an army to help understand its morale, quality of training, or location. If the collection is only against the communications of army personnel, this might be considered a targeted collection.
• Bulk collection of communications can reveal health care, electric power, or agricultural data that is not reported accurately, or at all, by a government.
• Sampling everyday communications in a region can provide insight into local sentiment about political trends that might lead to, for example, a government overthrow. For example, social networking communications during the Arab Spring reported unfolding events in real time.
Some of the data collected for strategic intelligence is analyzed using statistical techniques: rather than looking for specific persons or groups, the goal is to monitor trends or patterns in communications that might lead to intelligence insights. This is one application of analytical techniques that are known today as “big data analytics.”
Bulk collection is used to acquire reference data that supports other signals intelligence (SIGINT) collection or analysis. For example, analyzing communications data is greatly enhanced if analysts have “telephone directories” for organizations of intelligence interest—that is, a list of who’s who in the organization and their communications identifiers.
Another role for bulk collection is to guide targeted collection; the IC refers to this role as “SIGINT development.” For example, the decision about where to gather information can depend on knowing the target’s likely modes of communication. Because the target will not assist the collector in this decision, the collector will have to discover the likely modes of communication—perhaps by collecting information from all the modes of communication that the target might use—to understand their significance for national security priorities. Similarly, the National Security Agency (NSA) may have the resources to thoroughly monitor only one of several communication channels, and learning that some of them carry mostly communications of U.S. persons would make those channels less likely to be selected because they are not apt to be good sources of foreign intelligence. In addition to making NSA’s work more efficient, such decisions may reduce collection of information about people who are not of interest.
Does bulk collection overwhelm analysts with too much data, as is sometimes argued? The “needle in the haystack” metaphor is relevant here. If the needle is not found in the smaller haystack, there are two approaches—not mutually exclusive—that may result in success. One approach is to add more hay (because that additional material may contain the needle of interest). A second approach is to do a smarter search (because a smarter search may turn up a needle that was in the haystack all along), such as using techniques described by Cortes et al.1
Of course, if the needle is not in the smaller haystack, no amount of smarter searching will help. The use case category of alternate identifiers illuminates this problem. An analyst has determined that a new target is of interest, where “new” means that this target has not previously been explicitly targeted for collection. With luck, previously targeted collection may provide information on alternate identifiers that the new target has used.
Adding bulk data may help, because, by definition, bulk collection may contain alternate identifiers. But there is still no guarantee, because the bulk data might have been collected in the wrong location or through the wrong communications channel, etc. The alternate identifiers might still be missed, even though they exist.
Is a smarter search more or less likely than the use of bulk data to result in identification of the needle? Without details of the specific use case in question, this question cannot be answered in the abstract. In practice, analysts do not know if the haystack contains the needle without analyzing all the data—so they cannot know when to stop adding more hay.
Thus, collecting more data is necessary but it is not necessarily sufficient. It is true that more data may burden the analyst, while increasing the risk of intruding on parties that are not of interest, and may still fail to provide the data of interest, even when such data exists. Still, if the necessary data is not already available, collecting more is the only possible way to find the needle. This trade-off between too much data and finding the necessary information is inevitable. Although it can sometimes be reduced, it cannot be eliminated.
Below are some alternatives to present-day bulk collection practices that might mitigate some of the privacy and civil liberties concerns that
1 C. Cortes, D. Pregibon, and C. Volinsky, Computational methods for dynamic graphs, Journal of Computational and Graphical Statistics 12(3):950-970, 2003.
such practices raise. Each also involves a variety of performance trade-offs when compared to bulk collection as currently handled.
• Federating business record databases by allowing them to be held by telecommunications carriers and allowing authorized queries by the U.S. government. This “federated storage” approach, which primarily applies to domestic collection, is discussed more fully in Chapter 5. By providing the U.S. government with certain access to business records stored by the telephone companies, this alternative retains the principal benefit of bulk collection by the U.S. government—access to telephone call history—but it is not as operationally effective as bulk collection. As detailed in Chapter 5, federation offers advantages for safeguarding privacy and enforcing policies. It also has disadvantages that include divergent incentives between the government and third parties, greater technical and organizational complexity, and potentially poorer performance.
• Bulk analysis. A class of alternatives extracts bulk SIGINT data from a source, applies “analysis algorithms” to all of it, saves the results of the algorithm, and then discards the SIGINT data. For example, one scheme might construct the contact network from call detail records (CDRs), store the entire network, and discard the CDRs. If a significant portion of the stored network pertains to nontargets, this technique should be viewed as a variant of bulk collection. Some proposals go even farther and use algorithms to fuse data from several different intelligence sources into an annotated “hypergraph,” where the annotations retain information gleaned from intelligence data.2 These schemes are arguably more intrusive on privacy and civil liberties than bulk collection of raw SIGINT, because they analyze and store a multi-source picture of many people who are of no intelligence value. Moreover, automatic analysis seems unlikely to replace human analysis, although it may be useful as an augmentation to what humans do.
• Fast near-real-time targeting. Targeted collection is most effective when targets can be added to the discriminant quickly as they are identified in previous communications. If a call from a target X to an unknown Y is rapidly followed by a call from Y, the second call may be significant—possibly a message being passed on. If the first call quickly adds Y as a new target in the collection discriminant, the second call will be collected; otherwise, it will not, because both ends of the call are unknown identi
2 J.C. Smart, Georgetown University, briefing to the committee on September 9, 2014; see also AvesTerra Program, “The FOUR-Color Framework: A Reference Architecture for Extreme-Scale Information Sharing and Analysis: Overview,” V1.6, Georgetown University, October 2014, http://avesterra.georgetown.edu/sites/avesterra/files/4CF%20Overview%20%28V1.6%29.pdf.
fiers. Collection software could be designed to chain targets this way only if such chaining is pre-approved. While this approach may collect a few more rapidly unfolding scenarios, it does not provide the complete view of past events afforded by bulk collection.
• Big data analytics. It may be possible to use big data analytics to help narrow collection, even if the results from such analytical tools are not sufficiently precise to identify individual targets. That is, the government may be able to rely on the power of large private-sector databases, analytics, and machine learning to shape data collection constraints to data predicted to have high value. But even if the government collection becomes more narrowly targeted through the use of such analytic tools to develop the targeting, this is not necessarily a win for privacy. Depending on what aggregate data is used to determine the targeted government collection, use of such techniques may well raise privacy concerns. There will also be concerns that the methods used for targeting are akin to socially unacceptable profiling (e.g., targeting purchases of camping goods, males, ages 15 to 30). Thus, the use of big data analytics to provide better targeting may not be acceptable from a policy point of view, even if such techniques were to ultimately result in a more narrow government collection.
• Cascaded filtering. Some of these methods may benefit from the use of cascaded filtering. One benefit of this approach is that it allows one to reduce the computing burden by first applying cheap tests, followed by more expensive filters only if earlier filters warrant. For example, if metadata indicates a civilian telephone call to a military unit under surveillance, speech recognition and subsequent semantic analysis might be applied to the voice signal, resulting in an ultimate collection decision. Richer targeting may require enhancing the ability of collection hardware and software to apply complex discriminants to real-time signals feeds. Another benefit is that it will tend to reduce the amount of data that ends up being collected through fast and early filtering.
There is no doubt that bulk collection of SIGINT leaves many uncomfortable. Various courts have indeed questioned whether such collection is constitutional. This discomfort arises for many reasons. Some find the idea that the U.S. government collects vast amounts of communications signals information about unsuspected U.S. persons abhorrent to the very notion of democracy, while others object to this decision being made under the cover of secrecy.
This chapter has explored uses of bulk collection and technical alternatives the committee uncovered during its work that might mitigate
some of the privacy and civil liberties concerns of that collection. None of these alternatives changes a fundamental point: A key value of bulk collection is its record of past SIGINT that may be relevant to subsequent investigations. If past events become interesting in the present because of new circumstances—such as the identification of a new target, indications that a nonnuclear nation is now pursuing the development of nuclear weapons, discovery that an individual is a terrorist, or emergence of new intelligence-gathering priorities—historical events and the data they provide will be available for analysis only if they were previously collected.
Conclusion 1. There is no software technique that will fully substitute for bulk collection where it is relied on to answer queries about the past after new targets become known.
This conclusion does not mean that all current bulk collection must continue. What it does mean is that a choice to eliminate all forms of bulk collection would have costs in intelligence capabilities. The analysis in this report provides a partial basis from which to make such policy choices.
Other groups, such as the President’s Review Group on Intelligence and Communications Technologies and the Privacy and Civil Liberties Oversight Board have said that bulk collection of telephone metadata is not valuable enough to justify the loss in privacy.3 This is a policy judgment, which is not in conflict with the committee’s conclusion that there are no technical alternatives that can accomplish the same functions as bulk collection and serve as a complete substitute for it; there is no technological magic.
The committee was not asked to and did not consider whether the loss of effectiveness from reducing bulk collection would be too great, or whether the potential gain in privacy from adopting an alternative is worth the potential loss of intelligence information. Nor was it able to identify broad categories of use where substitution of alternatives might be possible or detect metrics that would inform such decisions. The Office of the Director of National Intelligence may wish to study these questions further.
Data retained from targeted SIGINT collection might be a partial substitute if the needed information was in fact collected. Bulk data held by other parties might substitute to some extent, but this relies on those
3 President’s Review Group on Intelligence and Communications Technologies, “Liberty and Security in a Changing World,” http://www.whitehouse.gov/sites/default/files/docs/2013-12-12_rg_final_report.pdf, and Privacy and Civil Liberties Oversight Board, Report on the Telephone Records Program Conducted under Section 215 of the USA PATRIOT Act and on the Operations of the Foreign Intelligence Surveillance Court, January 23, 2014, http://www.pclob.gov/SiteAssets/Pages/default/PCLOB-Report-on-the-Telephone-Records-Program.pdf.
parties retaining the information until it is needed, as well as the ability of intelligence agencies to collect or access it in an efficient and timely fashion. Other intelligence sources and methods might also be able to supply some of the lost information, but the committee was not charged to and did not investigate the full range of alternatives that intelligence agencies could bring to bear. Note that all of these alternatives may introduce distinct privacy and civil liberties concerns.
Conclusion 1.1. Other sources of information might provide a partial substitute for bulk collection in some circumstances.
Because bulk collection cannot for practical reasons be truly comprehensive, it is itself inherently selective and unable to capture all relevant history. As a result, at least in some cases, it may be possible to develop techniques that would improve targeted collection to the point where it provides a viable substitute for bulk collection. Although such approaches might reduce the extent of collection against persons other than targets of interest, they might also introduce new privacy and civil liberties concerns about how such profiles are developed and used.
Rapidly updating discriminants of ongoing collections to include new targets as they are discovered will enable the collection of data that would otherwise be lost. If targeted collection can be done quickly and well enough, then there may be cases where information about past events becomes less important. But such an approach is not a substitute if the past events were unique or if the delay incurred in collecting the new information is unacceptable (because the threat is imminent or perhaps because of press or public demand for instant results).
Conclusion 1.2. New approaches to targeting might improve the relevance of the collected information to future use and would rely on capabilities such as creating and using profiles of potentially relevant targets, possibly by using other sources of information.