The future of signals intelligence (SIGINT) may look very different from what we see today. Details of communications technologies are changing rapidly and are likely to continue to change. Encryption increasingly protects communications data both in transit and at rest. Private-sector business records may become fewer in number and less useful for intelligence purposes. On the other hand, more powerful computation can analyze raw data, such as speech and images, to extract useful intelligence information in real time. More data will certainly be available, driven by commercial, data-driven marketing as well as the spread of networked sensors of many sorts. Research to develop algorithms to proceed “from data to knowledge” may well benefit intelligence analysis.
The public policy landscape may also change. Public concerns with privacy, driven by the explosion of data as well as disclosures and misadventures in both public and private sectors, may lead to new legal frameworks. A shift from controlling collection to controlling usage is being discussed in policy circles. And, of course, the publicly acceptable trade-off between privacy and security might change immediately if the nation is attacked, if severe national security threats emerge, or if international security is further destabilized.
There are a number of trends, already under way today, that may have a deep impact on SIGINT. The net effect of these trends on SIGINT in the future is not clear today.
Declining costs of all elements of digital infrastructure continue to spur technology’s pervasive spread. Not long ago, “cloud computing,” the use of giant computer centers to assign, as needed, dozens to thousands of computers to a task—was new. Now we are experiencing the effects of “big data,” exploiting large amounts of data collected for business or scientific purposes to pursue new business opportunities or uncover new science. Just beginning is an “Internet of things,” deploying sensors of new types in many new places to control or optimize roadways, buses, trains, production lines, crop management, and countless other activities. And smart phones increasingly sense things of interest, notably location today, but also audio and video. While not all of this new data, and the communications that carry it, is likely to have intelligence value, some will surely offer new intelligence opportunities.
Increasingly, algorithms can digest raw signal data into much more useful forms. License plates can be located in images taken from roadways, and the license numbers recognized. Faces can be isolated in images captured by surveillance cameras, and databases of images can be queried to identify people.1 Audio signals of speech can be converted to text with enough reliability for dictation, making it easy to spot words of intelligence interest in communications. These algorithms all have a form that make it technically easy to scale up the processing to handle many inputs: you can assign each of 50 computers to analyze each of 50 license plate images, or you can deploy the same 50 computers to recognize speech. Flexibly adapting and scaling these computations is easy.
In SIGINT applications, these algorithms can be applied either at the time of collection or later, on demand, for analyzing selected data. Today, NSA says it cannot collect any sizeable fraction of all global communications data, and it may likewise be that despite declining computing costs, NSA will not be able to automatically analyze more than a tiny bit. However, in many cases, the operators of the sensors will apply the algorithms to meet business needs, such as identifying license plates to bill parking charges. In these cases, the analyzed data may be available to NSA in the form of business records.
Business records can be very valuable for intelligence, and the proliferation of information technology (IT) in businesses of all sorts means
1 Closed-circuit TV surveillance, a form of bulk collection, has been practiced for years with relatively little complaint, despite its privacy invasion.
that many more details of everyday life are recorded in this way. However, businesses that wish to minimize surveillance of their customers can arrange to reduce or eliminate the intelligence value of their records. For example, if a telephone company bills a flat monthly rate, it need not keep a record of each call, so no call data records would be available for intelligence purposes.2 Communications providers today are acutely aware of their customer’s concerns about surveillance,3 a fact that gives providers an additional incentive to refrain from keeping records that might be used against them.
Services that hold data for customers may find ways to encrypt the data with a key known only to the customer so as to evade surveillance. This technique could be used by email providers and social-networking services, among others. Some businesses are being established with exactly this objective. But today, the ability to examine customer data and use it for marketing purposes is an essential part of the hosting company’s business model, so customers are unlikely to have email that is both free and surveillance-proof.
Attempts to evade surveillance are unlikely to slow the big data trend. Businesses collect huge amounts of data not associated with individuals, which may not cause privacy concerns, and are sure to collect still more. Some of this data has a large public benefit, such as for weather prediction, crop management, or public health monitoring. Businesses may implement different levels of protection for different business records, so that customer-sensitive data is not comingled with data that has benign uses, both public and private.
One of the most imminent threats to SIGINT collection is the increasing use of strong encryption for signals in transmission. Increasingly, website servers are routinely encrypting traffic to and from the browser clients. To a lesser extent, data at rest is being encrypted. The cybersecurity vulnerabilities of the endpoints (browser, server) are becoming much greater than the vulnerability of the communications between them, a point suggesting that access may still be possible (although more difficult), even when transmission links are encrypted.
2 Other business records of such a company, however, linking customer name, address, and telephone number, might still be very valuable for intelligence purposes.
3 See, for example, Vodafone Group, “Law Enforcement Disclosure Report,” 2014, http://www.vodafone.com/content/sustainabilityreport/2014/index/operating_responsibly/privacy_and_security/law_enforcement.html, accessed January 16, 2015.
Although today’s common Internet services, such as VoIP (Voice over Internet Protocol) are not specifically designed to make surveillance difficult, they can be redesigned to evade surveillance. An important idea in many cases is “peer-to-peer” communications, which establishes an encrypted channel between two communicators without needing a third party to set up the communication. This technique means there is no third-party business that might hold business records or other data that could identify the communicators. It can be a bit tricky to design protocols that eliminate the third party, which often serves as a “directory” for a calling party to find the called party. And, of course, it is hard on the third-party business, which is trying to make money when callers communicate.
An unsurprising conclusion from the preceding subsections is that SIGINT techniques and operations will need to evolve as dynamically as the signals environment they monitor. As new protocols and businesses arise, collection methods and software must evolve. Adapting to traffic volume of different types is also essential, but it can be partially addressed by using techniques similar to those used in cloud computing.
Policy, law, and regulations will need to keep up with future SIGINT sources, which may evolve in dynamic and even surprising ways. Today, the laws governing collection of SIGINT are largely derived from legislation that applies to rotary dial telephones. Although policy and regulations have adapted to modern technologies, their pace of change does not match that of technology.
A striking change in the past few decades is the extent to which the private sector collects personal information. This trend had its origins in the 1960s with the rise of credit bureaus and has resulted in a cascade of law and regulation. In 1998, the Federal Trade Commission published a list of five core principles: (1) notice (give the consumer notice of data collection), (2) choice (give the consumer choice about whether the private data will be collected), (3) access (give the consumer the ability to access data about him- or herself), (4) integrity/security (the data collector must work to make sure data is correct and must give the consumer the right of redress if it is not), and (5) enforcement/redress. Today, these principles are known by the shortened phrase “Notice and Consent.”
The notice and consent framework is showing signs of stress. A recent President’s Council of Advisors on Science and Technology (PCAST) report ridiculed the turgid privacy terms that the public is typically asked to accept today: “Only in some fantasy world do users actually read these notices and understand their implications before clicking to indicate their consent.”4 Moreover, “consent” may imply that a person is volunteering personal data, which will mean it is afforded weaker Constitutional protection.
An alternative, which is starting to be discussed in policy circles, is to control use rather than collection of data. One variant calls for tagging all data with its origin and asking permission of its provider before using it. The data can be encrypted, so that only the provider’s grant of permission will reveal the actual data. Protecting use of data is not new; digital rights management (DRM) schemes encrypt songs and videos and only decrypt and play them when the player is given the key. Changing to protecting use of private data would be a major effort, requiring changes in laws and enforcement and, of course, a lot of software.5
Protecting use rather than limiting the collection of sensitive data would be consistent with maintaining the bulk collection of SIGINT. Perhaps if the public comes to embrace the philosophy and practice of usage controls for sensitive personal data, such as health and financial data, and comes to trust private sector IT implementations of the protection procedures, controlled-use approaches to intelligence information can find greater favor.
This section contains a collection of topics that came up during the committee’s deliberations that are potentially useful to the IC. None of these topics directly addresses ways to replace bulk collection with targeted collection. Because the main focus of this report was not to determine the full set of research areas to explore, this list is not meant to be complete.
Research is under way on all of the topics mentioned in this section. In many cases, NSA already implements some of the capabilities (e.g., certain kinds of query checking). The IC has research efforts under way in many of these areas as well. Of particular note is the Intelligence Advanced Research Projects Activity’s (IARPA’s) Security and Privacy
4 President’s Advisory Committee on Science and Technology (PCAST), Big Data: A Technological Perspective, May 1, 2014, http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf, p. xi.
5 Craig Mundie, Privacy pragmatism: Focus on data use, not data collection, Foreign Affairs, March/April 2014, http://www.foreignaffairs.com/articles/140741/craig-mundie/privacy-pragmatism.
This section does not delve into the many technologies that NSA and other IC organizations use to operate large, complex IT operations. It does not cover network security, operating-system security, physical security of computer systems, authentication of users, or a host of other areas that are part of making SIGINT technologies trustworthy. Research in these and other areas that affect the general state of complex IT will help the IC too.
The approaches described in Section 5.4 are not in widespread use, but they are not unexplored either. Their successful use will depend on not only choosing a sound architecture, but also on developing a careful implementation: the trustworthiness of key components depends on keeping them simple to avoid mistakes that lead to vulnerabilities. And system-wide properties, such as security, will depend on many details, such as managing cryptographic keys properly, distributing them securely, changing them occasionally, ensuring that no single system administrator can penetrate security, and so on. These are not simple systems to engineer and operate.
Variants of the systems described in Chapter 5 often involve executing separate components on separate computers (often under control of separate organizations) and protecting the communications among the components. Techniques for doing this, usually based on encryption, are the topic of a research area dubbed “secure multi-party computation,” which was investigated by the IARPA SPAR program. For example, recent research shows how to protect data and communications in a three-part system: one issues queries, a second authorizes queries, and a third holds data and performs searches specified by authorized queries.7
Although the focus of this report is signals intelligence that provides data about individual people and groups, signals intelligence can also
6 Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, “Security and Privacy Assurance Research (SPAR),” http://www.iarpa.gov/index.php/research-programs/spar, accessed January 16, 2015.
7 S. Jarecki, C. Jutla, H. Krawcyzk, M. Rosu, and M. Steiner, Outsourced symmetric private information retrieval, pp. 875-888 in Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security, Association of Computing Machinery, New York, N.Y., 2013.
be used to answer questions such as “What is the most common disease mentioned in Internet search requests from Yemen?” In these cases, the question is statistical, and protecting the identities of people cited in the source database can be done using techniques quite different from those prescribed for tracking specific threats. Although it might seem that statistical questions by their nature do not reveal identities, if the query specifies a sufficiently small group, identities can often be inferred using queries to different databases.
Collecting and publishing large data sets (“open data”) has spurred work on ways to benefit from the data without revealing personal information. One class of techniques attempts to “anonymize” (or de-identify) the data by transforming it to retain useful information but prohibit identification of individuals. But it turns out that most anonymization schemes are easy to defeat.8 Effective anonymization remains an open problem.
Differential privacy is an active research area tackling the problem of enabling statistical queries from collections of data while preserving the privacy of individuals.9 The purpose is to permit useful information to be determined while not exposing data on specific individuals, including individuals not included in the data. This is done by adding probabilistically structured noise (small probabilistic changes to the data) to the responses to the queries. Although statistical databases have value in many domains, the type of queries relevant to this report need to produce information about individual items, so the techniques of differential privacy are not immediately applicable. There is also work on using differential privacy techniques with social networks.10
Automatically restricting or approving a query requires automatically understanding it at a deeper level than syntax; this points to another advantage of automated decision making, namely, that it forces precision about what is being collected, which is useful both for analysts and for privacy. Automated understanding can be either static or dynamic. Static
8 See PCAST, Big Data, 2014, p. 38. A good view of anonymization and reidentification is in Sections 3 and 4 of “Opinion 05/2014 on Anonymisation Techniques” (European Commission, Article 29 Data Protection Working Party, adopted April 10, 2014, http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf).
9 Cynthia Dwork and Aaron Roth, The Algorithmic Foundations of Differential Privacy, Now Publishers, Boston, Mass., 2014.
10 C. Task and C. Clifton, A guide to differential privacy theory in social network analysis, pp. 411-417 in Proceedings of the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE Computer Society, Washington, D.C.
understanding tries to infer from a set of axioms whether a particular query or class of queries is allowed by policy, independently of the state of the database being queried. Taking pre-approved queries as axioms is a simple case of this. For example, “If X is an identifier for a reasonable and articulable suspicion (RAS) target, return all the identifiers that communicated with X in the last year.” The query is fixed except for some parameters, such as X in this example. It is a human decision to pre-approve it, and no automated reasoning is needed to apply it. A more powerful system could deduce that this query is OK from more general axioms such as, “X is associated with Y if X and Y communicated in the last year” and “Any identifier associated with a RAS target can be disclosed.”
Dynamic understanding looks at the actual results of a query, rather than considering all possible results, and asks whether policy allows them to be disclosed. A simple example is a kind of minimization: if a query returns a set of identifiers, any identifiers for U.S. persons should be removed from the results. Tags on the data that track its provenance or other properties can make dynamic understanding more powerful; the example uses a “U.S. person” tag that is added to database entries for an identifier when it is determined that the identifier refers to a U.S. person.11 This kind of understanding has been studied extensively in the context of information flow control, where the goal is to keep secrets from being disclosed to uncleared people, even if it is processed by untrusted programs. Decentralized information flow can very flexibly represent both degrees of secrecy and authorities for disclosure.12 Dynamic systems can also take account of context and history by applying the rules in force at the time a query is made, considering questions such as, Is there an emergency? Is the query part of a pattern known to need more scrutiny? Are the results being combined with other data to deanonymize the results in a way that is contrary to policy?
There are many similarities between static and dynamic understanding and the thriving fields of static and dynamic program understanding, which suggests that there may be rich opportunities here. Not surprisingly, programs written in languages that are designed for automated understanding are much easier to understand. The same thing applies to query languages; indeed, the standard SQL database query language is
11 NSA has developed and donated to the Apache open-source community such a database. Accumulo is a scalable key/value store that allows “access labels” to be attached to each cell that enables low-level query authorization checks (Apache Software Foundation, “Apache Accumulo,” https://accumulo.apache.org/, accessed January 16, 2015).
12 A.C. Myers and B. Liskov, A decentralized model for information flow control, pp. 129-142 in Proceedings of the 17th ACM Symposium on Operating System Principles (SOSP), 1997, Association for Computing Machinery, New York, N.Y.
designed for automatic understanding of queries, and database systems make heavy use of this to optimize their execution.13 A system that can understand a query can also rewrite it to add access control checks or calls to functions that encrypt and decrypt sensitive fields. For example, the CryptDB and Cipherbase systems do this (see Section 18.104.22.168 on simulating homomorphic encryption).
In most cases, a query is issued by an analyst, and the results are returned to the analyst, but there are also programs, called analytics by the IC, that issue queries and process the results themselves. Understanding these analytics programs requires combining an understanding of the queries with an understanding of the program that issues them. However, the issuing program can supplement the query itself with additional information that can be used in making a decision whether to approve the query. In other words, the program generating the query can be expected to do more work to support a decision whether to approve the query than might be practical for a human analyst.
The most likely approach to query approval is to proceed from easy cases to harder ones, reserving for human attention those that cannot be automated.
Auditing access to bulk data is essential for ensuring compliance with the rules. The first step is to ensure that every query is permanently recorded in a log; isolation makes it feasible to do this by technical means. Then the log must be reviewed. Doing this manually is feasible and, indeed, this is NSA’s current practice, but it is expensive and not transparent—outsiders must rely on the agency’s assurance that it is being done properly, because the queries are usually highly classified.
In analogous fashion, operating systems and networking equipment write voluminous logs of security-relevant events, and review of such logs usually requires a great deal of manual effort.14 It should be possible to develop much better tools that automatically review the log, highlight
13 S. Chaudhuri, An overview of query optimization in relational systems, pp. 34-43 in Proceedings of the ACM Symposium on Principles of Database Systems, 1998, Association for Computing Machinery, New York, N.Y.
14 See, for example, USENIX, WASL ’08, First USENIX Workshop on the Analysis of Systems Logs, “Workshop Sessions,” https://www.usenix.org/legacy/event/wasl08/tech/ (last changed January 26, 2009) and, to infer causality, see M. Chow, D. Meisner, J. Flinn, D. Peek, and T.F. Wenisch, “The mystery machine: End-to-end performance analysis of large-scale Internet services, pp. 217-231 in 11th USENIX Syposium on Operating Systems Design and Implementation, October 2014, https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chow.
suspicious patterns, filter out the great majority of queries that do not raise any issues or that were vetted by automatic query approval, and present the remainder for manual review.
Automating the audit or overview process has much in common with automating query authorization. Because there is a lot of audit data, machine learning can also play a role, although it would probably require introducing a lot of synthetic misbehavior (that is, deliberately introduced misbehavior) to get enough true positives into the training set.
If it were possible to express the laws, policies, and rules governing SIGINT in a machine-understandable form, it might be possible to generate tools that do automatic approval and oversight for a portion of the queries. One approach would be to develop formal policy languages to represent the precise meanings of policies. These could serve as an intermediate language between the output of lawyers and the technological control of processes and computer programs. The process of formulating them would likely reveal many anomalies, ranging from ambiguities to misinterpretations to inconsistencies. NSA reported that it had looked into deontic description logic for this purpose. To the extent that the field of computational law thrives, its results would be relevant. Projects around this area would seem to be an ideal unclassified research topic, appropriate for an interdisciplinary team of experts in law, policy, and computer science.
Basing automation on formal definitions has another advantage: if the rules must change, the automation will change as a direct consequence. Formal rule expressions will change, due to new laws, policies, and regulations, or in order to adapt to emergencies. Of course, the rule expressions and the process for changing them must be controlled carefully to ensure compliance with the governing documents.
Advances in this area might lead outside organizations to gain confidence that the rules for handling personal data are being followed. If these techniques are not being used today, how might they be applied to reassure overseers that what they see is a full report of what happened? Can zero-knowledge proofs be used in some way to reassure members of the public who wish to monitor operations? Are there general ways of scanning logs and reliably picking out transactions that need to be looked at? Cybersecurity defense tries to do this, but even with specialized logs, it is an incompletely solved problem.
Simpler or more understandable rules are desired, but it is not obvious how to create them, nor how to avoid the processes that produced the existing ones. This sort of research could be done independently of the Intelligence Community (IC), at the risk of irrelevance. Some kind of cooperative research leading to unclassified results would be best.
A seemingly simple, but fundamental, problem is the lack of a common lexicon to define the technology relevant to intelligence as it is controlled by law, regulation, and policy directives. This deficiency came up in many discussions, both inside and outside the IC. The absence of such a consensus on terminology may well explain some of the misunderstandings that exist between the IC, its overseers, and the public. If not addressed, it is likely that this confusion will continue and impede the effective development of a policy and legal framework. More generally, without consensus on terminology, the development of effective regulation of technology will be a continuing problem that also impedes building the necessary public trust in the IC. An interdisciplinary effort to develop common terminology for modern and emerging technology would be worthwhile.
Policy decisions might be informed by quantifying the benefits of various intelligence-gathering techniques as well as their risks. Anecdotal testimony that cites specific events doubtless understates the value of intelligence and also gives the misleading impression that the value of intelligence is in finding the single piece of evidence that thwarts an attack.15 More often, small bits of information from different sources contribute to an actionable finding.
The IT systems that produce and record intelligence, especially those used by analysts to bring together the bits and pieces gathered throughout an investigation, can track the provenance of the information. Can investigations, once completed, be mined to estimate the value of different sources of intelligence?
Statistical results and machine learning have a role to play. Statistical techniques allow one to estimate the value of different sources of data. Learning techniques potentially allow one to extract more information (better results) from collected data, or more confidently ignore data that
15 See, for example, Privacy and Civil Liberties Oversight Board, Report on the Telephone Records Program Conducted under Section 215 of the USA PATRIOT Act and on the Operations of the Foreign Intelligence Surveillance Court, January 23, 2014, http://www.pclob.gov/SiteAssets/Pages/default/PCLOB-Report-on-the-Telephone-Records-Program.pdf, p. 145 ff.
do not have to be collected. As the number of data sources grows, especially from public information, it may become important to routinely assess the value of these sources. And such analysis would provide, at least in classified form to the IC, an answer to a question that Presidential Policy Directive 28,16 in effect, asks, “How valuable is bulk collection of domestic telephone metadata?”
As the committee did its work, it noted an evolving relationship between NSA and the academic research community on problems such as those addressed in this report. For many years, NSA has formally funded unclassified, basic research in mathematics (algebra, number theory, discrete mathematics, probability, and statistics) in the United States in its Mathematical Sciences Program.17 According to NSA, this program was initiated in response to a need to support mathematics research in the United States and recognizes the benefits both to academia and NSA accruing through a vigorous relationship with the academic community.
Further developing a similarly vigorous and sustained relationship between NSA and the academic computer science community could have similar benefits. Mechanisms would have to be found to translate classified problems into unclassified ones that researchers could tackle without being subject to security review—doing so would improve the coupling of the research mission with the operational mission. The IC has two mechanisms that help bridge the classification “chasm.” IARPA funds research relevant to the IC, some of which targets the future of SIGINT. Many of its research programs are predominantly unclassified, and it is working to develop unclassified “proxies” for research problems of more direct applicability to the IC. The firm In-Q-Tel acts somewhat like a venture fund for innovative technology potentially useful to the IC, supporting commercially viable technologies that might serve IC needs. Both appear to be effective, but their structures and policies are not primarily intended to build long-term and vigorous relationships with academic disciplines. Bridging the chasm would benefit both communities.
Even in a report that was intended to address primarily technical issues, the committee found it necessary to engage with a number of legal and policy issues. This point underscores the fact that it is often
16 The White House, Presidential Policy Directive/PPD-28, “Signals Intelligence Activities,” Office of the Press Secretary, January 17, 2014, http://www.whitehouse.gov/sites/default/files/docs/2014sigint_mem_ppd_rel.pdf.
important for technical research to be conducted in an interdisciplinary manner cognizant of policy issues. But interdisciplinary work integrating technology, law, and policy remains the exception rather than the rule in academic research institutions. Much more of this type of collaboration is required if law and policy are to effectively manage the challenges being generated by rapidly changing technologies.
The committee has identified a number of technical areas where advances could help the IC address privacy concerns about SIGINT data. None of these topics directly addresses ways to replace bulk collection with targeted collection; rather, they represent alternatives for better targeting collection or better controls on usage after collection. Because determining the full set of research areas to explore was not the main focus of this report, this list is not meant to be complete, and it does not delve into most of the technologies that the IC uses for its IT capabilities. Nor are the topics necessarily new; research may be under way, the IC may already have implemented some of the capabilities, and the IC has research efforts under way in many of these areas as well.
Conclusion 3. Research and development can help in developing software intended to (1) enhance the effectiveness of targeted collection and (2) improve automated usage controls.
Conclusion 3.1. The use of targeted collection can be improved by enriching and streamlining methods for determining and deploying new targets rapidly and using automated processing and/or streamlined approval procedures.
Analytics, such as “big data analytics,” may help narrow collection, even if they are not sufficiently precise to identify individual targets. If the government is constrained by privacy concerns to collect less data, it may nevertheless be able to use the power of large private-sector databases, analytics, and machine learning to shape the constraints to collect only data predicted to have high value. New uses by the government of private-sector databases would also raise new privacy and civil liberties questions.
Some of these methods may require a great deal of computing, so that filters should be cascaded to first apply cheap tests, followed by more expensive filters only if earlier filters warrant. For example, if metadata indicates a civilian telephone call to a military unit under surveillance, speech recognition and subsequent semantic analysis might be applied to
the voice signal, resulting in an ultimate collection decision. Richer targeting may require enhancing the ability of collection hardware and software to apply complex discriminants to real-time signals feeds.
Conclusion 3.2. More powerful automation could improve the precision, robustness, efficiency, and transparency of the controls, while also reducing the burden of controls on analysts.
Some of the necessary technologies exist today, although they may need further development for use in intelligence applications; others will require research and development work. This approach and others for privacy protection of data held by the private sector can be exploited by the IC. Research could also advance the ability to systematically encode laws, regulations, and policies in a machine-processable form that would directly configure the rule automation.