Heather Adkins, Google, Inc.
Heather Adkins, director of information security and privacy at Google, Inc., shared an on-the-ground perspective of recoverability based on her nearly 20 years of experience as a security practitioner. At Google, Adkins’s realm is primarily in remediation. The chief information officer, information technology administrators, and site reliability engineers (SREs) who handle tactical and strategic recovery when necessary would all offer a different perspective on these issues, she noted.
Weaknesses of Current Responses
Adkins began with the caveat that she, and the vast majority of her in-the-trenches peers, do not closely follow, and in some cases are unfamiliar with, National Institute of Standards and Technology (NIST) Special Publication (SP) 800-184, Guide for Cybersecurity Event Recovery.1 Although she believes the standard contains excellent advice for an ideal world, she asserted that the day-to-day reality is very different from
1 M. Bartock et al., Guide for Cybersecurity Event Recovery, NIST SP 800-184, National Institute of Standards and Technology, Gaithersburg, MD, 2016, https://csrc.nist.gov/publications/detail/sp/800-184/final.
the world the standard seems to assume and that implementing the guidelines would be impractical for already overworked security teams.
In addition to being unable to fully implement ideal practices, Adkins posited that on-the-ground recovery teams are hampered by the fact that several common recovery strategies are actually weaker than one would like to think. For example, even seasoned first responders are known to run antivirus software to detect and remove problems, but these programs cannot detect or address advanced threats. In addition, if malicious files are spotted and removed in response, attackers can notice the change and switch tactics instead of retreating. Rebuilding a system also may be ineffective, because it merely drives an attacker to wreak havoc in another part of the network.
Quarantining a compromised network is another common recovery strategy, but it can backfire if someone unwittingly plugs it back into the active network again—a scenario Adkins has seen in practice. Similarly, shutting down a virtual machine that has been hacked seems reasonable, but avenues to these machines can remain open, so they also are at risk of being restarted. In scenarios involving a password breach, while password resetting can stop active hijacking, it might be too late to stop any processes an attacker may have set in motion while they had access.
Adkins described a well-known 2011 compromise of the systems that hosted Linux kernel development2 in which hackers eventually damaged the system so badly that it was beyond recovery and had to be rebuilt. This example is particularly worrisome because Linux code is pervasive in billions of devices. Police were able to arrest one of the hackers, a breakthrough that offered some insights into the scope of the compromise and methods used, but the technical response and recovery effort was enormous, including the painstaking verification of 15 million lines of code to identify any remaining back doors.3 In the context of this example, Adkins agreed with Butler Lampson’s (Microsoft Research) earlier assertion that selective undo is appealing (and could in theory have offered a way to streamline the recovery), but she noted that the approach
2 See U.S. Attorney’s Office, Northern District of California, “Florida Computer Programmer Arrested for Hacking,” Press Release, September 1, 2016, https://www.justice.gov/usao-ndca/pr/florida-computer-programmer-arrested-hacking, and the indictment in the U.S. District Court for the Northern District of California, “Case 1:16-mj-03168-BLG Document 1, Entered on FLSD Docket 08/30/2016,” August 30, 2016, available at https://regmedia.co.uk/2016/09/02/linux_hack.pdf.
3 Even such painstaking verification is likely to be ineffective against sophisticated adversaries.
requires knowing important details about how and when the compromise happened, which is not always possible.
Adkins then shared an example of a “perfect recovery,” in which a small organization, after being breached in an espionage-related attack by a nation-state actor, removed and replaced all of its hardware, reinstalled all of its software, and verified all of its data by hand. The effort, although considered a successful example of high-assurance recovery, was enormously expensive and required shutting down the organization for a month. The example demonstrates that a “perfect recovery” is possible in certain contexts, when all the sources of the attack or breach are known and understood, but expensive and impractical in most scenarios.
Letting Go of Trust
Recovery is generally taken to mean returning systems to their normal, trustworthy state. However, Adkins posited that depending on trustworthiness in certain types of systems may be misplaced, and possibly unnecessary. With regard to Google’s services, she said, employees work on what are called Zero-Trust networks. Adkins asked, if such a network were compromised, would it even matter? If you can remove the expectation of perfect trust in certain contexts, you can also remove the need for perfect recovery.
Further removing the emphasis on trust—for example, in hardware, operating systems, or other computing layers—could allow for a greater concentration of effort regarding trust and verification in the application and cryptographic code. In the context of applications, she noted that machine learning offers important avenues for audits, detection, and recovery. She argued that research into machine learning capabilities could, for example, lead to machine-assisted alteration-discovery systems capable of flagging suspicious code far more quickly than humans can. This could perhaps be used in breach recovery as a triage approach, by helping to concentrate code review efforts.
Enabling Recovery at Scale
Google is also reducing the emphasis on trusting personnel, such as SREs, in order to support operational reliability at scale. In practice, that means quickly building systems, recovering them, and migrating them when necessary, while also conducting regular, automated tests, including for catastrophic events. These practices are detailed in Google’s book, Site Reliability Engineering.4 It is also helpful to carefully consider metrics, to determine what exactly is useful to measure, Adkins said. For example, many organizations measure how often their backup processes complete successfully as a sign of a healthy business continuity program. She noted that Google, by contrast, looks
4 B. Beyer, ed., Site Reliability Engineering, O’Reilly Media, Inc., Sebastopol, CA, 2016.
at how often the actual recovery of data (e.g., read off of tape) from backup media completes successfully.
Adkins noted that moving computing into the cloud, and away from end-user systems, can also help. Recent attacks like Spectre and Meltdown did not cause Google too many problems because the company had been able to seamlessly migrate customer workloads to patched systems.
Adkins pointed to a vulnerability in the Internet of Things (IoT) as another argument for machine-learning approaches. She noted that many of the millions of Internet-connected devices embed a program known as dnsmasq, which is maintained by independent developers instead of a large company. If one of those devices were to be hacked, it could be difficult to determine whether backdoors were added, or how many, or how they might affect devices using dnsmasq. She went on to observe that recovering from such a catastrophic breach could be made easier with alteration analysis, where code could be machine analyzed to detect changes or unwelcome additions at speeds far faster than humans could achieve.
An important report that influenced Adkins’ security philosophy is James Anderson’s 1972 Computer Security Technology Planning Study (known colloquially as “The Anderson Report”), in which Anderson asserts that computer security is not strong enough to prevent malicious events.5 Adkins said that in a follow-up piece written a few years later, Anderson posited that the best way to improve security is to pair computer systems with humans via auditing and logging, practices that are still in use today.6
Looking at today’s capabilities and the shifts that have occurred since the 1980s when the paradigm emphasizing reference monitors and a trusted computing base was established, Adkins proposed a new solution: teaching the machines not only to read logs and discover breaches, but to actually learn to defend themselves. While this solution may be still a long way off, Adkins pointed to recent encouraging examples, such as the 2016 Defense Advanced Research Projects Agency (DARPA) Cyber Grand Challenge, which proved that computers could be designed to defend themselves, and even counterattack.7 Adkins also pointed out that by 2050, the scale of operations will make pairing humans and machines unfeasible. In addition, the scale of cyberspace will continue to grow. Right now it is global, but computers are already on Mars, and people may eventually be, too. She suggested that when computers can defend themselves, these automated defenses and counterattacks could usher in a new paradigm in which recovery is seen not as exceptional or catastrophic but simply as routine.
5 J.P. Anderson, Computer Security Technology Planning Study, Volume 1, 1972, http://seclab.cs.ucdavis.edu/projects/history/papers/ande72a.pdf.
6 J.P. Anderson, Computer Security Threat Monitoring and Surveillance, 1980, http://seclab.cs.ucdavis.edu/projects/history/papers/ande80.pdf.
Adkins wrapped up with three main points. First, standards exist, but enforcing recovery practices solely through compliance to standards is impractical due to operational barriers. Second, we need to move toward automated detection and change the way we build trust in networks and systems. Finally, we need to teach machines to recover successfully themselves.
William Sanders, University of Illinois, Urbana-Champaign, asked if there were recovery practices that worked for both accidental failures and malicious attacks. Adkins said dual-use practices are ideal, but there are cases where this is not feasible. For example, an insider attacker with unknown motivation can pose a particularly significant and unique type of threat since the attacker could have high level capabilities such as system administrator privileges. Sanders also wondered if recovery practices were different depending on the security goal: If confidentiality were an issue, instead of availability, would the protocols be different? In response, Adkins noted that it is very difficult to tease out the borders between confidentiality and availability, which are both indicators of reliability.
Bob Blakley, Citigroup, noted that recovery can be extremely difficult, even when the root cause is not a malicious actor. For example, he pointed to the experience of a company that purchased security software that became corrupted and in turn corrupted some of the business systems it was supposed to protect. Recovery was arduous and took weeks. Building on this point, Peter Swire, Georgia Institute of Technology, asked if attacks perpetrated by nation-states fundamentally would require different recovery strategies than other attacks or accidental failures. Adkins replied that the recovery process is largely the same in most cases regardless of the root cause of the problem. SREs and security team members must work together to lessen the impact, determine the motivation, and act accordingly; while the playbook might be somewhat dynamic, the most important rubric is the end result—restoring a secure and functioning system.
Steven Lipner, SAFECode, pointed out that James Anderson believed that it was possible to eventually build a sufficiently strong system, although he recognized that there would still be the problem of holding people accountable for malicious insider attacks. Machine learning could help with accountability for malicious action by insiders, but the need in all cases is to determine what the bad actors did, and how long ago they did it.
Adkins replied that recovery requires forensic analysis with a focus on such questions as the following: What was the root cause? What did the attacker do? and Where did they go? It is very difficult to obtain all of that information, but usually a rough picture emerges. To do these analyses well today, they must be conducted manually and are time consuming, requiring a conservative approach in which every aspect of the system is assumed to be compromised. The day-to-day reality is that most companies are able to
devote only limited resources to these issues—for example, paying for outside recovery services that operate for a limited time, meaning their search for data, story-building, and recovery could be rushed, and they would not get the whole story.
Paul Kocher, an independent researcher, noted that the number of places where data is stored is quickly growing, and he asked if that growth complicates the task of inventorying, analyzing, and restoring data. Adkins replied that at Google, infrastructure is relatively monolithic and centralized, and so despite the company’s size, its operators have a fairly good handle on what is happening at all times. Outside of that specific context, though, Adkins said the answer is more complicated. For example, her own computer has more than 20 pieces of firmware, about which she knows virtually nothing, yet she must trust them to keep her data safe. Broadly speaking, she said, using devices from multiple manufacturers, with no standardization and an overreliance on trust, creates a situation that is nearly impossible to defend. But, she suggested increasing standardization and narrowing the realm of trust to the smallest area possible can offer a way forward.
David Edelman, Citigroup
David Edelman, a director at Citigroup with in-depth incident response experience dealing with large-scale financial systems, offered a unique perspective on how recoverability is viewed and handled in the financial sector.
Edelman explained that resilience is considered a top priority in the U.S. financial sector, both because banks and retail brokerages recognize its importance to their business and because government regulations require it. Rather than a static target or a far-off goal, he said resilience in the financial sector is a real, everyday requirement best thought of as a constant process with ever-changing adversaries.
A Shared System for Recoverability
Edelman noted that to be able to give customers an additional level of confidence in the ability of their banks to provide services even in the face of very sophisticated malicious activity, the financial industry created Sheltered Harbor.8 While banks already capture every transaction every day within their own systems, Sheltered Harbor is an initiative undertaken by the sector as a whole that provides an additional layer of protection and enables rapid recovery and reconstitution of customer account status if needed.
Maintaining Data Confidentiality and Integrity
Edelman described how data confidentiality and integrity are central to Sheltered Harbor and other recoverability measures in the financial sector. The financial industry handles a huge volume of personally identifiable information. Sheltered Harbor employs sophisticated encryption and routine testing and verification to help ensure security. The system, he said, is built to be agile enough to use whatever is the most secure storage process available, whether that means publishing data onto tapes and storing it in a vault or—in the future—using a secure cloud-based system. If Sheltered Harbor were to move into the cloud, Edelman noted, the process would accommodate the necessary integrity verifications required before and after the encrypted data is uploaded. Edelman also noted that the processes and systems are designed so that data is easily auditable.
Eric Grosse, an independent consultant, asked if a catastrophic event at one bank would actually affect all banks because money moves so frequently between them. Building on this point, Fred Schneider, Cornell University, asked whether the approach changes if a problem is not detected on the day it occurs but several days later. Edelman clarified that resetting a single bank’s accounts to their state from the day before is a simple task that is inherent in most banks’ business as usual processes and that pre-date Sheltered Harbor. In other failure scenarios, individual transactions can be reversed or restored based on when the problem occurred. He went on to explain that reversing transactions and restoring from archives from multiple days back is also possible, but in these cases, which are exceedingly rare, other banks can be affected, necessitating a wider recovery effort. Edelman noted that that process is the subject of many industry-wide exercises.
Bob Blakley, Citigroup, added that several additional processes also come into play in these more complex scenarios. For example, banks conduct daily clearances and settlements among themselves, so that their records are reconciled. This, combined with a typical 3-day settlement window for securities transactions, means that a bad event at one institution would not necessarily ripple across the entire industry.
William Sanders, University of Illinois, Urbana-Champaign, asked Edelman how often extreme recovery operations are needed. Edelman said that full recovery events are extremely rare. Even during the 2008 financial crisis, in which institutions were acquired by other institutions on very short notice, the acquiring institutions had access to the failed institutions’ processing systems, data, and infrastructure.
Stephen Schmidt, Amazon Web Services
Stephen Schmidt, vice president of security engineering and chief information security officer at Amazon Web Services, led an open-ended session in which he invited workshop attendees to ask questions about Amazon’s resilience practices. To frame the discussion, he opened by sharing Amazon’s operational definition of resilience: “The ability of our customers to continue their business.” In this context, he said, achieving resilience requires good technology, smart decisions, and an openness to constant learning.
Response to the 2017 S3 Outage
Bob Blakley, Citigroup, asked Schmidt to describe the steps taken after the Amazon S3 storage service outage, which lasted a few hours in February 2017. Schmidt first noted that one of his 2017 security goals, set before the outage, was to drastically reduce the number of humans with access to certain data, a deliberately difficult goal that would force an increased reliance on automation. This goal proved prescient, he observed, as the outage was the result of a typographical error by an authorized administrator executing approved commands.
He went on to explain that that single error led to a cascading failure, which propelled Amazon to make several changes. For example, he said, access and command execution at that level now require two-person authentication and authorization; in addition, commands that are most critical are now automated. Schmidt said that in order to figure out what those most critical commands are, Amazon used an existing internal security tool that maintains a record of commands executed by administrators of its internal systems; managers can use this tool to see exactly what actions employees are taking.
The security team was notified shortly after the outage began, Schmidt said. The response team first checked to see who was logged in to the affected area. Seeing approved users, they concluded that it was not a deliberate attack, and began manually analyzing the most recent commands executed. They identified the typo and started the recovery process. Because S3 is so large, recovery was complicated and involved several different service tiers.
The experience, Schmidt said, highlighted the importance of what he called good throttle design. Throttling (limiting the number of requests or changes to a system that can be submitted in a given timeframe) was a focus for improvement following the outage. Amazon has included throttling in its system APIs (application programming interfaces) for many years; however, in this case certain S3 subsystems did not have throttles in place. Changes following the outage included changes in capacity
management tools, which enhanced safeguards to prevent large-scale capacity shifts that would bring subsystems below their minimum capacity levels.
The Role of the Cloud
Peter Swire, Georgia Institute of Technology, asked if Schmidt agreed with earlier speaker Heather Adkins, Google, Inc., that moving to the cloud improves prospects for recoverability. Schmidt said he agreed, adding that in his view there are very few workloads that would not work better in the cloud. Building on this, Swire wondered, if everything moves to the cloud, should we worry about a “monoculture” developing? Schmidt replied that monocultures in APIs should be considered differently from monocultures in the implementations underneath the APIs. Shared APIs provide a common interface format that allows for readily switching between services and moving workloads if needed. With respect to underlying implementations. there are other trade-offs to consider. Having only one version of a software stack becomes problematic if, for instance, a bug is found or introduced. Amazon keeps a small set of slightly different software versions on hand in case of such a problem, Schmidt said.
In terms of elements farther down in the stack, Schmidt added that Amazon, like many large companies, builds its own routers and switches, instead of relying on commercial products, to help ensure that the hardware works as expected and errors can be quickly caught. In fact, for items not created in-house, such as certain chips, the company intentionally uses a diversity of vendors to reduce reliance on one specific vendor or product, thus reducing their exposure if one becomes compromised.
In response to a question from Richard Danzig, Johns Hopkins University, Schmidt also pointed to the intelligence community’s evaluation of Amazon’s commercial cloud services (C2S), the results of which, he said, suggested that C2S engenders greater confidence than legacy data centers in terms of security. The root of C2S’s strength, he explained, is its high level of visibility, meaning that changes are more easily auditable and made with more accountability than with more traditional architectures. He observed that this reflects a sea change in behavior and attitudes, away from the idea that a computer or system is “owned” because data is stored there, and toward an embrace of shared resources and cloud computing.
John Manferdelli, Northeastern University, asked what has surprised Schmidt in his experience. He answered that there have been both good surprises—for example, how resilient systems can be when human errors occur or how easy it has been to scale up well-built systems—and bad ones—for example, when naïve user or developer choices have led to bad situations.
At Amazon, he said, the goal is to break problems into small enough pieces such that teams can remain small and move quickly. To maintain connections and communication among these thousands of independent teams and projects, the principal engineering staff periodically review services and make changes or recommendations, he said. He noted that problems early in design cannot be easily undone, yet frequently cause headaches down the line. As a result, Schmidt said he has learned how crucial it is to involve his team in the process as early as possible.
He also emphasized the value of security teams being seen as partners instead of police or compliance officers. Beginning early to address security in partnership with the project team means his team can say yes to (reasonable) requests while minimizing risks, rather than merely swooping in and applying the brakes after a product is farther along in the pipeline. Steven Lipner, SAFECode, built on this point, noting that even if a “no” is required, saying it early enough in the process can lead to a better collaboration. Schmidt responded that rather than saying “no,” he prefers for engineers to search for a clearer understanding of customer needs and then determine the best way to meet them.
Eric Grosse, an independent consultant, pointed to key lessons, such as the value of conducting forensics and the importance of keeping comprehensive logs, and asked Schmidt what additional practices he would consider useful industry-wide. Schmidt replied that backing up data is not as helpful as one might think, because backups inevitably decay and are not active portraits of your systems. Instead, he said, recovery systems should be built to be as active as possible. Fred Schneider, Cornell University, asked how Amazon could handle an attack that is not immediately detected if they have active backups, but not active replicas. Schmidt pointed to the importance of monitoring mechanisms and throttling as one important means to respond to anomalies. He added that it is also helpful to build systems with multiple components, or “shards,” that can maintain system functionality while some components are inactive, which is, in a sense, a system of active replicas. Schmidt said that this system of active replicas that are constantly monitored was the most effective way they had found for cloud systems, although it might not be appropriate for every business case.
Resiliency in Practice
Susan Landau, Tufts University, asked Schmidt to expand on what Amazon means by resiliency and how that plays out in practice. Schmidt reiterated that Amazon sees its customer’s needs as paramount, so resiliency for them means ensuring that their customers can do what they want to do when they want to do it, such as making sure that every API is responsive to customer requests. In the event of a problem, he said, Amazon will first try to contain it to prevent it from spreading further. The focus then turns to pinpointing exactly what went wrong and when, and finding alternate routes to deliver the data a customer needs. Once the shape of the problem becomes clear, he said, responders try to extrapolate what the hackers might do next, what they are capable of, and what problems have not been detected yet. The systems are returned to a safe starting point from before the attack, and an effort is made to remove the access doors the hackers used to get inside. Attribution (figuring out who is responsible for the problem), Schmidt said, is on the back burner during these steps, because knowing the individual responsible is generally not helpful for the recovery process.
Of course, Schmidt said, totally destroying the compromised system and building it back up from the ground would often be ideal, but that is not possible in every case or for every system component. When a customer has access to a cloud environment, the best course of action is to replace all the hardware involved. But other software components, such as a mySQL database, cannot be so easily disposed of and reconstituted.
Landau asked if the diverse needs of Amazon’s customer base affects the company’s resilience strategy. In terms of availability, Schmidt replied that Amazon is able to keep capacity fairly level across the board, so that a problem in one area does not usually affect availability in the other areas. In terms of customer priorities, Schmidt said Amazon recognizes that different customers have different requirements and preferences relevant to resiliency, and it builds and tests to those requirements. He also emphasized that the company takes regular, rigorous testing very seriously and said its operational teams are trained with frequent recovery drills.
Danzig asked how Amazon handles the unique security needs of one particular customer—the U.S. government—and whether the company’s relationship with the government affects its resiliency practices. Schmidt said Amazon appreciates any intelligence it acquires that is specific and timely enough to be actionable. However, noting his own background in the Federal Bureau of Investigation before coming to Amazon, Schmidt said he understands that it can be challenging to strike the right balance between protecting hard-won intelligence and translating it into sharable, actionable information.
Scanning the Horizon
Manferdelli asked Schmidt to comment on promising new technologies or, on the flip side, inventive new types of threats that he anticipates in the future. Schmidt replied that breakthroughs in silicon involving instrumentation of certain behaviors in the silicon itself could be valuable for understanding processor use. As for future threats, Schmidt noted that it remains difficult to anticipate complex side-channel attacks in advance, although collecting more data on normal operations could help.
Timothy E. Roxey, North American Electric Reliability Corporation
Tim Roxey is the chief security officer and chief special operations officer for the North American Electric Reliability Corporation (NERC), a nonprofit international regulatory authority that works to assure the security and reliability of North America’s bulk power system. He explored the challenges of identifying errors and deliberate attacks in the power grid’s cyber infrastructure and the mechanisms used to provide continued service in the face of them.
Roxey said the average time between an attacker gaining access to an industrial control system component and an owner noticing the intrusion is 100 days—far longer than in some other industries such as military organizations or very sophisticated companies as discussed at the workshop. In some cases, he said, the intrusion can be discovered years later, or never discovered at all. In this context, the expectation in terms of recovery is not to get back to normal, but rather to create a “new normal” and adjust to it.
Roxey explained that in North America there are roughly 4,500 power companies serving about 365 million people running power through upwards of 58,000 substations. NERC runs a large-scale analytic program to identify intrusions across this vast network—24 hours a day, 7 days a week. On a daily basis, Roxey said, this program runs about 90 algorithms against several billion bytes of data collected each day and over 200 terabytes of data collected over the year. He said that this monitoring is done within a multi-layered framework for power grid cybersecurity and recovery whose key components include standards, maintenance, and information sharing.
The bottom, largest layer of the framework is the standards that govern how power systems meet various regulations, both online and in physical structures, to perform essential tasks. Because the electric grid is considered critical infrastructure, he noted, standards are explicitly designed and regulated to facilitate recovery, and companies are regularly audited to ensure compliance.
The regulatory component of NERC defines which providers and which aspects of the cyber infrastructure are critical infrastructure and oversees them accordingly. Roxey explained that roughly 2,000 of the 4,500 power companies fall into the category of requiring regulations, while the remaining 2,500 are considered sufficiently small or disconnected from each other that a security incident at one of them would not cause a cascading outage.
Standards are developed by using an open standards-setting process and many power companies are required to implement these standards. Roxey asserted that the power grid is the single most regulated sector, with regulations in this space ranging from personnel and training requirements to physical security perimeter requirements to recovery plans to vulnerability assessments, among many others. Although these are not best practices or guidelines, the regulations are enforceable, he emphasized. Because they are developed openly, however, attackers are well aware of them—as well as what they do not cover.
Another layer, maintenance, involves testing the system against potential threats in order to understand weaknesses and shore them up. Every 2 years, for example, NERC’s GridEx exercises simulate a major grid challenge to which companies and government agencies attempt to respond. The lessons learned from these large-scale exercises can sometimes inspire new standards, but they always lead to better understandings, Roxey said, underscoring the industry’s commitment to continuous improvement. Roxey encouraged participants to study and learn from the exercises, which are publicly available. He noted that the next exercise will encompass effects of a major outage on sectors such as water, communications, finance, and others, in recognition of the critical role of electricity in maintaining these sectors and our way of life.
Grid systems face numerous threats all the time. Roxey noted that electric system operators have been dealing with distributed denial of service (DDOS) attacks for nearly two decades, and in recent years, these attacks have reached astounding levels. In addition, he said, what are essentially accidental denial of service attacks can occur as a result of routine maintenance and upgrades—for example, when a network engineer attempts to use a new tool to scan a decades-old server that is unable to handle the load.
Such experiences underscore the need for advanced communication and preparation for routine operations, in addition to effective responses to deliberate attacks.
Apart from the systems supporting the financial sector, the electric grid is probably the most complex system on the planet, Roxey said. It is also regulated or influenced by an astounding number of organizations, such as the National Security Council, the Nuclear Industry Assessment Committee, the Electric Power Research Institute, the Federal Energy Regulatory Institute, and NERC itself, to name a few. He explained that each such organization has different concepts of resilience, recovery, and robustness, which are important to recognize and reconcile in order to create standards that satisfy many complex stakeholder needs.
Discussing these needs and crafting regulations to meet them requires actionable information that can be shared. For example, cryptocurrency mining is a current challenge, but just naming the problem is not enough. To address it, information must be shared and it should be clear what cryptocurrency mining indicators are, so that companies know what to look for. Public-private collaboration for information sharing would improve security for everyone, Roxey said.
Roxey concluded by reiterating the implications of the typically long lag time between when an intrusion occurs in the grid and when it is noticed. The bottom line is that there is a good chance that at any given time, there is an existing intrusion that has not yet been detected. Good regulations to support testing, detection capabilities, and incident response plans can help to minimize the fallout.
Peter Swire, Georgia Institute of Technology, asked if Roxey could explain why it has taken so long to restore power in Puerto Rico after Hurricane Maria. Roxey noted a number of factors that could be contributors to this problem, including the state of the infrastructure prior to the emergency, challenges in communications, and difficulties with logistics and access to affected areas.
William Sanders, University of Illinois, Urbana-Champaign, asked about enforcement of regulations. Roxey replied that audits can result in stiff fines. But perhaps more important, he said, is the establishment of a shared vocabulary in terms of compliance and protocols. In fact, problems today are found and reported much more often by companies themselves than by audits, a reversal from the situation just 10 years ago, he said, when penalties were far more common. Roxey said that compliance is a higher priority for companies today, and having common vocabulary and increased information sharing gives them the tools they need to comply.
Stephen A. Cauffman and Matthew Barrett, National Institute of Standards and Technology
Stephen Cauffman and Matthew Barrett, both of the National Institute of Standards and Technology (NIST), offered a perspective on measuring community resilience and shared NIST’s framework for improving critical infrastructure for cybersecurity.
Cauffman, a research engineer in the Community Resilience Group (CRG),9 began by noting that communities are complex “systems of systems” making them somewhat new territory for NIST’s resilience work, which previously focused on individual buildings and their response to extreme events. By contrast, CRG focuses on the full recovery—social, economic, and infrastructural—of an entire community in the face of a natural or manmade disaster. In fact, Cauffman said, measuring the social and economic needs of a community after a disaster are essential to the process, because those needs drive the decisions and priorities relevant to creating more resilient buildings and infrastructure.
NIST is devoted to measurement science, but community resilience can be difficult to measure—a challenge compounded by the complexity of communities and all their dimensions. This lead CRG to devise a new, six-step process for assessing community resilience.
The first step is to reach out to community stakeholders, including elected officials, social and economic leaders, building owners, engineers, and architects. Second, those leaders work together to gain a wider understanding of a community’s social dimensions, its built environment, and how the two are intertwined. Third, these stakeholders define their community’s priorities, establish performance goals, and anticipate building performance in a disaster.
The final three steps are to evaluate gaps in building performance and identify solutions, finalize an official plan, and implement and maintain it. After a disaster, infrastructure is carefully assessed to determine how well it met recovery expectations, in both the short and long term. Gaps indicate where further improvements and solutions are needed. Together, these tangible steps link a community’s physical infrastructure to its socioeconomic environment.
9 NIST’s Community Resilience Group conducts research and works with stakeholders to help improve community resilience for all hazards. They focus on the built environment, including infrastructure systems. Resilience efforts in other domains may provide useful lessons for recovery and resilience related to cybersecurity, communications, and information technology-related challenges.
NIST is strongly committed to this holistic, community-wide approach to disaster planning, Cauffman said. Although the impacts of disasters are often most visible in infrastructure, the social and economic consequences pose real challenges as well. Cauffman noted that Puerto Rico after Hurricane Maria is an unfortunate example. The island’s poor infrastructure could not survive the storm, and that failure led to closed schools and health-care disruptions, exacerbating the damage to people’s lives.
Cauffman explained that studying the socioeconomic impacts of compromised infrastructure on a community can lead to better solutions than assessing an area one building at a time. However, he said, right now, few tools exist to measure or model resilience holistically. The six-step process Cauffman described is only the first layer in creating decision-support tools to aid communities. NIST is also working on creating large-scale tools, in partnership with Colorado State University’s Center for Risk-Based Community Resiliency Planning. By breaking down such a complex problem piece by piece, NIST hopes to create a framework for communities to be resilient and able to recover from disasters.
NIST Framework for Improving Critical Infrastructure for Cybersecurity
Barrett leads NIST’s Framework for Improving Critical Infrastructure for Cybersecurity (the “Framework”). For the second half of the NIST presentation, Barrett explained the framework and its role in advancing resilience.
Cybersecurity is ultimately measured by outcomes, not procedures, Barrett stressed. Although individual companies may determine how much or how little security they need, NIST’s role is to specify the level needed to enable resilience. The framework, he said, is not a list of rules but rather a decision-making tool for every company or community to think about how they will define and accomplish the five steps to creating cybersecurity: identify, protect, detect, respond, and recover. Although cybersecurity resilience is multidisciplinary and can be very complex, Cauffman said those five terms are valuable because they create a shared language for talking about resilience.
In response to a question from Mary Ellen Zurko, MIT Lincoln Laboratory, Barrett summarized the process used to develop the framework. His team started with a blank slate and was given 1 year to create a path to reducing the cybersecurity risk to critical infrastructure. Following the standard NIST model, the team openly engaged a variety of stakeholders, requested information and feedback, held workshops, and released a draft for public comment. This open dialogue and transparent process was instrumental in developing a framework that would be applicable across a variety of environments and work in most contexts, Barrett said.
The framework is currently being used to improve the overall regulatory ecosystem, including that for the electric sector, by making it more efficient and precise, Barrett
said. Focusing on cybersecurity outcomes, not procedures, can ensure that everyone’s goals are aligned. Building on the points raised earlier in the workshop with regard to the electric power sector, Barrett noted that compliance is a much more integral part of companies’ operations today. Ensuring that regulations and the steps required to comply with them become ever more efficient and laser-focused on essential security can improve compliance as well as resilience and recovery, he suggested.
Weighing Resilience Investments and Trade-Offs
Kicking off the discussion, Bob Blakley, Citigroup, asked about the economic trade-offs involved in building in resilience strategies ahead of time versus paying for recovery after an event, and how the relative likelihood of different events in different communities weighs into that calculation. Cauffman stressed that mitigation—investing in resilience ahead of time—makes recovery faster, easier, and far more cost-efficient. In fact, he said, a recent report by the National Institute of Building Sciences showed that $1 in mitigation spending saves roughly $6 in future recovery expenses.10 For events that are perhaps more rare or for events where recovery costs are lower, other strategies could include mutual aid partnerships, where organizations or regions pledge to help each other recover, such as by sending in crews to restring downed power lines. There are trade-offs to weigh in any case, and the community tools NIST is developing include an economic component to help stakeholders assign an actual dollar value to their options.
Cauffman also suggested realigning how we think about the “resilience curve” to better reflect the real-world trajectories of communities after disasters. The resilience curve is traditionally conceived as a straight line that dips down when an event occurs and then gradually returns to the starting state. In reality, Cauffman said, communities are most often starting out on either an upward or downward trajectory, and that initial trajectory affects what happens after the event. A community on the upswing, for example, is more likely to bounce back quickly and can sometimes even rebuild itself in ways that make it stronger and more resilient after the event. But communities already in decline will be much worse off after a disaster, and are unlikely to get back to the way they were before the event.
Cauffman pointed to New Orleans’ experience after Hurricane Katrina as a mix of these two extremes; while the city has never recovered its original state, and many residents moved away permanently, the rebuilding effort attracted heavy investment, creating a revitalized New Orleans that is very different from the city that existed before the storm.
10 National Institute of Building Sciences, “National Institute of Building Sciences Issues New Report on the Value of Mitigation,” press release, January 11, 2018, http://www.nibs.org/news/381874/National-Institute-of-Building-Sciences-Issues-New-Report-on-the-Value-of-Mitigation.htm.
Barrett added that the NIST framework enables users to define and balance the “before” (identify, protect) and “after” (detect, respond, recover) periods of a disaster to inform a resource investment strategy.
Defining Communities and Dealing with Interdependencies
Noting that communities are complex, interconnected, interdependent systems, Tadayoshi Kohno, University of Washington, asked how those interdependencies affect the way we assess capacity for recovery and resilience. Is it possible to ensure that assumptions about multiple, interconnected communities are satisfied all at the same time?
Cauffman answered that NIST defines communities by geographic boundaries and the presence of a local governance structure. Local, state, and federal government jurisdictions often overlap, and infrastructures can be regional, but constraining the focus by geography and governance helps to break the resilience challenge into logical and manageable pieces. Electricity may come from far away, and hospitals may be administered by external stakeholders, but the six-step process should, by and large, create an avenue to address those interdependencies and ensure assumptions are accurate, he said.
Fred Schneider, Cornell University, noted that this “divide and conquer” strategy can be a successful solution to difficult problems. It is logical to divide things geographically (akin to taking a building-by-building approach, but on a larger scale) or by sector (e.g., looking at power grid resiliency specifically). However, he said, interdependency is a different type of layer that makes the problem significantly more tricky. If roads are blocked, it can be hard to restore power. If power is out, it makes communication difficult. This close coupling of resources creates dependencies. Dependencies are logically built into cyber systems; one cannot bring up the database before the storage server is in place. He asked how we can account for such relationships in recovery planning.
To this point, Barrett pointed to two relevant concepts: essential services and secure engineering. One way of approaching dependencies in the physical world, he said, is by defining essential services and using secure engineering to build them with resilience factors in place. But it can be tricky to strike the right balance in identifying essential services without creating ever more dependencies. The classical computer science concept of common controls, in which many controls are handled by a few
pieces, is meant to increase efficiency, but can have the effect of increasing dependencies and potentially undermining resiliency. This conundrum is analogous to essential services in a community.
Opportunities in the Internet of Things
Noting that much of the discussion had focused on designing resiliency into systems and communities, Kohno asked about the role that commercial, off-the-shelf products could potentially play in resilience and recovery strategies. For example, IoT devices could potentially be used to determine when a community is in a state of emergency and divert power away from nonessential services.
Cauffman replied that there is potential for IoT devices to help, especially as technological capabilities grow and costs decline. For example, he pointed to the availability of low-cost smart flood gauges that provide real-time water level data to communities to inform decision making and help move people out of harm’s way. However, he underscored that in most communities resilience challenges are still incompletely defined, and at this stage it is important to focus on more fully understanding the problems before shifting our focus to developing technological solutions to address them.