Richard J. Danzig, Johns Hopkins University
Richard Danzig, Johns Hopkins University, wrapped up the workshop with a summary of observations and then moderated an open discussion among forum members, speakers, and workshop attendees.
Danzig pointed to two distinct dimensions in the arena of recoverability: the human and the technological. Although we often focus on the technological problems and solutions, the human dimension affects both system requirements and system performance, he said. He went on to note that recovery and resilience are as much psychological and social issues as they are technical ones.
We are in the early stages of fully grasping what cyber recovery and resiliency means, he continued, and our psychological expectations of what is acceptable and unacceptable will change over time. If, when cars were first invented, we knew they would eventually cause 35,000 deaths yearly, their adoption might have taken a different trajectory, he said. Danzig observed that we are now psychologically used to cars and their consequences now and we accept those risks daily, while we remain uncomfortable with the risks of cybersecurity breaches.
Predicting future risks and attitudes toward them is difficult. Just a few years ago, Danzig said, some underestimated the impact of Internet of Things (IoT) devices and
how vulnerable IoT users would be to bad actors. And, he emphasized, it is ultimately users, and not the technology itself, who are the targets of attacks and who suffer their fallout. When Sony was hacked in 2014, business was disrupted, but he argued that the greatest pain came from the public exposure of private thoughts. Similarly, Russian interference in the 2016 election targeted voters, not voting machines.
Danzig suggested directing more attention to determining what is an acceptable length of time between an attack and its detection and recovery, a timeline that varies substantially across sectors and environments and depends on the nature of the vulnerabilities as well as potential damages from attacks. In some contexts, it may be acceptable if it takes a long time to discover a problem or if a system has to be shut down for a month to recover, but in other contexts, such as elections, stock markets, or war, such delays could be catastrophic psychologically and socially. Here again, the importance of time is tied to the human dimensions of disruptions, not the technological ones.
Danzig observed that frequent exposure to resilience and recovery processes can lead to an environment in which problems are adapted to as a matter of course. If resiliency were better incorporated into our day-to-day landscape, Danzig suggested, the costs to design and maintain it would become a familiar and justifiable expense, perhaps akin to national vaccination programs.
Industries and communities can improve resilience and recovery by prioritizing frequent training, both before and after security incidents, Danzig argued. Instead of trying to hide or move past those situations, we should be sharing information, learning from mistakes, and improving procedures. Danzig recommended Antifragile by Nassim Nicholas Taleb to participants, which argues that the goal of resilience isn’t to get you back to where you were, but to teach you how to rebuild to become better than before.1
Addressing the technological dimension, Danzig noted that while machines do have vulnerabilities, we should keep in mind that they also possess extraordinary abilities that can be used to promote resilience and recovery. For example, every keystroke can be documented and used to bolster deterrence, attribution, and retribution practices. Machine learning and artificial intelligence (AI) capabilities could also be harnessed to tackle the challenges of complex systems, many of which are beyond individual human capabilities, he suggested.
1 N.N. Taleb, Antifragile, Penguin Books, Ltd., New York, 2012.
Danzig argued that the growth of the cloud is also promising for security, resilience, and recovery. Startups no longer need to create their own security systems, he observed, but can increasingly rely on expertly created ones. This may create a monoculture, but, Danzig added, it also provides a level of security, visibility, and technical sophistication that would otherwise not be available to smaller companies.
Danzig concluded by reminding attendees that the future, and its challenges, are still unknown to us, but the way that we design tools matters. The increased adoption of artificial intelligence, IoT, and virtual reality systems suggests that it is increasingly important that those systems be “antifragile” by design. Computing power and sophistication might be increasing exponentially along some dimensions, but human capabilities are not.
The workshop ended with an open discussion covering many topics, including the importance of learning from the past, understanding time and scale in different contexts, and sharing information. Many participants reiterated Danzig’s suggestion that the focus in security and resilience should shift from trying to predict the future and preventing problems from occurring and toward making resilience and the ability to recover a part of the everyday landscape across all sectors.
Learning from the Past
Several participants discussed past events that contain lessons for recovery. Susan Landau, Tufts University, shared two examples that illustrate how varied the challenges can be. A 2008 distributed denial of service attack on the country of Georgia created an atmosphere of misinformation and chaos that Russia was able to take advantage of. Russia’s interference techniques have been particularly challenging, she added, because they target humans, not just systems and infrastructure. On the other hand, she said, other examples illustrate how resilient human beings can be. Vermont is a state subjected to far more snowstorms than hurricanes. When Hurricane Irene pummeled the state in 2011, the people proved more resilient than the infrastructure. The hurricane washed away fifteen bridges, and it took years to rebuild them all, while the communities, finding themselves isolated for days or weeks in the immediate aftermath, managed the situation admirably.
Tim Roxey, North American Electric Reliability Corporation (NERC), talked about learning from the August 2003 blackout, when the power grid shut off in much of the Northeastern United States. NERC has implemented several new standards since that
incident, including vegetation maintenance and better emergency communication between stations and operators.
Steven Lipner, SAFECode, noted that James Anderson’s 1972 report, mentioned by Adkins, actually sent computer science researchers down an ultimately unproductive path for decades. Zurko agreed, noting that we can learn from that failure and apply it to today’s expectations of users and recovery. Lipner added that learning from experience is an essential piece of this process. Root cause analyses are extremely important, but they are only helpful if they lead to changes that prevent future problems. He also stressed the importance of respecting the limits to how much we can “engineer” users.
Lampson shared an example of another past failure that contains an important lesson. More than 20 years ago, NIST issued password-creation recommendations, and Lampson recently learned that they were created by an employee who didn’t have the right level of user experience expertise. Two important elements were ignored: the root cause of the problem and user limitations. The unintended result was the creation of countless easy-to-steal passwords.
Peter Swire, Georgia Institute of Technology, noted that the Army constantly studies past battles. John Manferdelli, Northeastern University, agreed that learning from the past is important, but forward progress is also essential. The Army has to fight the current war, not just implement ideas that might have helped in the last one. It’s also important to remember how experimentation and learning from mistakes can help craft guiding principles, he continued, something that, for example, has helped the nuclear submarine community improve overall reliability.
Considering Time and Scale
Bob Blakley, Citigroup, reiterated Danzig’s point that time is a crucial consideration for framing recoverability. In the financial sector, in addition to the end-of-day rule which requires that accounts are settled at the close of every business day, banks may also be placed into receivership by regulators if they are unable to conduct business operations for a number of consecutive days. These time limits help identify and shape critical recovery procedures in the industry.
Building on this, Swire added that time and scale vary greatly in different contexts, and recovery discussions must include these definitions in order to determine the best course of action. Time could mean weeks or milliseconds. Scale can be equally varied.
Blakley agreed, and noted that most large banks practice recovery procedures at various scales including the level of the application, machine, data center, national subsidiary, and entire firm.
Eric Grosse, independent consultant, expressed surprise that there can be lags of 100 days or more in between attack and detection in some sectors and environments. In his experience at a major web services company, failures are detected within minutes and systems are expected to recover within minutes of detection. An event that could persist for weeks or months would be incredibly rare, he said.
Danzig asked Roxey how long it would take to bring up a large power grid in the event of a cyberattack. Roxey replied that it depends on how large the attack and the grid are, but generally speaking, it would take somewhere between a few hours to a few days. An important question, Roxey noted, is not only whether the power is back on, but whether hackers are able to cause damage to the system.
Danzig noted another nuance affecting timelines, which is that most resilience practices were born out of situations with time and space limitations that do not neatly translate into the types of cyberattacks we must contend with today. It is obvious when a hurricane is over or a building has fallen, he said, but it is less obvious when a cyberattack is over. In many cases there is a chance that the attacker has merely changed course and will continue to be able to attack a different area.
Lipner expanded on the idea of information sharing, an issue raised by several presenters. The government does share information with industry, but commercial companies may not always see the benefit of passing their own information along. Blakley said that in his experience, there can be effective, mutual information sharing between government and industry, noting that financial institutions have received actionable information from the government.
Danzig cautioned that informal information bartering within industry tends to benefit the largest companies in a field, while the smaller ones can be shut out of the relationship. The quality of the information, regardless of whether it is shared with other companies or the government, is also important, he stressed.
Responding, Grosse pointed out that smaller companies can still be a part of the information sharing. Even if they do not have information to offer, he said, they are allowed to participate if they can be trusted not to leak. Landau added that different industries may have different conceptions of what information is considered helpful and actionable.
Lipner reiterated Danzig’s suggestion that resilience strategies must be practiced and deployed frequently to be effective. Over his years in the field, he said he has learned that it is crucial not only to make recovery plans, but also to enact and practice them frequently so they become routine and companies are in a state of readiness. Manferdelli agreed that practice is important, but it first requires data, which in turn requires funding, something that may be in short supply at different times in an organization’s lifespan. In addition, the challenges are further compounded by the fact that computing conditions are constantly changing, technology is not always transparent, and scale varies greatly. We are still getting used to the complications of a computing world, he said, much in the way that it took decades to get used to the benefits and drawbacks of the automotive world. In response to a question from Danzig, Manferdelli said that recovery discussions should balance inductive and deductive reasoning, but there is currently a lack of theory development and deduction, pointing to a need for more principle-based solutions, in addition to more data.
Tadayoshi Kohno, University of Washington, pointed out that there are resilience needs that go beyond critical infrastructure and national security. In the future, products and industries we cannot even imagine will have security and recovery needs, and he suggested a need to broaden recovery planning to cast a wider net.
Paul Kocher, independent researcher, noted that while resiliency is a key focus and recovery is working well in some areas, such as cloud attacks, there is a much larger area of technology where resiliency planning is not happening, such as cooling systems within computers. These aspects are more mundane, but still require attention, he argued. He speculated that there are probably not that many engineers today who could patch and stabilize a computer chip designed 12 years ago. Such planning is a complex problem that needs to be simplified first, he continued, especially as the number of electronic devices grows but it remains difficult and expensive to make sure they can continue to be updated.
Looking Toward the Future
Participants discussed several additional resources that are useful in thinking about resilience and recovery going forward. Blakley and Danzig pointed to Dan Geer’s keynote2 from Source Boston 2017, in which Geer argued that the fate of the future will be decided by the actions security technologists take now. Landau offered two recommendations for how to think about the future and the human element: she
reminded the audience of the 1999 paper “Users are not the enemy,”3 and mentioned Duo, a security platform that in her view incorporates the human dimension well.
Fred Schneider, Cornell University, noted that previous forum workshops had focused on different recovery aspects, including identity theft4 and cryptographic agility,5 and suggested that integrating the present discussion with those previous ones could help achieve a more complete view of the problem. At Danzig’s request, Schneider also spoke about “graceful degradation,” the ability of a machine to continue functioning even if a large part of it has been compromised. As more sectors leverage artificial intelligence (AI) to increase efficiency and effectiveness, graceful degradation should be considered in that context, as well. He argued that too many new devices and technologies are built with AI but without careful recovery planning and are degrading ungracefully. Butler Lampson, Microsoft Research, pointed out that not all systems need to degrade gracefully. Schneider agreed, but noted that there is an important difference between considering and then discarding such a goal and never considering it at all, which he believes to be the case in the context of many emerging technologies.
Blakley raised an additional concern about AI, the notion that reliance on AI can de-skill the workforce to the point that if the AI breaks, not only is staff unable to fix it, they cannot even complete the tasks themselves in order to keep serving customers. Landau agreed that resiliency planning and AI in the workforce need to be examined carefully, and also suggested that considering computers as disposable elements could improve our ability to maintain updated software and reduce the temptation to invest undue trust in any one component.
Closing out the workshop, Danzig reiterated that recovery is a balancing act. We can argue for more investment in resilience, more graceful degradation, and better security, but these also incur substantial costs that can drag down innovation. There is also the chance that we could over-invest in pursuing the wrong ideas.
3 A. Adams and M.A. Sasse, Users are not the enemy, Communications of the ACM 42(12):40-46, 1999.
4 See National Academies of Sciences, Engineering, and Medicine, Data Breach Aftermath and Recovery for Individuals and Institutions: Proceedings of a Workshop, The National Academies Press, Washington, DC, 2016, https://doi.org/10.17226/23559.
5 See National Academies of Sciences, Engineering, and Medicine, Cryptographic Agility and Interoperability: Proceedings of a Workshop, The National Academies Press, Washington, DC, 2017, https://doi.org/10.17226/24636.
When writing was invented, Danzig noted, the Greeks worried that it would ruin people’s ability to memorize things. It did, but it also brought unforeseen benefits that eventually outweighed the disadvantages. We are still in the early stages of thinking about recovery, and unfortunately the early years are the hardest. Eventually, good strategies will be adopted, he observed, but it may take many years, and many attacks, to understand what the real solutions are, a trajectory similar to that of aviation, which progressed from a daring and dangerous pursuit to a safe, routine means of transportation.
A good solution, he emphasized, will require acknowledging that different industries have very different priorities and resources. In the financial world, the assets are digital and banks have the funds to secure them, Danzig suggested. In the electric sector, the assets are physical and security is less heavily resourced. Finding the right solution, he noted, will also take experimentation, investment, and learning from inevitable attacks and mistakes.
This page intentionally left blank.