Eric Grosse, an independent consultant and forum member, moderated a discussion to explore the challenges of current cloud architecture and isolation assumptions post-Spectre, the importance of hardware isolation capabilities on shared infrastructure, and the practical implications of emerging side-channel risks in the context of other known vulnerabilities.
Grosse began with a brief overview of how cloud security has changed. Just 10 years ago, when he started leading security for Google, he recalled that the main concerns around cloud security had to do with mistakes around user’s cookies and log-in errors. There were large data centers—Google was running on well more than 10,000 machines—but the code was all written by Google engineers, offering a sense that the company was in control of the data and any vulnerabilities.
When Google moved to create its own browser and eventually a public cloud service, a much trickier set of security concerns emerged, Grosse said. For the first time, Google’s servers would be running code from external users. Google hired expert technical advisers and built multiple layers of detection and defense, and Grosse eventually felt confident enough to launch. To date, he said, Google’s cloud system has held up well, in part because the
company has maintained sufficiently trusting relationships with its vendors and reacts quickly to deploy patches when processor errata sheets are issued.
Although Spectre and Meltdown (and Rowhammer—a security vulnerability that exploited dynamic random-access memory) are ultimately hardware issues, they have important implications for cloud computing systems such as Google’s. Public cloud managers are heavily invested in security, which, Grosse emphasized, is a shared responsibility between the cloud vendor and the user. In reconsidering an understanding of the interfaces between hardware and software, it is vital to consider how these vulnerabilities affect the cloud, as well as the broader role of the cloud in helping to advance security for all.
Brandon Baker related how Google is addressing Spectre and discussed the future of resilience more broadly. Google played a significant role in the discovery of Spectre and also has a good deal at stake as a buyer of affected microprocessors and as a vendor of cloud services that depend on these processors. In these respects, Spectre represents a threat to Google’s infrastructure and its business.
In the 6-month lead-up to Spectre’s public disclosure in January 2018, Google employees worked along with security specialists, compilers, and kernel teams from Microsoft and Amazon to release mitigations for each variant of the Spectre vulnerability. This partnership among competing companies was critical to understand all the potential variants, determine the scope of the vulnerable devices, and develop a set of robust mitigations. However, although the mitigations are successful in stopping the vulnerability from being exploited, they do not actually fix the underlying hardware problem.
For Spectre variant 1, the teams analyzed the affected code patterns and removed those most vulnerable. Not all of the vulnerable code was reachable, which is good, Baker said: reachable code needs to be isolated or modified to protect it, but unreachable
code is already protected. For Spectre variant 2, they created a code pattern that could, at critical points, prevent speculative execution so that information would not leak.
The vulnerabilities could have been mitigated by updating the hardware microcode, but this would have significantly affected performance. Instead, Google opted to mitigate the vulnerabilities by modifying its software to avoid the vulnerabilities, providing an equal level of protection. Because all CPUs behave differently, Google had to actually develop the attacks and then test the mitigations, working with CPU developers to verify that the fix worked correctly.
Google’s live migration capability points to an approach that others could use for responding to vulnerabilities in the future. With live migration, Google engineers were able to patch the physical hosts underneath virtual machines (VMs) as needed, even down to the firmware, without customers noticing any disruptions to service or even requiring any of the VMs to be rebooted, all before Spectre became public knowledge. Applying updates without requiring a restart was very helpful, Baker said, suggesting that this capability could be a promising avenue for others going forward.
Another path worth investigating further is isolation—that is, keeping data and programs on one machine isolated from the activity happening on other machines. In the current environment, most cloud setups are optimized for efficiency, such that different workloads from different users frequently share space on the same machine. Baker explained how having dedicated, isolated machines could improve security, although it also increases inefficiencies, costs, and the risk of losing work. However, it could be an attractive option for highly sensitive workloads. Google has been testing ways to isolate system applications through gVisor, open source software that provides secure isolation for all of Google’s untrusted code, including the cloud and code for all of its devices. These applications run on hardware-isolated sandboxes with multiple defenses.
Isolation can be especially inefficient and expensive when it comes to smaller machines. Yet large, multisocketed dedicated machines with giant cores and high thread counts cannot be easily partitioned. To increase security, Baker suggested, hardware vendors should offer smaller machines that maintain hardware isolation or
larger machines that are partitionable. Then, the hosting provider can decide how to further partition the architecture and what security measures it would like on shared resources. While this capability is not yet available, he said, it is a promising direction to move toward.
If partitioning is not feasible or sufficient, Baker said it might be necessary to take certain types of high-value secrets out of the shared domain altogether. For example, there are more secure approaches already in use that could potentially be expanded to protect valuable data—such as Google’s Titan (a security key), Amazon’s Nitro (next-generation cloud infrastructure), and Microsoft’s Cerberus (a standards-compliant hardware root of trust).
It is important to consider how attackers operate and plan ways to detect and respond to attacks when needed, Baker argued. It is hard to attack workloads directly, because the zones are too big and the scheduling is unpredictable. As a result, bad actors have to run attacks continuously and dig very deep to reach valuable data like private keys, authentication keys, or host credentials that could unlock sensitive information. Even when a system is under continuous attack, however, detecting those attacks can be extremely difficult, especially for side-channel attacks and especially with existing hardware, which cannot distinguish regular activity from malicious activity. This points to a need for better detection mechanisms in the hardware, Baker said. However, such mechanisms would create new interfaces that share information, and it would be important to ensure that they do not also create new side-channel vulnerabilities. In addition, it is crucial that attackers do not always know how they are discovered, because it makes it harder for them to cover their tracks for a later attack.
Side-channel attacks are not new issues in cybersecurity, Baker noted. The difference with Spectre and Meltdown is that underlying hardware leaks information while the software runs along as normal, unaware of any intrusion and unable to change its own behavior
to stop it. Although the cloud remains vulnerable in some other ways, Baker argued that hardware side-channel attacks should be considered the greatest concern to the industry more broadly. He concluded with the observation that one additional positive aspect of the cloud is that cloud services tend to cycle through hardware very quickly, with some machines lasting only a few years. While these hardware issues get sorted out across subsequent generations of CPUs, refreshing hardware, he said, is good for security.
Mark Ryland spoke about his experience as a cloud service provider in the era of Spectre.
Spectre is not an easy vulnerability to exploit, and Ryland expressed his view that it is so esoteric that cybercriminals probably would not have discovered it. He argued that due to lack of basic hygiene in the IT industry, criminals have many much easier opportunities, and so a simple cost–benefit analysis would not suggest criminal activity in this area. As an industry, we have a long way to go in terms of improving basic security, from hygiene to patches to maintenance. Ryland posited that cloud services can help scale and automate those improvements. That said, more esoteric attacks are still possible and must be dealt with effectively. Once bad actors know about them, he said, they will certainly try to use them.
Cloud vendors, Ryland said, have more incentive to build in effective security than do smaller, point-solution players in the market (which provide only a small part of an overall solution). He argued that cloud platforms help improve security throughout the whole ecosystem. AWS, like Google and Microsoft, is able to protect security in a way that smaller players, such as those in the Internet of Things (IoT) space, are not. Large cloud companies, he said, not only have the business incentives to choose more secure over less secure options, but also have the resources to compensate for the expenses or performance losses incurred in doing so.
When looking at side-channel issues in cloud infrastructure, Ryland explained that one must first take into account the wide range of cloud services, from storage to database to compute services, with many nuances between. Customer requests to distributed storage systems, for example, have to go through multiple load balancers, servers, erasure, encoding, encryption, and more defenses. Although independent of side-channel issues, code can be buggy and result in data leakage, Ryland said that this poses a low risk if systems are built correctly. In sum, side-channel exploits are essentially impossible when the customer cannot execute code inside the multitenanted service. The more the customer can customize the behavior of a service by uploading some kind of executable content, the more these kinds of attacks become possible. The most vulnerable cloud service is, therefore, always going to be a VM service, because there the customer has the most complete ability to run arbitrary code down to the operating system level and observe the time characteristics of that execution on shared infrastructure.
Ryland noted an evolution in the way we think about servers. Traditionally, they were seen as “pets” that were named, housed, and cared for. In the context of the cloud, servers are viewed more like cattle—there are millions of them, and they are expendable. “Long term” is not necessarily good, he said, particularly when it comes to security.
VMs are a useful bridge from legacy computing to the abstractions of modern computing. Moving to the cloud is a chance for users to modernize their platforms and take advantage of the cloud’s features. EC2, AWS’s VM service for customers, is built with a laser focus on tenant isolation. Users have several choices when it comes to co-tenancy. They can choose “dedicated instances,” which is a placement policy that guarantees that the user’s VM will run on hardware that is not shared with any other customer. Beyond that, users can take advantage of a “dedicated host” with a host identifier, a feature popular with users of software whose licensing is tied to a particular machine for a given period of time. Ryland noted that users
can also use the dedicated hosts feature to test AWS isolation models by dividing up a dedicated host to check for noisy or nosy neighbors.
Ryland described how Amazon launched AWS while simultaneously supporting Amazon.com, but AWS was a true start-up, with all new hardware and software, Ryland said. The needs of Amazon.com did inform some features, such as the option for non–co-location (dedicated instances), but the overall effort was not solely driven by Amazon.com. At first, AWS hoped to create the notion of generic processor types for customers based on computing power measured in “elastic compute units” (ECUs), but that did not work out very well. The VM service used multiple types of actual processors for a given instance type, and once launched, customers could tell whether (for example) the processor was an Intel or an AMD processor, and also what generation of processor it was, and so on. So AWS abandoned that approach and some years ago standardized on a model whereby the processor type is explicitly noted in the documentation. But that means today customers can know the processor in use and what its features are, including the number of cores and sockets and whether and at what level it involves cache sharing.
From the beginning, Ryland said, AWS focused on providing customers with a reliable, consistent experience. This had the side effect of improving security, although it did decrease efficiency. To achieve reliability and consistency, AWS pinned its VMs to specific processor cores for their lifetimes, provided a fixed amount of non-oversubscribed memory, and did not do page coalescing. For the T family of instance types, which do provide better prices based on some degree of oversubscription, AWS devised a “CPU credits” model that still provides predictable behavior. Importantly, these efforts designed to provide consistency and predictability ended up being natural mitigations for a lot of the security issues seen on the hardware side. While AWS is able to hot-patch its hypervisors—update them without stopping the VMs—in the case of
paravirtualization (PV) instances, there was no good fix for Spectre-type issues without eliminating PV altogether, which AWS did by moving PV to run on top of hardware virt ual machine (HVM). That transition required VM reboots or stop/starts for the relatively small percentage of PV instances.
Co-locating could have opened up multitenancy risks, Ryland noted, but AWS was aware of the dangers. When in the early days of EC2, academic researchers showed that it was possible to determine the approximate location of a VM based on its IP address (an approach known as “cloud cartography”), AWS mitigated the vulnerability before the paper was published, and the problem was eliminated entirely in 2011 via virtual networking. Although an approach known as “prime and probe” was uncovered as another potential vulnerability, Ryland expressed his view that the signal sent by the victim required to execute such an attack makes it impractical and unlikely to be exploited.
Ryland next described Nitro, Amazon’s latest computing architecture for its VM service. Nitro has several security features, including hardware and firmware validation of the Intel chip and system firmware at every reboot, no interactive shell (all privileged access is done by APIs), encryption of all local storage, and cached Elastic Block Storage encryption keys. Malware or other attacks are thwarted, he said, because software validates the firmware at every reboot.
Function-as-a-service is a growing area. Ryland said many customers are skipping containerization and moving to Amazon Lambda, an event-driven, “serverless” computing platform where sensitive data is stored outside the computing site. The highly dynamic and ephemeral nature of code execution in a Lambda environment makes exploration and exploitation more difficult.
A variety of other isolation models that are emerging offer customers more choices, such as the option to run apps in isolation, use short-lived code, and avoid creating a stable co-location that can be attacked. Elastic GPUs, application and desktop as services, and secure browsing technologies are also becoming more common, and there is the potential for much more innovation in this space. By continuing to divide services into microservices that give customers
choices as to how they isolate their workloads, Ryland posited that we can further improve overall security.
Participants asked the two panelists to delve deeper into their companies’ approaches to cloud security. Building on themes raised earlier in the workshop, they also discussed trust issues and potential areas for improvement.
Security at Google and AWS
Grosse asked Baker to comment on the security implications of live migration, given that operating systems or applications may never actually need to be rebooted. Baker acknowledged that not requiring reboots could make it easier for users to ignore patches and make themselves vulnerable. However, Google does have forced reboots internally, which means that software is constantly updating. He noted that the cloud lends itself better to this model because VMs are too tied to legacy software and legacy behavior that we need to move past.
Schneider asked if Google and Amazon have mechanisms for triaging attack threats. Baker replied that at Google, any threat to customer data is considered important, so triage is not an option and it must address every threat. Ryland said AWS similarly does not have a specific process to prioritize or triage, other than a mandate to think about what is best for its customers, although he noted that it was possible to prioritize. For example, if customers expose themselves to an attacker, there is not much AWS can do, but if the vulnerability is in AWS’s arena, then it is a threat to its business model of protecting customers and must be addressed. He noted that AWS does work with large customers affected by a vulnerability before a public disclosure.
Baker added that Google devotes a great deal of its resources to security engineers, which gives the company the capacity to respond effectively when problems happen. Schneider asked if it could ever be necessary—or possible—for Google to replace every processor in
all of its servers if the threat was large and pervasive enough. Such a scenario would indeed be an enormous undertaking, Baker agreed, but using a diverse array of processors helps mitigate that threat. He added that Google is also willing to suspend all services if it is necessary to protect the company’s resources or its users.
On the business side, Ryland reiterated that it is far easier for AWS to take on large-scale security challenges than for its customers to do so. When big companies like AWS and Google tackle these issues, the rising tide lifts all boats, he said. Nonetheless, constant vigilance is still needed to deal with ongoing challenges.
Revisiting the Issue of Trust
John Manferdelli, Northeastern University, asked if it was possible to be satisfied by a vendor’s security assurances. Baker replied that he is suspicious of claims that he himself cannot validate, and instead of getting reassurances from the manufacturers, it would be better if they were transparent about side channels and security mechanisms.
Bob Blakley, Citigroup, asked both speakers to comment on whether cryptographic or software protections can really create a viable trusted environment. Baker replied that generalized homomorphic encryption1 is currently infeasible, but future research could focus on pushing encryption farther up the stack to increase resistance to side-channel attacks. Ryland agreed that it is a challenge, and he noted that AWS applies current technology judiciously to keep its environments as trustworthy as possible. He added that AWS itself makes its own chips for certain use cases, such as the Nitro hardware, which offers a certain level of customization.
Ideas for Improvement
Grosse pointed to the need for better mechanisms to detect intrusions. Ryland agreed, noting that hypervisors have long been used to detect different qualities, such as chip heat. These and other “dynamic mitigations” could provide added protection, and it would be good to take advantage of such approaches moving forward.
1 Homomorphic encryption enables an operation to be performed on encrypted data without decrypting the original data or the result. See C. Gentry, 2010, Computing arbitrary functions of encrypted data, Communications of the ACM 53(3):97-105, https://doi.org/10.1145/1666420.1666444.
Kocher reflected that while it may take a lot of technical savvy for someone to launch a Spectre attack, there is a connection between scale and security. Baker replied that dedicated VMs, partitioning, and case-by-case security considerations are the best way to secure important data. Ryland noted that thieves have to work very hard even to find the data worth stealing. If we use microservices for data, it will make the attacker’s task even harder. Even with such compartmentalization, there will still be risks, he said, and users will need to choose the level of risk, cost, and speed that they feel comfortable with.
Jeremy Epstein, National Science Foundation, asked how technology might better handle security for unsophisticated users with high-value data, such as voting machines. Ryland replied that once anything is connected to the Internet, it is at risk, but clouds can enhance security because they handle many of the elements needed for security for unsophisticated users and create secure “envelopes” around business applications.