In building distributed systems and tier-0 cloud services and platforms, a piece of code that’s supposed to save the day often fails spectacularly when the time comes. A classic example is the ability to recover from a data loss. You can build an elaborate system to automatically recover the data during a disaster, but it isn’t exercised until a real-life disaster happens. Critical systems or processes that are responsible for “once in a while” extreme events are often neglected or poorly tested because of lack of visibility on everyday operations. At Oracle Cloud Infrastructure (OCI), we prevent such mishaps by continuously evolving our operational best practices, release guardrails, and automation to minimize human errors.
In this blog post, we walk you through a complex change management process at OCI, the guardrails we’ve built to prevent our customers from catastrophes, and the strategies we use to keep things running smoothly. At the end, we share how these investments improved our security, availability, resiliency and developer velocity.
Like any hyperscaler, OCI is internally made up of thousands of microservices that talk to each other through a complex web of dependency chains. Every time you use one of our customer-facing products or services, you’re utilizing this immense infrastructure of connected services. To maintain our high level of security, communications between these internal services are always encrypted over the transport layer security (TLS). Our internal public key infrastructure (PKI) manages the lifecycle of TLS certificates and instance principal certificates that are available on OCI Compute instances for users to authenticate and authorize with other OCI services. PKI is responsible for renewing millions of short-lived instance principals and long-lived internal TLS certificates. We discuss a high-level overview of our digital certificate renewal process in a previous Behind the Scenes post.
These certificates are anchored to root certificates through a chain of trusted intermediate certificates. We rotate the root certificates and the chain of intermediaries in each cloud region periodically to maintain our security and compliance bar. This complicated deployment process spans across hosts, virtual machines (VMs), SmartNICs, root of trust devices and several other service targets.
Figure 1: Root rotation sequence
A small mishap in any of these steps can lead to a significant outage in that region, causing a poor customer experience. The worst part about these outages is that they can be difficult to recover from. The services required for recovery themselves might be impacted if two services can’t talk to each other because of a lack of trust. This complex process is infrequently exercised in a region but can lead to serious problems if the steps don’t work as expected.
When we started root rotations a few years ago, we intentionally spent a lot of developer cycles validating our process to minimize any risk of outages. But we’ve come a long way since then. In 2023, we rotated the root certificates in 18 cloud regions, and we have more on the roadmap for 2024 and beyond. The latest region where we rotated the roots took only 2 developer weeks to run the entire sequence without causing disruptions to our customers, compared to 12 developer months a year ago.
This transformation is a testament to how we’ve evolved our operational rigor and release automation. We’ve built software systems, guardrails, processes to exercise and rollback these steps periodically. We also continuously gather telemetry on potential customer impact to make improvements or fix regressions.
Let’s look at the details of the safety net that we’ve built to manage and sustain this ongoing process.
Building systems and processes as a stack of carefully placed invariants that support each other makes it easier to verify their effectiveness. In this case, the root rotation mechanism is built using the following release management framework:
Let’s highlight a few of these invariants and how we validate them consistently.
Common sense suggests that periodically testing rare scenarios helps validate their effectiveness. Building and running these tests is practical if you manage a small number of services, but how do you scale that to thousands of micro services? In our example, one of the hardest steps in the root rotation process is ensuring that every service has picked up the trust bundle (root certificate authority) that contains the new root certificates. If we miss a single service, failures can cascade across the entire cloud region because the service can’t trust its dependencies that use certificates issued by the new root. Having developers build guard rails for each service in OCI isn’t practical. So, we solved this problem more generally.
We built a new realm running an entire OCI stack, called the Chaos Lab, for the purpose of practicing recovery from real world failures. Then, we started building automated recipes of disruptive events. Finally, we trigger disruptions in this region. When the region is under impact, we collect telemetry on how services fail for different kinds of disasters, answering questions such as whether services recover as expected, if they have operational visibility and alarms, and if they have recovery runbooks up to date. We continuously test the following recipes:
In these ways, we exercise systems and processes built to handle rare scenarios across OCI and are prepared to face unfavorable outcomes.
In Chaos Lab, we identified that the phase during which we forcibly renew all existing certificates had no velocity control. As a result, we ran into severe throttling from several downstream services, including load balancers to which certificates must be distributed. The storm of retrying that resulted from throttling failures dropped some of our dependencies, which further slowed down the recovery.
This area is one where we made most improvements and continue to make more progress. Having to issue certificates and certificate authority (CA) bundles across thousands of targets including hosts, SmartNICs, and load balancers require controlling the sequence and speed of distribution of artifacts. We built and deployed configurable controls in our certificate distribution service and agents to manage the area of damage. For example, we introduced a mechanism to control the speed of distribution with a configurable parameter decided based on the number of certificates needed to be issued in a region. As a result, we started with 1% of targets receiving new certificates and gradually increasing it over a period of 6–8 hours. It allowed us to watch as the certificates were redistributed and react quickly to any sign of failures.
Figure 2: Certificate distribution velocity control
The higher, dark blue line shows the decline of certificate renewals, while the flatter, light blue line shows the number of certificates renewed every minute.
On top of minimizing damage, we built a mechanism to introduce new roots and certificates to specific services’ targets. Because PKI is at the bottom of the OCI stack, ensuring that the platform services don’t fall apart during root rotation is critical. It can lead to cyclic dependency issues while recovering a region if a full-blown failure occurs. We emulated this scenario in Chaos Lab and discovered that specific core platform services must break the cyclic dependency during the bootstrap of a region. We then built mechanisms for the following processes:
Figure 3: Deterministic distribution of certificates
Deterministically isolating impact targets helps respond to unrelated OCI-wide incidents without using a lot of developer hours on false positives. The teams can be on alert during their potential impact window.
Change management processes often capture rollback steps with an assumption that the upstream and downstream dependencies will work as expected during rollback. For platform teams that handle PKI, assuming that everything will go wrong during root rotation and tailoring the rollback steps based on that assumption is critical. While rehearsing rollback steps in advance is easy, we took extra measures to be fully prepared to handle the most extreme failure scenarios and recover quickly, including the following examples:
The key point here is to untangle the complex web of dependencies, assume they aren’t available during a rollback, and come up with simple break glass mechanisms. We use the Chaos Lab to stay ahead of every change in the dependency chain and rigorously test our break glass mechanisms, helping to ensure that they’re always up to date. With OCI evolving rapidly, ongoing testing is the key to keeping our break glass processes ready to go.
For the root rotation process, we rely on over 100 metrics to promote the success of each step and trigger the subsequent steps in our workflow. We’ve built metrics and alarming on PKI services and client-side agents to help us understand our own failures including failures and latency on certificate delivery, certificate expirations, number of renewals, targets refreshed, host performance metrics, and so on. In Chaos Lab, we were surprised at how quickly we identified the root cause of some of the issues mentioned in previous sections. However, understanding the area of impact was extremely difficult: We had no visibility into failures across over 100 services because teams report outages in different forums. We solved this problem with an OCI-wide platform bar initiative that involved building a monitoring platform that ingests key KPI metrics from every OCI service to indicate the overall health of the cloud region at any point in time. We then integrated these live dashboards into the PKI root rotation process to actively monitor the damage and any service impact to dictate our workflows.
Justifying the prioritization of development efforts for something that happens so rarely is often difficult. But recognizing that some these events can make or break customer trust is important. Our investments produced the following results:
This blog post emphasizes the rigor with which we architect our systems and processes at OCI to help prevent outages of any size and rarity. Thousands of customers around the world rely on Oracle Cloud Infrastructure to run their businesses. We take immense pride in solving hard problems for them to uphold OCI’s strict security standards.
For more information, see the following resources:
Next Post