We see it every day in the digital age which has become more and more dependent on hybrid multicloud infrastructure and microservices to quickly build and scale out applications to meet demand. Systems and services that we depend on daily are running fine one minute and then the next thing we know, they are down for days. This is the reality of our daily lives. As a high-tech professional obsessed with high availability, I routinely see these outages and my mind goes to how they could have been prevented and mitigated. In many cases, the underlying cause of these outages are found to be independent failures of infrastructure in the data center. However, sometimes the underlying cause is broader involving correlated failures such as network outage or natural disasters which impact multiple data centers and services. This is particularly true in a world that is quickly moving to a multicloud (interdependent) model in order to optimize their efficiency and make use of best of breed services like the Oracle Autonomous Database. However, to properly mitigate adverse events, we need to plan for disaster recovery and test that plan.
Regarding the disaster recovery plan, one could have their application ready to go in another region hundreds or even a thousand miles away to ensure it is geographically isolated and ready to go.
So, how would one go about testing that plan to ensure that this disaster recovery plan accounts for all the disasters and other situations that we think might impact our application system? Well, the answer to that is relatively simple, the goal should be to test it by breaking it ahead of time to see what happens and continue to adapt the plan to accommodate the findings from the test. One should then to introduce disruptions and break the application keeping in mind all of the different dependencies that could fail in various situations and gain confidence that the disaster plan is sound and that the included failover scenarios (i.e. cross-regional disaster recovery for example) is going to work as intended. Does terminology exist for such as strategy to “artfully and proactively break and disrupt application systems” in order to prepare for a disaster? Indeed there is, and it is called Chaos Engineering and our digital world and all of the players in it responsible for running the services the world depends on across such industries as financial, transportation, aerospace, healthcare, and retail are all quickly taking advantage of it at a rapid pace.
At Oracle, many are familiar with our Maximum Availability Architecture (MAA) which is a tiered set of solutions comprised of integrated High Availability (HA) and Disaster Recovery (DR) technologies as well as the reference architectures and operational best practices that go along with those technologies. Have you ever wondered how those best practices and reference architectures are created and why the veteran Oracle team responsible for creating them is so confident they will work? The answer is once again Chaos Engineering and the fact that MAA team applies Chaos Engineering to all areas of Oracle software on a routine basis. Of course, this includes the Oracle Database, Oracle Exadata, and the Oracle Cloud but it is extended to many of the applications as well. In the case of Oracle Cloud, MAA makes use of Chaos Engineering to provide the bedrock for the various integrated PaaS and SaaS solutions built on top of Oracle Cloud Infrastructure. This is an ever-evolving focused effort which borderlines on being an art form and the MAA team is diligent in their efforts. It also means there job is never done as there is almost a countless number of scenarios that can occur and the team responsible for this is balancing those scenarios with the technology and platform that makes up the Oracle Cloud evolving with regular releases of the software and underlying infrastructure. The good news on that is that these MAA engineers are very good at their job including disrupting and breaking these systems to ensure that a multitude of different disasters, planned maintenance events, and countless other scenarios including both individual and correlated failures that could impact applications that depend upon the Autonomous Database, Exadata Cloud Service, and Database Cloud Service are accounted for. Let’s now explore some of the many things that can cause downtime at a high-level that have to be considered independently or in combination with one another:
- Computer failures
- Storage failures
- Human errors
- Data corruption
- Hang or slow down
- Network failures
- Power failures or Site failure (i.e. Godzilla attack or hurricane)
- Server software updates
- Database software updates or upgrades
- Data reorganization or changes
- Application changes and optimizations
The list above isn’t a comprehensive list but does capture quite a few scenarios that could impact application availability. So, how would one go about doing some proper Chaos Engineering around the above scenarios? The answer in many cases is of course to break things and, in some cases, break things while other planned maintenance activities are in progress such as what would happen
when a power failure occurs during a software upgrade. Yes, it is unlikely to happen so that may be further down the list regarding chaos engineering scenarios, but the goal is to “break things” and see if you get the expected failover result. This is what the MAA team does on a regular basis, so you do not have to with the Oracle Database. The result is all the best practices that are provided on the MAA Database, MAA Exadata, and MAA Cloud best practices page along with the associated documentation here. None of this would be possible without this team’s dedication to Chaos Engineering to ensure that we can provide the most robust, reliable and available database solution possible along with the best configuration and operational best practices to go with it.
In the next iteration of this blog series on MAA and Chaos Engineering, I will explore some of the specific chaos engineering techniques used to test the robustness and best practices around specific Oracle Database HA and DR technologies including Real Application Clusters (RAC), Active Data Guard, and Application Continuity followed by additional blog posts focused specifically on Chaos Engineering and MAA for Exadata and the Oracle Cloud.
For more information, check out the main MAA page and resources here.
Maximum Availability Architecture (MAA) – Breaking things via Chaos Engineering to ensure your peace of mind and to build a better world.
