The art of building resilient cloud architectures: tenets of OCI

May 22, 2023 | 11 minute read
Arvind Bhope
Master Principal Solution Engineer
Text Size 100%:

In today’s world, the reliability of cloud infrastructure has become a critical factor for businesses that depend on cloud-based applications and services. Any downtime or disruption can result in lost revenue, damage to brand reputation, and even loss of customers. What can cause a broad systems failure? Some scenarios include a natural disaster, cyber-attack, power outage, hardware outage, or even human error. 

Here, resilience and resilience patterns come into play. But what is resilience, and how does it play into cloud terminology? Resilience refers to an infrastructure’s ability to recover from failures or disruptions quickly and continue to operate smoothly. In cloud computing, resilience patterns are designed to ensure that applications remain available and performant even in the face of failures or disruptions. Some of the key components of cloud resilience include redundancy, availability, scalability, fault tolerance, disaster recovery planning, backup and restore mechanisms, and continuous monitoring and testing.

Resilience in OCI

Oracle Cloud Infrastructure (OCI) recognizes the importance of resilience and provides a robust set of resilience patterns. These resilience patterns are designed to ensure that applications and services hosted on OCI are highly available, scalable, fault-tolerant, and recover swiftly from failures or any form of disruption. By implementing these measures, organizations improve the likelihood that their cloud-based systems and applications can withstand disruptions and continue to function effectively in the face of adversity.

With providing redundancy, OCI resilience patterns offer several advanced capabilities, such as multiregion architecture, fast failover, and automatic recovery. These capabilities help your applications recover quickly and continue to function even in the face of disasters or major disruptions.

Protection against loss of service or loss of data is always a multilayered discipline. OCI’s foundational blocks—regions, availability domains, fault domains, backup and recovery, load balancers, autoscaling, security, and monitoring—help us design the architectures to face adversity and recover from them.

Cloud resilience is becoming increasingly important as more organizations adopt cloud computing and rely on it for critical business operations. By prioritizing cloud resilience, organizations can minimize the risk of downtime, data loss, and reputational damage, and not just maintain but increase and create confidence with customers and stakeholders thus gaining increased trust.

OCI regions

OCI’s regional architecture is designed with a focus on resilience to enable customers to achieve resilience across multiple regions, providing protection against regional disasters, such as natural disasters, power outages, and network outages. This resilience is designed so that, if a complete region outage occurs, the application can fail over to another region, enabling seamless operation with zero to minimal downtime during disruption. By deploying applications across multiple regions, customers can benefit from geographic diversity, reduced latency, and improved performance.

OCI’s regional architecture also provides customers with the ability to implement disaster recovery solutions with ease. Through this implementation, failover to another region is automatic, reducing any delay and human interventions, which helps result in continual business operations.

Availability domains

OCI’s availability domains provide a high level of resilience by distributing your applications across multiple, physically separate data centers within a single region. Each availability domain is designed to be completely independent, with its own power, cooling, networking, and connectivity to the internet. So, if one availability domain experiences an outage or disruption, the others continue to operate as normal, providing a high level of availability and protection against single points of failure.

OCI’s availability domains are designed to provide low-latency and high-bandwidth connectivity between each other, enabling seamless failover and load balancing between availability domains. So, your applications can take advantage of the resilience benefits of availability domains without sacrificing performance. Moreover, OCI provides a higher availability service level agreement (SLA) for each availability domain, ensuring that your applications have a high level of uptime. Overall, OCI’s availability domains provide a robust and reliable foundation for building highly available and resilient applications.

Fault domains

Fault domain resilience is an important aspect of OCI’s resilience patterns. A fault domain is a logical grouping of hardware and infrastructure within an Availability Domain, designed to minimize the risk of correlated failures. By default, OCI places instances and resources across multiple fault domains within an availability domain, ensuring that if only one fault domain experiences a failure or outage, the others can continue to operate as normal.

This approach provides an extra layer of resilience and fault tolerance within an availability domain by ensuring that your instances are distributed across multiple physical servers within a data center. Each fault domain is designed to be completely isolated from the others with its own power and networking infrastructure. This isolation helps to ensure that, if only one physical server experiences a failure or outage, the others continue to operate as normal, helping to prevent cascading failures and protect against hardware-related issues.

Backup and recovery

OCI provides a comprehensive, robust set of backup and recovery capabilities to help ensure that your applications and data are protected against data loss and corruption. OCI’s backup and recovery solutions are designed to be highly scalable and resilient with automated backups, point-in-time recovery, and multiregion replication abilities. OCI’s backup and recovery solutions are also integrated with other OCI services, such as Object Storage and Block Storage, to provide a seamless and integrated experience for data protection and recovery.

Load balancing and traffic steering

The OCI Load Balancing service provides a highly available and scalable solution for distributing traffic across multiple instances and availability domains. With Load Balancing, you can easily distribute traffic across multiple instances and availability domains, providing a high level of resilience and protection against failures or disruptions. Load Balancing also provides advanced features, such as SSL termination, health checks, and session persistence, making it an ideal solution for modern, highly available architectures.

OCI Load Balancing continuously monitors the health of instances and automatically detects and removes unhealthy instances from the pool, eliminating the risk of sending traffic to a nonfunctional instance. OCI Load Balancing also allows you to configure health checks for your instances, enabling it to detect and isolate unhealthy instances before they affect the overall service.

The Oracle DNS service can distribute traffic across multiple regions, providing geographical resilience. With the multiregion traffic steering capability, you can route traffic to the closest region based on your location, ensuring that users have a low-latency experience and can access the service even if an entire region goes down.

Autoscaling

The OCI Autoscaling service provides a highly and scalable resilient solution for automatically scaling your applications in response to changes in demand. With Autoscaling, you can easily define scaling policies based on metrics, such as CPU utilization or request rate, or a defined schedule, and OCI automatically adds or removes instances as needed to maintain the wanted level of performance and availability.

Autoscaling is also integrated with other OCI services, such as Load Balancing and Compute, to provide a seamless experience. It also integrates with OCI’s instance pool feature, which allows you to create a pool of identical instances that can scale together. This integration provides a more efficient way to manage resources and ensures that all instances in the pool have the same configurations and are ready to handle traffic.

Security

The OCI Security services provide a comprehensive security strategy that includes network security, identity and access management, and data encryption. Use OCI security services and features, such as Identity and Access Management (IAM), virtual cloud network (VCNs) and security lists, Web Application Firewall (WAF), network security groups (NSGs), Key Management Service (KMS) and Cloud Guard, to increase resilience against security threats.

Implementing Oracle security services in your OCI architecture can help ensure that your system is resilient and protected from security threats. Designing and implementing a security strategy that covers all layers of the system is essential, from the network to the application layer. A well-designed and implemented security strategy can help ensure the resilience and availability of your system.

Resilient architecture patterns on OCI

As of April 2023, OCI offers services from 41 public cloud regions in 22 countries. Each Oracle Cloud region offers a consistent set of more than 100 cloud services designed to run any application, faster and more securely, for less. For the latest information, see Public Cloud Regions.

A graphic depicting the current and future Oracle Cloud regions across the globe.

By incorporating the foundational blocks into the OCI architecture, businesses can design a robust and flexible architecture capable of withstanding various disruptions. You can use these technologies in combination to create a comprehensive and effective architecture solution that provides the a high level of protection for critical systems and data.

Let’s draw resilient architecture patterns on OCI for infrastructure, data, and application resilience.

Infrastructure resilience

The following architecture includes the following features:

  • Three availability domains in one cloud region

  • Four application servers running the same code and functionality

  • One Oracle database with standby enabled

  • One active load balancer

The highly available and resilient architecture is deployed within a single OCI region. The protection is available for data center failure, hardware failure, and software failure. Also, default redundancy is built in using a standby load balancer. If a failure occurs, the standby load balancer automatically takes over the load of the primary load balancer.

All the application servers are deployed on different fault domains, which provides protection against hardware failure or power failure for application servers. Because the availability domains are widespread across the region, application servers get protection from power outage or any availability domain-level disaster. Each of the active application servers read and write to the Oracle Base Database service running on a virtual machine (VM) deployed within AD1. The protection of the database running in AD1 is enabled by replicating the database blocks to AD2. With the standby database up to date, the primary can fail over if a failure occurs.

A graphic depicting the architecture for the deployment across availability domains.

Data resilience

Oracle Maximum Availability Architecture (MAA) for the Database service provides a choice of reference architectures, including Bronze, Silver, Gold, and Platinum. Each reference architecture provides an optimal set of capabilities for data protection and data recovery during unplanned outages and planned maintenance events. The capabilities and service benefits increase as we move from Bronze to Silver to Gold to Platinum.

MAA reference architecture

A graphic depicting the tiers of the Maximum Availability Architecture for the OCI Database service.

Platinum MAA reference architecture

The Platinum reference architecture has the highest potential to provide availability and resilience for both planned events and unplanned outages. This architecture is built on the Gold reference architecture and expanded by adding Oracle GoldenGate replication technology on top. Oracle GoldenGate does the logical replication across the regions while Oracle Data Guard applies the physical changes to the databases, which are in different availability domains. This architecture provides highest level of resilience. You can schedule backups for local or remote sites.

A graphic depicting the Platinum level architecture, including recovery time and recovery point objectives.

Recovery time objective and recovery point objective for the Platinum MAA reference architecture

Outage Type Outage Event Expected RTO Expected RTO
Unplanned Recoverable instance failure Zero Zero
  Recoverable server failure Zero Zero
  Data corruption, site failure Zero Zero to seconds
Planned Reorganization Zero Zero
  Hardware or O/S maintenance Zero Zero
  Most database patches Zero Zero
  Database upgrades and patch sets Zero Zero
  Platform migrations Zero Zero
  Application upgrades Zero Zero

Application resilience

The following solution architecture has been deployed across the following areas and applications:

  • Two regions, such as Ashburn and Phoenix

  • Two availability domains in each cloud region

  • Two application servers running same code and functionality and replicating to Phoenix.

  • One Oracle database with immediate standby in AD1 of the same region and crossregion Data Guard on AD1 of region 2

  • One global load balancer

The highly available and resilient architecture is deployed across multiple regions: Ashburn and Phoenix. The protection is available for entire-region failure and within the region, entire data center failure, hardware failure, and software failure. Default redundancy is built in for the global load balancer using a standby load balancer.
If a failure occurs, the standby load balancer automatically takes over the load of the primary load balancer. All the application servers are deployed on different fault domains, which provide protection against hardware failure or power failure for application servers. Because the availability domains are widespread across the region, application servers also get a protection from power outage or any availability domain-level disaster.

Each of the active application servers must read and write to the Oracle Base Database service running on a VM deployed within AD2 in Ashburn region 1. The protection to database, running on AD2, is enabled by replicating the database blocks to AD1 with an immediate standby database and a regional standby set up in the Phoenix region, which is in case either region fails.

A graphic depicting the architecture for the deployment across two regions.

Conclusion

Cloud resilience plays a critical role in ensuring operational continuity and helping organizations achieve their business goals. By adopting the right strategy, resilience can rebuild trust and reduce costs. You can access the resilient architectures built on OCI in the Architecture Center.

Try Oracle Cloud Infrastructure for yourself and check out the Oracle Cloud Free Tier with US$300 credits for a 30-day free trial. Free Tier also includes several “Always Free” services that are available for an unlimited time, even after your credits expire.

For more information, see the following resources:

Arvind Bhope

Master Principal Solution Engineer

Arvind Bhope, a visionary leader and accomplished Cloud Architect at Oracle, is dedicated to transforming businesses through innovative technology solutions. With a proven track record of success, Arvind brings a unique blend of strategic thinking, technical expertise, and customer-centric approach to his role.

Arvind's passion lies in leveraging cutting-edge technologies to address complex business challenges and drive meaningful outcomes. He excels in architecture, scalable and secure solutions, enabling organisations to optimise their operations and unlock new opportunities for growth. His deep knowledge of cloud computing, data management, and digital transformation has made him a trusted advisor to numerous enterprise clients.

Arvind is known for his exceptional communication skills and ability to forge strong relationships through collaboration with cross-functional teams - Product Managers,Account Management teams and other critical stakeholders to deliver exceptional results.  Arvind has been recognised for his outstanding contributions, receiving accolades such as the Oracle Innovator Award and Sales Excellence Award.

Arvind's deep dive technology skills span across creating solutions on  Multi Cloud deployments, Architecture, Solutioning, On-premises, Database and various options. In Current role, he is involved in Discovery sessions, Demo's, POC's, Workshops, Customer tech days, In-House initiatives on various technologies of Oracle with customers.

Arvind's continuous pursuit of knowledge is evident through his extensive certifications in Oracle technologies, including Oracle Cloud Infrastructure (OCI) and Database management. With his unwavering commitment to excellence, Arvind is poised to shape the future of technology and empower organisations to thrive in the digital era.


Previous Post

First Principles: Generative AI - Have you talked to your database lately?

Mark Johnson | 12 min read

Next Post


It’s (not) all about the bike

Ewan Slater | 7 min read