HA App Dev: Architecting for High Availability

Introduction

High availability (HA) is about keeping applications running even when something breaks. The most reliable way to achieve it is by building in redundancy at every tier of your architecture. That means avoiding single points of failure in your application servers, data stores, and even entire geographic regions.

In this post, we’ll peel back the layers of HA one by one, starting from the simplest, most fragile setup and working up to more advanced designs, with redundancy as the guiding principle:

Start with the fragile baseline of a single server and a single data store
Add resilience at the middle server tier with multiple servers behind a load balancer
Strengthen your architecture by distributing resources across multiple zones and fault-isolation groups
Introduce redundancy at the data store tier
Expand to a geographically distributed, multi-region architecture for greater protection

At each stage, we’ll discuss the trade-offs, the risks being addressed, and the improvements you might expect. Regardless of which platforms you use — the core strategy applies across clouds and on-premises alike: build in redundancy, spread out risk, and aim for fast, reliable recovery.

Architecture at a Glance

Baseline: Non-HA Architecture

A starting point for many applications looks like this:

A single web server running the application
One data store in the same location

This setup works fine for proof-of-concept prototypes, but it’s fragile. Let’s look at what typically causes server outages:

Power failure (e.g., lost feeds or faulty power supplies)
Cooling failure (e.g., failed fans or HVAC issues)
Hardware faults (disks, memory, motherboard, etc.)
Site emergencies (fire, water damage, etc.)
Operating system crashes
Hypervisor failure, if using virtualization
Software failure
Network failure

If the server fails, whether due to hardware issues, an operating system crash, or a power outage, the application stops. If the data store becomes unavailable, the application can no longer read or write the information it needs.

With only one server and one data store, any single failure brings the entire system down.

So what’s next? First up: making the application tier more reliable.

First Step: Redundancy at the Application Tier

The first step toward HA is usually the application tier. Instead of relying on a single server, deploy two or more with a mechanism to distribute requests between them.

A load balancer fills this role. It sits in front of the servers and:

Directs incoming traffic to whichever server is available
Monitors health so requests aren’t sent to failed servers
Keeps users online even if one server goes down

In our testbed, where the web app resides on Oracle Cloud Infrastructure (OCI), we used the OCI Load Balancer to front multiple application servers. OCI provisions load balancer nodes across different fault domains (rack-level isolation within a data center). If one node fails, a standby takes over, and health monitors keep requests flowing to healthy servers.

It’s important to note that while failover is automatic, some brief interruption typically occurs while the system detects the outage and reroutes connections. Zero downtime can’t be guaranteed, but properly architected redundancy ensures the impact remains minimal.

At this stage, the server tier can survive individual failures. But redundancy only works if the servers themselves aren’t all vulnerable to the same underlying infrastructure.

Step Up: High Availability at the Data Store Tier

Even with multiple servers and a load balancer, the application still depends on a single data store. If it goes offline, the whole system stops. To prevent this, the data store also needs redundancy:

Redundancy can be achieved by running multiple instances of the data store
Clustering technologies allow multiple nodes to actively serve requests while sharing the same underlying storage.

The key goal is simple: the data store should never be a single point of failure.

In our testbed, where the web app also resides on OCI, we implemented a two-node Oracle RAC (Real Application Clusters) setup, because RAC allows multiple database instances to run on separate servers across different fault domains while accessing the same data. This design provides both redundancy and load balancing at the database tier.

Alternatively, many organizations achieve high availability (and performance) with Oracle Exadata. An Exadata rack delivers strong resilience within the system itself, redundant database and storage servers, networking, and power designed specifically for Oracle Database workloads and tightly integrated with RAC. This intra-platform redundancy protects against many component failures.

Whether you deploy RAC on general-purpose infrastructure or leverage Exadata’s integrated platform, Oracle offers multiple paths to robust HA at the database tier.

During testing, we simulated partial outages, such as losing one database node. RAC drained connections from the failed node, and the system continued to operate without downtime. With both the server tier and the data store tier protected, the most obvious single points of failure are eliminated within a single site. We’ll dive deeper into our test scenarios, tooling, and results in a future post.

Placement Matters: Building Resilience Across Zones and Fault-Isolation Groups

Running multiple servers is only half the story. If they all share the same rack, switch, or power supply, then a single hardware failure could still take them all offline. To be resilient, servers and data stores must be spread across different failure boundaries so that no single physical issue can take down the whole tier:

Don’t place all servers on the same rack or power circuit.
Ensure they don’t depend on the same cooling unit, network switch, or physical host.
Think in layers: racks → rooms → entire data centers → regions.
The same placement principles apply to both servers and data stores.

In our deployment, which runs on OCI, these boundaries are organized into three layers:

Fault Domains (FDs): Independent sets of physical hardware within an availability domain, protecting against rack-level failures
Availability Domains (ADs): Separate data centers within a region, each with independent power, cooling, and networking
Regions: Distinct geographic areas containing multiple ADs, protecting against complete site outages

By distributing resources across FDs, ADs, and Regions, you reduce the risk of “all eggs in one basket” — whether that basket is a power circuit, a network switch, a building, or a city.

This placement strategy lays the groundwork for the next layer of resilience: ensuring that the data store is also protected.

Full Resilience: Multi-Region HA

Mission-critical apps often require protection beyond a single site. What happens if an entire data center (region ) goes offline? To ensure business continuity and data recovery, a disaster recovery (DR) strategy must span geographically separate regions and include the application/middle tier, not just the database.

A resilient multi-region deployment includes:

Application servers in both regions
Application/middle tier in both regions, with software, configuration, and file systems replicated
Load balancers at each site
Replicated or clustered data stores in each region
A DNS or traffic manager to redirect users when the primary region is unavailable

Although our web app example focused on single-region HA, let’s introduce summarize the Oracle technologies that support cross-region resilience:

Oracle RAC provides local high availability within a data center region.
Oracle Data Guard can replicate the database to a standby region for disaster recovery.
DR failover requires updating DNS or using a traffic manager, since standby regions have different public IPs.
While automated failover is possible, there will typically be a short interruption during the switch.

Because regions data centers don’t share a single IP, the application URL changes when traffic moves to the standby site. That means a DNS update or traffic manager is required to redirect users. Inside one region, the load balancer hides node failures. Across regions, you need that extra step.

Final Thoughts

High availability is about more than uptime, it is about trust, continuity, and readiness for the unexpected.

In this post we peeled the onion of HA options, moving through the layers:

The fragile baseline (a single server and a single data store)
Site-level HA (servers behind a load balancer and a clustered data store), which we implemented in our web app testbed
Cross-region HA, where Oracle RAC and Data Guard extend resilience across sites with disaster recovery capabilities

Key takeaways:

Build redundancy at every layer to avoid single points of failure wherever they appear
Placement matters, spread risk across fault domains, availability domains, and regions
Disaster recovery is essential, local HA keeps you running but cross-region failover protects you from site-wide outages
Don’t just design, test your system under real failure conditions

HA App Dev: Architecting for High Availability

Introduction

Architecture at a Glance

Baseline: Non-HA Architecture

First Step: Redundancy at the Application Tier

Step Up: High Availability at the Data Store Tier

Placement Matters: Building Resilience Across Zones and Fault-Isolation Groups

Full Resilience: Multi-Region HA

Final Thoughts

Further Readings

Irina Granat

Senior Director

Richard Exley

Consulting Member of Technical Staff, Oracle Database

Oracle and DeepLearning.AI Launch New Agent Memory Course for AI Developers

Agent Reasoning: The Thinking Layer

HA App Dev: Architecting for High Availability

Introduction

Architecture at a Glance

Baseline: Non-HA Architecture

First Step: Redundancy at the Application Tier

Step Up: High Availability at the Data Store Tier

Placement Matters: Building Resilience Across Zones and Fault-Isolation Groups

Full Resilience: Multi-Region HA

Final Thoughts

Further Readings

Authors

Irina Granat

Senior Director

Richard Exley

Consulting Member of Technical Staff, Oracle Database

Oracle and DeepLearning.AI Launch New Agent Memory Course for AI Developers

Agent Reasoning: The Thinking Layer