HA App Dev: A Developer’s Journey to High Availability

A lot of developers concentrate on making sure their code functions correctly in ideal conditions, but what happens when there is an outage? If an application instance crashes, or we are performing maintenance, how does that affect the end user? It can be more than just an annoyance, it can cost money and really damage their trust. Industry studies estimate that even a few minutes of downtime can cost thousands of dollars and permanently impact your brand.

Introduction

Most developers optimize for the happy path. However, failures are inevitable in production, and a single crash or maintenance window can ripple outward, costing money, frustrating users, and eroding trust. High Availability (HA) isn’t just about servers and databases anymore; it’s about the way we design, build, and architect applications from day one.

The Oracle Maximum Availability Architecture (MAA) team has spent over 20 years helping enterprises achieve uptime at the platform level. But HA doesn’t stop at the infrastructure layer. With this project, we asked: What if developers applied the same discipline inside the application itself?
Instead of treating resilience as an afterthought, what if we designed it in from day one?

Key Takeaway: High availability is not just about being “up” — it’s about being reliably fast and responsive, even when things go wrong.

This is the first post in our HA App Development Best Practices series. Throughout this series, we will share technical principles, real-world hurdles, and best practices we discovered while creating and testing a representative application. We will not stick to theory — we’ll show how architectural choices and code patterns measurably improve availability..

Let’s jump in.

Why HA Matters for Your App: The Project Premise

Traditionally, HA was an ops problem. Today, users expect responsiveness despite failures, so developers must make HA part of application design, not just deployment.

We started this project to demonstrate that availability should be integrated from the ground up, rather than just added as an afterthought. So we developed a real application, simulated real failures, tracked everything, , and continually experimented to find out what actually helps.

Availability isn’t just about being “up.” An application might return a 200 OK, but if it takes too long to respond, users still perceive it as a failure. Real availability means the app is up, it’s working as expected, and it’s responding fast enough to keep people moving.

While some downtime is inevitable, our goal was to maximize uptime and measure practical resilience beyond just “online/offline” status. By focusing on users’ real experiences — speed, correctness, and reliability — we could improve the app’s true availability.

Project Development Uncovered

To explore HA from a developer’s perspective, we built a simple RESTful service with two endpoints:

GET /user/{uid}: retrieve user info
PUT /user: insert or update user info

The app is deliberately minimal. Our goal was not complex business logic, it was to test how architecture and code choices impact availability, performance, and recovery. By keeping it simple, we aimed to make the results broadly applicable.

We defined success and failure in strict, user-facing terms:

HTTP 200 for valid responses
HTTP 404 when data isn’t found
No HTTP 5xx errors allowed — any 5xx was treated as downtime
Responses always under 50 ms — any outlier counted as a service interruption

Measuring HA in Real Time

We established strict SLAs to create real-world conditions. Success was defined as HTTP 200 responses that return a name, with HTTP 404 used for unknown IDs. Failures included any HTTP 5xx status.

For performance, we drove a load of 1,000 requests per second with 10 percent as PUT requests. We enforced a maximum latency of 50 ms per request, whether the request ultimately succeeded or failed, and we did not allow outliers.

To test efficiency under pressure, we limited the application to a maximum of 8 database connections while still delivering 1,000 requests per second.

Our availability definition was strict: any 5xx response or any request with latency greater than 50 ms counted as a service outage. The goal was to achieve maximum availability during outages.

This intentionally tough SLA let us see which frameworks truly handled stress, and pushed the system to deliver not just uptime, but fast, reliable, and predictable service.

By changing how we define availability, we moved the conversation away from just asking if it is online to asking if it is responsive and reliable, especially under stress.

Stress Testing for the Real World

Once we had the app up and running, we put it to the test under heavy load, triggering both planned and unplanned outages.

Database connection dropouts
Server crashes
Network issues

We weren’t just testing for smooth sailing; we aimed to replicate everything that could go wrong, since that’s the reality in production.

We wanted to know not just if the system could survive, but how fast it detected, responded, and recovered:

Response time: how quickly it handled both successful and failed requests when things got busy
Resilience: how it behaved during those failure scenarios
Recovery speed: how fast the system bounced back after going down

Same App, Eight Frameworks

To make the results meaningful, we ported the same app across eight modern stacks, and held every framework to the same strict SLA:

Performance: 1,000 requests per second with a strict 50 ms service level agreement
Efficiency: limit to 8 database connections to really test connection pooling and avoid resource overloads
Availability: any 5xx errors or slow responses mean hard downtime, no exceptions

These constraints exposed weak spots, trade-offs, and optimizations across frameworks, providing apples-to-apples comparisons for languages, frameworks, and driver maturity.

Java Servlet
Spring Boot with UCP and HikariCP
Node.js (JavaScript)
Node.js (TypeScript)
.NET C#
Python
Rust
Go

Spoiler: not every stack handled failure gracefully. We’ll dive into those differences in upcoming posts.

Built and Tested on Oracle Cloud

We ran all of this on Oracle Cloud Infrastructure (OCI) with:

Oracle AI Database 26ai
Autonomous AI Database + APEX for storing and visualizing results

That setup gave us a real-world environment, not just lab experiments, while letting us analyze results quickly with dashboards and charts.

Although we chose Oracle Cloud for testing, the HA concepts, principles, and lessons are broadly applicable across cloud providers and deployment environments.

Oracle Cloud Infrastructure test environment with OCI Load Balancer, redundant app tier, HA database and Oracle APEX.

In upcoming blog posts, we’ll dive into the architecture of this test harness, explore how each component contributed to HA, and share real-world test results and lessons learned.

What’s Next?

This is just the foundation. Over the next posts, we’ll peel back the layers of resilience step by step from application architecture and load balancers to connection pools and real failover tests:

Designing the application architecture for high availability
How load balancers detect and respond to outages
Connection pooling strategies across frameworks
Real failover scenarios, both planned and unplanned
Actual test results, logs, and graphs

We’ll also share practical tips and sample code patterns to help you design for HA from the start — regardless of your chosen tech stack.

Our aim is to make HA practical for developers. Because in the end, high availability isn’t something you sprinkle on after coding — it’s something you design into every layer from the very first line.

While our test criteria are intentionally strict, they highlight how much headroom different platforms provide before user experience suffers. The trade-off between cost, complexity, and real HA is one every engineering team must weigh.

Stay tuned!

HA App Dev: A Developer’s Journey to High Availability

Introduction

Why HA Matters for Your App: The Project Premise

Project Development Uncovered

Measuring HA in Real Time

Stress Testing for the Real World

Same App, Eight Frameworks

Built and Tested on Oracle Cloud

What’s Next?

Irina Granat

Senior Director

Richard Exley

Consulting Member of Technical Staff, Oracle Database

Deploying LangFlow on OCI Kubernetes Engine (OKE) and exposing the UI using modern Kubernetes Gateway API, Envoy and OCI API Gateway

Comparing File Systems and Databases for Effective AI Agent Memory Management

HA App Dev: A Developer’s Journey to High Availability

Introduction

Why HA Matters for Your App: The Project Premise

Project Development Uncovered

Measuring HA in Real Time

Stress Testing for the Real World

Same App, Eight Frameworks

Built and Tested on Oracle Cloud

What’s Next?

Authors

Irina Granat

Senior Director

Richard Exley

Consulting Member of Technical Staff, Oracle Database

Deploying LangFlow on OCI Kubernetes Engine (OKE) and exposing the UI using modern Kubernetes Gateway API, Envoy and OCI API Gateway

Comparing File Systems and Databases for Effective AI Agent Memory Management