Sagas Are Great. What’s the Problem?

August 31, 2023 | 6 minute read
Todd Little
Chief Architect, Transaction Processing Products
Text Size 100%:

This blog post examines the Saga pattern for achieving eventual consistency in a microservices based application.  Sagas are promoted as the solution to scalable distributed transactions because resources are not locked during the execution of a saga.  However, sagas give up isolation in the traditional ACID transaction sense.  This lack of isolation can cause problems that can be difficult if not impossible to solve.  We’ll look at how sagas are often implemented and what needs to be done to solve the lack of isolation.

Introduction

Microservices like most other architectural patterns must deal with the issue of data consistency.  In fact, microservices exacerbate the problem of data consistency as the typical microservice has its own state that is often stored in its own database.  While the microservice itself may be stateless, as that is a common microservice principal, there is state information behind most microservices.  Whether that state is stored in a database, a messaging system, or the microservice itself, there is almost always some state information behind the microservice.  Another microservice principal is that microservices don’t share state, so effectively they each have their own database to store their required state.

This is all great and awesome, but what does it have to do with me?  Why this matters is that most microservices encounter the problem of a dual write.  A dual write problem occurs when a business transaction, as opposed to a database transaction, needs to ensure the atomic update of two or more stores of state.  The simplest example is transferring money from one account to another account. One can imagine creating a “transfer” microservice that transfers funds from one account to another account.  If the accounts are both in the same data store, i.e., database, it might be possible to use a local database transaction.  However, in the microservices world, the accounts may be stored in different data stores.  Even if they use the same database, microservices typically can’t share a local database connection so they can’t share a local database transaction. In this scenario the “transfer” microservice would call the “withdraw” microservice on the first account and call the “deposit” microservice on the second account.

That’s all good and fine until an error occurs.  Let’s say the “withdraw” succeeded, but the “deposit” fails.  This is the fundamental dual write problem.  The “withdraw” microservice and the “deposit” microservice must each successfully write to their data store, or neither of them writes to their data store.  In this case, the first account would be out the money transferred, and the second account wouldn’t have the funds. This is the dual write problem that distributed transactions attempt to solve.

There are many solutions to dealing with the dual write problem, but ignoring the problem likely leads to disaster.  Imagine an application that doesn’t use a distributed transaction model and implements the above funds transfer microservice.  It’s obvious there are going to be failures that allow the withdraw to succeed and the deposit to fail.  If that amount is $5.00, maybe it doesn’t matter and can probably be dealt with via manual compensation.  However, if the amount is $5,000,000, someone is going to be very unhappy that their $5,000,000 is somehow missing and may take some time to find.

Sagas

I want to examine the saga pattern as a solution to the dual write problem and the issues one may encounter using this pattern.  This pattern is based upon the idea of eventual consistency, meaning that at times, the system may appear to be inconsistent, but over time it will eventually become consistent.  What this pattern gives up in terms of normal ACID transaction properties is isolation.  While a saga is in progress, one or more local data stores may be updated independently.  In the saga model, as those updates are made, they are committed to the local data store and visible to others.  At the completion of the saga, all the participants have the option of performing some additional state management to ensure their state is consistent with other participants.  As well if the saga is aborted, each of the participants will have the option to perform some compensating actions.

Let’s examine this pattern for reserving open seating tickets at a concert.  A reservation service can reserve a number of seats or release a number of already reserved seats.  A payment service is also needed to pay for reservations which supports services to make a payment (withdraw/debit) or receive a refund (deposit/credit).  To ensure consistency across the services we’ll use sagas, so we don’t pay for tickets we don’t receive.  The ticketing agency will:

  1. Start a saga
  2. Call the reservation service to reserve a number of seats
  3. Call the payment service to make a payment
  4. Complete the saga

Or

  1. Start a saga
  2. Call the reservation service to release a number of seats
  3. Call the payment service to make a refund
  4. Complete the saga

A simplistic implementation of the reservation service would simply deduct or increase the number of seats available.  A slightly more sophisticated version also keeps track of how many seats someone has reserved so they can't release more seats than they have reserved.  Additionally one would need to track the number of seats reserved associated with the saga.  This is needed to know what to do if the saga must be compensated.  Let’s look at a potential flow in simplest form:

  1. Client C1 has previously reserved 50 seats
  2. The total number of seats remaining is 60
  3. Client C1 decides to cancel his reservation
  4. Ticketing agency:
    1. Starts saga S1 and calls reservation service to release the 50 seats
    2. The reservation service adds the 50 seats back into inventory so the total number of seats available is now 110. 
    3. Then also calls the payment service to refund the purchase
  5. Meanwhile, before S1 completes, another client tries to reserve 80 seats
  6. Ticketing agency:
    1. Starts saga S2 and calls the reservation service to reserve 80 seats
    2. The reservation service finding 110 seats available honors the request and reduces the number of available seats to 30
    3. Ticketing agency then calls the payment service to withdraw the necessary funds
    4. Ticketing agency then completes saga S2.
  7. Meanwhile the ticketing agency has been processing S1 when it discovers that S1 must be compensated.  Perhaps it was unsuccessful in refunding the payment or some other problem occurred.  As a result, it tries to compensate saga S1.  The reservation service can’t compensate the request because it would need back the 50 seats it gave up.  However, there are only 30 seats available, so compensation in this case is impossible.

This is just one example of how the lack of isolation can cause problems in an application.  To solve this issue, one could add more logic into the reservation service to place seats in escrow instead of applying the updates directly.  The difference in using escrow is that in the above scenario, the 50 seats that S1 was putting back into inventory wouldn’t be directly visible to anyone else.  Instead, they’d be placed in escrow associated with S1.  When S2 tries to reserve the 80 seats, the request would fail because there are only 60 seats available until S1 completes.  When S1 completes, the 50 seats in escrow would be placed back into inventory, and the new total would be 110.

Using an escrow approach allows the application to deal with the lack of isolation.  However, we’re now starting to mix business logic with transaction logic.  As the complexity of the updates made by a transaction increases, the difficulty of maintaining the escrow information increases.  Adding this additional logic to the reservation service increases the development costs and perhaps more importantly the testing costs.  Failures need to be introduced during testing in many different places to make sure the transaction handling is working properly.  It’s also difficult if not impossible to use an escrow approach for non-numeric values.

Conclusion

Ignoring the dual write problem is probably not a viable option for enterprise microservices.  As a result, some form of distributed transactions will be required.  Sagas are promoted as a solution to this problem, but they push a lot of work onto the developer and the QA team. Wouldn’t it be nice if the handling of the escrow could be automated?  Stay tuned for my next post where I show how Oracle database can automatically handle escrowing data during a saga.

Todd Little

Chief Architect, Transaction Processing Products

I'm currently the Chief Architect for a family of transaction processing products at Oracle including Oracle Tuxedo product family, Oracle Blockchain Platform, and the new Oracle Transaction Manager for Microservices.  My main areas of focus are on security, privacy, confidentiality, performance, and scalability.  My job is to provide the technical strategy for these products to ensure they meet customer requirements.

 

Prior to being acquired by Oracle, I was Chief Architect for BEA Tuxedo at BEA Systems, Inc. While at BEA Systems, I was responsible for defining the technical strategy and direction for the Tuxedo product family. I developed the Tuxedo Control for WebLogic Workshop that greatly simplified the usage of Tuxedo services from Workshop based applications. I also received two patents for methods allowing design patterns in a UML modeling tool to control the generation of software artifacts.

 

During my more than 40 years of software architecture and development experience, I have worked on a wide range of software systems and technology. At Science Applications International I worked on microcoded plasma display systems and command, control, and communication systems for naval applications. As a senior software consultant at Digital Equipment Corporation, I was the New York Area Regional Tools Consultant and also helped develop a multi-language multi-threaded distributed object oriented runtime environment with concurrent garbage collection.


Previous Post

Oracle Graph Server REST API

Rahul Tasker | 5 min read

Next Post


Using JSON Relational Duality Views with Micronaut® Framework

Bernard Horan | 13 min read