transactions and timeout
By stp on Mar 21, 2008
First, however, I want to talk about what transactions are. I realize that last week I wrote I about scheduling without really going into depth on what a transaction really is. Essentially, a transaction represents a task where some collective work is done; at the end, either all of the work is done or none of it is. If you're familiar with data bases, or with the cool stuff that happens in the Transactional Memory communities, you know that it's common to think about transactions around data, probably for two reasons. First, it's nice to know that if you're updating two related values, there's no chance to change one without changing the other. Second, you often want to change one value based on an observed value, and know that the observed value hasn't changed. These are the kinds of guarantees that you get out of ACID semantics.
Darkstar involves more than just data in its guarantees about transactions (although the data part is obviously critical). For instance, any messages you want to send or any tasks you schedule for future execution are delayed until the transaction commits. To say that a transaction commits means that the block of work is completed successfully; any updates can be made globally visible, and nothing observed has changed during the lifetime of the transaction. If anything has changed, then the transaction is aborted, and tried again as if it never happened.
When a Darkstar application processes an event (like a message arriving from a client or a timed event starting) it does so in the context of a transaction. This is done to help the developer. It makes it much harder to code common mistakes around data consistency and integrity. It also means that these tasks are bounded in the amount of time they're allowed to take. By default, they're not allowed to run for more than 100 milliseconds, but in reality 10-20 milliseconds is already too long.
Why the short window for processing these tasks? There are many reasons, but probably two that are most important. First, all these transactions are executed by a pool of threads. A task is chosen, run, and then when the task is finished the next is chosen. These threads represent a limited resource, and so you want to share them as effectively as possible. There are many ways to do this, but since Darkstar is a latency-driven system, we have opted for a model where there are many short tasks so that we can respond to requests as quickly as possible, and minimize the overall jitter and delay. If these tasks start taking a long time, it will be harder to respond to all clients in a timely manner, which strongly affects the quality of any game.
Second, and perhaps more importantly, the longer a transaction runs the more things that it's likely to interact with, and the more likely it is (even if a given transaction only interacts with a few objects in the data store) for another transaction to update some state being used by the first transaction. Remember, a transaction is an all-or-nothing mechanism, so when there's any conflict only one transaction can proceed. If a transaction is very short its much more likely to complete, or at least it will take much less time to abort and try again. The longer a transaction runs, the more likely there will be contention and therefore wasted effort.
When a transactional system decides how to handle locking and contention, there are many strategies to take. You can be pessimistic and try locking each object as needed, trying to rush through your work before anyone else causes conflict. You can be optimistic and wait until commit time to see what happened to the rest of the system while you were working. You can version objects and try to detect compatible changes. There are many other strategies in between, but all deal with observing how transactions interact, and all are optimized in part based on some explicit notions of how long transactions are likely to run. If a transaction runs beyond its allowed window it won't get to commit, and will have to try again.
Can you change the timeout for transactions in Darkstar? Well, yes, you can. Should you? No. It's not just about having some number set, but about all of the reasons why we set the bar where we do, and what the implications are throughout the system. I know that it's not always easy to do all of your work in one transaction, and programming with continuations isn't always clean. Still, there's a reason we model the system the way we do. Really. No, really.
The one exception is the initialization task. This is the first task run for an application. We let this task run as long as you'd like. Why? Because it's safe to do so. Because nothing else should be running yet, and so there shouldn't be any conflict. Because until this transaction commits, we don't have to worry about sharing resources or being responsive to clients. Because you need some place to initialize your world, and we want to make some things a little easier.
What's your take on this? Do you like the trade-off of short events for the power of transactions? How would you model something like this?