Behind the Scenes: Sleeping soundly with the help of TLA+

August 8, 2022 | 14 minute read
Calvin Loncaric
Formal Verification Engineer
Text Size 100%:

When your software is responsible for storing petabytes of data, how do you ensure that not a single byte gets lost? Software engineers most often test to gain confidence that their code is correct, but large, distributed systems have a swarm of challenges that thwart traditional testing approaches. For example, testing every possible thread interleaving, network fault, and power outage scenario is usually prohibitively difficult.

Oracle Cloud Infrastructure (OCI)’s verification team uses formal methods—in particular, a language and toolset called TLA+—to ensure that OCI’s systems are as robust as possible. Such techniques help catch subtle bugs before they impact customers. TLA+ enables OCI to ensure that our distributed system designs are correct under every possible thread interleaving, network fault, and power outage scenario.

The verification team works like an internal consultancy, helping service teams define correctness for their systems, identify subtle bugs, and gain confidence in their systems. From its inception, OCI has used formal methods as part of its software development process.

In this post, I show an illustrative example from the OCI Object Storage service. We have used TLA+ on 20 other different services, changing error-prone whiteboard reasoning into a precise process for eliminating design defects.

Example: Drive management for Object Storage

When OCI set out to automate hard drive ingestion and decommissioning with Object Storage, we employed formal methods to ensure correctness. Automation saves money and relieves human operators, but it brings new risks: Computers act faster than humans, and decommissioning is a destructive operation. Counterintuitively, even ingestion is dangerous! The first step of ingesting a new drive is to wipe any existing filesystem. TLA+ helps us sleep soundly, knowing that we haven’t missed any cases.

The key component of the drive automation design that I focus on here is a process called the drive manager, which runs on every storage host. The drive manager takes commands from external components, performs safety checks, and then runs them. It supports the following major commands:

  • Provision: Ingests an empty hard drive.

  • Fence: Prevents new writes to a hard drive but doesn’t remove any data.

  • Remove: Marks a drive as eligible for decommissioning. Notably, this operation doesn’t immediately remove data. Instead, it turns on a physical locator light in the data center so that an operator can find the drive to physically remove and recycle it.

The following graphic shows the normal drive lifecycle enforced by the drive manager:

Normal Drive Lifecycle

 

The drive manager is a good target for formal verification because of its importance and difficulty to test. A misbehavior in the drive manager can easily destroy data. For example, suppose the power goes out while a drive is being provisioned. When the system reboots, what’s the state of that drive? Is there any chance, even a small one, that the drive is eligible to receive customer writes, but the drive manager believes it hasn’t been fully provisioned? We can certainly test a handful of these scenarios to give ourselves confidence in the implementation, but having proof that the design doesn’t allow for these behaviors is better.

In general, using TLA+ to verify a system amounts to three tasks. First, formalize the system’s specification to precisely describe what it should and shouldn’t do. In the case of the drive manager, we describe the idea that the system must not lose data. Second, formalize the system’s design to precisely describe how the system works, which means describing all the actors and how they interact with each other: What the drive manager does, when commands can be sent to it, and how the network delivers commands to it. Finally, run the TLA+ model checker to find defects in the design. The model checker outputs behaviors that we need to understand and convert to meaningful insights.

Running the TLA+ Model Checker

 

Formalizing correctness in TLA+ (Spec.tla)

A component like the drive manager has an intuitive safety property: The system shouldn’t lose data. However, this intuitive English-language specification is difficult to formalize precisely: Decommissioning a drive means deleting the data on that drive—a normal operation for the Drive Manager. Should it count as data loss? Nothing is forever, and drives naturally fail on their own. These failures can’t be prevented, even by the best software. Does drive failure count as data loss?

We want to formalize safety properties as state machines. A state machine safety property describes what actions are allowed in the most general and abstract way possible.

The following image shows the Object Storage state machine. Each chunk of customer data is stored as a collection of blocks on various hard drives. Some blocks might be replicas, and others might be parity information. You can reconstruct the entire chunk if enough blocks are present. So, Object Storage must maintain a given number of data blocks. Losing too many blocks means losing the data.

Replicating and Removing Object Storage Blocks

 

The state machine safety property maintains that the state that matters is how many replicas of each block exist. Two kinds of allowed transitions between states exist. Object Storage can add a block replica whenever it wants, but it can only delete a block replica if other replicas of that block are available.

A large real-world system might satisfy or violate our state machine safety property. If it reformats a hard drive that has the only copy of a block, for example, that’s a safety violation. We’re interested in ensuring that the drive manager can’t take these actions.

The semi-formal diagram of our state machine is easily expressed precisely in TLA+, as shown in the following example:


CONSTANT RequiredBlockCount

VARIABLE num_blocks


Init ==

     num_blocks = RequiredBlockCount


ReReplicate ==

     num_blocks’ > num_blocks


RemoveReplica ==

     /\ num_blocks > RequiredBlockCount

     /\ num_blocks’ = num_blocks - 1


Next ==

     \/ ReReplicate

     \/ RemoveReplica

The TLA+ language is simple, but the syntax is confusing to newcomers, so let’s define the following concepts:

  • CONSTANT declares arbitrary constant values. We don’t care about the value of each constant; our system should function regardless of its specific value. For example, RequiredBlockCount is an integer describing how many blocks must exist at a minimum.

  • VARIABLE names something that changes over time. The variable num_blocks describes how many blocks of data are live and reachable. It can fluctuate as drives fail and the software takes steps to recover.

  • Named definitions are defined using two equals signs, as in “Init == …” to define something called Init. By convention, the initial state of the system is called Init and the allowed transitions are called Next— but we can choose whatever names we want.

  • TLA+ uses mathematical symbols for expressions. For example, it uses the “wedge” and “vee” symbols for “and” and “or.” That is, /\ means “and” (most programming languages write it as &&), and \/ means “or” (most programming languages would write it as ||). The equals symbol = tests for equality, and the greater-than symbol > compares integers.

  • The tick mark on a variable indicates the value of the variable in the next state of the system. For instance, the expression num_blocks’ > num_blocks means that the value of num_blocks in the next state is greater than its value in the current state.

In the TLA+ code, the state of the system is described by the variable num_blocks. Initially, enough data blocks are available. The definition of Next says that the system always takes one of two allowed transitions: ReReplicate or RemoveReplica, which correspond exactly to the visual diagram.

Normally, programs describe exactly how each operation should be carried out. TLA+, by contrast, only specifies what is allowed. For instance, the definition of ReReplicate allows any transition that increases num_blocks but doesn’t specify exactly how many new blocks are created. Transitions in TLA+ are written as predicates that constrain what can happen, and any transition that makes the predicate true is allowed.

For example, the following graphic shows a behavior allowed by the specification:

Behavior Allowed by the Specification

 

The specification allows the number of live blocks to fluctuate up and down but not to fall below the RequiredBlockCount. After we model the design, the TLA+ model checker can tell us whether our design satisfies this specification, such as a subtle bug that could allow our design to destroy a block of data when it shouldn’t.

What about drive failures? In this case, the specification only describes the behavior of our software. We’re interested in ensuring that the drive manager doesn’t destroy data on its own. In the real world, transitions can reduce num_blocks to zero if enough drives fail, but currently, we aren’t interested in these behaviors.

Modeling the drive manager (Design.tla)

The drive manager design is substantially more complex than the example specification. To give a sense of the scope of the design, the constants and variables are required to describe the state of the system:


CONSTANTS

    Drives,

    Datums,

    StorageServers,

    ReplicationFactor


VARIABLES

    driveData,

    driveLocations,

    virtualDrives,

    presumedHealthy,

    hasPartitionTable,

    hasFilesystem,

    fstab,

    mountedDrives,

    megaRaidStatus,

    locatorLights,

    destinedForRemoval,

    isInAVolume,

    drivesMarkedDead,

    pc

These definitions model the drive manager, which controls things like the set of mounted drives on each server and the locator light for each drive, but also external components like MegaRAID, which helps detect drive faults and manage drive states, the central volume metadata, a regional database storing information about each drive, and even the fstab file on each server, a Linux configuration file that describes how filesystem paths map to physical hard drives.

The entire specification is certainly too complex to show in its entirety, but a few notable actions give an overview of how the design works.

Removing a drive is a three-step process. A controller instructs the drive manager to remove a drive by sending a message to the drive manager over the network:


TryToRemove(d) ==

    /\ RemovalIsSafe(d)

    /\ destinedForRemoval' = destinedForRemoval \union {d}

    /\ […]

Try to Remove Drive

 

The transition is only allowed if the controller has performed its own safety checks ahead of time, such as RemovalIsSafe(d). Then the drive becomes destined for removal because a pending network message to remove the drive exists.

The second step happens when the drive manager receives that message and turns on the locator light for the drive:


Remove(d) ==

    /\ d \in destinedForRemoval

    /\ DriveManagerThinksRemovalIsSafe(d)

    /\ locatorLights' = locatorLights \union {d}

    /\ […]

Turn on Locator Light

When the drive manager gets a network request to remove a drive, it performs its internal safety checks, and if they pass, it turns on the locator light for the drive.

Finally, if the locator light is on, an operator is authorized to physically remove the drive from its enclosure, making the data unavailable:

OperatorClobbers(d) ==

    /\ d \in locatorLights

    /\ driveData' = [driveData EXCEPT ![d] = {}]

    /\ […]

Operator Clobbers Drive

 

The time gap between TryToRemove and Remove is a critically important point to understand: The controller’s safety checks can’t happen at the same time as the actual removal. They always happen first, then after some delay, the removal happens. The design must be safe no matter what happens in that gap. For example, a customer issues a request that writes data to the drive. The safety checks done at the TryToRemove step must be sufficient to ensure that forgetting the data later is safe, no matter what happens in the meantime.

Once the locator light is on, a data center operator physically removes the drive from its enclosure. While this action doesn’t destroy the data on the drive, the model assumes that it does.

Catching a subtle bug

Armed with a specification of correctness and a formal description of the system, the TLA+ tools explore all possible behaviors and give us an extraordinary level of confidence in our design—or a subtle scenario that we hadn’t considered. In this case, the TLA+ tools uncovered two such scenarios.

The TLA+ tools report correctness violations as behaviors, sequences of states that adhere to the design but don’t satisfy the specification. The raw behaviors are simply the value of every variable in the design at each point in time, which are tedious to read, so we don’t present one in this blog post. Instead, we present a clearer visualization of the misbehavior. A common job of the Verification team is to interpret TLA+ behaviors in terms of the real system and produce diagrams like the following example:

Remove drive from three different hosts

 

In the bad behavior, the controller instructs the drive manager on three different hosts to remove a drive. If those three drives contain the only blocks of a customer’s data, that data would be lost. What went wrong? Why did the design allow this behavior?

To understand, we need to understand the definition of RemovalIsSafe, which I omitted from the code. What safety checks does the controller need to perform to ensure that initiating the drive decommissioning process is safe?

The original design had an informal and sensible-sounding answer: “A removal request can be issued whenever there are enough additional copies of the data on other drives.” Unfortunately, that condition isn’t sufficient. It doesn’t prevent the concurrent removal of every copy of the data. When each removal is issued, there are still enough copies of the data, but after the removals resolve, all copies are concurrently deleted.

Adding the extra condition for no in-progress removals also doesn’t work. This condition is impossible to enforce because it requires global knowledge of the state of the network, and a subtle data loss scenario still occurs, also discovered by TLA+:

Subtle data loss scenario

 

In this new scenario, two controllers try to take conflicting actions on the same new drive. One tries to provision it as a new drive, and the other tries to decommission it. While unlikely, this scenario is possible. Two controllers can run at the same time and disagree about whether the drive is healthy, for example, if one controller’s information is out-of-date.

The controller that issues the remove request has met all the safety requirements: At the time of the request, all the data was safe, and no concurrent removals were happening. However, when the removal resolves, some data is indeed on the drive!

By using TLA+, we devised a safety check that works. This corrected check is stable in that when it becomes true, it never becomes false in the future, making it appropriate given the delay between check and removal.

A removal request can be issued whenever (1) the drive will never receive new writes, (2) the drive is not currently receiving writes, and (3) there are enough additional copies of the data on other drives.

Conditions (1) and (2) are new. While they sound difficult to enforce, the design of OCI Object Storage makes them easy to check internally because a regional source of truth for the state of each drive exists and it controls whether new requests can ever be directed to a drive. While we can’t check (2) directly, a certain marker on a drive ensures that current writes aren’t acknowledged as successful to the customer.

Conclusion

TLA+ has been a huge success. We have uncovered many subtle bugs like our example before they caused damage in production. The drive manager has also been a huge success: No data lost and thousands of operator hours saved. We believe this success is due in part to the extra time that we spent formalizing and verifying the design before deploying the system.

To learn more about the concepts in this blog post, see the following resources:

 

Read more Behind the Scenes posts

Calvin Loncaric

Formal Verification Engineer

Calvin Loncaric has been a verification engineer at Oracle Cloud Infrastructure since 2018. He works with multiple service teams to find subtle concurrency bugs and improve their designs using formal methods. Before OCI, Calvin got his Ph.D. from the University of Washington where he worked on program synthesis algorithms that produce runnable code from specifications.


Previous Post

Accelerate distributed deep learning with OCI

Joanne Lei | 3 min read

Next Post


Available now: Capacity Planning for OCI Compute Instances

Sriram Vrinda | 4 min read