OCFS2 Disk Heartbeat: How it Works, How to Inspect it, and How to Detect “Mess Mounts”

In an OCFS2 cluster, knowing which nodes are truly alive and writing to shared storage is foundational to preventing data consistency issues. OCFS2 maintains cluster membership by monitoring node liveness and evicting nodes that stop responding, using both network heartbeat and disk (block-device) heartbeat mechanisms. With disk heartbeat, all nodes share access to a heartbeat device/region where each node periodically writes to its own assigned slot and reads other nodes’ slots to determine whether peers are active. This post explains how the disk heartbeat layout works, what key fields inside a heartbeat block mean (e.g., timestamp/LastWrite, generation, dead timeout), and how to dump and interpret the heartbeat region in practice—especially for identifying dangerous “mess mounts,” where an unexpected host mounts and writes to the volume without being part of the intended cluster configuration.

Overview: how OCFS2 tracks node liveness

OCFS2 maintains cluster membership by monitoring which nodes are alive and evicting nodes that stop responding. It supports two heartbeat mechanisms:

Network heartbeat
Disk heartbeat (block-device heartbeat) — the focus here

Disk heartbeat requires a shared block device accessible by all nodes in the OCFS2 cluster. This device can be：

A dedicated global heartbeat device (a cluster may have more than one), or
A local heartbeat area on the same device that contains the OCFS2 filesystem.

Heartbeat region layout: one slot (block) per node

On the heartbeat device, logical blocks are assigned by node number:

Block 0 → node 0
Block 1 → node 1
… up to 256 nodes

A single slot size refers the logical block size, which can range from 512 bytes to 4096 bytes, depending on the block device its self. You can check it with:

blockdev --getss &lt;device&gt;

Because there are at most 256 nodes, the heartbeat region needs at most:
256 * block_size → up to 1 MiB at 4 KiB blocks.

Each node:

writes its own assigned heartbeat slot to indicate it is active, and
reads other nodes’ slots to determine whether they are active or inactive

What a heartbeat block contains (key fields)

A node writes several fields to its heartbeat slot, including:

Timestamp (64-bit): seconds since Unix Epoch for the last update. The key signal is whether it keeps changing.
Node number (8-bit): the owning node of the block; should match the block/slot number.
Checksum (CRC32, 32-bit): computed over the whole logical block with the checksum field cleared.
Generation (64-bit): generated at mount time and stays constant for the duration of the mount. A change implies the node left and rejoined.
Dead timeout (32-bit, ms): if the block isn’t updated within this interval, the node is treated as inactive.

How to dump and inspect the heartbeat region

You can dump the heartbeat region using debugfs.ocfs2 and then inspect it:

debugfs.ocfs2 -R "dump //heartbeat /tmp/heartbeat" &lt;device&gt;
hexdump /tmp/heartbeat

Notes:
If using local heartbeat, <device> is the OCFS2 volume device.
If using global heartbeat, <device> is the global heartbeat device.

An example of hexdump of heartbeat for a two nodes cluster:

0000000 0dad 699e 0000 0000 0000 0000 118d ab9f
0000010 1dc2 a601 df5e d269 f230 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200 0dae 699e 0000 0000 0001 0000 ea22 a574
0000210 d773 bc45 444e 1c5d f230 0000 0000 0000
0000220 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

which interprets to

Node=0 Generation=d269df5ea6011dc2 DeadMs=62000 LastWrite=2026-02-24 20:44:29  
Node=1 Generation=1c5d444ebc45d773 DeadMs=62000 LastWrite=2026-02-24 20:44:30

Both of the LastWrite are current time indicating both node 0 and node 1 are active now.

“Mess mount” (unexpected mount): what it is and why it’s dangerous

A mess mount refers to an accidental mount of an OCFS2 volume from an unexpected machine, often caused by an improper cluster reconfiguration:

Some old nodes still have access to the shared block device
New and old nodes may not know about each other

This is particularly dangerous because it can lead to split-brain-like behavior: metadata and data are not synchronized across the “good” nodes and the unexpected node(s). OCFS2 may not immediately complain, and corruption can surface later (e.g., stale metadata).

Detecting a mess mount via heartbeat slots

Heartbeat inspection can reveal mess mounts in some cases.

Example pattern:

You expect only node 0 and node 1 to be active (slots 0 and 1 updating),
But you observe another active slot (e.g., slot 3 updating):
** Node 3 / LastWrite is recent time and changing

If slot 3 is active when it shouldn’t be, it strongly suggests an unexpected node is writing to that slot—i.e., a mess mount situation.

Suggested improvement: store hostname/IP in the heartbeat slot

One operational gap is that, even if you detect an unexpected active slot, it can be hard to identify which host is writing it.

A practical enhancement would be to store hostname and IP address alongside the existing fields:

Heartbeat slots are at least 512 bytes, while only a small portion is currently used, so there is room to include additional identifiers.
This wouldn’t change OCFS2 cluster management semantics, but it would materially improve mess mount identification and incident response.

Conclusion

OCFS2 disk heartbeat data provides a direct, low-level way to validate cluster reality: which slots are updating, whether LastWrite is current, and whether a node’s generation indicates a rejoin—all of which can confirm active membership or reveal non-responsive nodes. Beyond routine liveness checks, heartbeat inspection can also expose “mess mounts”: if you expect only certain node slots to update but observe another slot with a recent and changing LastWrite, it strongly suggests an unexpected machine is writing to the shared device, creating split-brain-like risk and potential metadata/data inconsistency that may only surface later as corruption. To improve incident response, a practical enhancement is to store hostname/IP information in the heartbeat slot (there is ample space per slot), making it much easier to identify the offending host when an unexpected active slot is detected.

OCFS2 Disk Heartbeat: How it Works, How to Inspect it, and How to Detect “Mess Mounts”

Overview: how OCFS2 tracks node liveness

Heartbeat region layout: one slot (block) per node

What a heartbeat block contains (key fields)

How to dump and inspect the heartbeat region

“Mess mount” (unexpected mount): what it is and why it’s dangerous

Detecting a mess mount via heartbeat slots

Suggested improvement: store hostname/IP in the heartbeat slot

Conclusion

Wengang Wang

Why is MemAvailable sometimes less than MemFree?

Safe and signed - how to sign BPF programs

OCFS2 Disk Heartbeat: How it Works, How to Inspect it, and How to Detect “Mess Mounts”

Overview: how OCFS2 tracks node liveness

Heartbeat region layout: one slot (block) per node

What a heartbeat block contains (key fields)

How to dump and inspect the heartbeat region

“Mess mount” (unexpected mount): what it is and why it’s dangerous

Detecting a mess mount via heartbeat slots

Suggested improvement: store hostname/IP in the heartbeat slot

Conclusion

Authors

Wengang Wang

Why is MemAvailable sometimes less than MemFree?

Safe and signed - how to sign BPF programs