SEEK_HOLE and SEEK_DATA for sparse files
By bonwick on Dec 12, 2005
A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.
File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata.
Desperately seeking segments
As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:
o If whence is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. o If whence is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. [...] A "hole" is defined as a contiguous range of bytes in a file, all having the value of zero, but not all zeros in a file are guaranteed to be represented as holes returned with SEEK_HOLE. Filesystems are allowed to expose ranges of zeros with SEEK_HOLE, but not required to. Applications can use SEEK_HOLE to optimise their behavior for ranges of zeros, but must not depend on it to find all such ranges in a file. The existence of a hole at the end of every data region allows for easy programming and implies that a virtual hole exists at the end of the file. Applications should use fpathconf(_PC_MIN_HOLE_SIZE) or pathconf(_PC_MIN_HOLE_SIZE) to determine if a filesystem supports SEEK_HOLE. See fpath- conf(2). For filesystems that do not supply information about holes, the file will be represented as one entire data region.
Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller.
Sparse file navigation in ZFS
A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.
L4 6 ----------------------------------- L3 5 0 1 ----------- ----------- ----------- L2 3 0 2 0 0 1 --- --- --- --- --- --- L1 111 101 001
An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer's fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there's data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.
At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue.
Technorati Tags: OpenSolaris Solaris ZFS