• ZFS
    December 12, 2005

SEEK_HOLE and SEEK_DATA for sparse files

Guest Author

A sparse file is a file that contains much less data than its
size would suggest. If you create a new file and then
do a 1-byte write to the
billionth byte, for example, you've just created a 1GB sparse file.
By convention, reads from unwritten parts of a file return zeroes.

File-based backup and archiving programs like tar, cpio, and rsync
can detect runs of zeroes and ignore them, so that the archives they
produce aren't filled with zeroes. Still, this is terribly inefficient.
If you're a backup program, what you really want is a list of the
non-zero segments in the file. But historically there's been no way
to get this information without looking at filesystem-specific metadata.

Desperately seeking segments

As part of the ZFS project, we introduced two general extensions to
lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover
the non-zero regions of sparse files. Quoting from the new man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
next hole greater than or equal to the supplied offset
is returned. The definition of a hole is provided near
the end of the DESCRIPTION.
o If whence is SEEK_DATA, the file pointer is set to the
start of the next non-hole file region greater than or
equal to the supplied offset.
A "hole" is defined as a contiguous range of bytes in a
file, all having the value of zero, but not all zeros in a
file are guaranteed to be represented as holes returned with
SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
with SEEK_HOLE, but not required to. Applications can use
SEEK_HOLE to optimise their behavior for ranges of zeros,
but must not depend on it to find all such ranges in a file.
The existence of a hole at the end of every data region
allows for easy programming and implies that a virtual hole
exists at the end of the file. Applications should use
fpathconf(_PC_MIN_HOLE_SIZE) or pathconf(_PC_MIN_HOLE_SIZE)
to determine if a filesystem supports SEEK_HOLE. See fpath-
For filesystems that do not supply information about holes,
the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem
like UFS, which has no special support for sparseness, can walk its
block pointers much faster than it can copy out a bunch of zeroes.
Even if the search algorithm is linear-time, at least the constant
is thousands of times smaller.

Sparse file navigation in ZFS

A file in ZFS is a tree of blocks. To maximize the performance of
SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a
fill count indicating the number of data blocks beneath it
in the tree (see below). Fill counts reveal where holes and data
reside, so that ZFS can find both holes and data in guaranteed
logarithmic time.

  L4                           6
L3 5 0 1
----------- ----------- -----------
L2 3 0 2 0 0 1
--- --- --- --- --- ---
L1 111 101 001

An indirect block tree for a sparse file, showing the fill count in
each block pointer. In this example there are three block pointers
per indirect block. The lowest-level (L1) block pointers describe
either one block of data or one block-sized hole, indicated by a
fill count of 1 or 0, respectively. At L2 and above, each block
pointer's fill count is the sum of the fill counts of its children.
At any level, a non-zero fill count indicates that there's data below.
A fill count less than the maximum (3L-1 in this example)
indicates that there are holes below. From any offset in the file,
ZFS can find the next hole or next data in logarithmic time by following
the fill counts in the indirect block tree.


At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific.
I encourage (implore? beg?) other operating systems to adopt
these lseek(2) extensions verbatim (100% tax-free)
so that sparse file navigation becomes a ubiquitous
feature that every backup and archiving program can rely on.
It's long overdue.

Technorati Tags:

Join the discussion

Comments ( 18 )
  • nikita Tuesday, December 13, 2005

    Do I understand correctly, that ZFS "tree of blocks" is more of a radix-tree than B-tree? I wonder why this solution was choosen, because few problems with radix-trees immediately come to mind, e.g., one cannot use extents, only individual block pointers on the leaf level, and tree for a file with a huge hole in the middle will be unnecessarily tall.

    Also, tree fanout of 3 is unbelievably small. How comes only 3 block pointers fit into whole block?

  • Mike Gerdts Tuesday, December 13, 2005

    And I encourage Sun to implement this with UFS and to make tools like cpio sparse-file aware.

    This is particularly important in an environment that uses live upgrade and large UID's. A good example of a sparse file is /var/adm/lastlog. In my environment, there are a few users clustered in the 0 - 100 range, a few more in 1,000 - 20,000, and clusters around 100,000,000, 212,000,000 and 500,000,000. This means that lastlog looks like:

    $ ls -lh /var/adm/lastlog
    -rw-r--r-- 1 root root 13G Dec 13 18:08 /var/adm/lastlog
    $ du -h /var/adm/lastlog
    104K /var/adm/lastlog

    So, if I use standard live upgrade commands or flash archive commands a file like this will balloon to the full 13G. This makes it pretty hard to maintain several boot environments on a 36 GB disk.

    FWIW, the RFE has been filed and I was told "will not fix". My suggestion in the RFE was to switch the lu/flash tools to use GNU cpio with --sparse, which is presumably a lot less work than updating Sun's cpio to be sparse-file aware.

  • Bill Sommerfeld Tuesday, December 13, 2005
    nikita: I'd assume Jeff used a tree fanout of 3 to make the picture fit the page.
    check the source -- zfs block pointers are large, but zfs's blocks can be much larger; it looks like 128 byte block pointers, up to 128K blocks so you may have a fanout as large as 1024.
  • Glenn Wednesday, December 14, 2005
    The present lseek(2) man page starts out by saying "The lseek() function sets the file pointer ...". I gather from your description that SEEK_HOLE doesn't actually do any seeking -- that is, it doesn't modify the file pointer. I'm not saying that's bad (it can be seen as a bit more convenient for programming, though not greatly so); I'm just looking for clarification since it does violate a pre-existing convention.
  • Glenn Wednesday, December 14, 2005

    Let me add to my own comment. Your implementation of SEEK_HOLE as not moving the file pointer effectively presumes what it is being used for -- namely, measuring the length of a data segment, probably for use in a backup program. But my use may be instead to find the next place to safely store a piece of data. In that case, I want SEEK_HOLE to move the file pointer, and I want SEEK_DATA to not move the file pointer, since I'll use SEEK_DATA after SEEK_HOLE to measure the size of the hole.

    All of which is just to point out the inherent bias you've assumed in your definitions of these new operations, rather than consistently following the existing conventions. A clean way out would be to have SEEK_HOLE and SEEK_DATA both move the file pointer, and FIND_HOLE and FIND_DATA return the equivalent offset without moving the file pointer.

  • Glenn Wednesday, December 14, 2005
    Having thought about this some more, I believe the absolute minimum you should do before asking other vendors to adopt your extensions is to drop the existing SEEK_HOLE symbol and use FIND_HOLE instead. I can see no good reason (more strongly, it is abhorrent) to intentionally mis-name a concept when a simple, clear alternative is readily available. You could then decide later whether it is worth implementing FIND_DATA (no pointer movement) and SEEK_HOLE (with pointer movement).
  • Jeff Bonwick Monday, December 19, 2005
    I'm not sure where you got the impression that either SEEK_HOLE or SEEK_DATA doesn't set the file pointer. They do. (If they didn't, that would be weird, and we'd call it out explicitly.)
  • Glenn Monday, December 19, 2005
    I got that impression because your man-page quote for SEEK_HOLE is missing the key phrase "the file pointer is set to"; all it mentions is returning a value. Perhaps the man page needs an edit.
  • Tim Mooney Wednesday, December 21, 2005
    This blog entry is very timely. I just brought up this issue on the Legato NetWorker mailing list last week. On a Red Hat Linux box on x86_64, the sparse /var/log/lastlog is 1.2 TB in size, even though it's probably truly using less than 1 MB of disk space.
    The "save" process (what NetWorker uses on a client to perform the backup) takes more than 1.5 hours to read the file, even though 99.999% of that time it's just reading 64K blocks of 0s. If there were an interface available like the ZFS interface and NetWorker used that, the backup time could be cut to seconds.
  • Guy Harris Friday, December 23, 2005

    Yes, the man page needs an edit to make it clearer that SEEK_HOLE moves the seek pointer.

    Also, didn't Sun's old "Backup Copilot" product have a mechanism for finding out where the holes were in files? As I remember, it had a version of "dump" that read files rather than reading the raw disk, so you could do online backups, which meant that if it was to dump holes as such rather than as blocks of zeroes, it would either have to check for regions of zeroes and assume they're holes or ask the kernel where the holes were (or whether a given region includes a hole).

    I think it might've been a kernel add-on that was later integrated into SunOS 4.1.x (presumably so that Backup Copilot would run on top of a vanilla kernel and wouldn't have to install a patched <tt>/vmunix</tt>) and, if I remember from when I last saw the source (over 11 years ago), it might've used an <tt>fcntl()</tt>; some might find that cleaner than <tt>lseek()</tt>.

    Check out the 4.1.3 or 4.1.4 sources (if Sun still has them around...), or check out the Sun paper “Engineering a Commercial Backup Program” from the proceedings of the 5th LISA conference.

  • Terry Lambert Tuesday, January 3, 2006
    I would suggest is to do an fcntl to look for a hole using an flock structure. The basic API would be to call it with a l_whence of seek_set, an l_start of 0, and an l_len of 0 (the marker for "to EOF").

    The result would be a modified flock structure with l_start set to the first byte of the hole and l_len set to the hole length. You'd iterate the file by adding the returned l_len to the returned l_start, setting l_len back to 0, and calling it again to get the next hole.
  • Bj&ouml;rn Thursday, February 2, 2006
    Other File Systems have similar functions since a long time - XFS has XFS_IOC_GETBMAP which returns the allocation information. NTFS and also CIFS protocol knows about a similar "get allocation info" call. I'm not sure what is more convenient - seeking from hole to hole or retreive the allocation info all at once.
  • kenneth topp Tuesday, February 14, 2006
    Sparse files having huge regions of sparsness is less and less frequent. Except for some p2p apps, I don't see that much sparseness around, and even there those regions are going to be filled in pretty soon. For example, the /var/log/lastlog file on linux boxes is a just unfortunate use of binary structed data keyed off of UID and better off deleted. it's actually a useless index of last login info. useless because it's redundant (wtmp has that same info) and because only keeps the most recent login entry per UID.
    That being said, it's nice to know that sun has taken then time to add this.
    I'm going to look up the backup guide, but so far I still think the best thing legato/backup software can do is use O_DIRECT in open and tell the VM not to waste it's time at all caching this stuff. Either approach taken, the VM is still going to have unneeded stuff in the cache that it needs to quickly detect, be it zero's from some software that didn't update to the new methods seek, or 1s and 0s from actually data, and no one is going to know except the user, and he'll not be able to tell the appdevs or the kerneldevs. :)
  • Ragnar Sundblad Sunday, March 5, 2006
    The man page above says:
    "For filesystems that do not supply information about holes, the file will be represented as one entire data region."
    This seems very stupid, since there then is no way for a program to decide if the file system provides the sparse searching or if the program has to do it itself.
    A filesystem that can't supply this information rather should return ENOTSUP or some other relevant error.
  • Andr&eacute; Goddard Rosa Sunday, June 18, 2006
    Hello, Jeff!
    Alexander Viro posted a question regarding your proposal for lseek(2) implementation for SEEK_HOLE and SEEK_DATA:
    "Well... Description makes sense and it isn't hard to implement.
    About the only question is about semantics for directories..."
    Do you mind replying to him at
    linux-kernel at vger kernel org
    viro at ftp linux org uk
    This is the thread:
    Thank you so much!
  • Jeff Johnson Tuesday, August 22, 2006
    I've done a preliminary SEEK_HOLE/SEEK_DATA patch for linux,
    basically a loop over FIBMAP without the root-only capability check (unneeded because no physical blocks are returned to user land).
  • guest Tuesday, August 22, 2006
    The URL fopr the patch should have been
  • Anonymous Thursday, October 19, 2006
    The semantics of allowing the FS to return a "hole" for any run of zero bytes, even from non-sparse blocks, are a bit useless because that can easily be replicated in the reader software.
    It'd be more useful to really only return holes for actual sparse blocks, so that the application can distinguish between never-been-written-to and written-to-with-all-zeroes. This is useful for validating regions of a file, for example for applications which do non-linear downloads. If the application is aborted, it can then check which blocks of the file were successfully written, and only re-download the missing blocks.
    I think it should be mandated that holes have a one-to-one correspondence to sparse blocks, and that they hence always indicate a region of a file that was never written to.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.