SEEK_HOLE and SEEK_DATA for sparse files

A sparse file is a file that contains much less data than its size would suggest. If you create a new file and then do a 1-byte write to the billionth byte, for example, you've just created a 1GB sparse file. By convention, reads from unwritten parts of a file return zeroes.

File-based backup and archiving programs like tar, cpio, and rsync can detect runs of zeroes and ignore them, so that the archives they produce aren't filled with zeroes. Still, this is terribly inefficient. If you're a backup program, what you really want is a list of the non-zero segments in the file. But historically there's been no way to get this information without looking at filesystem-specific metadata.

Desperately seeking segments

As part of the ZFS project, we introduced two general extensions to lseek(2): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of sparse files. Quoting from the new man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
          next  hole greater than or equal to the supplied offset
          is returned. The definition of a hole is provided  near
          the end of the DESCRIPTION.

       o  If whence is SEEK_DATA, the file pointer is set to  the
          start  of the next non-hole file region greater than or
          equal to the supplied offset.


     A "hole" is defined as a contiguous  range  of  bytes  in  a
     file,  all  having the value of zero, but not all zeros in a
     file are guaranteed to be represented as holes returned with
     SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
     with SEEK_HOLE, but not required to.  Applications  can  use
     SEEK_HOLE  to  optimise  their behavior for ranges of zeros,
     but must not depend on it to find all such ranges in a file.
     The  existence  of  a  hole  at the end of every data region
     allows for easy programming and implies that a virtual  hole
     exists  at  the  end  of  the  file. Applications should use
     fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE)
     to  determine if a filesystem supports SEEK_HOLE. See fpath-

     For filesystems that do not supply information about  holes,
     the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers much faster than it can copy out a bunch of zeroes. Even if the search algorithm is linear-time, at least the constant is thousands of times smaller.

Sparse file navigation in ZFS

A file in ZFS is a tree of blocks. To maximize the performance of SEEK_HOLE and SEEK_DATA, ZFS maintains in each block pointer a fill count indicating the number of data blocks beneath it in the tree (see below). Fill counts reveal where holes and data reside, so that ZFS can find both holes and data in guaranteed logarithmic time.

  L4                           6
  L3          5                0                1
         -----------      -----------      -----------
  L2     3    0    2                       0    0    1
        ---  ---  ---                     ---  ---  ---
  L1    111       101                               001

An indirect block tree for a sparse file, showing the fill count in each block pointer. In this example there are three block pointers per indirect block. The lowest-level (L1) block pointers describe either one block of data or one block-sized hole, indicated by a fill count of 1 or 0, respectively. At L2 and above, each block pointer's fill count is the sum of the fill counts of its children. At any level, a non-zero fill count indicates that there's data below. A fill count less than the maximum (3L-1 in this example) indicates that there are holes below. From any offset in the file, ZFS can find the next hole or next data in logarithmic time by following the fill counts in the indirect block tree.


At this writing, SEEK_HOLE and SEEK_DATA are Solaris-specific. I encourage (implore? beg?) other operating systems to adopt these lseek(2) extensions verbatim (100% tax-free) so that sparse file navigation becomes a ubiquitous feature that every backup and archiving program can rely on. It's long overdue.

Technorati Tags:

Do I understand correctly, that ZFS "tree of blocks" is more of a radix-tree than B-tree? I wonder why this solution was choosen, because few problems with radix-trees immediately come to mind, e.g., one cannot use extents, only individual block pointers on the leaf level, and tree for a file with a huge hole in the middle will be unnecessarily tall.

Also, tree fanout of 3 is unbelievably small. How comes only 3 block pointers fit into whole block?

Posted by nikita on December 12, 2005 at 08:24 PM PST #

And I encourage Sun to implement this with UFS and to make tools like cpio sparse-file aware.

This is particularly important in an environment that uses live upgrade and large UID's. A good example of a sparse file is /var/adm/lastlog. In my environment, there are a few users clustered in the 0 - 100 range, a few more in 1,000 - 20,000, and clusters around 100,000,000, 212,000,000 and 500,000,000. This means that lastlog looks like:

$ ls -lh /var/adm/lastlog
-rw-r--r--   1 root     root         13G Dec 13 18:08 /var/adm/lastlog
$ du -h /var/adm/lastlog
 104K   /var/adm/lastlog

So, if I use standard live upgrade commands or flash archive commands a file like this will balloon to the full 13G. This makes it pretty hard to maintain several boot environments on a 36 GB disk.

FWIW, the RFE has been filed and I was told "will not fix". My suggestion in the RFE was to switch the lu/flash tools to use GNU cpio with --sparse, which is presumably a lot less work than updating Sun's cpio to be sparse-file aware.

Posted by Mike Gerdts on December 13, 2005 at 02:13 AM PST #

nikita: I'd assume Jeff used a tree fanout of 3 to make the picture fit the page. check the source -- zfs block pointers are large, but zfs's blocks can be much larger; it looks like 128 byte block pointers, up to 128K blocks so you may have a fanout as large as 1024.

Posted by Bill Sommerfeld on December 13, 2005 at 11:37 AM PST #

The present lseek(2) man page starts out by saying "The lseek() function sets the file pointer ...". I gather from your description that SEEK_HOLE doesn't actually do any seeking -- that is, it doesn't modify the file pointer. I'm not saying that's bad (it can be seen as a bit more convenient for programming, though not greatly so); I'm just looking for clarification since it does violate a pre-existing convention.

Posted by Glenn on December 14, 2005 at 04:43 AM PST #

Let me add to my own comment. Your implementation of SEEK_HOLE as not moving the file pointer effectively presumes what it is being used for -- namely, measuring the length of a data segment, probably for use in a backup program. But my use may be instead to find the next place to safely store a piece of data. In that case, I want SEEK_HOLE to move the file pointer, and I want SEEK_DATA to not move the file pointer, since I'll use SEEK_DATA after SEEK_HOLE to measure the size of the hole.

All of which is just to point out the inherent bias you've assumed in your definitions of these new operations, rather than consistently following the existing conventions. A clean way out would be to have SEEK_HOLE and SEEK_DATA both move the file pointer, and FIND_HOLE and FIND_DATA return the equivalent offset without moving the file pointer.

Posted by Glenn on December 14, 2005 at 05:07 AM PST #

Having thought about this some more, I believe the absolute minimum you should do before asking other vendors to adopt your extensions is to drop the existing SEEK_HOLE symbol and use FIND_HOLE instead. I can see no good reason (more strongly, it is abhorrent) to intentionally mis-name a concept when a simple, clear alternative is readily available. You could then decide later whether it is worth implementing FIND_DATA (no pointer movement) and SEEK_HOLE (with pointer movement).

Posted by Glenn on December 14, 2005 at 06:03 AM PST #

I'm not sure where you got the impression that either SEEK_HOLE or SEEK_DATA doesn't set the file pointer. They do. (If they didn't, that would be weird, and we'd call it out explicitly.)

Posted by Jeff Bonwick on December 18, 2005 at 08:53 PM PST #

I got that impression because your man-page quote for SEEK_HOLE is missing the key phrase "the file pointer is set to"; all it mentions is returning a value. Perhaps the man page needs an edit.

Posted by Glenn on December 19, 2005 at 08:47 AM PST #

This blog entry is very timely. I just brought up this issue on the Legato NetWorker mailing list last week. On a Red Hat Linux box on x86_64, the sparse /var/log/lastlog is 1.2 TB in size, even though it's probably truly using less than 1 MB of disk space. The "save" process (what NetWorker uses on a client to perform the backup) takes more than 1.5 hours to read the file, even though 99.999% of that time it's just reading 64K blocks of 0s. If there were an interface available like the ZFS interface and NetWorker used that, the backup time could be cut to seconds.

Posted by Tim Mooney on December 21, 2005 at 02:15 AM PST #

Yes, the man page needs an edit to make it clearer that SEEK_HOLE moves the seek pointer.

Also, didn't Sun's old "Backup Copilot" product have a mechanism for finding out where the holes were in files? As I remember, it had a version of "dump" that read files rather than reading the raw disk, so you could do online backups, which meant that if it was to dump holes as such rather than as blocks of zeroes, it would either have to check for regions of zeroes and assume they're holes or ask the kernel where the holes were (or whether a given region includes a hole).

I think it might've been a kernel add-on that was later integrated into SunOS 4.1.x (presumably so that Backup Copilot would run on top of a vanilla kernel and wouldn't have to install a patched <tt>/vmunix</tt>) and, if I remember from when I last saw the source (over 11 years ago), it might've used an <tt>fcntl()</tt>; some might find that cleaner than <tt>lseek()</tt>.

Check out the 4.1.3 or 4.1.4 sources (if Sun still has them around...), or check out the Sun paper “Engineering a Commercial Backup Program” from the proceedings of the 5th LISA conference.

Posted by Guy Harris on December 23, 2005 at 08:47 AM PST #

I would suggest is to do an fcntl to look for a hole using an flock structure. The basic API would be to call it with a l_whence of seek_set, an l_start of 0, and an l_len of 0 (the marker for "to EOF").

The result would be a modified flock structure with l_start set to the first byte of the hole and l_len set to the hole length. You'd iterate the file by adding the returned l_len to the returned l_start, setting l_len back to 0, and calling it again to get the next hole.

Posted by Terry Lambert on January 02, 2006 at 09:26 PM PST #

Other File Systems have similar functions since a long time - XFS has XFS_IOC_GETBMAP which returns the allocation information. NTFS and also CIFS protocol knows about a similar "get allocation info" call. I'm not sure what is more convenient - seeking from hole to hole or retreive the allocation info all at once.

Posted by Björn on February 02, 2006 at 08:50 AM PST #

Sparse files having huge regions of sparsness is less and less frequent. Except for some p2p apps, I don't see that much sparseness around, and even there those regions are going to be filled in pretty soon. For example, the /var/log/lastlog file on linux boxes is a just unfortunate use of binary structed data keyed off of UID and better off deleted. it's actually a useless index of last login info. useless because it's redundant (wtmp has that same info) and because only keeps the most recent login entry per UID. That being said, it's nice to know that sun has taken then time to add this. I'm going to look up the backup guide, but so far I still think the best thing legato/backup software can do is use O_DIRECT in open and tell the VM not to waste it's time at all caching this stuff. Either approach taken, the VM is still going to have unneeded stuff in the cache that it needs to quickly detect, be it zero's from some software that didn't update to the new methods seek, or 1s and 0s from actually data, and no one is going to know except the user, and he'll not be able to tell the appdevs or the kerneldevs. :)

Posted by kenneth topp on February 14, 2006 at 01:19 PM PST #

The man page above says: "For filesystems that do not supply information about holes, the file will be represented as one entire data region." This seems very stupid, since there then is no way for a program to decide if the file system provides the sparse searching or if the program has to do it itself. A filesystem that can't supply this information rather should return ENOTSUP or some other relevant error.

Posted by Ragnar Sundblad on March 04, 2006 at 08:07 PM PST #

Hello, Jeff! Alexander Viro posted a question regarding your proposal for lseek(2) implementation for SEEK_HOLE and SEEK_DATA: "Well... Description makes sense and it isn't hard to implement. About the only question is about semantics for directories..." Do you mind replying to him at linux-kernel at vger kernel org and viro at ftp linux org uk ? This is the thread: Thank you so much!

Posted by André Goddard Rosa on June 18, 2006 at 01:34 AM PDT #

I've done a preliminary SEEK_HOLE/SEEK_DATA patch for linux, basically a loop over FIBMAP without the root-only capability check (unneeded because no physical blocks are returned to user land).

Posted by Jeff Johnson on August 22, 2006 at 05:47 AM PDT #

The URL fopr the patch should have been https//

Posted by guest on August 22, 2006 at 05:48 AM PDT #

The semantics of allowing the FS to return a "hole" for any run of zero bytes, even from non-sparse blocks, are a bit useless because that can easily be replicated in the reader software. It'd be more useful to really only return holes for actual sparse blocks, so that the application can distinguish between never-been-written-to and written-to-with-all-zeroes. This is useful for validating regions of a file, for example for applications which do non-linear downloads. If the application is aborted, it can then check which blocks of the file were successfully written, and only re-download the missing blocks. I think it should be mandated that holes have a one-to-one correspondence to sparse blocks, and that they hence always indicate a region of a file that was never written to.

Posted by Anonymous on October 19, 2006 at 04:36 AM PDT #

Post a Comment:
Comments are closed for this entry.



« July 2016