Thursday Dec 22, 2011

More thoughts on ZFS compression and crash dumps

Thanks to Darren Moffat for poking holes in my previous post, or more explicitly pointing out that I could add more useful and interesting data. Darren commented that it was a shame I hadn't included the time to take a crash dump along side the size, and space usage. The reason for this is that one reason for using vmdump format compression from savecore is to minimize the time required to get the crash dump off the dump device and on to the file system.

The motivation for this reaches back many years, back to when the default for Solaris was to use swap as the dump device. So when you brought the system back up, you wanted to wait till savecore completed before letting the system complete coming up to multiuser (you can tell how old this is by the fact we're not talking about SMF services)

So with Oracle Solaris 11 the root file system is ZFS, and the default configuration is to dump to a dump ZVOL. And as it's not used by anything else, the savecore can and does run in the background. So it isn't quite as important to make it as fast as possible. It's still interesting though, as with everything in life, it's a compromise.

One problem with the tests I wrote about yesterday is the size of the dumps is too small to make measurement of time easy (size is one thing, but we have fast disks now, so getting 8GB off a zvol on to a file system takes very little time)

So this is not a completely scientific test, but an illustration which helps me understand what the best solution for me is. My colleague Clive King wrote a driver to leak memory to create larger kernel memory segments, which artificially increases the amount of data a crash dump contains. I told this to leak 126GB of kernel memory, set the savecore target directory to be one of "uncompressed" "gzip9 compressed" or "LZJB Compressed", and in the first case set it to use vmdump format compressed dumps, oj and then I took a crash dump, repeating over the 3 configurations. The idea being to time the difference in getting the dump on to the file system.

This is a table of what I found

 Size Leaked (GB)
 Size of Crash Dump (GB)
 ZFS pool space used (GB)
 Compression  Time to take dump (mm:ss)
 Time from panic to crash dump available (mm:ss)
 126  8.4  8.4  vmdump  01:48  06:15
 126  140  2.4  GZIP level 9
 01:47  11:39
 126  141  8.02  LZJB  01:57  07:05


Notice one thing, the compression ratio for gzip 9 is massive - 70x, so this is probably a side effect of the fact it's not real data, but probably contains some easily compressible data. The next step should be to populate the leaked memory with random data.

So what does this tel us - assuming the lack of random content isn't an issue, that for a modest hit in time take to get the dump from the dump device (7:05 vs 6:15) we get an uncompressed dump on an LZJB compressed ZFS file system while using a comparable amount of physical storage. This allows me to directly analyse the dump as soon as it's available. Great for development purposes. Is it of benefit to our customers? That's something I'd like feedback on. Please leave a comment if you see value in this being the default.

Wednesday Dec 21, 2011

Exploring ZFS options for storing crash dumps

Systems fail - for a variety of reasons, That's what keeps me in a job. When they do, you need to store data about the way it failed if you want to be able to diagnose what happened. This is what we call a crash dump. But this data consumes a large amount of space on disk. So I thought I'd explore which ZFS technologies can help reduce the overhead of storing this data.

The approach I took was to make a system (x4600M2 with 256GB memory) moderately active using the filebench benchmarking utility, available from the Solaris 11 Package repository. Then take a system panic using reboot -d, then repeat the process twice more, taking live crash dumps using savecore -L. This should generate 3 separate crash dumps, which will have some unique, and some duplicated data.

There are a number of technologies available to us.

  • savecore compressed crash dumps
    • not a ZFS technology, but the default behavior in Oracle Solaris 11
    • Have a read of Steve Sistare's great blog on the subject
  • ZFS deduplication
    • Works on a block level
    • Should store only one copy of a block if it's repeated among multiple crash dumps
  • ZFS snapshots
    • If we are modifying a file, we should only save the changes
      • To make this viable, I had to modify the savecore program to create a snapshot of a filesystem on the fly, and reopen the existing crash dump and modify the file rather than create a new one
  • ZFS compression
    • either in addition to or instead of savecore compression
    • Multiple different levels of compression
      • I tried LZJB and GZIP (at level 9)

All of these can be applied to both compressed (vmdump) and non-compressed (vmcore) crash dumps.So I created multiple zfs data sets with the properties and repeated the crash dump creation, adjusting savecore configuration using dumpadm(1m) to save to the various data sets, either using savecore compression or not.

Remember also one of the motivations of saving a crash dump compressed is to speed up the time it takes to get from the dump device to a file system, so you can send it to Oracle support for analysis.

So what do we get?

Lets look at the default case, this is no compression on the file system, but using the level of compression achieved by savecore (which is the same as the panic process, and is either LZJB or BZIP2). In this we have three crash dumps, totaling 8.86GB. If these same dumps are uncompressed we get 36.4GB crash dumps (so we can see that savecore compression is saving us a lot of space)

Interestingly use of dedup seems to not give us any benefit, I wouldn't have expected it to do so on vmdump format comrpessed dumps, as the act of compression is likely to make many more block unique, but I was surprised so no vmcore format uncompressed dumps showed any benefit. It's hard to see the how dedup is behaving because from a ZFS layer perspective  the data is still full size, but use of zdb(1m) can show us the dedup table

# zdb -D space
DDT-sha256-zap-duplicate: 37869 entries, size 329 on disk, 179 in core
DDT-sha256-zap-unique: 574627 entries, size 323 on disk, 191 in core

dedup = 1.03, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.03

the extra 0.03 dedup only came about when I started using the same pool for building Solaris kernel code.

I believe the lack of benefit is due to the fact that dedup works at a block level, and as such, even the change of a single pointer in a single data structure in a block of the crash dump would result in the block being unique and not being deduped

In light of this - the fact that my modified savecore code to use snapshots didn't show any benefit, is not really a surprise.

So that leaves compression. And this is where we get some real benefits. By enabling both savecore and zfs compression get between 20 and 50% saving in disk space. On uncompressed dumps, you get data size between 4.63GB and 8.03GB - ie. comparable to using savecore compression.

The table here shows the various usage

"> Name  Savecore compression
 ZFS Snapshot
 ZFS compression
 Size (GB)
 % of default
 Default  Yes  No  No  No  8.86  100
 DEDUP  Yes  Yes  No  No  8.86  100
 Snapshot  Yes
 No  Yes  No  14.6
 GZ9compress  Yes  No  No
 GZIP level 9
 4.46  50.3
 LZJBcompress  Yes
 No  LZJB  7.2  81.2
 Expanded  No  No  No  No  36.4  410
 ExpandedDedup  No  Yes  No  No  36.4  410
 No  No  Yes  No  37.7  425
 No  No No GZIP level 9  4.63
 No  No No LZJB  8.03

The anomaly here is the Snapshot of savecore compressed data. I can only explain that by saying that, though I repeated the same process, the crashdumps created were larger in that particular case. 5Gb each in stead of 5GB and two lots of 2GB

So what does this tell us? Well fundamentally, the Oracle Solaris 11 default does a pretty good job of getting a crash dump file off the dump device and storing it in an efficient way. That block level optimisations don't help in minimizing the data size (dedup and snapshot). And compression helps us in data size (big surprise there - not!)

If disk space is an issue for you, consider creating a compressed zfs data set to store you crash dumps in.

If you want to analyse crash dumps in situ, then consider using uncompressed crash dumps, but written to compressed zfs data set.

Personally, as I do tend to want to look at the dumps as quickly as possible, I'll be setting my lab machines to create uncompressed crash dumps using

# dumpadm -z off

but create a zfs compressed data set, and use

# dumpadm -s /path/to/compressed/data/set

to make sure I can analyse the crash dump, but still not waste storage space

Thursday Dec 03, 2009

zfs deduplication

Inspired by reading Jeff Bonwick's Blog I decided to give it a go on my development gates. A lot of files are shared between clones of the gate and even between builds, so hopefully I should get a saving in the number of blocks used in my project file system.

Being cautious I am using an alternate boot environment created using beadm, and backing up my code using hg backup (a useful mercurial extension included in the ON build tools)

I'm impressed. As it works on a block level, rather than a file level, so the saving isn't directly proportional the number of duplicate files. But you still get a significant saving, albeit at the expense of using more CPU. It needs to do a sha256 checksum comparison of the blocks to ensure they're really identical.

Enabling it is simply a case of

 $ pfexec zfs set dedup=on <pool-name> 

Though obviously you can do so much more. Jeff's blog (and the comments) are a goldmine of information about the subject.


Chris W Beal-Oracle


« June 2016