Friday Apr 24, 2015

Better ways to handle crash dumps in Solaris

Better ways to handle crash dumps in Solaris

For some time now we've realised that system crash dumps in Solaris are getting larger, and taking longer to preserve. A number of us have been working on projects to help speed these up, and reduce the size.

The first of these was "Crash Dump Restructuring". Implemented by Sriman and Vlad This went in to Solaris 11.2, and allowed the administrator to remove certain portions from the crash dump via the dumpadm(1m) command. Or even if you do need all of the data, the less useful data (like zfs metadata) is stored in a separate file, and you can analyse the main core file without the suplementary ones. This should allow you to send only the required data back to Oracle for analysis. For more information have a look at the documentation here and here

Another popular one is the integration of the Autonomous Crashdump Analysis Tool (ACT) in to Solaris. This again was added in 11.2 by my colleague Chris. This allows you to run an mdb dcmd on a crash dump and get a summary of the commonly needed data in a text form. Again you could send this small(er) text file to Oracle for analysis before sending the entire crash dump.

Though act is part of the mdb package, you need to load the module to make it work

</var/crash/1> # mdb -k 1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix zvpsm 
scsi_vhci iommu zfs rpcmod sata sd ip hook neti arp usba i915 stmf stmf_sbd 
sockfs md random idm cpc crypto ipc fcip fctl fcp zvmm lofs ufs logindmux nsmb 
ptm sppp nfs ]
> ::load act

> ::act
collector: 8.17
osrelease: 5.11.2
osversion: s11.2
arch: i86pc
target: core

[=== ACT Version: 8.17                                    <<< SUMMARY >>> ===]

[=== ACT report for core dump ===]
hostname: myhost
release: SunOS 5.11.2
architecture: i86pc
isa_list: amd64 pentium_pro+mmx pentium_pro pentium+mmx pentium i486 i386 i86
pagesize: 0t4096
hostid: 0x296e9

system booted at:0x543e6f2e  ##:2014 Oct 15 12:57:18 GMT
system crashed at:0x543e6ff3 ##:2014 Oct 15 13:00:35 GMT
dump started at:0x543e6ff8   ##:2014 Oct 15 13:00:40 GMT
panic: forced crash dump initiated at user request

<and copious stuff deleted>

And finally the big project that has really eaten all my and my colleagues time for a while now, and the motivation for this blog post, Deferred Dump. Working with Brian, Nick, and again Sriman and Vlad, we've worked to preserve the crash dump information in memory across a reboot. Thus eliminating the painfully slow process of writing to a dump device and extracting it again. Vlad has written an excellent blog about how this was possible, and why it needed doing. Please do go and take a look here

Deferred dump is now available in Solaris 11.2.8 (SRU 8 of Solaris 11 update 2), so we felt it was a good time to start talking about it. Vlad has highlighted some of the requirements as well.

Wednesday Dec 21, 2011

Exploring ZFS options for storing crash dumps

Systems fail - for a variety of reasons, That's what keeps me in a job. When they do, you need to store data about the way it failed if you want to be able to diagnose what happened. This is what we call a crash dump. But this data consumes a large amount of space on disk. So I thought I'd explore which ZFS technologies can help reduce the overhead of storing this data.

The approach I took was to make a system (x4600M2 with 256GB memory) moderately active using the filebench benchmarking utility, available from the Solaris 11 Package repository. Then take a system panic using reboot -d, then repeat the process twice more, taking live crash dumps using savecore -L. This should generate 3 separate crash dumps, which will have some unique, and some duplicated data.

There are a number of technologies available to us.

  • savecore compressed crash dumps
    • not a ZFS technology, but the default behavior in Oracle Solaris 11
    • Have a read of Steve Sistare's great blog on the subject
  • ZFS deduplication
    • Works on a block level
    • Should store only one copy of a block if it's repeated among multiple crash dumps
  • ZFS snapshots
    • If we are modifying a file, we should only save the changes
      • To make this viable, I had to modify the savecore program to create a snapshot of a filesystem on the fly, and reopen the existing crash dump and modify the file rather than create a new one
  • ZFS compression
    • either in addition to or instead of savecore compression
    • Multiple different levels of compression
      • I tried LZJB and GZIP (at level 9)

All of these can be applied to both compressed (vmdump) and non-compressed (vmcore) crash dumps.So I created multiple zfs data sets with the properties and repeated the crash dump creation, adjusting savecore configuration using dumpadm(1m) to save to the various data sets, either using savecore compression or not.

Remember also one of the motivations of saving a crash dump compressed is to speed up the time it takes to get from the dump device to a file system, so you can send it to Oracle support for analysis.

So what do we get?

Lets look at the default case, this is no compression on the file system, but using the level of compression achieved by savecore (which is the same as the panic process, and is either LZJB or BZIP2). In this we have three crash dumps, totaling 8.86GB. If these same dumps are uncompressed we get 36.4GB crash dumps (so we can see that savecore compression is saving us a lot of space)

Interestingly use of dedup seems to not give us any benefit, I wouldn't have expected it to do so on vmdump format comrpessed dumps, as the act of compression is likely to make many more block unique, but I was surprised so no vmcore format uncompressed dumps showed any benefit. It's hard to see the how dedup is behaving because from a ZFS layer perspective  the data is still full size, but use of zdb(1m) can show us the dedup table

# zdb -D space
DDT-sha256-zap-duplicate: 37869 entries, size 329 on disk, 179 in core
DDT-sha256-zap-unique: 574627 entries, size 323 on disk, 191 in core

dedup = 1.03, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.03

the extra 0.03 dedup only came about when I started using the same pool for building Solaris kernel code.

I believe the lack of benefit is due to the fact that dedup works at a block level, and as such, even the change of a single pointer in a single data structure in a block of the crash dump would result in the block being unique and not being deduped

In light of this - the fact that my modified savecore code to use snapshots didn't show any benefit, is not really a surprise.

So that leaves compression. And this is where we get some real benefits. By enabling both savecore and zfs compression get between 20 and 50% saving in disk space. On uncompressed dumps, you get data size between 4.63GB and 8.03GB - ie. comparable to using savecore compression.

The table here shows the various usage

"> Name  Savecore compression
 ZFS Snapshot
 ZFS compression
 Size (GB)
 % of default
 Default  Yes  No  No  No  8.86  100
 DEDUP  Yes  Yes  No  No  8.86  100
 Snapshot  Yes
 No  Yes  No  14.6
 GZ9compress  Yes  No  No
 GZIP level 9
 4.46  50.3
 LZJBcompress  Yes
 No  LZJB  7.2  81.2
 Expanded  No  No  No  No  36.4  410
 ExpandedDedup  No  Yes  No  No  36.4  410
 No  No  Yes  No  37.7  425
 No  No No GZIP level 9  4.63
 No  No No LZJB  8.03

The anomaly here is the Snapshot of savecore compressed data. I can only explain that by saying that, though I repeated the same process, the crashdumps created were larger in that particular case. 5Gb each in stead of 5GB and two lots of 2GB

So what does this tell us? Well fundamentally, the Oracle Solaris 11 default does a pretty good job of getting a crash dump file off the dump device and storing it in an efficient way. That block level optimisations don't help in minimizing the data size (dedup and snapshot). And compression helps us in data size (big surprise there - not!)

If disk space is an issue for you, consider creating a compressed zfs data set to store you crash dumps in.

If you want to analyse crash dumps in situ, then consider using uncompressed crash dumps, but written to compressed zfs data set.

Personally, as I do tend to want to look at the dumps as quickly as possible, I'll be setting my lab machines to create uncompressed crash dumps using

# dumpadm -z off

but create a zfs compressed data set, and use

# dumpadm -s /path/to/compressed/data/set

to make sure I can analyse the crash dump, but still not waste storage space

Tuesday Mar 15, 2011

Modeling Panic event in FMA

I haven't blogged in ages, in fact since Sun was taken over by Oracle. However I've not been idle, far from it, just working on product to get it out to market as soon as possible.

However - the release of Solaris 11 Express 2010.11 (yes I've been so busy I haven't even got round to writing this entry for 4 months!) I can tell you about one thing I've been working on with members of the FMA and  SMF teams. It's part of a larger effort to more tightly integrated software "troubles" in to FMA. This includes modeling SMF state changes in FMA, and my favorite, modeling System panic events in FMA.

I won't go in to the details, but in summary, when a system reboots after a panic, savecore is run (even if dumpadm -n is in effect) to check if a dump is present on the dump device. If there is, it raise an "Information Report" for fma to process. This becomes and FMA MSGID of  SUNOS-8000-KL. You should see a message on the console if you're looking, giving instructions on what to do next. There is a small amount of data about the crash, panicstring, stack, date etc embedded in the report. Once savecore is run to extract the dump from the dump device, another information report is raised which FMA ties to the first event, and solves the case.

One of the nice things that can then happen, is the FMA notification capabilities are open to us, so you could set up an SNMP trap or email notification for such a panic. A small thing, but it might help some sysadmins in the middle of the night.

One final thing. That small amount of information in the Ireport can be accessed using fmdump with the -V flag for the uuid of the fault (as reported in the messages on the console or fmadm faulty), for example, this was from a panic I induced by clearing the root vnode pointer.

# fmdump -Vu b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 
TIME                           UUID                                 SUNW-MSG-ID
Jan 13 2011 13:39:17.364216000 b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Jan 13 13:39:17.1064 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Jan 13 13:33:19.7888 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
        code = SUNOS-8000-KL
        diag-time = 1294925957 157194
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = CELSIUS-W360
                        chassis-id = YK7K081269
                        server-id = tetrad
                (end authority)

                mod-name = software-diagnosis
                mod-version = 0.1
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end resource)

                savecore-succcess = 1
                dump-dir = /var/crash/tetrad
                dump-files = vmdump.1
                os-instance-uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0015a865b0 addr=ffffff0200000000
                panicstack = unix:die+10f () | unix:trap+1799 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:lookupnameatcred+97 () | genunix:lookupname+5c () | elfexec:elf32exec+a5c () | genunix:gexec+6d7 () | genunix:exec_common+4e8 () | genunix:exece+1f () | unix:brand_sys_syscall+1f5 () | 
                crashtime = 1294925154
                panic-time = January 13, 2011 01:25:54 PM GMT GMT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4d2f0085 0x15b57ec0

Any way, I hope you find this feature useful. I'm hoping to use the data embedded in the event for data mining, and problem resolution. However if you have any ideas of other information that could be realistically added to the ireport, then please let me know. However you have to bare in mind this information is written while the system is panicking, so what can be reliably gathered is somewhat limited


Chris W Beal-Oracle


« May 2015