Thursday Jan 24, 2013

dtrace profiling


DTrace never fails to impress me in it's ability to simply find the answer to a technical question. My only problem is remembering the syntax, so I though I'd blog it here.

I could see from mpstat my CPU it was all being consumed in userland, so this simple dtrace command helped identify the culperit.

# dtrace -n 'profile-1001 /arg1/{@[execname, ustack()]=count()}END{trunc(@,10)}'

Showed me the top 10 stacks found, aggregated by process name too. It showed me in my case that pkgrecv was taking the CPU and it was infact the libcurl part..

Useful information for me



Thursday May 31, 2012

Creating an SMF service for mercurial web server

I'm working on a project at the moment, which has a number of contributers. We're managing the project gate (which is stand alone) with mercurial. We want to have an easy way of seeing the changelog, so we can show management what is going on.

 Luckily mercurial provides a basic web server which allows you to see the changes, and drill in to change sets. This can be run as a daemon, but as it was running on our build server, every time it was rebooted, someone needed to remember to start the process again. This is of course a classic usage of SMF.

Now I'm not an experienced person at writing SMF services, so it took me 1/2 an hour or so to figure it out the first time. But going forward I should know what I'm doing a bit better. I did reference this doc extensively.

Taking a step back, the command to start the mercurial web server is

 $ hg serve -p <port number> -d

So we somehow need to get SMF to run that command for us.

In the simplest form, SMF services are really made up of two components.

  1. The manifest
    1. Usually lives in /var/svc/manifest somewhere
    2. Can be imported from any location
  2. The method
    1. Usually live in /lib/svc/method
      1. I simply put the script straight in that directory. Not very repeatable, but it worked
    2. Can take an argument of start, stop, or refresh

Lets start with the manifest. This looks pretty complex, but all it's doing is describing the service name, the dependencies, the start and stop methods, and some properties.

The properties can be by instance, that is to say I could have multiple hg serve processes handling different mercurial projects, on different ports simultaneously

Here is the manifest I wrote. I stole extensively from the examples in the Documentation.

So my manifest looks like this

$ cat hg-serve.xml 
<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type='manifest' name='hg-serve'>

<service
name='application/network/hg-serve'
type='service'
version='1'>

<dependency 
name='network' 
grouping='require_all' 
restart_on='none' 
type='service'> 
<service_fmri value='svc:/milestone/network:default' /> 
</dependency>

<exec_method 
type='method' 
name='start' 
exec='/lib/svc/method/hg-serve %m' 
timeout_seconds='2' />

<exec_method
type='method'
name='stop'
exec=':kill'
timeout_seconds='2'>
</exec_method>

<instance name='project-gate' enabled='true'>
<method_context> 
<method_credential user='root' group='root' /> 
</method_context>
<property_group name='hg-serve' type='application'> 
<propval name='path' type='astring' value='/src/project-gate'/>
<propval name='port' type='astring' value='9998' />
</property_group> 
</instance>

<stability value='Evolving' />
<template> 
<common_name> 
<loctext xml:lang='C'>hg-serve</loctext> 
</common_name> 
<documentation> 
<manpage title='hg' section='1' /> 
</documentation> 
</template> 
</service> 
</service_bundle>

So the only things I had to decide on in this are the service name "application/network/hg-serve" the start and stop methods (more of which later) and the properties. This is the information I need to pass to the start method script. In my case the port I want to start the web server on "9998", and the path to the source gate "/src/project-gate". These can be read in to the start method.

So now lets look at the method scripts

$ cat /lib/svc/method/hg-serve 
#!/sbin/sh
#

#
# Copyright (c) 2012, Oracle and/or its affiliates. All rights reserved.
#

# Standard prolog
#
. /lib/svc/share/smf_include.sh

if [ -z $SMF_FMRI ]; then
        echo "SMF framework variables are not initialized."
        exit $SMF_EXIT_ERR
fi

#
# Build the command line flags
#
# Get the port and directory from the SMF properties

port=`svcprop -c -p hg-serve/port $SMF_FMRI`
dir=`svcprop -c -p hg-serve/path $SMF_FMRI`

echo "$1"
case "$1" in
'start')
	cd $dir
	/usr/bin/hg serve -d -p $port 
;;
*) 
echo "Usage: $0 {start|refresh|stop}" 
exit 1 
;; 
esac

exit $SMF_EXIT_OK

This is all pretty self explanatory, we read the port and directory using svcprop, and use those simply to run a command in the start case. We don't need to implement a stop case, as the manifest says to use "exec=':kill'
for the stop method.

Now all we need to do is import the manifest and start the service, but first verify the manifest

# svccfg verify /path/to/hg-serve.xml

If that doesn't give an error try importing it

# svccfg import /path/to/hg-serve.xml

If like me you originally put the hg-serve.xml file in /var/svc/manifest somewhere you'll get an error and told to restart the import service

svccfg: Restarting svc:/system/manifest-import
 The manifest being imported is from a standard location and should be imported with the  command : svcadm restart svc:/system/manifest-import
# svcadm restart svc:/system/manifest-import

and you're nearly done. You can look at the service using svcs -l

# svcs -l hg-serve
fmri         svc:/application/network/hg-serve:project-gate
name         hg-serve
enabled      false
state        disabled
next_state   none
state_time   Thu May 31 16:11:47 2012
logfile      /var/svc/log/application-network-hg-serve:project-gate.log
restarter    svc:/system/svc/restarter:default
contract_id  15749 
manifest     /var/svc/manifest/network/hg/hg-serve.xml
dependency   require_all/none svc:/milestone/network:default (online)

And look at the interesting properties

# svcprop hg-serve
hg-serve/path astring /src/project-gate
hg-serve/port astring 9998

...stuff deleted....

Then simply enable the service and if every things gone right, you can point your browser at http://server:9998 and get a nice graphical log of project activity.

# svcadm enable hg-serve
# svcs -l hg-serve
fmri         svc:/application/network/hg-serve:project-gate
name         hg-serve
enabled      true
state        online
next_state   none
state_time   Thu May 31 16:18:11 2012
logfile      /var/svc/log/application-network-hg-serve:project-gate.log
restarter    svc:/system/svc/restarter:default
contract_id  15858 
manifest     /var/svc/manifest/network/hg/hg-serve.xml
dependency   require_all/none svc:/milestone/network:default (online)

None of this is rocket science, but a bit fiddly. Hence I thought I'd blog it. It might just be you see this in google and it clicks with you more than one of the many other blogs or how tos about it. Plus I can always refer back to it myself in 3 weeks, when I want to add another project to the server, and I've forgotten how to do it.


 
  


Thursday Feb 09, 2012

Email notification of FMA events

One of the projects I worked on for Solaris 11 was to record some information on System Panics in FMA events.Now I want to start making it easier to gather this information and map it to known problems. So starting internally I plan to utilise another feature which we developed as part of  the same effort. This is the email notifications framework. Rob Johnston described this feature in his blog here.

So the nice feature I want to utilise it custom message templates. So I thought I'd share how to do this It's pretty simple, but I got burnt by a couple of slight oddities - which we can probably fix.

First off I needed to create a template. There are a number of committed expansion tokens - these will work to expand information from the FMA event in to meaninful info in the email. The ones I care about this time are

%<HOSTNAME> : Hostname of the system which had the event
%<UUID> : UUID of the event - so you can mine more information
%<URL> : URL of the knowledge doc describing the problem

In addition I want to get some data that is panic specific. As yet these are uncommitted interfaces and shouldn't be relied upon, but for my reference these can be accessed

Panic String of the dump is %<fault-list[0].panicstr>
Stack trace to put in to MOS is  %<fault-list[0].panicstack>

These are visible in the panic event - so I don't feel bad about revealing the names, but I stress they shouldn't be relied upon.

So create a template which contains the text you want. Make sure it's readable by the noaccess user (ie. not /root)

The one I created for now looks like this

# cat /usr/lib/fm/notify/panic_template
%<HOSTNAME> Panicked

For more information log in to %<HOSTNAME> and run the command

fmdump -Vu %<UUID>

Please look at %<URL> for more information

Crash dump is available on %<HOSTNAME> in %<fault-list[0].dump-dir>

Panic String of the dump is %<fault-list[0].panicstr>

Stack trace to put in to MOS is  %<fault-list[0].panicstack>
 

I then need to add this to the notification for the "problem-diagnosed" event class. This is done with the svccfg command

 

# svccfg setnotify problem-diagnosed \"mailto:someone@somehost?msg_template=/usr/lib/fm/notify/panic_template\"

(Note the backslashes and quotes - they're important to get the parser to recognise the "=" correctly.)

It would be nice to tie it specifically to a panic event, but that needs a bit of plumbing to make it happen.

You can  verify it is configured correctly with the command

 

# svccfg listnotify problem-diagnosed
    Event: problem-diagnosed (source: svc:/system/fm/notify-params:default)
        Notification Type: smtp
            Active: true
            reply-to: root@localhost
            msg_template: /usr/lib/fm/notify/panic_template
            to: someone@somehost

Now when I get a panic, I get an email with some useful information I can use to start diagnosing the problem.

So what next? I think I'll try to firm up the stability of the useful members of the event, and may be create a new event we can subscribe to for panics only, then make this template an "extended support" option for panic events, and make it easily configurable.

Please do leave comments if you have any opinions on this and where to take it next.



 





Thursday Dec 22, 2011

More thoughts on ZFS compression and crash dumps

Thanks to Darren Moffat for poking holes in my previous post, or more explicitly pointing out that I could add more useful and interesting data. Darren commented that it was a shame I hadn't included the time to take a crash dump along side the size, and space usage. The reason for this is that one reason for using vmdump format compression from savecore is to minimize the time required to get the crash dump off the dump device and on to the file system.

The motivation for this reaches back many years, back to when the default for Solaris was to use swap as the dump device. So when you brought the system back up, you wanted to wait till savecore completed before letting the system complete coming up to multiuser (you can tell how old this is by the fact we're not talking about SMF services)

So with Oracle Solaris 11 the root file system is ZFS, and the default configuration is to dump to a dump ZVOL. And as it's not used by anything else, the savecore can and does run in the background. So it isn't quite as important to make it as fast as possible. It's still interesting though, as with everything in life, it's a compromise.

One problem with the tests I wrote about yesterday is the size of the dumps is too small to make measurement of time easy (size is one thing, but we have fast disks now, so getting 8GB off a zvol on to a file system takes very little time)

So this is not a completely scientific test, but an illustration which helps me understand what the best solution for me is. My colleague Clive King wrote a driver to leak memory to create larger kernel memory segments, which artificially increases the amount of data a crash dump contains. I told this to leak 126GB of kernel memory, set the savecore target directory to be one of "uncompressed" "gzip9 compressed" or "LZJB Compressed", and in the first case set it to use vmdump format compressed dumps, oj and then I took a crash dump, repeating over the 3 configurations. The idea being to time the difference in getting the dump on to the file system.

This is a table of what I found

 Size Leaked (GB)
 Size of Crash Dump (GB)
 ZFS pool space used (GB)
 Compression  Time to take dump (mm:ss)
 Time from panic to crash dump available (mm:ss)
 126  8.4  8.4  vmdump  01:48  06:15
 126  140  2.4  GZIP level 9
 01:47  11:39
 126  141  8.02  LZJB  01:57  07:05

 

Notice one thing, the compression ratio for gzip 9 is massive - 70x, so this is probably a side effect of the fact it's not real data, but probably contains some easily compressible data. The next step should be to populate the leaked memory with random data.

So what does this tel us - assuming the lack of random content isn't an issue, that for a modest hit in time take to get the dump from the dump device (7:05 vs 6:15) we get an uncompressed dump on an LZJB compressed ZFS file system while using a comparable amount of physical storage. This allows me to directly analyse the dump as soon as it's available. Great for development purposes. Is it of benefit to our customers? That's something I'd like feedback on. Please leave a comment if you see value in this being the default.




Wednesday Dec 21, 2011

Exploring ZFS options for storing crash dumps

Systems fail - for a variety of reasons, That's what keeps me in a job. When they do, you need to store data about the way it failed if you want to be able to diagnose what happened. This is what we call a crash dump. But this data consumes a large amount of space on disk. So I thought I'd explore which ZFS technologies can help reduce the overhead of storing this data.

The approach I took was to make a system (x4600M2 with 256GB memory) moderately active using the filebench benchmarking utility, available from the Solaris 11 Package repository. Then take a system panic using reboot -d, then repeat the process twice more, taking live crash dumps using savecore -L. This should generate 3 separate crash dumps, which will have some unique, and some duplicated data.

There are a number of technologies available to us.

  • savecore compressed crash dumps
    • not a ZFS technology, but the default behavior in Oracle Solaris 11
    • Have a read of Steve Sistare's great blog on the subject
  • ZFS deduplication
    • Works on a block level
    • Should store only one copy of a block if it's repeated among multiple crash dumps
  • ZFS snapshots
    • If we are modifying a file, we should only save the changes
      • To make this viable, I had to modify the savecore program to create a snapshot of a filesystem on the fly, and reopen the existing crash dump and modify the file rather than create a new one
  • ZFS compression
    • either in addition to or instead of savecore compression
    • Multiple different levels of compression
      • I tried LZJB and GZIP (at level 9)

All of these can be applied to both compressed (vmdump) and non-compressed (vmcore) crash dumps.So I created multiple zfs data sets with the properties and repeated the crash dump creation, adjusting savecore configuration using dumpadm(1m) to save to the various data sets, either using savecore compression or not.

Remember also one of the motivations of saving a crash dump compressed is to speed up the time it takes to get from the dump device to a file system, so you can send it to Oracle support for analysis.

So what do we get?

Lets look at the default case, this is no compression on the file system, but using the level of compression achieved by savecore (which is the same as the panic process, and is either LZJB or BZIP2). In this we have three crash dumps, totaling 8.86GB. If these same dumps are uncompressed we get 36.4GB crash dumps (so we can see that savecore compression is saving us a lot of space)

Interestingly use of dedup seems to not give us any benefit, I wouldn't have expected it to do so on vmdump format comrpessed dumps, as the act of compression is likely to make many more block unique, but I was surprised so no vmcore format uncompressed dumps showed any benefit. It's hard to see the how dedup is behaving because from a ZFS layer perspective  the data is still full size, but use of zdb(1m) can show us the dedup table

# zdb -D space
DDT-sha256-zap-duplicate: 37869 entries, size 329 on disk, 179 in core
DDT-sha256-zap-unique: 574627 entries, size 323 on disk, 191 in core

dedup = 1.03, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.03

the extra 0.03 dedup only came about when I started using the same pool for building Solaris kernel code.

I believe the lack of benefit is due to the fact that dedup works at a block level, and as such, even the change of a single pointer in a single data structure in a block of the crash dump would result in the block being unique and not being deduped

In light of this - the fact that my modified savecore code to use snapshots didn't show any benefit, is not really a surprise.

So that leaves compression. And this is where we get some real benefits. By enabling both savecore and zfs compression get between 20 and 50% saving in disk space. On uncompressed dumps, you get data size between 4.63GB and 8.03GB - ie. comparable to using savecore compression.

The table here shows the various usage

"> Name  Savecore compression
 ZFS DEDUP
 ZFS Snapshot
 ZFS compression
 Size (GB)
 % of default
 Default  Yes  No  No  No  8.86  100
 DEDUP  Yes  Yes  No  No  8.86  100
 Snapshot  Yes
 No  Yes  No  14.6
 165
 GZ9compress  Yes  No  No
 GZIP level 9
 4.46  50.3
 LZJBcompress  Yes
 No
 No  LZJB  7.2  81.2
 Expanded  No  No  No  No  36.4  410
 ExpandedDedup  No  Yes  No  No  36.4  410
 ExpandedSnapshot
 No  No  Yes  No  37.7  425
 ExpandedGZ9
 No  No No GZIP level 9  4.63
 52.4
 ExpandedLZJB
 No  No No LZJB  8.03
 91

The anomaly here is the Snapshot of savecore compressed data. I can only explain that by saying that, though I repeated the same process, the crashdumps created were larger in that particular case. 5Gb each in stead of 5GB and two lots of 2GB

So what does this tell us? Well fundamentally, the Oracle Solaris 11 default does a pretty good job of getting a crash dump file off the dump device and storing it in an efficient way. That block level optimisations don't help in minimizing the data size (dedup and snapshot). And compression helps us in data size (big surprise there - not!)

If disk space is an issue for you, consider creating a compressed zfs data set to store you crash dumps in.

If you want to analyse crash dumps in situ, then consider using uncompressed crash dumps, but written to compressed zfs data set.

Personally, as I do tend to want to look at the dumps as quickly as possible, I'll be setting my lab machines to create uncompressed crash dumps using

# dumpadm -z off

but create a zfs compressed data set, and use

# dumpadm -s /path/to/compressed/data/set

to make sure I can analyse the crash dump, but still not waste storage space




Tuesday Nov 29, 2011

Package Version Numbers, why are they so important

One of the design goals of IPS has been to allow people to easily move forward to a supported "Surface" of component. That is to say, when you 

# pkg update

your system, you get the latest set of components which all work together, based on the packages you already have installed. During development, this has meant simply you update to the latest "build" of the components. (During development, we build everything and publish everything every two weeks).

Now we've released Solaris 11 using the IPS technologies, things are a bit more complicated. We need to be able to reflect all the types of Solaris release we are doing. For example Solaris Development builds, Solaris Update builds and "Support Repository Updates" (the replacement for patches) in the version scheme. So simply saying "151" as the build number isn't sufficient to articulate what you are running, or indeed what is available to update to

In my previous blog post I talked about creating your own package, and gave an example FMRI of

pkg://tools/mytools@1.3,0.5.11-0.0.0

But it's probably more instructive to look at the FMRI of a Solaris package. The package "core-os" contains all the common utilities and daemons you need to use Solaris.

 $ pkg info core-os
          Name: system/core-os
       Summary: Core Solaris
   Description: Operating system core utilities, daemons, and configuration
                files.
      Category: System/Core
         State: Installed
     Publisher: solaris
       Version: 0.5.11
 Build Release: 5.11
        Branch: 0.175.0.0.0.2.1
Packaging Date: Wed Oct 19 07:04:57 2011
          Size: 25.14 MB
          FMRI: pkg://solaris/system/core-os@0.5.11,5.11-0.175.0.0.0.2.1:20111019T070457Z

The FMRI is what we will concentrate on here. In this package "solaris" is the publisher. You can use the pkg publisher command to see where the solaris publisher gets it's bits from

$ pkg publisher
PUBLISHER                             TYPE     STATUS   URI
solaris                               origin   online   http://pkg.oracle.com/solaris/release/

So we can see we get solaris packages from pkg.oracle.com.  The package name is system/core-os. These can be arbitrary length, just to allow you to group similar packages together. Now on the the interesting? bit, the versions, everything after the @ is part of the version. IPS will only upgrade to a "higher" version.

core-os@0.5.11,5.11-0.175.0.0.0.2.1:20111019T070457Z

core-os = Package Name

0.5.11 = Component - in this case we're saying it's a SunOS 5.11 package

, = separator

5.11 = Built on version - to indicate what OS version you built the package on

- = another separator

0.175.0.0.0.2.1 = Branch Version

: = yet another separator

20111019T070457Z = Time stamp when the package was published

So from that we can see the Branch Version seems rather complex. It is necessarily so, to allow us to describe the hierarchy of releases we do In this example we see the following

0.175: is known as the trunkid, and is incremented each build of a new release of Solaris. During Solaris 11 this should not change 

0: is the Update release for Solaris. 0 for FCS, 1 for update 1 etc

0: is the SRU for Solaris. 0 for FCS, 1 for SRU 1 etc

0: is reserved for future use

2: Build number of the SRU

1: Nightly ID - only important for Solaris developers

Take a hypothetical example

core-os@0.5.11,5.11-0.175.1.5.0.4.1:<something>

This would be build 4 of SRU 5 of Update 1 of Solaris 11

This is actually documented in a MOS article 1378134.1 Which you can read if you have a support contract.

Tuesday Oct 25, 2011

Adventures in Regular Expressions

I'm one of those people who will get stuck in and solve problems, even if I don't know everything about an area a problem lies in. As such I often find I'm learning new and unexpected things. Hey, that's why I love coming to work.

So the project I'm working on at the moment relates to how we build the SRUs (the Support Repository Updates), which replace patches in Solaris 11. As such I'm learning a lot about IPS - the Image Packaging System, and in particular how the tools it provides help you deliver consistent and upgradeable packages.

The trick I picked up this week is about Regular Expressions. I have need to change some values in an FMRI of a package. I had been doing it in a set of shell scripts until Mark Nelson of the SMF team pointed out I was rewriting pkgmogrify(1)

So reading the pkgmogrify(1) man page left me feeling less than knowledgeable, I'd sort of gathered it worked with regular expressions. Now these are well known in the industry, just I've never needed to use them.

So after a bit of experimenting I find I can substitute values in a string using the "edit" directive. This is the relevant portion of the man page, which makes sense now I know the answer

     edit      Modifies an attribute of the action.  Three arguments are
               taken.  The first is the name of the attribute, the second
               is a regular expression matching the attribute value, and
               the third is the replacement string substituted for the
               portion of the value matched by the regular expression.
               Unlike the regular expression used to match an action, this
               expression is unanchored.  Normal regular expression
               backreferences, of the form '\1', '\2', etc., are available
               in the replacement string, if there are groups defined in
               the regular expression.

The last sentence is the clincher for what I need to do. I can search a string for a pattern, and load it in to the "groups", and then reference them in the replacement string. So for example if I want to change the package version string from "-0.175.0.0.0" to "-0.175.1.1.1", I can do it using a transform, but where 175 might change and I want that to be reflected in the resulting manifest

eg.

$ cat trans/mytransform 
<transform set name=pkg.fmri -> edit value \

'@*-0.([0-9])([0-9])([0-9]).+' '-0.\1\2\3.1.1.1'>

The "Groups" are defined in the round brackets eg. ([0-9]) for example. Then I can simply run pkgmogrify to get the  result

 

$ pkgmogrify -I trans -O <output file> <input manifest> mytransform

 

This performs the substitution just as I needed it to.

Tuesday Oct 04, 2011

How to create your own IPS packages

It's been ages since I blogged, so apologies for that. I'm at Oracle Openworld 2011 this week, and I was helping out on the Solaris booths in the DemoGrounds. One of the people I talked to asked a question, which I've heard internally at Oracle a few times recently. That is "How can I package up my own app as an IPS package?" I've answered it a few times, and so suddenly it struck me as a good subject for a blog.

 Most of this information is available either in the documentation,  or other blogs, but this is also for me to reference back to (rather than my notes)

With IPS, packages are published to a repository (or repo). This can either be a special pkg web server or simply a directory. For simplicity I've used the latter

$ export PKGREPO=/path/to/repo

$ pkgrepo create $PKGREPO 

This populates the $PKGREPO with the required repository structure, you only need to do this once.

Packages are described by manifests. You need one per package. If you have existing System V packages, you can allow the tools to generate it all for you, but simply passing the SystemV package file in the a tool called pkgsend

$ pkgsend generate sysvpkg > sysvpkg.mf

However, often my own tools are just added manually by hand or archived together with tar. So if I want to turn those in to an IPS package, I need to create my own manifest. Again fortunately pkgsend can help, but you'll need to add some details.

 So let's assume I've put all my tools in /opt/mytools

$ export ROOT=/opt/mytools
$ pkgsend generate $ROOT > mytools.mf

This manifest needs a line adding to it to describe the package

set name=pkg.fmri \
    value=pkg://tools/mytools-AT-1-DOT-3,0.5.11-0.0.0

This states the publisher (tools), package name (mytools) and the version (1.3,0.5.11-0.0.0)

So once you have the working manifest - and there's a lot more detail we can add to these,  they can be published. (to the $PKGREPO directory we created earlier)

$ pkgsend publish -s $PKGREPO -d $ROOT mypkg.mf

Note for each successfully published you get a "Published" message, but silent failure (exit code is 1 though so it can be detected in a script)

You can then add the repo you've just populated to your test machine. You'll need to do this a as a privileged user, such as root

# pkg set-publisher -O $PKGREPO tools


and add a package

# pkg install mytools  Packages to install:  1
Create boot environment: No

DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                                  1/1     716/716      2.4/2.4

PHASE                                        ACTIONS
Install Phase                                746/746

PHASE                                          ITEMS
Package State Update Phase                       1/1
Image State Update Phase                         2/2

PHASE                                          ITEMS
Reading Existing Index                           8/8
Indexing Packages                                1/1

So this is now installed on my system. There are so many interesting things you can add to the manifest, that if you're interested, I'll try to blog about in the future. However one thing I've glossed over here is the package version. It's surprisingly important that you understand what you set that to, so I'll make that the subject of a future blog.





Tuesday Mar 15, 2011

Modeling Panic event in FMA

I haven't blogged in ages, in fact since Sun was taken over by Oracle. However I've not been idle, far from it, just working on product to get it out to market as soon as possible.

However - the release of Solaris 11 Express 2010.11 (yes I've been so busy I haven't even got round to writing this entry for 4 months!) I can tell you about one thing I've been working on with members of the FMA and  SMF teams. It's part of a larger effort to more tightly integrated software "troubles" in to FMA. This includes modeling SMF state changes in FMA, and my favorite, modeling System panic events in FMA.

I won't go in to the details, but in summary, when a system reboots after a panic, savecore is run (even if dumpadm -n is in effect) to check if a dump is present on the dump device. If there is, it raise an "Information Report" for fma to process. This becomes and FMA MSGID of  SUNOS-8000-KL. You should see a message on the console if you're looking, giving instructions on what to do next. There is a small amount of data about the crash, panicstring, stack, date etc embedded in the report. Once savecore is run to extract the dump from the dump device, another information report is raised which FMA ties to the first event, and solves the case.

One of the nice things that can then happen, is the FMA notification capabilities are open to us, so you could set up an SNMP trap or email notification for such a panic. A small thing, but it might help some sysadmins in the middle of the night.

One final thing. That small amount of information in the Ireport can be accessed using fmdump with the -V flag for the uuid of the fault (as reported in the messages on the console or fmadm faulty), for example, this was from a panic I induced by clearing the root vnode pointer.

# fmdump -Vu b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 
TIME                           UUID                                 SUNW-MSG-ID
Jan 13 2011 13:39:17.364216000 b2e3080a-5a85-eda0-eabe-e5fa2359f3d0 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Jan 13 13:39:17.1064 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Jan 13 13:33:19.7888 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
        code = SUNOS-8000-KL
        diag-time = 1294925957 157194
        de = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = fmd
                authority = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        product-id = CELSIUS-W360
                        chassis-id = YK7K081269
                        server-id = tetrad
                (end authority)

                mod-name = software-diagnosis
                mod-version = 0.1
        (end de)

        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end asru)

                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = sw
                        object = (embedded nvlist)
                        nvlist version: 0
                                path = /var/crash/<host>/.b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                        (end object)

                (end resource)

                savecore-succcess = 1
                dump-dir = /var/crash/tetrad
                dump-files = vmdump.1
                os-instance-uuid = b2e3080a-5a85-eda0-eabe-e5fa2359f3d0
                panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0015a865b0 addr=ffffff0200000000
                panicstack = unix:die+10f () | unix:trap+1799 () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:lookupnameatcred+97 () | genunix:lookupname+5c () | elfexec:elf32exec+a5c () | genunix:gexec+6d7 () | genunix:exec_common+4e8 () | genunix:exece+1f () | unix:brand_sys_syscall+1f5 () | 
                crashtime = 1294925154
                panic-time = January 13, 2011 01:25:54 PM GMT GMT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4d2f0085 0x15b57ec0

Any way, I hope you find this feature useful. I'm hoping to use the data embedded in the event for data mining, and problem resolution. However if you have any ideas of other information that could be realistically added to the ireport, then please let me know. However you have to bare in mind this information is written while the system is panicking, so what can be reliably gathered is somewhat limited



Thursday Dec 03, 2009

zfs deduplication

Inspired by reading Jeff Bonwick's Blog I decided to give it a go on my development gates. A lot of files are shared between clones of the gate and even between builds, so hopefully I should get a saving in the number of blocks used in my project file system.

Being cautious I am using an alternate boot environment created using beadm, and backing up my code using hg backup (a useful mercurial extension included in the ON build tools)

I'm impressed. As it works on a block level, rather than a file level, so the saving isn't directly proportional the number of duplicate files. But you still get a significant saving, albeit at the expense of using more CPU. It needs to do a sha256 checksum comparison of the blocks to ensure they're really identical.

Enabling it is simply a case of

 $ pfexec zfs set dedup=on <pool-name> 

Though obviously you can do so much more. Jeff's blog (and the comments) are a goldmine of information about the subject.



Wednesday Sep 30, 2009

CPU used up after an upgrade

Just a quick blog, so that hopefully when you google for this you will find something and not spend hours debugging it.

I just upgraded to OpenSolaris B123, and everything worked fine, however when I logged out and back in again (something I rarely do, so it might have been present before), the machine was on it's knees. As usual I started the debugging process

# prstat

show svc.configd at the top

# truss -p `pgrep svc.configd`

 just showed it running door_return() a lot. This implies that something is talking to svc.configd.

On reflection I could have dtraced the door_call and door_return calls and gathered the execname of the process it was talking to, but I spotted before that, that my pid values were going up quickly. So clearly there was a process starting up, talking to svc.configd and exiting and repeating.

 After a few minutes digging around I found

desktop-print-management-applet calling ospm (OpenSolaris print manager) which was doing svcprop.

A colleague then pointed me to http://defect.opensolaris.org/bz/show_bug.cgi?id=10863 which described the problem and gives a workaround

Fortunately this should be fixed soon


Tuesday Sep 15, 2009

Debugging fmd plugins

This is really just a note to myself as I keep forgetting the options. I'm developing some new plugins for fmd. When they don't work, there's loads of additional data you can get out which isn't there by default, and you can add debug print statements to your code like

fmd_hdl_debug(hdl, "Crash dump instance %ld\\n",
	    cdp->scd_panic_instance);

But these won't be visible unless you fmd in a debug mode. First you need to disable fmd

# svcadm disable fmd

Then you need to run fmd with the right options. Running it in the forground also helps with starting restarting it

# /usr/lib/fm/fmd/fmd -o fg=true -o client.debug=true

If you want to see what fmd it's self is doing add the -o debug=all flag

# /usr/lib/fm/fmd/fmd -o fg=true -o client.debug=true -o debug=all

 Then you see those lovely debug messages.

Thursday Jul 09, 2009

Virtualization Landing Page

I don't normally just post a link to a page of links, but in conversation with one of our doc writers today, she mentioned they'd put together a landing page for all out V12N products http://docs.sun.com/source/821-0057/ I thought it had some quite interesting links so I thought I'd share Chris

Wednesday Jun 10, 2009

Comparing dtrace output using meld

Comparing dtrace and other debug logs using meld

meld is a powerful OpenSource graphical "diff" viewer. It is available from the OpenSolaris IPS repositories so can bee installed from the packagemanager in OpenSolaris or simply by typing

   $ pfexec pkg install SUNWmeld

It is very clever at identifying the real changes within files and highlighting where the difference start and end.


This example is using the output of two dtrace runs, tracing receive network activity in the a development version of the nge driver, once when it works, and once when it doesn't, trying to identify a bug in a development version.

First off the dtrace script is

   $ cat rx.d 
   #!/usr/sbin/dtrace -Fs
   fbt::nge_receive:entry
   {
       self->trace=1;
   }
   
   fbt::nge_receive:return
   /self->trace==1/
   {
       self->trace=0;
   }
   
   fbt:::entry
   /self->trace==1/
   {
       printf("%x",arg0);
   }
   
   fbt:::return
   /self->trace == 1/
   {
       printf("%x",arg0);
   }

This very simply traces all function calls from the nge_receive() function.


So I ran it twice, once when the network was working, and once when it wasn't and simply ran meld over the two files.

   $ meld rx.out rx.works.out

This throws up a gui as seen here





It's worth loading full size. What you see is on the right a large area of probes that have fired that do not exist within the one on the left. That implies a lot of code run in the working case that is missing from the failing case.

This is a picture of source code of nge_receive()



You can see it essentially does two things

   o Calls nge_recv_ring()
   o If that succeeds calls mac_rx()

Looking at the meld screenshot you can see the big green area starts at mac_rx. So in the failing case nge_receive() doesn't call mac_rx() (that'd explain why it fails to receive a packet).

Why doesn't it? Well it implies that nge_recv_ring() returns NULL. nge_recv_ring() is supposed to return an mblk, and it hasn't. Why is that? well looking in to the blue and red highlighted area in the meld window, we see another area in the working case that is missing in the failing case. Hey presto, this bit is the call to allocb(). allocb() is used to allocate an mblk.

So we know in the failing case the nge_recv_ring() function fails to allocate an mblk. Now just need to work out why.

I found this a powerful way of viewing complex data and quickly homing in on differences.

Monday Jun 01, 2009

It's Finally Here

I've decided I really need to get back to writing a blog occasionally, and what better day to choose than June 1 2009. Why? Well today we release OpenSolaris 2009.06, the latest OpenSource release of our operating system Solaris.

I know this all sounds a bit marketing, but actually there are some really good reasons for running OpenSolaris on your own machine.

First off, it is the most secure OS I know of. No need to Virus protection.

Second, it just works (mostly). I've just got a new Macbook Pro, I always find it easier to do development work on Solaris than any other platform so I like to run OpenSolaris. It installed pretty much seamlessly (just having to change the EFI disk label using the macOS fdisk utility as described here). The only thing that doesn't work out of the box is the Wifi - which is a pain. It's a broadcom chipset so I've got hold of a PCI3/4 Atheros card which works well

Third, all the development tools I need (and indeed anyone developing for or on Solaris) are available within the standard repositories. I found this page which is how I set up my laptop as a build machine.

From a day to day computing perspective it does everything I need. Mail, Web, chat all included, OpenOffice in the repositories for free (and simple) download. A new Media player in Elisa (in the repo), though unfortunately you have to buy the codecs for many common video formats.

So the next questions is, is it any different from 2008.11? Well it's hard for me to say as I've been upgrading every few weeks to the latest development builds (by using the opensolaris.com/dev repository). But I did install it fresh in side a VirtualBox VM and was impressed with the speed of the install. The auto installer is now more complete and can install SPARC machines (necessary for a good proportion of our customers). There are networking improvements, but generally the speed and usability is what you'll notice.

Oh and Fast reboot. Makes it much quicker to shutdown or reboot a machine.

Today I'm attending Comunity One (or C1 as we call it) and much more will be discussed about OpenSolaris and all our other OpenSource development efforts. I'll try to remember to write a blog about it (though don't hold your breath on recent form :-)




About

Chris W Beal

Search

Archives
« July 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today