Monday Jun 22, 2009

Live Upgrade and TimeSlider gotcha

Tried to upgrade my workstation over the weekend to snv_117. Apart from a little tridying up I had to do as a package didn't install correctly, all apeared to be going fine. I then went to unmount /.alt.snv_117, and it failed saying that the filesystem was busy.

fuser -c showed no processes using the mount point. What could it be?

A little bit of dtracing the umount2() system call was illuminating.

  1              <- zfsctl_umount_snapshots           0                0
  1            <- zfs_umount                          0               16

Hang on, snapshots? Although it returned 0, let's just check; as I do have timeslider enabled on this box.

rootksh@vesvi:~$ zfs list -t snapshot|grep 117                                                                     
pool/ROOT/snv_116@snv_117                                   4.03M      -  8.78G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-09:00     43.8M      -  7.99G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-10:00     48.9M      -  8.44G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-11:00     43.7M      -  8.74G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:15   42.6M      -  8.75G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:30   45.8M      -  8.76G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:45   38.1M      -  8.77G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-12:00     38.5M      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:daily-2009-06-22-00:00          0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:weekly-2009-06-22-00:00         0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-22-10:00         0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-22-10:30       0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-22-10:45       0      -  8.80G  -

Oh, timeslider was taking snapshots of the filesystem while it was upgrading. Hmm maybe we should be having that disabled on the target of a live upgrade (rfe coming, but I don't hold out a lot of hope).

Anyway, removing them was not difficult:

rootksh@vesvi:~$ zfs list -t snapshot|grep snv_117@zfs-auto|awk '{print $1}' | xargs -L 1 zfs destroy
rootksh@vesvi:~$ luumount snv_117                                                                                  

Something to keep in mind if you are using timeslider, zfs root and live upgrade (I wonder if we would have the same issue with 'pkg image-update' in OpenSolaris).

Monday May 04, 2009

multithreaded processes and mdb

Today I had to look at a gcore of devfsadm. Most specifically I wanted to have at what the threads in cond_wait() were doing. I haven't done a lot with such stuff in userland before so thought it would make a good short blog topic on things that can be done.

First off we run up mdb

#  mdb /usr/sbin/devfsadm devfsadm.gcore
Loading modules: [ ]

Great, we got all the modules. So, what lwps have we got?

> $L
lwpids 1, 2, 3, 4, 5 and 6 are in core of process 135.

So we have six threads, let's have a look at the registers in first one (note that this is on SPARC).

> 1::regs
%g0 = 0x00000000                 %l0 = 0x00000000 
%g1 = 0x0000001d                 %l1 = 0x00043748 
%g2 = 0x0003cb2c                 %l2 = 0xffbff8ac 
%g3 = 0x00038000                 %l3 = 0x00000001 
%g4 = 0x0003cb2c                 %l4 = 0x00000000 
%g5 = 0x00000000                 %l5 = 0x00000000 
%g6 = 0x00000000                 %l6 = 0x00000000 
%g7 = 0xff342a00                 %l7 = 0x00000001 
%o0 = 0xff342c40                 %i0 = 0x00000001 
%o1 = 0xff13b90c`pause+0x50 %i1 = 0x0003a2a4 
%o2 = 0xff1c3800`_uberdata %i2 = 0xff342a00 
%o3 = 0x00000000                 %i3 = 0x00039954 
%o4 = 0xff342a00                 %i4 = 0x00016964 
%o5 = 0x00000000                 %i5 = 0x00000000 
%o6 = 0xffbff850                 %i6 = 0xffbff8b0 
%o7 = 0xff13b914`pause+0x58 %i7 = 0x00015ce4 

 %psr = 0x00000044 impl=0x0 ver=0x0 icc=nzvc
                   ec=0 ef=0 pil=0 s=0 ps=64 et=0 cwp=0x4
  %y = 0x00000000
  %pc = 0xff14c160`_pause+4
  %npc = 0xff14c164`_pause+8
  %sp = 0xffbff850
  %fp = 0xffbff8b0                    

 %wim = 0x00000082
 %tbr = 0x00000000

Now to have a look at the stack we simply find the %sp value and use it with the stack dcmd.

> 0xffbff850::stack
0x15ce4(0, 43b48, 39db4, 4, 2276c, 38000)
main+0x358(0, 39f2c, ffbffdec, 398e4, 1, 38000)
_start+0x108(0, 0, 0, 0, 0, 0)

Note that this gives the stack frames above the current and not the current. From the value of %pc above we can see where we are executing in the current frame. You can also see that we the caller does not have an entry in the symbol table. Unfortunately, on Solaris 10, devfsadm has a lot of functions and variables declared as static, which really does make debugging a pain. Fortunately this is not the case in Nevada/OpenSolaris.

Looking at the other lwps is as simple as listing the lwp id in front of the regs dcmd and repeating what we just did. I won't go into how I worked out which of the static routines we were executing in for the other lwps in cond_wait(), save to say that there are only a couple of places that make that call in the code, and matching up the assembly around the locations to the source (especially looking at called functions), makes this not too difficult.

Tuesday Mar 03, 2009

My CMT machine loads Oracle Databases slower than ..

This is more of an "Oh no not again" type post, ...

I am constantly amazed at the number of escalations that make it to the performance group with this as the problem.

It really is a case of an unrealistic expectation and knowledge of what the machines excel at.

The most recent of these to cross my desk talks of a customer concerned that a dual core 2.5GHz x86 based box loads data into an Oracle database much faster than his shiny new T5220.

Until such a time as Oracle makes their SQL Loader run multi-threaded (which may bring in problems all of their own) this will always be the case.

The design of the system is such that it will run single threaded applications much slower than the x86 counterparts. These machines, however, come into their own once we enter production and start getting lots of parallel requests on the database. As we are running far more cpus, the load on the database must be much higher before we start to see any significant degradation.

The question that really should be asked here is, "Where do you want your performance? In the database load that you will do once, or in responding to production queries?"

Tuesday Nov 18, 2008

Even More DTrace Lab Answers (9 & 10)

OK, let's get the rest out.

Exercise 9

List the processes that are connecting to a specific port.

OK, we'll take that to mean we want to know the name and the process ID of connections to a particular port. This requires just a little network programming knowledge. We can see that the second argument (arg1) to connect(3SOCKET) is a sockaddr. The extra knowledge that we need is that to get a port, we need to cast it to a sockaddr_in and look at the port structure element. This bit of code expects the command line argument 1 to be a port number.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall::connect:entry {
	this->sock = (struct sockaddr_in \*)copyin(arg1, arg2);
	self->port = this->sock->sin_port;
syscall::connect:entry /self->port == $1/ {
	 printf("%5d %s\\n", pid, execname); }

syscall::connect:entry { self->port = 0; }

Exercise 10

Write a script to show where a specific system call, rename for example, is blocking and how long it is blocked for.

This one was a bit of fun to write, especially since the output format specifier for a stack is not documented and I had to find it in dt_printf.c.

We'll use the sched provider in this. While it would be nice to use the wakeup probe, it's not fired from the thread context so we won't have the thread local variable that we need to check and use. So, we just use on-cpu, which will tell us how long we were off cpu since we were told to block. Note that this also means that we are not tracking normal scheduling where it may have been pre-empted by another thread for whatever reason. This is what we want.

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting...  \^C to continue\\n"); }

syscall::rename:entry {self->interest = 1; }
sched:::sleep /self->interest/ { self->blocktime = timestamp; }
sched:::on-cpu /self->blocktime/ {
	this->taken = timestamp - self->blocktime; 
	@[curthread,stack()] = quantize(this->taken);
	self->blocktime = 0;
syscall::rename:return { self->interest = 0; }

END { printa("Thread 0x%p %k %@d\\n", @); }

This leaves us with only Exercise 11 to go. I'll try to get that up tomorrow.

Saturday Nov 15, 2008

More DTrace Lab Answers

OK, I promised more answers so here we go, ... (sorry, where files have copyright notices, I have to leave them there).

Exercise 2

Restrict the iosnoop.d to trace a specific process.

This solution will allow you to either specify a pid by appending '-p {pid}' or to run a command by appending '-c "command args ..." to the command line. eg dtrace -s iosnoop.d -p 1234

 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License, Version 1.0 only
 \* (the "License").  You may not use this file except in compliance
 \* with the License.
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \* Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
 \* Use is subject to license terms.
 \* This D script is used as an example in the Solaris Dynamic Tracing Guide
 \* wiki in the "io Provider" Chapter.
 \* The full text of the this chapter may be found here:
 \* On machines that have DTrace installed, this script is available as
 \* iosnoop.d in /usr/demo/dtrace, a directory that contains all D scripts
 \* used in the Solaris Dynamic Tracing Guide.  A table of the scripts and their
 \* corresponding chapters may be found here:
 \*   file:///usr/demo/dtrace/index.html

#pragma D option quiet

	printf("%10s %58s %2s\\n", "DEVICE", "FILE", "RW");

/pid == $target/
	printf("%10s %58s %2s\\n", args[1]->dev_statname,
	    args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W");

Exercise 3

Display the arguments for the rename(2) system call along with its return code.

This is longer than it needs to be as I like the idea of a single line of output for each time the we have a rename() system call. I save the two arguments in thread local variables (self->source and self->destination). Note that the two arguments are strings and they are in user space when this probe fires, so we use copyinstr() to both get them into kernel context and make them printable strings. Note also that we clean up after ourselves in the return probe.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall::rename:entry {
	self->source = copyinstr(arg0);
	self->destination = copyinstr(arg1);
syscall::rename:return {
	printf("%s: %s -> %s returned %d\\n", probefunc, self->source,
	    self->destination, arg1);
	self->source = self->destination = 0;

Exercise 4

Display the real and sys times (see timex(1)) for the syscall

I'm not completely happy with this as I saw some anomalous real time values running this on my notebook, but it should be about right. Note that we record a walltimestamp (seconds since epoch) and the vtimestamp (time on cpu in nanoseconds) and simply compare them in the return probe.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall:::entry {
	self->seconds = walltimestamp;
	self->sys = vtimestamp;
syscall:::return /self->sys/ {
	printf("%s: %d real seconds, %dns system\\n",
            walltimestamp - self->seconds,
            vtimestamp - self->sys);
	self->seconds = self->sys = 0;

Exercise 5

Show the kernel function name that triggers the io:::start probe.

Many folks did this using stack(). There is a better way. This makes use of the fact that in the io provider, the probe\* values are defined.

$ pfexec dtrace -qn 'io:::start {printf("iostart:::probefunc called from %s\\n",probefunc);}'
iostart:::probefunc called from bdev_strategy

You could also simply use the default action without -q as that will print the probe specification each time it fires, which will include the probefunc.

Exercise 6

Show the flow of kernel functions for a write system call.

Most folks put a -F in the arguments on line 1. The clearer (and more correct) way to do this in a script is to use the pragma (highlighted).

#!/usr/sbin/dtrace -s

#pragma D option flowindent

syscall::write:entry { self->interest = 1; }
fbt::: /self->interest/ {}
syscall::write:return { self->interest = 0; }

Exercise 7

Show all lock events for mutexes that occur during two second.

The question hints that we may want to use an aggregation and summary rather than printing them as they happen.

#!/usr/sbin/dtrace -s3

#pragma D option quiet

lockstat:::adaptive\* {@[probename] = count();}
tick-2s {exit(0);}

Exercise 8

Find file to which most IO is being done.

I'm going to assume that the question means real IO, not IOs that hit the cache. In which case, we can use the answer to exercise 2 as a hint. Unfortunately, pretty much all of the IO that this tracks is being done by sched to flush buffers and we don't have any idea of the filename. I'll try to come up with a better solution.

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting data ... \^C to finish.\\n"); }
io:::start { @[args[2]->fi_pathname] = count(); }
END { trunc(@,1); printa(@); }

That's it for the moment, I'll finish them off later.

Friday Nov 14, 2008

DTrace Lab in the performance track at CEC

We have just been helping out in the DTrace lab for the performance track at CEC. On having a look at exercise 1, we suggested to the attendees that they attempt the other exercises and come back to this one. The exercise was:

Enhance /usr/demo/dtrace/iosnoop to be able to detect reads that are satisfied by the filesystem cache (UFS: pagecache).

We then went off to write our own versions of an answer to the problem.I started from scratch, and (referring back to a previous blog where I did some digging into this code), came up with the following:

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting ...\\n"); }

syscall::read\*:entry {
        self->interest = 1;
        self->file = fds[arg0].fi_name;
        self->phys = 0;

/\* we only increment this kstat if we have to go to disk, from a filesystem \*/

sysinfo:::bread /self->interest/ { self->phys = 1;}

syscall::read\*:return /self->interest/ {
        this->str = self->phys
            ? "Physical reads"
            : "Cache Reads";
        @[this->str, self->file] = count();
        self->interest = 0;

END { printa("%@6d %s %s\\n", @); }

On running this you get output like:

   209 Cache Reads jaxrpc-impl.jar
   211 Cache Reads appserv-admin.jar
   311 Cache Reads xalan.jar
   354 Cache Reads ttysrch
   372 Cache Reads jaxr-impl.jar
   373 Cache Reads auxv
   500 Cache Reads appserv-assemblytool.jar
  1216 Cache Reads 
  1324 Cache Reads appserv-rt.jar
  1392 Cache Reads clone@0:ptm
 83181 Cache Reads psinfo

Of course by the time I finished writing this, pretty much everything in the only ufs filesystem on the box had been cached :)

Wednesday Jul 23, 2008

What can you say?

I have been really slacking off with my blogging and really need to get back into it. What better way than with something amusing that happened today to a colleague.

He picked up an task today where the customer had the following issue and question. Unfortunately I don't recall the exact issue and patch number but they are perepheral to the humour.

We've noticed that this problem occurs on a system with patch XXXXXX-02, but not on those with XXXXXX-04. Can you tell us if there is a patch or workaround to the problem?

How do you answer a question like that? An overseas colleague came up with the suggestion of "Sir, you really need a holiday".

My suggestion was more prosaic, that he simply look in the patch README for the bugs that were fixed between the -02 and -04 revision and reply "Yes, that was bug YYYYYYY which was fixed in XXXXXX-04", and try to keep a straight face.

Of course the really worrying thing about this whole incident, is that the current revision of the patch in question was -57!

Friday May 23, 2008

Yay for Sun Ray

As anyone who reads what I write here probably knows, I've spent the last 2 weeks in China, in two different offices in Beijing and one on Shanghai.

Some months back I installed Sun Ray server on my workstation in Sydney (as I prefer to run on relatively current nevada builds, rather than the Solaris 10 builds that IT Ops provides) and moved over to using a Sun Ray appliance both on my desk and at home.

I've been very happy with the ease of use of being able to transfer my work session home.

I've been even more happy with being able to transfer it to the various offices in China!

I just put my card in, it throws me to the Sydney ITops server that I would normally connect to, and I just utswitch from there to my own server. It all just works. What's more I've found working from Beijing very little different speedwise to working from home. It's all quite usable.

Technorati Tags: , ,

Tuesday Nov 06, 2007

The official melbourne cup site can't cope?

I simply have to tongue in cheek blog this. For those that don't know, Australia stops on the first Tuesday in November for the largest horse race in the country. The Melbourne Cup. Shortly after the race finished a colleague gathered the below screen shot from the official web site.

Do you think they might need to talk to Sun Australia about the new T2 boxes ( e.g. T5220 & T5120), Solaris 10 and Our Application Server ? :-)

Thursday Feb 15, 2007

more on the in.telnetd patches

As many folks have stated. Sun Alert 102802 and the patches are available on Sunsolve.

120068-02 SunOS 5.10_sparc: in.telnetd Patch
120069-02 SunOS 5.10_x86: in.telnetd Patch

I've had it pointed out to me that the patches are marked "Reboot after installation required". This is actually not the case and a bug has been logged to get the tags removed from the patch.

For what it is worth, I tested the fix by applying the patch while the systems were multiuser and the fix was immediate. I did not even have to restart the services. in.telnetd is fork/exec'd by ineted. It's generally short lived. Adding the patch replaces the binary that is exec'd. You do not need to perform a reboot to get this patch installed.

Technorati Tags: , , ,

Wednesday Nov 22, 2006

Some DTrace scripts I found useful last week

Last week I spent some time looking at applications that a customer was using to perform a data migration. It occurs to me that folks might be interested in a couple of the "one liner" type scripts that I found useful after turning them into 'stat' type tools. So, here they are.


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count all user space function calls
 \* $1 - time to run (eg 10s)
 \* $2 - pid to monitor

pid$2:::entry {
	@[probefunc] = count();}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count the syscalls a process is making as a stat tool
 \* $1 time to wait (eg 10s)
 \* $2 target pid

syscall:::entry /pid == $2/ {
	@[probefunc] = count();}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count the syscalls a process is making as a stat tool
 \* $1 time to wait (eg 10s)
 \* $2 target pid

syscall:::entry /pid == $2/ {
	self->start = vtimestamp;}
syscall:::return /self->start/ {
	@[probefunc] = quantize((vtimestamp - self->start)/1000);}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet
 \* Aggregate user stacks calling a function in user space
 \* $1 - time to run (eg 10s)
 \* $2 - pid to monitor
 \* $3 - function to look for

pid$2::$3:entry {
	@[ustack(30)] = count();}
tick-$1 {

Technorati Tags: ,

Sunday Nov 12, 2006

Solaris ufs bug in Month of Kernel Bugs

Just noticed that Solaris has an entry in Month of Kernel bugs.

While I agree that we have an issue that needs looking at, I also believe that the contributor is making much more of it than it really deserves.

First off, to paraphrase the issue:

If I give you a specially massaged filesystem and can convince someone with the appropriate privilege to mount it, it will crash the system.

I'd hardly call this a "denial of service", let alone exploitable.

First off, in order to perform a mount operation of a ufs filesystem, you need sys_mount privilege. In Solaris, we currently are runing under the concept of "least privilege". That is, a process is given the least amount of privilege that it needs to run. So, in order to exploit this you need to convince someone with the appropriate level of privilege to mount your filesystem. This would also invlove a bit of social engineering which went unmentioned.

That being said, they system should not panic off this filesystem and I will log a bug to this effect. It is a shame that the contributor did not make the crashdump files available as it would certainly speed up any analysis.

One other thing that I should add is that anyone who tries to mount an unknown ufs filesystem without at least running "fsck -n" over it probably deserves what they get.

OK, I have copied it to a relatively current nevada system and mounted it as /dev/lofi/1. On running "fsck -n" we see:

\*\* /dev/rlofi/1 (NO WRITE)







In the normal course of events, would you mount this filesystem. I certainly would not. This however is not the normal course of events and I'm playing on a lab system.

v40z-c# uname -a
SunOS v40z-c 5.11 snv_46 i86pc i386 i86pc

Let's try a read only mount first.

v40z-c# mount -r /dev/lofi/1 /mnt
v40z-c# ls /mnt
v40z-c# umount /mnt

OK, the read only mount is fine. Now the read/write, ... Bingo

v40z-c# mount /dev/lofi/1 /mnt

panic[cpu3]/thread=ffffffff9a6974e0: BAD TRAP: type=e (#pf Page fault) rp=fffffe8000c7f2c0 addr=fffffe80fe39d6c4

mount: #pf Page fault
Bad kernel fault at addr=0xfffffe80fe39d6c4
pid=2170, pc=0xfffffffffbb70950, sp=0xfffffe8000c7f3b0, eflags=0x10286
cr0: 8005003b cr4: 6f8
cr2: fffffe80fe39d6c4 cr3: 1ff76b000 cr8: c
fffffe8000c7f1b0 unix:die+b1 ()
fffffe8000c7f2b0 unix:trap+1528 ()
fffffe8000c7f2c0 unix:_cmntrap+140 ()
fffffe8000c7f440 ufs:alloccgblk+42f ()
fffffe8000c7f4e0 ufs:alloccg+473 ()
fffffe8000c7f560 ufs:hashalloc+50 ()
fffffe8000c7f600 ufs:alloc+14f ()
fffffe8000c7f6c0 ufs:lufs_alloc+f3 ()
fffffe8000c7f770 ufs:lufs_enable+261 ()
fffffe8000c7f7e0 ufs:ufs_fiologenable+63 ()
fffffe8000c7fd60 ufs:ufs_ioctl+3e0 ()
fffffe8000c7fdc0 genunix:fop_ioctl+3b ()
fffffe8000c7fec0 genunix:ioctl+180 ()
fffffe8000c7ff10 unix:sys_syscall32+101 ()

OK, so we should now have a crashdump to look at.

While the machine is rebooting, it occurs to me that if we put this ufs onto an external USB device, we might actually have an exploitable issue here, once the new hal/rmvolmgr framework is in place (nv_51) if we try to automatically mount ufs devices.

core file:      /var/crash/v40z-c/vmcore.0
release:        5.11 (64-bit)
version:        snv_46
machine:        i86pc
node name:      v40z-c
system type:    i86pc
hostid:         69e47dae
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/dsk/c1t1d0s1(517M)
time of crash:  Sun Nov 12 13:08:18 EST 2006
age of system:  34 days 1 hours 42 minutes 34.95 seconds
panic CPU:      3 (4 CPUs, 7.56G memory)
panic string:   BAD TRAP: type=e (#pf Page fault) rp=fffffe8000c7f2c0 addr=fffffe80fe39d6c4

sanity checks: settings...vmem...sysent...clock...misc...done

-- panic trap data  type: 0xe (Page fault)
  addr: 0xfffffe80fe39d6c4  rp: 0xfffffe8000c7f2c0
  savfp 0xfffffe8000c7f440  savpc 0xfffffffffbb70950
  %rbp  0xfffffe8000c7f440  %rsp  0xfffffe8000c7f3b0
  %rip  0xfffffffffbb70950  (ufs:alloccgblk+0x42f)

  0%rdi 0xffffffff8d60b000  1%rsi 0xffffffff8930c308  2%rdx               0xb5
  3%rcx               0xb5  4%r8  0xfffffe80fe39d6c0  5%r9              0x12f0

  %rax                 0x8  %rbx          0x361005a8
  %r10                   0  %r11  0xfffffffffbcd9ff0  %r12               0x5a8
  %r13  0xffffffff8930c000  %r14  0xffffffff8d60b000  %r15  0xffffffff99656c00
  %cs       0x28 (KCS_SEL)
  %ds       0x43 (UDS_SEL)
  %es       0x43 (UDS_SEL)
  %fs          0 (KFS_SEL)
  %gs      0x1c3 (LWPGS_SEL)
  %ss       0x30 (KDS_SEL)
  trapno     0xe (Page fault)
  err        0x2 (page not present,write,supervisor)
  %rfl   0x10286 (parity|negative|interrupt enable|resume)
  fsbase 0xffffffff80000000 gsbase 0xffffffff8901c800
-- switch to user thread's user stack --

The trap has occurred in alloccgblk+42f() in the ufs code.

ufs:alloccgblk+0x410            call   +0xe0a4  (ufs:clrblock+0x0)
ufs:alloccgblk+0x415            decl   0x1c(%r13)	; cgp->cg_cs.cs_nbfree--
ufs:alloccgblk+0x419            decl   0xc4(%r14)	; fs->fs_cstotal.cs_nbfree--
ufs:alloccgblk+0x420            movslq 0xc(%r13),%r8	; %r8 <- cgp->cg_cgx
ufs:alloccgblk+0x428            addq   0x2d8(%r14),%r8	; %r8 <- fs->fs_u.fs_csp[%r8]
ufs:alloccgblk+0x42f            decl   0x4(%r8) <-- panic here

We've just made a call to ufs:clrblock() and are decrementing something after a long list of pointer dereferencing. We only call clrblock() once in this routine, so that puts us at:

   428  #define fs_cs(fs, indx) fs_u.fs_csp[(indx)]

  1238          clrblock(fs, blksfree, (long)blkno);
  1239          /\*
  1240           \* the other cg/sb/si fields are TRANS'ed by the caller
  1241           \*/
  1242          cgp->cg_cs.cs_nbfree--;
  1243          fs->fs_cstotal.cs_nbfree--;
  1244          fs->fs_cs(fs, cgp->cg_cgx).cs_nbfree--;

Panicing on line 1244.

I should note at this point that the source I am quoting is from the same source tree as opensolaris.

After the macro expansion, it becomes

  1244          fs->fs_u.fs_csp[cgp->cg_cgx]--

So what is cgp->cg_cgx?

SolarisCAT(vmcore.0/11X)> sdump 0xffffffff8930c000 cg cg_cgx
   cg_cgx = 0x6f0000

This is probably a trifle on the largish side, which would explain how we have ended up in unmapped memory.

The address we end up with for the (struct csum \*) is 0xfffffe80fe39d6c0

If we go back to look at fs->fs_ncg, we see that there were only two cylinder groups allocated. We have an obvious inconsistancy.

Also, interestingly, this is not dieing in the mount(2) system call. It's dieing in a subsequent ioctl(2). This ioctl appears to be the one enabling ufs logging.

So how might we handle this?

Now, as the filesystem is already mounted and we are in a subsequent ioctl(), we can't fail the mount. Could we fail in alloccgblock() and have the error propogate back up to the process making the ioctl()?

Walking back up the stack, we see that alloccg() handles alloccgblk() returning 0. Now if alloccg() returns 0 to hashalloc(), in this instance, we'll first try to do a quadratic rehash, which will also fail, so we'll fall through to the brute force search. As this starts at cylinder group 2 and there are only 2 cylinder groups, this will call alloccg() once and fail, falling through to return 0 to alloc(). Note that no matter how many times we end up in alloccgblock() it has the same arguments, so it would fail the same way.

In alloc(), it notes that we did not get a block returned and assumes this is because some other thread grabbed the last block. It then treats the whole thing as if we ran out of space and returns ENOSPC to lufs_alloc(). lufs_alloc() catches this, frees up everything and returns the error (ENOSPC) to lufs_enable(), which in turn catches it cleans up and returns it to ufs_fiologenable() and the error is eventually passed back to user space. While not exactly the error we would have hoped, the end result would be that logging would not be turned on and the system would not panic due to this corrupted filesystem.

I'll log this as a bug against ufs for Solaris 10 and nevada


I have logged CR 6492771 against this issue. The link for the CR should work some time in the next 24 hours, but the content as logged is pretty much a cut and paste from this blog entry.

Update 2

The bug I logged has been closed as a duplicate of

4732193 ufs will attempt to mount filesystem with blatantly-bad superblock

Technorati Tags: , , ,

Tuesday Sep 05, 2006

Samba meets ZFS

I was browsing my RSS feeds today when I came across this post from my colleague Chris Gerhard.

Chris is in the process of revamping his home servers. He' already done some nice work with snapshotting on his notebook, but this one looks very nice as well.

The idea is to simply make a ZFS snapshot every time that the Samba user logs in. This way it is possible to go back to the beginning of a login session. OK, it's still a work in progress and it will almost certainly need some easy way to clean up the snapshots, but this is a very nice idea.

Good one Chris.

The application of ZFS that I'm waiting to try is the concept of running Windows under Xen, but using ZFS as teh backing filesystem. The idea being that I can take a snapshot of the backing filesystem before I try applying Microsofts's patches and when/if anything breaks, I'll have the option of backing out the entire patch set with a single zfs rollback command. The really nice thng here is that Windows should then even require a rebbot as Xen should be able to start off where I was when I took the snapshot initially.

Technorati Tags: , , ZFS, ,

Tuesday Jul 11, 2006

A first look at Thumper

Jonathan has given a sneak preview of the new Thumper box. It's nice that this will be out shortly and we will be able to talk about it.

As you can see from what he writes, it's pretty damned impressive unit. With up to 24 Terabytes (you read that correctly) of internal SATA disk and ZFS on top of that, ...

It was a nice touch apologising to the courier drivers for including (all 170 lbs of) it in the try & buy program.

Technorati Tags: ,

Friday Jun 16, 2006

Will a faster cpu make my application faster?

I was recently involved in an escalation in which a customer had moved from one sparc platform to another and also moved to a faster release of Ultrasparc-IV than they had previously looked at.

It turns out that they actually saw their application slow down.

This is not as silly, nor as unusual as it may at first seem.

The actual platform migration was from a US-III workgroup server to a starcat class machine.

Now, there are some things to watch for in this type of migration as there there are some major differences in the architecture. Most specifically you are moving from a platform with a two tier memory architecture to one with a three tier architecture.

Now in Solaris 9 we have some "new" bits that help us immensely here that are simply not there and not able to be backported to Solaris 8.

These are Memory Placement Optimization (MPO) and Multiple Page Size Support (MPSS). MPO is the important one as it attempts to run programs on the same board that the memory exists on.

OK, that probably accounts for the slow down. Why did we not see any improvement.

While I don't actually have the data on the previous system I have my suspicions.

On the Starcat box I am seeing large amounts of idle time. The immediate thought here is, is the application cpu bound or is something else the bottleneck?

If we already have lots of idle time on the previous box, then it's odds on that the cpu is not our bottleneck.

If, for example, we have threads that are doing a lot of I/O, and we haven't changed the I/O subsystem, then the time we spend waiting on the I/O is not going to change. If that is the limiting speed factor, then faster cpus are not going to help us.

The suggestion coming out of this is that before upgrading your hardware in order to speed up your applications, please have a look at the application to see exactly where the bottlenecks are. You may be pleasantly surprised to find that there are cheaper options to improving your application performance.

Some good \*stat commands to start with would be vmstat, iostat and mpstat.

First off, try "vmstat 5" This will give us output like

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 -- --   in   sy   cs us sy id
 0 0 0 692416 239220 33 129 80  1  4  0 140 37 0  0  0  469 2255  996  5  7 88
 0 0 0 626860 175244  0   9  0  0  0  0  0 26  0  0  0  427  214  278  1  3 96
 0 0 0 626860 175244  0   0  0  0  0  0  0  0  0  0  0  371  160  193  0  3 97
 0 0 0 626860 175244  0   0  0  0  0  0  0  0  0  0  0  372  195  203  1  3 97

Have a look at the 'cpu' columns, specifically the user/system/idle split. Do we have idle time? Does the system time look excessive? These kinds of things are more complex to investigate, but can be looked in to.

It's also worth looking at the 'thr' columns. These show

  • r - # threads ready to run but not yet running (a count of the threads on the dispatch queues
  • b - # threads that are blocked waiting for resources (eg I/O, paging, ...)
  • w - # swapped out lightweight processes that are waiting for processing resources to finish

Consistantly high numbers on any of these are cause for concern. Specifically, consistantly having threads in the dispatch queue is a sign that we are probably cpu starved in this box.

On to mpstat. This shows the following columns (I'm running this on a single cpu notebook, on multi-cpu machines you'd see more cpus).

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  115   6    0   460  356  920  249    0    0    0  2068    5   6   0  89
  0    2   0    0   412  310  250    2    0    0    0   160    0   3   0  96
  0    0   0    0   375  273  198    2    0    0    0   153    1   3   0  97

Again we have the user/system/idle split, but now it's on a per cpu basis. High numbers of icsw (involuntary context switches) on a particular cpu is an indication that that cpu is handling a lot of interrupts. This can have incredibly detrimental effects on applications trying to use that cpu. It may be worthwhile considering either processor sets, or using psradm to disable all but interrupts on that cpu.

High numbers in 'migr' (thread migrations) can also be detrimental as we end up having to invalidate cached data on one cpu and reload it in another. Binding processes to particular cpus might help here.

iostat is a good way to see how the I/O subsystem is running. I generally use something like the following command:

$ iostat -xnz 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   20.6    8.9 1281.1   53.1  1.8  0.3   62.1   11.2   9  21 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   19.2    0.0   52.0  0.0  0.0    0.1    1.0   0   2 c0d0
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

The -z argument tells iostat not to print lines of zeros (hence the multiple headers with no data).

Most of the columns are self explanatory. I'm generally interested in 'asvc_t' (active service time), which generally equates to 'time on the wire', or how long the device takes to service the request; '%w' shows the average number of I/O requests pending for this device and '%b' gives an indication of how busy the device is. Note that '%b' assumes that the device is only capable of sequential individual packets to the device. This is obviously not the case for arrays.

High '%w' numbers are an indication of a lot of I/O taking place. High 'asvc_t' numbers may indicate a problem with the storage device. You should be seeing times of the order of 1-10ms in this column in general on current hardware. The above were taken on my notebook which has a slower IDE drive in it that had had a bit of a workout before I ran the stats.

You should also be aware that in pretty much all of the \*stat commands, the first output is an average since boot, while good for a feel of the system average, it's not real useful when trying to get a feel for a particular time period.

Basically, a little bit of analysis can save you a considerable amount of money and then angst. I'd be failing in my role as a "trusted adviser" to recommend any other course. Yes, sure we'd like the money for customers buying more hardware, but for myself and many others, it's far more important for the customer to be a happy and returning customer rather than an unhappy one who is likely to look elsewhere for their next purchase due to such an experience.

Technorati Tags: , ,


* - Solaris and Network Domain, Technical Support Centre

Alan is a kernel and performance engineer based in Australia who tends to have the nasty calls gravitate towards him


« May 2015

No bookmarks in folder

Sun Folk

No bookmarks in folder

Non-Sun Folk
Non-Sun Folks

No bookmarks in folder