Friday Aug 14, 2009

Why I hate macros that make pointer dereferences look like structure elements

I have a colleague who generated an IDR patch for tcp in Solaris 10 for me to give relief to a customer for a bug while the formal fix was in progress.

As a part of the fix we had this code fragment

 18984          /\*
 18985           \* If the SACK option is set, delete the entire list of
 18986           \* notsack'ed blocks.
 18987           \*/
 18988          if (tcp->tcp_sack_info != NULL) {
 18989                  if (tcp->tcp_notsack_list != NULL)
 18990                          TCP_NOTSACK_REMOVE_ALL(tcp->tcp_notsack_list, tcp);
 18991          }

replaced with this code fragment (the fix actually has a lot more to it than this, but this is what was relevent.)

 18936          /\*
 18937           \* If the SACK option is set, delete the entire list of
 18938           \* notsack'ed blocks.
 18939           \*/
 18941          if (tcp->tcp_notsack_list != NULL)
 18942                  TCP_NOTSACK_REMOVE_ALL(tcp->tcp_notsack_list, tcp);

Now, the assembly around here reads

ip:tcp_process_shrunk_swnd+0x38:       ldx      [%i0 + 0xf8], %g1
ip:tcp_process_shrunk_swnd+0x3c:       add      %g3, %i1, %g2
ip:tcp_process_shrunk_swnd+0x40:       stw      %g2, [%i0 + 0x80]
ip:tcp_process_shrunk_swnd+0x44:       ldx      [%g1 + 0x48], %i5

and the register dump

pc:  0x7bed2918 ip:tcp_process_shrunk_swnd+0x44:   ldx  [%g1 + 0x48], %i5
npc: 0x7bed291c ip:tcp_process_shrunk_swnd+0x48:   subcc  %i5, 0x0, %g0   ( cmp   %i5, 0x0 )
  global:                       %g1                  0
        %g2             0x761c  %g3             0x68ec
        %g4      0x600216f6e6c  %g5                  0
        %g6               0x1c  %g7      0x2a101f89ca0
  out:  %o0      0x600210d8640  %o1              0x1e0
        %o2              0x5a8  %o3              0x3c8
        %o4      0x600216f6e6c  %o5      0x600216f68c4
        %sp      0x2a101f88c61  %o7         0x7bed2900
  loc:  %l0         0xc7341c85  %l1             0x2000  
        %l2      0x60010972000  %l3      0x600210d8640  
        %l4             0x1000  %l5             0x1000  
        %l6             0x1000  %l7                0x5  
  in:   %i0      0x600210d8640  %i1              0xd30  
        %i2                  0  %i3         0xc73439b5  
        %i4                  0  %i5                0x4  
        %fp      0x2a101f88d11  %i7         0x7becbed4  

The last instruction is where we paniced (yes the customer paniced [twice] as a result of this). As we can see from the register dump, %g1 is NULL, so we definitely have a NULL pointer dereference going on.

So where did this come from? It looks like a dereference of 0xf8 from %i0. %i0 is a (tcp_t \*) making %g1 a (tcp_sack_info \*), namely arg0->tcp_sack_info if we look at the structure; but hang on, the code says tcp->tcp_notsack_list, not tcp->tcp_sack_info. Indeed that element name does not exist within a tcp_t.

A light dawns when we see that:

   299  #define tcp_notsack_list	        tcp_sack_info->tcp_notsack_list

So in reality line 18941 is doing:

 18941          if (tcp->tcp_sack_info->tcp_notsack_list != NULL)

Without checking whether or not tcp->tcp_sack_info is non-NULL. The correct line should perhaps read

 18941          if (tcp->tcp_sack_info != NULL && tcp->tcp_notsack_list != NULL)

Now this would probably not have made it as far as in IDR patch delivered to a customer, if we didn't have that macro definition because alarm bells would have rung that we were doing another dereference!

Thursday Jul 30, 2009

Interim fixes for Bind Vulnerability VU#725188/CVE-2009-0696 (Updated)

Yesterday I noticed an article titled New DoS Vulnerability in All versions of BIND 9 on slashdot. The article refers to BIND Dynamic Update DoS at the ISC site describing Vulnerability Note VU#725188 - ISC BIND 9 vulnerable to denial of service via dynamic update request.

This very rapidly caused a stir on a few internal mailing lists that I'm on and work on addressing this as

        6865903 Updated, P1 network/dns CVE-2009-0696 BIND dynamic update problem

The current status of this within Sun is that the Interim Security Reliefs (ISR) are available from for the following releases:

SPARC Platform

  • Solaris 10 IDR142522-01
  • Solaris 9 IDR142524-01

x86 Platform:

  • Solaris 10 IDR142523-01
  • Solaris 9 IDR142525-01

Sun Alert 264828 is on its way to be published. When published it will be available from:

The fix is planned for build 121 for OpenSolaris/Nevada and we're attempting to get it into the next possible release Support Repository Update (SRU3).

Update 1

It turns out that the Solaris 9 ISR patches rely on an unreleased patch for Solaris 9. Work is underway to get this dependency out quickly,

Monday Jun 22, 2009

Live Upgrade and TimeSlider gotcha

Tried to upgrade my workstation over the weekend to snv_117. Apart from a little tridying up I had to do as a package didn't install correctly, all apeared to be going fine. I then went to unmount /.alt.snv_117, and it failed saying that the filesystem was busy.

fuser -c showed no processes using the mount point. What could it be?

A little bit of dtracing the umount2() system call was illuminating.

  1              <- zfsctl_umount_snapshots           0                0
  1            <- zfs_umount                          0               16

Hang on, snapshots? Although it returned 0, let's just check; as I do have timeslider enabled on this box.

rootksh@vesvi:~$ zfs list -t snapshot|grep 117                                                                     
pool/ROOT/snv_116@snv_117                                   4.03M      -  8.78G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-09:00     43.8M      -  7.99G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-10:00     48.9M      -  8.44G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-11:00     43.7M      -  8.74G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:15   42.6M      -  8.75G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:30   45.8M      -  8.76G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-19-11:45   38.1M      -  8.77G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-19-12:00     38.5M      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:daily-2009-06-22-00:00          0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:weekly-2009-06-22-00:00         0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:hourly-2009-06-22-10:00         0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-22-10:30       0      -  8.80G  -
pool/ROOT/snv_117@zfs-auto-snap:frequent-2009-06-22-10:45       0      -  8.80G  -

Oh, timeslider was taking snapshots of the filesystem while it was upgrading. Hmm maybe we should be having that disabled on the target of a live upgrade (rfe coming, but I don't hold out a lot of hope).

Anyway, removing them was not difficult:

rootksh@vesvi:~$ zfs list -t snapshot|grep snv_117@zfs-auto|awk '{print $1}' | xargs -L 1 zfs destroy
rootksh@vesvi:~$ luumount snv_117                                                                                  

Something to keep in mind if you are using timeslider, zfs root and live upgrade (I wonder if we would have the same issue with 'pkg image-update' in OpenSolaris).

Monday May 04, 2009

multithreaded processes and mdb

Today I had to look at a gcore of devfsadm. Most specifically I wanted to have at what the threads in cond_wait() were doing. I haven't done a lot with such stuff in userland before so thought it would make a good short blog topic on things that can be done.

First off we run up mdb

#  mdb /usr/sbin/devfsadm devfsadm.gcore
Loading modules: [ ]

Great, we got all the modules. So, what lwps have we got?

> $L
lwpids 1, 2, 3, 4, 5 and 6 are in core of process 135.

So we have six threads, let's have a look at the registers in first one (note that this is on SPARC).

> 1::regs
%g0 = 0x00000000                 %l0 = 0x00000000 
%g1 = 0x0000001d                 %l1 = 0x00043748 
%g2 = 0x0003cb2c                 %l2 = 0xffbff8ac 
%g3 = 0x00038000                 %l3 = 0x00000001 
%g4 = 0x0003cb2c                 %l4 = 0x00000000 
%g5 = 0x00000000                 %l5 = 0x00000000 
%g6 = 0x00000000                 %l6 = 0x00000000 
%g7 = 0xff342a00                 %l7 = 0x00000001 
%o0 = 0xff342c40                 %i0 = 0x00000001 
%o1 = 0xff13b90c`pause+0x50 %i1 = 0x0003a2a4 
%o2 = 0xff1c3800`_uberdata %i2 = 0xff342a00 
%o3 = 0x00000000                 %i3 = 0x00039954 
%o4 = 0xff342a00                 %i4 = 0x00016964 
%o5 = 0x00000000                 %i5 = 0x00000000 
%o6 = 0xffbff850                 %i6 = 0xffbff8b0 
%o7 = 0xff13b914`pause+0x58 %i7 = 0x00015ce4 

 %psr = 0x00000044 impl=0x0 ver=0x0 icc=nzvc
                   ec=0 ef=0 pil=0 s=0 ps=64 et=0 cwp=0x4
  %y = 0x00000000
  %pc = 0xff14c160`_pause+4
  %npc = 0xff14c164`_pause+8
  %sp = 0xffbff850
  %fp = 0xffbff8b0                    

 %wim = 0x00000082
 %tbr = 0x00000000

Now to have a look at the stack we simply find the %sp value and use it with the stack dcmd.

> 0xffbff850::stack
0x15ce4(0, 43b48, 39db4, 4, 2276c, 38000)
main+0x358(0, 39f2c, ffbffdec, 398e4, 1, 38000)
_start+0x108(0, 0, 0, 0, 0, 0)

Note that this gives the stack frames above the current and not the current. From the value of %pc above we can see where we are executing in the current frame. You can also see that we the caller does not have an entry in the symbol table. Unfortunately, on Solaris 10, devfsadm has a lot of functions and variables declared as static, which really does make debugging a pain. Fortunately this is not the case in Nevada/OpenSolaris.

Looking at the other lwps is as simple as listing the lwp id in front of the regs dcmd and repeating what we just did. I won't go into how I worked out which of the static routines we were executing in for the other lwps in cond_wait(), save to say that there are only a couple of places that make that call in the code, and matching up the assembly around the locations to the source (especially looking at called functions), makes this not too difficult.

Tuesday Mar 03, 2009

My CMT machine loads Oracle Databases slower than ..

This is more of an "Oh no not again" type post, ...

I am constantly amazed at the number of escalations that make it to the performance group with this as the problem.

It really is a case of an unrealistic expectation and knowledge of what the machines excel at.

The most recent of these to cross my desk talks of a customer concerned that a dual core 2.5GHz x86 based box loads data into an Oracle database much faster than his shiny new T5220.

Until such a time as Oracle makes their SQL Loader run multi-threaded (which may bring in problems all of their own) this will always be the case.

The design of the system is such that it will run single threaded applications much slower than the x86 counterparts. These machines, however, come into their own once we enter production and start getting lots of parallel requests on the database. As we are running far more cpus, the load on the database must be much higher before we start to see any significant degradation.

The question that really should be asked here is, "Where do you want your performance? In the database load that you will do once, or in responding to production queries?"

Tuesday Nov 18, 2008

Even More DTrace Lab Answers (9 & 10)

OK, let's get the rest out.

Exercise 9

List the processes that are connecting to a specific port.

OK, we'll take that to mean we want to know the name and the process ID of connections to a particular port. This requires just a little network programming knowledge. We can see that the second argument (arg1) to connect(3SOCKET) is a sockaddr. The extra knowledge that we need is that to get a port, we need to cast it to a sockaddr_in and look at the port structure element. This bit of code expects the command line argument 1 to be a port number.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall::connect:entry {
	this->sock = (struct sockaddr_in \*)copyin(arg1, arg2);
	self->port = this->sock->sin_port;
syscall::connect:entry /self->port == $1/ {
	 printf("%5d %s\\n", pid, execname); }

syscall::connect:entry { self->port = 0; }

Exercise 10

Write a script to show where a specific system call, rename for example, is blocking and how long it is blocked for.

This one was a bit of fun to write, especially since the output format specifier for a stack is not documented and I had to find it in dt_printf.c.

We'll use the sched provider in this. While it would be nice to use the wakeup probe, it's not fired from the thread context so we won't have the thread local variable that we need to check and use. So, we just use on-cpu, which will tell us how long we were off cpu since we were told to block. Note that this also means that we are not tracking normal scheduling where it may have been pre-empted by another thread for whatever reason. This is what we want.

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting...  \^C to continue\\n"); }

syscall::rename:entry {self->interest = 1; }
sched:::sleep /self->interest/ { self->blocktime = timestamp; }
sched:::on-cpu /self->blocktime/ {
	this->taken = timestamp - self->blocktime; 
	@[curthread,stack()] = quantize(this->taken);
	self->blocktime = 0;
syscall::rename:return { self->interest = 0; }

END { printa("Thread 0x%p %k %@d\\n", @); }

This leaves us with only Exercise 11 to go. I'll try to get that up tomorrow.

Saturday Nov 15, 2008

More DTrace Lab Answers

OK, I promised more answers so here we go, ... (sorry, where files have copyright notices, I have to leave them there).

Exercise 2

Restrict the iosnoop.d to trace a specific process.

This solution will allow you to either specify a pid by appending '-p {pid}' or to run a command by appending '-c "command args ..." to the command line. eg dtrace -s iosnoop.d -p 1234

 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License, Version 1.0 only
 \* (the "License").  You may not use this file except in compliance
 \* with the License.
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \* Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
 \* Use is subject to license terms.
 \* This D script is used as an example in the Solaris Dynamic Tracing Guide
 \* wiki in the "io Provider" Chapter.
 \* The full text of the this chapter may be found here:
 \* On machines that have DTrace installed, this script is available as
 \* iosnoop.d in /usr/demo/dtrace, a directory that contains all D scripts
 \* used in the Solaris Dynamic Tracing Guide.  A table of the scripts and their
 \* corresponding chapters may be found here:
 \*   file:///usr/demo/dtrace/index.html

#pragma D option quiet

	printf("%10s %58s %2s\\n", "DEVICE", "FILE", "RW");

/pid == $target/
	printf("%10s %58s %2s\\n", args[1]->dev_statname,
	    args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W");

Exercise 3

Display the arguments for the rename(2) system call along with its return code.

This is longer than it needs to be as I like the idea of a single line of output for each time the we have a rename() system call. I save the two arguments in thread local variables (self->source and self->destination). Note that the two arguments are strings and they are in user space when this probe fires, so we use copyinstr() to both get them into kernel context and make them printable strings. Note also that we clean up after ourselves in the return probe.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall::rename:entry {
	self->source = copyinstr(arg0);
	self->destination = copyinstr(arg1);
syscall::rename:return {
	printf("%s: %s -> %s returned %d\\n", probefunc, self->source,
	    self->destination, arg1);
	self->source = self->destination = 0;

Exercise 4

Display the real and sys times (see timex(1)) for the syscall

I'm not completely happy with this as I saw some anomalous real time values running this on my notebook, but it should be about right. Note that we record a walltimestamp (seconds since epoch) and the vtimestamp (time on cpu in nanoseconds) and simply compare them in the return probe.

#!/usr/sbin/dtrace -s

#pragma D option quiet

syscall:::entry {
	self->seconds = walltimestamp;
	self->sys = vtimestamp;
syscall:::return /self->sys/ {
	printf("%s: %d real seconds, %dns system\\n",
            walltimestamp - self->seconds,
            vtimestamp - self->sys);
	self->seconds = self->sys = 0;

Exercise 5

Show the kernel function name that triggers the io:::start probe.

Many folks did this using stack(). There is a better way. This makes use of the fact that in the io provider, the probe\* values are defined.

$ pfexec dtrace -qn 'io:::start {printf("iostart:::probefunc called from %s\\n",probefunc);}'
iostart:::probefunc called from bdev_strategy

You could also simply use the default action without -q as that will print the probe specification each time it fires, which will include the probefunc.

Exercise 6

Show the flow of kernel functions for a write system call.

Most folks put a -F in the arguments on line 1. The clearer (and more correct) way to do this in a script is to use the pragma (highlighted).

#!/usr/sbin/dtrace -s

#pragma D option flowindent

syscall::write:entry { self->interest = 1; }
fbt::: /self->interest/ {}
syscall::write:return { self->interest = 0; }

Exercise 7

Show all lock events for mutexes that occur during two second.

The question hints that we may want to use an aggregation and summary rather than printing them as they happen.

#!/usr/sbin/dtrace -s3

#pragma D option quiet

lockstat:::adaptive\* {@[probename] = count();}
tick-2s {exit(0);}

Exercise 8

Find file to which most IO is being done.

I'm going to assume that the question means real IO, not IOs that hit the cache. In which case, we can use the answer to exercise 2 as a hint. Unfortunately, pretty much all of the IO that this tracks is being done by sched to flush buffers and we don't have any idea of the filename. I'll try to come up with a better solution.

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting data ... \^C to finish.\\n"); }
io:::start { @[args[2]->fi_pathname] = count(); }
END { trunc(@,1); printa(@); }

That's it for the moment, I'll finish them off later.

Friday Nov 14, 2008

DTrace Lab in the performance track at CEC

We have just been helping out in the DTrace lab for the performance track at CEC. On having a look at exercise 1, we suggested to the attendees that they attempt the other exercises and come back to this one. The exercise was:

Enhance /usr/demo/dtrace/iosnoop to be able to detect reads that are satisfied by the filesystem cache (UFS: pagecache).

We then went off to write our own versions of an answer to the problem.I started from scratch, and (referring back to a previous blog where I did some digging into this code), came up with the following:

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN { printf("Collecting ...\\n"); }

syscall::read\*:entry {
        self->interest = 1;
        self->file = fds[arg0].fi_name;
        self->phys = 0;

/\* we only increment this kstat if we have to go to disk, from a filesystem \*/

sysinfo:::bread /self->interest/ { self->phys = 1;}

syscall::read\*:return /self->interest/ {
        this->str = self->phys
            ? "Physical reads"
            : "Cache Reads";
        @[this->str, self->file] = count();
        self->interest = 0;

END { printa("%@6d %s %s\\n", @); }

On running this you get output like:

   209 Cache Reads jaxrpc-impl.jar
   211 Cache Reads appserv-admin.jar
   311 Cache Reads xalan.jar
   354 Cache Reads ttysrch
   372 Cache Reads jaxr-impl.jar
   373 Cache Reads auxv
   500 Cache Reads appserv-assemblytool.jar
  1216 Cache Reads 
  1324 Cache Reads appserv-rt.jar
  1392 Cache Reads clone@0:ptm
 83181 Cache Reads psinfo

Of course by the time I finished writing this, pretty much everything in the only ufs filesystem on the box had been cached :)

Wednesday Jul 23, 2008

What can you say?

I have been really slacking off with my blogging and really need to get back into it. What better way than with something amusing that happened today to a colleague.

He picked up an task today where the customer had the following issue and question. Unfortunately I don't recall the exact issue and patch number but they are perepheral to the humour.

We've noticed that this problem occurs on a system with patch XXXXXX-02, but not on those with XXXXXX-04. Can you tell us if there is a patch or workaround to the problem?

How do you answer a question like that? An overseas colleague came up with the suggestion of "Sir, you really need a holiday".

My suggestion was more prosaic, that he simply look in the patch README for the bugs that were fixed between the -02 and -04 revision and reply "Yes, that was bug YYYYYYY which was fixed in XXXXXX-04", and try to keep a straight face.

Of course the really worrying thing about this whole incident, is that the current revision of the patch in question was -57!

Friday May 23, 2008

Yay for Sun Ray

As anyone who reads what I write here probably knows, I've spent the last 2 weeks in China, in two different offices in Beijing and one on Shanghai.

Some months back I installed Sun Ray server on my workstation in Sydney (as I prefer to run on relatively current nevada builds, rather than the Solaris 10 builds that IT Ops provides) and moved over to using a Sun Ray appliance both on my desk and at home.

I've been very happy with the ease of use of being able to transfer my work session home.

I've been even more happy with being able to transfer it to the various offices in China!

I just put my card in, it throws me to the Sydney ITops server that I would normally connect to, and I just utswitch from there to my own server. It all just works. What's more I've found working from Beijing very little different speedwise to working from home. It's all quite usable.

Technorati Tags: , ,

Tuesday Nov 06, 2007

The official melbourne cup site can't cope?

I simply have to tongue in cheek blog this. For those that don't know, Australia stops on the first Tuesday in November for the largest horse race in the country. The Melbourne Cup. Shortly after the race finished a colleague gathered the below screen shot from the official web site.

Do you think they might need to talk to Sun Australia about the new T2 boxes ( e.g. T5220 & T5120), Solaris 10 and Our Application Server ? :-)

Thursday Feb 15, 2007

more on the in.telnetd patches

As many folks have stated. Sun Alert 102802 and the patches are available on Sunsolve.

120068-02 SunOS 5.10_sparc: in.telnetd Patch
120069-02 SunOS 5.10_x86: in.telnetd Patch

I've had it pointed out to me that the patches are marked "Reboot after installation required". This is actually not the case and a bug has been logged to get the tags removed from the patch.

For what it is worth, I tested the fix by applying the patch while the systems were multiuser and the fix was immediate. I did not even have to restart the services. in.telnetd is fork/exec'd by ineted. It's generally short lived. Adding the patch replaces the binary that is exec'd. You do not need to perform a reboot to get this patch installed.

Technorati Tags: , , ,

Wednesday Nov 22, 2006

Some DTrace scripts I found useful last week

Last week I spent some time looking at applications that a customer was using to perform a data migration. It occurs to me that folks might be interested in a couple of the "one liner" type scripts that I found useful after turning them into 'stat' type tools. So, here they are.


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count all user space function calls
 \* $1 - time to run (eg 10s)
 \* $2 - pid to monitor

pid$2:::entry {
	@[probefunc] = count();}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count the syscalls a process is making as a stat tool
 \* $1 time to wait (eg 10s)
 \* $2 target pid

syscall:::entry /pid == $2/ {
	@[probefunc] = count();}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet

 \* Count the syscalls a process is making as a stat tool
 \* $1 time to wait (eg 10s)
 \* $2 target pid

syscall:::entry /pid == $2/ {
	self->start = vtimestamp;}
syscall:::return /self->start/ {
	@[probefunc] = quantize((vtimestamp - self->start)/1000);}
tick-$1 {


#!/usr/sbin/dtrace -s

#pragma D option quiet
 \* Aggregate user stacks calling a function in user space
 \* $1 - time to run (eg 10s)
 \* $2 - pid to monitor
 \* $3 - function to look for

pid$2::$3:entry {
	@[ustack(30)] = count();}
tick-$1 {

Technorati Tags: ,

Sunday Nov 12, 2006

Solaris ufs bug in Month of Kernel Bugs

Just noticed that Solaris has an entry in Month of Kernel bugs.

While I agree that we have an issue that needs looking at, I also believe that the contributor is making much more of it than it really deserves.

First off, to paraphrase the issue:

If I give you a specially massaged filesystem and can convince someone with the appropriate privilege to mount it, it will crash the system.

I'd hardly call this a "denial of service", let alone exploitable.

First off, in order to perform a mount operation of a ufs filesystem, you need sys_mount privilege. In Solaris, we currently are runing under the concept of "least privilege". That is, a process is given the least amount of privilege that it needs to run. So, in order to exploit this you need to convince someone with the appropriate level of privilege to mount your filesystem. This would also invlove a bit of social engineering which went unmentioned.

That being said, they system should not panic off this filesystem and I will log a bug to this effect. It is a shame that the contributor did not make the crashdump files available as it would certainly speed up any analysis.

One other thing that I should add is that anyone who tries to mount an unknown ufs filesystem without at least running "fsck -n" over it probably deserves what they get.

OK, I have copied it to a relatively current nevada system and mounted it as /dev/lofi/1. On running "fsck -n" we see:

\*\* /dev/rlofi/1 (NO WRITE)







In the normal course of events, would you mount this filesystem. I certainly would not. This however is not the normal course of events and I'm playing on a lab system.

v40z-c# uname -a
SunOS v40z-c 5.11 snv_46 i86pc i386 i86pc

Let's try a read only mount first.

v40z-c# mount -r /dev/lofi/1 /mnt
v40z-c# ls /mnt
v40z-c# umount /mnt

OK, the read only mount is fine. Now the read/write, ... Bingo

v40z-c# mount /dev/lofi/1 /mnt

panic[cpu3]/thread=ffffffff9a6974e0: BAD TRAP: type=e (#pf Page fault) rp=fffffe8000c7f2c0 addr=fffffe80fe39d6c4

mount: #pf Page fault
Bad kernel fault at addr=0xfffffe80fe39d6c4
pid=2170, pc=0xfffffffffbb70950, sp=0xfffffe8000c7f3b0, eflags=0x10286
cr0: 8005003b cr4: 6f8
cr2: fffffe80fe39d6c4 cr3: 1ff76b000 cr8: c
fffffe8000c7f1b0 unix:die+b1 ()
fffffe8000c7f2b0 unix:trap+1528 ()
fffffe8000c7f2c0 unix:_cmntrap+140 ()
fffffe8000c7f440 ufs:alloccgblk+42f ()
fffffe8000c7f4e0 ufs:alloccg+473 ()
fffffe8000c7f560 ufs:hashalloc+50 ()
fffffe8000c7f600 ufs:alloc+14f ()
fffffe8000c7f6c0 ufs:lufs_alloc+f3 ()
fffffe8000c7f770 ufs:lufs_enable+261 ()
fffffe8000c7f7e0 ufs:ufs_fiologenable+63 ()
fffffe8000c7fd60 ufs:ufs_ioctl+3e0 ()
fffffe8000c7fdc0 genunix:fop_ioctl+3b ()
fffffe8000c7fec0 genunix:ioctl+180 ()
fffffe8000c7ff10 unix:sys_syscall32+101 ()

OK, so we should now have a crashdump to look at.

While the machine is rebooting, it occurs to me that if we put this ufs onto an external USB device, we might actually have an exploitable issue here, once the new hal/rmvolmgr framework is in place (nv_51) if we try to automatically mount ufs devices.

core file:      /var/crash/v40z-c/vmcore.0
release:        5.11 (64-bit)
version:        snv_46
machine:        i86pc
node name:      v40z-c
system type:    i86pc
hostid:         69e47dae
dump_conflags:  0x10000 (DUMP_KERNEL) on /dev/dsk/c1t1d0s1(517M)
time of crash:  Sun Nov 12 13:08:18 EST 2006
age of system:  34 days 1 hours 42 minutes 34.95 seconds
panic CPU:      3 (4 CPUs, 7.56G memory)
panic string:   BAD TRAP: type=e (#pf Page fault) rp=fffffe8000c7f2c0 addr=fffffe80fe39d6c4

sanity checks: settings...vmem...sysent...clock...misc...done

-- panic trap data  type: 0xe (Page fault)
  addr: 0xfffffe80fe39d6c4  rp: 0xfffffe8000c7f2c0
  savfp 0xfffffe8000c7f440  savpc 0xfffffffffbb70950
  %rbp  0xfffffe8000c7f440  %rsp  0xfffffe8000c7f3b0
  %rip  0xfffffffffbb70950  (ufs:alloccgblk+0x42f)

  0%rdi 0xffffffff8d60b000  1%rsi 0xffffffff8930c308  2%rdx               0xb5
  3%rcx               0xb5  4%r8  0xfffffe80fe39d6c0  5%r9              0x12f0

  %rax                 0x8  %rbx          0x361005a8
  %r10                   0  %r11  0xfffffffffbcd9ff0  %r12               0x5a8
  %r13  0xffffffff8930c000  %r14  0xffffffff8d60b000  %r15  0xffffffff99656c00
  %cs       0x28 (KCS_SEL)
  %ds       0x43 (UDS_SEL)
  %es       0x43 (UDS_SEL)
  %fs          0 (KFS_SEL)
  %gs      0x1c3 (LWPGS_SEL)
  %ss       0x30 (KDS_SEL)
  trapno     0xe (Page fault)
  err        0x2 (page not present,write,supervisor)
  %rfl   0x10286 (parity|negative|interrupt enable|resume)
  fsbase 0xffffffff80000000 gsbase 0xffffffff8901c800
-- switch to user thread's user stack --

The trap has occurred in alloccgblk+42f() in the ufs code.

ufs:alloccgblk+0x410            call   +0xe0a4  (ufs:clrblock+0x0)
ufs:alloccgblk+0x415            decl   0x1c(%r13)	; cgp->cg_cs.cs_nbfree--
ufs:alloccgblk+0x419            decl   0xc4(%r14)	; fs->fs_cstotal.cs_nbfree--
ufs:alloccgblk+0x420            movslq 0xc(%r13),%r8	; %r8 <- cgp->cg_cgx
ufs:alloccgblk+0x428            addq   0x2d8(%r14),%r8	; %r8 <- fs->fs_u.fs_csp[%r8]
ufs:alloccgblk+0x42f            decl   0x4(%r8) <-- panic here

We've just made a call to ufs:clrblock() and are decrementing something after a long list of pointer dereferencing. We only call clrblock() once in this routine, so that puts us at:

   428  #define fs_cs(fs, indx) fs_u.fs_csp[(indx)]

  1238          clrblock(fs, blksfree, (long)blkno);
  1239          /\*
  1240           \* the other cg/sb/si fields are TRANS'ed by the caller
  1241           \*/
  1242          cgp->cg_cs.cs_nbfree--;
  1243          fs->fs_cstotal.cs_nbfree--;
  1244          fs->fs_cs(fs, cgp->cg_cgx).cs_nbfree--;

Panicing on line 1244.

I should note at this point that the source I am quoting is from the same source tree as opensolaris.

After the macro expansion, it becomes

  1244          fs->fs_u.fs_csp[cgp->cg_cgx]--

So what is cgp->cg_cgx?

SolarisCAT(vmcore.0/11X)> sdump 0xffffffff8930c000 cg cg_cgx
   cg_cgx = 0x6f0000

This is probably a trifle on the largish side, which would explain how we have ended up in unmapped memory.

The address we end up with for the (struct csum \*) is 0xfffffe80fe39d6c0

If we go back to look at fs->fs_ncg, we see that there were only two cylinder groups allocated. We have an obvious inconsistancy.

Also, interestingly, this is not dieing in the mount(2) system call. It's dieing in a subsequent ioctl(2). This ioctl appears to be the one enabling ufs logging.

So how might we handle this?

Now, as the filesystem is already mounted and we are in a subsequent ioctl(), we can't fail the mount. Could we fail in alloccgblock() and have the error propogate back up to the process making the ioctl()?

Walking back up the stack, we see that alloccg() handles alloccgblk() returning 0. Now if alloccg() returns 0 to hashalloc(), in this instance, we'll first try to do a quadratic rehash, which will also fail, so we'll fall through to the brute force search. As this starts at cylinder group 2 and there are only 2 cylinder groups, this will call alloccg() once and fail, falling through to return 0 to alloc(). Note that no matter how many times we end up in alloccgblock() it has the same arguments, so it would fail the same way.

In alloc(), it notes that we did not get a block returned and assumes this is because some other thread grabbed the last block. It then treats the whole thing as if we ran out of space and returns ENOSPC to lufs_alloc(). lufs_alloc() catches this, frees up everything and returns the error (ENOSPC) to lufs_enable(), which in turn catches it cleans up and returns it to ufs_fiologenable() and the error is eventually passed back to user space. While not exactly the error we would have hoped, the end result would be that logging would not be turned on and the system would not panic due to this corrupted filesystem.

I'll log this as a bug against ufs for Solaris 10 and nevada


I have logged CR 6492771 against this issue. The link for the CR should work some time in the next 24 hours, but the content as logged is pretty much a cut and paste from this blog entry.

Update 2

The bug I logged has been closed as a duplicate of

4732193 ufs will attempt to mount filesystem with blatantly-bad superblock

Technorati Tags: , , ,

Tuesday Sep 05, 2006

Samba meets ZFS

I was browsing my RSS feeds today when I came across this post from my colleague Chris Gerhard.

Chris is in the process of revamping his home servers. He' already done some nice work with snapshotting on his notebook, but this one looks very nice as well.

The idea is to simply make a ZFS snapshot every time that the Samba user logs in. This way it is possible to go back to the beginning of a login session. OK, it's still a work in progress and it will almost certainly need some easy way to clean up the snapshots, but this is a very nice idea.

Good one Chris.

The application of ZFS that I'm waiting to try is the concept of running Windows under Xen, but using ZFS as teh backing filesystem. The idea being that I can take a snapshot of the backing filesystem before I try applying Microsofts's patches and when/if anything breaks, I'll have the option of backing out the entire patch set with a single zfs rollback command. The really nice thng here is that Windows should then even require a rebbot as Xen should be able to start off where I was when I took the snapshot initially.

Technorati Tags: , , ZFS, ,


* - Solaris and Network Domain, Technical Support Centre

Alan is a kernel and performance engineer based in Australia who tends to have the nasty calls gravitate towards him


« June 2016

No bookmarks in folder

Sun Folk

No bookmarks in folder

Non-Sun Folk
Non-Sun Folks

No bookmarks in folder