Sunday Sep 06, 2009

New Edelux to light my way home

My winter bike is ready for another dose of rain and darkness. This year I have a new headlight the 2.4W Schmidt Edelux. A single LED that throws out more light that my old 12V set up. The old Lumitec Oval Plus sensor failed at the end of last winter such that the only part that still worked was the lamp. Neither the sensor or the swith would turn it off and the standlight also failed. While I don't hold much store in a forward standlights having so many it was only time before the light really failed. Something I can't really risk.

So I have joined the 21st century and have an LED. I've only test ridden it up and down the road, which has street lights, so does not really do it justice but it was very impressive. Like the Oval Plus sensor it comes on automatically and has a manual override.

Unlike the Oval Plus the switch is well protected being a reed switch operated by a magnet so there is no way for water to get inside and like all Schmidt lights it has a 5 year warranty and looks fantasitc.

It is powered by the SON hub generator and also swithes the rear light.

Photo by Robyn Gerhard

Thursday Sep 03, 2009

Improved sd.conf format

Editing sd.conf has always been somewhat difficult thanks to it not being a documented interface and that the interface was never inteded to be exposed and it was even architecture specific. Fortunately Micheal documented it, which meant that it was known even if syntax remained obscure.

However after ARC case 2008/465 was approved and the changes pushed as part of bug 6518995 you can now use more a human readable syntax1:

        "ATA     VBOX HARDDISK", "disksort:false";

As it turns out the “disksort”2 option along with the thottle-max and throttle-min are the ones I most often want to tune.

Here is the current list of tunables lifted straight from the ARC case.








































1This reminds me of the change to /etc/printcap that allowed you to specify the terminal flags as strings rather than as a bitmap. All the mystery seemed to be removed!

2While I used disksort as an example for this case I can't think of any reason why you would have it enabled for a virtual disk in VirtualBox.

Thursday Aug 27, 2009

Starting remote X applications

Someone has posted a script to start a remote xterm on BigAdmin which exposes a number of issues I thought it would be better if google stood some chance of finding a better answer or at least an answer that does not rely on inherently insecure settings.

Remote X applications should be started using ssh -X so that the X traffic is encrypted and if you add -C compressed which can be a significant performance boost. So a script to do this could be handy although to be honest knowing the ssh options or having them set as the default in your .ssh/config is just as easy:

: FSS 31 $; egrep '\^(Compress|ForwardX)' ~/.ssh/config
ForwardX11 yes
Compression yes
: FSS 32 $; ssh -f pearson /usr/X11/bin/xterm         
: FSS 33 $; 

or more usefully to start graphical tools:

: FSS 33 $; ssh -f pearson pfexec /usr/sadm/admin/bin/dhcpmgr
: FSS 34 $; 

However if you really want a script to do it here is one that will and no need to mess with your .ssh/config

if (( $# < 1 )) 
        print "USAGE: ${APP} host [args]" >&2
        exit 1
exec /usr/bin/ssh -o ClearAllForwardings=yes -C -Xfn $host \\
        PATH=${REMOTE_PATH} pfexec ${APP#r} $@

If you save this into a file called “rxterm” then running “rxterm remotehost” will start an xterm on the system remotehost assuming you can ssh to that system.

More entertainingly you can save it as “rdhcpmgr” and it will start the dhcpmgr program on a remote system and securely display it on your current display (assuming your PATH includes /usr/sadm/admin/bin and your profile allows you access to that application). You can use it to start any application by simple naming it after the application in question with a preceding “r”.

Wednesday Aug 26, 2009

More OpenSolaris updates

As I have lived with OpenSolaris I've got used to the updates occuring automatically as you would with most modern Operating Systems. What is a real joy is that it creates a new boot environment for the updates so in the event that one was toxic you can always just roll back. It also gives you a handy reference as to how many updates you have done. Number 13 has just happened for me:

cjg@brompton:~$ beadm list
BE               Active Mountpoint Space  Policy Created          
--               ------ ---------- -----  ------ -------          
b4nvidia-bin-fix -      -          86.0K  static 2009-06-06 17:27 
opensolaris-10   -      -          15.68M static 2009-07-18 08:04 
opensolaris-11   -      -          33.73M static 2009-07-26 09:42 
opensolaris-12   N      /          266.5K static 2009-08-24 13:26 
opensolaris-13   R      -          16.42G static 2009-08-26 18:58 
opensolaris-4    -      -          22.19M static 2009-01-30 21:42 
opensolaris-5    -      -          21.30M static 2009-02-25 20:14 
opensolaris-6    -      -          45.87M static 2009-04-10 18:17 
opensolaris-7    -      -          37.83M static 2009-06-01 20:51 
opensolaris-8    -      -          19.03M static 2009-07-02 18:55 
opensolaris-9    -      -          11.56M static 2009-07-10 07:30 

I'm going to have to bite the bullet on my home server and swith to OpenSolaris soon as the nevada builds stop. Alas with term time approaching I don't think I will be allowed significant down time for a while.

Thursday Aug 06, 2009

Monitoring mounts

Sometimes in the course of being a system administrator it is useful to know what file systems are being mounted and when and what mounts fail and why. While you can turn on automounter verbose mode that only answers the question for the automounter.

Dtrace makes answering the general question a snip:

: FSS 24 $; cat mount_monitor.d                         
#!/usr/sbin/dtrace -qs

/ args[1]->dir /
        self->dir = args[1]->flags & 0x8 ? args[1]->dir : 
/ self->dir != 0 /
        printf("%Y domount ppid %d, %s %s pid %d -> %s", walltimestamp, 
              ppid, execname, self->dir, pid, arg1 == 0 ? "OK" : "failed");
/ self->dir != 0 && arg1 == 0/
        self->dir = 0;
/ self->dir != 0 && arg1 != 0/
        printf("errno %d\\n", arg1);
        self->dir = 0;
: FSS 25 $; pfexec /usr/sbin/dtrace -qs  mount_monitor.d
2009 Aug  6 12:57:57 domount ppid 0, sched /share/consoles pid 0 -> OK
2009 Aug  6 12:57:59 domount ppid 0, sched /share/chroot pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/newsrc pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/build2 pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/chris_at_play pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws_eng pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws pid 0 -> OK
2009 Aug  6 12:58:03 domount ppid 0, sched /home/tx pid 0 -> OK
2009 Aug  6 12:58:04 domount ppid 0, sched /home/fl pid 0 -> OK
2009 Aug  6 12:58:05 domount ppid 0, sched /home/socal pid 0 -> OK
2009 Aug  6 12:58:07 domount ppid 0, sched /home/bur pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/ pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/ pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/ pid 0 -> OK
2009 Aug  6 12:59:45 domount ppid 8929, Xnewt /tmp/.X11-pipe/X6 pid 8935 -> OK

In particular that last line if repeated often can give you a clue to things not being right.

Tuesday Aug 04, 2009

Making a simple script faster

Many databases get backed up by simply stopping the database copying all the data files and then restarting the database. This is fine for things that don't require 24 hour access. However if you are concerned about the time it takes to take the back up then don't do this:

cp /data/file1.db .
gzip file1.db
cp /data/file2.db .
gzip file2.db

Now there are many ways to improve this using ZFS and snapshots being one of the best but if you don't want to go there then at the very least stop doing the “cp”. It is completely pointless. The above should just be:

gzip < /data/file1.db > file1.db
gzip < /data/file2.db > file2.db

You can continue to make it faster by backgrounding those gzips if the system has spare capacity while the back up is running but that is another point. Just stopping those extra copies will make life faster as they are completely unnecessary.

Friday Jul 31, 2009

Adding a Dtrace provider to the kernel

Since writing scsi.d I have been pondering if there should really be a scsi dtrace provider that allows you to do all that scsi.d does and more. Since the push of 6797025 that both removed the main reason for not doing this and also gave impetus to do it as scsi.d needed incompatible changes to use the new return function as the return “probe”.

This work is very much work in progress and may or may not see the light of day due to some other issues around scsi addressing, however I thought I would document how I added a kernel dtrace provider so if you want to you don't have to do so much searching1.

Adding the probes themselves is simplicity itself using the DTRACE_PROBEN() macros. Following the convention I added this macro:

#define	DTRACE_SCSI_2(name, type1, arg1, type2, arg2)			\\
	DTRACE_PROBE2(__scsi_##name, type1, arg1, type2, arg2);

to usr/src/uts/common/sys/sdt.h. Then after including <sys/sdt.h> in each file I put this macro in each of the places I wanted my probes:

 	DTRACE_SCSI_2(transport, struct scsi_pkt \*, pkt,
 	    struct scsi_address \*, P_TO_ADDR(pkt))

The bit that took a while to find was how to turn these into a provider. To do that edit the file “usr/src/uts/common/dtrace/sdt_subr.c” and create the attribute structure2:

 static dtrace_pattr_t scsi_attr = {

and add it to the sdt_providers array:

	{ "scsi", "__scsi_", &scsi_attr, 0 },

than add the probes to the sdt_args array:

	{ "scsi", "transport", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "transport", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},
	{ "scsi", "complete", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "complete", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},

Finally you need to create a file containing the definitions of the output structures, scsi_pktinfo_t and scsi_addrinfo_t and define translators for them. That goes into /usr/lib/dtrace and I called mine scsa.d (there is already one called scsi.d).

 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License (the "License").
 \* You may not use this file except in compliance with the License.
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \* Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 \* Use is subject to license terms.

#pragma D depends_on module scsi
#pragma D depends_on provider scsi

inline char TEST_UNIT_READY = 0x0;
#pragma D binding "1.0" TEST_UNIT_READY
inline char REZERO_UNIT_or_REWIND = 0x0001;
#pragma D binding "1.0" REZERO_UNIT_or_REWIND

inline char SCSI_HBA_ADDR_COMPLEX = 0x0040;
#pragma D binding "1.0" SCSI_HBA_ADDR_COMPLEX

typedef struct scsi_pktinfo {
	caddr_t pkt_ha_private;
	uint_t	pkt_flags;
	int	pkt_time;
	uchar_t \*pkt_scbp;
	uchar_t \*pkt_cdbp;
	ssize_t pkt_resid;
	uint_t	pkt_state;
	uint_t 	pkt_statistics;
	uchar_t pkt_reason;
	uint_t	pkt_cdblen;
	uint_t	pkt_tgtlen;
	uint_t	pkt_scblen;
} scsi_pktinfo_t;

#pragma D binding "1.0" translator
translator scsi_pktinfo_t  < struct scsi_pkt \*P > {
	pkt_ha_private = P->pkt_ha_private;
	pkt_flags = P->pkt_flags;
	pkt_time = P->pkt_time;
	pkt_scbp = P->pkt_scbp;
	pkt_cdbp = P->pkt_cdbp;
	pkt_resid = P->pkt_resid;
	pkt_state = P->pkt_state;
	pkt_statistics = P->pkt_statistics;
	pkt_reason = P->pkt_reason;
	pkt_cdblen = P->pkt_cdblen;
	pkt_tgtlen = P->pkt_tgtlen;
	pkt_scblen = P->pkt_scblen;

typedef struct scsi_addrinfo {
	struct scsi_hba_tran	\*a_hba_tran;
	ushort_t a_target;	/\* ua target \*/
	uchar_t	 a_lun;		/\* ua lun on target \*/
	struct scsi_device \*a_sd;
} scsi_addrinfo_t;

#pragma D binding "1.0" translator
translator scsi_addrinfo_t  < struct scsi_address \*A > {
	a_hba_tran = A->a_hba_tran;
	a_target = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_target;
	a_lun = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_lun;
	a_sd = (A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		A->a.a_sd : 0;

again this is just enough to get going so I can see and use the probes:

jack@v4u-2500b-gmp03:~$ pfexec dtrace -P scsi -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
 1303       scsi              scsi                    scsi_transport transport
 1313       scsi              scsi                 scsi_hba_pkt_comp complete

While this all works well for parallel scsi getting the address of devices on fibre is not clear to me. If you have any suggestions I'm all ears.

1If there is such a document already in existence then please add a comment. I will just wish I could have found it.

2These may not be the right attributes but gets me to the point it compiles and can be used in a PoC.

Friday Jul 24, 2009

gethrtime and the real time of day

Seeing Katsumi Inoue blogging about Oracle 10g reporting timestamps using the output from gethrtime() reminded me that I have had on occasion wished I had a log to map hrtime to the current time. As Katsumi points out the output of gethrtime() is not absolutely tied to the current time. So there is no way to take the output from it and tell when in real time the output was generated unless you have some reference point. To make things more complex the output is reset each time the system reboots.

For this reason it is useful to keep a file that contains a history of the hrtime and the real time so that any logs can be retrospectively coerced back into a readable format.

There are lots of ways to do this but since on this blog we seem to be in Dtrace mode here is how using dtrace

pfexec /usr/sbin/dtrace -o /var/log/hrtime.log -qn 'BEGIN,tick-1hour,END {
        timestamp, walltimestamp/1000000000,
        walltimestamp%1000000000, walltimestamp);

Then you get a nice file that contains three columns. The hrtime, the time in seconds since January 1st 1970 and a human readable representation of the time in the current timezone:

: TS 39 $; cat /var/log/hrtime.log    
5638545510919736:1248443226.350000625:2009 Jul 24 14:47:06
5642145449325180:1248446826.279995332:2009 Jul 24 15:47:06

I have to confess however that using Dtrace for this does not feel right, not least as you need to be root for this to be reliable and also the C code is trivial to write, compile and run from cron and send the output to syslog:

: FSS 39 $; cat  ./gethrtime_base.c
#include <sys/time.h>
#include <stdio.h>

main(int argc, char \*\*argv)
	hrtime_t hrt = gethrtime();
	struct timeval tv;
	gettimeofday(&tv, NULL);

	printf("%lld:%d.%6.6d:%s", hrt, tv.tv_sec, tv.tv_usec,
: FSS 40 $; make ./gethrtime_base
cc    -o gethrtime_base gethrtime_base.c 
: FSS 41 $;  ./gethrtime_base
11013365852133078:1248444379.163215:Fri Jul 24 15:06:19 2009
: FSS 42 $; 
./gethrtime_base | logger -p daemon.notice -t hrtime
: FSS 43 $;  tail -10 /var/adm/messages | grep hrtime
Jul 24 15:32:33 exdev hrtime: [ID 702911 daemon.notice] 11014939896174861:1248445953.109855:Fri Jul 24 15:32:33 2009
Jul 24 16:09:21 exdev hrtime: [ID 702911 daemon.notice] 11017148054584749:1248448161.131675:Fri Jul 24 16:09:21 2009
: FSS 50 $; 

Wednesday Jul 22, 2009

1,784,593 the highest load average ever?

As I cycled home I realised there was one more thing I could do on the exploring the limits of threads and processes on Solaris. That would be the highest load average ever. Modifying the thread creator program to not have each thread sleep once started but instead wait until all the threads were set up and then go into an infinite compute loop that should get me the highest load average possible on a system or so you would think.

With 784001 threads the load stabilised at:

10:16am  up 18:07,  2 users,  load average: 22114.50, 22022.68, 21245.781

Which was somewhat disappointing. However an earlier run with just 780,000 threads managed to peak the load at 1,784,593 while it was exiting:

 7:44am  up 15:35,  2 users,  load average: 1724593.79, 477392.80, 188985.10

I' still pondering how 780000 thread can result in a load average of more than 1 million.

Sunday Jul 19, 2009

784972 threads in a process

After the surprise interest in the maximum number of processes on a system it seems rude not to try and see how many threads I can squeeze into a single process while I have access to a system where physical memory will not be the limiting factor. The expectation is that this will closely match the number of processes as each thread will have an LWP in the kernel which will in turn consume the segkp.

A slight modification to the forker program:

: FSS 62 $; cat thr_creater.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <thread.h>

main(int argc, char \*\*argv)
        pid_t pid;
        int count=0;
        while(count < (argc != 2 ? 100 : atoi(argv[1])) &&
            (pid = thr_create(NULL, 0, (void \* (\*)(void \*))pause,
            NULL, THR_DETACHED, NULL)) != -1) {
                if (pid == 0 ) {
                        /\* Success, ) \*/
                        if (count % 1000 == 0)
                                printf("%d\\n", count);
        if (pid < 0)
        printf("%d\\n", count);

and this time it has to be built as a 64 bit program:

# make "CFLAGS=-m64 -mt" thr_creater

Here is how it went:

$; ./thr_creater 1000000        

Here things have stopped and for some bizarre reason attaching a debugger to see what is going on does not seem to be a good idea. I had prstat running in another window and it reported:

  2336 cg13442  7158M 7157M cpu73    0    0   1:42:59 1.6% thr_creater/784970

Which is just a few more threads than I got processes (784956) when running in multi user. However at this point the system is pretty much a warm brick as if I exit any process thr_creater hoovers up the process so I can create no more. Fortunately I had realized this would happen and had some sleep(1) processes running so I could pause the thr_creater and then kill one of the sleeps to allow me to run a command:

$; ps -o pid,vsz,rss,nlwp,comm -p 2336
  2336 7329704 7329248 784972 ./thr_creater

as you can see it managed to get another two threads created since the prstat exited.

Saturday Jul 18, 2009

Cycle helmets or brakes

I've just straightened the rear wheel on the bike of a friend of my daughter and while I was at it checked over the rest of the bike. This leads to my top tip for parents of children who cycle (which IMO should be every parent of an able bodied child).

Before you start worrying about whether your child is or is not wearing a helmet1 make sure that the brakes on their bike work. Not just when the bike is new but regularly.

1See for whether you should worry about that at all.

Friday Jul 17, 2009

10 Steps to OpenSolaris Laptop Heaven

If you have recently come into possession of a Laptop onto which to load Solaris then here are my top tips:

  1. Install OpenSolaris. At the time of writing the release is 2009.06, install that, parts of this advice may become obsolete with later releases. Do not install Solaris 10 or even worse Nevada. You should download the live CD and burn it onto a disk boot that and let it install but before you start the install read the next tip.

  2. Before you start the install open a terminal so that you can turn on compression on the root pool once it it created. You have to keep running “zpool list” until you see the pool is created and then run (pfexec zfs set compression=on rpool). You may think that disk is big but after a few months you will be needing every block you can get. Also laptop drives are so slow that compression will probably make things faster.

  3. Before you do anything after installation take a snapshot of the system so you can always go back (pfexec beadm create opensolaris@initialinstall). I really mean this.

  4. Add the extras repository. It contains virtualbox, the flash plugin for firefox, true type fonts and more. All you need is a sun online account. See and

  5. Decide whether you want to use the development or support repository. If in doubt choose the supported one. Sun employees get access to the support repository. Customers need to get a support contract. ( Then update to the latest bigs (pfexec pkg image-update).

  6. Add any extra packages you need. Since I am now writing this retrospectively there may be things missing. My starting list is:

    • OpenOffice (pfexec pkg install openoffice)

    • SunStudio (pfexec pkg install sunstudioexpress)

    • Netbeans (pfexec pkg install netbeans)

    • Flash (pkfexec pkg install flash)

    • Virtualbox (pfexec pkg install virtualbox)

    • TrueType fonts (pfxec pkg install ttf-fonts-core)

  7. If you are a Sun Employee install the punchin packages so you can access SWAN. I actually rarely use this as I have a Solaris 10 virtualbox image that I use for punchin so I can be both on and off SWAN at the same time but it is good to have the option.

  8. Add you keys to firefox so that you can browse the extras and support repositories from firefox. See

  9. Go to Fluendo and get and install the free mp3 decoder. They also sell a complete and legal set of decoders for the major video formats, I have them and have been very happy with them. They allow me to view the videos I have cycling events.

  10. Go to Adobe and get acroread. I live in hope that at some point this will be in a repository either at Sun or one Adobe runs so that it can be installed using the standard pkg commands but until then do it by hand.


Thursday Jul 16, 2009

784956 Processes

This week we had a customer claiming that they were unable to create more then 60,000 processes. This turned out to be due to them tuning max_nproc, maxuprc and maxpid but not setting segkpsize so the system would run out of “memory” before it ran into the resource limits for process.

Tuning segkpsize to 8G resolved it but I just had to see how many processes I could get running on an M8000.

Using these settings in /etc/system:

set segkpsize=0x300000
set pidmax=999999
set maxuprc=999990
set max_nprocs=999999

and a simple forker program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

main(int argc, char \*\*argv)
        pid_t pid;
        int count=0;
        while(count < argc == 2 ? 100 : atoi(argv[1]) &&
            (pid = fork()) != -1) {
                if (pid != 0 ) {
                        /\* Parent \*/
                        if (count % 1000 == 0)
                                printf("%d\\n", count);
                } else {
        if (pid < 0)
        printf("%d\\n", count);

I was slightly disappointed at the result:

$ ./forker 100000
fork: Resource temporarily unavailable

Only 784956 processes, plus the ones already running when the system booted. Trying to count them with ps obviously fails but mdb gives the real count.

# ps -e| wc
ksh: cannot fork: too many processes
# echo nproc::print -d | mdb -k  

Someone must have managed to get more.

Monday Jul 13, 2009

Cycling route from Farnborough Main to the Sun Office

In case anyone wanted to cycle from Farnborough Main railway station to the Sun Microsystems offices at Guillemont Park here is the route I take.

View Larger Map

Tuesday Jun 30, 2009

Using dtrace to track down memory leaks

I've been working with a customer to try and find a memory “leak” in their application. Many things have been tried, libumem, and the mdb ::findleaks command all with no success.

So I was, as I am sure others before me have, pondering if you could use dtrace to do this. Well I think you can. I have a script that puts probes into malloc et al and counts how often they are called by this thread and when they are freed often free is called.

Then in the entry probe of the target application note away how many calls there have been to the allocators and how many to free and with a bit of care realloc. Then in the return probe compare the number of calls to allocate and free with the saved values and aggregate the results. The principle is that you find the routines that are resulting in allocations that they don't clear up. This should give you a list of functions that are possible leakers which you can then investigate1.

Using the same technique I for getting dtrace to “follow fork” that I described here I ran this up on diskomizer, a program that I understand well and I'm reasonably sure does not have systemic memory leaks. The dtrace script reports three sets of results.

  1. A count of how many times each routine and it's descendents have called a memory allocator.

  2. A count of how many times each routine and it's descendents have called free or realloc with a non NULL pointer as the first argument.

  3. The difference between the two numbers above.

Then with a little bit of nawk to remove all the functions for which the counts are zero gives:

# /usr/sbin/dtrace -Z -wD TARGET_OBJ=diskomizer2 -o /tmp/out-us \\
	-s /tmp/followfork.d \\
	-Cs /tmp/allocated.d -c \\
         "/opt/SUNWstc-diskomizer/bin/sparcv9/diskomizer -f /devs -f background \\
          -o background=0 -o SECONDS_TO_RUN=1800"
dtrace: failed to compile script /tmp/allocated.d: line 20: failed to create entry probe for 'realloc': No such process
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
# nawk '$1 != 0 { print  $0 }' < /tmp/out.3081
           1 diskomizer`do_dev_control
           1 diskomizer`set_dev_state
           1 diskomizer`set_state
           3 diskomizer`report_exit_reason
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           6 diskomizer`update_aio_read_stats
           7 diskomizer`cancel_all_io
           9 diskomizer`update_aio_write_stats
          13 diskomizer`cleanup
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          80 diskomizer`my_calloc
         240 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         449 diskomizer`handle_write
         606 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2561 diskomizer`init_buf
        2561 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66255 diskomizer`handle_write_then_read
      124888 diskomizer`init_read_buf
      124897 diskomizer`do_new_read
      127460 diskomizer`expect_signal
           1 diskomizer`expect_signal
           3 diskomizer`report_exit_reason
           4 diskomizer`close_and_free_paths
           6 diskomizer`update_aio_read_stats
           9 diskomizer`update_aio_write_stats
          11 diskomizer`cancel_all_io
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          17 diskomizer`cleanup
         160 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         442 diskomizer`handle_write
         599 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2560 diskomizer`init_buf
        2560 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66246 diskomizer`handle_write_then_read
      124888 diskomizer`do_new_read
      124888 diskomizer`init_read_buf
      127448 diskomizer`cancel_expected_signal
     -127448 diskomizer`cancel_expected_signal
          -4 diskomizer`cancel_all_io
          -4 diskomizer`cleanup
          -4 diskomizer`close_and_free_paths
           1 diskomizer`do_dev_control
           1 diskomizer`init_buf
           1 diskomizer`set_dev_state
           1 diskomizer`set_io_len
           1 diskomizer`set_state
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           7 diskomizer`do_new_write
           7 diskomizer`handle_write
           9 diskomizer`do_new_read
           9 diskomizer`handle_write_then_read
          80 diskomizer`init_read
          80 diskomizer`my_calloc
      127459 diskomizer`expect_signal


From the above you can see that there are two functions that create and free the majority of the allocations and the allocations almost match each other, which is expected as they are effectively constructor and destructor for each other. The small mismatch is not unexpected in this context.

However it is the vast number of functions that are not listed at all as they and their children make no calls to the memory allocator or have exactly matching allocation and free that are important here. Those are the functions that we have just ruled out.

From here it is easy now to drill down on the functions that are interesting you, ie the ones where there are unbalanced allocations.

I've uploaded the files allocated.d and followfork.d so you can see the details. If you find it useful then let me know.

1Unfortunately the list is longer than you want as on SPARC it includes any functions that don't have their own stack frame due to the way dtrace calculates ustackdepth, which the script makes use of.

2The script only probes particular objects, in this case the main diskomizer binary, but you can limit it to a particular library or even a particular set of entry points based on name if you edit the script.

Saturday Jun 27, 2009

Follow fork for dtrace pid provider?

There is a ongoing request to have follow fork functionality for the dtrace pid provider but so far no one has stood upto the plate for that RFE. In the mean time my best workaround for this is this:

cjg@brompton:~/lang/d$ cat followfork.d
/ppid == $target/
	printf("fork %d\\n", pid);
	system("dtrace -qs child.d -p %d", pid);
cjg@brompton:~/lang/d$ cat child.d
	printf("%d %s:%s %d\\n", pid, probefunc, probename, ustackdepth)
cjg@brompton:~/lang/d$ pfexec /usr/sbin/dtrace -qws followfork.d -s child.d -p 26758
26758 malloc:entry 22
26758 malloc:entry 15
26758 malloc:entry 18
26758 malloc:entry 18
26758 malloc:entry 18
fork 27548
27548 malloc:entry 7
27548 malloc:entry 7
27548 malloc:entry 18
27548 malloc:entry 16
27548 malloc:entry 18

Clearly you can have the child script do what ever you wish.

Better solutions are welcome!

Thursday Jun 25, 2009

Why a fixie

Why a fixie1?

A few people have asked me this so here are the reasons for a fixie:

  • They are alleged to improve you pedaling.

  • They are supposed to make your legs stronger.

  • People say you are more in touch with the bike on a fixie

  • People say they are fun to ride. Quite why is hard to understand why would this be significantly different from just picking a gear and sticking to it. For a single speed bike with a free wheel I would agree except a single speed there is no way to give in up hills without getting off.

My additional reasons were:

  • I had noticed I generally ride in 3 gears during the winter so wondered if I could make it on a single speed.

  • Bike to work made it very affordable.

  • I wanted one.

Now having one I agree they are fun more fun than I ever expected and even though I don't think I have mastered it yet I do understand about it being in touch with the bike. Going up hill there is nowhere to hide I don't know if it is making me stronger but it feels like it.

It certainly has improved my ability to "spin".

1Obvioulsy the answer that you can never have too many bikes I assume will not wash. Indeed I have that problem at home since the house rule is that I can only have three bikes. The brompton, luckily, does not count leaving some others. However since under UK law a bike that has no pedals is not a bike I only have three sets of pedals so I'm o.k.

Tuesday Jun 23, 2009

500 fixie miles

I've now commuted 500 miles on my fixie. Riding with a single gear and no freewheel has proved to be more fun than I expected. I've managed to stay on the the thing despite it having a pretty good attempt to throw me on three occasions:

  • Trying to “freewheel” as I went over a speed bump.

  • Trying to “freewheel” going round a roundabout. Very exciting as I got lifted up at the same time as the rearwheel stepped out.

  • Going down hill and letting the speed build up and my legs not being able to keep up, in the wet all went very wobbly.

All these happened on the first two commutes since then I have mastered not freewheeling and can control the speed going down hill and spin at a rate that I previously thought impossible although it is much easier to use the brakes.

  • Track standing is harder on the fixie than on a freewheel bike. I think this is just lack of practice as it is getting better.

  • Gettting too close to the kerb is frightening as that pedal is going to go down relentlessly so narrow gaps are narrower.

  • Judging whether there is room for another turn of the pedals before you have to stop is more important than I relised. If you get it wrong then your foot is stuck at the bottom of a pedal stroke so starting is really hard.

  • Getting your foot clipped in once you are moving is impossible, better to get clipped in before putting the power down. Another reason to perfect that track stand.

  • I've not scraped my pedals going round corners, Yet.

Thursday Jun 18, 2009

Diskomizer Open Sourced

I'm pleased to announce the Diskomizer test suite has been open sourced. Diskomizer started life in the dark days before ZFS when we lived in a world full1 of bit flips, phantom writes, phantom reads, misplaced writes and misplaced reads.

With a storage architecture that does not use end to end data verification the best that you can hope for was that your application will spot errors quickly and allow you to diagnose the broken part or bug quickly. Diskomizer was written to be a “simple” application that could verify all the data paths worked correctly and worked correctly under extreme load. It has been and is used by support, development and test groups for system verification.

For more details of what Diskomizer is and how to build and install read these pages:

You can download the source and precompiled binaries from:

and can browse the source here:

Using Diskomizer

First remember in most cases Diskomizer will destroy all the data on any target you point it at. So extreme care is advised.

I will say that again.

Diskomizer will destroy all the data on any target that you point it at.

For the purposes of this explanation I am going to use ZFS volumes so that I can create and destroy them with confidence that I will not be destroying someone's data.

First lets create some volumes.

# i=0
# while (( i < 10 ))
zfs create -V 10G storage/chris/testvol$i
let i=i+1

Now write the names of the devices you wish to test into a file after the key “DEVICE=”:

# echo DEVICE= /dev/zvol/rdsk/storage/chris/testvol\* > test_opts

Now start the test. When you installed diskomizer it put the standard option files on the system and has a search path so that it can find them. I'm using the options file “background” which will make the test go into the back ground redirecting the output into a file called “stdout” and any errors into a file called “stderr”:

# /opt/SUNWstc-diskomizer/bin/diskomizer -f test_opts -f background

If Diskomizer has any problems with the configuration it will report them and exit. This is to minimize the risk to your data from a typo. Also the default is to open devices and files exclusively to again reduce the danger to your data (and to reduce false positives where it detects data corruption).

Once up and running it will report it's progress for each process in the output file:

# tail -5 stdout
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol7 (zvol0:a)2 write times (0.000,0.049,6.068) 100%
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol1 (zvol0:a) write times (0.000,0.027,6.240) 100%
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol7 (zvol0:a) read times (0.000,1.593,6.918) 100%
PID 1154: INFO /dev/zvol/rdsk/storage/chris/testvol9 (zvol0:a) write times (0.000,0.070,6.158)  79%
PID 1151: INFO /dev/zvol/rdsk/storage/chris/testvol0 (zvol0:a) read times (0.000,0.976,7.523) 100%

meanwhile all the usual tools can be used to view the IO:

# zpool iostat 5 5                                                  
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage      460G  15.9T    832  4.28K  6.49M  31.2M
storage      460G  15.9T  3.22K  9.86K  25.8M  77.2M
storage      460G  15.9T  3.77K  6.04K  30.1M  46.8M
storage      460G  15.9T  2.90K  11.7K  23.2M  91.4M
storage      460G  15.9T  3.63K  5.86K  29.1M  45.7M

1Full may be an exaggeration but we will never know thanks to the fact that the data loss was silent. There were enough cases reported where there was reason to doubt whether the data was good to keep me busy.

2The fact that all the zvols have the same name (zvol0:a) is bug 6851545 found with diskomizer.

Sunday Jun 14, 2009

Passing the Baton

When I was about 16 I used to ride around Surrey and Sussex a lot on my bike. Sometimes on my own and sometimes with friends. It allowed me to miss the Wedding of Lady Di and Prince Charles by going camping the entire weekend, little did I realise I would get complete symetry and miss her death as well.

On one of these cycling trips I was on my own, tired and struggling up a hill out of Horsham (I assume I was returning from Worthing as I used to cycle down there quite often and since I was on my own I must have been visiting my Great Aunt who lived there). As I struggled up the hill a cyclist who was much older than me (probably in his forties) pull along side and asked how I was. I replied something like “knackered” at which point he put his hand on my back and pushed me up the hill. I was both thankful for his kindness and appaulled that an “old git” was pushing me, a teenager, up the hill.

Well today I managed to pass the baton on. A teenager came out with us on our ride today and on the way back from the cafe blew quite spectacularly. Not able to keep up on the flat even in the tow of the other riders unless we slowed to a crawl, which we did, then we got to a hill and I saw my chance. I realised I had been waiting 28 years (or so) to pay this favour back and so I pushed him up the hill.


This is the old blog of Chris Gerhard. It has mostly moved to


« July 2016