Thursday Apr 13, 2006

Last Post

This is my last day at Sun. Further blogging and Solaris hacks/tips can be found be at super-user.org

Cheers,
GaryL.

Friday Mar 31, 2006

Mounting an iPod on Solaris

This worked for me.... I'm using Solaris 11 / Nevada built from build 36
# uname -a
SunOS dhcp-egmp02-36-140 5.11 snv_36 i86pc i386 i86pc
I plug in my Ipod to the usb port (I've not tried it on Firewire) of my laptop. I type devfsadm at the root prompt
# devfsadm
# 
I then use rmformat -l to see if the device is recognised
# rmformat -l
Looking for devices...
     1. Volmgt Node: /vol/dev/aliases/cdrom0
        Logical Node: /dev/rdsk/c1t0d0s2
        Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
        Connected Device: TOSHIBA  DVD-ROM SD-C2612 1F27
        Device Type: DVD Reader
        Bus: IDE
        Size: 
        Label: 
        Access permissions: 
     2. Logical Node: /dev/rdsk/c5t0d0s2
        Physical Node: /pci@0,0/pci10cf,11ab@1d,7/storage@1/disk@0,0
        Connected Device: Apple    iPod             1.53
        Device Type: Removable
        Bus: USB
        Size: 19.1 GB
        Label: 
        Access permissions: Medium is not write protected.
     3. Logical Node: /dev/rdsk/c5t0d0p0
        Physical Node: /pci@0,0/pci10cf,11ab@1d,7/storage@1/disk@0,0
        Connected Device: Apple    iPod             1.53
        Device Type: Removable
        Bus: USB
        Size: 19.1 GB
        Label: 
        Access permissions: Medium is not write protected.
Now, since the Ipod is formatted as PCFS I need to do the following magic command.
# mount -F pcfs /dev/dsk/c5t0d0p0:c /a
I can now see the iPod disk
# cd /a
# ls
Calendars     Contacts      iPod_Control  Notes
And I can make a file
# timex mkfile 100m ipod-disk1-100m

real     1:15.47
user        0.01
sys         5.64

# ls -l
total 204832
drwxrwxrwx   1 root     root        4096 Jan  1  1970 Calendars
drwxrwxrwx   1 root     root        4096 Jan  1  1970 Contacts
-rwxrwxrwx   1 root     root     104857600 Mar 31 11:53 ipod-disk1-100m
drwxrwxrwx   1 root     root        4096 Jan  1  1970 iPod_Control
drwxrwxrwx   1 root     root        4096 Jan  1  1970 Notes
# 
Cool!

Tuesday Mar 28, 2006

Persistent Resource Controls in S10

In a previous blog entry, I used prctl to change a resource limit on a project wide basis. It turns out that this is only temporary - and will be overwritten on reboot. For persistant resource changes it seems we still need to use the projmod command (or edit the /etc/project file by hand). Initially, my project file looks like this:-
bash-3.00# cat /etc/project
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
user.oracle:11::::

Which means that my shared memory limit will be reset on reboot, which is not what we want. To make the change permanent, we use the projmod command like so.

#  projmod -s -K "project.max-shm-memory=(priv,4gb,deny)" user.oracle    
# cat /etc/project
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
user.oracle:11::::project.max-shm-memory=(priv,4294967296,deny)

# bc
4\*1024\*1024\*1024
4294967296
If you want to edit the /etc/project by hand, you'll need to enter just a decimal number. It won't accept 4gb (at least not on my system,i tried). The changes are only seen on reboot. To change dynamically, use
prctl -n project.max-shm-memory -r -v 4gb -i project user.oracle
Then you will see the results immediately. When issuing the prctl command (above) at least one process e.g. a shell needs to be running in the project user.oracle (the simplest way to do this is to simply login to the machine as oracle in another terminal)

Thursday Feb 09, 2006

UK:TV tonight - Solaris powers the universe

Apparently, tonights Horizon (Wed 9th December) 9pm BBC2 will feature research work and visualisations processed by the Sun Grid on behalf of Durham University.

Friday Jan 27, 2006

Scotts Photo's

Scott Macdonald has created a small gallery with some of his photography. The macro stuff is particularly good IMHO. He's going to update the site with comments and annotations to the photo's soon. Looks like Adrian finally has some competition.

Thursday Jan 26, 2006

Don't bogart that file my friend...

I spent yesterday at the Sun office in the City of London at a sort of open day for our customers. We were demonstrating the new features in Solaris 10, and someone asked us how they could detect that a user had \*attempted\* to delete a file (though the same holds true for read, write etc). So, even though the attempt to delete a file will fail, due to permissions (either legacy or RBAC) they wanted to know that it had been attempted. Such a feat \*is\* achievable using auditting (aka BSM) but is more fun, and flexible from dtrace. In the script below, we log a message to the messages file, and for fun kill the process! I'm no expert in Dtrace, but it was pretty simple thanks in large part to Chris' blog earlier this month. Anyhow, the interesting thing was that the request from the customer was pretty random, but on the spot we were able to tell them how to achieve their aim with a few lines of 'D'. In the example below, the file is /tmp/fred.
#!/usr/sbin/dtrace -s

#pragma D option destructive
#pragma D option quiet

syscall::unlink:entry
 / ((self->path = copyinstr(arg0)) == "fred" && cwd =="/tmp") || (self->path == "/tmp/fred")
 /
 
 {
  self->prot=1;
  self->path = copyinstr(arg0);
  raise(9);
} 

syscall::unlink:return

/ self->prot==1
/
{
  system("logger -p user.err Deletion attempted of %s by user %d",self->path,uid);
  }

Monday Jan 23, 2006

Converting a ZFS pool to be mirrored

So, the ZFS syntax is quite different to that of SVM which can lead to confusion. Ben Rockwood does a good job of explaning the difference, but does not show how to convert an un-mirrored ZFS pool into mirrored one. So, here's how to do it

o We start with a pool called realzfs (because it's made out of real devices rather than files)
# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
realzfs                 544G   1.17G    543G     0%  ONLINE     -

o We can see that it is made up of 4 disks
# zpool status
  pool: realzfs
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        realzfs     ONLINE       0     0     0
          c3t0d0    ONLINE       0     0     0
          c3t1d0    ONLINE       0     0     0
          c3t2d0    ONLINE       0     0     0
          c3t5d0    ONLINE       0     0     0


o The correct way is to attach a new device to each existing ldev e.g.

# zpool attach -f realzfs c3t0d0 c3t8d0


# zpool status
  pool: realzfs
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 99.99% done, 0h0m to go
config:

        NAME        STATE     READ WRITE CKSUM
        realzfs     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t8d0  ONLINE       0     0     0  178.3 resilvered
          c3t1d0    ONLINE       0     0     0
          c3t2d0    ONLINE       0     0     0
          c3t5d0    ONLINE       0     0     0

# zpool attach -f realzfs c3t1d0 c3t9d0
# zpool attach -f realzfs c3t2d0 c3t10d0
# zpool attach -f realzfs c3t5d0 c3t11d0

o Finally we see all our ldevs mirrored.

# zpool status
  pool: realzfs
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Jan 23 15:26:16 2006
config:

        NAME         STATE     READ WRITE CKSUM
        realzfs      ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t0d0   ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t1d0   ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t2d0   ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t5d0   ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0

o The WRONG way to do it is as follows:-

# zpool add -f realzfs mirror  c3t8d0  c3t9d0 c3t10d0 c3t11d0


# zpool status
  pool: realzfs
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Jan 23 15:26:16 2006
config:

        NAME         STATE     READ WRITE CKSUM
        realzfs      ONLINE       0     0     0
          c3t0d0     ONLINE       0     0     0
          c3t1d0     ONLINE       0     0     0
          c3t2d0     ONLINE       0     0     0
          c3t5d0     ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0

Which is 4 single disk ldevs and one 4way mirrored ldev.  NOT 4 mirrored ldev's which is what we actually wanted.


        
    

Wednesday Jan 04, 2006

Allow dtrace for a regular user. (RBAC)

Here's the magic command to allow a regular user to run dtrace. Ideal for your own laptop/workstation. The username here, is garyli
# usermod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user garyli

Monday Nov 21, 2005

Living in the future.

Listening to the radio via the interweb, really feels like living in the future, here are some of my favourites. Both of these come under the heading of deep house. Firstly Digitally Imported , which has a pretty slick website. Secondly, one I found by accident, the wonderfully named Deepmix.ru . Both are playable via iTunes.

Friday Jul 22, 2005

Detecting data/file corruption

Sometimes I get escalations that go along the lines of '...I moved this application data from machine fred to machine bob and now the application won't read it. What's happened?'
To try and debug the problem from the application down, is probably going to be quite long-winded. So, my first action is to verify that the file is actually the same on both machines. i.e. did it get corrupted in the transfer. If it did, then we can forget the appliction layer stuff, and concentrate on the method of transfer. It seems obvious when you think of it, but sometimes in the heat of the momemt, the simplest things get forgotten. What follows are some examples of how to use standard Solaris tools to detect data corruption.

For a long time we've had binaries that generate a checksum against a file - which is a simple way to tell if the source and destination copies are the same. There are sum, cksum and now in s10 digest. Also we have 'cmp' which will do a byte-for-byte conparison of two files.

Examples

All of these tools can be used on reguar files and raw devices.

!!Copy a raw disk slice to an image file using dd.

# dd if=/dev/rdsk/c0t0d0s3 of=/var/tmp/c0t0d0s3.img bs=1024k
41+1 records in
41+1 records out

!!Now we can use the comparison tools, they should all come back identical or
clean.  Remember cmp gives no output for a matching pair of files.  For sum and
cksum, the first column is the checksum, the second column, the size.

# cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img

# sum  /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
28918 85050 /dev/rdsk/c0t0d0s3
28918 85050 /var/tmp/c0t0d0s3.img

# cksum  /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
3185788260      43545600        /dev/rdsk/c0t0d0s3
3185788260      43545600        /var/tmp/c0t0d0s3.img

# digest -a md5  /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
(/dev/rdsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255
(/var/tmp/c0t0d0s3.img) = 0616a55e0a4e30ecf49c974f23a56255

To show what happens when a file is corrupted we will write a single byte to the
front of the file, which is currently all zero's.

The current contents of the first 10 bytes of the file (offsets are in octal)
# od -x -N 10 /var/tmp/c0t0d0s3.img
0000000 0000 0000 0000 0000 0000
0000012

Now we write the first byte of /etc/hosts (any file would do) to the front of
the image file, to simulate corruption.
# dd if=/etc/hosts of=/var/tmp/c0t0d0s3.img bs=1 count=1 conv=notrunc

We now see that the file has changed by one byte.
# od -x -N 10 /var/tmp/c0t0d0s3.img
0000000 3100 0000 0000 0000 0000
0000012

!!Now we will re-run the comparison commands to see what is shown for a
corrupted file.

# cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
/dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img differ: char 1, line 1

# sum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
28918 85050 /dev/rdsk/c0t0d0s3
28967 85050 /var/tmp/c0t0d0s3.img

# cksum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img
3185788260      43545600        /dev/rdsk/c0t0d0s3
1666608083      43545600        /var/tmp/c0t0d0s3.img

Again, note that for cksum and sum, that the second column is identical in the
original and corrupt version since we have not changed the file length.

Timings, comparing two identical files on filesystem.  Single disk Ultra10 Solaris10.  The
 timings are dominated by waiting for IO.

# timex cmp /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak

real       12.83
user        4.86
sys         1.31

# timex sum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak
28918 85050 /dev/dsk/c0t0d0s3
28918 85050 c0t0d0s3.img.bak

real       15.17
user        3.89
sys         1.15

# timex cksum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak
3185788260      43545600        /dev/dsk/c0t0d0s3
3185788260      43545600        c0t0d0s3.img.bak

real       14.57
user        2.73
sys         1.33

# timex digest -a md5  /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak
(/dev/dsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255
(c0t0d0s3.img.bak) = 0616a55e0a4e30ecf49c974f23a56255

real       15.82
user        4.07
sys         1.68

Tuesday Jul 05, 2005

A Simple way to increase shared memory in Solaris10

Short Version

To increase the shared memory available to a given user on Solaris 10.
  • Find out which project the user is in
  • Use prctl to raise the limit e.g. to 200mb, using the project ID returned by id -p.
arches $ id -p
uid=90712(garyli) gid=10(staff) projid=10(group.staff)
arches $ su
Password: 
# prctl -n project.max-shm-memory -r -  v 200mb -i project 10

Long Version

By default the maximum amount of shared memory that a process can use is around 25% of physical memory. If you try to create a shared memory sgement larger than the allowable limit, you will see an error in the messages file, and the shmget system call will fail with EINVAL.
Jul  4 17:51:53 arches genunix: [ID 883052 kern.notice] privileged rctl project.max-shm-memory (value 195078144) exceeded by project 10
For instance, on arches we have only 512Mb
SunOS arches 5.10 s10_43 sun4u sparc SUNW,Ultra-5_10
arches $ prtdiag | head
System Configuration:  Sun Microsystems  sun4u Sun Ultra 5/10 UPA/PCI (UltraSPAR
C-IIi 300MHz)
System clock frequency: 100 MHz
Memory size: 512 Megabytes
And we can see what the default maximum shared memory segment will be, by using prctl
arches $ prctl -n project.max-shm-memory  -i project 10        

25758:  prctl -n project.max-shm-memory -i project 10
project.max-shm-memory                   [ no-basic deny ]
                    128100352 privileged deny           
         18446744073709551615 system     deny           [ max ]
	 
arches $ bc
128100352/(1024\*1024)
122
	 
In the above case we have a maximum of 128100352 (122 mb) which we can allocate using shmat()/shmget()

We can now demonstrate that it is the case, by trying to allocate first 122mb, then 123mb of shared memory. The program shm_var takes a single value as its input, which is the size in Mb of a shared memory segment that we want to create

arches $ ./shm_var 122
Attempting attach of 122  Mb shm base address = F7000000 shmid = 5 shmat time = 1 sec

arches $ ./shm_var 123
Attempting attach of 123 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 sec
In the above example, shmat() fails because shmget() returned -1 as the address after it failed to get the shared segment which we asked for. Using truss, we see shmget fail...
shmget(25851, 128974971, 0777|IPC_CREAT)        Err#22 EINVAL
So, how to change all this is actually quite simple, and can be done on the fly. IMHO the prtctl command doesn't do us any favours with what looks to me like an overy complex syntax. However, here's a cook-book approach.

Firstly, because the shared memory resource is controlled on a project basis, need to know the project to adjust. In the simple case, the project to change will be the project that the user belongs to. So, in the case of an oracle install - su to oracle and issue id -p. Unless you have changed things manually, the project will be '3' 'default'. However, in the example below, my project ID is based on my groupid - so my projectid is 10. Your project id can simply be found by issuing id -p

arches $ id -p
uid=90712(garyli) gid=10(staff) projid=10(group.staff)
Then we issue the magic prctl command to raise the value
# prctl -n project.max-shm-memory -r -v 200mb -i project 10
We can now allocate 200mb, but NOT 201mb
roxy $ ./shm_var 200
Attempting attach of 200 Mb shm base address = F2800000 shmid = 2 shmat time = 1 sec

roxy $ ./shm_var 201
Attempting attach of 201 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 sec
Interestingly the shmmax limit is cumulative - and so does away with the confusing shmmax, shmseg etc.
roxy $ ./shm_var 100
Attempting attach of 100 Mb shm base address = F8C00000 shmid = 3 shmat time = 0 sec
\^Croxy $ ./shm_var 100
Attempting attach of 100 Mb shm base address = F8C00000 shmid = 4 shmat time = 1 sec
\^Croxy $ ./shm_var 1
Attempting attach of 1 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 sec
Note, that in the above test we did not do a shmdt() between each run of shm_var, and so in ipcs -a we see 200Mb of shared memory across two segments
IPC status from  as of Mon Jul  4 17:59:50 BST 2005
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP CBYTES  QNUM QBYTES LSPID LRPID   STIME    RTIME    CTIME 
Message Queues:
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP NATTCH      SEGSZ  CPID  LPID   ATIME    DTIME    CTIME 
Shared Memory:
m          4   0x2f8b     --rw-rw-rw-   garyli    staff   garyli    staff      0  104857600 12171 12171 17:58:16 17:58:18 17:58:15
m          3   0x2f86     --rw-rw-rw-   garyli    staff   garyli    staff      0  104857600 12166 12166 17:58:07 17:58:12 17:58:07
m          1   0x43cb9a88 --rw-r-----   oracle      dba   oracle      dba      6   46235648   791   809 13:25:22 13:25:53 13:25:14
T         ID      KEY        MODE        OWNER    GROUP  CREATOR   CGROUP NSEMS   OTIME    CTIME 
Semaphores:
s          5   0xe49024ec --ra-r-----   oracle      dba   oracle      dba    39 17:56:57 13:25:14
s          1   0x71000b51 --ra-ra-ra-     root     root     root     root     1 13:18:56 13:18:34
s          0   0x187cf    --ra-ra-ra-     root      sys     root      sys     1 13:17:55 13:17:54
roxy $ 
Notice also that Oracle has 46Mb that is not affected by our allocation (or vice versa)
# su - oracle
Sun Microsystems Inc.   SunOS 5.10      Generic January 2005
-bash-3.00$ id -p
uid=101(oracle) gid=1001(dba) projid=11(user.oracle)
-bash-3.00$ prctl -n project.max-shm-memory -i project user.oracle
project: 11: user.oracle
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-shm-memory
        privileged       186MB      -   deny                                 -
        system          16.0EB    max   deny     
Notice also I used user.oracle, rather than the project ID, although -i project 11 would have achieved the same thing.

Thursday Jun 23, 2005

Erins portait of me on fathers day

A photo of me taken by my duahgter Erin (5) on fathers day Erin took this photo of me using her camera on fathers day. We're having a bit of a heatwave in the UK at the moment, which explains the reckless shirtlessness displayed here. I've not cropped or otherwise edited the picture and I love the kids-eye view of the world that her pictures give.

g.

Friday Jun 10, 2005

fsync performance

Not the most exciting title I know, but here goes anyway... OK, so fsync() is used to ensure that dirty pages that have been written to a file actully go down to disk. The same sort of thing can be done by opening a file using one of the O_SYNC options when a file is opened. fsync() however, allows greater flexibility since the programmer can specify when the synchronisation to disk takes place - perhaps in a separate thread. Anyhow, generally fsync() is goodness - and 'cheap' since it only sync's the data that is dirty. So far so good. However there is (or rather was) a subtle problem that shows up when very large files are mapped into the memory of systems with reasonable amounts of memory. The problem is not to do with large memory systems as such, just that you need a lot of memory to really cache a large file. The problem is that the file is searched linearly (from beginning to end) from the first page that is mapped in right through to the last. This can take quite a lot of time. Given a big enough file, and a big enough physical memory - the time taken can be measured in seconds (yes really!). Since many developers think of fsync() as a 'free' system call, often it is called quite indiscriminately and so fsync() performance can really make a BIG difference (See the test results below for a pathalogical case).

The good news is that this behavior is changed in Solaris 9 (I think of this is the version of Solaris designed for large systems like the 15/12/25K StarCat range) so that all the dirty pages are put at the head of the list, and we need only search the list until we find the first non-dirty page. This is logged as CR 4336082, fixed in patch 112233-09 and later.

So how can you tell if your application is suffering this problem? I would use truss -c against a running process, and see if fdsync()\* is using some appreciable amount of CPU time. Note that this is 'real' cpu time spent examining pages rather than 'sleeping' or time that is spent waiting for the actual page to be written to disk.

In the experiments below, I used Solaris 9, with the latest recommended patch set and with the same file on UFS and VxFS on the same system, the results are quite dramatic. VxFS at the time of wrtiting does not incorporate the fixes that are in UFS - so serves as a good counter example to the speed of UFS. This test was performed on a 16Gb file with 16Gb RAM.

The inital run shows a fast time since, there are few pages mapped into the page cache - after the first run, we read in the file using 'dd' making sure to keep below the threshold used by VxFS to do 'discovered direct IO' which seems not to populate the cache - as you would expect.
Firstly we run the test file (read then write 100 blocks at a size of 4Kb)

# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocks blocksize = 4096

real        0.91
user        0.00
sys         0.10
Then we read in the entire file in blocks of 4Kb
# dd if=./testfile of=/dev/null bs=4096k
4000+1 records in
4000+1 records out
We see that the performance has not changed, since we didn't populate the cache, because we read in the blocks at a size that triggers the VxFS discover direct IO size
# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real        0.92
user        0.00
sys         0.07
Next we read the file in the blocks of 4Kb (rather than 4Mb)
# dd if=./testfile of=/dev/null bs=4096
4096001+0 records in
4096001+0 records out
Now we see the problem - note that we are still only writing 100 blocks of 4K as we have always been. Note also that the time is accurately attributed to sys as expected.
#  timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real     6:47.77
user        0.00
sys      6:45.85
Using truss -c we can see where the time has gone.
# truss -c /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile
The size of the file is -402649088 bytes -98303 application blocks blocksize = 4096
syscall               seconds   calls  errors
_exit                    .000       1
read                     .012     101
write                    .018     106
open                     .000       5      1
close                    .000       4
brk                      .000       2
stat                     .000       4
lseek                    .007     202
fstat                    .000       3
ioctl                    .000       1
fdsync                403.329     101  <--- Here's where all the time went, in fsync() as expected
execve                   .000       1
getcontext               .000       1
evsys                    .000       1
evtrapret                .000       1
mmap                     .000      11
munmap                   .000       1
getrlimit                .000       1
memcntl                  .000       1
resolvepath              .000       5
                     --------  ------   ----
sys totals:           403.371     553      1
usr time:                .006
elapsed:              404.210
In the above truss -c each fdsync() takes around 4 SECONDS to complete On UFS, having used 'dd' to map in the file
# dd if=testfile of=/dev/null bs=4096

4096001+0 records in
4096001+0 records out

# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real        1.33
user        0.00
sys         0.06
#  truss -c  /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096
syscall               seconds   calls  errors
_exit                    .000       1
read                     .005     101
write                    .006     106
open                     .000       5      1
close                    .000       4
brk                      .000       2
stat                     .000       4
lseek                    .006     202
fstat                    .000       3
ioctl                    .000       1
fdsync                   .094     101
execve                   .000       1
getcontext               .000       1
evsys                    .000       1
evtrapret                .000       1
mmap                     .000      11
munmap                   .000       1
getrlimit                .000       1
memcntl                  .000       1
resolvepath              .000       5
                     --------  ------   ----
sys totals:              .115     553      1
usr time:                .005
elapsed:                1.370
The fsync() calls above are still super-quick despite having a lot of the file cached in RAM

\*fsync() is the call made from 'c' but shows up in truss as fdsync().

I tried reducing discovered_direct_iosz to 2k, and it seems to follow that if you allow VxFS to do directIO then the fsync issue is not hit. However, from the read side, you will not get any benefit of cacheing, whereas on UFS you get both cached data and fast fsync's

Monday Jun 06, 2005

Sun Studio 10 - Performance tools

We now have online documentation, about some very useful tools for performance analysis included in the developer tool suite (Sun Studio) I will try to blog some examples in the near future. http://docs.sun.com/app/docs/doc/819-0493

Tuesday May 03, 2005

SunLabs on the Reg

story about a recent Sun Labs open day
About

gjl

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today