bufhwm on large systems
By clive on Jan 04, 2008
I was asked yesterday to look at a busy system with high system time. Its Solaris 9 on a big config 25K. This output was the top of the lockstat -C -s 50 output.
------------------------------------------------------------------------------- Count indv cuml rcnt spin Lock Hottest Caller 132614 59% 59% 1.00 199 blist_lock bio_recycle+0x224 spin ------ Time Distribution ------ count Stack 1 | 186 bio_recycle+0x224 2 | 2335 bio_getfreeblk+0x4 4 | 4247 getblk_common+0x2bc 8 |@ 7190 bread_common+0x80 16 |@@ 11570 bmap_read+0x20c 32 |@@@@ 18285 ufs_directio_read+0x2e 64 |@@@@@ 25634 rdip+0x198 128 |@@@@@@ 28613 ufs_read+0x17c 256 |@@@@@ 22918 pread+0x28c 512 |@@ 9707 1024 | 1761 2048 | 157 4096 | 11
A bit of Solaris code reading lead me from the stack above to question the value of bufhwm. I checked it out again on docs.sun.com to really understand what this value does. Its the high water mark in K of the size of allocated buffers used for UFS indirect blocks, directories and other bits of metadata.
I went back to check some basic assumptions(always a good plan) and did an Explorer review. The following line is set in /etc/system :
I have no idea why it was set to 8000 on this system. I have seen it set many times on many systems and have not paid much attention on this and other systems. 8000 is proposed in many places as a reasonable value. I must admit I have never needed to suggest this value is tuned and my unconcious just assumed that it was just a good idea because common wisdom said so and never made a comment when other people tuned it.
By default this value would be 2% of memory. So this system had > 200Gb which would default to around 4GB. I expect 4gb would waste some memory, but then its a high water mark. 8mb is far too small on this size of server give that the buffer cache is used to store indirect blocks, directories, etc from a set of filesystems near 2 TB!
We can observe if buffer recycling is causing an issue using the following
echo "bfreelist$ buf" | mdb -k echo "v::print -t struct var" | mdb -k kstat -p -n biostats
and sar -b might also give some insight.
So the morals to repeat to myself include
- Turn off your unconcious mind when examining /etc/system. Don't assume any /etc/system setting is valid
- Never carry /etc/system tunables forward
- Put a comment in /etc/system if you set a value based on a attribute like memory size with has a potential to change citing the assumption.
Various customers who I have visited over the years comments in the form in /etc/system
# firstname.lastname@example.org 4/1/2008 # bufhwm value of 8000 assumes a memory size of 4gb and 600GB of UFS filesystem. revisit if size changes # Check with kstat -p -n biostats before changing set bufhwm=8000
At least if something goes wrong, then I can be emailed in capital letters.