Wednesday Jul 29, 2009

Why large ISM pages are not as large as I expected.

I was pondering why a large SGA segment was made up of 4M pages rather than 256M pages and decided to experiment. A simple as can be bit of code to create an ism segment
#include < sys/types.h >
#include < sys/ipc.h >
#include < sys/shm.h >
#include < stdlib.h >
#include < unistd.h >
#include < stdio.h >

int main(int argc, char \*\*argv)
{
  int sz;
  int sid;
  void \*a;

  sz = atoi(argv[1]);

  if ((sid = shmget(getpid(), sz \* (1024 \* 1024), IPC_CREAT)) == -1)
    {
      perror("shmget failed");
      exit(1);
    }

  if ((a = shmat(sid, (void \*)0, SHM_SHARE_MMU)) == -1) 
    {
      perror("shmat failed");
      exit(1);
    }

  sleep(60);
}

In a system with UltraSparc VI+ cpu's (panther) I found by default asking for a 1G ISM segment, we were still producing 4M pages according to pmap -xs. A little bit of kernel code reading and we found the decision is made in map_pgszism which looks like this

map_pgszism(caddr_t addr, size_t len)
    591 {
    592 	uint_t szc;
    593 	size_t pgsz;
    594 
    595 	for (szc = mmu_page_sizes - 1; szc >= TTE4M; szc--) {
    596 		if (disable_ism_large_pages & (1 << szc))
    597 			continue;
    598 
    599 		pgsz = hw_page_array[szc].hp_size;
    600 		if ((len >= pgsz) && IS_P2ALIGNED(addr, pgsz))
    601 			return (pgsz);
    602 	}
    603 
    604 	return (DEFAULT_ISM_PAGESIZE);
    605 }
    606 
A little poking around with mdb shows the value of disable_ism_large_pages to be 0x36. In the common code it is set to 0x2, so must be some platform specific code resetting this value. Poking disable_ism_large_pages to 0x2 with mdb meant the pages for the ISM segment were now 256M in size as reported by pmap. No recommended as a spur of the moment action for your production E25K running Oracle.

disable_ism_large_pages gets set in hat_init_pagesize as an or of disable_large_pages which is set to a shifting and bitmasking perturbation of mmu_exported_pagesize_mask. So a few more hops leads to bugid 6313025 which describes why 32M and 256M pages were turned off for the Panther cpu. Executing application code from the larger (>4M) pages caused nasty thing to happen. The bug is dated 2005 and I had a very distant memory of it, but it was worth tracking down the specifics.

ssd_max_throttle vs IT governance

Chris and I had a short IM exchange yesterday regarding a customer visit I made on monday, its a customer we have both worked with a lot over the years. One of the significant contributory factors of the reported problems is that a line of the form
set ssd:ssd_max_throttle=32
is missing from /etc/system across the estate attached to a particular SAN. Common problem, easy diagnosis, etc.

I made the comment that a co-worker of ours in a different part of the organization would have picked up on the need to address the underlying IT governance issue.

I liked this definition from Wikipedia :

   Specifying the decision rights and accountability framework to encourage desirable 
   behaviour in the use of IT.
and the cause of the cause of the cause of the cause of ssd_max_throttle not being set was in the structure of the established decision rights and accountability framework.

Still, far easier to stick ssd_max_thottle=32 in /etc/system and leave these battles to others this time round. However, an awareness has been sparked.

Thursday Nov 20, 2008

Computer Science Students, donuts and a T5140 SunRay server running OpenSolaris

On a typical Wednesday afternoon in the SunRay lab in the Computer Science Department at the University of Aberystwyth, Denial of Service attacks will at best result in a audience with the Head of Department and at worst exclusion from one's degree scheme. Yesterday afternoon was different in that trying to exclude your peers by panicing or hanging the SunRay server was positively encouraged!

Thanks to those students who turned up and in exchange for cakes, did "stuff" on the departments new T5140 SunRay server which was running build 97 of OpenSolaris. The aim of the afternoon was in part educational for the students, in part a load test to identify possible configuration improvements and in part to see if there were any obvious performance RFE's we could chase down.

Much respect to Dafydd who managed to translate "please just log in and out, repeat until" into running this bit of code in a bash shell script

:(){ :|:& };:

I suspect he got the idea from here, however it did make the machine hang. Lesson learned is to pay attention to project based resource controls and also if/when we do this again for me to be specific that there will be a couple of phases and could they leave the fork bomb/malloc bomb type activity till the end of the session.

The kernel tunable maxuprc would have stopped him if he got beyond 29995 processes, but each bash shell at around 3MB, so we would need around 90GB of available virtual memory before this limit stopped him. A value of 1000 should not stop "normal" activities, but also stop a Dafydd after too much sugar.

In a similar vein, Dave Barnard came up with an other simple trick of

#include 

int main()
{
int i = 0;
	while(1)
	{
		char \*t = (char\*)malloc(1024\*1024);

		if(t == NULL)
		{
			printf("end");
			return 1;
		}else{
			printf("%d MB"\\n, i);
			i++;
		}
	}
return 0;
}
and proceeded to leak over 6GB of memory on a 8Gb Physical + 5 GB swap system. While the prospect of the wrath of Prof. Price is typically more effective than resource limits, some probably need to be put in place know the little dears have a taste for this. rcapd is probably the right way to go here and put per user limits in place via projects. In addition the amount of swap has been doubled and we are going to do some memory usage monitoring to determine if more physical memory might be useful.

The T5140 itself was never more than 20% busy in terms of CPU utilization. Memory of various kinds was the main limitation. We also observed that NetBeans 6.1 was slow for interactive use which needs to be followed up. Netbeans 6.5 is out, so a 1st step is to see if it has the same problem. We also found that when under memory pressure, some SunRay sessions would exit which also needs to be followed up given a bit more time.

Friday Oct 24, 2008

grep c2audit:audit_load /etc/system

I have come across quite a few customers over the last few years who have this line in /etc/system

set c2audit:audit_load = 1

only one set of administrators knew why it was there and what it did and how they used the output. The rest came up with a vague "we need it for security and auditing what root does" or "it is part of the standard build". Most admins did not know it was set or why and a bit of questioning suggests that no one in the organisation has ever looked at the log files or knows the trigger to look at the log files.

The impact on performance and scalability is made much worse in Solaris 10 by the bug 6388077 (make sure you have at least 127127-01 which was released over a year ago), but typically it is not doing what you think it is in terms of useful auditing and acts as an inhibitor to scalability. The more cpu's a system has the greater the overhead.

lockstat -C -s 50 sleep 10 
can show some very interesting stacks!

Awareness of security is good, but my experience is that this feature has been enabled without consideration to how to use the output or its impact on performance. In light of this, is bsmuncov is your friend?

Sunday Aug 10, 2008

Sun Ray stress test harness anyone?

I have been ask around, but may not have yet asked in the right place, so here goes with a wider audience!

A University I do some work with want to load test their Sun Ray setup before going live. They had some performance problems with a lab full of students logged in and want to avoid this when they put in a shiny new T5xx0 series server, so a pre-term start load test makes some sense.

Anyone got a pointer to a Sun Ray Stress test harness or load generator? comments very welcome.

Thursday Jul 24, 2008

A few days in Edinburgh

I have been working with the MIS people at Edinburgh University and a consultant from Tribal which has been much fun indeed. I learned a number of things and was reminded of a few more my brain have choosen to put in long term storage.

  • cron starts non-root processes with a NICE value of 2 hence will have a lower priority than jobs started on the command line or via SMF. The queuedefs man page explains more, but the syntax is arcane!
  • Worth snooping traffic to and from the DNS server. Often shows up errors or performance opportunities in nscd.conf and resolve.conf such as having cache-enable set to no for ipnodes.
  • If any type of network latency is important such as in ping-pong of packets sitting on 2 clients, map out and understand where your firewall(s) sit(s) and benchmark without the firewall to get the scope of the impact. Firewalls are often an invisible(and hard to observe) component, so are often ignored.
  • Turning TCP NAGLE off via ndd is well worthwhile if ping-pong latency is a barrier.

Next race is on Saturday which is the Snowdon International Race.

Tuesday May 27, 2008

Using Kernel Crash dumps for Performance Analysis

Kernel Crash dumps are a point in time snapshot of the Solaris Kernel state. The aim is to allow post mortem analysis of the system state at the point the crash dump was taken. For system panic's and hangs, the ability to look at the system state is the primary failure analysis tool and one of the reasons Solaris is as reliable as it is.

I think of system failures as a 2 dimensional problem. The interaction of data and code at the point in time of the failure can be analyzed with tools such as MDB which are designed for this type of post-mortem analysis.

Performance adds the 3rd dimension of time.

Autopsy is not commonly used as a tool for determining the root cause of individual productivity issues. In a small subset of cases, poor individual productivity may be the result of a medical condition requiring a CAT scan (the medical version of a live Kernel Crash Dump). However, these cases are very rare and such techniques would only be used with a significant body of supporting evidence.

Kernel Crash Dumps are useful for a very small subset of performance cases. Specific performance problems rooted in memory shortfall caused by a memory leak would be one example, but these are quite rare in the big scheme of things and would need supporting evidence to use the Kernel Crash Dump approach.

I have come across a number of cases in the last few months where a crash dump has been requested and only one was possibly valid.

Before collecting the CAT scan equivalent of your system (with the associated cost) in the hope it shows up the cause of a performance problem, check the pulse, breathing and circulation 1st. If you do collect a live crash dump, make sure the supporting evidence and rational are sound.

Friday May 02, 2008

70x performance improvement in 5 minutes

A good friend of mine who is a Systems Engineer/Engagement Architect in the UK sent me a copy of a benchmark which his customer was using to assess the performance of various types of Sparc machines. While the benchmark is simplistic, the customer had a concern over its performance on a T5220, so any concern is valid. So here is the customer benchmark

The spirit of the customer benchmark was

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    ./run2_slow &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_slow

which calls

#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
       bc < /dev/null
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100
E
       loop=`expr ${loop} + 1`
done 

which executes in around 70 seconds.

Clive's version which required 5 minutes of very simple coding change

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    time ./run2_fast &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_fast
calls
#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
n=0
((n=n+100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100
))
	((loop=loop+1))
done
run the same number of iterations on the same machine in 1.01 seconds.

For the slower version dtrace -s /usr/demo/dtrace/whoexec.d shows that there are huge number of sequences of fork/exec where the script forks bc which forks dc and also the counter using expr requires calls to fork/exec. Less than 1% of the time spent in this script was actually calculation.

An interesting system level bottleneck did drop out where the text segment of libc was being faulted in as the process is being created as a result of a call to memcntl something like this

enoexec(5.11)$ truss -t memcntl /bin/true
memcntl(0xC4080000, 227576, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
Where you have many concurrent processes calling fork, you do end up with some lock contention in ufs_getpage like this
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
111484  18%  18% 0.00  1747392 0x30001293610          ufs_lockfs_begin_getpage+0xc8

      nsec ------ Time Distribution ------ count     Stack                   
       512 |                               5         ufs_getpage+0x7c        
      1024 |                               26        fop_getpage+0x90        
      2048 |                               43        segvn_faulta+0x114      
      4096 |                               61        as_faulta+0x138         
      8192 |                               131       memcntl+0x8d0           
     16384 |                               711       syscall_trap32+0xcc     
     32768 |@@@                            14042     
     65536 |@@@@                           17764     
    131072 |@@@@                           17988     
    262144 |@@                             9247      
    524288 |                               2818      
   1048576 |                               3215      
   2097152 |@                              5994      
   4194304 |@@@@@@                         24221     
   8388608 |@@                             10884     
  16777216 |                               3414      
  33554432 |                               310       
  67108864 |                               7         

it would be interesting to try this on a T5220 with ZFS as a root filesystem, but where is the bottleneck? I would argue that there may be a little room for improvement in UFS, but that the benchmark is pathological. My experience of dealing with performance issues in the field over the last 6 years is that large 15K/25K and then T2000 and now T5220 are very good indeed at exposing applications which don't scale. This is an example of where a simplistic benchmark can lead to incorrect or at least very incomplete conclusions about the underlying platform. The benchmark as the customer used just measured fork/exec performance, you would not implement a business solution like that, or would you?

Getting a 70x improvement for changes to Solaris or the underlying hardware is going to be a significant challenge. A 70x speedup from application changes in this case was viable and a little consulting help might not go a miss.

Tuesday Apr 29, 2008

define the problem you want to solve and you are 50% of the way there.

In the voyage to apply the rational process to performance issues, this is useful insight.

For those of us working on performance issues in Services, the biggest hurdle we have to clear is getting a good definition of the problem the customer wants solved (BTW, "my system is slow" is not a problem definition we can work with). and not knowing if an arbitrary kstat counter is incrementing too fast. Once we get a solid problem definition to work on, we are at least 50% of the way there.

As Andy points out, in a multi-vendor world which requires a hippy holistic approach, the definition of the problem is key.

It also occurred to me that by definition a true Hippy would not rant, but go with the flow!

Thursday Feb 07, 2008

Out of range lba's stop my web business

I spent a large part of a day a few weeks back on a bridge call with a large US customer for who serious I/O performance issues developed overnight. They had made no changes to the application, platform, SAN or storage for months. The SAN checked out fine. They had very serious performance issues which came down to very high latency I/O across most LUN's.

At iostat level all I could see was high service times and the storage vendor could only see low service times. As is part of the course it becomes a finger pointing exercise as both sides dig deeper and deeper into their stacks of hardware and software.

With the customers business in effect down (they are a web based business to a large extent), the political environments starts to heat up (you can even feel the heat from 3,000 miles away). Eventually one of the storage vendors engineers found that scsi packets with "out of range" LBA's ( Logical Block Address ) were being sent to the array which pointed the finger back at our platform. Some Solaris code reading resulted on my part and I concluded that if the platform was generating out of range LBA's, then it would get recorded in the Illegal Request ssderr kstat, but we did not see that counter incrementing.

As a side remark, one of the customers admin's mentioned that they had seen similar I/O issues on a set of Wintel systems since installing a new set of HBA cards. I asked what the port addresses were for the HBA's generating the errant packets and they were not from the 15k which had the performance problem. The call went silent as the picture unfolded, Sun and the storage vendor were thanked for their time and we left the call.

I wanted to really understand how Solaris really behaves when out of range lba's get generated, so I wrote some code to generate out of range LBA's. Please play with the code, but remember it uses USCSI which bypasses the checks of the filesystem, etc, so don't use it on your production server, please.

So there is a risk in a SAN environment that other hosts can impact on your mission critical business and it becomes a real challenge to find root cause. This customer lost a number of hours of sales. The Solaris side has no visibility of the out of range LBA scsi packets generated by other hosts. So one place it might be useful to use this code is to determine the effect on I/O latency of a workload if the back end storage has to handle "out of range" LBA's. In the case of our customer above, I suspect that resets were occurring within the array and this impacted performance. Should the storage be robust in terms of performance degradation to such requests or is it just a good idea to limit the size and scope of each SAN?

Sunday Jan 20, 2008

fuser with care

Back in August I wrote an entry Too much /proc is bad for you!. I came across an other command which needs to be used with care, fuser. A very useful command which finds and reports the processes which have a file open or files open in a file system. Like malt Whisky or a fine wine, which adds value in moderation, too much at one time and simple tasks start to take much longer.

Have a look at the code for the function dofusers. Simply it loops through every process in the process table looking at both each file in the file table of each process in turn and the vnode backing each segment in the address space of each process.

dtrace -n lockstat:::'{"[@[probefunc] = count()}' -c "fuser ."

shows in excess of 1 million locks acquired and released. While the 1 millions locks are no problem at all if uncontended, running multiple copies of fuser at the same time, can add considerable contention around a number of key locks such as pidlock and various address space locks.

Running 1 fuser every few seconds would have little overall impact. Trying to run multiple copies of fuser and throwing in a few calls to ps at the same time tends to make the smtx column in mpstat light up.

Friday Jan 04, 2008

bufhwm on large systems

I was asked yesterday to look at a busy system with high system time. Its Solaris 9 on a big config 25K. This output was the top of the lockstat -C -s 50 output.

-------------------------------------------------------------------------------
Count indv cuml rcnt     spin Lock                   Hottest Caller          
132614  59%  59% 1.00      199 blist_lock[8]          bio_recycle+0x224       
     spin ------ Time Distribution ------ count     Stack  
            1 |                               186       bio_recycle+0x224              
            2 |                               2335      bio_getfreeblk+0x4
            4 |                               4247      getblk_common+0x2bc
            8 |@                              7190      bread_common+0x80
           16 |@@                             11570     bmap_read+0x20c
           32 |@@@@                           18285     ufs_directio_read+0x2e
           64 |@@@@@                          25634     rdip+0x198
          128 |@@@@@@                         28613     ufs_read+0x17c 
          256 |@@@@@                          22918     pread+0x28c   
          512 |@@                             9707
         1024 |                               1761 
         2048 |                               157   
         4096 |                               11        

A bit of Solaris code reading lead me from the stack above to question the value of bufhwm. I checked it out again on docs.sun.com to really understand what this value does. Its the high water mark in K of the size of allocated buffers used for UFS indirect blocks, directories and other bits of metadata.

I went back to check some basic assumptions(always a good plan) and did an Explorer review. The following line is set in /etc/system :

set bufhwm=8000

I have no idea why it was set to 8000 on this system. I have seen it set many times on many systems and have not paid much attention on this and other systems. 8000 is proposed in many places as a reasonable value. I must admit I have never needed to suggest this value is tuned and my unconcious just assumed that it was just a good idea because common wisdom said so and never made a comment when other people tuned it.

By default this value would be 2% of memory. So this system had > 200Gb which would default to around 4GB. I expect 4gb would waste some memory, but then its a high water mark. 8mb is far too small on this size of server give that the buffer cache is used to store indirect blocks, directories, etc from a set of filesystems near 2 TB!

We can observe if buffer recycling is causing an issue using the following

echo "bfreelist$ buf" | mdb -k
echo "v::print -t struct var" | mdb -k
kstat -p -n biostats

and sar -b might also give some insight.

So the morals to repeat to myself include

  • Turn off your unconcious mind when examining /etc/system. Don't assume any /etc/system setting is valid
  • Never carry /etc/system tunables forward
  • Put a comment in /etc/system if you set a value based on a attribute like memory size with has a potential to change citing the assumption.

Various customers who I have visited over the years comments in the form in /etc/system

# clive.king@sun.com 4/1/2008 
# bufhwm value of 8000 assumes a memory size of 4gb and 600GB of UFS filesystem. revisit if size changes
# Check with kstat -p -n biostats before changing
set bufhwm=8000

At least if something goes wrong, then I can be emailed in capital letters.

Thursday Jan 03, 2008

6281341 RFE: ce_taskq_disable should be able to set on per instance basis

For those who run clusters and also have busy public interfaces, this RFE is a real step forward in ce performance, even though it has been a while in the brewing. Fixed in patches including 118778-11, but there are patches for S9 and x86 as well of course. Patch has been out for over a month now.

Some fiddling around in ce.conf will be require so only the private interfaces have ce_taskq_disable set in addition to applying the patch. If you run a system with cluster and your networks are busy, this is a patch you need to understand if you will gain from apply it. The blueprint is well worth reading to understand what the various tunables for CE do.

Tuesday Sep 11, 2007

sdxx to cxtxdx conversion for IO latency by colour DTrace script

A couple of people commented that they would like a format of c0t0d0 rather than sd0 for the script in I/O latency by colour script I posted a few weeks ago.

So, change the line

       @[args[1]->dev_pathname] = lquantize(this->elapsed / MS, 0, 200, 50);
to
       @[args[1]->dev_statname] = lquantize(this->elapsed / MS, 0, 200, 50);
The DTrace IO provider does not appear to provide the format cxtxdx(if you know different, please add a comment), so a bit of klunky post processing is needed.
#!/usr/bin/perl -w

use strict;
my %regex= ();

system("iostat -E | awk '/Soft/ { print \\$1}' > /tmp/a");
system("iostat -En | awk '/Soft/ { print \\$1 }' > /tmp/b");
system("paste /tmp/a /tmp/b > /tmp/c");

open(F, "/tmp/c") || die ("no /tmp/c");
while() {
    my ($sd,$ctd) = split;
    $regex{$sd} = $ctd;
}
close(F);

while (<>) {
    for my $sd (keys %regex) {
         s/$sd\\b/$regex{$sd}/g;
    }
    print;
}
Then run as
 pfexec dtrace -s ./io_latency_by_colour.d | ./sd_to_cxtxdx.pl
and enjoy the colours and the easier to read disk format. I shall have to talk to Brother Jon to determine if an addition to the IO provider makes sense, though given the amount of work iostat goes through to get this format, might explain why its not already part if the IO provider.

Thursday Aug 30, 2007

Too much /proc is bad for you!



I have been to a couple of customer sites this year where between ¼ and 1/6 of 20-40 cpu systems has been consumed by various types of monitoring. Over monitoring is a contributor to scalability issues that causes the customer to introduce additional monitoring. If you run large systems by CPU/core count (this now includes T2000), please read on. We introduce a few principles for the purpose of summary along the way.

One of the downsides of any monitoring is that it has overhead. Often cited as the Heisenburg Principle, but should be called the Observer Effect when discussing about computer systems. We shall put to one side a cat in a box and any associated philosophy. Let us agree that if you try to look at a system you will change it and the deeper you look the more overhead you add. kstats and DTrace are a best case, an overhead still exists. One enabled DTrace probe has a tiny overhead, 30,000 enabled probes has more overhead. Few customers buy our systems to run monitoring software, most use our systems to support their business, so Principle 1 : monitor what you care about now to solve today's business problem, not what you might care about in future

The use of /proc often(event distribution measured in small numbers of seconds) is a big deal. The procfs filesystem has a goal of giving correct data at the point in time that an observation is made. When a choice needs to be made, Solaris architects choose correct and slower over faster and misleading. This trade-off means, in the case of process monitoring, you can't have both correct and performant. Its a lot of work to give a consistent picture of a process. A ps -ef on our Sunray server with in the region of 4000 processes causes a sum of 3226,292 kernel fbt probes to fire. /proc is not a lightweight interface, so we need to be selective about its use.

/proc is also very lock heavy, it needs to be to ensure it gives a consistent picture. Every /proc operation acquires and release proclock and pidlock, among other locks. Being lock heavy means its very easy to write application which don't scale if /proc is used on a regular basis. I was involved in a hot customer issue where an early SF15K did not scale beyond 28 cpu's for a particular in-house Java application. The highly threaded application used the pr_fname field of the curprpsinfo structure to get the process name for every operation. The process name never changes and the developers had no idea a 3rd party native library used /proc in this way. The SMTX column in mpstat lit up and lockstat -C shows the errant kernel call stack very clearly. Easy fix to the application once it was pointed out and a huge drop in system time once the library cached the program name. Which leads to principle 2:- If you need to do it often, don't do it often with procfs

The Solaris kernel does not store the size of a processes address space at the current point in time. Its only typically needed by tools such as ps, so the /proc interface ps_procinfo??? calculates it as required and when needed. This is a non trivial operation as each segment you see in the pmap output for a processes needs to have the appropriate vnode locked, the size taken and the vnode unlocked. It not unusual to have processes with in excess of 100,000 mapped segments. 30 such processes and you can see that procfs on behalf of ps needs to do a lot of work to return the size of the address space of a process. Couple of ps instances running at the same time along with prstat and top and quite a bit of your very busy multi-million dollar database server gets consumed.

The most successful approach to system monitoring that I have observed our customers employ is to monitor at a business level such as user experienced response time. For some application types it takes a lot of work to get beyond measuring how often the help desk or CIO's phone rings and the word slow is uttered. To get useful quantitative metrics which give a useful representation of the user experience and to provide a clear trigger when it degrades is a highly non-trivial task. This may explain why many organisations have relied on system level metrics such as user/system time ratio or even I/O wt time(don't even joke about using wt)! System level metrics only typically confuse the process of resolution if used out of context with the business problem.. Taking system level metrics outside the context of the flow of data to and from the user (be it human or silicon based) typically leads to a colourful festival of explaining incrementing values of obscure kstats, rather than solving business problems and establishing actual root cause. Which leads to principle 3: Use only measures and metrics you fully understand

I was asked to pay a visit to a customer by an account manager where staff investigating a SAP performance issue had been flown from the UK to Hong Kong to conduct network tests with a result of its not the network1. A morning of understanding the business need, system (people, software, computers, networks)2 and the interaction between components followed by 10 minutes of DTrace(Truss would have done fine) showed the problem to be the efficiency of coding of a SAP script. I can't spell SAP, but following the flow of data between components, observing with the intent of answering the SGRT questions of where on the object and when in the lifecycle has not let me down yet3. Which leads to principle 4: Strong and relevant business metrics avoid wrong turns.

There is a school of thought that suggests a system should have no system level performance monitoring enabled on the box itself unless a baseline is being established, a problem is being pursued or data is being collected for a specific capacity planning exercise. For the most part I agree, based on the observation that continual low level system monitoring, on balance, causes more problems than it addresses. Storage monitoring products in particular appear adept at consuming a cpu or 2.

One of the Sun support tools, GUDS shares of the same concerns. Its important to get the purpose of GUDS into perspective. GUDS is intended to capture as much potentially useful data as possible in one shot such that we get useful data for most types of performance problem. Thus we accept that there is a non-zero overhead and must allow for it in the analysis4 and be judicious in its use. Like any tool, its the context in which you use GUDS that counts and can add great value. We (Global Performance V-team) often get asked you have a look at GUDS output and diagnose a situation where the problem definitions is system going slow. GUDS is a 1st pass tools when you don't have a decisive problem definition or you want to gather baseline data.

GUDS add load to the system it monitors. ::memstat in mdb, for example, takes 1 CPU and non trivial numbers of cross-calls to walk each page in memory and determine what the page is used for. TNF and lockstat also add a overhead. GUDS, when used in the right context, with the right -X option, is highly effective. As the Welsh Rugby commentator and ex-international Jonathan Davies notes in a different context its the top 2 inches that count.

This leads us to principle 5: Use system level metrics only in the context of understanding a business lead performance issue.

I have mentioned the need for relevant business metrics, itself a huge and complex subject, to replace obtrusive system level monitoring as the trigger to investigate when a problem that impacts the business arises. If you set out on a journey it helps to know your objective, business level metrics assist in knowing when you are making progress and when you are done. It also curious how often capacity planning gets confused with business metrics.

Back to /proc. Some useful one liners for finding overhead that is just overhead.

Lets see what processes are using /proc over 60 seconds

dtrace -n procfs::entry'{@[execname] = count()}' -n tick-60s'{exit(0)}'

For an application proc_pig, lets find the user land stack which causes procfs to be called.

dtrace -n procfs::entry'/<80><9D>execname == <80><9C>proc_pig<80><9D>/@[ustack()] = count()}' -n tick-60s'{exit(0)}'

One of the DTrace demo scripts is very useful for highlighting those monitoring processes which spawn many child processes.

dtrace -s /usr/demo/dtrace/whoexec.d -s tick-60s'{exit(0)}'

ps(or pgrep) are often used in scripts to determine if a child process identified either by name or by PID is still running. ps is a process monitoring lump hammer and its use in process state scripts is architecturally questionable, more so with the advent of SMF in Solaris 10.

So if you have a script that does something along the lines of

while true

do

ps -ef | awk '{print $2}' | egrep '\^$PID$' > /dev/null

if [ $? != 0 ] ; then

restart process

fi

sleep 10

done



to restart a process called $PID if it dies.

A step in the right direction in reducing the overhead is to use

ps -p $2 > /dev/null

in the ps line. If the process does not exist, then a non-zero return code is given and when the process does exist, the overhead of traversing every process and calculating its address space size is avoided.

To do the proper job, let SMF take the strain and manage it for you. The underlying contracts framework detects if a process dies and get it restarted One of the best places to learn about writing your own services is Bigadmin .

This leads us to principle 6: Use the right architecture and tools for service management

Writing a SMF service to restart a process, while not a trivial task, is easier and less error prone than writing efficient and correct shell script!

In summary, we have touched on a number of topics which relate to monitoring and the resultant impact on overall system performance. The obvious open question which is how do we generate meaningful business metrics beyond how long a batch job takes to run or how often the phone rings? Its a tough subject as most situations are unique and I would be interested in real examples of useful non-trivial business metrics in complex environments.

  • Monitor what you care about now to solve today's business problem, not what you might care about in future

  • If you need to do it often, don't do it often with procfs

  • Use only measures and metrics you fully understand

  • Strong and relevant business metrics avoid wrong turns

  • Use system level metrics only in the context of understanding a business lead performance issue

  • Use the right architecture and tools for service management


If you can think of any more from your experiences, please drop me an email or add a comment.

1 90%+ of process execution time spent in user land often suggests root cause is not a network problem.

2This was enabled by a Sun Account Manager who connected people in parts of the customer operation who otherwise did not know each other.

3 That is not quite true, it did once. A company with this is a no blame culture posters, so they were just looking for someone to blame, hence cooperation was thin on the ground

4Typical syndrome : Solaris is broken, GUDS shows a whole cpu burning system time on an otherwise idle system.

About

clive

Search

Categories
Archives
« July 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today