Too much /proc is bad for you!

I have been to a couple of customer sites this year where between ¼ and 1/6 of 20-40 cpu systems has been consumed by various types of monitoring. Over monitoring is a contributor to scalability issues that causes the customer to introduce additional monitoring. If you run large systems by CPU/core count (this now includes T2000), please read on. We introduce a few principles for the purpose of summary along the way.

One of the downsides of any monitoring is that it has overhead. Often cited as the Heisenburg Principle, but should be called the Observer Effect when discussing about computer systems. We shall put to one side a cat in a box and any associated philosophy. Let us agree that if you try to look at a system you will change it and the deeper you look the more overhead you add. kstats and DTrace are a best case, an overhead still exists. One enabled DTrace probe has a tiny overhead, 30,000 enabled probes has more overhead. Few customers buy our systems to run monitoring software, most use our systems to support their business, so Principle 1 : monitor what you care about now to solve today's business problem, not what you might care about in future

The use of /proc often(event distribution measured in small numbers of seconds) is a big deal. The procfs filesystem has a goal of giving correct data at the point in time that an observation is made. When a choice needs to be made, Solaris architects choose correct and slower over faster and misleading. This trade-off means, in the case of process monitoring, you can't have both correct and performant. Its a lot of work to give a consistent picture of a process. A ps -ef on our Sunray server with in the region of 4000 processes causes a sum of 3226,292 kernel fbt probes to fire. /proc is not a lightweight interface, so we need to be selective about its use.

/proc is also very lock heavy, it needs to be to ensure it gives a consistent picture. Every /proc operation acquires and release proclock and pidlock, among other locks. Being lock heavy means its very easy to write application which don't scale if /proc is used on a regular basis. I was involved in a hot customer issue where an early SF15K did not scale beyond 28 cpu's for a particular in-house Java application. The highly threaded application used the pr_fname field of the curprpsinfo structure to get the process name for every operation. The process name never changes and the developers had no idea a 3rd party native library used /proc in this way. The SMTX column in mpstat lit up and lockstat -C shows the errant kernel call stack very clearly. Easy fix to the application once it was pointed out and a huge drop in system time once the library cached the program name. Which leads to principle 2:- If you need to do it often, don't do it often with procfs

The Solaris kernel does not store the size of a processes address space at the current point in time. Its only typically needed by tools such as ps, so the /proc interface ps_procinfo??? calculates it as required and when needed. This is a non trivial operation as each segment you see in the pmap output for a processes needs to have the appropriate vnode locked, the size taken and the vnode unlocked. It not unusual to have processes with in excess of 100,000 mapped segments. 30 such processes and you can see that procfs on behalf of ps needs to do a lot of work to return the size of the address space of a process. Couple of ps instances running at the same time along with prstat and top and quite a bit of your very busy multi-million dollar database server gets consumed.

The most successful approach to system monitoring that I have observed our customers employ is to monitor at a business level such as user experienced response time. For some application types it takes a lot of work to get beyond measuring how often the help desk or CIO's phone rings and the word slow is uttered. To get useful quantitative metrics which give a useful representation of the user experience and to provide a clear trigger when it degrades is a highly non-trivial task. This may explain why many organisations have relied on system level metrics such as user/system time ratio or even I/O wt time(don't even joke about using wt)! System level metrics only typically confuse the process of resolution if used out of context with the business problem.. Taking system level metrics outside the context of the flow of data to and from the user (be it human or silicon based) typically leads to a colourful festival of explaining incrementing values of obscure kstats, rather than solving business problems and establishing actual root cause. Which leads to principle 3: Use only measures and metrics you fully understand

I was asked to pay a visit to a customer by an account manager where staff investigating a SAP performance issue had been flown from the UK to Hong Kong to conduct network tests with a result of its not the network1. A morning of understanding the business need, system (people, software, computers, networks)2 and the interaction between components followed by 10 minutes of DTrace(Truss would have done fine) showed the problem to be the efficiency of coding of a SAP script. I can't spell SAP, but following the flow of data between components, observing with the intent of answering the SGRT questions of where on the object and when in the lifecycle has not let me down yet3. Which leads to principle 4: Strong and relevant business metrics avoid wrong turns.

There is a school of thought that suggests a system should have no system level performance monitoring enabled on the box itself unless a baseline is being established, a problem is being pursued or data is being collected for a specific capacity planning exercise. For the most part I agree, based on the observation that continual low level system monitoring, on balance, causes more problems than it addresses. Storage monitoring products in particular appear adept at consuming a cpu or 2.

One of the Sun support tools, GUDS shares of the same concerns. Its important to get the purpose of GUDS into perspective. GUDS is intended to capture as much potentially useful data as possible in one shot such that we get useful data for most types of performance problem. Thus we accept that there is a non-zero overhead and must allow for it in the analysis4 and be judicious in its use. Like any tool, its the context in which you use GUDS that counts and can add great value. We (Global Performance V-team) often get asked you have a look at GUDS output and diagnose a situation where the problem definitions is system going slow. GUDS is a 1st pass tools when you don't have a decisive problem definition or you want to gather baseline data.

GUDS add load to the system it monitors. ::memstat in mdb, for example, takes 1 CPU and non trivial numbers of cross-calls to walk each page in memory and determine what the page is used for. TNF and lockstat also add a overhead. GUDS, when used in the right context, with the right -X option, is highly effective. As the Welsh Rugby commentator and ex-international Jonathan Davies notes in a different context its the top 2 inches that count.

This leads us to principle 5: Use system level metrics only in the context of understanding a business lead performance issue.

I have mentioned the need for relevant business metrics, itself a huge and complex subject, to replace obtrusive system level monitoring as the trigger to investigate when a problem that impacts the business arises. If you set out on a journey it helps to know your objective, business level metrics assist in knowing when you are making progress and when you are done. It also curious how often capacity planning gets confused with business metrics.

Back to /proc. Some useful one liners for finding overhead that is just overhead.

Lets see what processes are using /proc over 60 seconds

dtrace -n procfs::entry'{@[execname] = count()}' -n tick-60s'{exit(0)}'

For an application proc_pig, lets find the user land stack which causes procfs to be called.

dtrace -n procfs::entry'/<80><9D>execname == <80><9C>proc_pig<80><9D>/@[ustack()] = count()}' -n tick-60s'{exit(0)}'

One of the DTrace demo scripts is very useful for highlighting those monitoring processes which spawn many child processes.

dtrace -s /usr/demo/dtrace/whoexec.d -s tick-60s'{exit(0)}'

ps(or pgrep) are often used in scripts to determine if a child process identified either by name or by PID is still running. ps is a process monitoring lump hammer and its use in process state scripts is architecturally questionable, more so with the advent of SMF in Solaris 10.

So if you have a script that does something along the lines of

while true


ps -ef | awk '{print $2}' | egrep '\^$PID$' > /dev/null

if [ $? != 0 ] ; then

restart process


sleep 10


to restart a process called $PID if it dies.

A step in the right direction in reducing the overhead is to use

ps -p $2 > /dev/null

in the ps line. If the process does not exist, then a non-zero return code is given and when the process does exist, the overhead of traversing every process and calculating its address space size is avoided.

To do the proper job, let SMF take the strain and manage it for you. The underlying contracts framework detects if a process dies and get it restarted One of the best places to learn about writing your own services is Bigadmin .

This leads us to principle 6: Use the right architecture and tools for service management

Writing a SMF service to restart a process, while not a trivial task, is easier and less error prone than writing efficient and correct shell script!

In summary, we have touched on a number of topics which relate to monitoring and the resultant impact on overall system performance. The obvious open question which is how do we generate meaningful business metrics beyond how long a batch job takes to run or how often the phone rings? Its a tough subject as most situations are unique and I would be interested in real examples of useful non-trivial business metrics in complex environments.

  • Monitor what you care about now to solve today's business problem, not what you might care about in future

  • If you need to do it often, don't do it often with procfs

  • Use only measures and metrics you fully understand

  • Strong and relevant business metrics avoid wrong turns

  • Use system level metrics only in the context of understanding a business lead performance issue

  • Use the right architecture and tools for service management

If you can think of any more from your experiences, please drop me an email or add a comment.

1 90%+ of process execution time spent in user land often suggests root cause is not a network problem.

2This was enabled by a Sun Account Manager who connected people in parts of the customer operation who otherwise did not know each other.

3 That is not quite true, it did once. A company with this is a no blame culture posters, so they were just looking for someone to blame, hence cooperation was thin on the ground

4Typical syndrome : Solaris is broken, GUDS shows a whole cpu burning system time on an otherwise idle system.


If you want to wait for a process ID to exit, but don't want to use contracts (e.g., because you have multiple waiters) then use pwait(1).

Posted by Nico on August 30, 2007 at 03:48 PM BST #

Post a Comment:
Comments are closed for this entry.



« June 2016