Monday Mar 24, 2008

What the quot

One of the joy's of UNIX is that even after 16 years of using it, the opportunity to trip over new useful/interesting commands/features which have been around for a while. Even better if you find something when you are not looking for it.

     quot - summarize file system ownership

     quot [-acfhnv] filesystem...

     quot -a [-cfhnv]

     quot displays the number of blocks (1024 bytes) in the named
     filesystem (one or more) currently owned by each user. There
     is a limit of 2048 blocks. Files larger than  this  will  be
     counted as a 2048 block file, but the total block count will
     be correct.

This is the output from a root filesystem on a V890 with Solaris 10. The -v option is interesting in that it gives 3 extra columns of blocks not accessed in 30, 60 and 90 days. Yet to think of a use for the -v, but I am sure we will find one.

# quot -f -v /
3956446 173808  root            3679449 3491876 3301515
  901      73   uucp              503     502     467
  133      19   adm                21      21       4
  118     115   smmsp              57       4       0
   74       5   svctag              3       2       0
   48      41   noaccess           47       4       0
   28      28   lp                 24      23       2
   11       3   bin                11      11      11
    8       8   daemon              8       4       0
    8      16   nobody              8       8       0
    8       8   postgres            8       8       0
    4       4   gdm                 4       4       0
    4       4   webservd            4       4       0
    3       9   clivek              1       1       1

Friday Mar 14, 2008

Solaris on a macbook under Fusion

I use a Macbook for travelling as it is light and the battery lasts a few hours (compared to the 45 minutes for the Ferrari!). I have been using Solaris under Parallels for over a year, but the integration with OSX never quite cut it.

I got round to installing VMWare Fusion yesterday and was pleased that it just worked. Once VMWARE tools was installed the network just worked (apart from ssh which was disabled on the OSX side). I used Indiana(OpenSolaris Developers Preview 2) as I wanted to play with the new packaging framework.

I don't have any great hints or tricks beyond follow the instructions and when the guest is installed go to the "Virtual Machine" menu item and click "start VMWARE tools Installation" drop down.

Fusion might actually be good enough to fork out my own money when the evaluation period comes to an end!

Thursday Dec 27, 2007

What is the UK OS Ambassador for Education for?

I have been a Sun OS Ambassador since 1999. Most of my focus has been to maintain the relationship Sun has with Universities in the UK, so I have taken the unofficial role of UK OS Ambassador for Education. I worked at a University in the role of Systems Administrator and as a researcher and doing some teaching, so it is a natural fit for me. Its really good to see Universities at the top of Sun's corporate agenda again in a meaningful way. The most visible activity which shows Sun is \*really\* serious about education again is the Campus Ambassador. I thought it was worth listing the range of activities I have been involved as OS Ambassador since 1999 and if you are either a UK Campus Ambassador or work in a UK University and think that I might be of use, it would be a pleasure to help/pay you a visit or "get the right person" to help/pay you a visit.

  • Solaris technology demo. Has been DTrace, ZFS, Zones over the last few years, but a wider range is possible.
  • Invited talks to 2nd year Operating Systems courses.
  • PhD external examiner in areas related to Operating or Distributed Systems.
  • Our new system is slow/does not work and we can't get this issue resolved (yes, it does happen, though not very often and working in Service really helps on this one).
  • I don't know who else to talk to in Sun, I know you are not the right person, but can you find them.
  • We need someone independent to help us with Strategy Facilitation (I can do quite a good impression of a Management Consultant without any dress sense). Most of this is using techniques from the Kepner Tregoe Rational Process toolbox.
  • Free performance analysis via SharedShell if I can blog about it.
  • Student Mock Interviews and "how to prepare for interviews in industry" lectures.
  • Department Industrial Liaison boards.
  • Account team has a quick technical question and does not know where else to ask it.
  • Outside/Independent member of an interview panel.

UK OS Ambassador for Education is one of my hobbies, a bit like rock climbing, fell running or restoring a 1973 Muir-Hill A5000. Having a day job in Support Services means I only get to spend a day or 2 a month outside the support dungeon doing Ambassador type work.

If you work in a UK university and think any of the above might help you, drop me an email and I look forward to meeting you.

Thursday Aug 30, 2007

Too much /proc is bad for you!

I have been to a couple of customer sites this year where between ¼ and 1/6 of 20-40 cpu systems has been consumed by various types of monitoring. Over monitoring is a contributor to scalability issues that causes the customer to introduce additional monitoring. If you run large systems by CPU/core count (this now includes T2000), please read on. We introduce a few principles for the purpose of summary along the way.

One of the downsides of any monitoring is that it has overhead. Often cited as the Heisenburg Principle, but should be called the Observer Effect when discussing about computer systems. We shall put to one side a cat in a box and any associated philosophy. Let us agree that if you try to look at a system you will change it and the deeper you look the more overhead you add. kstats and DTrace are a best case, an overhead still exists. One enabled DTrace probe has a tiny overhead, 30,000 enabled probes has more overhead. Few customers buy our systems to run monitoring software, most use our systems to support their business, so Principle 1 : monitor what you care about now to solve today's business problem, not what you might care about in future

The use of /proc often(event distribution measured in small numbers of seconds) is a big deal. The procfs filesystem has a goal of giving correct data at the point in time that an observation is made. When a choice needs to be made, Solaris architects choose correct and slower over faster and misleading. This trade-off means, in the case of process monitoring, you can't have both correct and performant. Its a lot of work to give a consistent picture of a process. A ps -ef on our Sunray server with in the region of 4000 processes causes a sum of 3226,292 kernel fbt probes to fire. /proc is not a lightweight interface, so we need to be selective about its use.

/proc is also very lock heavy, it needs to be to ensure it gives a consistent picture. Every /proc operation acquires and release proclock and pidlock, among other locks. Being lock heavy means its very easy to write application which don't scale if /proc is used on a regular basis. I was involved in a hot customer issue where an early SF15K did not scale beyond 28 cpu's for a particular in-house Java application. The highly threaded application used the pr_fname field of the curprpsinfo structure to get the process name for every operation. The process name never changes and the developers had no idea a 3rd party native library used /proc in this way. The SMTX column in mpstat lit up and lockstat -C shows the errant kernel call stack very clearly. Easy fix to the application once it was pointed out and a huge drop in system time once the library cached the program name. Which leads to principle 2:- If you need to do it often, don't do it often with procfs

The Solaris kernel does not store the size of a processes address space at the current point in time. Its only typically needed by tools such as ps, so the /proc interface ps_procinfo??? calculates it as required and when needed. This is a non trivial operation as each segment you see in the pmap output for a processes needs to have the appropriate vnode locked, the size taken and the vnode unlocked. It not unusual to have processes with in excess of 100,000 mapped segments. 30 such processes and you can see that procfs on behalf of ps needs to do a lot of work to return the size of the address space of a process. Couple of ps instances running at the same time along with prstat and top and quite a bit of your very busy multi-million dollar database server gets consumed.

The most successful approach to system monitoring that I have observed our customers employ is to monitor at a business level such as user experienced response time. For some application types it takes a lot of work to get beyond measuring how often the help desk or CIO's phone rings and the word slow is uttered. To get useful quantitative metrics which give a useful representation of the user experience and to provide a clear trigger when it degrades is a highly non-trivial task. This may explain why many organisations have relied on system level metrics such as user/system time ratio or even I/O wt time(don't even joke about using wt)! System level metrics only typically confuse the process of resolution if used out of context with the business problem.. Taking system level metrics outside the context of the flow of data to and from the user (be it human or silicon based) typically leads to a colourful festival of explaining incrementing values of obscure kstats, rather than solving business problems and establishing actual root cause. Which leads to principle 3: Use only measures and metrics you fully understand

I was asked to pay a visit to a customer by an account manager where staff investigating a SAP performance issue had been flown from the UK to Hong Kong to conduct network tests with a result of its not the network1. A morning of understanding the business need, system (people, software, computers, networks)2 and the interaction between components followed by 10 minutes of DTrace(Truss would have done fine) showed the problem to be the efficiency of coding of a SAP script. I can't spell SAP, but following the flow of data between components, observing with the intent of answering the SGRT questions of where on the object and when in the lifecycle has not let me down yet3. Which leads to principle 4: Strong and relevant business metrics avoid wrong turns.

There is a school of thought that suggests a system should have no system level performance monitoring enabled on the box itself unless a baseline is being established, a problem is being pursued or data is being collected for a specific capacity planning exercise. For the most part I agree, based on the observation that continual low level system monitoring, on balance, causes more problems than it addresses. Storage monitoring products in particular appear adept at consuming a cpu or 2.

One of the Sun support tools, GUDS shares of the same concerns. Its important to get the purpose of GUDS into perspective. GUDS is intended to capture as much potentially useful data as possible in one shot such that we get useful data for most types of performance problem. Thus we accept that there is a non-zero overhead and must allow for it in the analysis4 and be judicious in its use. Like any tool, its the context in which you use GUDS that counts and can add great value. We (Global Performance V-team) often get asked you have a look at GUDS output and diagnose a situation where the problem definitions is system going slow. GUDS is a 1st pass tools when you don't have a decisive problem definition or you want to gather baseline data.

GUDS add load to the system it monitors. ::memstat in mdb, for example, takes 1 CPU and non trivial numbers of cross-calls to walk each page in memory and determine what the page is used for. TNF and lockstat also add a overhead. GUDS, when used in the right context, with the right -X option, is highly effective. As the Welsh Rugby commentator and ex-international Jonathan Davies notes in a different context its the top 2 inches that count.

This leads us to principle 5: Use system level metrics only in the context of understanding a business lead performance issue.

I have mentioned the need for relevant business metrics, itself a huge and complex subject, to replace obtrusive system level monitoring as the trigger to investigate when a problem that impacts the business arises. If you set out on a journey it helps to know your objective, business level metrics assist in knowing when you are making progress and when you are done. It also curious how often capacity planning gets confused with business metrics.

Back to /proc. Some useful one liners for finding overhead that is just overhead.

Lets see what processes are using /proc over 60 seconds

dtrace -n procfs::entry'{@[execname] = count()}' -n tick-60s'{exit(0)}'

For an application proc_pig, lets find the user land stack which causes procfs to be called.

dtrace -n procfs::entry'/<80><9D>execname == <80><9C>proc_pig<80><9D>/@[ustack()] = count()}' -n tick-60s'{exit(0)}'

One of the DTrace demo scripts is very useful for highlighting those monitoring processes which spawn many child processes.

dtrace -s /usr/demo/dtrace/whoexec.d -s tick-60s'{exit(0)}'

ps(or pgrep) are often used in scripts to determine if a child process identified either by name or by PID is still running. ps is a process monitoring lump hammer and its use in process state scripts is architecturally questionable, more so with the advent of SMF in Solaris 10.

So if you have a script that does something along the lines of

while true


ps -ef | awk '{print $2}' | egrep '\^$PID$' > /dev/null

if [ $? != 0 ] ; then

restart process


sleep 10


to restart a process called $PID if it dies.

A step in the right direction in reducing the overhead is to use

ps -p $2 > /dev/null

in the ps line. If the process does not exist, then a non-zero return code is given and when the process does exist, the overhead of traversing every process and calculating its address space size is avoided.

To do the proper job, let SMF take the strain and manage it for you. The underlying contracts framework detects if a process dies and get it restarted One of the best places to learn about writing your own services is Bigadmin .

This leads us to principle 6: Use the right architecture and tools for service management

Writing a SMF service to restart a process, while not a trivial task, is easier and less error prone than writing efficient and correct shell script!

In summary, we have touched on a number of topics which relate to monitoring and the resultant impact on overall system performance. The obvious open question which is how do we generate meaningful business metrics beyond how long a batch job takes to run or how often the phone rings? Its a tough subject as most situations are unique and I would be interested in real examples of useful non-trivial business metrics in complex environments.

  • Monitor what you care about now to solve today's business problem, not what you might care about in future

  • If you need to do it often, don't do it often with procfs

  • Use only measures and metrics you fully understand

  • Strong and relevant business metrics avoid wrong turns

  • Use system level metrics only in the context of understanding a business lead performance issue

  • Use the right architecture and tools for service management

If you can think of any more from your experiences, please drop me an email or add a comment.

1 90%+ of process execution time spent in user land often suggests root cause is not a network problem.

2This was enabled by a Sun Account Manager who connected people in parts of the customer operation who otherwise did not know each other.

3 That is not quite true, it did once. A company with this is a no blame culture posters, so they were just looking for someone to blame, hence cooperation was thin on the ground

4Typical syndrome : Solaris is broken, GUDS shows a whole cpu burning system time on an otherwise idle system.

Monday Jul 23, 2007

At a very high level performance analysis follows a cycle where we try to define the problem to be solved (shame it is often the part least attention is payed to), then run through a cycle of refining where time is being spent or identifying the resources are being used. Might also throw in a best practice review of the configuration.

I did some work with an Italian banking customer this morning. We used SharedShell taking about 10 minutes of analysis to establish root cause and maybe another 10 minutes for me to write up the root cause and solution. While SharedShell has been out for a few months, its the 1st time I had been invited to use it by a customer, rather than asking if it was a possibility. Being able to ask if the customer could see the performance issue they were concerned about at a particular moment in the chat window should not be underestimated in terms of removing misunderstanding and delay. The analysis would have required at least 2 steps beyond what a single GUDS run would give you, so a best case turn around time for an engineer running the same commands and emailing back the output or putting it on the SunSolve server for me to analyze is usually measured in hours. I guess we saved at least 3 hours on the time to resolution. For harder performance cases I would expect the time saving to be greater, but far more important to me is that the accuracy of diagnosis will be higher.

Well worth checking out and the effort in getting security clearance from your organisations prior to wanting to use it.




« June 2016