Swat Trace Facility (STF) with very high IOPS workloads.
By Henk Vandenbergh on Feb 12, 2009
On Solaris, STF uses TNF to collect detailed I/O trace data. Since TNF uses only a single kernel trace buffer of maximum 128 megabytes, this trace buffer will overflow quickly, therefore not allowing for the collection of trace data over longer periods of time.
STF therefore every 5 seconds checks to see how full the trace buffer is. When it is 80% full, STF will use tnfxtract to offload the trace buffer to disk, while allowing TNF to continue adding data to the trace buffer. Then 80% later the same will happen and so on.
Offloading the trace buffer when it is 80% full allows for a 20% overlap in available buffer space, but with very high IOPS that is not always enough when STF checks only every 5 seconds. In that case STF will generate the following messages:
====> Trace buffer is filling faster than we can offload to disk.
====> Trace will be cancelled after this happens 20 times.
I just noticed in one trace that some small amount of data was even lost without this message being displayed, so this logic is not 100% perfect.
I hope to start using Dtrace at some point in time, but the first time I tried it a few years back the overhead of using Dtrace was so significantly higher than using TNF, that I had to give up on that effort. Dtrace since then has improved, so time permitting I will try again.
Until then, make sure that when you collect a high IOPS STF trace, you offload your trace data to a device that gives you decent throughput. For the above-mentioned trace at 30,000 IOPS it needed to offload trace data at about 2.5 megabytes per second.
Note: when using SVM each I/O is traced twice (logical and physical), with VXVM each I/O is traced three times (logical, multi-path, and physical), so be aware of the extra amount of trace data that will be created.