Solaris Performance Analysis and Monitoring Tools... at what cost ?

In the area of Performance Analysis and related Monitoring tools, you'll find a plethora available for the Solaris environment. Each of them has it's own intrinsic costs associated.. listed here :

  • Monetary Costs ($$) :

    • Purchase Cost (Media, Documentation, etc..)
    • License Fees
    • Centralized or Management Server Required ? (HW Costs for System / Storage)
    • Hourly Costs of an Staff/ SME/ Consultant to Install/Config, Correlate, Interpret, Rpt findings...

  • Time / Effort Costs :

    • 3rd Party Installation / Configuration Pre-Requisites (libraries, tools Perl/Python.., etc..)
    • Server OS and Tools Design Requirements (Security, OS rev's, RAM, CPU, Storage, FS, Patches,..)
    • Server Installation (Rack/Stack, Network, Power, OS Install, Patching, Storage Cfg,..)
    • Server Toolset Installation (Installation, Configuration, License downloads, ..)
    • Client Node Agents Required for Installation / Configuration ?
    • Project/Manpower Time and Coordination Dependencies for an SME/Consultant vs. Other Resources (system, network, storage, etc...).
    • Time spent Installing, Configuring, Testing, Patching/Tweaking, Running, Correlating data, Analyzing/ Interpreting data, Reporting Findings...
    • Time spent learning the Toolset and how to interpret the raw and correlated data (thresholds, etc?)

  • System Overhead Costs :

    • CPU Consumption (% standard overhead vs. PEAK Load overhead)
    • Memory Consumption (RAM footprint.. standard vs. Peak load)
    • Storage Requirements (Toolset Installation space vs. active / historical storage vs. archiving req's)
    • RunTime Requirements (Running Constantly vs. During Specific PEAK load Intervals, Sampling Rate, ..)
    • Network Overhead (bandwidth and/or interrupt overhead due to data passed between client/server repository vs. local storage)
    • I/O Overhead (overhead of performing local IO.. generally depends on volume of data stored and sampling rates)

The Benefits of Accurate, Detailed, and Complete Data Gathering ...

\*\* NOTE: .. a key Attribute often left out is the ACCURACY and RELEVANCE of performance data captured (based up on the time it was captured, the sampling rates, and the level of detail provided).

This in many instances requires weighing the costs of having point in time event "detailed" snapshots (where the sampling rate intervals are very narrow.. per sec, etc.), vs. long-term historical trending data (where samples are aggregated and averaged over longer timeframes minimizing the storage requirements, but also smoothing out the Peak load visibility). For example, if you use a toolset or individual utility that can capture performance data at 1 second intervals, you will see a very granular view of systems utilization and PEAK load activity (resouce consumption, contention events, etc.).. VS.. using a historical trending toolset that can only save data at 1, 5, or 10 minute Averages.. (due to the contstraints of storage space available for the long periods of data that must be kept).

This might not seem like much would be missed, however.. even the difference between 1 second and 1 minute samples can be astronomical.. where 80 samples with 95% idle and 20 samples with 100% utilization (0% Idle) and a huge run queue will get "smoothed" out to a one minute sample where the box "appears" only 24% utilized (76% idle).. although the system is thrashing 20% of the time.  Even within the period of a second, you have over a billion instructions that get run on modern cpu's running at GHz + clock rates (Billions of cycles per second).. and only one aggregated sample for that period.

For complete end-to-end Capacity Planning and Performance Analysis capabilities, BOTH types of data is generally required (longer term trending for Capacity Planning purposes via graphs, etc.. VS. short term detailed drill down of system activity for point in time PEAK LOAD periods, allowing for detailed performance and utilization assessment / correlation).

Without detailed and granular data during peak periods, there can be no real correlation of root causes or specific bottlenecks... and in the same regard, without long-term, historical data that shows growth rates in capacity and cycles (patterns and models) of utilization and Peak activity.. accurate Capacity Planning isn't feasible.


..  if data captured doesn't include peak activity, or the granularity of samples is too sparse.. (not reflecting peak events), ...  then that data can only be useful for defining a BASELINE of Average Utilization.


MANY, many, .. tools


A wide variety of performance tools can be found.. from the high end.. using end-to-end third party products such as Teamquest (which provides a graphical, historical vantage point).. than need to be purchased, installed, and trained on... to the OS built-in utilities and the freely available open source / public domain variteies.

However, either way you go, be prepared for the requiring learning curve.. along with the extensive manual process and time required to identify and run the utilities, before you can capture and begin the extensive correlation process on the data from several disparate utilities (before you even get to do the analysis of your findings).

Either approach has it's advantages and disadvantages.. along with their strengths and weaknesses (3rd party purchased suites might save you time in graphical aggregation and correlation.. but tend to limit the level of detail and granularity available vs. what the OS utilities will provide).

The basic list of KEY "built-in" tools historically available for monitoring performance applies to nearly any Unix/Linux distribution, including the following partial list of common utilities used ... following the basic breakdown of computing subsystems :

\*\* CPU / Kernel Utilization :

--> vmstat (vm system cpu and kernel utilization metrics \*\* a great starting pt \*\*)
--> mpstat (multi processor .. per cpu performance statistics)

\*\* Memory / Kenel Utilization :

--> vmstat
--> ipcs

--> swap
--> top

\*\* I/O Performance :

--> iostat (Standard IO.. ufs, .. IO performance utility)
--> vxstat (Veritas vxfs filesystem IO performance)

\*\* Network Utilization :

--> netstat
--> ping
--> traceroute

\*\* Process / Kernel :

--> ps
--> top
--> prstat

--> ...

\* sar (provides most basic types of high level performance metrics, assuming that system accounting is turned on, which does incur some level of system overhead when always running)


SOLARIS 10 ... Above and Beyond other Unix / Linux Distributions ... 


In addition to the basic toolsets available, there exist the following key additions that Solaris 10 provides, which sets it apart from the other Unix / Linux variants.

\*\* DTrace (Dynamic Tracing via "D" language scripting and probe/providers)

__ Dtrace is the "Electron microscope" of performance analysis for a Solaris 10 system
See the DtraceToolkit for a long list of specific Dtrace scripts (several of which are used
within sys_diag, among others created)

\*\* lockstat (uses the kernel dtrace infrastructure) Summarizes system lock/mutex contention

\*\* Mdb (Modular Debugger)

\* kstat (Kernel statistics .. counters, etc..)

\* cpustat / cputrack (cpu statistics, system-wide or per process)

\* intrstat, trapstat (interrupt and system trap, I/DTLB_miss statistics, ..)

\* ... & many more.. [this list will be re-done in a future blog with a more thorough breakdown.. ]


The Time Saving.. automated nature... of SYS_DIAG   :)

Over the past several years, I have created a utility called "sys_diag" that offers the capability of automatically capturing performance statistics, using nearly all available system utilities.. and aggregating the data, performing analysis and HTML report generation of findings. Sys_diag creates a single .tar.Z compressed archive that can be emailed/ftp'd.. for performing system configuration and/or performance analysis off-site.. from virtually anywhere.. saving a LOT of time.. not requiring any 3rd party tools or agents to be installed on a system other than downloading the "sys_diag" ksh script itself (with a color coded dashboard.. and links to detailed analysis findings).  Virtually no learning curve is required for loading, running, and reflecting basic performance profiling, including high level subsystem bottlenecks (deeper root cause correlation might require some level of advanced sys admin knowledge).

Beyond performance analysis, sys_diag can be used to also generate a detailed configuration snapshot report, including OS, HW, Storage, SW, 3PP configuration attributes, among several other capabilities that it provides.

\*\* See the next blog entry for more details and examples on sys_diag \*\*.
The published repository and high level description of sys_diag is always available at BigAdmin using the following URL :

(Copyright 2007 Todd A. Jobson)

Add to Technorati Favorites


Post a Comment:
  • HTML Syntax: NOT allowed

This blog does not reflect the viewpoint or opinions of Oracle or Sun Microsystems. All comments are personal reflections and responsibility of Todd A. Jobson, and are copyrighted from the posted year to current year, to that effect.


« April 2014