What is sys_diag ?.. Automating Solaris Performance Profiling and Workload Characterization.
By tjobson on Jul 12, 2007
The following is an excerpt from the README_sys_diag.txt file which gives a
high level overview of the sys_diag capabilities and command line arguments /
/ usage examples... I've created this over many years to automate and reduce
the amount of time it takes to gather and correlate system data for conducting off-site (remote) Performance and Configuration Analysis. With sys_diag
all you need to do is download the ksh script.. and you're on your way. After
it's run.. you get a single .tar.Z that you can upload or email for remote
analysis.. (-G even includes a wide variety of DTrace examination..).
Read the following introduction.. and more specific examples will follow.
sys_diag v.7.04 Overview :
BACKGROUND / INTRODUCTION :sys_diag is a Solaris utility (ksh script) that can perform several
functions, among them, system configuration 'snapshot' and reporting
(detailed or high-level) plus workload characterization/profiling via performance
data gathering (over some specified duration or time in point 'snapshot'),
high-level analysis, and reporting of findings/exceptions (based upon
perf thresholds that can be easily changed within the script header).
The output is provided in a single .tar.Z of output and corresponding
data files, and a local sub-directory where report/data files are stored.
The report format is provided in .html, .txt, and .ps as a single file
for easy review (without requiring trudging through several subdirectories
of separate files to manually correlate and review).
sys_diag runs on any Solaris 2.6 (or above) Sun platform, including
reporting of new Solaris 10 capabilities (zone/containers, SVM,
zfspools, fmd, ipfilter/ipnat, link aggr, Dtrace probing, etc...).
Beyond the Sun configuration reporting commands [System/storage HW config,
OS config, kernel tunables, network/IPMP/Trunking/LLT config, FS/VM/NFS,
users/groups, security, NameSvcs, pkgs, patches, errors/warnings, and
system/network performance metrics...], sys_diag also captures relevant
application configuration details, such as Sun N1, Sun Cluster 2.x/3.x,
Veritas VCS/VM/vxfs.., Oracle .ora/listener files, etc.. detailed
configuration capture of key files (and tracking of changes via -t), etc ...
Of all the capabilities, the greatest benefits are found by being able
to run this single ksh script on a system and do the analysis from one single report/
file... offline/elsewhere (in addition to being capable of historically
archiving system configurations, for disaster recovery.. or to allow for
tracking system chgs over time.. after things are built/tested/certified).
One nice feature for performance analysis is that the vmstat and netstat
data is exported in a text format friendly to import and created graphs
from in StarOffice or Excell.. as well as creating IO and NET device
Averages from IOSTAT / Netstat data (# IO's per device, AVG R/W K, etc..)
along with peak exceptions for CPU / MEM / IO / NET ..
Although this tool isn't meant to replace long-term historical Performance
Trending and Capacity Planning packages (Teamquest, etc..), it provides the
foundation and basis for a very robust starting point (and actually is much
better at point in time workload characterization and root cause analysis of
bottlenecks, where very granular detailed data correlation is required).
Even though I'm a Sun employee, this has been personally developed over many
years, in my spare time in order to make my life a lot easier and
more efficient. Hopefully others will find this utility capable of
doing the same for them, also making use of it's legwork.. to streamline
the admin/analysis activities required of them. This has been an invaluable
tool used to diagnose / analyze hundreds of performance and/or configs issues
Regarding the system overhead, sys_diag runs all commands in a serial
fashion (waiting for each command to complete before running the next)
impacting system performance the same as if an admin were typing these
commands one at a time on a console.. with the exception of the background
vmstat/mpstat/iostat/netstat that's done when gathering performance data
(-g | -G) over some interval for report/analysis (which generally has minimal
impact on a system, especially if the sample interval [-I] is not every
second, or if the lighter weight -g is run vs. -G detailed/Dtrace snapshots).
sys_diag is generally run from /var/tmp as "sys_diag -l" for creating
a detailed long report, or via "sys_diag -g -l " for gathering
performance data and generating a long/detailed config/analysis report),
however offers many command line parameters documented within the header,
or via "sysdiag -?". \*\* READ the Usage below, as well as the Performance
Parameters sections for further enlightenment.. ;)
NOTE: For the best .html viewing experience, Do NOT use MS Internet Explorer browser
as it varies in support of HTML stds for formating and iframe file inclusion
(ending up opening many windows vs embedding output files within
the single .html report). \*\* USE Netscape, Mozilla, Firefox, etc.. browsers,
ensuring that your display resolution is set to the maximum resolution, and
font sizes are defaults or not made too large (for best viewing open full screen)
\*\*\* As is the best practice for any environment, first TEST thoroughly on a representative
TEST configuraiton PRIOR to running this or making any production system changes.
(read the sys_diag ksh headers for disclaimer and support notes) \*\*\*
\*\* See : http://blogs.sun.com/toddjobson/ for other entries relating to system performance,
capacity planning, and systems architecture / availability.
\*\* For the latest release of sys_diag see either BigAdmin or SunFreeware.com at the following URL's :
Common Command Line usage and available parameters :
COMMAND USAGE :
# sys_diag [-a -A -c -C -d_ -D -f_ -g -G -H -I_ -l -L_ -n -o_ -p -P -s -S -T_ -t -u -v -V -h|-? ]
-a Application details (included in -l/-A)
-A ALL Options are turned on, except Debug and -u
-c Configuration details (included in -l/-A)
-C Cleanup Files and remove Directory if tar works
-d path Base directory for data directory / files
-D Debug mode (ksh set -x .. echo statements/variables/evaluations)
-f input_file Used with -t to list configuration files to Track changes for
-g gather Performance data (2 sec intervals for 5 mins, unless -I |-T exist)
-G GATHER Extra Perf data (S10 Dtrace, more lockstats, pmap/pfiles) vs -g
-h | -? Help / Command Usage (this listing)
-H HA config and stats
-I secs Perf Gathering Sample Interval (default is 2 secs)
-l Long Listing (most details, but not -g,-V,-A,-t,-D)
-L label_descr_nospaces (Descriptive Label For Report)
-n Network configuration and stats (also included in -l/-A except ndd settings)
-o outfile Output filename (stored under sub-dir created)
-p Generate Postscript Report, along with .txt, and .html
-P -d ./data_dir_path Post-process the Perf data skipped with -S and finish .html rpt
-s SecurIty configuration
-S SKIP POST PROCESSing of Performance data (use -P -d data_dir to complete)
-t Track configuration / cfg_file changes (Saves/Rpts cfg/file chgs \*see -f)
-T secs Perf Gathering Total Duration (default is 300 secs =5 mins)
-u unTar ed: (do NOT create a tar file)
-v version Information for sys_diag
-V Verbose Mode (adds path_to_inst, network dev's ndd settings, mdb, snoop..)
Longer message/error/log listings. Additionally, pmap is run if -g ||-G,
and the probe duration for Dtrace and lockstat sampling is widened
from 2 seconds (during -G) to 5 seconds (if -G && -V). Ping is
also run against the default route and google.com to guage latency.
NOTE: NO args equates to a brief rpt (No -A,-g/I,-l,-t,-D,-V,..)
\*\* Also, note that option/parameter ordering is flexible, as well as use of white
space before arguments to parameters (or not). The only requirement is to list
every option/parameter separately with a preceeding - (-g -l , but not -gl).
BOTH of the following command line syntax examples is functionally the same :
eg. ./sys_diag -g -I 1 -T 1800 -t -f ./config_files -l
./sys_diag -g -l -t -f./config_files -I1 -T1800
------------------------------------------------------------------------------------------------eg. Common Usage :
./sys_diag -l Creates a LONG /detailed configuration rpt (.html/.txt)
Without -l, the config report created has basic system cfg details.
./sys_diag -g -l gathers performance data at the default sampling rate of 2 secs for
a total duration of 5 mins, adding a color coded performance header/
Dashboard Summary section and any performance findings/
exceptions found to the long (-l) cfg rpt. Also takes (3) starting/
midpt/endpoint snapshots using minimal lockstat/kstat (1sec)
NOTE: -g is meant to gather perf data without overhead, therefore
only 1 second lockstat samples are taken. Use -G and/or -V
for more detailed system probing (see examples and notes below)
Using -V with -g, adds pmap/pfiles snapshots, vs. using -G
to also capture Dtrace and extended lockstat probing.\* Any time that sys_diag is run with either -g or -G, the command
line output is appended to the file sys_diag_perflog.out, which
gets copied and archived as part of the final .tar.Z output file.
./sys_diag -g -I 1 -T 600 -l gathers perf data at 1 sec samples for 10 mins and
creates a long config rpt as noted above. Also does
basic start/mid/endpoint sampling using lockstat/kstat/pmap.
./sys_diag -l -C creates long config rpt, and Cleans up..
aka removes the data directory after tar.Z completes
./sys_diag -d base_directory_path (changes the base dir for datafiles from curr dir)
./sys_diag -G -I 1 -T 600 -l Gathers DEEP performance & Dtrace/lockstat/pmap data
at 1 sec samples for 10 mins & creates a long cfg rpt
(in addition to the standard data gathering from -g).
\*NOTE: this runs all Dtrace/Lockstat/Pmap probing during 3 snapshot intervals
(beginning_0/midpoint_1/ and endpoint_#2 snapshots), limiting probing
overhead to BEFORE/AFTER the standard data gathering begins
(vmstat, mpstat, iostat, netstat, .. from -g).
The MIDPOINT probing occurs at a known point as not to confuse this
activity for other system processing.
\*Because of this, standard data collection may not start for 30+ seconds,
or until the beginning snapshot (snapshot_#0) is complete.(-g snapshot_#0 activities only take a couple seconds to complete, since
they do not include any Dtrace/lockstat.. beyond 1 sec samples).
./sys_diag -G -V -I 1 -T 600 Gathers DEEP, VERBOSE, performance & Dtrace/lockstat/pmap
data at 1 sec samples for 10 mins (using 5 second Dtrace and
Lockstat snapshots, vs. 2 second probes for only -G.
(in addition to the standard data gathering from -g).
./sys_diag -g -l -S (gathers perf data, runs long config rpt, and SKIPS Post-Processing
and .html report generation)
\*\* This allows for completing the post-processing/analysis activities
either on another system, or at a later time, as long as the data_directory
exists (which can be extracted from the .tar.Z, then refered to as
-d data_dir_path ). \*\* See the next example using -P -d data_path \*\*
./sys_diag -P -d ./data_dir_path (Completes Skipped Post-Processing & .html rpt creation)
(Copyright 2007, Todd A. Jobson)