What does sys_diag v8.3 offer ?.. The Easiest, most Complete, Automated Solaris Performance Profiling and Workload Characterization tool..
By User12610236-Oracle on Jan 14, 2017
The following is an excerpt from the full sys_diag_Users_Guide.pdf which gives a
high level overview of the sys_diag capabilities and command line arguments /
/ usage examples.
I've created this over many years to automate and reduce the amount of time it takes to gather and correlate system data for conducting off-site (remote) Performance/Configuration/Consolidation.. or other forms of Architectural Analysis.
With sys_diag, all you need to do is download the ksh script.. and you're on your way.
After it's run you can click on the .html report to do explore the automated findings, or take the generated single .tar.Z that you can upload or email for remote analysis (-G even includes a wide variety of DTrace examination and deepest probing available).
Read the following introduction, or for the complete deep dive, download the sys_diag_Users_Guide.pdf .
Even better.. download the latest version of sys_diag and try it out ! (read through the .ksh header for complete chronology of Update history, usage, and agreements, ..).
NOTE: for either download, right-click on the links above and Save-As. (then # uncompress sys_diag.Z on a Solaris system)
sys_diag v.8.3g Overview :
Each run of sys_diag creates a local sub-directory where all datafiles captured or created (analysis, reports, graphs generated) are stored. Upon completion, sys_diag creates an compressed archive within a single .tar.Z for examination externally.
The report format is provided in .html, and .txt as a single file for easy review (without requiring trudging through several subdirectories of separate files potentially thousands of lines long each, to manually correlate and review for hours /days.. before manually generating the assessment report and/or any graphs needed). This tool will literally save you a week of analysis for complicated configurations that require diagnosis.
sys_diag has previously been run on Solaris 2.x (or above) Solaris platforms, and today should be capable of being run on any x86 or SPARC Solaris 8+ system. Version 8.3 includes reporting new Solaris 11.3 capabilities (zones, LDOM’s/OVM, SRM, zfspools, fmd, ipfilter/ipnat, link aggregation, Dtrace probing, etc...).
Beyond the Solaris configuration reporting commands (System/storage HW config, OS config, kernel tunables, network/IPMP/Trunking config, ZFS/FS/VM/NFS, users/groups, security, NameSvcs, pkgs, patches, errors/warnings, and system/network performance metrics), sys_diag also captures relevant application configuration details, such as Sun Cluster 2.x/3.x, Veritas VCS/VM/vxfs, Oracle .ora/RAC/CRS/listener.., MySQL.., along with other detailed configuration capture of key files (and tracking of changes via -t), etc.
Of all the capabilities, the greatest benefits are found by being able to run this single ksh script on a system and do the analysis from one single report/ file offline/elsewhere.
Regarding the system overhead, sys_diag runs all commands in a serially, (waiting for each command to complete before running the next) impacting system performance the same as if an admin were typing these commands one at a time on a console. The only exception is the background vmstat/mpstat/iostat/netstat (-g) performance gathering of metrics at the specified sampling interval (-I) and total duration (-T), which generally has negligible overhead on a "healthy" system.
Workflow (order of execution) of a typical sys_diag run (with arguments “-g –I1 –l”) :
- Extract README_sys_diag.txt (note this is an older summarized version of the complete Users_Guide)- Beginning BME (0=Begin/1=Midpt/2=EndPt) Profiling SNAPSHOT (#0)*ONLY IF Not Excluded via “-x”, & using verbosity via “-v”
(to profile the system point-in-time SNAPs serially with prstat, ps, iostat, netstat, zpool, tcpstat,.. *before any background collection is started*).
- Initiate BACKGROUND Data Collection at a (“-I x”) x second sampling rate for total duration default 300 seconds (5mins) or t Total Seconds via “-T t”. [data gathered includes vmstat, mpstat, iostat, netstat, .. and if non-gz capped: also zonestat]
- WAIT until the MidPoint of performance data gathering
- Initiate BME Midpoint Profiling SNAPSHOT (#1), *ONLY IF Not Excluded via “-x”, & using Deep Verbosity via “-V”, AND IF >3mins of Total duration remains
- WAIT for Background Data Collection to Complete
- Initiate BME Midpoint Profiling SNAPSHOT (#2), *ONLY IF Not Excluded via “-x”, & using verbosity via “-v”.- Capture System Configuration Data for report (following the TOC Table of Contents Outline)
- *Generate the complete .html report*
- Identify the Data_Directory Path, the HTML Report File link that can be opened for examination
- Create a compressed tar.Z archive of DataDirectory (all+ sys_diag & perflog)*NOTE : See Section 12 for the actual command line output running sys_diag *
sys_diag is generally run from the same directory (eg. /var/tmp) that will have enough available disk space for storing the data directories and archives (however, the data directory and all files can be removed after each run using –C). When always run from the same directory, a single sys_diag_perflog.out file is appended to as a system chronology of performance each time sys_diag is run, that can later be referred to.
NOTE: For the best .html viewing experience, Do NOT use MS Internet Explorer browser as it varies in support of HTML stds for formatting and iframe file inclusion (ending up opening many windows vs embedding output files in a single .html report)
** USE Chrome, Firefox as recommended browsers ** (for best viewing open full screen)
_____________________________________________________________________________________________________________________ 3.0 Command Line Arguments & available parameters : _____________________________________________________________________________________________________________________ COMMAND USAGE : # sys_diag [-a -A -c -C -d_ -D -f_ -g -G -H -I_ -l -L_ -n -N -o_ -p -P -q -s -S -T_ -t -u -v -V -h|-?]-a Application / DB Configs (included in -l/-A, Oracle/RAC/MySQL/SunRay ..)-A ALL Options are turned on, except Debug and -u-b Generate a Performance Thresholds "Baseline" profile (see -B or default fname used)-B (1 | 2) Use Baseline file Threshold Analysis Calculation (1=Range HWM, 2=StdDev)-c Configuration details (included in -l/-A)-C Cleanup Files and remove Directory if tar works-d path Base directory for data directory / files-D Debug mode (ksh set -x .. echo statements/variables/evaluations)-e email_addr Emails sys_diag .tar.Z file upon completion (assuming sendmail is configured)-f input_file Used with -t to list configuration files to Track changes of-g gather Performance data (def: 5 sec samples for 5 mins, unless -I |-T exist)-G GATHER Extended Perf data (S10+ Dtrace, lockstats+, pmap/pfiles) vs -g-h | -? Help / Command Usage (this listing) / Version_#-H HA configuration and stats (Solaris Cluster, VCS, ..)-I secs Perf Gathering Sample Interval (default is 5 secs)-l Long Listing (most details, but not -g|-G,-v|-V,-A,-t,-D)-L label_descr_nospaces (Descriptive Label For Report)-n Network configuration and stats (also included in -l/-A except ndd settings)-N No Graph generation in HTML Reports.-o outfile Output filename (stored under sub-dir created)-p cminp Specify Individual Performance Subsystems for data capture (for -g | -G).[eg “-p cminp” selects All (CPU|Mem|IO|Net|Process), “-p cn” only cpu & net]-P -d ./data_dir_path Post-process the Perf data skipped with -S and finish .html rpt-q Quiet mode, disables command line output. (*not yet fully implemented*)-s Security configuration-S SKIP POST PROCESSing of Performance data (use -P -d data_dir to complete)-t Track configuration / cfg_file changes (Saves/Rpts cfg/file chgs *see -f)-T secs Perf Gathering Total Duration (default is 300 secs =5 mins)-u unTar ed: (do NOT create a tar file)-v Extended verbosity level 1 (for -g perf gathering, examines more top procs,Also adds pmap/pfiles/ptree, and lightweight lockstat to BME SNAPSHOTS).-V Deep Verbosity level 2 (adds path_to_inst, netwk dev settings, snoop..)Longer message/error/log listings. Additionally, the probe duration forDtrace and lockstat sampling is widened from 2 seconds (during -G) to 5 seconds(if -G && -V). Ping is also run against the default route and google.com.If -g|-G & -V, then mdb memory usage is captured (page cache, kernel, anon..).-x Excludes lockstat, intrstat, plockstat (DTrace usage),pfiles & mdb from-g|-G performance data gathering, also skipping Midpt BME snapshots.
---- BOTH of the following command line syntax examples are functionally the same (order/spacing doesn’t matter): eg. ./sys_diag -g –v -I 1 -T 600 -l OR ./sys_diag -g -l -I1 –T600 -v NOTE: NO args equates to a brief rpt with NO Performance capture (No -A,-g/I,-l,-t,-D,-V,..) ** Also, note that option/parameter ordering is flexible, as well as use of white space before arguments to parameters (or not). The only requirement is to list every option/parameter separately with a preceeding - (-g -l , but not -gl). ----------------------------------------------------------------------------------------- ********************************** ** EXIT Status ** (Return Code) : **********************************
0 if OK, non-zero if an error occurred or Performance EXCEEDED Thresholds!foundIF Performance Gathering and Analysis (-g|-G) has Noted EXCEEDED Thresholds! THEN a bitmask isproduced of the following Conditions (added together to produce a single integerexit/return code) :
RED (Critical) CPU Alarm : return_code = return_code + 1RED (Critical) Memory Alarm : return_code = return_code + 2RED (Critical) StorageIO Alrm : return_code = return_code + 4RED (Critical) Network Alarm : return_code = return_code + 8YELLOW (Warning) CPU Alarm : return_code = return_code + 16YELLOW (Warning) Memory Alarm : return_code = return_code + 32YELLOW (Warning) StorageIO Alrm : return_code = return_code + 64YELLOW (Warning) Network Alarm : return_code = return_code + 128
Therefore, if you take the return code and start by subracting the highest values, you can identify which subsystems (cpu/memory/storageIO/network) had alarms.
eg. root# exit 0 will give you the exit code of the last run command/utility
Therefore, if sys_diag returned an exit code of 129, then that depicts :return_code - 128 shows that Network Warnings (YELLOW) were present.. andreturn_code - 1 shows CPU (RED) Critical Alarms
(essentially, start subtracting the largest exceptions, and take the remainderand go down the list.. so an exit code of 5 would have been RED_IO & RED_CPU)
---------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------____________________________________________________ 4.0 Common Command Line Usage examples : ____________________________________________________
./sys_diag -l Creates a LONG (detailed) configuration snapshot report in both HTML (.html) and Text formats (.out). Without -l, the config report created has minimal system cfg details. Note, that -l (as with most cmd line arguments) can be added when capturing performance data to create a more complete rpt../sys_diag -g gathers performance data at the default sampling rate of 5 secs for a total duration of 5 mins, creating a color coded HTML rpt with header/ Dashboard Summary section and performance details/ findings/ exceptions found. Also runs the BME starting/endpoint snapshots (before/after background data gathering of vm/mp/io/netstat..).*This example will NOT create detailed configuration report sections.NOTE: -g is meant to gather perf data without overhead, therefore only 1 second lockstat samples are taken. Use -G and/or -V for more detailed system probing (see examples and notes below) Using –v/-V with -g, adds pmap/pfiles snapshots, vs. using -G to also capture Dtrace and extended lockstat probing.** Any time that sys_diag is run with either -g or -G, the performance * dashboard/summary section of the command line output is appended to * the file sys_diag_perflog.out, which gets copied and archived as ** part of the final .tar.Z output file../sys_diag -g –l -I 1 -T 600 Gathers perf data at 1 sec samples for 10 mins and Also does basic BME Begin/Midpt/Endpoint sampling, and creates a long/ detailed configuration report../sys_diag -l -g -C Creates a long configuration snapshot report, gathers basic performance data/analysis, and Cleans up (aka removes the data directory) after data directory archive compression (.tar.Z)../sys_diag -d base_directory_path –l … (-d changes the data directory location to be created)./sys_diag -G –l -T 600 Gathers DEEP performance & Dtrace/lockstat/pmap data at the default Interval (sampling rate of 5 secs) for 10 mins (including the std data gathering from -g).*NOTE: this runs all Dtrace/Lockstat/Pmap probing during BME snapshot intervals (beginning_0/midpoint_1 w -V/ and endpoint_#2 snapshots), limiting probing overhead to BEFORE/AFTER the standard data gathering begins (vmstat, mpstat, iostat, netstat, .. from -g). The MIDPOINT probing occurs at a known point as not to confuse this activity for other system processing.*Because of this, standard data collection may not start for 30+ seconds, or until the beginning snapshot (snapshot_#0) is complete. (-g snapshot_#0 activities only take a couple seconds to complete, since they do not include any Dtrace/lockstat.. beyond 1 sec samples)../sys_diag -G -V -I 1 -T 600 Gathers DEEP, VERBOSE, performance & Dtrace/lockstat/pmap data at 1 sec sample intervals for 10 mins (uses 5 second Dtrace and Lockstat snapshots, vs. 2 second probing with -g. (in addition to the standard data gathering from -g)../sys_diag -g –l -S (gathers perf data, runs long config rpt, and SKIPS Post-Processing and .html report generation)NOTE: * This allows for completing the post-processing/analysis activities either on another system, or at a later time, as long as the data_directory exists (which can be extracted from the .tar.Z, then refered to a -d data_dir_path ). ** See the next example using -P -d data_path **./sys_diag -P -d ./data_dir_path (Completes Skipped Post-Processing & .html rpt creation)
This has been an invaluable asset used to characterize / diagnose / analyze workloads across literally hundreds of systems within many of the top Fortune 100 datacenters. As would be expected, the obligations, support, and implications of use are the sole responsibility of the user, as is documented within the header of sys_diag. As a standard “best practice”, this and/or any new workload introduced to a system should always be tested first in a non-production environment for validation and familiarity.
Enjoy, and let me know if you have any Q's or suggestions !