Tuesday Jul 17, 2007

sys_diag v.7.04 command line output ...


The following output was captured recently from running sys_diag v.7.04 on a (Solaris 10u3) Sun Ultra60 2 cpu test system in my lab  

Note the list of utilities run and types of data captured, as well as the final performance summary (a small summary of the complete color coded HTML dashboard available in the full .html report).

sys_diag has been run on virtually every type of Sun system, running Solaris 2.6 -> S10.  I have personally conducted dozens of Performance Analysis, Capacity Planning/ Benchmarking, and/or Architectural Assessments using sys_diag in production environments.. x86.. up to fully loaded E25K environments.

The latest release of sys_diag (v.7.04) is available from either BigAdmin (unpackaged ksh) or SunFreeware.com (pkg'd with the README) at the following URL's :
http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/sys_diag__solaris_c

or   http://www.sunfreeware.com

Realize that more than half of sys_diag 's benefit is in working from the .html aggregated report file.. that links and correlates all the independant data files together with findings and exceptions via a nice color-coded header / dashboard / and Table of Contents.  (the legwork is all done for you !)

I'll try to get a sample snapshot of a report header/dashboard for an up-coming blog... but for now, just download and test run sys_diag (v.7.04 is recommended), review the final .html report and forward and questions/comments back to me.. along with RFE for future releases.

(Read the last sections of the README for a detailed description of all datafiles created/available...)

With a little practice, it should save you many hours.. if not days.. of effort as it does for me.


Enjoy and let me know what you think,  

Todd


The following example does the deepest level of Performance data Gathering (-G, which includes Dtrace and pmap/pfiles snapshots vs. -g for light-weight perf gathering), Verbose output (-V), in addition to creation of a long/detailed configuration report (-l). The sampling rate used is 1 second intervals (-I1) for a total duration of 298 seconds (-T298).

\*Without -I || -T, the defaults are 2 second samples for 5 minutes total data gathering. Also note that when -G && -V are used together, the initial Dtrace and Lockstat snapshots take a couple minutes to complete, prior to beginning the data collection for 298 seconds (since the duration of probing is expanded with -V to 5 seconds vs 2 seconds with -G alone, or 1 second minimal lockstat sampling using  -g  ..aka.. no Dtrace probing).

root@/var/tmp #  ./sys_diag -G -V -l -I1 -T298

sys_diag:0717_033209: GATHER Extra PERFORMANCE DATA (-G)
sys_diag:0717_033209: VERBOSE (-V)
sys_diag:0717_033209: INTERVAL : 1 second sampling (-I1)
sys_diag:0717_033209: TIME Duration: 298 seconds (-T298)
sys_diag:0717_033209: LONG report (-l)
sys_diag:0717_033209: # Creating ... README_sys_diag.txt ...

sys_diag: ------- Beginning Process SNAPSHOT (# 0) -------
sys_diag:0717_033209: Dtrace: TCP write bytes by process ...(_dtcp_tx Snap 0)
sys_diag:0717_033209: Dtrace: TCP read bytes by process ... (_dtcp_rx Snap 0)
sys_diag:0717_033209: Dtrace: systemwide IO / IO wait... (_diow Snap 0)
sys_diag:0717_033235: Dtrace: Syscall count by process... (_dcalls_ Snap 0)
sys_diag:0717_033243: Dtrace: Syscall count by syscall... (_dsyscall_ Snap 0)
sys_diag:0717_033251: Dtrace: Read bytes by process... (_dR_ Snap 0)
sys_diag:0717_033258: Dtrace: Write bytes by process... (_dW_ Snap 0)
sys_diag:0717_033306: Dtrace: Sysinfo counts by process... (_dsinfo_ Snap 0)
sys_diag:0717_033314: Dtrace: Sdt_counts ... (_dsdtcnt_ Snap 0)
sys_diag:0717_033321: Dtrace: Interupt Times [sdt:::intr].. (_dintrtm_ Snap 0)
sys_diag:0717_033321: # ps -e -o ...(by %CPU) ... Snapshot # 0
sys_diag:0717_033321: # ps -e -o ...(by %MEM) ... Snapshot # 0
sys_diag:0717_033332: # pmap -xs 519 ...
sys_diag:0717_033332: # pmap -S 519 ...
sys_diag:0717_033332: # pmap -r 519 ...
sys_diag:0717_033332: # ptree -a 519 ...
sys_diag:0717_033332: # pfiles 519 ...
sys_diag:0717_033333: Dtrace: IO by process 519 ... (_dpio Snap 0)
sys_diag:0717_033339: # pmap -xs 448 ...
sys_diag:0717_033339: # pmap -S 448 ...
sys_diag:0717_033339: # pmap -r 448 ...
sys_diag:0717_033339: # ptree -a 448 ...
sys_diag:0717_033339: # pfiles 448 ...
sys_diag:0717_033340: Dtrace: IO by process 448 ... (_dpio Snap 0)
sys_diag:0717_033346: # pmap -xs 90 ...
sys_diag:0717_033346: # pmap -S 90 ...
sys_diag:0717_033346: # pmap -r 90 ...
sys_diag:0717_033346: # ptree -a 90 ...
sys_diag:0717_033346: # pfiles 90 ...
sys_diag:0717_033347: Dtrace: IO by process 90 ... (_dpio Snap 0)
sys_diag:0717_033353: # pmap -xs 825 ...
sys_diag:0717_033353: # pmap -S 825 ...
sys_diag:0717_033353: # pmap -r 825 ...
sys_diag:0717_033353: # ptree -a 825 ...
sys_diag:0717_033353: # pfiles 825 ...
sys_diag:0717_033353: Dtrace: IO by process 825 ... (_dpio Snap 0)
sys_diag:0717_033353: # /usr/bin/netstat -i -a ...
sys_diag:0717_033400: # Snapshot Kernel Memory Usage.. ::memstat | mdb -k ...
sys_diag:0717_033409: # /usr/sbin/lockstat -IW -n 100000 -s 13 sleep 5 ...
sys_diag:0717_033419: # /usr/sbin/lockstat -A -n 90000 -D15 sleep 5 ...
sys_diag:0717_033431: # /usr/sbin/lockstat -A -s8 -n 90000 -D10 sleep 5 ...
sys_diag:0717_033446: # /usr/sbin/lockstat -AP -n 90000 -D10 sleep 5 ...
sys_diag:0717_033521: Dtrace: Involuntary Context Switches (icsw) by process .. (_dmpc Snap 0)
sys_diag:0717_033526: Dtrace: Cross CPU Calls (xcal) caused by process ........ (_dmpc Snap 0)
sys_diag:0717_033531: Dtrace: MUTEX try lock (smtx) by lwp/process ............ (_dmpc Snap 0)

sys_diag: --\*\*-- (Background) DATA COLLECTION FOR 298 secs STARTED --\*\*--
sys_diag:0717_033531: # /usr/bin/vmstat -q 1 298 > ./sysd_socrates_070717_0332/sysd_vm_socrates_070717_033209.out 2>&1 &
sys_diag:0717_033531: # /usr/bin/iostat -xn 1 298 > ./sysd_socrates_070717_0332/sysd_io_socrates_070717_033209.out 2>&1 &
sys_diag:0717_033531: # /usr/bin/mpstat -q 1 298 > ./sysd_socrates_070717_0332/sysd_mp_socrates_070717_033209.out 2>&1 &
sys_diag:0717_033537: # /usr/bin/netstat -i -I lo0 1 298 > ./sysd_socrates_070717_0332/sysd_net1_socrates_070717_033537.out 2>&1 &
sys_diag:0717_033537: # /usr/bin/kstat -p -T u -n lo0 1> ./sysd_socrates_070717_0332/sysd_knetb_lo0_socrates_070717_033537.out 2>&1
sys_diag:0717_033538: # /usr/bin/netstat -i -I hme0 1 298 > ./sysd_socrates_070717_0332/sysd_net2_socrates_070717_033538.out 2>&1 &
sys_diag:0717_033538: # /usr/bin/kstat -p -T u -n hme0 1> ./sysd_socrates_070717_0332/sysd_knetb_hme0_socrates_070717_033538.out 2>&1
sys_diag:0717_033538: # /usr/sbin/snoop ...

sys_diag: ------- (Foreground) Gathering System Configuration Details -------
sys_diag:0717_033539: # uname -a ...
sys_diag:0717_033539: # hostid ...
sys_diag:0717_033539: # domainname (DNS) ...
sys_diag:0717_033539: ###### SYSTEM CONFIGURATION / DEVICE INFO ######
sys_diag:0717_033539: # prtdiag ...
sys_diag:0717_033539: # prtconf | grep Memory ...
sys_diag:0717_033539: # /usr/sbin/psrinfo -v ...
sys_diag:0717_033539: # /usr/sbin/psrinfo -pv ...
sys_diag:0717_033539: # /usr/sbin/psrset -q ...
sys_diag:0717_033539: # cfgadm -l ...
sys_diag:0717_033539: # cfgadm -al ...
sys_diag:0717_033539: # cfgadm -v ...
sys_diag:0717_033539: # cfgadm -av | grep memory | grep perm ...
sys_diag:0717_033541: ###### E10K / E25K / SunFire System INFO ######
sys_diag:0717_033541: # Checking Kernel Cage settings ...
sys_diag:0717_033541: # eeprom ...
sys_diag:0717_033541: # /usr/bin/coreadm ...
sys_diag:0717_033541: # /usr/sbin/dumpadm ...
sys_diag:0717_033541: # modinfo ...
sys_diag:0717_033541: # /usr/sbin/lustatus ...
sys_diag:0717_033541: # cat /etc/path_to_inst ...
sys_diag:0717_033542: ###### WORKLOAD CHARACTERIZATION ######
sys_diag:0717_033542: # prstat -c -a 1 1 ...
sys_diag:0717_033542: # prstat -c -J 1 1 ...
sys_diag:0717_033542: # prstat -c -Z 1 1 ...
sys_diag:0717_033542: # prstat -c 1 2 ...
sys_diag:0717_033544: # prstat -c -v 1 3 ...
sys_diag:0717_033546: # ps -e -o ...(by %CPU) ...
sys_diag:0717_033546: # ps -e -o ...(by %MEM) ...
sys_diag:0717_033546: # ps -e -o ...(by LWP) ...
sys_diag:0717_033546: ###### PERFORMANCE PROFILING (System / Kernel) ######
sys_diag:0717_033547: # vmstat 1 5 ...
sys_diag:0717_033551: # /usr/bin/mpstat 1 3 ...
sys_diag:0717_033551: # /usr/bin/isainfo -v ...
sys_diag:0717_033553: # /usr/bin/ipcs -a ...
sys_diag:0717_033553: # /usr/bin/pagesize ...
sys_diag:0717_033553: # swap -l ...
sys_diag:0717_033553: # swap -s ...
sys_diag:0717_033553: # /usr/bin/vmstat -s ...
sys_diag:0717_033553: # /usr/bin/kstat -n system_pages ...
sys_diag:0717_033553: # /usr/bin/kstat -n vm ...
sys_diag:0717_033554: # /usr/sbin/trapstat 1 2 ...
sys_diag:0717_033554: # /usr/sbin/trapstat -t 1 2 ...
sys_diag:0717_033554: # /usr/sbin/trapstat -l ...
sys_diag:0717_033554: # /usr/sbin/trapstat -t 1 2 ...
sys_diag:0717_033554: # /usr/sbin/trapstat -T 1 2 ...
sys_diag:0717_033554: # /usr/sbin/intrstat 1 2 ...
sys_diag:0717_033554: # /usr/bin/vmstat -i ...
sys_diag:0717_033554: ###### KERNEL ZONES/ SRM / Acctg / TUNABLES ######
sys_diag:0717_033554: # /usr/sbin/zoneadm list -v ...
sys_diag:0717_033554: # /usr/bin/projects -l ...
sys_diag:0717_033554: # /usr/sbin/psrset -i ...
sys_diag:0717_033554: # /usr/sbin/psrset -p ...
sys_diag:0717_033554: # /usr/sbin/psrset -q ...
sys_diag:0717_033554: # /usr/sbin/rctladm -l ...
sys_diag:0717_033554: # /usr/bin/priocntl -l ...
sys_diag:0717_033554: # /usr/sbin/acctadm ...
sys_diag:0717_033554: # /usr/sbin/acctadm -r...
sys_diag:0717_033554: # tail -80 /etc/system ...
sys_diag:0717_033554: # sysdef | tail -85 ...
sys_diag:0717_033554: # tail -40 /etc/init.d/sysetup ...
sys_diag:0717_033554: # cat /etc/power.conf ...
sys_diag:0717_033612: ###### STORAGE / ARRAY INFO ######
sys_diag:0717_033612: # prtconf -pv ...
sys_diag:0717_033613: # luxadm probe ...
sys_diag:0717_033614: ###### STORAGE VOLUME MANAGEMENT INFO ######
sys_diag:0717_033614: ###### SOLARIS (SDS/SVM) VOLUME MANAGER Info ######
sys_diag:0717_033614: # /sbin/metadb ...
sys_diag:0717_033614: # /sbin/metastat ...
sys_diag:0717_033614: # /sbin/metastat -p...
sys_diag:0717_033614: ###### Sun STMS / MPxIO Info ######
sys_diag:0717_033614: # cat /kernel/drv/fp.conf ...
sys_diag:0717_033614: # cat /kernel/drv/fcp.conf ...
sys_diag:0717_033614: ###### FILESYSTEM INFO ######
sys_diag:0717_033614: # df ...
sys_diag:0717_033614: # df -k ...
sys_diag:0717_033614: # mount -v ...
sys_diag:0717_033614: # /usr/sbin/showmount -a ...
sys_diag:0717_033614: # cat /etc/vfstab ...
sys_diag:0717_033614: # /usr/bin/cachefsstat ...
sys_diag:0717_033614: ###### I/O STATS ######
sys_diag:0717_033614: # /usr/bin/iostat -nxe 3 2 ...
sys_diag:0717_033614: # /usr/bin/iostat -xcC 3 2 ...
sys_diag:0717_033614: # /usr/bin/iostat -xnE ...
sys_diag:0717_033614: ###### NFS INFO ######
sys_diag:0717_033614: # /usr/bin/nfsstat ...
sys_diag:0717_033614: # /usr/bin/nfsstat -m ...
sys_diag:0717_033614: ###### NETWORKING INFO ######
sys_diag:0717_033614: # cat /etc/hosts ...
sys_diag:0717_033614: # /usr/sbin/ifconfig -a ...
sys_diag:0717_033614: # /usr/bin/netstat -i ...
sys_diag:0717_033614: # /usr/bin/netstat -r ...
sys_diag:0717_033614: # /usr/sbin/arp -a ...
sys_diag:0717_033614: # /usr/sbin/ping -s 192.168.200.1 56 10 ...
sys_diag:0717_033614: # /usr/sbin/ping -s 192.168.200.1 1016 10 ...
sys_diag:0717_033614: # /usr/sbin/ping -s google.com 56 10 ...
sys_diag:0717_033614: # /usr/sbin/ping -s google.com 1016 10 ...
sys_diag:0717_033614: # cat /etc/hostname.hme0 ...
sys_diag:0717_033614: # cat /etc/inet/networks ...
sys_diag:0717_033614: # cat /etc/netmasks ...
sys_diag:0717_033614: # tail -30 /etc/inet/ntp.server ...
sys_diag:0717_033614: # /usr/sbin/dladm show-dev ...
sys_diag:0717_033614: # /usr/sbin/dladm show-link ...
sys_diag:0717_033614: # /usr/sbin/dladm show-aggr ...
sys_diag:0717_033614: # /usr/sbin/pntadm -L ...
sys_diag:0717_033703: # /usr/bin/kstat -c net ...
sys_diag:0717_033703: # ndd -get /dev/tcp ...
sys_diag:0717_033703: # ndd -get /dev/udp ...
sys_diag:0717_033703: # ndd -get /dev/ip ...
sys_diag:0717_033706: # ndd -set /dev/hme instance 0 ...
sys_diag:0717_033706: # ndd -get /dev/hme ...
sys_diag:0717_033706: # /usr/bin/netstat -a ...
sys_diag:0717_033711: # /usr/bin/netstat -s ...
sys_diag:0717_033711: ###### TTY / MODEM INFO ######
sys_diag:0717_033711: # /usr/sbin/pmadm -l ...
sys_diag:0717_033711: # cat /etc/remote ...
sys_diag:0717_033711: # cat /var/adm/aculog ...
sys_diag:0717_033711: ###### USER / ACCOUNT / GROUP Info ######
sys_diag:0717_033711: # w ...
sys_diag:0717_033711: # who -a ...
sys_diag:0717_033711: # cat /etc/passwd ...
sys_diag:0717_033711: # cat /etc/group ...
sys_diag:0717_033711: ###### SERVICES / NAMING RESOLUTION ######
sys_diag:0717_033711: # /usr/bin/svcs -v ...
sys_diag:0717_033711: # cat /etc/services ...
sys_diag:0717_033711: # cat /etc/inetd.conf ...
sys_diag:0717_033711: # cat /etc/inittab ...
sys_diag:0717_033711: # cat /etc/nsswitch.conf ...
sys_diag:0717_033711: # cat /etc/resolv.conf ...
sys_diag:0717_033711: # cat /etc/auto_master ...
sys_diag:0717_033711: # cat /etc/auto_home ...
sys_diag:0717_033712: # /usr/bin/ypwhich ...
sys_diag:0717_033712: # /usr/bin/nisdefaults ...
sys_diag:0717_033712: ###### SECURITY / CONFIG FILES ######
sys_diag:0717_033712: # cat /etc/syslog.conf ...
sys_diag:0717_033712: # cat /etc/pam.conf ...
sys_diag:0717_033712: # cat /etc/default/login ...
sys_diag:0717_033712: # tail -250 /var/adm/sulog ...
sys_diag:0717_033712: # /usr/bin/last reboot ...
sys_diag:0717_033712: # /usr/bin/last -200 ...
sys_diag:0717_033712: # /usr/sbin/ipf -T list ...
sys_diag:0717_033712: # cat /etc/ipf/ipf.conf ...
sys_diag:0717_033712: # cat /etc/ipf/pfil.ap ...
sys_diag:0717_033712: # /usr/sbin/ipnat -vls ...
sys_diag:0717_033713: ###### HA/ CLUSTERING INFO ######
sys_diag:0717_033713: ###### SUN N1 Configuration INFO ######
sys_diag:0717_033713: ###### APPLICATION / ORACLE CONFIG FILES ######
sys_diag:0717_033713: ###### PACKAGE INFO / SOLARIS REGISTRY ######
sys_diag:0717_033713: # /usr/bin/prodreg browse ...
sys_diag:0717_033713: # /usr/bin/pkginfo ...
sys_diag:0717_033713: # /usr/bin/pkginfo -l ...
sys_diag:0717_033713: ###### PATCH INFO ######
sys_diag:0717_033713: # /usr/bin/showrev -p ...
sys_diag:0717_033713: # /usr/sadm/bin/smpatch analyze NOT RUN, passwd required....
sys_diag:0717_033753: ###### CRONTAB FILE LISTINGS ######
sys_diag:0717_033753: ###### FMD / SYSTEM MESSAGE/LOG FILES ######
sys_diag:0717_033753: # /usr/sbin/fmadm config ...
sys_diag:0717_033753: # /usr/sbin/fmdump ...
sys_diag:0717_033753: # /usr/sbin/fmstat ...
sys_diag:0717_033753: # tail -250 /var/adm/messages ...
sys_diag:0717_033753: # /usr/bin/dmesg ...
sys_diag:0717_033753: # tail -500 /var/log/syslog ...
sys_diag:0717_033754: ...WAITING 12 seconds for midpoint data collection...

sys_diag: ------- MidPoint Process SNAPSHOT (# 1) -------
sys_diag:0717_033806: Dtrace: TCP write bytes by process ...(_dtcp_tx Snap 1)
sys_diag:0717_033806: Dtrace: TCP read bytes by process ... (_dtcp_rx Snap 1)
sys_diag:0717_033806: Dtrace: systemwide IO / IO wait... (_diow Snap 1)
sys_diag:0717_033832: Dtrace: Syscall count by process... (_dcalls_ Snap 1)
sys_diag:0717_033840: Dtrace: Syscall count by syscall... (_dsyscall_ Snap 1)
sys_diag:0717_033847: Dtrace: Read bytes by process... (_dR_ Snap 1)
sys_diag:0717_033855: Dtrace: Write bytes by process... (_dW_ Snap 1)
sys_diag:0717_033903: Dtrace: Sysinfo counts by process... (_dsinfo_ Snap 1)
sys_diag:0717_033911: Dtrace: Sdt_counts ... (_dsdtcnt_ Snap 1)
sys_diag:0717_033918: Dtrace: Interupt Times [sdt:::intr].. (_dintrtm_ Snap 1)
sys_diag:0717_033918: # ps -e -o ...(by %CPU) ... Snapshot # 1
sys_diag:0717_033918: # ps -e -o ...(by %MEM) ... Snapshot # 1
sys_diag:0717_033929: # pmap -xs 4188 ...
sys_diag:0717_033929: # pmap -S 4188 ...
sys_diag:0717_033929: # pmap -r 4188 ...
sys_diag:0717_033929: # ptree -a 4188 ...
sys_diag:0717_033929: # pfiles 4188 ...
sys_diag:0717_033929: Dtrace: IO by process 4188 ... (_dpio Snap 1)
sys_diag:0717_033935: # pmap -xs 4181 ...
sys_diag:0717_033935: # pmap -S 4181 ...
sys_diag:0717_033935: # pmap -r 4181 ...
sys_diag:0717_033935: # ptree -a 4181 ...
sys_diag:0717_033935: # pfiles 4181 ...
sys_diag:0717_033936: Dtrace: IO by process 4181 ... (_dpio Snap 1)
sys_diag:0717_033942: # /usr/bin/netstat -i -a ...
sys_diag:0717_033942: # Snapshot Kernel Memory Usage.. ::memstat | mdb -k ...
sys_diag:0717_033952: # /usr/sbin/lockstat -IW -n 100000 -s 13 sleep 5 ...
sys_diag:0717_034002: # /usr/sbin/lockstat -A -n 90000 -D15 sleep 5 ...
sys_diag:0717_034015: # /usr/sbin/lockstat -A -s8 -n 90000 -D10 sleep 5 ...
sys_diag:0717_034037: # /usr/sbin/lockstat -AP -n 90000 -D10 sleep 5 ...
sys_diag:0717_034051: Dtrace: Involuntary Context Switches (icsw) by process .. (_dmpc Snap 1)
sys_diag:0717_034056: Dtrace: Cross CPU Calls (xcal) caused by process ........ (_dmpc Snap 1)
sys_diag:0717_034101: Dtrace: MUTEX try lock (smtx) by lwp/process ............ (_dmpc Snap 1)

sys_diag: ------- EndPoint Process SNAPSHOT (# 2) -------
sys_diag:0717_034101: # /usr/bin/kstat -p -T u -n lo0 2>&1
sys_diag:0717_034101: # /usr/bin/kstat -p -T u -n hme0 2>&1
sys_diag:0717_034107: Dtrace: TCP write bytes by process ...(_dtcp_tx Snap 2)
sys_diag:0717_034107: Dtrace: TCP read bytes by process ... (_dtcp_rx Snap 2)
sys_diag:0717_034107: Dtrace: systemwide IO / IO wait... (_diow Snap 2)
sys_diag:0717_034133: Dtrace: Syscall count by process... (_dcalls_ Snap 2)
sys_diag:0717_034141: Dtrace: Syscall count by syscall... (_dsyscall_ Snap 2)
sys_diag:0717_034149: Dtrace: Read bytes by process... (_dR_ Snap 2)
sys_diag:0717_034156: Dtrace: Write bytes by process... (_dW_ Snap 2)
sys_diag:0717_034204: Dtrace: Sysinfo counts by process... (_dsinfo_ Snap 2)
sys_diag:0717_034212: Dtrace: Sdt_counts ... (_dsdtcnt_ Snap 2)
sys_diag:0717_034220: Dtrace: Interupt Times [sdt:::intr].. (_dintrtm_ Snap 2)
sys_diag:0717_034220: # ps -e -o ...(by %CPU) ... Snapshot # 2
sys_diag:0717_034220: # ps -e -o ...(by %MEM) ... Snapshot # 2
sys_diag:0717_034230: # pmap -xs 519 ...
sys_diag:0717_034230: # pmap -S 519 ...
sys_diag:0717_034230: # pmap -r 519 ...
sys_diag:0717_034230: # ptree -a 519 ...
sys_diag:0717_034230: # pfiles 519 ...
sys_diag:0717_034231: Dtrace: IO by process 519 ... (_dpio Snap 2)
sys_diag:0717_034237: # pmap -xs 448 ...
sys_diag:0717_034237: # pmap -S 448 ...
sys_diag:0717_034237: # pmap -r 448 ...
sys_diag:0717_034237: # ptree -a 448 ...
sys_diag:0717_034237: # pfiles 448 ...
sys_diag:0717_034238: Dtrace: IO by process 448 ... (_dpio Snap 2)
sys_diag:0717_034244: # pmap -xs 90 ...
sys_diag:0717_034244: # pmap -S 90 ...
sys_diag:0717_034244: # pmap -r 90 ...
sys_diag:0717_034244: # ptree -a 90 ...
sys_diag:0717_034244: # pfiles 90 ...
sys_diag:0717_034245: Dtrace: IO by process 90 ... (_dpio Snap 2)
sys_diag:0717_034251: # pmap -xs 825 ...
sys_diag:0717_034251: # pmap -S 825 ...
sys_diag:0717_034251: # pmap -r 825 ...
sys_diag:0717_034251: # ptree -a 825 ...
sys_diag:0717_034251: # pfiles 825 ...
sys_diag:0717_034251: Dtrace: IO by process 825 ... (_dpio Snap 2)
sys_diag:0717_034251: # /usr/bin/netstat -i -a ...
sys_diag:0717_034258: # Snapshot Kernel Memory Usage.. ::memstat | mdb -k ...
sys_diag:0717_034307: # /usr/sbin/lockstat -IW -n 100000 -s 13 sleep 5 ...
sys_diag:0717_034317: # /usr/sbin/lockstat -A -n 90000 -D15 sleep 5 ...
sys_diag:0717_034329: # /usr/sbin/lockstat -A -s8 -n 90000 -D10 sleep 5 ...
sys_diag:0717_034344: # /usr/sbin/lockstat -AP -n 90000 -D10 sleep 5 ...
sys_diag:0717_034358: Dtrace: Involuntary Context Switches (icsw) by process .. (_dmpc Snap 2)
sys_diag:0717_034404: Dtrace: Cross CPU Calls (xcal) caused by process ........ (_dmpc Snap 2)
sys_diag:0717_034408: Dtrace: MUTEX try lock (smtx) by lwp/process ............ (_dmpc Snap 2)

sys_diag:0717_034408: ------- Data Collection COMPLETE -------
sys_diag:0717_034408: ###### SYSTEM ANALYSIS : INITIAL FINDINGS ... ######
sys_diag:0717_034414: ###### PERFORMANCE DATA : POTENTIAL ISSUES ######
_____________________________________________________________________________________

sys_diag:0717_034414: ## Analyzing VMSTAT CPU Datafile :
	./sysd_socrates_070717_0332/sysd_vm_socrates_070717_033209.out ...

\* NOTE: 2.6936 % : 8 of 297 VMSTAT CPU entries are WARNINGS!! \*


TOTAL CPU AVGS : RUNQ= 0.1 : BThr= 0.0 : USR= 15.0 : SYS= 11.2 : IDLE= 73.5
PEAK CPU HWMs : RUNQ= 8 : BThr= 0 : USR= 51 : SYS= 96 : IDLE= 0

___________________________________________________________________________________

sys_diag:0717_034414: ## Analyzing VMSTAT MEMORY from Datafile :
./sysd_socrates_070717_0332/sysd_vm_socrates_070717_033209.out ...

\* NOTE: 0.673401 % : 2 of 297 VMSTAT MEMORY entries are WARNINGS!! \*


TOTAL MEM AVGS : SR= 0.0 : SWAP_free= 747697.4 K : FREE_RAM= 287786.6 K
PEAK MEM Usage: SR= 0 : SWAP_free= 500128.0 K : FREE_RAM= 57080.0 K


___________________________________________________________________________________

sys_diag:0717_034414: ## Analyzing MPSTAT Datafile : ./sysd_socrates_070717_0332/sysd_mp_\*.out ...


\* NOTE: 5.20134 % : 31 of 596 MPSTAT CPU entries are WARNINGS!! \*

CPU MP AVGS: Wt= 0: Xcal= 736: csw= 120: icsw= 3: migr= 5: smtx= 3: syscl= 1024
PEAK MP HWMs: Wt= 0: Xcal= 51771: csw= 14108: icsw= 32: migr= 55: smtx= 79: syscl= 25836


NOTE: 0.2% CPU cycles handling TLB MISSES (0.0% ITLB_misses: 0.2% DTLB_misses)

_____________________________________________________________________________________

sys_diag:0717_034414: ## Analyzing IOSTAT Datafile :
./sysd_socrates_070717_0332/sysd_io_\*.out ...


\* NOTE: 14.4578 % : 24 of 166 IOSTAT entries are WARNINGS!! \*

TOP 10 Slowest IO Devices (\* AVG of non-zero device entries \*) :

r/s w/s kr/s kw/s actv wsvc_t asvc_t %w %b device # I/O Samples

32.6 10.8 263.6 24.6 0.8 0.0 13.7 0.0 19 c0t0d0 164
34.0 7.5 10.8 0.0 0.0 0.0 0.0 0.0 0 c0t1d0 2

_____________________________________________________________________________________

CONTROLLER IO : AVG and TOTAL Throughput per HBA (\*active/non-zero entries only\*) :
------------

c0 : AVG : 32.6 r/s | 10.8 w/s | 260.6 kr/s | 24.3 kw/s |
c0 : TOTAL: 5408 r | 1790 w | 43258 kr | 4037 kw | 166 entries

_____________________________________________________________________________________


sys_diag:0717_034414: ## Analyzing NETSTAT Datafiles : ...

\* lo0 : NOTE: 0 % : 0 of 297 NETSTAT entries are WARNINGS!! \*
\* hme0 : NOTE: 0 % : 0 of 297 NETSTAT entries are WARNINGS!! \*


------------ \*MAX_RX_PKTS\* AVG_RX_PKTS AVG_RX_ERRS AVG_TX_PKTS AVG_TX_ERRS AVG_COLL
NET1 : lo0 : 4 0.0 0.0 0.0 0.0 0.0

------------ \*MAX_RX_PKTS\* AVG_RX_PKTS AVG_RX_ERRS AVG_TX_PKTS AVG_TX_ERRS AVG_COLL
NET2 : hme0 : 14 0.4 0.0 0.4 0.0 0.0
: hme0 : TOT_RX_Bytes TOT_TX_Bytes TOT_RX_Packets TOT_TX_Packets TOTAL_Seconds
22210 30348 124 112 328
: hme0:1: TOT_RX_Packets TOT_TX_Packets

: hme0:1: 0 0

NOTE: \*\* 2 ESTABLISHED connections (sockets) exist\*\*

_____________________________________________________________________________________


\* NOTE: CPU=GRN : MEM=GRN : IO=YEL : NET=GRN \*

_____________________________________________________________________________________

sys_diag:0717_034417: ... gen_html_hdr ...
sys_diag:0717_034417: ... gen_html_rpt ...


sys_diag:0717_034419: ## Generating TAR file : ./sysd_socrates_070717_0332.tar ...

tar -cvf ./sysd_socrates_070717_0332.tar ./sysd_socrates_070717_0332 1>/dev/null
compress ./sysd_socrates_070717_0332.tarData files have been TARed and compressed in :

\*\*\* ./sysd_socrates_070717_0332.tar.Z \*\*\*

------- Sys_Diag Complete -------
#


( Copyright 2007, Todd A. Jobson )
Add to Technorati Favorites

Tuesday Jul 10, 2007

Solaris Performance Analysis and Monitoring Tools... at what cost ?

In the area of Performance Analysis and related Monitoring tools, you'll find a plethora available for the Solaris environment. Each of them has it's own intrinsic costs associated.. listed here :


  • Monetary Costs ($$) :

    • Purchase Cost (Media, Documentation, etc..)
    • License Fees
    • Centralized or Management Server Required ? (HW Costs for System / Storage)
    • Hourly Costs of an Staff/ SME/ Consultant to Install/Config, Correlate, Interpret, Rpt findings...


  • Time / Effort Costs :

    • 3rd Party Installation / Configuration Pre-Requisites (libraries, tools Perl/Python.., etc..)
    • Server OS and Tools Design Requirements (Security, OS rev's, RAM, CPU, Storage, FS, Patches,..)
    • Server Installation (Rack/Stack, Network, Power, OS Install, Patching, Storage Cfg,..)
    • Server Toolset Installation (Installation, Configuration, License downloads, ..)
    • Client Node Agents Required for Installation / Configuration ?
    • Project/Manpower Time and Coordination Dependencies for an SME/Consultant vs. Other Resources (system, network, storage, etc...).
    • Time spent Installing, Configuring, Testing, Patching/Tweaking, Running, Correlating data, Analyzing/ Interpreting data, Reporting Findings...
    • Time spent learning the Toolset and how to interpret the raw and correlated data (thresholds, etc?)


  • System Overhead Costs :

    • CPU Consumption (% standard overhead vs. PEAK Load overhead)
    • Memory Consumption (RAM footprint.. standard vs. Peak load)
    • Storage Requirements (Toolset Installation space vs. active / historical storage vs. archiving req's)
    • RunTime Requirements (Running Constantly vs. During Specific PEAK load Intervals, Sampling Rate, ..)
    • Network Overhead (bandwidth and/or interrupt overhead due to data passed between client/server repository vs. local storage)
    • I/O Overhead (overhead of performing local IO.. generally depends on volume of data stored and sampling rates)


The Benefits of Accurate, Detailed, and Complete Data Gathering ...


\*\* NOTE: .. a key Attribute often left out is the ACCURACY and RELEVANCE of performance data captured (based up on the time it was captured, the sampling rates, and the level of detail provided).

This in many instances requires weighing the costs of having point in time event "detailed" snapshots (where the sampling rate intervals are very narrow.. per sec, etc.), vs. long-term historical trending data (where samples are aggregated and averaged over longer timeframes minimizing the storage requirements, but also smoothing out the Peak load visibility). For example, if you use a toolset or individual utility that can capture performance data at 1 second intervals, you will see a very granular view of systems utilization and PEAK load activity (resouce consumption, contention events, etc.).. VS.. using a historical trending toolset that can only save data at 1, 5, or 10 minute Averages.. (due to the contstraints of storage space available for the long periods of data that must be kept).

This might not seem like much would be missed, however.. even the difference between 1 second and 1 minute samples can be astronomical.. where 80 samples with 95% idle and 20 samples with 100% utilization (0% Idle) and a huge run queue will get "smoothed" out to a one minute sample where the box "appears" only 24% utilized (76% idle).. although the system is thrashing 20% of the time.  Even within the period of a second, you have over a billion instructions that get run on modern cpu's running at GHz + clock rates (Billions of cycles per second).. and only one aggregated sample for that period.

For complete end-to-end Capacity Planning and Performance Analysis capabilities, BOTH types of data is generally required (longer term trending for Capacity Planning purposes via graphs, etc.. VS. short term detailed drill down of system activity for point in time PEAK LOAD periods, allowing for detailed performance and utilization assessment / correlation).

Without detailed and granular data during peak periods, there can be no real correlation of root causes or specific bottlenecks... and in the same regard, without long-term, historical data that shows growth rates in capacity and cycles (patterns and models) of utilization and Peak activity.. accurate Capacity Planning isn't feasible.
\*\*

 

..  if data captured doesn't include peak activity, or the granularity of samples is too sparse.. (not reflecting peak events), ...  then that data can only be useful for defining a BASELINE of Average Utilization.

 


MANY, many, .. tools

 

A wide variety of performance tools can be found.. from the high end.. using end-to-end third party products such as Teamquest (which provides a graphical, historical vantage point).. than need to be purchased, installed, and trained on... to the OS built-in utilities and the freely available open source / public domain variteies.

However, either way you go, be prepared for the requiring learning curve.. along with the extensive manual process and time required to identify and run the utilities, before you can capture and begin the extensive correlation process on the data from several disparate utilities (before you even get to do the analysis of your findings).

Either approach has it's advantages and disadvantages.. along with their strengths and weaknesses (3rd party purchased suites might save you time in graphical aggregation and correlation.. but tend to limit the level of detail and granularity available vs. what the OS utilities will provide).

The basic list of KEY "built-in" tools historically available for monitoring performance applies to nearly any Unix/Linux distribution, including the following partial list of common utilities used ... following the basic breakdown of computing subsystems :

\*\* CPU / Kernel Utilization :

--> vmstat (vm system cpu and kernel utilization metrics \*\* a great starting pt \*\*)
--> mpstat (multi processor .. per cpu performance statistics)

\*\* Memory / Kenel Utilization :

--> vmstat
--> ipcs

--> swap
--> top

\*\* I/O Performance :

--> iostat (Standard IO.. ufs, .. IO performance utility)
--> vxstat (Veritas vxfs filesystem IO performance)

\*\* Network Utilization :

--> netstat
--> ping
--> traceroute

\*\* Process / Kernel :

--> ps
--> top
--> prstat

--> ...

\* sar (provides most basic types of high level performance metrics, assuming that system accounting is turned on, which does incur some level of system overhead when always running)

 


SOLARIS 10 ... Above and Beyond other Unix / Linux Distributions ... 

 

In addition to the basic toolsets available, there exist the following key additions that Solaris 10 provides, which sets it apart from the other Unix / Linux variants.

\*\* DTrace (Dynamic Tracing via "D" language scripting and probe/providers)

__ Dtrace is the "Electron microscope" of performance analysis for a Solaris 10 system
See the DtraceToolkit for a long list of specific Dtrace scripts (several of which are used
within sys_diag, among others created)

\*\* lockstat (uses the kernel dtrace infrastructure) Summarizes system lock/mutex contention

\*\* Mdb (Modular Debugger)

\* kstat (Kernel statistics .. counters, etc..)

\* cpustat / cputrack (cpu statistics, system-wide or per process)

\* intrstat, trapstat (interrupt and system trap, I/DTLB_miss statistics, ..)

\* ... & many more.. [this list will be re-done in a future blog with a more thorough breakdown.. ]

___________________________________________________________________________________

The Time Saving.. automated nature... of SYS_DIAG   :)


Over the past several years, I have created a utility called "sys_diag" that offers the capability of automatically capturing performance statistics, using nearly all available system utilities.. and aggregating the data, performing analysis and HTML report generation of findings. Sys_diag creates a single .tar.Z compressed archive that can be emailed/ftp'd.. for performing system configuration and/or performance analysis off-site.. from virtually anywhere.. saving a LOT of time.. not requiring any 3rd party tools or agents to be installed on a system other than downloading the "sys_diag" ksh script itself (with a color coded dashboard.. and links to detailed analysis findings).  Virtually no learning curve is required for loading, running, and reflecting basic performance profiling, including high level subsystem bottlenecks (deeper root cause correlation might require some level of advanced sys admin knowledge).

Beyond performance analysis, sys_diag can be used to also generate a detailed configuration snapshot report, including OS, HW, Storage, SW, 3PP configuration attributes, among several other capabilities that it provides.

\*\* See the next blog entry for more details and examples on sys_diag \*\*.
The published repository and high level description of sys_diag is always available at BigAdmin using the following URL :
http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/sys_diag__solaris_c

(Copyright 2007 Todd A. Jobson)


Add to Technorati Favorites

Wednesday May 16, 2007

What is Performance ? .. in the Real World


When we think of "Performance", the definition can have/take many connotations...

In the context of computing, the dictionary defines it as : (http://dictionary.reference.com/browse/performance)

PERFORMANCE (noun) :

"The manner in which or the efficiency with which something reacts or fulfills its intended purpose." or "the execution or accomplishment of work, acts, feats, etc."

 

From this definition, it can be readily seen that the "efficiency" and overall "utilization" of resources is a key characteristic of the "performance" of a system (also leaving room for some subjective interpretation).

 

Real World Performance.. and the holistic viewpoint

The other key aspects of assessing the performance, whether in the real world, or that of a system, relates directly to the volume of productive OUTPUT over a duration of TIME that a system produces.

In the arena of Information Technology.. as in real world performance (auto's, economics, the human body, etc..) the entity as a whole needs to be examined, allowing for symptoms to be identified in one or more areas... aka.. "sub-systems".   Hence, the complete Integrated "system" as a whole.. comes to life with it's own unique dynamics and patterns that need to be examined.

(eg.   one analogy might be the "performance" of a race car .. dependent upon the design/architecture of the vehicle.. and all it's constituent components.. the chasis [weight, flexibility, ..],  the steering [responsiveness, turn ratio..], the engine [horsepower, air/fule intake, exhaust output, ..], the Transmission [gear ratios, latency in shifting.., MTBF of clutch,..], braking [responsiveness, 60-0 secs, ..], tires [ G's on the skidpad, wear rate, ..], .. and Overall Performance.. [0-60 acceleration, MPG, top speed, slalom speed, ..] .. individually each can be measured easily.. but as a whole.. the INTEGRATED "system dynamics" come into play ). 

The same can be said (and analogous) to most "systems"... hence, looking at the environment as a whole is crucial ...

 

The "Application Environment" ... 

That holistic entity in the arena of Computing is called the "Application Environment".. comprised of all the systems and underlying nested/encapsulated sub-systems. In the IT world, an Application Environment is composed of all the underlying infrastructure that together provides and supports the "Service(s)" (environments, systems, networks, storage, OS, Application Software, etc...).

 

"Perceived" Performance and Expectations :

For any system (or environment), the ultimate guage is in the "Perception" of it's performance, relating to whether or not it can fullfil the expectations of it's client user community.

How efficient, proficient, and/or productive we perceive something to be, is in large part.. a product of our vantage point (perception), and how we judge or evaluate it... according to our expectations, pre-conceived notions (rules), and the means available to us for measuring it (tools, etc..).
The perception of one impatient user doesn't always accurately reflect the responsiveness, efficiency, or other attributes for evaluating the performance / workload characterization of a system.

 

Understanding , Metrics, and Measurement ...

From this vantage point, it becomes evident that in Assessing a system, there must be measurement of key attributes .. aka.. METRICS... and in order to define key metrics that can/should be monitored, we must first UNDERSTAND the system and how it works (components, mechanics, inputs / outputs, among other items that can be measured).

Hence, "If you can't understand it... you can't effectively measure it, .. and if you can't measure it.. you can't assess it..". (T.Jobson 7/2006)

 

Requirements dictate Measurements... driven by Service Level Commitments :

Of the various vantage points that a system's performance is guaged, the following
attributes (relating to specific Metrics that can be sampled) are typically those
which Service Level Agreements (SLA's) and/or Commitments (SLC's) are based upon (reflecting Customer Requirements and "acceptable" Thresholds.. ) :

  • Response Time (Client GUI's, Client/Server Transactions, Service Transactions,..) Measured as "acceptable" Latency.

  • Throughput (how much Volume of data can be pushed through a specific subsystem.. IO, Net..)
  • Transaction Rates (DataBase, Application Services, Infrastructure / OS / Network.. Services, etc.).  These can be either rates per Second, Hour, or even Day... measuring various service-related transactions.
  • Failure Rates (# or Frequency of exceeding  High or Low Water Marks .. aka Threshold Exceptions)
  • Resource Utilization (CPU Kernel vs. User vs. Idle, Memory Consumption, etc..)
  • Startup Time (System HW, OS boot, Volume Mgmt Mirroring, Filesystem validation, Cluster Data Services, etc..)
  • FailOver / Recovery Time (HA clustered DataServices, Disaster Recovery of a Geographic Service, ..)  Time to recover a failed Service (includes recovery and/or startup time of restoring the failed Service)

  • etc ...

    For any exceptions to the "acceptable" thresholds listed above, SLA's typically reflect PENALTIES ($$$).

 

Latency ... the heart of a Bottleneck ...

Each of the attributes and perceived guages of performance listed above has it's own intrinsic relationships and dependencies to specific subsystems and components... in turn reflecting a type of "latency" (delay in response). It is these latencies that are investigated and examined for root cause and correlation as the basis for most Performance Analysis activities.

\*\* STAY TUNED \*\*..  Look for my up-coming blog entry on "The many Flavors of system Latency..".

Future blog entries will expand upon this baseline definition of performance.. so keep your eye's peeled.. and look at the world around you.. from as many vantage points as possible... Perspective is key.. hand in hand with understanding the world around us... Don't be afraid to ask why.. and dig deeper.. there's typically a reason for everything if you look at it with an open mind.. understanding the fundamentals first !

Todd ;) :)

(Copyright 2007, Todd A. Jobson)


Add to Technorati Favorites

About

This blog does not reflect the viewpoint or opinions of Oracle or Sun Microsystems. All comments are personal reflections and responsibility of Todd A. Jobson, and are copyrighted from the posted year to current year, to that effect.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today