Monday Nov 24, 2008

Pkgapp 3.0 for Solaris Available!

Hello all,

I'm please to announce I have released Pkgapp 3.0 for Solaris!

So far Pkgapp 3.0 is only availabe for Solaris but I will begin porting this code to Linux and HPUX.  I wanted to get Solaris first instead of waiting for all to be done...this way at least one unix flavor is available for use.

You can obtain Pkgapp 3 from Big Admin: Pkgapp 3.0 Solaris

Sun Employees can also find information on Sunspace.

Included files:

  • Pkgapp version 3 Script
  • Pkgapp Reference Guide: SUN-GDD_Pkgapp_Reference_Guide3.0.0.pdf
  • Pkgapp Readme
  • GDD Readme
  • GDD License

For information on what changed, what the usage looks like and a runtime example (using mysql) see my last post - Pkgapp 3.0 code has been finalized!

Enjoy all!

Lee

Wednesday Nov 19, 2008

Pkgapp 3.0 code has been finalized!

Pkgapp3 is the latest update to the Sun GDD tool set. These tools are designed to help customers gather the data needed for the best possible support on their issues.

Pkgapp is a shell tool built by Sun Support Engineering to help gather all OS and application libraries used by Sun Java Systems (and others) servers at runtime.  Pkgapp can be used to gather these libs, etc, from cores produced by crashes or gcores captured manually by administrators.

Pkgapp is part of the GDD (Gathering Debug Data) suite of tools and has been used for years to help Sun Support debug tricky crashes, as well as performance and memory leak problems where cores or gcores are available.

New major features in Pkgapp 3.0:

1) Now handles Java Cores and grabs jstack and jinfo.
 - See the -j and -J options

2) Includes all functionality that was available in both pkg_app 2.7 and pkgcore.

3) Can allow for the temporary files to be retained in /tmp
 - See the -r option

4) Allows pkgapp3 to be placed in a $PATH location; can now be run from any location.

5) Can use relative paths for the following options:
 -c <core file OR pid of a running process>
 -p <full path to, but not including the process binary> (ns-slapd, imapd, httpd etc.)
 -s <Storage; path to store the final tar file in>

6) Retains all lib/binary paths (pkgcore) as well as the pkg_app app/ location for older pkg_app aware applications.

7) Names the final storage path as pkgapp-<date>-<index> to help administrators locate the final tar file faster.
 - Ex: pkgapp3-111708-01/

8) Gathers the following Solaris only proctool commands from the core.
 - pldd
 - pstack
 - pmap
 - pcred
 - pflags
 - pargs

9) Checks to see if the core file is truncated and checks the core files cksum to ensure the file size match.

10) Gathers the following core and pkgapp3 specific data
 - core file cksum
 - core file truncation (good/bad)
 - core file size (ls -l)
 - library list (from the pldd)
 - manifest log (all files included int he final tar)
 - pkgapp3 arguments used.

11) Gathers the following OS specific data
 - coreadm info
 - date/time
 - messages (from /var/adm)
 - pkginfo
 - showrev -p
 - /etc/release (Solaris)
 - ulimit
 - uname -a

12) Creates a coreinfo.sh script to allow Sun Engineers to quickly run "coreinfo" for ns-slapd processes (only).

13) Alerts administrators they must upload 2 files (pkgapp tar & core) separately when they do not use the -i (include core) switch.

14) Will exit with a fatal error if a pldd cannot be properly retrieved from the core and alerts the administrator to run pkgapp3 using a pid as the -c option.

15) Saves the pkgapp3 arguments and runtime log into /var/tmp/pkgapp-history/.  Helps administrators and pkgapp see previous runs.

16) Updated usage and alert messages.

 Here are a couple previews of how the new code works.

 Example: Usage output

\* ----------------------------------------------------------------------------------
\* Sun Microsystems RSD pkgapp 3.0 Solaris                               [11/24/2008]
\* ----------------------------------------------------------------------------------
pkgapp 3.0, a Sun Microsystems data gathering utility.
Usage:
  pkgapp [options] -c <core file | pid> -p <full path (path only) to process binary> -s [path to write tar file]

Required parameters:
 -c <core file OR pid of a running process>
 -p <full path to, but not including the process binary> (ns-slapd, imapd, httpd etc.)

Optional parameters:
 -i (Include a previously generated core file with the final tar.gz)
 -j (Javacore; process a java core)
 -r (Remove all temp files)
 -q (Quiet)
 -d (Debug)
 -J <JstackPath; path to the jstack (jdk) commands>
        defaults to /usr/jdk/instances/...
 -s <Storage; path to store the final tar file in>

usage:  pkgapp -c <name of the core file> -p <path to process binary>
usage:  pkgapp -c <pid of the running app> -p <path to process binary>

Examples: these are examples mixing various parameters

Directory Server
pkgapp -i -r -c ./core.14740 -p /var/mps/ds52p4/bin/slapd/server/64

Messaging Server
pkgapp -c ./core.3496 -p /opt/SUNWmsgsr/lib

Web Server
pkgapp -c ./core.1092 -p /space/iws70/lib -s /var/crash

Calendar Server
pkgapp -r -c ./core -p /opt/SUNWics5/cal/lib

Sendmail
pkgapp -i -c 512 -p /usr/lib

Mysqld
pkgapp -i -r -c ./core -p /support/mysql-5.0.41/bin

Example: Runtime output

\* ----------------------------------------------------------------------------------
\* Sun Microsystems RSD pkgapp 3.0 Solaris                               [11/24/2008]
\* ----------------------------------------------------------------------------------
\* OS release                            [5.10]
\* Platform                              [SUNW,Sun-Blade-2500]
\* Checking [-c] is a core or pid        [using pid 1409]
\* Process Root                          [/support/mysql-5.0.41/data]
\* Databin parameter [-s] checks         [reset to /var/tmp/dev]
\* Databin found                         [/var/tmp/dev/test/cores/storage]
\* Databin writable check                [success]
\* Databin used/created is               [/var/tmp/dev/test/cores/storage/pkgapp-112408-05]
\* Creating temp area                    [/tmp/pkgapp.2928/]
\* Pid used, no corefile to check       
\* Process binary                        [mysqld]
\* Checking usage history                [not recently run]
\* mysqld binary bit version             [32]
\* Checking path [-p] to binary name     [success, path != binary name]
\* Checking path [-p] is a directory     [success]
\* Locating mysqld                       [success]
\* Checking located mysqld is 32 bit     [success]
\* Binary located                        [/support/mysql-5.0.41/bin/mysqld]
\* Adding binary to pkgapp.pldd          [success]
\* Grabbing pldd                         [success]
\* Grabbing pstack                       [success]
\* Grabbing pmap                         [success]
\* Grabbing pcred                        [success]
\* Grabbing pflags                       [success]
\* Grabbing pargs                        [success]
\* Provide the full path and name to the core file
\* If you do not have a core, enter "none"
\* Example - /data/cores/core.1445:      /var/tmp/dev/cores/core.1409
                                        [Answer was -> /var/tmp/dev/cores/core.1409]
\* Grabbing [-i] core/gcore              [success]
\* Javatools [-j] not set                [skipped]
\* Grabbing /var/adm/messages            [success]
\* Grabbing uname -a                     [success]
\* Grabbing date/time                    [success]
\* Grabbing showrev -p                   [success]
\* Grabbing pkginfo -l                   [success]
\* Grabbing /etc/release                 [success]
\* Grabbing coreadm                      [success]
\* Grabbing ulimit                       [success]
\* Grabbing libs                         [success]
\* Making lib paths app/                 [success]
\* Making lib paths libs/                [success]
\* Linking libraries                     [success]
\* Libraries linked                      [62 ttl]
\*                                      
\* Using hostid for naming .tar.gz       [837872d0]
\* Writing file                          [pkgapp-837872d0-s4u-2500a-brm04-081124-095809.tar.gz]
\*                                      
\* Done gathering files                 
\* Writing dbxrc & opencore.sh files     [success]
\* Writing manifest-081124-095809.log    [success]
\* Writing pkgapp-args-081124-095809     [success]
\* Creating final tarfile                [success]
\* Compressing tarfile                   [success]
\* End of runtime logging               
\* Saving history info                   [/var/tmp/pkgapp-history/history-081124-095809.log]
\* Saving runtime log                    [/var/tmp/pkgapp-history/runtime-081124-095809.log]
\* Removing [-r] temp area/files         [removed]
\*                                      
\* Operations Complete                  
----------------------------------------------------------------------------------
Upload the following file(s) to your supportfiles.sun.com Cores Directory at Sun

1) File(s) located in directory /var/tmp/dev/test/cores/storage/pkgapp-112408-05

                [ pkgapp-837872d0-s4u-2500a-brm04-081124-095809.tar.gz ]



                                Thank you.
                                Sun Software Technology Service Center (STSC)


NOTES:
1) You can check for updates to this script here:
        BigAdmin - http://www.sun.com/bigadmin/scripts/indexSjs.html
2) Release Notes and Guides located here:
        Docs - http://docs.sun.com/app/docs/doc/820-0437
3) GDD information located here:
        Docs - http://www.sun.com/service/gdd/index.xml

4) Please send all Bugs and RFE's to the following address:
        Subject "pkgapp bug/rfe" - gdd-issue-tracker@sun.com

5) Please send all other questions etc to:
        Subject "pkgapp feedback" - gdd-feedback@sun.com


I will announce Pkgapp 3.0 for Solaris's availability in the next few days...stay tuned!

Lee

Monday Aug 18, 2008

Pkg_app 3.0 coming soon.

I have finally had time to revamp the now aging pkg_app 2.7.

I have made Pkg_app 3.0 bigger better faster and will now include all libraries gathered by the older (2006) pkgcore.sh many have used in the past.

More later but here is a sneal peak.

--------------------------------------------------------------------------------
\* Sun Microsystems RSD pkg_app 3.0b Solaris[08/18/2008]
--------------------------------------------------------------------------------
\* OS release                            [5.9]
\* Platform                              [SUNW,Sun-Blade-1000]
\* Using core                            [/var/tmp/cores/Pkg_app/core.28907]
\* Process Root                          [/data/sunone/ds52p5/slapd-config]
\* Process binary                        [ns-slapd]
\* ns-slapd binary bit version           [64]
\* Checking path to binary name          [success, path != binary name]
\* Checking path is a directory          [success]
\* Locating ns-slapd                     [success]
\* Checking located ns-slapd is 64 bit   [success]
\* Binary located                        [/data/sunone/ds52p5/bin/slapd/server/64/ns-slapd]
\* Adding binary to pkg_app.pldd         [success]
\* Grabbing pldd                         [success]
\* Grabbing pstack                       [success]
\* Grabbing pmap                         [success]
\* Grabbing pcred                        [success]
\* Grabbing pflags                       [success]
\* Grabbing pargs                        [success]
\* Grabbing core/gcore                   [success]
\* Grabbing /var/adm/messages            [success]
\* Grabbing uname -a                     [success]
\* Grabbing date                         [success]
\* Grabbing showrev -p                   [success]
\* Grabbing libs                         [success]
\* Making lib paths                      [success]
\* Linking libraries                     [success]
\*                                      
\* Databin parameter check 1             [success]
\* Databin parameter check 2             [success]
\* Databin Found                         [/var/tmp/mycore]
\* Databin used/created is               [/var/tmp/mycore/081808-02]
\* Using hostid for naming .tar.gz       [834d2699]
\* Writing file                          [pkg_app-834d2699-kaneda-080818-080751.tar.gz]
\*                                      
\* Done gathering files                 
\* Writing dbx files                     [success]
\* Creating final tarfile                [success]
\* Compressing tarfile                   [success]
\* End of runtime logging               
\* Operations Complete
--------------------------------------------------------------------------------
Upload this file to your supportfiles.sun.com Cores Directory at Sun
File located in directory /var/tmp/mycore/081808-02


                [ pkg_app-834d2699-kaneda-080818-080751.tar.gz ]


                                Thank you.
                                Sun Software Technology Service Center (STSC)


NOTES:
1) You can check for updates to this script here:
        BigAdmin - http://www.sun.com/bigadmin/scripts/indexSjs.html

2) Release Notes and Guides located here:
        Docs - http://docs.sun.com/app/docs/doc/820-0437

3) GDD information located here:
        Docs - http://www.sun.com/service/gdd/index.xml

4) Please send all Bugs and RFE's to the following address:
        Subject "pkg_app bug/rfe" - gdd-issue-tracker@sun.com

5) Please send all other questions etc to:
        Subject "pkg_app feedback" - gdd-feedback@sun.com


root[/var/tmp/mycore/081808-02]#ls -l
total 38798
drwxrwxrwx   3 root     other        512 Aug 18 08:08 core-lib-data
-rwxrwxrwx   1 root     other        487 Aug 18 08:08 opencore
drwxrwxrwx   3 root     other        512 Aug 18 08:08 os-info
-rwxrwxrwx   1 root     other    19828244 Aug 18 08:08 pkg_app-834d2699-kaneda-080818-080751.tar.gz
drwxrwxrwx   2 root     other        512 Aug 18 08:08 pkgapp-info
drwxrwxrwx   2 root     other        512 Aug 18 08:08 proctool-info
-rwxrwxrwx   1 root     other       1575 Aug 18 08:08 runtime-080818-080751.log

 Beta testing within my group should begin this week.

Cheers!

Lee

Wednesday Jul 09, 2008

Suggestion Box

Hello all,

Just a quick update on the Dirtracer Blog front. I haven't forgotten the blog nor those who are waiting for updates...I however have been hit with a couple extremely tough support cases that took a ton of my time.

I decided to set up a Suggestion Box page where readers can suggest topics for me to cover in this blog.

Making a Suggestion is quick and easy.  You simply make a Comment to this Page in the Comment on this page. 

Comment how-to:

  1. Click this Comments link -> Comments
  2. Scroll down to the Comments: area
  3. Type in your Name, Email and the Suggestion in the Comment area
  4. Fill in the anti-spam answer and hit Post

Simple.  I will evaluate the suggestion and if it's a valid topic I will post a blog on the topic on Fridays.

Thanks all and I hope to see many suggestions.

LT

Tuesday Jun 10, 2008

Dirtracer + Core/Gcore = Separate upload for Pkg_app data.

Hello all,

Every now and then I run into an issue where the customer used Dirtracer to grab a Gcore (or two) or a Core file from a crash.  Dirtracer users are forgetting to upload the separate pkg_app tar.gz file when one is generated.

It is not a problem per-say as a it is users missing a critical message that is displayed at the end of a Dirtracer run when a successful Pkg_app file has been created.

Lets take a look at the messages I am talking about.

Suppose you define one of the following parameters in the dirtracer.config file:

CORE_LOCATION
GRAB_GCORE


Lets use GRAB_GCORE for a quick test; lets grab one Gcore.

GRAB_GCORE="1"


We then run Dirtracer as normal.

#./dirtracer -f ./dirtracer.config

--------------------------------------------------------------------------------
Sun Microsystems dirtracer 6.0.6 Solaris Sparc                        06/09/2008
--------------------------------------------------------------------------------


Dirtracer run through it's Main Loop and grabs one Gcore.

\* Entering Main Performance Gathering Loop
\*                                      
\* Loop 0 - 080609-131950               
\* Loop 1 - 080609-131956               
\* Loop 2 - 080609-132001               
\* Loop 3 - 080609-132007               
\* Loop 4 - 080609-132012               
\* togo[5sec]-timer[5]
\*                                      
\* Grabbing gcore 080609-132018          [success]


Because Dirtracer has noted the presence of the grab gcore parameter it will automatically run Pkg_app to gather the needed OS and DS libraries needed for debugging. 

By default Dirtracer waits 120 seconds for Pkg_app to complete.  If for some reason the Core/Gcore's header has an overrun (can't determine contents normally) or has other errors like it is truncated, Dirtracer will kill the Pkg_app gathering phase.  Users should check the Pkg_app tar contents to see if all libs were gathered and take appropriate actions.

\* Packaging files                      
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]


After the main Dirtracer tar file is created it will display the Operations Complete message and inform the user of which files he needs to upload to Sun.

As you can see below Dirtracer informs you there is both a Dirtracer and "separate" Pkg_app file which needs to be uploaded.  Users should pay special attention to the End Messages displayed when Dirtracer finishes.  Most users miss the second file (Pkg_app) requiring separate upload.


Operations Complete
--------------------------------------------------------------------------------
1) Dirtracer capture data located in directory [ /var/tmp/data/060908-04 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-mysystem-080609-132018.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/060908-04/pkg_app_data

                [pkg_app834d2699-mysystem-080609-132051.tar.Z ]


You might ask why is Pkg_app's tar simply not included with the main Dirtracer tar file.  The answer is simple...size.  Dirtracer captures by itself can be big if a customer does not archive their log files on a heavily used system and the final size can be massive if a Core/Gcore is captured.  It simply wasn't an option to include a big Pkg_app file with an already bigger Dirtracer file.

[LT]

Wednesday Jun 04, 2008

Using Dirtracer for Proactive Analysis

BACKGROUND:

One of our Customers was seeing a performance issue with their Siteminder enabled applications each day around 4:30pm.  While the Customer was using Dirtracer to capture event data, the time it took for Administration staff to see the alerts made capturing the event difficult.  The event, when it occurred took approximately 60 seconds from start to finish.  By the time the Dirtracers were started, the event had already passed 9-10 minutes before.

The Customer setup their BMC Control M monitoring facility so they could automate a 4 way Dirtracer capture at the same time when the event was noticed.

NOTE: they did not see any alerts/issues with the Directory or Siteminder Servers at this time but they launched Dirtracer via BMC Control M as a precaution only.

With this said, I analyzed the Dirtracers thinking there was a problem.

SYSTEM INFO:

1) The Customer captured data from 4 systems at the same time; 01, 02, 03 and system 04.

01    #dirtracer Started:     080422-174012
02    #dirtracer Started:     080422-174013
03    #dirtracer Started:     080422-174127
04    #dirtracer Started:     080422-174126


2) Connection Load.  I see not all servers are equal here; 01 and 04 are taking a lot less of a load than 02 and 03.

01    91 connections average
02    185 connections constant
03    171 connections constant
04    124 connections average


Summary Connection Snippets: taken from cn=monitor searches.

               Conn      Ttl           Curr      Read    ReqQue     OpsInitiated vs.
File            Peak  Th  Cons          Cons      Waitr   BakLog     OpsCompleted
====================================================================================
01:
[080422-175815] 1984   24  525136       101       0       0          332685-332674=11
[080422-175852] 1984   24  525143       102       0       0          332906-332899=7
[080422-175928] 1984   24  525150       103       0       0          333155-333140=15
[080422-180005] 1984   24  525159       87        0       0          331583-331576=7
[080422-180043] 1984   24  525167       82        0       0          331188-331186=2

02: Summary Snippet
[080422-174222] 1984   24  502241       185       0       0          2155474-2155468=6
[080422-174227] 1984   24  502245       185       0       0          2156076-2156073=3
[080422-174232] 1984   24  502249       185       0       0          2156687-2156680=7
[080422-174238] 1984   24  502253       185       2       2          2157573-2157565=8
[080422-174243] 1984   24  502259       185       1       0          2158588-2158583=5

03:
[080422-174513] 2256   24  503749       171       0       0          722384-722378=6
[080422-174518] 2256   24  503755       171       0       0          722607-722603=4
[080422-174522] 2256   24  503759       171       1       0          722890-722886=4
[080422-174527] 2256   24  503762       171       0       0          723230-723226=4
[080422-174531] 2256   24  503767       173       0       1          723506-723498=8

04:
[080422-174633] 8111   24  497842       123       1       0          383898-383884=14
[080422-174644] 8111   24  497848       123       0       2          384516-384511=5
[080422-174655] 8111   24  497855       124       0       1          384996-384993=3
[080422-174706] 8111   24  497860       124       0       1          385318-385313=5
[080422-174717] 8111   24  497864       124       0       0          385745-385739=6


3) The CPU % of each server mostly mirrors the load differences seen from above;  01 and 04 are taking a lot less of a load than 02 and 03.

01    1% average busy
02    16% average busy
03    20% average busy
04    8% average busy

CPU (pcpu) taken from the pms.sh (ps) file.

01:
04/22-[17:41:49] ------- 12489 5370536 5331048  1.1    growth:   0 k

02:
04/22-[17:41:39] ------- 6236 2216344 2193576 15.3    growth:   8 k

03:
04/22-[17:42:41] ------- 7930 2214432 2192016 21.3    growth:   -40 k

04:
04/22-[17:42:48] ------- 10554 3717880 3695216  9.1    growth:   -360 k


4) Iostats (disk performance).  Since the servers are using Veritas File System I cannot tell which disk slices the DS runs on versus which disks show higher than normal iostat's.

The DS is installed in the following location for all servers:

- /dev/vx/dsk/app/dsserver 1048576   18603  965640     2%    /opt/servers/slapd-master

01    Disks peak to 61% busy    (1 time)  GOOD
02    Disks peak to 58%    busy    (1 time)  GOOD
03    Disks peak to 100% busy    (2 times)  FAIR
04    Disks peak to 100% busy    (8 times)  BAD

As a result of the 100% disk busy times the average disk io service time and service wait times increase.

01    avg service time 8.8/ms - wait service 0/ms
02    avg service time 35/ms - wait service 0/ms
03    avg service time 199.5/ms - wait service 0/ms
04    avg service time 105.6/ms - wait service 373.5/ms
01
Tue Apr 22 17:41:07 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    2.0    0.0    0.0  0.0  0.0    0.0    3.6   0   0 c1
   1.0    4.0    0.0    0.1  0.0  0.0    0.0    8.7   0   0 c2
  61.0   34.0    0.6    1.4  0.0  0.9    0.0    9.9   0  10 c4
 102.0   47.0    1.1    2.0  0.0  1.3    0.0    8.7   0  10 c5
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.5   0   0 c4t0d1
  61.0   32.0    0.6    1.4  0.0  0.9    0.0   10.1   0  60 c4t0d2
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.4   0   0 c4t0d5
 102.0   44.0    1.1    1.9  0.0  1.3    0.0    8.8   0  61 c5t1d2
   0.0    3.0    0.0    0.1  0.0  0.0    0.0    0.7   0   0 c5t1d5
   0.0    2.0    0.0    0.0  0.0  0.0    0.0    3.6   0   1 c1t2d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    8.5   0   3 c2t2d0
   1.0    0.0    0.0    0.0  0.0  0.0    0.0    9.5   0   1 c2t0d0

02
Tue Apr 22 17:40:37 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   1.0   23.0    0.0    1.2  0.0  1.0    0.0   41.0   0   2 c1
   0.0   19.0    0.0    1.1  0.0  1.2    0.0   61.1   0   2 c2
   3.0  701.0    0.0    5.8  0.0 24.6    0.0   34.9   0  20 c4
   1.0  648.0    0.0    5.3  0.0 23.5    0.0   36.3   0  20 c5
   0.0    4.0    0.0    0.0  0.0  0.0    0.0    9.6   0   1 c4t0d0
   1.0  351.0    0.0    2.9  0.0 12.4    0.0   35.1   0  58 c4t0d2
   2.0  290.0    0.0    2.3  0.0  9.7    0.0   33.3   0  50 c4t0d3
   0.0   56.0    0.0    0.5  0.0  2.5    0.0   43.9   0  10 c4t0d5
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    7.7   0   1 c5t1d0
   0.0  323.0    0.0    2.6  0.0 11.7    0.0   36.4   0  57 c5t1d2
   1.0  269.0    0.0    2.2  0.0  9.1    0.0   33.8   0  54 c5t1d3
   0.0   52.0    0.0    0.4  0.0  2.6    0.0   50.2   0   9 c5t1d5
   1.0    0.0    0.0    0.0  0.0  0.0    0.0    5.4   0   1 c1t8d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    5.9   0   2 c1t2d0
   0.0    9.0    0.0    0.6  0.0  0.6    0.0   61.6   0  13 c1t12d0
   0.0   10.0    0.0    0.6  0.0  0.4    0.0   40.0   0   9 c1t9d0
   0.0    8.0    0.0    0.5  0.0  0.6    0.0   76.2   0  13 c2t12d0
   0.0   11.0    0.0    0.6  0.0  0.6    0.0   50.1   0  12 c2t9d0

03
Tue Apr 22 17:42:06 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0   34.0    0.0    1.5  0.0  1.7    0.0   49.7   0   3 c1
   0.0   33.0    0.0    1.7  0.0  2.1    0.0   63.3   0   3 c2
   0.0  556.0    0.0    4.5  0.0 89.8    0.0  161.5   0  23 c4
   2.0  279.0    0.0    3.2  0.0 12.2    0.0   43.3   0  15 c5
   0.0    9.0    0.0    0.1  0.0  0.0    0.0    5.3   0   3 c4t8d0
   0.0  103.0    0.0    0.9  0.0  1.1    0.0   11.1   0  38 c4t10d0
   0.0  444.0    0.0    3.5  0.0 88.6    0.0  199.5   0 100 c4t11d0
   2.0  249.0    0.0    2.0  0.0 11.0    0.0   43.8   0  76 c5t10d0
   0.0    9.0    0.0    0.1  0.0  0.0    0.0    4.1   0   3 c5t11d0
   0.0   21.0    0.0    1.1  0.0  1.1    0.0   53.6   0  13 c5t13d0
   0.0   10.0    0.0    0.6  0.0  0.7    0.0   73.7   0  13 c2t11d0
   0.0    8.0    0.0    0.4  0.0  0.4    0.0   52.3   0  11 c2t9d0
   0.0   15.0    0.0    0.7  0.0  0.9    0.0   62.3   0  12 c2t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    6.8   0   1 c1t2d0
   0.0   11.0    0.0    0.6  0.0  0.5    0.0   44.1   0  13 c1t11d0
   0.0   14.0    0.0    0.6  0.0  0.8    0.0   55.2   0  12 c1t9d0
   0.0    7.0    0.0    0.3  0.0  0.4    0.0   59.3   0  12 c1t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.0   0   1 c1t3d0

04
Tue Apr 22 17:42:23 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0  299.1    0.0    2.3 100.8 29.8  337.1   99.6  16  17 c1
   0.0  349.1    0.0    2.7 107.5 29.6  308.0   84.9  16  17 c2
   0.0    5.0    0.0    0.1  0.0  0.0    0.0    4.5   0   0 c4
   0.0   48.0    0.0    2.6  0.0  0.6    0.0   12.5   0   2 c5
   0.0    5.0    0.0    0.1  0.0  0.0    0.0    4.5   0   2 c4t8d0
   0.0    2.0    0.0    0.0  0.0  0.0    0.0   12.4   0   2 c5t8d0
   0.0   22.0    0.0    1.2  0.0  0.3    0.0   12.2   0   4 c5t9d0
   0.0   22.0    0.0    1.2  0.0  0.3    0.0   13.7   0   4 c5t10d0
   0.0    1.0    0.0    0.1  0.0  0.0    0.0    4.7   0   0 c5t11d0
   0.0    1.0    0.0    0.1  0.0  0.0    0.0    1.8   0   0 c5t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.5   0   1 c2t11d0
   0.0  181.1    0.0    1.4 51.9 14.6  286.7   80.7  92 100 c2t0d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   11.5   0   1 c2t13d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    4.1   0   0 c2t4d0
   0.0  164.0    0.0    1.3 55.6 15.0  339.0   91.4 100 100 c2t10d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.4   0   1 c2t5d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    6.5   0   3 c1t11d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.3   0   1 c1t13d0
   0.0  142.0    0.0    1.1 53.0 15.0  373.2  105.6 100 100 c1t0d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.3   0   1 c1t4d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    6.4   0   1 c1t3d0
   0.0  148.0    0.0    1.1 47.8 14.7  322.9   99.5  92 100 c1t10d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.7   0   1 c1t2d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    9.5   0   1 c1t5d0


5) Other ns-slapd instances running on these servers.  There are 5 total directory server instances on each of the 4 servers so I cannot see of the other ns-slapd instances are loading the disks to the levels seen above.  Significant load on the unknown instances could impact the disk used by the known instances and affect performance.

6) Dirtracer itself is not able to complete a normal MainLoop check in the normal time as configured; Dirtracer was confiured to loop trough each check in the MainLoop at 1 second INTERVALs. 

Each loop executes a pstack, prstat, netstat, cn=monitor search and 1 x backend cache info search (1 backend in this case).  Each of these data points are very lightweight and complete sub-second.

On Average each server completes a Loop in 5-41 seconds each.  It is unknown why these servers cannot complete these loops as fast as configured.

See the following examples: each Loop shows the date/time stamp when the loop starts...note how many seconds had passed between each loop (should have been 1 second between loops).

01
\* Loop 0 - 080422-174027             
\* Loop 1 - 080422-174106             
\* Loop 2 - 080422-174148             
\* Loop 3 - 080422-174228            
\* Loop 4 - 080422-174308           

02
\* Loop 0 - 080422-174030             
\* Loop 1 - 080422-174036             
\* Loop 2 - 080422-174041            
\* Loop 3 - 080422-174045             
\* Loop 4 - 080422-174050           

03
\* Loop 0 - 080422-174143             
\* Loop 1 - 080422-174148            
\* Loop 2 - 080422-174153            
\* Loop 3 - 080422-174158             
\* Loop 4 - 080422-174203

04
\* Loop 0 - 080422-174139             
\* Loop 1 - 080422-174150             
\* Loop 2 - 080422-174201             
\* Loop 3 - 080422-174213             
\* Loop 4 - 080422-174225 

LDAP INFO:

1) A quick check on the Dirtracer search etimes shows each are sub-second.

On 04 only a few were 0.1 to .2 seconds.  The following were taken from access log snippets captured.

root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs/working]#cat conn-49\* | awk '{print $11}' | grep etime | grep -v etime=0\\.0
etime=0.246540
etime=0.118270
etime=0.110770
etime=0.143630

2) Using the logs I found there was a Siteminder Policy Server that was configured to use an improper search base (dc=com).  For this data set only 01 doesn't have any Siteminder searching using an incorrect base; 02, 03 and 04 all to.  04 having the most.

01:    None
02:    4604            dc=com searches
03:    5457            dc=com searches
04:    13749          dc=com searches


I as able to track down the IP in only 1 server (04) as the connections in the others were already running.  I need to check old logs to see if there are other IP addresses these are coming from.

10.10.10.1

[22/Apr/2008:17:04:52 -0400] conn=496746 op=-1 msgId=-1 - fd=79 slot=79 LDAP connection from 10.10.10.1 to 10.10.10.181
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - BIND dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.000910 dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com"
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - RESULT err=32 tag=101 nentries=0 etime=0.000540
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - RESULT err=32 tag=101 nentries=0 etime=0.000480
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid==userremoved,,ou=People,dc==xxxxxx,,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - RESULT err=32 tag=101 nentries=0 etime=0.000460
...


The above connection (conn=496746) contains searches using the incorrect base 1195 times alone.

root[/customerdata/6xxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs]#gc err=32 conn-496746.log
1195


3) High etimes / high etime servers.   Overall server 04 is the biggest high etime server, followed by 01.

Top 5 High Etimes

01
etime=413.555070    1      
etime=300.814310 notes=F    1      
etime=115.912590    1      
etime=39.566400    1     
etime=39.553510 notes=F    1       

02
etime=8.719340    1      
etime=6.210240    1      
etime=6.101450    1      
etime=5.814920    1      
etime=5.802740    1    

03
etime=7.325220    1      
etime=5.766010    1      
etime=5.558690    1      
etime=5.186940    1      
etime=5.147010    1

04
etime=1876.043450 notes=U    1      
etime=1609.753640 notes=U    1      
etime=1275.956850 notes=U    1      
etime=1251.870940 notes=U    1 

We now know 04's highest offending etimes come from 4 cn=\* searches from a certain application.  IP 10.10.10.225

nentries=598115                 4
[22/Apr/2008:16:48:50 -0400] conn=496301 op=-1 msgId=-1 - fd=47 slot=47 LDAP connection from 10.10.10.225 to 10.10.10.181
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - BIND dn="uid=appl100,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.002810 dn="uid=appl100,ou=service,dc=xxxxxx,dc=com"
[22/Apr/2008:16:48:50 -0400] conn=496301 op=1 msgId=2 - SRCH base="ou=people,dc=xxxxxx
,dc=com" scope=1 filter="(cn=\*)" attrs="uid givenName sn mail region modifiers uid modifyTimestamp newOSUser OSmanaged OSstatus"
[22/Apr/2008:17:15:39 -0400] conn=496301 op=1 msgId=2 - RESULT err=0 tag=101 nentries=598115 etime=1609.753640 notes=U
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=3 - UNBIND
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=-1 - closing - U1
[22/Apr/2008:17:15:40 -0400] conn=496301 op=-1 msgId=-1 - closed.


CONFIGURATION INFO:

1) The Dirtracer captures highlighted differences in cache settings.  Both servers 02 and 03 have 1GB total whereas servers 01 has 3 and 04 has 2GB cache.

01
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  2,441,406 Kb  (2500000000 bytes)
       --------------------------------------------------
       Total Cache Size:   3 Gb,  3,408 Mb,  3,489,982 Kb,  (3573741824 bytes)

02
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,024,000 Kb  (1048576000 bytes)
       --------------------------------------------------
       Total Cache Size:   1 Gb,  2,024 Mb,  2,072,576 Kb,  (2122317824 bytes)

03
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,024,000 Kb  (1048576000 bytes)
       --------------------------------------------------
       Total Cache Size:   1 Gb,  2,024 Mb,  2,072,576 Kb,  (2122317824 bytes)

04
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,126,400 Kb  (1153433600 bytes)
       --------------------------------------------------
       Total Cache Size:   2 Gb,  2,124 Mb,  2,174,976 Kb,  (2227175424 bytes)



2) None of the four servers uses a tmpfs (tmp file system) for its db home; this affects performance on Unix systems.

"UNIX only. Used to fix a situation on UNIX platforms where the operating system endlessly flushes pages. This flushing can be so excessive that performance of the entire system is severely degraded."

   See: nsslapd-db-home-directory http://docs.sun.com/source/816-6699-10/plugattr.html

Each is using the main veritas partition for the db home.

   db-home-directory: /opt/servers/slapd-master/db

3) Each server used VXFS (Veritas File System) which has known fsync (file sync) problems when flushing Dirty Pages from Memory/Cache to the DB's.  This is done mostly during the transaction checkpoint txn_checkpoint.  The resulting fsync problem results in an overall slowdown of the DS process which will affect etimes on all other operations while the checkpoint occurs.  I have seen this on Veritas 4.4 and the customer is running an even older version.

   NAME:  VERITAS File System
   VENDOR:  VERITAS Software
   PSTAMP:  VERITAS-3.4FS-2003-01-29PID=110435-08

The problem can be seen using the cn=monitor search in the DS to show read waiters and the request queue backlog.  See the cn=monitor summary snippet at the top (above).  Anything more than 0 for either is not optimal.  Each of the Four servers has between 1 and 3 read waiters or request queue backlogs at any given time.  Generally each is running with 0 most of the time.

4) I asked a while ago if the customer did the following.

  • Upgraded from Directory Server Patch 4 - YES
  • Used the ActivateFix script in Patch 6 - NO


I saw these errors in the logs from servers 01, 03 and 04.

[22/Apr/2008:17:47:00 -0400] - WARNING<10307> - Replication  - conn=-1 op=-1 msgId=-1 - Configuration warning  Tombstone purging internal search is unindexed in replica dc=xxxxxx,dc=com


It is possible they have hit bug  6175472 and are seeing some result of the tombstone purging performance hit.  Allids threshold is only 4000 on these servers.  Tombstones are DELeted entries waiting to be purged.  If the number of deleted entries has gone above 4k the objectclass index becomes allids and the searches for the tombstone entries is unindexed and takes a long time.

6175472 MMR: High performance degradation when purging tombstones

I verified the dse.ldif's to see if the fix was enabled.  Part of the fix places a nsMatchingRule oid in the indexes entries as seen in the dse.ldif.  The Matching Rule is NOT in any indexes of any of the servers dse.ldif

nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150


01:
root[/customerdata/6xxxxxxx/Apr22/01-dirtracernxcspld1-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

02:
root[/customerdata/6xxxxxxx/Apr22/02-dirtracernxcspld2-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

03
root[/customerdata/6xxxxxxx/Apr22/03-dirtracernxcspld3-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

04
root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

Even though the above data was capture when there was no Directory or  Siteminder server problems, the Dirtracer data captured helped to highlight the following

  1. problematic/incorrect directory server configuration/architecture
  2. possible load balancing problems; i.e. sending more connections two two server as opposed to completely load balanced.
  3. disk capacity/io peak issues possibly from other heavily loaded ns-slapd instance
  4. incorrect application configuration
  5. overall etimes of the servers

[LT]

Tuesday May 27, 2008

Pkg_app in a Clustered Environment

Hello all,

I was recently contacted by a Sun Customer Advocate regarding Pkg_app, Dirtracer and Clustered environments.

First some up front info on Pkg_app.  Pkg_app was originally created by Rick Zagorski and I took over its development in 2006.  I took what Rick had coded and added to it specifically (at the time) to help with Directory Server core/gcore files.  This was released as pkg_app 2.0 and was a mash of Ricks code with some additions of mine.  As you know I created Dirtracer (Stracer) in 2004 and continue its development to help customers gather data on their Directory Server problems. 

The Customer was having issues with Pkg_app and Dirtracer when they attempted to use it in a Clustered Environment.   Even though the customer was running an old version of Pkg_app, Pkg_app and Dirtracer are "not" tested nor certified for use in a Clustered Environment.  I've have only had one prior request to have pkg_app/dirtracer be Cluster aware so it has not been on the top of the revision list.

I will be revamping the Pkg_app code to take into account all libraries etc. that Pkgcore gets for use with Java Cores and will look into the Cluster issue.

More news on this later.

[LT]

Wednesday May 21, 2008

Checks, Intervals and Loops Oh My - Part 3

This is my final blog on the Number of Checks and Interval settings within Dirtracer.

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

The Number of Checks and Interval parameters govern how many loops and how long the Main Loop runs for however they are also used when executing the following commands.  These commands are launched before the Main Loop is engaged and will run in the background while the Main Loop is active.

  • Iostats
  • Vmstats
  • Mpstats
  • Pms.sh

See the man pages for iotstat, vmstat and mpstat for more information on these commands.

Please see my previous blog on pms.sh (Pmonitor) for more information regarding this script.


[LT]

Monday May 19, 2008

Checks, Intervals and Loops Oh My - Part 2

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

    NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 4 - Crashing


The crashing option can be used when you know or suspect a crash will happen based on time or circumstances;  See Configurator option 4.  Crash Tracking with Dirtracer is as simple as setting the runtime for a long period of time and enabling the following options.

    CRASH_TRACK="1"
    PMONITOR_ONLY="1"

The point of Crash Tracking is to place Dirtracer into a "wait and see" mode.  It will poll for the process id and note when the crash happens and as the user if the crash produced a crash, where the core is located and then gather the remaining data like logs etc.  

The Number of Checks and Interval settings need to be configured to allow Dirtracer enough time for a crash to happen, This can vary based on the problem.  All the following are valid; configure your setting to allow run past the time you expect the crash to occur.

Set Number of Checks to 1000 and Interval to 30.  Total Runtime 30,000 Seconds.
Set Number of Checks to 100 and Interval to 60.  Total Runtime 6,000 Seconds.
Set Number of Checks to 60 and Interval to 60.  Total Runtime 3,600 Seconds.

\* Dirtracer mainloop will run for       [3600 sec. About 60 min.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-095153                [pms.sh only]
\* Loop 1 - 080519-095253                [pms.sh only]
\* Loop 2 - 080519-095353                [pms.sh only]
\* Loop 3 - 080519-095454                [pms.sh only]
\* Loop 4 - 080519-095554                [pms.sh only]
\* Loop 5 - 080519-095654                [pms.sh only]
\* Loop 6 - 080519-095754                [pms.sh only]
\* Loop 7 - 080519-095854                [pms.sh only]
\* Loop 8 - 080519-095954                [pms.sh only]
\* Loop 9 - 080519-100054                [pms.sh only]
\* Loop 10 - 080519-100155               [pms.sh only]
\* Loop 11 - 080519-100255               [pms.sh only]
\* Loop 12 - 080519-100355               [pms.sh only]
\* Loop 13 - 080519-100455               [pms.sh only]
\* Loop 14 - 080519-100555               [pms.sh only]
\* Loop 15 - 080519-100655               [pms.sh only]
\* Loop 16 - 080519-100755               [pms.sh only]
\* Loop 17 - 080519-100855               [pms.sh only]
\* Loop 18 - 080519-100956               [pms.sh only]
\* Loop 19 - 080519-101056               [pms.sh only]
\* Loop 20 - 080519-101156               [pms.sh only]
\* Loop 21 - 080519-101256               [pms.sh only]
\* Loop 22 - 080519-101356               [pms.sh only]
\* Loop 23 - 080519-101456               [pms.sh only]
\* Loop 24 - 080519-101556               [pms.sh only]
\* Loop 25 - 080519-101656               [pms.sh only]
\* Loop 26 - 080519-101757               [pms.sh only]
expr: syntax errormer[47]
\*                                       
\*       ALERT - The ns-slapd process has died!
\*                                       
\* [ALERT 2] The ns-slapd process has died!
\*                                       
\*
\*                                       
\* Locating crash core file              [failed]
\*                                       
\* Did the server produce a core file (y/n)? y
\*                                       
\* What is the full path to the core?    
\*                                       
\* Example: /var/cores/core.10008 :      /var/tmp/cores/core.10008



Example 4 - Memory Leak


Using the Memory Leak option will allow you to gather up front Pms.sh (Pmonitor) data over a long period of time to show the progression of memory usage of the slapd process.  Setting Memory Leak Tracking automatically sets the Interval to 1800 but the user needs to set the Number of Checks accordingly.

Enabling MEM_LEAK_TRACKING="1" sets the following automatically.

    INTERVAL="1800"
    PMONITOR_INTERVAL="5"

Set the Number of Checks to allow Dirtracer run for the time you feel it takes the ns-slapd process to show a substantial amount of leakage.  Lets say you are facing a leak that can manifest itself in 24 hours.  You could set the Number of Checks to 50 and Dirtracer will capture for 1500 minutes (approx. 25 hours).  Using Configurator to setup the dirtracer.config file or use the simple formula.

    1800 (interval) \* N = <Number of Seconds (total run time)>
                OR
    Minutes to leak \* 60 / 1800 (Interval in seconds) = <NUMBEROFCHECKS>

\* Dirtracer mainloop will run for       [90000 sec. About 1500 min.]
...
\* Mem Leak tracking                     [on]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-105751                
\* togo[90000sec]-timer[117]
...


I have eliminated the full run time information as it would be too long for a Blog post.


Example 5 - Server Down


Use of the server down option does not require any settings for the Interval and Number of Checks to be configured; In fact they are ignored.

\* Process State                         [server down, no process]
\* Server down, pms.sh not needed        
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\*
\* Server down, skipping main loop       
\*
\*                                       
\* Exiting Main Performance Gathering Loop


Example 6 - Basic Capture


The Basic Capture is a simple 5 x 5 check.  By default the dirtracer.config file is shipped with the following Interval and Number of Checks set.  You can also enable the Basic Capture with Option 7 when using Configurator.

    NUMBEROFCHECKS="5"
    INTERVAL="5"

Example 7 - Config Only Capture


The Config Only Capture can be enabled using the CONFIG_ONLY="1" parameter or also using Option 8 in Configurator.  This sets up a config file for a 1 x 1 (1 loop) capture.

    NUMBEROFCHECKS="1"
    INTERVAL="1"

To be Continued...

Hasta la vez prĂ³xima

[LT]

Thursday May 15, 2008

Checks, Intervals and Loops Oh My - Part 1

Hell all!

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 1 - Hung Processes (See Borked):


Most times a ns-slapd process is not actually hung but seems like it is.  A perceived hung process could just be an ultra busy process, caused by a series of massive db searches, all worker threads taken and waiting on one to free a lock or many other related issues.

Setting the Number of Checks and Interval correctly here can be critical.  Set them incorrectly and you may miss a data gathering opportunity.

Set Number of Checks to 5 and Interval to 5.  Total Runtime 25 Seconds.

This will gather 5 Pstacks/Prstats at 5 Second Intervals and can show if the process is changing over time, but does not have the granularity to show each threads progression through the stack.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092006                
\* Loop 1 - 080515-092011                
\* Loop 2 - 080515-092017                
\* Loop 3 - 080515-092022                
\* Loop 4 - 080515-092028


Example 2 - High CPU or Performance Problems:


Like Example 1 we want to see the process stack and threads change over time.  But for a High CPU or Performance Problem we want to see things change second by second.  A better option for this problem type would be to set the Number of Checks to 25 and Interval to 1.

Set Number of Checks to 25 and Interval to 1.  Total Runtime 25 Seconds.

This will gather 25 Pstacks/Prstats at 1 Second Intervals.  This way we can see the process stack change with no gaps in the captures. In Example 1 there were 5 Second gaps between pstacks and the threads will change a huge amount on a very busy server in that timeframe.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092554                
\* Loop 1 - 080515-092555                
\* Loop 2 - 080515-092557                
\* Loop 3 - 080515-092558                
\* Loop 4 - 080515-092600                
\* Loop 5 - 080515-092601                
\* Loop 6 - 080515-092603                
\* Loop 7 - 080515-092604                
\* Loop 8 - 080515-092605                
\* Loop 9 - 080515-092607                
\* Loop 10 - 080515-092608               
\* Loop 11 - 080515-092610               
\* Loop 12 - 080515-092611               
\* Loop 13 - 080515-092613               
\* Loop 14 - 080515-092614               
\* Loop 15 - 080515-092616               
\* Loop 16 - 080515-092617               
\* Loop 17 - 080515-092619               
\* Loop 18 - 080515-092620               
\* Loop 19 - 080515-092621               
\* Loop 20 - 080515-092623               
\* Loop 21 - 080515-092624               
\* Loop 22 - 080515-092626               
\* Loop 23 - 080515-092627               
\* Loop 24 - 080515-092629               


Because I have increased the Number of Checks and decreased the Interval the granularity is higher and I get 25 data points as opposed to 5 over the same time period.

Example 3 - Replication Problems:


The key to debugging a Replication Problem is Debug Logging over a period of time.  Setting the special PTS_CONFIG_LOGGING parameter to 8192 will allow Dirtracer to change the nsslapd-infolog-area logging value in the dse to 8192 (Replication Debug Logging).

PTS_CONFIG_LOGGING="8192"


Setting the Number of Checks or Interval for Granularity is not as important with Replication Problems as it is with Hangs or High CPU Problems.  The settings can vary and still achieve the same results.

Set Number of Checks to 40 and Interval to 30.  Total Runtime 20 Minutes.
Set Number of Checks to 10 and Interval to 120.  Total Runtime 20 Minutes.
Set Number of Checks to 5 and Interval to 240.  Total Runtime 20 Minutes.
Set Number of Checks to 1 and Interval to 1200.  Total Runtime 20 Minutes.

\* Dirtracer mainloop will run for       [1200 sec. About 20 min.]
...
\* Logging level is being changed        [nsslapd-infolog-area]
\*   current level:                      [0]
\*   new level:                          [8192]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-100855                
\* Loop 1 - 080515-101256                
\* Loop 2 - 080515-101656                
\* Loop 3 - 080515-102057                
\* Loop 4 - 080515-102457                
\* togo[240sec]-timer[240]
\*                                       
\* Exiting Main Performance Gathering Loop


There are five other problem types we can discuss but lets save that for another Blog post.

  • Crashing
  • Memory Leak
  • Server Down
  • Basic Capture
  • Config Only Capture


To be Continued...

Ciao all!

[LT]

Tuesday May 13, 2008

Pkg_app and Dirtracer

Today I will revisit Pkg_app but will focus on its uses within Dirtracer.

Before Dirtracer 6.0.4 Customers who would use Dirtracer to gather cores and gcores would have to run Pkg_app manually after the fact.

Since version 6.0.4 Dirtracer has included Pkg_app in the <Dirtracer Install Path>/dirtracertools/ location and with the Quiet (-q) switch in Pkg_app 2.7 I able to embed Pkg_app within Dirtracer to run automatically.

If a Customer uses the following config file parameters Pkg_app will be launched automatically.

CORE_LOCATION="<full path to the core> + SERVER_DOWN="1"

GRAB_GCORE="1" or GRAB_GCORE="2"

Here is an example of the following config:  I used 1 check and 1 second interval for brevity.

NUMBEROFCHECKS="1"
INTERVAL="1"
GRAB_GCORE="1"

See the runtime-<date>-<time>.log:

As you see below Dirtracer completes a quick one second loop, exits the Main Loop and grabs a Gcore.

<SNIP>
\*   pms.sh interval(1) x checks(1)      [pms.sh run time (1 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080509-075120                
\*                                       
\* Grabbing gcore 080509-075122          [success]
</SNIP>

Once Dirtracer finishes with the Post Loop gathering, it executed Pkg_app to have it gather all libraries and the ns-slapd binary.  Note the normal Pkg_app processing information is not seen because Pkg_app has been launched with the Quiet (-q) option.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]
</SNIP>

In Dirtracer 6.0.4 customers grabbing large cores/gcores with Dirtracer saw what they thought was a pkg_app hang.  It was likely the core/gcore had overflowed the core header and Pkg_app could not process the file correctly.  As a result I created a timer function to monitor processes like Pkg_app.  

If the Pkg_app runs for more than 120 seconds, then Dirtracer will "kill" the pkg_app process and alert the Customer they need to run Pkg_app manually.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [killed]
</SNIP>

If Pkg_app was successful then it will present the Customer with the following message; see 2) below.

<SNIP>
1) Dirtracer capture data located in directory [ /var/tmp/data/051308-01 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-kaneda-080513-090202.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/051308-01/pkg_app_data

                [pkg_app834d2699-kaneda-080513-090347.tar.Z ]
</SNIP>

Currently Dirtracer does not give a major alert if Pkg_app was killed.  The customer should manually run Pkg_app or gather the libraries used by the process.

[LT]

Thursday May 08, 2008

All Unix's are not created equal

I just thought about the Dirtracer parameters I had recently Blogged about and realized I forgot to mention a basic fundamental of Unix...that being that All Unix's are not create equal.  Simply stated all Unix Flavors do not have the command sets.

Working for Sun I normally prefer Solaris but as a Directory Engineer and Author of Dirtracer I need to understand and support the Directory and Dirtracer on the platforms the Directory Server is supported on; Windows is an exception.

When I created Stracer (Dirtracers predecessor) I wrote it only for Solaris.  Solaris includes many commands that other Unix flavors don't and vice versa.  Such commands include gcore, pstack, prstat and all proctool related commands.

How are they used?

  • Gcore is a way to instantly drop a binary file that is equivalent to a process Core file; i.e. a Solaris Crash Dump.
  • Pstack is a text based dump of the processes threads and the functions each thread was using when the pstack was taken.
  • Prstat shows output of each threads size, run state and cpu% used.

All these commands are extremely useful when debugging memory leaks, high cpu issues or bottlenecks with operations.  It is also a fundamental part of the Main (data gathering) Loop in Dirtracer.

Enough about Solaris for now.

The point of this blog was I thought I would mention not all of my posts are Solaris specific but some parameter sets used by Dirtracer may not apply to other flavors like Linux and HP-UX.

Linux and HP-UX have OpenSource ports of some of the above mentioned commands but they are untested (from me) and they would have to be installed on their respective OS's by the Customer/Admin so I cannot rely on them.  

Dirtracer has the basic same code whether it is used on Solaris, Linux or HP-UX but if these commands are not found they are ignored.  As a result, the Customer nor Sun Support cannot benefit from the availability of this data if it were available.

[LT]

Wednesday May 07, 2008

Pmonitor, Pms.sh and the almighty ps command

Originally Stracer (predecessor to Dirtracer) included a small shell script called Pmonitor written by Ben Gooley; then a Directory Server Support Engineer.  Pmonitor (Process Monitor) was a lightweight script that used the "ps" command with a series of switches to retrieve the Virtual Size (vsz), Resident Size (rsz) and cpu% time used to cpu time avail (pcpu).

man ps:
     vsz   The total size of the process in  virtual  memory,  in
           kilobytes.

     rss   The resident set size of the process, in kilobytes.

     pcpu  The ratio of CPU time used recently to CPU time avail-
           able  in  the  same period, expressed as a percentage.
           The  meaning  of  ``recently''  in  this  context   is
           unspecified.  The  CPU time available is determined in
           an unspecified manner.

Pmonitor was great to see an average cpu % busy (rudimentary load) as well as track the processes memory footprint over time.  This helps Sun Support Engineers to shed light on a possible memory leak if the process size never shrank.  It was a bit hard to see any real trend in just the raw data so many Engineers plotted the data to see it visually over time.

#./pmonitor 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:27:50] ------- 22241 86296 56360  0.0
05/06-[14:27:51] ------- 22241 86296 56360  0.0
05/06-[14:27:52] ------- 22241 86296 56360  0.0
05/06-[14:27:53] ------- 22241 86296 56360  0.0
05/06-[14:27:54] ------- 22241 86296 56360  0.0

In walks Mark Reynolds (creator of logconv.pl) with Pms.sh; an enhanced Pmonitor script which adds a "growth" calculation shown in Kilobytes (k).

#./pms.sh 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:28:20] ------- 22241 86296 56360  0.0
05/06-[14:28:21] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:22] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:23] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:24] ------- 22241 86296 56360  0.0    growth:   0 k

With Pms.sh we could now see the growth in the raw data without plotting it all the time.

04/24-[17:55:39] ------- 12489 5310368 5270584  2.3    growth:   0 k
04/24-[17:55:40] ------- 12489 5318432 5277048  5.6    growth:   8064 k
04/24-[17:55:42] ------- 12489 5319104 5277600  2.9    growth:   672 k
04/24-[17:55:56] ------- 12489 5319104 5277600  2.5    growth:   0 k

Not only could we see the growth we could also see when memory usage dropped.

04/24-[17:56:14] ------- 12489 5319104 5277600  1.0    growth:   0 k
04/24-[17:56:15] ------- 12489 5317560 5276424  3.5    growth:   -1544 k
04/24-[17:56:17] ------- 12489 5317560 5276424  5.4    growth:   0 k

I added Pms.sh 2.01 to the Dirtracer bundle starting with Dirtracer release 2.2 and not include Pms.sh 2.02 with Dirtracer 6.0.6

Uses for Pms.sh:

1) Memory Leaks; see above.
2) Gauging high cpu problems.  Dirtracer has prstats but in some circumstances prstat is not usable or recommended.
3) Looking at the trends of both of these elements over a long period of time.

System Impact:

Negligible.

I tested pms.sh using pms.sh as and top a gauge.

Top reported pms.sh uses less than .04 - .1% cpu on my Sunblade and pms.sh itself shows the pcpu as 0.0.

23661 root       1  50    0 1104K  832K sleep   0:00  0.09% pms.sh
23817 root       1   0    0 1104K  832K sleep   0:00  0.04% pms.sh

# ps -aef | grep pms.sh
    root 23661 21979  0 14:48:38 pts/3    0:00 /bin/sh ./pms.sh 22241 1 1000000

#./pms.sh 23661 1 1000
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:48:53] ------- 23661 1104  832  0.0
05/06-[14:48:54] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:55] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:57] ------- 23661 1104  832  0.0    growth:   0 k

Dirtracer + Pms.sh:

By default Dirtracer will run Pms.sh with the same NUMBEROFCHECKS and INTERVAL as seen in the dirtracer.config file.  If the Dirtracer mainloop is configured to run for 5 Checks at 5 Second Intervals then Pms.sh will do the same.

\*   pms.sh found                        [/export/home/hrdwired/PTS/dirtracertools/pms.sh]
\* Access/Error Logging                  [left as is]
\* Audit Logging                         [left as is]
\* Iostat available.     Executing...    [success]
\* Vmstat available.     Executing...    [success]
\* Mpstat available.     Executing...    [success]
\* pms.sh (pmonitor) executed            [success]
\*   pms.sh interval(5) x checks(5)      [pms.sh run time (25 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop

Config parameters used with Pms.sh:

PMONITOR_INTERVAL:  The Pmonitor Interval allows you to set a smaller Interval than Dirtracer is running with.

If Dirtracers mainloop is configured to run at 30 seconds INTERVALs but you want to see Pms.sh output at 1 second Intervals set the PMONITOR_INTERVAL="1".

PMONITOR_ONLY:  The Pmonitor Only parameter will disable the Mainloops use of all its normal data capture points such as pstacks, prstats, cn=monitor and cache size searches.  This allows the user to gather a long term Pms.sh data set without the overhead of tons of Pstacks/Pstats, Ldap Searches etc.

I hope this gives you a detailed view of Pms.sh (Pmonitor) and how it is used within Dirtracer and Sun Support.

[LT]

Tuesday May 06, 2008

Tracing ns-slapd process id's dynamically.

I was asked on the monthly Directory Collaborators call something to the effect of the following...

"It looks like Dirtracer is bound to one process id (pid) in the config file, but what if the server is restarted, does it know the new pid number dynamically"?

Currently, no.  Dirtracer was built to trace one pid at a time and must be changed in the dirtracer.config file before running.

I thought more on this and came to the conclusion it is possible in a future version of Dirtracer.  If you have a dirtracer.config or dirtracer.config.last (the config from the last run dirtracer capture) but the process id has changed, I could back track using the pid file located in the <slapd instance name>/logs location to use the current pid.

This would be based on the dirtracer.config file parameter INSTANCENAME.

If the INSTANCENAME parameter was not set (would be in a dirtracer.config.last or dirtracer.config.x) then I would have to abort the run.

While possible, it would however have limitations (gotchyas) that could prevent data capture.  If the directory server was hung/deadlocked and an old dirtracer.config file was using with BORKED="0" (Not Hung) then when launched, Dirtracer would attempt an ldapsearch and itself hang waiting on the search return.

Note: because Dirtracer is a shell script and relies on ldapsearch as opposed to an internal search mechanism it is limited to the ldapsearch capabilities therefor does not have an option (currently) to timeout a connection.  I have been thinking recently of using my timer function for this issue.

This dynamic pid check would mostly be useful for a couple scenarios I can think of.

1) Dirtracer run's to gather historical/config info using a cron without having to check/recheck the running pid vs config file setting (in the event the ds was restarted).

2) To set up Problem Type config files in advance without the current ns-slapd process's pid number embedded in the config file.  In this case the admin could just run ./dirtracer -f ./dirtracer.config.hang with very short notice and capture data quicker than having to edit the config file and change the pid parameter.

I will think this idea over some more but it has merit so it could make its way into the next release.

[LT]

Monday May 05, 2008

Presenting at the Directory Collaboration Meeting [follow up]

Quick note on todays Directory Collaboration meeting:

Thanks all who attended, I received some great questions and comments on Dirtracer...always good to have positive feedback on your product.

In tomorrows blog I will discuss thoughts around one question asked in todays presentation "Tracing ns-slapd process id's dynamically".

Thanks again all!

 [LT]


 

About

A Tech Blog about the Sun Java Systems Dirtracer Toolkit. Dirtracer and this blog written and maintained by Lee Trujillo an Oracle Senior Principal Support Engineer.

Search

Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today