Thursday Jul 30, 2009

DirTracer, Pkgapp and SLX

Alo all,

I received good news from the Big Admin admins today, the new DirTracer and Pkgapp versions are online!

The following versions are now available here on Big Admin. For information on the latest changes in this update, please see my previous blog - DT 6.0.7 Coming soon!

DirTracer 6.0.7 Photo © Lee Trujillo

Pkgapp 3.2

I have also released the SLX 02 DirTracer Quickstart on SLX (Sun Learning Exchange) - "This video is a Quickstart guide to getting DirTracer6 running in a short amount of time using Configurator 6. I also step through a real example of how DirTracer runs using the config file created by Configurator."

Finally, I am in production of the latest DirTracer SLX video "Server Impact".  Using Slamd, Cacti and DirTracer I plan on showing exactly what impact a Directory Server will see from DirTracer runs based on different run/problem types.  More on this later.

Cheers!

Lee

Thursday Jul 23, 2009

Multiple DirTracer iterations - What not to do.

Hey all,

Quick tech tip here. 

We recently came across a situation where a Customer ran DirTracer 3 separate times in the same second; one for each of the three instances on a single system.  Each config file had the correct configuration for each instance but all three were used at the same time.

Example: Using a wrapper script to launch all 3 in the same second.

#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config.1 &

nohup ./dirtracer -f ./dirtracer.config.2 &

nohup ./dirtracer -f ./dirtracer.config.3 &

As a result, DirTracer is launched 3 times in the same second and each see's that there is no existing sub-databin inside /var and only 1 sub-databin is created; i.e. /var/072309-01. 

<snip>

\* Using config file                     [./dirtracer.config.1]
\* Using config file                     [./dirtracer.config.2]
\* Using config file                     [./dirtracer.config.3]
\* Dirtracer mainloop will run for     \* Dirtracer mainloop will run for      \* Dirtracer mainloop will run for   

[100 sec. Between 1 & 2 min.]
[100 sec. Between 1 & 2 min.]
[100 sec. Between 1 & 2 min.]
\* Databin parameter check 1             \* Databin parameter check 1             \* Databin parameter check 1         

[success]
[success]
[success]
\* Databin parameter check 2             \* Databin parameter check 2             \* Databin parameter check 2         

[success]
[success]
[success]
\* Databin Found                         [/var]
\* Databin Found                         [/var]
\* Databin Found                         [/var]
\* Databin used is                       \* Databin used is                       \* Databin used is                   

[/var/072309-01]
[/var/072309-01]

[/var/072309-01]

<snip>

To ensure DirTracer runs properly and uses a separate sub databin (as intended) the following workaround can be used.

#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config &

sleep 1

nohup ./dirtracer -f ./dirtracer.config &

sleep 1

nohup ./dirtracer -f ./dirtracer.config &

By giving DirTracer a 1 second delay between executions gives it enough time to notice there is an exiting sub-databin in the intended path and increment the sub-databin number.

Example:

myzone root[/var/tmp/data]# ls -l
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-01
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-02
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-03

Hope this helps...

Cheers,

Lee


Tuesday Jul 21, 2009

DT 6.0.7 and Pkgapp 3.2 (Linux) Available!

Hi all,

Good news is always welcome :D

I have finalized testing of not only DirTracer (DT) 6.0.7 but have also finished coding and testing of Pkgapp 3.2 for both Redhat and SuSE Linux. Photo © Lee Trujillo

See my previous blog DT 6.0.7 Coming soon! or the README.dirtracer for details on what has changed.

Available DirTracer and Pkgapp versions:

  • Solaris Sparc 8/9/10
  • Solaris x86 9/10
  • SuSE Linux 9 U4/10 U1
  • Redhat Linux 3/4/5

Remember: DirTracer comes with Pkgapp (see the dirtracertools path) but for non Sun Directory Server customers, Pkgapp 3.2 can be downloaded separately.

Currently these versions are only available internally until BigAdmin can update their copies.  Until that happens, contact your Sun Customer Service Representative and have them send you the latest versions.

Enjoy!

Lee

Thursday Jul 16, 2009

DT 6.0.7 Coming soon!

Hi all,

Well, I hopped on the fix for the last updates bug so fast I decided to just release a new version of DirTracer (DT 6.0.7).

 Here is a list of the current change's...

DirTracer:
  1. Fixed bug 6861010 DirTracer 6.0.6 incorrectly identifies the DS as 64bit if running in as 32bit on AMD64 (Solaris Only).
  2. Removed post loop (non needed) pflag, pmap, pfiles gathering (Solaris Only).
  3. Replaced the older Pkg_app 2.7 with the new Pkgapp 3.2 (Solaris Sparc, Solaris x86 and Linux only at this time).
  4. Pkgapp lib tar file now included with the final DirTracer tar.gz
  5. Also added a check to skip grabbing /etc/system on a Solaris 10 Zone when not available.
  6. Added links to my Sun Blog and Sun Learning Exchange Video's for DirTracer to the displayed links when DT completes.

Configurator
  1. Fixed an issue when long path names are used for the server root.  Configuator would fail to derive the correct server root to locate pwdhash and would fail.  See bug 6511933 Dirtracer (GDD): greps on dse.ldif parameters fail when ldif wrapping is used.

As the changes only affected a couple of bugs I will not be reworking the pdf doc giude to reflect the new version information.  This also allowes me to quickly generae and release new fixes for current issues.

I have made the above changes to the Solaris Sparc, Solaris x86, Linux and HP-UX versions.  I have not coded a new version of Pkgapp 3.2 yet for HP-UX so I will release this version a bit later.  I will update soon when I have released DirTracer internally for download.  Once I do and you wish to get a copy, please contact your Sun Frontline Engineer.

Regards,

Lee

Wednesday Jul 15, 2009

Alert: DirTracer 6.0.6 incorrectly identifies the DS as 64bit if running in as 32bit on AMD64

Hi all,

Quick update on DirTracer.  I knew about this issue for a couple of months but haven't had time to deal with it.  I have created a bug (see below) and a fix.  I will try to get this out soon as DirTracer 6.0.7.  The fix is already in DT7 (DirTracer 7)

CR 6861010 DirTracer 6.0.6 incorrectly identifies the DS as 64bit if running in as 32bit on AMD64

Cheers all

Lee

Tuesday Jul 07, 2009

DirTracer Overview on SLX (Sun Learning Exchange)

Great news all,

I have released my first vidcast on Sun's Learning Exchange (SLX) site! Dream Sequence

SLX 01 - DirTracer Overview (runtime 7min 32sec)

This is the first in what I hope to be a long series discussing DirTracer, Configurator and Pkgapp.  If you are unfamiliar with DirTracer, this Overview is an easy and fast way to get acquainted with Sun's #1 Directory support tool.

On July 24th I will be releasing the second vidcasts titled "DirTracer Quickstart".  This will walk you through getting DirTracer up and running in the shortest amount of time using Configurator.

I am currently in production on the third vidcast titled "Server Impact".  This is a hot topic when it comes to running DirTracer on production systems.

That's all for now folks,

Enjoy

Lee

Friday Jun 26, 2009

Running DirTracer in the Background

Hello all,

I was approached today by a Partner about running DirTracer in the background.  I believe I have been asked about this once before.

Currently I have not tested nor can DirTracer be run in the background.  I just tried this in fact and saw nothing in the databin path, nor in the /tmp area for the runtime log.  I have found a way however.

If I background or no hup it I get no output -> (nohup ./dirtracer -f ./dirtracer.config &).

But if I create a stub script that launches DirTracer with nohup and & and do the same with the stub...it works.

myzone root[/opt/dirtracer]#cat backgrounder
#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config &
myzone root[/opt/dirtracer]#nohup ./backgrounder &
[1] 9159
myzone root[/opt/dirtracer]#Sending output to nohup.out

[1]    Done                          ./backgrounder

In most circumstances, (99% of the time) you should be running DirTracer in the foreground because the runtime is generally short.  I can only see backgrounding the process when you use the MEM_LEAK_TRACKING option or if you have such a long DirTracerrun that you will have to close down your terminals and leave.

Thanks to Brian for this great question.


Lee

Wednesday Jul 09, 2008

Suggestion Box

Hello all,

Just a quick update on the Dirtracer Blog front. I haven't forgotten the blog nor those who are waiting for updates...I however have been hit with a couple extremely tough support cases that took a ton of my time.

I decided to set up a Suggestion Box page where readers can suggest topics for me to cover in this blog.

Making a Suggestion is quick and easy.  You simply make a Comment to this Page in the Comment on this page. 

Comment how-to:

  1. Click this Comments link -> Comments
  2. Scroll down to the Comments: area
  3. Type in your Name, Email and the Suggestion in the Comment area
  4. Fill in the anti-spam answer and hit Post

Simple.  I will evaluate the suggestion and if it's a valid topic I will post a blog on the topic on Fridays.

Thanks all and I hope to see many suggestions.

LT

Tuesday Jun 10, 2008

Dirtracer + Core/Gcore = Separate upload for Pkg_app data.

Hello all,

Every now and then I run into an issue where the customer used Dirtracer to grab a Gcore (or two) or a Core file from a crash.  Dirtracer users are forgetting to upload the separate pkg_app tar.gz file when one is generated.

It is not a problem per-say as a it is users missing a critical message that is displayed at the end of a Dirtracer run when a successful Pkg_app file has been created.

Lets take a look at the messages I am talking about.

Suppose you define one of the following parameters in the dirtracer.config file:

CORE_LOCATION
GRAB_GCORE


Lets use GRAB_GCORE for a quick test; lets grab one Gcore.

GRAB_GCORE="1"


We then run Dirtracer as normal.

#./dirtracer -f ./dirtracer.config

--------------------------------------------------------------------------------
Sun Microsystems dirtracer 6.0.6 Solaris Sparc                        06/09/2008
--------------------------------------------------------------------------------


Dirtracer run through it's Main Loop and grabs one Gcore.

\* Entering Main Performance Gathering Loop
\*                                      
\* Loop 0 - 080609-131950               
\* Loop 1 - 080609-131956               
\* Loop 2 - 080609-132001               
\* Loop 3 - 080609-132007               
\* Loop 4 - 080609-132012               
\* togo[5sec]-timer[5]
\*                                      
\* Grabbing gcore 080609-132018          [success]


Because Dirtracer has noted the presence of the grab gcore parameter it will automatically run Pkg_app to gather the needed OS and DS libraries needed for debugging. 

By default Dirtracer waits 120 seconds for Pkg_app to complete.  If for some reason the Core/Gcore's header has an overrun (can't determine contents normally) or has other errors like it is truncated, Dirtracer will kill the Pkg_app gathering phase.  Users should check the Pkg_app tar contents to see if all libs were gathered and take appropriate actions.

\* Packaging files                      
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]


After the main Dirtracer tar file is created it will display the Operations Complete message and inform the user of which files he needs to upload to Sun.

As you can see below Dirtracer informs you there is both a Dirtracer and "separate" Pkg_app file which needs to be uploaded.  Users should pay special attention to the End Messages displayed when Dirtracer finishes.  Most users miss the second file (Pkg_app) requiring separate upload.


Operations Complete
--------------------------------------------------------------------------------
1) Dirtracer capture data located in directory [ /var/tmp/data/060908-04 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-mysystem-080609-132018.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/060908-04/pkg_app_data

                [pkg_app834d2699-mysystem-080609-132051.tar.Z ]


You might ask why is Pkg_app's tar simply not included with the main Dirtracer tar file.  The answer is simple...size.  Dirtracer captures by itself can be big if a customer does not archive their log files on a heavily used system and the final size can be massive if a Core/Gcore is captured.  It simply wasn't an option to include a big Pkg_app file with an already bigger Dirtracer file.

[LT]

Wednesday Jun 04, 2008

Using Dirtracer for Proactive Analysis

BACKGROUND:

One of our Customers was seeing a performance issue with their Siteminder enabled applications each day around 4:30pm.  While the Customer was using Dirtracer to capture event data, the time it took for Administration staff to see the alerts made capturing the event difficult.  The event, when it occurred took approximately 60 seconds from start to finish.  By the time the Dirtracers were started, the event had already passed 9-10 minutes before.

The Customer setup their BMC Control M monitoring facility so they could automate a 4 way Dirtracer capture at the same time when the event was noticed.

NOTE: they did not see any alerts/issues with the Directory or Siteminder Servers at this time but they launched Dirtracer via BMC Control M as a precaution only.

With this said, I analyzed the Dirtracers thinking there was a problem.

SYSTEM INFO:

1) The Customer captured data from 4 systems at the same time; 01, 02, 03 and system 04.

01    #dirtracer Started:     080422-174012
02    #dirtracer Started:     080422-174013
03    #dirtracer Started:     080422-174127
04    #dirtracer Started:     080422-174126


2) Connection Load.  I see not all servers are equal here; 01 and 04 are taking a lot less of a load than 02 and 03.

01    91 connections average
02    185 connections constant
03    171 connections constant
04    124 connections average


Summary Connection Snippets: taken from cn=monitor searches.

               Conn      Ttl           Curr      Read    ReqQue     OpsInitiated vs.
File            Peak  Th  Cons          Cons      Waitr   BakLog     OpsCompleted
====================================================================================
01:
[080422-175815] 1984   24  525136       101       0       0          332685-332674=11
[080422-175852] 1984   24  525143       102       0       0          332906-332899=7
[080422-175928] 1984   24  525150       103       0       0          333155-333140=15
[080422-180005] 1984   24  525159       87        0       0          331583-331576=7
[080422-180043] 1984   24  525167       82        0       0          331188-331186=2

02: Summary Snippet
[080422-174222] 1984   24  502241       185       0       0          2155474-2155468=6
[080422-174227] 1984   24  502245       185       0       0          2156076-2156073=3
[080422-174232] 1984   24  502249       185       0       0          2156687-2156680=7
[080422-174238] 1984   24  502253       185       2       2          2157573-2157565=8
[080422-174243] 1984   24  502259       185       1       0          2158588-2158583=5

03:
[080422-174513] 2256   24  503749       171       0       0          722384-722378=6
[080422-174518] 2256   24  503755       171       0       0          722607-722603=4
[080422-174522] 2256   24  503759       171       1       0          722890-722886=4
[080422-174527] 2256   24  503762       171       0       0          723230-723226=4
[080422-174531] 2256   24  503767       173       0       1          723506-723498=8

04:
[080422-174633] 8111   24  497842       123       1       0          383898-383884=14
[080422-174644] 8111   24  497848       123       0       2          384516-384511=5
[080422-174655] 8111   24  497855       124       0       1          384996-384993=3
[080422-174706] 8111   24  497860       124       0       1          385318-385313=5
[080422-174717] 8111   24  497864       124       0       0          385745-385739=6


3) The CPU % of each server mostly mirrors the load differences seen from above;  01 and 04 are taking a lot less of a load than 02 and 03.

01    1% average busy
02    16% average busy
03    20% average busy
04    8% average busy

CPU (pcpu) taken from the pms.sh (ps) file.

01:
04/22-[17:41:49] ------- 12489 5370536 5331048  1.1    growth:   0 k

02:
04/22-[17:41:39] ------- 6236 2216344 2193576 15.3    growth:   8 k

03:
04/22-[17:42:41] ------- 7930 2214432 2192016 21.3    growth:   -40 k

04:
04/22-[17:42:48] ------- 10554 3717880 3695216  9.1    growth:   -360 k


4) Iostats (disk performance).  Since the servers are using Veritas File System I cannot tell which disk slices the DS runs on versus which disks show higher than normal iostat's.

The DS is installed in the following location for all servers:

- /dev/vx/dsk/app/dsserver 1048576   18603  965640     2%    /opt/servers/slapd-master

01    Disks peak to 61% busy    (1 time)  GOOD
02    Disks peak to 58%    busy    (1 time)  GOOD
03    Disks peak to 100% busy    (2 times)  FAIR
04    Disks peak to 100% busy    (8 times)  BAD

As a result of the 100% disk busy times the average disk io service time and service wait times increase.

01    avg service time 8.8/ms - wait service 0/ms
02    avg service time 35/ms - wait service 0/ms
03    avg service time 199.5/ms - wait service 0/ms
04    avg service time 105.6/ms - wait service 373.5/ms
01
Tue Apr 22 17:41:07 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0    2.0    0.0    0.0  0.0  0.0    0.0    3.6   0   0 c1
   1.0    4.0    0.0    0.1  0.0  0.0    0.0    8.7   0   0 c2
  61.0   34.0    0.6    1.4  0.0  0.9    0.0    9.9   0  10 c4
 102.0   47.0    1.1    2.0  0.0  1.3    0.0    8.7   0  10 c5
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.5   0   0 c4t0d1
  61.0   32.0    0.6    1.4  0.0  0.9    0.0   10.1   0  60 c4t0d2
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    0.4   0   0 c4t0d5
 102.0   44.0    1.1    1.9  0.0  1.3    0.0    8.8   0  61 c5t1d2
   0.0    3.0    0.0    0.1  0.0  0.0    0.0    0.7   0   0 c5t1d5
   0.0    2.0    0.0    0.0  0.0  0.0    0.0    3.6   0   1 c1t2d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    8.5   0   3 c2t2d0
   1.0    0.0    0.0    0.0  0.0  0.0    0.0    9.5   0   1 c2t0d0

02
Tue Apr 22 17:40:37 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   1.0   23.0    0.0    1.2  0.0  1.0    0.0   41.0   0   2 c1
   0.0   19.0    0.0    1.1  0.0  1.2    0.0   61.1   0   2 c2
   3.0  701.0    0.0    5.8  0.0 24.6    0.0   34.9   0  20 c4
   1.0  648.0    0.0    5.3  0.0 23.5    0.0   36.3   0  20 c5
   0.0    4.0    0.0    0.0  0.0  0.0    0.0    9.6   0   1 c4t0d0
   1.0  351.0    0.0    2.9  0.0 12.4    0.0   35.1   0  58 c4t0d2
   2.0  290.0    0.0    2.3  0.0  9.7    0.0   33.3   0  50 c4t0d3
   0.0   56.0    0.0    0.5  0.0  2.5    0.0   43.9   0  10 c4t0d5
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    7.7   0   1 c5t1d0
   0.0  323.0    0.0    2.6  0.0 11.7    0.0   36.4   0  57 c5t1d2
   1.0  269.0    0.0    2.2  0.0  9.1    0.0   33.8   0  54 c5t1d3
   0.0   52.0    0.0    0.4  0.0  2.6    0.0   50.2   0   9 c5t1d5
   1.0    0.0    0.0    0.0  0.0  0.0    0.0    5.4   0   1 c1t8d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    5.9   0   2 c1t2d0
   0.0    9.0    0.0    0.6  0.0  0.6    0.0   61.6   0  13 c1t12d0
   0.0   10.0    0.0    0.6  0.0  0.4    0.0   40.0   0   9 c1t9d0
   0.0    8.0    0.0    0.5  0.0  0.6    0.0   76.2   0  13 c2t12d0
   0.0   11.0    0.0    0.6  0.0  0.6    0.0   50.1   0  12 c2t9d0

03
Tue Apr 22 17:42:06 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0   34.0    0.0    1.5  0.0  1.7    0.0   49.7   0   3 c1
   0.0   33.0    0.0    1.7  0.0  2.1    0.0   63.3   0   3 c2
   0.0  556.0    0.0    4.5  0.0 89.8    0.0  161.5   0  23 c4
   2.0  279.0    0.0    3.2  0.0 12.2    0.0   43.3   0  15 c5
   0.0    9.0    0.0    0.1  0.0  0.0    0.0    5.3   0   3 c4t8d0
   0.0  103.0    0.0    0.9  0.0  1.1    0.0   11.1   0  38 c4t10d0
   0.0  444.0    0.0    3.5  0.0 88.6    0.0  199.5   0 100 c4t11d0
   2.0  249.0    0.0    2.0  0.0 11.0    0.0   43.8   0  76 c5t10d0
   0.0    9.0    0.0    0.1  0.0  0.0    0.0    4.1   0   3 c5t11d0
   0.0   21.0    0.0    1.1  0.0  1.1    0.0   53.6   0  13 c5t13d0
   0.0   10.0    0.0    0.6  0.0  0.7    0.0   73.7   0  13 c2t11d0
   0.0    8.0    0.0    0.4  0.0  0.4    0.0   52.3   0  11 c2t9d0
   0.0   15.0    0.0    0.7  0.0  0.9    0.0   62.3   0  12 c2t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    6.8   0   1 c1t2d0
   0.0   11.0    0.0    0.6  0.0  0.5    0.0   44.1   0  13 c1t11d0
   0.0   14.0    0.0    0.6  0.0  0.8    0.0   55.2   0  12 c1t9d0
   0.0    7.0    0.0    0.3  0.0  0.4    0.0   59.3   0  12 c1t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.0   0   1 c1t3d0

04
Tue Apr 22 17:42:23 2008
                   extended device statistics                r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.0  299.1    0.0    2.3 100.8 29.8  337.1   99.6  16  17 c1
   0.0  349.1    0.0    2.7 107.5 29.6  308.0   84.9  16  17 c2
   0.0    5.0    0.0    0.1  0.0  0.0    0.0    4.5   0   0 c4
   0.0   48.0    0.0    2.6  0.0  0.6    0.0   12.5   0   2 c5
   0.0    5.0    0.0    0.1  0.0  0.0    0.0    4.5   0   2 c4t8d0
   0.0    2.0    0.0    0.0  0.0  0.0    0.0   12.4   0   2 c5t8d0
   0.0   22.0    0.0    1.2  0.0  0.3    0.0   12.2   0   4 c5t9d0
   0.0   22.0    0.0    1.2  0.0  0.3    0.0   13.7   0   4 c5t10d0
   0.0    1.0    0.0    0.1  0.0  0.0    0.0    4.7   0   0 c5t11d0
   0.0    1.0    0.0    0.1  0.0  0.0    0.0    1.8   0   0 c5t12d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.5   0   1 c2t11d0
   0.0  181.1    0.0    1.4 51.9 14.6  286.7   80.7  92 100 c2t0d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   11.5   0   1 c2t13d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    4.1   0   0 c2t4d0
   0.0  164.0    0.0    1.3 55.6 15.0  339.0   91.4 100 100 c2t10d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.4   0   1 c2t5d0
   0.0    4.0    0.0    0.1  0.0  0.0    0.0    6.5   0   3 c1t11d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.3   0   1 c1t13d0
   0.0  142.0    0.0    1.1 53.0 15.0  373.2  105.6 100 100 c1t0d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0   12.3   0   1 c1t4d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    6.4   0   1 c1t3d0
   0.0  148.0    0.0    1.1 47.8 14.7  322.9   99.5  92 100 c1t10d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    7.7   0   1 c1t2d0
   0.0    1.0    0.0    0.0  0.0  0.0    0.0    9.5   0   1 c1t5d0


5) Other ns-slapd instances running on these servers.  There are 5 total directory server instances on each of the 4 servers so I cannot see of the other ns-slapd instances are loading the disks to the levels seen above.  Significant load on the unknown instances could impact the disk used by the known instances and affect performance.

6) Dirtracer itself is not able to complete a normal MainLoop check in the normal time as configured; Dirtracer was confiured to loop trough each check in the MainLoop at 1 second INTERVALs. 

Each loop executes a pstack, prstat, netstat, cn=monitor search and 1 x backend cache info search (1 backend in this case).  Each of these data points are very lightweight and complete sub-second.

On Average each server completes a Loop in 5-41 seconds each.  It is unknown why these servers cannot complete these loops as fast as configured.

See the following examples: each Loop shows the date/time stamp when the loop starts...note how many seconds had passed between each loop (should have been 1 second between loops).

01
\* Loop 0 - 080422-174027             
\* Loop 1 - 080422-174106             
\* Loop 2 - 080422-174148             
\* Loop 3 - 080422-174228            
\* Loop 4 - 080422-174308           

02
\* Loop 0 - 080422-174030             
\* Loop 1 - 080422-174036             
\* Loop 2 - 080422-174041            
\* Loop 3 - 080422-174045             
\* Loop 4 - 080422-174050           

03
\* Loop 0 - 080422-174143             
\* Loop 1 - 080422-174148            
\* Loop 2 - 080422-174153            
\* Loop 3 - 080422-174158             
\* Loop 4 - 080422-174203

04
\* Loop 0 - 080422-174139             
\* Loop 1 - 080422-174150             
\* Loop 2 - 080422-174201             
\* Loop 3 - 080422-174213             
\* Loop 4 - 080422-174225 

LDAP INFO:

1) A quick check on the Dirtracer search etimes shows each are sub-second.

On 04 only a few were 0.1 to .2 seconds.  The following were taken from access log snippets captured.

root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs/working]#cat conn-49\* | awk '{print $11}' | grep etime | grep -v etime=0\\.0
etime=0.246540
etime=0.118270
etime=0.110770
etime=0.143630

2) Using the logs I found there was a Siteminder Policy Server that was configured to use an improper search base (dc=com).  For this data set only 01 doesn't have any Siteminder searching using an incorrect base; 02, 03 and 04 all to.  04 having the most.

01:    None
02:    4604            dc=com searches
03:    5457            dc=com searches
04:    13749          dc=com searches


I as able to track down the IP in only 1 server (04) as the connections in the others were already running.  I need to check old logs to see if there are other IP addresses these are coming from.

10.10.10.1

[22/Apr/2008:17:04:52 -0400] conn=496746 op=-1 msgId=-1 - fd=79 slot=79 LDAP connection from 10.10.10.1 to 10.10.10.181
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - BIND dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.000910 dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com"
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - RESULT err=32 tag=101 nentries=0 etime=0.000540
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - RESULT err=32 tag=101 nentries=0 etime=0.000480
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid==userremoved,,ou=People,dc==xxxxxx,,dc=com)(objectClass=groupOfUniqueNames\*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - RESULT err=32 tag=101 nentries=0 etime=0.000460
...


The above connection (conn=496746) contains searches using the incorrect base 1195 times alone.

root[/customerdata/6xxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs]#gc err=32 conn-496746.log
1195


3) High etimes / high etime servers.   Overall server 04 is the biggest high etime server, followed by 01.

Top 5 High Etimes

01
etime=413.555070    1      
etime=300.814310 notes=F    1      
etime=115.912590    1      
etime=39.566400    1     
etime=39.553510 notes=F    1       

02
etime=8.719340    1      
etime=6.210240    1      
etime=6.101450    1      
etime=5.814920    1      
etime=5.802740    1    

03
etime=7.325220    1      
etime=5.766010    1      
etime=5.558690    1      
etime=5.186940    1      
etime=5.147010    1

04
etime=1876.043450 notes=U    1      
etime=1609.753640 notes=U    1      
etime=1275.956850 notes=U    1      
etime=1251.870940 notes=U    1 

We now know 04's highest offending etimes come from 4 cn=\* searches from a certain application.  IP 10.10.10.225

nentries=598115                 4
[22/Apr/2008:16:48:50 -0400] conn=496301 op=-1 msgId=-1 - fd=47 slot=47 LDAP connection from 10.10.10.225 to 10.10.10.181
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - BIND dn="uid=appl100,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.002810 dn="uid=appl100,ou=service,dc=xxxxxx,dc=com"
[22/Apr/2008:16:48:50 -0400] conn=496301 op=1 msgId=2 - SRCH base="ou=people,dc=xxxxxx
,dc=com" scope=1 filter="(cn=\*)" attrs="uid givenName sn mail region modifiers uid modifyTimestamp newOSUser OSmanaged OSstatus"
[22/Apr/2008:17:15:39 -0400] conn=496301 op=1 msgId=2 - RESULT err=0 tag=101 nentries=598115 etime=1609.753640 notes=U
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=3 - UNBIND
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=-1 - closing - U1
[22/Apr/2008:17:15:40 -0400] conn=496301 op=-1 msgId=-1 - closed.


CONFIGURATION INFO:

1) The Dirtracer captures highlighted differences in cache settings.  Both servers 02 and 03 have 1GB total whereas servers 01 has 3 and 04 has 2GB cache.

01
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  2,441,406 Kb  (2500000000 bytes)
       --------------------------------------------------
       Total Cache Size:   3 Gb,  3,408 Mb,  3,489,982 Kb,  (3573741824 bytes)

02
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,024,000 Kb  (1048576000 bytes)
       --------------------------------------------------
       Total Cache Size:   1 Gb,  2,024 Mb,  2,072,576 Kb,  (2122317824 bytes)

03
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,024,000 Kb  (1048576000 bytes)
       --------------------------------------------------
       Total Cache Size:   1 Gb,  2,024 Mb,  2,072,576 Kb,  (2122317824 bytes)

04
       Total DB Cache:     1,048,576 Kb  (1073741824 bytes)
       Total Entry Cache:  1,126,400 Kb  (1153433600 bytes)
       --------------------------------------------------
       Total Cache Size:   2 Gb,  2,124 Mb,  2,174,976 Kb,  (2227175424 bytes)



2) None of the four servers uses a tmpfs (tmp file system) for its db home; this affects performance on Unix systems.

"UNIX only. Used to fix a situation on UNIX platforms where the operating system endlessly flushes pages. This flushing can be so excessive that performance of the entire system is severely degraded."

   See: nsslapd-db-home-directory http://docs.sun.com/source/816-6699-10/plugattr.html

Each is using the main veritas partition for the db home.

   db-home-directory: /opt/servers/slapd-master/db

3) Each server used VXFS (Veritas File System) which has known fsync (file sync) problems when flushing Dirty Pages from Memory/Cache to the DB's.  This is done mostly during the transaction checkpoint txn_checkpoint.  The resulting fsync problem results in an overall slowdown of the DS process which will affect etimes on all other operations while the checkpoint occurs.  I have seen this on Veritas 4.4 and the customer is running an even older version.

   NAME:  VERITAS File System
   VENDOR:  VERITAS Software
   PSTAMP:  VERITAS-3.4FS-2003-01-29PID=110435-08

The problem can be seen using the cn=monitor search in the DS to show read waiters and the request queue backlog.  See the cn=monitor summary snippet at the top (above).  Anything more than 0 for either is not optimal.  Each of the Four servers has between 1 and 3 read waiters or request queue backlogs at any given time.  Generally each is running with 0 most of the time.

4) I asked a while ago if the customer did the following.

  • Upgraded from Directory Server Patch 4 - YES
  • Used the ActivateFix script in Patch 6 - NO


I saw these errors in the logs from servers 01, 03 and 04.

[22/Apr/2008:17:47:00 -0400] - WARNING<10307> - Replication  - conn=-1 op=-1 msgId=-1 - Configuration warning  Tombstone purging internal search is unindexed in replica dc=xxxxxx,dc=com


It is possible they have hit bug  6175472 and are seeing some result of the tombstone purging performance hit.  Allids threshold is only 4000 on these servers.  Tombstones are DELeted entries waiting to be purged.  If the number of deleted entries has gone above 4k the objectclass index becomes allids and the searches for the tombstone entries is unindexed and takes a long time.

6175472 MMR: High performance degradation when purging tombstones

I verified the dse.ldif's to see if the fix was enabled.  Part of the fix places a nsMatchingRule oid in the indexes entries as seen in the dse.ldif.  The Matching Rule is NOT in any indexes of any of the servers dse.ldif

nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150


01:
root[/customerdata/6xxxxxxx/Apr22/01-dirtracernxcspld1-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

02:
root[/customerdata/6xxxxxxx/Apr22/02-dirtracernxcspld2-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

03
root[/customerdata/6xxxxxxx/Apr22/03-dirtracernxcspld3-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

04
root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0

Even though the above data was capture when there was no Directory or  Siteminder server problems, the Dirtracer data captured helped to highlight the following

  1. problematic/incorrect directory server configuration/architecture
  2. possible load balancing problems; i.e. sending more connections two two server as opposed to completely load balanced.
  3. disk capacity/io peak issues possibly from other heavily loaded ns-slapd instance
  4. incorrect application configuration
  5. overall etimes of the servers

[LT]

Tuesday May 27, 2008

Pkg_app in a Clustered Environment

Hello all,

I was recently contacted by a Sun Customer Advocate regarding Pkg_app, Dirtracer and Clustered environments.

First some up front info on Pkg_app.  Pkg_app was originally created by Rick Zagorski and I took over its development in 2006.  I took what Rick had coded and added to it specifically (at the time) to help with Directory Server core/gcore files.  This was released as pkg_app 2.0 and was a mash of Ricks code with some additions of mine.  As you know I created Dirtracer (Stracer) in 2004 and continue its development to help customers gather data on their Directory Server problems. 

The Customer was having issues with Pkg_app and Dirtracer when they attempted to use it in a Clustered Environment.   Even though the customer was running an old version of Pkg_app, Pkg_app and Dirtracer are "not" tested nor certified for use in a Clustered Environment.  I've have only had one prior request to have pkg_app/dirtracer be Cluster aware so it has not been on the top of the revision list.

I will be revamping the Pkg_app code to take into account all libraries etc. that Pkgcore gets for use with Java Cores and will look into the Cluster issue.

More news on this later.

[LT]

Wednesday May 21, 2008

Checks, Intervals and Loops Oh My - Part 3

This is my final blog on the Number of Checks and Interval settings within Dirtracer.

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

The Number of Checks and Interval parameters govern how many loops and how long the Main Loop runs for however they are also used when executing the following commands.  These commands are launched before the Main Loop is engaged and will run in the background while the Main Loop is active.

  • Iostats
  • Vmstats
  • Mpstats
  • Pms.sh

See the man pages for iotstat, vmstat and mpstat for more information on these commands.

Please see my previous blog on pms.sh (Pmonitor) for more information regarding this script.


[LT]

Monday May 19, 2008

Checks, Intervals and Loops Oh My - Part 2

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

    NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 4 - Crashing


The crashing option can be used when you know or suspect a crash will happen based on time or circumstances;  See Configurator option 4.  Crash Tracking with Dirtracer is as simple as setting the runtime for a long period of time and enabling the following options.

    CRASH_TRACK="1"
    PMONITOR_ONLY="1"

The point of Crash Tracking is to place Dirtracer into a "wait and see" mode.  It will poll for the process id and note when the crash happens and as the user if the crash produced a crash, where the core is located and then gather the remaining data like logs etc.  

The Number of Checks and Interval settings need to be configured to allow Dirtracer enough time for a crash to happen, This can vary based on the problem.  All the following are valid; configure your setting to allow run past the time you expect the crash to occur.

Set Number of Checks to 1000 and Interval to 30.  Total Runtime 30,000 Seconds.
Set Number of Checks to 100 and Interval to 60.  Total Runtime 6,000 Seconds.
Set Number of Checks to 60 and Interval to 60.  Total Runtime 3,600 Seconds.

\* Dirtracer mainloop will run for       [3600 sec. About 60 min.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-095153                [pms.sh only]
\* Loop 1 - 080519-095253                [pms.sh only]
\* Loop 2 - 080519-095353                [pms.sh only]
\* Loop 3 - 080519-095454                [pms.sh only]
\* Loop 4 - 080519-095554                [pms.sh only]
\* Loop 5 - 080519-095654                [pms.sh only]
\* Loop 6 - 080519-095754                [pms.sh only]
\* Loop 7 - 080519-095854                [pms.sh only]
\* Loop 8 - 080519-095954                [pms.sh only]
\* Loop 9 - 080519-100054                [pms.sh only]
\* Loop 10 - 080519-100155               [pms.sh only]
\* Loop 11 - 080519-100255               [pms.sh only]
\* Loop 12 - 080519-100355               [pms.sh only]
\* Loop 13 - 080519-100455               [pms.sh only]
\* Loop 14 - 080519-100555               [pms.sh only]
\* Loop 15 - 080519-100655               [pms.sh only]
\* Loop 16 - 080519-100755               [pms.sh only]
\* Loop 17 - 080519-100855               [pms.sh only]
\* Loop 18 - 080519-100956               [pms.sh only]
\* Loop 19 - 080519-101056               [pms.sh only]
\* Loop 20 - 080519-101156               [pms.sh only]
\* Loop 21 - 080519-101256               [pms.sh only]
\* Loop 22 - 080519-101356               [pms.sh only]
\* Loop 23 - 080519-101456               [pms.sh only]
\* Loop 24 - 080519-101556               [pms.sh only]
\* Loop 25 - 080519-101656               [pms.sh only]
\* Loop 26 - 080519-101757               [pms.sh only]
expr: syntax errormer[47]
\*                                       
\*       ALERT - The ns-slapd process has died!
\*                                       
\* [ALERT 2] The ns-slapd process has died!
\*                                       
\*
\*                                       
\* Locating crash core file              [failed]
\*                                       
\* Did the server produce a core file (y/n)? y
\*                                       
\* What is the full path to the core?    
\*                                       
\* Example: /var/cores/core.10008 :      /var/tmp/cores/core.10008



Example 4 - Memory Leak


Using the Memory Leak option will allow you to gather up front Pms.sh (Pmonitor) data over a long period of time to show the progression of memory usage of the slapd process.  Setting Memory Leak Tracking automatically sets the Interval to 1800 but the user needs to set the Number of Checks accordingly.

Enabling MEM_LEAK_TRACKING="1" sets the following automatically.

    INTERVAL="1800"
    PMONITOR_INTERVAL="5"

Set the Number of Checks to allow Dirtracer run for the time you feel it takes the ns-slapd process to show a substantial amount of leakage.  Lets say you are facing a leak that can manifest itself in 24 hours.  You could set the Number of Checks to 50 and Dirtracer will capture for 1500 minutes (approx. 25 hours).  Using Configurator to setup the dirtracer.config file or use the simple formula.

    1800 (interval) \* N = <Number of Seconds (total run time)>
                OR
    Minutes to leak \* 60 / 1800 (Interval in seconds) = <NUMBEROFCHECKS>

\* Dirtracer mainloop will run for       [90000 sec. About 1500 min.]
...
\* Mem Leak tracking                     [on]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-105751                
\* togo[90000sec]-timer[117]
...


I have eliminated the full run time information as it would be too long for a Blog post.


Example 5 - Server Down


Use of the server down option does not require any settings for the Interval and Number of Checks to be configured; In fact they are ignored.

\* Process State                         [server down, no process]
\* Server down, pms.sh not needed        
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\*
\* Server down, skipping main loop       
\*
\*                                       
\* Exiting Main Performance Gathering Loop


Example 6 - Basic Capture


The Basic Capture is a simple 5 x 5 check.  By default the dirtracer.config file is shipped with the following Interval and Number of Checks set.  You can also enable the Basic Capture with Option 7 when using Configurator.

    NUMBEROFCHECKS="5"
    INTERVAL="5"

Example 7 - Config Only Capture


The Config Only Capture can be enabled using the CONFIG_ONLY="1" parameter or also using Option 8 in Configurator.  This sets up a config file for a 1 x 1 (1 loop) capture.

    NUMBEROFCHECKS="1"
    INTERVAL="1"

To be Continued...

Hasta la vez próxima

[LT]

Thursday May 15, 2008

Checks, Intervals and Loops Oh My - Part 1

Hell all!

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 1 - Hung Processes (See Borked):


Most times a ns-slapd process is not actually hung but seems like it is.  A perceived hung process could just be an ultra busy process, caused by a series of massive db searches, all worker threads taken and waiting on one to free a lock or many other related issues.

Setting the Number of Checks and Interval correctly here can be critical.  Set them incorrectly and you may miss a data gathering opportunity.

Set Number of Checks to 5 and Interval to 5.  Total Runtime 25 Seconds.

This will gather 5 Pstacks/Prstats at 5 Second Intervals and can show if the process is changing over time, but does not have the granularity to show each threads progression through the stack.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092006                
\* Loop 1 - 080515-092011                
\* Loop 2 - 080515-092017                
\* Loop 3 - 080515-092022                
\* Loop 4 - 080515-092028


Example 2 - High CPU or Performance Problems:


Like Example 1 we want to see the process stack and threads change over time.  But for a High CPU or Performance Problem we want to see things change second by second.  A better option for this problem type would be to set the Number of Checks to 25 and Interval to 1.

Set Number of Checks to 25 and Interval to 1.  Total Runtime 25 Seconds.

This will gather 25 Pstacks/Prstats at 1 Second Intervals.  This way we can see the process stack change with no gaps in the captures. In Example 1 there were 5 Second gaps between pstacks and the threads will change a huge amount on a very busy server in that timeframe.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092554                
\* Loop 1 - 080515-092555                
\* Loop 2 - 080515-092557                
\* Loop 3 - 080515-092558                
\* Loop 4 - 080515-092600                
\* Loop 5 - 080515-092601                
\* Loop 6 - 080515-092603                
\* Loop 7 - 080515-092604                
\* Loop 8 - 080515-092605                
\* Loop 9 - 080515-092607                
\* Loop 10 - 080515-092608               
\* Loop 11 - 080515-092610               
\* Loop 12 - 080515-092611               
\* Loop 13 - 080515-092613               
\* Loop 14 - 080515-092614               
\* Loop 15 - 080515-092616               
\* Loop 16 - 080515-092617               
\* Loop 17 - 080515-092619               
\* Loop 18 - 080515-092620               
\* Loop 19 - 080515-092621               
\* Loop 20 - 080515-092623               
\* Loop 21 - 080515-092624               
\* Loop 22 - 080515-092626               
\* Loop 23 - 080515-092627               
\* Loop 24 - 080515-092629               


Because I have increased the Number of Checks and decreased the Interval the granularity is higher and I get 25 data points as opposed to 5 over the same time period.

Example 3 - Replication Problems:


The key to debugging a Replication Problem is Debug Logging over a period of time.  Setting the special PTS_CONFIG_LOGGING parameter to 8192 will allow Dirtracer to change the nsslapd-infolog-area logging value in the dse to 8192 (Replication Debug Logging).

PTS_CONFIG_LOGGING="8192"


Setting the Number of Checks or Interval for Granularity is not as important with Replication Problems as it is with Hangs or High CPU Problems.  The settings can vary and still achieve the same results.

Set Number of Checks to 40 and Interval to 30.  Total Runtime 20 Minutes.
Set Number of Checks to 10 and Interval to 120.  Total Runtime 20 Minutes.
Set Number of Checks to 5 and Interval to 240.  Total Runtime 20 Minutes.
Set Number of Checks to 1 and Interval to 1200.  Total Runtime 20 Minutes.

\* Dirtracer mainloop will run for       [1200 sec. About 20 min.]
...
\* Logging level is being changed        [nsslapd-infolog-area]
\*   current level:                      [0]
\*   new level:                          [8192]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-100855                
\* Loop 1 - 080515-101256                
\* Loop 2 - 080515-101656                
\* Loop 3 - 080515-102057                
\* Loop 4 - 080515-102457                
\* togo[240sec]-timer[240]
\*                                       
\* Exiting Main Performance Gathering Loop


There are five other problem types we can discuss but lets save that for another Blog post.

  • Crashing
  • Memory Leak
  • Server Down
  • Basic Capture
  • Config Only Capture


To be Continued...

Ciao all!

[LT]

Tuesday May 13, 2008

Pkg_app and Dirtracer

Today I will revisit Pkg_app but will focus on its uses within Dirtracer.

Before Dirtracer 6.0.4 Customers who would use Dirtracer to gather cores and gcores would have to run Pkg_app manually after the fact.

Since version 6.0.4 Dirtracer has included Pkg_app in the <Dirtracer Install Path>/dirtracertools/ location and with the Quiet (-q) switch in Pkg_app 2.7 I able to embed Pkg_app within Dirtracer to run automatically.

If a Customer uses the following config file parameters Pkg_app will be launched automatically.

CORE_LOCATION="<full path to the core> + SERVER_DOWN="1"

GRAB_GCORE="1" or GRAB_GCORE="2"

Here is an example of the following config:  I used 1 check and 1 second interval for brevity.

NUMBEROFCHECKS="1"
INTERVAL="1"
GRAB_GCORE="1"

See the runtime-<date>-<time>.log:

As you see below Dirtracer completes a quick one second loop, exits the Main Loop and grabs a Gcore.

<SNIP>
\*   pms.sh interval(1) x checks(1)      [pms.sh run time (1 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080509-075120                
\*                                       
\* Grabbing gcore 080509-075122          [success]
</SNIP>

Once Dirtracer finishes with the Post Loop gathering, it executed Pkg_app to have it gather all libraries and the ns-slapd binary.  Note the normal Pkg_app processing information is not seen because Pkg_app has been launched with the Quiet (-q) option.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]
</SNIP>

In Dirtracer 6.0.4 customers grabbing large cores/gcores with Dirtracer saw what they thought was a pkg_app hang.  It was likely the core/gcore had overflowed the core header and Pkg_app could not process the file correctly.  As a result I created a timer function to monitor processes like Pkg_app.  

If the Pkg_app runs for more than 120 seconds, then Dirtracer will "kill" the pkg_app process and alert the Customer they need to run Pkg_app manually.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [killed]
</SNIP>

If Pkg_app was successful then it will present the Customer with the following message; see 2) below.

<SNIP>
1) Dirtracer capture data located in directory [ /var/tmp/data/051308-01 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-kaneda-080513-090202.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/051308-01/pkg_app_data

                [pkg_app834d2699-kaneda-080513-090347.tar.Z ]
</SNIP>

Currently Dirtracer does not give a major alert if Pkg_app was killed.  The customer should manually run Pkg_app or gather the libraries used by the process.

[LT]
About

A Tech Blog about the Sun Java Systems Dirtracer Toolkit. Dirtracer and this blog written and maintained by Lee Trujillo an Oracle Senior Principal Support Engineer.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today