Wednesday May 21, 2008

Checks, Intervals and Loops Oh My - Part 3

This is my final blog on the Number of Checks and Interval settings within Dirtracer.

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

The Number of Checks and Interval parameters govern how many loops and how long the Main Loop runs for however they are also used when executing the following commands.  These commands are launched before the Main Loop is engaged and will run in the background while the Main Loop is active.

  • Iostats
  • Vmstats
  • Mpstats
  • Pms.sh

See the man pages for iotstat, vmstat and mpstat for more information on these commands.

Please see my previous blog on pms.sh (Pmonitor) for more information regarding this script.


[LT]

Monday May 19, 2008

Checks, Intervals and Loops Oh My - Part 2

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

    NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 4 - Crashing


The crashing option can be used when you know or suspect a crash will happen based on time or circumstances;  See Configurator option 4.  Crash Tracking with Dirtracer is as simple as setting the runtime for a long period of time and enabling the following options.

    CRASH_TRACK="1"
    PMONITOR_ONLY="1"

The point of Crash Tracking is to place Dirtracer into a "wait and see" mode.  It will poll for the process id and note when the crash happens and as the user if the crash produced a crash, where the core is located and then gather the remaining data like logs etc.  

The Number of Checks and Interval settings need to be configured to allow Dirtracer enough time for a crash to happen, This can vary based on the problem.  All the following are valid; configure your setting to allow run past the time you expect the crash to occur.

Set Number of Checks to 1000 and Interval to 30.  Total Runtime 30,000 Seconds.
Set Number of Checks to 100 and Interval to 60.  Total Runtime 6,000 Seconds.
Set Number of Checks to 60 and Interval to 60.  Total Runtime 3,600 Seconds.

\* Dirtracer mainloop will run for       [3600 sec. About 60 min.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-095153                [pms.sh only]
\* Loop 1 - 080519-095253                [pms.sh only]
\* Loop 2 - 080519-095353                [pms.sh only]
\* Loop 3 - 080519-095454                [pms.sh only]
\* Loop 4 - 080519-095554                [pms.sh only]
\* Loop 5 - 080519-095654                [pms.sh only]
\* Loop 6 - 080519-095754                [pms.sh only]
\* Loop 7 - 080519-095854                [pms.sh only]
\* Loop 8 - 080519-095954                [pms.sh only]
\* Loop 9 - 080519-100054                [pms.sh only]
\* Loop 10 - 080519-100155               [pms.sh only]
\* Loop 11 - 080519-100255               [pms.sh only]
\* Loop 12 - 080519-100355               [pms.sh only]
\* Loop 13 - 080519-100455               [pms.sh only]
\* Loop 14 - 080519-100555               [pms.sh only]
\* Loop 15 - 080519-100655               [pms.sh only]
\* Loop 16 - 080519-100755               [pms.sh only]
\* Loop 17 - 080519-100855               [pms.sh only]
\* Loop 18 - 080519-100956               [pms.sh only]
\* Loop 19 - 080519-101056               [pms.sh only]
\* Loop 20 - 080519-101156               [pms.sh only]
\* Loop 21 - 080519-101256               [pms.sh only]
\* Loop 22 - 080519-101356               [pms.sh only]
\* Loop 23 - 080519-101456               [pms.sh only]
\* Loop 24 - 080519-101556               [pms.sh only]
\* Loop 25 - 080519-101656               [pms.sh only]
\* Loop 26 - 080519-101757               [pms.sh only]
expr: syntax errormer[47]
\*                                       
\*       ALERT - The ns-slapd process has died!
\*                                       
\* [ALERT 2] The ns-slapd process has died!
\*                                       
\*
\*                                       
\* Locating crash core file              [failed]
\*                                       
\* Did the server produce a core file (y/n)? y
\*                                       
\* What is the full path to the core?    
\*                                       
\* Example: /var/cores/core.10008 :      /var/tmp/cores/core.10008



Example 4 - Memory Leak


Using the Memory Leak option will allow you to gather up front Pms.sh (Pmonitor) data over a long period of time to show the progression of memory usage of the slapd process.  Setting Memory Leak Tracking automatically sets the Interval to 1800 but the user needs to set the Number of Checks accordingly.

Enabling MEM_LEAK_TRACKING="1" sets the following automatically.

    INTERVAL="1800"
    PMONITOR_INTERVAL="5"

Set the Number of Checks to allow Dirtracer run for the time you feel it takes the ns-slapd process to show a substantial amount of leakage.  Lets say you are facing a leak that can manifest itself in 24 hours.  You could set the Number of Checks to 50 and Dirtracer will capture for 1500 minutes (approx. 25 hours).  Using Configurator to setup the dirtracer.config file or use the simple formula.

    1800 (interval) \* N = <Number of Seconds (total run time)>
                OR
    Minutes to leak \* 60 / 1800 (Interval in seconds) = <NUMBEROFCHECKS>

\* Dirtracer mainloop will run for       [90000 sec. About 1500 min.]
...
\* Mem Leak tracking                     [on]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-105751                
\* togo[90000sec]-timer[117]
...


I have eliminated the full run time information as it would be too long for a Blog post.


Example 5 - Server Down


Use of the server down option does not require any settings for the Interval and Number of Checks to be configured; In fact they are ignored.

\* Process State                         [server down, no process]
\* Server down, pms.sh not needed        
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\*
\* Server down, skipping main loop       
\*
\*                                       
\* Exiting Main Performance Gathering Loop


Example 6 - Basic Capture


The Basic Capture is a simple 5 x 5 check.  By default the dirtracer.config file is shipped with the following Interval and Number of Checks set.  You can also enable the Basic Capture with Option 7 when using Configurator.

    NUMBEROFCHECKS="5"
    INTERVAL="5"

Example 7 - Config Only Capture


The Config Only Capture can be enabled using the CONFIG_ONLY="1" parameter or also using Option 8 in Configurator.  This sets up a config file for a 1 x 1 (1 loop) capture.

    NUMBEROFCHECKS="1"
    INTERVAL="1"

To be Continued...

Hasta la vez próxima

[LT]

Thursday May 15, 2008

Checks, Intervals and Loops Oh My - Part 1

Hell all!

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 1 - Hung Processes (See Borked):


Most times a ns-slapd process is not actually hung but seems like it is.  A perceived hung process could just be an ultra busy process, caused by a series of massive db searches, all worker threads taken and waiting on one to free a lock or many other related issues.

Setting the Number of Checks and Interval correctly here can be critical.  Set them incorrectly and you may miss a data gathering opportunity.

Set Number of Checks to 5 and Interval to 5.  Total Runtime 25 Seconds.

This will gather 5 Pstacks/Prstats at 5 Second Intervals and can show if the process is changing over time, but does not have the granularity to show each threads progression through the stack.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092006                
\* Loop 1 - 080515-092011                
\* Loop 2 - 080515-092017                
\* Loop 3 - 080515-092022                
\* Loop 4 - 080515-092028


Example 2 - High CPU or Performance Problems:


Like Example 1 we want to see the process stack and threads change over time.  But for a High CPU or Performance Problem we want to see things change second by second.  A better option for this problem type would be to set the Number of Checks to 25 and Interval to 1.

Set Number of Checks to 25 and Interval to 1.  Total Runtime 25 Seconds.

This will gather 25 Pstacks/Prstats at 1 Second Intervals.  This way we can see the process stack change with no gaps in the captures. In Example 1 there were 5 Second gaps between pstacks and the threads will change a huge amount on a very busy server in that timeframe.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092554                
\* Loop 1 - 080515-092555                
\* Loop 2 - 080515-092557                
\* Loop 3 - 080515-092558                
\* Loop 4 - 080515-092600                
\* Loop 5 - 080515-092601                
\* Loop 6 - 080515-092603                
\* Loop 7 - 080515-092604                
\* Loop 8 - 080515-092605                
\* Loop 9 - 080515-092607                
\* Loop 10 - 080515-092608               
\* Loop 11 - 080515-092610               
\* Loop 12 - 080515-092611               
\* Loop 13 - 080515-092613               
\* Loop 14 - 080515-092614               
\* Loop 15 - 080515-092616               
\* Loop 16 - 080515-092617               
\* Loop 17 - 080515-092619               
\* Loop 18 - 080515-092620               
\* Loop 19 - 080515-092621               
\* Loop 20 - 080515-092623               
\* Loop 21 - 080515-092624               
\* Loop 22 - 080515-092626               
\* Loop 23 - 080515-092627               
\* Loop 24 - 080515-092629               


Because I have increased the Number of Checks and decreased the Interval the granularity is higher and I get 25 data points as opposed to 5 over the same time period.

Example 3 - Replication Problems:


The key to debugging a Replication Problem is Debug Logging over a period of time.  Setting the special PTS_CONFIG_LOGGING parameter to 8192 will allow Dirtracer to change the nsslapd-infolog-area logging value in the dse to 8192 (Replication Debug Logging).

PTS_CONFIG_LOGGING="8192"


Setting the Number of Checks or Interval for Granularity is not as important with Replication Problems as it is with Hangs or High CPU Problems.  The settings can vary and still achieve the same results.

Set Number of Checks to 40 and Interval to 30.  Total Runtime 20 Minutes.
Set Number of Checks to 10 and Interval to 120.  Total Runtime 20 Minutes.
Set Number of Checks to 5 and Interval to 240.  Total Runtime 20 Minutes.
Set Number of Checks to 1 and Interval to 1200.  Total Runtime 20 Minutes.

\* Dirtracer mainloop will run for       [1200 sec. About 20 min.]
...
\* Logging level is being changed        [nsslapd-infolog-area]
\*   current level:                      [0]
\*   new level:                          [8192]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-100855                
\* Loop 1 - 080515-101256                
\* Loop 2 - 080515-101656                
\* Loop 3 - 080515-102057                
\* Loop 4 - 080515-102457                
\* togo[240sec]-timer[240]
\*                                       
\* Exiting Main Performance Gathering Loop


There are five other problem types we can discuss but lets save that for another Blog post.

  • Crashing
  • Memory Leak
  • Server Down
  • Basic Capture
  • Config Only Capture


To be Continued...

Ciao all!

[LT]

Tuesday May 13, 2008

Pkg_app and Dirtracer

Today I will revisit Pkg_app but will focus on its uses within Dirtracer.

Before Dirtracer 6.0.4 Customers who would use Dirtracer to gather cores and gcores would have to run Pkg_app manually after the fact.

Since version 6.0.4 Dirtracer has included Pkg_app in the <Dirtracer Install Path>/dirtracertools/ location and with the Quiet (-q) switch in Pkg_app 2.7 I able to embed Pkg_app within Dirtracer to run automatically.

If a Customer uses the following config file parameters Pkg_app will be launched automatically.

CORE_LOCATION="<full path to the core> + SERVER_DOWN="1"

GRAB_GCORE="1" or GRAB_GCORE="2"

Here is an example of the following config:  I used 1 check and 1 second interval for brevity.

NUMBEROFCHECKS="1"
INTERVAL="1"
GRAB_GCORE="1"

See the runtime-<date>-<time>.log:

As you see below Dirtracer completes a quick one second loop, exits the Main Loop and grabs a Gcore.

<SNIP>
\*   pms.sh interval(1) x checks(1)      [pms.sh run time (1 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080509-075120                
\*                                       
\* Grabbing gcore 080509-075122          [success]
</SNIP>

Once Dirtracer finishes with the Post Loop gathering, it executed Pkg_app to have it gather all libraries and the ns-slapd binary.  Note the normal Pkg_app processing information is not seen because Pkg_app has been launched with the Quiet (-q) option.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]
</SNIP>

In Dirtracer 6.0.4 customers grabbing large cores/gcores with Dirtracer saw what they thought was a pkg_app hang.  It was likely the core/gcore had overflowed the core header and Pkg_app could not process the file correctly.  As a result I created a timer function to monitor processes like Pkg_app.  

If the Pkg_app runs for more than 120 seconds, then Dirtracer will "kill" the pkg_app process and alert the Customer they need to run Pkg_app manually.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [killed]
</SNIP>

If Pkg_app was successful then it will present the Customer with the following message; see 2) below.

<SNIP>
1) Dirtracer capture data located in directory [ /var/tmp/data/051308-01 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-kaneda-080513-090202.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/051308-01/pkg_app_data

                [pkg_app834d2699-kaneda-080513-090347.tar.Z ]
</SNIP>

Currently Dirtracer does not give a major alert if Pkg_app was killed.  The customer should manually run Pkg_app or gather the libraries used by the process.

[LT]

Thursday May 08, 2008

All Unix's are not created equal

I just thought about the Dirtracer parameters I had recently Blogged about and realized I forgot to mention a basic fundamental of Unix...that being that All Unix's are not create equal.  Simply stated all Unix Flavors do not have the command sets.

Working for Sun I normally prefer Solaris but as a Directory Engineer and Author of Dirtracer I need to understand and support the Directory and Dirtracer on the platforms the Directory Server is supported on; Windows is an exception.

When I created Stracer (Dirtracers predecessor) I wrote it only for Solaris.  Solaris includes many commands that other Unix flavors don't and vice versa.  Such commands include gcore, pstack, prstat and all proctool related commands.

How are they used?

  • Gcore is a way to instantly drop a binary file that is equivalent to a process Core file; i.e. a Solaris Crash Dump.
  • Pstack is a text based dump of the processes threads and the functions each thread was using when the pstack was taken.
  • Prstat shows output of each threads size, run state and cpu% used.

All these commands are extremely useful when debugging memory leaks, high cpu issues or bottlenecks with operations.  It is also a fundamental part of the Main (data gathering) Loop in Dirtracer.

Enough about Solaris for now.

The point of this blog was I thought I would mention not all of my posts are Solaris specific but some parameter sets used by Dirtracer may not apply to other flavors like Linux and HP-UX.

Linux and HP-UX have OpenSource ports of some of the above mentioned commands but they are untested (from me) and they would have to be installed on their respective OS's by the Customer/Admin so I cannot rely on them.  

Dirtracer has the basic same code whether it is used on Solaris, Linux or HP-UX but if these commands are not found they are ignored.  As a result, the Customer nor Sun Support cannot benefit from the availability of this data if it were available.

[LT]

Wednesday May 07, 2008

Pmonitor, Pms.sh and the almighty ps command

Originally Stracer (predecessor to Dirtracer) included a small shell script called Pmonitor written by Ben Gooley; then a Directory Server Support Engineer.  Pmonitor (Process Monitor) was a lightweight script that used the "ps" command with a series of switches to retrieve the Virtual Size (vsz), Resident Size (rsz) and cpu% time used to cpu time avail (pcpu).

man ps:
     vsz   The total size of the process in  virtual  memory,  in
           kilobytes.

     rss   The resident set size of the process, in kilobytes.

     pcpu  The ratio of CPU time used recently to CPU time avail-
           able  in  the  same period, expressed as a percentage.
           The  meaning  of  ``recently''  in  this  context   is
           unspecified.  The  CPU time available is determined in
           an unspecified manner.

Pmonitor was great to see an average cpu % busy (rudimentary load) as well as track the processes memory footprint over time.  This helps Sun Support Engineers to shed light on a possible memory leak if the process size never shrank.  It was a bit hard to see any real trend in just the raw data so many Engineers plotted the data to see it visually over time.

#./pmonitor 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:27:50] ------- 22241 86296 56360  0.0
05/06-[14:27:51] ------- 22241 86296 56360  0.0
05/06-[14:27:52] ------- 22241 86296 56360  0.0
05/06-[14:27:53] ------- 22241 86296 56360  0.0
05/06-[14:27:54] ------- 22241 86296 56360  0.0

In walks Mark Reynolds (creator of logconv.pl) with Pms.sh; an enhanced Pmonitor script which adds a "growth" calculation shown in Kilobytes (k).

#./pms.sh 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:28:20] ------- 22241 86296 56360  0.0
05/06-[14:28:21] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:22] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:23] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:24] ------- 22241 86296 56360  0.0    growth:   0 k

With Pms.sh we could now see the growth in the raw data without plotting it all the time.

04/24-[17:55:39] ------- 12489 5310368 5270584  2.3    growth:   0 k
04/24-[17:55:40] ------- 12489 5318432 5277048  5.6    growth:   8064 k
04/24-[17:55:42] ------- 12489 5319104 5277600  2.9    growth:   672 k
04/24-[17:55:56] ------- 12489 5319104 5277600  2.5    growth:   0 k

Not only could we see the growth we could also see when memory usage dropped.

04/24-[17:56:14] ------- 12489 5319104 5277600  1.0    growth:   0 k
04/24-[17:56:15] ------- 12489 5317560 5276424  3.5    growth:   -1544 k
04/24-[17:56:17] ------- 12489 5317560 5276424  5.4    growth:   0 k

I added Pms.sh 2.01 to the Dirtracer bundle starting with Dirtracer release 2.2 and not include Pms.sh 2.02 with Dirtracer 6.0.6

Uses for Pms.sh:

1) Memory Leaks; see above.
2) Gauging high cpu problems.  Dirtracer has prstats but in some circumstances prstat is not usable or recommended.
3) Looking at the trends of both of these elements over a long period of time.

System Impact:

Negligible.

I tested pms.sh using pms.sh as and top a gauge.

Top reported pms.sh uses less than .04 - .1% cpu on my Sunblade and pms.sh itself shows the pcpu as 0.0.

23661 root       1  50    0 1104K  832K sleep   0:00  0.09% pms.sh
23817 root       1   0    0 1104K  832K sleep   0:00  0.04% pms.sh

# ps -aef | grep pms.sh
    root 23661 21979  0 14:48:38 pts/3    0:00 /bin/sh ./pms.sh 22241 1 1000000

#./pms.sh 23661 1 1000
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:48:53] ------- 23661 1104  832  0.0
05/06-[14:48:54] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:55] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:57] ------- 23661 1104  832  0.0    growth:   0 k

Dirtracer + Pms.sh:

By default Dirtracer will run Pms.sh with the same NUMBEROFCHECKS and INTERVAL as seen in the dirtracer.config file.  If the Dirtracer mainloop is configured to run for 5 Checks at 5 Second Intervals then Pms.sh will do the same.

\*   pms.sh found                        [/export/home/hrdwired/PTS/dirtracertools/pms.sh]
\* Access/Error Logging                  [left as is]
\* Audit Logging                         [left as is]
\* Iostat available.     Executing...    [success]
\* Vmstat available.     Executing...    [success]
\* Mpstat available.     Executing...    [success]
\* pms.sh (pmonitor) executed            [success]
\*   pms.sh interval(5) x checks(5)      [pms.sh run time (25 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop

Config parameters used with Pms.sh:

PMONITOR_INTERVAL:  The Pmonitor Interval allows you to set a smaller Interval than Dirtracer is running with.

If Dirtracers mainloop is configured to run at 30 seconds INTERVALs but you want to see Pms.sh output at 1 second Intervals set the PMONITOR_INTERVAL="1".

PMONITOR_ONLY:  The Pmonitor Only parameter will disable the Mainloops use of all its normal data capture points such as pstacks, prstats, cn=monitor and cache size searches.  This allows the user to gather a long term Pms.sh data set without the overhead of tons of Pstacks/Pstats, Ldap Searches etc.

I hope this gives you a detailed view of Pms.sh (Pmonitor) and how it is used within Dirtracer and Sun Support.

[LT]

Tuesday May 06, 2008

Tracing ns-slapd process id's dynamically.

I was asked on the monthly Directory Collaborators call something to the effect of the following...

"It looks like Dirtracer is bound to one process id (pid) in the config file, but what if the server is restarted, does it know the new pid number dynamically"?

Currently, no.  Dirtracer was built to trace one pid at a time and must be changed in the dirtracer.config file before running.

I thought more on this and came to the conclusion it is possible in a future version of Dirtracer.  If you have a dirtracer.config or dirtracer.config.last (the config from the last run dirtracer capture) but the process id has changed, I could back track using the pid file located in the <slapd instance name>/logs location to use the current pid.

This would be based on the dirtracer.config file parameter INSTANCENAME.

If the INSTANCENAME parameter was not set (would be in a dirtracer.config.last or dirtracer.config.x) then I would have to abort the run.

While possible, it would however have limitations (gotchyas) that could prevent data capture.  If the directory server was hung/deadlocked and an old dirtracer.config file was using with BORKED="0" (Not Hung) then when launched, Dirtracer would attempt an ldapsearch and itself hang waiting on the search return.

Note: because Dirtracer is a shell script and relies on ldapsearch as opposed to an internal search mechanism it is limited to the ldapsearch capabilities therefor does not have an option (currently) to timeout a connection.  I have been thinking recently of using my timer function for this issue.

This dynamic pid check would mostly be useful for a couple scenarios I can think of.

1) Dirtracer run's to gather historical/config info using a cron without having to check/recheck the running pid vs config file setting (in the event the ds was restarted).

2) To set up Problem Type config files in advance without the current ns-slapd process's pid number embedded in the config file.  In this case the admin could just run ./dirtracer -f ./dirtracer.config.hang with very short notice and capture data quicker than having to edit the config file and change the pid parameter.

I will think this idea over some more but it has merit so it could make its way into the next release.

[LT]

Monday May 05, 2008

Presenting at the Directory Collaboration Meeting [follow up]

Quick note on todays Directory Collaboration meeting:

Thanks all who attended, I received some great questions and comments on Dirtracer...always good to have positive feedback on your product.

In tomorrows blog I will discuss thoughts around one question asked in todays presentation "Tracing ns-slapd process id's dynamically".

Thanks again all!

 [LT]


 

Bork Bork Bork

The BORKED parameter.


I realize most of our international users of Dirtracer may not get the reference but the BORKED parameter got its name in part from the US TV show called The Muppet Show; something I watched as a child.  A show made up entirely of Puppets (Muppets) who's main character is called Kermit the Frog.

The BORKED parameter is a reference to the Swedish Chef special Bork-speak...a parody of a wacky Chef who speaks in unintelligible Swedish.  In certain tech circles "Borked" is synonymous with "Broken", and when developing this settings purpose the name just stuck.

http://en.wikipedia.org/wiki/The_Muppet_Show
http://en.wikipedia.org/wiki/Bork_bork_bork

Think of Borked as "Hung" when it comes to its use with Dirtracer.  If a Directory Server process is thought to be Hung, then use set the BORKED parameter to 1.

    BORKED="1"

Note:  I plan to rename the parameter to HUNG (Config File only) in the next release.

What does BORKED="1" do?

Normally Dirtracer will run a set of the following searches and or modiify's, if Borked is set to 1 (on) then these searches etc. are skipped.  Setting Borked to 1 helps make sure Dirtracer itself doesn't hang waiting on these ldapsearches to return.  If Borked is not set to 1 when the Directory Server is suspected of being Hung then Dirtracer will not complete its data gathering.

Searches:
    Backend Suffix names; naming contexts
    Backend Database names
    cn=monitor
    cache info searches
    nsds50ruv; replica ruv's
    cn=config info
    rootdse

Modifies; only completed if Dirtracer is configured to do so.

    PTS_CONFIG_LOGGING can set the server Logging level to the parameter value configured.  This sets the nsslapd-infolog-area or nsslapd-errorlog-level

    Examples:
        nsslapd-infolog-area: 4 sets Heavy Trace Logging
        nsslapd-infolog-area: 128 sets ACI debugging
        nsslapd-infolog-area: 8192  sets Replication Debug Logging

    See the following link for more info on Logging Levels
        http://docs.sun.com/source/816-6699-10/confattr.html#15873

Dirtracer can also set (rarely used) Logging On or Off.

    TURN_LOGGING_ON="0"             # Turn On access/error logs
    TURN_LOGGING_OFF="0"            # Turn Off access/error logs
    AUDIT_LOGGING_ON="0"            # Turn On audit logs
    AUDIT_LOGGING_OFF="0"           # Turn Off audit logs

As mentioned Dirtracer only completes the above ldapmodify's if configured to do so.

[LT]

Thursday May 01, 2008

The DATABIN parameter

I had a question from a Front Line Engineer recently where they did not understand how to select a proper location for the DATABIN parameter.

DATABIN="<DATA OUTPUT PATH>"    # Databin main path.
                                # Sub dirs will be created beneath this path.

The DATABIN is the path where you want Dirtracer to store the data it captures.  Special care should be taken when selecting the right path based on the project size of the data you need to gather.

Sun GDD Directory Dirtracer Reference Guide: Page 17

Disk Usage

Disk space used is almost entirely dependent on the following.

1. How Dirtracer is configured; i.e. what it is asked to gather.
2. How many loops Dirtracer is configured to complete.
  •     cache, monitor searches
  •     netstats, iostats, pstacks, prstats
  •     transaction log ls -l captures.
3. How many access/error and audit logs are captured.  
  •     configured from the GATHER_N_XXXXX_LOGS="N" parameters.
4. How big each of those logs are.  (var/adm/messages logs included)
5. Shared Memory (MMAP) files
  •     how big the ns-slapd process size is.
6. Cores
  •     how big the ns-slapd process size is.
7. Gcores
  •     how big the ns-slapd process size is.
8. If Dirtracer has REMOVE_TEMP_DATA=0.
  •     saves all temp files in addition to the final tar file.
9. If Dirtracer has SKIP_TAR_GZIP=1.
  •     Skips the final tar/gz saving 1⁄2 the space it normally uses; i.e. duplication of files  occurrs as files are tarred and gzipped.


The Engineer was also requested to setup two directory.config files to trace two separate slapd instances at the same time, on the same system.  Would the DATABIN parameters need to be different? No.

Early on I saw a problem with Stracer (the old Dirtracer) when customers would use the same DATABIN to store data and the previously capture files would be overwritten or you would have multiples of the same files.

I solved this issue by having Dirtracer create a unique time/date based directory under the defined DATABIN path.  Even if Dirtracer is run multiple times on the same system a new sub databin is created to segregate the data.

Example:

1) Set the DATABIN as follows.

DATABIN="/var/tmp/data"

2) Run Dirtracer 3 times and observe the directories created in /var/tmp/data/

root[/var/tmp/data]#ls -l
total 12
drwxr-xr-x  11 root     other       1536 Apr 21 15:38 042108-01
drwxr-xr-x  11 root     other       1536 Apr 21 15:42 042108-02
drwxr-xr-x  11 root     other       1536 Apr 21 15:43 042108-03

root[/var/tmp/data]#find . -name "dirtracer-\*gz" -print
./042108-01/dirtracer-834d2699-kaneda-080421-153743.tar.gz
./042108-02/dirtracer-834d2699-kaneda-080421-154144.tar.gz
./042108-03/dirtracer-834d2699-kaneda-080421-154335.tar.gz

You can clearly see how the data is separated and should not collide.

[LT]

Configurator, the dirtracer.config.template and their uses.

I was recently asked what the differences are between the dirtracer.config.template and the Configurator script and how they are used.

The previous version of my script Stracer used both a config file as well as a full range of command line switches.  The command line switches confused many and the config file then was not well documented.  As a result we had many Dirtracer's configured to capture the wrong type of data for the problem type.

Shortly after I decided to create the "Configurator", and released it with Stracer 1.9.3.  Configurator took the Problem Type encountered by the Customer and translated it into a working dirtracer.config file.  Originally Configurator contained 7 problem type options.  With Configurator 6.0.6 I have added Option 8 for a Configuration Only Capture.

--------------------------------------------------------------------------------
Sun Microsystems Configurator 6.0.6                                       
--------------------------------------------------------------------------------
Please choose the type of problem you are experiencing

Process Hung                            [1]
High CPU                                [2]
Replication                             [3]
Crashing                                [4]
Memory Leak                             [5]
Server Down                             [\*]     DISABLED - (SLAPDPID is set)
Basic Capture                           [7]
Config Only Capture                     [8]
--------------------------------------------





NOTE: Now that the Document for Dirtracer has progressed to this point I may have to add a full section for Configurator; even though it's interactive and self explaining.

Configurator takes you through the following sections in which to create a dirtracer.config file

1) Case Number (if available)
2) Slapd Instance selection.
3) Directory Manager Password entry
4) Data Storage location.  This is the location of the DATABIN parameter where all captured data will be stored.
5) Skip Tar Gzip question
6) Problem Type selection.
    a) Process Hung. Hang detection, Gcore selection
    b) High CPU. CPU % thrshold level, Gcore selection
    c) Replication.  Sets replication debug logging (8192)
    d) Crashing.
    e) Memory Leak.
    f) Server Down. DS version [5x|6x], Instance path entry.
    g) Basic Capture
    h) Config Only Capture
7) DS Log capture selection; access, error and audit logs.
8) Dirtracer Runtime selection.
9) Pmonitor (pms.sh) Runtime selection.
10) Configuration Summary
11) Data Capture Size guesstimation.
12) Config file (dirtracer.config) creation.

The Configurator is a good way to for those new to Dirtracer to quickly setup a dirtracer.config file for an event.

So what is the difference between the Configurator and the dirtracer.config.file template?  Well, Configurator asks questions to setup a ready to use dirtracer.config.  The dirtracer.config.template is just that...a template.  The dirtracer.config.template does contain all parameters available that would be set when creating a new dirtracer.config using the Configurator.  The dirtracer.config.template does however "have" to be edited in order to be used with Dirtracer and does not have Presets for Problem Types.

Without the following parameters properly set/changed, Dirtracer will exit and alert the admin the file needs to be changed.  Likewise the template contains some default settings.

SLAPDPID="<SLAPD PID>"          # Slapd pid number
MGRPW="<PASSWORD>"              # Mgr password
DATABIN="<DATA OUTPUT PATH>"    # Databin main path.

The template can be copied, renamed and edited to contain different parameter settings for the same problem types as seen above.  The dirtracer.config.template is completely self documented so administrators can quickly look at a parameter and select its use (or not).

Hope this was useful.

[LT]

Presenting at the Directory Collaboration Meeting

Good news!

I will be one of the speakers in this months Directory Collaboration Meeting this coming Monday May 02, 12pm ET.  Partners, if you normally attend this, it will be an opportunity to ask me questions on Dirtracer usage and the future of my product.

I will be giving a mini preso on Dirtracer for those who are unfamiliar with it; is this even possible? :)  In addition to the new features you can take advantage of in the new version 6.0.6.

Directory Collaboration:
"Regularly scheduled monthly Directory Services collaborative meetings providing information in field experiences, deployment and configuration strategies, knowledge sharing, and questions and answers."

Hope to see you there!

[LT]

Wednesday Apr 30, 2008

Dirtracer 6.0.6 Unleashed!

Dirtracer 6.0.6 Unleashed!

The latest version of Dirtracer is now available for Customer and Partner download on the BigAdmin System Administration Portal.

The major changes included in 6.0.6 are as follows:

1) Added SKIP_TAR_GZIP Code.  Skips all tar and gzip functions.
 - \*   Preparing files - pstack            [skipped - skip tar gzip enabled]

2) Added Carpet Version Type Check. Gets a better DS Type from 6.x
 - \* DS Version                            [6.1 - 64 bit - zip install]

 - 6658311: Dirtracer (GDD): 6.0.5 can report the wrong install type - zipinstall vs pkginstall (jes).

3) Reset the ldapsearch path ..uses locateLdapTools function
 - 5.1 

\* Ldap Tools Path                       [/data/sunone/ds51sp4/shared/bin]

 - 5.2 

\* Ldap Tools Path                       [/opt/ds52p6/shared/bin]

 - 6.x 

\* Ldap Tools Path                       [/opt/dsee62/dsrk6/bin]


4) Added a secondary check for backends. Highlights if DS6 & No Backends exist.

 \* Backends Found                        [none configured]

5) ps -ae changed to include ps -aef
 \* ps -aef                             [success]

6) Add a checkPatch for Sol 8 on 108995-08
 - 108995-08
 - SunOS 5.8: /usr/lib/libproc.so.1 patch
 - http://sunsolve.sun.com/search/document.do?assetkey=1-21-108995-08-1

 - Sol 5.9
 - http://sunsolve.sun.com/search/document.do?assetkey=1-21-117125-02-1
 - http://sunsolve.sun.com/search/document.do?assetkey=1-21-117125-03-1

7) Added CONFIG_ONLY runtime option.  See Ref. Guide for Config Only Capture.

8) Added an ls -laR of the slapd instance
 \* Gathering needed customer defined configuration
 \*   ls -laR of slapd Instance           [success]

Stay tuned for explanations of the new features and how they can benefit you.

[LT]

Friday Apr 25, 2008

The Dirtracer Blog is here!

Welcome to the new Dirtracer Blog!


I am Lee Trujillo, an Engineer supporting the Sun Java Directory Server since December 2003.  I am also the creator of Dirtracer, Sun's number one tool for tracing issues for our Sun Directory Server.

Early in 2004, I saw that the support organization had no formalized standard for asking for and obtaining data related to issues surrounding the Directory Server, so I created Stracer (Stack Tracer), the predecessor to Dirtracer (Directory Tracer).  Stracer 1.0 consisted of 169 lines of shell script that basically gathered pstacks, prstats and top output from a running ns-slapd process.  Conversely, Dirtracer 6x (2008) is a complex 2,590 line script of functions that can be combined in many ways to gather Directory data based on problem type.

Dirtracer is a troubleshooting tool designed to help reduce resolution time on complex Directory Server problems and to ease the data-gathering process for Sun's customers.

Dirtracer is part of the GDD (Gathering Debug Data) suite of tools and has already been used for years to tackle some of the most persistent, difficult Sun Java Systems Directory Server problems faced in the field.  For problems such as server hangs, crashes, and high cpu utilization, Dirtracer simplifies the sampling of system resources and crash data in order to help identify trends.

Save you and your customers time and aggravation -- discover the power of Dirtracer.

I expect Dirtracer version 6.0.6 to be available later today on the external Big Admin Administration Portal site for Customer and partner download.

[LT]
About

A Tech Blog about the Sun Java Systems Dirtracer Toolkit. Dirtracer and this blog written and maintained by Lee Trujillo an Oracle Senior Principal Support Engineer.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today