Thursday Jul 23, 2009

Multiple DirTracer iterations - What not to do.

Hey all,

Quick tech tip here. 

We recently came across a situation where a Customer ran DirTracer 3 separate times in the same second; one for each of the three instances on a single system.  Each config file had the correct configuration for each instance but all three were used at the same time.

Example: Using a wrapper script to launch all 3 in the same second.

#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config.1 &

nohup ./dirtracer -f ./dirtracer.config.2 &

nohup ./dirtracer -f ./dirtracer.config.3 &

As a result, DirTracer is launched 3 times in the same second and each see's that there is no existing sub-databin inside /var and only 1 sub-databin is created; i.e. /var/072309-01. 

<snip>

\* Using config file                     [./dirtracer.config.1]
\* Using config file                     [./dirtracer.config.2]
\* Using config file                     [./dirtracer.config.3]
\* Dirtracer mainloop will run for     \* Dirtracer mainloop will run for      \* Dirtracer mainloop will run for   

[100 sec. Between 1 & 2 min.]
[100 sec. Between 1 & 2 min.]
[100 sec. Between 1 & 2 min.]
\* Databin parameter check 1             \* Databin parameter check 1             \* Databin parameter check 1         

[success]
[success]
[success]
\* Databin parameter check 2             \* Databin parameter check 2             \* Databin parameter check 2         

[success]
[success]
[success]
\* Databin Found                         [/var]
\* Databin Found                         [/var]
\* Databin Found                         [/var]
\* Databin used is                       \* Databin used is                       \* Databin used is                   

[/var/072309-01]
[/var/072309-01]

[/var/072309-01]

<snip>

To ensure DirTracer runs properly and uses a separate sub databin (as intended) the following workaround can be used.

#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config &

sleep 1

nohup ./dirtracer -f ./dirtracer.config &

sleep 1

nohup ./dirtracer -f ./dirtracer.config &

By giving DirTracer a 1 second delay between executions gives it enough time to notice there is an exiting sub-databin in the intended path and increment the sub-databin number.

Example:

myzone root[/var/tmp/data]# ls -l
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-01
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-02
drwxr-xr-x  11 root     root          30 Jul 23 14:48 072309-03

Hope this helps...

Cheers,

Lee


Friday Jun 26, 2009

Running DirTracer in the Background

Hello all,

I was approached today by a Partner about running DirTracer in the background.  I believe I have been asked about this once before.

Currently I have not tested nor can DirTracer be run in the background.  I just tried this in fact and saw nothing in the databin path, nor in the /tmp area for the runtime log.  I have found a way however.

If I background or no hup it I get no output -> (nohup ./dirtracer -f ./dirtracer.config &).

But if I create a stub script that launches DirTracer with nohup and & and do the same with the stub...it works.

myzone root[/opt/dirtracer]#cat backgrounder
#!/bin/sh

nohup ./dirtracer -f ./dirtracer.config &
myzone root[/opt/dirtracer]#nohup ./backgrounder &
[1] 9159
myzone root[/opt/dirtracer]#Sending output to nohup.out

[1]    Done                          ./backgrounder

In most circumstances, (99% of the time) you should be running DirTracer in the foreground because the runtime is generally short.  I can only see backgrounding the process when you use the MEM_LEAK_TRACKING option or if you have such a long DirTracerrun that you will have to close down your terminals and leave.

Thanks to Brian for this great question.


Lee

Tuesday Jun 10, 2008

Dirtracer + Core/Gcore = Separate upload for Pkg_app data.

Hello all,

Every now and then I run into an issue where the customer used Dirtracer to grab a Gcore (or two) or a Core file from a crash.  Dirtracer users are forgetting to upload the separate pkg_app tar.gz file when one is generated.

It is not a problem per-say as a it is users missing a critical message that is displayed at the end of a Dirtracer run when a successful Pkg_app file has been created.

Lets take a look at the messages I am talking about.

Suppose you define one of the following parameters in the dirtracer.config file:

CORE_LOCATION
GRAB_GCORE


Lets use GRAB_GCORE for a quick test; lets grab one Gcore.

GRAB_GCORE="1"


We then run Dirtracer as normal.

#./dirtracer -f ./dirtracer.config

--------------------------------------------------------------------------------
Sun Microsystems dirtracer 6.0.6 Solaris Sparc                        06/09/2008
--------------------------------------------------------------------------------


Dirtracer run through it's Main Loop and grabs one Gcore.

\* Entering Main Performance Gathering Loop
\*                                      
\* Loop 0 - 080609-131950               
\* Loop 1 - 080609-131956               
\* Loop 2 - 080609-132001               
\* Loop 3 - 080609-132007               
\* Loop 4 - 080609-132012               
\* togo[5sec]-timer[5]
\*                                      
\* Grabbing gcore 080609-132018          [success]


Because Dirtracer has noted the presence of the grab gcore parameter it will automatically run Pkg_app to gather the needed OS and DS libraries needed for debugging. 

By default Dirtracer waits 120 seconds for Pkg_app to complete.  If for some reason the Core/Gcore's header has an overrun (can't determine contents normally) or has other errors like it is truncated, Dirtracer will kill the Pkg_app gathering phase.  Users should check the Pkg_app tar contents to see if all libs were gathered and take appropriate actions.

\* Packaging files                      
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]


After the main Dirtracer tar file is created it will display the Operations Complete message and inform the user of which files he needs to upload to Sun.

As you can see below Dirtracer informs you there is both a Dirtracer and "separate" Pkg_app file which needs to be uploaded.  Users should pay special attention to the End Messages displayed when Dirtracer finishes.  Most users miss the second file (Pkg_app) requiring separate upload.


Operations Complete
--------------------------------------------------------------------------------
1) Dirtracer capture data located in directory [ /var/tmp/data/060908-04 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-mysystem-080609-132018.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/060908-04/pkg_app_data

                [pkg_app834d2699-mysystem-080609-132051.tar.Z ]


You might ask why is Pkg_app's tar simply not included with the main Dirtracer tar file.  The answer is simple...size.  Dirtracer captures by itself can be big if a customer does not archive their log files on a heavily used system and the final size can be massive if a Core/Gcore is captured.  It simply wasn't an option to include a big Pkg_app file with an already bigger Dirtracer file.

[LT]

Wednesday May 21, 2008

Checks, Intervals and Loops Oh My - Part 3

This is my final blog on the Number of Checks and Interval settings within Dirtracer.

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

The Number of Checks and Interval parameters govern how many loops and how long the Main Loop runs for however they are also used when executing the following commands.  These commands are launched before the Main Loop is engaged and will run in the background while the Main Loop is active.

  • Iostats
  • Vmstats
  • Mpstats
  • Pms.sh

See the man pages for iotstat, vmstat and mpstat for more information on these commands.

Please see my previous blog on pms.sh (Pmonitor) for more information regarding this script.


[LT]

Monday May 19, 2008

Checks, Intervals and Loops Oh My - Part 2

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

    NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 4 - Crashing


The crashing option can be used when you know or suspect a crash will happen based on time or circumstances;  See Configurator option 4.  Crash Tracking with Dirtracer is as simple as setting the runtime for a long period of time and enabling the following options.

    CRASH_TRACK="1"
    PMONITOR_ONLY="1"

The point of Crash Tracking is to place Dirtracer into a "wait and see" mode.  It will poll for the process id and note when the crash happens and as the user if the crash produced a crash, where the core is located and then gather the remaining data like logs etc.  

The Number of Checks and Interval settings need to be configured to allow Dirtracer enough time for a crash to happen, This can vary based on the problem.  All the following are valid; configure your setting to allow run past the time you expect the crash to occur.

Set Number of Checks to 1000 and Interval to 30.  Total Runtime 30,000 Seconds.
Set Number of Checks to 100 and Interval to 60.  Total Runtime 6,000 Seconds.
Set Number of Checks to 60 and Interval to 60.  Total Runtime 3,600 Seconds.

\* Dirtracer mainloop will run for       [3600 sec. About 60 min.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-095153                [pms.sh only]
\* Loop 1 - 080519-095253                [pms.sh only]
\* Loop 2 - 080519-095353                [pms.sh only]
\* Loop 3 - 080519-095454                [pms.sh only]
\* Loop 4 - 080519-095554                [pms.sh only]
\* Loop 5 - 080519-095654                [pms.sh only]
\* Loop 6 - 080519-095754                [pms.sh only]
\* Loop 7 - 080519-095854                [pms.sh only]
\* Loop 8 - 080519-095954                [pms.sh only]
\* Loop 9 - 080519-100054                [pms.sh only]
\* Loop 10 - 080519-100155               [pms.sh only]
\* Loop 11 - 080519-100255               [pms.sh only]
\* Loop 12 - 080519-100355               [pms.sh only]
\* Loop 13 - 080519-100455               [pms.sh only]
\* Loop 14 - 080519-100555               [pms.sh only]
\* Loop 15 - 080519-100655               [pms.sh only]
\* Loop 16 - 080519-100755               [pms.sh only]
\* Loop 17 - 080519-100855               [pms.sh only]
\* Loop 18 - 080519-100956               [pms.sh only]
\* Loop 19 - 080519-101056               [pms.sh only]
\* Loop 20 - 080519-101156               [pms.sh only]
\* Loop 21 - 080519-101256               [pms.sh only]
\* Loop 22 - 080519-101356               [pms.sh only]
\* Loop 23 - 080519-101456               [pms.sh only]
\* Loop 24 - 080519-101556               [pms.sh only]
\* Loop 25 - 080519-101656               [pms.sh only]
\* Loop 26 - 080519-101757               [pms.sh only]
expr: syntax errormer[47]
\*                                       
\*       ALERT - The ns-slapd process has died!
\*                                       
\* [ALERT 2] The ns-slapd process has died!
\*                                       
\*
\*                                       
\* Locating crash core file              [failed]
\*                                       
\* Did the server produce a core file (y/n)? y
\*                                       
\* What is the full path to the core?    
\*                                       
\* Example: /var/cores/core.10008 :      /var/tmp/cores/core.10008



Example 4 - Memory Leak


Using the Memory Leak option will allow you to gather up front Pms.sh (Pmonitor) data over a long period of time to show the progression of memory usage of the slapd process.  Setting Memory Leak Tracking automatically sets the Interval to 1800 but the user needs to set the Number of Checks accordingly.

Enabling MEM_LEAK_TRACKING="1" sets the following automatically.

    INTERVAL="1800"
    PMONITOR_INTERVAL="5"

Set the Number of Checks to allow Dirtracer run for the time you feel it takes the ns-slapd process to show a substantial amount of leakage.  Lets say you are facing a leak that can manifest itself in 24 hours.  You could set the Number of Checks to 50 and Dirtracer will capture for 1500 minutes (approx. 25 hours).  Using Configurator to setup the dirtracer.config file or use the simple formula.

    1800 (interval) \* N = <Number of Seconds (total run time)>
                OR
    Minutes to leak \* 60 / 1800 (Interval in seconds) = <NUMBEROFCHECKS>

\* Dirtracer mainloop will run for       [90000 sec. About 1500 min.]
...
\* Mem Leak tracking                     [on]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-105751                
\* togo[90000sec]-timer[117]
...


I have eliminated the full run time information as it would be too long for a Blog post.


Example 5 - Server Down


Use of the server down option does not require any settings for the Interval and Number of Checks to be configured; In fact they are ignored.

\* Process State                         [server down, no process]
\* Server down, pms.sh not needed        
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\*
\* Server down, skipping main loop       
\*
\*                                       
\* Exiting Main Performance Gathering Loop


Example 6 - Basic Capture


The Basic Capture is a simple 5 x 5 check.  By default the dirtracer.config file is shipped with the following Interval and Number of Checks set.  You can also enable the Basic Capture with Option 7 when using Configurator.

    NUMBEROFCHECKS="5"
    INTERVAL="5"

Example 7 - Config Only Capture


The Config Only Capture can be enabled using the CONFIG_ONLY="1" parameter or also using Option 8 in Configurator.  This sets up a config file for a 1 x 1 (1 loop) capture.

    NUMBEROFCHECKS="1"
    INTERVAL="1"

To be Continued...

Hasta la vez próxima

[LT]

Thursday May 15, 2008

Checks, Intervals and Loops Oh My - Part 1

Hell all!

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 1 - Hung Processes (See Borked):


Most times a ns-slapd process is not actually hung but seems like it is.  A perceived hung process could just be an ultra busy process, caused by a series of massive db searches, all worker threads taken and waiting on one to free a lock or many other related issues.

Setting the Number of Checks and Interval correctly here can be critical.  Set them incorrectly and you may miss a data gathering opportunity.

Set Number of Checks to 5 and Interval to 5.  Total Runtime 25 Seconds.

This will gather 5 Pstacks/Prstats at 5 Second Intervals and can show if the process is changing over time, but does not have the granularity to show each threads progression through the stack.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092006                
\* Loop 1 - 080515-092011                
\* Loop 2 - 080515-092017                
\* Loop 3 - 080515-092022                
\* Loop 4 - 080515-092028


Example 2 - High CPU or Performance Problems:


Like Example 1 we want to see the process stack and threads change over time.  But for a High CPU or Performance Problem we want to see things change second by second.  A better option for this problem type would be to set the Number of Checks to 25 and Interval to 1.

Set Number of Checks to 25 and Interval to 1.  Total Runtime 25 Seconds.

This will gather 25 Pstacks/Prstats at 1 Second Intervals.  This way we can see the process stack change with no gaps in the captures. In Example 1 there were 5 Second gaps between pstacks and the threads will change a huge amount on a very busy server in that timeframe.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092554                
\* Loop 1 - 080515-092555                
\* Loop 2 - 080515-092557                
\* Loop 3 - 080515-092558                
\* Loop 4 - 080515-092600                
\* Loop 5 - 080515-092601                
\* Loop 6 - 080515-092603                
\* Loop 7 - 080515-092604                
\* Loop 8 - 080515-092605                
\* Loop 9 - 080515-092607                
\* Loop 10 - 080515-092608               
\* Loop 11 - 080515-092610               
\* Loop 12 - 080515-092611               
\* Loop 13 - 080515-092613               
\* Loop 14 - 080515-092614               
\* Loop 15 - 080515-092616               
\* Loop 16 - 080515-092617               
\* Loop 17 - 080515-092619               
\* Loop 18 - 080515-092620               
\* Loop 19 - 080515-092621               
\* Loop 20 - 080515-092623               
\* Loop 21 - 080515-092624               
\* Loop 22 - 080515-092626               
\* Loop 23 - 080515-092627               
\* Loop 24 - 080515-092629               


Because I have increased the Number of Checks and decreased the Interval the granularity is higher and I get 25 data points as opposed to 5 over the same time period.

Example 3 - Replication Problems:


The key to debugging a Replication Problem is Debug Logging over a period of time.  Setting the special PTS_CONFIG_LOGGING parameter to 8192 will allow Dirtracer to change the nsslapd-infolog-area logging value in the dse to 8192 (Replication Debug Logging).

PTS_CONFIG_LOGGING="8192"


Setting the Number of Checks or Interval for Granularity is not as important with Replication Problems as it is with Hangs or High CPU Problems.  The settings can vary and still achieve the same results.

Set Number of Checks to 40 and Interval to 30.  Total Runtime 20 Minutes.
Set Number of Checks to 10 and Interval to 120.  Total Runtime 20 Minutes.
Set Number of Checks to 5 and Interval to 240.  Total Runtime 20 Minutes.
Set Number of Checks to 1 and Interval to 1200.  Total Runtime 20 Minutes.

\* Dirtracer mainloop will run for       [1200 sec. About 20 min.]
...
\* Logging level is being changed        [nsslapd-infolog-area]
\*   current level:                      [0]
\*   new level:                          [8192]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-100855                
\* Loop 1 - 080515-101256                
\* Loop 2 - 080515-101656                
\* Loop 3 - 080515-102057                
\* Loop 4 - 080515-102457                
\* togo[240sec]-timer[240]
\*                                       
\* Exiting Main Performance Gathering Loop


There are five other problem types we can discuss but lets save that for another Blog post.

  • Crashing
  • Memory Leak
  • Server Down
  • Basic Capture
  • Config Only Capture


To be Continued...

Ciao all!

[LT]

Tuesday May 13, 2008

Pkg_app and Dirtracer

Today I will revisit Pkg_app but will focus on its uses within Dirtracer.

Before Dirtracer 6.0.4 Customers who would use Dirtracer to gather cores and gcores would have to run Pkg_app manually after the fact.

Since version 6.0.4 Dirtracer has included Pkg_app in the <Dirtracer Install Path>/dirtracertools/ location and with the Quiet (-q) switch in Pkg_app 2.7 I able to embed Pkg_app within Dirtracer to run automatically.

If a Customer uses the following config file parameters Pkg_app will be launched automatically.

CORE_LOCATION="<full path to the core> + SERVER_DOWN="1"

GRAB_GCORE="1" or GRAB_GCORE="2"

Here is an example of the following config:  I used 1 check and 1 second interval for brevity.

NUMBEROFCHECKS="1"
INTERVAL="1"
GRAB_GCORE="1"

See the runtime-<date>-<time>.log:

As you see below Dirtracer completes a quick one second loop, exits the Main Loop and grabs a Gcore.

<SNIP>
\*   pms.sh interval(1) x checks(1)      [pms.sh run time (1 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080509-075120                
\*                                       
\* Grabbing gcore 080509-075122          [success]
</SNIP>

Once Dirtracer finishes with the Post Loop gathering, it executed Pkg_app to have it gather all libraries and the ns-slapd binary.  Note the normal Pkg_app processing information is not seen because Pkg_app has been launched with the Quiet (-q) option.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [waiting 120 sec.]      [success]
</SNIP>

In Dirtracer 6.0.4 customers grabbing large cores/gcores with Dirtracer saw what they thought was a pkg_app hang.  It was likely the core/gcore had overflowed the core header and Pkg_app could not process the file correctly.  As a result I created a timer function to monitor processes like Pkg_app.  

If the Pkg_app runs for more than 120 seconds, then Dirtracer will "kill" the pkg_app process and alert the Customer they need to run Pkg_app manually.

<SNIP>
\* Packaging files                       
\*   Preparing files - pkg_app           [killed]
</SNIP>

If Pkg_app was successful then it will present the Customer with the following message; see 2) below.

<SNIP>
1) Dirtracer capture data located in directory [ /var/tmp/data/051308-01 ]

Upload "only" this file to your supportfiles.sun.com cores directory at Sun

        [ dirtracer-834d2699-kaneda-080513-090202.tar.gz ]

2) pkg_app has captured system libs as the result of a gcore or core being found
                As this file may be large, please ftp this separately to Sun
                pkg_app file located in /var/tmp/data/051308-01/pkg_app_data

                [pkg_app834d2699-kaneda-080513-090347.tar.Z ]
</SNIP>

Currently Dirtracer does not give a major alert if Pkg_app was killed.  The customer should manually run Pkg_app or gather the libraries used by the process.

[LT]

Friday May 09, 2008

Pkg_app and its uses

Many Customers and Engineers may wonder at what pkg_app is and how it is used.

Pkg_app is a shell script originally created by Rick Zagorski of Sun Microsystems in 2003.  Rick moved on to other projects and left Pkg_app without an author so I stepped in and thanks to Rick was allowed to continue its development.

Q: What is Pkg_app?

A: Pkg_app is a script that can take a a core/gcore or process id and gather all the libraries required by Sun Support to debug a core/gcore file.  

Q: Why are the libraries needed in this way?

A: There are a multitude of Solaris OE versions and even more Patch versions; not all Customers can keep their servers 100% up to date.  Sun Support needs to have a server setup exactly like the Customers environment in which to debug core files; core files and their respective processes pre-load the libraries from the system.  

The quickest way to get this custom environment up and online is to just gather the libraries listed in the core or process.  The time it takes a Customer to run Pkg_app and uploaded the resulting file is a thousand times faster than getting a full Solaris Explorer, locating an available server, re-OS'ing the server to the Explorer specification and debugging the core.

Q: Why use Pkg_app?

A:  Pkg_app makes it easier to grab the library and process binary files than manually locating all files and copying them into one specific path.  The number of library files gathered depends on the libraries used by the process that created the core.  In a Directory Server's case, it can use approx. 68 in 5.2 servers and upwards of 82 libraries in 6.2.  Gathering these files manually could be a time consuming task for a Customer when time is of the essence.

Q: How do you use Pkg_app?

A:  Executing Pkg_app is easy and if not used with any parameters will display its usage.

As seen from the Usage, there are really only two parameters required to run Pkg_app.

 1) the full path including the name to the core file or the process id (pid).
 2) the path (path only) to the process binary that create the core or of the pid.

--------------------------------------------------------------------------------

\* Sun Microsystems RSD pkg_app 2.7 Solaris                      [05/09/2008]

--------------------------------------------------------------------------------

pkg_app 2.7, a Sun Microsystems data gathering utility.

Usage:

  ./pkg_app [options] -c <core file | pid> -p <full path to process binary> -s [path to write tar file]


Required parameters:

 -c <core file OR pid of a running process>

 -p <full path to process binary> (ns-slapd, imapd, httpd etc.)


Optional parameters:

 -i (Include the core file with the final tar.gz)

 -q (Quiet)

 -d (Debug)

 -s <Storage; path to store the final tar file in>


usage:  ./pkg_app -c <name of the core file> -p <path to process binary>

usage:  ./pkg_app -c <pid of the running app> -p <path to process binary>


Examples: these are examples mixing various parameters


Directory Server

./pkg_app -i -c ./core.14740 -p /var/mps/ds52p4/bin/slapd/server/64


Messaging Server

./pkg_app -c ./core.3496 -p /opt/SUNWmsgsr/lib


Web Server

./pkg_app ./core.1092 -p /space/iws70/lib -s /var/crash


Calendar Server

./pkg_app -i -c ./core -p /opt/SUNWics5/cal/lib


Q: How do I know the process binary name?

A: You can use the "file" command to find the process binary name

#file core.2894

core.2894:      ELF 64-bit MSB core file SPARCV9 Version 1, from 'ns-slapd'


Q: How do I find the path to the process binary?

You can use pldd or "find" in the process binaries install path.

1) Using pldd.  You can see the full path to ns-slapd is /opt/dsee62/ds6/lib/64

#pldd core.2894 | head -1

core 'core.2894' of 2894:       /opt/dsee62/ds6/lib/64/ns-slapd -D /opt/dsee62/var/dscc6/dcc/ads -i /o


2) Using find.  I know the server install path is /opt/dsee62 and the path to ns-slapd is /opt/dsee62/ds6/lib/sparcv9/  

#cd /opt/dsee62


root[/opt/dsee62]#find . -name ns-slapd -print

./ds6/lib/sparcv9/ns-slapd


root[/opt/dsee62]#cd ./ds6/lib/sparcv9


root[/opt/dsee62/ds6/lib/sparcv9]#pwd

/opt/dsee62/ds6/lib/sparcv9


Please note: the path seen in the pldd and that of the "find" are the same.  Directory Server uses a Link between 64 pointing to sparc9 here.

#diff /opt/dsee62/ds6/lib/64/ns-slapd /opt/dsee62/ds6/lib/sparcv9/ns-slapd



Example Testrun:

1) Locate a core file, in this case a gcore from a 6.2 Directory Server
2) Locate the process binary as per above.

#./pkg_app -c ./core.2894 -p /opt/dsee62/ds6/lib/sparcv9

--------------------------------------------------------------------------------

\* Sun Microsystems RSD pkg_app 2.7 Solaris                      [05/09/2008]

--------------------------------------------------------------------------------

\* OS release                            [5.10]

\* Platform                              [SUNW,Sun-Blade-2500]

\* Using core                            [/var/tmp/cores/core.2894]

\* Process Root                          [/opt/dsee62/var/dscc6/dcc/ads]

\* Process binary                        [ns-slapd]

\* ns-slapd binary bit version           [64]

\* Checking path to binary name          [success, path != binary name]

\* Checking path is a directory          [success]

\* Locating ns-slapd                     [success]

\* Checking located ns-slapd is 64 bit   [success]

\* Binary located                        [/opt/dsee62/ds6/lib/sparcv9/ns-slapd]

\* Adding binary to pkg_app.pldd         [success]

\* Grabbing pldd                         [success]

\* Grabbing pstack                       [success]

\* Grabbing pmap                         [success]

\* Grabbing pcred                        [success]

\* Grabbing pflags                       [success]

\*

\* Databin Used                          [/var/tmp/cores]

\* Using hostid for naming .tar.gz       [837872d0]

\* Writing file                          [pkg_app837872d0-s4u-2500a-brm04-080509-092119.tar.Z]

\*

\* Processing file 82 of 82

\*

\* Done gathering files

\* Writing dbx files                     [success]

\* Creating final tarfile                [success]

\* Compressing tarfile                   [success]

\*


Operations Complete


Upload this file to your supportfiles.sun.com Cores Directory at Sun

File located in directory .


                [ pkg_app837872d0-s4u-2500a-brm04-080509-092119.tar.Z ]


                                Thank you.

                                Sun Software Technology Service Center (STSC)


NOTES:

1) You can check for updates to this script here:

        BigAdmin - http://www.sun.com/bigadmin/scripts/indexSjs.html


2) Release Notes and Guides located here:

        Docs - http://docs.sun.com/app/docs/doc/820-0437


3) GDD information located here:

        Docs - http://www.sun.com/service/gdd/index.xml


4) Please send all Bugs and RFE's to the following address:

        Subject "pkg_app bug/rfe" - gdd-issue-tracker@sun.com


5) Please send all other questions etc to:

        Subject "pkg_app feedback" - gdd-feedback@sun.com


Other things to note are the Optional parameters.

Optional parameters:
 -i (Include the core file with the final tar.gz)
 -q (Quiet)
 -d (Debug)
 -s <Storage; path to store the final tar file in>

1)  -i (Include the core file with the final tar.gz)

If Sun Support does not yet have the core file, the Customer can include the core/gcore with the final Pkg_app tar.gz file

2)  -q (Quiet)

This will run Pkg_app in a completely silent mode.  No STDOUT to the terminal.  Good for use within other applications or via a cron.

3)  -d (Debug)

The Debug feature is like Verbose.  It spits out more info on what Pkg_app is doing and the internal parameters as they are built.  This can help me debug pkg_app issues.

4)  -s <Storage; path to store the final tar file in>

The Storage switch allows the Customer to give Pkg_app a new path to store the resulting pkg_app tar.gz file otherwise Pkg_app creates/stores the file in the current path.


This is all for today but I hope to show how Dirtracer uses Pkg_app.

[LT]

Wednesday May 07, 2008

Pmonitor, Pms.sh and the almighty ps command

Originally Stracer (predecessor to Dirtracer) included a small shell script called Pmonitor written by Ben Gooley; then a Directory Server Support Engineer.  Pmonitor (Process Monitor) was a lightweight script that used the "ps" command with a series of switches to retrieve the Virtual Size (vsz), Resident Size (rsz) and cpu% time used to cpu time avail (pcpu).

man ps:
     vsz   The total size of the process in  virtual  memory,  in
           kilobytes.

     rss   The resident set size of the process, in kilobytes.

     pcpu  The ratio of CPU time used recently to CPU time avail-
           able  in  the  same period, expressed as a percentage.
           The  meaning  of  ``recently''  in  this  context   is
           unspecified.  The  CPU time available is determined in
           an unspecified manner.

Pmonitor was great to see an average cpu % busy (rudimentary load) as well as track the processes memory footprint over time.  This helps Sun Support Engineers to shed light on a possible memory leak if the process size never shrank.  It was a bit hard to see any real trend in just the raw data so many Engineers plotted the data to see it visually over time.

#./pmonitor 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:27:50] ------- 22241 86296 56360  0.0
05/06-[14:27:51] ------- 22241 86296 56360  0.0
05/06-[14:27:52] ------- 22241 86296 56360  0.0
05/06-[14:27:53] ------- 22241 86296 56360  0.0
05/06-[14:27:54] ------- 22241 86296 56360  0.0

In walks Mark Reynolds (creator of logconv.pl) with Pms.sh; an enhanced Pmonitor script which adds a "growth" calculation shown in Kilobytes (k).

#./pms.sh 22241 1 5
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:28:20] ------- 22241 86296 56360  0.0
05/06-[14:28:21] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:22] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:23] ------- 22241 86296 56360  0.0    growth:   0 k
05/06-[14:28:24] ------- 22241 86296 56360  0.0    growth:   0 k

With Pms.sh we could now see the growth in the raw data without plotting it all the time.

04/24-[17:55:39] ------- 12489 5310368 5270584  2.3    growth:   0 k
04/24-[17:55:40] ------- 12489 5318432 5277048  5.6    growth:   8064 k
04/24-[17:55:42] ------- 12489 5319104 5277600  2.9    growth:   672 k
04/24-[17:55:56] ------- 12489 5319104 5277600  2.5    growth:   0 k

Not only could we see the growth we could also see when memory usage dropped.

04/24-[17:56:14] ------- 12489 5319104 5277600  1.0    growth:   0 k
04/24-[17:56:15] ------- 12489 5317560 5276424  3.5    growth:   -1544 k
04/24-[17:56:17] ------- 12489 5317560 5276424  5.4    growth:   0 k

I added Pms.sh 2.01 to the Dirtracer bundle starting with Dirtracer release 2.2 and not include Pms.sh 2.02 with Dirtracer 6.0.6

Uses for Pms.sh:

1) Memory Leaks; see above.
2) Gauging high cpu problems.  Dirtracer has prstats but in some circumstances prstat is not usable or recommended.
3) Looking at the trends of both of these elements over a long period of time.

System Impact:

Negligible.

I tested pms.sh using pms.sh as and top a gauge.

Top reported pms.sh uses less than .04 - .1% cpu on my Sunblade and pms.sh itself shows the pcpu as 0.0.

23661 root       1  50    0 1104K  832K sleep   0:00  0.09% pms.sh
23817 root       1   0    0 1104K  832K sleep   0:00  0.04% pms.sh

# ps -aef | grep pms.sh
    root 23661 21979  0 14:48:38 pts/3    0:00 /bin/sh ./pms.sh 22241 1 1000000

#./pms.sh 23661 1 1000
DATE   -  [TIME] ------- PID   VSZ   RSS   PCPU
05/06-[14:48:53] ------- 23661 1104  832  0.0
05/06-[14:48:54] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:55] ------- 23661 1104  832  0.0    growth:   0 k
05/06-[14:48:57] ------- 23661 1104  832  0.0    growth:   0 k

Dirtracer + Pms.sh:

By default Dirtracer will run Pms.sh with the same NUMBEROFCHECKS and INTERVAL as seen in the dirtracer.config file.  If the Dirtracer mainloop is configured to run for 5 Checks at 5 Second Intervals then Pms.sh will do the same.

\*   pms.sh found                        [/export/home/hrdwired/PTS/dirtracertools/pms.sh]
\* Access/Error Logging                  [left as is]
\* Audit Logging                         [left as is]
\* Iostat available.     Executing...    [success]
\* Vmstat available.     Executing...    [success]
\* Mpstat available.     Executing...    [success]
\* pms.sh (pmonitor) executed            [success]
\*   pms.sh interval(5) x checks(5)      [pms.sh run time (25 sec.)]
\*                                       
\* Entering Main Performance Gathering Loop

Config parameters used with Pms.sh:

PMONITOR_INTERVAL:  The Pmonitor Interval allows you to set a smaller Interval than Dirtracer is running with.

If Dirtracers mainloop is configured to run at 30 seconds INTERVALs but you want to see Pms.sh output at 1 second Intervals set the PMONITOR_INTERVAL="1".

PMONITOR_ONLY:  The Pmonitor Only parameter will disable the Mainloops use of all its normal data capture points such as pstacks, prstats, cn=monitor and cache size searches.  This allows the user to gather a long term Pms.sh data set without the overhead of tons of Pstacks/Pstats, Ldap Searches etc.

I hope this gives you a detailed view of Pms.sh (Pmonitor) and how it is used within Dirtracer and Sun Support.

[LT]

Monday May 05, 2008

Bork Bork Bork

The BORKED parameter.


I realize most of our international users of Dirtracer may not get the reference but the BORKED parameter got its name in part from the US TV show called The Muppet Show; something I watched as a child.  A show made up entirely of Puppets (Muppets) who's main character is called Kermit the Frog.

The BORKED parameter is a reference to the Swedish Chef special Bork-speak...a parody of a wacky Chef who speaks in unintelligible Swedish.  In certain tech circles "Borked" is synonymous with "Broken", and when developing this settings purpose the name just stuck.

http://en.wikipedia.org/wiki/The_Muppet_Show
http://en.wikipedia.org/wiki/Bork_bork_bork

Think of Borked as "Hung" when it comes to its use with Dirtracer.  If a Directory Server process is thought to be Hung, then use set the BORKED parameter to 1.

    BORKED="1"

Note:  I plan to rename the parameter to HUNG (Config File only) in the next release.

What does BORKED="1" do?

Normally Dirtracer will run a set of the following searches and or modiify's, if Borked is set to 1 (on) then these searches etc. are skipped.  Setting Borked to 1 helps make sure Dirtracer itself doesn't hang waiting on these ldapsearches to return.  If Borked is not set to 1 when the Directory Server is suspected of being Hung then Dirtracer will not complete its data gathering.

Searches:
    Backend Suffix names; naming contexts
    Backend Database names
    cn=monitor
    cache info searches
    nsds50ruv; replica ruv's
    cn=config info
    rootdse

Modifies; only completed if Dirtracer is configured to do so.

    PTS_CONFIG_LOGGING can set the server Logging level to the parameter value configured.  This sets the nsslapd-infolog-area or nsslapd-errorlog-level

    Examples:
        nsslapd-infolog-area: 4 sets Heavy Trace Logging
        nsslapd-infolog-area: 128 sets ACI debugging
        nsslapd-infolog-area: 8192  sets Replication Debug Logging

    See the following link for more info on Logging Levels
        http://docs.sun.com/source/816-6699-10/confattr.html#15873

Dirtracer can also set (rarely used) Logging On or Off.

    TURN_LOGGING_ON="0"             # Turn On access/error logs
    TURN_LOGGING_OFF="0"            # Turn Off access/error logs
    AUDIT_LOGGING_ON="0"            # Turn On audit logs
    AUDIT_LOGGING_OFF="0"           # Turn Off audit logs

As mentioned Dirtracer only completes the above ldapmodify's if configured to do so.

[LT]

Thursday May 01, 2008

The DATABIN parameter

I had a question from a Front Line Engineer recently where they did not understand how to select a proper location for the DATABIN parameter.

DATABIN="<DATA OUTPUT PATH>"    # Databin main path.
                                # Sub dirs will be created beneath this path.

The DATABIN is the path where you want Dirtracer to store the data it captures.  Special care should be taken when selecting the right path based on the project size of the data you need to gather.

Sun GDD Directory Dirtracer Reference Guide: Page 17

Disk Usage

Disk space used is almost entirely dependent on the following.

1. How Dirtracer is configured; i.e. what it is asked to gather.
2. How many loops Dirtracer is configured to complete.
  •     cache, monitor searches
  •     netstats, iostats, pstacks, prstats
  •     transaction log ls -l captures.
3. How many access/error and audit logs are captured.  
  •     configured from the GATHER_N_XXXXX_LOGS="N" parameters.
4. How big each of those logs are.  (var/adm/messages logs included)
5. Shared Memory (MMAP) files
  •     how big the ns-slapd process size is.
6. Cores
  •     how big the ns-slapd process size is.
7. Gcores
  •     how big the ns-slapd process size is.
8. If Dirtracer has REMOVE_TEMP_DATA=0.
  •     saves all temp files in addition to the final tar file.
9. If Dirtracer has SKIP_TAR_GZIP=1.
  •     Skips the final tar/gz saving 1⁄2 the space it normally uses; i.e. duplication of files  occurrs as files are tarred and gzipped.


The Engineer was also requested to setup two directory.config files to trace two separate slapd instances at the same time, on the same system.  Would the DATABIN parameters need to be different? No.

Early on I saw a problem with Stracer (the old Dirtracer) when customers would use the same DATABIN to store data and the previously capture files would be overwritten or you would have multiples of the same files.

I solved this issue by having Dirtracer create a unique time/date based directory under the defined DATABIN path.  Even if Dirtracer is run multiple times on the same system a new sub databin is created to segregate the data.

Example:

1) Set the DATABIN as follows.

DATABIN="/var/tmp/data"

2) Run Dirtracer 3 times and observe the directories created in /var/tmp/data/

root[/var/tmp/data]#ls -l
total 12
drwxr-xr-x  11 root     other       1536 Apr 21 15:38 042108-01
drwxr-xr-x  11 root     other       1536 Apr 21 15:42 042108-02
drwxr-xr-x  11 root     other       1536 Apr 21 15:43 042108-03

root[/var/tmp/data]#find . -name "dirtracer-\*gz" -print
./042108-01/dirtracer-834d2699-kaneda-080421-153743.tar.gz
./042108-02/dirtracer-834d2699-kaneda-080421-154144.tar.gz
./042108-03/dirtracer-834d2699-kaneda-080421-154335.tar.gz

You can clearly see how the data is separated and should not collide.

[LT]

Configurator, the dirtracer.config.template and their uses.

I was recently asked what the differences are between the dirtracer.config.template and the Configurator script and how they are used.

The previous version of my script Stracer used both a config file as well as a full range of command line switches.  The command line switches confused many and the config file then was not well documented.  As a result we had many Dirtracer's configured to capture the wrong type of data for the problem type.

Shortly after I decided to create the "Configurator", and released it with Stracer 1.9.3.  Configurator took the Problem Type encountered by the Customer and translated it into a working dirtracer.config file.  Originally Configurator contained 7 problem type options.  With Configurator 6.0.6 I have added Option 8 for a Configuration Only Capture.

--------------------------------------------------------------------------------
Sun Microsystems Configurator 6.0.6                                       
--------------------------------------------------------------------------------
Please choose the type of problem you are experiencing

Process Hung                            [1]
High CPU                                [2]
Replication                             [3]
Crashing                                [4]
Memory Leak                             [5]
Server Down                             [\*]     DISABLED - (SLAPDPID is set)
Basic Capture                           [7]
Config Only Capture                     [8]
--------------------------------------------





NOTE: Now that the Document for Dirtracer has progressed to this point I may have to add a full section for Configurator; even though it's interactive and self explaining.

Configurator takes you through the following sections in which to create a dirtracer.config file

1) Case Number (if available)
2) Slapd Instance selection.
3) Directory Manager Password entry
4) Data Storage location.  This is the location of the DATABIN parameter where all captured data will be stored.
5) Skip Tar Gzip question
6) Problem Type selection.
    a) Process Hung. Hang detection, Gcore selection
    b) High CPU. CPU % thrshold level, Gcore selection
    c) Replication.  Sets replication debug logging (8192)
    d) Crashing.
    e) Memory Leak.
    f) Server Down. DS version [5x|6x], Instance path entry.
    g) Basic Capture
    h) Config Only Capture
7) DS Log capture selection; access, error and audit logs.
8) Dirtracer Runtime selection.
9) Pmonitor (pms.sh) Runtime selection.
10) Configuration Summary
11) Data Capture Size guesstimation.
12) Config file (dirtracer.config) creation.

The Configurator is a good way to for those new to Dirtracer to quickly setup a dirtracer.config file for an event.

So what is the difference between the Configurator and the dirtracer.config.file template?  Well, Configurator asks questions to setup a ready to use dirtracer.config.  The dirtracer.config.template is just that...a template.  The dirtracer.config.template does contain all parameters available that would be set when creating a new dirtracer.config using the Configurator.  The dirtracer.config.template does however "have" to be edited in order to be used with Dirtracer and does not have Presets for Problem Types.

Without the following parameters properly set/changed, Dirtracer will exit and alert the admin the file needs to be changed.  Likewise the template contains some default settings.

SLAPDPID="<SLAPD PID>"          # Slapd pid number
MGRPW="<PASSWORD>"              # Mgr password
DATABIN="<DATA OUTPUT PATH>"    # Databin main path.

The template can be copied, renamed and edited to contain different parameter settings for the same problem types as seen above.  The dirtracer.config.template is completely self documented so administrators can quickly look at a parameter and select its use (or not).

Hope this was useful.

[LT]

About

A Tech Blog about the Sun Java Systems Dirtracer Toolkit. Dirtracer and this blog written and maintained by Lee Trujillo an Oracle Senior Principal Support Engineer.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today