Wednesday May 21, 2008

Checks, Intervals and Loops Oh My - Part 3

This is my final blog on the Number of Checks and Interval settings within Dirtracer.

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

The Number of Checks and Interval parameters govern how many loops and how long the Main Loop runs for however they are also used when executing the following commands.  These commands are launched before the Main Loop is engaged and will run in the background while the Main Loop is active.

  • Iostats
  • Vmstats
  • Mpstats
  • Pms.sh

See the man pages for iotstat, vmstat and mpstat for more information on these commands.

Please see my previous blog on pms.sh (Pmonitor) for more information regarding this script.


[LT]

Monday May 19, 2008

Checks, Intervals and Loops Oh My - Part 2

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

    NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 4 - Crashing


The crashing option can be used when you know or suspect a crash will happen based on time or circumstances;  See Configurator option 4.  Crash Tracking with Dirtracer is as simple as setting the runtime for a long period of time and enabling the following options.

    CRASH_TRACK="1"
    PMONITOR_ONLY="1"

The point of Crash Tracking is to place Dirtracer into a "wait and see" mode.  It will poll for the process id and note when the crash happens and as the user if the crash produced a crash, where the core is located and then gather the remaining data like logs etc.  

The Number of Checks and Interval settings need to be configured to allow Dirtracer enough time for a crash to happen, This can vary based on the problem.  All the following are valid; configure your setting to allow run past the time you expect the crash to occur.

Set Number of Checks to 1000 and Interval to 30.  Total Runtime 30,000 Seconds.
Set Number of Checks to 100 and Interval to 60.  Total Runtime 6,000 Seconds.
Set Number of Checks to 60 and Interval to 60.  Total Runtime 3,600 Seconds.

\* Dirtracer mainloop will run for       [3600 sec. About 60 min.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-095153                [pms.sh only]
\* Loop 1 - 080519-095253                [pms.sh only]
\* Loop 2 - 080519-095353                [pms.sh only]
\* Loop 3 - 080519-095454                [pms.sh only]
\* Loop 4 - 080519-095554                [pms.sh only]
\* Loop 5 - 080519-095654                [pms.sh only]
\* Loop 6 - 080519-095754                [pms.sh only]
\* Loop 7 - 080519-095854                [pms.sh only]
\* Loop 8 - 080519-095954                [pms.sh only]
\* Loop 9 - 080519-100054                [pms.sh only]
\* Loop 10 - 080519-100155               [pms.sh only]
\* Loop 11 - 080519-100255               [pms.sh only]
\* Loop 12 - 080519-100355               [pms.sh only]
\* Loop 13 - 080519-100455               [pms.sh only]
\* Loop 14 - 080519-100555               [pms.sh only]
\* Loop 15 - 080519-100655               [pms.sh only]
\* Loop 16 - 080519-100755               [pms.sh only]
\* Loop 17 - 080519-100855               [pms.sh only]
\* Loop 18 - 080519-100956               [pms.sh only]
\* Loop 19 - 080519-101056               [pms.sh only]
\* Loop 20 - 080519-101156               [pms.sh only]
\* Loop 21 - 080519-101256               [pms.sh only]
\* Loop 22 - 080519-101356               [pms.sh only]
\* Loop 23 - 080519-101456               [pms.sh only]
\* Loop 24 - 080519-101556               [pms.sh only]
\* Loop 25 - 080519-101656               [pms.sh only]
\* Loop 26 - 080519-101757               [pms.sh only]
expr: syntax errormer[47]
\*                                       
\*       ALERT - The ns-slapd process has died!
\*                                       
\* [ALERT 2] The ns-slapd process has died!
\*                                       
\*
\*                                       
\* Locating crash core file              [failed]
\*                                       
\* Did the server produce a core file (y/n)? y
\*                                       
\* What is the full path to the core?    
\*                                       
\* Example: /var/cores/core.10008 :      /var/tmp/cores/core.10008



Example 4 - Memory Leak


Using the Memory Leak option will allow you to gather up front Pms.sh (Pmonitor) data over a long period of time to show the progression of memory usage of the slapd process.  Setting Memory Leak Tracking automatically sets the Interval to 1800 but the user needs to set the Number of Checks accordingly.

Enabling MEM_LEAK_TRACKING="1" sets the following automatically.

    INTERVAL="1800"
    PMONITOR_INTERVAL="5"

Set the Number of Checks to allow Dirtracer run for the time you feel it takes the ns-slapd process to show a substantial amount of leakage.  Lets say you are facing a leak that can manifest itself in 24 hours.  You could set the Number of Checks to 50 and Dirtracer will capture for 1500 minutes (approx. 25 hours).  Using Configurator to setup the dirtracer.config file or use the simple formula.

    1800 (interval) \* N = <Number of Seconds (total run time)>
                OR
    Minutes to leak \* 60 / 1800 (Interval in seconds) = <NUMBEROFCHECKS>

\* Dirtracer mainloop will run for       [90000 sec. About 1500 min.]
...
\* Mem Leak tracking                     [on]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080519-105751                
\* togo[90000sec]-timer[117]
...


I have eliminated the full run time information as it would be too long for a Blog post.


Example 5 - Server Down


Use of the server down option does not require any settings for the Interval and Number of Checks to be configured; In fact they are ignored.

\* Process State                         [server down, no process]
\* Server down, pms.sh not needed        
\*                                       
\* Entering Main Performance Gathering Loop
\*                                       
\*
\* Server down, skipping main loop       
\*
\*                                       
\* Exiting Main Performance Gathering Loop


Example 6 - Basic Capture


The Basic Capture is a simple 5 x 5 check.  By default the dirtracer.config file is shipped with the following Interval and Number of Checks set.  You can also enable the Basic Capture with Option 7 when using Configurator.

    NUMBEROFCHECKS="5"
    INTERVAL="5"

Example 7 - Config Only Capture


The Config Only Capture can be enabled using the CONFIG_ONLY="1" parameter or also using Option 8 in Configurator.  This sets up a config file for a 1 x 1 (1 loop) capture.

    NUMBEROFCHECKS="1"
    INTERVAL="1"

To be Continued...

Hasta la vez próxima

[LT]

Thursday May 15, 2008

Checks, Intervals and Loops Oh My - Part 1

Hell all!

Lets discuss one of the most basic elements of Dirtracer...the Main Performance Gathering Loop and  two if its governing parameters.

        NUMBEROFCHECKS and INTERVAL

Definitions:

    NUMBEROFCHECKS    # Number of checks: total number of loops
    INTERVAL            # Interval: seconds between loops

These two parameters tell Dirtracer how long to run and how many Main Loop data points to gather.  The settings for these two parameters are totally dependent on the problem type you are gathering for.

The following are the data points normally gathered in the main loop; these are all of course configurable whether all, some or none are gathered all the time.

  • netstat info
  • ls -la from the transaction logs
  • pstacks
  • prstats
  • cn=monitor searches
  • main cache searches; 1 search per loop.
  • db cache searches; 1 search per backend per loop
  • gcore(s)


As mentioned above an Admin can set the Number of Checks or "data points" to gather.  Lets look at some example problem types versus the settings you may want to capture them with.

Example 1 - Hung Processes (See Borked):


Most times a ns-slapd process is not actually hung but seems like it is.  A perceived hung process could just be an ultra busy process, caused by a series of massive db searches, all worker threads taken and waiting on one to free a lock or many other related issues.

Setting the Number of Checks and Interval correctly here can be critical.  Set them incorrectly and you may miss a data gathering opportunity.

Set Number of Checks to 5 and Interval to 5.  Total Runtime 25 Seconds.

This will gather 5 Pstacks/Prstats at 5 Second Intervals and can show if the process is changing over time, but does not have the granularity to show each threads progression through the stack.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092006                
\* Loop 1 - 080515-092011                
\* Loop 2 - 080515-092017                
\* Loop 3 - 080515-092022                
\* Loop 4 - 080515-092028


Example 2 - High CPU or Performance Problems:


Like Example 1 we want to see the process stack and threads change over time.  But for a High CPU or Performance Problem we want to see things change second by second.  A better option for this problem type would be to set the Number of Checks to 25 and Interval to 1.

Set Number of Checks to 25 and Interval to 1.  Total Runtime 25 Seconds.

This will gather 25 Pstacks/Prstats at 1 Second Intervals.  This way we can see the process stack change with no gaps in the captures. In Example 1 there were 5 Second gaps between pstacks and the threads will change a huge amount on a very busy server in that timeframe.

\* Dirtracer mainloop will run for       [25 sec.]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-092554                
\* Loop 1 - 080515-092555                
\* Loop 2 - 080515-092557                
\* Loop 3 - 080515-092558                
\* Loop 4 - 080515-092600                
\* Loop 5 - 080515-092601                
\* Loop 6 - 080515-092603                
\* Loop 7 - 080515-092604                
\* Loop 8 - 080515-092605                
\* Loop 9 - 080515-092607                
\* Loop 10 - 080515-092608               
\* Loop 11 - 080515-092610               
\* Loop 12 - 080515-092611               
\* Loop 13 - 080515-092613               
\* Loop 14 - 080515-092614               
\* Loop 15 - 080515-092616               
\* Loop 16 - 080515-092617               
\* Loop 17 - 080515-092619               
\* Loop 18 - 080515-092620               
\* Loop 19 - 080515-092621               
\* Loop 20 - 080515-092623               
\* Loop 21 - 080515-092624               
\* Loop 22 - 080515-092626               
\* Loop 23 - 080515-092627               
\* Loop 24 - 080515-092629               


Because I have increased the Number of Checks and decreased the Interval the granularity is higher and I get 25 data points as opposed to 5 over the same time period.

Example 3 - Replication Problems:


The key to debugging a Replication Problem is Debug Logging over a period of time.  Setting the special PTS_CONFIG_LOGGING parameter to 8192 will allow Dirtracer to change the nsslapd-infolog-area logging value in the dse to 8192 (Replication Debug Logging).

PTS_CONFIG_LOGGING="8192"


Setting the Number of Checks or Interval for Granularity is not as important with Replication Problems as it is with Hangs or High CPU Problems.  The settings can vary and still achieve the same results.

Set Number of Checks to 40 and Interval to 30.  Total Runtime 20 Minutes.
Set Number of Checks to 10 and Interval to 120.  Total Runtime 20 Minutes.
Set Number of Checks to 5 and Interval to 240.  Total Runtime 20 Minutes.
Set Number of Checks to 1 and Interval to 1200.  Total Runtime 20 Minutes.

\* Dirtracer mainloop will run for       [1200 sec. About 20 min.]
...
\* Logging level is being changed        [nsslapd-infolog-area]
\*   current level:                      [0]
\*   new level:                          [8192]
...
\* Entering Main Performance Gathering Loop
\*                                       
\* Loop 0 - 080515-100855                
\* Loop 1 - 080515-101256                
\* Loop 2 - 080515-101656                
\* Loop 3 - 080515-102057                
\* Loop 4 - 080515-102457                
\* togo[240sec]-timer[240]
\*                                       
\* Exiting Main Performance Gathering Loop


There are five other problem types we can discuss but lets save that for another Blog post.

  • Crashing
  • Memory Leak
  • Server Down
  • Basic Capture
  • Config Only Capture


To be Continued...

Ciao all!

[LT]

About

A Tech Blog about the Sun Java Systems Dirtracer Toolkit. Dirtracer and this blog written and maintained by Lee Trujillo an Oracle Senior Principal Support Engineer.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today