Enabling Debugging Output
By templedf on Apr 13, 2007
Since it came up recently on the Grid Engine mailing list, let's talk about how to get debugging output from Grid Engine. It's often useful to know what's going on behind the scenes. For example, I occasionally run into a problem where the qmaster will crash immediately after starting. Turning on debugging output enables me to see exactly why it's crashing. (Usually it's something dumb that I did.)
The first step to turning on the debugging output is to source the $SGE_ROOT/util/dl.sh or $SGE_ROOT/util/dl.csh file. Once you've sourced one of these files, you can set the debug level using the dl command. dl takes one argument, the debug level. There are ten preconfigured debug levels. Each level represents a combination of layers and classes.
There are 8 debug layers in Grid Engine. They are:
- Top -- this is where general debugging information lives
- CULL -- debugging information specifically related to the Common User Linked List routines
- Basis -- I think this is intended for utility operation debugging information; currently only the JGDI makes much use of it
- GUI -- debugging information from qmon
- Unused -- not surprisingly, it's unused
- Commd -- essentially unused; this was for the commd from 5.3, but the commd replacement in 6.0, the comm lib, has it's own multi-threaded logging facility
- GDI -- debugging information specifically related to the Grid Database Interface, the protocol that the qmaster speaks
- Pack -- debugging information about the packing and unpacking of data for network communications
Grid Engine also has 8 debugging classes. They are:
- Trace -- shows information about entering and exiting functions
- Info -- general debugging information
- Job trace -- apparently unused
- Special -- apparently unused
- Timing -- used to report job start time in the execd
- Lock -- used by the locking library to output lock information
- Free Y -- apparently unused
- Free Z -- apparently unused
As you can see, many of the debugging classes are unused. The Grid Engine debugging mechanism is intended mostly as a tool for the Grid Engine developers. The unused classes provide places for developers to put temporary debugging output during product development.
Setting the debug level for Grid Engine means assigning a class to each enabled debugging layer. The 10 predefined debugging levels are:
- Top = Info
- Top = Trace + Info
- Top + CULL + GDI = Info
- Top + CULL + GDI = Trace + Info
- Top + GUI + GDI = Info
- Top + CULL + Basis + Commd + GDI = Lock
- Unused = Trace + Info
- Top + Commd = Info
- Top + Commd = Trace + Info
- Top + CULL + Basis + Pack = Trace + Info
For general configuration debugging, levels 1, 3, and 5 are the most useful. Very experienced admins might be able to make good use of 2 and 4. 6 is only useful for debugging deadlocks and bottlenecks caused by locking. 7 is only for developers and has no effect by default. Because the Commd layer isn't really used anymore, 8 and 9 are essentially the same as 1 and 2. In rare cases, 10 might be useful, but it's essentially only useful for developers.
The actual effect that the dl command has is to set the SGE_DEBUG_LEVEL environment variable. The value of the variable is a list of the sums of the class settings for each layer. For example, dl 1 will cause SGE_DEBUG_LEVEL to be set to "2 0 0 0 0 0 0 0". (A class value of 0 means that debugging for that layer is not enabled.) dl 4 results in SGE_DEBUG_LEVEL="3 0 0 3 0 0 3 0". Setting a debug level greater than 0 results in the SGE_ND environment variable being set to "true". When SGE_ND is true, the Grid Engine daemons (sge_qmaster, sge_schedd, sge_execd, and sge_shadowd) won't deamonize. That means they won't jump to background processes, enabling you to see the debugging output instead of sending it to /dev/null.
To turn off debugging, set the debug level to 0. Setting the debug level to 0 clears SGE_DEBUG_LEVEL and SGE_ND environment variables.