Monday Feb 14, 2011

Oracle Solaris Studio C/C++: Tuning iropt for inline control

It is desirable to inline as many hot routines as possible to reduce the runtime overhead of CPU intensive applications. In general, compilers go by their own rules when to inline and when to not inline a routine. This blog post is intended to introduce some of the not widely known (or used) compiler internal flags to tweak the pre-defined rules of compiler.

Consider the following trivial C code:


% cat inline.c

#include <stdio.h>
#include <stdlib.h>

inline void freememory(int \*ptr)
{
        free(ptr);
}

extern inline void swapdata(int \*ptr1, int \*ptr2)
{
        int \*temp;

        temp = (int \*) malloc (sizeof (int));
        printf("\\nswapdata(): before swap ->");

        \*temp = \*ptr1;
        \*ptr1 = \*ptr2;
        \*ptr2 = \*temp;

        printf("\\nswapdata(): after swap ->");

        free (temp);
}

inline void printdata(int \*ptr)
{
        printf("\\nAddress = %x\\tStored Data = %d", ptr, \*ptr);
}

inline void storedata(int \*ptr, int data)
{
        \*ptr = data;
}

inline int \*getintptr()
{
        int \*ptr;
        ptr = (int \*) malloc (sizeof(int));
        return (ptr);
}

inline void AllocLoadAndSwap(int val1, int val2)
{
        int \*intptr1, \*intptr2;

        intptr1 = getintptr();
        intptr2 = getintptr();
        storedata(intptr1, val1);
        storedata(intptr2, val2);
        printf("\\nBefore swapping .. ->");
        printdata(intptr1);
        printdata(intptr2);
        swapdata(intptr1, intptr2);
        printf("\\nAfter swapping .. ->");
        printdata(intptr1);
        printdata(intptr2);
        freememory(intptr1);
        freememory(intptr2);
}

inline void InitAllocLoadAndSwap()
{
        printf("\\nSnapshot 1\\n___________");
        AllocLoadAndSwap(100, 200);
        printf("\\n\\nSnapshot 2\\n___________");
        AllocLoadAndSwap(435, 135);
}

int main() {
        InitAllocLoadAndSwap();
        return (0);
}

By default auto inlining is turned off in Oracle Studio compilers; and to turn it on, the code must be compiled with -O4 or higher optimization. This example attempts to hint the compiler to inline all the routines with the help of inline keyword. Note that inline keyword is a suggestion/request for the compiler to inline the function. However there is no guarantee that compiler honors the suggestion/request. Just like everything else in the world, compiler has a pre-defined set of rules. And based on those rules, it tries to do its best as long as those rules are not violated. If the compiler chooses to inline a routine, the function body will be expanded at all the call sites (just like a macro expansion).

When this code is compiled with Oracle Studio C compiler, it doesn't print any diagnostic information on stdout or stderr - so, using nm or elfdump is one way to find what routines are inlined and what routines are not.


% cc -xO3 -c inline.c
% nm inline.o

inline.o:

[Index]   Value      Size      Type  Bind  Other Shndx   Name

[4]     |         0|       0|NOTY |LOCL |0    |3      |Bbss.bss
[6]     |         0|       0|NOTY |LOCL |0    |4      |Ddata.data
[8]     |         0|       0|NOTY |LOCL |0    |5      |Drodata.rodata
[16]    |         0|       0|NOTY |GLOB |0    |ABS    |__fsr_init_value
[14]    |         0|       0|FUNC |GLOB |0    |UNDEF  |InitAllocLoadAndSwap
[1]     |         0|       0|FILE |LOCL |0    |ABS    |inline.c
[15]    |         0|      20|FUNC |GLOB |0    |2      |main

From this output, we can see that InitAllocLoadAndSwap() is not inlined yet there is no information as to why this function is not inlined.

Compiler commentary with er_src

To get some useful diagnostic information, Oracle Studio compiler collection offers a utility called er_src. When the source code was compiled with debug flag (-g or -g0), er_src can print the compiler commentary. However since compiler does auto inlining only at O4 or later optimization levels, unfortunately compiler commentary for inlining is not available at O3 optimization level.

iropt's inlining report

"iropt" is the global optimizer in Oracle Solaris Studio compiler collection. Inlining will be taken care by iropt. It performs inlining for callees in the same file unless compiler options for cross file optimizations such as -xipo, -xcrossfile are specified on compile line. Some of the iropt options can be used to control inlining heuristics. These options have no dependency on the optimization level.

Finding the list of iropt phases and the supported options

Oracle Studio C/C++ compilers on SPARC support a variety of options for function inline control. -help displays the list of supported flags/options.


% CC -V
CC: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25

% cc -V
cc: Sun C 5.9 SunOS_sparc Patch 124867-01 2007/07/12

% iropt -help

  \*\*\*\*\*\*  General Usage Information about IROPT  \*\*\*\*\*\*

To get general help information about IROPT, use -help
To list all the optimization phases in IROPT, use -phases
To get help on a particular phase, use -help=phase
To turn on phases, use -A++...+
To turn off phases, use -R++...+
To use phase-specific flags, use -A:

% iropt -phases


  \*\*\*\*\*\* List of Optimization Phases in IROPT \*\*\*\*\*\*

    Phase Name          Description
-------------------------------------------------------------
bitfield	     Bitfield transformations
iv		     Strength Reduction
loop		     Loop Invariant Code Motion
cse		     Common Subexpression Elimination
copy		     Copy Propagation
const		     Const Propagation and Folding
reg		     Virtual Register Allocation
unroll		     Data Dependence Loop Unrolling
merge		     Merge Basic Blocks
reassoc		     Reconstruction of associative and/or distributive expressions
composite_breaker	     
tail		     Tail Recursion Optimization
rename		     Scalar Rename
reduction	     
mvl		     Two-version loops for parallelization
loop_dist	     Loop Distribution
ivsub		     Induction Variables Substitution: New Algorithm
ddint		     Loop Interchange
fusion		     Loop Fusion
eliminate	     Scalar Replacement on def-def and def-use
private		     Private Array Analysis
scalarrep	     Scalar Replacement for use-use
tile		     Cache Blocking
ujam		     Register Blocking
ddrefs		     Loop Invariant Array References Moving
invcc		     Invariant Conditional Code Motion
sprof		     Synthetic Profiling
restrict_g	     Assume global pointers as restrict
dead		     Dead code elimination
pde		     Partial dead code elimination
reassoc2	     loop invariant reassociative tranfsformations
distr		     distributive reassociative tranfsformations
height2		     tree height reassociative reduction
ansi_alias	     Apply ANSI Aliase Rules to Pointer References
perfect		     
yas		     Scalar Replacement for reduction arrays
pf		     Prefetch Analysis
cond_elim	     Conditional Code Elimination
vector		     Vectorizing Some Intrinsics Functions Calls in Loops
whole		     Whole Program Mode
bopt		     Branches Reordering based on Profile Data
invccexp	     Invariant Conditional Code Expansion
bcopy		     Memcpy and Memset Transformations
ccse		     Cross Iteration CSE
data_access	     Array Access Regions Analysis
ipa		     Interprocedual Analysis
contract	     Array Contraction Analysis
symbol		     Symbolic Analysis
ppg2		     optimistic strategy of constant propagation
parallel	     Parallelization
pcg		     Parallel Code Generator
lazy		     Lazy Code Motion
region		     Region-based Optimization
loop_peeling	     Loop Peeling
loop_reform	     Loop Reformulation
loop_shifting	     Loop Shifting
loop_collapsing	     Loop Collapsing
memopt		     Merge memory allocations
inline		     IPA-based inlining phase
clone		     Routine cloning phase
norm_ir		     clean-up and normalize ir
ipa_ppg		     interprocedural constant propagation
sr		     Strength reduction (new)
ivsub3		     Induction Variable Substitution
icall_opt	     indirect call optimization
cha		     Class Hierarchy Analysis
ippt		     Interprocedual pointer tracking
reverse_invcc	     reverse invariant condition code hoisting
crit		     Critical path optimisations
loop_norm	     loop normalization
loop_unimodular	     loop unimodular transformation
scalar_repl	     Scalar Replacement
loop_bound	     Redundant Loop Bound Checking Elimination
loop_condition	     Invariant Loop Bound Checking Hoisting
memopt_pattern	     Memory Access Optimization
loop_improvement	     Loop structure improvement by code specialization
pbranch_opt	     C++ Java Pbranch Optimizations
norm_ldst	     short ld/st normalisation
micro_vector	     Micro vectorization for x86
ipa_symbol_ppg	     interprocedural symbolic analysis
optinfo		     Compile-time information about loop and inlining transformations
vp		     Value profiling and code specialization
pass_ti		     Pass IR type trees to the backend
fully_unroll	     Fully Loop Unrolling
builtin_opt	     Builtin Optimization


% iropt -help=inline

NAME
    inline - Qoption for IPA-based inlining phase.

SYNOPSIS
    -Ainline[:][:]:...[:] - turn on inline.
    -Rinline                             - turn off inline

DESCRIPTION
    inline is on by default now. -Ainline turns it on.
    -Rinline turns it off. 
    
    NOTE: the following is a brief description of the old inliner qoptions
          1. Old inliner qoptions that do not have equivalent 
             options in the new inliner--avoid to use them later: 
             -Ml -Mi -Mm -Ma -Mc -Me -Mg -Mw -Mx -Mx -MC -MS 

          2. Old inliner qoptions that have equivalent option 
             in the new inliner--use the new options later: 
             Old options     new options 
                -Msn          recursion=n 
                -Mrn          irs=n      
                -Mtn          cs=n       
                -Mpn          cp=n       
                -MA           chk_alias  
                -MR           chk_reshape 
                -MI           chk_reshape=no 
                -MF           mi         
 
    The acceptable sub-options are:

      report[=n] - dump inlining report.
                  n=chain: 
                        show to-be-inlined call chains.
                  n=user_request: 
                        show the inlining status of user-requests.
                  n=0:  show inlined calls only.
                  n=1:  (default):  show both inlined and 
                        non-inlined calls and reasons for 
                        inlining/non-inlining.
                  n=2:  n=1 plus call id and node id
                  n=3:  show inlining summary only
                  n=4:  n=2 and iropt aborts after the 
                        inlining report is dumped out.
      cgraph     - dump cgraph.
      call_in_pragma[=no|yes]:
                 - call_in_pragma or call_in_pragma=yes: 
                   Inline a call that in the Parallel region 
		      into the original routine 
                 - call_in_pragma=no: (default) 
                   Don't inline a call that in the Parallel region
		      into the original routine 
      inline_into_mfunction[=no|yes]:(only for Fortran) 
		    - inline_into_mfunction or inline_into_mfunction=yes:(default) 
		      Inline a call into the mfunction if it is in the
		      Parallel Region
                 - inline_into_mfunction=no: 
                   Don't inline a call into the mfunction if it 
                   in the Parallel Region
NOTE: for other languages, if you specify inline_into_mfunction=yes 
	 The compiler will silently ignore this qoption. As a result, 
	 Calls in parallel region will still be inlined into pragma constructs
      rs=n       - max number of triples in inlinable routines.
                   iropt defines a routine as inlinable or not
                   based on this number. So no routines over 
                   this limit will be considered for inlining.
      irs=n      - max number of triples in a inlining routine,
                   including size increase by inlining its calls
      cs=n       - max number of triples in a callee. 
                   In general, iropt only inline calls whose 
                   callee is inlinable (defined by rs) AND 
                   whose callee size is not greater than n.
                   But some calls to inlinable routines are 
                   actually inlined because of other factors
                   such as constant actuals, etc. 
      recursion=n  
                 - max level of resursive call that is 
                   considered for inlining.
      cp=n       - minimum profile feedback counter of a call.
                   No call with counter less than this limit 
                   would be inlined.
      inc=n      - percentage of the total number of triples 
                   increased after inlining. No inlining over
                   this percentage. For instance, 'inc=30' 
                   means inlining is allowed to increase the 
                   total program size by 30%.
      create_iconf=:
      use_iconf=:
                   This creates/uses an inlining configuration.
                   The file lists calls and routines that are
                   inlined and routines that inline their calls.
                   Its format is:
                      air      /\* actual inlining routines \*/
                      r11 r12 r13 ...
                      r21 r22 r23 ...
                      .....
                      ari      /\* actual routines inlined \*/
                      r11 r12 r13 ...
                      r21 r22 r23 ...
                      .....
                      aci      /\* actual calls inlined \*/
                      (r11,c11) (r12,c12) (r13,c13) ...
                      (r21,c21) (r22,c22) (r23,c23) ...
                      .....
                   The numbers are callids (cxx) and nodeids(rxx) 
                   printed out when report=2. It is used for
                   debugging. The usual usage is to use
                   create_iconf= to create a config file.
                   then, comment (by preceding numbers line
                   with #) to disallow inlining for those 
                   calls or routines. For instance, 
                       aci
                       (2,3) (2,5) (2,6) (3,9)
                       (3,10) (6,4) (6,7) (7,6)
                       #(7,10) (8,21) (8,22)
                   with the above config file, calls whose
                   nodeids and callids are (7,10),(8,21) and 
                   (8,22) will not be inlined.

                   NOTE:for the aci part of the configure file,
                        in each pair (rij,cij), the parentheses
                        are not necessary, but the comma is necessary 
                        and there should not be any space between
                        rji and comma, comma and cij.
      do_inline=:
                 - guide inliner to do inlining for a given
                   routine only.
      mi:
                 - Do maximum inlining for given routines if do_inline
                   is used; otherwise, do maximum inlining for main routine.
                   (The inliner will not check inlining parameters.
      inline_level[=1|2|3]: 
                 - specify the level of inline: 
                     inline_level=1    basic inlining 
                     inline_level or inline_level=2    medium inlining (default) 
                     inline_level=3 or inline_level=4,5...   aggressive inlining 
      remove_ip[=no|yes]:
                 - remove_ip or remove_ip=yes:
                      removing inliningPlan after inlining.
                 - remove_ip=no [default]:
                      keep inliningPlan after inlining.
      chk_alias[=no|yes]:
                 - chk_alias or chk_alias=yes [default]:
                      Don't inline a call if inlining it causes
                      aliases among callee's formal arrays.
                 - chk_alias=no:
                      Ignore such checking.
      chk_reshape[=no|yes]:
                 - chk_reshape or chk_reshape=yes [default]:
                      Don't inline a call if its array argument
                      is reshaped between caller and callee.
                 - chk_reshape=no:
                      Ignore such checking.
      chk_mismatch[=no|yes]:
                 - chk_mismatch or chk_mismatch=yes [default]:
                      Don't inline a call if any real argument
                      mismatches with its formal in type.
                 - chk_mismatch=no:
                      Ignore such checking.
      do_chain[=no|yes]:
                 - do_chain or do_chain=yes [default]:
                      Enable inlining for call chains.
                 - do_chain=no:
                      Disable inlining for call chains.
      callonce[=no|yes]:
                 - callonce=no [default]:
                      Disable inlining a routine that is
                      called only once.
                 - callonce or callonce=yes:
                      Enable inlining a routine that is
                      called only once.
      icall_recurse[=no|yes]:
                 - icall_recurse=no [default]:
                      Disable recursive inlining of indirect
                      and virtual call sites
                 - icall_recurse=yes:
                      Enable recursive inlining of indirect
                      and virtual call sites
      formal_dbgsym[=no|yes]: (default = no)
                 - Specify to preserve the debug information for
                   formal parameter of inlined funcion

Some of these options can be used to get all the diagnostic information from compile time. Especially the sub-option (report) to -Ainline is useful in obtaining the inlining report. To pass special flags to iropt, specify -W2,<option>:<sub-option> on compile line.

Here is an example.


% cc -xO3 -c -W2,-Ainline:report=2 inline.c

INLINING SUMMARY

   inc=400: percentage of program size increase.
   irs=4096: max number of triples allowed per routine after inlining.
   rs=450: max routine size for an inlinable routine.
   cs=400: call size for inlinable call.
   recursion=1: max level for inlining recursive calls.
   Auto inlining: OFF

   Total inlinable calls: 14
   Total inlined calls: 36
   Total inlined routines: 7
   Total inlinable routines: 7
   Total inlining routines: 3
   Program size: 199
   Program size increase: 744
   Total number of call graph nodes: 11

   Notes for selecting inlining parameters

    1. "Not inlined, compiler decision":
       If a call is not inlined by this reason, try to
       increase inc in order to inline it by
          -Qoption iropt -Ainline:inc=  for FORTRAN, C++
          -W2,-Ainline:inc=  for C

    2. "Not inlined, routine too big after inlining":
       If a call is not inlined by this reason, try to
       increase irs in order to inline it by
          -Qoption iropt -Ainline:irs=  for FORTRAN, C++
          -W2,-Ainline:irs=  for C

    3. "Not inlined, callee's size too big":
       If a call is not inlined by this reason, try to
       increase cs in order to inline it by
          -Qoption iropt -Ainline:cs=  for FORTRAN, C++
          -W2,-Ainline:cs=  for C

    4. "Not inlined, recursive call":
       If a call is not inlined by this reason, try to
       increase recursion level in order to inline it by
          -Qoption iropt -Ainline:recrusion=  for FORTRAN, C++
          -W2,-Ainline:recrusion=  for C

    5. "Routine not inlined, too many operations":
       If a routine is not inlinable by this reason, try to
       increase rs in order to make it inlinable by
          -Qoption iropt -Ainline:rs=  for FORTRAN, C++
          -W2,-Ainline:rs=  for C


ROUTINES NOT INLINABLE:

 main [id=7] (inline.c)
   Routine not inlined, user requested

CALL INLINING REPORT:

 Routine: freememory [id=0] (inline.c)
  Nothing inlined.

 Routine: swapdata [id=1] (inline.c)
  Nothing inlined.

 Routine: printdata [id=2] (inline.c)
  Nothing inlined.

 Routine: storedata [id=3] (inline.c)
  Nothing inlined.

 Routine: getintptr [id=4] (inline.c)
  Nothing inlined.

 Routine: AllocLoadAndSwap [id=5] (inline.c)
   getintptr [call_id=8], line 46: Auto inlined
   getintptr [call_id=9], line 47: Auto inlined
   storedata [call_id=10], line 48: Auto inlined
   storedata [call_id=11], line 49: Auto inlined
   printdata [call_id=13], line 51: Auto inlined
   printdata [call_id=14], line 52: Auto inlined
   swapdata [call_id=15], line 53: Auto inlined
   printdata [call_id=17], line 55: Auto inlined
   printdata [call_id=18], line 56: Auto inlined
   freememory [call_id=19], line 57: Auto inlined
   freememory [call_id=20], line 58: Auto inlined

 Routine: InitAllocLoadAndSwap [id=6] (inline.c)
   AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
     (inc limit reached. See INLININING SUMMARY)
   AllocLoadAndSwap [call_id=24], line 66: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined

 Routine: main [id=7] (inline.c)
   InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
     AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
       (inc limit reached. See INLININING SUMMARY)
     AllocLoadAndSwap [call_id=24], line 66: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined

The above report shows the threshold values being used while making decisions, all the routines and information about whether a call to any function is inlined; if not, the reason for not inlining it, and some suggestions on how to make it succeed. This is cool stuff.

Going back to the example: based on the report, the compiler is trying to inline all the routines as long as the program size doesn't go beyond 400% of the original size (ie., without inlining). Unfortunately AllocLoadAndSwap() went beyond the limits and as a result, compiler decides not to inline it. Fair enough. If we don't bother about the size of the binary and if we really want this routine inlined, one solution is to increase the value for inc in such a way that AllocLoadAndSwap()'s inclusion would fit into the newer limits.

eg.,


% cc -xO3 -c -W2,-Ainline:report=2,-Ainline:inc=650 inline.c
INLINING SUMMARY

   inc=650: percentage of program size increase.
   irs=4096: max number of triples allowed per routine after inlining.
   rs=450: max routine size for an inlinable routine.
   cs=400: call size for inlinable call.
   recursion=1: max level for inlining recursive calls.
   Auto inlining: OFF

   Total inlinable calls: 14
   Total inlined calls: 60
   Total inlined routines: 7
   Total inlinable routines: 7
   Total inlining routines: 3
   Program size: 199
   Program size increase: 1260
   Total number of call graph nodes: 11

   Notes for selecting inlining parameters

    ... skip ... (see prev reports for the text that goes here)

ROUTINES NOT INLINABLE:

 main [id=7] (inline.c)
   Routine not inlined, user requested


CALL INLINING REPORT:

 Routine: freememory [id=0] (inline.c)
  Nothing inlined.

 Routine: swapdata [id=1] (inline.c)
  Nothing inlined.

 Routine: printdata [id=2] (inline.c)
  Nothing inlined.

 Routine: storedata [id=3] (inline.c)
  Nothing inlined.

 Routine: getintptr [id=4] (inline.c)
  Nothing inlined.

 Routine: AllocLoadAndSwap [id=5] (inline.c)
   getintptr [call_id=8], line 46: Auto inlined
   getintptr [call_id=9], line 47: Auto inlined
   storedata [call_id=10], line 48: Auto inlined
   storedata [call_id=11], line 49: Auto inlined
   printdata [call_id=13], line 51: Auto inlined
   printdata [call_id=14], line 52: Auto inlined
   swapdata [call_id=15], line 53: Auto inlined
   printdata [call_id=17], line 55: Auto inlined
   printdata [call_id=18], line 56: Auto inlined
   freememory [call_id=19], line 57: Auto inlined
   freememory [call_id=20], line 58: Auto inlined

 Routine: InitAllocLoadAndSwap [id=6] (inline.c)
   AllocLoadAndSwap [call_id=22], line 64: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined
   AllocLoadAndSwap [call_id=24], line 66: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined

 Routine: main [id=7] (inline.c)
   InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
     AllocLoadAndSwap [call_id=22], line 64: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined
     AllocLoadAndSwap [call_id=24], line 66: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined

From the above output we can conclude that AllocLoadAndSwap() was inlined by the compiler when we let the program size to increase by 650%.

Notes:

  • Multiple iropt options separated by a comma (,) can be specified after -W2
    eg., -W2,-Ainline:report=2,-Ainline:inc=650

  • For C++ code, use -Qoption to specify iropt options.
    eg., -Qoption iropt -Ainline:report=2
    -Qoption iropt -Ainline:report=2,-Ainline:inc=650

  • Inlining those functions whose function call overhead is large relative to the routine code improves performance. Improvement is the result of elimination of the function call, stack frame manipulation and the function return

  • Even though inlining may increase the runtime performance of an application, do not try to inline too many functions. Inline only those functions (from profiling data) that could benefit from inlining

  • In general, compiler threshold values are good enough for inlining the functions. Use iropt's options only if some very hot routines couldn't make it due to some reason. Turn on auto inlining with -xO4 option

  • Inline functions increase build time and program sizes. Sometimes it is possible that some of the very large routines (when inlined) may not fit into processor's cache and may lead to poor performance mainly due to the increased cache miss rate

ALSO SEE
Oracle Solaris Studio: Advanced Compiler Options for Performance

(Original blog post is at:
http://technopark02.blogspot.com/2005/11/sun-studio-cc-tuning-iropt-for-inline.html)

Monday Jun 22, 2009

Sun Studio: Debugging Multi-Threaded Applications with dbx

(Crossposting the three and half year old blog entry "as is" from my other blog hosted on blogger. It needs some serious editing, but I believe the content is still relevant. Source URL: http://technopark02.blogspot.com/2005/12/sun-studio-debugging-multi-threaded.html)

Multi-threading lets different tasks to run concurrently in a single process, hence multi-threaded programs would run faster [or achieve better throughput] on machines with multiple processors and on CPUs with multiple cores. On an SMP (Symmetric Multi-Processing system, where multiple processors share a single memory system) system with no CMT (Chip Multi-Threading), software threads are executed on different processors; and on an SMP system with CMT, the threads are executed on cores, and logical processors in CMP (Chip Multi-Processing) processors. As revolutionary chip designs are evolving, many important commercial applications like Oracle, SAP, Siebel, PeopleSoft are designed to be multi-threaded.

Debugging a multi-threaded (MT in short) application is a bit hard, due to the number of software threads running in parallel, compared to a single threaded program where only one task will be running per process, at any given time. Thread synchronization plays an important role when concurrently running threads have to share global resources. Improperly synchronized threads may starve, and lead to unnecessary dead locks, and race conditions. So, it is good to have an MT aware debugger handy, during development and in support phases of software life cycle, to debug threading issues.

Fortunately on Solaris, Sun Studio's debugger, dbx, has support for MT applications that are designed to use Solaris threads, and/or POSIX threads. With dbx, it is possible to get information like thread state, stack trace, locks from all threads, navigate between threads, suspend/resume threads, put break points in a thread and can do step by step execution in a function in a designated thread. Note that Solaris Modular Debugger (mdb) also has support for MT programs; but this blog post concentrates on Studio's dbx.

Siebel processes were used to show various dbx commands in the following examples. Siebel is a multi-threaded application, written in C/C++.

Core dump analysis

The following example shows some useful commands to get the stack trace in the thread, where the process crashed. For more information about dbx commands, type help or help <command> in dbx environment ie., at dbx prompt.

% ls -lh core
-rw-------   1 giri     other       273M Dec  9 16:56 core

% file core
core:           ELF 32-bit MSB core file SPARC Version 1, from 'siebprocmw'

% /opt/SS11/SUNWspro/prod/bin/dbx siebprocmw core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading siebprocmw
core file header read successfully
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsbcfui.so
Reading libsbcfuiapps.so
t@1 (l@1) terminated by signal KILL (Killed)
0xfd2bc7e0: ___nanosleep+0x0008:        blu      _cerror        ! 0xfd2206a0
Since we don't know which thread crashed the process, let's list all known threads with threads command. threads -all lists all threads, including zombies.
(dbx) threads
 >    t@1  a  l@1   ?()   LWP suspended in  ___nanosleep()
      t@2  b  l@2   MwTimerThread()   LWP suspended in  __pollsys()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd874078  in  __lwp_park()
      t@4  b  l@4   MwThread()   LWP suspended in  __pollsys()
      t@5  b  l@5   MwThread()   LWP suspended in  __pollsys()
o     t@6  b  l@6   MwThread()   signal SIGABRT in  __lwp_kill()
      t@7  b  l@7   MwThread()   LWP suspended in  __pollsys()
      t@9  b  l@9   MwThread()   LWP suspended in  ___nanosleep()

In the above list, t@1 is the current thread, which is indicated by ">", and the start function is not known (indicated with a "?()").

(dbx) thread
current thread ($thread) is t@1

(dbx) where
current thread: t@1
=>[1] ___nanosleep(0x4, 0xffbfd9a8, 0x0, 0xff000000, 0x0, 0x0), at 0xfd2bc7e0
  [2] _sleep(0x64, 0x0, 0xfd2e8bc0, 0xfd0e2000, 0xfd0e2000, 0x0), at 0xfd2afaa0
  [3] thr_t::do_thr_action(0xfd86ba10, 0xc, 0x1608, 0xfd86ba20, 0x1, 0x2), at 0xfd770e14
  [4] thr_t::t_sleep(0xfb80f5c0, 0x0, 0xffbfdb0e, 0xffbfdb08, 0xfd8546cc, 0xffffffff), at 0xfd770c58
  [5] MwWaitForMultipleObjects(0xfb80f5c0, 0x2, 0xfb80f5c8, 0x2, 0xffffffff, 0x9cd48), at 0xfd774dd4
  [6] WaitForMultipleObjectsEx(0x2, 0xffbfde3c, 0x0, 0x100000, 0x0, 0x9cd48), at 0xfd77fe9c
  [7] OSDNTWait::WaitForThread(0xc, 0xffffffff, 0xffbfdecc, 0xd0108, 0x1004f, 0xff8a1d64), at 0xffa7b050
  [8] OSDWaitTid(0xc, 0xffffffff, 0xffbfe7c4, 0x0, 0xc, 0xc), at 0xff05f1c4
  [9] scfEventFacility::scfEventFac::ShutdownCmd(0xe14450, 0x1, 0x7, 0xfe4de0f4, 0xffbfe7c8, 0xff48f8d4), at 0xff819884
  [10] scfEventFacility::scfEventFac::Shutdown(0xffbfe96c, 0xff877530, 0x0, 0x5e000, 0xff874e8c, 0x5e114), at 0xff819390
  [11] ScfSisDetach(0x0, 0x0, 0x0, 0xffffffff, 0xffbfe96c, 0xfc81c), at 0xff781ed4
  [12] _shutdown(0x6479c, 0x0, 0x651a8, 0x651a8, 0x7, 0x0), at 0x49c7c
  [13] wmain(0x12a, 0x6479c, 0x0, 0x0, 0xffbfedac, 0x6479c), at 0x4995c
  [14] main(0xfd85f310, 0xc94, 0xffbfef90, 0x54, 0xfd85f310, 0xc00), at 0x4d3cc

This is not exactly what we are looking for. The above call stack shows where the current thread (t@1) is waiting. Since our interest is to find out the thread that is responsible for the process crash, we need to look for an o before the thread id. t@6 is the ill fated thread in the list of all known threads; and the process was killed because of a SIGABRT in lwp_kill method. Note that OS provides the necessary abstraction for creating, and destroying threads; and also has the freedom of killing malfunctioning threads when things go haywire. In this example, __lwp_kill() was called by the operating system, due to some event which we are going to investigate.

thread -info <tid> command provides more information like what exactly happened in application code that triggered the forcible shutdown.

(dbx) thread -info t@6
        Thread t@6 (0xfcb80c00) at priority 0
        state: bound to    l@6
        base function: 0xfd770ff4: MwThread() stack: 0xfa380000[524288]
        flags: BOUND|DETACHED|SUSPENDED
        masked signals: SEGV
        Currently active in __lwp_kill

Observe that kernel trapped an illegal memory access with a SEGV signal. The default behavior for a SEGV, is to shutdown the process with a possible core file generation (aka core dump). Let's switch to thread t@6 with thread <tid> command, and get to the instruction which raised the segmentation fault.

(dbx) thread t@6
t@6 (l@6) stopped in __lwp_kill at 0xfd2bd5ec
0xfd2bd5ec: __lwp_kill+0x0008:  bcc,a,pt  %icc,__lwp_kill+0x18  ! 0xfd2bd5fc

(dbx) thread
current thread ($thread) is t@6

(dbx) where
current thread: t@6
=>[1] __lwp_kill(0x0, 0x6, 0x0, 0x6, 0xffff0000, 0x0), at 0xfd2bd5ec
  [2] raise(0x6, 0x0, 0xfd2a1af4, 0x42770, 0xfd2e4278, 0x6), at 0xfd25d884
  [3] abort(0xe15220, 0x1, 0x0, 0xa6544, 0xfd2e7298, 0x0), at 0xfd23de38
  [4] SehScanInvokeTryList(0x44bd308, 0x108000, 0xfd8571c4, 0x0, 0x2, 0x0), at 0xfd74c9d4
  [5] Signal_Handler::raise(0xc0000005, 0xfa37cde8, 0x0, 0x2, 0xfa37cc80, 0x1800), at 0xfd74d778
  [6] Raise_Exception::operator()(0x67670, 0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd86a07c, 0x2c), at 0xfd74d8dc
  [7] __sighndlr(0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd74d7c8, 0x0, 0x1), at 0xfd2bc52c
  ---- called from signal handler with signal 11 (SIGSEGV) ------
  [8] CSSSqlObj::GetTrxDbConn(0x458a7d8, 0x0, 0x1394478, 0x64c00, 0x0, 0x4611290), at 0xf91de72c
  [9] CSSSqlObj::Execute(0x4611290, 0x0, 0x0, 0x0, 0x0, 0xfe4dd294), at 0xf91c7b98
  [10] CSSBusComp::SqlExecute(0x4606640, 0x0, 0x0, 0x0, 0x1, 0x4b22e84), at 0xf9a9c160
  [11] CSSBCBase::SqlExecute(0x4606640, 0x0, 0xfa37d6fc, 0x0, 0x1, 0xf57be3e8), at 0xf56c2294
  [12] CSSBusComp::Execute(0x0, 0x0, 0x0, 0x0, 0x4606640, 0xfa37d7cc), at 0xf9a6b118
  [13] CSSMsgBoardMaintSvc::UpdTaskHistory(0x44b5ae0, 0xfa37df90, 0x0, 0x4567d14, 0xf8611198, 0x489cd94), at 0xf85f2d48
  [14] CSSMsgBoardMaintSvc::HandleEventDataList(0x44b5ae0, 0x43a0018, 0xff486b38, 0x0, 0xfa37e0ac, 0xf8611198), at 0xf85f5afc
  [15] CSSMsgBoardMaintSvc::ReadTaskHistory(0x44b5ae0, 0x43a0018, 0xf85f4e60, 0x44b5ae0, 0x43a0018, 0x1), at 0xf85f53c0
  [16] scfEventFacility::scfEventFac::CallRegSub(0x2a59448, 0x4109bd8, 0x0, 0x0, 0x8, 0x2), at 0xff81ad20
  [17] scfEventFacility::scfEventFac::HandleCurrProcEvents(0xe14450, 0x7530, 0xe14450, 0xff432ef0, 0xff874e8c, 0x1), 
            at 0xff81b19c
  [18] scfEventFacility::scfEventFac::scfEventThreadMain(0x0, 0x0, 0x0, 0x7400, 0xfa37fc90, 0xd0001), at 0xff81a7dc
  [19] OSDWslThreadStart(0x101d58, 0xff81a580, 0x101d58, 0x6, 0x0, 0x101d70), at 0xff05bec8
  [20] _AfxThreadEntry(0xffbfde34, 0xe9568, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [21] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd86bed0, 0xe15220), at 0xfd771230

From the above stack trace it is clear that the binary doesn't contain necessary debug information to show high level instructions; so, let's try to get the disassembly with dis command.

(dbx) dis GetTrxDbConn / 50
More than one identifier 'GetTrxDbConn'.
Select one of the following:
 0) Cancel
 1) `libsscfdm.so`#__1cPCSSModelPhysDefMGetTrxDbConn6MpkH_pnJCSSDbConn__
  [non -g, demangles to: CSSModelPhysDef::GetTrxDbConn(const unsigned short\*)]
 2) `libsscfdm.so`#__1cJCSSSqlObjMGetTrxDbConn6kM_pnJCSSDbConn__
  [non -g, demangles to: CSSSqlObj::GetTrxDbConn()const]
> 2
0xf91de6c0: GetTrxDbConn       :        save     %sp, -96, %sp
0xf91de6c4: GetTrxDbConn+0x0004:        mov      %i0, %i5
0xf91de6c8: GetTrxDbConn+0x0008:        ld       [%i0 + 388], %i0
0xf91de6cc: GetTrxDbConn+0x000c:        cmp      %i0, 0
0xf91de6d0: GetTrxDbConn+0x0010:        be,pn    %icc,GetTrxDbConn+0x60 ! 0xf91de720
0xf91de6d4: GetTrxDbConn+0x0014:        sethi    %hi(0x5b400), %l6
0xf91de6d8: GetTrxDbConn+0x0018:        call     GetTrxDbConn+0x20      ! 0xf91de6e0
0xf91de6dc: GetTrxDbConn+0x001c:        mov      %o7, %o7
0xf91de6e0: GetTrxDbConn+0x0020:        sethi    %hi(0x2d1400), %o5
0xf91de6e4: GetTrxDbConn+0x0024:        xor      %l6, 88, %l4
0xf91de6e8: GetTrxDbConn+0x0028:        inc      420, %o5
0xf91de6ec: GetTrxDbConn+0x002c:        sethi    %hi(0x1000), %l5
0xf91de6f0: GetTrxDbConn+0x0030:        add      %o5, %o7, %l3
0xf91de6f4: GetTrxDbConn+0x0034:        add      %l5, 868, %l1
0xf91de6f8: GetTrxDbConn+0x0038:        add      %l3, %l4, %l2
0xf91de6fc: GetTrxDbConn+0x003c:        ld       [%l2], %l0
0xf91de700: GetTrxDbConn+0x0040:        ld       [%l0 + %l1], %o4
0xf91de704: GetTrxDbConn+0x0044:        cmp      %o4, 0
0xf91de708: GetTrxDbConn+0x0048:        be,a,pn  %icc,GetTrxDbConn+0x68 ! 0xf91de728
0xf91de70c: GetTrxDbConn+0x004c:        ld       [%i5 + 128], %i2
0xf91de710: GetTrxDbConn+0x0050:        ld       [%o4 + 88], %l7
0xf91de714: GetTrxDbConn+0x0054:        cmp      %i5, %l7
0xf91de718: GetTrxDbConn+0x0058:        bne,a,pn  %icc,GetTrxDbConn+0x68        ! 0xf91de728
0xf91de71c: GetTrxDbConn+0x005c:        ld       [%i5 + 128], %i2
0xf91de720: GetTrxDbConn+0x0060:        ret
0xf91de724: GetTrxDbConn+0x0064:        restore  %g0, 0, %o0
0xf91de728: GetTrxDbConn+0x0068:        ld       [%i2 + 188], %i1
0xf91de72c: GetTrxDbConn+0x006c:        ld       [%i1 - 16], %i3
0xf91de730: GetTrxDbConn+0x0070:        cmp      %i3, 0
0xf91de734: GetTrxDbConn+0x0074:        bge,pn   %icc,GetTrxDbConn+0x90 ! 0xf91de750
0xf91de738: GetTrxDbConn+0x0078:        add      %i2, 188, %i4
0xf91de73c: GetTrxDbConn+0x007c:        clr      %o0
0xf91de740: GetTrxDbConn+0x0080:        call     RequiredConditionIsFalse [PLT] ! 0xf94b0684
0xf91de744: GetTrxDbConn+0x0084:        mov      84, %o1
0xf91de748: GetTrxDbConn+0x0088:        ld       [%i4], %i1
0xf91de74c: GetTrxDbConn+0x008c:        ld       [%i5 + 388], %i0
0xf91de750: GetTrxDbConn+0x0090:        call     GetTrxDbConn   ! 0xf90e0e00
0xf91de754: GetTrxDbConn+0x0094:        restore  %g0, 0, %g0
0xf91de758: GetTrxDbConn+0x0098:        unimp    0x0
...
...

To see the actual C++ instruction which seg faulted, compile the binary with -g (debug) option, and reproduce the crash. If the source code is readable from the location where you run the dbx session, you will see the actual high level instructions.


Some fun with an active process

The objective of this section is to show how to use some of the dbx commands to get some useful information, from a running MT process.

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2754 giri      399M  302M sleep   59    0   0:00:34 2.0% siebmtshmw/21

% /opt/SS11/SUNWspro/prod/bin/dbx - 2754
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading -
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsscasvbc.so
Reading libswcasvfr.so
Attached to process 2754 with 21 LWPs
t@1 (l@1) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004:   ta       8

(dbx) threads
 >    t@1  a  l@1   ?()   running          in  __pollsys() <- t@1 is always the default current thread under dbx
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
      t@4  b  l@4   MwThread()   running          in  __pollsys()
      t@5  b  l@5   MwThread()   running          in  __pollsys()
      t@6  b  l@6   MwThread()   sleep on 0xf9b7eb80  in  __lwp_park()
      t@7  b  l@7   MwThread()   running          in  __pollsys()
      t@8  b  l@8   MwThread()   running          in  _so_recv()
      t@9  b  l@9   MwThread()   sleep on 0xf927fb68  in  __lwp_park()
     t@10  b l@10   MwThread()   sleep on 0xf877f500  in  __lwp_park()
     t@11  b l@11   MwThread()   sleep on 0xf867fa40  in  __lwp_park()
     t@12  b l@12   MwThread()   sleep on 0xf857fa50  in  __lwp_park()
     t@13  b l@13   MwThread()   sleep on 0xf847fa38  in  __lwp_park()
     t@14  b l@14   MwThread()   running          in  __pollsys()
     t@15  b l@15   MwThread()   sleep on 0xf827f490  in  __lwp_park()
     t@16  b l@16   MwThread()   running          in  __pollsys()
     t@17  b l@17   MwThread()   sleep on 0xf807f490  in  __lwp_park()
     t@18  b l@18   MwThread()   running          in  __pollsys()
     t@19  b l@19   MwThread()   sleep on 0xf4c7f490  in  __lwp_park()
     t@20  b l@20   MwThread()   running          in  __pollsys()
     t@21  b l@21   MwThread()   sleep on 0xf4a7f490  in  __lwp_park()

Put a break point in thread 21 (t@21) for all calls to memcpy():

(dbx) stop in memcpy -thread t@21
More than one identifier 'memcpy'.
Select one of the following:
 0) Cancel
 1) `libc.so.1`memcpy
 2) `libc_psr.so.1`memcpy
 a) All
> a
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
Will create handlers for all 2 hits
(2) stop in _private_memcpy -thread t@21 <- implicit break point set by dbx
(3) stop in _memcpy -thread t@21  <- implicit break point

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy       :     nop

Note that dbx is synchronous -- when any thread or lightweight process (LWP) stops, all other threads and LWPs stop as well.

(dbx) thread
current thread ($thread) is t@21

(dbx) where
current thread: t@21
=>[1] _memcpy(0x5080e14, 0xff406b38, 0x2, 0x36, 0x1, 0x6c), at 0xfe1f04c0
  [2] SSstring::GetWriteBuffer(0xf4a7e6ac, 0xff406b28, 0xff874e8c, 0x32, 0x0, 0xff3b2ef0), at 0xff31ffcc
  [3] sciProcState::sciBlock::FormatLatchName(0xf4a7e6ac, 0x1, 0x7, 0x853c, 0xffa30bd8, 0x8400), at 0xffa02744
  [4] sciProcState::sciProcState(0x5ad31f8, 0xf9fc0000, 0xf4a7e644, 0xff406b3c, 0x0, 0x0), at 0xffa012c4
  [5] sciProcState::GetSciProcState(0xf4a7e7f8, 0x26fcb8, 0x5ad31f8, 0xff88db30, 0x5f5e4, 0x61e6c90), at 0xffa014f0
  [6] SciCheckShutdown(0xf4a7e8cc, 0x34151f8, 0x74, 0x26fcb8, 0x0, 0x2ef798), at 0xff9fe0e4
  [7] SciGetInterrupt(0x0, 0x6a20950, 0x0, 0xf4a7e864, 0x25cd94, 0x1da84), at 0xff9fde40
  [8] _smiMessageQ::ProcessMessage(0x15f85c0, 0x6a20950, 0x0, 0x0, 0x24a360, 0x32e18f0), at 0x2158e4
  [9] _smiMessageQ::ProcessRequest(0x3380c48, 0x6a20950, 0x191, 0x2, 0x5ae22f0, 0x15f85c0), at 0x21461c
  [10] _smiWorkQueue::ProcessWorkItem(0x15f98b8, 0x3380c48, 0x6a20950, 0x5ae2390, 0x0, 0x101f180), at 0x208d08
  [11] _smiWorkQueue::WorkerTask(0x15f98b8, 0x5b7f6b8, 0x3326338, 0x1500e0, 0x0, 0x0), at 0x208764
  [12] SmiThrdEntryFunc(0x32f72d8, 0x70000f, 0x700010, 0x0, 0x0, 0x0), at 0x1f7a0c
  [13] OSDWslThreadStart(0x3380568, 0x1f75a0, 0x3380568, 0x15, 0x0, 0x3380760), at 0xfefdbec8
  [14] _AfxThreadEntry(0xf4b7de5c, 0x3386210, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [15] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x33cdc40), at 0xfd671230

Let's step into memcpy() with stepi, and observe how the thread state changes.

(dbx) stepi
t@21 (l@21) stopped in _memcpy at 0xfe1f04c4
0xfe1f04c4: _memcpy+0x0004:     nop

(dbx) threads
      t@1  a  l@1   ?()   running          in  __pollsys()
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
      t@4  b  l@4   MwThread()   running          in  __pollsys()
      t@5  b  l@5   MwThread()   running          in  __pollsys()
      t@6  b  l@6   MwThread()   sleep on 0xf9b7eb80  in  __lwp_park()
      t@7  b  l@7   MwThread()   running          in  __pollsys()
      t@8  b  l@8   MwThread()   running          in  _so_recv()
      t@9  b  l@9   MwThread()   sleep on 0xf927fb68  in  __lwp_park()
     t@10  b l@10   MwThread()   sleep on 0xf877f500  in  __lwp_park()
     t@11  b l@11   MwThread()   sleep on 0xf867fa40  in  __lwp_park()
     t@12  b l@12   MwThread()   sleep on 0xf857fa50  in  __lwp_park()
     t@13  b l@13   MwThread()   sleep on 0xf847fa38  in  __lwp_park()
     t@14  b l@14   MwThread()   running          in  __pollsys()
     t@15  b l@15   MwThread()   sleep on 0xf827f490  in  __lwp_park()
     t@16  b l@16   MwThread()   running          in  __pollsys()
o    t@17  b l@17   MwThread()   breakpoint       in  _memcpy()
o    t@18  b l@18   MwThread()   breakpoint       in  _memcpy()
o    t@19  b l@19   MwThread()   breakpoint       in  _memcpy()
     t@20  b l@20   MwThread()   running          in  __pollsys()
\*>   t@21  b l@21   MwThread()   single stepped   in  _memcpy()

In the above example, t@17, t@18 and t@19 are stopped at calls to memcpy(); and t@21 stepped into memcpy(). Get out of memcpy() with step up command.

(dbx) step up
_memcpy returns 84413972
t@21 (l@21) stopped in SSstring::GetWriteBuffer at 0xff31ffd4
0xff31ffd4: GetWriteBuffer+0x0114:      ld       [%i1 + 4], %i2

Clear the break point (in current thread) with clear command

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy       :     nop

(dbx) clear
cleared (3) stop in _memcpy -thread t@21
Locks

thread -blocks [<tid>] lists all locks held by the given thread, blocking other threads. If tid is not specified, dbx lists the locks held by the current thread. In the following example, t@21 (current thread) is not holding any locks.

(dbx) thread -blocks
Locks held by t@21:

thread -blockedby [<tid>] shows the synchronization object (monitor) on which the given thread is blocked. If tid is not specified, dbx shows this information for the current thread. Note that only sleeping threads must be in blocked state.

(dbx) thread -blockedby t@10
Thread t@10 is blocked by:
0xf877f500 (0xf877f500): thread  condition variable

(dbx) thread -blockedby t@12
Thread t@12 is blocked by:
0xf857fa50 (0xf857fa50): thread  condition variable

(dbx) thread -blockedby t@17
Thread t@17 is not asleep

syncs command lists all synchronization objects ie., locks/monitors.

(dbx) syncs
All locks currently known to libthread:
0x01020320 (0x01020320): thread  mutex(unlocked)
0x010203f8 (0x010203f8): thread  mutex(unlocked)
0xf827f490 (0xf827f490): thread  condition variable
0xf827f4a0 (0xf827f4a0): thread  mutex(unlocked)
0xf877f500 (0xf877f500): thread  condition variable
0xf877f510 (0xf877f510): thread  mutex(unlocked)
0xf927fb68 (0xf927fb68): thread  condition variable
0xf927fb78 (0xf927fb78): thread  mutex(unlocked)
0xf867fa40 (0xf867fa40): thread  condition variable
0xf867fa50 (0xf867fa50): thread  mutex(unlocked)
0xf9b7eb80 (0xf9b7eb80): thread  condition variable
0xf9b7eb90 (0xf9b7eb90): thread  mutex(unlocked)
0x015c2ed8 (0x015c2ed8): thread  mutex(unlocked)
0x015c2f38 (0x015c2f38): thread  mutex(unlocked)
0x015c2f18 (0x015c2f18): thread  mutex(unlocked)
0x015c2dd8 (0x015c2dd8): thread  mutex(unlocked)
0x015c34d8 (0x015c34d8): thread  mutex(unlocked)
0x03325fb8 (0x03325fb8): thread  mutex(unlocked)
0x033264b8 (0x033264b8): thread  mutex(unlocked)
0x033261b8 (0x033261b8): thread  mutex(unlocked)
0x017a6ce8 (0x017a6ce8): thread  mutex(locked)
0xfa4f4314 (0xfa4f4314): process mutex(locked)
0x0332c438 (0x0332c438): thread  mutex(unlocked)
0x0332c348 (0x0332c348): thread  mutex(unlocked)
0x02fcd7e8 (0x02fcd7e8): thread  mutex(unlocked)
0x0028f860 (0x0028f860): thread  mutex(unlocked)
__1cUCSSSISLocalTransSrvrKs_instLock_+0x8 (0xff1ee220): thread  mutex(unlocked)
0x034150e8 (0x034150e8): thread  mutex(unlocked)
0x034151d8 (0x034151d8): thread  mutex(unlocked)
__uberdata+0x80 (0xfd168c40): thread  mutex(unlocked)
0x01878b98 (0x01878b98): thread  mutex(unlocked)
0x01878aa8 (0x01878aa8): thread  mutex(unlocked)
0xfa4c7e9c (0xfa4c7e9c): process mutex(unlocked)
libc_malloc_lock (0xfd1676f8): thread  mutex(unlocked)
0x0179cb30 (0x0179cb30): thread  mutex(unlocked)
0x0179c830 (0x0179c830): thread  mutex(unlocked)
0xfa5c2664 (0xfa5c2664): process mutex(unlocked)
0xfa5c2c94 (0xfa5c2c94): process mutex(unlocked)
0x0161dd90 (0x0161dd90): thread  mutex(unlocked)
0x0101f6e0 (0x0101f6e0): thread  mutex(unlocked)
0x0101f718 (0x0101f718): thread  mutex(unlocked)
0x0101f770 (0x0101f770): thread  mutex(unlocked)
0x0101f508 (0x0101f508): thread  mutex(locked)
0x0101f5a8 (0x0101f5a8): thread  mutex(unlocked)
0x015bfe90 (0x015bfe90): thread  mutex(unlocked)
0x015bfe20 (0x015bfe20): thread  mutex(unlocked)
0x015bfe58 (0x015bfe58): thread  mutex(unlocked)

To get information about a synchronization object at a given address, use sync -info <address>

(dbx) sync -info 0x0028f860
0x0028f860 (0x28f860): thread  mutex(unlocked)
Lock is unowned
No threads are blocked by this lock

(dbx) sync -info 0xf877f500
0xf877f500 (0xf877f500): thread  condition variable

(dbx) sync -info 0xfd1676f8
libc_malloc_lock (0xfd1676f8): thread  mutex(unlocked)
Lock is unowned
No threads are blocked by this lock
Tracing

trace command can be used to trace the executed source lines, function calls, or variable changes. The following example traces the thread creation, and prints a message whenever a thread gets created.

(dbx) trace thr_create
(4) trace thr_create

(dbx) cont
trace: thread created t@22 on l@22
trace: thread created t@23 on l@23
Reading libsrlcver.so
Reading libsscafsbc.so
...

(dbx) threads
\*>    t@1  a  l@1   ?()   signal SIGINT in  __pollsys()
      t@2  b  l@2   MwTimerThread()   sleep on 0xfb80f4c0  in  __lwp_park()
      t@3  b  l@3   MwAsyncSignalThread()   sleep on 0xfd774078  in  __lwp_park()
 ...
 ...
     t@20  b l@20   MwThread()   running          in  __pollsys()
     t@21  b l@21   MwThread()   sleep on 0xf4a7f490  in  __lwp_park()
     t@22  b l@22   MwThread()   running          in  __pollsys() <- new thread
     t@23  b l@23   MwThread()   sleep on 0xea6ff490  in  __lwp_park() <- new thread

In the above example, there is no information about who created the threads t@22 & t@23. Even to get that information, use when command as shown below:

(dbx) when thr_create { echo "New thread $newthread was created by thread $thread"; }
(6) when thr_create { kprint "New thread ${newthread} was created by thread ${thread}"; }
(dbx) cont
New thread t@24 was created by thread t@10
New thread t@25 was created by thread t@24

$newthread and $thread are pre-defined variables of dbx, which holds the thread ID of a newly created thread, and the thread ID of the current thread, respectively.

Similarly thread exits can be traced as follows:

(dbx) trace thr_exit
(5) trace thr_exit

(dbx) cont
New thread t@26 was created by thread t@10
New thread t@27 was created by thread t@26
trace: thr_exit t@27
Suspending/Resuming threads

To suspend the execution of a thread, run the command thread -suspend <tid>; to resume the suspended thread, thread -resume <tid>

(dbx) thread -suspend t@26
Thread t@26 suspended

(dbx) thread -resume t@26
Thread t@26 unsuspended
Break point with stop command

The following example shows how to set a break point to stop the execution, when a new thread with id t@34 gets created.

(dbx) stop thr_create t@34
(9) stop thr_create t@34

(dbx) cont
t@10 (l@10) stopped in tdb_event_create at 0xfd1377e8
0xfd1377e8: tdb_event_create       :    retl
trace: thread created t@34 on l@34

(dbx) where   <- who initiated the new thread creation? entire call stack
current thread: t@10
=>[1] tdb_event_create(0x2, 0x1084, 0x3ff, 0x0, 0xfc8e1c00, 0x1000), at 0xfd1377e8
  [2] _thrp_create(0x180, 0x10f8, 0xfd1377e8, 0x1e, 0xc1, 0xfde32000), at 0xfd138c04
  [3] _pthread_create(0xf877f310, 0x0, 0xfd670ff4, 0xf877f318, 0x0, 0xfd168bc0), at 0xfd12d104
  [4] MwCreateThread(0x0, 0xfeb95630, 0xf877f414, 0x4, 0x0, 0x9383cb0), at 0xfd671460
  [5] CreateThread(0x0, 0x0, 0xfeb95630, 0xf877f414, 0x4, 0x9383cb0), at 0xfd67d124
  [6] CWinThread::CreateThread(0x9383c80, 0x4, 0x0, 0x0, 0xfd164278, 0x88cabc9), at 0xfeb95f1c
  [7] AfxBeginThread(0xffa7a420, 0x88cabc0, 0x0, 0x0, 0x4, 0x0), at 0xfeb958a4
  [8] WslCreateThread(0xfefdbe00, 0x5c135c0, 0x0, 0x88cabc0, 0xf877f584, 0x16b8c), at 0xffa7a4cc
  [9] OSDCreateThread(0x211200, 0x5b40660, 0x0, 0x0, 0x5ab1590, 0x5c135c0), at 0xfefdc16c
  [10] SmiDispatchThrdMain(0x101f180, 0x5ab1588, 0x5ab1590, 0xf877fd64, 0xf877fcec, 0xff40f8d4), at 0x1f53f4
  [11] OSDWslThreadStart(0x10b8ad0, 0x1f5240, 0x10b8ad0, 0xa, 0x0, 0x15d07e8), at 0xfefdbec8
  [12] _AfxThreadEntry(0xffbfeaac, 0x2f4948, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
  [13] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x15cd558), at 0xfd671230
Light Weight Processes (LWPs)

Application (user) threads are not visible to the kernel. Kernel treats light weight processes (LWPs) as the only schedulable entities within a process. LWPs bridge the user level and kernel level threads. Each process contains one or more LWPs; and each LWP is associated with a kernel thread. Prior to Solaris 9, each of LWPs would run one or more user level threads (ie., 1xN). From Solaris 9 onwards, there is one LWP for every user level thread (ie., 1x1).

Use lwps command to list all LWPs in the process.

(dbx) lwps
  l@1 running          in _private_mprotect()
  l@2 running          in __lwp_park()
  l@3 running          in __lwp_park()
  l@4 running          in __pollsys()
  l@5 running          in __pollsys()
  l@6 running          in __lwp_park()
  l@7 running          in __pollsys()
  l@8 running          in _so_recv()
  l@9 running          in __lwp_park()
  l@10 running          in __lwp_park()
  l@11 running          in __lwp_park()
  l@12 running          in __lwp_park()
  l@13 running          in __time()
  l@14 running          in __pollsys()
  l@15 running          in __lwp_park()
  l@16 running          in __pollsys()
o l@17 breakpoint       in SSstring::GetWriteBuffer()
  l@18 running          in __lwp_unpark()
o l@19 breakpoint       in SSstring::GetWriteBuffer()
  l@20 running          in __pollsys()
\*>l@21 breakpoint       in SSstring::GetWriteBuffer()

lwp command displays the current LWP. To switch to a different LWP, use lwp <lwpid>. lwp -info [<lwpid>] shows some useful information for a given LWP.

(dbx) lwp
current LWP ($lwp) is l@21

(dbx) lwp -info
l@21 breakpoint       in SSstring::GetWriteBuffer()
masked signals are:

(dbx) lwp -info l@12
l@12 running          in __lwp_park()
masked signals are:

(dbx) lwp l@18
t@18 (l@18) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004:   ta       8

Scalability issues

In general, MT applications that make heavy use of the standard {Solaris operating system's} memory allocator, may exhibit poor scalability. This problem occurs when multiple threads are in malloc() or free() waiting to obtain the mlock.

If the application suffers from this scalability issue, the top of the thread stacks (which can be obtained using either dbx or pstack command) will appear as below:

lwp_park
mutex_lock_queue
slow_lock
free
or
lwp_park
mutex_lock_queue
slow_lock
malloc

One such problem was described in this Solaris forum's thread slow_lock making application hang.

MT aware memory allocators

mtmalloc, umem libraries of Solaris distribution will resolve this kind of scalability problem. libmtmalloc was introduced in Solaris 7; and libumem was introduced in Solaris 9 Update 3. These userland memory allocators are packaged as a drop-in replacement to the standard malloc() and free() library calls; so, to take advantage of these allocators, link the MT application with any of these allocators.

mtmalloc, umem allocators are a redesign of the standard library; and hence results in finer grained locking. These libraries will significantly outperform the standard library in cases where multiple concurrent requests are made to the memory allocator. In the case of a single threaded application, the standard memory allocator will however provide better performance. The standard memory allocator also provides a smaller memory footprint. Note that the trade-off with mtmalloc, umem allocators is much bigger memory footprint, due to the way the memory gets allocated. For these reasons the standard memory allocator may be preferred in cases where the advantages of mtmalloc and umem, do not apply. Make sure to experiment with these memory allocators to see which one fits best for your application.

Linking with mtmalloc or umem

At compile time, the application can be linked against mtmalloc or umem library. Adding -lmtmalloc or -lumem, option to the link line results in the application being linked appropriately.

eg., % cc -mt -o my_program my_program.c -lmtmalloc or % cc -mt -o my_program my_program.c -lumem

You can check the library dependency with ldd my_program.

Quick workaround -- library interposition

If re-building the application by linking with mtmalloc or umem, is not feasible, either of these libraries can be preloaded with LD_PRELOAD environment variable, when the program is executed.

eg., % setenv LD_PRELOAD libmtmalloc.so % ./my_program or % setenv LD_PRELOAD libumem.so % ./my_program

You can verify whether the library is preloaded, with pldd `pgrep my_program`.

Resources:
  1. Debugging a Program With dbx
  2. Multithreaded Programming Guide
  3. malloc vs mtmalloc
Suggested Reading:
  1. Welcome to the CMT Era!
  2. Improving Application Efficiency Through Chip Multi-Threading
About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today