DTrace and the mighty Hercules

DTrace and the mighty Hercules

One of the advanced features of Solaris is DTrace, a powerful tool for troubleshooting system or application problems, very frequently used for performance issues. It gives you a powerful tool for looking inside a system or application - in a non-intrusive way - for the best observability of any OS I know. In fact, a company I visited just a few weeks ago is very interested in moving from Linux to Solaris because of the observability that DTrace provides, along with other Solaris features designed for production work in the enterprise

Anyhow, that got me thinking about using DTrace to look at a workload I sometimes run for hobby purposes, the Hercules System/370, ESA/390, and z/Architecture Emulator. This lets me run a simulated mainframe, on which I can can boot ("IPL") an operating system like Linux and ancient versions of MVS or VM. (Technically it's possible to run current mainframe operating systems, but they are proprietary and not made available for download or licensed operation).

So, I booted up Hercules, which I ported to Solaris by making a few minor changes (incorporated in the Hercules 3.06 download) to see what I could learn about where time is spent when Hercules is running.

First the stethoscope, then the MRI

An important thing about DTrace is that it's not the first place to start when you're trying to understand the performance of a system. I like to say that it's like an MRI or CAT scan in medical diagnosis. You don't start with the advanced tool for a deep "look inside" - first you start with the general tools (like a stethoscope) to get an overall picture of what is going on.

So, the first thing I did was run some standard "*stat" commands while running Hercules:

$ vmstat 3
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd s0 s1 s2   in   sy   cs us sy id
 0 0 0 2728320 1004208 54 333 0 0  0  0 54  8 -1 -1 -1 1147 3458 1700  9  2 89
 0 0 0 1867328 129140 6  26  0  0  0  0  0 36  0  0  0 4280 23438 8437 53 6 41
 1 0 0 1834364 96368  0   0  0  0  0  0  0 163 0  0  0 4314 20460 8153 52 6 42
 0 0 0 1846368 108384 0   0  0  0  0  0  0 58  0  0  0 8277 20873 8582 52 7 41
 0 0 0 1852356 114376 0   0  0  0  0  0  0  0  0  0  0 3629 20029 7300 54 4 42
 0 0 0 1848048 110092 0   0  0  0  0  0  0 57  0  0  0 4501 24527 9419 53 6 41
 0 0 0 1823660 88100 50 2613 0  0  0  0  0 61  0  0  0 3856 21135 7355 62 9 29
^C
$ mpstat 3
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  165   0    8   602  126  877   66   48   20    1  1771    9   2   0  89
  1  168   0   28   563  242  853   66   48   20    1  1769   10   2   0  88
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    9   0    1  2322  368 3990  338  143   62    0 10859   55   8   0  37
  1    3   0   21  2026  305 4478  573  142   51    0 11870   50   8   0  42
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0  5449  390 4138  438  150   98    0  9052   63   6   0  31
  1  116   0 3187  2097  317 4313  384  151   68    0 12272   42   8   0  50
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    5   0    0  2038  165 4323  535  164  122    0 11398   52   6   0  42
  1    0   0   19  1992  400 4078  438  157   90    0 11289   59   6   0  36
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0  205  2475  479 4313  443  130   67    0 11991   50   9   0  41
  1    0   0  223  2208  276 4293  487  135   85    0 10710   55   8   0  37
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0  1919  336 3847  502  143  114    0 10301   64   6   0  29
  1  181   0   19  1927  392 4269  662  143  115    0 11507   56   7   0  3

Okay, I'm not short on memory, am driving some CPU, and I'm spending most of my CPU time in user state, not in the kernel.

Who is making system calls?

Despite the fact we're not spending most time in the kernel, we are doing a substantial number of system calls per second (syscl), so for fun let's see which ones they are - DTrace comes in here.

First, let's see which programs are issuing system calls; there's other stuff going on and we shouldn't assume it's Hercules. I'm using one of the most frequently seen DTrace one-liners, which "fires" on any system call, and creates an aggregate of number of system calls, indexed by the name of the program (execname built-in DTrace variable) issuing them. The following command traces each probe that matches any system call (the syscall provider), and creates a table of counts of system calls indexed by executable program name. By default, the action of stopping the command (the Control-C) dumps aggregates to the terminal. (Note: I'm running DTrace from root userid, but that's not necessary - you can use roles-based authorization to assign the privileges needed to run DTrace to a regular userid. This is my workstation, so I just used root).

# dtrace -n 'syscall::: { @[execname] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C

  inetd                                                             1
  gnome-volume-man                                                  3
  iiimd                                                             3
  mdnsd                                                             5
  devfsadm                                                          7
  nwam-manager                                                     15
  sendmail                                                         19

  ... lines omitted for brevity ...  

  dtrace                                                         3170
  evolution                                                      3213
  firefox-bin                                                    3447
  hercules                                                       3951
  gnome-terminal                                                 6035
  rhythmbox                                                      7628

Which system calls is Hercules making?

Ah, hah - playing music while I work is the leading cause of system calls! Well, let's disregard that, and focus on only the system calls coming from Hercules, and see which calls they are. Another well-known one liner, this time using a predicate that restricts observation only to calls issued by Hercules' binary. Let's index the calls by the name of the system call itself; I did this while running an application that reads from a simulated tape (the tape image is actually a disk file):
# dtrace -n 'syscall::: /execname == "hercules" / { @[probefunc] = count(); }'
dtrace: description 'syscall::: ' matched 472 probes
^C

  gtime                                                            16
  pollsys                                                          16
  ioctl                                                            18
  nanosleep                                                      1623
  yield                                                          6077
  write                                                         12380
  llseek                                                        62482
  lwp_park                                                     105148
  read                                                         116566
Well, that makes a lot of sense: I'm reading from a disk file, and sure enough, there are lots of read system calls. We can even use DTrace to see which file it is: when read(), write() or poll() are frequent system calls, aggregate on arg0 to the call, which is the file descriptor number, and then look up which file via pfiles.
# dtrace -n 'syscall::read:entry /execname == "hercules"/ { @[arg0] = count(); }
dtrace: description 'syscall::read:entry ' matched 1 probe
^C

               34                2
               35            48440
The left column is the file descriptor number, and the second column is the number of times it was seen, so the vast majority of read system calls was for file descriptor number 35. Using pfiles I can see which file it was:
# pfiles 6376
6376:	/rpool/hercules/misc/hercules-3.06/.libs/hercules -f hercules.cnf
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:315,0 ino:820483781 uid:103812 gid:7 rdev:24,4
      O_RDWR
      /dev/pts/4
   1: S_IFIFO mode:0000 dev:312,0 ino:2943 uid:103812 gid:10 size:0
      O_RDWR
  ...
  35: S_IFREG mode:0644 dev:182,65562 ino:53251 uid:103812 gid:10 size:1443627008
      O_RDWR|O_LARGEFILE
      /rpool/sirius/phase7/phase7.awstape
  ...
Now I know which system call was dominant, and which file it was for (the simulated tape file).

Which part of the program was making those system calls?

But wait, there's more! (just like on late night TV ads). What if I also want to know which program inside the executable was doing the system calls. Well, DTrace has access to both kernel and user stack, so we can issue:
# dtrace -n 'syscall::read:entry /execname == "hercules" / { @[ustack(5)] = count(); }'
dtrace: description 'syscall::read:entry ' matched 1 probe
^C

              libc.so.1`__read+0x7
              libhercd.so`cckd_read+0x78
                2

              libc.so.1`__read+0x7
              hdt3270.so`console_connection_handler+0x2f9
               28

              libc.so.1`__read+0x7
              libhercu.so`logger_thread+0x170
              0x632e7265
             1002

              libc.so.1`__read+0x7
              hdt3420.so`readhdr_awstape+0x53
            58414

              libc.so.1`__read+0x7
              hdt3420.so`read_awstape+0x120
            58414
Now I know exactly which functions (and offsets inside function) were the most frequent issuers of the read. No surprise there: the dominant system call was for reads, and the reads were issued from the piece of code that handles reading from simulated tape drives. In this case I could have made an easy guess about what was going on, but DTrace made it possible to instrument this, in a completely unmodified application.

Looking inside a program's entry points

But this has more implications: DTrace can see the objects within the executable, using the pid provider, so I can get even more detailed information. First, let's get a list of the probe points that are exposed when the program is running. In this case, I know that the process has process id (pid) 6376 (in later examples the pid may differ), so I can expose its contents this way:

# dtrace -l -n pid6376:::entry|more
   ID   PROVIDER            MODULE                          FUNCTION NAME
78992    pid6376          hercules                            _start entry
78993    pid6376          hercules                             __fsr entry
78994    pid6376          hercules                           _mcount entry
78995    pid6376          hercules                              main entry
...
... snip for brevity
...
79467    pid6376        libherc.so         set_or_reset_console_mode entry
79468    pid6376        libherc.so                    set_screen_pos entry
79469    pid6376        libherc.so                      erase_to_eol entry
79470    pid6376        libherc.so                      clear_screen entry
79471    pid6376        libherc.so                  set_screen_color entry
79472    pid6376        libherc.so               translate_keystroke entry
79473    pid6376        libherc.so                      console_beep entry
...
... snip for brevity
...
79484    pid6376        libherc.so                     configure_cpu entry
79485    pid6376        libherc.so                   deconfigure_cpu entry
79486    pid6376        libherc.so                        get_devblk entry
79487    pid6376        libherc.so                        ret_devblk entry
79488    pid6376        libherc.so             find_device_by_devnum entry
79489    pid6376        libherc.so                     attach_device entry
79490    pid6376        libherc.so                     detach_devblk entry
...
... snip for brevity
...
79650    pid6376        libherc.so          s370_branch_on_condition entry
79651    pid6376        libherc.so     s370_branch_on_count_register entry
79652    pid6376        libherc.so              s370_branch_on_count entry
79653    pid6376        libherc.so         s370_branch_on_index_high entry
79654    pid6376        libherc.so s370_branch_on_index_low_or_equal entry
79655    pid6376        libherc.so             s370_compare_register entry
79656    pid6376        libherc.so     s370_compare_logical_register entry
... snip

Hey, I can see the C functions that implement service routines, and the ones that implement the simulated instructions!

How many times do I invoke each function in the program?

Well, if I can do that, then I should be able to see how often they are invoked, and how much time is spent in them. So, I could issue a command that shows me when a given instruction is emulated, like:
dtrace  -n 'pid6376:libherc.so:s390_set_storage_key_extended:entry { trace(probefunc);}'
but that will run off the screen REAL FAST, as there will be zillions of times any given instruction is issued. Aggregates are our friend again, so instead I issue (notes, this was done a different day, different pid):
# dtrace -n 'pid19284:::entry { @a[probefunc] = count(); }'
dtrace: description 'pid19284:::entry ' matched 11135 probes
^C
...
... I snipped out lines for low-frequency functions
...
  s390_shift_left_double                                          341
  s390_external_interrupt                                         343
  s390_perform_external_interrupt                                 343
  s390_store_characters_under_mask                                346
...
......................I snipped a bunch of lines.....................
...
  s390_compare_logical_register                                 11769
  memcpy                                                        12616
  s390_exclusive_or_register                                    12627
  s390_branch_and_link                                          13483
  s390_branch_and_save_register                                 14161
  s390_subtract_register                                        14351
  s390_move_immediate                                           14601
...
......................I snipped a bunch of lines.....................
...
  s390_move_character                                           40638
  s390_load_and_test_register                                   44384
  s390_branch_on_condition_register                             47965
  s390_compare_logical_immediate                                49598
  sigon                                                         49928
  ptt_pthread_trace                                             73559
  s390_compare_logical                                          76252
  s390_compare_register                                         83654
  s390_compare_logical_character                                91370
  s390_load_register                                            96943
  s390_test_under_mask                                         132845
  s390_load_address                                            158997
  s390_insert_characters_under_mask                            168542
  s390_store                                                   196300
  s390_load                                                    309268
  s390_branch_on_condition                                     725825

Wow - what this tells me is that Branch on Condition (BC) and Load (L) are the most frequently seen instructions. This will mean something to you if you program in mainframe assembly language, otherwise not. It gives a good hint that anything that speeds up processing of these instructions will make a difference.

Quantifying actual time spent in functions

However, we can refine that: the above gives you the number of times you invoked the most frequently used functions, but not how much time was spent in them. So, let's use a different script that remembers the entry and exit times of the probe points, and creates aggregates of both number of times they're invoked, and a histogram of the times inside those functions. This is too complicated for a one-liner at the terminal, so I'm using a script I found (IIRC, in the DTrace Toolkit). What this does is remember the high-precision timestamp on entry to each function to compute the elapsed time spent in it, as well as the number of entries and exits.

#!/usr/sbin/dtrace -s
pid$1:::entry
{
        self->ts[self->stack++] = timestamp;
}

pid$1:::return
/ self->ts[self->stack - 1] /
{
        this->elapsed = timestamp - self->ts[--self->stack];
        @[probefunc] = count();
        @a[probefunc] = quantize(this->elapsed);
        self->ts[self->stack] = 0;
}
The quantify.d DTrace script

Notice that the pid id is passed as an argument to the script. Running this against pid 21080 gives the following output (with low frequency and low elapsed time entries removed):

# dtrace -s quantify.d 21080 

... output snipped ...
  s390_compare_register                                         51374
  s390_subtract_logical_register                                52798
  s390_compare                                                  56810
  s390_load_and_test_register                                   59370
  s390_compare_logical                                          68099
  s390_and                                                      71375
  s390_branch_on_condition_register                             72790
  ptt_pthread_trace                                             76804
  s390_load_register                                           135094
  s390_add_logical_register                                    139985
  s390_store                                                   199027
  s390_test_under_mask                                         234316
  s390_load_address                                            293952
  s390_load                                                    393249
  s390_branch_on_condition                                     746037
The first aggregate dumped showed frequency of invocation by function call, and that's similar to other tests. TM, LA, L, and BC are indeed frequent instructions.

But the aggregate based on time shows interesting different results! For exposition purposes I removed output for function calls related to sleeping or waiting on a condition, as they would be expected to have high elapsed times (a later test could be used to distinguish between elapsed times spent waiting for work to do or I/O to complete - versus time spent waiting for a lock to be released).

Instead, what is interesting is that while s390_branch_on_condition does get invoked a lot, it doesn't have high elapsed times - the left column represents time in nanoseconds, and this is a fast instruction in emulation. Other instructions, and a number of support routines, got invoked a lot less frequently, but represented disproportionate time for the number of invocations!

  s390_branch_on_condition                          
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  727692   
            2048 |                                         3758     
            4096 |@                                        14337    
            8192 |                                         103      
           16384 |                                         37       
           32768 |                                         39       
           65536 |                                         26       
          131072 |                                         40       
          262144 |                                         4        
          524288 |                                         1        
         1048576 |                                         0        

  s390_execute_b2xx                                 
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |@@@@@@@@@@@@@@@                          1635     
            8192 |@@@@@@@@@@@@                             1282     
           16384 |                                         14       
           32768 |                                         1        
           65536 |@@@@@@@@@                                990      
          131072 |                                         7        
          262144 |                                         7        
          524288 |@                                        150      
         1048576 |                                         22       
         2097152 |                                         1        
         4194304 |                                         0        
         8388608 |                                         0        
        16777216 |                                         6        
        33554432 |@                                        141      
        67108864 |                                         13       
       134217728 |                                         0        

  sprintf                                           
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@               40       
           16384 |@@@@@@@@@                                14       
           32768 |                                         0        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         0        
         1048576 |                                         0        
         2097152 |                                         0        
         4194304 |                                         0        
         8388608 |                                         0        
        16777216 |                                         0        
        33554432 |                                         0        
        67108864 |                                         0        
       134217728 |                                         0        
       268435456 |                                         0        
       536870912 |                                         0        
      1073741824 |@@@@@                                    8        
      2147483648 |                                         0        

  s390_insert_characters_under_mask                 
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             29837    
            2048 |@                                        844      
            4096 |@                                        621      
            8192 |                                         5        
           16384 |                                         5        
           32768 |                                         8        
           65536 |@                                        675      
          131072 |@                                        749      
          262144 |@                                        601      
          524288 |@                                        569      
         1048576 |@                                        706      
         2097152 |@                                        1547     
         4194304 |@@                                       2278     
         8388608 |@                                        756      
        16777216 |                                         402      
        33554432 |@                                        866      
        67108864 |@@@                                      2776     
       134217728 |                                         0        

In this example, the long durations for some invocations of s390_insert_characters_under_mask make it disproportionately more expensive than its frequency of use would indicate. Optimizing the "tail" of long duration calls, and for that matter reducing the number of calls to sprintf could make a difference in performance.

Summary

In this (slightly contrived) example, I've shown a little bit about how DTrace can be used for performance measurement. We were able to use standard Solaris tools to see how busy the system was, and then used DTrace to get counts of system calls by application and by function. After that, we figured out which functions in the application were making system calls, which file was involved in a filesystem call, and then we got the frequency counts and durations of the "hot" entry points in the application. This illustrates performance observability of an application not otherwise set up for instrumentation. No special compiler options were used, no code was changed, but we were able to see where we were spending our time.

This just scratches the surface of what can be done - it's possible to get a lot more information, such as flow of control from function to function, just as an example. With a little investment of effort (especially if you borrow existing DTrace scripts available on the 'net!) you can get incredible insight into the behavior of your applications or entire Solaris systems. There's really nothing remotely like this elsewhere, and I can't recommend it strongly enough.

For a great reference for learning more, go to the DTrace User Guide or OpenSolaris DTrace page.

Comments:

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today