70x performance improvement in 5 minutes

A good friend of mine who is a Systems Engineer/Engagement Architect in the UK sent me a copy of a benchmark which his customer was using to assess the performance of various types of Sparc machines. While the benchmark is simplistic, the customer had a concern over its performance on a T5220, so any concern is valid. So here is the customer benchmark

The spirit of the customer benchmark was

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    ./run2_slow &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_slow

which calls

#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
       bc < /dev/null
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100
E
       loop=`expr ${loop} + 1`
done 

which executes in around 70 seconds.

Clive's version which required 5 minutes of very simple coding change

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    time ./run2_fast &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_fast
calls
#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
n=0
((n=n+100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100\* \\
               100/100\*100/100
))
	((loop=loop+1))
done
run the same number of iterations on the same machine in 1.01 seconds.

For the slower version dtrace -s /usr/demo/dtrace/whoexec.d shows that there are huge number of sequences of fork/exec where the script forks bc which forks dc and also the counter using expr requires calls to fork/exec. Less than 1% of the time spent in this script was actually calculation.

An interesting system level bottleneck did drop out where the text segment of libc was being faulted in as the process is being created as a result of a call to memcntl something like this

enoexec(5.11)$ truss -t memcntl /bin/true
memcntl(0xC4080000, 227576, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
Where you have many concurrent processes calling fork, you do end up with some lock contention in ufs_getpage like this
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
111484  18%  18% 0.00  1747392 0x30001293610          ufs_lockfs_begin_getpage+0xc8

      nsec ------ Time Distribution ------ count     Stack                   
       512 |                               5         ufs_getpage+0x7c        
      1024 |                               26        fop_getpage+0x90        
      2048 |                               43        segvn_faulta+0x114      
      4096 |                               61        as_faulta+0x138         
      8192 |                               131       memcntl+0x8d0           
     16384 |                               711       syscall_trap32+0xcc     
     32768 |@@@                            14042     
     65536 |@@@@                           17764     
    131072 |@@@@                           17988     
    262144 |@@                             9247      
    524288 |                               2818      
   1048576 |                               3215      
   2097152 |@                              5994      
   4194304 |@@@@@@                         24221     
   8388608 |@@                             10884     
  16777216 |                               3414      
  33554432 |                               310       
  67108864 |                               7         

it would be interesting to try this on a T5220 with ZFS as a root filesystem, but where is the bottleneck? I would argue that there may be a little room for improvement in UFS, but that the benchmark is pathological. My experience of dealing with performance issues in the field over the last 6 years is that large 15K/25K and then T2000 and now T5220 are very good indeed at exposing applications which don't scale. This is an example of where a simplistic benchmark can lead to incorrect or at least very incomplete conclusions about the underlying platform. The benchmark as the customer used just measured fork/exec performance, you would not implement a business solution like that, or would you?

Getting a 70x improvement for changes to Solaris or the underlying hardware is going to be a significant challenge. A 70x speedup from application changes in this case was viable and a little consulting help might not go a miss.

Comments:

-- snip ---
loop=0

while [ $loop -lt 1000 ]
do
n=0
-- snip --
... it will run even faster if you declare "loop", "n" and "i" as "integer" since otherwise these variables are stored as _strings_ and converted to integers and back.
And "while [ $loop -lt 1000 ]" should be while (( loop < 1000 )) for the same reason.

BTW: See http://www.opensolaris.org/os/project/shell/shellstyle

Posted by Roland Mainz on May 02, 2008 at 02:30 PM BST #

good comment. About 30% fast with your suggestions. The link to the shellstyle has some good stuff I had not come across before:-)

Posted by Clive King on May 05, 2008 at 03:51 AM BST #

BTW: ksh93 has "for(( expr1 ; expr2 ; expr3 )) ; do bla ; done"-style loops, too.

Posted by Roland Mainz on May 06, 2008 at 07:50 PM BST #

Post a Comment:
Comments are closed for this entry.
About

clive

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today