A "Performance" issue on a T2000

For the last week or so, I've been troubleshooting a performance issue on a T2000. Not quite there yet.

Background

The box that went out to the customer was a beta version of the T2000 that only had four cores (i.e. 16 virtual cpus). It was also running an earlier release of the kernel (Solaris 10 hw 2 build 3) than what is currently shipping on the T2000s (Solaris 10 hw 2 build 5), which had some nice little gotchas in it (ipge hang on write problem and another that would stop DTrace working).

The customer here was running an MQ Series (v5.3) test and was interested in the number of packets/second that could be processed. The baseline was that they could do 4000/second on a v440 and expected to be able to do 16000 on a T2000. The problem was that they were only seeing about 2000.

So, what happened?

First off I tried to address the DTrace issue by bringing the beta box up to KU-20. While installing the patch, I noticed that a number of packages were missing, and thus not patched. When we tried to reboot this box, it complained about missing files and refused to boot.

OK, I though, I asked these guys to make a flash archive of the full box before I started playing with it, we'll just re-install.

It appears that the people working with this box (which is in another country to me) did not have installation media. OK, I pointed them to where they could get cd images of the version currently shipping on T2000s and they could use that to bootstrap the flash image that they had taken.

Guess what? There is a known issue with booting this version on those beta boxes, which is adressed by adding a few lines to /etc/system. Unfortunately, last time I looked, most installation media is read only.

It also turns out that the "flash archive" that was taken was ufsdumps of root and var.

By this time I had gained access to a released version of a T2000 in the Sydney lab. Fortunately it had a second disk in it that I could drop the ufsdump images onto, and after a bit of fiddling (mainly getting IP address right and fixing vfstab to point at the correct disk) I was able to get it up and running locally. \*phew\*

I applied a kludge to enable DTrace to work (commenting out some lines in sched.d) and ran up the server and the client.

mpstat shows a pretty idle box (95-98%). iostat shows very little disk activity. Time to crank out DTrace.

First off, who is doing read(2) system calls?

# dtrace -q -n 'syscall::read:entry { @[execname] = count();}
        tick-10s { printa(@); clear(@); }'
  nscd                                                              1
  fmd                                                               3
  java                                                          12284
  amqrmppa                                                      24566

  fmd                                                               0
  nscd                                                              2
  nfsmapid                                                          2
  java                                                          11938
  amqrmppa                                                      23869

  nscd                                                              0
  nfsmapid                                                          0
  fmd                                                               0
  ttymon                                                            2
  sac                                                               2
  java                                                          12306
  amqrmppa                                                      24611

Ok, we have java and amqrmppa. The client is java so we will leave that as we're interested in the server. Let's have a look at the number of reads per second that each thread of this process is doing.

# dtrace -q -n 'syscall::read:entry
        /execname == "amqrmppa"/ { @[tid] = count();}
        tick-10s { normalize(@,10); printa(@); clear(@); }'

        5             2563

        5             2462

        5             2557

There are two things of interest here.

  1. We are seeing around 2500 reads per second, which gels with the customer seeing about 2000 messages/second. This is probably a pretty good gauge of messages/second.
  2. Only one thread is doing any of the reading. The server is running single threaded!

Running single threaded might be good on a box that has a small number of very fast cpus, but is about the worst possible thing that you could do on a T2000.

Out of interest, let's see what stacks are doing the reads, just to make we are in the right place.

# dtrace -q -n 'syscall::read:entry
        /execname == "amqrmppa"/ { @[ustack(20)] = count();}
        tick-10s { normalize(@,10); printa(@); clear(@); }'

              libc.so.1`_read+0x8
              amqcctca`cciTcpReceive+0xc24
              libmqmr.so`ccxReceive+0x1d0
              libmqmr.so`rriMQIServer+0x2f4
              libmqmr.so`rrxResponder+0x52c
              libmqmr.so`ccxResponder+0x14c
              libmqmr.so`cciResponderThread+0xac
              libmqmcs.so`ThreadMain+0x890
              libc.so.1`_lwp_start
             2534

Which kind of looks like we have a server receiving packets.

One other thing that I noticed is that each time I killed and restarted the client, it looks like it attaches to a new single thread in the server.

I spent quite some time going through the mqm trees and google but so far I have been unable to come up with a way to increase the number of threads in the sever.

For all intents and purposes we are running single threaded. If we can increase the number of server threads to match the platform, then I would expect to see an incredible increase in the #packets/second processed.

If any of my readers have any suggestions on how to increase the number of server threads, I'd love to hear from you. MQ Series is not something that I've spent a lot of time with.

An Aside

I should mention one other thing that is incredibly useful if you happen to have a machine running a relatively current nevada or open solaris.

As Bryan mentioned when he addressed SOSUG in Sydney, the output from the dtrace command when given a -l and a -v command has been enhanced to also give you the types of the arguments. I used this a bit while looking at other things to get a feel for the system. It saved me having to dig up the reference manual. For example:

$ dtrace -v -l -n io:::wait-start
   ID   PROVIDER            MODULE                          FUNCTION NAME
  514         io           genunix                           biowait wait-start

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: bufinfo_t \*
                args[1]: devinfo_t \*
                args[2]: fileinfo_t \*

Update

update 1

I suspect that what we are seeing here is a client that does a connect, and then spawns all of it's threads. It looks like the way that the server is working is that we get a thread per connection.

I'm currently looking at a way to verify this suspicion.

update 2

Just for kicks, I modified MQSender.properties back to 1 client thread and then proceeded to start up 8 instances of the client.

This looks much better, we are tending around just under 16000/second on the server side. What is peculiar is that about every 38 seconds we see the count drop to 0 for about 4-6 seconds. At this time we see idle jump to 100% and more interestingly iostat shows a lot of IO to /var with active service times blowing out to half a second.

Playing with the SCSI write cache and forcing the filesystem to forcedirectio does not appear to help us.

Looking through the java source to the client, it looks like the connection is shared between all threads created. It would probably be more useful (and be more likely to reflect real life) if each of the threads had their own connection.

update 3

I've got to say that I'm also a bit suspect of the relevance (to reality) of "benchmarking" an application server platform having both the client and server both living on the one machine. Do people actually do this in real life where the server is likely to be pushed to it's limit? I would have thought that it would be much more reasonable (and likely) to run the application server seperately to it's clients.

It might be interesting to try splitting the client and server onto two different boxes. Unfortunately I've only got one T2000 to play with.

Technorati Tags: , , ,

Comments:

Post a Comment:
Comments are closed for this entry.
About

* - Solaris and Network Domain, Technical Support Centre


Alan is a kernel and performance engineer based in Australia who tends to have the nasty calls gravitate towards him

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Links
Blogroll

No bookmarks in folder

Sun Folk

No bookmarks in folder

Non-Sun Folk
Non-Sun Folks

No bookmarks in folder