Thursday Jan 21, 2010

Some findings with the Amber Road

Recently I had the good fortune to do some testing on an Amber Road. More officially known as the Sun Storage 7410 Unified Storage System.

The machine had 64 GB of internal memory and two quad core CPUs. A single JBOD holding 24 disks, of which one was a SSD optimized for write access (a writezilla.) The other 23 disks were "normal" 7200 RPM disks of one TB each. For connectivity some of the four gigabit NICs were used.
Besides the above mentioned SSD, this system also had two other SSDs optimized for reading (a readzilla.)
This appliance used ZFS as a file system. The writezilla was used to hold the ZFS intent log. The readzilla's were used as a level 2 ZFS (L2ARC) file system cache.

For more technical information about this product please check the website at:


The original trigger to execute these tests was a remark from an ISV that they had problems with applications using their Amber Road while an rsync was creating a backup of their email archive(s) on a Linux box to a volume on their appliance.

And indeed the throughput numbers we received were indeed not what we expected.

In order to get a better understanding a 7410 was setup together with some load generating equipment in one of our labs. The ISV was able to send us a copy of a part of their email archive we could use for testing. The Amber Road was configured in a comparable way to the one at the ISV location. The initial tests were indeed below expectations. The quest to find the bottleneck began.


There were quite a few elements that were "unknown" to me one of which were the internals of the Amber Road, including the hardware and the software (operating system, ZFS pool and file system, network interfaces , etc.)
The NICs, were they somehow dependent on each other? How much traffic can a single NIC handle? In throughput, but also in the number of packets?
The volume(s) in the Amber Road were used with NFS over gigabit ethernet. The Linux box was used to simulate the mail machine. Could this one handle the load required? The Linux volume manager was used. Using ext3 as a file system. Could this one deliver the load required? The tool used to make the backup, rsync, was this one able to drive all the resources?

I started with the front end: the Linux box. Especially the volume manager. Using

find <mail archive on lvm volume> -print | cpio -oc >/dev/null

as a way to simulate the reading part. I had a second machine on which Solaris 10 was installed. Exactly the same hardware as the Linux box. And of course couldn't resist to do the same reading tests on this one. I was not impressed with the load that could be generated from a couple of internal disks. Fortunately I had a couple of disk arrays "lying around".
I used a Sun StorageTek 6140 array holding 12 spindles spinning at 15K rpm. The array has a gigabyte of cache and two controllers.

The result of the above command:

1: 6140 Linux lvm (2 6140 volumes): 33 min
2: 6140 Solaris UFS on 2 6140 Disk Suite volume: 16 min
3: 6140 Linux nolvm (1 6140 volume): 34 min
4: 6140 Linux nolvm (4 6140 vols in parallel): 24 min

Tests 1 and 2 are performed with the same 6140 configuration. Same hardware. Different OS-es and volume managers. To check the Linux result I used half the disk capacity in the array (test 3) without lvm and 4 volumes without lvm (test 4). This last one had one volume mounted under /data1, the second volume on /data2, etc. To parallellize multiple find /dataX -print | cpio -o >/dev/null sessions were done on each file system in parallel.

After this result we decided to look into the other elements of the black box using Solaris only. Just to ensure we could generate load once we started testing the Amber Road.


The next part to check was the Amber Road network. Or better, how much load can the Amber Road NICs sustain. A single volume on the Amber Road was NFS mounted with the following options:


Since the rsync backup includes many file system operations (read, write, stat, etc) and is optimized to not write a file to backup if the file in question happens to reside on the backup volume already I changed the rsync use to a sequence of 'find ... cpio' ran in parallel. Each find command (see above) used its own private mail archive. After some analysis it was clear NFS was not a bottleneck. Using the analytics from the Amber Road the following screen dump shows the three NICs used were all running close to the theoretical speed of gigabit ethernet. In total "pumping" 300MB/sec look perfect to me, when using three interfaces. Each interface handling close to 15,000 NFSOPS/sec.


With these results I concluded that both client and network are running fine. I needed to focus on the 7410 now. I had 24 disks in this appliance. The GUI was not my favorite tool in order to look at the disk utilization and other statistics. Now I must admit, that I am in this business for quite some years. Still prefer the command line. Although it might be good to understand that the punch cards period was even before my time.....


The ISV remark included a statement about other applications suffering when a rsync was running. The rsync themselves also took a remarkable long time. This would translate to an average of 50MB/sec transfer rate.
I changed the approach a little bit. I first checked what could be seen as reasonable for the Amber Road. Fortunately there is a load generating tool called vdbench available. This open source tool can be downloaded from It has many possibilities. One of these is a simulation of file system tests. For this it creates a directory structure with as many files as specified. This is all done with a parameter file. Here is one example, for a pure read only test:


The generator runs on a client machine. The Amber Road volume is NFS mounted under /data_remote/test. There is a directory structure of 2 levels deep. Each directory at the end of this tree holds 10,000 files. Files of different sizes: 60% of them are 8KB, 25% is 32KB, 10% are 51KB and 5% are 4MB. A total a 96 threads are being used to generate the load. The test runs for 72,000 seconds.

The example above tries to create a continuous load of 10,000 IOPS (fwdrate).

Tests had to be done over a long time. The file system cache (ZFS ARC) was approximately 60 GB. It was already shown that the behavior of the Amber Road was perfect as long as capacity was available in ARC. Once the ARC was fully utilized the behavior started to show. For my tests this steady state behavior was needed in order to be comparable to the ISV situation.

The load test was done with a RAID-Z2 default setup, a RAID-Z2 narrow and a mirrored setup. Here's a picture with the output. Time versus number of IOPS. After the volume was created the directory structure and the files were created. After this a load run was started with 50% read.


Besides some spikes there is not much difference between these three. They all can cope with a 10,000 IOPS load very well. However this was during the first 10 minutes after all the files were created. Next graph shows the behavior over a ten times longer interval.


This data are from the same three tests. However here we see that at 700 seconds the default setup drops down to approximately 6,000 IOPS. Looking into the data presented by the vdbench log output (not shown here) also shows that there is no 50% read on average any more.
The RAID-Z2 narrow cab handle the 10,000 IOPS load longer. At 2700 seconds the behavior starts to change. It averages around 10,000 IOPS but the behavior is far less constant.
The mirrored setup performs the 10,000 IOPS nicely over the whole test period.

The tests described were executed to create a background load for running rsync. The bursts in the blue graph are points where the rsync is started. These were scheduled at 1800, 3600, and 5400 seconds. No ill effects during the test on a mirrored setup.

Here is a close up around the 900 second period.

One of the things that vdbench tries to do is to make up for missing IOs. The first big dip after T=700 is followed by a much bigger load request. The final result is that the system seems to throttle down to a 6000 IOPS level. During the first 600 seconds the average response time was around the 1 milli second. 200 seconds later this became 20 milli seconds. During the first 600 seconds the system was able to push 80 MB/sec. After the 800 second mark there was only 40-50 MB/sec left. Finally the amount of read ops settled around 85% of the total.

Finally to make this comparison complete, a close up of the graph around T=2900



The above mentioned test was redone for the three setups but now with a read write ratio of 75:25 and again with a 100% read test.

In order to see the effect on the run time of an rsync command that would create a backup of a 2 GB "mail-archive" this command was executed at 1800, 3600 and 5400 (rsync --delete -a <mail archive> <destination directory>.) The completion times are in the following tables. The times for the rsync during the default setup test were so long, that this test needed to be redone for a much longer time (14400 seconds.)

Table 1: 50% read 50% read load
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default 5161 - - 4542 8987
RAID-Z2 narrow 124 1074 1059 - -
RAID mirror 96 174 141 - -

Table 2: 75% read 25% write load"
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default - 1842 - - 11548
RAID-Z2 narrow 91 91 210 134 -
RAID mirror 99 100 101 116 -

Table 3: 100% read load"
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default 114 111 111 111 -
RAID-Z2 narrow 76 76 76 76 -
RAID mirror 88 84 85 88 -


The first thing to mention is that although the default setup seems to perform unexpected, you should revisit what is happening. The appliance is being stressed by a 10,000 IOPS load. Internally there are two times eleven disks handling this load. These standard SATA disks are able to handle a load of say 70 IOPS each (google returns many articles, where this number ranges from 50 to 100). This translates to approximately 1540 IOPS for the available disks.
In a mirrored setup this appliance was able to handle 10,000 IOPS: over 6 times the capacity of those disks alone.... If there are barely any writes this is even possible with the default setup, and the narrow raid configuration too. After realizing this, I must admit, I was impressed!

Sunday Jul 15, 2007

Profiling jdbc related method calls with DTrace

The Sun Hotspot VM has build-in facilities for method tracing using DTrace. Each Java method entry and exit now has a probe that can fire. Every non arbitrary Java program most likely does many many method calls. Therefore the expected overhead should be a factor to consider. While running a benchmark on a Glassfish application server environment using PostgreSQL as the database we were trying to understand the time spent in the JDBC driver being used.

Here I try to describe how a package specific profiler is implemented through DTrace. In order to catch these firing probes access to the Java pid is needed. The generic way to create clauses around these probes is:

Here $1 is the pid of the Java process of interest. These probes are specific for a process. Therefor the Java process needs to be started before the DTrace script can run (successfully). In order to not miss any method calls the Java process is best started with the following options in place:
The ExtendedDTraceProbes is needed to enable the two mentioned method probes. The UnlockDiagnosticVMOption is needed to enable the PauseAtStartup parameter handling. The PauseAtStartup is needed to stop the VM before the Java program is started, but after the VM has been initialized. The little thing here is that all the VM probes are part of the library. Only once this library is loaded and .init-ted will the probes exists and can we enable the probes specified by our clauses.

The directory from where the VM is started will contain a (empty) file called vm.paused. where is the process ID of the VM. Once this file exists will the VM be in stop-ped state and can your DTrace script be started. If your DTrace script is compiled and active remove the vm.paused. file. This will automagically activate the VM after which it will "just" run to completion.

More details about the probes the hotspot provider gives you can be found at Keith his blog:

Here I will describe my attempt to create a "profiler" that can focus on the use (that is (CPU)time consumption) of the methods in a certain package. A note of caution is called for. The code presented below will give a measurable performance degradation. I believe it should not be used in a production environment.

We had an interest to understand which part of a JDBC driver were consuming most of the time. The JDBC drivers packages of interest were org/postgresql/jdbc2 and org/postgresql/jdbc3 as being used from a Glassfish installation. This installation was stressed in a benchmark environment where we had a need to understand where time was spent.

In general a method will call other methods, that most of time will time yet other methods. Not only did we needed an understanding of which method consumed most of the time, but also if this was due to this method or due to one or more other methods that were called by it. This separates the data into two groups: inclusive and exclusive numbers. I think the following idea should be used during the implementation:

    time[depth] = timestamp;        /\* remember when we started           \*/
    depth++;                        /\* one extra step in the call tree    \*/
    includeMeasurement[depth] = 0;  /\* time consumption due to calls to   \*/
                                    /\* other methods originating from this\*/
                                    /\* method is set to zero. Hey we just \*/
                                    /\* started \*/

/depth > 0/
    delta = timestamp - time[depth];/\* amount of time we spend in this    \*/
    inclusive += delta;             /\* method all the time spend in this  \*/
                                    /\* method (every invocation)          \*/
    includeMeasurement[depth]       /\* total time spend, including methods\*/
        += delta;                   /\* called by me, that must be sub-    \*/
                                    /\* stracted to get the exclusive time \*/
                                    /\* of the method that called me       \*/
    delta -=                        /\* all time on my account reduced with\*/
        includeMeasurement[depth+1];/\* the total time spend by others that\*/ 
                                    /\* I called will give me my exclusive \*/
    includeMeasurement[depth+1] = 0;/\* time to prevent this amount be used\*/
                                    /\* once more in some future           \*/
    exclusive += delta;             /\* My exclusive time                  \*/
There are some gotcha's here. Any VM process is multithreaded. To ensure minimal overhead but proper handling of multiple competing threads there is a need to use the DTrace "self->" construct. Both variables inclusive and exclusive are fine candidates to be implemented by an aggregation. The variable delta is typically a clause local variable, while most of the others should be thread local variables. In order to implement this we also need to specify type of the variables used.

I hope the following is a correct implementation:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive
#pragma D option defaultargs
#pragma D option bufsize=16m
#pragma D option aggrate=100ms
#pragma D option dynvarsize=16m

self string package;

/\* class, method and signature together uniquely define a method.
(assuming a single namespace) However concatenating these into
a single string has my preference but strjoin() is only valid in
clause context. Looks to me I need to keep them separately.
If you know a better way, please do inform me

/\* all these arrays are indexed by calling depth. This to not 
miscalculate when recursive calls are made.
self string class[int];
self string method[int];
self string signature[int];

self int    started[int];
self int    incl[int];
self int    vincl[int];
self int    depth;

        startedAt = timestamp;

/\* the arguments specify the class, method, and signature. 
   Currently only interested in the package name. This 
   is the base name of the class.
        this->str_ptr = (char\*) copyin(arg1, arg2+1);
        this->str_ptr[arg2] = '\\0';
        self->class[self->depth] = (string) this->str_ptr;

        self->package = dirname(self->class[self->depth] );

/\*  The clause specified above gets us our package
    name so we can filter. Only for the stuff of
    interest do we need the class, method and 
    signature. And the time (vtime) everything
    started for this method at this call depth.
/  self->package == "org/postgresql/jdbc2"
|| self->package == "org/postgresql/jdbc3"
        this->str_ptr = (char\*) copyin(arg3, arg4+1);
        this->str_ptr[arg4] = '\\0';
        self->method[self->depth] = (string) this->str_ptr;

        this->str_ptr = (char\*) copyin(arg5, arg6+1);
        this->str_ptr[arg6] = '\\0';
        self->signature[self->depth] = (string) this->str_ptr;

        self->started[self->depth] = timestamp;
        self->vstarted[self->depth]= vtimestamp;

        self->incl[ self->depth ] = 0;
        self->vincl[ self->depth ] = 0;

/self->depth > 0/

        /\* always nice to know how much time elapsed in 
           context of the amount of calls being handled.
           Puts things in a certain perspective.
        @calls[ self->class[self->depth]
              , self->method[self->depth]
              , self->signature[self->depth]
              ] = count();

        this->delta  = timestamp  - self->started [self->depth];

        /\* This one is easy. Just accumulate it all \*/
        @inclusive[ self->class[self->depth]
                  , self->method[self->depth]
                  , self->signature[self->depth]
                  ] = sum(this->delta);
        self->incl[self->depth] += this->delta;

        /\* I used everything exept what has been used by those
           called by me \*/
        this->delta -= self->incl[ self->depth + 1 ];

        /\* its been used, and must be reset to zero \*/
        self->incl[ self->depth + 1 ] = 0;

        @exclusive[ self->class[self->depth]
                  , self->method[self->depth]
                  , self->signature[self->depth]
                  ] = sum(this->delta);

        /\* and the same story for the vtimestamp measurements \*/

        this->delta = vtimestamp - self->vstarted[self->depth];
        @vinclusive[ self->class[self->depth]
                   , self->method[self->depth]
                   , self->signature[self->depth]
                   ] = sum(this->delta);

        self->vincl[self->depth] += this->delta;
        this->delta -= self->vincl[ self->depth + 1 ];

        self->vincl[ self->depth + 1 ] = 0;

        @vexclusive[ self->class[self->depth]
                   , self->method[self->depth]
                   , self->signature[self->depth]
                   ] = sum(this->delta);

/pid == $1/

        printf("Total time (sec): %d\\n", (timestamp - startedAt)/1000000000 );
        normalize(@inclusive, 1000);
        normalize(@exclusive, 1000);
        normalize(@vinclusive, 1000);
        normalize(@vexclusive, 1000);
        printa("C %8@d %s/%s %s\\n", @calls);
        printa("TI %8@d %s/%s %s\\n", @inclusive);
        printa("TE %8@d %s/%s %s\\n", @exclusive);
        printa("VI %8@d %s/%s %s\\n", @vinclusive);
        printa("VE %8@d %s/%s %s\\n", @vexclusive);
It might take some time to grasp everything. I tested it with the following method call flow:

A() {
B() {
C() {
Each consume() is an inline chunk of code that consumes that amount of time. This should lead to the following:
     Incl   Excl
A:    135     45
B:     80     40
C:     50     50				5 times called, each 10


Below some tables with the findings of a benchmark run. These tables show the Top-5 of Count, Ela delta and CPU delta. The Ela columns show the result of the timestamp measurements, the CPU columns that of the results of the vtimestamp measurements. The Incl stand for inclusive, Excl for exclusive. The method signature is shortened by removing the package name part "org/postgresql/" from it.

Top-5 Count
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
53255123311233110359035900jdbc2/AbstractJdbc2ResultSet/ toInt(Ljava/lang/String;)I
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/ getFixedString(I) Ljava/lang/String;
114437043396433960372537250jdbc2/AbstractJdbc2ResultSet/ trimString(ILjava/lang/String;) Ljava/lang/String;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/ getString(I) Ljava/lang/String;
119874950701507010387238720jdbc2/AbstractJdbc2ResultSet/ checkColumnIndex(I)V

Top-5 Elapsed Inclusive
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature

Top-5 Elapsed Delta
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature

Top-5 CPU Elapsed
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature

Top-5 CPU Delta
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature

From these top-5 lists it can be concluded that, in the context of this benchmark, the three methods of interest are org/postgresql/jdbc2 package. The methods are: getString, getFixedString both from theAbstractJdbc2ResultSet class and the prepareStatement from the class AbstractJdbc2Connection. The next step in this conquest is to only check what is happening with respect to these three methods.

Sunday Jun 17, 2007

Green Computing

Green computing and performance tuning. They are closely related. Long ago (from an IT point of view) we had to be very careful how we designed and implemented our software. On my first Apple II I only had 48Kbyte (yes that is a K from kilo) available to make things work. Its 6502 processor ran at 1 MHz (yes, that's an M). You had to think how to use these. Carefully. Consciously.
However software creation became mainstream. The old way was considered too complicated. 4GLs were invented. The (system) complexities were hidden. But definitely present. Machines got more resources. Nowadays most of us do not care (or are not allowed to) and we just use these resources.
Green computing. Is this than the return of making decisions about resources used. Through benchmarks can we understand our software. A little profiling and we find the code paths that are most used and probably use most of the CPU cycles. Indeed a good performance tuning session makes the application faster.
Look at this from another point: we can do the same with less. Less power and less memory translates into less energy consumption. Meaning less heat which saves the amount of  energy  needed to control the environment. Indeed Green Computing should become the rule. Performance tuning now is not only fun, it is a must indeed.

Technorati Tags: ,

Tuesday Oct 24, 2006

Thinking about performance tuning

Performance tuning is not asking questions like "Are all patches installed? Is the deployment done in an acceptable way? What decision process was used to determine that this application can run on the given hardware?"
However questions like these popup. And too often the answers are not available... While at the same time the engineers want to tune; want to show they master this craft.
I believe a sanity check is the first thing that needs to be done when there is a perceived problem. It is this kind of work that might give you huge improvements. But at least should give you a sound and solid baseline to be used for a realistic setting of expectations.
Once an application is deployed in a proper way --and limitation are known, stated and documented-- are we able to set a baseline. Benefits from performance tuning should now be related to this baseline.
Any change in the stated limitations will present us with a new situation. If, for example, you are able to buy more disks for a disk bound application than indeed you can expect a nice improvement. However to run an IO intensive application on an environment severely deprived of IO capacity, was not a sane thing to begin with... Indeed it can be difficult to see where sanity checking ends and performance tuning begins.

Technorati Tags:




« June 2016