Thursday Jan 21, 2010

Some findings with the Amber Road

Recently I had the good fortune to do some testing on an Amber Road. More officially known as the Sun Storage 7410 Unified Storage System.

The machine had 64 GB of internal memory and two quad core CPUs. A single JBOD holding 24 disks, of which one was a SSD optimized for write access (a writezilla.) The other 23 disks were "normal" 7200 RPM disks of one TB each. For connectivity some of the four gigabit NICs were used.
Besides the above mentioned SSD, this system also had two other SSDs optimized for reading (a readzilla.)
This appliance used ZFS as a file system. The writezilla was used to hold the ZFS intent log. The readzilla's were used as a level 2 ZFS (L2ARC) file system cache.

For more technical information about this product please check the website at:

http://www.sun.com/storage/disk_systems/unified_storage/7410


Introduction

The original trigger to execute these tests was a remark from an ISV that they had problems with applications using their Amber Road while an rsync was creating a backup of their email archive(s) on a Linux box to a volume on their appliance.

And indeed the throughput numbers we received were indeed not what we expected.

In order to get a better understanding a 7410 was setup together with some load generating equipment in one of our labs. The ISV was able to send us a copy of a part of their email archive we could use for testing. The Amber Road was configured in a comparable way to the one at the ISV location. The initial tests were indeed below expectations. The quest to find the bottleneck began.


Approach

There were quite a few elements that were "unknown" to me one of which were the internals of the Amber Road, including the hardware and the software (operating system, ZFS pool and file system, network interfaces , etc.)
The NICs, were they somehow dependent on each other? How much traffic can a single NIC handle? In throughput, but also in the number of packets?
The volume(s) in the Amber Road were used with NFS over gigabit ethernet. The Linux box was used to simulate the mail machine. Could this one handle the load required? The Linux volume manager was used. Using ext3 as a file system. Could this one deliver the load required? The tool used to make the backup, rsync, was this one able to drive all the resources?

I started with the front end: the Linux box. Especially the volume manager. Using

find <mail archive on lvm volume> -print | cpio -oc >/dev/null

as a way to simulate the reading part. I had a second machine on which Solaris 10 was installed. Exactly the same hardware as the Linux box. And of course couldn't resist to do the same reading tests on this one. I was not impressed with the load that could be generated from a couple of internal disks. Fortunately I had a couple of disk arrays "lying around".
I used a Sun StorageTek 6140 array holding 12 spindles spinning at 15K rpm. The array has a gigabyte of cache and two controllers.

The result of the above command:

1: 6140 Linux lvm (2 6140 volumes): 33 min
2: 6140 Solaris UFS on 2 6140 Disk Suite volume: 16 min
3: 6140 Linux nolvm (1 6140 volume): 34 min
4: 6140 Linux nolvm (4 6140 vols in parallel): 24 min

Tests 1 and 2 are performed with the same 6140 configuration. Same hardware. Different OS-es and volume managers. To check the Linux result I used half the disk capacity in the array (test 3) without lvm and 4 volumes without lvm (test 4). This last one had one volume mounted under /data1, the second volume on /data2, etc. To parallellize multiple find /dataX -print | cpio -o >/dev/null sessions were done on each file system in parallel.

After this result we decided to look into the other elements of the black box using Solaris only. Just to ensure we could generate load once we started testing the Amber Road.


Network

The next part to check was the Amber Road network. Or better, how much load can the Amber Road NICs sustain. A single volume on the Amber Road was NFS mounted with the following options:

rw,vers=3,rsize=8192,wsize=8192,soft,intr,proto=tcp,timeo=600,retrans=2,sec=sys

Since the rsync backup includes many file system operations (read, write, stat, etc) and is optimized to not write a file to backup if the file in question happens to reside on the backup volume already I changed the rsync use to a sequence of 'find ... cpio' ran in parallel. Each find command (see above) used its own private mail archive. After some analysis it was clear NFS was not a bottleneck. Using the analytics from the Amber Road the following screen dump shows the three NICs used were all running close to the theoretical speed of gigabit ethernet. In total "pumping" 300MB/sec look perfect to me, when using three interfaces. Each interface handling close to 15,000 NFSOPS/sec.


AR.png

With these results I concluded that both client and network are running fine. I needed to focus on the 7410 now. I had 24 disks in this appliance. The GUI was not my favorite tool in order to look at the disk utilization and other statistics. Now I must admit, that I am in this business for quite some years. Still prefer the command line. Although it might be good to understand that the punch cards period was even before my time.....


Method

The ISV remark included a statement about other applications suffering when a rsync was running. The rsync themselves also took a remarkable long time. This would translate to an average of 50MB/sec transfer rate.
I changed the approach a little bit. I first checked what could be seen as reasonable for the Amber Road. Fortunately there is a load generating tool called vdbench available. This open source tool can be downloaded from http://sourceforge.net/projects/vdbench. It has many possibilities. One of these is a simulation of file system tests. For this it creates a directory structure with as many files as specified. This is all done with a parameter file. Here is one example, for a pure read only test:

fsd=fsd1,anchor=/data_remote/test,depth=2,width=8,files=10000,sizes=(8k,60,32k,25,512k,10,4m,5)
fwd=fwd1,fsd=fsd1,operation=read,skew=50,xfersize=8k,fileio=sequential,fileselect=random,threads=48
fwd=fwd2,fsd=fsd1,operation=read,skew=50,xfersize=8k,fileio=random,fileselect=random,threads=48
rd=rd1,fwd=(fwd1,fwd2),fwdrate=10000,format=no,elapsed=72000,interval=1

The generator runs on a client machine. The Amber Road volume is NFS mounted under /data_remote/test. There is a directory structure of 2 levels deep. Each directory at the end of this tree holds 10,000 files. Files of different sizes: 60% of them are 8KB, 25% is 32KB, 10% are 51KB and 5% are 4MB. A total a 96 threads are being used to generate the load. The test runs for 72,000 seconds.

The example above tries to create a continuous load of 10,000 IOPS (fwdrate).

Tests had to be done over a long time. The file system cache (ZFS ARC) was approximately 60 GB. It was already shown that the behavior of the Amber Road was perfect as long as capacity was available in ARC. Once the ARC was fully utilized the behavior started to show. For my tests this steady state behavior was needed in order to be comparable to the ISV situation.

The load test was done with a RAID-Z2 default setup, a RAID-Z2 narrow and a mirrored setup. Here's a picture with the output. Time versus number of IOPS. After the volume was created the directory structure and the files were created. After this a load run was started with 50% read.


tst.png

Besides some spikes there is not much difference between these three. They all can cope with a 10,000 IOPS load very well. However this was during the first 10 minutes after all the files were created. Next graph shows the behavior over a ten times longer interval.


tst2.png

This data are from the same three tests. However here we see that at 700 seconds the default setup drops down to approximately 6,000 IOPS. Looking into the data presented by the vdbench log output (not shown here) also shows that there is no 50% read on average any more.
The RAID-Z2 narrow cab handle the 10,000 IOPS load longer. At 2700 seconds the behavior starts to change. It averages around 10,000 IOPS but the behavior is far less constant.
The mirrored setup performs the 10,000 IOPS nicely over the whole test period.

The tests described were executed to create a background load for running rsync. The bursts in the blue graph are points where the rsync is started. These were scheduled at 1800, 3600, and 5400 seconds. No ill effects during the test on a mirrored setup.

Here is a close up around the 900 second period.
tst3.png

One of the things that vdbench tries to do is to make up for missing IOs. The first big dip after T=700 is followed by a much bigger load request. The final result is that the system seems to throttle down to a 6000 IOPS level. During the first 600 seconds the average response time was around the 1 milli second. 200 seconds later this became 20 milli seconds. During the first 600 seconds the system was able to push 80 MB/sec. After the 800 second mark there was only 40-50 MB/sec left. Finally the amount of read ops settled around 85% of the total.

Finally to make this comparison complete, a close up of the graph around T=2900


tst4.png


Results

The above mentioned test was redone for the three setups but now with a read write ratio of 75:25 and again with a 100% read test.

In order to see the effect on the run time of an rsync command that would create a backup of a 2 GB "mail-archive" this command was executed at 1800, 3600 and 5400 (rsync --delete -a <mail archive> <destination directory>.) The completion times are in the following tables. The times for the rsync during the default setup test were so long, that this test needed to be redone for a much longer time (14400 seconds.)


Table 1: 50% read 50% read load
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default 5161 - - 4542 8987
RAID-Z2 narrow 124 1074 1059 - -
RAID mirror 96 174 141 - -



Table 2: 75% read 25% write load"
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default - 1842 - - 11548
RAID-Z2 narrow 91 91 210 134 -
RAID mirror 99 100 101 116 -



Table 3: 100% read load"
Type\\Time 1800 3600 5400 7200 14400
RAID-Z2 default 114 111 111 111 -
RAID-Z2 narrow 76 76 76 76 -
RAID mirror 88 84 85 88 -



Conclusion

The first thing to mention is that although the default setup seems to perform unexpected, you should revisit what is happening. The appliance is being stressed by a 10,000 IOPS load. Internally there are two times eleven disks handling this load. These standard SATA disks are able to handle a load of say 70 IOPS each (google returns many articles, where this number ranges from 50 to 100). This translates to approximately 1540 IOPS for the available disks.
In a mirrored setup this appliance was able to handle 10,000 IOPS: over 6 times the capacity of those disks alone.... If there are barely any writes this is even possible with the default setup, and the narrow raid configuration too. After realizing this, I must admit, I was impressed!

Tuesday Jul 22, 2008

Waving Commit Delay

In the previous entries many graphs were shown. The once showing throughput by the amount of commands handled had this typical sinusoidal pattern. I now think I found the cause of it. Although I definitely do not understand it (yet).

Since, as usual, a picture tells us more than a thousand words here is one with that pattern. The vertical axes shows the cumulative time by all PostgreSQL processes in a certain state:


mail_commitdelay.png

Ok, what is shown here? The red graph depicts the spent while sleeping as a direct result of the commit_delay setting. Currently the average number of processes with active transactions when a flush is called is around 85. Therefore that part of the equation is true, so indeed it is the commit_delay setting that determines if the sleep is done: all other prerequisites are true (read the source if you do not understand :-) .)

Now when setting commit_delay to zero, the result becomes:


mail_nocommitdelay.png

The time for XLogFlush now fully determines the time spent in the routine that handle the RecordTransactionCommit. It is also approximately 60% of the time spend handling commands.

The downside of not disabling commit_delay is total throughput. A little bit of delay reduced the amount of (WAL related) IO significantly. This was already shown in an earlier blog entry. However the behavior is much more constant without the delay. would it be beneficial if the same constancy could be realized in the commit_delay case?

To answer that one, some extra investigations are called for.

Saturday Jul 19, 2008

Througput spikes explained

In previous posts a typical pattern in handling commands was seen. The handling versus reading commands cumulative time consumption showed a pattern as can be seen in the next graph:

mediumheap.png

I had a hunch that this could be related to the client (load generator) behaviour. The client program is written in Java. It runs 128 application threads that continuously communicate with the database. Lots of rows inserted and retrieved. The number of objects created should be considerable. Garbage collection to clean up the Java heap might be the driving force for the above mentioned pattern.

My standard configuration runs with a 2 Giga byte heap. I decided to increase this to 3.2GB, and also increase the Eden size to 1.5GB. For those a little familiar with Java, this is the space in which the objects are created. The test resulted in the following graph:


largeheap.png

Indeed it is the client who is responsible for these spikes. Using a larger heap means more objects can be created in the Java heap before a garbage collection is forced. Also a larger heap holds many more objects translating in longer GC times. Indeed the time spent reading commands has increased. Please note, garbage collections in Eden are "stop the world", meaning the client does not respond to the database process: during a GC the readcommand does take significant more time.

Wednesday Jul 16, 2008

Finding the throttle

In my previous blog entry I showed a behavioral pattern in PostgreSQL that needs attention. In this entry I will try to explain what I did to find the culprit.

The only statistics discussed were the ones in tcop/postgres.c These gave an impression of the global activity. During a checkpoint lots of time is spend handling commands. Although this time seems to be zero, it cummulates to figures as high as 1600 seconds. By looking at the other statistics during those checkpointing intervals some of these too show this accumulation hopefully too. This would make it easy to find the root cause. At 1218 seconds this pattern shows nicely. The next table shows the top time-consumers during that interval:

NAME COUNT TOTAL AVERAGE MAX
SLRUPhysicalWritePage::Part6 16 8368 523 8368
SLRUPhysicalWritePage::call 16 8370 523 8368
SLRUselectLRUPage::WritePage 1 8370 8370 8370
SLRUwritePage::call 16 8372 523 8370
postgresqlMain::readchar::B 9799 19100 1 4297
LWLock::Acquire::08::semaphoreTime 1711 21303 12 411
xlogflush::go::acq 1724 21349 12 411
SLRUselectLRUPage::Return3 768 344920 449 10574
SLRUselectLRUPage::Return1 2162 879268 406 10568
LWLock::Acquire::59::semaphoreTime 676 1059722 1567 8368
SimpleLRUWaitIO::bufferlock::Acquire 768 1060120 1380 8368
SLRUselectLRUPage::WaitIO 622 1061593 1706 8381
SimpleLRUWaitIO::call 747 1061932 1421 8381
SLRUselectLRUPage::Part3 623 1069964 1717 8381
SLRUselectLRUPage::call 2931 1224187 417 10574
SLRUreadPageReadOnly::SimpleReadPageCalls 1022 1224899 1198 10575
SLRUreadPage::call 2799 1224903 437 10575
SLRUreadPageReadOnly::call 1525 1225559 803 10576
postgresqlMain::handleCommand::E 9824 1267569 129 10577

Ok, what are these? These are the statistics. I try to use a #8220convention:#8221 A name consists of one or more tags separated by a colon pair. First part is the name of the function in which the statistic is used. Next part should indicate the part in the function that it means to time, and if present the final part tells more about the details being measured. Most statistics measure wall clock time, however it can be that another metric is used (number of elements in a list for example). Rows in the table are sorted by total time. The postgresqlMain::handleCommand::E is the statistic used to count time needed to handle an Execute command. During the last second it was called 9824 times, and the total time it took to do this was a little more than 1267 seconds. On average it took 129 msecs. However there once was a maximum handling time of a little over 10 seconds. (if time is the metric, the resolution is msec.) The functions in my copy of backend/access/transam/slru.c are now heavily instrumented. Statistic names starting with SLRU belong there. It looks like most of the commandHandling time comes from SLRUreadPageReadOnly: SLRUreadPageReadOnly::call depicts the total time spent in this function. However from the count value it is obvious that not all handleCommand visit this function. 1022 of these #8230ReadOnly function calls (SLRUreadPageReadOnly::SimpleReadPageCalls) call the SimpleLruReadPage function. Almost all of those 1267 seconds are spend in there. This function has 6 statistics build in. None of the others popup besides that one for the time consumption by the SLRUselectLRUPage::call. Since it was not obvious to me at that time what was going on in that function. Therefore I split this routine in three parts and measured time spend in each of them. Only part3 is visible in this Top-N table. Looking at the code I see this part contains an if-else construct in which either SimpleLruWaitIO or SimpleLruWritePage are called. The first one takes 1069 seconds for 747 calls, the second one takes only 8 seconds. However that is for just one call! This is a single write that takes forever. Why? And is this related to those 1069 seconds spent in the WaitIO part. Names used do point in that direction.The SimpleLruWaitIO routing too has 6 statistics of his own. One of them is: SimpleLRUWaitIO::bufferlock::Acquire.
The bufferlocks are assigned dynamically. This then, could be an explanation for the accumulation of time in LWLock 59. As you might recall in this one all statistics for all dynamic locks in the system are stored. The numbers do match. And indeed the maximum waiting time seen for LWLock::59:semaphoreTime is very similar to that WriteIO. It looks like all my client related processes are waiting for this single write that takes so long, since 128 \* 8 seconds translates to 1024 seconds total. Since this maximum time is for one single occurrence this also means that single block (did not actually look to the block size, so it is an assumption really) is used by all processes. If this is the case than there is something out there behaving as a classic scalability issue.


Time consuming write

Now why is that single write so terribly slow? Can I get more information out of the statistics I have? Before the SlruPhyscialWritePage routine is called one of the buffer locks is acquired in EXCLUSIVE mode. After this routine returns and some extra housekeeping is done this lock is released again. The PhyscialWritePage routine, which of course also has a fair amount of statistics, is the one where time is consumed. Meaning that that lock being held in EXCLUSIVE mode is our scalability issue! The PhyscialWritePage was also #8220magic#8221 to me, so simple separation by adding #8220PartX#8221s here and there learns me that Part6 is the time consumer. Since I know by traversing the code, that the last parameter passed to this routine has a value of null it must be the pg_fsync(fd) in this routine that causes the long waiting. This raises the next question. What is being synced? It must be data since we are in a checkpoint.


DTrace to the rescue
The little script I used for this basically starts on a fdsync system call. Before this happens it knows the file descriptors and file names that belong to each other. On each fdsync that takes longer than 2 seconds the filename that belongs to the file descriptor is printed.The script also counts all writes against the file descriptors. Basically the assumption is that all these write are out there waiting to be sync-ed. During the test three different filenames were shown as follows:
fsync 24637: Took 8 secs to write 122588 entries to "pg_tblspc/16384/16386/16387.1"
fsync 24637: Took 8 secs to write 128169 entries to "pg_tblspc/16384/16386/16387"fsync 24698: Took 7 secs to write 1 entries to "pg_clog/000C"
Here 24637 is the process id of the bgwriter. It is flushing single gigabyte portions at each flush. To all files of interest. (the ones above are application specific tablespaces/files). However writing a single block to clog also takes a very long time. Mmm that one is easy to resolve. Just wondering if it would make a difference. Moved pg_clog (including its content) to yet another disk. Created a softlink, and redid the test.


The CLOG solution
The next two graphs tell the story. These are the same graph as shown in the previous blog entry. The first one present the overall picture, the second one zooms in for a little more detail.
handlingcommandsclog.png handlingcommandsclogzoom.png

This one looks far better. However that pattern of "increased readcommand time / decreased handle time" is interesting too. And of course the questions:
  • CLOG is flushed by the fdsync. However I do not think this originates from the code pathes inspected so far. Just happened to be a side effect of the DTrace script and relatively ease to remove. So where did it come from?
  • Is it just the huge load of IOs on a ZFS file system that causes these waits, or can this be seen on UFS too for example?
  • Still there is a huge burst of writes (1GB) to each tablespace. Would it be worthwhile to sync in between checkpoints to spread the IO load over time? A single disk can sustain only so many IOs/sec. Spreading should indeed be advantaguous.

This screams for further inspection.


Checkpoint throttled throughput

In previous entries the checkpoint effect on throughput was mentioned but not explained. While looking at the IO behaviour comparing UFS to ZFS when running a PostgreSQL database instance, I see huge bursts of IO that also seem to throttle throughput in a significant way.

The original tests used a load generation tool that reported its measured throughput each 100 seconds. Since we all know that averaging removes the interesting information I redid a test with a 5 second reporting interval. Indeed this showed the effect of throughput much better: 0 TPM while checkpointing. Complete stand still of the application. Wouw, this needs further attention.

Instrumentation effect

Initially I used DTrace probes to instrument the code. However I found that for high frequency probes (firing over 1000 times a second on many processes) the instrumentation effect was not neglectable. Magne Mahre wrote a little library we started using to instrument "a light weight way". In my situation LWLock 25 was used to handle some synchronization issues. Every process counted locally and during commit time the global pool of counters was synced. This works rather well. However it is so unfortunate that dbwriter and logwriter commit at a very low rate. I had to call a "Copy local to global" routine to ensure the information gathered in these processes did also show in the final reports.

Unfortunately the final reports did show a new number one contender: lock 25. This method too did not turn out to be the light weight instrumentation method looked for. Although I must admit that in my attempt to understand I enter quite an amount of counters in the machine (currently I have over 550 new counters). Lots of these turn out not to increment.

I wrote a quick library that uses the atomic_xxx routines provided in the libc librarty that comes with Solaris. The now complete mutex and lock free way of instrumenting seems to provide me with the information I was looking for. Although I am still grasping to understand it completely. This library creates statistic objects in a new segment of shared memory. The advantage seems to be that any other process can attach to it, to display a real time view of the counters in there. I currently sample with a frequency of 1 Hz. And this adequately shows me the information.

A single statistic holds a running total, a counter of "times updated", the minimum value so far, and the maximum value so far used to update that statistic. Maybe a future version will also provide a histogram facility so a better understanding of the dynamics of certain statistic could be possible. Right now it is not implemented.

Setup

I currently using one of the earlier PostgreSQL 8.4 sources. All is build 64bit in an optimized way with Sun Studio 12 compilers and tools. Most tests run for an hour, after which I start mining the data. Current focus is on activities, or lack there of, during checkpoints. Checkpoints that are parameterized to happen every 300 seconds. And should finish in 15% of the time. All PostgreSQL related files reside on two ZFS pools. A WAL pool holds 4 internal disks. The other pool consists of 5 internal disks and holds all the other files. Each pool has its own memory backed up cache array for the ZFS Intent Log (see earlier post on UFS and ZFS).

Since the checkpoints are "tuned" to happen as fast as possible the concentration of IO seems to be high. However the IO subsystem can cope with this.

The next graph shows the throughput. A clear pattern is visible. I had a hunch that it might be related to garbage collection in the client. Not seeing throughput can also mean less activity on the load generator. Since this is a Java program GC needed to investigated. But ok, using jstat -gc it was very obvious this application behaves as it should be. No prolonged periods of GC-takes-over-the-world here. Once more the database was "prime suspect."

All tests are done with 128 clients.m

typicaltpm.png

Mapping this to the IO happening is clearly shows in the following graph. Please note these TPM numbers are for 100 second interval and are clearly shown after that interval. Also the actual point "Time=zero seconds" is different on the load generating machine and the system under test.


combinedinfo.png

And the IO graphs do not include the traffic to the ZIL holding devices. This is database related IO only. Seems obvious that the sudden burst in IO is related to the next dip in throughput. A dip that is about 20-25%.

From this graph my idea resulted to do some finer grained measurements. This to see if during those high IO peaks all throughput suffered. Redid the test, but now with a 5 second interval for reporting TPM..

Same information as the previous graph, but now a 5 second interval for TPM. Same overall pattern as the one above but now focussing on a smaller, arbitrary selected, interval.


combinedfastsampling.png

Indeed the checkpoint seems to be the driving force for this sudden, steep, not long lasting, surge in throughput. However mapping the results from two applications running on different machines is a little tricky. As the saying goes: 'Use the source, Luke'

Using the source

In order to be able to exclude the data from the client I implemented some counters in tcop/postgres.c In the routine PostgresMain there is a "for ( ; ; )" loop in which data is read from the client, and acted upon. Which these I am now able to tell how much time (WAL clock) has elapsed handling data in different contexts (execute, parse, bind, sync, define) and how much the database process is waiting for the client. And of course it tells me the number of times this has been done. The counters are cumulating over time. However perl is quite happy to translate this into deltas per second.

Besides these counters I enthusiastically overloaded many source modules with too many statistics. In storage/lmgr/lwlock.c I placed multiple per non dynamic lock, and one where I cumulate all the information for the dynamic locks. In my source this one is numbered 59. The next table is a snippet where the tcop/postgres.c placed stats are shown together with those from the lwlock.c source module (only the most time consuming lwlocks during the time a checkpoint is happening. BTW the tests are now reduced to 1800 or so seconds since these patterns seem not to be affected by warm up phases while mimicking real life..

Here is the graph. I put the ceiling on 400M to make things a little more visible. The actual peaks are around 1600M. micro seconds. The third graph depicts the total time spend reading and handling commands.


cummtimeelapsed.png

Zooming in a little show a bit more detail:

cummtimeelapsedzoom.png

Those short interruptions directly followed by a peak of the same magnitude are artifacts of measurements: if a timer stops in the beginning of the next interval, there is an extra cumulation in that next interval instead of the current one. However the peak at 910 and 1210 seconds are way beyond these artifacts. Total time is quit nice: 128 seconds. Recall we have 128 processes all measured each single second! The extra peaks/dips in reading/handling every 22 seconds or so is not yet explained. This one might be due to short garbage collection interruptions. It is indeed more time waiting for a command to arrive. Need to look into that one too some day soon...

Looking at the blue 'Totals' graph show a couple of seconds of "nothingness" Not reading, but also not executing! Followed by an extreme peak of handling time. BTW if a statistic timer starts in interval A, and ends in interval A+5 seconds, this adds zero's to A, A+1, ..., A+4, and 5 to interval A+5. There is indeed a repeating pattern of total zero throughput ( I am sure, I read the numbers.)

Thursday Jun 19, 2008

Running PostgreSQL on ZFS file system. Some considerations

Up till now I compared ZFS to UFS when deploying the PostgreSQL database files. Only these database processes were running on the system, and only all the PostgreSQL related files were on the filesystem being compared.


The file system was created on either a stripe set of nine disks using Solaris Volume Manager or a zpool using these same nine disks. The last blog entry reported ZFS outperforming UFS.


I did some extra tests to see if performance (read database throughput) could be increased by fiddling with some elements that are known to influence the IO part when using ZFS.


The ZFS Intent Log is known to consume bandwith. The effect of this element can be seen by disabling it altogether. Of course this is definitely NOT advised for production environments. (See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL for some extra information.)


Did some tests with the ZIL disabled. Using 5, 15, 50 and 85% for the completion_target of PostgreSQL. The following graph shows the database transaction throughput:




tpmNoZIL.png


Recalling the baseline I presented in an earlier blog using seven disk arrays with caches, that showed a throughput around 90K TPM this one comes very close. Only nine internal disks this time...


The previous throughput graph (... the saga continues) showed a rate of ~ 50K TPM: transactional integrity does not come for free.


Did some more tests where ZIL was enable, but new using a dedicated disk array (with 1GByte of write cache) for the intent log. The throughput result of this setup, the disabled-ZIL and the 'all on nine disks' tests are in the next graph (all have a completion_target setting of 15%):


tpmDifferentZIL.png


The ZIL element obviously takes time. Time during which the throughput is cleary throttled while a database checkpoint is in progress. Using a dedicated fast device pays of to increase throughput and make it more constant over time. The IO (write part) was measured for both the ZIL device as for the data on the "nine disk" ZFS file system . Here is the result:




zlogvsdataIO.png


The checkpoint at 1800 seconds resulted in a 77 MByte write/sec for the ZIL only. As already suggested by Mikael Gueck I would surely like to do some extra tests with SSD disks. Guess I need to hold my breath a while, since right now I "just' do not have them.


However right now I feel ZFS is already a realistic alternative for a PostgreSQL setup.


Will do some more tests where I'll play with some of the other features of ZFS to get a feeling for their effect on my PostgreSQL environment.

Tuesday Jun 17, 2008

UFS versus ZFS with PostgreSQL, the saga continues

The previous posts described some tests I did using a setup that was "tuned" to focus on CPU consumption. The idea was to minimize IO as much as possibly by caching all data needed as much as possible and using plenty of disk arrays with write caches to minimize write latency.


In my current tests I am interested in the IO components and how they influence throughput when using two different file systems. In order to create a significant IO component the amount of disks was limited to nine internal ones: no array; no write caches. In addition the load generator was changed to significantly increase the working set data. Indeed my 16GB of shared buffers are not enough to buffer this working set.


The initial configuration was "tuned" to once in 15 minutes perform a checkpoint. Here I will describe the results when and using a smaller checkpoint interval and influencing the time "available" to write out all data due to a checkpoint. The PostgreSQL parameters in effect here are:



checkpoint_timeout
checkpoint_completion_target

The checkpoint_timeout was set to do a checkpoint every 300 seconds. The checkpoint_completion_target used to be 50%. This influences PostgreSQL to write out all the dirty buffers in a period of time that is 50% of the timeout value. Indeed the old graphs show this nicely: first halve of each checkpoint period a lot of IO can be seen that is not present during the second halve of this period.


The idea behind the tests described here was to spread the IO over time. This was done by reducing the timeout from 15 to 5 minutes and increasing the completion_target to 85%. This turned out not to be optimal. Probably too many dirty blocks were re-dirtied again so a continuous writing of the same blocks was done over and over.


A smaller checkpoint_timeout seems fine for the time being. Increasing this parameter ensures that rewriting of blocks is minimized but the amount of blocks to be written increases resulting in a much more extreme IO workload once the checkpoint starts. With that same reduced checkpoint_timeout setting the completion_target was set to 15% to see what happens. The results of the 85% and 15% tests can be seen in the graphs below. First the throughput graph.




tpm.png


The nice thing here is that ZFS shows the best throughput (on average about 50K TPM.) And indeed for a 15% completion_target. Same target for UFS causes huge fluctuations in the throughput graph. Now lets look at the IOPS graph.




writeio.png


Although the pattern is clear lets focus on a smaller time interval to see what happens during one checkpoint interval




writeio2.png


The 15% setting seems too much for the UFS case: it needs to continue writing data for the whole period. ZFS on the other hand, puts out a burst of IO's in a 50 second period and from then on the IO settle to a constant amount of "background" IO for the rest of the checkpoint interval (most likely this has a high content of WAL related output.) 15% of 5 minutes is 45 seconds. Indeed ZFS seems to be able to cope with this setting.


The above shows the write IO's per second. How much data is actual written is shown in the following graph.




writeoutp2.png


ZFS is clearly pushing out much more data. During those initial 50 seconds up to 60MB/sec, after that a linear reduction of this amount can be seen. Although the IOPS are rather constant the amount is not. ZFS seems to reduce its write size. UFS pushes significantly less data out to the disks. However it uses so many IO calls that this reduces the overall TPM throughput.


Since ZFS is a logging file system the data written contains both the application (PostgreSQL) data and the logging data. I'll do some extra tests to see how this effects the TPM results. Need to "sacrifice" a 1GB cache disk array for this to emulate a fast SSD-like disk. Will make sure that this one is dedicated for ZFS its so called intent log.

Sunday Jun 15, 2008

UFS vs ZFS: commandline editing

My previous post on UFS vs ZFS showed too much of a difference. I looked at all my tests and they all showed this same difference. Still I wanted to make sure...


So I redid the test and .... gone were does 85MB ZFS writes. Doing these tests I rely on command line editing and command line history. Try to use the same command line as before. So I walked through my history and did not spot anything. On the first pass that is.


Than I "just" spotted it. Once more the devil is in the details. The command line used was


#zfs set recordsize=8k pg


And that # character was not the prompt but a comment marker... Well this explains a lot. The record size that in fact was used was the deault setting, 128K. Each 8KB write was in fact a 128KByte write.


So I redid the ZFS based tests and here are the results for the 'all on one volume of nine disks.'


First the Writes/sec:


writeio.png


And the corresponding IO load:




writekb.png


Finally here is the TPM throughput graph for the 'all on nine' tests.




tpm2.png


Once more ZFS writes out more data. Looking at the TPM numbers the ZFS tests pushes a little less transactions through the system. Therefore the extra writing must be related to ZFS "overhead." The "good" thing is ZFS uses less but much larger IOs.


Although the maximum throughput is a little better for UFS, the ZFS behaviour is more constant. Overall the average looks very similar. No clear winner. But than it was not a contest, just a comparison. And surely I would like to understand those too noticable high-lows in throughput. Of course the writes are related to PostgreSQL its checkpoint configuration. This configuration was "tuned" for a certain set of tests were focus was on CPU consumption. In order to do this the checkpoints were moved to once per 900 seconds. Also PostgreSQL attempt to finish all checkpoint related IO in half of the checkpoint time. Another test is called for that causes a more constant and less extreme IO load.


And btw, the UFS tests were done without directio mounting: the ramp up is almost as fast as with ZFS.

Thursday Jun 12, 2008

PostgreSQL on UFS versus ZFS

Introduction


Up till now I had the good fortune to do my PostgreSQL tests using a Sun Fire V890, 16 cores SPARC IV @1350MHz. Attached were seven disk arrays. Four of these with 256MB of write through cache; the other three with 1GB of write through cache. All spindels set up as a stripe.


Always used UFS, where WAL was located on a stripe of the four smaller arrays; all application data related tables on one array; all application indexes on one array; and all other PostgreSQL related data on the third array. Both file systems that held application specific data were mounted with directio option in place. As was the WAL holding file system.


Always used an internal toolkit to generate an OLTP-ish load with 128 concurrent users that were constantly sending their queries and DML (no think times), set up in such a way that there would be quite some IO (a rather large working set).


Next to these seven arrays I had nine internal disks free. This blog entry describes the findings when just using these internal disks in different ways. These internal disks were configure in different ways to hold the data mentioned above. Next to the (array based) baseline different configurations using both UFS and ZFS were tested. Below the findings are presented.


Results


The load generation tool reports its results through a TPM metric. The next graph shows this throughput for the baseline, a single stripe over the nine internal disks holding an UFS file system and a single pool (of these same internal nine disks) on which a ZFS file system was configure.


The /etc/system settings used for all these tests are:



set rlim_fd_max=260000
set rlim_fd_cur=260000
set maxphys=1048576
set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
set tune_t_fsflushr=1
set autoup=900
set ufs:ufs_WRITES=0
set segmap_percent=25

My system has 48GByte internal memory. PostgreSQL shared buffers take 16GByte. So I do not want the filesystems to use more than 25% of my memory for caching. Default UfS uses 12%. However ZFS is a little more greedy by default. This setting at least ensures the database processes would have an adequate amount of memory.


The ZFS file system was created with an 8KByte record size and compression turned off. No further settings were changed.





fs_tpm.png



Before each test a backup of the freshly created database was restored. The caching behavior of ZFS shows quite nicely. Both UFS tests need their time to get started (remember that directio mounted file systems do not cache, so in the beginning the data needs to be read from disks, instead from file system cache). The dips in throughput correspond nicely with checkpoint activity.

Checkpoints are definitely handled way better when using disk arrays with memory backed up cache. At about 2000 seconds the throughput seems become stable in the array case, while in both other cases the checkpoint load strongly effects the throughput.

Comparing the UFS and ZFS tests done using internal disks it seems that during throughput peek times they are quite comparable, although ZFS shows a little less throughput. The data above "just" presents the TPM. Lets look at the IO graphs. To make things "easier" I'll only show UFS vs ZFS using the internal disks. The above graphs come from a test where all database files are in the same file system. Therefore iostat cannot discriminate between WAL, data nor index based IOs.

writerate.png

Yes indeed, there is a difference. Read rate is not depicted here. ZFS shows no reads during the interval. Indeed the file system cache works perfectly. For UFS there is an initial rate of 13MByte/sec. This one reduces rapidly. At 1500 seconds there is 200KByte/sec left. After 2500 seconds (from the start) this has been further reduced to less than 40 KB/sec.

However the difference in write IO is significant. Using ZFS the periods of write IO is longer (from ~200 sec with UFS, to ~380 sec with ZFS) and the amount of write IO is up to five times as big (18MB/sec vs 80MB/sec). Now looking at the actual IOPS the following graph shows.



writeiorate.png



This one shows ZFS is handling the disk device much better. The average number of IO's is down by a factor of 3 when using ZFS.

Finally looking at CPU consumption it shows the ZFS test uses a little less of this resource.





Next step



Some extra tests were done where two pools are used. One for WAL and the other for the rest of the database files. This to check if the behavior could reside from the way the database uses its storage. The WAL pool has four disks, the other pool has five.

Since a picture tells us more than a thousand words, here are the results. First the WAL output.



walwrite.png



Although they seem to differ, it looks more or less equal when considering amounts. Still the UFS is "faster," meaning the frequency of those spikes (checkpoint) is higher. Once more, since everything is in the ZFS file system cache, the ZFS test starts at once, while the UFS one needs a warm up period. However the huge differences seen in the previous graphs do not show here. Now lets look at the filesystem holding the other PostgreSQL data.



allwrite.png

There it is! There is a striking difference between UFS and ZFS file systems when doing non wall related write IO with PostgreSQL. And by the way, the difference is bigger compared to the 'all on one file system' tests: approximately 7MB for UFS versus 85MB for ZFS. The ZFS numbers are obtained through zpool iostat command. I double checked the iostat numbers, and indeed the 85 MB number is correct. No hidden undocumented features that could "explain" this difference. Need to dig deeper.





Conclusion



Although the actual performance from a client perspective is just a little less when comparing ZFS with UFS, the amount of IO (KB/sec) is too different to be ignored. Although ZFS seems to optimize by large sequential writers the overall effect in an OLTP environment seems not to be optimal. I hope to be able to do some further tests that could help shed some light on this remarkable finding.

Friday Apr 18, 2008

Scalability issues due to waits

Previous entries showed lots of detail around different types of LWLocks, and the time spend waiting for them. In order to get a "helicopter" view all lock waiting times are combined. Also so timing is done in PostgresMain to understand the amount of time spend getting (waiting for) a new command, handling a command and pg_stat reporting.

In the previous post all tests were done from 64 to 1280 concurrent users. The findings in this post are all from 384 till 1024 users, in steps of 64. Each run took one hour. Measurement time was approx. 30 seconds less.

First a rather big table that holds all the numbers of interest :-) Lets explain the columns by position (all times not specified are in seconds):

  1. number of concurrent users;
  2. number of times the pg_stat updates were executed;
  3. total amount in time taken by these pg_stat updates;
  4. average time (msec) a single pg_stat update takes;
  5. number of commands read;
  6. total time for reading these commands;
  7. average time (msec) for reading a command;
  8. total time for handling the commands read;
  9. average time (msec) for handling a command;
  10. total number of times a LWLock was waited for;
  11. total amount of time waited for LWLocks;
  12. average time (msec) waiting for a LWLock;
  13. throughput per minute;
  14. average response time from client perspective;
  15. total amount of time (cols 3 + 6 + 8) explained.

   1        2       3     4         5         6       7        8       9       10          11     12     13     14    15              
384 30,774,906 1,498 0.05 134,478,799 1,257,700 9.35 111,570 0.83 30,583,924 47,298 1.55 41,771 0.017 3569
448 35,254,626 2,708 0.08 154,052,468 1,447,983 9.40 148,952 0.97 35,416,188 72,604 2.05 47,931 0.019 3570
512 39,472,668 1,966 0.05 172,484,080 1,628,241 9.44 197,524 1.15 41,839,911 109,362 2.61 54,144 0.022 3569
576 41,647,430 2,148 0.05 181,986,937 1,724,056 9.47 330,029 1.81 47,182,321 233,743 4.95 57,251 0.034 3569
640 41,903,717 2,260 0.05 183,106,564 1,740,012 9.50 543,114 2.97 48,511,160 444,870 9.17 57,943 0.057 3570
704 41,537,271 2,290 0.06 181,505,836 1,729,515 9.53 782,201 4.31 48,491,749 684,688 14.12 57,042 0.084 3571
768 41,011,402 2,279 0.06 179,208,259 1,710,304 9.54 1,030,160 5.75 47,853,511 933,159 19.50 56,190 0.113 3571
832 40,331,868 2,299 0.06 176,237,327 1,685,163 9.56 1,284,115 7.29 46,607,735 1,188,499 25.50 55,407 0.143 3571
896 39,857,972 2,268 0.06 174,167,499 1,668,546 9.58 1,529,312 8.78 46,385,722 1,434,691 30.93 55,103 0.172 3571
960 39,120,264 2,216 0.06 170,943,881 1,641,102 9.60 1,784,889 10.44 44,893,231 1,691,997 37.69 53,655 0.206 3571
1024 38,916,412 2,219 0.06 170,053,697 1,633,946 9.61 2,017,717 11.87 44,316,238 1,924,754 43.43 53,323 0.234 3568

The nice thing is to know that almost all time is explained (column 15). Tests took a little less than 3600 seconds.
Lets look at some of these numbers in a graphical way. First what is happening from a client side perspective when increasing the concurrency?

resptimeandthroughput.png Maximum throughput is reached with 640 concurrent users. The initial linear looking increase in response time is already clearly not linear at this concurrency level. Once the 704 mark is passed it looks linear once more although it rises much stronger. The transaction throughput numbers are in TPM (per minute.)
The next graph shows what is happening in the database back-end. Here the number of command read and the time spend handling these commands and the time spend waiting for LWLocks is shown. All these numbers are totals for all the back-ends in the test. Please note the concurrency numbers do not include the standard Postgres processes (dbwriter, wal log writer, etc).

timespend.png
Once more the number of commands read (and handled) tops at 640 users. This seems also to be the place were both "time needed to handle" and "time needed to acquire a lock" start rising in a linear way. The actual CPU consumption (difference between the red and the blue graph) looks rather constant in the third figure:

cpuUsage.png In this graph the average CPU usage to handle a single command is shown. There seems to be a marginal increase in CPU consumption with increase in concurrency. The CPU usage increase when going from 384 to 640 users is remarkable. It could be that -- but this is not proven by proper tests and instrumentation -- the processes starts spinning, in order to acquire a LWLock, more and more with increasing concurrency. At 640 than, most of this spinning is not adequate any more and the processes start to wait in order to acquire the lock. These lock waits take more and more time due to increased concurrency. This leads to a reduction in throughput (or in commands handled).
Knowing that all these tests indeed took an hour, it can be concluded that these 3600 seconds were basically used for waiting. For now I assume that the readCommand is -- as its name implies -- spending most of its time waiting for commands and data to arrive, in order to process them. The actual command handling is constant in its CPU consumption (around 0.5 msec.) The waiting time lost for acquiring LWLocks is however increasing when concurrency goes up. The amount of time spend in reading a command is rather constant. Total amount of CPU is 0.47 msec per command in the 384 user case. This translates to 64,272 sec (111570 - 47298). Total amount of time waited is 1,304,998 sec (1257700 + 47298.) Fortunately this is most of the time waiting for the clients to send a new command. However at 640 users (the maximum throughput) the wait time for locks has grown to 444,870 seconds. Something that escalated to 1,924,754 seconds in the 1024 user case.
The amount of CPU used is definitely an overestimate. The assumption here is "time not waited for LWLock is CPU time". The following calculation shows this cannot be true:-)
If, in the 384 user case, 64,272 seconds are used by CPUs in a period of 3569 seconds, it must be that the system has at least 18 CPUs. Also knowing the amount of idle time was approx. 37% this should mean these tests were run on a 28 CPU machine.For the 1024 user case there were 92963 seconds consumed: 26 CPUs. Idle time during this test was 7%, meaning 28 CPUs.The system used had just 16 CPUs! So there is wait time in other area's too. However compared to the wait time for the LWLocks this is barely worth any attention.However a significant amount of wait time is lost in reading a command. Approx. 9.5 msec are used per command. This could be due to the protocol used for communication with the client, or deficiencies in the benchmark tool. At least it motivates some extra attention. Currently the energy spend on SMP scalability is best used in the area's where wait time is observed. The above mentioned numbers show two area's worthy of more attention to obtain a more scalable and performing solution.

Wednesday Apr 09, 2008

PostgreSQL Lightweight Locks: Queue Length under high load

The previous blog entry (Using DTrace to measure Lightweight lock contention) discussed some findings with respect to contention on the lightweight locks in PostgreSQL. Using more extreme loads the ProcArrayLock turned out to be number one on the "waiting for locks" list.[Read More]

Thursday Apr 03, 2008

Using DTrace to measure Light Weight Lock contention

In an earlier post a table was shown with timing information for light weight locks. Here I will try to show a little more detail on the instrumentation effect. Please note these tests were done with PostgreSQL 8.4 devel.[Read More]

Monday Mar 10, 2008

Light Weight Lock Contention

After the IO effect had been removed the tests continued. Now able to focus on CPU related. As usual the quest is for finding where time flies by: either using the CPU or waiting for a resource.  When using higher loads the CPUs are not used to their fullest capacity: the processes must be waiting for resources.  Discarding network waits (it is here that commands come from) the next source are PostgreSQL its so called LWLocks.[Read More]

Sunday Mar 09, 2008

PostgreSQL performance study: introduction

While trying to improve the performance of PostgreSQL many tests are done. Currently the focus is on CPU related elements. In order to cleanup the picture some IO related optimizations were done. This post shows some graphs depicting the write side of the IO. This post is the first in a (possible) long series discussing findings, thought, etc.
[Read More]

Sunday Jul 15, 2007

Profiling jdbc related method calls with DTrace

The Sun Hotspot VM has build-in facilities for method tracing using DTrace. Each Java method entry and exit now has a probe that can fire. Every non arbitrary Java program most likely does many many method calls. Therefore the expected overhead should be a factor to consider. While running a benchmark on a Glassfish application server environment using PostgreSQL as the database we were trying to understand the time spent in the JDBC driver being used.

Here I try to describe how a package specific profiler is implemented through DTrace. In order to catch these firing probes access to the Java pid is needed. The generic way to create clauses around these probes is:

hotspot$1:::method-entry
and
hotspot$1:::method-return
Here $1 is the pid of the Java process of interest. These probes are specific for a process. Therefor the Java process needs to be started before the DTrace script can run (successfully). In order to not miss any method calls the Java process is best started with the following options in place:
-XX:+UnlockDiagnosticVMOptions
-XX:+PauseAtStartup
-XX:+ExtendedDTraceProbes
The ExtendedDTraceProbes is needed to enable the two mentioned method probes. The UnlockDiagnosticVMOption is needed to enable the PauseAtStartup parameter handling. The PauseAtStartup is needed to stop the VM before the Java program is started, but after the VM has been initialized. The little thing here is that all the VM probes are part of the libjvm.so library. Only once this library is loaded and .init-ted will the probes exists and can we enable the probes specified by our clauses.

The directory from where the VM is started will contain a (empty) file called vm.paused. where is the process ID of the VM. Once this file exists will the VM be in stop-ped state and can your DTrace script be started. If your DTrace script is compiled and active remove the vm.paused. file. This will automagically activate the VM after which it will "just" run to completion.

More details about the probes the hotspot provider gives you can be found at Keith his blog:
http://blogs.sun.com/kamg/category/Hotspot

Here I will describe my attempt to create a "profiler" that can focus on the use (that is (CPU)time consumption) of the methods in a certain package. A note of caution is called for. The code presented below will give a measurable performance degradation. I believe it should not be used in a production environment.

We had an interest to understand which part of a JDBC driver were consuming most of the time. The JDBC drivers packages of interest were org/postgresql/jdbc2 and org/postgresql/jdbc3 as being used from a Glassfish installation. This installation was stressed in a benchmark environment where we had a need to understand where time was spent.

In general a method will call other methods, that most of time will time yet other methods. Not only did we needed an understanding of which method consumed most of the time, but also if this was due to this method or due to one or more other methods that were called by it. This separates the data into two groups: inclusive and exclusive numbers. I think the following idea should be used during the implementation:

target$1:::method-entry
{
    time[depth] = timestamp;        /\* remember when we started           \*/
    depth++;                        /\* one extra step in the call tree    \*/
    includeMeasurement[depth] = 0;  /\* time consumption due to calls to   \*/
                                    /\* other methods originating from this\*/
                                    /\* method is set to zero. Hey we just \*/
                                    /\* started \*/
}

target$1:::method-return
/depth > 0/
{
    depth--;
    delta = timestamp - time[depth];/\* amount of time we spend in this    \*/
    inclusive += delta;             /\* method all the time spend in this  \*/
                                    /\* method (every invocation)          \*/
    includeMeasurement[depth]       /\* total time spend, including methods\*/
        += delta;                   /\* called by me, that must be sub-    \*/
                                    /\* stracted to get the exclusive time \*/
                                    /\* of the method that called me       \*/
    delta -=                        /\* all time on my account reduced with\*/
        includeMeasurement[depth+1];/\* the total time spend by others that\*/ 
                                    /\* I called will give me my exclusive \*/
    includeMeasurement[depth+1] = 0;/\* time to prevent this amount be used\*/
                                    /\* once more in some future           \*/
    exclusive += delta;             /\* My exclusive time                  \*/
}
There are some gotcha's here. Any VM process is multithreaded. To ensure minimal overhead but proper handling of multiple competing threads there is a need to use the DTrace "self->" construct. Both variables inclusive and exclusive are fine candidates to be implemented by an aggregation. The variable delta is typically a clause local variable, while most of the others should be thread local variables. In order to implement this we also need to specify type of the variables used.

I hope the following is a correct implementation:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive
#pragma D option defaultargs
#pragma D option bufsize=16m
#pragma D option aggrate=100ms
#pragma D option dynvarsize=16m

self string package;

/\* class, method and signature together uniquely define a method.
(assuming a single namespace) However concatenating these into
a single string has my preference but strjoin() is only valid in
clause context. Looks to me I need to keep them separately.
If you know a better way, please do inform me
\*/

/\* all these arrays are indexed by calling depth. This to not 
miscalculate when recursive calls are made.
\*/
self string class[int];
self string method[int];
self string signature[int];

self int    started[int];
self int    incl[int];
self int    vincl[int];
self int    depth;

BEGIN
{
        startedAt = timestamp;
}

/\* the arguments specify the class, method, and signature. 
   Currently only interested in the package name. This 
   is the base name of the class.
\*/
hotspot$1:::method-entry
{
        this->str_ptr = (char\*) copyin(arg1, arg2+1);
        this->str_ptr[arg2] = '\\0';
        self->class[self->depth] = (string) this->str_ptr;

        self->package = dirname(self->class[self->depth] );
}

/\*  The clause specified above gets us our package
    name so we can filter. Only for the stuff of
    interest do we need the class, method and 
    signature. And the time (vtime) everything
    started for this method at this call depth.
\*/
hotspot$1:::method-entry
/  self->package == "org/postgresql/jdbc2"
|| self->package == "org/postgresql/jdbc3"
/
{
        this->str_ptr = (char\*) copyin(arg3, arg4+1);
        this->str_ptr[arg4] = '\\0';
        self->method[self->depth] = (string) this->str_ptr;

        this->str_ptr = (char\*) copyin(arg5, arg6+1);
        this->str_ptr[arg6] = '\\0';
        self->signature[self->depth] = (string) this->str_ptr;

        self->started[self->depth] = timestamp;
        self->vstarted[self->depth]= vtimestamp;

        self->depth++;
        self->incl[ self->depth ] = 0;
        self->vincl[ self->depth ] = 0;
}

hotspot$1:::method-return
/self->depth > 0/
{
        self->depth--;

        /\* always nice to know how much time elapsed in 
           context of the amount of calls being handled.
           Puts things in a certain perspective.
        \*/
        @calls[ self->class[self->depth]
              , self->method[self->depth]
              , self->signature[self->depth]
              ] = count();

        this->delta  = timestamp  - self->started [self->depth];

        /\* This one is easy. Just accumulate it all \*/
        @inclusive[ self->class[self->depth]
                  , self->method[self->depth]
                  , self->signature[self->depth]
                  ] = sum(this->delta);
 
        self->incl[self->depth] += this->delta;

        /\* I used everything exept what has been used by those
           called by me \*/
        this->delta -= self->incl[ self->depth + 1 ];

        /\* its been used, and must be reset to zero \*/
        self->incl[ self->depth + 1 ] = 0;

        @exclusive[ self->class[self->depth]
                  , self->method[self->depth]
                  , self->signature[self->depth]
                  ] = sum(this->delta);

        /\* and the same story for the vtimestamp measurements \*/

        this->delta = vtimestamp - self->vstarted[self->depth];
        @vinclusive[ self->class[self->depth]
                   , self->method[self->depth]
                   , self->signature[self->depth]
                   ] = sum(this->delta);

        self->vincl[self->depth] += this->delta;
        this->delta -= self->vincl[ self->depth + 1 ];

        self->vincl[ self->depth + 1 ] = 0;

        @vexclusive[ self->class[self->depth]
                   , self->method[self->depth]
                   , self->signature[self->depth]
                   ] = sum(this->delta);
}

syscall::rexit:entry
/pid == $1/
{
        exit(0);
}

:::END
{
        printf("Total time (sec): %d\\n", (timestamp - startedAt)/1000000000 );
        normalize(@inclusive, 1000);
        normalize(@exclusive, 1000);
        normalize(@vinclusive, 1000);
        normalize(@vexclusive, 1000);
        printa("C %8@d %s/%s %s\\n", @calls);
        printa("TI %8@d %s/%s %s\\n", @inclusive);
        printa("TE %8@d %s/%s %s\\n", @exclusive);
        printa("VI %8@d %s/%s %s\\n", @vinclusive);
        printa("VE %8@d %s/%s %s\\n", @vexclusive);
}
It might take some time to grasp everything. I tested it with the following method call flow:

A() {
    consume(15);
    B();
    consume(20);
    C();
    consume(5);
    B();
    consume(5;
}
B() {
    consume(5);
    C();
    consume(5);
    C();
    consume(10);
}
C() {
    consume(10);
}
Each consume() is an inline chunk of code that consumes that amount of time. This should lead to the following:
     Incl   Excl
A:    135     45
B:     80     40
C:     50     50				5 times called, each 10

Findings

Below some tables with the findings of a benchmark run. These tables show the Top-5 of Count, Ela delta and CPU delta. The Ela columns show the result of the timestamp measurements, the CPU columns that of the results of the vtimestamp measurements. The Incl stand for inclusive, Excl for exclusive. The method signature is shortened by removing the package name part "org/postgresql/" from it.

Top-5 Count
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
53255123311233110359035900jdbc2/AbstractJdbc2ResultSet/ toInt(Ljava/lang/String;)I
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/ getFixedString(I) Ljava/lang/String;
114437043396433960372537250jdbc2/AbstractJdbc2ResultSet/ trimString(ILjava/lang/String;) Ljava/lang/String;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/ getString(I) Ljava/lang/String;
119874950701507010387238720jdbc2/AbstractJdbc2ResultSet/ checkColumnIndex(I)V

Top-5 Elapsed Inclusive
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
134422739612552948432446218272635jdbc3/AbstractJdbc3Connection/
prepareStatement(Ljava/lang/String;II)
Ljava/sql/PreparedStatement;
10552684984386744631028102048762jdbc2/AbstractJdbc2Statement
$StatementResultHandler
handleResultRows
(Lcore/Query;[Lcore/Field;Ljava/util/Vector;
Lcore/ResultCursor;)V
13442214441241973102439967425117163jdbc2/AbstractJdbc2Connection/
prepareStatement(Ljava/lang/String;)
Ljava/sql/PreparedStatement;
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/
getFixedString(I)Ljava/lang/String;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/
getString(I)Ljava/lang/String;

Top-5 Elapsed Delta
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
10552684984386744631028102048762jdbc2/AbstractJdbc2Statement
$StatementResultHandler/
handleResultRows
(Lcore/Query;[Lcore/Field;Ljava/util/Vector;
Lcore/ResultCursor;)V
134422739612552948432446218272635jdbc3/AbstractJdbc3Connection/
prepareStatement(Ljava/lang/String;II)
Ljava/sql/PreparedStatement;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/
getString(I)Ljava/lang/String;
13442214441241973102439967425117163jdbc2/AbstractJdbc2Connection/
prepareStatement(Ljava/lang/String;)
Ljava/sql/PreparedStatement;
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/
getFixedString(I)Ljava/lang/String;

Top-5 CPU Elapsed
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
134422739612552948432446218272635jdbc3/AbstractJdbc3Connection/
prepareStatement(Ljava/lang/String;II)
Ljava/sql/PreparedStatement;
105499520741796334111480614703336jdbc2/AbstractJdbc2Statement/
executeQuery()Ljava/sql/ResultSet;
13442214441241973102439967425117163jdbc2/AbstractJdbc2Connection/
prepareStatement(Ljava/lang/String;)
Ljava/sql/PreparedStatement;
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/
getFixedString(I)Ljava/lang/String;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/
getString(I)Ljava/lang/String;

Top-5 CPU Delta
Count Ela Incl Ela Excl Ela delta CPU Incl CPU Exlc CPU delta Method signature
134422739612552948432446218272635jdbc3/AbstractJdbc3Connection/
prepareStatement(Ljava/lang/String;II)
Ljava/sql/PreparedStatement;
105499520741796334111480614703336jdbc2/AbstractJdbc2Statement/
executeQuery()Ljava/sql/ResultSet;
11455652631672139914917624355206213734jdbc2/AbstractJdbc2ResultSet/
getString(I)Ljava/lang/String;
13442214441241973102439967425117163jdbc2/AbstractJdbc2Connection/
prepareStatement(Ljava/lang/String;)
Ljava/sql/PreparedStatement;
57814024120010666613453422108983112277jdbc2/AbstractJdbc2ResultSet/
getFixedString(I)Ljava/lang/String;

From these top-5 lists it can be concluded that, in the context of this benchmark, the three methods of interest are org/postgresql/jdbc2 package. The methods are: getString, getFixedString both from theAbstractJdbc2ResultSet class and the prepareStatement from the class AbstractJdbc2Connection. The next step in this conquest is to only check what is happening with respect to these three methods.

About

paulvandenbogaard

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today