It's Over 9000!
By Jacob Kessler on Nov 25, 2008
over 9000 operations per second, that is.
This week's blog is a story, told in the manner of a fairy tale. Don't worry, I've put all of the actual important tips down at the bottom if you don't feel like reading the whole story.
Once upon a time, there was an issue with Jruby support on Glassfish where it wasn't scaling well. Throwing high-concurrency loads at it made it very, very sad and slow. It clearly wasn't scaling well, and scaling was very important for it to do. The sages ran benchmarks on it, and announced that with a hello-world app, Glassfish could do 4000 operations per second, but weren't sure where bottlenecks were being introduced in larger apps. A call went out to find the bottleneck, and many brave heros from across the land pledged their support to the quest to make Glassfish the fastest Jruby framework deployment solution on the planet.
Yehuda Katz, developer of the merb framework, kindly provided a slightly more complex hello-world app to test on. It had three different routing options. The first was simply a hello-world app. The second introduced a sleep delay of .1 seconds to simulate a database access. The third introduced an even longer delay of .5 seconds. The task of divining the problem was given to Jacob, the youngest sage (still a sage-in-training, really), to see what he could come up with.
Jacob went forth, armed with a benchmarking tool (Faban) and Yehuda's app to test how well Glassfish was scaling under high-concurrency loads. He sat down and ran his first set of benchmarks with ten simultaneous users, and received the following results:
No delay, c=10: 3861.833
.1 delay, c=10: 76.43
.5 delay, c=10: 18.47
These he used as a baseline when he ran his second round of tests, using c=100
No delay, c=100: 438
.1 delay, c=100: 116
.5 delay, c=100: 33
These numbers were worrying indeed. While the delayed requests were finishing slightly faster, the enormous hit to performance with no delay was completely unacceptable. But Jacob remembered the wise words of Jean-Francois Arcand, as dispensed in these two blog entries, and went forth to tune Grizzly for performance.
And what glorious performance it was!
No delay, c=10: 7652 (average time .001 seconds)
No delay, c=100: 7823 (average time .01 seconds)
No delay, c= 300: 9053 (average time .033 seconds)
.1 delay, c=10: 99 (average time .100 seconds)
.1 delay, c=100: 979.8 (average time .102 seconds)
.1 delay, c=300: 1462 (average time .204 seconds)
.5 delay, c=10: 19.6 (average time .500 seconds)
.5 delay, c=100: 196.3 (average time .501 seconds)
.5 delay, c=300: 289.8 ops/sec (average time 1.001 seconds)
These numbers were much better. Even the c=300 numbers, since they showed that even at 150 threads, the bottleneck was still the number of worker threads that grizzly was using (since they were taking almost exactly twice as long as the c=100 requests). Sadly, Jacob couldn't experiment with larger numbers of threads because his ulimit was too low, but the evidence was clear. The no-delay c=300 benchmark also gave him a horrible name to use as the title of his blog post.
And so, with Glassfish performance shown to be even greater than previously supposed, the sages were able to report the problem solved, and everyone lived happily ever after. And by "Happily ever after" I mean "went back to making Glassfish even better". Which makes me happy, so that's fine.
Now, the less-story part: The defaults for Grizzly in domain.xml are, to quote JF Arcand, "not really appropriate when Glassfish is used in production". Changing them can, as shown here, greatly increase the performance of the app server. If your application doesn't seem to be scaling well to large numbers of concurrent requests, you should try experimenting with those values until you find the ones that work well for the system that you are running on. At the very least, I would suggest:
1) Increasing the number of worker and acceptor threads until you see as much CPU usage as you want (probably 100%). As shown by the high-thread no-delay and low-concurrency, high delay tests, having "too many" doesn't hurt performance, so going too high isn't much of an issue, short of the ulimit.
2) Using the parallel old garbage collector (-XX:ParallelGCThreads=N -XX:+UseParallelOldGC). This means that your application won't pause while full garbage collections happen, which means that there won't be periods when requests are waiting and can't be served because the GC is running
 All benchmarks were performed with fhb -r 20/30/10 in a totally unscientific way on a quad-core Intel Q6850 that was running both Faban and Glassfish.
 Not quite according to those blogs, actually. I was only using 3g of heap, and I wasn't using AggressiveHeap (My father keeps telling me that while AggressiveHeap, which puts in some high-performance but resource-intensive defaults rather than letting the GC figure things out on its own, used to be better they have since gotten better at their collection heuristics and can now consistantly out-perform AggressiveHeap). I was also using 150 threads rather than 130.