Four Guidelines for sizing JRuby apps

 I warned that there would be many blogs this week, didn't I?

 Once upon a time, on the Jruby mailing list, a question was asked. The question was roughly "What effect do heap sizes etc. have on memory usage and performance, and how can I run my Rails app in the minimum possible space?" That's what I'll be trying to answer today, hopefully in a way that is applicable to more than just Rails apps and hopefully in a way that removes some of the feeling that memory usage and heap sizing are a black art.

For the purposes of this blog, I'll be using the Hotspot Java virtual machine, Jruby 1.1.2, and Glassfish Gem v0.9.3. Your command line options may vary slightly if you're using different software, but the theory should remain the same.

First, the easy one. Memory usage of any Java program on the will grow to a bit over the maximum heap size that you set at startup. If you don't set a maximum heap size, JRuby will start with 512MB of heap, giving you maximum memory usage of around 550MB. It will start out much smaller and only grow as it fills the space it has already. With a Rails app, however, you will fill it, so it's safe to assume that your app will end up taking up as much space as you let it. You can adjust the maximum heap size with -J-Xmx{size}, for example -J-Xmx128m to have a maximum size heap size of 128MB, which (after the non-heap memory is included) will be a total memory use of about 160MB. The heap won't shrink once it's grown, either, so however much you tell it it can use is how much it will use.

So, that's how heap size affects memory usage. How does heap size affect performance? To do that, we'll need to learn a bit about the Hotspot Garbage Collector.

The basic assumption of the garbage collector is that most objects will be extremely short-lived, by which I mean less than a single collection. Hotspot creates three different areas of the heap. The first is the "Eden space", where all new objects are created. The second is the "Survivor space", where objects that have stayed live through a collection live. Those two together form the "Young Generation", and account for most of the space in the heap. Finally, there is the "Old Generation", where objects that have lived through a collection in the survivor space get moved to. Once an object has reached the old generation, it's assumed that it's probably sticking around (because it's something like the JRuby interpreter). An object is considered "live" if it can be reached from any of the roots (things like currently running threads), or from any other live objects.

Mostly, though, you don't care about the gritty details of the garbage collector. The important thing to remember is that every request that you serve will generate garbage, and that that's completely fine. That garbage slowly piles up until the eden space is full, and then a young generation collection happens. That moves all the live objects out of the eden space, and reclaims the space from all the dead ones. It's also really, really fast, and scales based on how much stuff is still alive (rather than the size of the heap). But, stuff from the young generation collections slowly piles up too, and eventually a full garbage collection comes in. That collection stops all running threads, checks everything, moves things around as appropriate, and only then allows things to continue. Those clean out all the accumulated garbage, but are slower and (more importantly) pause your application while they are running. There are low-pause collectors included with Hotspot, but I'm not going to go into them right now since the basic theory is the same. What you care about, performance-wise, is that you don't spend too much of your processor time collecting garbage.

Obviously, the more heap space you give to your app, the less time you will spend collecting garbage. This leads to Guideline #1:

    Always give your applications as much heap space as you can

"As you can" is a bit of a vague term, though. The general theory is that, on a computer that is tasked only with running your app, you should get 100% memory usage. After all, if you buy 4GB of memory and only use 1.5GB, you're wasting however much that extra 2.5GB cost. You don't get any performance boost from having unused memory.
This, of course, doesn't quite work in reality, since if you ever try to use more memory than you have RAM, you'll start using your hard drives as memory and things will get really slow really quickly. That doesn't change the fact that you want to be using as much of your computer's memory as you can when you are running your app, though.

So, that sounds easy, right? You have an app and a computer with 4GB of RAM. You decide to play it safe, run with 3GB of heap, and everything works great, right? Unfortunately not, since people have an unfortunate tendency to try to run more than one app on the same computer, for whatever reason. Maybe they are a hosting provider, and want to serve as many customers as they can on each of their computers. Maybe they are using non-multithreaded Rails, and want to run multiple copies of Rails to overcome that. Maybe the computer has other things to be doing while it's serving apps.

So now you need to compromise. Sure, more memory is always better, but how much does it actually /need/? To find the answer out, we're going to go diving into the garbage collection logs. Get out your wetsuits. If you want to follow along, go ahead and start glassfish serving the app of your choice with

jruby -J-verbose:gc -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDetails -J-Xloggc:./gc.log -J-XX:+DisableExplicitGC -J-Xmx512m -S glassfish -e production [1]

And then hit it with a few thousand requests to generate some garbage. You'll want to make sure you get at least two full collections after it has started up. Lets look at some logs, shall we? Here we've got an excerpt from one of my test applications under an HTTP load test.

    1017.433: [GC [PSYoungGen: 172128K->2240K(172288K)] 311942K->142094K(312320K), 0.0112010 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
    1018.222: [GC [PSYoungGen: 172096K->2304K(172288K)] 311950K->142215K(312320K), 0.0099330 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
    1019.011: [GC [PSYoungGen: 172160K->2336K(172288K)] 312071K->142295K(312320K), 0.0113770 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
    1019.803: [GC [PSYoungGen: 172192K->2336K(172288K)] 312151K->142347K(312320K), 0.0094030 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
    1019.813: [Full GC [PSYoungGen: 2336K->0K(172288K)] [PSOldGen: 140011K->91659K(133184K)] 142347K->91659K(305472K) [PSPermGen: 41801K->41801K(80576K)], 0.3587540 secs] [Times: user=0.35 sys=0.00, real=0.36 secs] 
    1020.957: [GC [PSYoungGen: 169856K->2432K(171136K)] 261515K->95585K(304320K), 0.0115010 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
    1021.734: [GC [PSYoungGen: 171136K->2448K(171712K)] 264289K->96737K(304896K), 0.0112320 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
    1022.516: [GC [PSYoungGen: 171152K->2336K(171648K)] 265441K->96969K(304832K), 0.0091780 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]

Looks kind of messy, doesn't it? The first column of numbers is the time (in seconds) since the VM started, here about a thousand seconds into the run. We'll be using those a lot.
The first three lines are the young generation collections that I was talking about earlier. You can see that I'm getting one approximately every .8 seconds, but that each one is only taking .01 seconds or so. That means that I'm spending roughly 1% of my time doing young generation collections. The third and fourth lines are a single full collection happening (a young generation collection always happens immediately before a full collection), and you can see that it took roughly a third of a second. However, my next full collection will be a bit over 10 minutes in the future, so my total time in full GCs is negligible. After the full collection, the young generation collections pick up again, and things are normal. I can also see that I've devoted about 90MB to storing my app, and 80MB (of which 40MB are used) to store class defintions and things like that (the Perm Gen). From a performance standpoint, if I were to run in an infinite amount of heap space I could see at most 1% improvement, since that's all that the garbage collector is dealing with. I can also see that I could shave around 30MB off by bringing the permgen setting down to 50mb or so (running out of permgen is really bad, so you want to leave a bit of a buffer). That 1% GC time, especially since it's coming almost exclusively from the young generation collections, means that the 512MB of heap that we gave our app was likely too much if we were trying to conserve memory. Lets trim it down, shall we? For those of you following along, restart the gem with the following command:

jruby -J-verbose:gc -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDetails -J-Xloggc:./gc_small.log -J-XX:+DisableExplicitGC -J-Xmx95m -S glassfish -e production

Let's see what that log file looks like...

    138.993: [GC [PSYoungGen: 31808K->960K(29760K)] 92492K->63593K(95296K), 0.0061510 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
    139.128: [GC [PSYoungGen: 29760K->1984K(30784K)] 92393K->65590K(96320K), 0.0055150 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
    139.268: [GC [PSYoungGen: 30784K->1984K(27968K)] 94390K->67254K(93504K), 0.0052070 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
    139.273: [Full GC [PSYoungGen: 1984K->0K(27968K)] [PSOldGen: 65270K->57022K(65536K)] 67254K->57022K(93504K) [PSPermGen: 40896K->40653K(57280K)], 0.3145120 secs] [Times: user=0.31 sys=0.00, real=0.31 secs] 
    139.710: [GC [PSYoungGen: 25984K->1568K(29376K)] 83006K->58590K(94912K), 0.0056690 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
    139.843: [GC [PSYoungGen: 27552K->1440K(28928K)] 84574K->58462K(94464K), 0.0029340 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

And now we get into part of why heap size tuning is a dark art. Here, the app that was previously living in 90MB has shrunk to around 60MB, probably because of how Ruby builds stuff based on heap size. Anyway: our young generation collections are happening every .2 seconds now, but they only take .005 seconds. That means that we are spending about 2.5% of our time collecting garbage (the next full GC is in two minutes and still takes .3 seconds, so we're up to .2% of our time in full GCs. Still negligible). The practical side of this is that we just traded 1.5% performance for 80% of our memory footprint. If we were trying to run multiple apps, we'd now be able to run five times as many, which (assuming that memory and not processing power was the limiting factor) would more than make up for the performance loss. A big win for us! So, Guideline #2 is

    When an application has plenty of memory, large decreases can be made without noticably affecting the performance

Lets try bringing it even lower.

jruby -J-verbose:gc -J-XX:+PrintGCTimeStamps -J-XX:+PrintGCDetails -J-Xloggc:./gc_verysmall.log -J-XX:+DisableExplicitGC -J-Xmx64m -S glassfish -e production

    7.335: [GC [PSYoungGen: 15120K->6144K(14016K)] 50014K->41638K(56384K), 0.0084020 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
    7.384: [GC [PSYoungGen: 13504K->6096K(14592K)] 48998K->42070K(56960K), 0.0074700 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
    7.454: [GC [PSYoungGen: 13456K->5056K(14592K)] 49430K->42326K(56960K), 0.0068940 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
    7.502: [GC [PSYoungGen: 12416K->5376K(14592K)] 49686K->43150K(56960K), 0.0108550 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
    7.552: [GC [PSYoungGen: 12736K->4416K(14592K)] 50510K->43839K(56960K), 0.0067770 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
    7.644: [GC [PSYoungGen: 11776K->2720K(14592K)] 51199K->44099K(56960K), 0.0062080 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
    7.650: [Full GC [PSYoungGen: 2720K->1598K(14592K)] [PSOldGen: 41379K->42367K(43712K)] 44099K->43966K(58304K) [PSPermGen: 31861K->31861K(50688K)], 0.2799870 secs] [Times: user=0.28 sys=0.00, real=0.28 secs] 
    7.977: [Full GC [PSYoungGen: 7360K->551K(14592K)] [PSOldGen: 42367K->43711K(43712K)] 49727K->44263K(58304K) [PSPermGen: 31907K->31907K(50112K)], 0.2675740 secs] [Times: user=0.26 sys=0.00, real=0.27 secs] 
    8.316: [Full GC [PSYoungGen: 7360K->831K(14592K)] [PSOldGen: 43711K->43711K(43712K)] 51071K->44543K(58304K) [PSPermGen: 31925K->31925K(49024K)], 0.2659910 secs] [Times: user=0.26 sys=0.00, real=0.27 secs]

That doesn't look happy at all, does it? Not only do we start out doing a .01 second young generation collection every .05 seconds (for 20% garbage collector load), once we put actual load on it we start thrashing. Full collections start running without freeing up space, .28 second full collections are running every .33 seconds (for 85% garbage collector load!), and the app starts refusing requests. What happened there? We traded 82% performance for 30% of our memory footprint: Not a good deal at all. In fact, the lack of ability to process requests is probably the only thing that saved us from getting out of memory errors and bringing the whole server down. This brings us to Guideline #3.

    When an application has too little memory, even small changes can enormously hurt performance.

Of course, the heap sizes that you see each of the above behaviors will depend on which app, and which framework, you are running. Larger apps need more memory, smaller apps can get away with less. If this were a "hello world" app, for example, it would probably run fine in 64MB of memory (and less than that if I wasn't using a 64-bit machine. Memory sizes are larger on 64-bit machines than 32-bit machines). You can also play around with the different collectors (-J-XX:+UseParallelGC and -J-XX:+UseParallelOldGC for the high-throughput garbage collector, -XX:+UseConcMarkSweepGC for the low-pause collector), especially if you are running on a multiprocessor machine, since they can further reduce your garbage collection times.

The last bit about minimizing memory usage is to remember that sometimes, memory isn't your limiting factor. If I take a single-processor Pentium III, put 2GB of RAM into it, and tried to run ten instances of my test app on it, they would all fit in memory but be running quite poorly, since the CPU couldn't keep up with them all. CPU is a bit harder to provision (since it varies more), but in general you should aim for about 95% CPU usage under your full load (the extra 5% is a buffer so that smaller spikes don't degrade performance). It's important to remember that that is your _full_, day-after-thanksgiving load, not your typical load or your typical peak load, which means that you will normally have significant chunks of CPU idle. In this case, that's worth it to avoid degrading performance at your peak times. To sum that up, we have Guideline #4

    Make sure that your server is using all (with some reasonable buffer) of its memory and CPU when it is at full load.

In short, you want the largest heaps that you can get for each of as many apps as it takes to use all of the available processor when fully loaded, as long as your GC times are reasonable in those conditions. If they aren't, either add memory to the server or reduce the number of apps to allow you to increase the heap size on the remaining ones until full GC times are reasonable. You'll have to play around a bit to find that point, though, and remember that each app will be different. So, go out there and experiment!

[1] A note about -J-XX:+DisableExplicitGC: I almost always run with this flag, which NOOPs calls to System.gc() (and the Ruby equivelents GC.start and GC.garbage_collect). This prevents poor programming from forcing the garbage collector to run more often than it is needed. Originally, the idea was that in some apps there might be times that the app knew were good times to collect garbage even if one wasn't strictly needed, but the garbage collection behavior has gotten a lot more sophisticated now and forcing garbage collections, except for debugging purposes (like detecting memory leaks) is generally a bad idea.


Post a Comment:
  • HTML Syntax: NOT allowed

Jacob Kessler


« December 2016