Taming GC for SIP Applications... part2

In this entry, we will touch the crux of the GC tuning for SIP applications. 

GC Tuning Tips:

Apart from other regular CMS GC Tunings for a server, The following tunings will make the CMS run in a more predictable manner. Lets dive deep into some of these CMS tunings and see its effect on predictability.

Young Generation Size:

The size of young generation can be controlled through a flag '-Xmn'. For SIP workloads, keeping this size as minimum as possible will yield better average response times and greater probability of meeting your 95th percentile response time requirements.

-Xmn


Parallel young generation:

Enable the parallel young generation with the following options. Also, tune the parallel GC threads to number of cpus or cores less one. The idea is to leave one core for handling network interrupts etc.

-XX:+UseParNewGC & -XX:ParallelGCThreads=Number of Cores - 1

Maximum Object Tenure Threshold:

In the SIP world, the object life spans are depended on the timers associated with the transactions. Holding these objects longer in the young generation just causes overhead due to coping or checking for liveness. In such cases, we are better off with promoting to old generation. One needs to study the promotion rate after setting this attribute. If more objects are promoted even after having higher tenure threshold then set this value to zero.

-XX:MaxTenuringThreshold=N

Survivor Spaces:

Depending on your tenure threshold, set the survivor spaces. If you decided to set the tenure threshold to zero then set this space size to high value. The idea is to promote objects to old generation directly.

-XX:SurvivorRatio=N

CMS Occupancy Fraction:

The CMS initiating Occupancy Fraction plays a critical role in controlling things happen in old generation. Having a lower bound on this value allows the CMS to run frequently and keep the old generation in check all the time. The benefits include better control over CPU spikes (an important factor for Telco carriers with regards to overload protection) and some better handle on CMS fragmentation failures. But this benefit comes with a price i.e. more CPU utilization. Since CMS GC runs all the time uses more CPU utilization than required for cleaning the objects in old generation. Even though most of this phase happens in parallel to application threads but some part of this phase happens by stopping the application threads. Refer CMS GC documents for more details.

-XX:CMSInitiatingOccupancyFraction=N

Future Work:

Even though we have solved some of the problems related to 'Predictability' and achieved the goal, we have encountered certain observations including some GC pause spikes as shown in the above graph in young generation needs to be studied more. There are two areas identified on this for further study. The first area is the effect of 'UseMembar' JVM option on young generation. This option was suggested to have some control over these spikes in young generation but our meticulous SailFin QA stress team reported some regression in some of their tests. The performance team observed some good data with this option. The second area which requires some more attention is in the use of the JVM option 'UseTLAB' which is for Thread Local Allocation Buffer. For the current work we have enabled the ergonomics for this option meaning we let the JVM to study the usage patterns and self correct these sizes in the Eden region. In my opinion, this option's ergonomics works pretty good. For the purpose of these spikes in young generation, one can dig more into this option and study its impact. Even though this is just one option in grand scheme of CMS JVM options, there is more work behind the scenes taking place for TLAB. If you are more interested refer to this great blog from Jon Masamitsu from GC team on this topic here then you will realize the amount of work went behind this option.

Also, There is another area which might be interesting to study for SIP workloads which is with the use of throughput collector with the response time goals.

Conclusion:

The requirements like response time predictability and time budgeting can be achieved to a degree required for Telco vendors with the existing JVM technologies. There is a new trend happening in embracing 'Java' based infrastructure and moving away from legacy 'C' based infrastructure in Telco arena and a very good example for this is an Open Source based Project SailFin. The benefits Telcos get are enormous including more leverage in existing Java technologies like web services, EJB and JMS etc; continuous innovation in and around Java based technologies; and a big support from the community around. This embrace by Telcos at such low level protocol implementations in 'Java' will spur new level of requirements in Java and JVM technologies and this is good for community. The benefit you and I get is a better and efficient way of living with daily communication infrastructure with kith and kin across the world.

In addition to the above, The Telco aspects like performance, scalability and predictability of Project SailFin are stunning. We, the performance team, think its a continuous process and there will be always room for improvement. At Sun Microsystems, we take these enterprise aspects as serious as serious can be. Project SailFin has great knowledgeable community around and if you intend to suggest or send us your observations, we are always open and welcome the community contributions.

Acknowledgments:

I would like to thank sincerely Y.S. Ramakrishna (Ramki) from GC team for his invaluable suggestions and relentless efforts at times to make custom VM for our analysis and to get more visibility in GC generations based on our requirements. Also, I would like to thank Brian Doherty from JVM team for his initial suggestions on CMS working internals and sharing his prior work on CMS. Needless to say, Scott Oaks is our overall GlassFish performance lead and I have captured some of his findings on CMS occupancy fraction effect on CPU utilization. Thanks to Robert Handle and other Ericsson members who helped us to look at the system from Telco perspective. Finally, I would also like to thank Madhu Konda (Manager) and Sreeram Duvur (SailFin Architect) for their continued support and for asking more thought provoking questions in right direction to keep us on track of solving problems.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Me, Myself & Bharath

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today