Taming GC for SIP Applications

I plan to write a two part series on this topic. Here is the gist of what you can expect in this series.

Part1: SailFin GC Predictability
Part2: GC Tuning Tips to achieve low latency

Let us dive into these topics individually now.


For quite sometime now, we have been looking at Project SailFin with a focus on Telco production deployment features like Predictability, High Performance, Scalability and Longevity. In this entry, I will focus on one such aspect for Project SailFin which is 'Response Time Predictability' and set a stage for some more thoughts to be looked at in near future. Getting your SIP applications to behave in a more time constrained and budgeted way is more time consuming and iterative process. I will try to highlight certain areas and share our experience in this particular area to ease your efforts and time for this process.

To set the expectations right, the aim of this entry is to initiate or bootstrap your thought process about 'Predictability' for SIP workloads. Also, one could extrapolate this information and apply to any other Java based workload where 'Predictability' and 'Time Budgeting' are critical. The exact value for these tuning attributes is not given in this entry. But these options will be explained to a greater depth to get a fair idea as to what these values can be. The simple reason being that each application is different and should be given a specific attention to various other areas of the system. We will see what these attributes are in a moment and how you can tune the SIP workloads to your specific needs.

First, lets define certain terms to set the ground level.

What is Workload?

What is Predictability?

Note: In real time systems terminology, predictability has more meaning to it. But for the scope of this article, we are just focusing on soft real time systems where we don't treat those response time not meeting time criteria cases as a failure. But we do achieve a required degree of predictability for SIP workloads.

In the context of Project SailFin, the study of predictability requires a detailed analysis of Grizzly Network Manager, JVM JIT compilation, GC, SIP load generator and the application itself for any randomness. If such randomness exists one should minimize it through tuning or should have proper planning in your calculations for predictability.

There are series of white papers planned around SailFin which talks about performance in terms of throughput and response time characteristics, instance and core scalability. The scope of this article is limited to GC predictability.

GC Predictability:

    One of the challenges you will come across for SIP applications in Java world is in tuning garbage collection. SIP container comes with lots of predefined timers for the message transactions. A small delay in processing your requests either due to GC or other events can confuse these timers and triggers transaction failure messages. So your first step is in selecting the right GC collector. The best suited GC is Concurrent Mark and Sweep (CMS) for SIP kind of applications. CMS is a low pause collector and we can tune the CMS in such a way that GC pause time scan be in the range of 30ms to 50ms. Controlling the GC pause duration is very important for converged applications or else you will see or hear distortions in your video or voice applications respectively. The pause duration can be minimized by tuning the CMS for a controlled time budget depending on the requirements.

     We have run the test under a very high load in simulating the busy hour for over 5 hours. Under high load condition, the pause duration for GC was measured. The following graph shows the controlled and more predictable GC pause times for an INIVTE Proxy test.

The graph shown here is a plot of every GC pause duration in milliseconds specified on Y-axis and X-axis represents the duration of the system running for this test. As you can see, the GC pauses are pretty much controlled. There are only a few spikes in GC pauses. I have talked a little bit about these spikes in the coming section and give tips about how one can minimize these. Also, This is part of our future work to explore this randomness of the system.

There are two parts to GC, one is the young generation collection where the GC scavenges are stop-the-world kind which stops the application threads and the second one is the old generation collection which runs parallel to the application threads. The second part is not quite completely runs parallel to the application threads. There are certain phases in the CMS collection which are stop-the-world. If you look at the CMS phases in the GC log, some CMS phases are marked with word 'concurrent' executes concurrent to the application threads and some CMS phases which are not marked with the word concurrent are not concurrent. Also, CMS collections happen between the scavenges i.e. in between any two consecutive young generation scavenges. If you are curious about the CMS internals then refer to GC documentation. In the next part, we will see the tunings required for CMS to get the predictability and level of control we want from the GC pauses.


Post a Comment:
  • HTML Syntax: NOT allowed

Me, Myself & Bharath


« July 2016