By timatworkhomeandinbetween on Jul 29, 2008
The customers brand new machines were having problems with a jvm stopping in garbage collection. The jvm was set to have a 3.1 Gbyte heap but after a few hours of running the application would grind to halt, the garbage collector stats showed that garbage collection was taking an increasing amount of time up into the 10s of seconds..
prstat -m 1 on the box with the application showed the jvm using a lot of user and system time, the system time was odd, but at the top of the active processes was rcapd. A quick look in /etc/project showed that someone had set a resource control to limit RSS for the user.root project of 1.7 Gbytes.
The line looked something like..
So that was bad, everytime the processes that belonged to root got an RSS of over 1.7Gbytes rcapd would pick on them and page out their pages, and then set them running again. In the case of the jvm as soon as the garbage collector had to manage a heap bigger than 1.7 Gbytes it would scan the objects paging all those heap pages back in, then rcapd would page them all back out. The garbage collector was now being limited by the speed of the swap device and was fighting a losing battle.
iostat -xnz 1 showed 100s of Mbytes/sec going to swap and then 100s of Mbytes/sec coming back from swap.
vmstat -p 1 just showed a massive amount of page ins.
rcapd was disabled using rcapadm -D and the new machines flew along. I did some investigations here in the office using a small test case that malloc'ed a large array and then looped around touched its pages and that was crippled as soon as the array was much bigger than the cap, but interestingly as rcapd is procfs based sometimes a procfs command like pfiles or truss would either fail with "the process is traced" or it would stop the rcapd getting to the process's proc files so it would escape capping for a short while...