By Alan Hargreaves-Oracle on Aug 28, 2015
Last Friday I got pulled in to a very hot customer call.
The issue was best summarised as
Since migrating our WebLogic and database services from AIX to Solaris, at random times we are seeing the the WebLogic server pause for a few minutes at a time. This takes down the back office and client services part of the business for these periods and is causing increasing frustration on the part of staff and customers. This only occurs when a particular module is enabled in WebLogic.
Before going on, I should note that there were other parts to this call which required option changes to WebLogic that also addressed major performance issues (such as the field iPads timing out talking to the WebLogic service) that were being seen, but it was these pauses that were the great concern to the customer.
I'd been given data from a GUDS run which initially made me concerned that we were having pstack(1M) run on the WebLogic java service. Pstack will stop the process while it walks all of the thread stacks. This could certainly have a nasty effect on accessibility to the service.
Unfortunately it was not to be that simple. The pstack collection was actually part of the data gathering process that the WebLogic folks were running. A great example of the Heisenberg Effect while looking at a problem. The effect of this data gathering masked out the possibility of seeing anything else.
I should also mention that in order to keep business running that the customer had disabled the particular module, so we were very limited in when we could get it enabled. Data gathering also required the them to send someone out to site with an iPad (which was the field interface that seemed to be an enabler of the problem). We were pretty much getting one shot at data gathering in any given 24 hour period.
The next day we gathered data with the pstack command commented out.
This was a little more interesting; however, the small amount of time that the issue was present and the fact that we were only gathering small lockstat profiles meant that it was difficult to pin anything down as we were playing hit and miss for us to be taking a profile when the issue was apparent. I did notice that we seemed to be spending more time in page-faulting than I would have expected (about 25% of available cpu at one point!), and about half of that time was being spent spinning on a cross call mutex to flush the newly mapped addresses from all other cpu caches.
With the data from the next days run I also noticed that we had the kflt_user_evict() thread also fighting for the same mutex. My thought at this time was to disable that thread, and for good measure also disable page coalescing by adding the following lines to /etc/system and rebooting.
set kflt_disable=1 set mpss_coalesce_disable=1 set pg_contig_disable=1
It still felt like we were looking at addressing symptoms but not the cause.
We had our breakthrough on Tuesday when we got the following update from the customer:
The iPad transaction which triggers this issue has a label printing as part of the transaction. The application uses the Solaris print spooling mechanism to spool prints to a label printer. The application code spawns a lp process to do this spooling. The code is used is something like below: Runtime.getRuntime().exec("lpr -P <destination> <file-name>"); We have observed that the CPU sys% spike behaviour seems not to occur once we have disabled this print spooling functionality. Is there any known issues in Oracle Java 1.7 with spawning multiple processes from within the JVM ? Note that this functionality as such has always been working fine on an IBM JVM on AIX.
This suddenly made the page-faulting make sense.
On chasing up the Java folks, it looks like the default mechanism for this kind of operation is fork()/exec().
Now, fork will completely clone the address space of the parent process. This is what was causing all of the page-faults. The WebLogic Java process had a huge memory footprint and more than 600 threads.
Further discussion with the Java folks revealed that there was an option in the later Java versions that we could have them use to force Java to use posix_spawn() rather than fork/exec, which would stop the address space duplication. Customer needed to start Java with the option:
They implemented this along with the other application changes and it looks to have been running acceptably now for a few days.
The hardest part of this call was the fact that without any one of the support groups looking at this (WebLogic, Java and Solaris), it is a virtual certainty that we would not have gotten to root cause and found a solution.
Well done everyone.