memory leaks in "rgmd -z global" process

A memory leak occurs in the "rgmd -z global" process on Sun Cluster 3.2 1/09 Update2. The global zone instance of the rgmd process leaks memory in most situations such as "scstat" or "cluster show" and other basic commands. The problem is severe and the rgmd heap grows to a large size and crashes the Sun Cluster node.

The issue only happen if one of the following Sun Cluster core patches are active.
126106-27 or -29 or -30 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or -30 or -31 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.

The error can look as follows:
Analyze the grow of memory allocation with (or similar tools)
# prstat
3942 root 61M 11M sleep 101 - 0:00:02 0.7% rgmd/41
sometime later the increase of the memory allocation is visible.
3942 root 61M 20M sleep 101 - 0:01:15 0.7% rgmd/41
or
# pmap -x <pid_of_rgmd-z_global> | grep heap
00022000 47648 6992 6984 - rwx-- [ heap ]
sometime later the increase of the memory allocation is visible.
00022000 47648 15360 15352 - rwx-- [ heap ]

When the memory is full the Sun Cluster node panics with the following message:
Feb 25 07:59:23 node1 RGMD[1843]: [ID 381173 daemon.error] RGM: Could not allocate 1024 bytes; node is out of swap space; aborting node.
...
Feb 25 08:10:05 node1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Feb 25 08:10:05 node1 unix: [ID 836849 kern.notice]
Feb 25 08:10:05 node1 \^Mpanic[cpu0]/thread=2a100047ca0:
Feb 25 08:10:05 node1 unix: [ID 562397 kern.notice] Failfast: Aborting zone "global" (zone ID 0) because "globalrgmd" died 30 seconds ago.
Feb 25 08:10:06 node1 unix: [ID 100000 kern.notice]
...

Update 20.Mar.2009:
Available now:
Alert 1020253.1 Memory Leak in the "rgmd" Process of Solaris Cluster 3.2 may Cause a failfast Panic

Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86


Workaround: Use previous version -19 to prevent issue.
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86

The issue is reported in bug 6808508 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.

Comments:

This issue has been resolved and an IDR is available.
For more information check CR 6808508.

Posted by Harish Mallya on March 30, 2009 at 05:06 AM CEST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today