Monday Mar 09, 2009

memory leaks in "rgmd -z global" process

A memory leak occurs in the "rgmd -z global" process on Sun Cluster 3.2 1/09 Update2. The global zone instance of the rgmd process leaks memory in most situations such as "scstat" or "cluster show" and other basic commands. The problem is severe and the rgmd heap grows to a large size and crashes the Sun Cluster node.

The issue only happen if one of the following Sun Cluster core patches are active.
126106-27 or -29 or -30 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or -30 or -31 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.

The error can look as follows:
Analyze the grow of memory allocation with (or similar tools)
# prstat
3942 root 61M 11M sleep 101 - 0:00:02 0.7% rgmd/41
sometime later the increase of the memory allocation is visible.
3942 root 61M 20M sleep 101 - 0:01:15 0.7% rgmd/41
or
# pmap -x <pid_of_rgmd-z_global> | grep heap
00022000 47648 6992 6984 - rwx-- [ heap ]
sometime later the increase of the memory allocation is visible.
00022000 47648 15360 15352 - rwx-- [ heap ]

When the memory is full the Sun Cluster node panics with the following message:
Feb 25 07:59:23 node1 RGMD[1843]: [ID 381173 daemon.error] RGM: Could not allocate 1024 bytes; node is out of swap space; aborting node.
...
Feb 25 08:10:05 node1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Feb 25 08:10:05 node1 unix: [ID 836849 kern.notice]
Feb 25 08:10:05 node1 \^Mpanic[cpu0]/thread=2a100047ca0:
Feb 25 08:10:05 node1 unix: [ID 562397 kern.notice] Failfast: Aborting zone "global" (zone ID 0) because "globalrgmd" died 30 seconds ago.
Feb 25 08:10:06 node1 unix: [ID 100000 kern.notice]
...

Update 20.Mar.2009:
Available now:
Alert 1020253.1 Memory Leak in the "rgmd" Process of Solaris Cluster 3.2 may Cause a failfast Panic

Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86


Workaround: Use previous version -19 to prevent issue.
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86

The issue is reported in bug 6808508 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.

Wednesday Aug 22, 2007

Memory leak in scdpmd

The scdpmd (Sun Cluster disk path monitor daemon) have a memory leak when the reboot_on_path_failure flag is enabled. This is a known issue and reported in bug 6563949 which will be fixed soon. The workaround is to use the default of reboot_on_path_failure which is disabled. Only Sun Cluster 3.2 is affected because this is a new feature of Sun Cluster 3.2.
Details about Administering Disk-Path Monitoring.
Update 8.Feb.2008: The bug 6563949 is now fixed in the patches
126106-04 Sun Cluster 3.2: CORE patch for Solaris 10
126107-04 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-04 Sun Cluster 3.2: CORE patch for Solaris 9
Update 30.Jun.2008: The bug 6682663 can prevent the reboot. This is fixed in the revision -15 of the already mentioned Sun Cluster 3.2 CORE patches.



How to identify if reboot_on_path_failure is enabled?

t2000d# scdpm -p all:all
t2000e:reboot_on_path_failure enabled
t2000e:/dev/did/rdsk/d1 Ok
t2000e:/dev/did/rdsk/d2 Ok
t2000e:/dev/did/rdsk/d4 Ok
t2000e:/dev/did/rdsk/d5 Ok
t2000e:/dev/did/rdsk/d6 Ok
t2000e:/dev/did/rdsk/d7 Ok
t2000d:reboot_on_path_failure enabled
t2000d:/dev/did/rdsk/d10 Ok
t2000d:/dev/did/rdsk/d11 Ok
t2000d:/dev/did/rdsk/d13 Ok
t2000d:/dev/did/rdsk/d14 Ok
t2000d:/dev/did/rdsk/d6 Ok
t2000d:/dev/did/rdsk/d7 Ok



How to configure out if scdpmd consume to much memory?

t2000d# ps -ef | grep scdpmd
root 5355 1 0 Aug 20 ? 390:26 /usr/cluster/lib/sc/scdpmd
t2000d# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
5355 root 952M 8520K sleep 59 0 6:29:59 4.4% scdpmd/14



How to disable reboot_on_path_failure flag?

t2000d# clnode set -p reboot_on_path_failure=disabled t2000d t2000e
t2000d# scdpm -p all:all
t2000e:reboot_on_path_failure disabled
t2000e:/dev/did/rdsk/d1 Ok
t2000e:/dev/did/rdsk/d2 Ok
t2000e:/dev/did/rdsk/d4 Ok
t2000e:/dev/did/rdsk/d5 Ok
t2000e:/dev/did/rdsk/d6 Ok
t2000e:/dev/did/rdsk/d7 Ok
t2000d:reboot_on_path_failure disabled
t2000d:/dev/did/rdsk/d10 Ok
t2000d:/dev/did/rdsk/d11 Ok
t2000d:/dev/did/rdsk/d13 Ok
t2000d:/dev/did/rdsk/d14 Ok
t2000d:/dev/did/rdsk/d6 Ok
t2000d:/dev/did/rdsk/d7 Ok



How to restart scdpm service to prevent memory leak?

t2000d# svcadm restart svc:/system/cluster/scdpm:default


Additional information and best practices informations are available in
Infodoc 1004119.1: Sun[TM] Cluster 3.2 Disk Path Monitoring and how to test for losing path to storage

About

I'm still mostly blogging around Solaris Cluster and support. Independently if for Sun Microsystems or Oracle. :-)

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today