一、问题描述
版本 11.2.0.3 version, Clusterware, AIX platform
OS log中发现osysmond.bin产生core dump:
Probable Causes
SOFTWARE PROGRAM
User Causes
USER GENERATED SIGNAL
Recommended Actions
CORRECT THEN RETRY
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
SIGNAL NUMBER
6
USER’S PROCESS ID:
2687534
FILE SYSTEM SERIAL NUMBER
1
INODE NUMBER
2
CORE FILE NAME
//core
PROGRAM NAME
osysmond.bin
STACK EXECUTION DISABLED
0
COME FROM ADDRESS REGISTER
??
PROCESSOR ID
hw_fru_id: 4
hw_cpu_id: 38
ADDITIONAL INFORMATION
pthread_k B0
??
_p_raise 48
sclssutl_ 88
??
Symptom Data
REPORTABLE
1
INTERNAL ERROR
0
SYMPTOM CODE
PCSS/SPI2 FLDS/osysmond. SIG/6 FLDS/pthread_k VALU/b0
二、分析
Osysmond.bin是Cluster Health Monitor的组件,其功能是监控和收集操作系统级的统计信息,并把它发送给ologgerd记录。关于CHM请参考:
11gR2 新特性:Oracle Cluster Health Monitor(CHM)简介
https://blogs.oracle.com/Database4CN/entry/11gr2_%E6%96%B0%E7%89%B9%E6%80%A7_oracle_cluster_health
查看crf log(<GRID_HOME>/log/<nodename>/crfmond/*)发现如下信息:
<<<crfmond.log
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_answer:gipcWait failed with 1
2016-01-12 09:28:28.806: [ CRFM][1031]crfmctx dump follows
2016-01-12 09:28:28.806: [ CRFM][1031]****************************
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: connection local name: ipc://sbr0n03gridcorpnewCRFM_SIPC
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: connection peer name:
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: connaddr: ipc://sbr0n03gridcorpnewCRFM_SIPC
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: ctype: 1
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: mytype: 1
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: hostname br0n03
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: myport: 0
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: rhostname
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: rport:
2016-01-12 09:28:28.806: [ CRFM][1031]crfm_dumpctx: flags: 5
2016-01-12 09:28:28.806: [ CRFM][1031]****************************
2016-01-12 09:28:33.808: [ CRFM][1031]SYNC crfm_send: send fail datasize 118 sbuf 163acba90, msglen 154, olen 0, ret 1
2016-01-12 09:28:33.808: [ CRFMOND][1031]ipcmsghdlr: crfm_send failed
2016-01-12 09:28:48.797: [ CRFM][1031]SYNC crfm_send: send fail datasize 118 sbuf 164361d10, msglen 154, olen 0, ret 1
2016-01-12 09:28:48.798: [ CRFMOND][1031]ipcmsghdlr: crfm_send failed
2016-01-12 09:28:58.803: [ CRFMOND][1031]crfmond_conn_cbkipc: lstner failed wait for message from client 5
2016-01-12 09:29:28.805: [ CRFM][1031]crfm_answer: Wait failed2 timeout 2147483645. GIPC return values is 1
<<<crfmondOUT.log
2016-01-12 09:33:59
Dumping crfmond stack trace
2016-01-12 09:33:59
—– Call Stack Trace —–
sclssutl_sigdump <- 440 <- sclssutl_signalhand <- ler <- 48b4
<- ler <- crfmond_compute_dis <- k_stats <- crfmond_compute_sta <- crfmond_loop
<- ts <- clsmond_main <- main <- start
通过call stack匹配到Bug 14703646 – OSYSMOND.BIN DUMP CORE(Suspended, Req’d Info not Avail),根据该bug的内部描述,这是个default behavior,是硬件变更(比如磁盘扩容,添加cpu等)后CHM无法在线识别新硬件而导致的问题。
与客户确认后,得知的确是硬件变更后出现core dump。
三、解决方案
重启ora.crf资源, 使CHM识别硬件变化:
$GRID_HOME/bin/crsctl stop res ora.crf -init
$GRID_HOME/bin/crsctl start res ora.crf -init
或考虑禁用ora.crf资源(确定用不到CHM监控的话):
禁用:
#<GRID_HOME>/bin/crsctl modify resource ora.crf -attr “AUTO_START=never” -init
启用:
#<GRID_HOME>/bin/crsctl modify resource ora.crf -attr “AUTO_START=always” -init