X

一线数据库工程师的精彩案例分享、新特性介绍、诊断工具和诊断方法、以及常用的测试案例 -- 欢迎光临Oracle数据库中文技术支持官方微博

硬件改变导致Osysmond.bin产生Core Dump

一、问题描述

版本 11.2.0.3 version, Clusterware, AIX platform

OS log中发现osysmond.bin产生core dump:

Probable Causes

SOFTWARE PROGRAM

User Causes

USER GENERATED SIGNAL

Recommended Actions

CORRECT THEN RETRY

Failure Causes

SOFTWARE PROGRAM

Recommended Actions

RERUN THE APPLICATION PROGRAM

IF PROBLEM PERSISTS THEN DO THE FOLLOWING

CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data

SIGNAL NUMBER

6

USER'S PROCESS ID:

2687534

FILE SYSTEM SERIAL NUMBER

1

INODE NUMBER

2

CORE FILE NAME

//core

PROGRAM NAME

osysmond.bin

STACK EXECUTION DISABLED

0

COME FROM ADDRESS REGISTER

??

PROCESSOR ID

hw_fru_id: 4

hw_cpu_id: 38

ADDITIONAL INFORMATION

pthread_k B0

??

_p_raise 48

sclssutl_ 88

??

Symptom Data

REPORTABLE

1

INTERNAL ERROR

0

SYMPTOM CODE

PCSS/SPI2 FLDS/osysmond. SIG/6 FLDS/pthread_k VALU/b0

二、分析

Osysmond.bin是Cluster Health Monitor的组件,其功能是监控和收集操作系统级的统计信息,并把它发送给ologgerd记录。关于CHM请参考:

11gR2 新特性:Oracle Cluster Health Monitor(CHM)简介

https://blogs.oracle.com/Database4CN/entry/11gr2_%E6%96%B0%E7%89%B9%E6%80%A7_oracle_cluster_health

查看crf log(<GRID_HOME>/log/<nodename>/crfmond/*)发现如下信息:

<<<crfmond.log

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_answer:gipcWait failed with 1

2016-01-12 09:28:28.806: [    CRFM][1031]crfmctx dump follows

2016-01-12 09:28:28.806: [    CRFM][1031]****************************

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: connection local name: ipc://sbr0n03gridcorpnewCRFM_SIPC

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: connection peer name: 

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: connaddr:  ipc://sbr0n03gridcorpnewCRFM_SIPC

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: ctype:  1

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: mytype:  1

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: hostname  br0n03

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: myport:  0

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: rhostname 

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: rport: 

2016-01-12 09:28:28.806: [    CRFM][1031]crfm_dumpctx: flags:  5

2016-01-12 09:28:28.806: [    CRFM][1031]****************************

2016-01-12 09:28:33.808: [    CRFM][1031]SYNC crfm_send: send fail  datasize 118 sbuf 163acba90, msglen 154, olen 0, ret 1

2016-01-12 09:28:33.808: [ CRFMOND][1031]ipcmsghdlr: crfm_send failed

2016-01-12 09:28:48.797: [    CRFM][1031]SYNC crfm_send: send fail  datasize 118 sbuf 164361d10, msglen 154, olen 0, ret 1

2016-01-12 09:28:48.798: [ CRFMOND][1031]ipcmsghdlr: crfm_send failed

2016-01-12 09:28:58.803: [ CRFMOND][1031]crfmond_conn_cbkipc: lstner failed wait for message from client 5

2016-01-12 09:29:28.805: [    CRFM][1031]crfm_answer: Wait failed2 timeout 2147483645. GIPC return values is 1

<<<crfmondOUT.log

2016-01-12 09:33:59 

Dumping crfmond stack trace

2016-01-12 09:33:59 

----- Call Stack Trace -----

    sclssutl_sigdump <- 440 <- sclssutl_signalhand <- ler <- 48b4

       <- ler <- crfmond_compute_dis <- k_stats <- crfmond_compute_sta <- crfmond_loop

        <- ts <- clsmond_main <- main <- start

通过call stack匹配到Bug 14703646 - OSYSMOND.BIN DUMP CORE(Suspended, Req'd Info not Avail),根据该bug的内部描述,这是个default behavior,是硬件变更(比如磁盘扩容,添加cpu等)后CHM无法在线识别新硬件而导致的问题。

与客户确认后,得知的确是硬件变更后出现core dump。

三、解决方案

重启ora.crf资源, 使CHM识别硬件变化:

$GRID_HOME/bin/crsctl stop res ora.crf -init

$GRID_HOME/bin/crsctl start res ora.crf -init

或考虑禁用ora.crf资源(确定用不到CHM监控的话):

禁用:

#<GRID_HOME>/bin/crsctl modify resource ora.crf -attr "AUTO_START=never" -init

启用:

#<GRID_HOME>/bin/crsctl modify resource ora.crf -attr "AUTO_START=always" -init

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.