一、问题描述
Read only方式运行的Dataguard备库上以非常高的频率产生大量trace文件(2秒产生一个)。Trace文件内容基本一致:
*** 2016-02-25 09:59:48.318
*** SESSION ID:(1442.65013) 2016-02-25 09:59:48.318
*** CLIENT ID:() 2016-02-25 09:59:48.318
*** SERVICE NAME:() 2016-02-25 09:59:48.318
*** MODULE NAME:(sqlplus@xxxxx (TNS V1-V3)) 2016-02-25 09:59:48.318
*** ACTION NAME:() 2016-02-25 09:59:48.318
KQRCMT: Write failed with error=604 po=70000065bd6e4b8 cid=13
diagnostics : cid=13 hash=6e8b1016 flag=2a
Alert log中没有任何报错记录。
二、分析
1、从KQRCMT: Write failed with error=604 po=70000065bd6e4b8 cid=13来看,应该是kqrcmt函数尝试写入时报了error=604(ORA-00604)错误,从而联想到可能客户在只读备库上尝试了写操作,但与客户初步确认并没有这样的操作。而error=604本身也确定了是递归操作做的写入:
$ oerr ora 604
00604, 00000, “error occurred at recursive SQL level %s”
// *Cause: An error occurred while processing a recursive SQL statement
// (a statement applying to internal dictionary tables).
// *Action: If the situation described in the next error on the stack
// can be corrected, do so; otherwise contact Oracle Support.
建议客户设置如下event来打印出详细的trace信息:
ALTER SYSTEM SET event=’604 trace name errorstack level 3‘,’16000 trace name errorstack level 3’ SCOPE=SPFILE;
注:ORA-16000是dataguard只读模式不允许写入的错误号:
$ oerr ora 16000
16000, 00000, “database open for read-only access”
// *Cause: The database was opened for read-only access. Attempts to
// modify the database using DML or DDL statements generate this
// error.
// *Action: In order to modify the database, it must first be shut down and
// reopened for read/write access.
然而客户反馈该库比较重要,实在是不方便重启实例。
于是进一步想,有没有不用重启实例的办法呢?
后来在一个bug上找到了这种只对某一个函数的报错来做errorstack的方法,这种诊断方法应该在很多情况下都有用:
alter system set events ‘604 {callstack:fname=kqrcmt} errorstack(3)’;
这种方式设置的event在show parameter event中看不到,但设置之后,kqrcmt函数产生报错ORA-604时都会在trace里面打印出详细的errorstack信息。
2、很快客户上传了最新产生的errorstack trace,从trace中可以清楚地看到该问题是序列的fetch引起的seq$表update操作:
*** 2016-03-02 17:57:36.950
*** SESSION ID:(1767.55761) 2016-03-02 17:57:36.950
*** CLIENT ID:() 2016-03-02 17:57:36.950
*** SERVICE NAME:(SYS$USERS) 2016-03-02 17:57:36.950
*** MODULE NAME:(sqlplus@xxxxx (TNS V1-V3)) 2016-03-02 17:57:36.950
*** ACTION NAME:() 2016-03-02 17:57:36.950
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=3, mask=0x0)
—– Error Stack Dump —–
ORA-00604: error occurred at recursive SQL level 4
ORA-16000: database open for read-only access
—– Current SQL Statement for this session (sql_id=cna1abuq78rgp) —–
SELECT UPPER(NVL(PROGRAM, ‘null’)), UPPER(MODULE), TYPE, NEWSESS.NEXTVAL, SERIAL#, LOGON_TIME, DECODE(NVL(INSTR(PROCESS, ‘:’), 0), 0, NVL(PROCESS, 1234), SUBSTR(PROCESS, 1, INSTR(PROCESS, ‘:’) – 1)) FROM SYS.V_$SESSION WHERE SID = :B1 《====NEWSESS.NEXTVAL是序列fetch。
—– PL/SQL Stack —–
—– PL/SQL Call Stack —–
object line object
handle number name
7000001829ced38 871 package body xxxxx.TLOGON<====从名称上看是登录时执行的存储过程
7000001829ced38 1233 package body HZMCASSET.TLOGON
7000001829ced38 2372 package body HZMCASSET.TLOGON
700000189a5a150 1 anonymous block
7000001b8e8b2e0 5 anonymous block
—– Call Stack Trace —–
skdstdst <- ksedst1 <- ksedst <- dbkedDefDump <- ksedmp
<- dbkdaKsdActDriver <- dbgdaExecuteAction <- dbgdaRunAction <- dbgdRunActions <- dbgdProcessEventAct
<- ions <- dbgdChkEventKgErr <- dbkdChkEventRdbmsEr <- ksfpec <- dbgePostErrorKGE
<- 1176 <- dbkePostKGE_kgsf <- kgeade <- kgesev <- ksesec1
<- rpidrr <- kprball <- kqdsnu <- kqrcmt <- ktcCommitTxn<============================kqrcmt前一个函数是ktcCommitTxn,说明有事务提交
<- kdnwor <- kdnAllocN <- kdnnxt <- qersqPopulate <- qersqRowProcedure
<- qersoFetch <- qerjotFetch <- qersqFetch <- opifch2 <- opiefn0
<- opipls <- opiodr <- rpidrus <- skgmstack <- rpidru
<- rpiswu2 <- rpidrv <- psddr0 <- psdnal <- pevm_EXECC
<- pfrinstr_EXECC <- pfrrun_no_tool <- pfrrun <- plsql_run <- peicnt
<- kkxexe <- opiexe <- opipls <- opiodr <- rpidrus
<- skgmstack <- rpidru <- rpiswu2 <- rpidrv <- psddr0
<- psdnal <- pevm_EXIM <- pfrinstr_EXIM <- pfrrun_no_tool <- pfrrun
<- plsql_run <- peicnt <- kkxexe <- opiexe <- opiall0
<- opikpr <- opiodr <- rpidrus <- skgmstack <- rpidru
<- rpiswu2 <- kprball <- kktextrg <- kkttrex <- kktexeevt0
<- kpoauth <- opiodr <- ttcpip <- opitsk <- opiino
<- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain
<- main <- start
LibraryHandle: Address=7000001b8e2d248 Hash=9d1660bd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName: Name=update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6,cache=:7,highwater=:8,audit$=:9,flags=:10 where obj#=:1 《===序列fetch对seq$进行了update
FullHashValue=3d5d234d25a38dc4499e60c99d1660bd Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=2635489469 OwnerIdn=0
Statistics: InvalidationCount=226 ExecutionCount=6670331 LoadCount=228 ActiveLocks=1 TotalLockCount=6670331 TotalPinCount=1
Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=51194 HandleInUse=51194 HandleReferenceCount=0
Concurrency: DependencyMutex=7000001b8e2d2f8(0, 1, 0, 0) Mutex=7000001b8e2d378(0, 26684282, 1, 0)
Flags=RON/PIN/TIM/PN0/DBN/[10012841]
WaitersLists:
Lock=7000001b8e2d2d8[7000001b8e2d2d8,7000001b8e2d2d8]
Pin=7000001b8e2d2b8[7000001b8e2d2b8,7000001b8e2d2b8]
LoadLock=7000001b8e2d330[7000001b8e2d330,7000001b8e2d330]
Timestamp: Current=11-21-2015 10:14:45
HandleReference: Address=7000001b8e2d468 Handle=0 Flags=[00]
LibraryObject: Address=70000018cfe30b0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
ChildTable: size=’16’
Child: id=’0′ Table=70000018cfe3f60 Reference=70000018cfe3998 Handle=7000001b8f2e8f8
NamespaceDump:
Parent Cursor: sql_id=4m7m0t6fjcs5x parent=70000018cfe3150 maxchild=1 plk=y ppn=n
三、解决方案
备库的数据虽然来自主库,但由于备库工作在只读模式下,删掉序列NEWSESS显然不现实。
而由于问题出现在登录时执行的存储过程上,那么该存储过程的调用肯定是通过登录触发器来进行,所以禁用系统级触发器是最好的选择:
alter system set “_system_trig_enabled”=false;