"My Customer" has a ADG db that is used for reporting. Upgraded from 11gR2 to 12c.
Some queries fail with an ORA-29771 process USER (OSID 335621) blocks LGWR (OSID 314841) for more than 70 seconds
The underlying problem has been there forever, it was a time bomb. here is why
First what is the same in 11g and 12c
A SQL when it parses grabs a shared lock in DBINSTANCE state object.
When the SCN is advanced by the recovery in an ADG the LGWR locks the DBINSTANCE state object in Exclusive mode.
As a consequence the LGWR can block the parsing of a SQL which is bad but the parsing of a SQL can block the LGWR which is really bad.
and here comes Bug 17018214 - ORA-600 [krdrsb_end_qscn_2] ORA-4021 in Active Dataguard Standby Database with fix for bug 16717701 present - Instance may crash (Doc ID 17018214.8)
Before the fix for bug 17018214 (pre-220.127.116.11), when a query blocked the LGWR for so long an ora-600 use to be raised and the instance crashed.
Now with the fix for bug 17018214 (18.104.22.168 & 22.214.171.124) then the query is aborted with the ORA-29771.
So the longer parsing the longer the chances to raise the ora-29771.
Then there are 2 strategies to prevent the error.
A) Prevent the timeout
B) Reduce the parsing time of the SQL
For A) The fix adds the parameter _adg_parselock_timeout
The value of the event encodes two things:
For B) there are as many of reasons why the parsing is slow as SQLs.
Now comes 12c
For A) "My Customer" tried 550 ms but still the error is raised.
There are other recommendations in note:17018214.8 like
LGWR will wait for up to 30 seconds to acquire the X-lock.
If it fails, then it will retry after sleeping for 100 milliseconds.
value = (S * 1048576) + T = (100 * 1048576) + 30000 = 104887600
where S = sleep duration in milliseconds
T = timeout in milliseconds
but still it is possible a SQL would parse longer than any timeout.
For B) The parsing delay was caused by dynamic sampling queries.
The DS queries are doing single block reads.
I have seen this many times so I had a hunch that many of these single block reads were over BitMap Blocks (BMB) and checked the segments for out-of-sync HWM using
Note:726653.1 Dbms_Space_Admin.Assm_Segment_Synchwm Definition
and there it was.
"My Customer" fixed the HWM following Note:726653.1 and the parsing was fast and no more error but still it was not enough as they keep doing loads and the error eventually came back.
It is easier to identify if the blocks are being read are bitmap blocks if they are of class 8,9,10,11, or 12 listed (usually p3) in the waitevent that read it like "gc current block 2-way" or
WAITEVENT: "read by other session" Reference Note (Doc ID 732891.1)
So we are back to 11g
It is expected behavior *in all post-9i RDBMS versions* that the Low and High HWMs to become separated due to direct-data loads,
used to improve performance.
This introduces unformatted blocks between the Low and High HWMs for a segment.
A Table scan on ASSM segments can be slowed if there are a lot of unformatted blocks between the Low and High HWMs for a segment.
When Low and High HWMs are not in sync, then it becomes necessary to build a more detailed extent map at SELECT time,
which could increase the number of CR gets on L1/L2/L3 blocks by a factor of how many extents are between the HWMs.
As unformatted blocks continue to grow and grow, due to the continued direct loads
this can cause the fetch for these blocks to start to run too long, which can result in LGWR blocks.
And here comes Bug 4067168 - Excessive CR blocks / Slow table scan for ASSM segments with UNIFORM extent allocation (DocID: 4067168.8)
The enhancement of bug 4067168 (covered in this document Note:4067168.8 for all post-9i RDBMS versions),
introduces code to keep the high and low HWMs closer in sync for direct load operations:
The problem with "My Customer" is this ratio.
The larger the table the larger the number of blocks that percentage represents and eventually reaches the threshold where
it is too much burden to read the BMBs and the data blocks individually.
That is why for "My Customer" the problem did not happened immediately that they upgrade to 12c.
It took time while the table partitions kept growing.
it is not a common practice to set because for most customers that percentage is small enough that the cost of reading
the Bit Map Blocks and data blocks has not reached a level that they even notice or they do other DMLs that format the blocks later.
This enhancement also adds a procedure DBMS_SPACE_ADMIN.ASSM_SEGMENT_SYNCHWM to help check for the HWMs that might be out of sync,
and fixes them.
When fixing, it is recommended to set _HWM_SYNC_THRESHOLD=100, at the session level, to guarantee complete sync.
In 11g mostly the user queries where slow because the constructed the extent map during execution time
but if a query used Dynamic Sampling they were sure to have the same behavior causing slow parsing and the most extreme cases
raise the ORA-29771.
"My Customer" did not came aware of the out-of-sync in 11g because the error was not raised. They only worried that the queries ran.
So back again to 12c
In 12c with the explosion of dynamic sampling queries increase the likelihood of encountering the HWM issue during parse time.
For customers using data guard the Slower parse causes the ORA-29771.
For most customers using non-dataguard DBs this issue only manifest as slow parsing.
and the most common answer from support is disable dynamic sampling and transferring the problem to execution time
as the customer most probably had in 11g but as long as the performance is restored to 11g level no ones digs deeper.
Solutions to avoid the error ORA-29771 then are (in order of strategy and my preference)
On A) Prevent the timeout
On B) Reduce the parsing time of the SQL