An Application Tuning Case Study: The Deadly Deadlock
By gaurav.verma on Mar 01, 2008
Lets face it. Deadlock and Latching issues have always been hard to debug. Many a times, we don't focus too well on application tuning, even though, it always advisable to look at the application design and user access trends first. These things don't come to the mind too readily.
Many a times, we do face situations where the users are deadlocked, sitting and watching blue screens or the application was freezing up. While there are many reasons for this to happen: Poor OS implementation, (swapping/disk/Memory issues) , RDBMS bugs, RDBMS initialization parameter values, having a lot of non sharable SQL, a good share of cases can also be caused due to Poor application coding, without undertanding the sequence in which resources should be accessed or acted upon (updated/deleted etc).
This is one such case study in which a deadlock was traced back to a deadlock situation un-aware code and the solution was to enhance the code to not cause deadlocks in the future. While the change was really simple, picturizing the deadlock in action was really difficult.
This article also goes into the mechanics of contructing the deadlock situation in slow motion. This makes it an instructive case study to read through for future reference.
Getting the Bad Press...
The client had a big OLTP implementation of Order management (Order to cash business flow) 11i E Business suite and the main bread and butter was to take orders for their configurable products.
Many a times during any business day, the customer service representatives (~200 people) would get blue screens or get hung up, while trying to work on Orders. Getting frantic helpdesk calls for angry customer service reps was normal and obviously this was under the management's scanner since it was highly visible.
Customer service reps would keep making small talk to their clients who would call in to order chairs etc and eventually, the clients would give up on the wait. As a fallback of this, the client was irate (they were losing their customers) and frustrated at implementing order management.
Everybody in customer service would say: Why is Oracle so slow?
Scratching the surface...
This continued on for months and it was normal for several myths about bad oracle performance to float around. More so, no one had any idea what was happening. Someone blamed workflow, someone blamed the network and so on..
The customer had done a lot of customizations. Preliminary investigations revealed that often there was a deadlock detected between a concurrent program session and a forms user session.
As a corollary, the custom "Order Acknowledgment" program would sometimes hang inordinately for > half hour, whereas otherwise, it would "normally" finish in 1-2 mins.
Analyzing wait events (using v$session_wait) and 10046 event session tracing didnt lead anywhere. The pattern was not predictable too. It was really very puzzling behaviour.
Understanding the dead-lock in slow motion...
The "XX Sales Order Acknowledgement XML publishing" program used to run every minute. There had to be some co-relation between the programs code (table/object access/locking pattern) and the activity being done on Order entry forms.
After analyzing the code and seeing how underlying tables were being accessed or updated, It became clear that there was a problem in the "XX Sales Order Acknowledgement XML publishing" code.
The complete picture of the deadlock pattern was available now. Lets see it in slow motion here:
Deadlock Situation 1:
From the entire bunch of qualifying Order header records to be processed for Order Acknowledgements, a batch of 25 Orders were updated at a time. After the updation of 25 orders, a commit was issued, which then freed the locks.
Now, out of these 25 records, if "XX Sales Order Acknowledgement XML publishing" concurrent program (say A) was trying to update the first Order, and there were be a Customer Rep (say B) who was working on the very same order in OE screen, "A" HAD TO WAIT till B freed the record by either commit/rollback action in Order Entry form.
TILL THE TIME B freed the same record, "A" was not be able to proceed with the remaining 24 Orders in the batch and kept on waiting for the row level exclusive lock.
It eventually either timed out or acquired the lock on the row. Similarly, "A" was not be able to proceed till "B" was done with the updates on the Order Header in OEXOETEL (Order Entry) form.
The same applied for any other Order which A had to update. Thus the situation when a single run of "XX Sales Order Acknowledgement XML publishing" concurrent program would be taking 20 mins or so. During this time, one would witness a high Order Entry udpating activity at the Customer Support site.
Now, lets reverse the situation.
Deadlock Situation 2:
At anytime, there may be 20 (say) Customer Reps (say "B"), who are working on 25 or 30 orders. Therefore, in theory, around 25-20 Order Entry Headers rows may be locked in exclusive mode.
Now, every next minute, the "XX Sales Order Acknowledgement XML publishing" concurrent program was trying to update these very Order Entry Headers rows and was having a hard time waiting for the row level exclusive locks to get released.
When "B" saved its changes, the commit released the row level exclusive locks. Only then was "A" able to proceed with its job. The real time deadlocks seen during the day were caused by a mix of Deadlock situation 1 and 2.
Both the above situations were happening at the same time and vying for the same resource, the same row in the Order Entry headers table, to become free.
This was a vicious, endless cycle and caused locking grief for people working on Order Entry.
Breaking the cycle
Once the nature of the application design and sequence of accessing resources were understood, it was clear that the vicious cycle had to be broken.
It was best to make the custom program more intelligent by checking first if the Order header record was locked (can be done by using the SELECT ..... FOR UPDATE NOWAIT clause).
Any already locked record was skipped for update. The logic used was that the skipped Order header record would be picked up for processing by the custom program next time AFTER the custom service rep would finally commit or rollback on the Order Entry form (the program was running every minute, remember).
The problem was with the custom application design after analyzing the table/object access/locking pattern & understanding the code. Essentially, in this case, what was happening was that the Custom order acknowledgment program used to update a batch of 25 Order header records, in a sequential fashion, and only then issue a commit.
The update statement in culprit concurrent program never checked if the Order header record was already locked or not by another session and hence, as per default behavior, the update statement waited till the lock timeout happened -- this is a long time until NOWAIT clause is specified.
This could happen for any of the 25 Order header records it would try to update. When a customer service rep wanted to update the same Order which was being processed by the order acknowledgment program, the person would hang since the form session couldnt acquire a lock on the Order header record till the time a commit was issued by order acknowledgment program.
On the other hand, when a customer service rep would update an Order through the Order Entry screen, they would lock up an order for a good 10-20 mins or so (they kept doing stuff to it, while talking to the customer on the phone) and the custom order acknowledgment program would keep waiting for that exclusive lock on the order header record.
Again, this could happen for any of those 25 records in the batch, as per the design. So the deadlock was happening both ways.
And it was really subtle. NO amount of tracing or latch analysis or metalink research would have brought the troubleshooter to this stage of understanding the real issue.
Other Performance Related Articles