Oracle Support Master Note for Troubleshooting Managed Distributed Transactions (Doc ID 100664.1)
By user793088 on Nov 17, 2010
Master Note for Troubleshooting Oracle Managed Distributed Transactions (Doc ID 100664.1)
Copyright (c) 2010, Oracle Corporation. All Rights Reserved.
In this Document
What is being announced?
What do you need to do?
Oracle Distributed Transactions
Two-Phase Commit (2PC)
Recoverer Process (RECO)
V$DBLINK / GV$DBLINK
Testing Recovery of Failed Distributed Transactions
Failures in Distributed Transactions
Steps to Troubleshoot an In-Doubt Transaction
Purging the Views
SCN Recovery Steps
Who to contact for more information?
Oracle Server - Enterprise Edition - Version: 184.108.40.206 to 220.127.116.11 - Release: 9.2 to 11.2
Information in this document applies to any platform.
What is being announced?This master document discusses Oracle Managed Distributed Transactions; by this, we refer to transactions that span two or more databases and involve Oracle Databases only.
The information contained in this document targets DBAs involved in environments that use Oracle Distributed Transactions. Distributed database concepts and troubleshooting steps are included in the note and, where appropriate, references to additional notes and documentation which provide further information on the relevant topic. You can use this information when investigating and troubleshooting in-doubt transactions; a section on SCN based recovery steps is also included.
The following Oracle Documentation references discuss Oracle Distributed Transactions:
Oracle9i Database Administrator's Guide
Release 2 (9.2)
Part Number A96521-01
Part VI: Distributed Database Management
Oracle10g Database Administrator's Guide
10g Release 2 (10.2)
Part Number B14231-02
Part VII: Distributed Database Management
Oracle11gR1 Database Administrator's Guide
11g Release 1 (11.1)
Part Number B28310-04
Part V: Distributed Database Management
Oracle11gR2 Database Administrator's Guide
11g Release 2 (11.2)
Part Number E17120-05
Part V: Distributed Database Management
Note the difference between a distributed and a remote transaction; a remote transaction contains one or more DML statements that are executed on the SAME remote node, consider the examples below:
- Distributed transaction:insert into table@remotesite;insert into mytable: --local tablecommit;-Remote transaction:insert into table@remotesite;commit;
Database links are used to communicate between the databases performing distributed transactions. Location transparency can be achieved via synonyms, database views or PL/SQL procedures.
Oracle Distributed Transactions are Operating System and Database version independent; they can expand databases running different Operating Systems and RDBMS releases as long as their combination is certified by Oracle and supported. For more information, refer to: Note 207303.1 Client / Server / Interoperability Support Between Different Oracle Versions
Global Database NameThe global name of the database is composed of DB_NAME.DB_DOMAIN, where DB_DOMAIN specifies the network domain.
In a distributed environment, it is fundamental that the global name of a database is unique in the network so that each database can be unambiguously identified.
Note: The DB_DOMAIN parameter is only relevant at database creation time; once the database is created, changing this parameter in the database initialization file will not take effect in itself, though it is recommended to include it there for documentation purposes.
The following sql will allow to check the current name of the database and to change it if necessary:
Refer to Managing Global Names in a Distributed System from the Oracle Database Administrator's Guide for further information on the above.
sqlplus> select * from global_name;
sqlplus> ALTER DATABASE RENAME GLOBAL_NAME TO orcl.net;
sqlplus> select * from global_name;
In short, there are three phases for the 2PC:
All the above phases take place quickly and transparently to the application where the transaction originated.
- PREPARE: The initiating node ask each of its referenced nodes to promise to perform a commit or rollback when told to do so. The preparing node will flush the redo log buffer to the online redo log. It converts locks to in-doubt transaction locks on the data blocks and passes its highest SCN value to the initiating node.
- COMMIT: The initiating node commits and writes to its redo log the committed SCN. The Data Block locks are released.
- FORGET: Pending transactions tables are related database views are cleared (dba_2pc_pending/dba_2pc_neighbors)
A crash during the PREPARE Phase results in a ROLLBACK
A crash during the COMMIT Phase results in either COMMIT or ROLLBACK
Node Relations During 2-PC OperationsEach Database involved in a two-phase commit operation performs one or more of the following roles:
For more detailed information regarding distributed transactions and Oracle implementation of the 2PC mechanism, refer to Note 13229.1 Distributed Database, Transactions and Two Phase Commit
- Client: Database node that requests information from another db.
- Server: Node which receives a request for information from another database involved in a distributed transaction.
- Global Coordinator (GC): Node where the distributed transaction originates. This node is responsible for sending messages to the other nodes to prepare and commit (or roll back).
- Local Coordinator: Node that needs to access data on other nodes in order to complete its part of the transaction. It is also responsible for coordinating the transaction among the nodes with which it communicates directly.
- Commit Point Site: Node which commits or rolls back the transaction first, as instructed by the Global Coordinator (GC). This site determines the outcome of the transaction. This node never enters the prepare state. Note: The commit point site is chosen based on the highest COMMIT_POINT_STRENGTH value of all the nodes involved.
- Session Tree: The communication topology for the Oracle server is fundamentally a tree structured topology. As the statements in a distributed transaction are issued, the nodes and edges of the session tree are defined. There is a certain amount of recursiveness inherent in the tree structure. A session tree is the hierarchical model that describes the relationships between sessions and their roles in the distributed transaction. The nodes which represent the sessions in the session tree have different roles in the 2PC protocol with regard to transaction management.
In-Doubt TransactionsIf a distributed transaction fails, it may leave in-doubt transactions on one or more databases.
An in-doubt transaction is normally automatically resolved when the database or network is restored, this is done by the RECO database background process.
An in-doubt distributed transaction occurs when a two-phase commit was interrupted by any type of system or network failure. For example, two databases report to the coordinating database that they were prepared to commit, but the coordinating database instance fails immediately after receiving the messages. The two databases who are prepared to commit are now left awaiting notification out of the outcome.
Recoverer Process (RECO)RECO is a mandatory background process that, in a distributed database, automatically resolves failures in distributed transactions. The RECO process of a node automatically connects to other databases involved in an in-doubt distributed transaction; it can use an existing connection or establish a new connection to other nodes involved in the failed transaction.. When RECO reestablishes a connection between the databases, it automatically resolves all in-doubt transactions, removing from each database's pending transaction table any rows that correspond to the resolved transactions.
At exponentially growing time intervals, the RECO background process of a node attempts to recover the local portion of an in-doubt distributed transaction.
Disabling and Enabling RECOYou can enable and disable RECO using the ALTER SYSTEM statement with the ENABLE/DISABLE DISTRIBUTED RECOVERY options. For example, you can temporarily disable RECO to force the failure of a two-phase commit and manually resolve the in-doubt transaction.
-To disable RECO:ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY-To enable RECO and let it automatically resolve indoubt transactionsALTER SYSTEM ENABLE DISTRIBUTED RECOVERY;DBA_2PC_PENDING Provides Information on distributed transactions awaiting recovery, ie in-doubt transactions. Relevant columns from this view are:
GLOBAL_NAMES: (default false): It is strongly recommended to set this parameter to TRUE, to enforce the database name uniqueness in the network.
Setting it to TRUE enforces that any database link has the same name than the global_name of the database that it connects to.NOTE: changing this parameter from false to true may render already existing database links unusable.
OPEN_LINKS: (default 4) Specifies the maximum number of concurrent open connections to remote databases in one session. These connections include database links, as well as external procedures and cartridges, each of which uses a separate process.. It should at least be the same than the maximum number of databases that any transaction can reference to.
DISTRIBUTED_LOCK_TIMEOUT: (default 60 seconds): Defines the number of seconds that a distributed transaction waits for a lock. If a session waits longer than this for the lock then ORA-2049 is signalled.
note This parameter was hidden in 8i and 9.0 and then made available again in 9.2 onwards.
COMMIT_POINT_STRENGTH (default 1). Range of values is any integer from 0 to 255. Its value determines the commit point site in a distributed transaction. The node in the transaction with the highest value for COMMIT_POINT_STRENGTH will be the commit point site. The commit point strength should be set relative to the amount of critical shared data in the database. If no commit point site is set, Oracle will determine which site becomes the commit point site.
LOCAL_TRAN_IDLocal Id of the transaction,in the form of n.n.n; where n is a numberGLOBAL_TRAN_IDGlobal Transaction ID, unique to all sitesSTATE:One of the following: Collecting/Prepared/Committed/Forced Commit/Forced RollbackMIXED:
YES indicates that portions of the transaction have been committed and portions rolled back (forcibly)ADVICE
C indicates CommitTRAN_COMMENT
R indicates Rollback
This field is populated only if the application has issued one of the following statements prior to beginning the distributed transaction:ALTER SESSION ADVICE COMMIT or ALTER SESSION ADVICE ROLLBACK
Commit comment text. This column is populated only if the application has issued a COMMIT with a comment, for example:FAIL_TIME
COMMIT COMMENT 'comment detail'
FORCE_TIMETime that the row has been inserted in the view
RETRY_TIMEIf the transaction has been forced the time will be displayed; NULL otherwise
OS_USERTime that the RECO process last attempted to resolve the transaction
OS_TERMINALOS USerID of the local user that created the transaction
HOSTTerminal from where the local portion of the transaction originated
DB_USERName of the machine where the local transaction originated
COMMIT#Database username who originated the distributed transaction
If the transaction is committed this column represents the global commit number
DBA_2PC_NEIGHBORS Displays information on incoming and outgoing connections for pending transactions within an Oracle distributed transaction. Relevant columns:LOCAL_TRAN_ID
Local identifier of the transactionIN_OUT
- IN for incoming
- OUT for outgoing
- For incoming connections, this is the client db global_name
- For outgoing connections, this value represents the database link
- For incoming connections, the Oracle Username
- For outgoing connections, the owner of the database link
Used to locate the Global Commit Point site:
- For incoming links, C indicates that this site or one of the descendants on an outgoing link is the commit point site.
- For outgoing links, C indicates that the destination database DBID is the commit point site.
- If we are in-doubt, INTERFACE is N and then the top-level database either is the commit point site or can locate the commit point site.
The global name of the remote databaseSESS#
Local session number for the connection at this database. Sessions are numbered consequently, starting by 1BRANCHTransaction branch ID of the connection at this database. Branch IDs for incoming connections are two byte hexadecimal numbers; the first byte is the remote parent's session ID, and the second byte is its branch ID.
V$DBLINK / GV$DBLINKV$DBLINK Lists all open database links in your session, that is, all database links with the IN_TRANSACTION column set to YES.GV$DBLINK Lists all open database links in your session along with their corresponding instances. This view is useful in an Oracle Real Application Clusters configuration.
It might be useful to determine which database link connections are currently open in your own session. Note that if you connect as SYSDBA, you cannot query a view to determine all the links open for all sessions; you can only access the link information in the session within which you are working.V$GLOBAL_TRANSACTION provides information on the currently active distributed transactions.
Issuing a COMMIT with a comment such as ORA-2PC-CRASH-TEST-n, -where n is a number from 1 to 10-.
You can test a variety of scenarios, according to the value of n, as shown below,
For example, the following statement returns the following messages if the local commit point strength is greater than the remote commit point strength and both nodes are updated:1 Crash commit point after collect2 Crash non-commit-point site after collect
3 Crash before prepare (non-commit-point site)
4 Crash after prepare (non-commit-point site)
5 Crash commit point site before commit
6 Crash commit point site after commit
7 Crash non-commit-point site before commit
8 Crash non-commit-point site after commit
9 Crash commit point site before forget
10 Crash non-commit-point site before forget
At this point, the in-doubt distributed transaction appears in the DBA_2PC_PENDING view. If enabled, RECO automatically resolves the transaction.
COMMIT COMMENT 'ORA-2PC-CRASH-TEST-7';
ORA-02054: transaction 1.93.29 in-doubt
ORA-02059: ORA_CRASH_TEST_7 in commit comment
Refer to Note 126069.1 Manually Resolving In-Doubt Transactions: Different Scenarios, for detailed test results regarding each of the situations above.
Error messages related to distributed transactions fail in the range ORA-02040 - ORA-02099, refer to Oracle11g Database Error Messages for details on those.
Often, abnormal conditions that occur during 2PC are caused by either a network or a server failure whilst the transaction is in the prepare and commit phases.
Some errors that you might see in the alert.log when working in environments that use distributed transactions are shown below:ORA-02053: transaction <txnId> committed, some remote DBs may be in-doubt
The transaction has been locally committed, however we have lost communication with one or more local coordinators.
ORA-02054: transaction <txnId> in-doubt
The transaction is neither committed or rolled back locally, and we have lost communication with the global coordinator.
ORA-02050: transaction <txnId> rolled back, some remote DBs may be in-doubt
Indicates that a communication error ocurred during the two-phase commit
ORA-01591: lock held by in-doubt distributed transaction <txnId>
Encountering the above error and users/applications unable to proceed with their work. In this case, Oracle automatically rolls back the user attempted transaction and the DBA has now to manually commit or rollback the in-doubt transaction.
NOTE: Reads are blocked because, until the transaction is resolved, Oracle does not assume which version of the data to display for a query user.
- Distributed transactions time out waiting to acquire locks or hold locks themselves for an excessive amount of time. If a distributed transaction cannot obtain a required lock after
DISTRIBUTED_LOCK_TIMEOUT seconds then the following error is returned:
ORA-02049 timeout: distributed transaction waiting for lock
Increasing the value of the distributed_lock_timeout/retry the issued SQL, are typical approaches when this error is encountered. If it becomes a recurrent problem and certain transactions appear to be hanging for ever or causing contention in the database, then further investigation is required, and there is a need determine what other transaction is holding the lock and what type of problem exists.
To further analyze the above, refer to Note 789517.1 ORA-02049: timeout: distributed transaction waiting for lock' Error: How to Obtain a System State Trace BEFORE the Error Occurs, While Still Experiencing the Contention
- Additionally , there are some other rare circumstances in which further action needs to be taken by the DBA in order to resolve the in-doubt transaction as there is underlying metadata corruption. Please refer to the following note for a discussion on how to investigate and overcome those: Note 401302.1 How To Resolve Stranded DBA_2PC_PENDING Entries
1- Problem reported:Users querying a local table in database DB102C encounter the following errors and notify the DBA :
In the alert.log :ORA-01591: lock held by in-doubt distributed transaction 10.24.340
You note down the Local Transaction ID,10.24.340, and query DBA_2PC_PENDING:DISTRIB TRAN DB102D.UK.ORACLE.COM.ea27958e.9.46.338is local tran 10.24.340 (hex=0a.18.154)insert pending prepared tran, scn=5017243 (hex=0.004c8e9b)
2- Query DBA_2PC_PENDING
select * from dba_2pc_pending where local_tran_id='10.24.340';
LOCAL_TRAN_ID : 10.24.340GLOBAL_TRAN_ID : DB102D.UK.ORACLE.COM.ea27958e.9.46.338STATE : preparedMIXED : noADVICE :TRAN_COMMENT :FAIL_TIME : 02-nov-2010 02:06:02FORCE_TIME :RETRY_TIME : 02-nov-2010 02:06:02OS_USER : NT AUTHORITY\SYSTEMOS_TERMINAL : USER33-lapHOST : USER33-lapDB_USER : TESTUSERCOMMIT# : 5017243
From the above output the following can be determined:
- This node is not the Global Coordinator as the local_tran_id is different than the last portion of the Global_Tran_ID; ie the distributed transaction did not originate at this node.
- GLOBAL_TRANSACTION_ID represents the common transaction ID that will be the same on the every node for a distributed transaction. It is of the form global_database_name.hhhhhhhh.local_transaction_id where global_database_name is the database name of the global coordinator and hhhhhhhh represents in hexadecimal the internal database identifier of the global coordinator.
- The transaction in this node is in PREPARED State
- Look at the ADVICE and TRAN_COMMENT columns for information about this transaction, if any of those columns are populated, they could help you decide whether the local portion of the transaction should be rolled back or committed. (COMMIT COMMENT... or SET TRANSACTION... NAME populates tran_comment column)
3 - Query DBA_2PC_NEIGHBORSWe climb the session tree so that we find coordinators, until we eventually reach the global coordinator. Along the way, you might find a coordinator that has resolved the transaction. If not, you can eventually work your way to the commit point site, which will have always have resolved the in-doubt transaction.
To trace the session tree, query DBA_2PC_NEIGHBORS on each node
- On the local node:
select * from dba_2pc_neighbors;
LOCAL_TRAN_ID : 10.24.340IN_OUT : inDATABASE : DB102D.UK.ORACLE.COMDBUSER_OWNER : TESTUSERINTERFACE : NDBID : ea27958eSESS# : 1BRANCH : 09002E00520100000104
- As discussed, DBA_2PC_NEIGHBORS provides information about connections associated with an in-doubt transaction. Information for each connection is different, based on whether the connection is inbound (IN_OUT = in) or outbound (IN_OUT = out).
- In our example we see that IN_OUT = in, indicating that our db (db102c) is a server of the DB102D client as specified in the database column.
- DBUSER_OWNER shows the connecting user to our db102c database
- INTERFACE = N indicates that neither DB102C nor any of its dependent sites is the commit point site.- On the remote node referenced in the transaction:
select * from dba_2pc_neighbors;
LOCAL_TRAN_ID : 9.46.338
IN_OUT : in
DBUSER_OWNER : TESTUSER
INTERFACE : N
SESS# : 1
BRANCH : 0000
LOCAL_TRAN_ID : 9.46.338
IN_OUT : out
DATABASE : DB102C.UK.ORACLE.COM
DBUSER_OWNER : TESTUSER
INTERFACE : N
DBID : bae6b90f
SESS# : 1
BRANCH : 4
For the OUT connection :
The local txn ID matches the end portion of the global_transaction_id indicating that this site is the global coordinator, where the transaction originated.
- column DBUSER_OWNER shows the owner of the dblink
- column DATABASE shows the database link name that has accessed the remote server (db102c).
This site is the commit point site and has already committed the local portion of the transaction as seen by the state column:
Conclusion: This node already committed the local transaction and afterwards it crashed leaving the GC still waiting for the 'commit' response from the commit point site.
4 - Resolve the in-doubt transactionGiven that the commit point site has already committed this transaction, then manually commit it at the DB102C site; to do so the following syntax can be used:
COMMIT FORCE 'DB102D.UK.ORACLE.COM.ea27958e.9.46.338','5017245';
COMMIT FORCE '10.24.340','5017245';
The COMMIT FORCE clause lets you manually commit an in-doubt distributed transaction. The transaction is identified by the 'string' containing its local or global transaction ID. You can use integer to specifically assign the transaction a system change number (SCN). If you omit integer, then the transaction is committed using the current SCN.In order to ensure global integrity, we use the commit# of the commit point site (highest global commit#)
By specifying the SCN for the transaction when forcing a transaction to commit, the in-doubt transaction is committed with the SCN assigned when it was committed at other nodes.
In this way, we can maintain the synchronized commit time of the distributed transaction even if there is a failure.
We only specify an SCN only when we can determine the SCN of the same transaction already committed at another node.
In cases where the in-doubt transaction is to be ROLLBACK you would use syntax ROLLBACK FORCE <local_txn_ID> to set the state to forced rollback.
Purging the ViewsDBMS_TRANSACTION.PURGE_LOST_DB_ENTRY is used to manually purge the details from the views once the in-doubt transaction has been resolved:
execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY ('10.24.340');
In the case of a mixed transaction, where portions of the database have already committed and other portions have been rolled back forced, useexecute DBMS_TRANSACTION.PURGE_MIXED ('txnID');
Refer to DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY for further details regarding purging Pending Rows from the Data DictionaryNOTE: If using UNDO_MANAGEMENT=auto and DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY fails with
ORA-30019: Illegal rollback Segment operation in Automatic Undo mode, use the
-- check the current value of _smu_debug_mode (default 0):
SQL> show parameter debug -- if default 0, it will show no entry
-- set it temporarily to 4:
SQL> alter system set "_smu_debug_mode" = 4; -- in 9.2x alter session can be used instead.
SQL> alter system set "_smu_debug_mode" = <original value>;
Complete Recovery:1. At the down site recover completely if possible (treat as regular recovery).
2. SCN will appear in the alert.log after recover is done on the crashed Node.
Incomplete Recovery:1. If time-based or cancel-based recovery was used on the crashed node, other sites must be placed back to the same point in time for global consistency. Get last SCN from the alert.log of the crashed node.
2. At each node perform a shutdown normal or immediate.
3. Take a cold backup.
4. Restore the control file if necessary.
5. Restore the last backup of all datafiles along with archived redo logs.
6. Choose which tool to use to perform SCN recovery - either SQLPLUS or RMAN.
7. Connect as sysdba, startup mount, check status from v$datafile to make sure all datafiles are online. Issue 'Alter database datafile '?/?/?' online;', for each datafile with status offline, to bring it online.
8. Issue the following command using the latest SCN from alert.log on the Node that had to be recovered:RECOVER DATABASE UNTIL CHANGE '1.13.5197';NOTE: If for some reason (e.g.when issuing commit force command) automatic recovery (RECO process) needs to be disabled, we can disable it and reenable via alter system as mentioned before:ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY;Use this command to wake RECO up after that:ALTER SYSTEM ENABLE DISTRIBUTED RECOVERY;For further diagnosis and help troubleshooting distributed transactions, contact Oracle Support or login to Streams and Distributed Database Community to share your experiences with fellow peers.
NOTE:1012842.102 - ORA-2019 ORA-2058 ORA-2068 ORA-2050: Failed Distributed Transactions
NOTE:126069.1 - Manually Resolving In-Doubt Transactions: Different Scenarios
NOTE:13229.1 - Distributed Database, Transactions and Two Phase Commit
NOTE:159377.1 - How to Purge a Distributed Transaction from a Database
NOTE:207303.1 - Client / Server / Interoperability Support Between Different Oracle Versions
NOTE:401302.1 - How To Resolve Stranded DBA_2PC_PENDING Entries
Oracle9i Database Administrator's Guide - Part VI: Distributed Database Management
Oracle10g Database Administrator's Guide - Part VII: Distributed Database Management
Oracle11g Database Administrator's GuidePart V: Distributed Database Management