Solving a disk outage situation during an online backup in DS5.2
By marcos on Jun 10, 2009
Problem descriptionR3 replica run out of disk space on the 32GB transaction logs partition /DB/txlogs while a nightly online backup was taking place during the night of April 14th. More than 3200 transaction logs were existing at the time the system run out of disk. The disk outage did not impact the DB partition, which had enough room for both the DB and the backup to be stored.
This is the start of the backup at 2am... So far so good! The very first transaction log recorded at 2:00am will be log.0000241693:
[15/May/2009:02:00:05 +0200] - Backup starting (/DATA/backup/DB_TO_BAK/last_backup) [15/May/2009:02:00:06 +0200] - Multiple pool memory management enabled. [15/May/2009:02:00:07 +0200] - Backing up file 1 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3) [15/May/2009:02:00:07 +0200] - Backing up file 2 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
At 03:59, the server should have finished the backup. Unfortunately, a transaction log/s which was present but not processed at the time the backup was executed, log.0000241676, has been in the meantime processed by the DS compromising the backup, hence the backup needs to be restarted (this is a known behavior, an RFE is opened to change this):
[15/May/2009:03:59:08 +0200] - Backing up file 92 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_puro_o_siux47e852bf000002bd0000.db3) [15/May/2009:03:59:08 +0200] - Backing up file 93 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_uva_o_siux486973b7000003230000.db3) [15/May/2009:03:59:08 +0200] - WARNING<20510> - Backend Database - conn=-1 op=-1 msgId=-1 - WARNING: Log log.0000241676 has been swiped out from under me! (retrying) [15/May/2009:03:59:08 +0200] - Backing up file 94 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3) [15/May/2009:03:59:08 +0200] - Backing up file 95 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
At 05:50, once again at the end of the second try for a good backup, we hit another issue while copying one of the transaction logs which has been created since 2:00 (I can't deduce WHY we have an error on copying log.0000242125, but it is clear these transaction log is the transaction log #433 since the backup started (for as the first transaction log is log.0000241693) and the backup is restarted once again with the message "Backing up file 620 (186+433+1):
[15/May/2009:05:50:38 +0200] - Backing up file 186 (/DATA/backup/DB_TO_BAK/last_backup/cldb/ou_uva_o_aux486973b7000003230000.db3) [15/May/2009:05:52:31 +0200] - WARNING<20511> - Backend Database - conn=-1 op=-1 msgId=-1 - Error copying file '/BD/txlog/log.0000242125' (err=-1). Starting over... [15/May/2009:05:52:31 +0200] - Backing up file 620 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3) [15/May/2009:05:52:31 +0200] - Backing up file 621 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)
All these restarts have been to much for the transaction logs partition disk space... Only 2 minutes afterwards, we have run out of disk in the transaction logs partition and from that moment onwards nothing will ever get written in the DB:
[15/May/2009:05:52:31 +0200] - Backing up file 631 (/DATA/backup/DB_TO_BAK/last_backup/userRoot/userRoot_id2entry.db3) [15/May/2009:05:54:54 +0200] - WARNING<20993> - conn=-1 op=-1 msgId=-1 - disk space under /BD/txlog is low [15/May/2009:05:55:14 +0200] - ERROR<20994> - conn=-1 op=-1 msgId=-1 - disk full under /BD/txlog [15/May/2009:05:55:16 +0200] - WARNING<10249> - Incremental Protocol - conn=-1 op=-1 msgId=-1 - Internal error Failed to replay LDAP operation with csn 4a0cf65a000000150000 for replica o=aux, got error: DSA is unwilling to perform (53).
By looking at the data above, some evidences can be written:
- R3 run out of disk space in the transaction logs partition and that has rendered the replica completely non capable to process any further update operation (as proven by the replication error messages from 05:55:16 onwards). It is therefore a smart move to stop R3 immediately and proceed with a database recovery
- this is the first time this problem (online backup requiring to be restarted several times up until the transaction log partition exploses) has been hit in one year. Possibly the execution of the maintenance task only one hour before this online backup started may have had an influence on this event, although that remains to be proven
- stop R3 (Friday 15, 14:25)
- remove the DB cache and region files under nsslapd-db-home-directory (/tmp/JESdb) to force a database recovery at next start up
- restart R3 (Friday 15, 14:40) => the DB recovery starts and we start monitor ing how the recovery moves on over time. For this, we start running iteratively pfiles on the nsslapd PID and monitor the inodes opened by the process inside the transaction logs partition to verify that the recovery is running through each and every existing transaction log from the newest till the oldest at a rate that is close to 5-6 tx logs/minute at the beggining of the recovery and degrades over time to 1 tx-log per minute at the end, about 21 hours later
- Saturday 16, 11:30 => the DB recovery ends the iteration above and starts a second iteration on the transaction log files, this time from the oldest to the youngest, at a much faster rate of 50-70 tx logs/minute.
- Saturday 16, 13:14 => the DB recovery second iteration phase ends and the errors log stops writing any further status message (the last one written is "[15/May/2009:14:37:25 +0200] - Database recovery is 97% complete."), the CPU stabilizes at 25% of usage with nsslapd still in a single-thread
- Saturday 16, 14:30 => as the status is still the same (no further messages in the errors log since 13:14, CPU stable at 25%), we start collecting evidences on the process itself. truss shows nothing at all, not even "sleeping" messages. pstack alway shows the following type of stack, most times looping on several different adresses on __memp_close, a bucket of adresses that overtime we detect to be always the same ones
18:03:56 [root@PERA3]: /BD/userRoot # pstack 24623 24623: ./ns-slapd -D /var/opt/mps/serverroot/slapd-pera -i /var/opt/mps/serve 0000000100036bd8 __memp_close (10076e3b0, 0, 0, 0, 0, 0) + 94 0000000100021fc0 ???????? (10076e3b0, 100750070, 1004d03d0, 0, 5ade9b, 478) 0000000100021adc __dbenv_open (10076e3b0, 0, 4000, 0, 7821, 40020) + 55c ffffffff7653760c dblayer_start (1003b6ea0, c00, f78, 1006b9080, 1003b5d18, 100021160) + 63c ffffffff76530840 ldbm_back_start (100785030, 1003b6ea0, 1003b6ea0, 0, 1001c6570,... ffffffff7ec86bf0 ???????? (1c00, 1f30, 100785030, 5400, 57b8, 10000) ffffffff7ec865c8 ???????? (57b8, ffffffff7ededa88, 35, ffffffff7edfe2a0, 2, 1f00) ffffffff7ec86a38 plugin_startall (5, ffffffff7ffffaf8, 1, 1, c80, 7d00) + 30 00000001000195b4 main (0, 0, 185, ffffffff7ffffaf8, 1001bf550, 1001c1c48) + bf0 0000000100010e7c _start (0, 0, 0, 0, 0, 0) + 17c
- Saturday 16, 23:30 => situation is still the same one, same pstacks, same CPU usage, same absence of log messages. we are in a comfrotable position to deduce that an infinite loop on __memp_close is in place. stop-slapd
- Saturday 16, 23:32 => without removing any db cache nor region files, start again R3. We decide to try our chances by running a new recovery to verify whether that new recovery would start from scratch the 23 hour long transaction logs iteration or if it would go directly into memp_close loop mode. And we are lucky, the DB recovery goes directly into memp_close loop without iterating through the transaction logs. This is excellent news, it means that, whichever the problem is regarding the memp_close loop, it has nothing to do with the Database integrity, which seems already ensured (no need to run over the transaction logs any longer). Could the memp_close loop be related to the lack of disk space in the transaction log partition instead?
- Saturday 16, 23:35 => stop the server once again and remove the first 3000 transaction logs to empty the disk space in the transaction logs partition. We can safely do this now that we have proved via a second DB recovery that there was no longer need to iterate through them. We live however the last 200 files transaction logs untouched to be sure the server will be able to find out about its last checkpoint.
- Saturday 16, 23:40 => restart the server once again, it enters for the third time in DB recovery mode, and this time leaves such mode in a couple of seconds and switches to regular mode immediately afterwards. A dozen of seconds afterwards, R3 starts receiving successfully replicated operations and the old 200 transaction logs are automatically trimmed by the DS instance in the next minute. Problem solved.