Solving a disk outage situation during an online backup in DS5.2

Problem description
R3 replica run out of disk space on the 32GB transaction logs partition /DB/txlogs while a nightly online backup was taking place during the night of April 14th. More than 3200 transaction logs were existing at the time the system run out of disk. The disk outage did not impact the DB partition, which had enough room for both the DB and the backup to be stored.

Analysis
An analysis was first executed on R3 errors log output to understand why the online backup in R3 had failed:

This is the start of the backup at 2am... So far so good! The very first transaction log recorded at 2:00am will be log.0000241693:

[15/May/2009:02:00:05 +0200] - Backup starting (/DATA/backup/DB_TO_BAK/last_backup)
[15/May/2009:02:00:06 +0200] - Multiple pool memory management enabled.
[15/May/2009:02:00:07 +0200] - Backing up file 1 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:02:00:07 +0200] - Backing up file 2 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

At 03:59, the server should have finished the backup. Unfortunately, a transaction log/s which was present but not processed at the time the backup was executed, log.0000241676, has been in the meantime processed by the DS compromising the backup, hence the backup needs to be restarted (this is a known behavior, an RFE is opened to change this):

[15/May/2009:03:59:08 +0200] - Backing up file 92 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_puro_o_siux47e852bf000002bd0000.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 93 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/cldb/ou_uva_o_siux486973b7000003230000.db3)
[15/May/2009:03:59:08 +0200] - WARNING<20510> - Backend Database - conn=-1 op=-1 msgId=-1 -  
                               WARNING: Log log.0000241676 has been swiped out from under me! (retrying)
[15/May/2009:03:59:08 +0200] - Backing up file 94 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:03:59:08 +0200] - Backing up file 95 (/DATA/backup/BINARIO/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

At 05:50, once again at the end of the second try for a good backup, we hit another issue while copying one of the transaction logs which has been created since 2:00 (I can't deduce WHY we have an error on copying log.0000242125, but it is clear these transaction log is the transaction log #433 since the backup started (for as the first transaction log is log.0000241693) and the backup is restarted once again with the message "Backing up file 620 (186+433+1):

[15/May/2009:05:50:38 +0200] - Backing up file 186 (/DATA/backup/DB_TO_BAK/last_backup/cldb/ou_uva_o_aux486973b7000003230000.db3)
[15/May/2009:05:52:31 +0200] - WARNING<20511> - Backend Database - conn=-1 op=-1 msgId=-1 -  
                               Error copying file '/BD/txlog/log.0000242125' (err=-1). Starting over...
[15/May/2009:05:52:31 +0200] - Backing up file 620 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_id2entry.db3)
[15/May/2009:05:52:31 +0200] - Backing up file 621 (/DATA/backup/DB_TO_BAK/last_backup/UVA/uva_entrydn.db3)

All these restarts have been to much for the transaction logs partition disk space... Only 2 minutes afterwards, we have run out of disk in the transaction logs partition and from that moment onwards nothing will ever get written in the DB:

[15/May/2009:05:52:31 +0200] - Backing up file 631 (/DATA/backup/DB_TO_BAK/last_backup/userRoot/userRoot_id2entry.db3)
[15/May/2009:05:54:54 +0200] - WARNING<20993>  - conn=-1 op=-1 msgId=-1 - disk space under /BD/txlog is low
[15/May/2009:05:55:14 +0200] - ERROR<20994>  - conn=-1 op=-1 msgId=-1 -  disk full under /BD/txlog
[15/May/2009:05:55:16 +0200] - WARNING<10249> - Incremental Protocol - conn=-1 op=-1 msgId=-1 - 
Internal error  Failed to replay LDAP operation with csn 4a0cf65a000000150000 for replica o=aux, 
got error: DSA is unwilling to perform (53).

By looking at the data above, some evidences can be written:

  • R3 run out of disk space in the transaction logs partition and that has rendered the replica completely non capable to process any further update operation (as proven by the replication error messages from 05:55:16 onwards). It is therefore a smart move to stop R3 immediately and proceed with a database recovery
  • this is the first time this problem (online backup requiring to be restarted several times up until the transaction log partition exploses) has been hit in one year. Possibly the execution of the maintenance task only one hour before this online backup started may have had an influence on this event, although that remains to be proven

Resolution
Customer allowed our P+ team to stop R3 and run a recovery work immediately on Friday 15th right during  a long weekend window. For this, the following was planned and executed succesfully:
  • stop R3 (Friday 15, 14:25)
  • remove the DB cache and region files under nsslapd-db-home-directory (/tmp/JESdb) to force a database recovery at next start up
  • restart R3 (Friday 15, 14:40) => the DB recovery starts and we start monitor ing how the recovery moves on over time. For this, we start running iteratively pfiles on the nsslapd PID and monitor the inodes opened by the process inside the transaction logs partition to verify that the recovery is running through each and every existing transaction log from the newest till the oldest at a rate that is close to 5-6 tx logs/minute at the beggining of the recovery and degrades over time to 1 tx-log per minute at the end, about 21 hours later
  • Saturday 16, 11:30 => the DB recovery ends the iteration above and starts a second iteration on the transaction log files, this time from the oldest to the youngest, at a much faster rate of 50-70 tx logs/minute.
  • Saturday 16, 13:14 => the DB recovery second iteration phase ends and the errors log stops writing any further status message (the last one written is "[15/May/2009:14:37:25 +0200] - Database recovery is 97% complete."), the CPU stabilizes at 25% of usage with nsslapd still in a single-thread
  • Saturday 16, 14:30 => as the status is still the same (no further messages in the errors log since 13:14, CPU stable at 25%), we start collecting evidences on the process itself. truss shows nothing at all, not even "sleeping" messages. pstack alway shows the following type of stack, most times looping on several different adresses on __memp_close, a bucket of adresses that overtime we detect to be always the same ones
        18:03:56 [root@PERA3]: /BD/userRoot # pstack 24623
        24623:  ./ns-slapd -D /var/opt/mps/serverroot/slapd-pera -i /var/opt/mps/serve
         0000000100036bd8 __memp_close (10076e3b0, 0, 0, 0, 0, 0) + 94
         0000000100021fc0 ???????? (10076e3b0, 100750070, 1004d03d0, 0, 5ade9b, 478)
         0000000100021adc __dbenv_open (10076e3b0, 0, 4000, 0, 7821, 40020) + 55c
         ffffffff7653760c dblayer_start (1003b6ea0, c00, f78, 1006b9080, 1003b5d18, 100021160) + 63c
         ffffffff76530840 ldbm_back_start (100785030, 1003b6ea0, 1003b6ea0, 0, 1001c6570,...
         ffffffff7ec86bf0 ???????? (1c00, 1f30, 100785030, 5400, 57b8, 10000)
         ffffffff7ec865c8 ???????? (57b8, ffffffff7ededa88, 35, ffffffff7edfe2a0, 2, 1f00)
         ffffffff7ec86a38 plugin_startall (5, ffffffff7ffffaf8, 1, 1, c80, 7d00) + 30
         00000001000195b4 main (0, 0, 185, ffffffff7ffffaf8, 1001bf550, 1001c1c48) + bf0
         0000000100010e7c _start (0, 0, 0, 0, 0, 0) + 17c
  • Saturday 16, 23:30 => situation is still the same one, same pstacks, same CPU usage, same absence of log messages. we are in a comfrotable position to deduce that an infinite loop on __memp_close is in place. stop-slapd
  • Saturday 16, 23:32 => without removing any db cache nor region files, start again R3. We decide to try our chances by running a new recovery to verify whether that new recovery would start from scratch the 23 hour long transaction logs iteration or if it would go directly into memp_close loop mode. And we are lucky, the DB recovery goes directly into memp_close loop without iterating through the transaction logs. This is excellent news, it means that, whichever the problem is regarding the memp_close loop, it has nothing to do with the Database integrity, which seems already ensured (no need to run over the transaction logs any longer). Could the memp_close loop be related to the lack of disk space in the transaction log partition instead?
  • Saturday 16, 23:35 => stop the server once again and remove the first 3000 transaction logs to empty the disk space in the transaction logs partition. We can safely do this now that we have proved via a second DB recovery that there was no longer need to iterate through them. We live however the last 200 files transaction logs untouched to be sure the server will be able to find out about its last checkpoint.
  • Saturday 16, 23:40 => restart the server once again, it enters for the third time in DB recovery mode, and this time leaves such mode in a couple of seconds and switches to regular mode immediately afterwards. A dozen of seconds afterwards, R3 starts receiving successfully replicated operations and the old 200 transaction logs are automatically trimmed by the DS instance in the next minute. Problem solved.
Comments:

Hi Marcos,
I see the following errors on 5.2 P6, however Directory works fine with no issues. How do i avoid this error? can i stop the slapd and remove this file. if i do, will it break replication or cause any other issue? Please advise. I am afraid if i simply remove this tx logs i may need to reinitialize the db.

[17/Jul/2009:16:14:20 -0400] - DEBUG - conn=-1 op=-1 msgId=-1 - libdb: Ignoring log file: /opt/netscape/server4/directory/slapd-userstore/db/log.0000225575: magic number 0, not 40988
[17/Jul/2009:16:16:31 -0400] - DEBUG - conn=-1 op=-1 msgId=-1 - libdb: Ignoring log file: /opt/netscape/server4/directory/slapd-userstore/db/log.0000225575: magic number 0, not 40988
[17/Jul/2009:16:19:19 -0400] - DEBUG - conn=-1 op=-1 msgId=-1 - libdb: Ignoring log file: /opt/netscape/server4/directory/slapd-userstore/db/log.0000225575: magic number 0, not 40988

Posted by Sutha on July 17, 2009 at 06:23 AM PDT #

Manually removing transaction logs is never a good idea, and this blog's entry should show that fact (it took more than 36 hours of active monitoring and troubleshot before we got a hand on the transaction logs for as we did not have any other way to move forward to liberate disk in such situation). Hence I strongly not recommend to remove the transaction log manually. My recommendation would be to contact Sun Support instead for an in-depth analysis. After all, as you seem to confirm, the server is still up and available despite of that message.

Posted by Marcos on July 22, 2009 at 05:42 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

marcos

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today