Set up of a comprehensive safe backup policy on a multi master topology

Problem description
One of our customers wanted (or should I say needed) to improve their current backup policies inside their 5.2patch5 replicated topology. Before, a couple of db2ldif -r were executed on a weekly basis in one consumer of the topology. This was definitely not good both in terms of quantity (not enough) and quality (not spred uniformly). The analysis below was aimed to to identify whether binary copy was an option to take into account on their topology.

Analysis
Considering the current average time for doing a complete export/re-import of the userRoot database for this customer (27 Million entries), which extends over 17 hours, the fastest way to both backing up and restoring data in this topology is via a binary copy procedure.

Binary copies in 5.2patch5 can be in theory portable among 2 replicas provided that both run on the same exact OS, the same exact DS version, same exact replication role (master/hub/consumer) and same exact index configuration. Unluckily, our customer is running a 2 master <-> 2 consumer deployment which means that binary copies could only be exchanged between the 2 masters and between the 2 consumers. Another 2 consumers will be added soon to the platform, but they will run on different hardware (T2000), so the backups extracted from those 2 new replicas will also be limited for exchange to the 2 replicas themselves. To finalize with the set of unlicky events, every half of these 3 couple of replicas is physically located in a different site (1 half in one city, the other one in a different city, 600 Km away), which adds a non-negligeable transfer time to the equation (backups size will be around 80GB).

Considering all these drawbacks, the bottom line is that each replica will have to deal with its own backup policy, as likeliness to use backups from one replica to initialize another one is very low.

With that in mind, and considering that the preliminary pre-production tests show an average time of 90 minutes per backup, my suggestion would be to launch backups in all of the 6 replicas every night in a sequential mode, so that you only have one single backup in the whole topology happening at a particular time. As nights are typically much less busy in terms of DS traffic in this deployment, such an event would almost be unnoticed. Here is a potential timeframe for the activity described above (each backup run through db2bak.pl):

  • 21:00 Start back-up on master1 (City1)
  • 23:00 Start back-up on master4 (City2)
  • 01:00 Start back-up on consumer1 (City1)
  • ===> GO/NO GO
  • 03:00 Start back-up on master2 (City2)
  • 05:00 Start back-up on master3 (City1)
  • 07:00 Start back-up on consumer2 (City2)

The cycle described above tries to place the backups of masters in the lowest traffic periods while at the same time ensuring that backups are iteratively spawned in both sites to minimize the possibility to have 2 backups running on the same site at the same time. It also provides a GO/NO GO checking point at which a decision should be made whether we can pursue with the backups initially scheduled for 03:00 AM, 00:05 AM and 07:00 AM or whether we need to abort the current backup campaign due to some unexpected errors during the first 3 backups of the night (especially if such failures affected a master replica that has entered into an outage state).

When this policy would be applied, we should make sure that the local disks dedicated for backup storage would allow to store the 2 backups corresponding to the previous 2 nights. In other words, at each point in time, each replica should have 2 backups available, the one from last night and the one from the previous night. This makes up for solving situations like losing one backup or not executing one backup due to a GO/NO GO abort.

Because and as a consequence of this new policy, there are two important parameters affecting the overall performance of the deployment that could be reconfigured to improve the DS deployment behaviour even further:

  • nsslapd-changelogmaxage: Reduce it from 1 week to 50 hours (i.e., 2 days and 2 hours)
  • nsds5ReplicaPurgeDelay: Reduce it from 1 week to 50 hours (i.e., 2 days and 2 hours)

Indeed, these two parameters can be reduced to 50 hours as we are sure to have at least 2 backups generated in such interval to recover from in case of need.

By reducing the 2 parameters above we will be dramatically:

  • reducing the changelog size => changelog easier to maintain
  • reducing the historical information inside the entries => DB overtime size reduced and memory footprint of the LDAP entries when loaded into memory reduced

Resolution
Finally, and after verifying in pre-production average times around 70 minutes per replica (sometimes more due spureous transaction log replays (previous historical bugs like 6303171, not affecting us any longer), here are the details of the final backup policy that has been implemented since October 26th, 2007:

  • 1. Nightly timetable:
    • 0:00am master2 City2-Master
    • 1:30am consumer1 City1-Consumer
    • 3:00am consumer2 City2-Consumer
    • 4:30am master1 City1-Master
    • 6:00am master2 (ldif without -r)
  • 2. Backup script (after a long discussion, we convinced the customer about the idea of not compressing the backups, as their life-time will be very short -only 2-days per backup- and the compressing task is very costly -around 110 minutes-:

#!/usr/bin/ksh
date=`date '+!DONEm%d%H%M'`
DS_HOME_DIR="$INSTANCEDIR" # CUSTOMIZE!!!
BACKUP_DIR="$BACKUPDIR/current" # CUSTOMIZE!!!
BACKUP_FILE="$BACKUPDIR/slapd-$INSTANCE-bin.${date}.bak.tar" # CUSTOMIZE!!!
LAST_BACKUP="$BACKUPDIR/last_backup" # CUSTOMIZE!!!
# do the backup
cd ${DS_HOME_DIR}
echo "[`date '+%H:%M:%S'`] Iniciando el proceso de backup... \\c"
`./db2bak ${BACKUP_DIR} >/dev/null 2>&1`
if [ $? -ne 0 ]; then
echo "\\n\\t[`date '+%H:%M:%S'`] Error realizando el backup"
`rm -r ${BACKUP_DIR}/\* > /dev/null 2>&1`
exit 1
else
# if current backup is ok, we delete the last backup
`rm -rf ${LAST_BACKUP}/slapd\*tar > /dev/null 2>&1`
echo "Ok"
fi
# move backup
`mv ${BACKUP_FILE} ${LAST_BACKUP}/`
# delete all binaries
`rm -r ${BACKUP_DIR}/\* > /dev/null 2>&1`
echo "[`date '+%H:%M:%S'`] DONE"
exit 0
  • 3. Restore procedure:
      
# /var/opt/mps/serverroot/slapd-$INSTANCE/stop-slapd
# cd /var/opt/mps/serverroot/slapd-$INSTANCE
# bak2db
# /var/opt/mps/serverroot/slapd-
$INSTANCE/start-slapd
# In the 2 master replicas, all suffixes will be in referral on update:
# ./ldapsearch -D cn=root -w -b "cn=config" "objectclass=nsMappingTree" nsslapd-state
version: 1
dn: cn=o=NetscapeRoot, cn=mapping tree, cn=config
nsslapd-state: referral on update
...
...
# We need to check replication synchronicity, then set the suffixes to read-write again in both masters:
# cd /var/opt/mps/serverroot/shared/bin
# ./ldapmodify -h master1 -p 389 -D cn=root -w
dn: cn=replica, cn=o=NetscapeRoot, cn=mapping tree, cn=config
changetype: modify
add: ds5BeginReplicaAcceptUpdates
ds5BeginReplicaAcceptUpdates: start

dn: cn=replica, cn="$SUFFIX1",cn=mapping tree,cn=config
changetype: modify
add: ds5BeginReplicaAcceptUpdates
ds5BeginReplicaAcceptUpdates: start

dn: cn=replica, cn="$SUFFIX2",cn=mapping tree,cn=config
changetype: modify
add: ds5BeginReplicaAcceptUpdates
ds5BeginReplicaAcceptUpdates: start

dn: cn=replica,cn="$SUFFIX3",cn=mapping tree,cn=config
changetype: modify
add: ds5BeginReplicaAcceptUpdates
ds5BeginReplicaAcceptUpdates: start
Comments:

The script looks fine, but what is the proper way to make online backups from DS 6? "dsadm export" and "dsadm backup" require the instance to be stopped.

Posted by Fabian on March 26, 2008 at 10:20 PM PDT #

The script contained in this blog relates to a DS5.2patch6 topology.

For DS6, dsadm is used for offline backup/restores. That is correct. But DS6 also comes with dsconf which provides similar backup/restore capabilities while the server is online. Moreover, and thanks to the brand new frozen mode feature, you can even backup a complete filesystem image while the server is up.

Posted by Marcos on April 01, 2008 at 11:18 PM PDT #

Hi,

I have problem using dsmig, from 5.2 to 6.3.

Is there a way I can communicate with you?

Thanks,

Simon

Posted by Simon Lee on May 29, 2008 at 12:40 AM PDT #

This comment does not apply to this article. Nevertheless, I can be contacted using Sun support regular methods. Please contact your Sun support representative for this.

Posted by Marcos on May 29, 2008 at 01:23 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

marcos

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today