Side effect of bug 6310880 - replication loop (halt) with err=67
By marcos on Jun 24, 2007
DS bug #6310880 (modRDN of entry with multi-valued attr causes data inconsistency when replacing those attr) is about storing bogus state information inside the consumer replicas under very specific conditions (you can check the bug for further details on those conditions). This bug has been fixed both in the DS5.2patch5 official release and in several Sun Support hotfixes built on top of 5.2patch4. Or, the fix only provides protection against newly created entries, it does not resolve the bogus state information inside entries that were already exisitng in the Database before the bug fix was put in place. This can lead to undesired results, among which potential err=67 replication halt events.
Bug fix for 6310880 protects for future data, but does not provide protection for already existing data that would have been wrongly constructed because of that bug.
Let me be more specific with a real-life example of one of these halts as experienced in one of our customers sites:
Recently, replication broke between the hubs and 6 of the 26 consumers of our customer's topology. All of the 26 consumers were supposedly protected by a fix for 6310880. All of the 6 broken consumers were looping with err=67 on the same exact replicated update (same CSN). Here are the details around the exact entry and the exact operation involved in the replication loop:
1- State information maintained inside the hub replicas (ldapsearch ... nscpentrywsi) for the involved entry:
2- Exact update being replicated from the hubs to the consumers (thank you audit log!):
3- State information maintained inside the 6 failing consumer replicas (ldapsearch ... nscpentrywsi) for the involved entry:
4- State information maintained inside the 20 working consumer replicas (ldapsearch ... nscpentrywsi) for the involved entry: Exactly the same as for the hubs (see point 1 above)
The reason why the update between the Hubs and the 6 consumers can NOT proceed is because the bad consumers nscpentrywsi state information is bogus, i.e., it does not contain the mdcsn bit (mdcsn-4612e3a3002900030000) required by the Update Resolution Procedures (URP) code to verify the integrity of the replicated state. That ultimately will provoke the failure to process the update with the return to the supplier of an err=67.
So, if this information in the consumer was bogus, how did it appear there, how did it make its way into the consumer in the first place?
The bogus information (lack of mdcsn state info) was caused by bug 6310880 and it surely arrived to the consumer because of one of the following 2 reasons:
- either the consumer was not protected at the time when the entry was created/updated by a bug fix for 6310880
- or it was protected but was later restored from a backup or exported file that had been extracted from another replica which was not protected by a bug fix for 6310880 at the time when the entry was created/updated
In our real-life example above, our customer was able to confirm the first of the 2 reasons above as being the root cause. Indeed, for the entry analyzed above, the missing mdcsn 45de33c6000700030000 had been generated on 23/Feb/2007:01:22:30, or, the 6 failing consumers had been protected with a hotfix for 6310880 only on 10/Apr/2007...
Whatever the case, and independently of the current protection already in place for 6310880 inside the consumer: if a subset of the data in its DB still contains wrong mdcsn states on other entries, this event is likely to be repeated as soon as a mod replace of the uid attribute will happen on one of the entries containing the wrong mdcsn state. This could potentially happen next week, next year, next century...
So what can we do to avoid this potential problem from happening in the future?
Unfortunately, the only single safe answer to this question would be to do a complete clean re-init of all the replicas in the topology to remove any spurious bogus entry from the topology. After all, all the replicas are potentially affected by the bug. Even the masters (when acting as consumers from other masters) can be affected by the err=67 return on a bogus entry.
A clean re-init means the following:
- db2ldif (without option -r) from your authoritative replica (say, for example, master1)
- ldif2db of the exported output in step 1 inside the SAME authoritative replica (master1 in our example)
- db2ldif (with -r) from your authoritative replica (master1 in our example)
- ldif2db of the exported output in step 3 inside all other replicas in the topology
Steps 1 and 2 ensure that we get rid of all the replication state information of the topology since its origin.
I understand that such an action may need to be planned in advanced, specially due to the fact that write service is unavailable during execution of steps 1 and 2, so a workaround recommendation until such plan takes place would be to:
- avoid replace: uid operations momentarily
- in case it is not possible, fix potential repl err=67 future halts with replck immmediately. The good thing about replck is that it will fix the bogus state on the consumer entry immediately, as replck fixes the consumer by patching the entry with the contents in the supplier, which have to be good, otherwise the replication towards that supplier would be broken
- finally, another possible compromise solution would be to protect all servers in the topology with a fix for the bug, then taking one of the replicas as the authoritative replica, then using such authoritative replica to initialize all other replicas in the environment (i.e., all other master/hub/consumers). So, whatever old bogus entries exist inside that authoritative replica, will be also available in other master/hub/consumers. The consequences will be that whichever MOD-replace-RDNattribute operation arriving into the topology will be refused already at the master's level without making its way downstream. Masters will act as "firewalls"... The advantage of this procedure is that it will not halt the service write activity. The downside is that such MOD-replace-RDNattribute operations will no longer be possible inside the topology for the bogus entries, so it will be necessary to verify that the client applications are not depending on the ability to execute these operations