Long response times and spurious deadlock messages: checkpointing analysis required

1) Problem Description

A customer contacted us about "database error -30996" messages on their primary master since Wednesday 26th, 2008:

[26/Mar/2008:17:15:35 +0100] - ERROR<20775> - Backend Database - conn=9708 op=173478 msgId=173479 -  database error -30996

According to them, these messages have started appearing a couple of weeks after their last re-initialization of their DBs (5 suffixes). Their DS internal users are complaining about response times as well.


Initial inspection of the provided data has allowed to show that both problems ("database error -30996" messages and high response times for some searches) are not necessarily observed at the same exact times. Moreover, inspection of data on that same master 2 days before the first "database error -30996" messages already show some evident problems in response times. This clearly indicates that the problem has not suddenly appear but we are facing a progressive problem. Otherwise said: if there is a problem lying out there, it did not start suddenly on Wednesday 26th, 2008. The more the time has passed, the higher seems to be the incidence and visibility.

Another fact to add to this progressive misbehavior: while our customer is only initially complaining about this particular master instance, the analysis of data collected on one of their consumers also shows evidence of long lasting operations, definitely not as high as in the primary master, but hard to accept anyway.

In other words, all systems seem to be handling operations in a sub-optimal way. It is only that our customer acception threshold is set high enough to cope with some of this sub-optimal behaviors for / in most of the situations.

Following are examples of what has been observed which can not be considered as correct behaviors (despite of our customer's ability to accept them):

A) Data extracted -via logconv.pl- from a consumer replica (c1):

Start of Log:  01/Apr/2008:14:50:11
End of Log:    01/Apr/2008:16:20:53

----- Top 20 Longest etimes -----

etime=35.757370    1
etime=34.587640 csn=47f23c12000000290000    1
etime=30.438510    1
etime=30.368480    1
etime=29.553650 csn=47f2377f000000290000    1
etime=29.237990 csn=47f24329000000290000    1
etime=23.175030    1
etime=20.303570 csn=47f23512000000290000    1
etime=19.344710    1
etime=19.204840 csn=47f240e2000100290000    1
etime=18.989240    1
etime=18.977770    1
etime=18.545800    1
etime=18.521340    1
etime=18.521180 csn=47f23c18000000290000    1
etime=18.519000 csn=47f24331000100290000    1
etime=17.968120    1
etime=17.885240    1
etime=17.883980 csn=47f23bf80000002c0000    1
etime=17.577830    1


=> Many modifications are taking too much time. Many searches as well, even despite of the fact they are indexed and retrieve a small number of entries (after a close look to the access log from which these lon lasting operations have been extracted, rarely more than 100 entries per search are being returned).

The top-3 modifications above clearly indicate that long MODs concern GROUP ENTRIES modifications. The first one MOD above, 47f23c12000000290000, was so long that it took two replication tries to be able to replicate it (a supplier replication session aborts after 30 seconds by default):
 
mares $ grep 47f23c12000000290000 access
[01/Apr/2008:15:41:05 +0200] conn=273450 op=0 msgId=4116 - RESULT err=0 tag=48 nentries=0 etime=34.587640 csn=47f23c12000000290000
[01/Apr/2008:15:41:17 +0200] conn=285358 op=0 msgId=179 - RESULT err=0 tag=48 nentries=0 etime=11.183230 csn=47f23c12000000290000

mares $ grep "conn=273450 op=0 msgId=4116 " access
[01/Apr/2008:15:40:30 +0200] conn=273450 op=0 msgId=4116 - MOD dn="cn=group1,o=customer"
[01/Apr/2008:15:41:05 +0200] conn=273450 op=0 msgId=4116 - RESULT err=0 tag=48 nentries=0 etime=34.587640 csn=47f23c12000000290000

mares $ grep "conn=285358 op=0 msgId=179 " access
[01/Apr/2008:15:41:06 +0200] conn=285358 op=0 msgId=179 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:15:41:17 +0200] conn=285358 op=0 msgId=179 - RESULT err=0 tag=48 nentries=0 etime=11.183230 csn=47f23c12000000290000

mares $ grep conn=273450 access | tail
[01/Apr/2008:15:40:30 +0200] conn=273450 op=4113 msgId=4114 - RESULT err=0 tag=120 nentries=0 etime=0.002070
[01/Apr/2008:15:40:30 +0200] conn=273450 op=4114 msgId=4115 - EXT oid="1.3.6.1.4.1.42.2.27.9.6.4"
[01/Apr/2008:15:40:30 +0200] conn=273450 op=4114 msgId=4115 - RESULT err=0 tag=120 nentries=0 etime=0.000200
[01/Apr/2008:15:40:30 +0200] conn=273450 op=4115 msgId=4116 - EXT oid="1.3.6.1.4.1.42.2.27.9.6.8"
[01/Apr/2008:15:40:30 +0200] conn=273450 op=0 msgId=4116 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:15:40:45 +0200] conn=273450 op=4116 msgId=4117 - UNBIND
[01/Apr/2008:15:40:45 +0200] conn=273450 op=4116 msgId=-1 - closing - U1
[01/Apr/2008:15:41:05 +0200] conn=273450 op=0 msgId=4116 - RESULT err=0 tag=48 nentries=0 etime=34.587640 csn=47f23c12000000290000
[01/Apr/2008:15:41:06 +0200] conn=273450 op=4115 msgId=4116 - RESULT err=0 tag=120 nentries=0 etime=35.757370
[01/Apr/2008:15:41:06 +0200] conn=273450 op=-1 msgId=-1 - closed.

mares $ grep "conn=272109 op=0 msgId=1978 " access
[01/Apr/2008:15:21:01 +0200] conn=272109 op=0 msgId=1978 - MOD dn="
cn=group2,o=customer"
[01/Apr/2008:15:21:30 +0200] conn=272109 op=0 msgId=1978 - RESULT err=0 tag=48 nentries=0 etime=29.553650 csn=47f2377f000000290000

mares $ grep "conn=285358 op=0 msgId=4101 " access
[01/Apr/2008:16:10:49 +0200] conn=285358 op=0 msgId=4101 - MOD dn="
cn=group3,o=customer"
[01/Apr/2008:16:11:18 +0200] conn=285358 op=0 msgId=4101 - RESULT err=0 tag=48 nentries=0 etime=29.237990 csn=47f24329000000290000




B) Data extracted from the primary master on a period of time which is previous to their problem report:

Start of Log:  25/Mar/2008:15:05:55
End of Log:    25/Mar/2008:16:50:21

----- Top 20 Longest etimes -----

etime=29.302850    1
etime=26.852590    1
etime=23.413390    1
etime=19.315180    1
etime=18.433210    1
etime=18.089670    1
etime=16.530610    1
etime=13.679230    1
etime=12.851240    1
etime=12.199620    1
etime=12.169680 csn=47e90b17000000290000    1
etime=12.042780    1
etime=11.878220 csn=47e90b15002800290000    1
etime=11.875240 csn=47e90a4b0000002b0000    1
etime=11.688760    1
etime=11.048990 csn=47e90a4c0000002b0000    1
etime=11.029940    1
etime=10.125640 csn=47e90a570000002b0000    1
etime=9.956060    1
etime=9.950260    1
etime=9.817410    1
etime=9.785880    1
etime=9.328110    1
etime=9.309880    1
etime=9.175540    1


Again, many searches are taking too much time, even despite of the fact they are indexed and not many entries are being returned. Many modifications as well. Important to note that none of these modifications is group entry related.


2) Analysis

Analysis of the data collected on Wednesday 26th in the primary master shows that the response times are definitely much worse than the ones observed before that date and on other replicas, which justifies the fact that this particular system becomes visible and not acceptable.

Start of Log:  26/Mar/2008:15:59:00
End of Log:    26/Mar/2008:18:04:17

----- Top 50 Longest etimes -----

etime=621.012380 notes=U,F    1
etime=512       1
etime=501.717580 notes=U,F    1
etime=494       1
etime=460       1
etime=288.370250    1
etime=269.209370    1
etime=265.228540    1
etime=256.852960    1
etime=235.999590    1
etime=232.924120 notes=F    1
etime=231.772890    1
etime=230.150610    1
etime=227.940970    1
etime=226.256950 notes=F    1
etime=226.149760 notes=F    1
etime=225.666500    1
etime=223.651170 notes=F    1
etime=221.149540 notes=F    1
etime=220.817520 notes=F    1
etime=220.554860 notes=F    1
etime=220.404400 notes=F    1
etime=219.228350 notes=F    1
etime=216.946460    1


==> Note that all of the Top-20 etimes -but #1 and #3- in this log period correspond to indexed searches, not a single modification. Or... are we sure there isn't any modification taking place?

When analyzing the pstacks obtained during this period of time, we find a huge abnormal checkpointing activity taking place. In the example below, we can easily observe that the same single checkpoint is lasting around 4 minutes.

pstack.080326:172525: NO CHECKPOINTING TAKING PLACE

pstack.080326:172606: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408
pstack.080326:172642: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408
pstack.080326:172718: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408
pstack.080326:172759: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408
pstack.080326:172835: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408

pstack.080326:172913: NO CHECKPOINTING TAKING PLACE

A checkpointing which is lasting almost 4 minutes and blocking all threads in the DS process, all of them waiting on a DB lock which can't be got  ==> such blocked earches get impacted by such a checkpoint enormous delay, causing client applications complains on high response times:

[26/Mar/2008:17:29:44 +0100] conn=38954 op=238 msgId=239 - RESULT err=0 tag=101 nentries=2 etime=220.554860 notes=F
[26/Mar/2008:17:29:44 +0100] conn=38154 op=3189 msgId=3190 - RESULT err=0 tag=101 nentries=2 etime=220.404400 notes=F
[26/Mar/2008:17:29:45 +0100] conn=38299 op=2317 msgId=2318 - RESULT err=0 tag=101 nentries=2 etime=220.817520 notes=F
[26/Mar/2008:17:29:45 +0100] conn=38254 op=3084 msgId=3085 - RESULT err=0 tag=101 nentries=2 etime=221.149540 notes=F
[26/Mar/2008:17:29:47 +0100] conn=9997 op=25893 msgId=25894 - RESULT err=0 tag=101 nentries=2 etime=223.651170 notes=F
[26/Mar/2008:17:29:47 +0100] conn=39071 op=1 msgId=2 - RESULT err=0 tag=101 nentries=6 etime=227.940970
[26/Mar/2008:17:29:49 +0100] conn=39045 op=19 msgId=64 - RESULT err=0 tag=101 nentries=0 etime=225.666500
[26/Mar/2008:17:30:56 +0100] conn=14666 op=112401 msgId=112402 - RESULT err=0 tag=101 nentries=1 etime=288.370250

=> If we take a look at the searches above, all of them are listed as part of the top-20 longest etimes listed above.

Why could this checkpoint be so big, so long? Which modifications can be causing such a checkpointing impact when no modification has been listed in the top-20 longest etimes?

What if the modifications involved in this checkpointing would not be visible inside the access log? Yes, that would be able to explain such a checkpointing activity. So who can provoke internal modifications which could affect checkpointing massively? A very good candidate is the referential integrity plugin.

When analyzing the primary master Referential Integrity configuration, we see indeed a lot of interesting facts:

  • the referential integrity plugin is configured for 6 attributes: owner, seeAlso, uniqueMember, parent, domain, jobmember
  • three of these attributes are not indexed in a 6th suffix o=counter. Fortunately, such suffix only contains 2 entries, so this point is not that important (this said, it would not harm to index them anyway to close any potential future rathole)
  • when running dbscan on all of the attributes configured on the 5 important suffixes using the following script:
#!/bin/ksh
BASEDIR=/users/nds52/slapd-instance/db

for j in domain jobmember owner parent seeAlso uniquemember
do
  for i in `ls $BASEDIR/\*/\*_$j.db3`
  do
    echo "dbscan -n -i -f $i | grep allids"
    dbscan -n -i -f $i | grep allids
    echo ""
    echo dbscan -G 20 -n -i -f $i | grep "\^="
    dbscan -G 20 -n -i -f $i | grep "\^="
    echo ""
  done
done
we observe that there are 199645 attribute values which are present in more than 20 entries inside the database. What this basically means is that there are 199645 entries among all the databases that will generate, when deleted, more than 20 referential integrity internal modifications.

This is enormous! This also explains why the longest MOD entries observed in the consumers are all GROUP ENTRIES modifications while such modifications are not visible inside the primary master access log: the referential integrity plugin creates such modifications internally (not logged in the access log), then writes them into the local changelog, then sends it to the consumer replicas via replication. That also explains why the access log MOD/DEL ratio is much much bigger inside the consumer systems than inside the primary master.

Let's prove this with an example: the referential integrity plugin is configured with a 3 minute delay (nsslapd-pluginarg0: 180), which means that the referential integrity MODs generated after a DEL operation will be executed 3 minutes after the DEL has taken place. This can be perfectly observed inside the consumer replicas, as all these operations (the DEL and subsequent MOD operations) will be replicated and logged into the access log. Following is an example extracted from the consumer replica already analyzed above, c1:

  • 2 DEL operations are replicated:
[01/Apr/2008:15:01:05 +0200] conn=249223 op=0 msgId=4255 - DEL dn="uid=d015897,o=customer"
[01/Apr/2008:15:01:46 +0200] conn=249223 op=0 msgId=4283 - DEL dn="uid=d015898,o=customer"
  • 3 minutes later, the MOD operations concerning the groups membership deletions of the 2 entries deleted above will start:
[01/Apr/2008:15:04:34 +0200] conn=249197 op=0 msgId=16618 - MOD dn="cn=role1,o=customer"
[01/Apr/2008:15:04:34 +0200] conn=249197 op=0 msgId=16624 - MOD dn="cn=role2,o=customer"
[01/Apr/2008:15:04:37 +0200] conn=249197 op=0 msgId=16648 - MOD dn="cn=role3,o=customer"
[01/Apr/2008:15:04:39 +0200] conn=249197 op=0 msgId=16662 - MOD dn="cn=role4,o=customer"
[01/Apr/2008:15:04:45 +0200] conn=249197 op=0 msgId=16683 - MOD dn="cn=role5,o=customer"
[01/Apr/2008:15:04:48 +0200] conn=249197 op=0 msgId=16690 - MOD dn="cn=role6,o=customer"
[01/Apr/2008:15:04:48 +0200] conn=249197 op=0 msgId=16691 - MOD dn="cn=role7,o=customer"

[01/Apr/2008:15:04:49 +0200] conn=249197 op=0 msgId=16701 - MOD dn="cn=role1,o=customer"
[01/Apr/2008:15:04:49 +0200] conn=249197 op=0 msgId=16703 - MOD dn="cn=role2,o=customer"
[01/Apr/2008:15:04:51 +0200] conn=249197 op=0 msgId=16712 - MOD dn="cn=role3,o=customer"
[01/Apr/2008:15:04:51 +0200] conn=249197 op=0 msgId=16718 - MOD dn="cn=role4,o=customer"
[01/Apr/2008:15:04:58 +0200] conn=249197 op=0 msgId=16733 - MOD dn="cn=role5,o=customer"
[01/Apr/2008:15:05:04 +0200] conn=249197 op=0 msgId=16740 - MOD dn="cn=role6,o=customer"
[01/Apr/2008:15:05:04 +0200] conn=249197 op=0 msgId=16741 - MOD dn="cn=role7,o=customer"


Note that while the DEL operations are instantaneous (executed in 0 seconds each), the group of MOD operations which need to be executed for each of such DEL operations are taking around 15 seconds each. And this is only an example.

The incidence in the primary master of such behaviour will be at least as big as the one observed in the consumer system. Only, we will not be able to track it via the log files. Nevertheless, one can easily start imagining that if we receive a series of DEL operations which each of them requires a 15-second MOD activity immediately afterwards, there is a huge potential for massive sudden database write access required in the primary master DBs. If the primary master is at those times also serving complicated group based SRCHes (as it is usually the case), chances to get a complete DB access collapse causing temporary deadlocks or high response times are pretty high.

To prove this fact (or possibility, for the non-believers), let's inspect if some DELs where received by the primary master may well be so huge due to a big number of costly DEL operations happening sometime around 3 minutes before the memp_sync checkpointing issues described above where observed:

pstack.080326:172606: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, fffffffff1016e88, ffffffff6bffff20, 1003100e0) + 408


And, not surprisingly, following is a burst of DEL operations happening around 3 minutes before or more of that checkpoint:

[26/Mar/2008:17:22:06 +0100] conn=38974 op=4 msgId=6 - DEL dn="cn=adm1,o=customer"
[26/Mar/2008:17:22:06 +0100] conn=38974 op=7 msgId=9 - DEL
dn="cn=adm2,o=customer"
[26/Mar/2008:17:22:07 +0100] conn=38974 op=10 msgId=12 - DEL
dn="cn=adm3,o=customer"
[26/Mar/2008:17:22:10 +0100] conn=38974 op=13 msgId=15 - DEL
dn="cn=adm4,o=customer"
[26/Mar/2008:17:22:12 +0100] conn=38974 op=18 msgId=22 - DEL
dn="cn=adm5,o=customer"
[26/Mar/2008:17:22:55 +0100] conn=38974 op=21 msgId=25 - DEL
dn="cn=adm6,o=customer"
[26/Mar/2008:17:22:56 +0100] conn=38974 op=24 msgId=28 - DEL
dn="cn=adm7,o=customer"
[26/Mar/2008:17:22:56 +0100] conn=38974 op=27 msgId=31 - DEL
dn="cn=adm8,o=customer"
[26/Mar/2008:17:22:56 +0100] conn=38974 op=30 msgId=34 - DEL
dn="cn=adm9,o=customer"
[26/Mar/2008:17:22:56 +0100] conn=38974 op=33 msgId=37 - DEL
dn="cn=adm10,o=customer"
[26/Mar/2008:17:22:56 +0100] conn=38974 op=38 msgId=44 - DEL
dn="cn=adm11,o=customer"
[26/Mar/2008:17:26:02 +0100] conn=9511 op=39688 msgId=39689 - DEL dn="uid=D019070
,o=customer"

Hence the proof of evidence that we were searching for! Analysis of other long checkpointing periods, like the following one:

2603_1955_ERR_SLAPD/pstack.080326:200122: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, ffffffff6bab24e8, ffffffff6bffff20, 1003100e0) + 408
2603_2115_ERR_SLAPD/pstack.080326:211808: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, ffffffff6bab24e8, ffffffff6bffff20, 1003100e0) + 408
2603_2115_ERR_SLAPD/pstack.080326:211846: 0000000100071eb0 memp_sync (100d49400, 0, ffffffff6e61fce8, ffffffff6bab24e8, ffffffff6bffff20, 1003100e0) + 408


also lead to similar conclusions. Concluding, we are in a comfortable situation to affirm that every burst in DEL operations is a potential root cause for a checkpointing bottleneck. Will such burst become a real bottleneck or just a threat? That will depend on the exact entries being deleted: if the entries are such that they cause a huge amount od MOD internal operations, then they will indeed be a threat. Otherwise, they will not cause any pain. As an example, let's consider the following:

Between March 25th and March 30th, 127975 DEL operations have been executed in the primary master. 95433 of them have happened inside the same suffix, cmdo=calendra. Or, no single entry under such suffix contains any of the 6 attributes being managed via the referential integrity plugin. As a consequence, none of those 95433 DEL operations has been an enemy for checkpointing. The remaining 32542 DEL operations were potential killers because they were part of suffixes which are massively full of values for those 6 attributes. But each particular DEL incidence will ultimately depend on the entry being deleted, the number of groups such entry was subscribed to and on the size of those groups.

Resolution

Resolution comes in 2 parts: the first one is spontaneous, received by the customer himself, the second one is based on the analysis above.

1) Let's start with the first part:
On March 31st, customer informed us they had just realized that since March 26th they had introduced an application error consisting in two groups that were referring themselves both ways. Such a double reference was responsible for having doubled the number of SRCH operations per day since then, generating in particular an impressive amount of SRCH operations of this kind:

SRCH base="ou=appgroups,o=customer" scope=2 filter="(&(objectClass=dominomodel)(|(seeAlso=cn=somegroup,o=customer)
(seeAlso=cn=sothergroup,
o=customer)))" attrs="cn"
 
This high traffic increase surely contributed to make the high response times and "Database error -30996" even more evident than before.

Fixing such double cross-reference between groups should reduce the SRCH operations load to the previous level.

2) And now the second part:
Based on the analysis section above, there is a clear need to reduce the checkpointing times on this primary master. Reducing the checkpointing will in a first stage contribute to set up the good basis to improve the overall read/write behaviour of the instance. It may not solve all of the issues for as a huge amount of the operations taking place in this instance are already very costful by definition and there may be a room later for further tuning on each individual costful operation type. But until we do not fix the sub-optimal effect that the current checkpointing mechanism places upon the whole DS overall behavior, we will not be able to achieve any further individual optimization.

Le'ts illustrate with an example at this point what we will be pursuing when removing the current checkpointing inefficiencies. We saw in the analysis section above that there were some inefficiencies observed everywhere, even on the consumer side. We saw that there was a specific MOD operation which took more than 30 seconds to be processed by a consumer replica:

[01/Apr/2008:15:40:30 +0200] conn=273450 op=0 msgId=4116 - MOD dn="cn=group1,o=customer"
[01/Apr/2008:15:41:05 +0200] conn=273450 op=0 msgId=4116 - RESULT err=0 tag=48 nentries=0 etime=34.587640 csn=47f23c12000000290000


Is such long time due to a long checkpointing or is it simply due to the nature of the MOD operation itself (its size, the size of the multi-valued attributes being modified, etc)? with the current data we have in our hands, we can NOT answer this answer at this stage, but we can make a good first approach guess: if we take a look at all of the other MOD operations taking place for the same exact entry in the same exact consumer instance through the whole log period analyzed, we will easily notice that all of the MOD operations for such entry take around 3 seconds to be fulfilled:

mares $ grep "conn=285358 op=0 msgId=409 " access
[01/Apr/2008:15:42:12 +0200] conn=285358 op=0 msgId=409 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:15:42:15 +0200] conn=285358 op=0 msgId=409 - RESULT err=0 tag=48 nentries=0 etime=2.746930 csn=47f23c77000000290000
mares $ grep " conn=294355 op=0 msgId=576 " access
[01/Apr/2008:16:07:33 +0200] conn=294355 op=0 msgId=576 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:16:07:37 +0200] conn=294355 op=0 msgId=576 - RESULT err=0 tag=48 nentries=0 etime=3.319820 csn=47f2426e000000290000
mares $ \^C
mares $ grep " conn=294355 op=0 msgId=1300 " access
[01/Apr/2008:16:13:22 +0200] conn=294355 op=0 msgId=1300 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:16:13:25 +0200] conn=294355 op=0 msgId=1300 - RESULT err=0 tag=48 nentries=0 etime=2.059830 csn=47f243ca000000290000
mares $ \^C
mares $ grep "conn=297933 op=0 msgId=1804 " access
[01/Apr/2008:16:18:59 +0200] conn=297933 op=0 msgId=1804 - MOD dn="
cn=group1,o=customer"
[01/Apr/2008:16:19:02 +0200] conn=297933 op=0 msgId=1804 - RESULT err=0 tag=48 nentries=0 etime=3.039680 csn=47f24519000000290000


Considering that all of these kind of MODs are taking place via the action of the referential integrity plugin, there is a high probability that the content of all these MOD operations will be very similar, and if such case is confirmed, then such a difference in times (34 versus 3 seconds) could only be explained by something which would not be dependant on the nature of the MOD operation itself, something external that can't be controlled, like checkpointing.

This reasoning justifies further ahead the nead to improve the checkpointing mechanism immediately in the primary master but also, and ideally, throughtout the topology. Such improvement will remove all the noise of the current deployment and allow other finer granularity improvements to be applied, like for example: can a group entry modification like the one underlined above be improved further to reduce its etime from 3 seconds to 0 or 1 seconds? can we remove deadlocks? can we improve response times even further? can we verify the exact time required by the referential integrity plugin to finalize a particular deletion for a particular user or admin user entry? Answering such questions with checkpointing repaired will be much easier (and, unfortunately, it may not always be a positive answer).

So how can we improve the current checkpointing impact in the instances behavior?


There are several advices to take into consideration to achieve the goal of an optimal checkpointing. All of them are important. But all of them affect / govern different aspects of the checkpointing mechanism. A wise combination of these advices will ultimately help to find the BEST checkpointing behavior, but for some of these advices, in-house testing will be almost a must (for example, to choose between advice #2 and advice #3), you will understand why after reading the remaining of this section.

One last consideration before entering into the analysis of each individual advice: previously to writing them, there was a need to analyze the corrent I/O and memory subsystems of the primary master machine to identify any possible device bottleneck which could explain such checkpointing underperformance. Such analysis was run by inspecting a good collection of iostat, vmstat and memory footprint data. The conclusions around such analysis are clear: there are no current device bottlenecks identified, the disk I/O and memory reports are neat, crystal clear, and the machine is architectured to cope with higher I/O capacity and memory capacity than what is currently being used. For the record, here are some of the data sets collected:

  • typical iostat for c0t60060480000290100720533030464444d0 (Device where the DBs are stored):
Maximum usage % rarely hits 75%, almost no waits, reads and writes average 8KB (which corresponds to the current DS DB page size configured), but there is clearly room for more if required. Read/writes executed by bursts, most likely driven by checkpointing.

                  extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.1  169.3    0.0    1.6  0.0  0.1    0.0    0.4   0   5 c0t60060480000290100720533030464444d0
    0.0  141.2    0.0    1.2  0.0  0.1    0.0    0.4   0   4 c0t60060480000290100720533030464444d0
   56.6   39.4    0.5    0.4  0.0  0.4    0.0    3.9   0  32 c0t60060480000290100720533030464444d0
   24.6  162.8    0.2    1.7  0.0  0.3    0.0    1.5   0  26 c0t60060480000290100720533030464444d0
    0.0  153.2    0.0    1.3  0.0  0.1    0.0    0.4   0   5 c0t60060480000290100720533030464444d0
   52.9   64.6    0.4    0.5  0.0  0.4    0.0    3.3   0  36 c0t60060480000290100720533030464444d0
  121.2   21.6    1.0    0.2  0.0  0.8    0.0    5.5   0  77 c0t60060480000290100720533030464444d0
    0.1   21.7    0.0    0.2  0.0  0.0    0.0    0.5   0   1 c0t60060480000290100720533030464444d0
    0.0    2.6    0.0    0.0  0.0  0.0    0.0    0.8   0   0 c0t60060480000290100720533030464444d0
   45.6    1.7    0.4    0.0  0.0  0.3    0.0    6.6   0  31 c0t60060480000290100720533030464444d0
   17.2    1.3    0.1    0.0  0.0  0.1    0.0    7.8   0  14 c0t60060480000290100720533030464444d0
    0.0    2.4    0.0    0.0  0.0  0.0    0.0    0.6   0   0 c0t60060480000290100720533030464444d0
    0.0    1.8    0.0    0.0  0.0  0.0    0.0    0.7   0   0 c0t60060480000290100720533030464444d0
   32.3    1.7    0.3    0.0  0.0  0.2    0.0    6.3   0  21 c0t60060480000290100720533030464444d0
  131.3    1.7    1.0    0.0  0.0  0.8    0.0    5.7   0  75 c0t60060480000290100720533030464444d0
    0.2    2.3    0.0    0.0  0.0  0.0    0.0    1.4   0   0 c0t60060480000290100720533030464444d0
    0.0    1.7    0.0    0.0  0.0  0.0    0.0    0.8   0   0 c0t60060480000290100720533030464444d0
   98.1    1.7    0.8    0.0  0.0  0.6    0.0    6.2   0  62 c0t60060480000290100720533030464444d0
   89.4    3.4    0.7    0.0  0.0  0.5    0.0    5.2   0  48 c0t60060480000290100720533030464444d0
    0.0    3.2    0.0    0.0  0.0  0.0    0.0    0.5   0   0 c0t60060480000290100720533030464444d0
    0.0    3.6    0.0    0.0  0.0  0.0    0.0    0.6   0   0 c0t60060480000290100720533030464444d0
  327.5   62.5    2.6    0.7  0.0  0.8    0.0    2.0   0  75 c0t60060480000290100720533030464444d0
    0.2  128.4    0.0    1.4  0.0  0.1    0.0    0.5   0   5 c0t60060480000290100720533030464444d0
    0.0   33.6    0.0    0.3  0.0  0.0    0.0    0.4   0   1 c0t60060480000290100720533030464444d0


  • typical vmstat on the primary master
No processes blocked, tones of free memory, page scanner rarely runs, CPU usage pretty big considering the circumstances and perfectly in phase with the type of SRCH requests received by the DS and the fact that most of the DB contents can be mmapped into memory without constant page swapping:
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
 0 0 0 31065408 287800 27 187 2 0  0  0  0  0  0  0  0  428  877  563 10  6 84
 0 0 0 31065592 287792 15 31 1  0  0  0  0  0  0  0  0  406  583  558  8  6 86
 0 0 0 31119088 342776 61 252 771 0 0 0  0  0  0  0  0  720  743  842  1  7 92
 0 0 0 31103728 322712 134 507 750 0 0 0 0  0  0  0  0  719 1355  843  2 15 83
 0 0 0 31065536 288544 32 89 12 0  0  0  0  0  0  0  0  404  679  570 12  3 85
 0 0 0 31065536 291912 152 1356 105 2 2 0 0 0  0  0  0  466 3124  732  6 12 82
 1 0 0 31133576 347608 278 1012 2681 0 0 0 0 0 0  0  0 2213 24221 1761 5 11 84
 0 0 0 31087552 279744 13 63 4 170 173 0 45 0  0  0  0 1120 112578 729 7 17 76
 0 0 0 31071728 267432 6 25 10  0  0  0  0  0  0  0  0  504  546  546 13  2 85
 1 0 0 31087064 286864 22 176 17 3 3  0  0  0  0  0  0 1483 275079 1054 8 24 68
 1 0 0 31138032 337968 39 922 62 0 0  0  0  0  0  0  0 2289 410655 1946 24 34 43
 0 0 0 31065776 274600 527 4708 118 21 23 0 6 0 0 0  0  631 7167 1037  9 17 74
 0 0 0 31060632 269608 181 1405 110 5 5 0 0 0  0  0  0 1200 4046 1080 17 18 65
 0 0 0 31062040 276952 550 6754 58 6 6 0 0  0  0  0  0 1408 11403 2071 9 25 66
 0 0 0 31062936 283320 88 646 58 0 0  0  0  0  0  0  0  510 2699  850  4 14 82
 0 0 0 31062480 267152 247 918 734 0 0 0 0  0  0  0  0  890 2495  929  2 16 82
 0 0 0 31086024 290040 106 809 92 0 0 0  0  0  0  0  0  786 26881 1027 5 17 77
 0 0 0 31138208 336936 164 934 178 0 0 0 0  0  0  0  0  782 2393 1099  3 17 80
 0 0 0 31076792 288912 108 643 14 0 0 0  0  0  0  0  0  674 2075  687  2 17 81

With all of the above in mind, here are the advices that can be considered by the customer for execution to modify the internal behavior of the DS checkpointing mechanism:

ADVICE #1: reducing the nsslapd-db-checkpoint-interval: 60

Reason: Current checkpointing of the DB is happening every 60 seconds. When a checkpoint takes place, all of the pages inside the DB cache will need to be scanned to know whether they need to be flushed to disk or not. The higher the checkpointing interval, the bigger the amount of pages to be fushed.
Recommendation: Reducing the nsslapd-db-checkpoint-interval parameter to 15 will allow for shorter checkpointings, as the number of pages to be flushed to disk will be reduced to 1/4th of the previous level. This will allow for more predictable behaviours. Overall, the high difference on the lenght of operations that exists today depending whether an operation arrives to the DS during a long checkpoint or whether it arrives when checkpointing is not running will be reduced, possibly short ops will not be on average so short and long ops will not be on average so long. We will be substituting a long lock wich takes pace every 60 seconds by 4 shorter loks happening every 15 seconds. Considering the current pretty high acceptance threshold our customer has in terms of response times, I strongly believe this change can only do good for our customer, because they seems to cope much better 100 good operations taking each 0.2 seconds more (etime increse from 0.3 to 0.5) than one single operation taking 20 seconds. But ultimate proof about the real improvements for our customer can only be delivered throught in-house tests.

ADVICE #2: setting the whole contents of the db cache in memory

IMPORTANT NOTE: This advice is NOT COMPATIBLE with ADVICE #3 below.

The current DB cache settings are not optimal. Our customer current DB cache is being stored under the following directory:

/users/nds52/slapd-instance/db


Such directory belongs to the same exact FS and stored in the same exact physical device (c0t60060480000290100720533030464444d0) than all of the 6 suffixes defined in the instance. This does not help to improve I/O performance. Fortunately, and considering both:

- the current MEDIUM size of all of the customer DBs (6.4 GB of DB files, 4.9 GB correspond to the DB files themselves, 1.5 GB correspond to the index files)
- and the huge amount of memory in the sytem (16GB of physical memory plus 45 GB of swap, 30 GB of them still free when DS is running)

there is an interesting option to consider which would be to completely store the DB cache into a swap filesystem like /tmp. Like this, we could for example keep the currently configured 2.5 GB size for the DB cache (which should be good enough to store all of the index files in the DBs) but store it under /tmp instead (this can be achieved by setting the stopping the DS, editting the DS LDIF file to set nsslapd-db-home-directory: /tmp/$some-directory-of-your-wish and restarting the DS). By moving the database cache backing files to a tmpfs file system, they are actually stored in memory, so the process of updating the backing files is much less expensive than it would be if they were on even the fastest disk subsystem.

Note: It is a common misconception that putting the backing files into /tmp will consume even more memory on the system. That is not true. Because the files are mmapped, they will only consume that amount of memory once, no matter how many things are mapping them. Given that the tmpfs file system will already be mapping them, there won't be any additional memory consumption when the directory server process maps them as well.

Therefore, moving the files won't consume any additional memory and will remove what could be a significant bottleneck to the overall performance of the server.

The most important drawback / undesired effect of this setting is the instance start up time after a machine reboot, as all the DB cache will need to be recreated from scratch.

ADVICE #3: reducing the db cache to the minimum and trust on the File System cache instead

IMPORTANT NOTE: This advice is NOT COMPATIBLE with ADVICE #2 above

As we saw in the previous advice, the current DB cache settings are not optimal. This advice tries to fix that underperformant setting by taking a complete radical approach to it: instead of improving the DB cache settings by changing the storage media, removing the DB cache completely from the equation and relying on the Files System cache instead.

Indeed, this advice, which may seem weird in a first sight, is actually the most common setting on customers like big telcos who run huge DBs which can't be completely loaded into memory. Tests have proven that under such conditions, the SleepyCat DB Cache underperforms with respect to the File System Cache prvided by the most recent Solaris OS versions.

Reducing the DB cache to the minimum setting (10 MB I believe, even if it will be rarely be of any useful usage with such a size, setting it into a swap FS like is recommended in the previous advice #2 will of course be a plus) will make checkpointing much faster (the number of DB pages to be scanned being much lower). With such a small size for the DB cache, this strategy will not have the drawback / undesired effect of this setting is the instance start up time after a machine reboot.

The only single drawback I can think of for this advice is that, as we are delegating to the OS the ability to load index files into memory and have them ready for the application when the application will need them, such ability gets now out of our application environment hands. But the good news today is that such an ability will be guaranteed by the amount of free memory which is currently being unused in the system, so there is really very little to worry about the FS cache capability to succesfully commit on this task. From a personal point of view, I believe this is a very wise strategy which removes completely the DB cache dependence on the DB size increase. The FS cache will always be a best effort dynamic solution, the most optimal one for the DB size that we will have to cope with.


ADVICE #4: Increasing the DB page size from 8KB to XKB

We have seen through this analysis that our customer is relying their application architectures on a massive usage of group entries. This makes me guess that the average size of the most common entries inside the DS deployment is bigger than 2KB. I may be wrong on this assumption, so this is something that the customer will need to test and verify himself. Whatever the case, the principle is the following one: I would take all of the entries inside the DB and check their sizes, then get rid of the 5% bigger entries as well as the 10% smaller entries, and stablish and average size for the remaining 85% entries: if such average size is higher than 2KB, then increasing the DB page size may be a wise thing to do. The ideal DB page size should be such that an average entry can be stored in 25% of the size of the DB page size. in other words, if the average entry size is stablished between 2KB and 4KB, then a good DB page size guess will be 16KB, if the average entry size is stablished between 4KB and 8KB, then a good DB page size guess will be 32KB, beyond that a  good DB page size will be 64KB. Whatever the case do not configure DB page size beyond 64KB, I have never seen a deployment with bigger settings, so we would be touching the twighlight zone in such case.

The biggest advantage of a DB page size tuning is that it directly reduces the overflow pages and hence the index fragmentation. It will require a bigger memory footprint (for db-cache, in case advice #2 is implemented/preferred to advice #3), but performances are pretty awesome as the number of required DB locks -and therefore overall contention- can get massively reduced. We have observed that following this single advice has been enough for another of our customers with a similar profile to completely remove all of their high response times and DB contention.

The biggest drawback of tuning the DB page size is that it requires a complete DB re-initialization after the nsslapd-db-page-size configuration attribute is changed. Another side effect is that increasing the page size will increase the I/O required bandwidth (reads and writes to/from disk will always happen per page size blocks). However, as we noted above in this section, I/O bandwidth does not seem currently an issue for this customer, there is plenty of unused bandwidth yet.


ADVICE #5: Playing with dopageflush

As commented above in this section, the memory sub-system seems to be in pretty good health. A quite low-level tuning advice that benchmark experts use when such circumstances are met is to play around with the page scanner, disabling it when it seems to be of little use (like seems to happen for our customer, as vmstat reports very little activity on the sr column). Under such conditions, disabling the page scanner can be an option for improving fsflush performances and, for the same exact reason, DS checkpointing performances.

To disable the page scanner, you need to set the Solaris operating system dopageflush kernel parameter to 0. Information on all fsflush related parameters can be found here:

http://docs.sun.com/app/docs/doc/817-0404/chapter2-3?a=view.

Information on how to tune the the Solaris Kernel can be found here:

http://docs.sun.com/app/docs/doc/817-0404/chapter1-9?a=view

One of the easiest ways to tune the parameter from the ones listed in the link above is launching "mdb -kw" as described in the latest link.

Comments:

Post a Comment:
Comments are closed for this entry.
About

marcos

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today