Improving the I/O rates on a massive DS deployment

I was recently led to investigate an interesting I/O performance issue in a Solaris 9, Sparc, UFS, 4-replica DS5.2 patch5 deployment hosting a 27M-entry DB.

As far as the topology is concerned, it was perfectly synced and all 4 replicas answered to all read and write operations in the second (etime=0). All 4 replicas were in theory configured identically and running the same exact OS. The 2 read-only replicas answered the same kind of SRCH requests, all SRCHes are targetting one single entry, all indexed. The dbcache is not configured for any of the replicas (60GB DB being loaded directly on a 20GB FS cache) and the DB page size in all replicas was set to 32KB, which means that each read from disk into the file system cache should be expected to be of that size, in the worse of the cases.

Despite of the above, there was one read-only replica which showed and I/O read rate which was 4 times bigger than the other 3 for the same amount and type of read traffic. This was the initial I/O output extracted from the bad replica then the problem was evident:

Fri Aug 24 12:52:03 2007
extended device statistics
r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device

371.7 29.4 41.8 0.7 0.0 9.6 0.0 24.0 0 100 c7t600C0FF000000000083D437AA64B8900d0
282.4 36.6 29.7 0.9 0.0 3.8 0.0 11.9 0 94 c7t600C0FF000000000083D437AA64B8900d0
287.2 41.4 31.0 1.1 0.0 4.4 0.0 13.4 0 97 c7t600C0FF000000000083D437AA64B8900d0
282.8 44.6 31.5 1.1 0.0 3.9 0.0 11.8 0 93 c7t600C0FF000000000083D437AA64B8900d0

As we can see in the iostat output above, for each 32KB read, we are loading into the FS cache an average of (31000KB/s)(290r/s) = 105KB/read.

After carefully analyzing possible root causes like the differences in traffic load, the differences in type of searches, the kernel/ufs readahead settings and the storage, the root cause for high I/O in the bad replica was located inside the kernel/UFS readahead settings (therefore completely out of the scope of DS).

Indeed, decreasing the maxcontig setting from 128 blocks (default value) to 16 blocks removed the problem away. The following plan was executed on the bad replica during a maintenance window. The plan was a success, it fixed all of the I/O issues in the replica:

   1) Stop the DS (stop-slapd)
2) Set maxcontig to 16:
tunefs -a 16 /dev/dsk/c7t600C0FF000000000083D437AA64B8900d0s0
3) reboot
4) Verify maxcontig is set to 16 after the reboot:
fstyp -v /dev/dsk/c7t600C0FF000000000083D437AA64B8900d0s0 | grep maxcontig
5) start-slapd and set the traffic back in place
6) wait for cache priming and replication sync-up (10 minutes)
7) iostat -xnMCz -T d 5 during a couple of minutes

Once the action plan above was executed, here are the iostat results with this new maxcontig setting:
  277.6   16.2    6.2    0.3  0.0  1.9    0.0    6.6   0  80 c7t600C0FF000000000083D437AA64B8900d0
249.8 15.2 5.4 0.3 0.0 1.7 0.0 6.5 0 75 c7t600C0FF000000000083D437AA64B8900d0
343.0 12.4 7.5 0.3 0.0 2.6 0.0 7.2 0 84 c7t600C0FF000000000083D437AA64B8900d0
389.6 16.4 8.9 0.3 0.0 3.1 0.0 7.5 0 87 c7t600C0FF000000000083D437AA64B8900d0
424.4 25.2 9.7 0.6 0.0 3.6 0.0 8.1 0 89 c7t600C0FF000000000083D437AA64B8900d0
391.8 25.0 9.2 0.5 0.0 3.3 0.0 7.9 0 89 c7t600C0FF000000000083D437AA64B8900d0
324.2 20.8 7.2 0.5 0.0 2.4 0.0 7.1 0 86 c7t600C0FF000000000083D437AA64B8900d0
307.4 21.0 6.8 0.5 0.0 2.2 0.0 6.7 0 82 c7t600C0FF000000000083D437AA64B8900d0
317.2 33.6 7.1 0.8 0.0 2.3 0.0 6.6 0 85 c7t600C0FF000000000083D437AA64B8900d0
330.4 37.0 7.1 0.9 0.0 2.5 0.0 6.7 0 89 c7t600C0FF000000000083D437AA64B8900d0

Now for each 32KB read, we are loading into the FS cache an average of (7200KB/s)(324r/s) = 22KB/read.

This is excellent, these numbers are now even better than the ones in the other 3 replicas and correspond to a FS cache hit ratio of 33%, which makes complete sense for a 60GB DB being loaded on a 20GB FS cache.

Summarizing: Lowering down maxcontig from 128 to 8 has allowed us to reduce the amount of data read from disk from 105 KB/read to 22 KB/read, this is almost 5 times less. This maxcontig setting change makes sense in a huge directory server DB deployment as the one addressed, as access to data in this kind of environment is typically a random access, not a sequential one.

Comments:

Post a Comment:
Comments are closed for this entry.
About

marcos

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today