ZIL block size selection policy

ZIL block size selection policy

As I mentioned in my previous blog entry, the ZIL (ZFS Intent Log) operates with block sizes between ZIL_MIN_BLKSZ(4K) and ZIL_MAX_BLKSZ(128k).  Let us take a closer look at this.

The ZIL has to allocate a new zil block before it commits the current zil block.  This is because the zil block being committed has to have a link to the next zil block. If you do not preallocate, you will have to update the next pointer in the previous block whenever you write a new zil block. This means that you will have read in the previous block, update the next pointer, and rewrite it out. Obviously this is quite expensive (and quite complicated).

The current block selection strategy is to chose either the sum of all outstanding ZIL blocks or if no outstanding zil blocks are present, the size of the last zil block that was committed. If the size of the outstanding zil blocks is greater than 128k, it is rounded up to 128k.

The above strategy works in most cases, but behaves badly for certain edge cases.

Let us examine the zil block size for the set of actions described below
(dtrace -n ::zil_lwb_commit:entry'{@[1]=quantize((lwb_t\*)args[2]->lwb_sz);}')

  1. Bunch of 2k O_DYNC writes -- zil block size: 4k (ZIL_MIN_BLKSZ)
  2. Bunch of 128bytes O_DSYNC writes -- zil block size was 4k (ZIL_MIN_BLKSZ)
  3. Bunch of non-O_DSYNC writes ... No zil blocks written
  4. Bunch of 128 byte O_DSYNC writes -- zil block size was 64k !!
oops! Why did the zil block size suddenly jump up to 64k above?

When the first O_DSYNC write was initiated in (4), the zil coalesced all outstanding log operations into a big block (in my case a 128k block and a 64k block) and then did a zil_commit. The next O_DSYNC write then chose 64k as the zil block size as that was the size of the last zil_commit. The following O_DSYNC writes then continued to use 64K as the zil block size.

Neil Perrin filed CR 6354547: sticky log buf size to fix this issue. His proposed fix is to use the size of the last block as the basis for the size of the new block. This should work optimally for most cases, but there is a possiblity for empty log writes. Need to investigate this issue with "real" workloads.
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

realneel

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today