Tuesday Jul 17, 2007

slog blog (or blogging on slogging)

slog_blog I've been slogging for a while on support for separate intent logs (slogs) for ZFS. Without slogs, an intent log is allocated dynamically from the main pool. It consists of a chain of varying block sizes which are anchored in fixed objects. Specifying a separate log device enables the use of limited capacity but fast block devices such NVRAM  and Solid State Drives (SSDs).

Using chained logs (clogs?) can also lead to pool fragmentation. This is because log blocks are allocated and then freed as soon as the pool transaction group has committed. So we get a swiss cheesing effect.


Interface

        zpool create <pool> <pool devices> log <log devices>

Creates a pool with separate intent log device(s). If more than one log device is specified then writes are load-balanced between devices. It's also possible to mirror log devices. For example a log consisting of two sets of two mirrors could be created thus:

                zpool create whirl <pool devices> \\
log mirror c1t8d0 c1t9d0 \\
mirror c1t10d0 c1t11d0

        zpool add <pool> log <log devices>

Creates a log device if it doesn't exist, or adds extra log devices if it does.

        zpool replace <pool> <old device> <new device>

Replace old log device with new log device.

        zpool attach <pool> <log device> <new log device>

Attaches a new log device to an existing log device. If the existing device is not a mirror then a 2 way mirror is created. If device is part of a two-way log mirror, attaching new_device creates a three-way log mirror, and so on.

        zpool detach pool <log device>

Detaches a log device from a mirror.

        zpool status

Additionally displays the log devices

        zpool iostat

Additionally shows IO statistics for log devices.

When a slog is full or if a non mirrored log device fails then ZFS will start using chained logs within the main pool.

Performance

The performance of databases and NFS is dictated by the latency of making data stable. They need to be assured that their transactions are not lost on power or system failure. So they are heavily dependent on the speed of the intent log devices.
Here's some database performance testing results:

  • Test program creates 32 threads and each does 8K O_DSYNC writes randomly to a 400MB byte file.
  • Test hardware was a Sun X4500 (aka thumper) with 48 x 500GB disks.
  • The NVRAM is the battery backed pci Micro Memory pci1332,5425 card.
  • Table values are MB/s

Main pool disks

1
2
4
8
16
32
0 slogs
11
14
17
15
16
13
1 slog
12
12
12
12
12
11
2 slogs
17
17
17
19
19
16
4 slogs
17
16
15
15
16
16
8 slogs
18
19
20
18
16
18
NVRAM
221
221
218
217
215
217


I also ran the same without write disk cache flushing
(echo zfs_nocacheflush/W 1 | mdb -kw)
Note, this should not be done on a real system unless the device cache is non-volatile.


Main pool disks

1
2
4
8
16
32
0 slogs
33
83
123
136
142
143
1 slog
45
46
44
45
45
46
2 slogs
97
99
90
94
94
95
4 slogs
124
125
127
124
127
127
8 slogs
135
137
134
138
138
138
NVRAM
225
220
226
226
226
227

Note, these tables can be a bit mis-leading.  If you had 2 disks you'd have a choice of 2 main pool device  or 1 slog and 1 main pool device. So looking at the table  you should compare the following entries:
  • 2 main pool: 83MB/s
  • 1 slog, 1 main pool: 45MB/s
The first table highlights some issues of scaling which will be investigated further.

Perf summary

For this micro-benchmark and from limited other perf testing it makes sense to only use fast devices for the slog. However, there may be some cases where using regular disks as slog disks is faster than putting the same disks in the main pool.

Status/Bugs

This support was recently putback into Solaris Nevada build snv_68. Here's a list of slog bugs - fixed and to be fixed.

6574298  "slog still uses main pool for dmu_sync()"   - now fixed in snv_69
6574286 "removing a slog doesn't work" - now fixed
6575965 "panic/thread=2a1016b5ca0: BAD TRAP: type=9 ...:" - panic when no main pool devices present - now fixed in snv_83


Wednesday Nov 16, 2005

ZFS: The Lumberjack

I'm a lumberjack and I'm ok,
I sleep all night and I work all day.

... (Monty Python)

Well just recently I've been working much of the night as well! For many years now I've been working on logging. First improving UFS logging (technically journalling) performance and more recently on the ZFS Intent Log. Hence - "The Lumberjack".

So, you may well ask, what is this ZFS Intent Log? ZFS is always consistent on disk due to its transaction model. Unix system calls can be considered as transactions which are aggregated into a transaction group for performance and committed together periodically. Either everything commits or nothing does. That is, if a power goes out, then the transactions in the pool are never partial. This commitment happens fairly infrequently - typically a few seconds between each transaction group commit.

Some applications, such as databases, need assurance that say the data they wrote or mkdir they just executed is on stable storage, and so they request synchronous semantics such as O_DSYNC (when opening a file), or execute fsync(fd) after a series of changes to a file descriptor. Obviously waiting seconds for the transaction group to commit before returning from the system call is not a high performance solution. Thus the ZFS Intent Log (ZIL) was born.

The ZIL gathers together in memory transactions of system calls. If the system call immediately needs to be pushed out (e.g. O_DSYNC) or an fsync() later arrives then the in-memory list is pushed out to a per filesystem on-disk log. In the event of a crash or power fail the log is examined and those system calls that did not get committed are replayed.

The on-disk log consists of a header and log blocks. Inside the log blocks are records containing the system call transactions. Log blocks are chained together dynamically. There is no fixed location for the blocks, they are just grabbed and released as needed from the pool. This dynamic allocation makes it easy to change the size the log block size according to demand. When a transaction group commits then all log blocks containing those transactions are released. On replay we determine the validity of the next block in the chain using a strong checksum and further checking of the log record fields. Here's a diagram of the on-disk format:

So that's the ZIL in a nutshell. It's design was guided by previous years of frustration with UFS logging. We knew it had to be very fast as synchronous semantics are critical to database performance. We certainly learnt from previous mistakes.

There's a whole bunch more intricate details I could talk about, but they can wait for another blog entry. However, you can also get details about the ZIL in Neels blog. There's also more work to do.Still it's currently very functional now and bloody fast!

Technorati Tag:

About

perrin

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks