Tuesday Sep 18, 2007

ZFS ARC Statistics

ARC Statistics are exported via kstat(1M) (CR 6510807) in Solaris 10 U4 and Solaris nevada (build 57+). A simple way to see them is to use kstat -m zfs.

I wrote a simple perl script to print out the ARC statistics. Sample output is shown below. You can either print the set of default fields, or specify what fields you want printed (see arcstat -v for details)

Download arcstat.pl

In later blogs, I will describe how to make sense of these numbers. Sample output below

Cached random read

thumper1[bash]$ ~/arcstat.pl
    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:15:21  985M  100M     10   47M   14   53M    8    1M    3    11G   11G
18:15:22  137K     0      0     0    0     0    0     0    0    11G   11G
18:15:23  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:24  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:25  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:26  138K     0      0     0    0     0    0     0    0    11G   11G
18:15:27  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:28  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:29  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:30  140K     0      0     0    0     0    0     0    0    11G   11G
18:15:31  139K     0      0     0    0     0    0     0    0    11G   11G
18:15:32   33K     0      0     0    0     0    0     0    0    11G   11G

Uncached sequential write

    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
18:17:48     0     0      0     0    0     0    0     0    0    11G   11G
18:17:49     0     0      0     0    0     0    0     0    0    11G   11G
18:17:50   664   372     56     7    2   365  100     5    2    11G   11G
18:17:51    1K   534     36     5    0   529  100     5    0    11G   11G
18:17:52    2K   774     33     6    0   768  100     6    0    10G   10G
18:17:53    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:54    1K   645     35     5    0   640  100     5    0    10G   10G
18:17:55    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:56    1K   646     35     6    0   640  100     6    0    10G   10G
18:17:57    2K   645     31     5    0   640  100     5    0    10G   10G
18:17:58    2K   774     33     6    0   768  100     6    0    10G   10G

Monday Sep 25, 2006

ZFS and Databases

Databases and ZFS

Comparing UFS and ZFS out-of-the-box, we find that ZFS performs slightly better than UFS Buffered. We also demonstrate that it is possible to get performance improvements with ZFS by following a small set of recommendations. We have also identified a couple of tunings that help performance. These tunings will be on by default in future releases of ZFS

We (PAE - Performance Engineering) recently completed a study to understand database performance with ZFS. Read on more details and recommendations. You can also read Roch's blog on the same study

Databases stress the filesystem in unique ways. Depending on the workload and configuration, you can have thousands of IO operations per second. The size of these IO is usually small (database block size). All the writes are synchronized writes. Reads can be random or sequential. Some writes are also more critical than others. Depending on the configuration, Reads are cached by the database program or the filesystem (if supported/requested). In many cases where filesystems are used, the IO is spread over a few files. This causes the single writer lock to be very hot under certain configurations like Buffered UFS.

Since IO is so important for databases, not surprisingly, there are a lot heavy weight players in this arena. UFS, QFS, VxFS, are quite popular with customers as the underlying filesystem. So how does the new kid on the block (ZFS) do?

We used an internally developed benchmark called OLTP/Net to study database performance with ZFS. OLTP/Net (O-L-T-P slash Net) is a OLTP benchmark that simulates an online store. The major feature of the benchmark is that it has a bunch of tuning knobs that control the ratio of network IO to disk IO, and/or read/write nature of the transactions, and/or number of new connects/disconnects to the database etc.. This makes it quite easy to simulate customer situations in our labs. We use it quite extensively inside Sun to model real-world database performance, and have found/fixed quite a few performance issues using this workload.

For our ZFS study, we used the default settings for OLTP/Net. In this scenario, we have a read/write ratio of 2:1 and a network/disk IO ratio of 10:1. Since our goal is to run like most customers, we controlled the number of users (load generators) such that the box was 60% utilized.

The hardware configuration consisted of a T2000 with 32x1200Mhz CPUs, 32GB RAM connected to 140 Fibre channel JBODs. We used both Solaris 10 Update 2 as well as Solaris Nevada build 43 to do the analysis We created one big dynamically stripped pool with all the disks. We set the recordsize of this pool to 8k. Each disk was divided into 2 slices. These slices were allocated to UFS and ZFS in round robin fashion to ensure that each filesystem got equal number of inner and outer slices.

Normally for OLTP benchmark situations, we try to use the smallest database blocksize for best performance. When we started out with our study, we used a block size of 2048 as that gives us the best performance for other filesystems. But since we are trying to do what most customers might do, we switched over to a block size of 8192. We did two kinds of tests, a cached database as well as a large (not cached) database. Details follow in following sections.

Recommendations for ZFS and Databases

Most customers use UFS buffered filesystems and ZFS already performs better than UFS buffered!. Since want to test performance, and we want ZFS to be super fast, we decided to compare ZFS with UFS directio. We noticed that UFS Directio performs better than what we get with with ZFS out-of-the-box. With ZFS, not only was the throughput much lower, but we used more twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized.
We noticed that we were not only reading in more data, but we were also doing more IO operations that what is needed. A little bit of dtracing quickly revealed that these reads were originating from the write code path! More dtracing showed that these are level 0 blocks, and are being read-in for the read-modify-write cycle. This lead us to the FIRST recommendation
Match the database block size with ZFS record size.
A look at the DBMS statistics showed that "log file sync" was one of the biggest wait events. Since the log files were in the same filesystem as the data, we noticed higher latency for log file writes. We then created a different filesystem (in the same pool), but set the record size to 128K as log writes are typically large. We noticed a slight improvement in our numbers, but not the dramatic improvement we we wanted to achieve. We then created a separate pool and used that pool for the database log files. We got quite a big boost in performance. This performance boost could be attributed to the decrease in the write latency. Latency of database log writes is critical for OLTP performance. When we used one pool, the extra IOs to the disks increased the latency of the database log writes, and thus impacted performance. Moving the logs to a dedicated pool improved the latency of the writes, giving a performance boost. This leads us to our SECOND recommendation
If you have a write heavy workload, you are better off by separating the log files on a separate pool
Looking at the extra IO being generated by ZFS, we noticed that the reads from disk were 64K in size. This was puzzling as the ZFS recordsize is 8K. More dtracing, and we figured out that the vdev_cache (or software track buffer) reads in quite a bit more than what we request. The default size of the read is 64k (8x more than what we request). Not surprisingly, the ZFS team is aware of this, and there are quite a few change requests (CR) on this issue

4933977: vdev_cache could be smarter about prefetching
6437054: vdev_cache: wise up or die
6457709: vdev_knob values should be determined dynamically

Tuning the vdev_cache to read in only 8K at a time decreased the amount of extra IO by a big factor, and more importantly improved the latency of the reads too. This leads to our THIRD recommendation
Tune down the vdev_cache using ztune.sh1 until 6437054 is fixed
Ok, we have achieved quite a big boost from all the above tunings, but we are still seeing high latency for our IOs. We see that the disks are busier during the spa_sync time. Having read Eric Kustarz's blog about 'vq_max_pending' , we tried playing with that value. We found that setting it to 5 gives us the best performance (for our disks, and our workload). Finding the optimal value involves testing it for multiple values -- a time consuming affair. Luckily the fix is in the works

6457709: vdev_knob values should be determined dynamically

So, future releases of ZFS will have this auto-tuned. This leads us to our FOURTH recommendation
Tune vq_max_pending using ztune.sh1 until 6457709 is fixed
We tried various other things. For example, we tried changing the frequency of the spa_sync. The default is once every 5 seconds. We tried once every second, or once every 30 seconds, and even once every hour. While in some cases we saw marginal improvement, we noticed higher CPU utilization, or high spin on mutexes. Our belief is that this is something that is good out of the box, and we recommend you do not change it. We also tried changing the behaviour of the ZIL by modifying the zfs_immediate_write_sz value. Again, we did not see improvements. This leads to our FINAL recommendation

Let ZFS auto-tune. It knows best. In cases were tuning helps, expect ZFS to incorporate that fix in future releases of ZFS

In conclusion, you can improve out-of-the-box performance of databases with ZFS by doing simple things. We have demonstrated that it is possible to run high-throughput workloads with current release of ZFS. We have also shown that it is quite possible to get huge improvements in performance for databases in future versions of ZFS. Given the fact that ZFS is around a year old, this is amazing!!

1ztune.sh Roch's script

Monday Nov 28, 2005

ZIL block size selection policy

ZIL block size selection policy

As I mentioned in my previous blog entry, the ZIL (ZFS Intent Log) operates with block sizes between ZIL_MIN_BLKSZ(4K) and ZIL_MAX_BLKSZ(128k).  Let us take a closer look at this.

The ZIL has to allocate a new zil block before it commits the current zil block.  This is because the zil block being committed has to have a link to the next zil block. If you do not preallocate, you will have to update the next pointer in the previous block whenever you write a new zil block. This means that you will have read in the previous block, update the next pointer, and rewrite it out. Obviously this is quite expensive (and quite complicated).

The current block selection strategy is to chose either the sum of all outstanding ZIL blocks or if no outstanding zil blocks are present, the size of the last zil block that was committed. If the size of the outstanding zil blocks is greater than 128k, it is rounded up to 128k.

The above strategy works in most cases, but behaves badly for certain edge cases.

Let us examine the zil block size for the set of actions described below
(dtrace -n ::zil_lwb_commit:entry'{@[1]=quantize((lwb_t\*)args[2]->lwb_sz);}')

  1. Bunch of 2k O_DYNC writes -- zil block size: 4k (ZIL_MIN_BLKSZ)
  2. Bunch of 128bytes O_DSYNC writes -- zil block size was 4k (ZIL_MIN_BLKSZ)
  3. Bunch of non-O_DSYNC writes ... No zil blocks written
  4. Bunch of 128 byte O_DSYNC writes -- zil block size was 64k !!
oops! Why did the zil block size suddenly jump up to 64k above?

When the first O_DSYNC write was initiated in (4), the zil coalesced all outstanding log operations into a big block (in my case a 128k block and a 64k block) and then did a zil_commit. The next O_DSYNC write then chose 64k as the zil block size as that was the size of the last zil_commit. The following O_DSYNC writes then continued to use 64K as the zil block size.

Neil Perrin filed CR 6354547: sticky log buf size to fix this issue. His proposed fix is to use the size of the last block as the basis for the size of the new block. This should work optimally for most cases, but there is a possiblity for empty log writes. Need to investigate this issue with "real" workloads.

Wednesday Nov 16, 2005

The ZFS Intent Log

A quick guide to the ZFS Intent Log (ZIL)

I am not a ZFS developer. However I am interested in ZFS performance, and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it ;-). What follows is my notes as I made my way through the ZIL


Most modern file systems include a logging feature to ensure faster write times and crash recovery time (fsck). UFS has supported logging since Solaris 2.7 and uses logging as the default on Solaris 10. Our tests internally have shown us that logging file systems perform as good as (sometimes even better) non-logging file systems.

Logging is implemented via the ZFS Intent Log module in ZFS. ZFS Intent Log or ZIL is implemented in the zil.c file. Here is a brief walk through of the logging implementation in ZFS. All of this knowledge can be found in the zil.[c|h] files in the ZFS source code. I also recommend you check out Neil's blog -- He is one of the ZFS developers who works on the ZIL.

All file system related system calls are logged as transaction records by the ZIL. These transaction records contain sufficient information to replay them back in the event of a system crash.

ZFS operations are always a part of a DMU (Data Management Unit) transaction. When a DMU transaction is opened, there is also a ZIL transaction that is opened. This ZIL transaction is associated with the DMU transaction, and in most cases discarded when the DMU transaction commits. These transactions accumulate in memory until an fsync or O_DSYNC write happens in which case they are committed to stable storage. For committed DMU transactions, the ZIL transactions are discarded (from memory or stable storage).

The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.

ZIL and write(2)
The zil behaves differently for different size of writes that happens. For small writes, the data is stored as a part of the log record. For writes greater than zfs_immediate_write_sz (64KB), the ZIL does not store a copy of the write, but rather syncs the write to disk and only a pointer to the sync-ed data is stored in the log record. We can examine the write(2) system call on ZFS using dtrace.

230  -> zfs_write                                         21684
230    -> zfs_prefault_write                              28005
230    <- zfs_prefault_write                              35446
230    -> zfs_time_stamper                                69932
230      -> zfs_time_stamper_locked                       72893
230      <- zfs_time_stamper_locked                       74813
230    <- zfs_time_stamper                                76893
230    -> zfs_log_write                                   81054
230    <- zfs_log_write                                   89855
230  <- zfs_write                                         96257
230  <= write

As you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.

zil.c walk thorough

There are several zil functions that operate on zil records. What follows is a very brief description of their functionality.

  • zil_create() creates a dmu transaction and allocates a first log block and commits it.
  • zil_itx_create() is used to create a new zil transaction.
  • zil_itx_assign() is used to associate this intent log transaction with a dmu transaction.
  • zil_itx_clean() is used to clean up all in memory log transactions. Clearing in memory zil transactions implies that these are not flushed to disk. zil_itx_clean() is called via the zil_clean() function which dispatches a work request to a dispatch thread.
  • zil_commit() is used to commit zil transactions to stable storage.
  • zil_sync() ZIL transactions are then cleaned (or deleted) in the zil_sync routine when the DMU transactions that they are assigned to is committed to disk (maybe as a result of a fsync) It is mostly called from the txg_sync_thread every txg_time (5 seconds) via this code path.


ZFS Mount

During file system mount time, ZFS checks to see if there is an intent log. If there is an intent log, this implies that the system crashed (as the ZIL is deleted at umount(2) time). This intent log is converted to a replay log and is replayed to updated the file system to a stable state. If both the replay log and intent log are present, it implies that the system crashed while replaying the replay log in which case it is OK to ignore/delete the replay log and replay the intent log.

ZIL Tunables
I am almost tempted to mention some tunables here but the truth is that ZFS is intended to not require any tuning.  ZFS should (and will) perform optimally "Out of the Box". You might find some switches in the code, but they are only for internal development and will be yanked out soon!

ZIL Performance
As you must have figured out by now, ZIL performance is critical for performance of synchronous writes. A common application that issues synchronous writes is a database. This means that all of these writes run at the speed of the ZIL. The ZIL is already quite optimized, and I am sure ongoing efforts will optimize this code path even further. As Neil mentions, using nvram/solid state disks for the log would make it scream!. I also recommend that you checkout Roch's work on ZFS performance for details of other performance studies in progress.

Dtrace scripts for use with zil

  • To see ZIL activity 
    • dtrace -n zil\\\*:entry'{@[probefunc]=count();}' -n zio_free_blk:entry'{@[probefunc]=count();}' 
  • To see blocksize of log writes
    • dtrace -n ::zil_lwb_commit:entry'{@[1] = quantize((lwb_t\*)args[2]->lwb_sz);}'


Congratulations to the ZFS team for delivering such a world class product. You folks rock!.

Technorati Tag:
Technorati Tag:
Technorati Tag:




« July 2016