Tuning ZFS recordsize
By User13278091-Oracle on juin 07, 2006
One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis.
It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using
zfs set recordsize=64k mypool/myfs
In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time.
Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way.
If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads.
For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize.
So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads.
So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input.
On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well.
There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience).
So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.