2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing
One of the great advances present in the ZFS Appliance 2010/Q1
software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues.
A series of changes to ZFS metaslab code lead to 50% improved OLTP performance
and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.
Excruciating Details for aspiring developer "Abandon hope all ye who enter here"
At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:
As Jeff explains, when a piece of data needs to be stored on disk, ZFS will first select a top level vdev (a raid-z group, a mirrored set, a single disk) for it. Within that top level vdev, a metaslab (slab for short) will be chosen and within the slab a block of Data Virtual Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we thought we have an issue with the block allocator. What we were seeing was that for random file rewrite the aggregation factor was large (say 8 blocks or more) when performance was good but dropped to 1 or 2 when performance was low. So we tried to see if we could do a better job at selecting blocks that would lead to better I/O aggregation down the pipeline. We kept looking at the effect of block allocation but it turned out the source of problem was in the slab selection process.
So a slab is a portion of DVA space within a metaslab group (aka a top level vdev). We currently divide VDEV space into approximately 200 slabs (see vdev_metaslab_set_size
). Slabs can be either loaded in memory or not. When loaded, the associated spacemaps are active meaning we can allocate space from them. When slabs are not loaded, we can't allocated space but we can still free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In this case we just log to disk the freed range information. As load and unload of spacemaps are not cheap and we insure we minimize such operation.
So each slab is weighted according to a few criteria and the slab with the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last one used: basically don't change a winner. We refer to this as the PRIMARY
slab. The second criteria for slab selection is the amount of free space. The more the better. However, lower LBA (logical block addresses) which maps to outer cylinders will generally give better performance. So we weight lower LBA more than inner ones at equivalent free space. Finally, a slab that has already been used in the past, even if currently unloaded, is preferred to opening up a fresh new slab. This is the SMO
bonus (because primed slabs have a Space Map Object associated). We do want to favor previously used slabs in order to limit the span of head seeks : we only move inwards when outer space is filled up.
The purpose of the slabs is to service a block allocation, say for a 128K record. So when a request comes in, the highest weighted slab is chosen as we ask for a block of the proper size using an AVL tree of free/allocated space. There was a problem we had to deal with in previous releases which occurred when such allocation failed because of free space fragmentation. Then the AVL tree was then not able to find a span of the requested size and was consuming CPU only to figure out there was no free block present to satisfy an allocation. When space was really tight in a pool we walked every slab before deciding that the allocation needed to be split into small chunks and a gang block (a block of blocks) created. So the spacemaps were augmented with another structure that allowed ZFS to immediately know how large an allocation could be serviced in a slab (the so called picker private tree organized by size of free space).
At that point we had 2 ways to select a block, either find one in sequence of previous allocation (first fit) or use one that fills in exactly a hole in the allocated space: so called best fit allocator. We also decided then to switch from best fit to first fit as a slab became 70% full. The problem that this created, we now realize, is that while it helped the compactness of the on-disk layout, it created a headache for writes. Each new allocation, got a taylored-fit disk area and this lead to much less write aggregation than expected. We would see that write workloads to a slab slowed down as it transitioned to 70% full (note this occurred when a slab was 70% full not the full vdev nor the pool). Eventually, the degraded slab became fully used and it would transition to a different slab with better performance characteristic. Performance could then fluctuate from an hour to the next.
So to solve this problem, what went in 2010/Q1
software release is multifold. The most important thing is: we increased the threshold at which we switched from 'first fit' (go fast) to 'best fit' (pack tight) from 70% full to 96% full. With TB drives, each slab is at least 5GB and 4% is still 200MB plenty of space and no need to do anything radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed an allocation we decided to stop giving the primary slab this preferential threatment as soon as the biggest allocation that could be satisfied by a slab was down to 128K (metaslab_df_alloc_threshold
). At that point we were ready to switch to another slab that had more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50% empty was preferred over slabs that had never been used. In order to foster more write aggregation, we reduced the threshold to 33% empty. This means that a random write workload now spread to more slabs where each one will have larger amount of free space leading to more write aggregation. Finally we also saw that slab loading was contributing to lower performance and implemented a slab prefetch mechanism to reduce down time associated with that operation.
The conjunction of all these changes lead to 50% improved OLTP and 70% reduced variability from run to run (see David Lutz's post on OLTP performance
) . We also saw a full 200% improvement on MS Exchange performance from these changes. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.