mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

jeudi mars 11, 2010

Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with
  • running a DB on file shares or block volumes
  • write streaming to block volumes (backups)
Also impacted at a lesser level :
  • large file rewrites on CIFS or NFS shares
In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand.

To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Doubling Exchange Performance

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing

One of the great advances present in the ZFS Appliance 2010/Q1 software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues.

Executive Summary

A series of changes to ZFS metaslab code lead to 50% improved OLTP performance and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.

Excruciating Details for aspiring developer "Abandon hope all ye who enter here"

At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:

As Jeff explains, when a piece of data needs to be stored on disk, ZFS will first select a top level vdev (a raid-z group, a mirrored set, a single disk) for it. Within that top level vdev, a metaslab (slab for short) will be chosen and within the slab a block of Data Virtual Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we thought we have an issue with the block allocator. What we were seeing was that for random file rewrite the aggregation factor was large (say 8 blocks or more) when performance was good but dropped to 1 or 2 when performance was low. So we tried to see if we could do a better job at selecting blocks that would lead to better I/O aggregation down the pipeline. We kept looking at the effect of block allocation but it turned out the source of problem was in the slab selection process.

So a slab is a portion of DVA space within a metaslab group (aka a top level vdev). We currently divide VDEV space into approximately 200 slabs (see vdev_metaslab_set_size). Slabs can be either loaded in memory or not. When loaded, the associated spacemaps are active meaning we can allocate space from them. When slabs are not loaded, we can't allocated space but we can still free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In this case we just log to disk the freed range information. As load and unload of spacemaps are not cheap and we insure we minimize such operation.

So each slab is weighted according to a few criteria and the slab with the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last one used: basically don't change a winner. We refer to this as the PRIMARY slab. The second criteria for slab selection is the amount of free space. The more the better. However, lower LBA (logical block addresses) which maps to outer cylinders will generally give better performance. So we weight lower LBA more than inner ones at equivalent free space. Finally, a slab that has already been used in the past, even if currently unloaded, is preferred to opening up a fresh new slab. This is the SMO bonus (because primed slabs have a Space Map Object associated). We do want to favor previously used slabs in order to limit the span of head seeks : we only move inwards when outer space is filled up.

The purpose of the slabs is to service a block allocation, say for a 128K record. So when a request comes in, the highest weighted slab is chosen as we ask for a block of the proper size using an AVL tree of free/allocated space. There was a problem we had to deal with in previous releases which occurred when such allocation failed because of free space fragmentation. Then the AVL tree was then not able to find a span of the requested size and was consuming CPU only to figure out there was no free block present to satisfy an allocation. When space was really tight in a pool we walked every slab before deciding that the allocation needed to be split into small chunks and a gang block (a block of blocks) created. So the spacemaps were augmented with another structure that allowed ZFS to immediately know how large an allocation could be serviced in a slab (the so called picker private tree organized by size of free space).

At that point we had 2 ways to select a block, either find one in sequence of previous allocation (first fit) or use one that fills in exactly a hole in the allocated space: so called best fit allocator. We also decided then to switch from best fit to first fit as a slab became 70% full. The problem that this created, we now realize, is that while it helped the compactness of the on-disk layout, it created a headache for writes. Each new allocation, got a taylored-fit disk area and this lead to much less write aggregation than expected. We would see that write workloads to a slab slowed down as it transitioned to 70% full (note this occurred when a slab was 70% full not the full vdev nor the pool). Eventually, the degraded slab became fully used and it would transition to a different slab with better performance characteristic. Performance could then fluctuate from an hour to the next.

So to solve this problem, what went in 2010/Q1 software release is multifold. The most important thing is: we increased the threshold at which we switched from 'first fit' (go fast) to 'best fit' (pack tight) from 70% full to 96% full. With TB drives, each slab is at least 5GB and 4% is still 200MB plenty of space and no need to do anything radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed an allocation we decided to stop giving the primary slab this preferential threatment as soon as the biggest allocation that could be satisfied by a slab was down to 128K (metaslab_df_alloc_threshold). At that point we were ready to switch to another slab that had more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50% empty was preferred over slabs that had never been used. In order to foster more write aggregation, we reduced the threshold to 33% empty. This means that a random write workload now spread to more slabs where each one will have larger amount of free space leading to more write aggregation. Finally we also saw that slab loading was contributing to lower performance and implemented a slab prefetch mechanism to reduce down time associated with that operation.

The conjunction of all these changes lead to 50% improved OLTP and 70% reduced variability from run to run (see David Lutz's post on OLTP performance) . We also saw a full 200% improvement on MS Exchange performance from these changes.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.



« avril 2014

No bookmarks in folder