mardi févr. 28, 2012

Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.

As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving.

We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark.

Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS.

This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package.

Way to go ZFS !

SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see www.spec.org.

lundi oct. 03, 2011

Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance.

SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;



I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone.

We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks.

These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit.

So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark.

So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols.

The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage.

The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2.

The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB.

Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that).

After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run.

When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour.

SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs.

In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance.

2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance.





SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info www.storageperformance.org. Sun ZFS Storage 7420 Appliance and

Oracle Sun ZFS Storage Appliance 7420 _http://www.storageperformance.org/results/benchmark_results_spc1#a00108 _As of October 3, 2011 Netapp FAS3270A _http://www.storageperformance.org/results/benchmark_results_spc1#ae00004 _As of October 3, 2011

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

mercredi mai 26, 2010

Let's talk about Lun Alignment for 3 minutes

Recall that I had Lun alignment on my mind a few weeks ago. Nothing special about the ZFS storage appliance over any other storage. Pay attention to how you partition your luns, it can have a great impact on performance. Right Roch ? :

jeudi mars 11, 2010

Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals.

ZFS Dedup Basics

Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets.

Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly.

An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K.

Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.

Performance Model and I/O Pipeline Differences

What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern.

For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory .

Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed.

The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior.

So how large is the dedup table ?

The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.

The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table.

But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :





Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation....

Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications.

When is alignment most important ?

Alignment problems are most likely to have an impact with
  • running a DB on file shares or block volumes
  • write streaming to block volumes (backups)
Also impacted at a lesser level :
  • large file rewrites on CIFS or NFS shares
In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance.

Let's review the different cases.

Case 1: running a Database (DB) on file shares or block volumes

Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand.

To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize.

Case 2: write streaming to block volumes (backups)

The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty.

Case 3: large file rewrites on CIFS or NFS shares

For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance.

In summary

The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred.



The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Referenced Links :

Doubling Exchange Performance

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing

One of the great advances present in the ZFS Appliance 2010/Q1 software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues.

Executive Summary

A series of changes to ZFS metaslab code lead to 50% improved OLTP performance and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes.

Excruciating Details for aspiring developer "Abandon hope all ye who enter here"


At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:

As Jeff explains, when a piece of data needs to be stored on disk, ZFS will first select a top level vdev (a raid-z group, a mirrored set, a single disk) for it. Within that top level vdev, a metaslab (slab for short) will be chosen and within the slab a block of Data Virtual Address (DVA) space will be selected. We didn't have a problem with the vdev selection process, but we thought we have an issue with the block allocator. What we were seeing was that for random file rewrite the aggregation factor was large (say 8 blocks or more) when performance was good but dropped to 1 or 2 when performance was low. So we tried to see if we could do a better job at selecting blocks that would lead to better I/O aggregation down the pipeline. We kept looking at the effect of block allocation but it turned out the source of problem was in the slab selection process.

So a slab is a portion of DVA space within a metaslab group (aka a top level vdev). We currently divide VDEV space into approximately 200 slabs (see vdev_metaslab_set_size). Slabs can be either loaded in memory or not. When loaded, the associated spacemaps are active meaning we can allocate space from them. When slabs are not loaded, we can't allocated space but we can still free space from them (ZFS being copy-on-write or COW, a block rewrite frees up the old space). In this case we just log to disk the freed range information. As load and unload of spacemaps are not cheap and we insure we minimize such operation.

So each slab is weighted according to a few criteria and the slab with the highest weight is selected for allocation on a vdev. The first criteria for slab selection is to reuse the same one as the last one used: basically don't change a winner. We refer to this as the PRIMARY slab. The second criteria for slab selection is the amount of free space. The more the better. However, lower LBA (logical block addresses) which maps to outer cylinders will generally give better performance. So we weight lower LBA more than inner ones at equivalent free space. Finally, a slab that has already been used in the past, even if currently unloaded, is preferred to opening up a fresh new slab. This is the SMO bonus (because primed slabs have a Space Map Object associated). We do want to favor previously used slabs in order to limit the span of head seeks : we only move inwards when outer space is filled up.

The purpose of the slabs is to service a block allocation, say for a 128K record. So when a request comes in, the highest weighted slab is chosen as we ask for a block of the proper size using an AVL tree of free/allocated space. There was a problem we had to deal with in previous releases which occurred when such allocation failed because of free space fragmentation. Then the AVL tree was then not able to find a span of the requested size and was consuming CPU only to figure out there was no free block present to satisfy an allocation. When space was really tight in a pool we walked every slab before deciding that the allocation needed to be split into small chunks and a gang block (a block of blocks) created. So the spacemaps were augmented with another structure that allowed ZFS to immediately know how large an allocation could be serviced in a slab (the so called picker private tree organized by size of free space).

At that point we had 2 ways to select a block, either find one in sequence of previous allocation (first fit) or use one that fills in exactly a hole in the allocated space: so called best fit allocator. We also decided then to switch from best fit to first fit as a slab became 70% full. The problem that this created, we now realize, is that while it helped the compactness of the on-disk layout, it created a headache for writes. Each new allocation, got a taylored-fit disk area and this lead to much less write aggregation than expected. We would see that write workloads to a slab slowed down as it transitioned to 70% full (note this occurred when a slab was 70% full not the full vdev nor the pool). Eventually, the degraded slab became fully used and it would transition to a different slab with better performance characteristic. Performance could then fluctuate from an hour to the next.

So to solve this problem, what went in 2010/Q1 software release is multifold. The most important thing is: we increased the threshold at which we switched from 'first fit' (go fast) to 'best fit' (pack tight) from 70% full to 96% full. With TB drives, each slab is at least 5GB and 4% is still 200MB plenty of space and no need to do anything radical before that. This gave us the biggest bang. Second, instead of trying to reuse the same primary slabs until it failed an allocation we decided to stop giving the primary slab this preferential threatment as soon as the biggest allocation that could be satisfied by a slab was down to 128K (metaslab_df_alloc_threshold). At that point we were ready to switch to another slab that had more free space. We also decided to reduce the SMO bonus. Before, a slab that was 50% empty was preferred over slabs that had never been used. In order to foster more write aggregation, we reduced the threshold to 33% empty. This means that a random write workload now spread to more slabs where each one will have larger amount of free space leading to more write aggregation. Finally we also saw that slab loading was contributing to lower performance and implemented a slab prefetch mechanism to reduce down time associated with that operation.

The conjunction of all these changes lead to 50% improved OLTP and 70% reduced variability from run to run (see David Lutz's post on OLTP performance) . We also saw a full 200% improvement on MS Exchange performance from these changes.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

jeudi oct. 08, 2009

CMT, NFS and 10 Gbe

Now that we have Gigabytes/sec class of Network Attached OpenStorage and highly threaded CMT servers to attach from you figure just connecting the two would be enough to open the pipes for immediate performance. Well ... almost.

Our openstorage system can deliver great performance but we often find limitation on the client side. Now that NAS servers can deliver so much power, their NAS client can themselve be powerful servers trying to deliver GB/sec class services to the internet.

CMT servers are great throughput engines for that, however they deliver the goods when the whole stack is threaded. So in a recent engagement, my collegue David Lutz found that we needed one tuning at each of 4 levels in Solaris : IP, TCP, RPC and NFS.

ServiceTunable
IPip_soft_ring_cnt
TCPtcp_recv_hiwat
RPCclnt_max_conns
NFSnfs3_max_threads
NFSnfs4_max_threads


ip_soft_rings_cnt requires tuning up to Solaris 10 update 7. The default value of 2 is not enough to sustain the high throughput in a CMT environment. A value of 16 proved beneficial.

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch
   set ip_soft_rings_cnt=16


The receive socket buffer size is critical to the TCP connection performance. The buffer is not preallocated and memory is only used if and when the application is not reading the data it has requested. The default at 48K is from the age of 10MB/s Network cards and 1GB/sec systems. Having a larger value allows the peer to not throttle it's flow pending the returning TCP ACK. This is specially critical in high latency environment, urban area networks or other large fat network but it's also critical in the datacenter to reach a reasonable portion of the 10Gbe available in today's NIC. It turns out that NFS connection inherit the TCP default for the system and so it's interesting to run with a value between 400K and 1MB :

	ndd -set /dev/tcp_recv_hiwat 400000


But even with this, a single TCP connection is not enough to extract the most out of 10Gbe on CMT. And the solaris rpc client will establish a single connection to any of the server it connects to. The code underneath is highly threaded but did suffer from a few bugs when trying to tune that number of connections notably 6696163, 6817942 both of which are fixed in S10 update 8.

With that release, it becomes interesting to tune the number of RPC connections for instance to 8.

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 8 : see blogs.sun.com/roch
   set clnt_max_conns=8


And finally, above the RPC layer, NFS does implement a pool of threads per mount point to service asynchronous requests. These will be mostly used in streaming workloads (readahead and writebehind) while other synchronous requests will be issued within the context of the application thread. The default number of asynchronous requests is likely to limit performance in some streaming scenario. So I would experiment with

In /etc/system :
   \* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch
   set nfs3_max_threads=32
   set nfs4_max_threads=32


As usual YMMV and use them with the usual circumspection, remember that tuning is evil but it's better to know about these factors than being in the dark and stuck with lower than expected performance.

jeudi sept. 17, 2009

iSCSI unleashed



One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new target code is an in-kernel implementation that replaces what was previously known as the iSCSI target deamon, a user-level implementation of iSCSI.

Should we expect huge performance gains from this change ? You Bet !

But like most performance question, the answer is often : it depends. The measured performance of a given test is gated by the weakest link triggered. iSCSI is just one component among many others that can end up gating performance. If the daemon was not a limiting factor, then that's your answer.

The target deamon was a userland implementation of iSCSI : some daemon threads would read data from a storage pool and write data to a socket or vice versa. Moving to a kernel implementation opens up options to bypass at least one of the copies and that is being considered as a future option. But extra copies while undesirable do not necessarily contribute to the small packet latency or large request throughput; For small packets requests, the copy is small change compared to the request handling. For large request throughput the important things is that the data path establishes a pipelined flow in order to keep every components busy at all times.

But the way threads interact with one another can be a much greater factor in delivered performance. And there lies the problem. The old target deamon suffered from one major flaw in that each and every iSCSI requests would require multiple trips through a single queue (shared between every luns) and that queue was being read and written by 2 specific threads. Those 2 threads would end up fighting for the same locks. This was compounded by the fact that user level threads can be put to sleep when they fail to acquire a mutex and that going to sleep for a user level thread is a costly operation implying a system call and all the accounting that goes with that.

So while the iSCSI target deamon gave reasonable service for large request, it was much less scalable in terms of the number IOPS that can be served and the CPU efficiency in which it could do that. IOPS being of course a critical metrics for block protocols.

As an illustration of that with 10 client initiators and 10 threads per initiators (so 100 outstanding request) doing 8K cache-hit reads, we observed

Old Target Daemon Comstar Improvement
31K IOPS 85K IOPS 2.7X


Moreover the target daemon was consuming 7.6 CPU to service those 31K IOPS while comstar could handle 2.7X more IOPS consuming only 10 cpus, a 2X improvement in iops per cpu efficiency.

On the write side, with a disk pool that had 2 striped write optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs 88MB/sec) and 60% more cpu efficiency.

Immediatedata

During our testing we noted a few interesting contributor to delivered performance. The first being the setting of iSCSI immediatedata parameter iSCSIadm(1M). On the write path, that parameter will cause the initiator iSCSI to send up to 8K of data along with the initial request packet. While this is a good idea to do so, we found that for certain sizes of writes, it would trigger some condition in the zil that caused ZFS to issue more data than necessary through the logzillas. The problem is well understood and remediation is underway and we expect to get to a situation in which keeping the default value of immediatedata=yes is the best. But as of today, for those attempting world record data transfer speeds through logzillas, setting immediatedata=no and using a 32K or 64K write size might yield positive result depending on your client OS.

Interrupt Blanking

Interested in low latency request response ? Interestingly, a chunk of that response time is lost in the obscure setting of network card drivers. Network cards will often delay pending interrupts in the hope of coalescing more packets into a single interrupt. The extra efficiency often results in more throughput at high data rate at the expense of small packet latency. For 8K request we manage to get 15% more single threaded IOPS by tweaking one such client side parameter. Historically such tuning has always been hidden in the bowel of each drivers and specific to ever client OS so that's too broad a topic to cover here. But for Solaris clients, the Crossbow framework is aiming among other thing to make latency vs throughput decision much more adaptive to operating condition relaxing the need for per workload tunings.

WCE Settings

Another important parameter to consider for comstar is the 'write cache enable' bit. By default all write request to an iSCSI lun needs to be committed to stable storage as this is what is expected by most consumers of block storage. That means that each individual write request to a disk based storage pool will take minimally a disk rotation or 5ms to 8ms to commit. This also why a write optimised SSD is quite critical to many iSCSI workloads often yeilding 10X performance improvements. Without such an SSD, iSCSI performance will appear quite lackluster particularly for lightly threaded workloads which more affected by latency characteristics.

One could then feel justified to set the write cache enable bits on some luns in order to recoup some spunk in their engine. One good news here is that in the new 2009.Q3 release of fishworks the setting is now persistent across reboots and reconnection event, fixing a nasty condition of 2009.Q2. However one should be very careful with this setting as the end consumer of block storage (exchange, NTFS, oracle,...) is quite probably operating under an unexpected set of condition. This setting can lead to application corruption in case of outage (no risk for the storage internal state).

There is one exception to this caveat and it is ZFS itself. ZFS is designed to safely and correctly operate on top of devices that have their write cached enabled. That is because ZFS will flush write caches whenever application semantics or its own internal consistency require it. So a ZPOOL created on top of iSCSI luns would be well justified to set the WCE on the lun to boost performance.

Synchronous write bias

Finally as described in my blog entry about Synchronous write bias, we now have to option to bypass the write optimised SSDs for a lun if the workload it receive is less sensitive to latency. This would be the case of a highly threaded workload doing large data transfers. Experimenting with this new property is warranted at this point.

Synchronous write bias property


With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset Synchronous write bias property or logbias for short. In a nutshell, this property takes the default value of Latency signifying that the storage should handle synchronous writes in urgency, the historical default handling. See Brendan's comprehensive blog entry on the Separate Intent Log and synchronous writes. However for datasets holding Oracle Datafiles, the logbias property can be set to Throughput signifying that the storage should avoid using log devices acceleration instead trying to optimize the workload's throughput and efficiency. We definitely expect to see a good boost to Oracle performance from this feature for many types of workloads and configs; workloads that generate 10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD.

The property is set in the Share Properties just above database recordsize. You might need to unset the Inherit from projet checkbox in order to modify the settings on a particular share:



The logbias property addresses a peculiar aspect of Oracle workloads : namely that DB writers are issuing a large number of concurrent synchronous writes to Oracle datafiles, writes which individually are not particularly urgent. In contrast to other types of synchronous writes workloads, the more important metrics for DB Writers is not about individual latency. The important metric is that the storage keep up with the throughput demand in order to have database buffers always available for recycling. This is unlike redo log writes which are critically sensitive to latency as they are holding up individual transactions and thus users.

ZFS and the ZIL

A little background; with ZFS, synchronous writes are managed by the ZFS Intent Log ZIL. Because synchronous writes are typically holding up applications, it's important to handle those writes with some level of urgency and the ZIL does an admirable job at that.

In the Openstorage hybrid storage pool the ZIL itself is speeded up using low latency write-optimized SSD devices : the logzillas. Those devices are used to commit a copy of the in-memory ZIL transaction and retain the data until an upcoming transaction group commits the in-memory state to the on-disk pooled storage (Dynamics of ZFS, The New ZFS write throttle).

So while the ZIL speeds up synchronous writes, logzillas speeds up the zil. Now SSDs can serve IOPS at a blazing 100μs but also have their own throughput limits: currently around 110MB/sec per device. At that throughput, committing, for example, 40K of data will need minimally 360μs. The more data we can divert away from log devices, the lower the latency response of those devices will be.

It's interesting to note that other types of raid controllers will be hostage of their NVRAM and require, for consistency, that data be committed through some form of acceleration in order to avoid the Raid Write Hole (Bonwick on Raid-Z). ZFS, however, does not require that data passes through its SSD commit accelerator and it can manage consistency of commits either using disk or using SSDs.

Synchronous write bias : Throughput

With this newfound ability of storage administrators to signify to ZFS that some datasets will be subject to highly threaded synchronous writes for which global throughput is more critical than individual write latency, we can enable a different handling mode. By setting Logbias=Throughput ZFS is able to divert writes away from the Logzillas which are then preserved for servicing low latency sensitive operations (e.g. redo log operations).

  • A setting of Synchronous write bias : Throughput for a dataset allows synchronous writes to files in other datasets to have lower latency access to SSD log devices.
But that's not all. Data flowing through a logbias=Throughput dataset is still served by the ZIL. It turns out that the ZIL has different internal options in the way it can commit transactions one of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an I/O for the modified file record and record a pointer to it in the zil chain. (see WR_INDIRECT in zil.c, zvol.c, zfs_log.c ).

ZIL transaction of type WR_INDIRECT might use more disk I/Os and slightly higher latency immediately but less I/Os and less total bytes during the upcoming transaction group update. Up to this point, the heuristics that lead to using WR_INDIRECT transactions, were not triggered by DB writer workloads. But armed with the knowledge that comes with the new logbias property, we're now less concerned about the slight latency increase that WR_INDIRECT can have. So from efficiency consideration the logbias=Throughput datasets are now set to use this mode leading to more leveled latency distributions of Transactions.

  • Synchronous write bias : Throughput is a dataset mode that reduces the number of I/Os that need to be issued on behalf of this dataset during the regular transaction group updates leading to more leveled response time.
A reminder that such kind of improvements sometimes can go unnoticed in sustained benchmarks if the downstream Transaction group destage is not given enough resources. Make sure you have enough spindles (or total disk KRPM) to sustain the level of performance you need. A pool with 2 logzillas and a single JBOD, might have enough SSD throughput to absorb DB writer workloads without adversely affecting redo log latency and so would not benefit from the special logbias settings, however for 1 logzillas per JBOD the situation might be reversed.

While the DB Record Size property is inherited by files in a dataset and is immutable, the logbias setting is totally dynamic and can be toggled on the fly during operations. For instance, during database creation or some lightly threaded write operations to Datafiles, it's expected that logbias=Latency should perform better.

Logbias deployments for Oracle

As of the 2009.Q3 release of fishworks, the current wisdom around deploying Oracle DB an Openstorage system with SSD acceleration, is to segregate, at the filesystem/dataset level, but within the single storage pool, Oracle datafiles, index files and redo Log files. Having each type of files in different dataset allows better observability into each one using the great analytics tool. But also, each dataset can then be tuned independantly to deliver the most stable performance characteristics. The most important parameter to consider is the ZFS internal recordsize used to manage the files. For Oracle datafiles the established (ZFS Best Practice) is to match the recordsize and the DB block size. For redo log files using default 128K records means that fewer file updates will be stradling multiple filesystem records. With 128K records we expect to have fewer transaction needing to wait for redo log input I/Os leading more leveled latency distribution for transactions. As for Index files, using smaller blocks of 8K offers better cacheability feature for both the primary and secondary caches (only cache what is used from indexes), but using larger blocks offers better index-scan performance. Experimenting is in order, depending on your use case, but an intermediate block size of maybe 32K might also be considered for mixed usage scenario.

For Oracle datafiles specifically, using the new setting of Synchronous write bias : Throughput has potential to deliver more stable performance in general and higher performance for redo log sensitive workloads.

Dataset Recordsize Logbias
Datafiles 8K Throughput
Redo Logs 128K(default)Latency(default)
Index 8K-32K?Latency(default)


Following these guidelines yielded a 40% boost in our Transaction processing testing in which we had 1 logzillas for a 40 disk pool.

jeudi juin 11, 2009

Compared Performance of Sun 7000 Unified Storage Array Line

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with a _single_ 10 GbE connection delivers linespeed of the 10 GbE.

  • The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
  • The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications.
  • The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensice SATA disks.
  • The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server.


All those numbers characterise a single head of a 7410 clusterable technology. The 7000 clustering technology stores all data in dual attached disk trays and no state is shared between cluster heads (see Sun 7000 Storage clusters). This means that an active-active cluster of 2 healthy 7410 will deliver 2X the performance posted here.

Also note that the performance posted here represent what is acheived under a very tightly defined constrained workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well by my friend Brendan :

Measurement Method

To measure our performance we used the open source Filebench tool accessible from SourceForge (Filebench on solarisinternals.com). Measuring performance of a NAS storage is not an easy task. One has to deal with the client side cache which needs to be bypassed, the synchronisation of multiple clients, the presence of client side page flushing deamons which can turn asynchronous workloads into synchronous ones. Because our Storage 7000 line can have such large caches (up to 128GB of ram and more than 500GB of secondary caches) and we wanted to test disk responses, we needed to find a backdoor ways to flush those caches on the servers. Read Amithaba Filebench Kit entry on the topic in which he posts a link to the toolkit used to produce the numbers.

We recently released our first major software update 2000.Q2 and along with that a new lower cost clusterable 96 TB Storage, the 7310.

We report here the compared numbers of a 7310 with the latest software release to those previously obtained for the 7410, 7210 and 7110 systems each attached to an 18 to 20 client pool over a single 10Gbe interface with the regular frame ethernet (1500 Bytes). By the way, looking at brendan's results above, I encourage you to upgrade to use Jumbo Frames ethernet for even more performance and note that our servers can drive two 10Gbe at line speed.

Tested Systems and Metrics

The tested setup are :
        Sun Storage 7410, 4 x quad core: 16 cores @ 2.3 Ghz AMD.
        128GB of host memory.
        1 dual port 10Gbe Network Atlas Card. NXGE driver. 1500 MTU
        Streaming Tests:
        2 x J4400 JBOD,  44 x 500GB SATA drives 7.2K RPM, Mirrored pool, 
        3 Write optimized 18GB SSD, 2 Read Optimized 100GB SSD.
        IOPS tests:
        12 x J4400 JBOD, 280 x 500GB SATA drives 7.2K RPM, Mirrored pool,
        272 Data drives + 8 spares.
        8-Mirrored Write Optimised 18GB SSD, 6 Read Optimized 100GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7310,2 x quad core: 8 cores @ 2.3 Ghz AMD.
        32GB of host memory.
        1 dual port 10Gbe Network Atlas  Atlas Card (1 port used). NXGE driver. 1500 MTU
        4 x J4400 JBOD for a total 92 SATA drives  7.2K RPM
        43 mirrored pairs
        4 Write Optimised 18GB SSD, 2 Read Optimized 100GB SSD.
        FW OS : Q2 2009.04.10.2.0,1-1.15

        Sun Storage 7210, 2 x quad core: 8 cores @ 2.3 Ghz AMD
        32 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        44  x 500 GB SATA drives  7.2K RPM, Mirrored pool,
        2 Write Optimised 18 GB SSD.
        FW OS : ak/generic@2008.11.20,1-0

        Sun Storage 7110, 2 x quad core opteron: 8 cores @ 2.3 Ghz AMD
        8 GB of host memory.
        1 dual port 10Gbe Network Atlas Atlas Card (1 port used). NXGE driver. 1500 MTU
        12 x 146 GB SAS drives, 10K RPM, in 3+1 Raid-Z pool.
        FW OS : ak/generic@2008.11.20,1-0


The newly released 7310 was tested with the most recent software revision and that certainly is giving the 7310 an edge over it's peers. The 7410 on the other hand was measured here managing a much large contingent of storage, including mirrored Logzillas and 3 times as many JBODs and that is expected to account for some of the performance delta being observed.

    Metrics Short Name
    1 thread per client streaming cached reads Stream Read light
    1 thread per client streaming cold cache reads Cold Stream Read light
    10 threads per client streaming cached reads Stream Read
    20 threads per client streaming cold cached reads Cold Stream Read
    1 thread per client streaming write Stream Write light
    20 threads per client streaming write Stream Write
    128 threads per client 8k synchronous writes Sync write
    128 threads per client 8k random read Random Read
    20 threads per client 8k random read on cold caches Cold Random Read
    8 threads per client 8k small file create IOPS Filecreate


There are 6 read tests, 2 writes test and 1 synchronous write test which overwrites it's data files as a database would. A final filecreate test complete the metrics. Test executes against 20GB working set _per client_ times 18 to 20 clients. There are 4 sets used in total running over independent shares for a total of 80GB per client. So before actual runs at taken, we create all working sets or 1.6 TB of precreated data. Then before each run, we clear all caches on the clients and server.

In each of the 3 groups of 2 read tests, the first one benefits from no caching at all and the throughput delivered to the client over the network is observed to come from disk. The test runs for N seconds priming data in the Storage caches. A second run (non-cold) is then started after clearing the client side caches. Those test will see the 100% of the data delivered over the network link but not all of it is coming off the disks. Streaming test will race through the cached data and then finish off reading from disks. The random read test can also benefit from increasing cached responses as the test progresses. The exact caching characteristic of a 7000 lines will depend on a large number of parameters including your application access pattern. Numbers here reflect the performance of fully randomized test over 20GB per client x 20 clients or a 400GB working set. Upcoming studies will include more data (showing even higher performance) for workloads with higher cache hit ratio than those used here.

In a Storage 7000 server, disks are grouped together in one pool and then individual Shares are creates. Each share has access to all disk resource subject to quota (a minimum) and reservation (a maximum) that might be set. One important setup parameter associated with each share is the DB record size. It is generally better for IOPS test to use 8K records and for streaming test to use 128K records. The recordsize can be dynamically set based on expected usage.

The tests shown here were obtained with NFSv4 the default for Solaris clients (NFSv3 is expected to come out slightly better). The clients were running Solaris 10, with tuned tcp_recv_hiwat of 400K and dopageflush=0 to prevent buffered writes from being converted into synchronous writes.

Compared Results of the 7000 Storage Line

    NFSv4 Test 7410 Head
    Mirrored Pool
    7310 Head
    Mirrored Pool
    7210 Head
    Mirrored Pool
    7110 Head
    3+1 Raid-Z
    Throughput
    Cold Stream Read light 915 MB/sec 685 MB/sec 719 MB/sec 378 MB/sec
    Stream Read light 1074 MB/sec 751 MB/sec 894 MB/sec 416 MB/sec
    Cold Stream Read 959 MB/sec 598 MB/sec 752 MB/sec 329 MB/sec
    Stream Read 1030 MB/sec 620 MB/sec 792 MB/sec 386 MB/sec
    Stream Write light 480 MB/sec 507 MB/sec 490 MB/sec 226 MB/sec
    Stream Write 447 MB/sec 526 MB/sec 481 MB/sec 224 MB/sec
    IOPS
    Sync write 22383 IOPS 8527 IOPS 10184 IOPS 1179 IOPS
    Filecreate 5162 IOPS 4909 IOPS 4613 IOPS 162 IOPS
    Cold Random Read 28559 IOPS 5686 IOPS 4006 IOPS 1043 IOPS
    Random Read 36478 IOPS 7107 IOPS 4584 IOPS 1486 IOPS
    Per Spindle IOPS 272 Spindles 86 Spindles 44 Spindles 12 Spindles
    Cold Random Read 104 IOPS 76 IOPS 91 IOPS 86 IOPS
    Random Read 134 IOPS 94 IOPS 104 IOPS 123 IOPS






Analysis



The data shows that the entire Sun Storage 7000 line are throughput workhorse delivering 10 Gbps level NAS services per cluster head nodes, using a single Network Interface and single IP address for easy integration into your existing network.

As with other storage technology write streaming performance require more involvement from the storage controller and this leads to about 50% less write throughput compared to read throughput.

The use of write optimized SSD in the 7410, 7310 and 7220 also give this storage very high synchronous write capabilities. This is one of the most interesting result as it maps to database performance. The ability to sustain 24000 O_DSYNC writes at 192MB/sec of synchronized user data using only 48 inexpensive sata disks and 3 write optimized SSD is one of the many great performance characteristics of this novel storage system.

Random Read test generally map directly to individual disk capabilities and is a measure of total disk rotations. The cold runs shows that all our platforms are delivering data at the expected 100 IOPS per spindle for those SATA disks. Recall that our offering is based on the economical energy efficient 7.2 RPM disk technology. For cold random reads, a mirrored pair of 2 x 7.2K RPM offers the same total disk rotation (and IOPS) as expensive and power hungry 15 K RPM disks but in a much more economical package.

Moreover the difference between the warm and cold random read runs is showing that the Hybrid Storage Pool (HSP) is providing a 30% boost even on this workload that addresses randomly 400GB working set on 128GB of controller cache. The effective boost from the HSP can be much greater depending on the cacheability of workloads.

If we consider an organisation in which the avg mail message is 8K in size, our results show that we could consolidate 100000 employees on a single 7410 storage where each employee is accessing new data every 3.6 seconds with 70ms response time.

Messaging system are also big consumer of file creations, I've shown in the past how efficient ZFS can be at creating small files (Need Inodes ?). For the NFS protocol, file creation is a straining workload but the 7000 storage line comes out not too bad with more than 5000 filecreates per second per storage controller.

Conclusion

Performance Can never be summerised with a few numbers and we have just begun to scratch the surface here. The numbers presented here along with the disruptive pricing of the Hybrid Storage Pool will, I hope, go a long way to show the incredible power of the Open Storage architecture being proposed. And keep in mind that this performance is achievable using less expensive, less power hungry SATA drives and that every data services : NFS, CIFS, iSCSI, ftp, HTTP etc. offered by our Sun Storage 7000 servers are available at 0 additional software cost to you.

Disclosure Statement: Sun Microsystem generated results using filebench. Results reported 11/10/08 and 26/05/2009 Analysis done on June 6 2009.

vendredi févr. 13, 2009

Need Inodes ?

It seems that some old school filesystem still need to statically allocate inodes to hold pointers to individual files. Normally this should not cause too much problems as default settings account for an average filesize of 32K. Or will it ?

If the avg filesize you need to store on the filesystem is much smaller than this, they you are likely to eventually run out of inodes even if the space consumed on the storage is far from exhausted.

In ZFS inodes are allocated on demand and so the question came up, how many files can I store onto a piece of storage. I managed to scrape up an old disk of 33GB, created a pool and wanted to see how many 1K files I could store on that storage.

ZFS stores files with the smallest number of sectors possible and so 2 sectors was enough to store the data. Then of course one needs to also store some amount of metadata, indirect pointer, directory entries etc to complete the story. There I didn't know what to expect. My program would create 1000 files per directory. Max depth level is 2, nothing sophisticated attempted here.

So I let my program run for a while and eventually interrupted it at 86% of disk capacity :

        Filesystem  size  used avail capacity Mounted on
	space          33G  27G  6.5G  81%  /space
Then I counted the files.

        #ptime find /space/create | wc
	real  51:26.697
	user   1:16.596
	sys   25:27.416
	23823805 23823805 1405247330


So 23.8M files consuming 27GB of data. Basically less than 1.2K of used disk space per KB of files. A legacy type filesystem that would allocate one inode per 32K would have run out of space after a meager 1M files but ZFS managed to store 23X more on the disk without any tuning.

The find command here is mostly gated on fstat performance and we see here that we did the 23.8M fstat in 3060 seconds or 7777 fstat per second.

But here is the best part : And how long did it take to create all those files ?

   real 1:09:20.558
   user   9:20.236
   sys  2:52:53.624


This is hard to believe but it took about 1 hour for 23.8 million files.This is on a single direct attach drive

   3. c1t3d0 <FUJITSU-MAP3367N SUN36G-0401-33.92GB>


ZFS created on average 5721 files per second. Now obviously such a drive cannot do 5721 IOPS but with ZFS it didn't need to. File create is actually more of a cpu benchmark because the application is interacting with host cache. It's the task of the filesystem to then create the files on disk in the background. With ZFS, the combination of the Allocated on Write policy and the sophisticated I/O aggregation in the I/O scheduler (dynamics) means that the I/O for multiple independant file create can be coalesced. Using dtrace I counted the number of IO required and filecreates per minutes, typical samples show more than 200K files created per minutes using about 3000 IO per minutes or 3300 files per second using a mere 50 IOPS !!!

Per Minute
	   Sample Create  IOs
	   #1	  214643  2856
	   #2	  215409  3342
	   #3	  212797  2917
	   #4	  211545  2999


Finally with all these files, is scrubbing a problem ? It took 1h34m to actually scrub that many files at a pace of 4200 scrubbed files per second. No sweat.

pool: space
 state: ONLINE
 scrub: scrub completed after 1h34m with 0 errors on Wed Feb 11 12:17:20 2009


If you need to create, store and otherwise manipulate lots of small files efficiently, ZFS has got to be you filesystem of choice for you.

lundi déc. 15, 2008

Decoding Bonnie++



I've been studying the popular Bonnie++ load generator to see if it was a suitable benchmark to use with Network attached storage such as Sun Storage 7000 line. At this stage I've looked at the single client runs, and it doesn't appear that Bonnie++ is an appropriate tool in this environment because as we'll see here, for many of the tests, it either stresses the networking environment or the strength of client side cpu.

The first interesting thing to note is that Bonnie will work on a data set that is double the client's memory. This does address some of the client side caching concern one could otherwise have. In a NAS environment the amount of memory present on the server is not considered by a default bonnie++ run. My client had 4GB leading to a working set was then 8GB while the server had 128GB of memory. The Bonnie++'s output looks like :
  Writing with putc()...done
  Writing intelligently...done
  Rewriting...done
  Reading with getc()...done
  Reading intelligently...done
  start 'em...done...done...done...
  Create files in sequential order...done.
  Stat files in sequential order...done.
  Delete files in sequential order...done.
  Create files in random order...done.
  Stat files in random order...done.
  Delete files in random order...done.

  Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
		      -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
  Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
  v2c01            8G 81160  92 109588  38 89987  67 69763  88 113613  36  2636  67
		      ------Sequential Create------ --------Random Create--------
		      -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
		files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
		   16   687  10 +++++ +++  1517   9   647  10 +++++ +++  1569   8
  v2c01,8G,81160,92,109588,38,89987,67,69763,88,113613,36,2635.7,67,16,687,10,+++++,+++,1517,9,647,10,+++++,+++,1569,8

Method

I have used a combination of Solaris truss(1), reading Bonnie++ code, looking at AmberRoad's Analytics data , as well as a custom Bonnie d-script in order to understand how each test triggered system calls on the client and how those translated into a NAS server load. In the d-script, I characterise the system calls by the average elapse time as well as by the time spent waiting for a response from the NAS server. The time spent waiting is the operational latency that one should be interested in when characterising a NAS, while the additional time relates to the client CPU strength along with the client NFS implementation. Here is what I found trying to explain how performant each test was.

Writing with putc()

So easy enough, that test creates a file using single character putc stdio library call.

This test is clearly a client CPU test with most of the time spent in user space running putc(). Every 8192 putc, stdio library will issue a write(2) system call. That syscall is still a client CPU test since the data is absorbed on the client cache. What we test here is the client single CPU performance and the client NFS implementation. On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using analytics a network transfer rate of 87 MB/sec.

Results : 87 MB/sec of writes. Limited by single CPU speed.



Writing intelligently...done

Here it's more clever since it writes a file using sequential 8K write system calls.

In this test the CPU is much relieved. So here the application is running 8K write system call to client NFS. This is absorbed by memory on the client. With an Opensolaris client, no over the wire request is sent for such an 8K write. However after 4 such 8K writes we reach the natural 32K chunk advertised by the server and that will cause the client to asynchronously issue a write request to the server. The asynchronous nature means that this will not cause the application to wait for the response and the test will keep going on CPU. The process will now race ahead generating more 8K writes and 32K asynchronous NFS requests. If we manage to generate such request at a greater rate than responses, we will consume all allocated aysnchronous threads. On Solaris this maps to nfs4_max_threads (8) threads. When all 8 asynchronous threads are waiting for a response, then the application will finally block waiting for a previously issued request to get a response and free an async thread.

Since generating 8K write systems to fill the client cache is faster than the network connection between the client and the server we will eventually reach this point. The steady state of this test is that Bonnie++ is waiting for data to transfer to the server. This happens at the speed of a single NFS connection which for us saturated the 1Gbps link we had. We observed 113MB/sec which is network line rate considering protocol overheads.

To get more through on this test, one could use Jumbo Frame ethernet instead of the 1500 Byte default frame size used as this would reduce the protocol overhead slightly. One could also configure the server and client to use 10Gbps ethernet links.

One could also use LACP link aggregation of 1Gbps network ports to increase the throughput. LACP increases throughput of multiple network connections but not single socket protocol. By default a Solaris client will establish a single connection (clnt_max_conns = 1) to a server (1 connections per target IP). So using multiple aggregated links _and_ tuning clnt_max_conns could yield extra throughput here.

Using single connection one could use a faster network between client and server links to reach additional throughput.

More commonly, we expect to saturate the client 1Gbps connectivity here, not much of a stress for a Sun Storage 7000 server.

Results : 113 MB/sec of writes. Network limited.



Rewriting...done

This gets a little interesting. It actually reads 8K, lseek back to the start of the block, overwrites the 8K with new data and loops.

So here we read, lseek back, overwrite . For the NFS protocol lseek is a noop since every over the wire write is tagged with the target offset. In this test we are effectively stream reading the file from the server and stream writing the file back to the server. The stream write behavior will be much like the previous test. We never need to block the process unless we consume all 8 asynchronous threads. Similarly 8K sequential reads will be recognised by our client NFS as streaming access which will deploy asynchronous readahead requests. We will use 4 (nfs4_nra) request for 32K blocks ahead of the point being currently read. What we observed here was that of 88 second of elapse time, 15 was spent in write and 20 in reads. However a small portion of that was spent waiting for response. It was mostly all spent on CPU time to interact with the client NFS. This implies that readhead and asynchronous writeback was behaving without becoming bottlenecks. The Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27 sec, was spent waiting off cpu. I struggle somewhat in this interpretation but I do know from the Analytics data on the server that the network is seeing 100 MB/sec of data flowing in each direction. This must also be close to network saturation. The wait time attributed to Bonnie++ in this test seems be related to kernel preemption. As Bonnie++ is coming out of its system calls we see such events in dtrace.

              unix`swtch+0x17f
              unix`preempt+0xda
              genunix`post_syscall+0x59e
              genunix`syscall_exit+0x59
              unix`0xfffffffffb800f06
            17570


This must be to service the kernel threads of higher priority, likely the asynchronous threads being spawned by the reads and writes.

This test is then a stress test of bidirectional flow of 32K data transfers. Just like the previous test, to improve the numbers one would need to improve the network connection throughput between the client and server. It also potentially could then benefit from faster and more client CPUs.

Results : 100MB/sec in each direction, network limited.



Reading with getc()...done

Reads the file one character at a time.

Back to a test of the client CPU much like the first one. We see that the readahead are working great since little time is spent waiting (0.4 of 114 seconds). Given that this test does 1 million reads in 114 seconds, the average latency could be evaluated to be 114 usec.

Results : 73MB/sec, single CPU limited on the client.



Reading intelligently...done start 'em...done...done...done...

Reads with 8k system calls, sequential.

This test seems to be using 3 spawned bonnie process to read files. The reads are of size 8K and we needed 1M of them to read our 8GB working set. We observed with analytics no I/O done on the server since it had 128GB of cache available to it. The network on the other hand is saturated at 118 MB/sec.

The dtrace script shows that the 1M read calls collectively spend 64 seconds waiting (most of that NFS response). So that implies a 64 usec read response time for this sequential workload.

Results : 118MB/sec, limited by Network environment.



start 'em...done...done...done...

Here is seems that Bonnie starts 3 helper processes used to read the files in the "Reading Intelligently" test.

Create files in sequential order...done.

Here we see 16K files being created (with creat(2)) then closed.

This test will create and close 16K files and took 22 seconds in our environment. 19 seconds were used for the creates, 17.5 waiting for responses. That means a 1ms response time for file creates. The test seems single threaded. Using analytics we observe 13500 NFS ops per second to handle those file create. We do see some activity on the Write bias SSD although very modest at 2.64MB /sec. Given that the test is single threaded we can't estimate if this metric is representative of the NAS server capability. More likely this is representative the single thread capability of the whole environment made of : client CPU, client NFS implementation, client network driver and configuration, network envinronment including switches, and the NAS server.

Results : 744 filecreate per second per thread. Limited by operational latency.

Here is the analytics view captured for the this tests and the following 5 tests.



Stat files in sequential order...done.

Test was too elusive possibly working against cached stat information.

Delete files in sequential order...done.

Here we unlink(2) the 16K files.

Here we call the unlink system call for the 16K files. The run takes 10.294 seconds showing a 1591 unlink per second. Each call goes off cpu, waiting for a server response for 600 usec.

Much like the create file test above, while we get information about the single threaded unlink time present in the environment it's obviously not representative of the server's capabilities.'

Results : 1591 unlink per second per thread, Limited by operational latency.

Create files in random order...done.

We recreate 16K files, closing each one but also running a stat() system call on each.

Stat files in random order...done.

Elusive as above.

Delete files in random order...done.

We remove the 16K files.

I could not discern in the "random order" test any meaninful differences to the sequential order ones.

Analytics screenshot of Bonnie++ run

Here is the full screen shot from analytics including Disk and CPU data



The takeway here is that single instance bonnie++ does not generally stress one Sun Storage 7000 NAS server but will stress the client CPU and 1Gbps network connectivity. There is no multi-client support in Bonnie++ (that I could find).

One can certainly start multiple clients simultaneously, but since the different tests would not be synchronized the output of bonnie++ would be very questionable. Bonnie++ does have a multi-instance synchronisation mode that is based on semaphore which will only work if all instances are running within the same OS environment.

So in a multi client test, Only the total elapsed time would be of interest here and that would be dominated by the streaming performance as each client would read and write its working set 3 times over the wire. Filecreate and unlink times would also contribute to the total elapsed time of such a test.

For a single node multi-instance bonnie++ run, one would need to have a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth of network capabilities in order to properly test one Sun Storage 7410 server. Otherwise, Bonnie++ is more likely to show client and network limits, not server ones. As for unlink capabilities, the topic is a pretty complex and important one that certainly cannot be captured with simple commands. The interaction with snapshots and the I/O load generated on the server during large unlink storms needs to be studied carefully in order to understand the competitive merits of different solutions.

In Summary, here is what governs the performance of the individual Bonnie++ tests :
    Writing with putc()... 87 MB/sec Limited by client's single CPU speed
    Writing intelligently...113 MB/sec Limited by Network conditions
    Rewriting...100MB/sec Limited by Network conditions
    Reading with getc()...73MB/sec Limited by client's single CPU speed
    Reading intelligently...118MB/sec Limited by Network conditions
    start 'em...done...done...done...
    Create files in sequential order...744 create/sLimited by operational latency
    Stat files in sequential order...not observable
    Delete files in sequential order...1591 unlink/sLimited by operational latency
    Create files in random order...same as sequential
    Stat files in random order...same as sequential
    Delete files in random order...same as sequential


So Bonnie++ won't tell you much about our server's capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate multiple clients systems and so cannot be used to stress a server. Bonnie++ could be used to stress a NAS server using a single large multi-core client with very strong networking capabilities. In the end though I don't expect to learn much about our servers over and above what is already known. For that please check out our links here :

  • Low Level Performance of Sun Storage
  • Analyzing the Sun Storage 7000
  • Designing Performance Metrics...
  • Sun Storage 7xxx Performance Invariants


  • Here is the bonnie.d d-script used and the output generated bonnie.out.

    lundi nov. 10, 2008

    Blogfest : Performance and the Hybrid Storage Pool

    Today Sun is announcing a new line of Unified Storage designed by a core of the most brilliant engineers . For starters Mike Shapiro provides a great introduction into this product, the new economics behind it and the killer App in Sun Storage 7000.

    The killer App is of course Bryan Cantrill's brainchild, the already famous Analytics. As a performance engineer, it's been a great thrill to have given this tool an early test drive. Working a full 1 ocean's (the atlantic) + 1 continent (the USA) away from my system running Analytics I was skeptical at first that I would be visualizing in real time all that information : the NFS/CIFS ops, the disk ops, the CPU load and network throughput, per client, per disk, per file ARE YOU CRAZY ! All that information available IN REAL TIME; I just have to say a big thank you to the team that made it possible. I can't wait to see our customer put this to productive use.

    Also check out Adam Levanthal's great description of HSP the Hybrid Storage Pool and read my own perspective on this topic ZFS as a Network Attach Storage Controller.

    Lest we forget the immense contribution of the boundless Energy bubble that is Brendan Gregg; the man that braught DTracetoolkit to the semi-geek; he must be jumping with excitement as we now see the power of DTrace delivered to each and every system administrator. He talks here about the Status Dashboard. And Brendan's contribution does not stop here, he is also the parent of this wonderful component of the HSP known as the L2ARC which is how the readzillas become activated. See his own previous work on the L2ARC along with Jing Zhang more recent studies. Quality assurance people don't often get into the spotlight but check out Tim Foster 's post on how he tortured the zpool code adding and removing l2 arc devices from pools :

    For myself, it's been very exciting to be able to see performance improvement ideas get turned into product improvements from weeks to weeks. Those interested should read how our group influenced the product that is shipping today, see Alan Chiu and my own Delivering Performance Improvements.

    Such a product has a strong Price/Performance appeal and given that we fundamentally did not think that there where public benchmarks that captured our value proposition, we had to come up with a third millenium participative ways to talk about performance. Check out how we designed our Metrics or maybe go straight to our numbers obtained by Amitabha Banerjee a concise entry backed up by immense, intense and carefull data gathering effort in the last few weeks. bmseer is putting his own light on the low level data (data to be updated with numbers from a grander config).

    I've also posted here a few performance guiding lights to be used thinking about this product; I call them Performance Invariants. So further numbers can be found here about raid rebuild times.

    On the application side, we have the great work of Sean (Hsianglung Wu) and Arini Balakrishnan showing how a 7210 can deliver > 5000 concurrent video streams at an aggregate of, you're kidding, : WOW ZA 750MB/sec. More Details on how this was acheived in cdnperf.

    Jignesh Shaw shows step by step instructions setting up PostgreSQL over iSCSI.

    See our Vice President, Solaris Data, Availability, Scalability & HPC Bob Porras trying to tame this beast into a nutshell and pointing out code bits reminding everyone of the value of the OpenStorage proposition.

    See also what bmseer has to say on Web 2.0 Consolidation and get from Marcus Heckel a walkthrough of setting up Olio Web 2.0 kit with nice Analytics performance screenshots. Also get the ISV reaction (a bit later) from Georg Edelmann. Ryan Pratt reports on Windows Server 2003 WHQL certification of the Sun Storage 7000 line.

    And this just in : Data about what to expect from a Database perspective.

    We can talk all we want about performance but as Josh Simons points out, these babies are available to you for your own try and buy. Or check out how you could be running the appliance within the next hour really : Sun Storage 7000 in VMware.

    It seems I am in competition with another less verbose aggregator Finally capture the whole stream of related posting to Sun Storage 7000

    Delivering Performance Improvements to Sun Storage 7000


    I describe here the effort I spearheaded studying the performance characteristics of the OpenStorage platform and the ways in which our team of engineers delivered real out of the box improvements to the product that is shipping today.

    One of the Joy of working on the OpenStorage NAS appliance was that solutions we found to performance issues could be immediately transposed into changes to the appliance without further process.

    The first big wins

    We initially stumble on 2 major issues, one for NFS synchronous writes and one for the CIFS protocol in general. The NFS problem was a subtle one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS intent log and was impacting our threaded synchronous writes test by up to a 20X factor. Fortunately I had an history of studying that part of the code and could quickly identify the problem and suggest a fix. This was tracked as 6683293: concurrent O_DSYNC writes to a fileset can be much improved over NFS.

    The following week, turning to CIFS studies, we were seeing great scalability limitation in the code. Here again I was fortunate to be the first one to hit this. The problem was that to manage CIFS request the kernel code was using simple kernel allocations that could accommodate the largest possible request. Such large allocations and deallocations causes what is known as a storm of TLB shootdown cross-calls limiting scalability.

    Incredibly though after implementing the trivial fix, I found that the rest of the CIFS server was beautifully scalable code with no other barriers. So in one quick and simple fix (using kmem caches) I could demonstrate a great scalability improvements to CIFS. This was tracked as 6686647 : smbsrv scalability impacted by memory

    Since those 2 protocol problems were identified early on, I must say that no serious protocol performance problems have come up. While we can always find incremental improvements to any given test, our current implementation has held up to our testing so far.

    In the next phase of the project, we did a lot of work on improving network efficiency at high data rate. In order to deliver the throughput that the server is capable of, we must use 10Gbps network interface and the one available on the NAS platforms are based on the Neptune networking interface running the nxge driver.

    Network Setup

    I collaborated on this with Alan Chiu that already new a lot about this network card and driver tunables and so we quickly could hash out the issues. We had to decide for a proper out of the box setup involving
    	- how many MSI-X interrupts to use
    	- whether to use networking soft rings or not
    	- what bcopy threshold to use in the driver as opposed to
    	  binding dma.
    	- Whether to use or not the new Large Segment Offload (LSO)
    	  technique for transmits.
    
    We new basically where we wanted to go here. We wanted many interrupts on receive side so as to not overload any CPU and avoid the use of layered softrings which reduces efficiency. A low bcopy threshold so that dma binding be used more frequently as the default value was too high for this x64 based platform. And LSO was providing a nice boost to efficiency. That got us to some proper efficiency level.

    However we noticed that under stress and high number of connections our efficiency would drop by 2 or 3 X. After much head scratching we rooted this to the use of too many TX dma channels. It turns out that with this driver and architecture using a few channels leads to more stickyness in the scheduling and much much greater efficiency. We settled on 2 tx rings as a good compromise. That got us to a level of 8-10 cpu cycles per byte transfered in network code (more on Performance Invariants). Interrupt Blanking

    Studying a Opensource alternative controller, we also found that on 1 of 14 metrics we where slower. That was rooted in the interrupt blanking parameter that NIC use to gain efficiency. What we found here was that by reducing our blanking to a small value we could leapfrog the competition (from 2X worse to 2X better) on this test while preserving our general network efficiency. We were then on par or better for every one of the 14 tests.

    Media Streaming

    When we ran thousand or 1 Mb/s media streams from our systems we quickly found that the file level software prefetching was hurting us. So we initially disabled the code in our lab to run our media studies but at the end of the project we had to find an out of the box setup that could preserve our Media result without impairing maximum read streaming. At some point we realized that what we were hitting 6469558: ZFS prefetch needs to be more aware of memory pressure. It turns out that the internals of zfetch code is setup to manage 8 concurrent streams per file and can readahead up to 256 blocks or records : in this case 128K. So when we realized that with 1000s of streams we could readahead ourself out of memory, we knew what we needed to do. We decided on setting up 2 streams per file reading ahead up to 16 blocks and that seems quite sufficient to retain our media serving throughput while keeping so prefetching capabilities. I note here also is that NFS client code will themselve recognize streaming and issue their own readahead. The backend code is then reading ahead of client readahead requests. So we kind of where getting ahead of ourselves here. Read more about it @ cndperf

    To slog or not to slog

    One of the innovative aspect of this Openstorage server is the use of read and write optimized solid state devices; see for instance The Value of Solid State Devices.

    Those SSD are beautiful devices designed to help latency but not throughput. A massive commit is actually better handled by regular storage not ssd. It turns out that it was actually dead easy to instruct the ZIL to recognize massive commits and divert it's block allocation strategy away from the SSD toward the common pool of disks. We see two benefits here, the massive commits will sped up (preventing the SSD from becoming the bottleneck) but more importantly the SSD will now be available as low latency devices to handle workloads that rely on low latency synchronous operations. One should note here that the ZIL is a "per filesystem" construct and so while a filesystem might be working on a large commit another filesystem from the same pool might still be running a series of small transaction and benefit from the write optimized SSD.

    In a similar way, when we first tested the read-optimized ssds , we quickly saw that streamed data would install in this caching layer and that it could slow down the processing later. Again the beauty of working on an appliance and closely with developers meant that the following build, those problems had been solved.

    Transaction Group Time

    ZFS operates by issuing regular transaction groups in which modifications since last transaction group are recorded on disk and the ueberblock is updated. This used to be done at a 5 second interval but with the recent improvement to the write throttling code this became a 30 second interval (on light workloads) which aims to not generate more than 5 seconds of I/O per transaction groups. Using 5 seconds of I/O per txg was used to maximize the ratio of data to metadata in each txg, delivering more application throughput. Now these Storage 7000 servers will typically have lots of I/O capability on the storage side and the data/metadata is not as much a concern as for a small JBOD storage. What we found was that we could reduce the the target of 5 second of I/O down to 1 while still preserving good throughput. Having this smaller value smoothed out operation.

    IT JUST WORKS

    Well that is certainly the goal. In my group, we spent the last year performance testing these OpenStorage systems finding and fixing bugs, suggesting code improvements, and looking for better compromise for common tunables. At this point, we're happy with the state of the systems particularly for mirrored configuration with write optimized SSD accelerators. Our code is based on a recent OpenSolaris (from august) that already has a lot of improvements over Solaris 10 particularly for ZFS, to which we've added specific improvements relevant to NAS storage. We think these systems will at times deliver great performance (see Amithaba's results ) but almost always shine in the price performance categories.

    Sun Storage 7000 Performance invariants



    I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of the information ("can you run this instead ?").

    A more useful approach is to think in terms of performance invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS as an invariant and disks will have throughput limit such as 80MB/sec. Thinking in terms of those invariant helps in extrapolating performance data (with caution) and observing breakdowns in invariant is often a sign that something else needs to be root caused.

    So using 11 metrics and our Performance engineering effort what can be our guiding invariants ? Bearing in mind that it is expected that those are rough estimate. For real measured numbers check out Amitabha Banerjee's excellent post on Analyzing the Sun Storage 7000.

    Streaming : 1 GB/s on server and 110 MB/sec on client

    For read Streaming wise, we're observing that 1GB/s is somewhat our guiding number for read streaming . This can be acheived with fairly small number of client and threads but will be easier to reach if the data is prestaged in server caches. A client normally running 1Gbe network cards is able to extract 110 MB/sec rather easily. Read streaming will be easier to acheived with the larger 128K records probably due to the lesser CPU demand. While our results are with regular 1500 Bytes ethernet frames, using jumbo frame will also make this limit easier to reach or even break. For a mirrored pool, data needs to be sent twice to the storage and we see a reduction of about 50% for write streaming workloads.

    Random Read I/Os per second : 150 random read IOPS per mirrored disks

    This is probably a good guiding light also. When going to disks that will be a reasonable expectation. But here caching can radically change this. Since we can configure up to 128GB of host ram and 4 times that much of secondary caches, there are opportunity to break this barrier. But when going to spindles that needs to be kept under consideration. We also know that Raid-z spreads records to all disks. So the 150 IOPS limit basically applies to raid-z groups. Do plan to have many groups to service random reads.

    Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD

    In some instances, data after eviction from main memory will be kept in secondary caches. Small files and tuned recordsize filesystem are good target workload for this. Those read-optimized SSD can restitute this data at a rate of 3100 IOPS L2 ARC). More importantly so it can do so at much reduced latency meaning that lightly threaded workloads will be able to acheive high throughput.

    Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD

    Synchronous writes can be generated by a O_DSYNC write (database) or just as part of the NFS protocol (such as the tar extract : open,write,close workloads). Those will reach the NAS server and be coalesced in a single transaction with the separate intent log. Those SSD devices are great latency accelerators but are still devices with a max throughput of around 110 MB/sec. However our code actually detects when the SSD devices become the bottleneck and will divert some of the I/O request to use the main storage pool. The net of all this is a complex equation but we've observed easily 5000-8000 synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs). Using smaller working set which creates less competition for CPU resources we've even observed 48K synchronous writes per second.



    Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS

    Once we include the full NFS or CIFS protocol, the efficiency was observed to be in the 30-40 cycles per byte (8 to 10 of those coming from the pure network component at regulat 1500 bytes MTU). More studies are required to figure out the extent to which this is valid but it's an interesting way to look at the problem. Having to run disk I/O vs being serviced directly from cached data is expected to exert an additional 10-20 cycles per byte. Obviously for metadata test in which small amount of byte is transfered per operation, we probably need to come up with a cycles/MetaOps invariant but that is still TBD.

    Single Client NFS throughput : 1 TCP Window per round trip latency.

    This is one fundamental rule of network throughput but it's a good occasion to refresh this in everyones mind. Clients, at least solaris clients, will establish a single TCP connection to a server. On that connection there can be a large number of unreleated requests as NFS is a very scalable protocol. However, a single connection will transport data at a maximum speed of a "socket buffer" divided by the round trip latency. Since today's network speed, particularly in wide area networks have grown somewhat faster than default socket buffers we can see such things becoming performance bottleneck. Now given that I work in Europe but my tests systems are often located in california, I might be a little more sensitive than most to this fact. So one important change we did early on, in this project was to simply bump up the default socket buffers in the 7000 line to 1MB. However for read throughput under similar conditions, we can only advise you to do the same to your client infrastructure.
    About

    user13278091

    Search

    Categories
    Archives
    « avril 2014
    lun.mar.mer.jeu.ven.sam.dim.
     
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
        
           
    Today
    News
    Blogroll

    No bookmarks in folder