The Wonders of ZFS Storage
Performance for your Data

Recent Posts

The Flashy ZFSSA OS8.7 : The Best Getting Better

It's time for an update on the ZFS Storage Appliance performance landscape. Moore's law is just relentless and a lot of things are happening in ZFS land as a consequence. First, the major vehicle to achieve better performance is our new March 2017 software release OS 8.7 (full name 2013.,1-1.23). For those with a MOS account, the release notes are packed with information and worth a read and not just if you are an insomniac. But let me give you a few spoilers right here.   News Flash The OS 8.7 release comes with a variety of improvement to unleash the power of our 2 sockets ZS5-2 and 4 sockets ZS5-4 servers starting with the much anticipated All Flash Pool (AFP) and SAS-3 disk trays (to complement the PCI gen-3 HBAs). Just for grins, a ZS5-4 storage 2-node cluster has 8 x 18 core 2.3Ghz Xeon (that's a lot) and holds up to 3TB of RAM (also a lot). That much CPU punch gives unmatched compression and encryption capabilities and the large RAM provides incredibly high filesystem cache hits ratios. In the high end, we're talking multi GB/sec per SAS-3 port and 10s of GB/sec per controllers. As a demonstration of this power, the ZS5-2 cluster was showcased using SPC-2 benchmarks delivering and aggregate mark of 24,397.12 MB/sec More about the ZS-5 line is available here. The ZS5-4 line will also quench your thirst for all flash storage as a ZS5-4 cluster is capable of hosting 2.4 PB of flash SSD. Even a ZS5-2 cluster with 2 trays of flash devices can approach or top 1 Million (not dollars Dr. Evil...) but IOPS and can do so on a variety of workloads. Moreover, flash devices reduce the importance of cache hit ratio, allowing good performance results even for data working sets that don't fit into cache. Deployment using AFP is clearly the solution of choice for your datasets with the greatest I/O intensity. If on the other hand you need more space for data that is less I/O intensive, the hybrid storage pool (HSP) economically serves up to 9PB of hard disk storage while also dynamically autotiering hot data into up to 614 TB of L2ARC flash. The flash based L2ARC is so much more powerful than in the past and is now available using devices that failover between storage controllers allowing much faster recovery after a reboot. A single L2ARC device that commonly handles 10K IOPS basically serve the equivalent of about 50 HDDs worth of IOPS. Moreover, today's devices have great capacity of ingest throughput. Our improved feeding code supports feeding at a rate in excess of 100 MB/sec per L2ARC device which allows the L2ARC to never miss out on "soon to be evicted" data. The improved feeding was the key to boost the L2ARC hit ratio which in turn delivers SSD based low latency IOPS for your warm data while hot data is being served from RAM. Good hit ratio in L2ARC means the HDD devices in the hybrid pool are less burdened by expensive random reads, and therefore, are available for writes and for streaming workloads where HDD excel.   Scalability and Features Our OS 8.7 release comes with a lot of other goodies, just to quickly mention a few: Trendable capacity analytics to monitor space consumption of the pool over time (the crowd goes wild!) making it easier to stay within the current capacity recommendations. Metadevices store important metadata including the new dedup table. LUN I/O throttling provides the ability to set throughput limits on targetted LUNs preventing a noisy neighbour effect where one heavy consumer impacts all others. Asynchronous dataset deletions run those in the background for quicker failover. LZ4 : compress and decompress compressible data quickly (and efficiently) and handles uncompressible data quickly (bails out). More work done on the scalability of the ARC cache. With today's memory size and ZS5 controllers, we are constantly improving our OS scalability, taking advantage in more places of improved RW locks boosting the ARC with concurrent eviction code and better tracking of ARC metadata With all these goodies in place, read on about our updated best practices.

It's time for an update on the ZFS Storage Appliance performance landscape. Moore's law is just relentless and a lot of things are happening in ZFS land as a consequence.First, the major vehicle to...


Block Picking

In the final article of this series, I walk through the fascinating topic of block picking. As is well known ZFS is an allocate on write storage system and every time it updates the on-disk structure, it writes the new data wherever it chooses. Within reasonable bounds ZFS also control the timing of when to write data out to devices (outside of ZIL blocks which is governed by applications). The current bounds are set about 5 seconds apart and when that clock ticks, we bundle up all recent changes in a transaction group (TXG) handled by spa_sync. Armed with this herd of pending data blocks, ZFS issues a highly concurrent workload dedicated to running all CPU intensive tasks. For any individual I/O, after going through compression, encryption and checksumming, we move on to the allocation task and finally device level I/O scheduling. The first task of allocation involves, selecting a device in a pool. Then, within that device, selecting a sub-region called a metaslab and finally, within that metaslab, selecting a block where the data is stored. Our guiding principles through the process is to Foster write I/O aggregation Ensure devices are used efficiently Avoid fragmenting the on-disk space Limit the core memory required Serve concurrent allocations quickly Do all this with as little CPU resources as possible Let's see how ZFS solves this tough equation. Device Selection When a TXG is triggered, we want devices to receive a set of I/Os such that the I/Os; Aggregate Stream on the media I/O aggregation occurs downstream when 2 I/Os, which are ready to issue, have just been allocated on the same device in adjacent locations. The per device I/O pipeline is then able to merge both of them (N of them actually) into a single device I/O. This is an incredible feature as it applies to items that have no relationship to each other, other than their allocation proximity as first described in Need Inodes ?. Streaming the media is a close cousin to aggregation. I/Os aggregate when they are adjacent, but if they are not, we still want to avoid seeking to a far away area of a disk every I/O. A disk seek is, after all, an eternity. So, while setting data onto the media, we like to keep logical block addresses (LBA) as packed as possible leading to streaming efficiency. Similarly for SSDs, doing so, avoids fragmenting the internal space mapping done by the Flash Translation Layer (FTL). Even logical volumes software welcome this model. ZFS allocates from a device until about 1MB of data is handled. The first 1MB of blocks that reaches the allocation stage (after CPU heavy transformations) are directed to a first device. The following 1MB of blocks move on to the next one, and so on. After iterating round robin through every device, we are now in a state where every device in a pool is busy with I/O. At the same time other blocks are being processed through the CPU intensive stages and are now building up an I/O backlog onto every device. At that moment, even if CPUs are heavily used, the system is actually issuing I/O to all devices that are 100% busy. Metaslab Selection Once we have directed 1MB of allocation to a specific device, we still want the I/Os to target a localized subarea of the device. A metaslab has an in-core structure representing free space of a portion of a device (very roughly 1%). At any one time, ZFS only allocates from a single metaslab of a device insuring dense packing of all writes and therefore avoiding long disk head excursions. The other benefit is that for the purpose of allocation, we strictly only need to keep in-core a structure representing the free space for that subarea. This is the active metaslab. During a given TXG we thus only write to the active metaslab of every device. Block Picking And now, going for the kill. We have the device and the metaslab subarea within it to service our allocation of size X. We finally have to choose a specific device location to fit this allocation. A simple strategy would be to allocate all blocks back-to-back regardless of size. That would lead to maximum aggregation, but we must be considerate of space fragmentation implications. Blocks we are allocating together at this instant, may well be freed later on a different schedule from each other. Frankly we have no way to know when a block free occurs since that is entirely driven by the workload. In ZFS, our bias is to consider that blocks that have similar size have a better chance of having similar life expectancy. We exploit this by maintaining a set of separate pointers within our metaslab and allocate blocks of similar size near each other. Historical Behavior When ZFS first came of age, it had 2 strategies to pick blocks. The regular one lead to good performance through aggregation while another strategy was aimed at defragmenting and lead to terrible performance. We would switch strategy when a metaslab started to have less than 30% free space within it. Customers voiced their discontent loudly. The 30% parameter was later reduced to 4% but that didn't reduce the complaints1. The other problem we had in the past, was that when we needed to switch a metaslab whose structure was not yet in core, we would block all allocating threads waiting on the in-core loading of the metaslab data. If that took too much time, we could leave devices idling in the process. Finally, we would switch only when an allocation could not be satisfied usually because a request was larger than the largest free block available. Forced to switch, we would then select the metaslab that had the best2 free space. This meant that we would keep using metaslabs past their prime capacity to foster aggregated and streaming I/Os. Today is a Better World Fast forward to today, we have evolved this whole process in very significant ways. The thread blocking problem was simple enough; when we do in fact load a metaslab we quickly direct allocating threads to other devices. We therefore keep feeding the other devices more smoothly. But the most important advance is that we don't use an allocator that switches strategy based on available space in the metaslab. Allocations of a given size are serviced from a chunk chosen such that the I/Os aggregate 16-fold : so 4K allocations tend to consume 64K chunks, while 128K allocations looks for 2MB chunks. Allocations of different sizes do not compete for the same sized chunks. Finally, as the maximum size available in a metaslab is reduced, we simply and gradually scale down our potential for aggregation from this metaslab. Alongside this change, we decided to actually switch away from a metaslab as soon as it started to show signs of fatigue. As long as a metaslab is able to serve blocks of approximately 1MB we keep allocating from it. But as soon as it's biggest block size drops below this threshold, we go and pick the metaslab with the best free space. Finally to account for more frequent switches, we also decided to unload metaslabs less aggressively than before. This policy allows us to reuse a metaslab without incurring the cost of loading since that comes with both a CPU and I/O cost. Conclusion With these changes in, we have an allocator that fosters aggregation very effectively and leads to performance that degrades gracefully as the pool fills up. This allocator has served us well over the many years it's been in place. ZFS gives you great performance by handling writes in a way that streams the data to devices. This is effective and delivers maximum performance as long as there is large sequential free space on devices. For users, the real judge is to monitor the average I/O size for writes and if, for the same workload mix, the write size starts to creep down, then it's time to consider adding additional storage space. Le Fin 1 Some still remember this 30% factor and use that as a rule of thumb to not exceed even though we have not used this allocator in years; tuning metaslab_df_free_pct has no effect on our systems. 2 I say best free space and not most free space since we actually boost the desirability of metaslabs with low addresses. For physical disks, outer tracks fly faster under a disk head and that translate into more throughput. Even for SSDs we see a benefit : Favoring one side of the addresses, means we reuse freed space more aggressively. The overwrites of an SSD LBA range means that the flash cells holding the overwritten data can be recycled quickly by the FTL, which simplifies greatly its operation.

In the final article of this series, I walk through the fascinating topic of block picking. As is well known ZFS is an allocate on write storage system and every time it updates the on-disk...


Concurrent Metaslab Syncing

As hinted in my previous article, spa_sync() is the function that runs whenever a pool needs to update it's internal state. That thread is the master of ceremony for the whole TXG syncing process. As such it is the most visible of thread. At the same time, it's the thread we actually want to see idling. The spa_sync thread is setup to generate work for taskqs and then wait for the work to happen. That's why we often see spa_sync waiting in zio_wait or taskq_wait. This is what we expect that thread to be doing. Let's dig into this process a bit more. While we do expect spa_sync to mostly be waiting, it is not the only thing that it does. Before it waits, it has to farm out work to those taskqs. Every TXG, spa_sync wakes up and starts to create work for the zio taskq threads. Those threads immediately pick up the initial tasks posted by spa_sync and just as quickly generate load for pool devices. Our goal is just to keep taskqs and more importantly device fed with work. And so, we have this single spa_sync thread, quickly posting work to zio taskqs and threads working on checksum computation and other CPU intensive tasks. This model ensures that the disk queues are non-empty for the duration of the data update portion of a TXG. In practice, that single spa_sync thread is able to generate the tasks to service the most demanding environment. When we hit some form of pool saturation, we typically see spa_sync waiting on a zio and that is just the expected sign that something at the I/O level below ZFS is the current limiting factor. But, not too long ago, there was a grain of sand in this beautiful clockwork. After spa_sync was all done with ... well waiting... it had a final step to run before updating the uberblock. It would walk through all the devices and process all the space map updates, keeping track of all the allocs and frees. In many cases, this was a quick on-CPU operation done by the spa_sync thread. But when dealing with a large amount of deletion it could show up as significant. It was definitely something that spa_sync was tackling itself as opposed to farming out to workers. A project was spawned to fix this and during the evaluation the ZFS engineer figured out that a lot of the work could be handled in the earlier stages of the zio processing, further reducing the amount of work we could have to wait on in the later stages of spa_sync. This fix was a very important step in making sure that the critical thread running spa_sync spends most of it's time ...waiting.

As hinted in my previous article, spa_sync() is the function that runs whenever a pool needs to update it's internal state. That thread is the master of ceremony for the whole TXG syncing process. As...


System Duty Cycle Scheduling Class

It's well known that ZFS uses a bulk update model to maintain the consistency of information stored on disk. This is referred to as a transaction group (TXG) update or internally as a spa_sync(), which is the name of the function that orchestrates this task. This task ultimately updates the uberblock between consistent ZFS states. Today these tasks are expected to run on a 5-second schedule with some leeway. Internally, ZFS builds up the data structures such that when a new TXG is ready to be issued it can do so in the most efficient way possible. That method turned out to be a mixed blessing. The story is that when ZFS is ready, it uses zio taskqs to execute all of the heavy lifting, CPU intensive jobs necessary to complete the TXG. This process includes the checksumming of every modified block and possibly compressing and encrypting them. It also does on-disk allocation and issues I/O to the disk drivers. This means there is a lot of CPU intensive work to do when a TXG is ready to go. The zio subsystem was crafted in such a way that when this activity does show up, the taskqs that manage the work never need to context switch out. The taskq threads can run on CPU for seconds on end. That created a new headache for the Solaris scheduler. Things would not have been so bad if ZFS was the only service being provided. But our systems, of course, deliver a variety of services and non-ZFS clients were being short changed by the scheduler. It turns out that before this use case, most kernel threads had short spans of execution. Therefore kernel threads were never made preemptable and nothing would prevent them from continuous execution (seconds is same as infinity for a computer). With ZFS, we now had a new type of kernel thread, one that frequently consumed significant amounts of CPU time. A team of Solaris engineers went on to design a new scheduling class specifically targeting this kind of bulk processing. Putting the zio taskqs in this class allowed those threads to become preemptable when they used too much CPU. We also changed our model such that we limited the number of CPUs dedicated to these intensive taskqs. Today, each pool may use at most 50% of nCPUS to run these tasks. This is managed by kernel parameter zio_taskq_batch_pct which was reduced from 100% to 50%. Using these 2 features we are now much better equipped to allow the TXG to proceed at top speeds without starving application from CPU access and in the end, running applications is all that matters.

It's well known that ZFS uses a bulk update model to maintain the consistency of information stored on disk. This is referred to as a transaction group (TXG) update or internally as a spa_sync(),...


Scalable Reader/Writer Locks

ZFS is designed as a highly scalable storage pool kernel module. Behind that simple idea are a lot of subsystems, internal to ZFS, which are cleverly designed to deliver high performance for the most demanding environments. But as computer systems grow in size and as demand for performance follows that growth, we are bound to hit scalability limits (at some point) that we had not anticipated at first. ZFS easily scales in capacity by aggregating 100s of hard disks into a single administration domain. From that single pool, 100s or even 1000s of filesystems can be trivially created for a variety of purposes. But then people got crazy (rightly so) and we started to see performance tests running on a single filesystem. That scenario raised an interesting scalability limit for us...something had to be done. Filesystems are kernel objects that get mounted once at some point (often at boot). Then, they are used over and over again, millions even billions of times. To simplify each read/write system call uses the filesystem object for a few milliseconds. And then, days or weeks later, a system administrator wants this filesystem unmounted and that's that. Filesystem modules, ZFS or other, need to manage this dance in which the kernel object representing a mount point is in-use for the duration of a system call and so must prevent that object from disappearing. Only when there are no more system calls using a mountpoint, can a request to unmount be processed. This is implemented simply using a basic reader/writer lock, rwlock(3C) : A read or write system call acquires a read lock on the filesystem object and holds it for the duration of the call, while a umount(2) acquires a write lock on the object. For many years, individual filesystems from a ZFS pool were protected by a standard Solaris rwlock. And while this could handle 100s of thousands or read/write calls per second through a single filesystem eventually people wanted more. Rather than depart from the basic kernel rwlock, the Solaris team decided to tackle the scalability of the rwlock code itself. By taking advantage of visibility into a system's architecture, Solaris is able to use multiple counters in a way that scales with the system's size. A small system can use a simple counter to track readers while a large system can use multiple counters each stored on separate cache lines for better scaling. As a bonus they were able to deliver this feature without changing the rwlock function signature. For ZFS code, just simple rwlock initialization change was needed to open up the benefit of this scalable rwlock. We also found that, in addition to protecting the filesystem object itself, another structure called a ZAP object used to manage directories was also hitting the rwlock scalability limit and that was changed too. Since the new locks have been put into action, they have delivered scalable performance into single filesystems that is absolutely superb. While the French explorer Jean-Louis Etienne claims that "On se repousse pas ses limites, on les decouvre:" From the comfort of my air-conditioned office, I conclude that we are pushing the limits out of harm's way.

ZFS is designed as a highly scalable storage pool kernel module.Behind that simple idea are a lot of subsystems, internal to ZFS, which are cleverly designed to deliver high performance for the most...


Zero Copy I/O Aggregation

One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved it with a new zero copy feature. It is well known that ZFS is a Copy-on-Write storage technology. That doesn't meant that we constantly copy data from disk to disk. More to the point it means that when data is modified we store that data in a fresh on-disk location of our own choosing. This is primarily done for data integrity purposes and is managed by the ZFS transaction group (TXG) mechanism that runs every few seconds. But an important side benefit of this freedom given to ZFS is that I/Os, even unrelated I/Os, can be allocated in physical proximity to one another. Cleverly scheduling those I/Os to disk then makes it possible to detect contiguous I/Os and issue few large ones rather than many small ones. One consequence of I/O aggregation is that the final I/O sizes used by ZFS during a TXG, as observed by ZFSSA Analytics or iostat(1), depend more on the availability of contiguous on-disk free space than it does on the individual application write(2) sizes. To a new ZFS user or storage administrator, it can certainly be really baffling that 100s of independent 8K writes can end up being serviced by a single disk I/O. The timeline of an asynchronous write is described like this: Application issues a write(2) of N byte to a file stored using ZFS records of size R. Initially the data is stored in the ARC cache. ZFS notes the M dirty blocks needing to be issued in the next TXG as follows: If R=128K, a small write(2) say of 10Bytes here means 1 dirty block (of 128K) If R=8K, a single 128K write(2) implies 16 dirty blocks (of 8K) Within the next few seconds multiple dirty blocks get associated with the upcoming TXG. The TXG starts. ZFS gathers all of the dirty blocks and starts I/Os1. Individual blocks get checksummed and, as necessary, compressed and encrypted. Then and only then, knowing the compressed size and the actual data that needs to be stored on disk, a device is selected and an allocation takes place, The allocation engine finds a chunk in proximity to recent allocations (a future topic of its own), The I/O is maintained by ZFS using 2 structures, one ordered by priority and another ordered by device offset. As soon as there is at least one I/O in these structures, the device level ZIO pipeline gets to work. When a slot is available, the highest priority I/O for that device is selected to be issued. And here is where the magic occurs. With this highest priority I/O in hand, the ZIO pipeline doesn't just issue that I/O to the device. It first checks for other I/Os which could be physically adjacent to this one. It gathers all such I/Os together until hitting our upper limit for disk I/O size. Because of the way this process works, if there are contiguous chunks of free space available on the disk, we're nearly guaranteed that ZFS finds pending I/Os that are adjacent and can be aggregated. This also explains why one sees regular bursts of large I/Os whose sizes are mostly unrelated to the sizes of writes issued by the applications. And I emphasize that this is totally unrelated to the random or sequential nature of the application workload. Of course, for hard disk drives (HDDs), managing writes this way is very efficient. Therefore, those HDDs are less busy and stay available to service the incoming I/Os that applications are waiting on. And this bring us to the topic du jour. Up to recently, there was a cost to doing this aggregation in the form of a memory copy. We would take the buffers coming from the ZIO pipeline (after compression and encryption) and copy them to a newly allocated aggregated buffer. Thanks to a new Solaris mvector feature, we can now run the ZIO aggregation pipeline without incurring this copy. That, in turns, allows us to boost the maximum aggregation size from 128K up to 1MB for extra efficiency. The aggregation code also limits itself to aggregating 64 buffers together. When working with 8K blocks we can see up to 512K I/O during a TXG and 1MB I/O with bigger blocks. Now, a word about the ZIL. In this article, I focus on the I/Os issued by the TXG which happens every 5 seconds. In between TXG, if disk writes are observed, those would have to come from the ZIL. The ZIL also does it's own grouping of write requests that hit a given dataset (share, zvol or filesystem). Then, once the ZIL gets to issue an I/O, it uses the same I/O pipeline as just described. Since ZIL I/Os are of high priority, they tend to issue straight away. And because they issue quickly, there is generally not a lot of them around for aggregation. So it is common to have the ZIL I/Os not aggregate much if at all. However, under a heavy synchronous write load, when the underlying device becomes saturated, a queue of ZIL I/Os forms and they become subject to ZIO level aggregation. When observing the I/Os issued to a pool with iostat it's nice to keep all this in mind: synchronous writes don't really show up with their own size. The ZIL issues I/O for a set of synchronous writes that may further aggregate under heavy load. Then, with a 5 second regularity, the pool issues I/O for every modified block, usually with large I/Os whose size is unrelated to the application I/O size. It's a really efficient way to do all this, but it does require some time getting used to it. 1 Application write size is not considered during a TXG.

One of my favorite feature of ZFS is the I/O aggregation done in the final stage of issuing I/Os to devices. In this article, I explain in more detail what this feature is and how we recently improved...


It is the Dawning of the Age of the L2ARC

One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following: reduce its own memory footprint, be able to survive reboots, be managed using a better eviction policy, be compressed on SSD, and finally allow feeding at much greater rates then ever achieved before. Let's review these elements, one by one. Reduced Footprint We already saw in this ReARC article that we dropped the amount of core header information from 170 bytes to 80 bytes. This means we can track more than twice as much L2ARC data as before using a given memory footprint. In the past, the L2ARC had trouble building up in size due to its feeding algorithm, but we'll see below that the new code allows us to grow the L2ARC and use up available SSD space in its entirety. So much so that initial testing revealed a problem: For small memory configs with large SSDs, the L2ARC headers could actually end up filling most of the ARC cache and that didn't deliver good performance. So, we had to put in place a memory guard for L2 headers which is currently set to 30% of the ARC. As the ARC grows and shrinks so does the maximum space dedicated to tracking the L2ARC. So, a system with 1TB of ARC cache, then up to 300GB if necessary could be devoted to tracking the L2ARC. With the 80 bytes headers, this means we could track a whopping 30TB of data assuming 8K blocksize. If you use 32K blocksize, currently the largest blocks we allow in L2ARC, then that grows up to 120TB of SSD based auto-tiered L2ARC. Of course, if you have a small L2ARC the tracking footprint of the in-core metadata is smaller. Persistent Across Reboot With that much tracked L2ARC space, you would hate to see it washed away on a reboot as the previous code did. Not so anymore, the new L2ARC has an on-disk format that allows it to be reconstructed when a pool is imported. That new format tracks the device space in 8MB segments for which each ZFS blocks (DVAs for the ZFS geeks) consumes 40 bytes of on-SSD space. So reusing the example of an L2ARC made up of only 8K-sized blocks, each 8MB segments could store about 1000 of those blocks consuming just 40K of on-SSD metadata. The key thing here is that to rebuild the in-core L2ARC space after a reboot, you only need to read back 40K, from the SSD itself, in order to discover and start tracking 8MB worth of data. We found that we could start tracking many TBs of L2ARC within minutes after a reboot. Moreover we made sure that as segment headers were read in, they would immediately be made available to the system and start to generate L2ARC hits, even before the L2ARC was done importing every segments. I should mention that this L2ARC import is done asynchronously with respect to the pool import and is designed to not slow down pool import or concurrent workloads. Finally that initial L2ARC import mechanism was made scalable with many import threads per L2ARC device. Better Eviction One of the benefits of using an L2ARC segment architecture is that we can now weigh them individually and use the least valued segment as eviction candidate. The previous L2ARC would actually manage L2ARC space by using a ring buffer architecture: first-in first-out. It's not a terrible solution for an L2ARC but the new code allows us to work on a weight function to optimise eviction policy. The current algorithm puts segments that are hit, an L2ARC cache hit, at the top of the list such that a segment with no hits gets evicted sooner. Compressed on SSD Another great new feature delivered is the addition of compressed L2ARC data. The new L2ARC stores data in SSDs the same way it is stored on disk. Compressed datasets are captured in the L2ARC in compressed format which provides additional virtual capacity. We often see a 2:1 compression ratio for databases and that is becoming more and more the standard way to deploy our servers. Compressed data now uses less SSD real estate in the L2ARC: a 1TB device holds 2TB of data if the data compresses 2:1. This benefit helps absorb the extra cost of flash based storage. For the security minded readers, be reassured that the data stored in the persistent L2ARC is stored using the encrypted format. Scalable Feeding There is a lot to like about what I just described but what gets me the most excited is the new feeding algorithm. The old one was suboptimal in many ways. It didn't feed well, disrupted the primary ARC, had self-imposed obsolete limits and didn't scale with the number of L2ARC devices. All gone. Before I dig in, it should be noted that a common misconception about L2ARC feeding is assuming that the process handles data as it gets evicted from L1. In fact the two processes, feeding and evicting, are separate operations and it is sometimes necessary under memory pressure to evict a block before being able to install it in the L2ARC. The new code is much much better at avoiding such events; it does so by keeping it's feed point well ahead of the ARC tail. Under many conditions, when data is evicted from primary ARC it is after the L2ARC has processed it. The old code also had some self-imposed throughput limit that meant that N x L2ARC devices in one pool, would not be fed at proper throughput. Given the strength of the new feeding algorithm we were able to remove such limits and now feeding scales with number of L2ARC devices in use. We also removed an obsolete constraint in which read I/Os would not be sent to devices as they were fed. With these in place, if you have enough L2ARC bandwidth in the devices, then there are few constraints in the feeder to prevent actually capturing 100% of eligible L2ARC data1. And capturing 100% of data is the key to actually delivering a high L2ARC hit rate in the future. By hitting in L2, of course you delight end users waiting for such reads. More importantly, an L2ARC hit is a disk read I/O that doesn't have to be done. Moreover, that saved HDD read is a random read, one that would have lead to a disk seek, the real weakness of HDDs. Therefore, we reduce utilization of the HDDs, which is of paramount importance when some unusual job mix arrives and causes those HDDs become the resource gating performance: A.K.A crunch time. With a large L2ARC hit count, you get out of this crunch time quicker and restore proper level of service to your users. Eligibility The L2ARC Eligibility rules were impacted by the compression feature. The max blocksize considered for eligibility was unchanged at 32K but the check is now done on compressed size if compression is enabled. As before, the idea behind an upper limit on eligible size is two-fold, first for larger blocks, the latency advantage of flash over spinning media is reduced. The second aspect of this is that the SSD will eventually fill up with data. At that point, any block we insert in the L2ARC requires an equivalent amount of eviction. A single large block can thus cause eviction of a large number of small blocks. Without an upper cap on block size, we can face a situation of inserting a large block for a small gain with a large potential downside if many small evicted blocks become the subject of future hits. To paraphrase Yogi Berra: "Caching decisions are hard."2. The second important eligibility criteria is that blocks must not have been read through prefetching. The idea is fairly simple. Prefetching applies to sequential workloads and for such workloads, flash storage offers little advantage over HDDs. This means that data that comes in through ZFS level prefetching is not eligible for L2ARC. These criteria leave 2 pitfalls to avoid during an L2ARC demo, first configuring all datasets with 128K recordsize and second trying to prime the L2ARC using dd-like sequential workloads. Both of those are by design workloads that bypasse the L2ARC. The L2ARC is designed to help you with disk crunching real workloads, which are those that access small blocks of data in random order. Conclusion : A Better HSP In this context, the Hybrid Storage Pool (HSP) model refers to our ZFSSA architecture where data is managed in 3 tiers: a high capacity TB scale super fast RAM cache; a PB scale pool of hard disks with RAID protection; a channel of SSD base cache devices that automatically capture an interesting subset of the data. And since the data is captured in the L2ARC device only after it has been stored in the main storage pool, those L2ARC SSDs do not need to be managed by RAID protection. A single copy of the data is kept in the L2ARC knowing that if any L2ARC device disappears, data is guaranteed to be present in the main pool. Compared to a mirrored all-flash storage solution, this ZFSSA auto-tiering HSP means that you get 2X the bang for your SSD dollar by avoiding mirroring of SSDs and with ZFS compression that becomes easily 4X or more. This great performance comes along with the simplicity of storing all of your data, hot, warm or cold, into this incredibly versatile high performance and cost effective ZFS based storage pool. 1It should be noted that ZFSSA tracks L2ARC eviction as "Cache: ARC evicted bytes per second broken down by L2ARC state", with subcategories of "cached," "uncached ineligible," and "uncached eligible." Having this last one at 0 implies a perfect L2ARC capture. 2For non-americans, this famous baseball coach is quoted to have said, "It's tough to make predictions, especially about the future."

One of the most exciting things that have gone into ZFS in recent history has been the overhaul of the L2ARC code. We fundamentaly changed the L2ARC such that it would do the following: reduce its own...


ZIL Pipelining

The third topic on my list of improvements since 2010 is ZIL pipelining : Allow the ZIL to carve up smaller units ofwork for better pipelining and higher log device utilization. So let's remind ourselves of a few things about the ZIL and why it's so critical to ZFS. The ZIL stands for ZFS Intent Log and exists in order to speed up synchronous operations such as an O_DSYNC write or fsync(3C) calls. Since most Database operation involve synchronous writes it's easy to understand that having good ZIL performance is critical in many environments. It is well understood that a ZFS pool updates it's global on-disk state at a set interval (5 seconds these days). The ZIL is actually what keeps information in between those transaction group (TXG). The ZIL records what is committed to stable storage from a users point of view. Basically the last committed TXG + replay of the ZIL is the valid storage state from a users perspective. The on-disk ZIL is a linked list of records which is actually only useful in the event of a power outage or system crash. As part of a pool import, the on-disk ZIL is read and operations replayed such that the ZFS pool contains the exact information that had been committed before the disruption. While we often think of the ZIL as it's on-disk representation (it's committed state), the ZIL is also an in-memory representation of every posix operation that needs to modify data. For example, a file creation even if that is an asynchronous operation needs to be tracked by the ZIL. This is because any asynchronous operation, may at any point in time require to be committed to disk; this is often due to an fsync(3C) call. At that moment, every pending operation on a given file needs to be packaged up and committed to the on-disk ZIL. Where is the on-disk ZIL stored ? Well that's also more complex than it sound. ZFS manages devices specifically geared to store ZIL blocks; those separate slog devices or slogs are very often flash SSD. However the ZIL is not constrained to only using blocks from slog devices; it can store data on main (non-slog) pool devices. When storing ZIL information into the non-slog pool devices, the ZIL has a choice of recording data inside zil blocks or recording full file records inside pool blocks and storing a reference to it inside the ZIL. This last method for storing ZIL blocks has the benefit of offloading work from the upcoming TXG sync at the expense of higher latency since the ZIL I/Os are being sent to rotating disks. This mode is the one used with logbias=throughput. More on that below. Net net: the ZIL records data in stable storage in a link list and user applications have synchronization point in which they choose to wait on the ZIL to complete it's operation. When things are not stressed, operations show up at the ZIL, wait a little bit while the ZIL does it's work, and are then released. Latency of the ZIL is then coherent with the underlying device used to capture the information. In this rosy picture we would not have done this train project. At times though, the system can get stressed. The older mode of operation of the ZIL was to issue a ZIL transaction (implemented by ZFS function zil_commit_writer) and while that was going on, build up the next ZIL transaction with everything that showed up at the door. Under stress when a first operation would be serviced with a high latency, the next transaction would accumulate many operations, growing in size thus leading to a longer latency transaction and this would spiral out of control. The system would automatically divide into 2 ad-hoc sets of users; a set of operations which would commit together as a group, while all other threads in the system would form the next ZIL transaction and vice-versa. This leads to bursty activity on the ZIL devices, which meant that, at times, they would go unused even though they were the critical resource. This 'convoy' effect also meant disruption of servers because when those large ZIL transaction do complete, 100s or 1000s of user threads might see their synchronous operation complete and all would end up flagged as 'runnable' at the same time. Often those would want to consume the same resource, run on the same CPU, of use the same lock etc. This led to thundering herds, a source of system inefficiency. Thanks to the ZIL train project, we now have the ability to break down convoys into smaller units and dispatch them into smaller ZIL level transactions which are then pipelined through the entire data center. With logbias set to throughput, the new code is attempting to group ZIL transactions in sets of approximately 40 operations which is a compromise between efficient use of ZIL and reduction of the convoy effect. For other types of synchronous operations we group them into sets representing about ~32K of data to sync. That means that a single sufficiently large operation may run by itself but more threads will group together if their individual commit size are small. The ZIL train is thus expected to handle burst of synchronous activity with a lot less stress on the system. The THROUGHPUT VS LATENCY debate. As we just saw the ZIL provides 2 modes of operation. The throughput mode and the default latency mode. The throughput mode is named as such not so much because it favors throughput but more so because it doesn't care too much about individual operation latency. The implied corollary of throughput friendly workloads is that they are very highly concurrent (100s or 1000s of independent operations) and therefore are able to get to high throughput even when served at high latency. The goal of providing a ZIL throughput mode is to actually free up slog devices from having to handle such highly concurrent workloads and allow those slog devices to concentrate on serving other low-concurrency, but highly sensitive to latency operations. For Oracle DB, we therefore recommend the use of logbias set to throughput for DB files which are subject to highly concurrent DB writer operations while we recommend the use of the default latency mode for handling other latency sensitive files such as the redo log. This separation is particularly important when redo log latency is very critical and when the slog device is itself subject to stress. When using Oracle 12c with dnfs and OISP, this best practice is automatically put into place. In addition to proper logbias handling, DB data files are created with a ZFS recordsize matching the established best practice : ZFS recordsize matching DB blocksize for datafiles; ZFS recordsize of 128K for redo log. When setting up a DB, with or without OISP, there is one thing that Storage Administrators must enforce : they must segregate redo log files into their own filesystems (also known as shares or datasets). The reason for this is that the ZIL is a single linked list of transactions maintained by each filesystem (other filesystems run their own ZIL independently). And while the ZIL train allows for multiple transaction to be in flight concurrently, there is a strong requirement for completion of the transaction and notification of waiters to be handled in order. If one were to mix data files and redo log files in the same ZIL, then some redo transaction would be linked behind some DB writer transactions. Those critical redo transaction committing in latency mode to a slog device would see their I/O complete quickly (100us timescale) but nevertheless have to wait for an antecedent DB writer transaction committing in throughput mode to regular spinning disk device (ms timescale). In order to avoid this situation, one must ensure that redo log files are stored in their own shares. Let me stop here, I have a train to catch...

The third topic on my list of improvements since 2010 is ZIL pipelining : Allow the ZIL to carve up smaller units ofwork for better pipelining and higher log device utilization. So let's remind...


Sequential Resilvering

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional storage that has to resilver entire disk even if there is no actual data stored. And indeed on newly created pools ZFS was super fast for resilvering. But of course storage pools rarely stay empty. So what happened when pools grew to store large quantities of data ? Well we basically had to resilver most blocks present on a failed disk. So the advantage of only resilvering what is actually present is not much of a advantage, in real life, for ZFS. And while ZFS based storage grew in importance, so did disk sizes. The disk sizes that people put in production are growing very fast showing the appetite of customers to store vast quantities of data. This is happening despite the fact that those disks are not delivering significantly more IOPS than their ancestors. As time goes by, a trend that has lasted forever, we have fewer and fewer IOPS available to service a given unit of data. Here ZFSSA storage arrays with TB class caches are certainly helping the trend. Disk IOPS don't matter as much as before because all of the hot data is cached inside ZFS. So customers gladly tradeoff IOPS for capacity given that ZFSSA deliver tons of cached IOPS and ultra cheap GB of storage. And then comes resilvering... So when a disk goes bad, one has to resilver all of the data on it. It is assured at that point that we will be accessing all of the data from surviving disks in the raid group and that this is not a highly cached set. And here was the rub with old style ZFS resilvering : the metadata driven algorithm was actually generating small random IOPS. The old algorithm was actually going through all of the blocks file by file, snapshot by snapshot. When it found an element to resilver, it would issue the IOPS necessary for that operation. Because of the nature of ZFS, the populating of those blocks didn't lead to a sequential workload on the resilvering disks. So in a worst case scenario, we would have to issue small random IOPS covering 100% of what was stored on the failed disk and issue small random writes to the new disk coming in as a replacement. With big disks and very low IOPS rating comes ugly resilvering times. That effect was also compounded by a voluntary design balance that was strongly biased to protect application load. The compounded effect was month long resilvering. The Solution To solve this, we designed a subtly modified version of resilvering. We split the algorithm in two phases. The populating phase and the iterating phase. The populating phase is mostly unchanged over the previous algorithm except that, when encountering a block to resilver, instead of issuing the small random IOPS, we generate a new on disk log of them. After having iterated through all of the metadata and discovered all of the elements that need to be resilvered we now can sort these blocks by physical disk offset and issue the I/O in ascending order. This in turn allows the ZIO subsystem to aggregate adjacent I/O more efficiently leading to fewer larger I/Os issued to the disk. And by virtue of issuing I/Os in physical order it allows the disk to serve these IOPS at the streaming limit of the disks (say 100MB/sec) rather than being IOPS limited (say 200 IOPS). So we hold a strategy that allows us to resilver nearly as fast as physically possible by the given disk hardware. With that newly acquired capability of ZFS, comes the requirement to service application load with a limited impact from resilvering. We therefore have some mechanism to limit resilvering load in the presence of application load. Our stated goal is to be able to run through resilvering at 1TB/day (1TB of data reconstructed on the replacing drive) even in the face of an active workload. As disks are getting bigger and bigger, all storage vendors will see increasing resilvering times. The good news is that, since Solaris 11.2 and ZFSSA since 2013.1.2, ZFS is now able to run resilvering with much of the same disk throughput limits as the rest of non-ZFS based storage. The sequential resilvering performance on a RAIDZ pool is particularly noticeable to this happy Solaris 11.2 customer saying It is really good to see the new feature work so well in practice.

In the initial days of ZFS some pointed out that ZFS resilvering was metadata driven and was therefore super fast : after all we only had to resilver data that was in-use compared to traditional...



The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage here. In this article, I describe how the reARC project had impact on at least these 7 important aspects of it's operation: Managing metadata Handling ARC accesses to cloned buffers Scalability of cached and uncached IOPS Steadier ARC size under steady state workloads Improved robustness for a more reliable code Reduction of the L2ARC memory footprint Finally, a solution to the long standing issue of I/O priority inversion The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop. No Meta Limits Previously, the ARC claimed to use a two-state model: "most recently used" (MRU) "most frequently used" (MFU) But it further subdivided these states into data and metadata lists. That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit, which was impossible to set properly and was a source of problems for customers. This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly sensible on the surface, but I was on the fence about it. ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list. When considering special protection for metadata, I've always stumbled on this question: If some buffer, be it data or metadata, has not seen any accesses forsufficient amount of time, such that the block is now the tail of aneviction list, what is the argument that says that I should protectthat block based on it's state ? I came up blank on that question. If it hasn't been used, it can be evicted, period. Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable inasmuch as any of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache and furthermore, is non-evictable. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction. And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete. Single Copy Arc: Dedup of Memory Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have. What is magical here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are referenced through a snapshot. You might need to read that sentence again ... and check the blog entry. Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory. With the reARC project and the new DMU to ARC interfaces, we don't have to keep multiple data copies. Multiple clones of the same data share the same buffers for read accesses and new copies are only created for a write access. It has not escaped our notice that this N-way pairing has immense consequences for virtualization technologies. The use of ZFS clones (or writable snapshots) is just a great way to deploy a large number of virtual machines. ZFS has always been able to store N clone copies with zero incremental storage costs. But reARC is taking this one step further. As VMs are used, the in-memory caches that are used to manage multiple VMs no longer need to inflate, allowing the space savings to be used to cache other data. This improvement allows Oracle to boast the amazing technology demonstration of booting 16000 VMs simultaneously. Improved Scalability of Cached and Uncached OPs The entire MRU/MFU list insert and eviction processes have been redesigned. One of the main functions of the ARC is to keep track of accesses, such that most recently used data is moved to the head of the list and the least recently used buffers make their way towards the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock. These changes were very important in removing long pauses in ARC operations that hampered the previous implementation. Finally, the main hash table was modified to use more locks placed on separate cache lines improving the scalability of the ARC operations. This lead to a boost in the cached and uncached maximum IOPs capabilities of the ARC. Steadier Size, Smaller Shrinks The growth and shrink model of the ARC was also revisited. The new model grows the ARC less aggressively when approaching memory pressure and instead recycles buffers earlier on. This recycling leads to a steadier ARC size and fewer disruptive shrink cycles. If the changing environment nevertheless requires the ARC to shrink, the amount by which we do shrink each time is reduced to make it less of a stress for each shrink cycle. Along with the reorganization of the ARC list locking, this has lead to a much steadier, dependable ARC at high loads. ARC Access Hardening A new ARC reference mechanism was created that allows the DMU to signify read or write intent to the ARC. This, in turn, enables more checks to be performed by the code. Therefore, catching bugs earlier in the process. A better separation of function between the DMU and the ARC is critical for ZFS robustness or hardening. In the new reARC mode of operation, the ARC now actually has the freedom relocate kernel buffers in memory in between DMU accesses to a cached buffer. This new feature proves invaluable as we scale to large memory systems. L2ARC Memory Footprint Reduction Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to a bare minimum that now only requires about 80 bytes of metadata per L2 buffers. With the arrival of larger SSDs for L2ARC and a better feeding algorithm, this reduced L2ARC footprint is a very significant change for the Hybrid Storage Pool (HSP) storage model. I/O Priority Inversion One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending, would block waiting on the low priority I/O prefetch completion. While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored. Conclusion The key points that we saw in reARC are as follows: Metadata doesn't need special protection from eviction, arc_meta_limit has become an obsolete tunable. Multiple clones of the same data share the same buffers for great performance in a virtualization environment. We boosted ARC scalability for cached and uncached IOPs. The ARC size is now steadier and more dependable. Protection from creeping memory bugs is better. L2ARC uses a smaller footprint. I/Os are handled with more fairness in the presence of prefetches. All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one. Stay tuned as we go about describing further improvements we're making to ZFS.

The initial topic from my list is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS...


ZFS Performance boosts since 2010

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have integrated since Sun and ZFS became Oracle brands. Since there is so much to cover, I tee up this series of article with a short description of 9 major performance topics that have evolved significantly in the last years. Later, I will describe each topic in more details in individual blog entries. Of course, these selected advancements represents nowhere near an exhaustive list. There has been over 650 changes to the ZFS code in the last 4 years. My personal performance bias has selected topics that I know best. The designated topics are:   reARC Scales the ZFS cache to TB class machines and CPU counts in thousands. Sequential Resilvering Converts a random workload to a sequential one. ZIL Pipelining Allows the ZIL to carve up smaller units of work for better pipelining and higher log device utilisation. It is the dawning of the age of the L2ARC Not only did we make the L2ARC persistent on reboot, we made the feeding process so much more efficient we had to slow it down. Zero Copy I/O Aggregation A new tool delivered by the Virtual Memory team allows the already incredible ZFS I/O aggregation feature to actually do its thing using one less copy. Scalable Reader/Writer locks Reader/Writer locks, used extensively by ZFS and Solaris, had their scalability greatly improved on on large systems. New thread Scheduling class ZFS transaction groups are now managed by a new type of taskqs which behave better managing bursts of cpu activity. Concurrent Metaslab Syncing The task of syncing metaslabs is now handled with more concurrency, boosting ZFS write throughput capabilities. Block Picking The task of choosing blocks for allocations has been enhanced in a number of ways, allowing us to work more efficiently at a much higher pool capacity percentage. There you have it. I'm looking forward to reinvigorating my blog so stay tuned.

Well, look who's back! After years of relative silence, I'd like to put back on my blogging hat and update my patient readership about the significant ZFS technological improvements that have...


Sun ZFS Storage Appliance : can do blocks, can do files too!

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420. As a benchmark SPC-1's profile is close to what a fixed block size DB would actually be doing. See Fast Safe Cheap : Pick 3 for more details on that result. Here, for an encore, we're showing today how the ZFS Storage appliance can perform in a totally different environment : generic NFS file serving. We're announcing that the Sun ZFS Storage 7320's reached 134,140 SPECsfs2008_nfs.v3 Ops/sec ! with 1.51 ms ORT running SPEC SFS 2008 benchmark. Does price performance matters ? It does, doesn't it, See what Darius has to say about how we compare to Netapp : Oracle posts Spec SFS. This is one step further in the direction of bringing to our customer true high performance unified storage capable of handling blocks and files on the same physical media. It's worth noting that provisioning of space between the different protocols is entirely software based and fully dynamic, that every stored element fully checksummed, that all stored data can be compressed with a number of different algorithms (including gzip), and that both filesystems and block based luns can be snapshot and cloned at their own granularity. All these manageability features available to you in this high performance storage package. Way to go ZFS ! SPEC and SPECsfs are registered trademarks of Standard Performance Evaluation Corporation (SPEC). Results as of February 22, 2012, for more information see www.spec.org.

Last October, we demonstrated storage leadership in block protocols with our stellar SPC-1 result showcasing our top of the line Sun ZFS Storage 7420.As a benchmark SPC-1's profile is close to what...


Fast, Safe, Cheap : Pick 3

Today, we're making performance headlines with Oracle's ZFS Storage Appliance. SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS; I'm proud to say that, yours truly, along with a lot of great teammates in Oracle, is not totally foreign to this milestone. We are announcing that Oracle's 7420C cluster acheived 137000 SPC-1 IOPS with an average latency of less than 10 ms. That is double the results of NetApp's 3270A while delivering the same latency. As compared to the NetApp 3270 result, this is a 2.5x improvement in $/SPC-1-IOPS (2.99$/IOPS vs $7.48/IOPS). We're also showing that when the ZFS Storage Appliance runs at the rate posted by the 3270A (68034 SPC-1 IOPS), our latency of 3.26ms is almost 3X lower than theirs (9.16ms). Moreover, our result was obtained with 23700 GB of user level capacity (internally mirrored) for 17.3 $/GB while NetApp's , even using a space saving raid scheme, can only deliver 23.5$/GB. This is the price per GB of application data actually used in the benchmark. On top of that the 7420C still had 40% of space headroom whereas the 3270A was left with only 10% of free blocks. These great results were at least partly made possible with the availability of 15K RPM Hard Disk Drives (HDD). Those are great to run the most demanding databases because they combine a large IOPS capability and are generally of smaller capacity. The ratio of IOPS/GB makes them ideal to store high intensity database modeled by SPC-1. On top of that, this concerted engineering effort lead to improved software not just for those running on 15K RPM. We actually used this benchmark to seek out how to increase the quality of our products. The preparation runs, after an initial diagnostic of some issue, we were attached to finding solutions that where not targeting the idiosyncrasies of SPC-1 but based on sound design decision. So instead of changing the default value of some internal parameter to a new static default, we actually changed the way the parameter worked so that our storage systems or all types and sizes would benefit. So not only are we getting a great SPC-1 results, but all existing customers will benefit from this effect even if they are operating outside of the intense conditions created by the benchmark. So what is SPC-1 ? It is one of the few benchmarks which counts for storage. It is maintained by Storage Performance Council (SPC). SPC-1 simulates multiple databases running on a centralized storage or storage cluster. But even if SPC-1 is a block based benchmark, within the ZFS Storage appliance, a block based FC or iSCSI volume is handled very much the same way as would be a large file subject to synchronous operation. And by Combining modern network technologies (Infiniband or 10Gbe Ethernet), the CPU power packed in the 7420C storage controllers and Oracle's custom dNFS technology for databases, one can truly acheive very high database transaction rates on top of the more manageable and flexible file based protocols. The benchmarks defines three Application Storage Unit (ASU): ASU1 with a heavy 8KB block read/write component, ASU2 with a much lighter 8KB block read/write component, and ASU3 which is subject to hundreds of write streams. As such it's is not too far from a simulation of running hundreds of Oracle database onto a single system : ASU1 and ASU2 for datafiles and ASU3 for redolog storage. The total size of the ASUs is constrained such that all of the stored data (including mirror protection and disk used for spares) must exceed 55% of all configured storage. The benchmark team is then free to decide how much total storage to configure. From that figure, 10% is given to ASU3 (redo log space) and the rest divided equally between heavily ASU1 and lightly used ASU2. The benchmark team also has to select the SPC-1 IOPS throughput level it wishes to run. This is not a light decision given you want to balance high IOPS; low latency and $/user GB. Once the target IOPS rate is selected, there are multiple criteria needed to pass a successful audit; one of the most critical is that you have to run at the specified IOPS rate for a whole 8 hour. Note that the previous specifications of the benchmark used by NetApp called for an 4 hour run. During that 8 hour run delivering a solid 137000 SPC-1 IOPS, the avg latency of must be less than 30ms (we did much better than that). After this brutal 8 hour run, the benchmark then enters another critical phase: the workload is restarted (using a new randomly selected working set) and performance is measured for a 10 minute period. It is this 10 minute period that decides the official latency of the run. When everything is said and done, you press the trigger; go to sleep and wake up to the result. As you could guess we were ecstatic that morning. Before that glorious day, for lack of a stronger word, a lot of hard work had been done during the extensive preparation runs. With little time, and normally not all of the hardware, one runs through series of run at incremental loads, making educated guesses as to how to improve the result. As you get more hardware you scale up the result tweaking things more or less until the final hour. SPC-1, with it's requirement of less than 45% of unused space, is designed to trigger many disk level random read IOPS. Despite this inherent random pattern of the workload, we saw that our extensive caching architecture was as helpful for this benchmark as it is in real production workloads. While the 15K RPM HDDs normally levels off with random operation at a rate slightly above 300 IOPS, our 7420C, as a whole, could deliver almost 500 user-level SPC-1 IOPS per HDDs. In the end one of the most satisfying aspect was to see that the data being managed by ZFS was stored rock solid on disk, properly checksummed, all data could be snapshot, compressed on demand, and delivering an impressively steady performance. 2X the absolute performance, 2.5X cheaper per SPC-1 IOPS, almost 3X lower latency, 30% cheaper per user GB with room to grow... So, If you have a storage decision coming and you need, FAST, SAFE, CHEAP : pick 3, take a fresh look at the ZFS Storage appliance. SPC-1, SPC-1 IOPS, $/SPC-1 IOPS reg tm of Storage Performance Council (SPC). More info www.storageperformance.org. Sun ZFS Storage 7420 Appliance and Oracle Sun ZFS Storage Appliance 7420 _http://www.storageperformance.org/results/benchmark_results_spc1#a00108 _As of October 3, 2011 Netapp FAS3270A _http://www.storageperformance.org/results/benchmark_results_spc1#ae00004 _As of October 3, 2011 The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Today, we're making performance headlines with Oracle's ZFS Storage Appliance. SPC-1 : Twice the performance of NetApp at the same latency; Half the $/IOPS;I'm proud to say that, yours truly, along...


Dedup Performance Considerations

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup ? As Jeff says, this is the ultimate gaming ground for benchmarks. But lets have a look at the fundamentals. ZFS Dedup Basics Dedup code is simplistically a large hash table (the DDT). It uses a 256 bit (32 Bytes) checksum along with other metata data to identify data content. On a hash match, we only need to increase a reference count, instead of writing out duplicate data. The dedup code is integrated in the I/O pipeline and is done on the fly as part of the ZFS transaction group (see Dynamics of ZFS, The New ZFS Write Throttle ). A ZFS zpool typically holds a number of datasets : either block level LUNS which are based on ZVOL or NFS and CIFS File Shares based on ZFS filesystems. So while the dedup table is a construct associated with individual zpool, enabling of the deduplication feature is something controlled at the dataset level. Enabling of the dedup feature on a dataset, has no impact on existing data which stay outside of the dedup table. However any new data stored in the dataset will then be subject to the dedup code. To actually have existing data become part of the dedup table one can run a variant of "zfs send | zfs recv" on the datasets. Dedup works on a ZFS block or record level. For a iSCSI or FC LUN, i.e. objects backed by ZVOL datasets, the default blocksize is 8K. For filesystems (NFS, CIFS or Direct Attach ZFS), object smaller than 128K (the default recordsize) are stored as a single ZFS block while objects bigger than the default recordsize are stored as multiple records Each record is the unit which can end up deduplicated in the DDT. Whole Files which are duplicated in many filesystems instances are expected to dedup perfectly. For example, whole DB copied from a master file are expected to falls in this category. Similarly for LUNS, virtual desktop users which were created from the same virtual desktop master image are also expected to dedup perfectly. An interesting topic for dedup concerns streams of bytes such as a tar file. For ZFS, a tar file is actually a sequence of ZFS records with no identified file boundaries. Therefore, identical objects (files captured by tar) present in 2 tar-like byte streams might not dedup well unless the objects actually start on the same alignment within the byte stream. A better dedup ratio would be obtained by expanding the byte stream into it's constituent file objects within ZFS. If possible, the tools creating the byte stream would be well advised to start new objects on identified boundaries such as 8K. Another interesting topic is backups of active Databases. Since database often interact with their constituent files with an identified block size, it is rather important for the deduplication effectiveness that the backup target be setup with a block size that matches the source DB block size. Using a larger block on the deduplication target has the undesirable consequence that modifications to small blocks of the source database will cause those large blocks in the backup target to appear unique and not dedup somewhat artificially. By using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio. Performance Model and I/O Pipeline Differences What is the effect on performance of Dedup ? First when dedup is enabled, the checksum used by ZFS to validate the disk I/O is changed to the cryptographically strong SHA256. Darren Moffat shows in his blog that SHA256 actually runs at more than 128 MB/sec on a modern cpu. This means that less than 1 ms is consumed to checksum a 128K and less than 64 usec for an 8K unit. This cost is online incurred when actually reading or writing data to disk, an operation that is expected to take 5-10 ms; therefore the checksum generation or validation is not a source of concern. For the read code path, very little modification should be observed. The fact that a reads happens to hit a block which is part of the dedup table is not relevant to the main code path. The biggest effect will be that we use a stronger checksum function invoked after a read I/O : at most an extra 1 ms is added to a 128K disk I/O. However if a subsequent read is for a duplicate block which happens to be in the pool ARC cache, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary. Each filesystem can then work independently on their copy of the data in the ARC cache as is the case without deduplication. Synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication. Therefore the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds. Similarly for asynchronous writes which interact with the ARC caches, dedup code has no affect unless the pool's transaction group itself becomes the limiting factor. So the effect of dedup will take place during the pool transaction group updates. Here is where we take all modifications that occurred in the last few seconds and atomically commit a large transaction group (TXG). While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceed the capabilities of the TXG to commit data to disk. Then eventually the reads and write will be held up by the necessary write (Throttling) code preventing ZFS from consuming up all of memory . Looking into the ZFS TXG, we have 2 operations of interest, the creation of a new data block and the simple removal (free) of a previously used block. ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free). For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW). For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information. On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG. Then a single 1 x 16K I/O can then replace 10s of larger IOPS. As for free operations, the internals of ZFS actually holds the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrement the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference then it will be freed. The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and it's elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup. If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Also note that whole filesystem removal or large file truncation are operation that can free up large quantity of data at once and when the dedup table exceeds allotted memory then those operation, which are more complex with deduplication, can then impact the amount of data going into every TXG and the write throttling behavior. So how large is the dedup table ? The command zdb -DD on a pool shows the size of DDT entries. In one of my experiment it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than what these ratio provide, strongly consider allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs. The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table. But when the operating conditions does not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. Referenced Links : ZFS Dedup Bobn's first Look Dedup Community Dynamics of ZFS Write Throttle Recordsize SHA256 Performance

One of the major milestones for ZFS Storage appliance with 2010/Q1 is the ability to dedup data on disk. The open question is then : What performance characteristics are we expected to see from Dedup...


Proper Alignment for extra performance

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be suffering a 2-4X performance degradation.... Many I/O performance problem I see end up being the result of a mismatch in request sizes or it's alignment versus the natural block size of the underlying storage. While raw disk storage works using a 512 Byte sector and performs at the same level independent of the starting offset of I/O requests this is not the case for more sophisticated storage which will tend to use larger block units. Some SSDs today support 512B aligned requests but will work much better if you give them 4K aligned requests as described in Aligning on 4K boundaries Flash and Sizes. The Sun Oracle 7000 Unified Storage line supports different sizes of blocks between 4K and 128K (it can actually go lower but I would not recommend that in general). Having proper alignment between the application's view, the initiator partitioning and the backing volume can have great impact on the end performance delivered to applications. When is alignment most important ? Alignment problems are most likely to have an impact with running a DB on file shares or block volumes write streaming to block volumes (backups) Also impacted at a lesser level : large file rewrites on CIFS or NFS shares In each case adjusting the recordsize to match the workload and insuring that partitions are aligned on a block boundary could have important effect on your performance. Let's review the different cases. Case 1: running a Database (DB) on file shares or block volumes Here the DB is a block oriented application. General ZFS Best Practices warrant that the storage use a record size equal to the DB natural block size. At the logical level, the DB is issuing I/O which are aligned on block boundaries. When using file semantics (NFS or CIFS), then the alignment is guaranteed to be observed all the way to the backend storage. But when using block device semantics, the alignments of requests on the initiator is not guaranteed to be the same as the alignement on the target side. Misalignment of the LUN will cause two pathologies. First, an application block read will straddle 2 storage blocks creating storage IOPS inflation (more backend reads than application reads). But a more drastic effect will be seen for block writes which, when aligned, could be serviced by a single write I/O. Those will now require a Read-Modify-Write (R-W-M) of 2 adjacent storage blocks. Such type of I/O inflation leads to additional storage load and degrade performance during high demand. To avoid such I/O inflation, insure that the backing store uses a block size (LUN volblocksize or Share recordsize) compatible with the DB block size. If using a file share such as NFS, insure that the filesystem client passes I/O requests directly to the NFS server using a mount option such as directio or use Oracle's dNFS client (Note that with directio mount option, memory management considerations independent of alignment concerns, the server will behave better when the client specifies rsize,wsize options not exceeding 128K). To avoid such LUN misalignment, prefer the use full LUNS as opposed to sliced partition. If disk slices must be used, prefer partitioning scheme in which one can control the sector offset of individual partitions such as EFI labels. In that case start partitions on a sector boundary which aligns with the volume's blocksize. For instance a initial block for a parition which is a multiple of 16 \* 512B sectors will align on an 8K boundary, the default lun blocksize. Case 2: write streaming to block volumes (backups) The other important case to pay attention to is stream writing to a raw block device. Block devices by default commit each write to stable storage. This path is often optimized through the use of acceleration devices such as write optimized SSD. Misalignement of the LUNS due to partitioning software imply that application writes, which could otherwise be committed to SSD at low latency, will instead be delayed by disk reads caught in R-M-W. Because the writes are synchronous in nature, the application running on the initiator will thus be considerably delayed by disk reads. Here again one must insure that partitions created on the client system are aligned with the volumes blocksize which typically default to 8K. For pure streaming workloads large blocksize up to the maximum 128K can lead to greater streaming performance. One must take good care that the block size used for a LUNS should not exceed the application writes sizes to raw volumes or risk being hit by the R-M-W penalty. Case 3: large file rewrites on CIFS or NFS shares For file shares, large streaming write will be of 2 types : they will either be the more common file creation (write allocation) or they will correspond to streaming overwrite to existing file. The more common write allocation would not greatly suffer from misalignment since there is no pre-existing data to be read and modified. But for the less common streaming rewrite to files, one can definitely be impacted by misalignment and R-M-W cycles. Fortunately file protocols are not subject to LUN misalignment so one must only take care that the write sizes reaching the storage be multiple of the recordsize used to create the file share in the storage. The solaris NFS clients often issues 32K write size for streaming application while CIFS has been observed to use 64K from clients. If existing streaming asynchronous file rewrite is an important component of your I/O workloads (a rare set of conditions), it might well be that setting the LUN blocksize accordingly will provide a boost to delivered performance. In summary The problem with alignment is more generally seen with fixed record oriented application (as for Oracle Database or Microsoft Exchange) with random access pattern and synchronous I/O semantics. It can be caused by partitioning software (fdisk, diskpart) which create disk partitions not aligned with the storage blocks. It can also be caused to a lesser extent by streaming file overwrite when the application write size does not match the file share's blocksize. The Sun Storage 7000 line offers great flexibility in selecting different blocksizes for different use within a single pool of storage. However it has no control on the offset that could be selected during disk partitioning of block devices on client systems. Care must be taken when partitioning disks to avoid misalignment and degraded performance. Using full LUNs is preferred. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. Referenced Links : ESX Exchange #1 Exchange #2 4K Disk Sectors

Because of disk parititioning software on your storage clients (keywords : EFI, VTOC, fdisk, DiskPart,...) or a mismatch between storage configuration and application request pattern, you could be...


Doubling Exchange Performance

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing One of the great advances present in the ZFS Appliance 2010/Q1 software update relates to the block allocation strategy. It's been one the most complex performance investigation I've ever had to deal with because of the very strong impact previous history of block allocation had on future performance. It was maddening experience littered with dead end leads. During that whole time it was very hard to make sense of the data and segregate what was due to a problem in block allocation from author causes that leads customer to report performance issues. Executive Summary A series of changes to ZFS metaslab code lead to 50% improved OLTP performance and 70% reduced variability from run to run. We also saw a full 200% improvement on MS Exchange performance from these changes. Excruciating Details for aspiring developer "Abandon hope all ye who enter here" At some point we started to look at random synchronous file rewrite (a la DB writer) and it seemed clear that the performance was not what we expected for this workload. Basically, independent DB block synchronous writes were not aggregating into larger I/Os in the vdev queue. We could not truly assert a point where a regression had set in, so rather than threat this as a performance regression, we just decided to study what we had in 2009/Q3 and see how we could make our current scheme work better. And that lead us on the path of the metaslab allocator:

2010/Q1 delivers 200% more Exchange performance and 50% extra transaction processing One of the great advances present in the ZFS Appliance 2010/Q1software update relates to the block allocation...


CMT, NFS and 10 Gbe

Now that we have Gigabytes/sec class of Network Attached OpenStorage and highly threaded CMT servers to attach from you figure just connecting the two would be enough to open the pipes for immediate performance. Well ... almost. Our openstorage system can deliver great performance but we often find limitation on the client side. Now that NAS servers can deliver so much power, their NAS client can themselve be powerful servers trying to deliver GB/sec class services to the internet. CMT servers are great throughput engines for that, however they deliver the goods when the whole stack is threaded. So in a recent engagement, my collegue David Lutz found that we needed one tuning at each of 4 levels in Solaris : IP, TCP, RPC and NFS. ServiceTunable IPip_soft_ring_cnt TCPtcp_recv_hiwat RPCclnt_max_conns NFSnfs3_max_threads NFSnfs4_max_threads ip_soft_rings_cnt requires tuning up to Solaris 10 update 7. The default value of 2 is not enough to sustain the high throughput in a CMT environment. A value of 16 proved beneficial. In /etc/system : \* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch set ip_soft_rings_cnt=16 The receive socket buffer size is critical to the TCP connection performance. The buffer is not preallocated and memory is only used if and when the application is not reading the data it has requested. The default at 48K is from the age of 10MB/s Network cards and 1GB/sec systems. Having a larger value allows the peer to not throttle it's flow pending the returning TCP ACK. This is specially critical in high latency environment, urban area networks or other large fat network but it's also critical in the datacenter to reach a reasonable portion of the 10Gbe available in today's NIC. It turns out that NFS connection inherit the TCP default for the system and so it's interesting to run with a value between 400K and 1MB : ndd -set /dev/tcp_recv_hiwat 400000 But even with this, a single TCP connection is not enough to extract the most out of 10Gbe on CMT. And the solaris rpc client will establish a single connection to any of the server it connects to. The code underneath is highly threaded but did suffer from a few bugs when trying to tune that number of connections notably 6696163, 6817942 both of which are fixed in S10 update 8. With that release, it becomes interesting to tune the number of RPC connections for instance to 8. In /etc/system : \* To drive 10Gbe in CMT in Solaris 10 update 8 : see blogs.sun.com/roch set clnt_max_conns=8 And finally, above the RPC layer, NFS does implement a pool of threads per mount point to service asynchronous requests. These will be mostly used in streaming workloads (readahead and writebehind) while other synchronous requests will be issued within the context of the application thread. The default number of asynchronous requests is likely to limit performance in some streaming scenario. So I would experiment with In /etc/system : \* To drive 10Gbe in CMT in Solaris 10 update 7 : see blogs.sun.com/roch set nfs3_max_threads=32 set nfs4_max_threads=32 As usual YMMV and use them with the usual circumspection, remember that tuning is evil but it's better to know about these factors than being in the dark and stuck with lower than expected performance.

Now that we have Gigabytes/sec class of Network Attached OpenStorage and highly threaded CMTservers to attach from you figure just connecting the two would be enough to open the pipes for immediate...


iSCSI unleashed

One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new target code is an in-kernel implementation that replaces what was previously known as the iSCSI target deamon, a user-level implementation of iSCSI. Should we expect huge performance gains from this change ? You Bet ! But like most performance question, the answer is often : it depends. The measured performance of a given test is gated by the weakest link triggered. iSCSI is just one component among many others that can end up gating performance. If the daemon was not a limiting factor, then that's your answer. The target deamon was a userland implementation of iSCSI : some daemon threads would read data from a storage pool and write data to a socket or vice versa. Moving to a kernel implementation opens up options to bypass at least one of the copies and that is being considered as a future option. But extra copies while undesirable do not necessarily contribute to the small packet latency or large request throughput; For small packets requests, the copy is small change compared to the request handling. For large request throughput the important things is that the data path establishes a pipelined flow in order to keep every components busy at all times. But the way threads interact with one another can be a much greater factor in delivered performance. And there lies the problem. The old target deamon suffered from one major flaw in that each and every iSCSI requests would require multiple trips through a single queue (shared between every luns) and that queue was being read and written by 2 specific threads. Those 2 threads would end up fighting for the same locks. This was compounded by the fact that user level threads can be put to sleep when they fail to acquire a mutex and that going to sleep for a user level thread is a costly operation implying a system call and all the accounting that goes with that. So while the iSCSI target deamon gave reasonable service for large request, it was much less scalable in terms of the number IOPS that can be served and the CPU efficiency in which it could do that. IOPS being of course a critical metrics for block protocols. As an illustration of that with 10 client initiators and 10 threads per initiators (so 100 outstanding request) doing 8K cache-hit reads, we observed Old Target DaemonComstar Improvement 31K IOPS85K IOPS2.7X Moreover the target daemon was consuming 7.6 CPU to service those 31K IOPS while comstar could handle 2.7X more IOPS consuming only 10 cpus, a 2X improvement in iops per cpu efficiency. On the write side, with a disk pool that had 2 striped write optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs 88MB/sec) and 60% more cpu efficiency. Immediatedata During our testing we noted a few interesting contributor to delivered performance. The first being the setting of iSCSI immediatedata parameter iSCSIadm(1M). On the write path, that parameter will cause the initiator iSCSI to send up to 8K of data along with the initial request packet. While this is a good idea to do so, we found that for certain sizes of writes, it would trigger some condition in the zil that caused ZFS to issue more data than necessary through the logzillas. The problem is well understood and remediation is underway and we expect to get to a situation in which keeping the default value of immediatedata=yes is the best. But as of today, for those attempting world record data transfer speeds through logzillas, setting immediatedata=no and using a 32K or 64K write size might yield positive result depending on your client OS. Interrupt Blanking Interested in low latency request response ? Interestingly, a chunk of that response time is lost in the obscure setting of network card drivers. Network cards will often delay pending interrupts in the hope of coalescing more packets into a single interrupt. The extra efficiency often results in more throughput at high data rate at the expense of small packet latency. For 8K request we manage to get 15% more single threaded IOPS by tweaking one such client side parameter. Historically such tuning has always been hidden in the bowel of each drivers and specific to ever client OS so that's too broad a topic to cover here. But for Solaris clients, the Crossbow framework is aiming among other thing to make latency vs throughput decision much more adaptive to operating condition relaxing the need for per workload tunings. WCE Settings Another important parameter to consider for comstar is the 'write cache enable' bit. By default all write request to an iSCSI lun needs to be committed to stable storage as this is what is expected by most consumers of block storage. That means that each individual write request to a disk based storage pool will take minimally a disk rotation or 5ms to 8ms to commit. This also why a write optimised SSD is quite critical to many iSCSI workloads often yeilding 10X performance improvements. Without such an SSD, iSCSI performance will appear quite lackluster particularly for lightly threaded workloads which more affected by latency characteristics. One could then feel justified to set the write cache enable bits on some luns in order to recoup some spunk in their engine. One good news here is that in the new 2009.Q3 release of fishworks the setting is now persistent across reboots and reconnection event, fixing a nasty condition of 2009.Q2. However one should be very careful with this setting as the end consumer of block storage (exchange, NTFS, oracle,...) is quite probably operating under an unexpected set of condition. This setting can lead to application corruption in case of outage (no risk for the storage internal state). There is one exception to this caveat and it is ZFS itself. ZFS is designed to safely and correctly operate on top of devices that have their write cached enabled. That is because ZFS will flush write caches whenever application semantics or its own internal consistency require it. So a ZPOOL created on top of iSCSI luns would be well justified to set the WCE on the lun to boost performance. Synchronous write bias Finally as described in my blog entry about Synchronous write bias, we now have to option to bypass the write optimised SSDs for a lun if the workload it receive is less sensitive to latency. This would be the case of a highly threaded workload doing large data transfers. Experimenting with this new property is warranted at this point.

One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new...


Synchronous write bias property

With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset Synchronous write bias property or logbias for short. In a nutshell, this property takes the default value of Latency signifying that the storage should handle synchronous writes in urgency, the historical default handling. See Brendan's comprehensive blog entry on the Separate Intent Log and synchronous writes. However for datasets holding Oracle Datafiles, the logbias property can be set to Throughput signifying that the storage should avoid using log devices acceleration instead trying to optimize the workload's throughput and efficiency. We definitely expect to see a good boost to Oracle performance from this feature for many types of workloads and configs; workloads that generate 10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD. The property is set in the Share Properties just above database recordsize. You might need to unset the Inherit from projet checkbox in order to modify the settings on a particular share: The logbias property addresses a peculiar aspect of Oracle workloads : namely that DB writers are issuing a large number of concurrent synchronous writes to Oracle datafiles, writes which individually are not particularly urgent. In contrast to other types of synchronous writes workloads, the more important metrics for DB Writers is not about individual latency. The important metric is that the storage keep up with the throughput demand in order to have database buffers always available for recycling. This is unlike redo log writes which are critically sensitive to latency as they are holding up individual transactions and thus users. ZFS and the ZIL A little background; with ZFS, synchronous writes are managed by the ZFS Intent Log ZIL. Because synchronous writes are typically holding up applications, it's important to handle those writes with some level of urgency and the ZIL does an admirable job at that. In the Openstorage hybrid storage pool the ZIL itself is speeded up using low latency write-optimized SSD devices : the logzillas. Those devices are used to commit a copy of the in-memory ZIL transaction and retain the data until an upcoming transaction group commits the in-memory state to the on-disk pooled storage (Dynamics of ZFS, The New ZFS write throttle). So while the ZIL speeds up synchronous writes, logzillas speeds up the zil. Now SSDs can serve IOPS at a blazing 100μs but also have their own throughput limits: currently around 110MB/sec per device. At that throughput, committing, for example, 40K of data will need minimally 360μs. The more data we can divert away from log devices, the lower the latency response of those devices will be. It's interesting to note that other types of raid controllers will be hostage of their NVRAM and require, for consistency, that data be committed through some form of acceleration in order to avoid the Raid Write Hole (Bonwick on Raid-Z). ZFS, however, does not require that data passes through its SSD commit accelerator and it can manage consistency of commits either using disk or using SSDs. Synchronous write bias : Throughput With this newfound ability of storage administrators to signify to ZFS that some datasets will be subject to highly threaded synchronous writes for which global throughput is more critical than individual write latency, we can enable a different handling mode. By setting Logbias=Throughput ZFS is able to divert writes away from the Logzillas which are then preserved for servicing low latency sensitive operations (e.g. redo log operations). A setting of Synchronous write bias : Throughput for a dataset allows synchronous writes to files in other datasets to have lower latency access to SSD log devices. But that's not all. Data flowing through a logbias=Throughput dataset is still served by the ZIL. It turns out that the ZIL has different internal options in the way it can commit transactions one of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an I/O for the modified file record and record a pointer to it in the zil chain. (see WR_INDIRECT in zil.c, zvol.c, zfs_log.c ). ZIL transaction of type WR_INDIRECT might use more disk I/Os and slightly higher latency immediately but less I/Os and less total bytes during the upcoming transaction group update. Up to this point, the heuristics that lead to using WR_INDIRECT transactions, were not triggered by DB writer workloads. But armed with the knowledge that comes with the new logbias property, we're now less concerned about the slight latency increase that WR_INDIRECT can have. So from efficiency consideration the logbias=Throughput datasets are now set to use this mode leading to more leveled latency distributions of Transactions. Synchronous write bias : Throughput is a dataset mode that reduces the number of I/Os that need to be issued on behalf of this dataset during the regular transaction group updates leading to more leveled response time. A reminder that such kind of improvements sometimes can go unnoticed in sustained benchmarks if the downstream Transaction group destage is not given enough resources. Make sure you have enough spindles (or total disk KRPM) to sustain the level of performance you need. A pool with 2 logzillas and a single JBOD, might have enough SSD throughput to absorb DB writer workloads without adversely affecting redo log latency and so would not benefit from the special logbias settings, however for 1 logzillas per JBOD the situation might be reversed. While the DB Record Size property is inherited by files in a dataset and is immutable, the logbias setting is totally dynamic and can be toggled on the fly during operations. For instance, during database creation or some lightly threaded write operations to Datafiles, it's expected that logbias=Latency should perform better. Logbias deployments for Oracle As of the 2009.Q3 release of fishworks, the current wisdom around deploying Oracle DB an Openstorage system with SSD acceleration, is to segregate, at the filesystem/dataset level, but within the single storage pool, Oracle datafiles, index files and redo Log files. Having each type of files in different dataset allows better observability into each one using the great analytics tool. But also, each dataset can then be tuned independantly to deliver the most stable performance characteristics. The most important parameter to consider is the ZFS internal recordsize used to manage the files. For Oracle datafiles the established (ZFS Best Practice) is to match the recordsize and the DB block size. For redo log files using default 128K records means that fewer file updates will be stradling multiple filesystem records. With 128K records we expect to have fewer transaction needing to wait for redo log input I/Os leading more leveled latency distribution for transactions. As for Index files, using smaller blocks of 8K offers better cacheability feature for both the primary and secondary caches (only cache what is used from indexes), but using larger blocks offers better index-scan performance. Experimenting is in order, depending on your use case, but an intermediate block size of maybe 32K might also be considered for mixed usage scenario. For Oracle datafiles specifically, using the new setting of Synchronous write bias : Throughput has potential to deliver more stable performance in general and higher performance for redo log sensitive workloads. DatasetRecordsize Logbias Datafiles8K Throughput Redo Logs128K(default)Latency(default) Index 8K-32K?Latency(default) Following these guidelines yielded a 40% boost in our Transaction processing testing in which we had 1 logzillas for a 40 disk pool.

With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset S...


Compared Performance of Sun 7000 Unified Storage Array Line

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with a _single_ 10 GbE connection delivers linespeed of the 10 GbE. The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance. The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications. The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensice SATA disks. The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server. All those numbers characterise a single head of a 7410 clusterable technology. The 7000 clustering technology stores all data in dual attached disk trays and no state is shared between cluster heads (see Sun 7000 Storage clusters). This means that an active-active cluster of 2 healthy 7410 will deliver 2X the performance posted here. Also note that the performance posted here represent what is acheived under a very tightly defined constrained workload (see Designing 11 Storage metric) and those do not represent the performance limits of the systems. This is testing 1 x 10 GbE port only; each product can have 2 or 4 10 GbE ports, and by running load across multiple ports the server can deliver even higher performance. Achieving maximum performance is a separate exercise done extremely well by my friend Brendan : 7410 Perf 7310 Perf 2 GB/sec Quarter Million IOPS Measurement Method To measure our performance we used the open source Filebench tool accessible from SourceForge (

The Sun Storage 7410 Unified Storage Array provides high-performance for NAS environments. Sun's product can be used on a wide variety of applications. The Sun Storage 7410 Unified Storage Array with...


Need Inodes ?

It seems that some old school filesystem still need to statically allocate inodes to hold pointers to individual files. Normally this should not cause too much problems as default settings account for an average filesize of 32K. Or will it ? If the avg filesize you need to store on the filesystem is much smaller than this, they you are likely to eventually run out of inodes even if the space consumed on the storage is far from exhausted. In ZFS inodes are allocated on demand and so the question came up, how many files can I store onto a piece of storage. I managed to scrape up an old disk of 33GB, created a pool and wanted to see how many 1K files I could store on that storage. ZFS stores files with the smallest number of sectors possible and so 2 sectors was enough to store the data. Then of course one needs to also store some amount of metadata, indirect pointer, directory entries etc to complete the story. There I didn't know what to expect. My program would create 1000 files per directory. Max depth level is 2, nothing sophisticated attempted here. So I let my program run for a while and eventually interrupted it at 86% of disk capacity : Filesystem size used avail capacity Mounted onspace 33G 27G 6.5G 81% /space Then I counted the files. #ptime find /space/create | wcreal 51:26.697user 1:16.596sys 25:27.41623823805 23823805 1405247330 So 23.8M files consuming 27GB of data. Basically less than 1.2K of used disk space per KB of files. A legacy type filesystem that would allocate one inode per 32K would have run out of space after a meager 1M files but ZFS managed to store 23X more on the disk without any tuning. The find command here is mostly gated on fstat performance and we see here that we did the 23.8M fstat in 3060 seconds or 7777 fstat per second. But here is the best part : And how long did it take to create all those files ? real 1:09:20.558 user 9:20.236 sys 2:52:53.624 This is hard to believe but it took about 1 hour for 23.8 million files.This is on a single direct attach drive 3. c1t3d0 <FUJITSU-MAP3367N SUN36G-0401-33.92GB> ZFS created on average 5721 files per second. Now obviously such a drive cannot do 5721 IOPS but with ZFS it didn't need to. File create is actually more of a cpu benchmark because the application is interacting with host cache. It's the task of the filesystem to then create the files on disk in the background. With ZFS, the combination of the Allocated on Write policy and the sophisticated I/O aggregation in the I/O scheduler (dynamics) means that the I/O for multiple independant file create can be coalesced. Using dtrace I counted the number of IO required and filecreates per minutes, typical samples show more than 200K files created per minutes using about 3000 IO per minutes or 3300 files per second using a mere 50 IOPS !!! Per Minute Sample Create IOs #1 214643 2856 #2 215409 3342 #3 212797 2917 #4 211545 2999 Finally with all these files, is scrubbing a problem ? It took 1h34m to actually scrub that many files at a pace of 4200 scrubbed files per second. No sweat. pool: space state: ONLINE scrub: scrub completed after 1h34m with 0 errors on Wed Feb 11 12:17:20 2009 If you need to create, store and otherwise manipulate lots of small files efficiently, ZFS has got to be you filesystem of choice for you.

It seems that some old school filesystem still need to statically allocate inodes to hold pointers to individual files. Normally this should not cause too much problems as default settings account...


Decoding Bonnie++

I've been studying the popular Bonnie++ load generator to see if it was a suitable benchmark to use with Network attached storage such as Sun Storage 7000 line. At this stage I've looked at the single client runs, and it doesn't appear that Bonnie++ is an appropriate tool in this environment because as we'll see here, for many of the tests, it either stresses the networking environment or the strength of client side cpu. The first interesting thing to note is that Bonnie will work on a data set that is double the client's memory. This does address some of the client side caching concern one could otherwise have. In a NAS environment the amount of memory present on the server is not considered by a default bonnie++ run. My client had 4GB leading to a working set was then 8GB while the server had 128GB of memory. The Bonnie++'s output looks like : Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP v2c01 8G 81160 92 109588 38 89987 67 69763 88 113613 36 2636 67 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 687 10 +++++ +++ 1517 9 647 10 +++++ +++ 1569 8 v2c01,8G,81160,92,109588,38,89987,67,69763,88,113613,36,2635.7,67,16,687,10,+++++,+++,1517,9,647,10,+++++,+++,1569,8 Method I have used a combination of Solaris truss(1), reading Bonnie++ code, looking at AmberRoad's Analytics data , as well as a custom Bonnie d-script in order to understand how each test triggered system calls on the client and how those translated into a NAS server load. In the d-script, I characterise the system calls by the average elapse time as well as by the time spent waiting for a response from the NAS server. The time spent waiting is the operational latency that one should be interested in when characterising a NAS, while the additional time relates to the client CPU strength along with the client NFS implementation. Here is what I found trying to explain how performant each test was. Writing with putc() So easy enough, that test creates a file using single character putc stdio library call. This test is clearly a client CPU test with most of the time spent in user space running putc(). Every 8192 putc, stdio library will issue a write(2) system call. That syscall is still a client CPU test since the data is absorbed on the client cache. What we test here is the client single CPU performance and the client NFS implementation. On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using analytics a network transfer rate of 87 MB/sec. Results : 87 MB/sec of writes. Limited by single CPU speed. Writing intelligently...done Here it's more clever since it writes a file using sequential 8K write system calls. In this test the CPU is much relieved. So here the application is running 8K write system call to client NFS. This is absorbed by memory on the client. With an Opensolaris client, no over the wire request is sent for such an 8K write. However after 4 such 8K writes we reach the natural 32K chunk advertised by the server and that will cause the client to asynchronously issue a write request to the server. The asynchronous nature means that this will not cause the application to wait for the response and the test will keep going on CPU. The process will now race ahead generating more 8K writes and 32K asynchronous NFS requests. If we manage to generate such request at a greater rate than responses, we will consume all allocated aysnchronous threads. On Solaris this maps to nfs4_max_threads (8) threads. When all 8 asynchronous threads are waiting for a response, then the application will finally block waiting for a previously issued request to get a response and free an async thread. Since generating 8K write systems to fill the client cache is faster than the network connection between the client and the server we will eventually reach this point. The steady state of this test is that Bonnie++ is waiting for data to transfer to the server. This happens at the speed of a single NFS connection which for us saturated the 1Gbps link we had. We observed 113MB/sec which is network line rate considering protocol overheads. To get more through on this test, one could use Jumbo Frame ethernet instead of the 1500 Byte default frame size used as this would reduce the protocol overhead slightly. One could also configure the server and client to use 10Gbps ethernet links. One could also use LACP link aggregation of 1Gbps network ports to increase the throughput. LACP increases throughput of multiple network connections but not single socket protocol. By default a Solaris client will establish a single connection (clnt_max_conns = 1) to a server (1 connections per target IP). So using multiple aggregated links _and_ tuning clnt_max_conns could yield extra throughput here. Using single connection one could use a faster network between client and server links to reach additional throughput. More commonly, we expect to saturate the client 1Gbps connectivity here, not much of a stress for a Sun Storage 7000 server. Results : 113 MB/sec of writes. Network limited. Rewriting...done This gets a little interesting. It actually reads 8K, lseek back to the start of the block, overwrites the 8K with new data and loops. So here we read, lseek back, overwrite . For the NFS protocol lseek is a noop since every over the wire write is tagged with the target offset. In this test we are effectively stream reading the file from the server and stream writing the file back to the server. The stream write behavior will be much like the previous test. We never need to block the process unless we consume all 8 asynchronous threads. Similarly 8K sequential reads will be recognised by our client NFS as streaming access which will deploy asynchronous readahead requests. We will use 4 (nfs4_nra) request for 32K blocks ahead of the point being currently read. What we observed here was that of 88 second of elapse time, 15 was spent in write and 20 in reads. However a small portion of that was spent waiting for response. It was mostly all spent on CPU time to interact with the client NFS. This implies that readhead and asynchronous writeback was behaving without becoming bottlenecks. The Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27 sec, was spent waiting off cpu. I struggle somewhat in this interpretation but I do know from the Analytics data on the server that the network is seeing 100 MB/sec of data flowing in each direction. This must also be close to network saturation. The wait time attributed to Bonnie++ in this test seems be related to kernel preemption. As Bonnie++ is coming out of its system calls we see such events in dtrace. unix`swtch+0x17f unix`preempt+0xda genunix`post_syscall+0x59e genunix`syscall_exit+0x59 unix`0xfffffffffb800f06 17570 This must be to service the kernel threads of higher priority, likely the asynchronous threads being spawned by the reads and writes. This test is then a stress test of bidirectional flow of 32K data transfers. Just like the previous test, to improve the numbers one would need to improve the network connection throughput between the client and server. It also potentially could then benefit from faster and more client CPUs. Results : 100MB/sec in each direction, network limited. Reading with getc()...done Reads the file one character at a time. Back to a test of the client CPU much like the first one. We see that the readahead are working great since little time is spent waiting (0.4 of 114 seconds). Given that this test does 1 million reads in 114 seconds, the average latency could be evaluated to be 114 usec. Results : 73MB/sec, single CPU limited on the client. Reading intelligently...done start 'em...done...done...done... Reads with 8k system calls, sequential. This test seems to be using 3 spawned bonnie process to read files. The reads are of size 8K and we needed 1M of them to read our 8GB working set. We observed with analytics no I/O done on the server since it had 128GB of cache available to it. The network on the other hand is saturated at 118 MB/sec. The dtrace script shows that the 1M read calls collectively spend 64 seconds waiting (most of that NFS response). So that implies a 64 usec read response time for this sequential workload. Results : 118MB/sec, limited by Network environment. start 'em...done...done...done... Here is seems that Bonnie starts 3 helper processes used to read the files in the "Reading Intelligently" test. Create files in sequential order...done. Here we see 16K files being created (with creat(2)) then closed. This test will create and close 16K files and took 22 seconds in our environment. 19 seconds were used for the creates, 17.5 waiting for responses. That means a 1ms response time for file creates. The test seems single threaded. Using analytics we observe 13500 NFS ops per second to handle those file create. We do see some activity on the Write bias SSD although very modest at 2.64MB /sec. Given that the test is single threaded we can't estimate if this metric is representative of the NAS server capability. More likely this is representative the single thread capability of the whole environment made of : client CPU, client NFS implementation, client network driver and configuration, network envinronment including switches, and the NAS server. Results : 744 filecreate per second per thread. Limited by operational latency. Here is the analytics view captured for the this tests and the following 5 tests. Stat files in sequential order...done. Test was too elusive possibly working against cached stat information. Delete files in sequential order...done. Here we unlink(2) the 16K files. Here we call the unlink system call for the 16K files. The run takes 10.294 seconds showing a 1591 unlink per second. Each call goes off cpu, waiting for a server response for 600 usec. Much like the create file test above, while we get information about the single threaded unlink time present in the environment it's obviously not representative of the server's capabilities.' Results : 1591 unlink per second per thread, Limited by operational latency. Create files in random order...done. We recreate 16K files, closing each one but also running a stat() system call on each. Stat files in random order...done. Elusive as above. Delete files in random order...done. We remove the 16K files. I could not discern in the "random order" test any meaninful differences to the sequential order ones. Analytics screenshot of Bonnie++ run Here is the full screen shot from analytics including Disk and CPU data The takeway here is that single instance bonnie++ does not generally stress one Sun Storage 7000 NAS server but will stress the client CPU and 1Gbps network connectivity. There is no multi-client support in Bonnie++ (that I could find). One can certainly start multiple clients simultaneously, but since the different tests would not be synchronized the output of bonnie++ would be very questionable. Bonnie++ does have a multi-instance synchronisation mode that is based on semaphore which will only work if all instances are running within the same OS environment. So in a multi client test, Only the total elapsed time would be of interest here and that would be dominated by the streaming performance as each client would read and write its working set 3 times over the wire. Filecreate and unlink times would also contribute to the total elapsed time of such a test. For a single node multi-instance bonnie++ run, one would need to have a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth of network capabilities in order to properly test one Sun Storage 7410 server. Otherwise, Bonnie++ is more likely to show client and network limits, not server ones. As for unlink capabilities, the topic is a pretty complex and important one that certainly cannot be captured with simple commands. The interaction with snapshots and the I/O load generated on the server during large unlink storms needs to be studied carefully in order to understand the competitive merits of different solutions. In Summary, here is what governs the performance of the individual Bonnie++ tests : Writing with putc()... 87 MB/sec Limited by client's single CPU speed Writing intelligently...113 MB/sec Limited by Network conditions Rewriting...100MB/sec Limited by Network conditions Reading with getc()...73MB/sec Limited by client's single CPU speed Reading intelligently...118MB/sec Limited by Network conditions start 'em...done...done...done... Create files in sequential order...744 create/sLimited by operational latency Stat files in sequential order...not observable Delete files in sequential order...1591 unlink/sLimited by operational latency Create files in random order...same as sequential Stat files in random order...same as sequential Delete files in random order...same as sequential So Bonnie++ won't tell you much about our server's capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate multiple clients systems and so cannot be used to stress a server. Bonnie++ could be used to stress a NAS server using a single large multi-core client with very strong networking capabilities. In the end though I don't expect to learn much about our servers over and above what is already known. For that please check out our links here : Low Level Performance of Sun Storage Analyzing the Sun Storage 7000 Designing Performance Metrics... Sun Storage 7xxx Performance Invariants Here is the bonnie.d d-script used and the output generated bonnie.out.

I've been studying the popular Bonnie++ load generator to see if it was a suitable benchmark to use with Network attached storage such as Sun Storage 7000 line. At this stage I've looked at the...


Blogfest : Performance and the Hybrid Storage Pool

Today Sun is announcing a new line of Unified Storage designed by a core of the most brilliant engineers . For starters Mike Shapiro provides a great introduction into this product, the new economics behind it and the killer App in Sun Storage 7000. The killer App is of course Bryan Cantrill's brainchild, the already famous Analytics. As a performance engineer, it's been a great thrill to have given this tool an early test drive. Working a full 1 ocean's (the atlantic) + 1 continent (the USA) away from my system running Analytics I was skeptical at first that I would be visualizing in real time all that information : the NFS/CIFS ops, the disk ops, the CPU load and network throughput, per client, per disk, per file ARE YOU CRAZY ! All that information available IN REAL TIME; I just have to say a big thank you to the team that made it possible. I can't wait to see our customer put this to productive use. Also check out Adam Levanthal's great description of HSP the Hybrid Storage Pool and read my own perspective on this topic ZFS as a Network Attach Storage Controller. Lest we forget the immense contribution of the boundless Energy bubble that is Brendan Gregg; the man that braught DTracetoolkit to the semi-geek; he must be jumping with excitement as we now see the power of DTrace delivered to each and every system administrator. He talks here about the Status Dashboard. And Brendan's contribution does not stop here, he is also the parent of this wonderful component of the HSP known as the L2ARC which is how the readzillas become activated. See his own previous work on the L2ARC along with Jing Zhang more recent studies. Quality assurance people don't often get into the spotlight but check out Tim Foster 's post on how he tortured the zpool code adding and removing l2 arc devices from pools : For myself, it's been very exciting to be able to see performance improvement ideas get turned into product improvements from weeks to weeks. Those interested should read how our group influenced the product that is shipping today, see Alan Chiu and my own Delivering Performance Improvements. Such a product has a strong Price/Performance appeal and given that we fundamentally did not think that there where public benchmarks that captured our value proposition, we had to come up with a third millenium participative ways to talk about performance. Check out how we designed our Metrics or maybe go straight to our numbers obtained by Amitabha Banerjee a concise entry backed up by immense, intense and carefull data gathering effort in the last few weeks. bmseer is putting his own light on the low level data (data to be updated with numbers from a grander config). I've also posted here a few performance guiding lights to be used thinking about this product; I call them Performance Invariants. So further numbers can be found here about raid rebuild times. On the application side, we have the great work of Sean (Hsianglung Wu) and Arini Balakrishnan showing how a 7210 can deliver > 5000 concurrent video streams at an aggregate of, you're kidding, : WOW ZA 750MB/sec. More Details on how this was acheived in cdnperf. Jignesh Shaw shows step by step instructions setting up PostgreSQL over iSCSI. See our Vice President, Solaris Data, Availability, Scalability & HPC Bob Porras trying to tame this beast into a nutshell and pointing out code bits reminding everyone of the value of the OpenStorage proposition. See also what bmseer has to say on Web 2.0 Consolidation and get from Marcus Heckel a walkthrough of setting up Olio Web 2.0 kit with nice Analytics performance screenshots. Also get the ISV reaction (a bit later) from Georg Edelmann. Ryan Pratt reports on Windows Server 2003 WHQL certification of the Sun Storage 7000 line. And this just in : Data about what to expect from a Database perspective. We can talk all we want about performance but as Josh Simons points out, these babies are available to you for your own try and buy. Or check out how you could be running the appliance within the next hour really : Sun Storage 7000 in VMware. It seems I am in competition with another less verbose aggregator Finally capture the whole stream of related posting to Sun Storage 7000

Today Sun is announcing a new line of Unified Storage designed by a core of the most brilliant engineers . For starters Mike Shapiro provides a great introduction into this product, the new economics...


Designing Performance Metrics for Sun Storage 7000

One of the necessary checkpoint before launching a product is to be able to assess it's performance. With Sun Storage 7xxx we had a challenge in that the only NFS benchmark of notoriety was SPEC SFS. Now this benchmark will have it's supporters and some customers might be attached to it but it's important to understand what a benchmarks actually says. These SFS benchmark is a lot about "cache busting" the server : this is interesting but at Sun we think that Caches are actually helpful in real scenarios. Data goes in cycles in which it becomes hot at times. Retaining that data in cache layers allow much lower latency access, and much better human interaction with storage engines. Being a cache busting benchmark, SFS numbers end up as a measure of the number of disk rotation attached to the NAS server. So good SFS result requires 100 or 1000 of expensive, energy hungry 15K RPM spindles. To get good IOPS, layers of caching are more important to the end user experience and cost efficiency of the solution. So we needed another way to talk about performance. Benchmarks tend to test the system in peculiar ways that not necessarely reflect the workloads each customer is actually facing. There are very many workload generators for I/O but one interesting one that is OpenSource and extensible is Filebench available in Source. So we used filebench to gather basic performance information about our system with the hope that customers will then use filebench to generate profiles that map to their own workloads. That way, different storage option can be tested on hopefully more meaningful tests than benchmarks. Another challenge is that a NAS server interacts with client system that themselve keep a cache of the data. Given that we wanted to understand the back-end storage, we had to setup the tests to avoid client side caching as much as possible. So for instance between the phase of file creation and the phase of actually running the tests we needed to clear the client caches and at times the server caches as well. These possibilities are not readily accessible with the simplest load generators and we had to do this in rather ad-hoc fashion. One validation of our runs was to insure that the amount of data transfered over the wire, observed with Analytics was compatible with the aggregate throughput measured at the client. Still another challenge was that we needed to test a storage system designed to interact with large number of clients. Again load generators are not readily setup to coordinate multiple client and gather global metrics. During the course of the effort filebench did come up with a clustered mode of operation but we actually where too far engaged in our path to take advantage of it. This coordination of client is important because, the performance information we want to report is actually the one that is delivered to the client. Now each client will report it's own value for a given test and our tool will sum up the numbers; but such a Sum is only valid inasmuch as the tests ran on the clients in the same timeframe. The possibility of skew between tests is something that needs to be monitored by the person running the investigation. One way that we increased this coordination was that we divided our tests in 2 categories; those that required precreated files, and those that created files during the timed portion of the runs. If not handled properly, file creation would actually cause important result skew. The option we pursued here was to have a pre-creation phase of files that was done once. From that point, our full set of metrics could then be run and repeated many times with much less human monitoring leading to better reproducibility of results. Another goal of this effort was that we wanted to be able to run our standard set of metrics in a relatively short time. Say less than 1 hours. In the end we got that to about 30 minutes per run to gather 10 metrics. Having a short amount of time here is important because there are lots of possible ways that such test can be misrun. Having someone watch over the runs is critical to the value of the output and to it's reproducibility. So after having run the pre-creation of file offline, one could run many repeated instance of the tests validating the runs with Analytics and through general observation of the system gaining some insight into the meaning of the output. At this point we were ready to define our metrics. Obviously we needed streaming reads and writes. We needed ramdom reads. We needed small synchronous writes important to Database workloads and to the NFS protocol. Finally small filecreation and stat operation completed the mix. For random reading we also needed to distinguish between operating from disks and from storage side caches, an important aspect of our architecture. Now another thing that was on my mind was that, this is not a benchmark. That means we would not be trying to finetune the metrics in order to find out just exactly what is the optimal number of threads and request size that leads to best possible performance from the server. This is not the way your workload is setup. Your number of client threads running is not elastic at will. Your workload is what it is (threading included); the question is how fast is it being serviced. So we defined precise per client workloads with preset number of thread running the operations. We came up with this set just as an illustration of what could be representative loads : 1- 1 thread streaming reads from 20G uncached set, 30 sec. 2- 1 thread streaming reads from same set, 30 sec. 3- 20 threads streaming reads from 20G uncached set, 30 sec. 4- 10 threads streaming reads from same set, 30 sec. 5- 20 threads 8K random read from 20G uncached set, 30 sec. 6- 128 threads 8K random read from same set, 30 sec. 7- 1 thread streaming write, 120 sec 8- 20 threads streaming write, 120 sec 9- 128 threads 8K synchronous writes to 20G set, 120 sec 10- 20 threads metadata (fstat) IOPS from pool of 400k files, 120 sec 11- 8 threads 8K file create IOPS, 120 sec. For each of the 11 metrics, we could propose mapping these to relevant industries : 1- Backups, Database restoration (source), DataMining , HPC 2- Financial/Risk Analysis, Video editing, HPC 3- Media Streaming, HPC 4- Video Editing 5- DB consolidation, Mailserver, generic fileserving, Software development. 6- DB consolidation, Mailserver, generic fileserving, Software development. 7- User data Restore (destination) 8- Financial/Risk Analysis, backup server 9- Database/OLTP 10- Wed 2.0, Mailserver/Mailstore, Software Development 11- Web 2.0, Mailserver/Mailstore, Software Development We managed to get all these tests running except the fstat (test 10) due to a technicality in filebench. Filebench insisted on creating the files up front and this test required thousands of them; moreover filebench used a method that ended up single threaded to do so and in the end, the stat information was mostly cached on the client. While we could have plowed through some of the issues the conjunctions of all these made us put the fstat test on the side for now. Concerning thread counts, we figured that single stream read test was at times critical (for administrative purposes) and an interesting measure of the latency. Test 1 and 2 were defined this way with test 1 starting with cold client and server caches and test 2 continuing the runs after having cleared the client cache (but not the server) thus showing the boost from server side caching. Test 3 and 4 are similarly defined with more threads involved for instance to mimic a media server. Test 5 and 6 did random read tests, again with test 5 starting with a cold server cache and test 6 continuing with some of the data precached from test 5. Here, we did have to deal with client caches trying to insure that we don't hit in the client cache too much as the run progressed. Test 7 and 8 showcased streaming writes for single and 20 streams (per client). Reproducibility of test 7 and 8 is more difficult we believe because of client side fsflush issue. We found that we could get more stable results tuning fsflush on the clients. Test 9 is the all important synchronous write case (for instance a database). This test truly showcases the benefit of our write side SSD and also shows why tuning the recordsize to match ZFS records with DB accesses is important. Test 10 was inoperant as mentioned above and test 11 filecreate, completes the set. Given that those we predefined test definition, we're very happy to see that our numbers actually came out really well with these tests particularly for the Mirrored configs with write optimized SSDs. See for instance results obtained by Amitabha Banerjee . I should add that these can now be used to give ballpark estimate of the capability of the servers. They were not designed to deliver the topmost numbers from any one config. The variability of the runs are at times more important that we'd wish and so your mileage will vary. Using Analytics to observe the running system can be quite informative and a nice way to actually demo that capability. So use the output with caution and use your own judgment when it comes to performance issues.

One of the necessary checkpoint before launching a product is to be able to assess it's performance. With Sun Storage 7xxx we had a challenge in that the only NFS benchmark of notoriety was SPEC...


Sun Storage 7000 Performance invariants

I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of the information ("can you run this instead ?"). A more useful approach is to think in terms of performance invariants. We all know that 7.2K RPM disk drive can do 150-200 IOPS as an invariant and disks will have throughput limit such as 80MB/sec. Thinking in terms of those invariant helps in extrapolating performance data (with caution) and observing breakdowns in invariant is often a sign that something else needs to be root caused. So using 11 metrics and our Performance engineering effort what can be our guiding invariants ? Bearing in mind that it is expected that those are rough estimate. For real measured numbers check out Amitabha Banerjee's excellent post on Analyzing the Sun Storage 7000. Streaming : 1 GB/s on server and 110 MB/sec on client For read Streaming wise, we're observing that 1GB/s is somewhat our guiding number for read streaming . This can be acheived with fairly small number of client and threads but will be easier to reach if the data is prestaged in server caches. A client normally running 1Gbe network cards is able to extract 110 MB/sec rather easily. Read streaming will be easier to acheived with the larger 128K records probably due to the lesser CPU demand. While our results are with regular 1500 Bytes ethernet frames, using jumbo frame will also make this limit easier to reach or even break. For a mirrored pool, data needs to be sent twice to the storage and we see a reduction of about 50% for write streaming workloads. Random Read I/Os per second : 150 random read IOPS per mirrored disks This is probably a good guiding light also. When going to disks that will be a reasonable expectation. But here caching can radically change this. Since we can configure up to 128GB of host ram and 4 times that much of secondary caches, there are opportunity to break this barrier. But when going to spindles that needs to be kept under consideration. We also know that Raid-z spreads records to all disks. So the 150 IOPS limit basically applies to raid-z groups. Do plan to have many groups to service random reads. Random Read I/Os per second using SSDs : 3100 Read IOPS per Read Optimized SSD In some instances, data after eviction from main memory will be kept in secondary caches. Small files and tuned recordsize filesystem are good target workload for this. Those read-optimized SSD can restitute this data at a rate of 3100 IOPS L2 ARC). More importantly so it can do so at much reduced latency meaning that lightly threaded workloads will be able to acheive high throughput. Synchronous writes per second : 5000-9000 Synchronous write per Write Optimized SSD Synchronous writes can be generated by a O_DSYNC write (database) or just as part of the NFS protocol (such as the tar extract : open,write,close workloads). Those will reach the NAS server and be coalesced in a single transaction with the separate intent log. Those SSD devices are great latency accelerators but are still devices with a max throughput of around 110 MB/sec. However our code actually detects when the SSD devices become the bottleneck and will divert some of the I/O request to use the main storage pool. The net of all this is a complex equation but we've observed easily 5000-8000 synchronous writes per SSD up to 3 devices (or 6 in mirrored pairs). Using smaller working set which creates less competition for CPU resources we've even observed 48K synchronous writes per second. Cycles per Bytes : 30-40 cycles per byte for NFS and CIFS Once we include the full NFS or CIFS protocol, the efficiency was observed to be in the 30-40 cycles per byte (8 to 10 of those coming from the pure network component at regulat 1500 bytes MTU). More studies are required to figure out the extent to which this is valid but it's an interesting way to look at the problem. Having to run disk I/O vs being serviced directly from cached data is expected to exert an additional 10-20 cycles per byte. Obviously for metadata test in which small amount of byte is transfered per operation, we probably need to come up with a cycles/MetaOps invariant but that is still TBD. Single Client NFS throughput : 1 TCP Window per round trip latency. This is one fundamental rule of network throughput but it's a good occasion to refresh this in everyones mind. Clients, at least solaris clients, will establish a single TCP connection to a server. On that connection there can be a large number of unreleated requests as NFS is a very scalable protocol. However, a single connection will transport data at a maximum speed of a "socket buffer" divided by the round trip latency. Since today's network speed, particularly in wide area networks have grown somewhat faster than default socket buffers we can see such things becoming performance bottleneck. Now given that I work in Europe but my tests systems are often located in california, I might be a little more sensitive than most to this fact. So one important change we did early on, in this project was to simply bump up the default socket buffers in the 7000 line to 1MB. However for read throughput under similar conditions, we can only advise you to do the same to your client infrastructure.

I see many reports about running campains of test measuring performance over a test matrix. One problem with this approach is of course the Matrix. That matrix never big enough for the consumer of...


Delivering Performance Improvements to Sun Storage 7000

I describe here the effort I spearheaded studying the performance characteristics of the OpenStorage platform and the ways in which our team of engineers delivered real out of the box improvements to the product that is shipping today. One of the Joy of working on the OpenStorage NAS appliance was that solutions we found to performance issues could be immediately transposed into changes to the appliance without further process. The first big wins We initially stumble on 2 major issues, one for NFS synchronous writes and one for the CIFS protocol in general. The NFS problem was a subtle one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS intent log and was impacting our threaded synchronous writes test by up to a 20X factor. Fortunately I had an history of studying that part of the code and could quickly identify the problem and suggest a fix. This was tracked as 6683293: concurrent O_DSYNC writes to a fileset can be much improved over NFS. The following week, turning to CIFS studies, we were seeing great scalability limitation in the code. Here again I was fortunate to be the first one to hit this. The problem was that to manage CIFS request the kernel code was using simple kernel allocations that could accommodate the largest possible request. Such large allocations and deallocations causes what is known as a storm of TLB shootdown cross-calls limiting scalability. Incredibly though after implementing the trivial fix, I found that the rest of the CIFS server was beautifully scalable code with no other barriers. So in one quick and simple fix (using kmem caches) I could demonstrate a great scalability improvements to CIFS. This was tracked as 6686647 : smbsrv scalability impacted by memory Since those 2 protocol problems were identified early on, I must say that no serious protocol performance problems have come up. While we can always find incremental improvements to any given test, our current implementation has held up to our testing so far. In the next phase of the project, we did a lot of work on improving network efficiency at high data rate. In order to deliver the throughput that the server is capable of, we must use 10Gbps network interface and the one available on the NAS platforms are based on the Neptune networking interface running the nxge driver. Network Setup I collaborated on this with Alan Chiu that already new a lot about this network card and driver tunables and so we quickly could hash out the issues. We had to decide for a proper out of the box setup involving - how many MSI-X interrupts to use- whether to use networking soft rings or not- what bcopy threshold to use in the driver as opposed to binding dma.- Whether to use or not the new Large Segment Offload (LSO) technique for transmits. We new basically where we wanted to go here. We wanted many interrupts on receive side so as to not overload any CPU and avoid the use of layered softrings which reduces efficiency. A low bcopy threshold so that dma binding be used more frequently as the default value was too high for this x64 based platform. And LSO was providing a nice boost to efficiency. That got us to some proper efficiency level. However we noticed that under stress and high number of connections our efficiency would drop by 2 or 3 X. After much head scratching we rooted this to the use of too many TX dma channels. It turns out that with this driver and architecture using a few channels leads to more stickyness in the scheduling and much much greater efficiency. We settled on 2 tx rings as a good compromise. That got us to a level of 8-10 cpu cycles per byte transfered in network code (more on Performance Invariants). Interrupt Blanking Studying a Opensource alternative controller, we also found that on 1 of 14 metrics we where slower. That was rooted in the interrupt blanking parameter that NIC use to gain efficiency. What we found here was that by reducing our blanking to a small value we could leapfrog the competition (from 2X worse to 2X better) on this test while preserving our general network efficiency. We were then on par or better for every one of the 14 tests. Media Streaming When we ran thousand or 1 Mb/s media streams from our systems we quickly found that the file level software prefetching was hurting us. So we initially disabled the code in our lab to run our media studies but at the end of the project we had to find an out of the box setup that could preserve our Media result without impairing maximum read streaming. At some point we realized that what we were hitting 6469558: ZFS prefetch needs to be more aware of memory pressure. It turns out that the internals of zfetch code is setup to manage 8 concurrent streams per file and can readahead up to 256 blocks or records : in this case 128K. So when we realized that with 1000s of streams we could readahead ourself out of memory, we knew what we needed to do. We decided on setting up 2 streams per file reading ahead up to 16 blocks and that seems quite sufficient to retain our media serving throughput while keeping so prefetching capabilities. I note here also is that NFS client code will themselve recognize streaming and issue their own readahead. The backend code is then reading ahead of client readahead requests. So we kind of where getting ahead of ourselves here. Read more about it @ cndperf To slog or not to slog One of the innovative aspect of this Openstorage server is the use of read and write optimized solid state devices; see for instance The Value of Solid State Devices. Those SSD are beautiful devices designed to help latency but not throughput. A massive commit is actually better handled by regular storage not ssd. It turns out that it was actually dead easy to instruct the ZIL to recognize massive commits and divert it's block allocation strategy away from the SSD toward the common pool of disks. We see two benefits here, the massive commits will sped up (preventing the SSD from becoming the bottleneck) but more importantly the SSD will now be available as low latency devices to handle workloads that rely on low latency synchronous operations. One should note here that the ZIL is a "per filesystem" construct and so while a filesystem might be working on a large commit another filesystem from the same pool might still be running a series of small transaction and benefit from the write optimized SSD. In a similar way, when we first tested the read-optimized ssds , we quickly saw that streamed data would install in this caching layer and that it could slow down the processing later. Again the beauty of working on an appliance and closely with developers meant that the following build, those problems had been solved. Transaction Group Time ZFS operates by issuing regular transaction groups in which modifications since last transaction group are recorded on disk and the ueberblock is updated. This used to be done at a 5 second interval but with the recent improvement to the write throttling code this became a 30 second interval (on light workloads) which aims to not generate more than 5 seconds of I/O per transaction groups. Using 5 seconds of I/O per txg was used to maximize the ratio of data to metadata in each txg, delivering more application throughput. Now these Storage 7000 servers will typically have lots of I/O capability on the storage side and the data/metadata is not as much a concern as for a small JBOD storage. What we found was that we could reduce the the target of 5 second of I/O down to 1 while still preserving good throughput. Having this smaller value smoothed out operation. IT JUST WORKS Well that is certainly the goal. In my group, we spent the last year performance testing these OpenStorage systems finding and fixing bugs, suggesting code improvements, and looking for better compromise for common tunables. At this point, we're happy with the state of the systems particularly for mirrored configuration with write optimized SSD accelerators. Our code is based on a recent OpenSolaris (from august) that already has a lot of improvements over Solaris 10 particularly for ZFS, to which we've added specific improvements relevant to NAS storage. We think these systems will at times deliver great performance (see Amithaba's results ) but almost always shine in the price performance categories.

I describe here the effort I spearheaded studying the performance characteristics of the OpenStorage platform and the ways in which our team of engineers delivered real out of the box improvements...


Using ZFS as a Network Attach Controller and the Value of Solid State Devices

So Sun is coming out today with a line of Sun Storage 7000 systems that have ZFS as the integrated volume and filesystem manager using both read and write optimized SSD. What is this Hybrid Storage Pool and why is this a good performance architecture for storage ? A write optimized SSD is a custom designed device for the purpose of accelerating operations of the ZFS intent log (ZIL). The ZIL is the part of ZFS that manages the important synchronous operation guaranteeing that such writes are acknowledged quickly to applications while guaranteeing persistence in case of outage. Data stored in the ZIL is also kept in memory until ZFS issue the next Transaction Groups (every few seconds). The ZIL is what stores data urgently (when application is waiting) but the TXG is what stores data permanently. The ZIL on-disk blocks are only ever re-read after a failure such as power outage. So the SSDs that are used to accelerate the ZIL are write-optimized : they need to handle data at low latency on writes; reads are unimportant. The TXG is an operation that is asynchronous to applications : apps are generally not waiting for transactions groups to commit. The exception here is when data is generated at a rate that exceeds the TXG rate for a sustained period of time. In this case, we become throttled by the pool throughput. In a NAS storage this will rarely happen since network connectivity even at GB/s is still much less that what storage is capable of and so we do not generate the imbalance. The important thing now is that in a NAS server, the controller is also running a file level protocol (NFS or CIFS) and so is knowledgeable about the nature (synchronous or not) of the requested writes. As such it can use the accelerated path (the SSD) only for the necessary component of the workloads. Less competition for these devices means we can deliver both high throughput and low latency together in the same consolidated server. But here is where is gets nifty. At times, a NAS server might receive a huge synchronous request. We've observed this for instance due to fsflush running on clients which will turn non-synchronous writes into a massive synchronous one. I note here that a way to reduce this effect, is to tune up fsflush (to say 600). This is commonly done to reduce the cpu usage of fsflush but will be welcome in the case of client interacting with NAS storage. We can also disable page flushing entirely by setting dopageflush to 0. But that is a client issue. From the perspective of the server, we still need as a NAS to manage large commit request. When subject to such a workload, say 1GB commit, ZFS being all aware of the situation, can now decide to bypass the SDD device and issue request straight to disk based pool blocks. It would do so for 2 reasons. One is that the pool of disks in it's entirety has more throughput capabilities than the few write optimized SSD and so we will service this request faster. But more importantly, the value of the SSD is in it's latency reduction aspect. Leaving the SSDs available to service many low latency synchronous writes is considered valuable here. Another way to say this is that large writes are generally well served by regular disk operations (they are throughput bound) whereas small synchronous writes (latency bound) can and will get help from the SSDs. Caches at work On the read path we also have custom designed read optimized SSDs to fit in these OpenStorage platforms. At Sun, we just believe that many workloads will naturally lend to caching technologies. In a consolidated storage solution, we can offer up to 128GB of primary memory based caching and approximately 500GB of SSD based caching. We also recognized that the latency delta between memory cached response and disk response was just too steep. By inserting a layer of SSD between memory and disk, we have this intermediate step providing lower latency access than disk to a working set which is now many times greater than memory. It's important here to understand how and when these read optimized SSD will work. The first thing to recognized is that the SSD will have to be primed with data. They feed off data being evicted from the primary caches. So their effect will not immediately seen at the start of a benchmarks. Second, one of the value of read optimized SSD is truly in low latency responses to small requests. Small request here means things of the order of 8K in size. Such request will occur either when dealing with small files (~8K) or if dealing with larger size but with fix record based application, typically a database. For those application it is customary to set the recordsize and this will allow those new SSDs to become more effective. Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work on the L2 ARC) and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the key point is that the low latency response means it can do so using much fewer threads that would be necessary to reach the same level on a JBOD. Brendan demonstrated here that the response time of these devices can be 20 times faster than disks and 8 to 10 times faster from the client's perspective. So once data is installed in the SSD, users will see their requests serviced much faster which means we are less likely to be subject to queuing delays. The use of read optimized SSD is configurable in the Appliance. Users should learn to identify the part of their datasets that end up gated by lightly threaded read response time. For those workloads enabling the secondary cache is one way to deliver the value of the read optimized SSD. For those filesystems, if the workload contains small files (such as 8K) there is no need to tune anything, however for large files access in small chunks setting the filesystem recordsize to 8K is likely to produce the best response time. Another benefit to these SSDs will be in the $/IOPS case. Some workloads are just IOPS hungry while not necessarely huge block consumers. The SSD technology offers great advantages in this space where a single SDD can deliver the IOPS of a full JBOD at a fraction of the cost. So with workloads that are more modestly sized but IOPS hungry a test drive of the SSD will be very interesting. It's also important to recognized that these systems are used in consolidation scenarios. It can be that some part of the applications will be sped up by read or write optimized SSD, or by the large memory based caches while other consolidated workloads can exercise other components. There is another interesting implication to using SSD in the storage in regards to clustering. The read optimized ssd acting as caching layers actually never contain critical data. This means those SSD can go into disk slots of head nodes since there is no data to be failed over. On the other hand, write optimized SSD will store data associated with the critical synchronous writes. But since those are located in dual-ported backend enclosures, not the head nodes, it implies that, during clustered operations, storage head nodes do not have to exchange any user level data. So by using ZFS and read and write optimized SSDs, we can deliver low latency writes for application that rely on them, and good throughput for synchronous and non synchronous case using cost effective SATA drives. Similarly on the read size, the high amount of primary and secondary caches enables delivering high IOPS at low latency (even if the workload is not highly threaded) and it can do so using the more cost and energy efficient SATA drive. Our architecture allows us to take advantage of the latency accelerators while never being gated by them.

So Sun is coming out today with a line of Sun Storage 7000 systems that have ZFS as the integrated volume and filesystem manager using both read and write optimized SSD. What is this Hybrid...


People ask: where are we with ZFS performance ?

The standard answer to any computer performance question is almost always : "it depends" which is semantically equivalent to "I don't know". The better answer is to state the dependencies. I would certainly like to see every performance issue studied with a scientific approach. OpenSolaris and Dtrace are just incredible enablers when trying to reach root cause and finding those causes is really the best way to work toward delivering improved performance. More generally tough, people use common wisdom or possible faulty assumption to match their symptoms with that of other similar reported problems. And, as human nature has it, we'll easily blame the component we're least familiar with for problems. So we often end up with a lot of report of ZFS performance that once, drilled down, become either totally unrelated to ZFS (say HW problems) , or misconfiguration, departure from Best Practices or, at times, unrealistic expectations. That does not mean, there are no issues. But it's important that users can more easily identify known issues, schedule for fixes, workarounds etc. So anyone deploying ZFS should really be familiar with those 2 sites :ZFS Best Practices and Evil Tuning Guide That said, what are real commonly encountered performance problems I've seen and where do we stand ? Writes overunning memory That is a real problem that was fixed last March and is integrated in the Solaris U6 release. Running out of memory causes many different types of complaints and erratic system behavior. This can happen anytime a lot of data is created and streamed at rate greater than that which can be set into the pool. Solaris U6 will be an important shift for customers running into this issue. ZFS will still try to use memory to cache your data (a good thing) but the competition this creates for memory resources will be much reduced. The way ZFS is designed to deal with this contention (ARC shrinking) will need a new evaluation from the community. The lack of throttling was a great impairement to the ability of the ARC to give back memory under pressure. In the mean time lots of people are capping their arc size with success as per the Evil Tuning guide. For more on this topic check out : The new ZFS write throttle Cache flushes on SAN storage This is a common issue we hit in the entreprise. Although it will cause ZFS to be totally underwhelming in terms of performance, it's interestingly not a sign of any defect in ZFS. Sadly this touches customers that are the most performance minded. The issue is somewhat related to ZFS and somewhat to the Storage. As is well documented elsewhere, ZFS will, at critical times, issue "cache flush" request to the storage elements on which is it layered. This is to take into account the fact that storage can be layered on top of _volatile_ caches that do need to be set on stable storage for ZFS to reach it's consistency points. Entreprise Storage Arrays do not use _volatile_ caches to store data and so should ignore the request from ZFS to "flush caches". The problem is that some arrays don't. This misunderstanding between ZFS and Storage Arrays leads to underwhelming performance. Fortunately we have an easy workaround that can be used to quickly identify if this is indeed the problem : setting zfs_nocacheflush (see evil tuning guide). The best workaround here is to configure the storage with the setting to indeed ignore "cache flush". And we also have the option of tuning sd.conf on a per array basis. Refer again to the evil tuning guide for more detailed information. NFS slow over ZFS (Not True) This is just not generally true and often a side effect of the previous Cache flush problem. People have used storage arrays to accelerate NFS for long time but failed to see the expected gains with ZFS. Many sighting of NFS problems are traced to this. Other sightings involve common disks with volatile caches. Here the performance delta observed are rooted in the stronger semantics that ZFS offer to this operational model. See NFS and ZFS for a more detailed description of the issue. While I don't consider ZFS as generally slow serving NFS, we did identify in recent months a condition that effects high thread count of synchronous writes (such as a DB). This issue is fixed in the Solaris 10 Update 6 (CR 6683293). I would encourage you to be familiar to where we stand regarding ZFS and NFS because, I know of no big gapping ZFS over NFS problems (if there were one, I think I would know). People just need to be aware that NFS is a protocol need some type of accelaration (such as NVRAM) in order to deliver a user experience close to what a direct attach filesystem provides. ZIL is a problem (Not True) There is a wide perception that the ZIL is the source of performance problems. This is just a naive interpretation of the facts. The ZIL serves a very fundamental component of the filesystem and does that admirably well. Disabling the synchronous semantics of a filesystem will necessarely lead to higher performance in a way that is totally misleading to the outside observer. So while we are looking at further zil improvements for large scale problems, the ZIL is just not today the source of common problems. So please don't disable this unless you know what you're getting into. Random read from Raid-Z Raid-Z is a great technology that allows to store blocks on top of common JBOD storage without being subject to raid-5 write hole corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However the performance characteristics of raid-z departs significantly from raid-5 as to surprise first time users. Raid-Z as currently implemented spreads blocks to the full width of the raid group and creates extra IOPS during random reading. At lower loads, the latency of operations is not impacted but sustained random read loads can suffer. However, workloads that end up with frequent cache hits will not be subject to the same penalty as workloads that access vast amount of data more uniformly. This is where one truly needs to say, "it depends". Interestingly, the same problem does not affect Raid-Z streaming performance and won't affect workloads that commonly benefit from caching. That said both random and streaming performance are perfectible and we are looking at a number different ways to improve on this situation. To better understand Raid-Z, see one of my very first ZFS entry on this topic : Raid-Z CPU consumption, scalability and benchmarking This is an area we will need to make more studies. With todays very capable multicore systems, there are many workloads that won't suffer from the CPU consumptions of ZFS. Most systems do not run at 100% cpu bound (being more generally constrained by disk, networks or application scalability) and the user visible latency of operations are not strongly impacted by extra cycles spent in say the ZFS checksumming. However, this view breaks down when it comes to system benchmarking. Many benchmarks I encounter (the most crafted ones to boot) end up as host CPU efficiency benchmarks : How many Operations can I do on this system given large amount of disk and network resources while preserving some level X of response time. The answer to this question is purely the reverse of the cycles spent per operation. This concern is more relevant when the CPU cycles spent in managing direct attach storage and filesystem is in direct competition with cycles spent in the application. This is also why database benchmarking is often associated with using raw device, a fact must less encountered in common deployment. Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation of filesystems. Direct I/O Directio has been a great enabler of database performance in other filesystems. The problem for me is that Direct I/O is a group of improvements each with their own contribution to the end result. Some want the concurrent writes, some wants to avoid a copy, some wants to avoid double caching, some don't know but see performance gains when turned on (some also see a degradation). I note that concurrent writes has never been a problem in ZFS and that the extra copy used when managing a cache is generally cheap considering common DB rates of access. Acheiving greater CPU efficiency is certainly a valid goal and we need to look into what is impacting this in common DB workloads. In the mean time, ZFS in OpenSolaris got a new feature to manage the cachebility of Data in the ZFS ARC. The per filesystem "primarycache" property will allow users to decide if blocks should actually linger in the ARC cache or just be transient. This will allow DB deployed on ZFS to avoid any form of double caching that might have occured in the past. ZFS Performance is and will be a moving target for some time in the future. Solaris 10 Update 6 with a new write throttle, will be a significant change and then Opensolaris offers additional advantages. But generally just be skeptical of any performance issue that is not root caused: the problem might not be where you expect it

The standard answer to any computer performance question is almost always : "it depends" which is semantically equivalent to "I don't know". The better answer is to state the dependencies.I would...


The new ZFS write throttle

A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down your microbenchmark but nevertheless it's a change you should be eager for. Write throttling For a filesystem, write throttling designates the act of blocking application for some amount of time, as short as possible, waiting for the proper conditions to allow the write system calls to succeed. Write throttling is normally required because applications can write to memory (dirty memory pages) at a rate significantly faster than the kernel can flush the data to disk. Many workloads dirty memory pages by writing to the filesystem page cache at near memory copy speed, possibly using multiple threads issuing high-rates of filesystem writes. Concurrently, the filesystem is doing it's best to drain all that data to the disk subsystem. Given the constraints, the time to empty the filesystem cache to disk can be longer than the time required for applications to dirty the cache. Even if one considers storage with fast NVRAM, under sustained load, that NVRAM will fill up to a point where it needs to wait for a slow disk I/O to make room for more data to get in. When committing data to a filesystem in bursts, it can be quite desirable to push the data at memory speed and then drain the cache to disk during the lapses between bursts. But when data is generated at a sustained high rate, lack of throttling leads to total memory depletion. We thus need at some point to try and match the application data rate with that of the I/O subsystem. This is the primary goal of write throttling. A secondary goal of write throttling is to prevent massive data loss. When applications do not manage I/O synchronization (i.e don't use O_DSYNC and fsync), data ends up cached in the filesystem and the contract is that there is no guarantee that the data will still be there if a system crash were to occur. So even if the filesystem cannot be blamed for such data loss, it is still a nice feature to help prevent such massive losses. Case in point : UFS Write throttling For instance UFS would use the fsflush daemon to try to keep data exposed for no more than 30 seconds (default value of autoup). Also, UFS would keep track of the amount of I/O outstanding for each file. Once too much I/O was pending, UFS would throttle writers for that file. This was controlled through ufs_HW, ufs_LW and their values were commonly tuned (a bad sign). Eventually old defaults values were updated and seem to work nicely today. UFS write throttling thus operates on a per file basis. While there are some merits to this approach, it can be defeated as it does not manage the imbalance between memory and disks at a system level. ZFS Previous write throttling ZFS is designed around the concept of transaction groups (txg). Normally, every 5 seconds an _open_ txg goes to the quiesced state. From that state the quiesced txg will go to the syncing state which sends dirty data to the I/O subsystem. For each pool, there are at most 1 txg in each of the 3 states, open, quiescing, syncing. Write throttling used to occur when the 5 second txg clock would fire while the syncing txg had not yet completed. The open group would wait on the quiesced one which waits on the syncing one. Application writers (write system call) would block, possibly a few seconds, waiting for a txg to open. In other words, if a txg took more than 5 seconds to sync to disk, we would globally block writers thus matching their speed with that of the I/O. But if a workload had a bursty write behavior that could be synced during the allotted 5 seconds, application would never be throttled. The Issue But ZFS did not sufficiently controled the amount of data that could get in an open txg. As long as the ARC cache was no more than half dirty, ZFS would accept data. For a large memory machine or one with weak storage, this was likely to cause long txg sync times. The downsides were many : - if we did ended up throttled, long sync times meant the systembehavior would be sluggish for seconds at a time.- long txg sync times also meant that our granularity at which we could generate snapshots would be impacted.- we ended up with lots of pending data in the cache all ofwhich could be lost in the event of a crash.- the ZFS I/O scheduler which prioritizes operations was alsonegatively impacted.- By not throttling we had the possibility thatsequential writes on large files could displace from the ARCa very large number of smaller objects. Refillingthat data meant very large number of disk I/Os. Not throttling can paradoxically end up as verycostly for performance.- the previous code also could at times, not be issuing I/Osto disk for seconds even though the workload wascritically dependant of storage speed.- And foremost, lack of throttling depleted memory and preventedZFS from reacting to memory pressure. That ZFS is considered a memory hog is most likely the results of the the previous throttling code. Once a proper solution is in place, it will be interesting to see if we behave better on that front. The Solutions The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory. And to avoid the system wide and seconds long throttle effect, the new code will detect when we are dangerously close to that situation (7/8th of the limit) and will insert 1 tick delays for applications issuing writes. This prevents a write intensive thread from hogging the available space starving out other threads. This delay should also generally prevent the system wide throttle. So the new steady state behavior of write intensive workloads is that, starting with an empty TXG, all threads will be allowed to dirty memory at full speed until a first threshold of bytes in the TXG is reached. At that time, every write system call will be delayed by 1 tick thus significantly slowing down the pace of writes. If the previous TXG completes it's I/Os, then the current TXG will then be allowed to resume at full speed. But in the unlikely event that a workload, despite the per write 1-tick delay, manages to fill up the TXG up to the full threshold we will be forced to throttle all writes in order to allow the storage to catch up. It should make the system much better behaved and generally more performant under sustained write stress. If you are owner of an unlucky workload that ends up as slowed by more throttling, do consider the other benefits that you get from the new code. If that does not compensate for the loss, get in touch and tell us what your needs are on that front.

A very significant improvement is coming soon to ZFS. A change that will increase the general quality of service delivered by ZFS. Interestingly it's a change that might also slow down...


NFS and ZFS, a fine combination

No doubt there is still a lot to learn about ZFS as an NFS server and this will not delve deeply into that topic. What I'd like to dispel here is the notion that ZFS can cause some NFS workloads to exhibit pathological performance characteristics. The Sightings Since there have been a few perceived 'sightings' of such slowdowns, a little clarification is in order. Large reported slowdowns would typically be reported when looking at a single threaded load, probably doing small file creation such as 'tar xf many_small_files.tar'. For instance, I've run a small such test over a 72G SAS drive. tmpfs : 0.077 secufs : 0.25 seczfs : 0.12 secnfs/zfs : 12 sec There are a few things to observe here. Local filesystem services have a huge advantage for this type of load: in the absence of specific request by the application (e.g. tar), local filesystems can lose your data and noone will complain. This is data loss, not data corruption, and this generally accepted data loss will occur in the event of a system crash. The argument being that if you need a higher level of integrity, you need to program it in applications either using O_DSYNC, fsync etc. Many applications are not that critical and avoid such burden. NFS and COMMIT On the other hand, the nature of the NFS protocol is such that the client _must_ at some specific point request to the server to place previously sent data onto stable storage. This is done through an NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract between clients and servers that allows the client to forget about its previous historical interaction with the file. In the event of a server crash/reboot, the client is guaranteed that previously commited data will be returned by the server. Operations since the last COMMIT can be replayed after a server crash in a way that insures a coherent view between everybody involved. But this all topples over if the COMMIT contract is not honored. If a local filesystem does not properly commit data when requested to do so, there is no more guarantee that the client's view of files will be what it would otherwise normally expect. Despite the fact that the client has completed the 'tar x' with no errors, it can happen that some of the files are missing in full or in parts. With local filesystems, a system crash is plainly obvious to users and requires applications to be restarted. With NFS, a server crash in not obvious to users of the service (the only sign being a lengthy pause), and applications are not notified. The fact that files or parts of files may go missing in the absence of errors can be considered as plain corruption of the client's side view. When the underlying filesystem serving NFS ignores COMMIT request, or when the storage subsystem acknowledge I/O before they reach stable storage, what is potential data loss on the server, becomes corruption of the client's point of view. It turns out that in NFSv3/NFSv4 the client will request a COMMIT on close; Moreover, the NFS server itself is required to commit on meta-data operations; for NFSv3 that is on : SETATTR, CREATE, MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK and a COMMIT maybe required on the containing directory. Expected Performance Let's imagine we find a way to run our load at 1 COMMIT (on close) per extracted files. The COMMIT means the client must wait for at least a full I/O latency and since 'tar x' processes the tar file from a single thread, that implies that we can run our workload at the maximum rate (assuming infinitely fast networking) of one extracted file per I/O latency or about 200 extracted files per second (on modern disks). If the files to be extracted are 1K in average size, the tar x will proceed at a pace of 200K/sec. If we are required to issue 2 COMMIT operations per extracted files (for instance due to a server-side COMMIT on file create), that would further halves that throughput number. However, If we had lots of threads extracting individual files concurrently the performance would scale up nicely with the number of threads. But tar is single threaded, so what is actually going on here ? The need to COMMIT frequently means that our thread must frequently pause for a full server side I/O latency. Because our single threaded tar is blocked, nothing is able to process the rest of our workload. If we allow the server to ignore COMMIT operations, then NFS responses will we sent earlier allowing the single thread to proceed down the tar file at greater speed. One must realise that the extra performance is obtained at the risk of causing corruption from the client's point of view in the event of a crash. Whether or not the client or the server needs to COMMIT as often as it does is a separate issue. The existence of other clients that would be accessing the files needs to be considered in that discussion. The point being made here is that this issue is not particular to ZFS, nor does ZFS necessarily exacerbate the problem. The performance of single threaded writes to NFS will be throttled as a result of the NFS-imposed COMMIT semantics. ZFS Relevant Controls ZFS has two controls that come into this picture. The disk write caches and the zil_disable tunable. ZFS is designed to work correctly whether or not the disk write caches are enabled. This is acheived through explicit cache flush requests, which are generated (for example) in response to an NFS COMMIT. Enabling the write caches is then a performance consideration, and can offer performance gains for some workloads. This is not the same with UFS which is not aware of the existence of a disk write cache and is not designed to operate with such cache enabled. Running UFS on a disk with write cache enabled can lead to corruption of the client's view in the event of a system crash. ZFS also has the zil_disable control. ZFS is not designed to operate with zil_disable set to 1. Setting this variable (before mounting a ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT operations are all ignored! We note that, even without a ZIL, ZFS will always maintain a coherent local view of the on-disk state. But by ignoring NFS COMMIT operations, it will cause the client's view to become corrupted (as defined above). Comparison with UFS In the original complaint, there was no comparison between a semantically correct NFS service delivered by ZFS to another similar NFS service delivered by another filesystem. Let's gather some more data: Local and memory based filesystems :tmpfs : 0.077 secufs : 0.25 seczfs : 0.12 secNFS service with risk of corruption of client's side view :nfs/ufs : 7 sec (write cache enable)nfs/zfs : 4.2 sec (write cache enable,zil_disable=1)nfs/zfs : 4.7 sec (write cache disable,zil_disable=1)Semantically correct NFS service :nfs/ufs : 17 sec (write cache disable)nfs/zfs : 12 sec (write cache disable,zil_disable=0)nfs/zfs : 7 sec (write cache enable,zil_disable=0) We note that with most filesystems we can easily produce an improper NFS service by enabling the disk write caches. In this case, a server-side filesystem may think it has commited data to stable storage but the presence of an enabled disk write cache causes this assumption to be false. With ZFS, enabling the write caches is not sufficient to produce an improper service. Disabling the ZIL (setting zil_disable to 1 using mdb and then mounting the filesystem) is one way to generate an improper NFS service. With the ZIL disabled, commit request are ignored with potential client's view corruption. Intelligent Storage An different topic is about running ZFS on intelligent storage arrays. One known pathology is that some arrays will _honor_ the ZFS request to flush the write caches despite the fact that their caches are qualified as stable storage. In this case, NFS performance will be much much worst than otherwise expected. On this topic and ways to workaround this specific issue, see Jason's .Plan: Shenanigans with ZFS. Conclusion In many common circumstances, ZFS offers a fine NFS service that complies with all NFS semantics even with write caches enabled. If another filesystem appears much faster, I suggest first making sure that this other filesystem complies in the same way. This is not to say that ZFS performance cannot be perfected as clearly it can. The performance of ZFS is still evolving quite rapidly. In many situations, ZFS provides the highest throughput of any filesystem. In others, ZFS performance is highly competitive with other filesystems. In some cases, ZFS can be slower than other filesystems -- while in all cases providing end-to-end data integrity, ease of use and integrated services such as compression, snapshots etc. See Also Eric's fine entry on zil_disable

No doubt there is still a lot to learn about ZFS as an NFS server and this will not delve deeply into that topic. What I'd like to dispel here is the notion that ZFS can cause some NFS workloads...



ZFS and Databases Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now I know very little myself about DBMS, so I teamed up with people that have tons of experience with it, my Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as well. Note that UFS (with DIO) has been heavily tuned over the years to provide very good support for DBMS. We are just beginning to explore the tweaks and tunings necessary to achieve comparable performance from ZFS in this specialized domain. We knew that running a DBMS would be a challenge since, a database tickles filesystems in ways that are quite different from other types of loads. We had 2 goals. Primarily, we needed to understand how ZFS performs in a DB environment and in what specific area it needs to improve. Secondly, we figured that whatever would come out of the work, could be used as blog-material, as well as best practice recommendations. You're reading the blog material now; also watch this space for Best Practise updates. Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.) Workload The workload we use in PAE to characterize DBMSes is called OLTP/Net. This benchmark was developed inside Sun for the purpose of engineering performance into DBMS. Modeled on common transaction processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose: "OLTP/Net, the New-Order transaction involves multi-hops as it performs Item validation, and inserts a single item per hop as opposed to block updates " I hope that means something to you; Neel will be blogging on his own, if you need more info. Reference Point The reference performance point for this work would be UFS (with VxFS being also an interesting data point, but I'm not tasked with improving that metric). For DB loads we know that UFS directio (DIO) provides a significant performance boost and that would be our target as well. Platform & Configuration Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands per core) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2 equal slices, with half of the surface given to a Solaris Volume Manager (SVM) onto which UFS would be built and the other half was given to ZFS pool. The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here. Write Cache Enabled (WCE) ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS will turn _ON_ the write cache as part of the import sequence. That is, it won't enable write cache when given only a slice. So, to be fair to ZFS capabilities we manually turned ON WCE when running our test over ZFS. UFS is not designed to work with WCE and will put data at risk if WCE is set, so we needed to turn it off for the UFS runs. We needed to do this, to get around the fact that we did not have enough disk to provide each filesystem. Therefore the performance we measured is what would be expected when giving full disk to either filesystem. We note that, for the FC devices we used, WCE does not provide ZFS a significant performance boost on this setup. No Redundancy For this initial effort we also did not configure any form of redundancy for either filesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so we settled on simple stripe. We could eventually configure software mirroring on both filesystems, but we don't expect that will change our conclusions. But still this will be interesting in follow-up work. DBMS logging Another thing we know already is that a DBMS's log writer latency is critical to OLTP performance. So in order to improve on that metric, it's good practice to set aside a number of disks for the DBMS' logs. So with this in hand, we manage to run our benchmark and get our target performance number (in relative terms, higher the better): UFS/DIO/SVM : 42.5 Separate Data/log volumes Recordsize OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pool and a data pool and get going. Note that log writers actually generate a pattern of sequential I/O of varying sizes. That should map quite well with ZFS out of the box. But for the DBMS' data pool, we expect a very random pattern of read and writes to DB records. A commonly known zfs best practice when servicing fixed record access is to match the ZFS' recordsize property to that of the application. We note that UFS, by chance or by design, also works (at least on sparc) using 8K records. 2nd run ZFS/S10U2 So for a fair comparison, we set the recordsize to 8K for the data pool and run our OLTP/Net and....gasp!: ZFS/S10U2 : 11.0 Data pool (8K record on FS) Log pool (no tuning) So that's no good and we have our work cut out for us. The role of Prefetch in this result To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also refer to as the software track buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, by default, go and fetch quite a sizable chunk of data (64K) located at the physical location being read. In itself, this should not increase the I/O latency which is dominated by the head-seek and since the data is stored in a small fixed sized buffer we don't expect this is eating up too much memory either. However in a heavy-duty environment like we have here, every extra byte that moves up or down the data channel occupies valuable space. Moreover, for a large DB, we really don't expect the speculatively read data to be used very much. So for our next attempt we'll tune down the prefetch buffer to 8K. And the role of the vq_max_pending parameter But we don't expect this to be quite sufficient here. My DBMS savvy friends would tell me that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes reads over writes and so we thought we should be ok. However during a pool transaction group sync, ZFS will issue quite a number of concurrent writes to each device. This is the vq_max_pending parameter which default to 35. Clearly during this phase the read latency even if prioritized will take a somewhat longer time to complete. 3rd run, ZFS/S10U2 - tuned So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev preftech buffer of 8K and a vq_max_pending of 10. This boosted our performance almost 2X: ZFS/S10U2 : 22.0 Data pool (8K record on FS) Log pool (no tuning) vq_max_pending : 10 vdev prefetch : 8K But not quite satisfying yet. ZFS/S10U2 known bug We know of something else about ZFS. In the last few builds before S10U2, a little bug made it's way into the code base. The effect of this bug was that for full record rewrite, ZFS would actually input the old block even though the data is actually not needed at all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not that common....except for database, bummer. So S10U2 is plagued with this issue affecting DB performance with no workaround. So our next step was to move on to ZFS latest bits. 4th run ZFS/Build 44 Build 44 of our next Solaris version has long had this particular issue fixed. There we topped our past performance with: ZFS/B44 : 33.0 Data pool (8K record on FS) Log pool (no tuning) vq_max_pending : 10 vdev prefetch : 8K As we compare to umpty-years of super tuned UFS: UFS/DIO/SVM : 42.5 Separate Data/log volumes Summary I think at this stage of ZFS, the results are neither great nor bad. We have achieved: UFS/DIO : 100 % UFS : xx no directio (to be updated) ZFS Best : 75% best tuned config with latest bits. ZFS S10U2 : 50% best tuned config. ZFS S10U2 : 25% simple tuning. To achieve acceptable performance levels: The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking releases for a little while. The current OpenSolaris release as well as the upcoming Solaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44 results shown here. 1 data pool and 1 log pool: common practice to partition HW resource when we want proper isolation. Going forward I think that, we will eventually get to the point where this will not be necessary but it seems an acceptable constraint for now. Tuned vdev prefetch: the code is being worked on. I expect that in a near future this will not be necessary. Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key and throughput secondary. There are a number of ideas that needs to be tested which will help ZFS improve on both average latency as well as latency fluctuations. This will help both the Intent log (O_DSYNC write) latency as well as reads. Parting Words As those improvement come out, they may well allow ZFS to catch or surpass our best UFS numbers. When you match that kind of performance with all the usability and data integrity features of ZFS, that's a proposition that becomes hard to pass up.

ZFS and Databases Given that we started to have enough understanding on the internal dynamics of ZFS, I figured it was time to tackle the next hurdle : running a database management system (DBMS). Now...


Tuning the knobs

As experience builds up we're finding a few knobs of ZFS that we want to experiment with. As we gains a better understanding on them, our aim is that tuning them will not be necessary in the future and there is already work in progress to offset the need to tune them. But for those ZFS users that lives on the bleeding edge of performance, I figured this ztune script can come in handy. The scripts works by tuning the in-kernel values of some internal parameters. It then runs an export/import sequence on the specified pool which becomes tuned. After that, the scripts resets the in-kernel values to the ZFS default values. This means that the tunings will not be persistent across reboots or even following export/import sequence. We've seen the need to tune 2 parameters; the vdev level prefetch size and the maximum number of pending I/O per vdev. The low level prefetch causes problem when in occupies the I/O channel for no benefit i.e. if we end up never using the prefetched data. The default value is 64K but 16K or even less appears to be good values when a workload is read intensive to non-cacheable data (working set bigger than memory). The max pending parameter can cause some problem also. When working with volumes that map to multiple spindles, it is possible that the default value be too low for write throughput scenarios. Although I am a bit skeptical it would help, on those types of volumes, if faced with disapointing throughput, one could try to increase the value. The default is 35 but one could try to use 100 or so. I would be very interesting to hear about this if you stumble on an occasion where this helps. More likely in my mind, the default value can cause extra latency for critical I/O (log writes and reads). During transaction group sync, each devices becomes saturated with that number of I/O causing the service time to be fairly large; this occurs in cyclic fashion commonly on a 5 seconds beat. When latency is more important than throughput, tuning down the value to 10 or less should bring better performance. Be warned that the script mocks (mdb -kw) around with unstable kernel definitions; there is potential to crash an OS so be extra prudent and don't use in production. http://blogs.sun.com/roch/resource/ztune.sh And remember, "You can tune a filesystem, but you can't tune a fish".

As experience builds up we're finding a few knobs of ZFS that we want to experiment with. As we gains a better understanding on them, our aim is that tuning them will not be necessary in the future...


ZFS and Directio

ZFS AND DIRECTIO In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way. UFS Directio UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath. What does that tuned code path gets us ? A few things: - output goes directly from application buffer to disk bypassing the filesystem core memory cache. - the FS is not constrained anymore to strictly obey the POSIX write ordering. The FS is thus able to allow multiple thread concurrently issuing some I/Os to a single file. - On input UFS DIO refrains from doing any form of readahead. In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os. People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case. I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect. As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me. Anything for ZFS ? So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras don't do any more I/O that requested allow multiple concurrent I/O to a file. I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior. DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads). As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS). This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets. UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output. Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages. The 2 stages of ZFS First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance. Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about. With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics. For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary. Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os. ZFS Performance and DB Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to. But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....

ZFS AND DIRECTIO In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be...


The Dynamics of ZFS

The Dynamics of ZFS ZFS has a number of identified components that governs its performance. We review the major ones here. Introducing ZFS A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device. A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device. ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs. Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically. Recordsize The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K. The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device. The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated. Transaction Groups The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC. Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG. If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications. A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool. ZFS Intent Log (ZIL) For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os. The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues. CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review. I/O Scheduler and Priorities ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds. The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle. Read Syscalls If a read cannot be serviced from the ARC cache, ZFS will issue a 'prioritized' I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general. However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads. Prefetch The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love". Write Syscalls ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary. CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review. Soft Track Buffer An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB. This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle. The ARC Cache The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference. That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data. On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved. We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed. The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some. The behavior of the ARC in response to memory pressure is under review. CPU Consumption Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption. ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance. What About Your Test ? This is what I know about the ZFS performance model today. My performance comparison on different types of modelled workloads made last fall already had ZFS ahead on many of them; we have improved the biggest issues highlighted then and there are further performance improvements in the pipeline (based on UFS, we know this will never end). Best Practices are being spelled out. You can contribute by comparing your actual usage and workload pattern with the simulated workloads. But nothing will beat having reports from real workloads at this stage; Your results are therefore of great interest to us. And watch this space for updates...

The Dynamics of ZFS ZFS has a number of identified components that governs its performance. We review the major ones here.Introducing ZFS A volume manager is a layer of software that groups a set of...


Tuning ZFS recordsize

One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis. It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using zfs set recordsize=64k mypool/myfs In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time. Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way. If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads. For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize. So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads. So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input. On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well. There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience). So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.

One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are...



DOES ZFS REALLY USE MORE RAM ? I'll touch 3 aspects of that question here : - reported freemem - syscall writes to mmap pages - application write throttling Reported freemem will be lower when running with ZFS than say UFS. The UFS page cache is considered as freemem. ZFS will return it's 'cache' only when memory is needed. So you will operate with lower freemem but won't normally suffer from this. It's been wrongly feared that this mode of operation puts us back to the days of Solaris 2.6 and 7 where we saw a roaller coaster effect on freemem leading to sub-par application performance. We actually DO NOT have this problem with ZFS. The old problem came because the memory reaper could not distinguish between a useful application page and an UFS cached page. That was bad. ZFS frees up it's cache in a way that does not cause this problem. ZFS is designed to release some of it's memory when kernel modules exert back pressure onto the kmem subsystem. Some kernel code that did not properly exert that pressure was recently fixed (short description here: 4034947). There is one peculiar workload that does lead ZFS to consume more memory: writing (using syscalls) to pages that are also mmaped. ZFS does not use the regular paging system to manage data that passes through reads and writes syscalls. However mmaped I/O which is closely tied to the Virtual Memory subsystem still goes through the regular paging code . So syscall writting to mmaped pages, means we will keep 2 copies of the associated data at least until we manage to get the data to disk. We don't expect that type of load to commonly use large amount of ram. Finally, one area where ZFS will behave quite differently from UFS is in throttling writters. With UFS, up to not long ago, we throttled a process trying to write to a file, as soon as that file had 0.5 M B of I/O pending associated with it. This limit has been recently upped to 16 MB. The gain of such throttling is that we prevent an application working on a single file or consuming inordinate amount of system memory. The downside is that we throttle an application possibly unnecessarely when memory is plenty. ZFS will not throttle individual apps like this. The scheme is mutualized between all writers: when the global load of applications data overflows the I/O subsystem for 5 to 10 seconds then we throttle the applications allowing the I/O to catch up. Applications thus have a lot more ram to play with before being throttled. This is probably what's behind the notion that ZFS likes more RAM. By and large, to cache some data, ZFS just needs the equivalent amount of RAM as any other filesystem. But currently, ZFS lets applications run a lot more decoupled from the I/O subsystem. This can speed up some loads by very large factor, but at times, will appear as extra memory consumption.

DOES ZFS REALLY USE MORE RAM ? I'll touch 3 aspects of that question here : - reported freemem - syscall writes to mmap pages - application write throttling Reported freemem will be lower when running with...



WHEN TO (AND NOT TO) USE RAID-Z RAID-Z is the technology used by ZFS to implement a data-protection scheme which is less costly than mirroring in terms of block overhead. Here, I'd like to go over, from a theoretical standpoint, the performance implication of using RAID-Z. The goal of this technology is to allow a storage subsystem to be able to deliver the stored data in the face of one or more disk failures. This is accomplished by joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z groups can be dynamically striped to form a larger storage pool. To store file data onto a RAID-Z group, ZFS will spread a filesystem (FS) block onto the N devices that make up the group. So for each FS block, (N - 1) devices will hold file data and 1 device will hold parity information. This information would eventually be used to reconstruct (or resilver) data in the face of any device failure. We thus have 1 / N of the available disk blocks that are used to store the parity information. A 10-disk RAID-Z group has 9/10th of the blocks effectively available to applications. A common alternative for data protection, is the use of mirroring. In this technology, a filesystem block is stored onto 2 (or more) mirror copies. Here again, the system will survive single disk failure (or more with N-way mirroring). So 2-way mirror actually delivers similar data-protection at the expense of providing applications access to only one half of the disk blocks. Now let's look at this from the performance angle in particular that of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z group achieves it's protection by spreading a ZFS block onto the N underlying devices. That means that a single ZFS block I/O must be converted to N device I/Os. To be more precise, in order to acces an ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for input as the parity data need not generally be read-in. Now after a request for a ZFS block has been spread this way, the IO scheduling code will take control of all the device IOs that needs to be issued. At this stage, the ZFS code is capable of aggregating adjacent physical I/Os into fewer ones. Because of the ZFS Copy-On-Write (COW) design, we actually do expect this reduction in number of device level I/Os to work extremely well for just about any write intensive workloads. We also expect it to help streaming input loads significantly. The situation of random inputs is one that needs special attention when considering RAID-Z. Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of delivered random input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring. With 2-way mirroring, each FS block output must be sent to 2 devices. Half of the available IOPS are thus lost to mirroring. However, for Inputs each side of a mirror can service read calls independently from one another since each side holds the full information. Given a proper software implementation that balances the inputs between sides of a mirror, the FS blocks delivered by a mirrored group is actually no less than what a simple non-protected RAID-0 stripe would give. So looking at random access input load, the number of FS blocks per second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way mirrored or simply striped (a.k.a RAID-0, no data protection !), the equation would be (where dev represents the capacity in terms of blocks of IOPS of a single device): Random Blocks AvailableFS Blocks / sec ------------------------------ RAID-Z(N - 1) \* dev1 \* dev Mirror(N / 2) \* devN \* dev StripeN \* devN \* dev Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and look at different possible configurations; In the table below the configuration labeled: "Z 5 x (19+1)" refers to a dynamic striping of 5 RAID-Z groups, each group made of 20 disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a simple dynamic stripe. Random ConfigBlocks AvailableFS Blocks /sec ------------------------------------- Z 1 x (99+1) 9900 GB 200 Z 2 x (49+1)9800 GB 400 Z 5 x (19+1)9500 GB 1000 Z 10 x (9+1)9000 GB 2000 Z 20 x (4+1)8000 GB 4000 Z 33 x (2+1)6600 GB 6600 M 2 x (50) 5000 GB20000 S 1 x (100) 10000 GB20000 So RAID-Z gives you at most 2X the number of blocks that mirroring provides but hits you with much fewer delivered IOPS. That means that, as the number of devices in a group N increases, the expected gain over mirroring (disk blocks) is bounded (to at most 2X) but the expected cost in IOPS is not bounded (cost in the range of [N/2, N] fewer IOPS). Note that for wide RAID-Z configurations, ZFS takes into account the sector size of devices (typically 512 Bytes) and dynamically adjust the effective number of columns in a stripe. So even if you request a 99+1 configuration, the actual data will probably be stored on much fewer data columns than that. Hopefully this article will contribute to steering deployments away from those types of configuration. In conclusion, when preserving IOPS capacity is important, the size of RAID-Z groups should be restrained to smaller sizes and one must accept some level of disk block overhead. When performance matters most, mirroring should be highly favored. If mirroring is considered too costly but performance is nevertheless required, one could proceed like this: Given N devices each capable of X IOPS. Given a target of delivered Y FS blocks per second for the storage pool. Build your storage using dynamically striped RAID-Z groups of (Y / X) devices. For instance: Given 50 devices each capable of 200 IOPS. Given a target of delivered 1000 FS blocks per second for the storage pool. Build your storage using dynamically striped RAID-Z groups of (1000 / 200) = 5 devices. In that system we then would have 20% block overhead lost to maintain RAID-Z level parity. RAID-Z is a great technology not only when disk blocks are your most precious resources but also when your available IOPS far exceed your expected needs. But beware that if you get your hands on fewer very large disks, the IOPS capacity can easily become your most precious resource. Under those conditions, mirroring should be strongly favored or alternatively a dynamic stripe of RAID-Z groups each made up of a small number of devices.

WHEN TO (AND NOT TO) USE RAID-Z RAID-Z is the technology used by ZFS to implement a data-protection scheme which is less costly than mirroring in terms of block overhead. Here, I'd like...


128K Suffice

throughput numbers to raw device was corrected since my initial postThe question put forth is whether the ZFS 128K blocksize is sufficient to saturate a regular disk. There is great body of evidence that shows that the bigger the write sizes and matching large FS clustersize lead to more throughput. The counter point is that ZFS schedules it's I/Olike nothing else seen before and manages to sature a single diskusing enough concurrent 128K I/O.I first measured the throughput of a write(2) to raw device using forinstance this;dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024On Solaris we would see some overhead of reading the block from/dev/zero and then issuing the write call. The tightest function thatfences the I/O is default_physio(). That function will issue the I/O tothe device then wait for it to complete. If we take the elapse timespent in this function and count the bytes that are I/O-ed, thisshould give a good hint as to the throughput the device isproviding. The above dd command will issue a single I/O at a time(d-script to measure is attached).Trying different blocksizes I see: Bytes Elapse of phys IO Size Sent 8 MB; 3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s 9 MB; 1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s 31 MB; 3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s 78 MB; 4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s 124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s 178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s 226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s 226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46) 32 MB; 686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46) 224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47) 272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new data) 288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data)Data was corrected after it was pointed out that, physio will bethrottled by maxphys. New data was obtained after settings/etc/system: set maxphys=8338608/kernel/drv/sd.conf sd_max_xfer_size=0x800000/kernel/drv/ssd.cond ssd_max_xfer_size=0x800000And setting un_max_xfer_size in "struct sd_lun".That address was figured out using dtrace and knowing thatsdmin() calls ddi_get_soft_state (details avail upon request).And of course disabling the write cache (using format -e)With this in place I verified that each sdwrite() up to 8M would lead to a single biodone interrupts using this:dtrace -n 'biodone:entry,sdwrite:entry{@a[probefunc, stack(20)]=count()}'Note that for 16M and 32M raw device writes, each default_physiowill issue a series of 8M I/O. And so we don'texpect any more throughput from that.The script used to measure the rates (phys.d) was alsomodified since I was counting the bytes before the I/O hadcompleted and that made a big difference for the very largeI/O sizes.If you take the 8M case, the above rates correspond to thetime it takes to issue and wait for a single 8M I/O to thesd driver. So this time certainly does include 1 seek and ~0.13 seconds of data transfer, then the time to respond tothe interrupt, finally the wakeup of the thread waiting indefault_physio(). Given that the data transfer rate using 4MB is very close to the one using 8 MB, I'd say that at 60MB/sec all the fixed-cost element are well amortized. So Iwould conclude from this that the limiting factor is now atthe device itself or on the data channel between the diskand the host.My disk is.Now lets see what ZFS gets. I measure using single dd process. ZFSwill chunk up data in 128K blocks. Now the dd command interact withmemory. But the I/O are scheduled under the control of spa_sync(). Soin the d-script (attached) I check for the start of an spa_sync andtime that based on elapse. At the same time I gather the number ofbytes and keep a count of I/O (bdev_strategy) that are being issued.When the spa_sync completes we are sure that all those are on stablestorage. The script is a bit more complex because there are 2 threadsthat issue spa_sync, but only one of them actually becomesactivated. So the script will print out some spurious lines of outputat times. I measure I/O with the script while this runs:dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000And I see: 1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s 1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s 1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/sOK, I cheated. Here, ZFS is given a full disk to play with. In thiscase ZFS enables the write cache. Note that even with the write cacheenabled, when the spa_sync() completes, it will be after a flush ofthe cache has been executed. So the 60MB/sec do correspond to data setto the platter. I just tried disabling the cache (with format -e) butI am not sure if that is taken into account by ZFS; Results are thesame 60MB/sec. This will have to be confirmed.With write cache enabled, the physio test reaches 66 MB/s as soon aswe are issuing 16KB I/Os. Here clearly though, data is not on theplatter when the timed function completes.Another variable not fully controled is the physical (cylinder)locations of the I/O. It could be that some of the differences comefrom that.What do I take away ?a single 2MB physical I/O will get 46 MB/sec out of my disk.35 concurrent 128K I/O sustained followed by metadata I/Ofollowed by flush of the write cache allows ZFS to get 60MB/sec out of the same disk.This is what underwrites my belief that 128K blocksize is sufficientlylarge. Now, nothing here proves that 256K would not give morethroughput; so nothing is really settled. But I hope this helps put us on common ground.--------------phys.d-------------------#!/usr/sbin/dtrace -qs/\* Measure throughput going through physio (dd to raw)\*/BEGIN {b = 0; /\* Bytecount \*/cnt = 0; /\* phys iocount \*/delta = 0; /\* time delta \*/tt = 0; /\* timestamp \*/} default_physio:entry {tt=timestamp; self->b = (args[5]->uio_iov->iov_len);} default_physio:return/tt!=0/ {cnt ++;b += self->b;delta+=(timestamp-tt);} tick-5s/delta != 0/ {printf("%d MB; %d ms of phys; avg sz : %d KB; throughput %d MB/s\\n",b / 1048576,delta / 1000000, b / cnt / 1024,(b \* 1000000000) / (delta \* 1048676)); }tick-5s {b = 0; delta = 0; cnt = 0; tt = 0}--------------phys.d---------------------------------spa_sync.d-------------------#!/usr/sbin/dtrace -qs/\* \* Measure I/O throughput as generated by spa_sync \* Between the spa_sync entry and return probe \* I count all I/O and bytes going through bdev_strategy. \* This is a lower bound on what the device can do since \* some aspects of spa_sync are non-concurrent I/Os. \*/BEGIN {tt = 0; /\* timestamp \*/b = 0; /\* Bytecount \*/cnt = 0; /\* iocount \*/} spa_sync:entry/(self->t == 0) && (tt == 0)/{b = 0; /\* reset the I/O byte count \*/cnt = 0;tt = timestamp; self->t = 1;}spa_sync:return/(self->t == 1) && (tt != 0)/{this->delta = (timestamp-tt);this->cnt = (cnt == 0) ? 1 : cnt; /\* avoid divide by 0 \*/printf("%d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/s\\n",b / 1048576,this->delta / 1000000, b / this->cnt / 1024,(b \* 1000000000) / (this->delta \* 1048676)); tt = 0;self->t = 0;}/\* We only count I/O issued during an spa_sync \*/bdev_strategy:entry /tt != 0/{ cnt ++;b += (args[0]->b_bcount);}--------------spa_sync.d-------------------

throughput numbers to raw device was corrected since my initial postThe question put forth is whether the ZFS 128K blocksize is sufficient to saturate a regular disk. There is great body of evidence...


Beware of the Performance of RW Locks

In my naive little mind a rw lock would represents a performant scalable construct inasmuch as WRITERS do not hold the lock for a significant amount of time. One figures that the lock would be held for short WRITERS times followed by concurrent execution of RW_READERS. What I recently found out is quite probably well known to seasoned kernel engineer but this was new to me. So I figured it could be of interest to others. The SETUP So Reader/Writer locks (RW) can be used in kernel and user level code to allow multiple READERS of, for instance, a data structure, to access the structure while allowing only a single WRITER at a time within the bounds of the rwlock(). A RW locks (rwlock(9F), rwlock(3THR)) is more complex that a simple mutex. So acquiring such locks will be more expensive. This means that if the expected hold times of a lock is quite small (say to update or read 1 or 2 fields of a structure) then regular mutexes can usually do that job very well. A common programming mistake is to expect faster execution of RW locks for those cases. However when READ hold times need to be fairly long; then RW locks represent an alternative construct. With those locks we expect to have multiple READERS executing concurrently thus leading to performant code that scales to large numbers of threads. As I said, if WRITERS are just quick updates to the structure, we can naively believe that our code will scale well. Not So Let's see how it goes. A WRITER lock cannot get in the protected code while READERS are executing protected code. The WRITER must then wait at the door until READERS releases their hold. If the implementation of RW locks didn't pay attention, there would be cases in which at least one READER is always present within the protected code and WRITERS would get starved of access. To prevent such starvation, RW lock must block READERS as soon as a WRITER has requested access. But no matter, our WRITERS will quickly update the structure and we will get concurrent execution most of the time. Isn't it ? Well not quite. As just stated, a RW locks will block readers as soon as a WRITER has hit the door. This means that the construct does not allow parallel execution at that point. Moreover the WRITER will stay at the door while READERS are executing. So the construct stays fully serializing from the time a WRITER hits until all current READERS are done followed by the WRITERS time. For Instance:- a RW_READER gets in and will keep a long time. ---|- a RW_WRITER hits the lock; is put on hold. |- other RW_READERS now also block. |.... time passes |- the long RW_READER releases > (N \* avg read hold times). Roundup In the end, from a performance point of view, RW locks should be used only when the average hold times is significant in order to justify the use of this more complex type of lock: for instance, calling a function of unknown latency or issuing an I/O while holding the lock represent good candidates. But the construct will be scalable to N threads, if and only if WRITERS are very infrequent. [T]: NiagaraCMT Solaris Sun

In my naive little mind a rw lock would represents a performant scalable construct inasmuch as WRITERS do not hold the lock for a significant amount of time. One figures that the lock would be...


Showcasing UltraSPARC T1 with Directory Server's searches

So my Friend and Sun's Directory Server (DS) developer Gilles Bellaton recently got his hands onto an early access Niagara (UltraSPARC T1) system; something akin to SunFireTMT2000. The chip in the system only had 7 active cores and thus 28 hardware threads (a.k.a strands) but we wanted to check how well it would perform on DS. The results here are a little anecdotal: we just ran a few quick test with the aim to showcase Niagara but nevertheless the results we're beyond expectations. If you consider the Throughput Engine architecture that Niagara provides (what the Inquire says), we can expect it to perform well in highly multithreaded loads such as a directory search test. Since we had limited disk space on the system the slapd instance was created on /tmp. We realize that this is not at all proper deployment conditions; however the nature of the test is such that we would expect the system to operate mostly from memory (Database fully cached). The only data that would need to go to disk on a real deployment would be the 'access log' and this typically is a not a throughput limiting subsystem. So we can prudently expect that a real on-disk deployment of a read-mostly workload in which the DB can be fully cached could perform perhaps closely to our findings. This showcase test is a base search over a tiny 1000 entries Database using 50 thread slapd. Slapd was not tuned in any way before the test. For simplicity, the client was run on the same system as the server. This means that, on the one hand, the client is consuming some CPU away from the server, but on the other it reduces the need to run the Network adapter driver code. All in all, this was not designed as a realistic DS test but only to see in a few hours of access time to the system if DS was running acceptably well on this new cool Hardware. The Results were obtained with Gilles' workspace of DS 6.0 optimized build of August 29th 2005. The number of CPUs where adjusted by creating psrset. Numbers of Strands Search/secRatio1 920 1 X3 (1 core; 3 str/core) 2260 2.45 X4 (1 core; 4 str/core) 2650 2.88 X4 (4 core; 1 str/core) 4100 4.45 X14 (7 cores, 2 str/core) 1250013.59 X21 (7 cores, 3 str/core) 1610017.5 X28 (7 cores; 4 str/core) 1820019.8 X Those are pretty good scaling numbers straight out of the box. While other more realistics investigation will be produced, this test at least showed us early that Niagara based systems were not suffering from an flagrant deficiencies when running DS searches. [T]: NiagaraCMT Solaris Sun

So my Friend and Sun's Directory Server (DS) developer Gilles Bellaton recently got his hands onto an early access Niagara (UltraSPARC T1) system; something akin to SunFireTMT2000.The chip in the...


ZFS to UFS Performance Comparison on Day 1

With special thanks to Chaoyue Xiong for her help in this work.          In this paper I'd like to review the performance data we have gathered comparing this initial  release of ZFS  (Nov 16 2005) with the Solaris legacy, optimized beyond reason, UFS filesystem.  The  data we will be reviewing is based on 14 Unit tests that  were designed to stress some specific usage pattern of  filesystem operations.  Working  with these well  contained usage     scenarios, greatly  facilitate    subsequent performance engineering analysis. Our focus was to issue a fair head to  head comparison between UFS and ZFS but not  try to  produce the  biggest,  meanest marketing numbers. Since ZFS  is also a Volume   Manager, we actually  compared  ZFS to a UFS/SVM combination.  In cases  where ZFS underperforms UFS, we wanted to figure out why and how to improve ZFS. We currently also are focusing on data intensive operations.  Metadata intensive tests are  being develop and we will   report on those  in a later study. Looking ahead to  our results we find  that of our  12 Filesystem Unit test that were successfully run:     ZFS outpaces UFS in 6 tests by a mean factor of 3.4     UFS outpaces ZFS in 4 tests by a mean factor of 3.0     ZFS equals UFS in 2 tests. In this paper, we will be taking a closer  look at the tests where UFS is ahead and try to make proposition toward improving those numbers. THE SYSTEM UNDER TEST Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At this point  we are  not  yet monitoring  the  CPU utilization  of  the different tests  although we plan to do  so in the future. The storage is  an insanely  large  300 disk   array; The disks  were   rather old technology,  small &  slow  9 GB  disks.  None  of the  test  currently stresses the array very much  and the idea was  mostly trying to  take the storage configuration   out  of the  equation.  Working  with  old technology disks, the absolute throughput  numbers are not necessarily of interest; they are presented in an appendix. Every disk  in our configuration  is partitioned into   2 slices and a simple zvm or  zpool stripped volume  is made across all spindles. We then build  a filesystem on top of  the volume.  All commands  are run with default parameters.  Both filesystems  are mounted and we can run our test suite on either one. Every  test is rerun  multiple  times  in  succession; The tests   are defined and developed to avoid variability between instances. Some of the current test definition  require that file data  not be present in the filesystem cache. Since we currently do not  have a convenient way to control this for  ZFS, the result for those  tests are omitted from this report. THE FILESYSTEM UNIT TESTS Here  is the  definition   of the 14  data   intensive tests  we  have currently identified.   Note  that  we   are  very  open to   new test definition; if you know of an data  intensive application, that uses a Filesystem in  a very  different pattern,  and  there must be  tons of them, we would dearly like to hear from you. Test 1 This is the simplest way  to create a  file; we open/creat a file then issue 1MB writes until the filesize reaches 128 MB; we then close the file. Test 2 In this test, we also create a new file,  although here we work with a file opened  with the O_DSYNC  flag.  We work with  128K writes system calls.  This maps to some database file creation scheme. Test 3 This test is  also relative to file creation  but with writes that are much smaller and of varying sizes. In this test, we create a 50MB file using writes of size picked randomly between [1K,8K]. The file is open with  default flags (no  O_\*SYNC) but  every 10 MB  of written  data we issue an fsync() call for the  whole file. This form  of access can be used for log files that have data integrity requirements. Test 4 Moving now to a read test; we read a  1 GB file (assumed in cache) with 32K read system call. This is a rather  simple test to keep everybody honest. Test 5 This is same test as Test  4 but when  the file is assumed not present in the filesystem cache. We currently have  no control on ZFS for this and so we  will not be reporting   performance numbers for  this test. This is a basic streaming read sequence that should test the readahead capacity of a filesystem. Test 6 Our previous write test, were allocating  writes. In this test we will verify the ability of a filesystem to rewrite over an existing file. We will look at 32K writes, to a file open with O_DSYNC. Test 7 Here we also test the ability to rewrite existing  files. The size are randomly picked  in the [1K,8K] range. Not  special  control over data integrity (no O_\*SYNC, no fsync()). Test 8 In  this test  we  create a very  large  file  (10 GB) with 1MB  writes followed by 2 full-pass sequential  read.  This test is still evolving but we want  verify the ability of the  filesystem to work with  files that are of size close or larger that available free memory. Test 9 In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB file. When 128 MB of data is written we issue an fsync(). Test 10 Here,  we issue  2K writes at   random (unaligned) offsets  to  a file opened O_DSYNC. Test 11 Same test   as 10 but using 4   cooperating threads all  working  on a single file. Test 12 Here we attempt to  simulate a mixed  read/write pattern. Working with an existing file, we loop through  a pattern of 3  reads at 3 randomly selected 8K aligned  offsets followed by an  8K write to the last read block.   Test 13 In this test  we   issue 2K pread()    calls (to an  random  unaligned offset).  File is asserted to not be in  the cache. Since we currently have no such control, no won't report data for this test. Test 14 We have 4 cooperating  threads (working on a  single file)  issuing 2K pread() calls  to random unaligned offset. The  file is present in the cache. THE RESULTS We have  a common testing framework  to generate the performance data. Each test is written using as a simple  C program and the framework is responsible   for creating   threads,   files,  timing   the runs  and reporting.  We currently are in discussing merging this test framework with the Filebench  suite.  We regret that  we cannot easily share the test  code,  however the   above descriptions  should  be sufficiently precise to allow someone to  reproduce our data.   In my mind a simple 10 to 20 disk array and any small server  should be enough to generate similar  numbers.  If anyone  find very different  results, I would be very interested in knowing about it. Our      framework reports all    timing    results   as a   throughput measure. Absolute values of throughput is  highly test case dependent. A 2K O_DSYNC write will  not have the same throughput  as a 1MB cached read.  Some test would be better described in  terms of operations per second.    However  since  our focus  is  a   relative ZFS to  UFS/SVM comparison, we will focus here on  the delta in throughput between the 2 filesystems (for the curious  the full throughput  data is posted in the appendix). Drumroll.... Task ID      Description                                               Winning FS / Performance Delta 1                 open() and allocation  of a                        ZFS / 3.4X                    128.00 MB file with                    write(1024K) then close().                 2                 open(O_DSYNC) and                               ZFS / 5.3X                    allocation of a                    5.00 MB file with                    write(128K) then close().               3                 open()  and allocation of a                        UFS / 1.8X                    50.00 MB file with write() of                    size picked uniformly  in                    [1K,8K] issuing fsync()                                              every 10.00 MB 4                 Sequential read(32K) of a                        ZFS / 1.1X                    1024.00 MB file, cached.                                5                 Sequential read(32K) of a                         no data                   1024 MB MB file, uncached. 6                 Sequential rewrite(32K) of a                    ZFS / 2.6X                    10.00   MB  file,  O_DSYNC,                    uncached                       7                 Sequential rewrite() of a 1000.00            UFS / 1.3X                    MB cached file, size picked                    uniformly    in the [1K,8K]                                      range, then close(). 8                 create  a file   of size 1/2  of                    ZFS / 2.3X                    freemem   using  write(1MB)                    followed  by 2    full-pass                    sequential   read(1MB).  No                                  special cache manipulation. 9                 128.00  MB  worth  of random  8            UFS / 2.3X                    K-aligned write       to  a                    1024.00  MB  file; followed                                      by fsync(); cached. 10              1.00  MB worth of   2K write to            draw (UFS == ZFS)                   100.00   MB file,  O_DSYNC,                   random offset, cached. 11             1.00  MB worth  of  2K write  to                ZFS  / 5.8X                   100.00 MB    file, O_DSYNC,                   random offset, uncached.  4                   cooperating  threads   each                                writing 1 MB 12             128.00 MB  worth of  8K aligned            draw (UFS == ZFS)                  read&write   to  1024.00 MB                  file, pattern  of 3 X read,                  then write to   last   read                  page,     random    offset,                  cached. 13             5.00  MB worth of pread(2K) per            no data                  thread   within   a  shared                 1024.00  MB    file, random                 offset, uncached 14            5.00 MB  worth of  pread(2K) per                UFS / 6.9X                 thread within a shared                 1024.00 MB file, random                                          offset, cached 4 threads. As stated in the abstract     ZFS outpaces UFS in 6 tests by a mean factor of 3.4     UFS outpaces ZFS in 4 tests by a mean factor of 3.0     ZFS equals UFS in 2 tests. The performance differences can be sizable; lets have a closer look at some of them. PERFORMANCE DEBRIEF Lets look at each test to try and understand what  is the cause of the performance differences. Test 1 (ZFS 3.4X)      open() and allocation  of a     128.00 MB file with     write(1024K) then close().                 This  test is not fully  analyzed. We note  that in this situation UFS will regularly kick off some I/O from the  context of the write system call.  This would occur  whenever a  cluster  of writes (typically  of size  128K or 1MB)  has completed. The initiation  of I/O by UFS slows down the process.  On the other hand ZFS  can zoom through the test at a rate much closer to  a memcopy.  The  ZFS I/Os to disks are actually generated internally by the ZFS  transaction group mechanism: every few seconds a transaction group will come and flush the dirty data to disk and this occurs without throttling the test. Test 2 (ZFS 5.3X)      open(O_DSYNC) and     allocation of a     5.00 MB file with     write(128K) then close().               Here ZFS shows  an even bigger advantage.   Because of it's design and complexity,  UFS is actually somewhat limited  in it capacity to write allocate files in  O_DSYNC mode.  Every  new  UFS write  requires some disk block   allocation, which must occur  one  block at a   time when O_DSYNC is set. ZFS can easily outperform UFS for this test. Test 3 (UFS 1.8X)      open()  and allocation of a     50.00 MB file with write() of     size picked uniformly  in     [1K,8K] issuing fsync()                               every 10.00 MB Here ZFS pays the advantage it had in test  1.  In this test, we issue very many writes to a file.  Those are cached as the process is racing along.  When the fsync() hits (every 10 MB  of outstanding data per the test definition) the FS must now guarantee that all the data is set to stable  storage.  Since UFS  kicks off  I/O more  regularly, when  the fsync() hits UFS has a smaller amount  of data left to  sync up.  What save the day for ZFS is that, for that leftover data UFS slows down to a crawl.  On the other hand ZFS has accumulated a large amount of data in the cache and when  the fsync() hits.   Fortunately ZFS is able  to issue much larger I/Os to  disk and catches some  of it's lag that has built  up.  But the final  results shows that UFS  wins the horse race (at least  in this specific test);  Details of the test will influence final  result here.  However the ZFS  team  is working on ways   to make the fsync()   much better.  We actually have 2  possible avenues of improvements.  We can borrow from  the  UFS behavior and kick  off  some I/Os when too  much outstanding data is cached.  UFS does  this at a very regular interval which does not look  right either.  But clearly  if a file has many MB of outstanding  dirty  data  sending   them  off  to  disk   might  be beneficial.  On    the other hand,    keeping  the data in   cache  in interesting when  the pattern of  writing is such  that the same file offsets  are written and re-written over  and over again.  Sending the data to disk  is wasteful if  data is subsequently  rewritten shortly after.  Basically the FS must place a bet on whether a future fsync() will occur before an new write  to the block.   We cannot win this bet on all tests all the time. Given that fsync() performance  is important, I  would like to see  us asynchronously kick off I/O when some we reach many MB of outstanding data to a file. This is nevertheless debatable. Even if we don't do this, we have another area of improvement that the ZFS team  is looking into.  When the  Fsync finally hits the fan, even with a  lot of outstanding data;  the current  implementation does not issue  disk I/Os very efficiently.   The proper way  to  do this is to kick-off all required I/Os  and then wait  for  them to all  complete. Currently in the  intricacies of the   code, some I/Os are  issued and waited  upon one after the   other.  This is  not yet  optimal but  we certainly  should see  improvements coming  in the  future and I truly expect ZFS fsync() performance to be ahead all the time.     Test 4 (ZFS 1.1X)      Sequential read(32K) of a 1024.00     MB file, cached. Rather simple  test, mostly    close  to memcopy  speed  between   the Filesystem  cache and the  user buffer. Contest is  almost a wash with ZFS slightly on top. Not yet analyzed. Test 5 (N/A)      Sequential read(32K) of a 1024.00     MB file, uncached. No results dues to lack of control on the ZFS file level caching. Test 6 (ZFS 2.6X)       Sequential rewrite(32K) of a     10.00   MB  file,  O_DSYNC,     uncached                       Due  to the WAFL  (Write Anywhere File  Layout) ZFS, a  rewrite is not very different to an initial  write and it seems  to perform very well on this  test.  Presumably UFS performance is  hindered by the need to synchronize the cached data. Result not yet analyzed. Test 7 (UFS 1.3X)      Sequential rewrite() of a 1000.00     MB cached file, size picked     uniformly    in the [1K,8K]                       range, then close(). In this test we are not timing any of the  disk I/O. This is merely a test about unrolling the  filesystem code for 1K  to 8K cached writes. The  UFS codepath wins in  simplicity and years of performance tuning. The ZFS codepath here somewhat suffers from it's youth. Understandably the ZFS  current  implementation is very well   layered and we  easily imagine  that the  locking  strategies of   the different layers   are independent of one another. We have found (thanks dtrace) that a small ZFS cached write would use about 3 times as many lock acquisition that an equivalent UFS    call.  Mutex rationalization  within  or  between layers certainly seems to be an area of  potential improvement for ZFS that would help this particular test.  We  also realised that the very clean and  layered code implementation   is causing the  callstack  to follow very many elevator ride up and down between  layers. On a Sparc CPU going up  and down 6  or 7 layers  deep in the callstack causes  a spill/fill trap and one   additional trap for every additional   floor travelled. Fortunately there  are very  many  areas where ZFS  will be able to merge different functions into  single one or possibly exploit the technique of  tail calls to regain  some of the lost  performance. All in all, we find that the performance difference is small enough to not  be  worrysome at this  point  specially in  view of  the possible improvements we already have identified. Test 8 (ZFS 2.3X)       create  a file   of size 1/2  of     freemem   using  write(1MB)     followed  by 2    full-pass     sequential   read(1MB).  No                   special cache manipulation. This  test  needs to  be   analyzed further.  We   note that  UFS will proactively  freebehind read blocks. While  this is a very responsible use of memory   (give it back  after use)  it  potentially  impact the re-read UFS performance.  While we're happy  to see ZFS performance on top, some investigation is  warranted to make sure  that ZFS does  not overconsume memory in some situations. Test 9 (UFS 2.3X)        128.00  MB  worth  of random  8     K-aligned write       to  a     1024.00  MB  file; followed                       by fsync(); cached. In this test we expect a rational similar to the one of Test 3 to take effect. The same cure should also apply. Test 10 (draw)       1.00  MB worth of   2K write to     100.00   MB file,  O_DSYNC,     random offset, cached. Both FS must issue and wait  for a 2K I/O on  each write. They both do this as efficiently as possible. Test 11 (ZFS 5.8X)      1.00  MB worth  of  2K write  to     100.00 MB    file, O_DSYNC,     random offset, uncached.  4     cooperating  threads   each                  writing 1 MB This test is similar to the previous  one except for the 4 cooperating threads. ZFS being on top highlights a key feature of ZFS, the lack of single writer lock.  UFS can only allow  a single write thread working per file.  The only  exception is  when directio  is enabled  and then only with rather restrictive conditions. UFS with directio would allow concurrent writers with the implied restriction  that it did not honor full POSIX semantics regarding write atomicity.  ZFS,  out of the box, is able  to  allow concurrent  writers  without requiring any  special setup  nor   giving up full     POSIX semantics. All   great news  for simplicity of deployment and great Data-Base performance . Test 12 (draw)     128.00 MB  worth of  8K aligned     read&write   to  1024.00 MB     file, pattern  of 3 X read,     then write to   last   read     page,     random    offset,     cached. Both filesystem perform appropriately. Test still require analysis. Test 13 (N/A)       5.00  MB worth of pread(2K) per            thread   within   a  shared     1024.00  MB    file, random     offset, uncached No results dues to lack of control on the ZFS file level caching. Test 14 (UFS 6.9X)      5.00 MB  worth of  pread(2K) per        thread within a shared     1024.00 MB file, random                              offset, cached 4 threads. This test unexplicably  shows UFS on  top.   The UFS  code can perform rather well given  that  the FS cache  is  stored in the   page cache. Servicing writes from  cache can be made  very scalable.  We  are just starting our analysis  of  the performance characteristic of   ZFS for this   test  We have identified  some  serialization  construct in the buffer management code where we find that  reclaiming the buffers into which to put the cached  data is acting  as a serial throttle. This is truly the  only test where  the   ZFS performance disappoint  although there    is no doubt   that    we will be    finding   a cure to  this implementation issue. THE TAKEAWAY ZFS is  on top   on very  many  of  our  test  often by a  significant factor. Where UFS is ahead we have a clear view on  how to improve the ZFS implementation.  The case of shared readers to  a single file will be the test that requires special attention. Given   the youth of the  ZFS  implementation, the performance outline presented in this paper shows that the ZFS design decision are totally validated from a performance perspective. FUTURE DIRECTIONS Clearly, we should now expands the unit  test coverage.  We would like to study more metadata intensive workloads.  We also would like to see how   ZFS  features such as  compression    and RaidZ perform.   Other interesting studies could   focus    on CPU consumption     and memory efficiency.  We also  need to find  a solution to running the existing unit test that requires the files to not be cached in the filesystem. APPENDIX/ THROUGHPUT MEASURE Here are the raw throughput measures for each of the 14 Unit test.  Task ID      Description              ZFS latest+nv25(MB/s)      UFS+nv25 (MB/s) 1     open() and allocation  of a        486.01572         145.94098     128.00 MB file with     write(1024K) then close().                 ZFS 3.4X 2     open(O_DSYNC) and            4.5637             0.86565     allocation of a     5.00 MB file with     write(128K) then close().               ZFS 5.3X 3     open()  and allocation of a          27.3327         50.09027     50.00 MB file with write() of     size picked uniformly  in     [1K,8K] issuing fsync()                           1.8X UFS     every 10.00 MB 4     Sequential read(32K) of a 1024.00    674.77396         612.92737     MB file, cached.                                ZFS 1.1X 5     Sequential read(32K) of a 1024.00    1756.57637         17.53705     MB file, uncached.                                XXXXXXXXX 6      Sequential rewrite(32K) of a        2.20641         0.85497     10.00   MB  file,  O_DSYNC,     uncached                       ZFS 2.6X 7     Sequential rewrite() of a 1000.00    204.31557         257.22829     MB cached file, size picked     uniformly    in the [1K,8K]                   1.3X UFS     range, then close(). 8      create  a file   of size 1/2  of    698.18182         298.25243     freemem   using  write(1MB)     followed  by 2    full-pass     sequential   read(1MB).  No               ZFS 2.3X     special cache manipulation. 9       128.00  MB  worth  of random  8        42.75208         100.35258     K-aligned write       to  a     1024.00  MB  file; followed                   2.3X UFS     by fsync(); cached. 10      1.00  MB worth of   2K write to        0.117925         0.116375     100.00   MB file,  O_DSYNC,     random offset, cached.                      ==== 11     1.00  MB worth  of  2K write  to    0.42673         0.07391     100.00 MB    file, O_DSYNC,     random offset, uncached.  4     cooperating  threads   each              ZFS 5.8X     writing 1 MB 12      128.00 MB  worth of  8K aligned        264.84151         266.78044     read&write   to  1024.00 MB     file, pattern  of 3 X read,     then write to   last   read                 =====     page,     random    offset,     cached. 13      5.00  MB worth of pread(2K) per        75.98432         0.11684     thread   within   a  shared     1024.00  MB    file, random               XXXXXXXX     offset, uncached 14     5.00 MB  worth of  pread(2K) per    56.38486         386.70305     thread within a shared     1024.00 MB file, random                          6.9X UFS     offset, cached 4 threads. OpenSolaris, ZFS

With special thanks to Chaoyue Xiong for her help in this work.          In this paper I'd like to review the performance data we have gatheredcomparing this initial  release of ZFS  (Nov 16...