Delivering Performance Improvements to Sun Storage 7000
By user13278091 on nov. 10, 2008
I describe here the effort I spearheaded studying the performance characteristics of the OpenStorage platform and the ways in which our team of engineers delivered real out of the box improvements to the product that is shipping today.
One of the Joy of working on the OpenStorage NAS appliance was that solutions we found to performance issues could be immediately transposed into changes to the appliance without further process.
The first big wins
We initially stumble on 2 major issues, one for NFS synchronous writes and one for the CIFS protocol in general. The NFS problem was a subtle one involving the distinction of O_SYNC vs O_DSYNC writes in the ZFS intent log and was impacting our threaded synchronous writes test by up to a 20X factor. Fortunately I had an history of studying that part of the code and could quickly identify the problem and suggest a fix. This was tracked as 6683293: concurrent O_DSYNC writes to a fileset can be much improved over NFS.
The following week, turning to CIFS studies, we were seeing great scalability limitation in the code. Here again I was fortunate to be the first one to hit this. The problem was that to manage CIFS request the kernel code was using simple kernel allocations that could accommodate the largest possible request. Such large allocations and deallocations causes what is known as a storm of TLB shootdown cross-calls limiting scalability.
Incredibly though after implementing the trivial fix, I found that the rest of the CIFS server was beautifully scalable code with no other barriers. So in one quick and simple fix (using kmem caches) I could demonstrate a great scalability improvements to CIFS. This was tracked as 6686647 : smbsrv scalability impacted by memory
Since those 2 protocol problems were identified early on, I must say that no serious protocol performance problems have come up. While we can always find incremental improvements to any given test, our current implementation has held up to our testing so far.
In the next phase of the project, we did a lot of work on improving network efficiency at high data rate. In order to deliver the throughput that the server is capable of, we must use 10Gbps network interface and the one available on the NAS platforms are based on the Neptune networking interface running the nxge driver.
I collaborated on this with Alan Chiu that already new a lot about this network card and driver tunables and so we quickly could hash out the issues. We had to decide for a proper out of the box setup involving
- how many MSI-X interrupts to use - whether to use networking soft rings or not - what bcopy threshold to use in the driver as opposed to binding dma. - Whether to use or not the new Large Segment Offload (LSO) technique for transmits.We new basically where we wanted to go here. We wanted many interrupts on receive side so as to not overload any CPU and avoid the use of layered softrings which reduces efficiency. A low bcopy threshold so that dma binding be used more frequently as the default value was too high for this x64 based platform. And LSO was providing a nice boost to efficiency. That got us to some proper efficiency level.
However we noticed that under stress and high number of connections our efficiency would drop by 2 or 3 X. After much head scratching we rooted this to the use of too many TX dma channels. It turns out that with this driver and architecture using a few channels leads to more stickyness in the scheduling and much much greater efficiency. We settled on 2 tx rings as a good compromise. That got us to a level of 8-10 cpu cycles per byte transfered in network code (more on Performance Invariants). Interrupt Blanking
Studying a Opensource alternative controller, we also found that on 1 of 14 metrics we where slower. That was rooted in the interrupt blanking parameter that NIC use to gain efficiency. What we found here was that by reducing our blanking to a small value we could leapfrog the competition (from 2X worse to 2X better) on this test while preserving our general network efficiency. We were then on par or better for every one of the 14 tests.
When we ran thousand or 1 Mb/s media streams from our systems we quickly found that the file level software prefetching was hurting us. So we initially disabled the code in our lab to run our media studies but at the end of the project we had to find an out of the box setup that could preserve our Media result without impairing maximum read streaming. At some point we realized that what we were hitting 6469558: ZFS prefetch needs to be more aware of memory pressure. It turns out that the internals of zfetch code is setup to manage 8 concurrent streams per file and can readahead up to 256 blocks or records : in this case 128K. So when we realized that with 1000s of streams we could readahead ourself out of memory, we knew what we needed to do. We decided on setting up 2 streams per file reading ahead up to 16 blocks and that seems quite sufficient to retain our media serving throughput while keeping so prefetching capabilities. I note here also is that NFS client code will themselve recognize streaming and issue their own readahead. The backend code is then reading ahead of client readahead requests. So we kind of where getting ahead of ourselves here. Read more about it @ cndperf
To slog or not to slog
One of the innovative aspect of this Openstorage server is the use of read and write optimized solid state devices; see for instance The Value of Solid State Devices.
Those SSD are beautiful devices designed to help latency but not throughput. A massive commit is actually better handled by regular storage not ssd. It turns out that it was actually dead easy to instruct the ZIL to recognize massive commits and divert it's block allocation strategy away from the SSD toward the common pool of disks. We see two benefits here, the massive commits will sped up (preventing the SSD from becoming the bottleneck) but more importantly the SSD will now be available as low latency devices to handle workloads that rely on low latency synchronous operations. One should note here that the ZIL is a "per filesystem" construct and so while a filesystem might be working on a large commit another filesystem from the same pool might still be running a series of small transaction and benefit from the write optimized SSD.
In a similar way, when we first tested the read-optimized ssds , we quickly saw that streamed data would install in this caching layer and that it could slow down the processing later. Again the beauty of working on an appliance and closely with developers meant that the following build, those problems had been solved.
Transaction Group Time
ZFS operates by issuing regular transaction groups in which modifications since last transaction group are recorded on disk and the ueberblock is updated. This used to be done at a 5 second interval but with the recent improvement to the write throttling code this became a 30 second interval (on light workloads) which aims to not generate more than 5 seconds of I/O per transaction groups. Using 5 seconds of I/O per txg was used to maximize the ratio of data to metadata in each txg, delivering more application throughput. Now these Storage 7000 servers will typically have lots of I/O capability on the storage side and the data/metadata is not as much a concern as for a small JBOD storage. What we found was that we could reduce the the target of 5 second of I/O down to 1 while still preserving good throughput. Having this smaller value smoothed out operation.
IT JUST WORKS
Well that is certainly the goal. In my group, we spent the last year performance testing these OpenStorage systems finding and fixing bugs, suggesting code improvements, and looking for better compromise for common tunables. At this point, we're happy with the state of the systems particularly for mirrored configuration with write optimized SSD accelerators. Our code is based on a recent OpenSolaris (from august) that already has a lot of improvements over Solaris 10 particularly for ZFS, to which we've added specific improvements relevant to NAS storage. We think these systems will at times deliver great performance (see Amithaba's results ) but almost always shine in the price performance categories.