jeudi sept. 17, 2009

iSCSI unleashed

One of the much anticipated feature of the 2009.Q3 release of the fishworks OS is a complete rewrite of the iSCSI target implementation known as Common Multiprotocol SCSI Target or COMSTAR. The new target code is an in-kernel implementation that replaces what was previously known as the iSCSI target deamon, a user-level implementation of iSCSI.

Should we expect huge performance gains from this change ? You Bet !

But like most performance question, the answer is often : it depends. The measured performance of a given test is gated by the weakest link triggered. iSCSI is just one component among many others that can end up gating performance. If the daemon was not a limiting factor, then that's your answer.

The target deamon was a userland implementation of iSCSI : some daemon threads would read data from a storage pool and write data to a socket or vice versa. Moving to a kernel implementation opens up options to bypass at least one of the copies and that is being considered as a future option. But extra copies while undesirable do not necessarily contribute to the small packet latency or large request throughput; For small packets requests, the copy is small change compared to the request handling. For large request throughput the important things is that the data path establishes a pipelined flow in order to keep every components busy at all times.

But the way threads interact with one another can be a much greater factor in delivered performance. And there lies the problem. The old target deamon suffered from one major flaw in that each and every iSCSI requests would require multiple trips through a single queue (shared between every luns) and that queue was being read and written by 2 specific threads. Those 2 threads would end up fighting for the same locks. This was compounded by the fact that user level threads can be put to sleep when they fail to acquire a mutex and that going to sleep for a user level thread is a costly operation implying a system call and all the accounting that goes with that.

So while the iSCSI target deamon gave reasonable service for large request, it was much less scalable in terms of the number IOPS that can be served and the CPU efficiency in which it could do that. IOPS being of course a critical metrics for block protocols.

As an illustration of that with 10 client initiators and 10 threads per initiators (so 100 outstanding request) doing 8K cache-hit reads, we observed

Old Target Daemon Comstar Improvement
31K IOPS 85K IOPS 2.7X

Moreover the target daemon was consuming 7.6 CPU to service those 31K IOPS while comstar could handle 2.7X more IOPS consuming only 10 cpus, a 2X improvement in iops per cpu efficiency.

On the write side, with a disk pool that had 2 striped write optimised SSD, comstar gave us 50% more throughput (130 MB/sec vs 88MB/sec) and 60% more cpu efficiency.


During our testing we noted a few interesting contributor to delivered performance. The first being the setting of iSCSI immediatedata parameter iSCSIadm(1M). On the write path, that parameter will cause the initiator iSCSI to send up to 8K of data along with the initial request packet. While this is a good idea to do so, we found that for certain sizes of writes, it would trigger some condition in the zil that caused ZFS to issue more data than necessary through the logzillas. The problem is well understood and remediation is underway and we expect to get to a situation in which keeping the default value of immediatedata=yes is the best. But as of today, for those attempting world record data transfer speeds through logzillas, setting immediatedata=no and using a 32K or 64K write size might yield positive result depending on your client OS.

Interrupt Blanking

Interested in low latency request response ? Interestingly, a chunk of that response time is lost in the obscure setting of network card drivers. Network cards will often delay pending interrupts in the hope of coalescing more packets into a single interrupt. The extra efficiency often results in more throughput at high data rate at the expense of small packet latency. For 8K request we manage to get 15% more single threaded IOPS by tweaking one such client side parameter. Historically such tuning has always been hidden in the bowel of each drivers and specific to ever client OS so that's too broad a topic to cover here. But for Solaris clients, the Crossbow framework is aiming among other thing to make latency vs throughput decision much more adaptive to operating condition relaxing the need for per workload tunings.

WCE Settings

Another important parameter to consider for comstar is the 'write cache enable' bit. By default all write request to an iSCSI lun needs to be committed to stable storage as this is what is expected by most consumers of block storage. That means that each individual write request to a disk based storage pool will take minimally a disk rotation or 5ms to 8ms to commit. This also why a write optimised SSD is quite critical to many iSCSI workloads often yeilding 10X performance improvements. Without such an SSD, iSCSI performance will appear quite lackluster particularly for lightly threaded workloads which more affected by latency characteristics.

One could then feel justified to set the write cache enable bits on some luns in order to recoup some spunk in their engine. One good news here is that in the new 2009.Q3 release of fishworks the setting is now persistent across reboots and reconnection event, fixing a nasty condition of 2009.Q2. However one should be very careful with this setting as the end consumer of block storage (exchange, NTFS, oracle,...) is quite probably operating under an unexpected set of condition. This setting can lead to application corruption in case of outage (no risk for the storage internal state).

There is one exception to this caveat and it is ZFS itself. ZFS is designed to safely and correctly operate on top of devices that have their write cached enabled. That is because ZFS will flush write caches whenever application semantics or its own internal consistency require it. So a ZPOOL created on top of iSCSI luns would be well justified to set the WCE on the lun to boost performance.

Synchronous write bias

Finally as described in my blog entry about Synchronous write bias, we now have to option to bypass the write optimised SSDs for a lun if the workload it receive is less sensitive to latency. This would be the case of a highly threaded workload doing large data transfers. Experimenting with this new property is warranted at this point.




« avril 2014

No bookmarks in folder