Monday Nov 10, 2008

Network Performance on Open Storage Appliance

Sun Storage 7000 Unified Storage Systems is launched this Monday. This is  the industry's first open storage appliance and will deliver breakthrough performance, speed and simplicity. I will share an inside view of how network performance is optimized for it. Most of the optimization are generic and can be applied to other storage systems running Solaris.

1 GBytes/sec and Beyond

Early on there is the question of whether networking performance work should focus on 10GbE or GigE. If the system can sustain 1 GBytes/sec throughput or more, then it makes sense to focus on 10GbE performance first, and multiple GigE performance will be easy. 1 GBytes/sec is a nice round number to remember. Can it be done? Using uperf as micro-benchmark tool to model NFS workload, throughput on a 2 socket quad core AMD Barcelona system can reach 900 MBytes/s, and on a 4 socket quad core Barcelona can reach 10GbE line rate, so the answer is 'yes'. Results from uperf are later validated with NFS cached read throughput that exceeds 1 GBytes/sec. Later work focuses on reducing CPU utilization and latency while not missing the 1 GBytes/sec mark.

A lot of hard work goes into reaching 1 GBytes/sec on one NAS node. The key metrics from a network performance perspective are throughput, CPU efficiency, and latency. Packet rate is not a key metric because IOPS/sec for storage are much lower than packets/sec for network. I'll explain how the tuning decisions are made.

CPU Efficiency

Networking is CPU intensive, and leaving enough CPU cycles to do other work is important for NAS. CPU efficiency measures the amount of work done divided by the amount of CPU used to do the work. If the same amount of work is done using less CPU, then it's more CPU efficient and more desirable. CPU efficiency for networking can be measured by bytes/cycle, i.e. amount of bytes transferred divided by CPU cycles. Roch has written a DTrace script that measures the inverse of bytes/cycle - cycles/byte, and it is used to help make tuning decisions.

Large Segment Offload

Large Segment Offload works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets. The 10GbE for Sun Storage 7000 Unified Storage Systems (the 10GbE driver is called nxge) supports software LSO, and it helps CPU efficiency and throughput for transmit. LSO is enabled by default for nxge, and may be incorporated later for GigE (e1000g).

Jumbo Frames 

If your network infrastructure allows jumbo frames, it is helpful to improve CPU efficiency. Jumbo frames are enabled by default, as auto-negotiation for MTU between client and server should just work.

Interrupts vs. Soft Rings 

At 10GbE speed a NIC can generate more interrupts than single CPU can handle. Solaris provides 2 mechanisms to offload the interrupt CPU so that it's not a performance bottleneck: interrupt multiple CPU using MSI/X (Message Signaled Interrupt and its successor, MSI-X), and soft rings. Interrupting multiple cpu relies on the multiple DMA channels on a NIC to generate interrupts, and the interrupts from each DMA channel can be directed to a different CPU, so packet receive processing is done in interrupt context on multiple CPU. Soft rings don't rely on NIC hardware to have multiple DMA channels, and fan out packets to kernel threads from interrupt service routine for  receive processing. Three combinations are compared for NFS workload: MSI-X only, soft rings only, and MSI-X + soft rings. The results are as expected: MSI-X only provides the best CPU efficiency(~8% better) at a similar throughput.

Number of Transmit DMA Channels

One anomaly that stood out in our performance study is CPU efficiency for nxge drops significantly as the number of clients increase. mpstat shows there is significant increase in context switch, and CPU performance counter shows Hyper-transport utilization is above recommended 60%. Although multiple transmit DMA channels (aka TX rings as they are ring buffers) are needed for maximum packet rate, reducing the number of  TX rings reduces context switch and thread migration. Experiment shows 1 TX ring limits throughput, and 2 TX ring reaches maximum throughput while improving CPU efficiency by up to 50%.

Configuration conn     Gbits/s         
all TX rings 12     4.9 11.1
40     5.8 15.7
2 TX rings       
12     6.3 9.0
40     8.2 10.8


Latency matters for storage clients because it can become a limiting factor for client throughput. One important component of network latency is interrupt blanking: instead of interrupting the CPU immediately, a NIC can delay generating a interrupt so that if more packets arrive during that interval, they can be processed together and be more CPU efficient. Reducing interrupt interval to minimum value for nxge (~25us) improves throughput by up to 30% for 8 or less connections; CPU cost (cycles/byte) can increase by up to 10%.




« February 2017