Wednesday May 27, 2009

Out of box Network Performance on Nehalem servers

With the upcoming OpenSolaris 2009.06 release, how does networking perform on Intel Nehalem servers vs. Linux? One of the network performance questions asked by customers is "Can a system achieve X Gbit/s?" or "Can a system send/receive y packets/sec?" after they have characterized the workload and find that networking may be the limiting factor. I'll take a look at out of box performance using a micro-benchmark tool uperf that shows the server's capabilities.

Benchmark Considerations

Intel Nehalem servers have sufficient cpu cycles and I/O bandwidth to drive one 10GbE port at line rate for large packet sizes. To see a meaningful difference between Solaris and Linux, I chose a four port configuration that can show difference in throughput. When PCIe gen. 2 10 Gigabit Ethernet is widely available, the four ports can come from two NIC and take only two PCIe slots. Since I use PCIe gen. 1 10 Gigabit Ethernet, the four ports come from four NIC to avoid PCIe bandwidth becoming the bottleneck.

Which Linux?

SUSE Linux Enterprise 11 is chosen as SUSE is widely used and it is based on a more recent Linux kernel than latest Red Hat distro. Compared to SLES 10 or RHEL 5.3, SLES 11 includes TX multiqueue support. This helps TX scales better with more connections. For simplicity, no tuning is done for etiher Linux or OpenSolaris. Tuning Linux will improve some results.

Test Results

Nehalem TX throughput

Nehalem RX throughput

Engineering Notes

S10 Update 7 throughput drops significantly with more connections because of high rate of interrupts. Tuning to interrupt more cpu:

set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_multi_msi_max=4


improves performance to be more in line with OpenSolaris. TCP RX throughput more than doubles to 20.6 Gbit/s, and TCP TX throughput is 26.6 Gbit/s at 1000 connections.

Uperf Profiles

uperf uses XML to describe the workload. The profile that describes TCP TX throughput is here. The  profile that describes TCP RX throughput is here. The profiles are written based on four clients; if different number of clients are used, modify the number of groups to match the number of clients.

Test Set-up

  • SUT: Sun Fire X4270, dual socket Intel Nehalem @ 2.66 GHz. Hyper-threading is on (i.e. change BIOS default). 24 GB of memory.
  • OS: OpenSolaris 2009.06; Solaris 5/09; SUSE Linux Enterprise 11 ( kernel)
  • Four X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller installed on SUT. ixgbe driver version compiled on SUSE 11.
  • Clients: Four Sun Fire X4150, dual socket Intel Xeon @ 3.16 GHZ, 8 GB main memory. Three of the X4150 runs Solaris Nevada 109 with Sun Multithreaded 10 GbE Networking Card. One X4150 runs SUSE 10 SP2 with X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller. Each 10 GbE port is connected back-back with one 10GbE port on SUT.
    • Tuning for Solaris clients:
    • set ddi_msix_alloc_limit=8
      set pcplusmp:apic_multi_msi_max=8
      set pcplusmp:apic_msix_max=8
      set pcplusmp:apic_intr_policy=1
      set nxge:nxge_msi_enable=2
    • /kernel/drv/nxge.conf
    • soft-lso-enable = 1;

Monday Nov 10, 2008

Network Performance on Open Storage Appliance

Sun Storage 7000 Unified Storage Systems is launched this Monday. This is  the industry's first open storage appliance and will deliver breakthrough performance, speed and simplicity. I will share an inside view of how network performance is optimized for it. Most of the optimization are generic and can be applied to other storage systems running Solaris.

1 GBytes/sec and Beyond

Early on there is the question of whether networking performance work should focus on 10GbE or GigE. If the system can sustain 1 GBytes/sec throughput or more, then it makes sense to focus on 10GbE performance first, and multiple GigE performance will be easy. 1 GBytes/sec is a nice round number to remember. Can it be done? Using uperf as micro-benchmark tool to model NFS workload, throughput on a 2 socket quad core AMD Barcelona system can reach 900 MBytes/s, and on a 4 socket quad core Barcelona can reach 10GbE line rate, so the answer is 'yes'. Results from uperf are later validated with NFS cached read throughput that exceeds 1 GBytes/sec. Later work focuses on reducing CPU utilization and latency while not missing the 1 GBytes/sec mark.

A lot of hard work goes into reaching 1 GBytes/sec on one NAS node. The key metrics from a network performance perspective are throughput, CPU efficiency, and latency. Packet rate is not a key metric because IOPS/sec for storage are much lower than packets/sec for network. I'll explain how the tuning decisions are made.

CPU Efficiency

Networking is CPU intensive, and leaving enough CPU cycles to do other work is important for NAS. CPU efficiency measures the amount of work done divided by the amount of CPU used to do the work. If the same amount of work is done using less CPU, then it's more CPU efficient and more desirable. CPU efficiency for networking can be measured by bytes/cycle, i.e. amount of bytes transferred divided by CPU cycles. Roch has written a DTrace script that measures the inverse of bytes/cycle - cycles/byte, and it is used to help make tuning decisions.

Large Segment Offload

Large Segment Offload works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets. The 10GbE for Sun Storage 7000 Unified Storage Systems (the 10GbE driver is called nxge) supports software LSO, and it helps CPU efficiency and throughput for transmit. LSO is enabled by default for nxge, and may be incorporated later for GigE (e1000g).

Jumbo Frames 

If your network infrastructure allows jumbo frames, it is helpful to improve CPU efficiency. Jumbo frames are enabled by default, as auto-negotiation for MTU between client and server should just work.

Interrupts vs. Soft Rings 

At 10GbE speed a NIC can generate more interrupts than single CPU can handle. Solaris provides 2 mechanisms to offload the interrupt CPU so that it's not a performance bottleneck: interrupt multiple CPU using MSI/X (Message Signaled Interrupt and its successor, MSI-X), and soft rings. Interrupting multiple cpu relies on the multiple DMA channels on a NIC to generate interrupts, and the interrupts from each DMA channel can be directed to a different CPU, so packet receive processing is done in interrupt context on multiple CPU. Soft rings don't rely on NIC hardware to have multiple DMA channels, and fan out packets to kernel threads from interrupt service routine for  receive processing. Three combinations are compared for NFS workload: MSI-X only, soft rings only, and MSI-X + soft rings. The results are as expected: MSI-X only provides the best CPU efficiency(~8% better) at a similar throughput.

Number of Transmit DMA Channels

One anomaly that stood out in our performance study is CPU efficiency for nxge drops significantly as the number of clients increase. mpstat shows there is significant increase in context switch, and CPU performance counter shows Hyper-transport utilization is above recommended 60%. Although multiple transmit DMA channels (aka TX rings as they are ring buffers) are needed for maximum packet rate, reducing the number of  TX rings reduces context switch and thread migration. Experiment shows 1 TX ring limits throughput, and 2 TX ring reaches maximum throughput while improving CPU efficiency by up to 50%.

Configuration conn     Gbits/s         
all TX rings 12     4.9 11.1
40     5.8 15.7
2 TX rings       
12     6.3 9.0
40     8.2 10.8


Latency matters for storage clients because it can become a limiting factor for client throughput. One important component of network latency is interrupt blanking: instead of interrupting the CPU immediately, a NIC can delay generating a interrupt so that if more packets arrive during that interval, they can be processed together and be more CPU efficient. Reducing interrupt interval to minimum value for nxge (~25us) improves throughput by up to 30% for 8 or less connections; CPU cost (cycles/byte) can increase by up to 10%.




« April 2014