NFS Streaming over 10 GbE


NFS Tuning for HPC Streaming Applications


Overview:

I was recently working in a lab environment with the goal of setting up a Solaris 10 Update 8 (s10u8) NFS server application that would be able to stream data to a small number of s10u8 NFS clients with the highest possible throughput for a High Performance Computing (HPC) application.  The workload varied over time: at some points the workload was read-intensive while at over times the workload was write intensive.  Regardless of read or write, the application's I/O pattern was always "large block sequential I/O" which was easily modeled with a "dd" stream from one or several clients.

Due to business considerations, 10 gigabit ethernet (10GbE) was chosen for the network infrastructure.  It was necessary to not only to install appropriate server, network and I/O hardware, but also to tune each subsystem.  I wish it was more obvious if a gigabit implied 1024\^3 or 1000\^3.  In ether case, one might naively assume that the connection should be able to reach NFS speeds of 1.25 gigabytes per second, however, my goal was to be able to achieve NFS end-to-end throughput close to 1.0 gigabytes per second.

Hardware:

A network with the following components worked well:

  • Sun Fire X4270 servers
    • Intel Xeon X5570 CPU's @ 2.93 GHz
    • Solaris 10 Update 8
  • a network, consisting of:
    • Force10 S2410 Data Center 10 GbE Switch
    • 10-Gigabit Ethernet PCI Express Ethernet Controller, either:
      • 375-3586 (aka Option X1106A-Z) with the Intel 82598EB chip (ixgbe driver), or
      • 501-7283 (aka Option X1027A-Z) with the "Neptune" chip (nxge driver)
  • an I/O system on the NFS server:
    • 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers)
    • Storage, either
      • Sun Storage J4400 Arrays, or
      • Sun Storage F5100 Flash Array

This configuration based on recent hardware was able to reach close to full line speed performance.  In contrast, a slightly older server with Intel Xeon E5440 CPU's @ 2.83 Ghz was not able to reach full line speed.

The application's I/O pattern is large block sequential and known to be 4K aligned, so the Sun Storage F5100 Flash Array is a good fit.  Sun does not recommend this device for general purpose NFS storage.

Network

When the hardware was initially installed, rather than immediately measuring NFS performance, the individual network and IO subsystems were tested.  To measure the network performance, I used netperf. I found that the "out of the box" s10u8 performance was below my expectations; it seems that the Solaris "out of the box" settings are better fitted to a web server with a large number of potentially slow (WAN) connections.  To get the network humming for large block LAN workload I made several changes:

a) The TCP Sliding Window settings in /etc/default/inetinit

ndd -set /dev/tcp tcp_xmit_hiwat  1048576
ndd -set /dev/tcp tcp_recv_hiwat  1048576
ndd -set /dev/tcp tcp_max_buf    16777216
ndd -set /dev/tcp tcp_cwnd_max    1048576


b) The network interface card "NIC" settings, depending on the card:
/kernel/drv/ixgbe.conf
default_mtu=8150;
tx_copy_threshold=1024;


/platform/i86pc/kernel/drv/nxge.conf
accept_jumbo = 1;
soft-lso-enable = 1;
rxdma-intr-time=1;
rxdma-intr-pkts=8;

/etc/system
\* From http://www.solarisinternals.com/wiki/index.php/Networks
\* For ixgbe or nxge
set ddi_msix_alloc_limit=8
\* For nxge
set nxge:nxge_bcopy_thresh=1024
set pcplusmp:apic_multi_msi_max=8
set pcplusmp:apic_msix_max=8
set pcplusmp:apic_intr_policy=1
set nxge:nxge_msi_enable=2

c) Some seasoning :-)

\* Added to /etc/system on S10U8 x64 systems based on
\*   http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
\* For few TCP connections
set ip:tcp_squeue_wput=1
\* Bursty
set hires_tick=1

d) Make sure that you are using jumbo frames. I used mtu 8150, which I know made both the NICs and the switch happy.  Maybe I should have tried a slightly more aggressive setting of 9000.  

/etc/hostname.nxge0
192.168.2.42 mtu 8150

/etc/hostname.ixgbe0
192.168.1.44 mtu 8150
e) Verifying the MTU with ping and snoop.  Some ping implementations include a flag to allow the user to set the "do not fragment" (DNF) flag, which is very useful for verifying that the MTU is properly set.  With the ping implementation that ships with s10u8, you can't set the DNF flag.  To verify the MTU, use snoop to see if large pings are fragmented:

server# snoop -r -d nxge0 192.168.1.43
Using device nxge0 (promiscuous mode)

// Example 1: A 8000 byte packet is not fragmented
client% ping -s 192.168.1.43 8000 1
PING 192.168.1.43: 8000 data bytes
8008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.370 ms


192.168.1.42 -> 192.168.1.43 ICMP Echo request (ID: 14797 Sequence number: 0)
192.168.1.43 -> 192.168.1.42 ICMP Echo reply (ID: 14797 Sequence number: 0)

//
Example 2: A 9000 byte ping is broken into 2 packets in both directions
client%
ping -s 192.168.1.43 9000 1
PING 192.168.1.43: 9000 data bytes
9008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.383 ms


192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=8128 MF=0 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=8128 MF=0 TOS=0x0 TTL=255

//
Example 3: A 32000 byte ping is broken into 4 packets in both directions
client%
ping -s 192.168.1.43 32000 1
PING 192.168.1.43: 32000 data bytes
32008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.556 ms


192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=8128 MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=16256 MF=1 TOS=0x0 TTL=255
192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=24384 MF=0 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=0    MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=8128 MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=16256 MF=1 TOS=0x0 TTL=255
192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=24384 MF=0 TOS=0x0 TTL=255



f) Verification: after network tuning was complete, it had very impressive performance, with either the nxgbe or the ixgbe driver.  The end-to-end measurement reported by netperf of 9.78 GbE is very close to full line speed and indicates that the switch, network interface cards, drivers and Solaris system call overhead are minimally intrusive.

$ /usr/local/bin/netperf -fg -H 192.168.1.43 -tTCP_STREAM -l60 TCP STREAM TEST from ::ffff:0.0.0.0 (0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43 (192.168.1.43) port 0 AF_INET

Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10\^9bits/sec

1048576 1048576 1048576    60.56       9.78

$ /usr/local/bin/netperf -fG -H 192.168.1.43 -tTCP_STREAM -l60
TCP STREAM TEST from ::ffff:0.0.0.0 (0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43 (192.168.1.43) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    GBytes/sec

1048576 1048576 1048576    60.00       1.15

Both of the two tests above show about the same throughput, but in different units.   With the network performance tuning, above, the system can achieve slightly less than 10 gigabits per second, which is slightly more than 1 gigabyte per second.


g) Observability: I found "nicstat" (download link at http://www.solarisinternals.com/wiki/index.php/Nicstat) to be a very valuable tool for observing network performance.  To compare the network performance with the running application against synthetic tests, I found that it was useful to graph the "nicstat" output. (see http://blogs.sun.com/taylor22/entry/graphing_solaris_performance_stats)  You can verify that the Jumbo Frames MTU is working as expected by checking that the average packet payload is big by dividing Bytes-per-sec / Packets-per-sec to get average bytes-per-packet.


Network Link Aggregation for load spreading


I initially hoped to use link aggregation to get a 20GbE NFS stream using 2 X 10GbE ports on the NFS server and 2 X 10GbE ports on the NFS client.  I hoped that packets would be distributed to the link aggregation group in a "per packet round robin" fashion.  What I found is that, regardless of whether the negotiation is based on L2, L3 or L4, LACP will negotiate port pairs based on a source/destination mapping, so that each stream of packets will only use one specific port from a link aggregation group. 

Link aggregation can be useful in spreading multiple streams over ports, but the streams will not necessarily be evenly divided across ports.  The distribution of data over ports in a link aggregation group can be viewed with "nicstat" 

After reviewing literature, I concluded that It is best to use IPMP for failover, but link aggregation for load spreading.  "Link aggregation" has finer control for load spreading than IPMP:  Comparing IPMP and Link Aggregation
  • IPMP can be used to protect against switch/router failure because each NIC can be connected to a different switch and therefore can protect against either NIC or switch failure.
  • With Link Aggregation, all of the ports in the group must be connected to a single switch/router, and that switch/router must support Link Aggregation Control Protocol (LACP), so there is no protection against switch failure.
With the Force10 switch that I tested, I was disappointed that the LACP algorithm was not doing a good job of spreading inbound packets to my hot server.  Again, once the switch mapped a client to one of the ports in the "hot server's link group", it stuck so it was not unusual for several clients to be banging hard on one port while another port was idle.

Multiple subnets for network load spreading

After trying several approaches for load spreading with S10u8, I chose to use multiple subnets.  (If I had been using Nevada 107 or newer which includes Project Clearview, I might have come to a different conclusion.)  In the end, I decided that the best solution was an old fashion approach using a single common management subnet combined with multiple data subnets:

  • All of the machines were able to to communicate with each other on a slower "management network", specifically Sun Grid Engine jobs were managed on the slower network.
  • The clients were partitioned into a small number of "data subnets".
  • The "hot" NFS data server had multiple NIC's, with each NIC on a separate "data subnet".
  • A limitation of this approach that is that clients in one subnet partition only have a low bandwidth connection to clients in different subnet partitions. This was OK for my project.
  • The advantage of manually preallocating the port distribution was that my benchmark was more deterministic.  I did not get overloaded ports in a seemingly random pattern.


Disk IO: Storage tested

Configurations that had sufficient bandwidth for this environment included:

  • A Sun Storage F5100 Flash Array using 80 FMods in a ZFS RAID 0 stripe to create a 2TB volume. 
  • A "2 X Sun Storage J4400 Arrays" JBOD configuration with a ZFS RAID 0 stripe
  • A "4 X Sun Storage J4400 Arrays" configuration with a ZFS RAID 1+0 mirrored and striped


Disk I/O: SAS HBA's

The Sun Storage F5100 Flash Array was connected to the the Sun Fire X4270 server using 4 PCIe 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers) so that each F5100 domain with 20 FMods used an independent HBA.  Using 4 SAS HBAs has a 20% better theoretical throughput than using 2 SAS HBA's:

  • A fully loaded Sun Storage F5100 Flash Array with 80 FMods has 4 domains with 20 FMods per domain
  • Each  375-3487 Sun StorageTek PCI Express SAS 8-Channel HBA (SG-XPCIE8SAS-E-Z, LSI-based disk controller) has
    • two 4x wide SAS ports
    • an 8x wide PCIe bus connection
  • Each F5100 domain to SAS HBA, connected with a single 4x wide SAS port, will have a maximum half duplex speed of (3Gb/sec \* 4) = 12Gb/Sec =~ 1.2 GB/Sec per F5100 domain
  • PCI Express x8 (half duplex) = 2 GB/sec
  • A full F5100 (4 domains) connected using 2 SAS HBA's would be limited by PCIe to 4.0 GB/Sec
  • A full F5100 (4 domains) connected using 4 SAS HBA's would be limited by SAS to 4.8 GB/Sec.
  • Therefore a full Sun Storage F5100 Flash Array has 20% theoretically better throughput when connected using 4 SAS HBA's rather than 2 SAS HBA's.

The "mirrored and striped" configuration using 4 X Sun Storage J4400 Arrays was connected using 3 PCIe 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers):

SAS Wiring for 4 JBOD trays

Multipathing (MPXIO)

MPXIO was used in the "mirrored and striped" 4 X Sun Storage J4400 Array so that for every disk in the JBOD configuration, I/O could be requested by either of the 2 SAS cards connected to the array.  To eliminate any "single point of failure" I chose to mirror all of the drives in one tray with drives in another tray, so that any tray could be removed without losing data.

The command for creating a ZFS RAID 1+0 "mirrored and striped" volume out of MPXIO devices looks like this:

zpool create -f jbod \\
mirror c3t5000C5001586BE93d0 c3t5000C50015876A6Ed0 \\
mirror c3t5000C5001586E279d0 c3t5000C5001586E35Ed0 \\
mirror c3t5000C5001586DCF2d0 c3t5000C50015825863d0 \\
mirror c3t5000C5001584C8F1d0 c3t5000C5001589CEB8d0 \\
...
It was a bit tricky to figure out which disks (i.e. "c3t5000C5001586BE93d0") were in which trays.  I ended up writing a surprisingly complicated Ruby script to choose devices to mirror.  This script worked for me.  Your mileage may vary.  Use at your own risk.

#!/usr/bin/env ruby -W0

all=`stmsboot -L`
Device = Struct.new( :path_count, :non_stms_path_array, :target_ports, :expander, :location)
Location = Struct.new( :device_count, :devices )

my_map=Hash.new
all.each{
  |s|
  if s =~ /\\/dev/ then
    s2=s.split
    if !my_map.has_key?(s2[1]) then
      ## puts "creating device for #{s2[1]}"
      my_map[s2[1]] = Device.new(0,[],[])
    end
    ## puts my_map[s2[1]]
    my_map[s2[1]].path_count += 1
    my_map[s2[1]].non_stms_path_array.push(s2[0])
  else
    puts "no match on #{s}"
  end
}

my_map.each {
  |k,v|
  ##puts "key is #{k}"
  mpath_data=`mpathadm show lu #{k}`
  in_target_section=false
  mpath_data.each {
    |line|
    if !in_target_section then
      if line =~ /Target Ports:/ then
        in_target_section=true
      end
      next
    end
    if line =~ /Name:/ then
      my_map[k].target_ports.push(line.split[1])
      ##break
    end
  }
  ##puts "key is #{k} value is #{v}"
  ##puts k v.non_stms_path_array[0],  v.non_stms_path_array[1]
}

location_array=[]
location_map=Hash.new
my_map.each {
  |k,v|
  my_map[k].expander = my_map[k].target_ports[0][0,14]
  my_map[k].location = my_map[k].target_ports[0][14,2].hex % 64
  if !location_map.has_key?(my_map[k].location) then
    puts "creating entry for #{my_map[k].location}"
    location_map[my_map[k].location] = Location.new(0,[])
    location_array.push(my_map[k].location)
  end
  location_map[my_map[k].location].device_count += 1
  location_map[my_map[k].location].devices.push(k)

}

location_array.sort.each {
  |location|
  puts "mirror #{location_map[location].devices[0].gsub('/dev/rdsk/','')} #{location_map[location].devices[1].gsub('/dev/rdsk/','')} \\\\"
}


Separate ZFS Intent Logs?

Based on http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on , I ran some tests comparing separate intent logs on either disk or SSD (slogs) vs the default chained logs (clogs).  For the large block sequential workload, "tuning" the configuration by adding separate ZFS Intent Logs actually slowed the system down slightly. 

ZFS Tuning


/etc/system parameters for ZFS
\* For ZFS
set zfs:zfetch_max_streams=64
set zfs:zfetch_block_cap=2048
set zfs:zfs_txg_synctime=1
set zfs:zfs_vdev_max_pending = 8


NFS Tuning

a) Kernel settings

Solaris /etc/system

\* For NFS
set nfs:nfs3_nra=16
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576

\* Added to /etc/system on S10U8 x64 systems based on
\*  http://www.solarisinternals.com/wiki/index.php/Networks (Nov 18, 2009)
\* For NFS throughput
set rpcmod:clnt_max_conns = 8


b) Mounting the NFS filesystem

/etc/vfstab

192.168.1.5:/nfs - /mnt/nfs nfs - no vers=3,rsize=1048576,wsize=1048576

c) Verifing the NFS mount parameters

# nfsstat -m
/mnt/ar from 192.168.1.7:/export/ar
 Flags:         vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600
 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60


Results:

With tuning, the Sun Storage J4400 Arrays via NFS achieved write throughput of 532 MB/Sec and read throughput of 780 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/jbod/test-80g bs=2048k count=40960; umount /mnt/jbod; mount /mnt/jbod; /bin/time dd if=/mnt/jbod/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:33.7
user        0.1
sys      1:30.9
40960+0 records in
40960+0 records out

real     1:44.9
user        0.1
sys      1:04.0


With tuning, the Sun Storage F5100 Flash Array via NFS achieved write throughput of 496 MB/Sec and read throughput of 832 MB/sec for a single 'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/lf/test-80g bs=2048k count=40960; umount /mnt/lf; mount /mnt/lf; /bin/time dd if=/mnt/lf/test-80g of=/dev/null bs=2048k
40960+0 records in
40960+0 records out

real     2:45.0
user        0.2
sys      2:17.3
40960+0 records in
40960+0 records out

real     1:38.4
user        0.1
sys      1:19.6
To reiterate, this testing was done in preparation for work with an HPC application that is known to have large block sequential I/O that is aligned on 4K boundaries.  The Sun Storage F5100 Flash Array would not be recommended for general purpose NFS storage that is not known to be 4K aligned.

References:


  - http://blogs.sun.com/brendan/entry/1_gbyte_sec_nfs_streaming
  - http://blogs.sun.com/dlutz/entry/maximizing_nfs_client_performance_on

Comments:

hello,
very interesting article, but i'm surprised you used "dd" to measure perfs, did you take a look at:
http://www.c0t0d0s0.org/archives/6269-dd-is-not-a-benchmark.html

gerard

Posted by gerard henry on January 31, 2010 at 08:49 PM EST #

Thanks Gerard. Lisa's blog criticizing 'dd' as a benchmark is absolutely accurate. (1) The 'dd' default blocksize is small and is unlikely to match the actual production I/O blocksize, and (2) 'dd' is single threaded which is also unlikely to match the actual production workload. However, in my case, (1) I adjusted the blocksize (bs=2048k) to exactly match the application and (2) a very critical component of the workload is a single threaded I/O stream between the NFS server and the NFS client which occurs while other application components are not producing a significant I/O load. This is an example were an HPC workload can have vastly different requirements as compared to a traditional web workload, database workload, or a general purpose file server workload.

Posted by Jeffrey Taylor on February 01, 2010 at 03:30 AM EST #

Hello Jeff,
We are really interested on those great NFS performances as we are also demanding high NFS write performances from a single client over 10GbE for the new storage system we are in the process of acquiring. You got really impressive numbers. It's also great to have all your configuration details available.

I am wondering if you have you been able to test the performance with your system using smaller file sizes? On our case clients will write groups of around 1000 files with sizes from 5MB to even 50MB, with some rest gaps...

This is a test command we have been planning to use to test our new system candidate designs (opensuse bash command line)

1000 x 50MB files write:
# cd /nfsfolder ; export start=`date +'%s'`; for i in {1..1000}; do dd if=/dev/zero of=./$i bs=32k count=1600; done; end=`date +'%s'`; echo $end-$start | bc;echo seconds; echo 50\*1000/\\($end-$start\\) | bc ; echo Mbytes/s

We've been also using Iozone as it tests on a wider range of both file and block sizes. (http://www.iozone.org/)

For us it would be great to see how the NFS protocol overload affects (if it does) the performance when writing a lot of smaller files on a system as performant as yours.

Thanks a lot for publishing your helpful results!
Toni

Posted by Toni Perez i Font on March 03, 2010 at 11:25 PM EST #

Link aggregation using a round-robin packet scheduling algorithm \*will\* reorder the segments of a TCP connection. While TCP "deals with it" in so far as maintaining ordering as seen by the application, each out-of-order TCP segment arriving at the TCP receiver triggers an immediate ACKnowledgement to state which segment was expected next. That will trash the ACK avoidance heuristics in the stack and increase the CPU utilization per KB of data transferred (eg "service demand" in netperf terms) If enough TCP segments get out of order then it will trigger spurrious fast retransmissions which will do unpleasant things to the congestion window.

Posted by rick jones on August 03, 2010 at 10:21 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12620111

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today