X

Jeff Taylor's Weblog

  • Sun
    January 31, 2010

NFS Streaming over 10 GbE


NFS Tuning for HPC Streaming Applications


Overview:

I was recently working in a lab environment with the goal of setting
up a Solaris 10 Update 8 (s10u8) NFS server application that would be
able to stream data to a small number of s10u8 NFS clients with the
highest possible throughput for a High Performance Computing (HPC)
application.  The workload varied over time: at some points the
workload was read-intensive while at over times the workload was write
intensive.  Regardless of read or write, the application's I/O pattern
was always "large block sequential I/O" which was easily modeled with a
"dd" stream from one or several clients.

Due to business considerations, 10 gigabit ethernet (10GbE) was
chosen for the network infrastructure.  It was necessary to not only to
install appropriate server, network and I/O hardware, but also to tune
each subsystem.  I wish it was more obvious if a gigabit implied 1024\^3
or 1000\^3.  In ether case, one might naively assume that the connection
should be able to reach NFS speeds of 1.25 gigabytes per second,
however, my goal was to be able to achieve NFS end-to-end throughput
close to 1.0 gigabytes per second.

Hardware:

A network with the following components worked well:

  • Sun Fire X4270 servers
    • Intel Xeon X5570 CPU's @ 2.93 GHz
    • Solaris 10 Update 8
  • a network, consisting of:
    • Force10 S2410 Data Center 10 GbE Switch
    • 10-Gigabit Ethernet PCI Express Ethernet Controller, either:
      • 375-3586 (aka Option X1106A-Z) with the Intel 82598EB chip
        (ixgbe driver), or
      • 501-7283 (aka Option X1027A-Z) with the "Neptune" chip (nxge
        driver)
  • an I/O system on the NFS server:
    • 375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers)
    • Storage, either
      • Sun Storage J4400 Arrays, or
      • Sun Storage F5100 Flash Array

This configuration based on recent hardware was able to reach close
to full line speed performance.  In contrast, a slightly older server
with Intel
Xeon E5440 CPU's @ 2.83 Ghz was not able to reach full
line speed.

The application's I/O pattern is large block sequential and known to
be 4K aligned, so the Sun Storage F5100 Flash Array is a good fit.  Sun does not recommend this device for general purpose NFS storage.

Network

When the hardware was initially installed, rather than immediately
measuring NFS performance, the individual network and IO subsystems
were tested.  To measure the network performance, I used netperf. I found that the
"out of the box" s10u8 performance was below my expectations; it seems that
the Solaris "out of the box" settings are better fitted to a web server
with a large number of potentially slow (WAN) connections.  To get the
network humming for large block LAN workload I made several changes:

a) The TCP Sliding Window settings in /etc/default/inetinit

ndd -set /dev/tcp
tcp_xmit_hiwat  1048576

ndd -set /dev/tcp tcp_recv_hiwat  1048576

ndd -set /dev/tcp tcp_max_buf    16777216

ndd -set /dev/tcp tcp_cwnd_max    1048576



b) The network interface card "NIC" settings, depending on the card:
/kernel/drv/ixgbe.conf
default_mtu=8150;

tx_copy_threshold=1024;



/platform/i86pc/kernel/drv/nxge.conf
accept_jumbo = 1;

soft-lso-enable = 1;

rxdma-intr-time=1;

rxdma-intr-pkts=8;


/etc/system
\* From http://www.solarisinternals.com/wiki/index.php/Networks

\* For ixgbe or nxge

set ddi_msix_alloc_limit=8

\* For nxge

set nxge:nxge_bcopy_thresh=1024

set pcplusmp:apic_multi_msi_max=8

set pcplusmp:apic_msix_max=8

set pcplusmp:apic_intr_policy=1

set nxge:nxge_msi_enable=2


c) Some seasoning :-)

\* Added to /etc/system on
S10U8 x64 systems based on


\*   http://www.solarisinternals.com/wiki/index.php/Networks
(Nov 18,
2009)


\* For few TCP connections


set ip:tcp_squeue_wput=1


\* Bursty


set hires_tick=1

d) Make sure that you are using jumbo frames. I used mtu 8150, which
I know made both the NICs and the switch happy.  Maybe I should have
tried a
slightly more aggressive setting of 9000.  

/etc/hostname.nxge0
192.168.2.42
mtu 8150



/etc/hostname.ixgbe0
192.168.1.44 mtu 8150

e) Verifying the MTU with ping and snoop.  Some ping implementations
include a flag to allow the user to set the "do not fragment" (DNF)
flag, which is very useful for verifying that the MTU is properly set. 
With the ping implementation that ships with s10u8, you can't set the
DNF flag.  To verify the MTU, use snoop to see if large pings are
fragmented:

server# snoop -r -d nxge0
192.168.1.43

Using device nxge0 (promiscuous mode)

// Example 1: A 8000 byte packet is not fragmented
client% ping -s 192.168.1.43 8000 1

PING 192.168.1.43: 8000 data bytes

8008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.370 ms



192.168.1.42 -> 192.168.1.43 ICMP Echo request (ID: 14797 Sequence
number: 0)

192.168.1.43 -> 192.168.1.42 ICMP Echo reply (ID: 14797 Sequence
number: 0)

//
Example 2:A 9000 byte ping
is broken into 2 packets in both directions

client%
ping -s
192.168.1.43 9000 1

PING 192.168.1.43: 9000 data bytes

9008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.383 ms



192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=0   
MF=1 TOS=0x0 TTL=255

192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32355 Offset=8128
MF=0 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=0   
MF=1 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49788 Offset=8128
MF=0 TOS=0x0 TTL=255

//
Example
3:
A
32000 byte ping is broken into 4 packets in both directions

client%
ping -s
192.168.1.43 32000 1

PING 192.168.1.43: 32000 data bytes

32008 bytes from server10G-43 (192.168.1.43): icmp_seq=0. time=0.556 ms



192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=0   
MF=1 TOS=0x0 TTL=255

192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=8128
MF=1 TOS=0x0 TTL=255

192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=16256
MF=1 TOS=0x0 TTL=255

192.168.1.42 -> 192.168.1.43 ICMP IP fragment ID=32356 Offset=24384
MF=0 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=0   
MF=1 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=8128
MF=1 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=16256
MF=1 TOS=0x0 TTL=255

192.168.1.43 -> 192.168.1.42 ICMP IP fragment ID=49789 Offset=24384
MF=0 TOS=0x0 TTL=255




f) Verification: after network tuning was complete, it had very
impressive performance, with either the nxgbe or the ixgbe driver.  The end-to-end measurement reported by netperf of 9.78 GbE is very close to full line speed and indicates that the switch, network interface cards, drivers and Solaris system call overhead are minimally intrusive.





$ /usr/local/bin/netperf -fg
-H 192.168.1.43 -tTCP_STREAM -l60 TCP STREAM TEST from ::ffff:0.0.0.0
(0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43 (192.168.1.43) port 0
AF_INET

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10\^9bits/sec


1048576 1048576 1048576    60.56       9.78

$ /usr/local/bin/netperf -fG -H
192.168.1.43 -tTCP_STREAM -l60
TCP STREAM TEST from
::ffff:0.0.0.0 (0.0.0.0) port 0 AF_INET to ::ffff:192.168.1.43
(192.168.1.43) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time    
Throughput
bytes  bytes   bytes    secs.   
GBytes/sec

1048576 1048576 1048576   
60.00       1.15

Both of the two tests above show about the same throughput, but in different units.   With the network performance tuning, above, the system can achieve slightly less than 10 gigabits per second, which is slightly more than 1 gigabyte per second.



g) Observability: I found "nicstat" (download link at
http://www.solarisinternals.com/wiki/index.php/Nicstat)
to be a very
valuable tool for observing network performance.  To compare the
network performance with the running application against synthetic
tests, I found that it was useful to graph the "nicstat" output. (see
http://blogs.sun.com/taylor22/entry/graphing_solaris_performance_stats
You
can verify that the Jumbo Frames MTU is working as expected by
checking that the average packet payload is big by dividing
Bytes-per-sec / Packets-per-sec to get average bytes-per-packet.


Network Link Aggregation for load spreading



I initially hoped to use link aggregation to get a 20GbE NFS
stream using 2 X 10GbE ports on the NFS server and 2 X 10GbE ports on
the NFS client.  I hoped that packets would be distributed to the link
aggregation group in a "per packet round robin" fashion.  What I found
is that, regardless of whether the negotiation is based on L2, L3 or
L4, LACP will negotiate port pairs based on a source/destination
mapping, so that each stream of packets will only use one specific port
from a
link aggregation group. 


Link aggregation can be useful in spreading multiple streams over
ports, but the streams will not necessarily be evenly divided across
ports.  The distribution of data over ports in a link aggregation group
can be
viewed with "nicstat" 


After reviewing literature, I concluded that It is best to use IPMP for
failover, but link aggregation for load spreading.  "Link aggregation"
has finer control for load spreading than IPMP:  Comparing
IPMP
and Link Aggregation

  • IPMP can be used to protect against switch/router failure because
    each NIC can be connected to a different switch and therefore can
    protect against either NIC or switch failure.
  • With Link Aggregation, all of the ports in the group must be
    connected to a single switch/router, and that switch/router must
    support Link Aggregation Control Protocol (LACP), so there is no
    protection against switch failure.

With the Force10 switch that I tested, I was disappointed that the LACP
algorithm was not doing a good job of spreading inbound packets to my
hot
server.  Again, once the switch mapped a client to one of the ports in
the "hot server's link group", it stuck so it was not unusual for
several clients to be banging hard on one port while another port was
idle.

Multiple subnets for network load spreading


After trying several approaches for load spreading with S10u8, I chose
to use multiple subnets.  (If I had been using Nevada 107 or newer
which includes Project
Clearview
, I might have come to a different conclusion.)  In the
end, I decided that the best solution was an old fashion approach using
a single common management subnet combined with multiple data subnets:

  • All of the machines were able to to communicate with each other
    on a slower "management network", specifically Sun Grid Engine jobs
    were managed on the slower network.
  • The clients were partitioned into a small number of "data
    subnets".
  • The "hot" NFS data server had multiple NIC's, with each NIC on a
    separate "data subnet".
  • A limitation of this approach that is that clients in one subnet
    partition only have a low bandwidth connection to clients in different
    subnet partitions. This was OK for my project.
  • The advantage of manually preallocating the port distribution was
    that my benchmark was more deterministic.  I did not get overloaded
    ports in a seemingly random pattern.


Disk IO: Storage tested

Configurations that had sufficient bandwidth for this environment included:

  • A Sun Storage F5100 Flash Array using 80 FMods in a ZFS RAID 0 stripe to create
    a 2TB volume. 

  • A "2 X Sun Storage J4400
    Arrays" JBOD configuration with a ZFS RAID 0 stripe

  • A "4 X Sun Storage J4400
    Arrays" configuration with a ZFS RAID 1+0 mirrored and striped


Disk I/O: SAS HBA's

The Sun Storage F5100 Flash Array was connected to the the Sun Fire X4270 server using 4 PCIe
375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers) so that each F5100 domain with 20 FMods used an
independent HBA.  Using 4 SAS HBAs has a 20% better theoretical
throughput than using 2 SAS HBA's:

  • A fully loaded Sun Storage F5100 Flash Array with 80 FMods has 4 domains with 20
    FMods per domain
  • Each  375-3487 Sun StorageTek PCI Express SAS 8-Channel HBA (SG-XPCIE8SAS-E-Z, LSI-based disk controller) has
    • two 4x wide SAS ports
    • an 8x wide PCIe bus connection
  • Each F5100 domain to SAS HBA, connected with a single 4x wide SAS
    port, will have a maximum half duplex speed of (3Gb/sec \* 4) = 12Gb/Sec
    =~ 1.2 GB/Sec per F5100 domain
  • PCI Express x8 (half duplex) = 2 GB/sec
  • A full F5100 (4 domains) connected using 2 SAS HBA's would be
    limited by PCIe to 4.0 GB/Sec
  • A full F5100 (4 domains) connected using 4 SAS HBA's would be
    limited by SAS to 4.8 GB/Sec.
  • Therefore a full Sun Storage F5100 Flash Array has 20%
    theoretically better throughput when connected using 4 SAS HBA's rather
    than 2
    SAS HBA's.

The "mirrored and striped" configuration using 4 X Sun Storage J4400
Arrays was connected using 3 PCIe
375-3487 Sun StorageTek PCI Express SAS 8-Channel HBAs (SG-XPCIE8SAS-E-Z, LSI-based disk controllers):

SAS Wiring for 4 JBOD trays

Multipathing (MPXIO)

MPXIO was used in the "mirrored and striped" 4 X Sun Storage J4400
Array so that for every disk in the JBOD configuration, I/O could be
requested by either of the 2 SAS cards connected to the array.  To
eliminate any "single point of failure" I chose to mirror all of the
drives in one tray with drives in another tray, so that any tray could
be removed without losing data.

The command for creating a ZFS RAID 1+0 "mirrored and striped" volume out of MPXIO devices looks like this:

zpool create -f jbod \\

mirror c3t5000C5001586BE93d0 c3t5000C50015876A6Ed0 \\

mirror c3t5000C5001586E279d0 c3t5000C5001586E35Ed0 \\

mirror c3t5000C5001586DCF2d0 c3t5000C50015825863d0 \\

mirror c3t5000C5001584C8F1d0 c3t5000C5001589CEB8d0 \\

...

It was a bit tricky to figure out which disks (i.e.
"c3t5000C5001586BE93d0") were in which trays.  I ended up writing a
surprisingly complicated Ruby script to choose devices to mirror.  This
script worked for me.  Your mileage may vary.  Use at your own risk.

#!/usr/bin/env ruby -W0
all=`stmsboot -L`
Device = Struct.new( :path_count, :non_stms_path_array, :target_ports, :expander, :location)
Location = Struct.new( :device_count, :devices )
my_map=Hash.new
all.each{
|s|
if s =~ /\\/dev/ then
s2=s.split
if !my_map.has_key?(s2[1]) then
## puts "creating device for #{s2[1]}"
my_map[s2[1]] = Device.new(0,[],[])
end
## puts my_map[s2[1]]
my_map[s2[1]].path_count += 1
my_map[s2[1]].non_stms_path_array.push(s2[0])
else
puts "no match on #{s}"
end
}
my_map.each {
|k,v|
##puts "key is #{k}"
mpath_data=`mpathadm show lu #{k}`
in_target_section=false
mpath_data.each {
|line|
if !in_target_section then
if line =~ /Target Ports:/ then
in_target_section=true
end
next
end
if line =~ /Name:/ then
my_map[k].target_ports.push(line.split[1])
##break
end
}
##puts "key is #{k} value is #{v}"
##puts k v.non_stms_path_array[0], v.non_stms_path_array[1]
}
location_array=[]
location_map=Hash.new
my_map.each {
|k,v|
my_map[k].expander = my_map[k].target_ports[0][0,14]
my_map[k].location = my_map[k].target_ports[0][14,2].hex % 64
if !location_map.has_key?(my_map[k].location) then
puts "creating entry for #{my_map[k].location}"
location_map[my_map[k].location] = Location.new(0,[])
location_array.push(my_map[k].location)
end
location_map[my_map[k].location].device_count += 1
location_map[my_map[k].location].devices.push(k)
}
location_array.sort.each {
|location|
puts "mirror #{location_map[location].devices[0].gsub('/dev/rdsk/','')} #{location_map[location].devices[1].gsub('/dev/rdsk/','')} \\\\"
}


Separate ZFS Intent Logs?

Based on http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
, I ran some tests comparing separate intent logs on either disk or SSD
(slogs) vs the default chained logs (clogs).  For the large block
sequential workload, "tuning" the configuration by adding separate ZFS
Intent Logs actually slowed the system down slightly. 

ZFS Tuning


/etc/system parameters for ZFS
\* For ZFS
set
zfs:zfetch_max_streams=64

set zfs:zfetch_block_cap=2048

set zfs:zfs_txg_synctime=1

set zfs:zfs_vdev_max_pending = 8


NFS Tuning

a) Kernel settings

Solaris /etc/system

\* For NFS
set nfs:nfs3_nra=16
set nfs:nfs3_bsize=1048576
set nfs:nfs3_max_transfer_size=1048576

\* Added to /etc/system on S10U8 x64
systems based on

\*  http://www.solarisinternals.com/wiki/index.php/Networks
(Nov 18, 2009)

\* For NFS throughput
set rpcmod:clnt_max_conns = 8


b) Mounting the NFS filesystem

/etc/vfstab

192.168.1.5:/nfs - /mnt/nfs nfs -
no vers=3,rsize=1048576,wsize=1048576

c) Verifing the NFS mount parameters

# nfsstat -m

/mnt/ar from 192.168.1.7:/export/ar

 Flags:        
vers=3,proto=tcp,sec=sys,hard,intr,link,symlink,acl,rsize=1048576,wsize=1048576,retrans=5,timeo=600

 Attr cache:    acregmin=3,acregmax=60,acdirmin=30,acdirmax=60


Results:


With tuning, the Sun Storage J4400 Arrays via NFS achieved write
throughput of 532 MB/Sec and read throughput of 780 MB/sec for a single
'dd' stream.

$ /bin/time dd if=/dev/zero of=/mnt/jbod/test-80g
bs=2048k count=40960; umount /mnt/jbod; mount /mnt/jbod; /bin/time dd
if=/mnt/jbod/test-80g of=/dev/null bs=2048k

40960+0 records in

40960+0 records out


real     2:33.7

user        0.1

sys      1:30.9

40960+0 records in

40960+0 records out


real     1:44.9

user        0.1

sys      1:04.0



With tuning, the Sun Storage F5100 Flash Array via NFS achieved write
throughput of 496 MB/Sec and read throughput of 832 MB/sec for a single
'dd' stream.

$ /bin/time dd if=/dev/zero
of=/mnt/lf/test-80g bs=2048k count=40960; umount /mnt/lf; mount
/mnt/lf; /bin/time dd if=/mnt/lf/test-80g of=/dev/null bs=2048k

40960+0 records in

40960+0 records out


real     2:45.0

user        0.2

sys      2:17.3

40960+0 records in

40960+0 records out


real     1:38.4

user        0.1

sys      1:19.6

To reiterate, this testing was done in preparation for work with an HPC
application that is known to have large block sequential I/O that is
aligned on 4K boundaries.  The Sun Storage F5100 Flash Array would not
be recommended for general purpose NFS storage that is not known to be 4K
aligned.

References:



  - http://blogs.sun.com/brendan/entry/1_gbyte_sec_nfs_streaming

  -
http://blogs.sun.com/dlutz/entry/maximizing_nfs_client_performance_on

Join the discussion

Comments ( 4 )
  • gerard henry Monday, February 1, 2010

    hello,

    very interesting article, but i'm surprised you used "dd" to measure perfs, did you take a look at:

    http://www.c0t0d0s0.org/archives/6269-dd-is-not-a-benchmark.html

    gerard


  • Jeffrey Taylor Monday, February 1, 2010

    Thanks Gerard. Lisa's blog criticizing 'dd' as a benchmark is absolutely accurate. (1) The 'dd' default blocksize is small and is unlikely to match the actual production I/O blocksize, and (2) 'dd' is single threaded which is also unlikely to match the actual production workload. However, in my case, (1) I adjusted the blocksize (bs=2048k) to exactly match the application and (2) a very critical component of the workload is a single threaded I/O stream between the NFS server and the NFS client which occurs while other application components are not producing a significant I/O load. This is an example were an HPC workload can have vastly different requirements as compared to a traditional web workload, database workload, or a general purpose file server workload.


  • Toni Perez i Font Thursday, March 4, 2010

    Hello Jeff,

    We are really interested on those great NFS performances as we are also demanding high NFS write performances from a single client over 10GbE for the new storage system we are in the process of acquiring. You got really impressive numbers. It's also great to have all your configuration details available.

    I am wondering if you have you been able to test the performance with your system using smaller file sizes? On our case clients will write groups of around 1000 files with sizes from 5MB to even 50MB, with some rest gaps...

    This is a test command we have been planning to use to test our new system candidate designs (opensuse bash command line)

    1000 x 50MB files write:

    # cd /nfsfolder ; export start=`date +'%s'`; for i in {1..1000}; do dd if=/dev/zero of=./$i bs=32k count=1600; done; end=`date +'%s'`; echo $end-$start | bc;echo seconds; echo 50\*1000/\\($end-$start\\) | bc ; echo Mbytes/s

    We've been also using Iozone as it tests on a wider range of both file and block sizes. (http://www.iozone.org/)

    For us it would be great to see how the NFS protocol overload affects (if it does) the performance when writing a lot of smaller files on a system as performant as yours.

    Thanks a lot for publishing your helpful results!

    Toni


  • rick jones Tuesday, August 3, 2010

    Link aggregation using a round-robin packet scheduling algorithm \*will\* reorder the segments of a TCP connection. While TCP "deals with it" in so far as maintaining ordering as seen by the application, each out-of-order TCP segment arriving at the TCP receiver triggers an immediate ACKnowledgement to state which segment was expected next. That will trash the ACK avoidance heuristics in the stack and increase the CPU utilization per KB of data transferred (eg "service demand" in netperf terms) If enough TCP segments get out of order then it will trigger spurrious fast retransmissions which will do unpleasant things to the congestion window.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.