Maximizing NFS client performance on 10Gb Ethernet

I generally agree with the opening statement of the ZFS Evil Tuning Guide, which says "Tuning is often evil and should rarely be done." That said, tuning is sometimes necessary, especially when you need to push the envelope. At the moment, achieving peak performance for NFS traffic between a single client and a single server, running over 10 Gigabit Ethernet (10GbE) is one of those cases. I will outline below the tunings I used to achieve a ~3X throughput improvement in NFS IOPS over 10GbE, on a Chip Multithreading (CMT) system running Solaris 10 Update 7 (S10u7).

The default values for the tunables outlined below are all either being reviewed, or have already changed since the release of S10u7. Some of these tunings are unnecessary if you are running S10u8, and they should all be unnecessary in the future. Consider these settings a workaround to achieve maximum performance, and plan to revisit them in the future. A good place to monitor for future developments is the Networks page on the Solaris Internals site. You can also review the NFS section of the Solaris Tunable Parameters Reference Manual.

If you want to fine tune these settings beyond what is outlined here, a reasonable technique is to start from your current default settings and double the value until no observable improvement is seen.

For the time being, consider the following settings if you plan to run NFS between a single client and a single server over 10GbE:

Step 1 - TCP window sizes

The TCP window size defines how much data a host is willing to send/receive without an acknowledgment from its communication partner. Window size is a central component of the TCP throughput formula, which can be simplified to the following if we assume no packet loss:

max throughput (per second) = window size / round trip time (in seconds)

For example, with 1ms RTT and the current default window size of 48k, we have:

49152 / 0.001 = ~50 MB/sec per communication partner

This is obviously too low for NFS over 10GbE, so the send and receive window sizes should be increased. A setting of 1MB provides a max bandwidth of ~1 GB/sec with a RTT of 1ms.

Solaris 10 Update 8 and earlier

	ndd -set /dev/tcp tcp_xmit_hiwat 1048576
	ndd -set /dev/tcp tcp_recv_hiwat 1048576
TCP window size has been the subject of a number of CRs, has changed several times over the years, and the default is likely to change again in the near future. Use a command like
	ndd -get /dev/tcp tcp_xmit_hiwat
on your system to check the current default value before tuning, to make sure that you do not inadvertently lower the values.

Note: if you want to increase TCP window sizes beyond 1MB, you should also increase tcp_max_buf and tcp_cwnd_max, which currently default to 1MB.

Step 2 - IP software rings

A general heuristic for network bandwidth is that we need approximately 1GHz of CPU bandwidth to handle 1Gb (gigabit) per second of network bandwidth. That means that we need to use multiple CPUs to match the bandwidth of a 10GbE interface. Software Rings are used in Solaris as a mechanism to spread the incoming load from a network interface across multiple CPU strands, so that we have enough aggregate CPU bandwidth to match the network interface bandwidth. The default value for the number of soft rings in Solaris 10 Update 7 and earlier is too low for 10GbE, and must be increased:

Solaris 10 Update 7 and earlier on Sun4v

In /etc/system
	set ip:ip_soft_rings_cnt=16

Solaris 10 Update 7 and earlier on Sun4u, x86-64, etc

In /etc/system
	set ip:ip_soft_rings_cnt=8

Solaris 10 Update 8 and later

Thanks to the implementation of CR 6621217 in S10 u8, the default value for the number of soft rings should be fine for network interface speeds up to and including 10GbE, so no tuning should be necessary.

The changes introduced by CR 6621217 highlight why tuning is often evil. It was found that it is difficult to find an optimal, system wide setting for the number of soft rings if the system contains multiple network interfaces of different types. This resulted in the addition of a new tunable, ip_soft_rings_10gig_cnt, which applies to 10GbE interfaces. The old tunable, ip_soft_rings_cnt, applies to 1GbE interfaces. Both tunables have good defaults at this point, so it is best not to tune either on S10u8 and later.

Step 3 - RPC client connections

Now that we have enough IP software rings to handle the network interface bandwidth, we need to have enough IP consumer threads to handle the IP bandwidth. In our case the IP consumer is NFS, and at the time of this writing, its default behavior is to open a single network connection from an NFS client to a given NFS server. This results in a single thread on the client that handles all of the data coming from that server. To maximize throughput between a single NFS client and server over 10GbE, we need to increase the number of network connections on the client:

Solaris 10 Update 8 and earlier

In /etc/system
	set rpcmod:clnt_max_conns = 8
Note: for this to be effective, you must have the fix for CR 2179399, which is available in snv_117, s10u8, or s10 patch 141914-02

A new default value for rpcmod:clnt_max_conns is being investigated as part of CR 6887770, so it should be unnecessary to tune this value in the future.

Step 4 - Allow for multiple pending I/O requests

The IOPS rate of a single thread issuing synchronous reads or writes over NFS will be bound by the round trip network latency between the client and server. To get the most out of the available bandwidth you should have a workload that generates multiple pending I/O requests. This can be from multiple processes each generating an individual I/O stream, a multi-threaded process generating multiple I/O streams, or a single or multi-threaded process using asynchronous I/O calls.

Conclusion

Once you have verified/tuned TCP window sizes, IP soft rings, and rpc client connections, and you have a workload that can capitalize on the available bandwidth, you should see excellent NFS throughput on your 10GbE network interface. There are a few more tunings that might add a few percentages of performance, but the tunings shown above should suffice for the majority of systems.

As I mentioned at the start, these tunables are all either under investigation or already adjusted in Solaris 10 Update 8. Our goal is always to provide excellent performance out of the box, and these tunings should be unnecessary in the near future.

Comments:

Very useful! What is your opinion on LSO?

Posted by Tom Shaw on November 30, 2009 at 04:51 PM PST #

@Tom -

Good question! I tested LSO and I did not see a performance improvement for this NFS workload. I believe that is because I had plenty of CPU to spare, and the individual I/O streams used fairly small block size (4k). With a CPU bound server, or with larger I/O sizes, LSO might have shown an improvement.

Posted by David Lutz on December 01, 2009 at 12:35 AM PST #

since ss700 is appliance did U do any tcp tuning or /etc/system tuning for ss7000?

Posted by Hung-Sheng Tsao on December 16, 2009 at 01:05 AM PST #

@Hung-Sheng,

No, I didn't do any tuning on the ss7000 end, since it is already highly tuned for NFS. The one thing I could have tried, but haven't yet, is to enable Jumbo Frames on the NFS client, the switch, and the ss7000. The appliance interface exposes that through a simple on/off setting, and I would expect to see some benefit for this NFS workload.

Posted by David Lutz on December 16, 2009 at 11:00 PM PST #

research question - what was the throughput you achieved post tuning on solaris 10, using 10g on NFS?

Posted by matt on March 08, 2010 at 01:05 AM PST #

@matt,

I was primarily concerned with small block IOPS, and on a 1.2GHz CMT system, default MTU, going through a switch, I maxed out at roughly 65000 NFSv4 reads/sec with 64 readers. Glenn Fawcett also showed on his blog that with a DSS type workload, you can exceed 1GB/sec. See http://glennfawcett.wordpress.com/2009/12/17/ and http://glennfawcett.wordpress.com/2009/12/14/

Posted by David Lutz on March 10, 2010 at 12:21 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12610824

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today