Thursday Jun 17, 2010

TCP Fusion and improved loopback traffic

In the past, when two processes were communicating using TCP on the same system, a lot of the TCP and IP protocol processing was performed just as it was for traffic to and from another system. Since a significant amount of CPU is spent on the protocol layers both sending and receiving to insure successful, complete, in order, non-duplicated, re-routed around network failures for data that never left the system, there is considerable performance benefit in providing a short circuit for the data.

In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.

So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as

MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);
Some details of the reasoning and implementation are in Change Request 4821256.

When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.

The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.

After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.

I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.

To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.

# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min:      0x8            =       0x20
Here is how you check what the current value is.
# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min:
tcp_fusion_rcv_unread_min:      32
After running several hours of tests, the EAGAIN error did not return.

Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.

\* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
\* The default value is 8.
set ip:tcp_fusion_rcv_unread_min=0
To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.

I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.

And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file

Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!

Steffen

PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.

# more tcp-fuse.d
#!/usr/sbin/dtrace -qs

fbt:ip:tcp_fuse_maxpsz_set:entry
{
        self->tcp = (tcp_t \*) arg0;
}

fbt:ip:tcp_fuse_maxpsz_set:return
/self->tcp > 0/
{
        this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
        this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;

        printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);

        self->tcp = 0;
        this->peer = 0;
        this->hiwat = 0;
}

Wednesday Jun 17, 2009

ssh and friends scp, sftp say "hello crypto!"

Solaris includes the SunSSH toolset (ssh, scp, and sftp) in Solaris 9 and later. Solaris 10 comes with the Solaris Cryptographic Framework that provides an easy mechanism for applications that use PKCS #11, OpenSSL, Java Security Extensions, or the NSS interface to take advantage of cryptographic hardware or software on the system.

Separately, the UltraSPARC® T2 processor in the T-series (CMT) has built-in cyptographic processors (one per core, or typically eight per socket) that accelerate secure one-way hashes, public key session establishment, and private key bulk data transfers. The latter is useful for long standing connections and for larger data operations, such as a file transfer.

Prior to Solaris 10 5/09, an scp or sftp file transfer operation had the encryption and decryption done the by the CPU. While usually this is not a big deal, as most CPUs do private key crypto reasonably fast, on the CMT systems these operations are relatively slow. Now with SunSSH With OpenSSL PKCS#11 Engine Support in 5/09, the SunSSH server and client will use the cryptographic framework when an UltraSPARC® T2 process nc2p cryptographic unit is available.

To demonstrate this, I used a T5120 with Logical Domains (LDoms) 1.1 configured running Solaris 10 5/09. Using LDoms helps, as I can assign or remove crypto units on a per-LDom basis. (Since the crypto units are not supported yet with dynamic reconfiguration, a reboot of the LDom instance is required. However, in general, I don't see making that kind of change very often.)

I did all the work in the 'primary' control and service LDom, where I have direct access to the network devices, and can see the LDom configuration. I am listing parts of it here, although this is about Solaris, SunSSH, and the crypto hardware.

medford# ldm list-bindings primary
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    8G       0.1%  22h 16m

MAC
    00:14:4f:ac:57:c4

HOSTID
    0x84ac57c4

VCPU
    VID    PID    UTIL STRAND
    0      0      0.6%   100%
    1      1      1.9%   100%
    2      2      0.0%   100%
    3      3      0.0%   100%
    4      4      0.0%   100%
    5      5      0.1%   100%
    6      6      0.0%   100%
    7      7      0.0%   100%
    8      8      0.7%   100%
    9      9      0.1%   100%
    10     10     0.0%   100%
    11     11     0.0%   100%
    12     12     0.0%   100%
    13     13     0.0%   100%
    14     14     0.0%   100%
    15     15     0.0%   100%

MAU
    ID     CPUSET
    0      (0, 1, 2, 3, 4, 5, 6, 7)
    1      (8, 9, 10, 11, 12, 13, 14, 15)

MEMORY
    RA               PA               SIZE
    0x8000000        0x8000000        8G
The 'system' has 16 CPUs (hardware strands), two MAUs (those are the crypto units), and 8 GB of memory. I am using e1000g0 for the network and the remote system is a V210 running Solaris Express Community Edition snv_113 SPARC (OK, I am a little behind). The network is 1 GbE.

The command I run is

source#/usr/bin/time scp -i /.ssh/destination /large-file destination:/tmp

source# du -h /large-file
 1.3G   /large-file
My results with the crypto units were
real     1:13.6
user       32.2
sys        34.5
while without the crypto units
real     2:28.2
user     2:10.9
sys        26.8
The transfer took one half the time and considerably less CPU processing with the crypto units in place (I have two although I think it is using only one since this is a single transfer).

So, SunSSH benefits from the built-in cryptographic hardware in the UltraSPARC® T2 process!

Steffen

Saturday May 12, 2007

Network performance differences within an IP Instance vs. across IP Instances

When consolidating or co-locating multiple applications on the same system, inter-application network typically stays within the system, since the shared IP in the kernel recognizes that the destination address is on the same system, and thus loops it back up the stack without ever putting the data on a physical network. This has introduced some challenges for customers deploying Solaris Containers (specifically zones) where different Containers are on different subnets, and it is expected that traffic between them leaves the system (maybe through a router or fireall to restrict or monitor inter-tier traffic).

With IP Instances in Solaris Nevada build 57 and targeted for Solaris 10 7/07, there is the ability to configures zones with exclusive IP Instances, thus forcing all traffic leaving a zone out onto the network. This introduces additional network stack processing both on the transmit and the receive. Prompted by some customer questions regarding this, I performed a simple test to measure the difference.

On two systems, a V210 with two 1.336GHz CPUs and 8GB memory, and an x4200 with two dual-core Opteron XXXX and 8GB memory, I ran FTP transfers between zones. My switch is a Netgear GS716T Smart Switch with 1Gbps ports. The V210 has four bge interfaces and the x4200 has four e1000g interfaces.

I created four zones. Zones x1 and x2 have eXclusive IP Instances, while zones s1 and s2 have Shared IP Instances (IP is shared with the global zone). Both systems are running Solaris 10 7/07 build 06.

Relevant zonecfg info is a follows (all zones are sparse):


v210# zonecfg -z x1 info
zonename: x1
zonepath: /localzones/x1
...
ip-type: exclusive
net:
        address not specified
        physical: bge1

v210# zonecfg -z s1 info
zonename: s1
zonepath: /localzones/s1
...
ip-type: shared
net:
        address: 10.10.10.11/24
        physical: bge3
 
As a test user in each zone, I created a file using 'mkfile 1000m /tmp/file1000m'. Then I used ftp to transfer it between zones. No tuning was done whatsoever.

The results are as follows.

V210: (bge)

Exclusive to Exclusive
x1# /usr/bin/time ftp x2 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real       17.0
user        0.2
sys        11.2

Exclusive to Shared
x1# /usr/bin/time ftp s2 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real       17.3
user        0.2
sys        11.6

Shared to Shared
s2# /usr/bin/time ftp s1 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real        6.6
user        0.1
sys         5.3


X4200: (e1000g)

Exclusive to Exclusive
x1# /usr/bin/time ftp x2 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real        9.1
user        0.0
sys         4.0

Exclusive to Shared
x1# /usr/bin/time ftp s2 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real        9.1
user        0.0
sys         4.1

Shared to Shared
s2# /usr/bin/time ftp s1 << EOF\^Jcd /tmp\^Jbin\^Jput file1000m\^JEOF

real        4.0
user        0.0
sys         3.5
I ran each test several times and picked a result that seemed average across the runs. Not very scientific, and a table might be nicer.

Something I noticed that surprised me was that time spent in IP and the driver is measurable on the V210 with bge, and much less so on the x4200 with e1000g.

About

stw

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today