Using Solaris and SPARC Networking and Virtualization

TCP Fusion and improved loopback traffic

In the past, when two processes were communicating using TCP on the same system, a lot of the TCP and IP protocol processing was performed just as it was for traffic to and from another system. Since a significant amount of CPU is spent on the protocol layers both sending and receiving to insure successful, complete, in order, non-duplicated, re-routed around network failures for data that never left the system, there is considerable performance benefit in providing a short circuit for the data.

In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.

So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as

MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);

Some details of the reasoning and implementation are in Change Request 4821256.

When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.

The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.

After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.

I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.

To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.

# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min: 0x8 = 0x20

Here is how you check what the current value is.
# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
tcp_fusion_rcv_unread_min: 32

After running several hours of tests, the EAGAIN error did not return.

Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.

\* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
\* The default value is 8.
set ip:tcp_fusion_rcv_unread_min=0

To turn TCP Fusion off all together, something I have not tested with, the variable do_tcp_fusion can be set from its default 1 to 0.

I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.

And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file

Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!


PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.

# more tcp-fuse.d
#!/usr/sbin/dtrace -qs
self->tcp = (tcp_t \*) arg0;
/self->tcp > 0/
this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;
printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);
self->tcp = 0;
this->peer = 0;
this->hiwat = 0;

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.