In Solaris 10 6/06 a feature called TCP Fusion was delivered, which removed all the stack processing when both ends of the TCP connection are in the same system, and now with IP Instances, in the same IP Instance (between the global zone and all shared IP zones, or within an exclusive zone). There are some exceptions to this, including when using IPsec, IPQoS, raw-socket, kernel SSL, non-simple TCP/IP conditions. or the two end points are on different squeues. A fused connect will revert to unfused if an IP Filter rule will drop a packet. However TCP fusion is done in the general case.
So why do I bring this up? With TCP fusion enabled (which it is by default in Solaris 10 6/06 and later, and in OpenSolaris), when a TCP connection is created between processes on a system, the necessary things are set up to transfer data from the sender to the receiver without sending it down and back up the stack. The typical flow control of filling a send buffer (defaults to 48K or the value of tcp_xmit_hiwat, unless changed via a socket operation) still applies. With TCP Fusion on, there is a second check, which is the number of writes to the socket without a read. The reason for the counter is to allow the receiver to get CPU cycles, since the sender and receiver are on the same system and may be sharing one or more CPUs. The default value of this counter is eight (8), as determined by tcp_fusion_rcv_unread_min. The value per TCP connection is calculated as
MAX(sndbuf >> 14, tcp_fusion_rcv_unread_min);
When doing large writes, or when the receiver is actively reading, the buffer flow control dominates. However, when doing smaller writes, it is easy for the sender to end up with a condition where the number of consecutive writes without a read is exceeded, and the writer blocks, or if using non-blocking I/O, will get an EAGAIN error.
The latter was a case at a customer of mine. An ISV application was reporting EAGAIN errors on a new installation, something that hadn't been seen before. More importantly, the ISV was also not seeing it elsewhere or in their test environment.
After some investigation using DTrace, including reproduction on slightly different system configuration, it became clear that the sending application was getting the error after a burst of writes. The application has both local and remote (on other systems) receivers, and the EAGAIN errors were only happening on the local connection.
I also saw that the application was repeatedly doing a pair of writes, one of 12 bytes and the second of 696 bytes. Thus it would be easy to hit the consecutive write counter before the write buffer is ever filled.
To test this I suggested the customer change the tcp_fusion_rcv_unread_min on their running system using mdb(1). I suggested they increase the counter by a factor of four (4), just to be safe.
# echo "tcp_fusion_rcv_unread_min/W 32" | mdb -kw
tcp_fusion_rcv_unread_min: 0x8 = 0x20
# echo "tcp_fusion_rcv_unread_min/D" | mdb -k
Since then I have suggested they set tcp_fusion_rcv_unread_min to 0, to turn the check off completely. This will allow the buffer size and total outstanding write data volume to determine whether the sender is blocked, as it is for remote connections. Since the mdb is only good until the next reboot, I suggested the customer change the setting in /etc/system.
\* Set TCP fusion to allow unlimited outstanding writes up to the TCP send buffer set by default or the application.
\* The default value is 8.
I hope this helps someone who might be trying to understand why errors, or maybe less than expected throughput, is being seen on local connections.
And I would like to note that in OpenSolaris only the do_tcp_fusion setting is available. With the delivery of CR 6826274, the consecutive write counting has been removed. The TCP Fusion code has also been moved into its own file
Thanks to Jim Eggers, Jim Fiori, Jim Mauro, Anders Parsson, and Neil Putnam for their help as I was tracking all this stuff down!
PS. After publishing, I wrote this DTrace script to show what the per connection outstanding write counter tcp_fuse_rcv_unread_hiwater is set to.
# more tcp-fuse.d
self->tcp = (tcp_t \*) arg0;
/self->tcp > 0/
this->peer = (tcp_t \*) self->tcp->tcp_loopback_peer;
this->hiwat = this->peer->tcp_fuse_rcv_unread_hiwater;
printf("pid: %d tcp_fuse_rcv_unread_hiwater: %d \\n", pid, this->hiwat);
self->tcp = 0;
this->peer = 0;
this->hiwat = 0;