Thursday Oct 08, 2009

TCP and real-time

We had two systems each running a real-time Java application under Java RTS, the RT applications on both sides communicated by sending messages back and forth. We wondered whether messages could be exchanged using a standard TCP connection with a bounded transmit time close enough to the mean transmit time (achieving what I'll call in the following a "deterministic" communication).

The systems we considered were two single cpu systems with four giga-ethernet network ports. We dedicated a network port on each side for the "deterministic" communication: messages are sent back and forth over a TCP connection between two real-time Java threads (one on each system) through the dedicated ports, no other traffic goes through these dedicated ports. In our experiments the other ports see moderate traffic: two standard Java threads (again one on each side) continuously send small 1K messages back and forth over another TCP connection.

The two dedicated network ports are back to back: they are connected by a crossover cable.

The systems are equipped with e1000g NICs and running Solaris 10 update 4.

TCP is notoriously not designed for deterministic communication but, in the controlled system configuration described above:
  • a packet corruption/loss is very unlikely because the systems are back to back (no network equipment in between).
  • the application threads are running in the real-time scheduling class and are guaranteed to be ready to drain the network when required (assuming a correctly designed application).

So the TCP congestion control, TCP flow control, TCP retransmission mechanisms should have limited impact on transmission times. Whether bounded time transmission can be achieved should be mostly a matter of whether the Solaris 10 TCP implementation allows it rather than whether the TCP protocol supports it.

A side note on packet corruption/loss: it should be very rare in this configuration, but still if it happens, delays in the transmission could be encountered. Then the application would have to be designed to detect the very rare delays, treat them as a form of soft link failure and recover from it.

Indeed we've found that the behavior of the Solaris 10 TCP implementation has a lot to do in achieving bounded time transmissions. The rest of this entry explains what we've discovered, the configuration steps we took and how the TCP implementation would need to be adjusted to accommodate our requirements.

The architecture of the Solaris 10 TCP stack and its specificities is covered here.

Solaris uses an abstraction called an squeue. As of Solaris 10, squeues are only used for TCP processing. There are several squeues statically created at boot time (one per processor at least). Some more can be created by the system depending on the load. A TCP connection is bound to a single squeue. A single squeue can be shared by many connections. The squeue acts as a serialization mechanism: processing on different squeues can proceed in parallel with several threads/on several cpus. Processing for a single squeue will only happen in a single thread at a time: having multiple threads work on a single TCP connection is known not to be efficient. This architecture targets efficiency for a massive number of connections spread across several squeues themselves handled by different cpus.

When some thread has work to do on a connection, it first retrieves the squeue associated with the connection. Then it needs to acquire ownership of the squeue and that can only happen if no other thread is already working on the squeue (and already acquired ownership of the squeue). If a thread succeeds in acquiring ownership of the squeue then it proceeds with the TCP work it has to do and, once done releases ownership of the squeue. If grabbing the squeue fails, rather than waiting for the squeue to become available, the thread enqueues a work element in the squeue and moves on. The work elements will be processed in order later on.

A major determinism problem that we've seen repeatedly occurs when the squeue is in use (for non deterministic TCP traffic) and work needs to be performed on the deterministic TCP connection. For instance, a RT thread wakes up, preempts a non RT thread that is doing some network processing and thus owns an squeue. The RT thread starts sending on a TCP connection, it needs the TCP connection's squeue which happens to be the one owned by the non RT thread that it just preempted so it can't get it. A TCP work element is enqueued on the squeue for later processing and the RT thread continues its work.

The squeue will be drained sometime later and the work element will be processed either by:

  • some other thread that happens to do some TCP processing using the same squeue.
  • or an squeue worker thread: a thread running at system priority that is dedicated to the processing of one squeue and that is woken up when the squeue is not empty
  • or an interrupt thread for a network interface

Draining of an squeue was tuned for maximum out of the box overall throughput and is paced to achieve the best possible overall system behavior. One thread will be draining for no more than a few milliseconds, even if after that delay the squeue is not empty. An squeue worker thread will pause for a few milliseconds after some time draining and then only go back to draining: this keeps the system balanced and allows it to handle other tasks.

So here are the two problems as far as we are concerned:

  • TCP processing for a "deterministic" TCP connection may be handled by some thread at some unrelated non real-time priority causing a priority inversion
  • TCP processing for a "deterministic" TCP connection may be arbitrarily delayed by the pacing scheme

We've seen these cause delays of up to a few hundreds of milliseconds.

On a multi-core systems, one could maybe try to arrange for the deterministic connection to be bound to an squeue that is isolated on a core using a processor set. This requires a good understanding of how exactly one connection is bound to one particular squeue. This certainly does not work the same way on the machine where the application connects and on the machine where the application accepts the connection. Anyway, we were restricted to single cpu systems.

The TCP implementation offers some control over when exactly a work element is enqueued rather than processed inline, when draining occurs and how draining is paced. By using some undocumented /etc/system tunable change, one can mostly disable the pacing (the squeue will be drained until it is empty), always enqueue even if the squeue is not in use and only have worker threads perform the draining.

Here is how we've solved the priority inversion:

  • All threads always enqueue a work element into an squeue rather than attempt to do the processing themselves even if the squeue can be acquired right away.
  • Only squeue worker threads will drain the squeue.
  • Pacing is disabled. The squeue worker thread will run as soon as an element is enqueued in the squeue and it will work until the squeue is empty.

Here is what we've added to /etc/system for this:

set ip:ip_squeue_worker_wait=0
set ip:squeue_workerdrain_ms=1000
set ip:tcp_squeue_wput=3
set ip:ip_squeue_enter=3
set ip:squeue_intrdrain_ms=1

How is that supposed to help determinism? It's not. TCP processing is now always done by a thread running at non RT priority (a squeue worker thread running at system priority). There is a missing bit: the squeue worker threads need to run at an RT priority.

Sadly, this cannot be controlled with any undocumented tunable. We've done experiments where we force the priority of the squeue worker threads to a RT priority from a device driver (squeue worker threads are referenced by per cpu kernel data structures so a reference to their data structure can be retrieved). Those experiments show that this setting indeed offer guaranteed max transmit time.

With all these changes, TCP processing will occur as soon as the RT squeue worker threads can run. Only other RT threads from the application and the interrupt threads can delay the TCP processing. If the application is well designed and allows for the squeue worker threads to get enough cpu cycles in time, bounded transmit time can be achieved.

Note that all TCP processing is now done at a RT priority, even the one of non "deterministic" connections. We can assume that the amount of TCP work done at a RT priority is bounded by the cpu cycles the non RT part of the application gets and the size of the pipes. Thus this does not make transmit time for the deterministic connection unbounded.

There are a few other settings that we found help determinism:

  • For large transfer make sure plenty of per connection buffer is available on both sides. This is done with:

ndd -set /dev/tcp tcp_xmit_hiwat 2097152
ndd -set /dev/tcp tcp_recv_hiwat 2097152
ndd -set /dev/tcp tcp_max_buf 8388608

  • Entries in the arp cache (mapping from an IP to a hardware address) are flushed every 20 minutes. Outgoing packets are dropped when that happens. Timer based retransmission is then triggered. This causes transmit time spikes. This can be solved by setting static arp entries with the arp command.
  • Jumbo frames (large ethernet packets) help maximum transmit time (I used 16k rather than the standard 1.5k) for large messages.

A side-note on retransmission: if a packet loss occurs, as long as there are more than 3 packets sent after the packet was lost, TCP's fast retransmit mechanism will detect the loss and retransmit almost immediately (the packets result in an ACK from the receiver and the sender uses the 3 duplicate acks as a hint that the packet got lost, so it can send it immediately). So for large messages with a large number of packets, the probability that a packet loss will cause a timer based retransmission is low. Add to that, that the probability of a loss for two back to back systems correctly connected is very low. For small messages, or in the case of a packet loss in the last packets of a message, timer based retransmission is triggered and a 500ms delay in the transmission occurs. One trick that can be played for small messages is to pad all outgoing messages with extra data so that more (3) packets are sent and the fast retransmit can be triggered. The receiver only waits for the useful data. The padding is received as part of the next communication and is discarded by the receiver. This would increase min and mean transmit time but offer guarantees on max transmit time.

So it's encouraging to see that the Solaris TCP architecture is flexible enough to offer good temporal guarantees (we had tests running for several days on tuned systems, sending data back and forth continuously with no noticeable outlier). The solution we present here would require changes to the Solaris kernel to allow the priority of the squeue worker threads to be tunable, either at boot time or ideally at runtime: the priority of the squeue worker threads will become a part of the RT application configuration and thus the user will want to be able to play with different values.

The Solaris TCP implementation is evolving with the introduction of virtualization support, so what I describe here may not apply to later Solaris versions.

Again Dtrace proved invaluable to help understand the system's behavior (with limited prior understanding of Solaris TCP implementation) even when we must get to the bottom of a single outlier in hour long runs.

Thanks to Erik for his priceless insights on TCP and Solaris.


Roland Westrelin


Top Tags
« October 2009 »