Thursday Oct 08, 2009

TCP and real-time

We had two systems each running a real-time Java application under Java RTS, the RT applications on both sides communicated by sending messages back and forth. We wondered whether messages could be exchanged using a standard TCP connection with a bounded transmit time close enough to the mean transmit time (achieving what I'll call in the following a "deterministic" communication).

The systems we considered were two single cpu systems with four giga-ethernet network ports. We dedicated a network port on each side for the "deterministic" communication: messages are sent back and forth over a TCP connection between two real-time Java threads (one on each system) through the dedicated ports, no other traffic goes through these dedicated ports. In our experiments the other ports see moderate traffic: two standard Java threads (again one on each side) continuously send small 1K messages back and forth over another TCP connection.

The two dedicated network ports are back to back: they are connected by a crossover cable.

The systems are equipped with e1000g NICs and running Solaris 10 update 4.

TCP is notoriously not designed for deterministic communication but, in the controlled system configuration described above:
  • a packet corruption/loss is very unlikely because the systems are back to back (no network equipment in between).
  • the application threads are running in the real-time scheduling class and are guaranteed to be ready to drain the network when required (assuming a correctly designed application).

So the TCP congestion control, TCP flow control, TCP retransmission mechanisms should have limited impact on transmission times. Whether bounded time transmission can be achieved should be mostly a matter of whether the Solaris 10 TCP implementation allows it rather than whether the TCP protocol supports it.

A side note on packet corruption/loss: it should be very rare in this configuration, but still if it happens, delays in the transmission could be encountered. Then the application would have to be designed to detect the very rare delays, treat them as a form of soft link failure and recover from it.

Indeed we've found that the behavior of the Solaris 10 TCP implementation has a lot to do in achieving bounded time transmissions. The rest of this entry explains what we've discovered, the configuration steps we took and how the TCP implementation would need to be adjusted to accommodate our requirements.

The architecture of the Solaris 10 TCP stack and its specificities is covered here.

Solaris uses an abstraction called an squeue. As of Solaris 10, squeues are only used for TCP processing. There are several squeues statically created at boot time (one per processor at least). Some more can be created by the system depending on the load. A TCP connection is bound to a single squeue. A single squeue can be shared by many connections. The squeue acts as a serialization mechanism: processing on different squeues can proceed in parallel with several threads/on several cpus. Processing for a single squeue will only happen in a single thread at a time: having multiple threads work on a single TCP connection is known not to be efficient. This architecture targets efficiency for a massive number of connections spread across several squeues themselves handled by different cpus.

When some thread has work to do on a connection, it first retrieves the squeue associated with the connection. Then it needs to acquire ownership of the squeue and that can only happen if no other thread is already working on the squeue (and already acquired ownership of the squeue). If a thread succeeds in acquiring ownership of the squeue then it proceeds with the TCP work it has to do and, once done releases ownership of the squeue. If grabbing the squeue fails, rather than waiting for the squeue to become available, the thread enqueues a work element in the squeue and moves on. The work elements will be processed in order later on.

A major determinism problem that we've seen repeatedly occurs when the squeue is in use (for non deterministic TCP traffic) and work needs to be performed on the deterministic TCP connection. For instance, a RT thread wakes up, preempts a non RT thread that is doing some network processing and thus owns an squeue. The RT thread starts sending on a TCP connection, it needs the TCP connection's squeue which happens to be the one owned by the non RT thread that it just preempted so it can't get it. A TCP work element is enqueued on the squeue for later processing and the RT thread continues its work.

The squeue will be drained sometime later and the work element will be processed either by:

  • some other thread that happens to do some TCP processing using the same squeue.
  • or an squeue worker thread: a thread running at system priority that is dedicated to the processing of one squeue and that is woken up when the squeue is not empty
  • or an interrupt thread for a network interface

Draining of an squeue was tuned for maximum out of the box overall throughput and is paced to achieve the best possible overall system behavior. One thread will be draining for no more than a few milliseconds, even if after that delay the squeue is not empty. An squeue worker thread will pause for a few milliseconds after some time draining and then only go back to draining: this keeps the system balanced and allows it to handle other tasks.

So here are the two problems as far as we are concerned:

  • TCP processing for a "deterministic" TCP connection may be handled by some thread at some unrelated non real-time priority causing a priority inversion
  • TCP processing for a "deterministic" TCP connection may be arbitrarily delayed by the pacing scheme

We've seen these cause delays of up to a few hundreds of milliseconds.

On a multi-core systems, one could maybe try to arrange for the deterministic connection to be bound to an squeue that is isolated on a core using a processor set. This requires a good understanding of how exactly one connection is bound to one particular squeue. This certainly does not work the same way on the machine where the application connects and on the machine where the application accepts the connection. Anyway, we were restricted to single cpu systems.

The TCP implementation offers some control over when exactly a work element is enqueued rather than processed inline, when draining occurs and how draining is paced. By using some undocumented /etc/system tunable change, one can mostly disable the pacing (the squeue will be drained until it is empty), always enqueue even if the squeue is not in use and only have worker threads perform the draining.

Here is how we've solved the priority inversion:

  • All threads always enqueue a work element into an squeue rather than attempt to do the processing themselves even if the squeue can be acquired right away.
  • Only squeue worker threads will drain the squeue.
  • Pacing is disabled. The squeue worker thread will run as soon as an element is enqueued in the squeue and it will work until the squeue is empty.

Here is what we've added to /etc/system for this:

set ip:ip_squeue_worker_wait=0
set ip:squeue_workerdrain_ms=1000
set ip:tcp_squeue_wput=3
set ip:ip_squeue_enter=3
set ip:squeue_intrdrain_ms=1

How is that supposed to help determinism? It's not. TCP processing is now always done by a thread running at non RT priority (a squeue worker thread running at system priority). There is a missing bit: the squeue worker threads need to run at an RT priority.

Sadly, this cannot be controlled with any undocumented tunable. We've done experiments where we force the priority of the squeue worker threads to a RT priority from a device driver (squeue worker threads are referenced by per cpu kernel data structures so a reference to their data structure can be retrieved). Those experiments show that this setting indeed offer guaranteed max transmit time.

With all these changes, TCP processing will occur as soon as the RT squeue worker threads can run. Only other RT threads from the application and the interrupt threads can delay the TCP processing. If the application is well designed and allows for the squeue worker threads to get enough cpu cycles in time, bounded transmit time can be achieved.

Note that all TCP processing is now done at a RT priority, even the one of non "deterministic" connections. We can assume that the amount of TCP work done at a RT priority is bounded by the cpu cycles the non RT part of the application gets and the size of the pipes. Thus this does not make transmit time for the deterministic connection unbounded.

There are a few other settings that we found help determinism:

  • For large transfer make sure plenty of per connection buffer is available on both sides. This is done with:

ndd -set /dev/tcp tcp_xmit_hiwat 2097152
ndd -set /dev/tcp tcp_recv_hiwat 2097152
ndd -set /dev/tcp tcp_max_buf 8388608

  • Entries in the arp cache (mapping from an IP to a hardware address) are flushed every 20 minutes. Outgoing packets are dropped when that happens. Timer based retransmission is then triggered. This causes transmit time spikes. This can be solved by setting static arp entries with the arp command.
  • Jumbo frames (large ethernet packets) help maximum transmit time (I used 16k rather than the standard 1.5k) for large messages.

A side-note on retransmission: if a packet loss occurs, as long as there are more than 3 packets sent after the packet was lost, TCP's fast retransmit mechanism will detect the loss and retransmit almost immediately (the packets result in an ACK from the receiver and the sender uses the 3 duplicate acks as a hint that the packet got lost, so it can send it immediately). So for large messages with a large number of packets, the probability that a packet loss will cause a timer based retransmission is low. Add to that, that the probability of a loss for two back to back systems correctly connected is very low. For small messages, or in the case of a packet loss in the last packets of a message, timer based retransmission is triggered and a 500ms delay in the transmission occurs. One trick that can be played for small messages is to pad all outgoing messages with extra data so that more (3) packets are sent and the fast retransmit can be triggered. The receiver only waits for the useful data. The padding is received as part of the next communication and is discarded by the receiver. This would increase min and mean transmit time but offer guarantees on max transmit time.

So it's encouraging to see that the Solaris TCP architecture is flexible enough to offer good temporal guarantees (we had tests running for several days on tuned systems, sending data back and forth continuously with no noticeable outlier). The solution we present here would require changes to the Solaris kernel to allow the priority of the squeue worker threads to be tunable, either at boot time or ideally at runtime: the priority of the squeue worker threads will become a part of the RT application configuration and thus the user will want to be able to play with different values.

The Solaris TCP implementation is evolving with the introduction of virtualization support, so what I describe here may not apply to later Solaris versions.

Again Dtrace proved invaluable to help understand the system's behavior (with limited prior understanding of Solaris TCP implementation) even when we must get to the bottom of a single outlier in hour long runs.

Thanks to Erik for his priceless insights on TCP and Solaris.

Monday Jun 29, 2009

64 bit Java RTS

We are progressing on our new release of the Sun Java Real-Time System (Java RTS) and are making the first early access builds available for download. One of the big change with this release is the availability of a 64 bit Java RTS (on our supported platforms: Solaris sparc and x86, real-time Linux x86) that should, in practice, remove all limits on the heap size (a 32 bit Java RTS process is typically limited to 2 to 3 GBytes of heap size).

For Java RTS,  going to 64 bits was challenging because we only support hotspot's client compiler and our code base is built from a JDK5 hotspot VM.

Why does RTS only support the client compiler?

A standard Java SE implementation (such as the hotspot VM) first runs methods interpreted and only when a method is identified as hot is it compiled. The VM runtime takes advantage of the interpreted execution to gather profile data and uses this data to perform more aggressive optimizations. Also, the VM runtime further optimizes for throughput by relying on optimistic optimizations: at compile time, the runtime uses the current state of the VM to perform aggressive optimizations that might later be invalidated. When that happens, the method is deoptimized (resumes execution interpreted) and a recompilation may only happen later on.

A real-time Java implementation must offer guarantees on maximum execution time. The lower the maximum execution time, the better, even when it results in an higher average execution time (lower throughput). What that means is that from the time the application enters its steady state, all code in the critical path must be compiled. A real-time VM has to compile code early and cannot benefit from extended interpreted execution and profile driven optimizations. It cannot limit compilation to hot methods but has to compile all critical methods. Finally, to keep the bound on execution time low, compiled methods cannot return to interpreted execution so no optimistic optimization can be used in the critical path.

Not all the performance benefit of hotspot's server compiler comes from its use of profile driven and optimistic optimizations but it is built with the assumption that these mechanisms are available and it is hard to isolate and disable them. On the other hand, The client compiler is a simpler compiler and it is relatively straightforward to have it produce very steady code. It is an example of trading throughput for higher determinism and lower latency.

Why is RTS still based on a JDK5 hotspot VM?

While the RTS VM is based on the hotspot's VM, it really is a different VM. The hotspot VM is optimized for throughput. The RTS VM is optimized for hard real-time and low latency. Implementing the RTSJ APIs and guaranteeing that all VM subsystems guarantee determinism requires substansial changes to the hotspot VM code. So updating the RTS code to follow hotspot changes and auditing new code to guarantee determinism is a time consuming task. So far, we found that sticking with the older code base and focusing on improving the real-time features of RTS is the best way for us to improve the product.

Why is it tricky to support 64 bit in RTS?

The hotspot VM runtime is largely written in C++ but its interpreter and compiled code are generated by an embedded assembler. On sparc, 64 bit native code is largely similar to 32 bit code (same registers, very similar instructions and ABI). On x86, 64 bit native code (so called amd64) is significantly different from 32 bit native code: more registers, extra instructions and extended instruction encoding, a different ABI. So for hotspot, going from a 32 bit VM to a 64 bit VM on x86 requires a large amount of changes to the compilers and interpreter. This work had only been done in hotspot on 64 bit x86 for the server compiler until recently: the JDK7 hotspot VM now has support for a 64 bit client on both sparc and x86. The work is still in progress.

So when we started the work to make JRTS 64 bit, we had the choice of converting our JDK5 client compiler to generate 64 bit code or to bring the client compiler from the JDK7 hotspot VM into Java RTS. Both options represent significant code changes and a lot of work. On one hand, as described above, going to 64 bit x86 is a complex task. On the other hand, a JDK7 VM has accumulated several years of code changes and is substantially different from a JDK5 VM. Anyway we chose this second option as, from JDK6, hotspot has an improved client compiler with better code generation. So getting the new client compiler largely solves the 64 bit problem and has the added benefit of improved performance.

Backporting the JDK7 client compiler into Java RTS, reconciling the code from the newer VM with the older one, enhancing it with RTSJ features, making sure that it generates deterministic code and correct 64 bit code is what has been keeping me busy over the 2.2 release cycle.

Monday Apr 21, 2008

Performance tools and realtime Java

Performance is usually expressed in terms of mean execution time (often designated as throughput). By contrast, in the realtime world, it is defined by the worst case execution. If an application must respond with a latency of 500 microseconds when a stop button is pressed, then the worst case time it takes for the application to detect that the button is pressed and to take some action is what defines performance. This time must be less that 500 microseconds.

When a developer studies realtime performance, he often has to focus on a few data points and understand why a delay is observed for those data points: if from time to time the button is pressed but the application fails to respond in less than 500 microseconds, then the developer needs to study the chain of events for those few outliers. Performance tools usually work by taking samples (oprofile on Linux for instance). Such statical tools are useful to study throughput issues but don't help much in the realtime world: a statistical tool is unlikely to catch a rare event.

One tool that is very efficient for realtime performance investigation on Solaris, in particular with java RTS, is Dtrace. It allows the instrumentation of full chain of events and to detect when something unexpected happens. Dtrace allows the developer to observe:

  1. system events. If the application is delayed after a press of the stop button, it could be that the application thread is descheduled. The sched provider will let you detect when that happens.
  2. user-level activity. Once you know your application thread lost the cpu, if it was not preempted, you are likely to want to know what it was doing at that time that made the thread block. User process tracing is here to help.
  3. any Java programs. If your application is a java application, then you'll want a java stack at the point where it lost the cpu. jstack()  does that for you.
  4. JVM events. The same way Solaris has static probes to observe scheduling activities, Java RTS has static probes to observe Java RTS specific activities.
  5. instrumented Java applications. Java RTS allows the application programmer to insert in its Java RTS application Dtrace probes.

What about Linux? Linux has systemtap. It is supposed to be for Linux what Dtrace is for Solaris. In the list of 5 points above where Dtrace helps realtime performance investigation, systemtap only covers 1- as of today. User-level support, while planned, does not appear available.

I had the opportunity of using systemtap to study a few realtime performance issues on Linux and I had some success with it. However, the lack of user-level support makes things challenging. On a Linux kernel with RT patches, I found that it can very easy to make the system unstable with simple systemtap scripts. Systemtap does not have static probes as Solaris does. It uses tapsets  instead, a way to alias a specific event (the tasks gets the cpu: probe scheduler.cpu_on) with some specific code level locations (kernel function finish_task_switch). I found that the tapsets provided with systemtap are often out of sync with the kernel you are running. I guess we can expect those issues to improve as systemtap matures.

Another issue I hit is that on a RT kernel, systemtap scripts can induce a large latency on the application (commonly several tens of milliseconds). I found that this is caused by how systemtap uses locks to protect every global variables in the systemtap script. Whenever a global variable is read in a probe definition, it is protected read-only by a lock. Whenever the global variable is written to in a probe definition, it is protected by a read/write lock. Systemtap works by converting the systemtap script to C, then compiling the C code to a binary kernel module and then loading the kernel module. By inspecting the C code generated by systemtap for one of the scripts, it is clear that locks acquisition is one of the first things done when the probe code is run. This should not be much of a problem as long as a variable is not commonly written to. On a RT kernel, because of PIP  read-only lock operations and read-write lock operations are all the same lock operations. So even if writes to global variables are uncommon in the systemstap script, contentions on the locks happen and greatly disrupt the application being traced, to the point where systemtap is unuseable because it causes a higher latency that the problem being debugged.

 I've tried several work-arounds for the global variable locks problems including first converting the systemtap script to C code using the stap command, then editing the C code to either remove or move lock operations that I knew were useless and then building and loading the kernel module. That's a lot of work to do by hands for every change to the script. So I ended up switching to systemtap guru mode and embedding C code in the systemtap script. That proved very efficient to remove the lock contention problem but defeats a lot of the advantages of using a framework such as systemtap.

Beyond systemtap, kernel markers appear promising as they should bring real static probe points to the linux kernel. The framework has just been integrated in kernel 2.6.24 and probes still need to be added.


Monday Feb 04, 2008

Real-time Java and futexes on Linux

We've just made available an early access release of Sun's Java RTS 2.1 on Linux. Java RTS (Real-Time System) is Sun's implementation of the Real-Time Specification for Java (RTSJ) based on Java SE and the hotspot virtual machine. The product has been available on Solaris (sparc and x86) for some time and we've been working on making it available on Linux. Our development and test systems run pretty standard Linux distributions with RT enabled kernels.

I contributed to this porting effort and one Linux subsystem I had to take a closer look at is the futex subsystem. Java RTS on Linux only relies on the POSIX APIs. We make use of pthread mutexes (with the priority inheritance protocol) and condition variables. On recent Linuxes, pthread synchronization primitives are built on top of the clever futex mechanism.

A futex keeps part of its state in user-land and part at the kernel level. This allows synchronization primitives to be implemented with a fast-path entirely in user-land when possible (such as a lock operation on an uncontended mutex) and a fall back that uses the futex system call: the Linux kernel provides a single system call for the futex support. A command is passed as an argument to the futex syscall to designate what operation to perform on the futex. Four commands are of interest to us: FUTEX_LOCK_PI/FUTEX_UNLOCK_PI to implement the priority inheritance mutex lock/unlock operations and FUTEX_WAIT/FUTEX_WAKE to implement pthread condition variable wait (optionally with a timeout) and signal operations.

The user-land part of the futex is a 32 bit integer, its value is the user-land part of its state. When the application calls the futex syscall to operate on the futex, the futex "object" is identified with the virtual address in the process address space of the 32 bit integer. As an example, a PI mutex is implemented with a single futex. The 32 bit integer is initialized with a value of zero: mutex unlocked. The locking operation on the futex consists in atomically setting the futex's value to the thread id of the new owner if the the futex's value is zero. This is performed with a compare-and-exchange  or similar operation. Unlocking the futex is similarly performed by atomically reseting the futex's value to zero if the value is still the thread id of the current thread. If, because the futex is already owned by another thread, the fast path locking fails, then the thread calls the futex syscall with the FUTEX_LOCK_PI command. The futex syscall then takes care of changing the user-land futex value (so that the owner thread cannot unlock with the fast path but is forced through the futex syscall with the FUTEX_UNLOCK_PI command), suspending the thread until the futex is available and finally of updating the user-land value to the thread id of the new owner.

A futex can be shared between processes. If it's not then the command passed to the futex syscall should be or'ed with the FUTEX_PRIVATE constant.

Interestingly, and probably because a Java VM and the RT VM in particular puts an unusual load on the system, pretty much all the problems we have had so far with Linux itself were related to the futex mechanism. One of them is a performance issue on one of our own real-time Java benchmark. The benchmark programs a real-time thread so that it is woken up at a particular absolute time in the future using RTSJ APIs. Internally the VM uses condition variables and mutexes in the process of waking up the thread when the absolute time is reached. We measure the absolute time at which the thread is effectively woken up and back to executing Java code. The difference between the measured time and the requested time is called the latency. Worse case execution time (and not mean execution time) is what defines performance.

The benchmark is run with and without some load, including some load that triggers the garbage collector. Without the load, the latency is always very low. With the load, the latency is sometimes ten times higher than without. Against, worst case execution is all that matters to us in this case. What we've found is that the extra latency happens in the futex syscall and that the extra delay is caused by a non-real-time thread in the VM performing calls to mmap. Indeed, in the kernel sources we found that the futex syscall synchronizes with the mmap calls. This should not happen with process private futex operations (commands that are marked private). A pthread mutex or condition variable can be marked either shared between processes or private to a single one. We use only process private mutexes and condition variables so we expects them to be handled only by private futex commands. Inspecting the code of the glibc, we discovered that the process private attribute of the futex is not always taken advantage of, even in the most recent glibc releases. Doing some more investigation, we found that VM memory locking (through the mlock(2) call to prevent indeterminism caused by paging activities) exacerbates the problem. On a simple C test case, we measured that mmap'ing a 100MB in a non real-time thread could cause a latency of up to 1 second in a real-time thread.

So why do we have so much mmap activity in our real-time VM? When run with one of the standard hotspot GC and lots of garbage produced, the GC repetitively grows and shrinks the heap which is done with mmap calls. In any case, mmap is widely used (for instance, malloc is implemented with mmap) so, even if we use our real-time GC (which does not grow/shrink the heap) instead of one of hotspot's GCs, suffering extra latency in a real-time thread caused by a non real-time thread's mmaps is still possible. Fixing the process-private attribute of pthread mutexes/condition variables is thus important for improved and guaranteed latency of complex real-time systems on Linux.

Related, two bugs in the futex subsystem we reported: futex: Prevent stale futex owner when interrupted/timeout, futex: fix for futex_wait signal stack corruption

Roland Westrelin


Top Tags
« July 2016