Optimizing the ORB in GlassFish
By kcavanaugh on May 08, 2007
I've been away for a while. Keeping up a blog while working on a busy project is difficult. This entry discusses one of our recent efforts: ORB optimization.
The major goal of the ORB team for GlassFish v2 is to significantly improve the performance of the ORB. This is part of the larger goal of improving GlassFish's SpecJAppServer 2004 score. SpecJ is a throughput benchmark: it measures how many transactions per second can be performed by a J2EE application server while still maintaining a specified latency in processing each request. SpecJ is a simulation of a typical business scenario: car dealers
ordering cars from the manufacturer. SpecJ uses remote EJBs for some of the business operations. GlassFish uses RMI-IIOP for remote EJB communications, and the ORB provides the RMI-IIOP implementation.
SpecJ is the most important overall influence on our ORB performance work, but not the only one. Because SpecJ covers a broad range of J2EE technologies, the ORB performance has a significant but indirect performance impact on SpecJ. However, SpecJ only performs very simple EJB requests, with a small amount of data. A typical SpecJ request might return a couple of Strings and an int or two. How fast the ORB can process such requests has a significant impact on the SpecJ score. But there are other cases where performance is important.
A good example of this is the problem of marshalling lists of objects of the same type. This case is common in applications that need to query a database and return a number of rows (often up to a few hundred) that result from the query. This is usually represented as an ArrayList where each element of the ArrayList is an instance of a simple serializable Java object. In order to send this data, RMI-IIOP must encode the ArrayList and each element of the ArrayList as RMI-IIOP value types. Each value type is represented as a value type header followed by the marshalled representation of each field accoding to the usual Java serialization rules. Among other things, the value type header contains a String containing a space-separated list of codebase URLs, which indicate possible places where the implementation of the value type could be downloaded. The codebase strings are often identical across many value types, so the RMI-IIOP specification supports an indirection mechanism that allows each occurence of the same codebase to be encoded as [0xFFFFFFFF indirection], where the indirection is a negative 32-bit integer that gives the offset to a previous occurrence of the string.
Poor ORB performance in this case was caused by two problems: the high cost of computing the codebase, and a bug in the encoding of repeated codebase strings in the ORB. The codebase is computed by the RMIClassLoader.getClassAnnotation method, and this is not cached inside the ClassLoaders used in GlassFish (this problem is reported in GlassFish issue 955). The length of the ClassPath in GlassFish makes this call expensive. The bug in handling indirections caused the very large codebase string to be marshalled in every value type header, rather than just the first one. This caused effected operations to slow down by a factor of 5-10 when a large number of objects was returned.
The fixes were quite simple: the getClassAnnotation method simply needed a cache for the data. The problem with repeated codebase strings was caused by using an identity based cache instead of an equality based cache. Interestingly, neither problem was an issue with standalone ORB tests, because the class codebase was usually empty in standalone tests, whereas in the full app server the codebase was a very long string containing everything in the App server's classpath. This fix is available in any recent GlassFish v2 build.
The biggest code change we've made is to improve how the ORB reads incoming messages using NIO. Until recently the ORB read each individual GIOP message using two read calls: the first call always read 12 bytes (the size of the fixed GIOP header). This header contains the length of the rest of the message, which was the second call. If most messages are large, this isn't too bad, but in fact most messages the ORB handles in SpecJ are fairly small (around 100-200 bytes each). For 200 byte messages, we could fit 20 in a single 4K message buffer. Reading 20 messages used to take 40 read calls. Because SpecJ is a throughput oriented benchmark, it sends a lot of message to the App Server concurrently. This results in some accumulation of GIOP Request messages in the ORB.
A large improvement in read efficiency is possible if the ORB reads everything that is available from the connection in a single read call. We have a microbenchmark that simulates the ORB usage in SpecJ. This benchmark shows that we can typically read 10 or so small messages in a single read. Clearly this is much more efficient. The savings comes from a reduced number of system calls, as well as a reduction in the number of context switches to read the messages. The read optimization and other improvements have led to about a 70% improvement in the ORB throughput, and a noticeable impact on the SpecJ score.
There are a number of smaller performance improvements as well:
- doPrivileged calls are necessary when there is a non-null SecurityManager in the app server. The ORB has been modified to only call doPrivileged if the SecurityManager is not null.
- A very common operation in the ORB is to determine whether or not a particular operation is targeted to an EJB instance or not. This has been improved, but the current cost of two string comparisons is still noticeable in the benchmarks. We plan further improvements here.
- Some ORB features are never used for RMI-IIOP in the app server, primarily features for backward compatibility. We now have an isAppServerMode flag which can be used to reduce the cost of such features (such as support for stream format version 1).
- The ORB uses generated exception wrapper methods to handle all throwing of system exceptions and the associated logger calls. Many improvements have been made here to reduce the cost of obtaining the appropriate log wrapper instance. The final solution is to generate an interface with a method for each possible log wrapper, and an implementation of this interface that lazily initializes the log wrapper instance on the first call.
There are also many improvements that involve caching of the results of data transformations. For example, the Object key must be marshalled on every request from the client, and unmarshalled on every request in the server. Caching the mapping between the marshalled
and unmarshalled form saves a lot of time on each request, in the common case where many requests are made on the same object reference.
More work is planned for performance improvements. I'll discuss some of this in a later post.