Thursday Apr 23, 2009

Sockets Direct Protocol

Solaris has support for the InfiniBandTM (IB) Sockets Direct Protocol (SDP) since Solaris 10 5/08. SDP is a wire protocol to support stream sockets networking over IB. It utilizes features such as RDMA for high performance data transfers and provides higher performance and lower latency compared to IP encapsulation over IB (IPoIB). A long standing request has been to add implementation-specific support to the JDK to make it easy to use SDP in Java Language applications. As SDP provides a streams sockets interfaces it can be made transparent to applications using and of course the stream-oriented SelectableChannels and AsynchronousChannels in the channels package.

The simple solution in jdk7 uses a configuration file with rules to indicate the endpoints that are connected to the IB fabric for when SDP should be used. When a or java.nio.channels.SocketChannel binds or connects to an address that is an endpoint on the IB fabric then the socket is converted to use the SDP protocol. The solution can be thought of a kind of built-in interposer library and using an interposer library is an alternative way of doing this. In the future it might be interesting to support factory methods that allow the SDP protocol be selected when creating channels and sockets.

The $JAVA_HOME/lib/sdp directory contains a template configuration file. The format is relatively simple with each non-comment line a rule to indicate when the SDP transport protocol should be used. The configuration file is specified via the system property com.sun.sdp.conf. A second property (com.sun.sdp.debug) can be set to enable debug messages to be printed to System.out, useful to check the configuration. The property value can also be set to the location of a file so that the debug messages are directed to a log file instead. SDP needs to be enabled on Solaris via the sdpadm(1M) command and of course the interfaces need to be plumbed with IP addresses.

Our lab machines have the InfiniBand adapters plumbed with addresses on the 192.168.1.\* network so a simple configuration may be:

$ cat << EOF > sdp.conf
bind \*
bind  5000
connect  1024-\*
connect   1521

This basically says that SDP should be used for applications that bind to or bind to the wildcard address on port 5000. All connections to application services on 192.168.1.\* should use SDP and all Net8 connections to the Oracle database on should use SDP.

Once the configuration file is created we simply specify it when running the application, eg:

$ java -Dcom.sun.sdp.conf=sdp.conf MyApplication

Note that this example also sets the property. The Java Runtime always uses IPv6 sockets if IPv6 is enabled. The Solaris implementation of SDP supports IPv6 addresses but currently doesn't support IPv4-mapped IPv6 addresses (::ffff: for example). For now this means setting the property so that all sockets are IPv4 sockets.

One final thing to mention is that although SDP has been available since Solaris 10 5/08 there are a couple of bugs that the JDK runs into. Thanks to Lida Horn, all these issues have been fixed for Solaris 10 Update 8. They should also be in next OpenSolaris release or build 113 for those downloading the Solaris Express Community Edition.

Thursday Apr 09, 2009

JSR-203/NIO2 update

The implementation of JSR-203/NIO2 has been brewing as the OpenJDK New I/O project over the last year or so. The bulk of the socket-channel changes were pushed to the jdk7 repository back in build 37 and most of the rest went into jdk7 build 50. I've have way too much on my plate recently so I haven't had cycles to write about it. That needs to change as there is so much to talk about!

For those that haven't been tracking this project then the one-line summary is that it brings a new API to the file system, a new API for asynchronous I/O to both sockets and files, and completes the socket-channel API that was originally defined by JSR-51 back in Java SE 1.4.

The java.nio.file package is where the new API to the file system resides. The API is provider-based so the adventurous can deploy their own file system implementations. Out of the box, there is a default file system that provides access to the regular file system that both java and native applications see.

Most developers will likely use the Path class and little else. Think of Path as the equivalent of in the new API. Interoperability with and existing code is achieved using the toPath method so existing code can be retrofitted to use the new API without too many changes. There is long list of major short-comings and behavioral issues with that can never be fixed for compatibility reasons so using the toPath method gets you an object in the new API that accesses the same file as the File object.

A Path is created from a path-string or URI. It defines various syntactic operations to access its components, supports comparison, testing if a Path starts or ends with another Path, allows Paths to be combined by resolving one Path against another, support relativization, etc.

A Path may also be used to access the file that it locates. Both stream and channel I/O are supported. Using InputStream and OutputStream allows for good interoperability with the package and existing code. Channel I/O is via the new SeekableByteChannel or any of its super-types. A SeekableByteChannel is a ByteChannel that maintains a file position so it can be used for single-thread random access as well as sequential access. In the case of the default provider, the channel can be cast to a FileChannel for more advanced operations like file locking, positional read/write to support concurrent threads doing I/O to different parts of the file, and memory-mapped I/O. When opening files a set of options allows applications to indicate how the file should be opened or created. There are quite a few and these can be extended further with implementation specific options where required. Speaking of extensibility, much of the API allows for implementation specific extensions where needed.

In addition to I/O, there are methods to do all the usual things like create directories, delete files, checking access to files, copy and move, etc. The copyTo/moveTo methods do the right thing and know how to copy file meta-data, named streams, special files etc. The important thing is that they work in a cross-platform manner and also throw useful exceptions when they fail. In this API all methods that access the file system throw the checked IOException. There are specific IOExceptions defined for cases where recovery action from specific errors is useful. All other exceptions in this API are unchecked.

Accessing directories is a bit different to Instead of returning an array or List of the entries, a DirectoryStream is used to iterate over the entries in the directories. This scales to large directories, uses less resources, and can smooth out the response time when accessing remote file systems. Another difference is that the platform representation of the file names in the directory is preserved so that the files can be accessed again (this is very important where the file names are stored as sequences of bytes for example). A further advantage to having a handle to an open directory is that it is possible to do operations relative to the directory, something that is important for security-sensitive applications. When iterating over a directory the entries can be filtered. The API has built-in support for glob and regex patterns to filter by name or arbitrary filters can be developed.

The API has full support for symbolic-links based on the long-standing semantics of Unix symbolic links. This works on Windows Vista and newer Windows aswell. By default symbolic links are followed with a couple exceptions, such as move and delete. There are also a few cases where the application can specify an option to follow or not follow links. This is important when reading file attributes or walking file trees for example.

Speaking of walking file trees, the Files.walkFileTree utility method is invoked with a starting directory and a FileVisitor that is invoked for each of the directories and files in the file tree. Think of walkFileTree as a simple-to-use internal iterator for file trees. File tree traversal is depth-first with pre-visit and post-visit for directories. By default symbolic links are not followed which is exactly what you want when developing find, chmod -R, rm -r, and other operations. An option can be used to cause symbolic links to be followed (say find -L or cp -r). When following symbolic links then cycles are detected and reported to avoid infinite loops or stack overflow issues. Another thing to mention about FileVisitor is that each of the methods return a result to control if the iteration should continue, terminate or if parts of the file tree should be skipped. Many of the samples in the samples directory use walkFileTree.

The java.nio.file package also has a WatchService to support file change notification. This maps to the inotify or equivalent facility to detect changes caused by non-communicating entities. The main goal here is to help with the performance issues of applications that are forced to poll the file system today. It's a relatively low-level/advanced API, arguable niche, but is straight-forward to build on. The WatchDir example in the samples directory demonstrates using it to snoop a directory or file tree.

The java.nio.file.attribute package provides access to file attributes. File attributes or meta-data is an awkward area because it is very file system specific. The approach we've taken in this API is to group related attributes and define a “view” of the commonly used attributes. An implementation is required a support a “basic” view that defines attributes that are common to most file systems (file type, size, timestamps, etc.). An implementation may support additional views. The package defines a number of views for other common groups of attributes, including a subset of the POSIX attributes. An implementation isn't required to support these of course and may instead support implementation-specific views. In most cases, developers won't need to be concerned with all this but instead will use the static methods defined by the Attributes class for the common cases.

A couple of other things to say about file attributes is that, where possible, attributes are read in bulk (think stat or equivalent). This will help with the performance of applications that need several attributes of the same file. Dynamic access is also supported. This allows attributes to be treated as name/value pairs. This is useful to avoid compile time dependencies on implementation specific classes at the expense of type-safety. The API also allows initial file attributes to be set when creating files. This is very important to eliminate the attack window that would otherwise exist if file permissions or ACL are set after the file is created.

The other new package is java.nio.file.spi This is for the provider mechanism mentioned above and is really only interesting to the few that are developing their own file system implementations. One thing to mention is that in addition to custom providers, it is possible to replace or interpose on the default provider. If someone wants to extend the default provider then they install their own provider that delegates to the otherwise default provider. This is useful to those interested in developing virtual file systems for example. One thing that isn't complete yet is that the existing classes (File/FileInputStream, etc.) don't yet redirect when the default provider is replaced. I hope to get this fixed soon.

Moving on from the file system API, the second major topic is a new set of AsynchronousChannels for asynchronous I/O. The new channels provide an asynchronous programming model for both sockets and files. The API is designed to map well to operating systems with a highly scalable mechanism to demultiplex events to pools of threads (think Solaris 10 event ports or Windows completion ports for example).

Asynchronous I/O operation takes one of two forms. In the first form, I/O operations return a java.util.concurrent.Future that represents the pending result. The Future interface defines methods to poll the status or wait for the I/O operation to complete. The second form is where the I/O operation is initiated specifying a CompletionHandler that is invoked when the I/O operation completes (think callbacks).

Completion handlers immediately raise the question as to what threads invoke the handlers. In this API, asynchronous channels are bound to a group that encapsulates a thread pool and the other resources shared between channels in the group. Groups are constructed with a thread pool so the application gets to control the number of threads, thread identity, and other policy matters. Combining thread pools and I/O event demultiplexing is a complex area. The project page has further information on this topic.

The other big update in the channels package is the completion of the socket channel API. Each of the network oriented channels now implement NetworkChannel and so define methods to bind the channel's socket, set and query socket options, etc. This should eliminate the need to use the troublesome socket adapter. Furthermore, the support for socket options is extensible to allow for operating system specific options, important for fine tuning some high performance servers. Multicast support has also been added.

The original early draft proposed a new set of buffer classes that were indexed by long and so could support more than 231 elements. In the I/O area the main use-case is contiguous mapping of files regions larger than 2GB. On further examination, this was an ugly solution so this has been buried for now. If support for big arrays is added to the collections API in the future then with appropriate factories then it should be possible to have such arrays backed by a large mapped region. In the mean-time, applications needing to map massive files need to do so in chunks of 2GB or less.

One other update to mention is BufferPoolMXBean. This is new management interface for pools of buffers so that tools such as VisualVM, jconsole, or other JMX clients, can monitor the resources associated with direct or mapped buffers.

So, that's a brief run through the project. In the introduction paragraph I mentioned that the bulk of the implementation was pushed to jdk7 a few builds back. Overall, I'm relatively happy with it but there are a few areas that need to be re-visited. For now, development will continue at the New I/O project and we will hopefully sync up again with jdk7 in a couple of builds time.




Top Tags
« June 2016

No bookmarks in folder