By user12610620 on Jul 13, 2009
Sometimes you find something that sounds like it should be fairly straightforward but ends up being a bit more complicated than you thought.
Our intern Frank commented that he'd really like to have a little more agility in trying out the code that he's working on. You see, we run our data store on a grid. The grid is physically somewhere else, and more importantly, it runs our code in a virtual network (you create a network to run your processes in so everything is isolated). Every process you start runs in a Solaris zone with its own IP address(es). So in order to run some test code, you need to build it, upload it, then restart your process. This doesn't take a lot of time, but it adds several steps between "change a line of code" and "try it out against our live system". To get the results, you need to either look at an output file that your process creates, or you need to connect to some server providing it (probably a web server that can talk to the process you ran). Fairly cumbersome.
This has been a problem for a while, but we hadn't really gotten to a point yet where we couldn't just test our code by running against a small dataset running on a "distributed" instance of our data store that actually runs all in one JVM on our development machines. Frank is working on stuff that really needs "big data" so that isn't going to cut it anymore. The external interface to our live datastore is based on web-services, but for heavy computation, we want to speak directly to the store over RMI. So since we seem to be letting our interns dictate what we should work on, I went to work providing a solution.
One of the many nice features of the research grid we're running on is that we can use a VPN client to connect in to our virtual network. Once I figured out how to configure this and allocate the resources for it, it worked like a charm. This should be easy now, right? Just get your client machine on the network and you can use RMI to talk directly to our data store! Except, it didn't work. We use Jini to self-assemble the pieces of our data store and also to discover the "head" nodes that allow client access. Turns out the VPN client wasn't handling the multicast packets needed to do discovery. No problem - I specified the IP address that was assigned to our RMI registry process. Still didn't work. It would attempt to load our services but it would hang and eventually time out. This was getting tricky.
One problem I noticed from the start was that I wasn't getting any name resolution to work. Perhaps Java was trying to look up the names of hosts and timing out. Enough host name timeouts, and our own connection timeout would be reached and it would fail. Now the problem was that while all of the processes running on the grid itself (inside Solaris zones) had access to the name server for the grid, the VPN clients connecting in could not actually route packets to the DNS server. (Remember, everybody gets their own private network, but name service is provided by the grid -- you don't need to (or want to) run your own). Now we could have actually started our own name server, or we could have made a hosts file that has all the names of the machines we use. The problem is, this is a dynamic environment. From one start of the system to another, the actual addresses may change. (Note that we don't care what the addresses are for the system to work -- Jini lets us self-discover all the components.) The solution I chose to go with was to start an on-grid process that would relay DNS requests to the actual name server and back.
Of course, I initially thought that even this would be easy because there are any number of port-forwarding Java programs out there, and I've even written one or two myself for various tasks. I started by deploying one of these and was surprised to discover that it still didn't work. No name resolution. Fortunately, I happened to look at the little network monitor I was running and noticed (with a loud slap of my forehead) that of course name resolution usually uses UDP, not TCP (which is what all these port forwarders are written for). I hoped that maybe I could configure my name resolver to use only TCP (yeah, much less efficient, but still tiny in the grand scheme of things), but no such luck.So, I looked around and was (I guess not very) surprised to find that there wasn't any Java code out there for running a simple UDP port forwarder. Fortunately, the code turns out to be relatively simple - at least, it was once I reminded myself how UDP works. I wrote a simple process that receives datagrams on port 53 then chooses an unused outgoing port to resend the datagram (after re-writing the source and destination addresses) to the actual DNS. It starts a thread that waits for the response on that port then sends the response datagram packet back (again, with the addresses re-written) to the original requesting machine (the VPNed client machine). I deployed the code as a grid process and sure enough, it worked like a charm!
All told, this probably should have been easier. On the other hand, we can now connect our machines into the network that runs our live system and run code directly against it. What I didn't mention is that Frank isn't actually running Java code. He wants to run Python code that uses a native number crunching library that starts a JVM that allows him to pull Java objects from the data store and pass them into the python library. He's running that in his virtual linux box running in VMWare on his Mac (which is connected to the virtual network on the grid via VPN). We don't know yet just how slow this is all going to be.