A slight digression with Logical Domains and Apache
By jsavit on Nov 16, 2006
Right now, I'm at a Sun internal conference for USA technical staff, and I presented sessions on virtualization, with a demo of the Logical Domains (LDoms) capability coming soon on our T1000 and T2000 servers. In some ways they use an opposite method than traditional hypervisors that work by multiplexing a CPU by time-slicing among virtual machines. That's something I'll discuss later.
For my session, I wanted to demonstrate Apache in one domain being driven by a client program in another domain - a simple test that demonstrates inter-domain network connectivity. For this demo, I used 'ab' - the Apache benchmark tool that lets you repeatedly hammer a specified URL N times, with C concurrent requests. I've used this before, as Apache is easy to set up (of course, 'ab' can be used with other web servers) and is pre-installed with Solaris 10. I used 'perfbar' and 'cpubar' tools to draw nice CPU utilization charts in graphical format, since eye-candy is fun for everyone, and these tools show system loads very dramatically in real time.
What happened when I tested this surprised me: I was getting really high latencies (and hence, low hits/second), yet at the same time the CPU utilization in both the client instance of Solaris, and the Solaris running Apache was extremely low. If anything was doing work, it was the service domain providing virtualized disks, which showed a little bit of CPU activity. Hmmm, why is this happening? The logical domain running web server looked almost completely idle!
I suppose I could have used DTrace to find out where the applications were spending their time - there are a number of scripts that would have been helpful here, such as running profiles and seeing scheduler state, system calls, or stacks. Instead, I resorted to traditional, low-tech methods. What solved the problem was the Apache error log, which had zillions of error messages (warnings, actually) about an unreachable network. Well, that didn't make any sense to me (the web hits actually did proceed, even if not at the speed I expected).
I then did something that has become a habit: I cut and paste the error message into a Google screen (why look up error message when it can look it up - with commentary - for me?) Lo and behold, there were several hits on this, including one I found in under a minute that essentially said "In httpd.conf replace Listen 80 with Listen 0.0.0.0:80".
That really shouldn't be necessary, but let's give it a try. I did, and all of a sudden the test took off: thousands of web hits flashed by, and perfbar and cpubar showed pretty CPU load charts in the logical domains they were supposed to (primarily, in the server's domain). Appropriate oohs and aahs from the audience - very gratifying. I then went to another instance of Solaris with an unmodified httpd.conf, and I was able to establish the original effect - which pretty much was that the rate of web hits was gated by the rate Apache could wrote error messages to disk! This also explained the slight CPU load in the service domain, as it was doing the disk writes.
So, a nice little demonstration worked, and illustrated a few more points than I originally had planned for it by showing low-latency, high throughput web serving across domains (the web server is also available off that server box, of course, so I wrote a tiny CGI script that displays the number of CPUs the web server thinks it has), and the effects of disk writes when network access was expected. Logical domains can be expanded or shrunk on the fly by simple commands, so I removed virtual CPUs from the web server domain, re-ran the test (remaining CPUs get saturated), put the virtual CPUs back (the web server now has ample CPU capacity and returns web hits faster). A nice example of changing resource distribution under logical domains.
Now, of course, I have a little exploration to do when I have a few free minutes. Why does Apache require this (should not be necessary) alteration to its Listen directive? Does this happen only with an 'ab' command, or does it occur from a real browser as well? Does this happen in non-LDom situations (ditto with Solaris Containers versus global zone, for symmetry)? That will be easy enough to determine in a test, and probably I will use DTrace to dig deeper to see what Apache binds with both the original and altered Listen. There might be a bug to file somewhere, but first I need to find out who is doing what. That sounds like a fun exercise to do when I get home.