Saturday Oct 03, 2009

A Dashboard Like No Other: The OpenDS Weather Station

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Rationale

 Doing so many benchmarks, profiling and other various performance related activities, I had to find a way to "keep an eye" on things while fetching emails, chatting on IM and the like. Having some experience in past projects with microcontrollers, although on Windows, I figured I could put together a little gizmo to help me keep tabs on my Directory Server.

Bird's Eye View

This is basically a simple setup with a USB Bit Whacker controlled by a Python script, feeding it data crunched from various sources, mainly the Directory Server access log, the garbage collection log and kstats... the result is a useful dashboard where I can see things happen at a glance.

The Meat

Everything starts with the USB Bit Whacker. It's a long story, but to cut short, a couple a years ago, Kohsuke Kawaguchi put together an orb that could be used to monitor the status of a build / unit tests in Hudson. Such devices are also know as eXtreme Feedback Devices or XFDs. Kohsuke chose to go with the USB Bit Whacker (UBW) for it is a USB 'aware' microcontroller that also draws power from the bus, and is therefore very versatile while remaining affordable ($25 soldered and tested from sparkfun but you can easily assemble your own). A quick search will tell you that this is a widely popular platform for hobbyists.

 On the software side, going all java would have been quite easy except for the part where you need platform specific libraries from the serial communication. Sun's javacomm library or rxtx have pros and cons but in my case, the cons were just too much of a hindrance. What's more, I am not one to inflict myself pain unless it is absolutely necessary. For that reason, I chose to go with Python. While apparently not as good on cross-platformedness compared to Java, installing the Python libraries for serial communication with the UBW is trivial and has worked for me right off the bat on every platform I have tried, namely: Mac OS, Linux and Solaris. For example, on OpenSolaris all there is to it is:

 $ pfexec easy_install-2.4 pySerial
Searching for pySerial
Reading http://pypi.python.org/simple/pySerial/
Reading http://pyserial.sourceforge.net/
Best match: pyserial 2.4
Downloading http://pypi.python.org/packages/source/p/pyserial/pyserial-2.4.tar.gz#md5=eec19df59fd75ba5a136992897f8e468
Processing pyserial-2.4.tar.gz
Running pyserial-2.4/setup.py -q bdist_egg --dist-dir /tmp/easy_install-Y8iJv9/pyserial-2.4/egg-dist-tmp-WYKpjg
setuptools
zip_safe flag not set; analyzing archive contents...
Adding pyserial 2.4 to easy-install.pth file

Installed /usr/lib/python2.4/site-packages/pyserial-2.4-py2.4.egg
Processing dependencies for pySerial
Finished processing dependencies for pySerial

 that's it! Of course, having easy_install is a prerequisite. If you don't, simply install setuptools for your python distro, which is a 400kB thing to install. You'll be glad you have it anyway.

Then, communicating with the UBW is mind boggingly easy. But let's not get ahead of ourselves, first things first:

Pluging The USB Bit Whacker On OpenSolaris For The First Tim

The controller will appear as a modem of the old days and communicating with equates to sending AT commands. For those of you who are used to accessing Load Balancers or other network equipment through the serial port, this is no big deal.

In the screenshot below, the first ls command output shows that nothing in /dev/term is an actual link, however, the second -which I issued after plugging the UBW on the usb port- shows a new '0' link has been created by the operating system.


Remember which link your ubw appeared as for our next step: talking to the board.

Your First Python Script To Talk To The UBW

I will show below how to send the UBW the 'V' command which instructs it to return the firmware version, and we'll see how to grab the return value and display it. Once you have that down, the sky is the limit. Here is how:

from serial import \*
ubw = Serial("/dev/term/0")
ubw.open()
print "Requesting UBW Firmware Version"
ubw.write("V\\n")
print "Result=["+ubw.readline().strip() + "]\\n"
ubw.close()

Below is the output for my board:

Voila!

That really is all there is to it, you are now one step away from your dream device. And it really is only a matter of imagination. Check out the documentation of current firmware to see what commands the board supports and you will realize all the neat things you can use it for: driving LEDs, Servos, LCD displays, acquiring data, ...

Concrete Example: The OpenDS Weather Station

As I said at the beginning of this post, my initial goal was to craft a monitoring device for OpenDS. Now you have a good idea of how I dealt with the hardware part, but an image is worth a thousand words so here is a snap...

On the software front, well, being a software engineer by trade, that was the easy part so that's almost not fun and I won't go inot as much detail but here is a 10,000ft view:

  • data is collected in a matrix of hash tables.
  • each hash table represent a population of data points for a sampling period
  • an individual time thread pushes a fresh list of hash tables in the matrix so as to reset the counters for a new sampling period

So for example, if we want to track CPU utilization, we only need to keep one metric. The hash table will only have one key pair. Easy. Slightly overkill but easy. Now if you want to keep track of transactions response times, the hash table will keep the response time (in ms) as a key and the number of transactions that were processed in that particular response time as the associated value. Therefore, if you have within one sampling period, 10,000 operations processed with 6,000 in 0 ms, 3,999 in 1ms and 1 in 15 ms, your hashtable will only have 3 entries as follows: [ 0 => 6000; 1=>3999; 15=>1 ]

This allows for a dramatic compression of the data compared to having a single line with etime for each operation, which would result in 10,000 lines of about 100 bytes.

What's more is that this representation of the same information allows to easily compute the average, extract the maximum value and calculate the standard deviation.

All that said, the weather station is only sent the last of the samples, so it always shows the current state of the server. And as it turns out, it is very useful, I like it very much just the way it worked out.

 Well, I'm glad to close down the shop, it's 7:30pm .... another busy Saturday

Tuesday Sep 29, 2009

OpenDS in the cloud on Amazon EC2

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Rationale

Why not run your Authentication service in the cloud? This is the first step to having a proper cloud IT. There are numerous efforts going to ease deploying your infrastructure in the cloud, from Sun and others, from OpenSSO to glassfish, from SugarCRM to Domino, and on goes the list. Here is my humble contribution for OpenDS.

Bird's Eye View

 Tonight I created my EC2 account and got OpenDS going on the Amazon infrastructure in about half an hour, I will retrace my steps here and point out some of the gotchas.

The Meat

Obviously, some steps must be taken prior to installing software.

First, you need an AWS (Amazon Web Services) account with access to EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service). I will say this about EC2, it is so jovially intoxicating that I would not be surprised to be surprised by my first bill when it comes... but that's good, right? At least for amazon it is, yes.

Then you need to create a key pair, trivial as well. Everything is explained in the email you receive upon subscription.

Once that's done, you can cut to the chase and log on to the AWS management console right away to get used to the concepts and terms used in Amazon's infrastructure. The two main things are an instance and a volume. The names are rather self explanatory, the instance is a running image of an operating system of your choice. The caveat is that if shut it down, the next time you start this image, you will be back to the vanilla image. Think of it as a LiveCD. Can't write persistent data to it, if you do, it won't survive a power cycle.

To persist data between cycles, we'll have to rely on volumes for now. Volumes are just what they seem to be, only virtual. You can create and delete volumes at will, of whatever size you wish. Once a volume is created and becomes available, you need to attach it to your running instance in order to be able to mount it in the host operating system. CAUTION: look carefully at the "availability zone" where your instance is running, the volume must be created in the same zone or you won't be able to attach it.

 Here's a quick overview of the AWS management console with two instances of OpenSolaris 2009.06 running. The reason I have two instances here is that one runs OpenDS 2.0.0 and the other runs DSEE 6.3 :) -the fun never ends-. I'll use it later on to load OpenDS.

My main point of interest was to see OpenDS perform under this wildly virtualized environment. As I described in my previous article on OpenDS on Acer Aspire One, virtualization brings an interesting trend in the market that is rather orthogonal to the traditional perception of the evolution of performance through mere hardware improvements...

In one corner, the heavy weight telco/financial/pharmaceutical company weighing in at many millions of dollars for a large server farm dedicated to high performance authentication/authorization services. Opposite these folks, the ultra small company curled in the other corner, looking at every way to minimize cost in order to simply run the house while allowing to grow the supporting infrastructure as business ramps up.

Used to be quite the headache, that. I mean it's pretty easy to throw indecent amounts of hardware at meeting crazy SLAs. Architecting a small, nimble deployment yet able to grow later? Not so much. If you've been in this business for some time, you know that every iteration of sizing requires to go back to capacity planning and benchmarking which is too long and too costly most of the time. That's where the elastic approaches can help. The "cloud" (basically, hyped up managed hosting) is one of them.

Our team also has its own, LDAP-specific, approach to elasticity, I will talk about that in another article, let's focus on our "cloud" for now. 

 Once your instance is running, follow these simple steps to mount your volume and we can start talking about why EC2 is a great idea that needs to be developed further for our performance savvy crowd.

In this first snapshot, I am running a stock OpenDS 2.0.0 server with 5,000 standard MakeLDIF entries. This is to keep it comparable to the database I used on the netbook. Same searchrate, sub scope, return the whole entry, across all 5,000.

If this doesn't ring a bell? Check out the Acer article. Your basic EC2 instance has about as much juice as a netbook. Now the beauty of it all is that all it takes on my part to improve the performance of that same OpenDS server is to stop my "small" EC2 instance and start a medium one.

Voila!

  I've got 2.5 times the initial performance. I did not change ONE thing on OpenDS, this took 3 minutes to do, I simply restarted the instance with more CPU. I already hear you cry out that it's a shame we can't do this live -it is virtualization after all- but I'm sure it'll come in due course. It is worth noting that even though I could use 80+% of CPU on the small instance of OpenDS, in this case I was only using about 60% so the benefit would likely be greater but I would need more client instances. This imperfect example still proves the point on the ease of use and the elasticity aspect.

The other thing that you can see coming is an image of OpenDS for EC2. I'm thinking it should be rather easy to script 2 things:

1) self-discovery of an OpenDS topology and automatic hook up in the multi master mesh and

2) snapshot -> copy -> restore the db, almost no catch up to do data wise. If you need more power, just spawn a number of new instances: no setup, no config, no tuning. How about that ?

Although we could do more with additional features from the virtualization infrastructure, there is already a number of unexplored options with what is already there. So let's roll up our sleeves and have a serious look. Below is a snapshot of OpenDS modrate on the same medium instance as before with about 25% CPU utilization. As I said before, this thing has had NO fine tuning whatsoever so these figures are with the default, out-of-the-box settings.

  I would like to warmly thank Sam Falkner for his help and advice and most importantly for teasing me into trying EC2 with his FROSUG lightning talk! That stuff is awesome! Try it yourself.

Tracking Down All Outliers From Your LDAP Servers ...

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Rationale

 I was recently faced with the challenge to track down and eliminate outliers from a customer's traffic and I had to come up with some some of tool to help in diagnosing where these long response time transactions originated from. Not really rocket science -hardly anything IS rocket science, even rocket science isn't all that complicated, but I digress- yet nothing that I had in the tool box would quite serve the purpose. So I sat down and wrote a tool that would allow me to visually correlate events in real time. At least that was the idea.

Bird's Eye View

This little tool is only meant for investigations and we are working on delivering something better and more polished (code name Gualicho, shhhhhhh) for production monitoring. The tool I am describing in this article simply correlates the server throughput, peak etime, I/O, CPU, Network and Garbage Collection activity (for OpenDS). It is all presented in a sliding line metric, stacked on top of each other, making visual identification and correlation easy. Later on I will adapt the tool to work on DPS, since it is the other product I like to fine tune for my customers.

The Meat

When pointed to the access log and the GC log, here is the text output you get. There is one line per second that is displayed with the aggregated information collected from the access log and garbage collection as well as kstats for network, I/O, CPU.


If you looked at it closely, I represented the garbage collection in % which is somewhat unsual but after debating on how to make this metric available, I decided that all I was interested was a relative measure of the time spent in stop-the-world GC operations over the time the application itself is running. As I will show in the snapshot below, this is quite effective to spot correlations with high etimes in most cases. To generate this output in the GC log, all you have to do is add the following to your set of JAVA_ARGS for start-ds.java-args in /path/to/OpenDS/config/java.properties:

 -Xloggc:/data/OpenDS/logs/gc.log -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime

And then my GUI will show something like:


Don't hesitate to zoom in on this snapshot. The image is barely legible due to blog formatting constraints.

Excuse me if I have not waited 7+ days to take the snapshot for this article but I think this simple snap serves the purpose. You can see that most of the time we spend 2% of the time blocked in GC but sometimes we have spikes up to 8% and when this happens, even though it has little impact on the overall throughput over one second, the peak etime suddenly jumps to 50ms. I will describe in another article what we can do to mitigate this issue, I simply wanted to share this simple tool here since I think it can serve some of our expert community.

OpenDS on Acer Aspire One: Smashing!

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Rationale

As far fetched as it may seem,  with the growing use of virtualization and cloud computing, the average image instance that LDAP authentication systems are having to run on look more like your average netbook than a supercomputer. With that in mind, I set out to find a reasonable netbook to test OpenDS on. I ended up with an Acer Aspire ONE with 1GB of RAM. Pretty slim on memory. Let's see what we can get out of that thing!

Bird's Eye View

In this rapid test I have done, I loaded OpenDS (2.1b1) with 5,000 entries (stock MakeLdif template delivered with it), hooked up the netbook to a closed GigE network and loaded it from a corei7 machine with searchrate. Result: 1,300+ searches per second. Not bad for a machine that only draws around 15 Watts!

The Meat 

As usual, some more details about the test but first a quick disclaimer: this is not a proper test or benchmark of the Atom as a platform, it is merely a kick in the tires. I have not measured other metrics than the throughput and only for a search workload at that. It is only to get a "feel" of it on such a lightweight sub-notebook.

In short:

  • Netbook: Acer Aspire One ZG5 - Atom N270 @1.6GHz, 1GB RAM, 100GB HDD
  • OS: OpenSolaris 2009.05
  • FS: ZFS
  • OpenDS: all stock, I did not even touch the JAVA options which I usually do
  • JAVA: 1.6 Update 13

The little guy in action, perfbar shows the CPU is all the way up there with little headroom...


About

Directory Services Tutorials, Utilities, Tips and Tricks

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today