Monday Dec 21, 2009

Twice Faster: Import Tips And Tricks


We understand. Really, we do.

You want to import your data as fast possible. Not only nobody likes to wait but there are time constraints on everyone of us. In particular, we're constrained by maintenance windows.

So, what does give you the best return on investment? Higher CPU Clock speed? More CPUs? More memory? We're here to discover just that together.

Irrespective of the litterature on the topic, you must make sure that import will not have to be done in multiple passes or you'll get killed on index merging time. To do so, shoot high and dial down gradually to find the right cache size for your data set.

Bird's Eye View

In essence, what I did was to start an import on a vanilla system and Directory Server with a customer's actual data, 7.9M entries. The system I am testing with isn't a server. The system doesn't matter, it is a constant. The variable are the amount of import cache, the number of CPUs active (1-8) and the CPU clock speed (from 3.3 to 3.7GHz). In short, memory matters most.

The Meat

The Setup

The instance I am doing this with is an actual telco setup, with 7.9M entries in the LDIF file. The LDIF weighs in at 12GiB. There are 10 custom indexes configured for equality only.The 8 system indexes are there as well.

On the physical side of things, the machine is my desktop, an Intel Corei7 975EE @ 3.3GHz. It has a 64GB Intel X25 and a 10,000 rpm SATA drive. The disk layout is described in more detail here.

Sensivity To Import Cache 

Despite what the documentation says, there are huge gains to be reaped from increasing the import cache size, and depending on your data set, this may make  a world of difference.

This is the first thing that I tweaked during this test first phase and bumping import cache from 2GiB to 4GiB chopped import time in half. Basically, if your import has to occur in more than a single pass, then your import cache isn't big enough, try to increase it if your system can take it.

Sensivity To Clock Speed

Ever wondered if a system with CPUs twice faster would buy you time on import? Not really. Why? Well, if the CPUs are waiting on the disks or locks then higher clock speeds isn't going to move the needle at all. That's what's going on here. Check this out...

The reason the 3.7GHz import isn't as fast as the 3.3GHz is because my overclocking might have thrown off the balance between the core clock and the bus clock, so the CPU is spinning its wheels ,waiting to access memory and IO...

I officially declare this one moot. I'll try again later with underclocking.

Sensivity To Number Of CPUs

Scalability is an interesting challenge. Ideally, you'd want half the import time given twice the resources. In reality, import is very lock intensive to avoid collisions/corruptions so it isn't quite that linear. Here's what I got on my system, all other things being equal.

So even though the scalability isn't linear, the good thing is the more CPUs the better your performance is going to be.

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Friday Dec 18, 2009

More ZFS Goodness: The OpenSolaris Build Machine

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>

Apart from my usual LDAP'ing, I also -try to- help the opensolaris team with anything I can.

Lately, I've helped build their new x64 build rig, for which I carefully selected the best components out there while trying to keep the overall box budget on a leash. It came out at about $5k. Not on the cheap side, but cheaper than most machines in most data centers.

The components:

  • 2 Intel Xeon E5220 HyperThreaded QuadCores@2.27GHz. 16 cpus in solaris
  • 2 32GB Intel X25 SSD
  • 2 2TB WD drives
  • 24GB ECC DDR2

I felt compelled to follow up my previous post about making the most out your SSD because some people commented that non mirrored pools were evil. Well, here's how this is set up this time: in order to avoid using either of the relatively small SSDs for the system, I have partitioned the big 2TB drives with exactly the same layout, one 100GB partition for the system, the rest of the disk is going to be holding our data. This leaves our SSD available for the ZIL and the L2ARC. But thinking about it, the ZIL is never going to take up the entire 32GB SSD. So I partitioned one of the SSDs with a 3GB slice for the ZIL and the rest for L2ARC.

The result is a system with 24GB of RAM for the Level 1 ZFS cache (ARC) and 57GB for L2ARC in combination with a 3GB ZIL. So we know it will be fast. But the icing on the cache ... the cake sorry, is that the rpool is mirrored. And so is the data pool.

Here's how it looks: 

admin@factory:~$ zpool status
  pool: data
 state: ONLINE
 scrub: none requested

    data        ONLINE       0     0     0
      c5d1p2    ONLINE       0     0     0
      c6d1p2    ONLINE       0     0     0
      c6d0p1    ONLINE       0     0     0
      c5d0p1    ONLINE       0     0     0
      c6d0p2    ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
 scrub: none requested

    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        c5d1s0  ONLINE       0     0     0
        c6d1p1  ONLINE       0     0     0

errors: No known data errors

 This is a really good example of how to setup a real-life machine designed to be robust and fast without compromise. This rig achieves performance on par with $40k+ servers. And THAT is why ZFS is so compelling.

Monday Dec 14, 2009

Make The Most Of Your SSD With ZFS

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>


 If you're anything like me, you're lucky if you have a single SSD in your workstation or server. I had a dilemma in the past, I couldn't quite find a way to make the most of my one flash drive: I had to chose between making it a ZIL or use it as L2ARC. I dreaded having to make a definitive choice for one or the other. When I installed my workstation with OpenSolaris 2009.06, I had an idea in mind, so I installed the system on the SSD in a small partition (10GB) and left the rest of the drive unallocated if you catch my drift...

Bird's Eye View

Simple! Just partition the SSD to be able to use it as both L2ARC and ZIL in whatever proportions you think is going to suit  your needs. Note however that the IOs are shared between your partitions on the same drive. From my testing though, I can tell you that with this setupu you're still coming out on top in most situations.

The Meat

It's all pretty simple really, when you install solaris, you have a choice of installing on "whole disk" or to use the tool to make a smaller custom partition. I cut out a 36GB partition which allows ample room for the system and swap. The rest of my 64GB SSD is left unallocated at install time, we'll take care of everything later.

The second disk in my system is a 300GB 10,000 rpm SATA drive which, being fast but small, I wanted to leave whole for my data pool (keep in mind that the rpool is a little different than your regular pool, so make sure to treat it accordingly). That is why I decided to compromise and use some of the SSD space for the system. You don't have, you could partition your spindle and have the system on there.

Now, that you have opensolaris up and running, install GParted to be able to edit your disks partitions. You can either use the opensolaris package manager or

pfexec pkg install SUNWGParted

It's all downhill from here. Open GParted. If you just installed it, you will need to log out and back in to see in the GNnome menu. It will be in Applications->System tools->GParted Partition Editor

Select your flash drive and carve out a 2GB partition for your ZIL and assign the remaining space for L2ARC. Apply the changes and keep the window open.

Note the two devices path in /dev/dsk because that's what we'll use to add these two SSD partitions as performance enhancing tools in our existing pool.

arnaud@ioexception:/data/dsee7.0/instances$ pfexec zpool add data log /dev/dsk/c9d0p2 cache /dev/dsk/c9d0p3

Let's check how our pool looks now...

arnaud@ioexception:/data/dsee7.0/instances$ zpool status data
  pool: data
 state: ONLINE
 scrub: none requested

    data        ONLINE       0     0     0
      c8d0      ONLINE       0     0     0
      c9d0p2    ONLINE       0     0     0
      c9d0p3    ONLINE       0     0     0

errors: No known data errors

Et voila!

You've got the best of both worlds, making the absolute most of whatever little hardware you had at your disposal!


Saturday Oct 03, 2009

A Dashboard Like No Other: The OpenDS Weather Station

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>


 Doing so many benchmarks, profiling and other various performance related activities, I had to find a way to "keep an eye" on things while fetching emails, chatting on IM and the like. Having some experience in past projects with microcontrollers, although on Windows, I figured I could put together a little gizmo to help me keep tabs on my Directory Server.

Bird's Eye View

This is basically a simple setup with a USB Bit Whacker controlled by a Python script, feeding it data crunched from various sources, mainly the Directory Server access log, the garbage collection log and kstats... the result is a useful dashboard where I can see things happen at a glance.

The Meat

Everything starts with the USB Bit Whacker. It's a long story, but to cut short, a couple a years ago, Kohsuke Kawaguchi put together an orb that could be used to monitor the status of a build / unit tests in Hudson. Such devices are also know as eXtreme Feedback Devices or XFDs. Kohsuke chose to go with the USB Bit Whacker (UBW) for it is a USB 'aware' microcontroller that also draws power from the bus, and is therefore very versatile while remaining affordable ($25 soldered and tested from sparkfun but you can easily assemble your own). A quick search will tell you that this is a widely popular platform for hobbyists.

 On the software side, going all java would have been quite easy except for the part where you need platform specific libraries from the serial communication. Sun's javacomm library or rxtx have pros and cons but in my case, the cons were just too much of a hindrance. What's more, I am not one to inflict myself pain unless it is absolutely necessary. For that reason, I chose to go with Python. While apparently not as good on cross-platformedness compared to Java, installing the Python libraries for serial communication with the UBW is trivial and has worked for me right off the bat on every platform I have tried, namely: Mac OS, Linux and Solaris. For example, on OpenSolaris all there is to it is:

 $ pfexec easy_install-2.4 pySerial
Searching for pySerial
Best match: pyserial 2.4
Processing pyserial-2.4.tar.gz
Running pyserial-2.4/ -q bdist_egg --dist-dir /tmp/easy_install-Y8iJv9/pyserial-2.4/egg-dist-tmp-WYKpjg
zip_safe flag not set; analyzing archive contents...
Adding pyserial 2.4 to easy-install.pth file

Installed /usr/lib/python2.4/site-packages/pyserial-2.4-py2.4.egg
Processing dependencies for pySerial
Finished processing dependencies for pySerial

 that's it! Of course, having easy_install is a prerequisite. If you don't, simply install setuptools for your python distro, which is a 400kB thing to install. You'll be glad you have it anyway.

Then, communicating with the UBW is mind boggingly easy. But let's not get ahead of ourselves, first things first:

Pluging The USB Bit Whacker On OpenSolaris For The First Tim

The controller will appear as a modem of the old days and communicating with equates to sending AT commands. For those of you who are used to accessing Load Balancers or other network equipment through the serial port, this is no big deal.

In the screenshot below, the first ls command output shows that nothing in /dev/term is an actual link, however, the second -which I issued after plugging the UBW on the usb port- shows a new '0' link has been created by the operating system.

Remember which link your ubw appeared as for our next step: talking to the board.

Your First Python Script To Talk To The UBW

I will show below how to send the UBW the 'V' command which instructs it to return the firmware version, and we'll see how to grab the return value and display it. Once you have that down, the sky is the limit. Here is how:

from serial import \*
ubw = Serial("/dev/term/0")
print "Requesting UBW Firmware Version"
print "Result=["+ubw.readline().strip() + "]\\n"

Below is the output for my board:


That really is all there is to it, you are now one step away from your dream device. And it really is only a matter of imagination. Check out the documentation of current firmware to see what commands the board supports and you will realize all the neat things you can use it for: driving LEDs, Servos, LCD displays, acquiring data, ...

Concrete Example: The OpenDS Weather Station

As I said at the beginning of this post, my initial goal was to craft a monitoring device for OpenDS. Now you have a good idea of how I dealt with the hardware part, but an image is worth a thousand words so here is a snap...

On the software front, well, being a software engineer by trade, that was the easy part so that's almost not fun and I won't go inot as much detail but here is a 10,000ft view:

  • data is collected in a matrix of hash tables.
  • each hash table represent a population of data points for a sampling period
  • an individual time thread pushes a fresh list of hash tables in the matrix so as to reset the counters for a new sampling period

So for example, if we want to track CPU utilization, we only need to keep one metric. The hash table will only have one key pair. Easy. Slightly overkill but easy. Now if you want to keep track of transactions response times, the hash table will keep the response time (in ms) as a key and the number of transactions that were processed in that particular response time as the associated value. Therefore, if you have within one sampling period, 10,000 operations processed with 6,000 in 0 ms, 3,999 in 1ms and 1 in 15 ms, your hashtable will only have 3 entries as follows: [ 0 => 6000; 1=>3999; 15=>1 ]

This allows for a dramatic compression of the data compared to having a single line with etime for each operation, which would result in 10,000 lines of about 100 bytes.

What's more is that this representation of the same information allows to easily compute the average, extract the maximum value and calculate the standard deviation.

All that said, the weather station is only sent the last of the samples, so it always shows the current state of the server. And as it turns out, it is very useful, I like it very much just the way it worked out.

 Well, I'm glad to close down the shop, it's 7:30pm .... another busy Saturday

Tuesday Sep 29, 2009

Tracking Down All Outliers From Your LDAP Servers ...

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>


 I was recently faced with the challenge to track down and eliminate outliers from a customer's traffic and I had to come up with some some of tool to help in diagnosing where these long response time transactions originated from. Not really rocket science -hardly anything IS rocket science, even rocket science isn't all that complicated, but I digress- yet nothing that I had in the tool box would quite serve the purpose. So I sat down and wrote a tool that would allow me to visually correlate events in real time. At least that was the idea.

Bird's Eye View

This little tool is only meant for investigations and we are working on delivering something better and more polished (code name Gualicho, shhhhhhh) for production monitoring. The tool I am describing in this article simply correlates the server throughput, peak etime, I/O, CPU, Network and Garbage Collection activity (for OpenDS). It is all presented in a sliding line metric, stacked on top of each other, making visual identification and correlation easy. Later on I will adapt the tool to work on DPS, since it is the other product I like to fine tune for my customers.

The Meat

When pointed to the access log and the GC log, here is the text output you get. There is one line per second that is displayed with the aggregated information collected from the access log and garbage collection as well as kstats for network, I/O, CPU.

If you looked at it closely, I represented the garbage collection in % which is somewhat unsual but after debating on how to make this metric available, I decided that all I was interested was a relative measure of the time spent in stop-the-world GC operations over the time the application itself is running. As I will show in the snapshot below, this is quite effective to spot correlations with high etimes in most cases. To generate this output in the GC log, all you have to do is add the following to your set of JAVA_ARGS for in /path/to/OpenDS/config/

 -Xloggc:/data/OpenDS/logs/gc.log -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime

And then my GUI will show something like:

Don't hesitate to zoom in on this snapshot. The image is barely legible due to blog formatting constraints.

Excuse me if I have not waited 7+ days to take the snapshot for this article but I think this simple snap serves the purpose. You can see that most of the time we spend 2% of the time blocked in GC but sometimes we have spikes up to 8% and when this happens, even though it has little impact on the overall throughput over one second, the peak etime suddenly jumps to 50ms. I will describe in another article what we can do to mitigate this issue, I simply wanted to share this simple tool here since I think it can serve some of our expert community.

OpenDS on Acer Aspire One: Smashing!

<script type="text/javascript"> var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "' type='text/javascript'%3E%3C/script%3E")); </script> <script type="text/javascript"> try { var pageTracker = _gat._getTracker("UA-12162483-1"); pageTracker._trackPageview(); } catch(err) {}</script>


As far fetched as it may seem,  with the growing use of virtualization and cloud computing, the average image instance that LDAP authentication systems are having to run on look more like your average netbook than a supercomputer. With that in mind, I set out to find a reasonable netbook to test OpenDS on. I ended up with an Acer Aspire ONE with 1GB of RAM. Pretty slim on memory. Let's see what we can get out of that thing!

Bird's Eye View

In this rapid test I have done, I loaded OpenDS (2.1b1) with 5,000 entries (stock MakeLdif template delivered with it), hooked up the netbook to a closed GigE network and loaded it from a corei7 machine with searchrate. Result: 1,300+ searches per second. Not bad for a machine that only draws around 15 Watts!

The Meat 

As usual, some more details about the test but first a quick disclaimer: this is not a proper test or benchmark of the Atom as a platform, it is merely a kick in the tires. I have not measured other metrics than the throughput and only for a search workload at that. It is only to get a "feel" of it on such a lightweight sub-notebook.

In short:

  • Netbook: Acer Aspire One ZG5 - Atom N270 @1.6GHz, 1GB RAM, 100GB HDD
  • OS: OpenSolaris 2009.05
  • FS: ZFS
  • OpenDS: all stock, I did not even touch the JAVA options which I usually do
  • JAVA: 1.6 Update 13

The little guy in action, perfbar shows the CPU is all the way up there with little headroom...


Directory Services Tutorials, Utilities, Tips and Tricks


« April 2014