Twice Faster: Import Tips And Tricks
By arnaud on Dec 21, 2009
We understand. Really, we do.
You want to import your data as fast possible. Not only nobody likes to wait but there are time constraints on everyone of us. In particular, we're constrained by maintenance windows.
So, what does give you the best return on investment? Higher CPU Clock speed? More CPUs? More memory? We're here to discover just that together.
Irrespective of the litterature on the topic, you must make sure that import will not have to be done in multiple passes or you'll get killed on index merging time. To do so, shoot high and dial down gradually to find the right cache size for your data set.
Bird's Eye View
In essence, what I did was to start an import on a vanilla system and Directory Server with a customer's actual data, 7.9M entries. The system I am testing with isn't a server. The system doesn't matter, it is a constant. The variable are the amount of import cache, the number of CPUs active (1-8) and the CPU clock speed (from 3.3 to 3.7GHz). In short, memory matters most.
The instance I am doing this with is an actual telco setup, with 7.9M entries in the LDIF file. The LDIF weighs in at 12GiB. There are 10 custom indexes configured for equality only.The 8 system indexes are there as well.
On the physical side of things, the machine is my desktop, an Intel Corei7 975EE @ 3.3GHz. It has a 64GB Intel X25 and a 10,000 rpm SATA drive. The disk layout is described in more detail here.
Sensivity To Import Cache
Despite what the documentation says, there are huge gains to be reaped from increasing the import cache size, and depending on your data set, this may make a world of difference.
This is the first thing that I tweaked during this test first phase and bumping import cache from 2GiB to 4GiB chopped import time in half. Basically, if your import has to occur in more than a single pass, then your import cache isn't big enough, try to increase it if your system can take it.
Sensivity To Clock Speed
Ever wondered if a system with CPUs twice faster would buy you time on import? Not really. Why? Well, if the CPUs are waiting on the disks or locks then higher clock speeds isn't going to move the needle at all. That's what's going on here. Check this out...
The reason the 3.7GHz import isn't as fast as the 3.3GHz is because my overclocking might have thrown off the balance between the core clock and the bus clock, so the CPU is spinning its wheels ,waiting to access memory and IO...
I officially declare this one moot. I'll try again later with underclocking.
Sensivity To Number Of CPUs
Scalability is an interesting challenge. Ideally, you'd want half the import time given twice the resources. In reality, import is very lock intensive to avoid collisions/corruptions so it isn't quite that linear. Here's what I got on my system, all other things being equal.
So even though the scalability isn't linear, the good thing is the more CPUs the better your performance is going to be.