Wednesday Oct 26, 2011

VDEV - What is a VDEV and why should you care?

Ok, so we can finally talk VDEVs. Going back to my blog on disk calculations, I told you how the calculator works, and the way you can see how many drive spindles you would have for any particular RAID layout. Let's use an example of nine trays of 24 drives each, using 1TB drives.

Yes, I know we no longer offer 1TB drives, but this is the graphic I had, so just roll with me. Now, if we were setting this up in the ZFSSA BUI, it would look like this:

So that's all great and it all lines up, right? Well, the one thing the BUI doesn't show very well is the VDEVs. You can figure it out in your head if you know what you're doing, but the calculator can do it for you if you just add the "-v" option right after the .py command in the python string. Doing that for the above example will give you back this:

Notice the new column for VDEVs. Cool. So now I can see the breakdown of Virtual Devices that each type of RAID will create out of my physical devices (spindles). In this case, my nine trays of 24 spindles is 216 physical devices.
-If I do something silly and make that a 'Stripe', then I would get 1 big virtual device made up of 216 physical devices.
-I could also make it a 'Mirror', which will give me 106 virtual devices, each made up of 2 physical devices.
-A RAIDz1 pool will give me 53 virtual devices, each with 4 physical devices to make my 3+1 stripes.
-Finally, for the sake of this conversation, a RAIDz2 choice will give me only 15 VDEVs, each with 14 physical drives that make 12+2 stripes. You don't get 14 data drives, you get 14 drives per stripe, so you need to remember that 2 of those are parity drives in a RAIDz2 stripe when you calculate your usable space.

Now, why do you care how many VDEVs you have? It's all about throughput.  Very simply stated, the more VDEVs you have, the more data can be pushed into the system by the most amount of users at once. Now, that's very simplistic, and it really depends on your workload. There are exceptions as I have found, but for the most part, more VDEVs will equal better throughput for small, random IO. This is why a Mirrored pool is almost always the best way to setup a high-throughput pool for small, random IO such as a database. look at all the VDEVs a mirrored pool gives you.

Think of it this way: Say you have a 256K block of data you want to write to the system, using a 128K record size. With a mirrored pool, ZFS will split your 256K file into 2 blocks of 128K each, and send it down to exactly 2 of the VDEVs to write out to 4 physical drives. Now, you still have a whopping 104 other VDEVs not doing anything, and they could all be handling other user's workflows all at the same time. Take the same example using a RAIDz1 pool. ZFS will have to break up your 256K block again into two 128K chunks and send it to 2 VDEVs, each with 4 physical drives, with each data drive of the 3+1 stripe getting about 43K. That's all fine, but while those 8 physical drives are working on that data, they can't do anything else, and you only have 51 other VDEVs to handle everyone else's workload.
As an extreme example, let's check out a RAIDz3 False pool. You only get 4 VDEVs, each with 53 drives, each in a 50+3 stripe. Writing that same 256K block with 128K record sizes will still split it over 2 VDEVs, and you only have 2 left for others to use at the same time. In other words, it will take the IOPs of 106 physical spindles to write that one stupid 256K block, while in the Mirrored pool, it would have only taken the IOPs of 4 physical spindles, leaving you with tons of other IOPs.

Make sense?

Like I said, Mirroring is not always the best way to go. I've seen plenty of examples where we choose other pools over Mirrored after testing. That is the key. You need to test your workload with multiple pool types before picking one. If you don't have that luxury, make your best, educated guess based on the knowledge that in general, high throughput random IO does better with more VDEVs, and large, sequential files can do very well with larger stripes found in RAIDz2.

As a side note, we recommend the RAIDz1 pool for our Exadata backups to a ZFSSA. After testing, we found that, yes, the mirrored pool did go a bit faster, but not enough to justify the drop in capacity. We also found that the RAIDz1 pool was about 20% faster for backups and restores then the RAIDz2 pool, so that speed difference didn't justify the extra capacity of RAIDz2. Now, some people may disagree and say they don't care about capacity, they want the fastest no matter what, and go with Mirrored even in this scenario. That's fine, and that's the beauty of the ZFSSA, where you are allowed to experiment with many choices and options and choose the right balance for your company and your workload.

Have fun. Steve