Cluster tricks & tips
By Steve Tunstall-Oracle on Dec 12, 2013
Most of us have clustered ZFSSAs, and have been frustrated at one time or another with getting the proper resource to be owned by the proper controller.
I feel your pain, and believe me, I have to deal with it as much or even more than you do. There are, however, some cool things you can do here and it will make your life easier if you fully understand how this screen works.
First, understand this- You almost never want to push the 'Takeover' button. The 'Takeover' button actually sends a signal to instantly reboot the OTHER controller, in a non-graceful way. More on that below. We have two heads in this picture and they're both in the "Active" state as you see here. This means you can not click the "Failback" button which is how we move resources to the head you wish to own them. You are only allowed ONE Failback when a head is in the "Ready for Failback" state, as it is when it first comes up. We have already hit Failback on this system, so both heads are now Active. That's it. You're done until one reboots.
Do NOT hit the 'Takeover' button. That button should be labeled "Ungracefully shutdown the other controller". Those were just too many words to fit on the button, so they called it Takeover. Sure, that means that since the other head is now being instantly rebooted, this head will now takeover all of the resources and the other head will now reboot. This is one of the worse ways to reboot the other head. It's not nice. It does not flush the cache first. It's actually slower then the other way. When and why would you ever hit it? There's a few reasons. Perhaps the other head is in a failed state that is not allowing you to log in and shut it down correctly. Perhaps you are just setting the controls up on day one, you know there's no workload at all, and you really don't care how the other head gets rebooted. If that's the case, then go for it.
Instead, for a clean and faster reboot, log into the controller you want to reboot, and click the power button:
This allows you to reboot is gracefully, flushing the cache first, and it actually comes up faster than the 'takeover' way, almost always.
Now that it has rebooted, which may take 5-15 minutes, the good controller's cluster screen should show that it's "Ready for Failback". Be certain all of your resources are set to the proper owner, and then hit the "Failback" button to move the resources and change both controllers to the "Active" state. REMEMBER--- You only get to hit the Failback button ONCE!!! So take your time and do all of your config and setup and get the ownership right before you hit it. Otherwise, you will be rebooting one of your controllers again. Not a huge deal, but another 15 minutes of your life, and perhaps a production slowdown for your clients.
Now for a trick. There's nothing I can do to help you with the network resources. If they are on the wrong controller, you may have to reboot one and fix it and do a failback. However, if you have a storage pool on the wrong controller, I may be able to show you something cool. The best thing to remember and do is this: Create the resource (network or pool) ON the controller you wish to be the owner in the first place!!! Then, it will already be owned by the proper one, and you don't have to do a failback at all. However, what if, for whatever reason, you need to move a pool to the other controller and you MUST NOT reboot a controller in order to move it using the Failback process? In other words, you have an Active-Active setup, the Failback button is grayed out, and it's very important that you change the ownership of a storage pool but you are not allowed to reboot one of the controllers?
Bummer, right? Not so fast, check this out.
So here I have a system with two pools, Rotation and Bob, both on controller A. The Bob pool is supposed to be on controller B. They are both Active, so I can not click Failback. I would normally have to reboot head B to fix this. But I don't want to.
So I'm going to unconfigure the Bob pool here on controller A. That's right, unconfigure. This does NOT hurt your data. Your data is safe as long as you do NOT create a new pool in that space. We're not going to create a new pool. We're going to IMPORT the Bob pool on controller B. All of your shares, LUNs, and their properties will be perfectly fine. There is only one hiccup, which we will talk about.
Go to Configuration-->Storage, select the correct pool (Bob), and then click "Unconfig".
But first, I want you to look carefully at the info below the pie chart here. Note that Bob currently has 2 Readzilla cache drives in it. This is important.
You will get this screen. Take a deep breath and hit apply.
No more Bob. Bob gone. Not really. It's still there and can be imported into another controller. This is how we safely move disk trays to new controllers, anyway. No big deal.
So, now go log into the OTHER controller. Don't do this on the same one or else you'll have to start all over again.
Here we are on B. DO NOT click the Plus Sign!!!! That will destroy your data!!!!
Click the IMPORT button.
The Import button will go out and scan your disk trays for any valid ZFS pools not already listed. Here, it finds one called "bob".
Select it and hit "Commit". There, Bob Pool is back. All of it's shares and LUNs will be there too. The "Rotation" pool shows Exported because it's owned by the "A" controller, and the Bob Pool is owned here on B.
We can go to Configuration-->Cluster and see all is well and Bob Pool is indeed owned by the controller we wanted, and we never had to reboot!
However, we have one big problem.... Did you notice when you Imported the Bob Pool into controller B, the Cache drives did NOT come over?
It now has zero cache drives. What did you expect? The cache drives are the readzillas inside the controller, itself. They can't move over just because you changed the owner.
I have 2 extra Readzillas in my B controller not being used. So All I have to do is Add them to the Bob Pool.
Go back to Configuration-->Storage on the B controller. Select the Bob pool and click "ADD". Do NOT click the plus sign. This is different.
I can now add any extra drives to the Bob pool. In this case, I don't have anything I could possibly add other then these two readzillas inside controller B. So pretty easy.
Once added, I'm all good. I now have the Bob pool, with cache drives, being serviced on controller B with no reboot necessary.
By the way, you know you can not remove drives from a pool, right? We can only add. This includes SSDs like Logzillas and Readzillas.
Well, I kind of just showed you a way you CAN remove readzillas from a pool, didn't I? Hmmmmmm.....