Thursday Dec 12, 2013

Cluster tricks & tips

Most of us have clustered ZFSSAs, and have been frustrated at one time or another with getting the proper resource to be owned by the proper controller.

I feel your pain, and believe me, I have to deal with it as much or even more than you do. There are, however, some cool things you can do here and it will make your life easier if you fully understand how this screen works. 

First, understand this- Never push the 'Takeover' button. That's right, I said never. That button is mislabeled.  Now, yes, we have two heads here and they're both in the "Active" state as you see here. This means you can not click the "Failback" button which is how we move resources to the head you wish to own them. You are only allowed ONE Failback when a head is in the "Ready for Failback" state, as it is when it first comes up. We have already hit Failback on this system, so both heads are now Active. That's it. You're done until one reboots. Do NOT hit the 'Takeover' button.


Do NOT hit the 'Takeover' button. That button should be labeled "Panic the other controller". Those were just too many words to fit on the button, so they called it Takeover. Panic is exactly what it does. Sure, that means that since the other head is now in panic, this head will now takeover all of the resources and the other head will now reboot. This is one of the worse ways to reboot the other head. It's not nice. It does not flush the cache first. It's actually slower then the other way. Don't do it.

Instead, for a clean and faster reboot, log into the controller you want to reboot, and click the power button:

This allows you to reboot is gracefully, flushing the cache first, and it actually comes up faster than a panic.

Now that it has rebooted, which may take 5-15 minutes, the good controller's cluster screen should show that it's "Ready for Failback". Be certain all of your resources are set to the proper owner, and then hit the "Failback" button to move the resources and change both controllers to the "Active" state. REMEMBER--- You only get to hit the Failback button ONCE!!! So take your time and do all of your config and setup and get the ownership right before you hit it. Otherwise, you will be rebooting one of your controllers again. Not a huge deal, but another 15 minutes of your life, and perhaps a production slowdown for your clients.


Now for a trick. There's nothing I can do to help you with the network resources. If they are on the wrong controller, you may have to reboot one and fix it and do a failback. However, if you have a storage pool on the wrong controller, I may be able to show you something cool.  The best thing to remember and do is this: Create the resource (network or pool) ON the controller you wish to be the owner in the first place!!! Then, it will already be owned by the proper one, and you don't have to do a failback at all. However, what if, for whatever reason, you need to move a pool to the other controller and you MUST NOT reboot a controller in order to move it using the Failback process? In other words, you have an Active-Active setup, the Failback button is grayed out, and it's very important that you change the ownership of a storage pool but you are not allowed to reboot one of the controllers?

Bummer, right? Not so fast, check this out. 

So here I have a system with two pools, Rotation and Bob, both on controller A. The Bob pool is supposed to be on controller B. They are both Active, so I can not click Failback. I would normally have to reboot head B to fix this. But I don't want to.

So I'm going to unconfigure the Bob pool here on controller A. That's right, unconfigure. This does NOT hurt your data. Your data is safe as long as you do NOT create a new pool in that space. We're not going to create a new pool. We're going to IMPORT the Bob pool on controller B. All of your shares, LUNs, and their properties will be perfectly fine. There is only one hiccup, which we will talk about.

Go to Configuration-->Storage, select the correct pool (Bob), and then click "Unconfig". 
But first, I want you to look carefully at the info below the pie chart here. Note that Bob currently has 2 Readzilla cache drives in it. This is important.

You will get this screen. Take a deep breath and hit apply.

No more Bob. Bob gone. Not really. It's still there and can be imported into another controller. This is how we safely move disk trays to new controllers, anyway. No big deal.

So, now go log into the OTHER controller. Don't do this on the same one or else you'll have to start all over again. 
Here we are on B. DO NOT click the Plus Sign!!!! That will destroy your data!!!!
Click the IMPORT button.

The Import button will go out and scan your disk trays for any valid ZFS pools not already listed. Here, it finds one called "bob". 

Select it and hit "Commit". There, Bob Pool is back. All of it's shares and LUNs will be there too. The "Rotation" pool shows Exported because it's owned by the "A" controller, and the Bob Pool is owned here on B. 

We can go to Configuration-->Cluster and see all is well and Bob Pool is indeed owned by the controller we wanted, and we never had to reboot!

However, we have one big problem.... Did you notice when you Imported the Bob Pool  into controller B, the Cache drives did NOT come over?
It now has zero cache drives. What did you expect? The cache drives are the readzillas inside the controller, itself. They can't move over just because you changed the owner.
No problem.
I have 2 extra Readzillas in my B controller not being used. So All I have to do is Add them to the Bob Pool.
Go back to Configuration-->Storage on the B controller. Select the Bob pool and click "ADD". Do NOT click the plus sign. This is different.

I can now add any extra drives to the Bob pool. In this case, I don't have anything I could possibly add other then these two readzillas inside controller B. So pretty easy.

Once added, I'm all good. I now have the Bob pool, with cache drives, being serviced on controller B with no reboot necessary.

That's it.

By the way, you know you can not remove drives from a pool, right? We can only add. This includes SSDs like Logzillas and Readzillas.
Well, I kind of just showed you a way you CAN remove readzillas from a pool, didn't I? Hmmmmmm.....

Tuesday Feb 07, 2012

Tip- Setting up a new cluster

I haven’t given out a real tip for a while now, but this issue popped up on my last week, so thought I would pass it along. I had a horrible time setting up a new 7320 cluster; for the sole reason that I screwed it up by not doing it in the right order. This caused my install, which should have been done in 1 hour, to take me over 3 hours to complete.

So let me tell you what I did wrong, and then I'll tell you the way I should have done it.

Out of the box, my client's two new 7320 controller heads were one software revision behind, at 2010.Q3.4.2, so I wanted to upgrade them to the newest version of 2011.Q1.1.1. So far, so good, right? Well here was my mistake. I configured controller A via the serial interface, gave it IP numbers, went into the BUI, and did the upgrade to 2011.Q1.1.1. No problem. Now, I wanted to bring the other one up and do the same thing. However, I knew that controller B in a cluster must be in the initial, factory-reset state in order to be joined to a cluster.  You can't configure it, first, or if you do, you must factory-reset it in order to join a cluster. So I bring controller B up, but I don't configure it, and I go to controller A to start the cluster setup process. Big mistake. The process starts, but because the two controllers are on two different software versions, the cluster process cannot continue. This hoses me (that's southern California slang for "messes me up"), because now controller B has started the cluster setup process, and going to the serial connection just has it hung up in a "configuring cluster" state. Rebooting it does not help, as it's still in the "configuring cluster" state once it comes back up.

So.... now I have 2 choices. I can downgrade controller A back to 2010.Q3.4.2, or I can factory-reset controller B, bring it up as a single controller, upgrade it to 2011.Q1.1.1, and then factory reset again, and then finally be able to add it to the cluster via controller A's cluster setup process. I opt for the second choice, as I do not want to downgrade controller A, which is working just fine. Remember, controller B is currently hosed, messed up, or wanked, depending on how you want to say it.
It's stuck. So to get it back to a state I can work with, I need to do the trick I talked about way back in this blog on May 31, 2011 (http://blogs.oracle.com/7000tips/entry/how_to_reset_passwords_on). I had to use the GRUB menu, use the -c trick on the kernel line, and reset the machine and erase all configuration on it. Now I could bring it up as a single controller, upgrade it, factory reset it, and then have it join the cluster. That all worked fine, it just took be two hours to do it all.

Here's what I should have done.

Bring up controller A, config it and log into the BUI. Now bring up controller B. Do NOT config it in any way. Using controller A, setup clustering in the cluster menu.

Once the two controllers are clustered and all is well, NOW go ahead and upgrade controller A to the latest code. Once it reboots, go ahead and upgrade controller B. Everything's fine. You see, if the cluster has already been made, it's perfectly fine to upgrade one controller at a time. The software lets you do that. The software does NOT let you setup a NEW cluster if the controllers are not on the same software level. 

So that is the cluster setup safety tip of the day, kids. Have fun. 

About

This blog is a way for Steve to send out his tips, ideas, links, and general sarcasm. Almost all related to the Oracle 7000, code named ZFSSA, or Amber Road, or Open Storage, or Unified Storage. You are welcome to contact Steve.Tunstall@Oracle.com with any comments or questions

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today