Wednesday Oct 16, 2013

OS8- AK8- The bad news...

Ok I told you I would give you the bad news of AK8 to go along with all the cool new stuff, so here it is. It's not that bad, really, just things you need to be aware of.

First, the 2013.1 code is being called OS8, AK8 and 2013.1 by different people. I mean different people INSIDE Oracle!! It was supposed to be easy, but it never is. So for the rest of this blog entry, I'm calling it AK8.

AK8 is not compatible with the 7x10 series. Ever. The 7x10 series is not supported with AK8, and if you try to upgrade one, it will fail at the healthcheck.

All 7x20 series, all of them regardless of age, are supported with AK8.

Drive trays. Let's talk about drive trays and SAS cards. The older drive trays for the 7x20 series were called the "Riverwalk 2" or "DS2" trays. They were technically the "J4410" series JBODs that Sun used to sell a la carte before we stopped selling JBODs. Don't get me started on that, it still makes me mad. We used these for many years, and you can still buy them right now until December 15th, 2013, when they will no longer be sold. The DS2 tray only came as a 4u, 24 drive shelf. It held 3.5" drives, and you had a choice of 2TB, 3TB, 300GB or 600GB drives. The SAS HBA in the 7x20 series was called a "Thebe" card, with a part # of 7105394. The 7420, for example, came standard with two of these "Thebe" cards for connecting to the disk trays. Two Thebe cards could handle up to 12 trays, so one would add two more cards to go to 24 trays, or have up to six Thebe cards to handle 36 trays. This card was for external SAS only. It did not connect to the internal OS drives or the Readzillas, both of which used the internal SCSI controller of the server.

These Riverwalk 2 trays ARE supported with AK8. You can upgrade your older 7420 or 7320, no problem, as-is. The much older Riverwalk 1 trays or J4400 trays are NOT supported by AK8. However, they were only used by the 7x10 series, and we already said that the 7x10 series was not supported.

Here's where it gets tricky. Since last January, we have been selling the new style disk trays. We call them the "DE2-24P" and the "DE2-24C" trays. The "C" tray is for capacity drives, which are 3.5" 3TB or 4TB drives. The "P" trays are for performance drives, which are 2.5" 300GB and 900GB drives. These trays are NOT Riverwalk 2 trays, even though the "C" series may kind of look like it. Different manufacturer and different firmware. They are not new. Like I said, we've been selling them with the 7x20 series since last January. They are the only disk trays we will be selling going forward. Of course, AK8 supports them.

So what's the problem? The problem is going to be for people who have to mix drive trays.

Remember, your older 7x20 series has Thebe SAS2 HBAs. These have 2 SAS ports per card.  The new ZS3-2 and ZS3-4 systems, however, have the new "Thebe2" SAS2 HBAs. These Thebe2 cards have 4 ports per card. This is very cool, as we can now do more SAS channels with less cards. Instead of needing 4 SAS cards to grow to 24 trays like we did with the old Thebe cards, I can now do 24 trays with only 2 Thebe2 cards. This means more IO slots for fun things like Infiniband and 10G. So far, so good, right? These Thebe2 cards work with any disk tray. You can even mix older DS2 trays with the newer DE2 trays in the same system, as long as you have Thebe2 cards.

Ah, there's your problem. You don't have Thebe2 cards in your old 7420, do you? Well, I told you the bad news wasn't that bad, right? We can take out your Thebe cards and replace them with Thebe2. You can then plug your older DS2 trays right back in, and also now get newer DE2 trays going forward. However, it's important that the trays are on different SAS channels. You can mix them in the same system, but not on the same channel. Ask your local SC if you need help with the new cable layout. By the way, the new ZS3-2 and ZS3-4 systems also include a new IO card called "Erie" cards. These are for INTERNAL SAS to the OS drives and the Readzillas. So those are now SAS2 instead of SATA like the older models. Yes, the Erie card uses an IO slot, but that's OK, because the Thebe2 cards allow us to use less SAS HBAs to grow the system, right?

That's it. Not too much bad news and really not that bad. AK8 does not support the 7x10 series, and you may need new Thebe2 cards in your older systems if you want to add on newer DE2 trays. I think we can all agree that there are worse things out there. Like our Congress.  

Next up.... More good news and cool AK8 tricks. Such as virtual NICS. 

Friday Oct 11, 2013

Do you want to upgrade to AK8 (2013.1) right now?

Ok, so you will hear some great stuff about AK8, but are you going to upgrade your production system to a new major release right after it comes out? Probably not. If you have a test system or a lab system you can play with, then I highly recommend upgrading it so you can start to see the new performance features that AK8 can give you. If you only have one system, or they're all in production, then of course you're going to wait for the first minor release of the new code, aren't you? I would too. I'm told the first minor is coming out in just a few weeks. It is the release they used for the public benchmark performance testing. So you can feel more confident in that release. You may also be able to talk to your local sales team about getting a demo unit. Then, you can play with the new code in a safe lab area before upgrading your production system.

Next up... The negative aspects of upgrading to AK8. It's not too bad, but you will need to know which older systems can't do it, how to work with older disk trays, and whether or not you can replicate newer systems with older systems. 

Hey, I told you I wasn't just going to blow sunshine on you all the time, right? I can spit out the kool-aid as well as drink it!  :)

Thursday Oct 10, 2013

Upgrading to OS8 - AK8- 2013.1

The upgrade to OS8, AK8 or whatever we are calling it this week was pretty straightforward. It will take some extra time, as it has to perform some one-time jobs the first time it reboots, but it wasn't more than 15 minutes. Your mileage may vary, it's possible on larger systems that it takes longer. There is also a deferred update I will show you down below that you can choose to do right away or later. Once you do that deferred update, you do NOT want to roll back to the previous version, so be warned. 

It's been over 1.5 years since the last major update, so many of you probably have never done one before. The process is just like a minor update, it just takes longer. 

1 Get the update from MOS and unzip it to a folder. Go ahead and upload it and unpack it like normal from your Maintenance-->System screen. I did like how it tried to tell me how much time was left, but the numbers were all over the place, and it was over by the time it was correct.

Now, when you click the arrow to apply the update, the normal health check window appears, but you will notice something extra. That's the 'Deferred Update' choice. You can make it apply as soon as it reboots, or you can manually apply it later. Remember, you do NOT want to rollback after this is applied. I did "Upon Request", click the "Check" button, and if all is well, click "Apply" 

After it installs and reboots, you can look at the command line via serial port or SSH. You will notice a few things are different during this boot-up.

Right after the "Updating ####" section you can see it actually upgrading various services and the SMF repository. This can take around 3 minutes, but if you have a lot of aggragations or IPMP then it could take longer. So relax. You can see mine, below, which went 290 seconds, and then continued upgrading other stuff.

 The upgrade continues, and the screen is pretty obvious.

 When you see it configuring network devices, you're almost done. You can see the new code level, and it's about to go to the login prompt. At that point, you should be able to log back into the BUI.

 Log back into the BUI, and you will see the new version is the current version in Maintenance-->System

Now, let's do the deferred update on the same screen.

You can read about the deferred updates here, and click apply when ready to add them. In this case, it's for the ability to associate multiple initiator groups with a LUN, something we have wanted for some time now, so very cool. Note that ANY other deferred updates you have not applied yet will also apply, as there is no way to pick and choose. Either they all apply or none do. Remember I said not to roll-back to a previous version of the code after you do this? It will let you, but if you do, your LUN operations will fail. No bueno. Don't do it. The deferred upgrades are one-way.

Note that the deferred update does NOT force a reboot. 

Once you apply the deferred updates, the whole deferred update area goes away, and the screen now looks like this. 

Do you want to see something cool right away now in OS8 that you could not do before? There's a lot I will talk about later, but for now, since you're so excited, go to Configuration-->Alerts, and create a new Threshold Alert. Notice the new Capacity threshold alerts, where you can now get emails or create an action when a pool, and project, or a share goes over, say, 80% full. Sweet.

Monday Dec 03, 2012

My error with upgrading 4.0 to 4.2- What NOT to do...

Last week, I was helping a client upgrade from the 2011.1.4.0 code to the newest 2011.1.4.2 code. We downloaded the 4.2 update from MOS, upload and unpacked it on both controllers, and upgraded one of the controllers in the cluster with no issues at all. As this was a brand-new system with no networking or pools made on it yet, there were not any resources to fail back and forth between the controllers. Each controller had it's own, private, management interface (igb0 and igb1) and that's it. So we took controller 1 as the passive controller and upgraded it first. The first controller came back up with no issues and was now on the 4.2 code. Great. We then did a takeover on controller 1, making it the active head (although there were no resources for it to take), and then proceeded to upgrade controller 2.

Upon upgrading the second controller, we ran the health check with no issues. We then ran the update and it ran and rebooted normally. However, something strange then happened. It took longer than normal to come back up, and when it did, we got the "cluster controllers on different code" error message that one gets when the two controllers of a cluster are running different code. But we just upgraded the second controller to 4.2, so they should have been the same, right???

Going into the Maintenance-->System screen of controller 2, we saw something very strange. The "current version" was still on 4.0, and the 4.2 code was there but was in the "previous" state with the rollback icon, as if it was the OLDER code and not the newer code. I have never seen this happen before. I would have thought it was a bad 4.2 code file, but it worked just fine with controller 1, so I don't think that was it. Other than the fact the code did not update, there was nothing else going on with this system. It had no yellow lights, no errors in the Problems section, and no errors in any of the logs. It was just out of the box a few hours ago, and didn't even have a storage pool yet.

So.... We deleted the 4.2 code, uploaded it from scratch, ran the health check, and ran the upgrade again. once again, it seemed to go great, rebooted, and came back up to the same issue, where it came to 4.0 instead of 4.2. See the picture below.... HERE IS WHERE I MADE A BIG MISTAKE....

I SHOULD have instantly called support and opened a Sev 2 ticket. They could have done a shared shell and gotten the correct Fishwork engineer to look at the files and the code and determine what file was messed up and fixed it. The system was up and working just fine, it was just on an older code version, not really a huge problem at all.

Instead, I went ahead and clicked the "Rollback" icon, thinking that the system would rollback to the 4.2 code.   Ouch... What happened was that the system said, "Fine, I will delete the 4.0 code and boot to your 4.2 code"... Which was stupid on my part because something was wrong with the 4.2 code file here and the 4.0 was just fine. 

So now the system could not boot at all, and the 4.0 code was completely missing from the system, and even a high-level Fishworks engineer could not help us. I had messed it up good. We could only get to the ILOM, and I had to re-image the system from scratch using a hard-to-get-and-use FishStick USB drive. These are tightly controlled and difficult to get, almost always handcuffed to an engineer who will drive out to re-image a system. This took another day of my client's time. 

So.... If you see a "previous version" of your system code which is actually a version higher than the current version... DO NOT ROLL IT BACK.... It did not upgrade for a very good reason.

In my case, after the system was re-imaged to a code level just 3 back, we once again tried the same 4.2 code update and it worked perfectly the first time and is now great and stable.  Lesson learned. 

By the way, our buddy Ryan Matthews wanted to point out the best practice and supported way of performing an upgrade of an active/active ZFSSA, where both controllers are doing some of the work. These steps would not have helpped me for the above issue, but it's important to follow the correct proceedure when doing an upgrade.


1) Upload software to both controllers and wait for it to unpack
2) On controller "A" navigate to configuration/cluster and click "takeover"
3) Wait for controller "B" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
4) Wait for controller "B" to apply the update and finish rebooting
5) Login to controller "B", navigate to configuration/cluster and click "takeover"
6) Wait for controller "A" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
7) Wait for controller "A" to apply the update and finish rebooting
8) Login to controller "B", navigate to configuration/cluster and click "failback"

Tuesday Feb 07, 2012

Tip- Setting up a new cluster

I haven’t given out a real tip for a while now, but this issue popped up on my last week, so thought I would pass it along. I had a horrible time setting up a new 7320 cluster; for the sole reason that I screwed it up by not doing it in the right order. This caused my install, which should have been done in 1 hour, to take me over 3 hours to complete.

So let me tell you what I did wrong, and then I'll tell you the way I should have done it.

Out of the box, my client's two new 7320 controller heads were one software revision behind, at 2010.Q3.4.2, so I wanted to upgrade them to the newest version of 2011.Q1.1.1. So far, so good, right? Well here was my mistake. I configured controller A via the serial interface, gave it IP numbers, went into the BUI, and did the upgrade to 2011.Q1.1.1. No problem. Now, I wanted to bring the other one up and do the same thing. However, I knew that controller B in a cluster must be in the initial, factory-reset state in order to be joined to a cluster.  You can't configure it, first, or if you do, you must factory-reset it in order to join a cluster. So I bring controller B up, but I don't configure it, and I go to controller A to start the cluster setup process. Big mistake. The process starts, but because the two controllers are on two different software versions, the cluster process cannot continue. This hoses me (that's southern California slang for "messes me up"), because now controller B has started the cluster setup process, and going to the serial connection just has it hung up in a "configuring cluster" state. Rebooting it does not help, as it's still in the "configuring cluster" state once it comes back up.

So.... now I have 2 choices. I can downgrade controller A back to 2010.Q3.4.2, or I can factory-reset controller B, bring it up as a single controller, upgrade it to 2011.Q1.1.1, and then factory reset again, and then finally be able to add it to the cluster via controller A's cluster setup process. I opt for the second choice, as I do not want to downgrade controller A, which is working just fine. Remember, controller B is currently hosed, messed up, or wanked, depending on how you want to say it.
It's stuck. So to get it back to a state I can work with, I need to do the trick I talked about way back in this blog on May 31, 2011 (http://blogs.oracle.com/7000tips/entry/how_to_reset_passwords_on). I had to use the GRUB menu, use the -c trick on the kernel line, and reset the machine and erase all configuration on it. Now I could bring it up as a single controller, upgrade it, factory reset it, and then have it join the cluster. That all worked fine, it just took be two hours to do it all.

Here's what I should have done.

Bring up controller A, config it and log into the BUI. Now bring up controller B. Do NOT config it in any way. Using controller A, setup clustering in the cluster menu.

Once the two controllers are clustered and all is well, NOW go ahead and upgrade controller A to the latest code. Once it reboots, go ahead and upgrade controller B. Everything's fine. You see, if the cluster has already been made, it's perfectly fine to upgrade one controller at a time. The software lets you do that. The software does NOT let you setup a NEW cluster if the controllers are not on the same software level. 

So that is the cluster setup safety tip of the day, kids. Have fun. 

Wednesday Jun 08, 2011

Upgrade to Q3.3.1 notes -

Ok, so there is a good reason why you folks want to upgrade. These upgrades fix some great bugs that other clients may have found, and you just have been lucky not to have had effect you yet. Another reason is that at some point when you DO want to upgrade, you may be too far behind to upgrade directly to the newest version. Check out this screenshot. In trying to upgrade to Q3.3.1, the update informs me that I won't be able to do this until I upgrade to Q3.2.0.

Just something to be aware of, so you can plan for additional time if you need to upgrade twice during your maintenance window.

 

 

About

This blog is a way for Steve to send out his tips, ideas, links, and general sarcasm. Almost all related to the Oracle 7000, code named ZFSSA, or Amber Road, or Open Storage, or Unified Storage. You are welcome to contact Steve.Tunstall@Oracle.com with any comments or questions

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today