By Steve Tunstall on Dec 03, 2012
Last week, I was helping a client upgrade from the 2011.1.4.0 code to the newest 2011.1.4.2 code. We downloaded the 4.2 update from MOS, upload and unpacked it on both controllers, and upgraded one of the controllers in the cluster with no issues at all. As this was a brand-new system with no networking or pools made on it yet, there were not any resources to fail back and forth between the controllers. Each controller had it's own, private, management interface (igb0 and igb1) and that's it. So we took controller 1 as the passive controller and upgraded it first. The first controller came back up with no issues and was now on the 4.2 code. Great. We then did a takeover on controller 1, making it the active head (although there were no resources for it to take), and then proceeded to upgrade controller 2.
Upon upgrading the second controller, we ran the health check with no issues. We then ran the update and it ran and rebooted normally. However, something strange then happened. It took longer than normal to come back up, and when it did, we got the "cluster controllers on different code" error message that one gets when the two controllers of a cluster are running different code. But we just upgraded the second controller to 4.2, so they should have been the same, right???
Going into the Maintenance-->System screen of controller 2, we saw something very strange. The "current version" was still on 4.0, and the 4.2 code was there but was in the "previous" state with the rollback icon, as if it was the OLDER code and not the newer code. I have never seen this happen before. I would have thought it was a bad 4.2 code file, but it worked just fine with controller 1, so I don't think that was it. Other than the fact the code did not update, there was nothing else going on with this system. It had no yellow lights, no errors in the Problems section, and no errors in any of the logs. It was just out of the box a few hours ago, and didn't even have a storage pool yet.
So.... We deleted the 4.2 code, uploaded it from scratch, ran the health check, and ran the upgrade again. once again, it seemed to go great, rebooted, and came back up to the same issue, where it came to 4.0 instead of 4.2. See the picture below.... HERE IS WHERE I MADE A BIG MISTAKE....
I SHOULD have instantly called support and opened a Sev 2 ticket. They could have done a shared shell and gotten the correct Fishwork engineer to look at the files and the code and determine what file was messed up and fixed it. The system was up and working just fine, it was just on an older code version, not really a huge problem at all.
Instead, I went ahead and clicked the "Rollback" icon, thinking that the system would rollback to the 4.2 code. Ouch... What happened was that the system said, "Fine, I will delete the 4.0 code and boot to your 4.2 code"... Which was stupid on my part because something was wrong with the 4.2 code file here and the 4.0 was just fine.
So now the system could not boot at all, and the 4.0 code was completely missing from the system, and even a high-level Fishworks engineer could not help us. I had messed it up good. We could only get to the ILOM, and I had to re-image the system from scratch using a hard-to-get-and-use FishStick USB drive. These are tightly controlled and difficult to get, almost always handcuffed to an engineer who will drive out to re-image a system. This took another day of my client's time.
So.... If you see a "previous version" of your system code which is actually a version higher than the current version... DO NOT ROLL IT BACK.... It did not upgrade for a very good reason.
In my case, after the system was re-imaged to a code level just 3 back, we once again tried the same 4.2 code update and it worked perfectly the first time and is now great and stable. Lesson learned.
By the way, our buddy Ryan Matthews wanted to point out the best practice and supported way of performing an upgrade of an active/active ZFSSA, where both controllers are doing some of the work. These steps would not have helpped me for the above issue, but it's important to follow the correct proceedure when doing an upgrade.
1) Upload software to both controllers and wait for it to unpack
2) On controller "A" navigate to configuration/cluster and click "takeover"
3) Wait for controller "B" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
4) Wait for controller "B" to apply the update and finish rebooting
5) Login to controller "B", navigate to configuration/cluster and click "takeover"
6) Wait for controller "A" to finish restarting, then login to it, navigate to maintenance/system, and roll forward to the new software.
7) Wait for controller "A" to apply the update and finish rebooting
8) Login to controller "B", navigate to configuration/cluster and click "failback"