Monday Dec 10, 2007

Flashupdating the stand-by XSCFU

I got a really comment on my entry setupplatform and other new XSCF features from Paul Liong asking:
    Our 'xscf" is currently running at XCP1050. We hope to get it upgraded to XCP1060 according to the Chapter 8 of XSCF User’s Guide. However, it is noticed that there is no permission to run the 'flashupdate' command on the Redundant XSCF Unit. So, how can we perform the firmware upgrade on the XSCF Unit on the standby side first and then on the active side?
An excellent question indeed! I checked the documentation (User Guide, Administrator Guide, man pages) and don't see it described in any detail there. So let me take a stab at it.

First, some background. The Sun SPARC-Enterprise M8000 and M9000 support two service processors, XSCFU#0 and XSCFU#1. The two work in a dual-redundant fashion. One unit is always the "active" unit, and can fully monitor and control the platform. The second unit, if present, is the "stand-by" unit, and has very limited functionality; mostly, the stand-by XSCFU is a slave to the active unit, receiving database updates so that it's current and ready to take over if the active XSCFU fails, is physically removed, or the user runs switchscf.

Back to Paul's question. You cannot run flashupdate on the stand-by XSCFU, that is true. Instead, you run flashupdate on the active XSCFU. This causes the XSCFU to check the flash image, install it and reboot. Upon reboot, if all goes well, the active XSCFU then communicates with the stand-by, tells it's partner that a new version of firmware has been installed, and copies the firmware image to the stand-by XSCFU. At this point, the stand-by XSCFU installs the firmware image, and reboots. If the upgrade is successful, the stand-by XSCFU will request to become the active XSCFU in order to finish the ugprade process. When you're done, both XSCFUs will be running the same version of firmware.

One side effect of this process is that the active XSCFU will switch. In other words, if XSCFU#0 was active and XSCFU#1 was stand-by when you started, then XSCFU#1 will be active and XSCFU#0 will be stand-by when you're done. We had a heated debate about this during development. Someone filed a high-priority bug that the transition was unexpected and should be considered a bug. On the other hand, switching the active XSCFU back to the original active unit would require a second transition; that second transition would add another couple of minutes to the upgrade process (minimizing firmware upgrade times was an important requirement for the SPARC-Enterprise service processor, so an extra two minutes is a lot of time). In the end we decided that it doesn't matter which unit is active, since they are dual redundant, so we should adopt the approach that allowed the firmware upgrade to finish as quickly as possible. If there are customers who strongly feel that, for example, XSCFU#0 should always be the active unit, then they can use switchscf when the firmware upgrade is complete.

So I'm sure some people are out there now wondering what happens if the stand-by XSCFU is absent when you upgrade the active XSCFU. Well, the active XSCFU will hold on to the firmware image. When the active XSCFU sees the stand-by inserted, the user can run flashupdate -c sync to update the stand-by XSCFU from the active unit. The same command can be used when you replace the stand-by XSCFU with a new unit.

Thursday Nov 15, 2007

Behind the panel

I got a question from the field... We know the Sun SPARC Enterprise M-class servers have a serial EEPROM in the panel, but what's stored in it?

A good question, and after digging through various documents to see what Sun says about the panel, I see exactly why the question is being asked. The only reference I could find to the panel was in the Sun SPARC Enterprise Server Family Architecture white paper, which says:

    Operator Panel
    Mid-range and hign-end models of Sun SPARC Enterprise servers feature an operator panel to display server status, store server identification and user setting information, change between operator and maintenance modes, and turn on power supplies for all domains. [Emphasis added.]
But what "server identification and user setting information" is stored in the panel, and what data isn't stored there? Luckily, I happen to know.

The panel contains a small SEEPROM which an be accessed from the service processor (the XSCFU). OK, maybe no so small. I forget the exact size, but it's much larger than the 256 byte SEEPROM used for some FRU identification, and much, much smaller than a 73GB SAS disk. Let's split the difference logorithmically and assume it's in the range of a few dozen kilobytes.

In an ideal world, the panel SEEPROM would contain all of the non-volatile data stored on the XSCFU, so if the XSCFU fails and a new unit is installed, it can fully recover its state from the panel SEEPROM. Sadly, due to space limitations, that's simply not feasible. The XSCFU reserves in excess of 10MBs for error and fault logs alone, which could never fit in the SEEPROM.

Instead, the critical configuration data is stored in the panel SEEPROM. This includes hardware configuration (how XSBs are assigned to domains, network setup, etc), software settings (whether ssh is enabled, the email address for email notifications, etc.) and locally created user accounts and privileges. Pretty much the result of all 'set\*' commands ends up in the panel SEEPROM.

The things that are not stored in the panel SEEPROM include error and fault log files, and FRU information.

If the XSCFU fails and is replaced, the new XSCFU on power-on will recognize that it's installed in a new chassis. It will then go out and read the panel SEEPROM, and rebuild it's configuration data from the panel. It can then read the FRU ID SEEPROM from each FRU to rebuild its FRU inventory information. So the only real data that is lost is log files.

Monday Nov 05, 2007

New Features in XCP1050

As I mentioned in my 01-Nov-2007 posting, the Sun SPARC Enterprise M-class server service processor firmware version XCP1050 and beyond have several new features that I wanted to blog about.

Servicetags

Servicetags is part of the Sun Connection infrastructure. The basic idea is to enable customers to better track their Sun assets, and by communicating with Sun, determine what updates are available, what needs patching, etc.

Servicetags were introduced in Solaris 10. It's essentially a piece of software that runs on a server and can communicate the list of software products installed on the server, the product versions, patch levels, and so forth. Customers can then run a Java application on their workstation to discover Sun products throughout their datacenter, and at their discretion, send that list to Sun to register the products and/or check for updates.

In XCP1050, the servicetags software now also runs on the service processor. This allows customers to discover the hardware assets in their datacenter, including the machine type, part number, and serial number.

On new machines, servicetags are enabled by default; if you upgrade from XCP1041 or earlier, you'll need to enable servicetags manually. The commands to manage servicetags on the service processor are setservicetags and showservicetags. The usage is very straightforward, for example:

    XSCF> setservicetag -c disable
    XSCF> showservicetag
    Disabled
You can download the discovery application here.

Browser User Interface

Anyone who used the Browser User Interface (called BUI, or Web User Interface) in XCP1041 or earlier probably found there were many tasks that could not be accomplished through the BUI, but required you to use the command line interface. In XCP1050, that all changed. Now, just about everything you could do through the command line can now be done through your web browser. A lot of hard work went into these BUI updates, and I think it really shows.

Fault LEDs and clearfault

It might be surprising, but the most difficult aspects of collaborating with another company on a new product were things that seem the most trivial: bezel color, whether buttons in the Browser UI should have square or rounded corners, and when and how should LEDs blink. Of all these, I think LEDs were the most contentious.

Sun adheres to the ANSI/VITA 40-2003 Service Indicator Standard (SIS) for most of its products. Fujitsu, however, adheres to a different standard. The differences between the two standards are minor, but when a customer is managing a large number of systems, any variation in indicator standards can be a source of confusion. In XCP1040, we shipped with a compromise, meaning both Sun and Fujitsu were unhappy with the solution.

For XCP1050, we reworked the fault indicator policies for both companies. In fact, we added the ability of the firmware to tell if the server was Sun-branded or Fujitsu-branded, and based on the branding, it adhered to the respective company's fault LED standards. Now, on a Sun branded system for example, the fault LEDs adhere to a simple policy:

  • If a FRU's (field replaceable unit) fault LED is on, then there is a fault in the chassis and it has been isolated to that specific FRU with very high confidence. In other words, if the fault LED is on, then we know for a fact the FRU is broken.
  • If the chassis fault LED (on the front panel) is on, then there is a fault in the chassis somewhere.
Note that it's possible that the chassis fault LED is on, but no FRU LEDs are on; that can happen if the server cannot isolate the fault to a single FRU. Commands such as showstatus and fmadm faulty will identify the list of suspected FRUs.

Furthermore, on Sun branded systems, cycling chassis power can no longer be used to clear the fault LEDs for FRUs or the chassis. FRUs don't magically become "better" just because you cycled power; if the FRU was faulty, it's still faulty, so the fault LED shouldn't magically turn off. For Sun branded systems, the fault LEDs will remain on until the customer or service engineer actively clears the fault condition, by removing/replacing the faulty FRU, or by running the clearfault command.

In XCP1040, clearfault could be used to mark a FRU as not faulty; however, almost all FRUs still required a chassis power cycle. This was so that the chassis could perform a power-on self test of the FRU before reconfiguring it into a running server. The last thing you want is someone to manually type clearfault /CMU#0 and then discover that CMU#0 really was faulty and bring down the server.

XCP1050 was enhanced so that clearfault could selectively initiate self test on many FRUs, without requiring a chassis power cycle. When you run clearfault now, it will check to see if it is possible to run self test without disturbing the running system. Some FRUs cannot be tested during operation; other FRUs may be in use in such a way that self test cannot be performed. If the FRU is safe to be tested, clearfault will initiate the self test, and if successful, the fault condition will be cleared. If clearfault cannot test the FRU, it will be marked to be cleared at the next chassis power cycle.

A great deal of effort went into improving the fault detection, isolation, reporting, and service interface, to make it more accurate and more consistent with other Sun products.

Thursday Nov 01, 2007

setupplatform and other new XSCF features

I see my last posting on this blog was back in July. After a busy August, I started a new position within Sun working on the x64 server family. But I see that XCP1060 firmware for the M-class service processor has been released, so let me clue you in on a few new features that made it in to XCP1050 or XCP1060.

Some of the key new features in XCP1050 and XCP1060 are:

  • setupplatform
  • servicetags
  • Signed XCP releases
  • Web UI (aka Browser UI, or BUI)
  • Fault LEDs and clearing faults

setupplatform

My personal favorite new feature is the setupplatform command. As a development engineer, I've set up literally dozens if not hundreds of SPARC Enterprise M-class servers, and it's not easy -- you have to create users, set up the network, the hostname, enable optional services, and so forth. The XSCF User Guide has an entire chapter on it. And every time I had to set up a platform, I had to refer back to the user guide.

So a couple of us decided to "wizard-ize" the setup process. Other Sun service processors had a command called 'setupplatform', so we started there. We figured the first thing someone does when they turn on a new machine is log into the service processor and create a user account with useradm (the ability to create more user accounts) and platadmn (i.e., platform administrator, to set-up the rest of the platform) privileges. So the first thing setupplatform does is prompt for a user name and password for a new account with platadm and useradm privileges.

The next thing you normally do is set up the network -- service processor host name, IP address, netmask, default gateway, domain name server, and so forth. On service processors with multiple network interfaces, you need to set up each interface. The setupplatform command leads you through the process. Finally, you need to enable optional services, like the Web UI (aka, Browser UI or BUI), ssh, ntp, and whether you want email notification of important events. That was provided in XCP1050, and in XCP1060 we added support for setting the datacenter altitude, and selecting the local timezone. When the setupplatform command is done and you've answered all the prompts, you're pretty much ready to configure your domain and boot Solaris.

If you find you've made a mistake, you don't have to re-do everything. For example, if you forgot to set up ssh, you could do 'setupplatform -p network' and answer "no" to everything until it prints "Do you want to set up ssh?" Answer "yes" and you're prompted with "Enable ssh service? [y|n]:", and a simple "yes" or "no" will enable or disable ssh. I know from memory that 'setssh -c enable' will enable ssh, but other tasks require more than a simple enable/disable argument (and I'm way too lazy to go read the man page).

Here's a quick sample, creating a new platform administrator user account:

    XSCF> setupplatform -p user
    Do you want to set up an account? [y|n]: y
    Username: johndoe
    User id in range 100 to 65533 or leave blank to let the system
    choose one: 
         Username: johndoe
         User id: 
    Are these settings correct? [y|n]: y
    XSCF> adduser johndoe
    XSCF> setprivileges johndoe useradm platadm platop
    XSCF> password johndoe
    New XSCF password: 
    Retype new XSCF password: 
Note that the last few lines of output, while they start with the "XSCF>" prompt, are actually generated by the setupplatform command itself; these are the actual commands that setupplatform executed on your behalf. As a result, you always know exactly what setupplatform did to the system. And in the process, you might learn the syntax of the commands yourself so you can skip setupplatform for simple changes. I think it was a great idea, and I hope others appreciate that as well.

I have to admit, the prompts are pretty pedantic. For example, setupplatform always asks if you want to set up something, asks you how you want it set up, then summarizes your answers and asks you again if they're correct. But we wanted to make sure people wouldn't hit the wrong key and really screw up their system. Since this is all text-based, it's not like you can hit the "back" button.

I'll write about some of the other new features in coming posts...

Wednesday Jul 11, 2007

Getting help for XSCF commands

The Sun SPARC Enterprise M-class server service processor (XSCF) has several commands that are new or different from previous enterprise-class servers. Most users are probably aware that the 'man' facility is available for all commands; however, as someone once said (I think it was Scott McNealy), "man has the answer to any question, conveniently organized by answer." In other words, if you don't know what command to use, man isn't going to help. Luckily there are a couple of things that can help...

The XSCF man facility does include a standard Intro(8) topic, which provides a complete list of commands and a short synopsis. For example:

    XSCF> man intro

    System Administration                                    Intro(8)

    NAME
         Intro - eXtended System Control Facility (XSCF) man pages

    DESCRIPTION
         This manual contains XSCF man pages.

    LIST OF COMMANDS
         The following commands are supported:

         Intro, intro            eXtended  System  Control   Facility
                                 (XSCF) man pages

         addboard                configure   an    eXtended    System
                                 Board(XSB)  into  the  domain confi-
                                 guration or assigns it to the domain
                                 configuration

         addcodlicense           add  a  Capacity  on  Demand   (COD)
                                 right-to-use  (RTU)  license  key to
                                 the COD license database

         addfru                  add a Field Replaceable Unit (FRU)

         adduser                 create an XSCF user account
    ...

On the other hand, you may sort of know the command you want to run, but aren't sure of the specifics. For example, to set up the network I sometimes forget if the command is 'setnet' or 'setnetwork'. Like a standard bash shell, the tab can be used to complete the command. for example, 'setnet<TAB>' will expand to 'setnetwork'.

Also, like the bash shell, you can use the double-tab to display a list of possible completions. This comes in handy when you know you need a 'set' command, but you don't recall specifically which command; you can do 'set<TAB><TAB>' and get a complete list of all 'set\*' commands. For example (using 'setn<TAB><TAB>' which is a little less verbose):

    XSCF> setn<TAB><TAB>
    setnameserver  setnetwork     setntp         
    XSCF> setn

Finally, all of the XSCF commands consistently implemented the -h option to display the command's synopsis. This isn't the full man page, just the synopsis. For example:

    XSCF> setnetwork -h
    usage: setnetwork [-m addr] interface address
           setnetwork -c {up | down} interface
           setnetwork -h
That's usually enough to help you figure out what arguments you need to provide. And when it isn't, you always have man to provide all the details.

About

Bob Hueston

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today