Thursday May 01, 2008

FX1: SPARC64-VII Supercomputer

I haven't heard any announcements from Sun about shipping the new SPARC64-VII processor, but apparently Fujitsu is already selling SPARC64-VII processor-based servers. Fujitsu announced an order from the Japan Aerospace Exploration Agency (JAXA) for a supercomputer consisting of 3,392 single-CPU FX1 servers with the SPARC64-VII processor.

The announcement describes the system:

    At the core of the new system is a massively
    parallel computer system comprised of 3,392 FX1
    computing nodes. Compared to the existing system,
    the new system delivers peak theoretical calculating
    performance of 135 TFLOPS, approximately a 15-fold
    increase, 100 TB of total memory, a roughly 30-fold
    increase, and total storage of 11 petabytes (PB),
    approximately a 16-fold increase.
    

The FX1, available only in Japan, appears to be targetted specifically at the HPC market -- a single CPU server node with 32GB of RAM, but only one hard disk, and sporting an InfiniBand HCA, designed to scale to thousands of nodes. Although, the need for small servers with excellent single-threaded performance may be reflected in the Ikkaku.

The datasheet on the FX1 contains some specs about the SPARC64-VII, especially the clock speed. The FX1 includes a SPARC64-VII with four cores running at 2.5GHz, compared to the two cores running at 2.1GHz to 2.4GHz in the SPARC64-VI which currently ships with the Sun Enterprise M-class servers.

Friday Apr 25, 2008

XCP 1070 Now Available, with SPARC64-VII Jupiter Support

I see on Sun's Software Download Center that XCP 1070 firmware is now available for download. XCP is the firmware that runs on the Sun SPARC Enterprise M-class service processor. According to the Sun SPARC Enterprise M8000/M9000 Servers Product Notes, the one major new feature in XCP 1070 is:
    In XCP Version 1070, the following new feature is introduced:
    • Support for SPARC64® VII processors
The product notes refer to Solaris 5/08 (also available for download on the Software Download Center) for SPARC64-VII support.

One interesting limitation noted in the product notes is:

    For Solaris domains that include SPARC64 VII processors, a single
    domain of 256 threads or more might hang for an extended period of
    time under certain unusual situations. Upon recovery, the uptime
    command will show extremely high load averages.
    
The above limitation references Change Request CR6619224 Tick accounting needs to be made scalable, which describes the problem in detail:
    Solaris performs some accounting and bookkeeping activities every
    clock tick. To do this, a cyclic timer is created to go off every
    clock tick and call a clock handler (clock()). This handler performs,
    among other things, tick accounting for active threads.
    
    Every tick, the tick accounting code in clock() goes around all the
    active CPUs in the system, determines if any user thread is running
    on a CPU and charges it with one tick. This is used to measure the
    number of ticks a user thread is using of CPU time. This also goes
    towards the time quantum used by a thread. Dispatching decisions are
    made using this. Finally, the LWP interval timers (virtual and profiling
    timers) are processed every tick, if they have been set.
    
    As the number of CPUs increases, the tick accounting loop gets larger.
    Since only one CPU is engaged in doing this, this is also single-threaded.
    This makes tick accounting not scalable. On a busy system with many CPUs,
    the tick accounting loop alone can often take more than a tick to process
    if the locks it needs to acquire are busy. This causes the invocations of
    the clock() handler to drift in time. Consequently, the lbolt drifts. So,
    any timing based on the lbolt becomes inaccurate. Any computations based
    on the lbolt (such as load averages) also get skewed.
    

The issue of tick scalability has been around for a while. Eric Saxe mentions the issue in his May 21, 2007 blog entry tick, tick, tick.... The change request does say that the problem is already fixed in Solaris Nevada (OpenSolaris) build 81, so hopefully this limitation will be removed with an upcoming patch or release of Solaris 10.

Monday Apr 21, 2008

Jupiter Processor (SPARC64-VII) for Sun SPARC Enterprise M-Class Servers

In my last post OpenSolaris support for Ikkaku: 2U Single-CPU SPARC Enterprise Server some people have noticed the line:
        Ikkaku is a 2U, single CPU version of SPARC Enterprise
        M-series (sun4u) server utilizing the SPARC64-VII
        (Jupiter) processor.
I've been asked by a few people about the new Jupiter processor for the SPARC Enterprise M-Class servers. Of course, I can't divulge any Sun or Fujitsu proprietary information. But anyone can go to OpenSolaris.org and learn quite a bit...

Cores and Strands

The Jupiter CPU was announced even before the the SPARC-Enterprise servers began shipping in April 2006. The Jupiter processor's product name is going to be SPARC64-VII. I did a search of "sparc64-vii" and got a hit for FWARC 2007/411 Jupiter Device binding update, which tells us:
        Jupiter CPU is a 4 core variant of the current shipping
        Olympus-C CPU, which has 2 cores.  Both Olympus-C and
        Jupiter has 2 CPU strands per CPU core.
According to the Sun SPARC Enterprise Server Family Architecture white paper (published by Sun in April 2007), current SPARC Enterprise M-Class servers will be upgradable to the new Jupiter CPU modules. Among other things, this means that a fully-loaded, SPARC-Enterprise M9000-64 with 64 CPU chips would have 256 cores, capable of running 512 concurrent threads in a single Solaris image. With 512 DIMMs, and assuming 4GB DIMMs are available soon, the system would max-out at 2TB of RAM, enough to keep 512 threads pretty happy.

Shared Contexts

Another search uncovered this email notification that OpenSolaris now supports Shared Contexts for SPARC64-VII. Steve Sistare's Blog, describes Shared Context for the UltraSPARC T2 processor:
    In previous SPARC implementations, even when processes share physical memory, they still have private translations from process virtual addresses to shared physical addresses, so the processes compete for space in the TLB. Using the shared context feature, processes can use each other's translations that are cached in the TLB, as long as the shared memory is mapped at the same virtual address in each process. This is done safely - the Solaris VM system manages private and shared context identifiers, assigns them to processes and process sharing groups, and programs hardware context registers at thread context switch time. The hardware allows sharing only amongst processes that have the same shared context identifier. In addition, the Solaris VM system arranges that shared translations are backed by a shared TSB, which is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.
Presumably, Shared Context on Jupiter is similar if not identical.

Integer Multiply Add Instruction

If you go into the OpenSolaris source browser and search for "Jupiter" in the kernel source, you get about a half dozen C file hits. One of the files, opl_olympus.c, includes the following code, and more importantly, comments:

        101 /\*
        102  \* Set to 1 if booted with all Jupiter cpus (all-Jupiter features enabled).
        103  \*/
        104 int cpu_alljupiter = 0;
        ...
        284 /\*
        285  \* Enable features for Jupiter-only domains.
        286  \*/
        287 void
        288 cpu_fix_alljupiter(void)
        289 {
        290     if (!prom_SPARC64VII_support_enabled()) {
        291         /\*
        292          \* Do not enable all-Jupiter features and do not turn on
        293          \* the cpu_alljupiter flag.
        294          \*/
        295         return;
        296     }
        297 
        298     cpu_alljupiter = 1;
        299 
        300     /\*
        301      \* Enable ima hwcap for Jupiter-only domains.  DR will prevent
        302      \* addition of Olympus-C to all-Jupiter domains to preserve ima
        303      \* hwcap semantics.
        304      \*/
        305     cpu_hwcap_flags |= AV_SPARC_IMA;
        306 }
The comments in the above snippets imply that a kernel can either be in "all-Jupiter" mode, or "not all-Jupiter" mode. From the Sun SPARC Enterprise M4000/M5000/M8000/M9000 Servers Administration Guide we get the following description of the two modes:
    A SPARC Enterprise M4000/M5000/M8000/M9000 server domain runs in one of the following CPU operational modes:
    • SPARC64 VI Compatible Mode - All processors in the domain - which can be SPARC64 VI processors, SPARC64 VII processors, or any combination of them - behave like and are treated by the OS as SPARC64 VI processors. The new capabilities of SPARC64 VII processors are not available in this mode.
    • SPARC64 VII Enhanced Mode - All boards in the domain must contain only SPARC64 VII processors. In this mode, the server utilizes the new features of these processors.
Based on the source code, the main difference with SPARC64 VII Enhanced Mode appears to be the addition of the AV_SPARC_IMA hardware capability, which is an Integer Multiply-Add (IMA) instruction (see also CR6591339).

Integer Multiple-Add is important to cryptographic algorithms, and presumably Jupiter's ima instruction is similar to the xma instruction of the Itanium and other processors. The xma instruction takes three operands A, B and C, and produces the result A\*B+C in a single instruction. Integer Multiply-Add instructions have a significant impact on RSA key generation and other cryptographic algorithms.

Summary

In summary, what I've learned by searching Sun's own web page:
  • Jupiter has four cores, two threads per core.
  • It supports Shared Context, for improved performance on applications that use a lot of shared memory.
  • It supports a new Integer Multiple Add instruction to improve crptographic algorithms.
  • It can be installed on SPARC Enterprise M-class servers.
I haven't seen anything about improved clock speeds, but all in all, a pretty significant improvement to what is already a great product.

That's about all I can say. We'll all have to keep waiting, and watching, for more Jupiter info as Sun decides to release it.

Thursday Mar 27, 2008

OpenSolaris support for Ikkaku: 2U Single-CPU SPARC Enterprise Server

I saw this on the opensolaris.org OS/Net flag day announcement alias:
    With the putback of
    
         6655597  Support for SPARC Enterprise Ikkaku
    
    Solaris Nevada now supports the new Ikkaku model of
    SPARC Enterprise M-series (OPL) family of servers.
    
    Official product name for Ikkaku is not yet finalized.
    Ikkaku is a 2U, single CPU version of SPARC Enterprise
    M-series (sun4u) server utilizing the SPARC64-VII
    (Jupiter) processor.
    
All I know is what I've read on opensolaris.org, but if you'd like to learn more, try searching for Ikkaku on opensolaris.org.

By the way, according to the announcement, Ikkaku is Japanese for Narwhal.

Wednesday Jan 16, 2008

DSCP IP Addresses

On the subject of Sun SPARC Enterprise M-Class server Domain-to-SCF Communication Protocol (I wrote about it yesterday), I received two questions about the IP addresses reserved for the internal DSCP network. The first has to do with reusing the IP addresses in multiple machines; the second has to do with the DSCP netmask.

DSCP IP Address Reuse

One customer emailed me to ask if it was OK to use the same DSCP IP address range on multiple machines in the same datacenter. The short answer is yes.

The only requirement for the DSCP addresses is that they not be used elsewhere on the external networks, either the network used by the SCF or by the Solaris domains. Clearly if you use the same IP addresses for some machines in the datacenter as are used for either the SCF or the Solaris domains, then the SCF and Solaris domains would not be able to connect to those other machines; requests to those IP addresses would be routed to the internal DSCP network, not to the external Ethernet ports.

It is perfectly acceptable to use the same DSCP addresses on every SPARC Enterprise machine in the datacenter. In our development lab, we had well over a dozen machines on a single subnet (both the SCF and the Solaris domains), and they all shared a common set of DSCP IP addresses. We operated that way for over a year without a problem.

We could have hard-coded the DSCP addresses, but it always seems whatever address you select, at least one customer is using that same address in their datacenter. So after consulting with manufacturing and Sun Service, we decided to leave the DSCP IP addresses entirely up to the customer.

DSCP Network Address and Netmask

Another customer raised an interesting question. They had a SPARC Enterprise machine which is capable of supporting multiple Solaris domains, but they were only configuring it into a single domain. When they tried to use setdscp on the SCF, they provided a network address and netmask that only contained two IP addresses, one for the SCF and one for the Solaris domain. setdscp doesn't allow that; it insists that the netmask be large enough to handle the maximum number of domains that the chassis can support. This is so that setdscp can compute IP addresses for the SCF and all possible Solaris domains.

But this customer was adamant that they could only spare two IP addresses for DSCP.

In actuality, setdscp can be invoked in three ways. The first method:

        setdscp -i address -m netmask
is the one recommended by the user documentation. This takes an IP network address and a netmask and computed IP addresses for the SCF and all the possible domains. This method is provided purely as a convenience. The network address and netmask are not used at all by DSCP, other than to compute individual IP addresses.

You can also invoke setdscp with no arguments, and it will prompt you for a network address and netmask and compute IP addresses for the SCF and all domains. In effect, this is the same as the first method, except it's interactive.

The third method of using setdscp is to manually assign the SCF and domain DSCP IP addresses one at a time using the format:

        setdscp -s -i address
        setdscp -d domain_id -i address
The first line sets the SCF's DSCP interface to a specific IP address. The second line sets each domain's DSCP interface IP address. You can invoke the second line once for each domain you plan on configuring.

Caveat: With the second approach, you can configure just some of the domains you plan on creating. For example, you may have an M9000, capable of 24 domains, but you only plan on using it as a single system, so you configure the SCF and domain 0's DSCP IP addresses. Later, though, you may decide to create a new domain 1. You can add boards to domain 1, but when you try to power it on, the poweron command will fail, because there's no DSCP address for domain 1. Last time I checked (this was in XCP1040) the failure occurred late in the boot process, after the poweron command had returned to the user saying the domain was being powered on. As a result, the only way to see why the domain didn't power on was to inspect the error logs. (Note: This behavior may have changed in later versions of the SCF firmware.)

Summary

In summary, you can use the same DSCP IP addresses for all Sun SPARC Enterprise M-Class machines in your datacenter. Given that, it should be rare that you run low on IP addresses for DSCP, but if you do, know that you can manually configure only the DSCP IP addresses you really need.

Tuesday Jan 15, 2008

DSCP: Policy Failure for the incoming packet

As a comment to my post DSCP: Domain to Service Processor Communication Protocol, Mike Beach provided the following comment:
        We are seeing ipsec messages in the system log regarding the DSCP addresses.    
        Is there an ipsec configuration option on the host that we should check?

        # dmesg |tail -1
        Jan 14 14:38:42 sco01b ip: [ID 372019 kern.error] ipsec_check_inbound_policy:   
        Policy Failure for the incoming packet (not secure); Source 192.168.037.026,    
        Destination 192.168.037.028.
My response to Mike was, "I think, in general, you can ignore the ipsec_check_inbound_policy messages. They are probably happening whenever the SCF is rebooted." But I wanted to provide a more complete response, and the tiny comment block wasn't the right place.

On a SPARC Entperprise M-class server, the SCF and Solaris use IPsec to authenticate each end of the connection when they set up a domain-to-SCF communication (DSCP) link. If for some reason the SCF reboots (during an SCF failover, when the SCF firmware is upgrade, or manually rebooted, for example), the SCF sends a reset message to Solaris to reset the TCP connection. The reset message is sent in-the-clear. When Solaris sees the message in-the-clear (that is, "not secure"), it logs the policy failure.

If you see this message only when Solaris or the SCF reboots, then it safely can be ignored. If you're seeing this at other times, or continuously logged every second, then contact your service engineer and escalate the problem to Sun.

There was a bug filed against Solaris about these messages (technically it was an RFE -- request for enhancement). In part, the bug says:

        The cause of the messages is IPsec, this system recieved a clear text packet
        but the IPsec policy on this system only allows IPsec encrypted packets, so the
        system discards the packet. To stop the system logging itself to death, there
        is a rate limiting function which will only log a message every
        ipsec_policy_log_interval milliseconds.

        The default value for ipsec_policy_log_interval is 1000 ( one second ).

        In certain configurations, messages like this are expected and after a while 
        too many messages will start to become anoying and fill up the messages file.

        This value can be tunned with ndd upto 999999 milliseconds ( just over 16 minutes )
        but can't actually be disabled. This is a request to allow the systems administrator
        to turn off these messages should they wish.
This was fixed in OpenSolaris build 37 and Solaris 10 Update 4, and allows you to specify an ipsec_policy_log_interval of 0 to turn off the logging altogether. (Caveat: I haven't actually tried the fix myself.)

Hope this provides enough information for you to understand what's going on when you see these messages.

Wednesday Jan 09, 2008

Fault Management Coordination Between Solaris and SCF

Accurately diagnosing faults on a large and complex system like a Sun SPARC Enterprise M-class server is critical to maintaining server availability. But fault diagnosis is complicated by the fact that Solaris sees a limited view of the universe, while the service processor (SP) sees a different, limited view. In the M-class server line, things are further complicated because two Solaris instances may be sharing the same hardware, and as a result, seeing different errors due to the same hardware fault.

Solaris basically sees three types of hardware: CPUs, memory and I/O devices. A real machine, though, is composed of many more components: interconnect ASICs, cables and connectors, power supplies and voltage regulators, and fans, to name just a few. The SP sees these other components. A fault may manifest itself as different errors depending no one's point of view. For example, a fault in an ASIC that connects a CPU to memory might be seen by the SP as protocol errors on the data interconnect, while the effect that Solaris sees is an uncorrectable memory error. Accurate diagnosis and recovery requires the SP and Solaris to coordinate.

In the M-class family of servers, the SP/Solaris coordination basically comes in three areas: Memory, I/O and CPU.

Memory Errors

For the most part, Solaris handles correctable errors in memory. Single-bit, correctable errors in DIMMs are a natural result of physics of dynamic RAMs, and does not, in general, reflect faulty hardware. A stray cosmic ray may hit the device and flip a bit. Thanks to ECC (error correction code) bits, the single-bit error is corrected, and the data is rewritten to memory.

There are, however, several memory errors that may imply faulty hardware:

  • Multi-bit uncorrectable errors (UEs).
  • Permanent correctable errors (also called PCEs, or "Sticky" CEs). These are correctable errors that are not cleared by rewriting to memory. They typically represent a stuck bit.
  • Excessive correctable errors in a short period of time, which could indicate a faulty DIMM.
Solaris detects all of these errors, and handles each slightly differently.

When a UE is detected, the memory controller notifies Solaris and the SCF at the same time. The Solaris Fault Management Architecture (FMA) will retire the page containing the UE, to avoid hitting the same error in the future. On the SCF, the DIMM immediately is considered faulty, and the next time Solaris is rebooted, POST will map-out a 64k chunk of memory around the UE to ensure that OBP and Solaris do not use that memory again. This allows a system to continue running as long as possible, using memory that's safe to use, until a service action can be scheduled to replace the DIMM.

When a CE is detected, only Solaris gets notified. At first, Solaris FMA may decide that the DIMM is slightly degraded and will retire a page of memory (issuing a fault.memory.page event), but after enough PCEs have been collected for a single DIMM, Solaris FMA will declare the DIMM faulted and issue a fault.memory.dimm event. The fault event gets sent to the SCF over the internal network. On the SCF, the fault.memory.dimm is processed by the SCF FMA, and produces a fault.chassis.SPARC-Enterprise.memory.block.pce event. At this point, the SCF will consider the entire DIMM faulty, and the next time Solaris is rebooted, the SCF will isolate this DIMM.

I/O Errors

Errors in the PCI-Express fabric are detected by PCI-Express devices, and handled by Solaris device drivers. Generally, the errors are reported to the Solaris FMA, which may decide an I/O device is faulty and issue one of the fault.io.\* events. If possible, Solaris will retire the I/O device, or prevent it from being used in the future. The fault.io.\* event is also forwarded to the SCF.

For memory, the SCF was able to remember that a DIMM was faulty, and POST could map-out a chunk of memory to avoid the faulty pages from being used. In the case of I/O, the SCF really can't do either.

First, POST can't map-out a PCI device. POST presents to OBP a list of PCI-Express root complexes. Once OBP takes over, it can probe the PCI-Express fabric to discover PCI devices. POST could map-out an entire root complex, but that would be a "big hammer" solution to the problem. Instead, the SCF allows Solaris to handle mapping-out the faulty devices, since Solaris can map-out devices at a much finer granularity.

Second, while the SCF could remember than an I/O device is faulty, as I've just written, it doesn't do much good since it doesn't map-out PCI devices. Furthermore, unlike memory DIMMs, PCI cards do not have FRU ID PROMs and serial numbers. So if you powered off your machine and replaced a faulty PCI-Express card, the SCF would not know that the faulty device was replaced. This would require another manual step when you replace a PCI card, i.e., logging into the SCF and telling the SCF the card has been replaced.

So instead, the SCF just logs that Solaris detected the I/O fault, and it relies on Solaris to handle mapping-out the device in the future. The fault event is visible using fmdump on the SCF.

CPU Errors

CPU errors are reported to both the SCF and Solaris. In some cases, it won't do Solaris much good; if a CPU is faulty, it may result in a panic, or even a complete hardware reset. In many other cases, Solaris is able to identify a CPU exhibiting excessive correctable errors, and offline the CPU.

The SCF, however, sees a superset of errors from the CPU, the support ASICs, memory controllers, and I/O controllers. Errors could be reported by ASICs that don't belong to a single Solaris instance, for example, a crossbar ASIC which is routing data for all Solaris domains in the chassis.

In some cases, the SCF may decide that a CPU chip is generating excessive correctable errors on the interconnect between chips, errors that are completely transparent to Solaris. In this case, the SCF will diagnose the CPU chip as faulty, and the next time Solaris boots, the SCF will map-out that CPU chip.

If nothing else is done, eventually the faulty CPU chip will probably emit an uncorrectable error, resulting in a complete domain stop and reset. To minimize that likelihood, the SCF issues a fault.chassis.SPARC-Enterprise.cpu.SPARC64-VI.core.ce-offlinereq fault event. This event gets forwarded to Solaris FMA over the internal network. In Solaris, this fault event is treated like any other CPU fault event that Solaris might have diagnosed itself -- it gets logged to /var/adm/messages and the console, results in an snmp trap, and the CPU is offlined.

One peculiarity about the ce-offlinereq event is that the Solaris FMA stack received the event from the SCF, and did not generate the event itself. As a result, the ce-offlinereq does not show up using fmdump in Solaris.

But wait, there's one more case... The switch ASIC can detect excessive correctable errors in the L2 cache tags for a specific CPU cache way. The SCF can handle this on its own; it deconfigures the L2 cache way and the CPU does not need to be offlined. Solaris is unaffected, except for a slight performance degredation. In this case, however, the Solaris administrator should know that a CPU is being degraded. So the SCF emits the event fault.chassis.SPARC-Enterprise.asic.sc.ce-l2tagcpu, which is forwarded to the affected Solaris domain. Solaris logs the fault in /var/adm/messages, on the console, and through an snmp trap, but otherwise, does nothing to offline any CPU.

All The Rest

All of the rest of the faults fall into two broad categories: Solaris-only and SCF-only. Solaris can detect things like SCSI errors, zfs errors, and software errors. These are handled in Solaris, and do not involve the SCF at all. The SCF, on the other hand, can detect a wide range of errors -- power and voltages, over temperature and fan speeds, crossbar and switch ASICs, and many more. In these cases, the SCF diagnoses the fault and performs appropriate fault recovery on its own, and there's nothing that Solaris needs to do.

Summary

The Sun SPARC Enterprise M-class servers employ the same Fault Management Architecture on the SCF that has existed in Solaris since S10 first shipped. Having both the SCF and Solaris running the same FMA stack enables the two entities to communicate using a common event protocol, and coordinate their activities in handling errors, diagnosing faults, and recovering from those faults.

Tuesday Jan 08, 2008

Working with Fujitsu: Travel to Japan

I work in Sun's Burlington, MA office, about ten miles north of Boston. Three years ago when I was asked if I was willing to work on the APL project, my first question was, "Would I have to travel to Japan?"

Flying to Japan

It's not that I had anything against Japan, or travel in general. But Boston to Japan is a long flight! I hated just having to fly six hours to the West Coast. Depending on the aircraft and the stops, Japan can mean 18 hours on an airplane. The "best" arrangement I've found is flying down to New York (less than an hour), then taking a Boeing 777 direct to Narita International Airport (just outside Tokyo). Fourteen and a half hours in your seat, inside a thin metal tube, above the clouds. You leave Boston on Saturday morning, and with the time difference you're lucky to be in Tokyo for dinner Sunday night.

I did discover that, remarkably, my iPod battery could last almost the entire flight. I also learned that sharing your playlists can be a way to break the ice with the person next to you on a plane. You can see right away that they have an iPod, and no one seems to mind if you say, "Nice iPod! What kind of music do you listen to?" My iPod has a fair mix of oldies like Simon and Garfunkle (is Billy Joel an "oldie" yet?) through current bands like Guster and Barenaked Ladies (are they still "current"?). I was actually surprised that a 13-year-old girl sitting next to me on one flight had many of the same artists on her iPod.

Hotel New Otani and Akasaka

Once at Narita, it's an hour-long train ride into Tokyo on the Narita Express. We'd normally stay at the Hotel New Otani, near the Akasaka district with its many restaurantes and easy access to rail. Whenever a large group from Sun was arriving on different flights from different cities, we'd arrange to meet for dinner on the first night at a Korean barbeque. Since you cook the food yourself right at the table, and this particular restaurant had large tables with several barbeque pits, it could handle a large crowd of wearly travelers, and it didn't matter if you wanted to eat a little or a lot, or came late, or left early to get a good night's sleep. And I think everyone gets a good night's sleep the first night.

The New Otani severed an "American" breakfast buffet. While they serve normal fare like fruit, muffins, and cereal, the two most interesting items at the buffet are french fries, and salad. Thinking about it more, french fries are really nothing but home fries cut the other way. But salad? Of course, after a few days in Japan you realize that most Japanese restaurants don't serve a tossed salad with dinner. Usually by the third day or so, I'd find myself craving lettuce and was more than willing to eat it for breakfast.

The Commute

The main Fujitsu facility we were usually visiting is in Kawasaki, although occasionally we would meet in Kamata. We all traveled around using Japan Rail (JR) or the Tokyo Metro. The rail stations are fairly close to the New Otani, an easy walk in the morning, and a welcome walk after a long day of meetings. Traveling with the rest of the Tokyo commuters was a great (if a bit scary) experience. On my first trip to Tokyo, I momentarily got separated from the group during a transfer. I was quite lucky that I found them again quickly, since I didn't have a clue which train to transfer to, or even what my next stop was. To this day, that common recurring nightmare -- the one where you're back at high school, it's final exams, you failed to attend any of the classes or read the book, you don't have a pencil, and you forgot your pants -- has been replaced by the nightmare where I'm in an unfamiliar JR Rail station, alone, and I can't figure out which train to take back to the hotel.

Of course, getting around Tokyo isn't as bad as my nightmare. Almost all the signs have English subtitles, and most people speak at least a little English. Those that don't speak English are usually friendly and try hard to communicate with hand gestures and charades. I suppose the one thing that bothered me most about the train system was that I didn't have a complete map of the system; I couldn't figure out how to get from point A to point B if it involved changing lines. But unlike Boston with its four T lines (Red, Orange, Blue and Green lines), there are more than a dozen subway and JR Rail lines around Tokyo, and far more stations. I can't imagine the size of a map large enough to encompass the entire system.

On the Metro I was introduced to "pushers". These were very different from the pushers I knew from the New York City subway. These pushers dressed in uniforms, almost like police, and wore white gloves. At the platform, everyone lined up according to marks on the floor, in double file, one line on either side of the subway car door. When the subway train arrived, it stopped exactly on its mark, the doors opened, and a stream of people poured out between the two lines of awaiting travelers. Once the car had expunged its passengers, the two lines of people would stream into the car. As the car became full, the pushers went to work. They would gently, and very politely, push you further into the car, and push the next person in behind you. When the car was nearly fully to overflowing, the pushers would push the next person in, keeping their hand firmly on the back of the last passenger until the doors started to close. Just as the door was about to close on their arm, the pusher would pull their arm out. Not unlike the way my wife packs a suitcase.

As you can imagine, the Metro trains are crammed with people. At one point, two of us missed a stop because we were pushed to the middle of the car, pinned in the crowd, and unable to make our way to the doors for our stop. Of course, I never felt claustrophobic in the subway cars; most of my American coworkers were over 6 feet (1.8m) tall and their heads poked up above the crowd.

I very much appreciated the daily commute. I've done a lot of traveling, and for some cities the only places I've been are the airport, the hotel, and the meeting room. You never get to know a city or its people doing that. But going from the hotel to the meeting room by subway and train, with all of the other commuters, really made me feel like I was in Tokyo, seeing the sides of Tokyo even most tourists don't see.

Meetings with Fujitsu

Meetings were like meetings are. Long days. Small conference rooms. Presentations. Whiteboards. But in this case, they were far more productive because of the time shifting. Normally, Kawasaki is 13 or 14 hours ahead of Boston (depending on daylight savings time). I could send an email today, Fujitsu would read it tonight, and I wouldn't see the response until tomorrow. But when we were all there, in the same room, at the same time, we could discuss things, interactively. We could ask a question, discuss the answer, and come to agreement in a matter of minutes, instead of days.

When you're traveling 10 (from the West Coast) or 18 (from the East Coast) hours to get to Japan, you try to arrange as many meeting topics as you can, to get the most out of the travel time. We'd usually have at least three very long days of meetings on various topics. All in all, trips to Japan were packed with as much work as we could physically accomplish.

Evenings

After a long day of meetings, it was back to the subway station to head back in to Tokyo. If we were tired, we'd get off the train near the hotel, and eat dinner in the Akasaka district, past the many pachinko parlors to a shabu-shabu or tampura restaurant. Other nights we'd head down to the Ginza. If things were going well, typically on the last night we'd treat ourselves to Kobe beef at a little hole-in-the-wall a couple of blocks off the Ginza called Gyu An (my coworker Roman Zajcew always ate at Gyu An when he was in Tokyo, and traveled to Japan so much that the hostess there thought he lived in Tokyo).

One night our Fujitsu hosts took us to a local dining district in Kamata, and we ate at a small Korean barbeque restraurant with only three tables, and one person who was the hostess, waitress and "chef". A very nice, and authentically Japanese place -- they didn't even have menus in English or with pictures you could point at. It was quite humorous at the end of the meal when both Sun and Fujitsu people pulled out their credit cards and started a friendly argument over who should pay, when the hostess came over and explained they didn't take credit cards; embarassed, we had to pass the hat around the table and collect cash from everyone to cover the check.

After dinner it was usually late and time to head back to the hotel room. Unfortunately, the New Otani had high-speed internet service in the rooms, I always had my laptop, and people in the US were just starting their day. On more than one occasion I was online well past 2am reading and writing emails with my colleages back home, catching up on the issues of the day, and letting them know about the progress we were making. This arrangement also meant that I could email my action items from todays meetings, people back home could work on the tasks while I slept, and I could attend the next day of meetings with several action items already resolved.

Heading Home

There is no better feeling after a business trip than settling in to your seat, hearing the aircraft door close, and feeling the plane back away from the airport. You're on your way home! All I could think about was getting back to my own house, my own bed, and my waiting wife and daughter. My little girl was just two when I made my first trip to Japan; one week away from a two-year-old means missing 1% of her short life. But thanks to my travels, and the little "Hello Kitty" doll I brought home on one trip (which we've named "Edo"), my now-four-year-old has no problem finding Japan on her globe, and knows a few words of Japanese as well.

Monday Jan 07, 2008

Working with Fujitsu: Building Bridges

I was about as surprised as anyone in 2004 when Sun announced the APL agreement -- we were going to partner with Fujitsu to develop the next generation of midrange and high-end servers based on the Olympus CPU chip. I was even more surprised when I was asked to work on the project.

Bridging Timezones

I'm based out of Sun's Burlington, MA office, near Boston. Initially, the I/O team was drawn from Burlington, while the OS and Service Processor teams were from offices in California. Fujitsu is located in Kawasaki, Japan. That's 13 or 14 hours ahead of Boston (depending on daylight savings time). This was going to take some changes to my work habits in order to get some workday overlap with my West Coast colleagues, and attend phone meetings with Fujitsu in Japan.

I used to be an early bird. In my previous job, I'd be at my desk by 7am every day; home for dinner by 5pm. When I joined Sun nine years ago, I found that schedule tough since most people didn't start until 10am, but worked well into the evening. Later when I worked on a project with Sun's West Coast offices, I found the West Coasters also came in at 10am -- 10am Pacific Time! If I started at 7am and quit at 4pm, I only had three hours of overlap with the West Coast (and one of those hours was their lunch time). So my hours started to shift. I started working more 8am to 5pm, then 9am to 6pm.

But now working with Fujitsu, there was more of a timezone challenge. Initially we planned to have monthly face-to-face meetings with Fujitsu to exchange technical information. But soon it became clear that once-a-month meetings were not often enough, and the few people who could travel were not always the right people for the discussions. We had to find a way to meet more regularly, weekly, and involve the right people. Fujitsu generally starts at 9amJT, which is 7pmET.

But the first rule I laid down was: I will not work between 6pm and 9pm. That's family time. If I can't eat dinner with my family, play with my daughter after dinner, and tuck her in to bed every night, then that's not the job I want to have. The rest of the Sun team was very understanding, and we scheduled conference calls at 9pmET/6pmPT/11amJT (during daylight savings time, that changed to 10amJT).

There was still an advantage to being an early bird. Often I would get up at 6:30 or 7am, check email from my home office and find that my colleagues at Fujitsu were still working. We could exchange email and information in real-time. Then when my daughter got up around 8am, I would play with her for a while, then after breakfast, whem my wife and daughter headed out to dance class or swim class or whatever, I'd shower and go back to work, usually around 11am. I'd break for dinner at 6pm, and spend the evening with my family until my daughter was asleep in bed. At 9pm, I'd head back to the home office for a conference call, or just exchange email with the West Coast and Japan. Most work days ended at 10pm or 11pm.

In the end I worked hard, but not as hard as it seemed to others. They'd see me responding to emails at 7am, and still working past 10pm. It gave the illusion I was working 7am to 11pm (I think that motivated everyone else to put in extra hours themselves).

I'm really fortunate that Sun promotes an environment where employees can work from home. While I could have gone into the office every day, I would get there at 11:30 and have to leave at 5:30; hardly worth the traffic and travel time. And since most of my peers were in California or Japan, I would sit alone in my office all day, on the phone with people in other states. Initially, I tried to go into the office three or four days a week. By the end of the project, there were periods of entire months when I didn't see the inside of my office. And the time I did not spend commuting, I spent working. So I think Sun is really fortunate, too.

Bridging Companies

Working with another company to develop a product is difficult. I've done it in the past. But with Fujitsu, we were working with a competitor to develop a product. That's an order of magnitude more difficult.

And Fujitsu was a real competitor at the time. The Sun Fire 6900 and 25000 systems were up against PrimePower in many areas. We needed to work together on the Olympus platforms, but work being done to improve the Sun Fire servers, or pave the way for Rock-based servers, was not part of the contract; sharing that with Fujitsu before SPARC Enterprise platforms shipped meant giving away proprietary information to a competitor. Of course, improvements to support more cores, higher thread counts, larger memory images, and PCI-Express were all related to SPARC Enterprise as well as Sun Fire and Rock platforms. Figuring out what could be shared and what was taboo was a constant chore in the beginning. This was the first project I've worked on, at Sun or anywhere, that we had a lawyer from the Legal department assigned to the project team from day one. Working with another company meant understanding what was OK to share, and what must be kept proprietary.

This of course cut both ways. There were many technical questions we had that the Fujitsu engineers just couldn't answer; they weren't allowed to answer. We at Sun were not used to working like this. If we have a question about a new microprocessor, we're used to walking down the hall and talking to the engineers designing the chip. Working with another company meant trying to make forward progress with limited access to certain information.

Then there were the technical documents that were under Japanese export control -- they could be read in the US, Canada and the UK, but no where else. Sun is a multi-national company, where you're just as likely to have someone on your engineering team from India, Ireland, or elsewhere, and data is usually shared without regard to national borders. Sun largely ignores geography, but on this project, geography became a key factor. I think this was hard at first, but we quickly learned the routine; what was OK and what wasn't. We had to educate others within Sun (for example, the Sun engineer in India who was writing the SunVTS diagnostic tests, but he wasn't allowed to read the technical documents unless he flew to California).

There was also a need for education within Sun about APL. Many Sun employees assumed APL was a co-branding agreement. While this was largely true of the T2000 machines (they were entirely Sun designed, originally Sun Fire T2000 rebranded as SPARC Enterprise T2000), the SPARC Enterprise M-class servers were not simply re-branded PrimePower machines; they were jointly designed, jointly developed, and jointly manufactured by both companies. Fujitsu provided the CPU chips and support ASICs, but Sun was responsible for the M4000 and M5000 hardware design (boards, power supplies, chassis), while Fujitsu owned the M8000 and M9000 hardware design, and both companies contributed equally to the platform-specific Solaris components and service processor firmware that ran across the product line.

Bridging Cultures

It's no secret that Japan has a very different culture from America. Before starting the project, many of us took a short class on doing business in Japan, and we all received a little handbook. But while the cultural differences were present, they were not significant. We Americans tried to be very respectful of the Japanese culture and traditions, but I'm sure we screwed up on many an occassion (I know I did). And in response, the Fujitsu engineers were respectful of the American culture, and forgiving of our faux pas.

We quickly learned the appropriate way to address our peers at Fujitsu, and they us. For example, I would refer to my Japanese counterparts by the last name with the suffix "-san", and they quickly learned that "Bob" was how I preferred to be addressed (and whenever anyone said "Mr. Hueston", I'd jump up and look for my father!). As we got to know our peers better, and were exposed to less formal situations such as dinners or one-on-one phone calls, then using their first names was acceptable. But out of respect, we'd never call someone by their first name in front of their superior.

Language was also a bit of a challenge. In Japan, most children study English in elementary school, so every profressional we dealt with at Fujitsu spoke some English. We always had to be aware that the person we were talking to did not speak English as their first language. We had to speak a little slower, articulate a little more clearly, and try to avoid using colloquialisms. If a sentence wasn't understood and we're asked to repeat ourselves, a natural instinct is to paraphrase what we just said, but that only added confusion. We had to learn to exactly repeat what we had just said, giving our Japanese counterpart the opportunity to re-listen and re-translate the words in their head. If they had a question or needed further explanation, we had to learn to listen carefully to their question, and answer the question that was asked. At first, talking face to face (or worse, phone to phone) was a real challenge, but after just a little while, it became second nature -- you talked more slowly using simpler phrases, and listend more closely to maximize understanding in both directions.

Looking back at the people that Sun selected for this team, I've noticed that they are all people who are calm, respectful individuals. There were no hot-heads, or name callers (Sun has their fair share, but not on this team). Perhaps someone higher up realized that in order to bridge the cultures, we needed a team of people who were patient, and would respect the people and culture of Japan. And I think mutual respect was the one key thing that helped us overcome any cultural differences.

Bridging The Distance

I was going to write about traveling to Japan and meeting with Fujitsu, but I think that can be a posting all on its own.

Building Bridges

Working on the APL program was all about building bridges. In a certain respect, the Sun engineering team was a bridge between Fujitsu and the rest of Sun. We bridged Sun's customers, service engineers, manufacturing engineers, and Sun's business to Fujitsu. In turn, the Fujitsu engineers we dealt with bridged Fujitsu's customers and business interests to Sun.

Monday Dec 10, 2007

Flashupdating the stand-by XSCFU

I got a really comment on my entry setupplatform and other new XSCF features from Paul Liong asking:
    Our 'xscf" is currently running at XCP1050. We hope to get it upgraded to XCP1060 according to the Chapter 8 of XSCF User’s Guide. However, it is noticed that there is no permission to run the 'flashupdate' command on the Redundant XSCF Unit. So, how can we perform the firmware upgrade on the XSCF Unit on the standby side first and then on the active side?
An excellent question indeed! I checked the documentation (User Guide, Administrator Guide, man pages) and don't see it described in any detail there. So let me take a stab at it.

First, some background. The Sun SPARC-Enterprise M8000 and M9000 support two service processors, XSCFU#0 and XSCFU#1. The two work in a dual-redundant fashion. One unit is always the "active" unit, and can fully monitor and control the platform. The second unit, if present, is the "stand-by" unit, and has very limited functionality; mostly, the stand-by XSCFU is a slave to the active unit, receiving database updates so that it's current and ready to take over if the active XSCFU fails, is physically removed, or the user runs switchscf.

Back to Paul's question. You cannot run flashupdate on the stand-by XSCFU, that is true. Instead, you run flashupdate on the active XSCFU. This causes the XSCFU to check the flash image, install it and reboot. Upon reboot, if all goes well, the active XSCFU then communicates with the stand-by, tells it's partner that a new version of firmware has been installed, and copies the firmware image to the stand-by XSCFU. At this point, the stand-by XSCFU installs the firmware image, and reboots. If the upgrade is successful, the stand-by XSCFU will request to become the active XSCFU in order to finish the ugprade process. When you're done, both XSCFUs will be running the same version of firmware.

One side effect of this process is that the active XSCFU will switch. In other words, if XSCFU#0 was active and XSCFU#1 was stand-by when you started, then XSCFU#1 will be active and XSCFU#0 will be stand-by when you're done. We had a heated debate about this during development. Someone filed a high-priority bug that the transition was unexpected and should be considered a bug. On the other hand, switching the active XSCFU back to the original active unit would require a second transition; that second transition would add another couple of minutes to the upgrade process (minimizing firmware upgrade times was an important requirement for the SPARC-Enterprise service processor, so an extra two minutes is a lot of time). In the end we decided that it doesn't matter which unit is active, since they are dual redundant, so we should adopt the approach that allowed the firmware upgrade to finish as quickly as possible. If there are customers who strongly feel that, for example, XSCFU#0 should always be the active unit, then they can use switchscf when the firmware upgrade is complete.

So I'm sure some people are out there now wondering what happens if the stand-by XSCFU is absent when you upgrade the active XSCFU. Well, the active XSCFU will hold on to the firmware image. When the active XSCFU sees the stand-by inserted, the user can run flashupdate -c sync to update the stand-by XSCFU from the active unit. The same command can be used when you replace the stand-by XSCFU with a new unit.

Thursday Nov 15, 2007

Behind the panel

I got a question from the field... We know the Sun SPARC Enterprise M-class servers have a serial EEPROM in the panel, but what's stored in it?

A good question, and after digging through various documents to see what Sun says about the panel, I see exactly why the question is being asked. The only reference I could find to the panel was in the Sun SPARC Enterprise Server Family Architecture white paper, which says:

    Operator Panel
    Mid-range and hign-end models of Sun SPARC Enterprise servers feature an operator panel to display server status, store server identification and user setting information, change between operator and maintenance modes, and turn on power supplies for all domains. [Emphasis added.]
But what "server identification and user setting information" is stored in the panel, and what data isn't stored there? Luckily, I happen to know.

The panel contains a small SEEPROM which an be accessed from the service processor (the XSCFU). OK, maybe no so small. I forget the exact size, but it's much larger than the 256 byte SEEPROM used for some FRU identification, and much, much smaller than a 73GB SAS disk. Let's split the difference logorithmically and assume it's in the range of a few dozen kilobytes.

In an ideal world, the panel SEEPROM would contain all of the non-volatile data stored on the XSCFU, so if the XSCFU fails and a new unit is installed, it can fully recover its state from the panel SEEPROM. Sadly, due to space limitations, that's simply not feasible. The XSCFU reserves in excess of 10MBs for error and fault logs alone, which could never fit in the SEEPROM.

Instead, the critical configuration data is stored in the panel SEEPROM. This includes hardware configuration (how XSBs are assigned to domains, network setup, etc), software settings (whether ssh is enabled, the email address for email notifications, etc.) and locally created user accounts and privileges. Pretty much the result of all 'set\*' commands ends up in the panel SEEPROM.

The things that are not stored in the panel SEEPROM include error and fault log files, and FRU information.

If the XSCFU fails and is replaced, the new XSCFU on power-on will recognize that it's installed in a new chassis. It will then go out and read the panel SEEPROM, and rebuild it's configuration data from the panel. It can then read the FRU ID SEEPROM from each FRU to rebuild its FRU inventory information. So the only real data that is lost is log files.

Tuesday Nov 06, 2007

Sun SPARC Enterprise M-class going strong

In yesterday's earnings announcement, Jonathan Schwartz noted, "We saw particular strength in our high-end systems lineup..." The high-end systems lineup he's referring to is the Sun SPARC Enterprise M-class server line, jointly developed by Sun and Fujitsu. It's a great feeling to work on a product for almost three years, watch as it enters the market, and see incredible demand. And it gives a great sense of pride that the product line is notable in its contribution to revenue and margin.

Since the M-class server line shipped in April, I've been blogging about the neat features and amazing capabilities of these servers. The CPU chip was developed by Fujitsu, but a lot of the usability, servicability and functionality that makes it a great server (not just a great chip) came from Sun, and especially from my software team.

Sadly, most of the M-class software development work is done, and it's time to move on. I'm hoping to help bring other great features to our volume systems and x64 product line in my new role.

I have a couple of posts I've started that still need a little cleanup, but after that, I think I'll be putting this blog in mothballs. Maybe I'll start another blog... The Secrets of Thumper... :-)

Monday Nov 05, 2007

New Features in XCP1050

As I mentioned in my 01-Nov-2007 posting, the Sun SPARC Enterprise M-class server service processor firmware version XCP1050 and beyond have several new features that I wanted to blog about.

Servicetags

Servicetags is part of the Sun Connection infrastructure. The basic idea is to enable customers to better track their Sun assets, and by communicating with Sun, determine what updates are available, what needs patching, etc.

Servicetags were introduced in Solaris 10. It's essentially a piece of software that runs on a server and can communicate the list of software products installed on the server, the product versions, patch levels, and so forth. Customers can then run a Java application on their workstation to discover Sun products throughout their datacenter, and at their discretion, send that list to Sun to register the products and/or check for updates.

In XCP1050, the servicetags software now also runs on the service processor. This allows customers to discover the hardware assets in their datacenter, including the machine type, part number, and serial number.

On new machines, servicetags are enabled by default; if you upgrade from XCP1041 or earlier, you'll need to enable servicetags manually. The commands to manage servicetags on the service processor are setservicetags and showservicetags. The usage is very straightforward, for example:

    XSCF> setservicetag -c disable
    XSCF> showservicetag
    Disabled
You can download the discovery application here.

Browser User Interface

Anyone who used the Browser User Interface (called BUI, or Web User Interface) in XCP1041 or earlier probably found there were many tasks that could not be accomplished through the BUI, but required you to use the command line interface. In XCP1050, that all changed. Now, just about everything you could do through the command line can now be done through your web browser. A lot of hard work went into these BUI updates, and I think it really shows.

Fault LEDs and clearfault

It might be surprising, but the most difficult aspects of collaborating with another company on a new product were things that seem the most trivial: bezel color, whether buttons in the Browser UI should have square or rounded corners, and when and how should LEDs blink. Of all these, I think LEDs were the most contentious.

Sun adheres to the ANSI/VITA 40-2003 Service Indicator Standard (SIS) for most of its products. Fujitsu, however, adheres to a different standard. The differences between the two standards are minor, but when a customer is managing a large number of systems, any variation in indicator standards can be a source of confusion. In XCP1040, we shipped with a compromise, meaning both Sun and Fujitsu were unhappy with the solution.

For XCP1050, we reworked the fault indicator policies for both companies. In fact, we added the ability of the firmware to tell if the server was Sun-branded or Fujitsu-branded, and based on the branding, it adhered to the respective company's fault LED standards. Now, on a Sun branded system for example, the fault LEDs adhere to a simple policy:

  • If a FRU's (field replaceable unit) fault LED is on, then there is a fault in the chassis and it has been isolated to that specific FRU with very high confidence. In other words, if the fault LED is on, then we know for a fact the FRU is broken.
  • If the chassis fault LED (on the front panel) is on, then there is a fault in the chassis somewhere.
Note that it's possible that the chassis fault LED is on, but no FRU LEDs are on; that can happen if the server cannot isolate the fault to a single FRU. Commands such as showstatus and fmadm faulty will identify the list of suspected FRUs.

Furthermore, on Sun branded systems, cycling chassis power can no longer be used to clear the fault LEDs for FRUs or the chassis. FRUs don't magically become "better" just because you cycled power; if the FRU was faulty, it's still faulty, so the fault LED shouldn't magically turn off. For Sun branded systems, the fault LEDs will remain on until the customer or service engineer actively clears the fault condition, by removing/replacing the faulty FRU, or by running the clearfault command.

In XCP1040, clearfault could be used to mark a FRU as not faulty; however, almost all FRUs still required a chassis power cycle. This was so that the chassis could perform a power-on self test of the FRU before reconfiguring it into a running server. The last thing you want is someone to manually type clearfault /CMU#0 and then discover that CMU#0 really was faulty and bring down the server.

XCP1050 was enhanced so that clearfault could selectively initiate self test on many FRUs, without requiring a chassis power cycle. When you run clearfault now, it will check to see if it is possible to run self test without disturbing the running system. Some FRUs cannot be tested during operation; other FRUs may be in use in such a way that self test cannot be performed. If the FRU is safe to be tested, clearfault will initiate the self test, and if successful, the fault condition will be cleared. If clearfault cannot test the FRU, it will be marked to be cleared at the next chassis power cycle.

A great deal of effort went into improving the fault detection, isolation, reporting, and service interface, to make it more accurate and more consistent with other Sun products.

Friday Nov 02, 2007

XCP firmware signed images

The Sun SPARC Enterprise M-class service processors excel in the area of security. Out of the box, all network services are disabled, and customesr can decide which secure services, like https and ssh, to enable. There are no "well known" password accounts. And when users are given access, they are assigned privileges which range from domain operator (able to power on/off a single domain but not change the configuration) to platform administrator (able to set-up and reconfigure the platform). But even platform administrators can't change their own privileges; that's reserved for users with user administrator privileges only. All this combines to ensure that the service processor is as secure as the Solaris domain. After all, if the service processor is compromised, the domains would be immanently at risk. In order for the server to be secure, the service processor must be secure.

But when you upgrade the firmware on the service processor, how do you know it hasn't already been compromised? Perhaps the image you thought you were downloading from sun.com was really a spoof. Or perhaps someone modified the image after it was downloaded to insert a Trojan Horse that would give them unrestricted access to the service processor. Even checking MD5 checksums isn't very reliable, since you typically download those from the same place you get the image, so they are equally suspect.

The XCP1050 release of the M-class service processor firmware introduced signed images. When Sun produces a service processor image, it adds to it a signature based on a private key known only to a select few people at Sun (I wrote the software, and even I don't have access to the private key). Then when you attempt to update the firmware, the flashupdate command will check the signature in the XCP image. If the check fails, flashupdate will prevent the install.

The getflashimage command will also check the signature when you initially download the image to the service processor. (Also, as inferred on my posting about getflashimage back on 04-Jun-2007, I modified getflashimage to print the MD5SUM of the image after it was downloaded, so you can verify that the download was successful without any corruption.) However, note that getflashimage just warns you if the signature check fails; it's flashupdate that enforces the signature check.

So how does this prevent hackers from inserting a trojan horse? The signature of the XCP image is checked against the public key already stored on the service processor. In other words, it's the public key that was contained in the previous XCP image. Without access to the private key used to create XCP version N-1, it's virtually impossible to create a new XCP image version N with a valid signature. Even if you knew the public key (you could, for example, disassemble the service processor hardware and read the bytes in the Flash PROMs using a PROM programmer to discover the public key), it's still unlikely that you could sign an image correctly ("unlikely" as in a probability less than 1 in 2100).

In case you're wondering, yes, we did allow for changing the private/public key pair on a regular basis, and we also ensured that customers when customers upgrade non-sequencially (for example, from XCP1050 to XCP2010) they would not have to upgrade to every intermediate release. How? Well, let's just keep that a secret for now.

Thursday Nov 01, 2007

setupplatform and other new XSCF features

I see my last posting on this blog was back in July. After a busy August, I started a new position within Sun working on the x64 server family. But I see that XCP1060 firmware for the M-class service processor has been released, so let me clue you in on a few new features that made it in to XCP1050 or XCP1060.

Some of the key new features in XCP1050 and XCP1060 are:

  • setupplatform
  • servicetags
  • Signed XCP releases
  • Web UI (aka Browser UI, or BUI)
  • Fault LEDs and clearing faults

setupplatform

My personal favorite new feature is the setupplatform command. As a development engineer, I've set up literally dozens if not hundreds of SPARC Enterprise M-class servers, and it's not easy -- you have to create users, set up the network, the hostname, enable optional services, and so forth. The XSCF User Guide has an entire chapter on it. And every time I had to set up a platform, I had to refer back to the user guide.

So a couple of us decided to "wizard-ize" the setup process. Other Sun service processors had a command called 'setupplatform', so we started there. We figured the first thing someone does when they turn on a new machine is log into the service processor and create a user account with useradm (the ability to create more user accounts) and platadmn (i.e., platform administrator, to set-up the rest of the platform) privileges. So the first thing setupplatform does is prompt for a user name and password for a new account with platadm and useradm privileges.

The next thing you normally do is set up the network -- service processor host name, IP address, netmask, default gateway, domain name server, and so forth. On service processors with multiple network interfaces, you need to set up each interface. The setupplatform command leads you through the process. Finally, you need to enable optional services, like the Web UI (aka, Browser UI or BUI), ssh, ntp, and whether you want email notification of important events. That was provided in XCP1050, and in XCP1060 we added support for setting the datacenter altitude, and selecting the local timezone. When the setupplatform command is done and you've answered all the prompts, you're pretty much ready to configure your domain and boot Solaris.

If you find you've made a mistake, you don't have to re-do everything. For example, if you forgot to set up ssh, you could do 'setupplatform -p network' and answer "no" to everything until it prints "Do you want to set up ssh?" Answer "yes" and you're prompted with "Enable ssh service? [y|n]:", and a simple "yes" or "no" will enable or disable ssh. I know from memory that 'setssh -c enable' will enable ssh, but other tasks require more than a simple enable/disable argument (and I'm way too lazy to go read the man page).

Here's a quick sample, creating a new platform administrator user account:

    XSCF> setupplatform -p user
    Do you want to set up an account? [y|n]: y
    Username: johndoe
    User id in range 100 to 65533 or leave blank to let the system
    choose one: 
         Username: johndoe
         User id: 
    Are these settings correct? [y|n]: y
    XSCF> adduser johndoe
    XSCF> setprivileges johndoe useradm platadm platop
    XSCF> password johndoe
    New XSCF password: 
    Retype new XSCF password: 
Note that the last few lines of output, while they start with the "XSCF>" prompt, are actually generated by the setupplatform command itself; these are the actual commands that setupplatform executed on your behalf. As a result, you always know exactly what setupplatform did to the system. And in the process, you might learn the syntax of the commands yourself so you can skip setupplatform for simple changes. I think it was a great idea, and I hope others appreciate that as well.

I have to admit, the prompts are pretty pedantic. For example, setupplatform always asks if you want to set up something, asks you how you want it set up, then summarizes your answers and asks you again if they're correct. But we wanted to make sure people wouldn't hit the wrong key and really screw up their system. Since this is all text-based, it's not like you can hit the "back" button.

I'll write about some of the other new features in coming posts...

About

Bob Hueston

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today