Friday Apr 25, 2008

XCP 1070 Now Available, with SPARC64-VII Jupiter Support

I see on Sun's Software Download Center that XCP 1070 firmware is now available for download. XCP is the firmware that runs on the Sun SPARC Enterprise M-class service processor. According to the Sun SPARC Enterprise M8000/M9000 Servers Product Notes, the one major new feature in XCP 1070 is:
    In XCP Version 1070, the following new feature is introduced:
    • Support for SPARC64® VII processors
The product notes refer to Solaris 5/08 (also available for download on the Software Download Center) for SPARC64-VII support.

One interesting limitation noted in the product notes is:

    For Solaris domains that include SPARC64 VII processors, a single
    domain of 256 threads or more might hang for an extended period of
    time under certain unusual situations. Upon recovery, the uptime
    command will show extremely high load averages.
The above limitation references Change Request CR6619224 Tick accounting needs to be made scalable, which describes the problem in detail:
    Solaris performs some accounting and bookkeeping activities every
    clock tick. To do this, a cyclic timer is created to go off every
    clock tick and call a clock handler (clock()). This handler performs,
    among other things, tick accounting for active threads.
    Every tick, the tick accounting code in clock() goes around all the
    active CPUs in the system, determines if any user thread is running
    on a CPU and charges it with one tick. This is used to measure the
    number of ticks a user thread is using of CPU time. This also goes
    towards the time quantum used by a thread. Dispatching decisions are
    made using this. Finally, the LWP interval timers (virtual and profiling
    timers) are processed every tick, if they have been set.
    As the number of CPUs increases, the tick accounting loop gets larger.
    Since only one CPU is engaged in doing this, this is also single-threaded.
    This makes tick accounting not scalable. On a busy system with many CPUs,
    the tick accounting loop alone can often take more than a tick to process
    if the locks it needs to acquire are busy. This causes the invocations of
    the clock() handler to drift in time. Consequently, the lbolt drifts. So,
    any timing based on the lbolt becomes inaccurate. Any computations based
    on the lbolt (such as load averages) also get skewed.

The issue of tick scalability has been around for a while. Eric Saxe mentions the issue in his May 21, 2007 blog entry tick, tick, tick.... The change request does say that the problem is already fixed in Solaris Nevada (OpenSolaris) build 81, so hopefully this limitation will be removed with an upcoming patch or release of Solaris 10.

Monday Apr 21, 2008

Jupiter Processor (SPARC64-VII) for Sun SPARC Enterprise M-Class Servers

In my last post OpenSolaris support for Ikkaku: 2U Single-CPU SPARC Enterprise Server some people have noticed the line:
        Ikkaku is a 2U, single CPU version of SPARC Enterprise
        M-series (sun4u) server utilizing the SPARC64-VII
        (Jupiter) processor.
I've been asked by a few people about the new Jupiter processor for the SPARC Enterprise M-Class servers. Of course, I can't divulge any Sun or Fujitsu proprietary information. But anyone can go to and learn quite a bit...

Cores and Strands

The Jupiter CPU was announced even before the the SPARC-Enterprise servers began shipping in April 2006. The Jupiter processor's product name is going to be SPARC64-VII. I did a search of "sparc64-vii" and got a hit for FWARC 2007/411 Jupiter Device binding update, which tells us:
        Jupiter CPU is a 4 core variant of the current shipping
        Olympus-C CPU, which has 2 cores.  Both Olympus-C and
        Jupiter has 2 CPU strands per CPU core.
According to the Sun SPARC Enterprise Server Family Architecture white paper (published by Sun in April 2007), current SPARC Enterprise M-Class servers will be upgradable to the new Jupiter CPU modules. Among other things, this means that a fully-loaded, SPARC-Enterprise M9000-64 with 64 CPU chips would have 256 cores, capable of running 512 concurrent threads in a single Solaris image. With 512 DIMMs, and assuming 4GB DIMMs are available soon, the system would max-out at 2TB of RAM, enough to keep 512 threads pretty happy.

Shared Contexts

Another search uncovered this email notification that OpenSolaris now supports Shared Contexts for SPARC64-VII. Steve Sistare's Blog, describes Shared Context for the UltraSPARC T2 processor:
    In previous SPARC implementations, even when processes share physical memory, they still have private translations from process virtual addresses to shared physical addresses, so the processes compete for space in the TLB. Using the shared context feature, processes can use each other's translations that are cached in the TLB, as long as the shared memory is mapped at the same virtual address in each process. This is done safely - the Solaris VM system manages private and shared context identifiers, assigns them to processes and process sharing groups, and programs hardware context registers at thread context switch time. The hardware allows sharing only amongst processes that have the same shared context identifier. In addition, the Solaris VM system arranges that shared translations are backed by a shared TSB, which is accessed via HWTW, further boosting efficiency. Processes that map the same ISM/DISM segments and have the same executable image share translations in this manner, for both the shared memory and for the main text segment.
Presumably, Shared Context on Jupiter is similar if not identical.

Integer Multiply Add Instruction

If you go into the OpenSolaris source browser and search for "Jupiter" in the kernel source, you get about a half dozen C file hits. One of the files, opl_olympus.c, includes the following code, and more importantly, comments:

        101 /\*
        102  \* Set to 1 if booted with all Jupiter cpus (all-Jupiter features enabled).
        103  \*/
        104 int cpu_alljupiter = 0;
        284 /\*
        285  \* Enable features for Jupiter-only domains.
        286  \*/
        287 void
        288 cpu_fix_alljupiter(void)
        289 {
        290     if (!prom_SPARC64VII_support_enabled()) {
        291         /\*
        292          \* Do not enable all-Jupiter features and do not turn on
        293          \* the cpu_alljupiter flag.
        294          \*/
        295         return;
        296     }
        298     cpu_alljupiter = 1;
        300     /\*
        301      \* Enable ima hwcap for Jupiter-only domains.  DR will prevent
        302      \* addition of Olympus-C to all-Jupiter domains to preserve ima
        303      \* hwcap semantics.
        304      \*/
        305     cpu_hwcap_flags |= AV_SPARC_IMA;
        306 }
The comments in the above snippets imply that a kernel can either be in "all-Jupiter" mode, or "not all-Jupiter" mode. From the Sun SPARC Enterprise M4000/M5000/M8000/M9000 Servers Administration Guide we get the following description of the two modes:
    A SPARC Enterprise M4000/M5000/M8000/M9000 server domain runs in one of the following CPU operational modes:
    • SPARC64 VI Compatible Mode - All processors in the domain - which can be SPARC64 VI processors, SPARC64 VII processors, or any combination of them - behave like and are treated by the OS as SPARC64 VI processors. The new capabilities of SPARC64 VII processors are not available in this mode.
    • SPARC64 VII Enhanced Mode - All boards in the domain must contain only SPARC64 VII processors. In this mode, the server utilizes the new features of these processors.
Based on the source code, the main difference with SPARC64 VII Enhanced Mode appears to be the addition of the AV_SPARC_IMA hardware capability, which is an Integer Multiply-Add (IMA) instruction (see also CR6591339).

Integer Multiple-Add is important to cryptographic algorithms, and presumably Jupiter's ima instruction is similar to the xma instruction of the Itanium and other processors. The xma instruction takes three operands A, B and C, and produces the result A\*B+C in a single instruction. Integer Multiply-Add instructions have a significant impact on RSA key generation and other cryptographic algorithms.


In summary, what I've learned by searching Sun's own web page:
  • Jupiter has four cores, two threads per core.
  • It supports Shared Context, for improved performance on applications that use a lot of shared memory.
  • It supports a new Integer Multiple Add instruction to improve crptographic algorithms.
  • It can be installed on SPARC Enterprise M-class servers.
I haven't seen anything about improved clock speeds, but all in all, a pretty significant improvement to what is already a great product.

That's about all I can say. We'll all have to keep waiting, and watching, for more Jupiter info as Sun decides to release it.

Thursday Mar 27, 2008

OpenSolaris support for Ikkaku: 2U Single-CPU SPARC Enterprise Server

I saw this on the OS/Net flag day announcement alias:
    With the putback of
         6655597  Support for SPARC Enterprise Ikkaku
    Solaris Nevada now supports the new Ikkaku model of
    SPARC Enterprise M-series (OPL) family of servers.
    Official product name for Ikkaku is not yet finalized.
    Ikkaku is a 2U, single CPU version of SPARC Enterprise
    M-series (sun4u) server utilizing the SPARC64-VII
    (Jupiter) processor.
All I know is what I've read on, but if you'd like to learn more, try searching for Ikkaku on

By the way, according to the announcement, Ikkaku is Japanese for Narwhal.

Tuesday Jul 10, 2007

SPARC Enterprise M4000/M5000 Power Calculator

Chris Kevlahan here at Sun has done extensive power measurements of the SPARC Enterprise M4000/M5000 servers, and put together a spreadsheet to estimate power usage based on machine configuration. I turned that into Javascript so I could embed it in this blog

The purpose of this tool is:

  • Estimate and calculate the power consumption of a planned configuration.
  • Estimate and calculate the cooling requirements of a planned configuration.
Due to proprietary agreements with Fujitsu, the power calculator has been removed from this blog
  • Total number of memory boards must be 2 or more.
  • Total number of memory boards must not exceed 4 (M4000) or 8 (M5000).
  • Power usage for PCI/PCI-Express cards is estimated using 14 Watts average; however, 25 Watts is the maximum.

Tuesday Jun 12, 2007

Mapping LSBs to physical system boards

In a prior posting, SBs, XSBs, and LSBs, I wrote how Sun SPARC-Enterprise M-class servers use logical system board (LSB) numbers, not physical system board (SB) numbers, to assign CPU IDs and I/O bus addresses in Solaris. One question that common arises is: In Solaris, how can you figure out what the physical system board is for a given CPU ID or I/O device path?

For I/O, the easiest method happens to be the fmtopo command. fmtopo will list all of the I/O devices, the device path, the FRU (Field Replaceable Unit) and FRU Label. Here's a snippet of the output showing the device node /devices/pci@90,600000:

    # /usr/lib/fm/fmd/fmtopo -p
            ASRU: dev:////pci@90,600000
            FRU: hc:///component=iou#0
            Label: iou#0
The PCI-Express root complex pci@90,600000 belongs to the logical system board (LSB) 9 (you can tell from the "90" after the @ sign). But the output of fmtopo shows that the device is actually on the physical FRU iou#0, which is part of SB#0.

Now, knowing that LSB#9 is really SB#0, one can infer that the cpuids associated with LSB#9 (i.e., cpuids 9\*32 through 10\*32-1, or 288-319) are also on SB#0.

So, how does fmtopo figure out how to map LSBs to SBs? Turns out that there is one memory controller per LSB, and the memory controller node has two properties of interest, board# and physical-board#. The board# is the LSB number, while the physical-board# is the SB number. Other nodes in the device tree have the board# property (I/O hostbridges, CPUs, etc), but only the memory controller node has the physical-board# property.

To see what I mean, you can use prtconf, for example:

    # /usr/sbin/prtconf -pv | grep "board#"
        physical-board#:  00000000
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
        board#:  00000009
If you look at the full output of prtconf, you'll see that the first two lines belong to the memory controller node (pseudo-mc) with an LSB board# of 9 and an SB physical-board# of 0. The other board# properties belong to the CPUs and I/O host bridges.

Monday Jun 04, 2007


The SPARC Enterprise M-class XSCF User's Manual chapter 8.1.10, "Firmware Update Procedure" will tell you: "XCP import is done by using the XSCF Web", indicating that you need to use your web browser and the Browser User Interface (BUI) feature of the XSCF (the service processor) to upload an XCP firmware image (the software that runs on the service processor). But that really isn't the only way.

XCP 1040 (with which most systems have shipped) has another, undocumented command, called getflashimage(8M). XCP 1041 is now available, and upgrading using getflashimage and flashupdate is easy -- it takes just one minute to download the image with getflashimage, and 15 minutes to install the new version with flashupdate.

Since getflashimage didn't make the XSCF User's Manual, I thought I'd give you a brief overview.

getflashimage allows you to log on to the XSCF and download an XCP image to the XSCF. getflashimage works similar wget: you provide a URL and getflashimage downloads the file. getflashimage supports http, https and ftp protocols, and will even allow you to download the XCP image from a USB flash device (which is useful if your XSCF does not connect to the network where the XCP images are).

The synopsis for getflashimage is:

   getflashimage [-v] [-q -{y|n}] [-u username] [-p proxy [-t proxy_type]] URL
   getflashimage -l
   getflashimage [-q -{y|n}] -d
   getflashimage -h
I think most of the options in the first synopsis are pretty straight forward (including the standard '-v' for "verbose", '-q' for "quiet", "-{y|n}" for "yes" or "no"). For example, I can download a flash image from an https server that requires user authentication by doing:
    XSCF> getflashaimge -u rjh https://imageserver/images/FFXCP1041.tar.gz
where "rjh" is my webserver user name. getflashimage will prompt for my password, and in about a minute, the image will be downloaded and ready for flashupdate(8M). If you're having problems accessing the server (and the error messages aren't sufficiently clear), you can use the -v option to view the protocol exchange with the server, and see the exact error codes returned by the server (if you find yourself having to use "-v", please let me know what the problem was, and I'll try to improve the error messages). And of course the -p and -t can be used to access the web through a proxy, where -p is the proxy name or IP address, and -t can be used to specify the proxy type (http, socks4 or socks5; the default proxy type is http).

In hindsight, it probably would have made sense to do some integrity checking on the image file, maybe verifying its checksum during the download process. Maybe I'll add that to the next release.

The second synopsis line, 'getflashimage -l' (small L), allows you to list the image file that was downloaded, just in case you forgot whether you finished downloading. It will also display the image file size and download date.

The third synopsis line, 'getflashimage -d', lets you delete any and all image files previously downloaded. If you've done the download and flashupdate, you can delete the image at any time, but it doesn't hurt to leave the file around (the space is reserved and can't be used for anything else). On the other hand, if you downloaded an image file and decided you did not want to flashupdate it (perhaps you downloaded the wrong version), you might want to delete it immediately so you don't accidentally flashupdate it later.

getflashimage can be a real lifesaver if you have a slow connection to the XSCF. For example, when I connect from my home (in the Boston area) to Sun's lab in San Diego, using the Brower User Interface can be slow because I have to ftp the XCP image from San Diego to my home workstation, then upload it using Firefox back to the XSCF in San Diego -- a 6,000 mile journey. But with getflashimage, I can ssh to the XSCF, then use getflashimage to very quickly load an XCP image from a file server to the XSCF on the same subnet, without the bits needing to travel cross-country to me on the East Coast. I also tend to prefer command lines over GUIs.

  • XSCF: The service processor (called a system controller on Sun Fire systems).
  • XCP: The firmware that runs on the XSCF (similar to SMS or SCAPP on Sun Fire systems). XCP also includes POST and OBP.

Friday Jun 01, 2007

XCP 1041 Now Available

Sun SPARC Enterprise M-class service processor firmware release XCP 1041 is now available to download: I was looking at the product notes for XCP 1041, and it's not obvious what new features are in XCP 1041, so here's a quick summary of what's new:
  • Capacity on demand: If you're not familiar with COD, let me summarize by saying that you can get a CPU/Memory Unit (CMU) at a low up-front cost, but not pay the rest until you actually need and use the CPUs and memory. For example, you could buy an M5000 server with four CPUs and get four extra COD CPUs. Normally, you'd use the system with four CPUs. But if you needed more compute capacity, you could call up Sun, buy a license to add another CPU, enter the license key, and then power on the CPU and use it. No boards to install. No downtime. Just enable the CPU and go. We've had this feature for the Sun Fire 3800-6900 and 12K-25K, and now it's available on the M-class servers. I don't think the SPARC Enterprise COD info is posted on yet, but here's a link to the Sun Fire COD web page.
  • External I/O Expansion Unit: XCP 1041 includes full support for the Sun External I/O Expansion Unit (during development, we just called it "IO Box" for short). This 4RU chassis can connect to a host PCI-Express slot, and gives you six additional PCI-Express or PCI-X slots. Since the link connecting the IO Box and the host is fibre optic, the IO Box can be in the next rack, or across the room. The IO Box product isn't shipping yet, so I'll probably do a full blog posting when it does ship.
In addition to the new features, we fixed a small number of bugs.

If you're going to download and install XCP 1041, you'll want to read the product notes about getflashimage. I'll write a separate posting about that.

Wednesday May 30, 2007

PCI-Express Relaxed Ordering and the Sun SPARC Enterprise M-class Servers

PCI-Express supports the relaxed ordering mechanism originally defined for PCI-X. Relaxed ordering allows certain transactions to violate the strict-ordering rules of PCI; that is, a transaction may be completed prior to other transactions that were already enqueued. In PCI-Express, if the transaction layer protocol (TLP) header has the Relaxed Ordering attribute set to 1, the transaction may use relaxed ordering. In particular, a memory write transaction with the RO attribute set to 1 is allowed to complete before prior write transactions enqueued in the hostbridge ahead of it.

The Jupiter Interconnect

Relaxed Ordering is important to the Sun SPARC Enterprise M-series server I/O architecture. The SPARC Enterprise servers use a network of switches and crossbars to connect CPUs, memory access controllers (MACs), and I/O controllers (IOCs). This internal network, called the Jupiter interconnect (it is sometimes called the Jupiter bus, although it's not a "bus" at all) employs error detection and correction mechanisms, and will retry transactions if a protocol error occurs between two nodes in the network. As a result, it is possible for one transaction to "pass" another transaction; such that the agent issuing the transactions sees them complete in opposite order from the order in which they were issued.

For example, consider an M9000 system with two system boards shown in the following figure:

System Board #1
System Board #2
XBU (Crossbar Unit)

Each system board has four I/O hostbridges, four MACs (memory access controllers), and four CPU chips. The IOCs, MACs and CPU chips on a system board are interconnected by four SC chips (system controller), and the SCs connect the system boards to the crossbar unit (XBU). [If you're curious, each SC has direct connections to all four MACs, all four CPU chips, and one of the hostbridges.]

Now let's take two simple transactions: Transaction TA and Transaction TB. Transaction TA is a write from a hostbridge on system board #1 to memory on system board #2. The transaction must go from the hostbridge to the SC on system board #1 (SC#1), then to the XBU, then to the SC on system board #2 (SC#2), then to the MAC on system board #2 (MAC#2). Transaction TB is a write from the same hostbridge to memory on the same system board. This transaction must go from the hostbridge to the SC on system board #1 (SC#1), then directly to the MAC on system board #1 (MAC#1). The following scenario shows how the transactions could get reordered while in flight:

  1. Hostbridge issues transaction TA to SC#1 on the same system board.
  2. Hostbridge issues TB to SC#1.
  3. SC#1 issues TA to XBU.
  4. SC#1 issues TB to MAC#1 on same system board.
  5. XBU issues TA to SC#2 on destination system board.
  6. MAC#1 commits data to RAM, sends acknowledge back to hostbridge that TB is complete.
  7. SC#2 issues TA to MAC#2.
  8. MAC#2 commits data to RAM, sends acknowledge back to hostbridge that TA is complete.
Since the path from the hostbridge to MAC#1 is shorter (2 hops) than from the hostbridge to MAC#2 (4 hops), transaction TB completes prior to transaction TA.

In order for the hostbridge to maintain the strict PCI ordering rules, it is necessary for the hostbridge to wait until the first transaction completes before issuing the next transaction. Using the above example, if TA and TB must adhere to the PCI strict ordering rules, the scenario would look very different:

  1. Hostbridge waits for all outstanding writes to by acknowledged.
  2. Hostbridge issues TA to SC#1.
  3. SC#1 issues TA to XBU.
  4. XBU issues TA to SC#2 on destination system board.
  5. SC#2 issues TA to MAC#2.
  6. MAC#2 commits data to RAM, send acknowledge back to hostbridge that TA is complete.
  7. Hostbridge issues TB to SC#1.
  8. SC#1 issues TB to MAC#1.
  9. MAC#1 commits data to RAM, send acknowledge back to hostbridge that TB is complete.
In the above scenario, the hostbridge is unable to initiate transaction A until all outstanding write transactions are complete. Then after initiating transaction TA, the Hostbridge cannot start transaction TB until it receives confirmation that TA has completed. And until TB is acknowledged, the hostbridge is unable to initiate any other writes, whether strictly ordered or relaxed ordered. This means the Hostbridge is unable to pipeline writes, which can limit write bandwidth. The bandwidth is limited by the latency from the IOC to memory and back. On an M4000, the latency is very low so the effects of strictly ordered writes is small. However, on an M9000-64 where the Hostbridge and MAC are in different cabinets, the latency can be very large; if relaxed ordering is not enabled, the Hostbridge write-to-memory bandwidth can be significantly affected.

When to use Relaxed Ordering

In most case, relaxed ordering cannot be enabled on every transaction. Take for example a typical network interface card (NIC) architecture. The NIC might write a large number of data blocks, followed by an update of a descriptor block indicating that the data is available. When the driver sees the descriptor updated, it goes and processes the data. It doesn't matter in what order the data blocks get committed to RAM. But the descriptor must be written after all of the data is in RAM; otherwise, the driver might see the descriptor get updated, and read the partially-updated data buffer.

Therefore, the data writes can employ relaxed ordering; the descriptor must be strictly ordered so that it will not pass the data writes. Assuming the number and size of data transactions are much larger than descriptor updates, the system will see high write-to-memory performance when relaxed ordering is enabled on the data transactions.

An I/O device should only set the relaxed ordering bit in the TLP header if the device is smart enough to know which transactions can be reordered without causing data corruption. Unfortunately, we've encountered some devices which set the relaxed ordering bit incorrectly.

Enabling Relaxed Ordering in Solaris

The Sun SPARC Enterprise servers are the first SPARC servers from Sun that support relaxed ordering. When we first started testing with hardware, we found that several cards did not support relaxed ordering, or did not support it correctly. The SAS controller used in the M4000/M5000 servers, for example, does not support relaxed ordering. The Gigabit Ethernet controller, on the other hand, incorrectly set the relaxed ordering bit in the TLP header for all transactions, including control block updates. To deal with this, we had to turn off relaxed ordering in the GBE controller itself.

Even though the device hardware did not support (or did not enable) relaxed ordering, good throughput from these devices required that they allow relaxed ordering on the Jupiter Interconnect. To deal with this, Sun added a new flag, DDI_DMA_RELAXED_ORDERING, which allows a device driver to specify which DMA buffers may be relaxed ordered. We also modified the SAS and GBE drivers to tag data buffers with the DDI_DMA_RELAXED_ORDERING bit; control buffers were not tagged.

To enable relaxed ordering, a device driver must set the DDI_DMA_RELAXED_ORDERING in the dma_attr_flags in the ddi_dma_attr_t(9S) structure passed to ddi_dma_alloc_handle(9F). Per the ddi_dma_attr_t man page:


         This optional flag can be set if  the  DMA  transactions
         associated  with this handle are not required to observe
         strong DMA write ordering among each other, nor with DMA
         write transactions of other handles.

         It allows the host bridge to transfer data to  and  from
         memory  more  efficiently  and  may result in better DMA
         performance on some platforms.
For an example of driver code which uses the DDI_DMA_RELAXED_ORDERING flag to enable relaxed ordering on data buffers, see the bge driver on
   1977 	/\*
   1978 	 \* Enable PCI relaxed ordering only for RX/TX data buffers
   1979 	 \*/
   1980 	if (bge_relaxed_ordering)
   1981 		dma_attr.dma_attr_flags |= DDI_DMA_RELAXED_ORDERING;

System Considerations

If you're going to deploy a system with a mix of I/O devices that support relaxed ordering (either in the TLP header, or using the DDI_DMA_RELAXED_ORDERING flag) and I/O devices that do not support relaxed ordering, you should consider the system impact.

Take, for example, the I/O architecture of a system board on a Sun SPARC Enterprise M9000 server:

Architecture for the M8000/M9000 I/O Unit
         |  IOC 0   |
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 0
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 1
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 2
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 3
         |  IOC 1   |
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 4
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 5
         |  ______  | x8
Jupiter  | | Host |-+---------------[ PCI-E Slot 6
---------+-|Bridge| | x8
Interface| |______|-+---------------[ PCI-E Slot 7
[Forgive the ASCII art -- I'm in Engineering, not Marketing.]

The above diagram shows that an M9000 I/O Unit has two I/O controller chips, each IOC has two hostbridges, and each hostbridge contains the root complexes for two PCI-Express slots.

The ideal situation is if all the cards enable relaxed ordering. On the other hand, let's say you have one legacy card that does not support relaxed ordering (perhaps it's a low-performance card where the vendor did not feel throughput, and therefore relaxed ordering, was important). If you put this low-performance card in, for example, slot 0 along with a high performance card that supports relaxed ordering in slot 1, both cards will share a single hostbridge and therefore a single Jupiter Interconnect interface. If the hostbridge has some strictly-ordered writes to memory from card 0, the relaxed-ordered writes from card 1 may queue up behind the strictly-ordered writes.

For comparison, here is the I/O Unit for an M4000/M5000:

Architecture for the M4000/M5000 I/O Unit
                                        |      |-- SAS Controller
                                        |PCI-X | 
                               ______   |Bridge|-- Gigabit Ethernet
          __________          |      |--|      |
         |   IOC    |         |PCI-E |  |______|--[ PCI-X Slot 0
         |  ______  | x8      |Switch|
Jupiter  | | Host |-+---------|      |------------[ PCI-E Slot 1
---------+-|Bridge| | x8      |______|
Interface| |______|-+-----------------------------[ PCI-E Slot 2
         |  ______  | x8
Jupiter  | | Host |-+-----------------------------[ PCI-E Slot 3
---------+-|Bridge| | x8
Interface| |______|-+-----------------------------[ PCI-E Slot 4

In this case, PCI-X slot 0 and PCI-E slots 1 and 2 all share a hostbridge, while PCI-E slots 3 and 4 share the other hostbridge (there is only one IOC with two hostbridges on an M4000/M5000 I/O Unit). While the hostbridge-to-memory latency is not as large on the M4000/M5000 systems, mixing cards that support relaxed ordering under the same hostbridge as cards that require strict-ordering can impact I/O throughput. Note that the SAS controller and the Gigabit Ethernet conroller already have relaxed ordering enabled using the DDI_DMA_RELAXED_ORDERING flag in their respective drivers.

To maximize write-to-memory throughput, it is best to group cards that do not enable relaxed ordering together below the same set of hostbridges, and group high-performance cards that enable relaxed ordering together below a different set of hostbridges. At the same time, you don't want to not oversubscribe the hostbridge. The hostbridge can easily handle a single x8 PCI-Express link writing at its top bandwidth of about 1.7 GB/s; however, two high-performance x8 cards could be limited by the hostbridge's Jupiter interface bandwidth of 2.1GB/s. Of course, the best arrangement of I/O cards may depend on other factors as well; relaxed ordering is just one thing to keep in mind when building a system.

Tuesday May 22, 2007

SBs, XSBs, and LSBs

The Sun SPARC Enterprise M-series servers introduce several new configuration options compared to the Sun Fire 6900/25K family. In my posting eXtended System Boards I explained XSBs -- how a single physical system board (SB) can be partitioned and configured into domains at the granularity of a CPU. The Sun SPARC Enterprise servers also support a concept called Logical System Boards, or LSBs. LSBs add a new dimension of configuration.

Physical System Boards and the Sun Fire 6800

In the past (using the Sun Fire 6800 as an example, since I happen to have one handy), an SB could be configured into a domain, and the resources on that SB were identified to Solaris based on the board number; similarly, if you knew the resource id, you could infer the physical system board it is on. For example, if psrinfo in Solaris showed

    % psrinfo
    0       on-line   since 03/01/2007 12:16:43
    1       on-line   since 03/01/2007 12:16:44
    2       on-line   since 03/01/2007 12:16:44
    3       on-line   since 03/01/2007 12:16:44
    8       on-line   since 03/01/2007 12:16:44
    9       on-line   since 03/01/2007 12:16:44
    10      on-line   since 03/01/2007 12:16:44
    11      on-line   since 03/01/2007 12:16:44
you could infer that your domain consisted of system boards 0 and 2 (the CPU IDs on an SB start at the SB number times 4, so SB0 contains CPU IDs 0 through 3, while SB2 contains CPU IDs 8 through 11). The PCI hostbridge bus addresses are assigned in a similar fashion. For example:
    % ls -1d /devices/ssm@0,0/pci@\*000
shows the hostbridges on I/O boards 6 and 8. (The math here is a bit more complex. I/O board numbers start at 6 with bus address 0x18, and each I/O board has two host bridges, so IB6 has pci@18 and pci@19, while IB8 has pci@1c and pci@1d.)

If a system board experienced a fault and needed to be replaced, or worse, the system board slot was at fault so you could not simply replace the system board, you could reconfigure the system from the System Controller to add CPUs, memory or IO from a different system board to restore the domain to full power. You could, for example, configure SB0 out of the domain, and configure SB1 into the domain. At that point, the domain would be running with CPU IDs 4 through 11 (4 through 7 on SB1, and 8 through 11 on SB2). Similarly, you could replace IB6 with IB7, and the PCI hostbridges would change from pci@18 and pci@19 to pci@1a and pci@1b.

That's all fine, unless your boot device was hanging off IB6. Even if you moved the boot device to IB7, the device paths would all be different. The boot device that was "/devices/ssm@0,0/pci@18,700000/pci@1/SUNW,isptwo@4/sd@0,0:a" would change to "/devices/ssm@0,0/pci@1a,700000/pci@1/SUNW,isptwo@4/sd@0,0:a".

In effect, the CPU IDs and hostbridge bus addresses are physical addresses -- they are calculated based on the physical location of the board.

Logical System Boards and the Sun SPARC Enterprise Servers

The Sun SPARC Enterprise Servers introduce a new concept called Logical System Boards, or LSBs. The LSB number defines the way the CPUs and I/O on an extended system board (or XSB) are identified by a domain.

When an XSB is assigned to a domain, it is given a logical system board number. In effect, the LSB number is a virtual address. And as the Sun Fire 6800 assigns CPU IDs based on physical system board number, the Sun SPARC Enterprise assigns CPU IDs based on logical system board number. The same is true for hostbridge bus addresses. The CPUs on LSB 0 are assigned CPU IDs from the range 0 through 31, and the CPUs on LSB 1 are assigned CPU IDs from the range 32 through 63, regardless of the physical system board hosting the CPU chips. Similarly, the first hostbridge on LSB 0 is pci@0,600000, while the first hostbridge on LSB 1 is pci@10,600000, and so forth.

The mapping from LSB-to-XSB is user configurable; you can choose the LSB number for any XSB almost entirely at will, and LSB numbers can be re-used for different XSBs in different domains. As a result, it is possible that every domain in a chassis could have a CPU with cpuid 0. And every domain could have its boot device below /devices/pci@0,600000. You could have a domain that includes SB 0, 1 and 2 assigned as LSBs 0, 1 and 2, but for some reason it is necessary to replace SB 0. You could then assign SB 3 to the domain as LSB 0. The domain would continue to have cpuid 0 and /devices/pci@0,600000. If you move your boot device over to SB 3's I/O unit (either move the PCI-Express card, or simply move the internal SAS disk), you could boot the domain, and device paths and processor sets would remain unaffected.

If we use the analogy of virtual memory, the domain is the context, the LSB is the virtual address, and the SB (or more specifically, the XSB) is the physical address.

Configuring LSBs

With the added flexibility of XSBs and LSBs, the process of configuring a domain requires some extra steps. The Sun SPARC Enterprise M4000/M5000/M8000/M9000 Servers Administration Guide, Chapter 4 explains the process. The following is an example using a Sun SPARC Enterprise M5000, configured into two domains, each with one system board mapped to LSB 0.


The first step is to configure the system boards as either uni-XSB or quad-XSB mode using the setupfru command:
    XSCF> setupfru -x 1 sb 0
    XSCF> setupfru -x 1 sb 1
The above example places SB 0 and SB 1 in uni-XSB mode, so all of the resources on a system board are assigned to domains in a single configuration unit. At this point, SB 0 is referred to as the single XSB 00-0; SB 1 is referred to as the single XSB 01-0.


The next step is to establish the mapping from LSB to XSB. Just for illustration, I chose the following mapping:
  • Domain 0:
    • LSB 0 => XSB 00-0 (system board 0)
    • LSB 1 => XSB 01-0 (system board 1)
  • Domain 1:
    • LSB 15 => XSB 00-0 (system board 0)
    • LSB 0 => XSB 01-0 (system board 1)
Note that both domains have an LSB 0 (and they refer to different physical system boards). Also note that the domains have different mappings for the system boards. To define the mapping from XSB-to-LSB you use setdcl, which stands for "set domain component list". Here are the commands to set up the domains:
    # For domain 0, map LSB 0 to XSB 00-0 and LSB 1 to XSB 01-0
    XSCF> setdcl -d 0 -a 0=00-0 1=01-0
    # For domain 1, map LSB 15 to XSB 00-0 and LSB 0 to XSB 01-0
    XSCF> setdcl -d 1 -a 15=00-0 0=01-0
The fact that there's an LSB-to-XSB mapping for a system board for a domain does not mean that the XSB is assigned to the domain. It only means, once the XSB is assigned to the domain, this is the LSB number it will get.


So obviously the next step is to assign real XSBs to domains. We only have two SBs (and they're in uni-XSB mode, so we only have two XSBs), and two domains, so give each domain one XSB:
    XSCF> # Assign XSB 00-0 to domain 0
    XSCF> addboard -c assign -d 0 00-0
    XSB#00-0 will be assigned to DomainID 0. Continue?[y|n] :y
    XSCF> # Assign XSB 01-0 to domain 1
    XSCF> addboard -c assign -d 1 01-0
    XSB#01-0 will be assigned to DomainID 1. Continue?[y|n] :y
Once an XSB has been assigned to a domain, that domain owns the XSB; the XSB cannot be assigned to more than one domain. For example, if I tried to give XSB 00-0 to domain 1 after it has been assigned to domain 0:
    # Try to assign XSB 00-0 to domain 1 also
    XSCF> addboard -c assign -d 1 00-0
    XSB#00-0 is already assigned to another domain.

We can use showboards to see what we've done:

    XSCF> showboards -a
    XSB  DID(LSB) Assignment  Pwr  Conn Conf Test    Fault
    ---- -------- ----------- ---- ---- ---- ------- --------
    00-0 00(00)   Assigned    n    n    n    Unknown Normal
    01-0 01(00)   Assigned    n    n    n    Unknown Normal
The above shows that XSB 00-0 is assigned to domain 0 as LSB 0, and XSB 01-0 is assigned to domain 0, also as LSB 0.


Domain 0 should power on with SB 0 as LSB 0, and should have cpuids starting at 0 and hostbridges starting at pci@0,600000. Domain 1 should power on with SB 1 as LSB 0, also with cpuids starting at 0 and hostbridges starting at pci@0,600000. Just to prove that this worked as expected, I powered on the two domains. If I connect to the console for each domain, I get:
    XSCF> console -yq -d 0

    {0} ok show-disks
    a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
    Enter Selection, q to quit: q
    {0} ok
    {0} ok exit from console.

    XSCF> console -yq -d 1

    {0} ok show-disks
    a) /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/disk
    Enter Selection, q to quit: q
    {0} ok
The {0} shows that both domains have a cpuid 0. And the show-disks shows that both domains have a hostbridge pci@0,600000; in fact, both domains have the exact same device path to completely different SCSI controllers. QED.

Thursday May 17, 2007

Compiler Optimizations for SPARC64-VI

Rupert Brauch, Staff Engineer, SPARC Code Generator for the Sun Studio Compilers (and extreme cyclist), in his blog Getting Peak Olympus Performance Using the Studio 12 Compilers writes about the new Sun Studio 12 Compilers optimizations for the SPARC64-VI CPU architecture. Here's a reprint of his blog in case you missed it:

    Getting Peak Olympus Performance Using the Studio 12 Compilers

    The new Sun Studio 12 compilers have been optimized to produce the best performance for SPARC64-VI binaries. It is possible to achieve gains of 30% or more over binaries compiled with the Studio 11 compilers.

    We recommend using the following options when compiling for SPARC64-VI:

    Generate code that is tuned for SPARC64-VI. The binary will run on any SPARC processor, but will perform best on a SPARC64-VI system.

    -xarch=sparcfmaf -fma=fused
    Both of these options are necessary to enable the usage of the new fused multiply add instructions. The binary will only run on SPARC64-VI, and any future SPARC systems that support the fused multiply add instructions. These instructions improve the performance of some floating point programs.

    A combination of -xchip=sparc64vi and -xarch=sparcfmaf. It is still necessary to use -fma=fused to enable fused multiply add instructions.

Thanks Rupert!

Monday May 14, 2007

eXtended System Boards

Like the Sun Fire midrange (6800/6900) and high-end (15K/25K) servers, the Sun SPARC Enterprise M-series servers allow you to organize system boards (SBs) into hardware domains (called "Dynamic Systems Domains" by marketing). Hardware domains contain CPUs, memory and I/O which are isolated from each other; one hardware domain may be powered on or off regardless of the other hardware domains. Like Sun Fire, SPARC Enterprise system boards consist of four CPU chip sockets, 32 DIMM sockets, and I/O.

The Sun SPARC Enterprise midrange and high-end servers, however, take system boards and hardware domains one step further. Physical systems boards can be partitioned into four eXtended system boards (XSBs).

The Sun SPARC Enterprise M4000 can have up to 4 CPU chips and is organized as a single system baord. The M5000 can have up to eight CPU chips and is organized as two system boards. The M8000 and M9000 "system board" consists of a CPU/Memory Unit (CMU) plus an I/O Unit (IOU) which together form a system board. The M8000 can have up to four SBs, while the M9000-64 can have up to 16. When all of the resources on a system board are assigned to domains as a single group, the system board is said to be in "Uni-XSB" mode. The following table shows the CPU, memory and I/O resources on the Sun SPARC Enterprise system boards in Uni-XSB mode:

M4000 in Uni-XSB Mode
SB CPUs Memory I/O
00 CPU#0
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
M5000 in Uni-XSB Mode
SB CPUs Memory I/O
00 CPU#0
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
01 CPU#0
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
M8000/M9000 SB in Uni-XSB Mode
SB CPUs Memory I/O
32 DIMMs PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4
PCI-E Slot#5
PCI-E Slot#6
PCI-E Slot#7
PCI-E Slot#8

Normally on a Sun Fire system you would only be able to create as many domains as you have system boards. However, with the Sun SPARC Enterprise servers, you can configure each system board into four XSBs (quad-XSB mode). This allows you to create domains as small as a single CPU, 8 DIMMs, and I/O. To make it easier to map XSBs back to the physical SB, the number used for XSBs is xx-y where xx is the physical system board, and y is the XSB on that system board. For example, 01-2 would refer to the XSB containing CPU#2 on physical system board #1. The next table shows how the various resources are partitioned among the four XSBs per SB.

M4000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
M5000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
01 01-0 CPU#0 8 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
01-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
01-2 CPU#2 8 DIMMs No I/O
01-3 CPU#3 8 DIMMs No I/O
M8000/M9000 in Quad-XSB Mode
SB XSB CPUs Memory I/O
XX-0 CPU#0 8 DIMMs PCI-E Slot#1
PCI-E Slot#2
XX-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
XX-2 CPU#2 8 DIMMs PCI-E Slot#5
PCI-E Slot#6
XX-3 CPU#3 8 DIMMs PCI-E Slot#7
PCI-E Slot#8

Note in the above table that on M4000 and M5000 servers, XSB 0 gets the internal disks, DVD, Gigabit Ethernet, PCI-X slot and two PCI-Express slots. XSB 1 gets two PCI-Express slots. XSBs 2 and 3 have no I/O. This is a physical limitation -- the M4000 and M5000 I/O units only have two PCI-Express hostbridges. So, while in theory you could create eight domains with a single CPU each, in reality a domain needs I/O so you can only create four hardware domains in an M5000; two domains in an M4000.

The M8000 and M9000 system boards, on the other hand, have symmetric XSBs -- each system board has four CPUs, 32 DIMMs, four PCI-Express hostbridges and 8 PCI-Express slots. When they're placed in quad-XSB mode, each XSB has one CPU, 8 DIMMs, one PCI-Express hostbridge and two PCI-Express slots. So a SPARC Enterprise M8000 with 4 system boards can effectively be split into 16 domains.

For example, with an M5000, you could place system board 00 in quad-XSB mode, and system board 01 in uni-XSB mode. Then you can create one domain with XSBs 00-0 and 00-1 (call this the "green" domain), and a second domain with 00-2, 00-3 and all of 01 (call this the "blue" domain). Here's what that would look like:

Example M5000 With Two Domains
SB XSB CPUs Memory I/O
00 00-0 CPU#0 8 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
00-1 CPU#1 8 DIMMs PCI-E Slot#3
PCI-E Slot#4
00-2 CPU#2 8 DIMMs No I/O
00-3 CPU#3 8 DIMMs No I/O
32 DIMMs 2 SAS Disks
2 GBE Ports
PCI-X Slot#0
PCI-E Slot#1
PCI-E Slot#2
PCI-E Slot#3
PCI-E Slot#4

The green domain could have 2 CPUs, 16 DIMMs, and lots of I/O, while the blue domain could have 6 CPUs, 48 DIMMs, and lots of I/O.

There are some down-sides to using quad-XSB mode. The primary issue is availability in the face of hardware failures. On an M4000 or M5000 there are two SC chips (officially, these are called "system controller" ASICs; however, due to potential confusion with the Sun Fire System Controllers, I like to just call them SC chips); the M8000/M9000 system board has four SC chips. The SC chips connect the CPUs, memory and I/O on the system board, and connect the system board to the system crossbar (or in the case of the M5000, the SCs on one system board connect directly to the SCs on the other system board). The SC chips are shared by all XSBs on a system baord. If a system board is in uni-XSB mode and there's a fault internal to an SC chip, the system board (and the domain using that system board) may take a fatal error and be reset. If a system board is in quad-XSB mode, an SC fault may require the entire system board to be reset, which would reset all domains using XSBs on that system board.

Using the M5000 example above, if the system experienced a fatal error in CPU#0, only the green domain would be reset. However, if one of the SC chips on system board 00 experiences a fault, then all XSBs on system board 00 are affected; both the blue and the green domains would be reset as a result.

On the other hand, XSBs do offer a great deal of flexibility. With an M4000 which only has one system board, you can create two domains, something you could never do with a Sun Fire 6900/25K with only one system board. On larger systems, you have the flexibility of configuring domains down to the CPU level, rather than at a system board level. If the impact of losing two domains due to a hardware failure is acceptable, then quad-XSB mode offers unprecedented flexibility and configurability.

Friday May 11, 2007

Deep Blue v SPARC Enterprise

I like to read the "This Day In History" column in the paper, and this week I've been re-living the infamous chess match between the supercomputer Deep Blue and Garry Kasparov. Today in history, on May 11, 1997, Deep Blue defeated Kasparov in the sixth and final game of their match. I recall as a young (OK, perhaps not young but at least younger) computer engineer, and an amateur chess player myself, following the story in the news papers and on the evening news. A computer had defeated a reigning world chess champion in regulation play.

Ten years later, I was wondering how Deep Blue stacked up to modern supercomputers, like the Sun SPARC Enterprise machines.

I've seen estimates that Deep Blue ran at 11 GFLOPS. I don't know what it cost, but I've seen figures like $5M in captial (not counting the roughly $100M in research and development costs in the decade it took to develop Deep Blue). Deep Blue wasn't a single system; it was a cluster of 32 RS/6000 servers, plus nearly 500 processors specifically designed for playing chess.

Last month the Sun SPARC Enterprise M9000 achieved 1.032 TFLOPS with 64 CPU chips, making it the fastest single-system supercomputer in the world. One hundred times the performance of Deep Blue, at a fraction of the price. A single SPARC64-VI CPU chip would be about 16 GFLOPS, still besting Deep Blue by a good margin. Of course, I don't have specific numbers handy but I'm pretty sure you also can get that kind of performance from a couple of dual-core Opterons in a 1U server (like the Sun Fire X4100) for around $5K.

So, for what a good PC cost in 1997, you can now buy a server the size of a pizza box with the performance of Deep Blue. And for the cost of Deep Blue in 1997, you can buy a dozen SPARC Enterprise M9000's, each with 100 times the performance. That's about three orders of magnitude improvement in the price/performance ratio in a decade. At that rate, in 2017 you'll be able to buy an 11 GFLOPS "supercomputer" for $5, and it will probably be powering your cell phone.

OK, I'm clearly not comparing apples to apples. Deep Blue had the chess processors that probably didn't factor into the 11 GFLOPS rating, ground-breaking algorithms and software, and the genius of Feng-hsiung Hsu behind it. I doubt anyone is going to defeat a chess grandmaster with kchess.

But it is interesting to see how far we've come in ten short years.

Tuesday May 08, 2007

Any info on Cache Coherency?

I was asked couple times by "Trainers" and OPL Training Writers
about OPL Cache Coherency but I don't have any info myself.

Here is a summary emails on the question:


In the Serengeti, it's the Sunfire Snoopy coherency.
In the Starcat, It's memory management between
the CPU memory and the L2. In the Starcat, it's
scalable shared memory where cache coherency is
maintained within the board set and referenced out to
the L3 when needed.

What's the cache coherency model for OPL?


Anyone has info to share?


Friday Apr 27, 2007

Introducing the SPARC Enterprise M-series

Last week Sun and Fujitsu announced the new SPARC Enterprise line of servers, featuring four servers based on the Olympus SPARC64-VI CPU.

Three years ago, I was shocked (as was much of the industry) to learn that Sun and Fujitsu were cooperating to develop the next generation of midrange and high-end servers. A couple of weeks later, I learned that I would be part of the Sun engineering team working on this ground-breaking new product. The last three years have been very interesting, and I believe we've produced a fantastic product, and I'm eager to blog about it.

I hope to use this space to write about the great features, and little secrets, of the new SPARC-Enterprise M-series machines. And I hope to encourage my fellow engineers to contribute a few postings themselves.

Caveat: We're all engineers here, so we can't make promises about future deliver of products or features, and we probably can't tell you too much about new things in the works. But we certainly can tell you about the stuff that's already shipping, and I'm hoping to get input from you about things you'd like to see added or changed in the product.

If there's something in particular you're interested in reading about, post a comment and let me know. If I can't answer your questions, I'll see if I can find an expert who can.


Bob Hueston


Top Tags
« July 2016