Tuesday Sep 15, 2009

Exadata 2 - Software By Oracle, Hardware By Sun

I just finished watching Larry Ellison and John Fowler launch Oracle's Exadata 2 product. Software by Oracle, hardware by Sun. This thing is a monster - and I mean that in a good way. I'm sure there'll be plenty of press dissecting the speeds, capacities, and so forth...the Oracle press release is a starter. What perked my ears up is that fault tolerance (redundancy, mirroring) is a key design predication of the system - something near and dear to me.

As for the OS running under the covers, it wasn't discussed during the launch. One would expect that it's Oracle Enterprise Linux. Wonder about getting OpenSolaris as the host os? I'd expect performance to get a boost. And on the fault tolerance front, Solaris FMA would provide a benefit - the systems are using Sun's X4170 Nehalem-based systems as the compute server building blocks, which are fully supported by FMA.


Monday Aug 31, 2009

Hot-Adding Nehalem Sockets on Solaris

In a Nehalem whitepapter dated roughly to mid-2008, there's a bullet point on Page 7 under the Intel® QuickPath Architecture Performance that while brief speaks volumes. It reads:

Intel® QuickPath Architecture Performance
- Hot plug capability to support hot plugging of nodes, such as processor cards.

The implication here is far-reaching. Systems will be designed that allow for dynamically adding processor nodes (and assumedly the memory and IOH that go with the socket, too). I can see such a feature being used for dynamically growing capacity as application demands rise. Or as a RAS feature, bringing in new hardware to backfill for components that have been faulted and/or isolated (e.g. via Solaris FMA's CPU retire functions).

As Intel's engagement in the OpenSolaris community has continued, the first step toward readying Solaris for the hot-add capabilities of Nehalem hits build 123. This first round of changes lay down the ACPI infrastructure future phases of support will rely upon. Kudos out to Gerry Liu at Intel for getting the code into OpenSolaris. Places to get more details:

  • PSARC/2009/104 Hot-Plug Support for ACPI-based Systems
  • 6846944Device tree creation and acpi virtual nexus driver for acpi based x86 systems
  • 6849408 Device matching rule in ppm.conf is not flexible enough

And of course me in my FMA world will get more excited as Solaris continues to be able to fault manage newly added resources. More to come on that I'm sure...


Tuesday Apr 14, 2009

Fault Management on Sun's Xeon 5500-based Systems

Sun announced a suite of Intel Xeon 5500 based systems today. The features of the Xeon 5500, aka Nehalem EP, have been covered extensively in the press following Intel's launch a couple of weeks ago, so I won't review the highlights or innards of the chip. Rather, I'll focus on the fault management features of the Sun systems built around the Xeon 5500. The platforms covered below include the X4170, X4270, X4275 and X2270 rackmount servers and the X6270 and X6275 blades. The one platform I'm not covering here is the Ultra 27 Workstation as it does not include a service processor. When running Solaris, all Xeon 5500-based systems reap the benefits of Solaris FMA, which has been updated for Nehalem1. But the rackmount and blade servers provide additional fault management capabilities not present in the Ultra 27 workstation.

Sun's Xeon 5500 servers and blades have two fault management precincts - the service processor (SP) and the host operating system. This new line of systems has SP-based fault detection and diagnosis for several subsystems, providing a solid base level of fault management irrespective of the host operating system (but you're all running Solaris or OpenSolaris, right? :) A quick rundown of the subsystems that are fault managed:

  • Nehalem CPUs: The SP detects and diagnoses all processor uncorrectable errors (UEs). Correctable errors (CEs) are remanded to the host operating system. If the OS is Solaris, processors can be offlined by Solaris FMA if CEs occur too frequently. Also, Solaris FMA will capture error state on CPU UEs and report them upon the next Solaris restart.
  • Memory: As with CPUs, memory UEs are diagnosed and reported within the SP. Unlike CPU errors, memory UEs are not visible to the host operating system.
  • IO Subsystem: Errors in the IO Hub (Tylersburg) itself are detected and reported via the SP. PCI/PCIE fabric errors are handled by the host operating system (via AER).
  • Power, Cooling & Environmentals: The SP detects and diagnoses problems with the bulk power supplies and fan trays. It also monitors the various component sensors (temp, voltage, etc.), reporting components that have gone out of tolerance.

The service processor provides coverage for a good portion of the errors in the system, yet the host operating system can augment the SP, notably in the area of recovery and/or isolation of problematic resources. If you're at all familiar with Solaris, you'll know that Solaris has the capacity to offline individual processor strands (no further software threads are scheduled on the affected strand), retire individual pages of memory (8KB granularity), and cease using problematic IO devices (configurations & active usage permitting).

Another interesting look at Sun's new systems is the fault management functionality for various host operating systems - what diagnosis and recovery features are available straight out of the box:

  Host Operating System
Subsystem Diagnosis Solaris Windows Linux
CPU Correctable Yes No No
CPU Uncorrectable Yes Yes Yes
Memory Correctable Yes Yes Yes
Memory Uncorrectable Yes Yes Yes
IO Hub (Tylersburg) Yes Yes Yes
PCI/PCIE Fabric (assuming AER support) Yes Yes Yes 1
Recovery/Isolation Solaris Windows Linux
CPU Strand Offline Yes No No
Memory Page Retire Yes No No
IO Device Retire Yes No No

Irrespective of your choice of host operating system, Sun's Xeon 5500 suite of systems provides solid fault detection and reporting capabilities. And when coupled with Solaris, fault resilience is improved thanks its recovery and isolation capabilities.

Click here for more Sun blogs about Sun's new Xeon 5500 systems.


1 If running OpenSolaris on an Ultra 27, it is recommended to install build 111 or later to include the fix to CR 6804867 This fix is also planned for a future Solaris 10 Update release.




« April 2014