Wednesday Aug 20, 2008

Dependability Benchmarking for Computer Systems

Over the past few years, a number of people have been working to develop benchmarks for dependability of computer systems. After all, why should the performance guys have all of the fun? We've collected a number of papers on the subject in a new book, Dependability Benchmarking for Computer Systems, available from the IEEE Computer Society Press and Wiley.

The table of contents includes:

  1. The Autonomic Computing Benchmark
  2. Analytical Reliability, Availability, and Serviceability Benchmarks
  3. System Recovery Benchmarks
  4. Dependability Benchmarking Using Environmental Test Tools
  5. Dependability Benchmark for OLTP Systems
  6. Dependability Benchmarking of Web Servers
  7. Dependability Benchmark of Automotive Engine Control Systems
  8. Toward Evaluating the Dependability of Anomaly Detectors
  9. Vajra: Evaluating Byzantine-Fault-Tolerant Distributed Systems
  10. User-Relevant Software Reliability Benchmarking
  11. Interface Robustness Testing: Experience and Lessons Learned from the Ballista Project
  12. Windows and Linux Robustness Benchmarks with Respect to Application Erroneous Behavior
  13. DeBERT: Dependability Benchmarking of Embedded Real-Time Off-the-Shelf Components for Space Applications
  14. Benchmarking the Impact of Faulty Drivers: Application to the Linux Kernel
  15. Benchmarking the Operating System against Faults Impacting Operating System Functions
  16. Neutron Soft Error Rate Characterization of Microprocessors

Wow, you can see that there has been a lot of work, by a lot of people to measure system dependability and improve system designs.

The work described in Chapter 2,  Analytical Reliability, Availability, and Serviceability Benchmarks, can be seen as we are beginning to publish these benchmark results in various product white papers:

Performance benchmarks have proven useful in driving innovation in the computer industry, and I think dependability benchmarks can do likewise. If you feel that these benchmarks are valuable, then please drop me a note, or better yet, ask your computer vendors for some benchmark results.

I'd like to thank all of the contributors to the book, the IEEE, and Wiley. Karama Kanoun and Lisa Spainhower worked tirelessly to get all of the works compiled (herding the cats) and interfaced with the publisher, great job! Ira Pramanick, Jim Mauro, William Bryson, and Dong Tang collaborated with me on Chapters 2 & 3, thanks team!

Wednesday Feb 20, 2008

Big Clusters and Deferred Repair

When we build large clusters, such as high performance clusters or any cluster with a large number of computing nodes, we begin to look in detail at the repair models for the system. You are probably aware of the need to study power usage, air conditioning, weight, system management, networking, and cost for such systems. So you are also aware of how multiplying the environmental needs of one computing node times the number of nodes can become a large number. This can be very intuitive for most folks. But availability isn't quite so intuitive. Deferred repair models can also affect the intuition of the design. So, I thought that a picture would help show how we analyze the RAS characteristics of such systems and why we always look to deferred repair models in their design.

To begin, we have to make some assumptions:

  • The availability of the whole is not interesting.  The service provided by a big cluster is not dependent on all parts being functional. Rather, we look at it like a swarm of bees. Each bee can be busy, and the whole swarm can contribute towards making honey, but the loss of a few bees (perhaps due to a hungry bee eater) doesn't cause the whole honey producing process to stop. Sure, there may be some components of the system which are more critical than others, like the queen bee, but work can still proceed forward even if some of these systems are temporarily unavailable (the swarm will create new queens, as needed). This is a very different view than looking at the availability of a file service, for example.
  • The performability will might be interesting. How many dead bees can we have before the honey production falls below our desired level? But for very, very large clusters, the performability will be generally good, so a traditional performability analysis is also not very interesting. It is more likely that a performability analysis of the critical components, such as networking and storage, will be interesting. But the performability of thousands of compute nodes will be less interesting.
  • Common root cause failures are not considered. If a node fails, the root cause of the failure is not common to other nodes. A good example of a common root cause failure is loss of power -- if we lose power to the cluster, all nodes will fail. Another example is software -- a software bug which causes the nodes to crash may be common to all nodes.
  • What we will model is a collection of independent nodes, each with their own, independent failure causes.  Or just think about bees.
For a large number of compute nodes, even using modern, reliable designs, we know that the probability of all nodes being up at the same time is quite small. This is obvious if we look at the simple availability equation:
Availability = MTBF / (MTBF + MTTR)

where, MTBF (mean time between failure) is MTBF[compute node]/N[nodes]
and, MTTR (mean time to repair) is > 0

The killer here is N. As N becomes large (thousands) and MTTR is dependent on people, then the availability becomes quite small. The time required to repair a machine is included in the MTTR. So as N becomes large, there is more repair work to be done. I don't know about you, but I'd rather not spend my life in constant repair mode, so we need to look at the problem from a different angle.

If we make MTTR large, then the availability will drop to near zero. But if we have some spare compute nodes, then we might be able to maintain a specified service level. Or, some a practical perspective, we could ask the question, "how many spare compute nodes do I need to keep at least M compute nodes operational?" The next, related question is, "how often do we need to schedule service actions?" To solve this problem, we need a model.

Before I dig into the model results, I want to digress for a moment and talk about Mean Time Between Service (MTBS) and Mean Time Between System Interruption (MTBSI).  I've blogged in detail about these before, but to put there use in context here, we will actually use MTBSI and not MTBF for the model.  Why? Because if a compute node has any sort of redundancy (ECC memory, mirrored disks, etc.) then the node may still work after a component has failed. But we want to model our repair schedule based on how often we need to fix nodes, so we need to look at how often things break for two cases. The models will show us those details, but I won't trouble you with them today.

The figure below shows a proposed 2000+ node HPC cluster with two different deferred repair models. For one solution, we use a one week (168 hour) deferred repair time. For the other solution, we use a two week deferred repair time. I could show more options, but these two will be sufficient to provide the intuition for solving such mathematical problems.

Deferred Repair Model Results 

We build a model showing the probability that some number of nodes will be down. The OK state is when all nodes are operational. It is very clear that the longer we wait to repair the nodes, the less probable it is that the cluster will be in the OK state. I would say, that that with a two week deferred maintenance model, there is nearly zero probability that all nodes will be operational. Looking at this another way, if you want all nodes to be available, you need to have a very, very fast repair time (MTTR approaching 0 time). Since fast MTTR is very expensive, accepting a deferred repair and using spares is usually a good cost trade-off.

OK, so we're convinced that a deferred repair model is the way to go, so how many spare compute nodes do we need? A good way to ask that question is, "how may spares do I need to ensure that there is a 95% probability that I will have a minumum of M nodes available?" From the above graph, we would accumulate the probability until we reached the 95% threshold. Thus we see that for the one week deferred repair case, we need at least 8 spares and for the two week deferred repair case we need at least 12 spares. Now this is something we can work with.

The model results will change based on the total number of compute nodes and their MTBSI. If you have more nodes, you'll need more spares. If you have more reliable or redundant nodes, you need fewer spares. If we know the reliability of the nodes and their redundancy characteristics, we have models which can tell you how many spares you need.

This sort of analysis also lets you trade-off the redundancy characteristics of the nodes to see how that affects the system, too. For example, we could look at the affect of zero, one, or two disks (mirrored) per node on the service levels. I personally like the zero disk case, where the nodes boot from the network, and we can model such complex systems quite easily, too. This point should not be underestimated, as you add redundancy to increase the MTBSI, you also increase the MTBS, which impacts your service costs.  The engineer's life is a life full of trade-offs.


In conclusion, building clusters with lots of nodes (red shift designs) requires additional analysis beyond what we would normally use for critical systems with few nodes (blue shift designs). We often look at service costs using a deferred service interval and how that affects the overall system service level. We also look at the trade-offs between per-node redundancy and the overall system service level. With proper analysis, we can help determine the best performance and best cost for large, red shift systems.



Monday Jul 30, 2007

Solaris Cluster Express is now available

As you have probably already heard, we have begun to release Solaris Cluster source at the OpenSolaris website. Now we are also releasing a binary version for use with Solaris Express. You can download the bits from the download center.

Share and enjoy!

Monday Apr 23, 2007

Mainframe inspired RAS features in new SPARC Enterprise Servers

My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:

  1. Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
  2. Extended ECC for main memory.  Full chip failures on a DIMM can be tolerated.
  3. Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
  4. Improved data path protection. Many improvements here, along the entire data path.  ECC protection is provided for all of the on-processor memory.
  5. Reduced part count from the older generation Sun Fire E25K.  Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
  6. Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
  7. Enhanced dynamic reconfiguration.  Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
  8. Solaris Cluster support.  Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
  9. Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis.  This will be greatly appreciated in large organizations where multiple groups need computing resources.
  10. Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature.  Think of this as RAID-1 for your power source.

I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies.  The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies.  We have goals to make every generation better, and we've made many advances with this new generation.  If the RAS guys do their job right, you won't notice it - things will just keep working.

Wednesday Feb 07, 2007

New white paper on Solaris Cluster and Oracle RAC

My colleagues Tim Read, Gia-Khan Nguyen, and Bob Bart have recently released an excellent white paper which cuts through the fog surrounding the complementary functions of Oracle Clusterware, RAC, and Solaris Cluster. The paper is titled Sun™ Cluster 3.2 Software: Making Oracle Database 10g R2 RAC even More “Unbreakable.”

We have people ask us about why they should use Solaris Cluster when Oracle says they could just use Clusterware. Daily. Sometimes several times per day. The answer is not always clear, and we view the products as complementary rather than exclusionary. This white paper shows why you should consider both products which work in concert to provide very highly available services. I would add that Sun and Oracle work very closely together to make sure that the Solaris platform running Oracle RAC is the best Oracle platform in the world. There is an incredible team of experts working together to make this happen.

Kudos to Tim, Gia-Khan, and Bob for making this readily available for Sun and Oracle customers. 


Thursday Jan 18, 2007

Solaris Cluster (nee Sun Cluster)

Just a quick note, to those who might get confused.  It seems that marketing has decided to name the technologies formerly known as "Sun Cluster" to "Solaris Cluster."  Old habits die hard, so forgive me if I occasionally use the former name.

Branding is very important and, as you've probably seen over the years, not what I would consider to be Sun's greatest strength. But in the long run, I think this is a good change.




« April 2014