Tuesday May 09, 2006

Analysis of Memory Page Retirement

My colleague Dong Tang has recently placed a copy of a paper for the Dependable Systems and Networks conference on the OpenSparc website. The paper is Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults.

This paper gives a very brief description of the Memory Page Retirement (MPR) self-healing features in Solaris 10 for SPARC and AMD based systems. The paper goes into detail on how we analyzed MPR, measured field data, and effectively reduced downtime and service costs for Sun customers.

A High-level view of MPR

One of the most important services that an OS provides to applications is the management of memory. The Solaris OS allows processes to allocate memory as pages. These pages can be of varying sizes, depending on the hardware's capabilities, and are be backed by a combination of main memory or disk space. The actual content of the pages may be copied in multiple places such as processor cache, main memory, swap space, or in a file.

If a correctable fault occurs in main memory, then we don't have to perform any recovery action. However, if we continue to see correctable faults, then perhaps the memory is going bad. To avoid possible, future uncorrectable faults in the same page, we can copy the data to a different page and mark the page as unusable (retired). The policy surrounding this analysis is part of the Solaris 10 Fault Management Archictecture (FMA). Very cool stuff.

If an uncorrectable fault occurs in main memory, there are several things we can do to recover from the fault:

  • if the page hasn't been updated (is clean):
    • retire the bad page and reload from the backing storage (eg. reload text page from a file)
  • if the page has been updated (is dirty):
    • if the data is in processor cache and is being written to the page, then we mark the page as having an uncorrectable fault and defer action until a subsequent access (we don't want to make extra work if the page will subsequently be freed)
    • if the data is being accessed, then process will be forcibly terminated, the page is retired, and (hopefully) the Service Management Facility (SMF) will restart the process

In short, the effect of an uncorrectable main memory error is now dramatically reduced. Only the non-relocatable pages, such as those in some parts of the kernel, are not retireable. Successful page retirements do not cause a reboot or Solaris outage. For applications, only that memory which is dirty is susceptible to uncorrectable errors which would cause an application to be terminated. Those applications which are designed to restart automatically or are managed by (SMF) will restart and keep going. The ultimate result is that Solaris systems can continue to operate in the face of main memory faults. This will become more important as the amount of main memory in systems continues to increase.

As I look into my crystal ball, I see systems designs with hundreds of processing elements all connected to terabytes of main memory. If you count silicon area, we'll see much more area devoted to memory than processing elements. So it makes really good sense to efficiently and cost-effectively add fault recovery techniques to the memory subsystems. Since we are Sun Microsystems we can use our systems knowledge of hardware and software to provide a highly available platform for running applications.

Wednesday Apr 05, 2006

Sun's new ATCA blade system

Today Sun announced a new Advanced Telecom Computing Archictecture (ATCA) blade server system. This system has one of the best high-RAS features of any server system on the market. If you have a moment, read the Advanced TCA and Sun Microsystems: A Technical Overview of Sun's AdvancedTCA White Paper, it describes the design in good detail.

I'd like to point out that to truly provide a highly available platform for running your applications, the platform itself is comprised of both hardware and software. There is a reason why our name is Sun Microsystems. The combination of a well designed blade chassis, outstanding blade board-level design, excellent processors (SPARC and Opteron), operating system (Solaris and Carrier Grade Linux), and HA software (Netra High Availability Suite) comprises a system platform which offers truly outstanding reliability, availability, and serviceability. I'm very proud of this system and give it a big thumbs up for its high RAS features.

While ATCA systems are very robust and represent good high availability design practices, they are not necessarily used in the enterprise data center environment. The Telecom market has environmental and servicing requirements which are somewhat specific to their environment. There are also very specific size and power requirements which lead to server designs which are, by enterprise standards, somewhat limited in computing capacity. This is ok for many Telecom applications, but perhaps not the best combination for, say, technical compute grids or large database servers. Not to worry, you'll notice that many of the robustness techniques used in Sun's ATCA systems are leveraged across the rest of Sun's product line. We aren't designing our systems by keeping the engineers pidgeon-holed in their cubicles. We are designing systems while leveraging the expertise across the company: hardware, software, services, sales, everybody! All of Sun can be proud of this, and I hope this is very evident when you look at the products in detail.

Monday Dec 12, 2005

UltraSPARC CoolThreads Server parts count

Previously, I mentioned how parts count affects reliability. I was happy when David Yen used one of the slides we developed in the Sun FireTM CoolThreads Server launch last week. It is always good to see when your work gets widespread exposure.

When we do a parts count, it is quite simple for our own products. Obviously, when you do the design, it is just a simple matter of having the CAD tools print the component parts used for you as a bill of materials (BOM). This information gets put into various databases and are used for procurement, manufacturing, and service throughout the life cycle of the product. In the RAS Engineering (Reliability, Availability, and Serviceability) group we use this data for making reliability projections. Early in the design cycle, we might make many reliability projections as the design trade-offs are being made. We also use the reliability projections for more complex RAS models and benchmarks which are used to improve our designs and compare to other designs.

For competitive products, we rarely get the component-level BOM, and we have to do it the old-fashioned way: purchase a product and count the parts. We then build reliability projections using the same methods as we use for our own products. This gives us a common baseline for product reliability comparison. Before you ask, no I won't share the detailed results of these comparisons with you. Suffice to say, the Sun FireTM CoolThreads Servers kick serious butt in performance, price, and especially reliability.

When I present this slide I often notice that many people are surprised that there are so many parts in a modern server. In truth, some components, like capacitors, are everywhere. In general, capacitors are very reliable and are often used to filter unwanted signals – a good thing. A modern, enterprise-class server may have hundreds or thousands of capacitors. Over time, more and more functions become integrated into fewer parts. At one extreme is the UltraSPARCTM T1 processor itself which is essentially 8 (or 32, depending on how you count) processors and 4 memory controllers integrated into one chip. But integration is occurring everywhere – including the new I/O ASIC, integrated RAID controllers, system controllers, and network interfaces. A quick browse through the pictures in the Sun FireTM T1000 and T2000 Server Architecture white paper (see pages 20 and 23) should give you an indication of how highly integrated these servers really are. Or, just get your hands on one, and open the cover. You might glean some value from the full components list, though that is really just the FRU components list – not the component-level BOM we use for reliability projections. We've also reduced the parts count in the power supplies, as I have blogged about previously. We can't quite put everything into one chip yet, but we're getting closer, and you can expect even more integration in the next version of the UltraSPARC T1 processor, code-named Niagara-2. I have no doubts that we will continue to drive high reliability and high integration.

[ T: NiagaraCMT ]

RAS Benchmarking white papers

Here are some citations for our work on Reliability, Availability, and Serviceability (RAS) benchmarks.

R-Cubed(R3): Rate, Robustness, and Recovery – An Availability Benchmark Framework by Ji Zhu, James Mauro, and Ira Pramanick, July 2002.

A System Recovery Benchmark for Clusters by Ira Pramanick, James Mauro, and Ji Zhu. IEEE Computer Society, CLUSTR2003, 2003.

The System Recovery Benchmark by James Mauro, Ira Pramanick, and Ji Zhu. 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04).

Robust Benchmarking for Hardware Maintenance Events by Ji Zhu, James Mauro, and Ira Pramanick, 2003 International Conference on Dependable Systems and Networks (DSN'03).

Availability Benchmarking by Richard Elling, Sun User's Performance Group (SUPerG), 2005.

Thursday Dec 08, 2005

Niagara, Viagra

Now that you've heard all about the UltraSPARCTM T1 processor and the Sun FireTM CoolThreads servers, you'll also hear a lot of our competitors making snide comments about the Niagara codename. I've read several articles and quotes which try to play Niagara off of Viagra. As a RAS guy, I get a kick out of it, because we always want servers to stay up...

Tuesday Dec 06, 2005

CoolThreads, T2000, and RAS

Now that we've officially announced the Sun FireTM CoolThreads servers, I can talk about some of the work we've been doing in the Reliability, Availability, and Serviceability (RAS) Engineering group for the past year. We are very excited because these servers offer some of the best RAS features ever. Of course, you'll hear from many folks about the price, performance, power, space, and other advantages of these new servers. I'll concentrate on why these servers will keep running and running for a long, long time.

Heat Kills

When designing data centers, we often spend a lot of time designing the heating, ventilation, and cooling (HVAC) systems. The laws of thermodynamics dictate that we move heat or energy around and that doing so isn't free. We begin the design with an enclosure which provides good cooling paths, including thermal isolation between the power supplies, disk drives, and motherboard. For the Sun FireTM T2000 server we use basically the same enclosure as the Sun FireTM X4200 servers, that I've discussed previously. By designing this enclosure for the heat generation envelope of Opteron servers, we have plenty of margin for the lower power UltraSPARCTM T1 processor based servers. The laws of semiconductor physics say that as the temperature rises, the reliability decreases. Thus the physical design of the T2000 server leads to increased reliability.

In some of the competitive comparisons you'll see today, people will compare one T2000 server against competing products which use multiple processors. From a reliability perspective, we know that heat kills, and more processors generate more heat.

High Integration Leads to High Reliability

We perform reliability projections for new products early in the design cycle. In RAS-industry lingo, we use the methods descibed by MIL-HDBK-217 and Telcordia for reliability modeling. In general, the more parts you have, the lower your reliability. This is really common sense, in that if I have some widgets, then the probability that any widget will fail is a function of the number of widgets. Moore's law says that over time, we can integrate more widgets into one widget. Think of it this way, we've taken half of a Sun EnterpriseTM 10000 server and put it in a single chip, effective reducing hundreds of widgets into one UltraSPARCTM T1 processor.

In some of the performance comparisons you'll see today, people will compare a single T1 processor against several competing processors. A funny thing happens when you look at the failure rate of the processors. To a large degree, complex, contemporary semiconductors have the same failure rate under the same environmental conditions. If you compare a T2000 against a server with four processor chips, the reliability of the T2000 processor is approximately four times that of the competing system's processors. This sort of analysis extends to multiple servers also. For example, if one T2000 has equivalent performance to a set of two competing servers, then the failure rate of the T2000 will be approximately half (or less :-) that of the competing solution because there are half as many power supplies, disks, semiconductors, etc. Not only do we gain a power, space, and cost advantage, but we get higher reliability as well.

But Reliability Is Only One Letter in RAS...

So, what about availability and serviceability? We do the usual things to make the T2000 more available and serviceable. The power supplies, fans, and disk drives are all redundant and hot swappable. Main memory has ECC and chip kill which complement the Solaris Fault Management Architecture (FMA) which provides memory page retirement. The UltraSPARCTM T1 processor has builtin reliability and redundancy features. Automatic System Recovery (ASR) provides automatic recovery from failures in PCI cards or memory (though ASR is disabled by default). There is an ALOM system controller which allows remote management of the platform. In other words, all of the availability and serviceability methods and processes for this class of server are carried forward – there is no sacrificing of these important features.

More on Redundancy

Above I mentioned that we basically shrunk half of a Sun EnterpriseTM 10000 server (aka starfire) and put it in a single chip. One feature that we lose when we do that is the grainularity of the redundancy. We also lose Dynamic System Domains (DSD) which provides some allocation and deployment flexibility, but for now, I'll stick to the basic RAS concepts. The starfire has many redundant and hot-pluggable features which are lost when we integrate onto a single chip. In other words, we are replacing a highly redundant system with a system containing many single points of failure (SPOFs). In general, reliability trumps redundancy, so this trade-off isn't always a bad thing. But perhaps more specifically, for the target market and price range, we intentionally designed the T2000 server to have such SPOFs. There is more good news here, the actual parts count for the motherboard is still lower than competing systems in the same market and price range. It doesn't really make sense to replace a starfire with a T2000, for RAS purposes, but it makes excellent sense to replace a pile of Xeon or PowerPC based servers with a T2000 or two. Ok, to be fair, it also makes sense to replace a pile of older 1-2 RU UltraSPARC-based servers, too, with the added benefit of maintaining the software binary interface.

Of course, another approach is to use the no-cost (!) Solaris Enterprise System which includes the Sun Java Availability Suite which includes the Sun Cluster software to achieve full redundancy between two or more T2000 servers. From a cluster perspective, the T2000 is an excellent choice for a small form factor cluster node.


The new Sun Fire T2000 server based on the CoolThreads technology is a breakthrough in RAS as well as price, performance, space, and power consumption. If you remember only three things:

  • Sun Fire CoolThreads servers offer extraordinarily high reliability due to high integration

  • Best-in-class RAS features

  • CoolThreads servers are very cool!

[ T: http://technorati.com/tag/NiagaraCMT ]

Monday Sep 12, 2005

RAS and the X4100/X4200 Servers

Here's a peek inside the reliability, availability, and serviceability (RAS) features of the new Sun Fire X4100 and Sun Fire X4200 servers. In this post, I'll talk about heat flow and power distribution, two important design constraints for computer system designs. You will see why the new X4100 and X4200 server is not just another run-of-the-mill x64 server.

Heat costs money and kills

By now, you've probably seen the evidence and talk about power consumption and heat. It has long been known that cranking up the number of components and speed of their operation generates more heat. You can do the math and figure out how the heat affects your pocketbook. From a RAS perspective, I'll add that heat kills. The ambient temperature has a direct affect on the reliability of electronics. Cool servers don't break as often as hot servers.

Mechanical systems are the most seriously affected and in modern computer systems that usually means disk drives. The X4100/X4200 designs use new 2.5" serial attached SCSI (SAS) disks, which draw about 40% less power than the equivalent 3.5" disk drives. The performance gurus will also note that the average seek times also drop by about 15% for the smaller drives. For power estimation purposes, plan on about 8W for a reasonably busy 2.5" SAS disk. The reduced form factor and power consumption means that we can offer 4 disks in the space and power budget formerly needed for two 3.5" disks. Ok, so most thin servers don't need more than 2 disks, and I fully expect that people will deploy many more X4100/X4200 servers with zero, one, or two disks than four disks.

But the reduced form factor also allows us to improve overall system RAS because we regain a bunch of space from the front bezel area. In older thin servers such as the wildly popular Netra t1 series, the disks basically consume all of the front bezel area. Consider what this means to the airflow in the system. You will have some heat generators sitting in front of the other electronics. Airflow is front-to-back, so the air that passes over your motherboard is already hotter than the ambient air. By using the 2.5" disk drives, we were able to move the disks out of the way and fully isolate the airflow for the disks from the motherboard. I'll use the X4100 to demonstrate. The configuration is such that the disks and DVD are located on the right side of the front of the server. Behind the drives are the hot swappable power supplies. There is a wall between the drives/power supplies and the motherboard to keep the air separate. The air flowing over the motherboard comes directly from the exterior and flows out the back. The orientation of the CPUs and memory is such that they all get clean (cool) airflow directly. Pushing air over the motherboard are two rows of hot-pluggable, redundant fans. The bezel in front of the fans is not blocked by bulky disk drives, further ensure good, cool, airflow into the server.

By contrast, the Sun Fire V60x and Sun Fire V65x chassis designs were done by "another company" and Sun took that design and re-badged it. The problem is that the other company wasn't used to designing data center class systems. The V60x has a series of holes in the side of the chassis. When you put a bunch of them into a rack, hot air circulates through the rack and back into the chassis. The result is that the systems run very hot. In the front, the disk drives further block the airflow, such that the motherboard sees inconsistent, pre-warmed air flow. These systems also use Xeon processors, which tend to run hot. The net result is that the environmental requirements are de-rated to adjust for the additional cooling requirements of the system. The elegant and clean design of the X4100 and X4200 is vastly superior to the older design in this respect.

Power conversion

Any discussion about power would not be complete without a discussion of power conversion. The Sun Fire X4100 and Sun Fire X4200 servers have RAS improvements there as well. The AC/DC power supplies are a new design which are remarkably simple. Only two DC voltage levels are provided: 3.3 and 12 VDC. The 3.3V level is used for control logic and the iLOM controller. Most of the power is internally distributed as 12V. This 12V is converted to the various logic levels at other places in the system using reliable DC-DC converters. This design decision allowed Sun to simplify the power supply, and any such simplifications improve reliability.

Power supplies operate in a hostile environment. We added a metal-oxide varistor (MOV, aka surge suppressor) into the power supply to help protect the system from unwanted power surges. This protection, the simplicity, low power consumption, and low parts count will result in a highly reliable power supply subsystem. This is systems engineering at its finest.

By contrast, the ATX-style power supplies used in systems, such as the new Sun Fire X2100 server, provide +3.3, +5, -5, +12, and -12 V. This is an old-school design and is significantly more complex than the new, simpler design. Simple designs tend to use fewer parts and thus have higher reliability. While browsing the power supplies at a local computer store last week, I noticed that most ATX power supplies were rated at around 80,000 hours MTBF. The power supplies in the X4100 and X4200 are projected to have more than twice as many hours MTBF.


The new Sun Fire X4100 and Sun Fire X4200 servers are definitely not just another repackaging of some old-school ATX design. The entire design was approached from a data center perspective with the intent of providing a highly reliable, high performance, small form factor server. The good RAS design should translate directly into long life, fewer service calls, and happy customers. Do you want to be happy?

Thursday Aug 18, 2005

Diskless redux?

NFS is alive and well. No surpise there, but the traditional uses of NFS are coming full circle.

I first used diskless clients 20 years ago, back when disks were expensive, slow, and relatively unreliable. Today disks are inexpensive, relatively slow, but still relatively unreliable. 20 years ago it was very possible to get transfer rates of 1 MByte/s and latency of 30ms for local disks or network disks. Since then, the performance improvements on the disks outpaced that of networks. This is largely due to the fact that once deployed, networks tend to have long lives. Disks don't live very long to begin with, and they tend to be easier to replace than network infrastructure. In many enterprises, networks are much more important than disks – the network is the computer.

Many early implementers of diskless workstations migrated towards a dataless model where the OS and swap space were on a local disk, but home directories and many applications were stored on NFS servers. This solved the problem of network infrastructure utilization as it was much easier to add a node than to redesign the network infrastructure to be more performant for large numbers of nodes. By keeping the mundane OS and swap activity local, you could add many more nodes onto the relatively slow and shared network.

Virtual memory systems are very good for allowing us to trade-off performance for storage costs. If you think of the Solaris virtual memory system as multiple layers of cache layered atop the processor and main memory cache, then it becomes very apparent that you'd really like to have huge caches closer to the processor. The reason you can't is mostly economic and slightly technical. Putting memory very close to the processor costs more. If you look at costs/bit of L1-L3 cache, main memory, and disks you'll see that the farther you get from the processor, the costs/bit drop dramatically. The trade-off is that the latency increases dramatically, too. 15 years ago, when DRAM prices went through the roof, we had systems which swapped a lot – mostly because software was bloating faster than our wallets. Simultanously, processors were getting much faster and disks were much more affordable than DRAM. During this time, many people moved away entirely from the concept of diskless clients.

But today, DRAM prices are quite reasonable and the amount of memory typically available on a system is greater than the need (yes, we have data that proves this :-). The latency of swapping to disks hasn't improved significantly, and now the feeling is that if you have to swap, the solution is to buy more DRAM. Networks have also improved dramatically. 10 Gbit Ethernet has 1000 times lower latency and 1000 times better bandwidth than 10 MBit Ethernet. Even the ubiquitous 1 Gbit Ethernet is much lower latency and higher bandwidth than a small pile of disks. So, why aren't diskless clients more popular? Tim Marsland blogs about doing kernel development with diskless clients. He makes a very good case for ease of management, debugging, and rapid development. These features have always been part of the allure of diskless systems.

The objections I often hear against diskless systems is that the reliability isn't as good. Disk reliability has improved over time, but not nearly as much as computing system reliability. These same people tend to also mirror all boot disks and use fancy RAID arrays for the important data. However, when pressed, nobody has given me any quantitative data showing that their overall system availability is better by having thousands of disks spread out everywhere versus dozens of disks in a highly reliable RAID array. And I know that the total cost of ownership for managing disks everywhere is much worse than for centralized, planned storage services. Someone needs to revisit the basic premises and do a quantitative analysis of the diskfull and diskless models. So, I think I'll try. I've got some interesting new ways to model such systems and will give it a shot. Don't be surprised if diskless is in again.

Monday Jun 06, 2005

Availability Benchmarking - SUPerG paper

I appologize for the delay, but here is my SUPerG 2005 paper on Availability Benchmarking. Right now we are working on measurements and analysis for availability, performability, and RAS benchmarks for a wide variety of systems. The comparative results are interesting and will be the topic of some later published works. I'll keep you posted as to our progress.

Wednesday May 04, 2005

Power hunger

Today I read an article on power demands in the datacenter. The issues discussed in the article are similar to those I was able to discuss with Sun customers at the Sun User's Performance Group (SUPerG) conference in Washington, DC recently. But it reminded me also of my early career...

In the summer of 1981 I worked for NASA at the Marshall Space Flight Center in the Spacelab Systems Division. One of my jobs was to analyse the Spacelab-1 mission timeline for periods of maximum power usage.

Spacelab is a collection of laboratory modules which fit in the shuttle orbiter payload bay. The shuttle has 3 auxilliary power units (APUs) (fuel cells) which each can produce 7 kiloWatts (kW) of power. Two APUs are available to the payload bay. Though a single APU provided enough power to return the shuttle safely to earth, mission rules required that all three APUs be functional to run spacelab. In other words, if we lost an APU, we would sacrifice the scientific experiments, and not endanger the crew. With our power budget set at 7 kW, we needed to know whether we would break the budget during a spacelab mission.

Spacelab was a multinational effort to create a reusable laboratory in space. This was during the lull between Skylab and what became known as the International Space Station. The basic idea was to put a bunch of 19" racks in a climate controlled environment and allow scientists to manage and run experiments. Prior to Spacelab, spacecraft were very much custom built and experiments were crammed into corners and crevices as room was available. Thus, each experiment was custom built for the mission it flew on. Incidentally, we did fly one electrophoresis experiment on the shuttle which originally flew on an Apollo mission; we basically strapped it in a crevice in the shuttle mid-deck. The promise of using standard 19" racks for housing experiments theoretically reduced the cost of creating experiments and thus would allow many more scientists to fly experiments than ever before. To further extend the opportunity, Spacelab was a joint effort between NASA and the European Space Agency (ESA). Thus Spacelab was a globally accessible space-borne laboratory. Cool vision.

Of course, the details of making all of this work was left to us engineers. The very first Spacelab-1 mission was quite a diverse collection of experiments from all over the world. And it all needed to be integrated. My division was responsible for this integration and we performed much of the analysis needed to ensure overall mission success.

From a power perspective, some of the experiments didn't consume much juice. Others were ovens (literally) and required lots of juice. Given the 7 kW power budget, we had to ensure that all of the power-hungry ovens were not turned on at the same time. Once we solved that relatively easy scheduling problem, we had to prove that at no time during the mission would the dozens of active experiments cause us to break the power budget. This was a perfect job for a computer. Herman Hight, another Mississippi State grad, and I wrote a FORTRAN-4+ program which would analyze the total power consumption of a Spacelab mission at all points in the mission timeline. This is mildly difficult and involves some additional fudging because some of the experiments were started by an astronaut flipping a switch rather than some automated start process. Hence, there were times which we identified where an astronaut could be early or late to start an experiment and blow the budget. These risky times were identified and fed into the mission experiment tests to be performed by the actual crew while still on the ground, just to work out any potential issues.

7 kW today seems like such a little amount of power. A rack full of 1U servers could easily consume 20 kW. A "power desktop" may approach 1.2 kW of real power. The art and science of power budgeting was largely lost during the 1980s-1990s in the computer business. Today, power analysis is perhaps the most important factor in designing datacenters. And it is for this reason that Sun is putting so much emphasis on reducing the power needed by making more use of the computing resources. In particular, chip multithreading (CMT) allows significantly better utilization of the pipeline. If we can replace a dozen dual-core 120 W processors with a single 80 W processor, then the power usage goes way down. And I think that is a good thing.

Friday Apr 15, 2005

I'm off to SUPerG

Soon I'll pack my bags and head off to the Sun User's Performance Group (SUPerG) conference in Washington, DC. If you've never heard of SUPerG, it is one of the most technical conferences Sun offers to the public. JavaOne is a larger and very technical conference, but the intended audiences is different. SUPerG is specifically intended for high performance and datacenter topics whereas JavaOne caters more to developers and people supporting Java infrastructure.

For various reasons, I've missed the past few SUPerG conferences, but I'm happy to be back. One of the things I enjoy most about SUPerG is that customers from all over the world can get lots of face-to-face time with the top performance specialists, hardware and software architects, and availability experts in Sun. I'm not talking about marketeers here (though they tend to lurk in the halls, too) I'm talking about real, senior engineers who take the time to write papers and discuss what they are working on with customers. This is a high signal-to-noise ratio conference.

My paper is on Availability Benchmarking, a topic which has kept me very busy over the past few months and the subject of another blog for later. After the conference, I'll post my paper so that you can read it. If you want it before then, be in Washington, DC next week and we can talk about it directly.

I hope to be able to give a day-by-day account of what I see at the conference on this blog. But be forewarned, there are three parallel tracks and I am but one attendee, so I will certainly miss some good presentations.

Wednesday Feb 09, 2005

What is going on?

I'm collecting detailed boot data from a variety of machines. I plan to calculate some System Recover Benchmark version A (SRB-A) benchmark results for a wide variety of machines... more on that later. I notice that some systems spend significantly more time early in the boot cycle for Solaris than others. By significant I mean on the order of 30-40 seconds versus say, 5 seconds. While this might not seem like much, for an availability guy this is huge. It is a crucial part of closing the difference between four-nines to six-nines. With our significant improvements in the Solaris 10 boot times, we already knock a lot of the time out of the boot sequence, and gaining another 25 seconds is becoming a big deal. Back to the labor-atory...




« June 2016