By relling on May 09, 2006
My colleague Dong Tang has recently placed a copy of a paper for the Dependable Systems and Networks conference on the OpenSparc website. The paper is Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults.
This paper gives a very brief description of the Memory Page Retirement (MPR) self-healing features in Solaris 10 for SPARC and AMD based systems. The paper goes into detail on how we analyzed MPR, measured field data, and effectively reduced downtime and service costs for Sun customers.
A High-level view of MPR
One of the most important services that an OS provides to applications is the management of memory. The Solaris OS allows processes to allocate memory as pages. These pages can be of varying sizes, depending on the hardware's capabilities, and are be backed by a combination of main memory or disk space. The actual content of the pages may be copied in multiple places such as processor cache, main memory, swap space, or in a file.
If a correctable fault occurs in main memory, then we don't have to perform any recovery action. However, if we continue to see correctable faults, then perhaps the memory is going bad. To avoid possible, future uncorrectable faults in the same page, we can copy the data to a different page and mark the page as unusable (retired). The policy surrounding this analysis is part of the Solaris 10 Fault Management Archictecture (FMA). Very cool stuff.
If an uncorrectable fault occurs in main memory, there are several things we can do to recover from the fault:
- if the page hasn't been updated (is clean):
- retire the bad page and reload from the backing storage (eg. reload text page from a file)
- if the page has been updated (is dirty):
- if the data is in processor cache and is being written to the page, then we mark the page as having an uncorrectable fault and defer action until a subsequent access (we don't want to make extra work if the page will subsequently be freed)
- if the data is being accessed, then process will be forcibly terminated, the page is retired, and (hopefully) the Service Management Facility (SMF) will restart the process
In short, the effect of an uncorrectable main memory error is now dramatically reduced. Only the non-relocatable pages, such as those in some parts of the kernel, are not retireable. Successful page retirements do not cause a reboot or Solaris outage. For applications, only that memory which is dirty is susceptible to uncorrectable errors which would cause an application to be terminated. Those applications which are designed to restart automatically or are managed by (SMF) will restart and keep going. The ultimate result is that Solaris systems can continue to operate in the face of main memory faults. This will become more important as the amount of main memory in systems continues to increase.
As I look into my crystal ball, I see systems designs with hundreds of processing elements all connected to terabytes of main memory. If you count silicon area, we'll see much more area devoted to memory than processing elements. So it makes really good sense to efficiently and cost-effectively add fault recovery techniques to the memory subsystems. Since we are Sun Microsystems we can use our systems knowledge of hardware and software to provide a highly available platform for running applications.