Top Solaris 10 features for fault management

A number of Sun bloggers have posted top N lists of cool and new features in Solaris 10 (e.g., in Adam's blog). I thought I'd have a go at a Solaris 10 top 10 from the error handling and fault management and "Predictive Self-Healing" point of view, and then go into each item in a bit more detail in future entries.
  1. Sun has adopted a fault management architecture and Solaris 10 delivers the first (of many planned) offerings implementing this architecture in the fault management  daemon  (part of the svc:/system/fmd:default service).
  2. Error event handlers now propogate structured error reports to the fault manager where they are logged in perpetuity.
  3. Diagnosis Engines now exist to automate much diagnosis.
  4. Agent software can implement various policies given a diagnosis, e.g. to offline and blacklist a cpu or to retire some memory.
  5. Only diagnoses will appear on consoles etc, and they reference web-based knowledge articles.
  6. The contract filesystem ctfs provides a mechanism by which we can communicate hardware errors to groups of affected processes.
  7. The Service Management Facility is available to manage services affected by errors.
  8. Error trap handlers, now that there is a clear separation of responsibilities, are more robust.
  9. Getting error telemetry out of the kernel is dead easy now.
  10.  Fault management is no longer an afterthought!  And it is set to grow and grow.
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

I work in the Fault Management core group; this blog describes some of the work performed in that group.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Site Pages
OpenSolaris
Sun Bloggers

No bookmarks in folder