Friday Jan 29, 2010

FMA's Historical Diagnosis

[A new company now, so time to give the weblog a new look.]

And, time to talk about a very cool FMA feature I've been meaning to blog about for a while. It's been termed "historical diagnosis". The crux is having FMD reduce the number of convicted suspects in multi-entry suspect lists by comparing the historical record of other faults diagnosed on that system. When certain correlations are found, FMD auto-acquits some of the suspects in various suspect lists.

Probably best dealt with via an example. Suppose a suspect list is issued indicting two components, FRUA and FRUB. Some time later, another suspect list is issued, this time only indicting FRUA. FMD will discard the newer, single entry suspect list as a duplicate and acquit FRUB in the original suspect list. The end result, only FRUA is indicted. (And yes, component serial numbers are used to ensure the "old" FRUA is the same as the "new" FRUA.)

There are of course many different use cases with variations when the FRUs in the suspect list are in different states (faulty|isolated|replaced|acquitted), but you get the idea. And what's very nice is that all of this is managed by FMD itself. No changes to diagnosis rules or response agents. The changes incorporated into build 125.

:wq

Thursday Jan 14, 2010

FMD Core Dumps and Filling /

snv_132 has a nice fix to avoid a scenario I've seen plenty of times in the lab (and not so much in the field, thankfully :). FMD is a service within SMF and as a service it will be restarted if it crashes. And all things being equal, that's a good thing. FMD also produces a core dump in /var/fm/fmd upon crashing, so we can go figure out what's wrong.

SMF's restarter algorithm includes protection to avoid restarting a service too rapidly. But, the protection parameters are slanted toward a service that starts very rapidly. FMD can take 10s of seconds to start, sometimes longer. If there's a nasty bug in FMD or one of its modules that causes FMD to crash on start, SMF's restarter algorithm doesn't detect this and continually restarts the service. (The same holds true for any service that takes "a while" to start.) And with each restart comes a core file in /var/fm/fmd, which is typically part of the / filesystem. From there, it's only a matter of time before the filesystem fills up.

The restart gap FMD slips through is closed with th integration of:
6219078 svc.startd's algorithm for detecting restart loops should be sensible.

:wq

Friday Dec 11, 2009

Rainbow Falls Integration

Several months ago, the Rainbow Falls got press. And today, the processor support integrated into OpenSolaris. While I played a small role in the specifics of the Rainbow Falls project, I'm personally excited about this integration as it's the first SPARC processor to utilize the Platform Independent FMA work I spearheaded roughly one year ago. From an implementation standpoint, Rainbow Falls required few changes to topo enumeration code or diagnosis rules.

Looking forward to the forthcoming products!

:wq

About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today