Welcome to Monday!
I'm jumpstarting an ultra-60 in our lab so I can test a bugfix when I see this message:
WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd2):
Error for Command: load/start/stop Error Level: Informational
Requested Block: 0 Error Block: 0
Vendor: SEAGATE Serial Number: 9808500387
Sense Key: Soft Error
ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x45
That's a pretty serious-looking message, so why is it only a "WARNING" rather than an "ERROR" ?
The answer comes from the routine gda_errmsg(..)
which is in usr/src/uts/common/io/dktp/dcdev/gda.c
starting at line 247. This routine calls gda_log(..)
which is a wrapper around cmn_err(..)
. One of the parameters we pass to cmn_err(..)
is the error level: CE_CONT, CE_NOTE, CE_WARN, CE_PANIC and CE_IGNORE (defined in usr/src/uts/common/sys/cmn_err.h
. The gda_errmsg(..)
routine passes CE_WARN (that's the first part of the message above) and CE_CONT (the rest of the message).
So what should I do about this message?
Replace the disk immediately.
There is no other option you can take. The message is that the drive's failure prediction threshold has been exceeded, so the drive's internal electronics is telling you that it's about to die. In my case this is a rather old 4gb Seagate
disk, so I'm more than happy to get a new one in instead.
We don't pass CE_PANIC as an argument to gda_log(..)
because we do not want to take out the system due to a (generally) online-resolvable issue. Of course if this is your boot disk you'd better take action right away, but Solaris isn't going to panic on you from this incident.
Moral of the story: don't ignore "WARNING" messages because they're only "WARNING"s and always read the full text of the message. It could really be an error.