Managing Fault Management Log Files

Sun is working on a Solaris System Administration Series of books. The first book planned in the series centers around Security. The follow on book is entitled "Solaris 10 Essentials", for which I've been working on an Fault Management chapter. Graciously, the folks running the book are allowing me to post a portion of that chapter on my blog.

I chose to post the section on FMA log files. Hopefully you'll find this interesting, both from an FMA perspective, and as a taste of some good material to come in the new book series.

:wq


The Fault Manager daemon (fmd) maintains two persistent log files of events: the error log and the fault log. The error log persistently records inbound telemetry information (ereports), and the fault log persistently records diagnosis and repair events. Both log files are in the Extended Accounting format associated with libexacct(3LIB). The log files reside in /var/fm/fmd. These log files are viewed by using the fmdump command.

 # fmdump -? 
 fmdump: illegal option -- ? 
 Usage: fmdump [-efvV] [-c class] [-R root] [-t time] [-T time] [-u uuid] 
       [-n name[.name]\*[=value]] [file] 
       -c  select events that match the specified class 
       -e  display error log content instead of fault log content 
       -f  follow growth of log file by waiting for additional data 
       -R  set root directory for pathname expansions 
       -t  select events that occurred after the specified time 
       -T  select events that occurred before the specified time 
       -u  select events that match the specified uuid 
       -n  select events containing named nvpair (with matching value) 
       -v  set verbose mode: display additional event detail 
       -V  set very verbose mode: display complete event contents

With no options, fmdump displays the contents of the fault log. The -e option instructs fmdump to examine the error log. Various options provide more detailed and granular scrutiny of the log files. However, the commonly used options are -v for more verbose output and -u to list only those events associated with a UUID.

Automatic Log Rotation
Both the error and fault log files have historical recording, similar to /var/adm/messages. By default, up to 10 historical error and fault log files are kept. With historical logging, the need for log rotation follows. The rotation of fmd log files is managed by the logadm command. By default, logadm is run from root user's crontab each day at 03:10 a.m. The logadm.conf entries for fmd log files are as follows:

 # grep /var/fm/fmd /etc/logadm.conf
 /var/fm/fmd/errlog -M '/usr/sbin/fmadm -q rotate errlog && mv /var/fm/fmd/errlog.0- 
 $nfile' -N -s 2m 
 /var/fm/fmd/fltlog -A 6m -M '/usr/sbin/fmadm -q rotate fltlog && mv
 /var/fm/fmd/fltlog.0- $nfile' -N -s 10m

The errlog file is rotated when the active file grows larger than 2 megabytes. The fltlog log threshold for rotation is 10 megabytes. Also note the use of -A on the fltlog file, which means that fault log files older than 6 months are deleted, irrespective of size.

Also note that after the fmadm rotate command, an mv command renames the file to a final archived name. So, automatic rotation is a two-step process:

  1. fmadm rotate creates a \*log.0- file.
  2. logadm renames the \*log.0- file to \*log.[0-9].
The following example shows output indicating a system with automatically rotated error log files:

 # cd /var/fm/fmd ; ls -l errlog\*	 
 -rw-r--r--   1 root     root     2014185 Jun 25 16:32 errlog 
 -rw-r--r--   1 root     root     2049327 Jun 10 16:30 errlog.0 
 -rw-r--r--   1 root     root     3123843 May 28 16:30 errlog.1 
 -rw-r--r--   1 root     root     2174873 May 19 16:30 errlog.2 
 -rw-r--r--   1 root     root     2049173 May  7 16:30 errlog.3 
 -rw-r--r--   1 root     root     2293094 Apr 22 16:30 errlog.4 
 -rw-r--r--   1 root     root     2583748 Apr  9 16:30 errlog.5 
 -rw-r--r--   1 root     root     2867374 Mar 10 16:30 errlog.6 
 -rw-r--r--   1 root     root     2187465 Feb  8 16:30 errlog.7 
 -rw-r--r--   1 root     root     2211937 Jan 25 16:30 errlog.8 
 -rw-r--r--   1 root     root     2328587 Jan  2 16:30 errlog.9
Manual Log Rotation
The Fault Manager daemon error and fault log files can also be rotated manually. The logadm.conf entries show that the fmadm rotate <logname> command is used for an on-demand log rotation, followed by some post processing. The following output shows what happens if just the fmadm rotate <logname> command is used:

 # ls -l /var/fm/fmd
 total 54 
 drwx------   3 root     sys          512 May 13 14:55 ckpt 
 -rw-r--r--   1 root     root       13049 May 13 15:00 errlog 
 -rw-r--r--   1 root     root       11013 May 13 15:01 fltlog 
 drwx------   2 root     sys          512 May 13 15:01 rsrc 
 drwx------   2 root     sys          512 May 13 02:04 xprt 
 # fmadm rotate errlog 
 fmadm: errlog has been rotated out and can now be archived 
 # fmadm rotate fltlog 
 fmadm: fltlog has been rotated out and can now be archived 
 # ls -l /var/fm/fmd 
 total 58 
 drwx------   3 root     sys          512 May 13 14:55 ckpt 
 -rw-r--r--   1 root     root         330 May 13 15:01 errlog 
 -rw-r--r--   1 root     root       13049 May 13 15:00 errlog.0- 
 -rw-r--r--   1 root     root         330 May 13 15:01 fltlog 
 -rw-r--r--   1 root     root       11013 May 13 15:01 fltlog.0- 
 drwx------   2 root     sys          512 May 13 15:01 rsrc 
 drwx------   2 root     sys          512 May 13 02:04 xprt

Note that manual rotation leaves a \*log.0- file. When rotated automatically, logadm summarily renames this file to the next historical log file. Manual rotation executes the rotation steps only within fmd, which creates the \*log.0- file. The result is that the next manual rotation will overwrite the previous \*log.0- file. For example:

 # ls -l /var/fm/fmd/errlog\* 
 -rw-r--r--   1 root     root         330 May 18 11:01 errlog 
 -rw-r--r--   1 root     root       13049 May 13 15:00 errlog.0- 
 # fmadm rotate errlog 
 fmadm: errlog has been rotated out and can now be archived 
 # ls -l /var/fm/fmd/errlog\* 
 -rw-r--r--   1 root     root         329 Jul 25 18:35 errlog 
 -rw-r--r--   1 root     root         330 May 18 11:01 errlog.0-

Note that errlog.0- has been overwritten. Any information in the log file from May 13 15:00 is gone. Recall that automatic log rotation is a two-step process. Using the fmadm rotate command directly only performs the first step.

A cleaner on-demand log rotation method is to use logadm to process the logadm.conf file, but to override the default rotation periods and sizes. This method has the advantage of ensuring that the historical log files are preserved. For example:

 # ls -l errlog\* 
 -rw-r--r--   1 root     root         330 May 13 15:01 errlog 
 -rw-r--r--   1 root     root       13049 May 13 15:00 errlog.0- 
 # logadm -p now -s 1b /var/fm/fmd/errlog 
 # ls -l errlog\* 
 -rw-r--r--   1 root     root         330 Sep 11 10:17 errlog 
 -rw-r--r--   1 root     root         330 May 13 15:01 errlog.0

And similarly for the fault log:

 # ls -l fltlog\* 
 -rw-r--r--   1 root     root         330 May 13 15:01 fltlog 
 -rw-r--r--   1 root     root       11013 May 13 15:01 fltlog.0- 
 # logadm -p now -s 1b /var/fm/fmd/fltlog 
 # ls -l fltlog\* 
 -rw-r--r--   1 root     root         330 Sep 11 10:22 fltlog 
 -rw-r--r--   1 root     root         330 May 13 15:01 fltlog.0 
Log Rotation Failures
The rotation of a log file can fail. If a rotation request is made while an ereport is being written to the log file, fmd will wait 200 milliseconds and then retry the rotation. If after 10 attempts the rotation is still not successful, fmd will abandon the operation and report the following error:

 # fmadm rotate errlog 
 fmadm: failed to rotate errlog: log file is too busy to rotate (try again later)
Such a condition can persist if a steady stream of errors is occurring on a system, such as a “storm” of correctable errors. Even with rotation failures, ereports are still persistently logged to the errlog file.

Examining Historical Log Files
Once log files have been rotated, you can use the fmdump command with the -f <file option to examine historical information. For example:

 # fmdump -v -u 04837324-f221-e7dc-f6fa-dc7d9420ea76 
 TIME                 UUID             SUNW-MSG-ID 
 fmdump: /var/fm/fmd/fltlog is empty 
	 
 # fmdump -f "fltlog.0" -v -u 04837324-f221-e7dc-f6fa-dc7d9420ea76 
 TIME                 UUID             SUNW-MSG-ID 
 May 13 15:00:02.2409 04837324-f221-e7dc-f6fa-dc7d9420ea76 AMD-8000-AV 
   100%	 fault.cpu.amd.dcachedata 
	 
	 Problem in: hc://:product-id=Sun-Ultra-20-Workstation:chassis-
	 id=0604FK401F:server-id=hexterra/motherboard=0/chip=0/cpu=0 
	 Affects: cpu:///cpuid=0 FRU: hc://:product-id=Sun-Ultra-20-
	 Workstation:chassis-id=0604FK401F:server-id=hexterra/motherboard=0/chip=0 
	 Location: CPU 0

The fmdump command displays any events in the fltlog.0 file associated with UUID 04837324-f221-e7dc-f6fa-dc7d9420ea76.


Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today