SMF/Predictive Self Healing Overview
By tobin on Sep 13, 2004
Predictive Self Healing is an architectural framework made up of several pieces. The one that I've been working on is SMF, or Service Management Facility. It's an infrastructure that provides several functions:
1) Defining services for Solaris, which can be the state of a device, a running application, or a set of other services. Each service is referred to by a unique identifier.
2) A formal relationship between services, with explicit dependencies.
3) Automatic starting and restarting of services.
4) A repository to store service state and configuration properties (negating the need for dozens of configuration files scattered throughout the system.
The "thousand mile view" of SMF is that the system is managed by a master "restarter" named
svc.startd. This daemon enforces dependencies, starts and stops services, and basically keeps an eye on the running of the machine. All configuration is stored in a repository on the system, managed by a daemon as well, named
svc.configd. There are one or more "delegated restarters" who are given a subset of services to manage, and are written specifically to deal with this subset, for example,
inetd manages most networking services, as a delegated restarter.
Let's look at the pieces of SMF a little closer.
A service is the fundamental unit of SMF. Each service can have one or more instances, which is a specific configuration of a service. For example, Apache is a service. An Apache daemon configured to serve www.sun.com on port 80 would be an instance of that service. Apache could have several instances, all with different configurations. The service holds basic configuration properties that are inherited by each of its instances, but each instance can override configuration properties, as needed.
There are also special services called milestones. These are a service that correspond to a specific system state, such as "basic networking" or "local filesystems available". They are basically a list of other services, and they're considered to be online when each of their component parts is online.
Each service is identified with an FMRI, or Fault Management Resource Identifier. It's the unique identifier representing a service, or instance. For example, the
telnet service is represented by
svc:/network/telnet:default, where "svc:/network/telnet" describes the service, and "default" describes a specific instance.
FMRIs can be a bit of a handful to type, so you'll find that most SMF commands will accept the "shortened" versions of a service's FMRI, given that it only has one instance. For example, most utilities will accept
network/telnet as the FMRI for telnet, since it comes installed with only one instance.
You will have noticed that telnet is preceded with the word network. SMF contains several categories for services, to provide organization and uniqueness of naming. The standard categories are:
Each service on the machine is always in one of seven discrete states, observable by the SMF CLI tools. The possible states of each service are:
- degraded - The service is running, but something is wrong, or its capacities are limited in some way.
- disabled - The service has been disabled and is not running.
- legacy_run - A legacy rc.X script has been started by the system, and is running. We'll talk more about legacy services later.
- maintenance - The instance has encountered some sort of error, and it needs to be repaired by an administrator.
- offline - The service is enabled, but not running yet, usually because a service it depends on is not online yet.
- online - The service is both enabled and running successfully.
- uninitialized - svc.startd has not yet read this service's configuration.
That's about enough typing for me tonight. Next time we'll start to look at how services are described, and how you administer the system using SMF. As usual, if you have any questions, please feel free to ask them in the comments section.