smf(5) fault/retry models

Someone asked on an internal Sun alias how svc.startd(1M) determines whether there was a fault in the service and when to put the service in maintenance. Unfortunately, this is not well-described in our existing manpages, but I'm working on that. Still, it takes a little while for manpage changes to propagate into the mainline Solaris release so I figured I'd include my description from email here. A more formal version will be coming, and I'll update this post if subsequent questions yield a better description.

It is important to mention first that svc.startd(1M) offers three separate service models: contract, transient, and wait. These are described in the Service Developer Introduction. I'll only touch on the fault/retry models for the common ones, contract and transient here.

Next, I'd like to point out that there's a distinction between method failures and service failures, from svc.startd(1M)'s point of view. So, I'll go over each type of failure and how it is handled.

svc.startd(1M) believes a method has failed if it returns a non-zero exit code. Method failures cause a service to go into the maintenance state immediately if the exit code is $SMF_EXIT_ERR_CONFIG or $SMF_EXIT_ERR_FATAL. All other failures will cause the service to go back to offline. Remember, as smf(5) describes, if a service is offline and its dependencies are satisfied, we try to start the service. But, if 3 method failures happen in a row, or if the service is restarting too quickly, that service will go into maintenance.

A service failure is determined by a combination of the service model (transient or contract) and the value of the startd/ignore_error property.

A contract type service is considered to have failed if any of the following conditions occur:

  • all processes in the service exit

  • any processes in the service coredump

  • a process outside the service sends a service process a fatal signal (e.g. an admin pkills a service process)

The latter two of these conditions may be ignored by the service by specifying core and/or signal in startd/ignore_error. All of these service failures are detected by contract events. I've talked earlier about contracts and fault isolation in smf(5) too.

Defining a service as transient means that svc.startd(1M) doesn't track processes for that service, so none of the service errors above matter. Thus, a transient service only goes to maintenance if a method failure occurs.

Technorati Tags: , , and .

Comments:

The problem I'm having is when my gnome looks like this "maintenance 12:51:17 svc:/application/gdm2-login:default" Would I need to just blow it away and start from scratch? I'm having problems with power management, login keeps resetting and my gnome doesn't work. I'll keep scouring the web to look for ideas. thanks

Posted by Mr. Sarccastik on September 15, 2006 at 02:56 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Liane Praza

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today