Monday Sep 27, 2004

SMF/ Predictive Self Healing: svcadm(1)

Today we'll take a look at SMF's main administrative tool, svcadm(1M).  With this tool, you will enable, disable, and maintain your services.


Let's start with a commonly used service, ssh.

[straylight] % svcs network/ssh
STATE          STIME    FMRI
online         Sep_24   svc:/network/ssh:default

From another machine, we can verify that everything is running fine:

[proxima-centauri] % ssh straylight

As you can see, ssh is enabled and running on my machine.  Now let's disable it.

[straylight] % svcadm disable network/ssh
[straylight] % svcs network/ssh
STATE          STIME    FMRI
disabled       0:56:48 svc:/network/ssh:default

Now the service reads as disabled, and we can see:

[proxima-centauri] % ssh straylight
ssh: connect to host straylight port 22: Connection refused

...that it actually is.  Turning the service on is just as easy.

[straylight] % svcadm enable network/ssh
[straylight] % svcs network/ssh
STATE          STIME    FMRI
online         0:58:07  svc:/network/ssh:default

Note that it's now online.  (The time changes each time we execute an administrative action)

Enable and disable have some extra options that are quite useful.  Say we enabled ssh, but that we noticed that it was offline:

[straylight] % svcs network/ssh
STATE          STIME    FMRI
offline        1:00:47 svc:/network/ssh:default

We know from my previous post that offline means that the service is enabled, but something it depends on is missing (either disabled, or offline).  Let's see what is wrong:

[straylight] % svcs -d network/ssh:default
STATE          STIME    FMRI
disabled       1:00:24 svc:/system/cryptosvc:default
online         Sep_24   svc:/network/loopback:default
online         Sep_24   svc:/system/filesystem/usr:default

Ah hah, cryptosvc isn't enabled.  You might have a service with lots of dependencies that are disabled, or you might have dependencies disabled many levels deep.

Do you want to walk through all those services, find out why they're not on, and enable every dependency by hand?  Of course you don't.  So svcadm has a "recursive enable" option that goes through and enables everything that your service depends on.

[straylight] % svcadm enable -r network/ssh

[straylight] % svcs network/ssh
STATE          STIME    FMRI
online         1:02:23 svc:/network/ssh:default

[straylight] % svcs -d network/ssh:default
STATE          STIME    FMRI
online         Sep_24   svc:/network/loopback:default
online         Sep_24   svc:/system/filesystem/usr:default
online         1:02:22 svc:/system/cryptosvc:default

As you can see, we recursively enabled not only ssh, but everything it depended on, allowing it to come online.

One last option of note for enable/disable is the "temporary" option.  Say that you want to enable/disable a service just for this session, but have it revert to its previous state on reboot, in case there are problems.  If ssh is disabled and you issue:

[straylight] % svcadm enable -t network/ssh 

The enable will only be temporary.  If you reboot the machine, the service will once again be disabled.


Refresh serves two purposes.  One is if you've changed any of the properties of your service, say that you've added a dependency or changed the timeout for starting, you refresh the service, and the properties become active.  The other purpose is that there's an optional method, in addition to "start" and "stop", called "refresh" that you can define.  If your daemon can be sent a HUP signal to re-read its configuration file, you put this in the refresh method, and when you refresh the service, this method is called.

A good example of this is DHCP.  If you change one of the parameters in dhcpsvc.conf, you issue:

[straylight] % svcadm refresh network/dhcp-server 

... and your changes become active.


Restart is pretty self evident.  Restarting a service means that you stop it and start it again.  Where in the past you might have issued a /etc/init.d/sendmail stop followed by /etc/init.d/sendmail start, now you would use:

[straylight] % svcadm restart network/smtp:sendmail 

... which will restart sendmail.

mark (degraded | maintenance)

Mark is used to force a service into a certain state.  (The states are here if you've forgotten them)  An administrator might want to force a service into the maintenance state to let other administrators know that there's something wrong with it that needs to be addressed before it's started again.  You can force a service into either maintenance (which will shut the service down) or degraded (which will leave it running, but let others know that it's running in a degraded state).

Keeping with our earlier example of ssh:

[straylight] % svcadm mark maintenance network/ssh

[straylight] % svcs network/ssh
STATE          STIME    FMRI
maintenance    1:12:47 svc:/network/ssh:default


Clear is used to "reset" the state of a service, and have it be re-evaluated.  For example, say that syslog is in maintenance:

[straylight] % svcs system/system-log 
STATE          STIME    FMRI 
maintenance    1:15:33  svc:/system/system-log:default

You debug the problem, and realize that syslog failed to start because someone had accidentally deleted syslog.conf, which syslog needs to start.  It attempted to start, saw that the conf file was missing, and fell into maintenance.  You repair the file, and issue a clear:

[straylight] % svcadm clear system/system-log

[straylight] % svcs system/system-log
STATE          STIME    FMRI
online         1:25:07  svc:/system/system-log:default


So now you know how to perform basic maintenance on a Solaris 10 machine using SMF.  I hope it's clear that this system of administration is quite easy, and incredibly powerful.  No longer do you have to hunt around for daemons and init scripts, every service is given a unique FMRI, administered through a unified framework.  This, combined with explicit states and dependencies, gives administrators flexibility and power that is unavailable in other Unix distributions.

My next post will be about manifests, which are the XML files used to describe each service.  We'll examine a manifest in depth, and take a look at the properties and the dependencies that make it up.  As always, questions and suggestions are welcome.

Tuesday Sep 21, 2004

SMF/ Predictive Self Healing: svcs(1)

Perhaps the most often used tool in the SMF world is svcs. It's the master "observational" tool, used for querying the state and properties of all the services on your machine.

Used with no options, it simply lists all services that are enabled. Enabled means that the administrator wishes these services to be running. They may not be, because their dependencies are not met, they failed to start correctly, or some other reason. But they're the services that should be running.

# svcs
STATE          STIME    FMRI
legacy_run     Sep_17   lrc:/etc/rcS_d/S10pfil
legacy_run     Sep_17   lrc:/etc/rcS_d/S29wrsmcfg
legacy_run     Sep_17   lrc:/etc/rc2_d/S72autoinstall
legacy_run     Sep_17   lrc:/etc/rc2_d/S72directory
legacy_run     Sep_17   lrc:/etc/rc3_d/S84patchserver
legacy_run     Sep_17   lrc:/etc/rc3_d/S90samba
online         Sep_17   svc:/system/svc/restarter:default
online         Sep_17   svc:/network/loopback:default
online         Sep_17   svc:/network/physical:default
online         Sep_17   svc:/system/filesystem/root:default
online         Sep_17   svc:/network/ssh:default
online         Sep_17   svc:/system/coreadm:default
online         Sep_17   svc:/milestone/single-user:default
online         Sep_17   svc:/system/system-log:default
online         Sep_17   svc:/system/utmp:default
online         Sep_17   svc:/system/filesystem/local:default
online         Sep_17   svc:/milestone/name-services:default
online         Sep_17   svc:/network/inetd:default

Now, I've edited this list quite a bit. As it stands now, a freshly installed Solaris machine will come up with 108 running services. I'm sure this number will change a bit before we ship.

What can we see above? First, we see the legacy services, which I've mentioned in a previous post. These are the scripts that still exist in the rcX directories. For example, you can see above that /etc/rc3.d/S90samba was started. Since these services haven't been converted to SMF services, they won't be automatically restarted, but they'll be started, just like they were in previous incarnations of Solaris. If you have your own personal scripts, this is where you'll see them.

Below this you start to see online SMF services, such as ssh and inetd. You'll also see "milestones", which are services that are simply lists of dependencies that represent a system state, such as "single-user", or "local filesystems" as being available.

Let's take a look at some of the command line options that make svcs so powerful and useful.


The -a option means "all". It shows all the services on a machine, whether they're enabled or not.

# svcs -a
STATE          STIME    FMRI
legacy_run     Sep_17   lrc:/etc/rcS_d/S10pfil
legacy_run     Sep_17   lrc:/etc/rcS_d/S29wrsmcfg
disabled       Sep_17   svc:/application/print/server:default
disabled       Sep_17   svc:/network/nfs/server:default
disabled       Sep_17   svc:/network/time:default
disabled       Sep_17   svc:/network/talk:default
online         Sep_17   svc:/system/svc/restarter:default
online         Sep_17   svc:/network/loopback:default
online         Sep_17   svc:/network/physical:default

This listing will show you all of the services seen in svcs, plus all the disabled services. Solaris 10 ships with most networking services turned off, for security purposes. You'll see above things like being an nfs server or a print server are also off by default. They can be enabled simply using svcadm, which will be the subject of a later post.

Now we'll start to take a look at how using svcs can be useful in analyzing a service.


The -d option to svcs shows which services this service depends on. Let's take a real world example. Say that for some reason, inetd isn't running on your machine, and you want to look at what it depends on.

# svcs -d network/inetd
STATE          STIME    FMRI
online         Sep_17   svc:/network/loopback:default
online         Sep_17   svc:/network/physical:default
disabled       Sep_17   svc:/network/rpc/bind:default
online         Sep_17   svc:/milestone/single-user:default
online         Sep_17   svc:/system/filesystem/local:default
online         Sep_17   svc:/milestone/name-services:default

You can see above that inetd depends on networking, being in single user mode, local files, and name services. And also on rpcbind, which is disabled. You'll know that you need to enable rpcbind to get inetd to run. And since we have dynamic dependency checking with SMF, as soon as you enable rpcbind, inetd will come online.


The -D option to svcs shows all the services that depend ON a given service. Again, for a real world example, let's take a look at rpcbind. Say you were considering disabling rpcbind, and you wanted to know what effect this would have on your system.

# svcs -D network/rpc/bind
STATE          STIME    FMRI
disabled       Sep_17   svc:/network/nis/server:default
disabled       Sep_17   svc:/network/nfs/client:default
disabled       Sep_17   svc:/network/rpc/bootparams:default
disabled       Sep_17   svc:/network/nfs/server:default
online         Sep_17   svc:/network/rpc/keyserv:default
online         Sep_17   svc:/network/inetd:default
online         Sep_17   svc:/network/nis/client:default
online         Sep_17   svc:/milestone/multi-user:default
online          0:42:34 svc:/network/nfs/cbd:default
online          0:42:34 svc:/network/nfs/mapid:default
online          1:26:47 svc:/network/nfs/nlockmgr:default
online          1:26:47 svc:/network/nfs/status:default

Again, I've removed several lines for brevity, but you can see that there are several disabled services that depend on rpcbind, and quite a few online services. Including inetd, your ability to be a nis server, the multi-user milestone, and others. So you can see that disabling rpcbind would have a significant effect on your machine, and you're able to know \*exactly\* how your system will be affected.


Say you want to know a general "overview" of one of your services. svcs -l (list) shows the relevent stats on a service, including who its restarter is, who it depends on, what its state is, etc.

# svcs -l network/inetd:default
fmri         svc:/network/inetd:default
enabled      true
state        online
next_state   none
restarter    svc:/system/svc/restarter:default
contract_id  43 
dependency   require_all/none svc:/milestone/single-user (online)
dependency   optional_all/error svc:/network/rpc/bind (online)
dependency   optional_all/error svc:/network/physical (online)
dependency   require_all/error svc:/system/filesystem/local (online)
dependency   require_all/error svc:/network/loopback (online)

# svcs -l network/telnet:default
fmri         svc:/network/telnet:default
enabled      true
state        online
next_state   none
restarter    svc:/network/inetd:default
contract_id  128

The two examples above show inetd and telnet. You can see that they're both enabled and online. You can also see that inetd is restarted by the master restarter, but telnet is restarted by inetd. inetd has several dependencies, while telnet has none (other than inetd running).

You'll notice that inetd has two different types of dependencies, optional and require. While I'll go into types of dependencies in another post, it's helpful to note that "require" means that that dependency must be online, while "optional" means that it has to be online only if it's enabled.


Sometimes you might want to know what processes are controlled by a service. The -p (process) option gives you the pid, start time, and name of all the processes started by a service. For example, sendmail:

# svcs -p network/smtp:sendmail
STATE          STIME    FMRI
online         Sep_17   svc:/network/smtp:sendmail
               Sep_17        452 sendmail
               Sep_17        453 sendmail

You can see above that sendmail has two processes on my machine, both started on Sept 17th.

One last example I'll show is how svcs can be useful for scripting purposes.

-H, -o

The -H option means "don't show column headings", and -o is used to pick output columns. Say you wanted to write a perl script that took services in a certain state, and performed an action upon them.

[straylight] % svcs -H -o state,fmri
legacy_run     lrc:/etc/rcS_d/S10pfil
legacy_run     lrc:/etc/rcS_d/S29wrsmcfg
legacy_run     lrc:/etc/rcS_d/S55fdevattach
online         svc:/system/svc/restarter:default
online         svc:/network/loopback:default
online         svc:/network/physical:default
online         svc:/system/filesystem/root:default

This way, you're telling svcs to output all the enabled services, showing only their state and FMRI, without the header. This can be fed into a script that looks for services that are "degraded" or in "maintenance" mode, and emails the administrator, for example.

Next time, we'll take a look at svcadm, and how to administer a system by enabling and disabling services. As always, comments and questions are welcome.

Thursday Sep 16, 2004

SMF/Predictive Self Healing: Graphing service dependencies

I don't have the next tutorial entry ready yet, but Stephen posted something so cool on his blog that I had to show it to you all.

Now that every service on a system is an entity with dependencies, one of the side benefits is that you can actually chart what your system looks like, graphically.  Below is a chart from Stephen's machine, representing not only all of his running services and their dependencies, but a right-to-left timeline of the boot sequence of his machine:

I really can't get over how cool this is.

Stephen gives a nice overview of the features of this graph in his blog entry, so I won't reproduce them here.  But go check it out.  SMF is letting us do some amazing things.

Wednesday Sep 15, 2004

SMF/Predictive Self Healing Overview: Part 2

Continuing on with the overview, we're going to cover how services actually get on your Solaris machine, a few more basic concepts, and give a brief outline of what system administration is like under SMF.  For those of you just joining, part 1 of the overview is here.


An SMF manifest is an XML file describing a service.  All of the manifests in the system are stored in /var/svc/manifest, under categorical subdirectories.  If you're not planning on converting any of your custom services over to the SMF model, you won't ever need to edit these files, but they can be a helpful reference.

On boot, svc.configd looks in the manifest directory, and if there are any new manifests, it imports them into the repository.  This can also be done manually by the administrator, as I'll describe in later sections.  The entire system is run using information in the repository, not the manifests, they're simply a delivery mechanism of the descriptions of services.  An active system is administered using SMF's command line tools, which I'll briefly outline at the end of this post.

In order to create an SMF service, a user need simply to create an XML file describing it, and import it.  We've labored to make these manifests incredibly simple to use.  In most cases, all you need to do is determine what your service depends on, and how to start and stop it, cut and paste that into an XML file, and you're finished.  For a few minutes of work, you get all the benefits of SMF, including parallel booting of your service, dynamic dependency checking, and automatic restarting on failure.  I'll be dedicating an entire post later on to the process of converting a service to SMF, since it's critical that users understand how simple it can be.  This leads us to:


Take a deep breath, and read this a few times:

          \*Everything still works\*

Most users of Solaris have their pet scripts and services that they've carefully honed over time, and don't want to part with.  While we'd like you to take advantage of the benefits of SMF and convert your services, you by no means have to.  All of the scripts in /etc/rc\*.d continue to be executed on run-level transitions, just like you're used to.

If you look at the states I mentioned in my previous post, you'll see one called legacy_run.  This way, you can use the SMF observational tools to observe your "legacy" services as well as those that have been converted to SMF.  Any service with the state of legacy_run means that it was a script in an /etc/rc\*.d directory that was run upon successful transition to a run level.

On the development end, we've converted a great number of the standard Solaris services already.  Once you install Solaris 10, you'll notice that the /etc/init.d and /etc/rc\*.d directories are a lot more empty than they used to be.  However, if you upgrade a machine, your pet scripts and services will still be there, and usually still run without a hitch.

Administrative Interfaces

Ah, the heart and soul of it.  We've put a lot of time and effort into making the administration of a Solaris machine with SMF as painless as possible.  Once you start to play with our tools and see what's really possible, I think you'll be a convert as well.  No longer will you have to be grepping for processes in lists, wondering if they're running or not, hunting for configuration files, et cetera.  Administration of SMF services is all done through a central interface, allowing you to observe the state of services, their dependencies, their properties, and make changes to your services quite easily.

The SMF CLI tools are as follows:

  • inetadm - observe and configure services that are controlled by inetd.
  • svcadm - service administration, including enabling, disabling, and restarting services
  • svccfg - manipulate the contents of the repository, usually properties in a service
  • svcprop - observe property values in a read-only manner.  outputs are formalized for easy use in shell scripts.
  • svcs - observe the state of all the service instances on the system, and detailed views of their dependencies, processes, etc.

Each of these are so commonly used and important that I'll be dedicating a post to each of them in the coming days, including real-world examples from administering my own personal Solaris 10 desktop.

There's also a set of programming interfaces in a library called libscf that developers can use to interact with the repository.  There's two levels to this library, the really nitty-gritty "low level", for those of you who will need to arbitrate transactions, and develop delegated restarters, and the "high level" interface, which provides convenience functions, such as reading a single property value, or enabling a service.

I plan to move on from here to descriptions and examples of the administrative tools, and then how to convert a legacy service to SMF.  As always, any requests or questions should be posted to the comments section, or emailed to me.

Monday Sep 13, 2004

SMF/Predictive Self Healing Overview

Predictive Self Healing is an architectural framework made up of several pieces.  The one that I've been working on is SMF, or Service Management Facility.  It's an infrastructure that provides several functions:

1) Defining services for Solaris, which can be the state of a device, a running application, or a set of other services.  Each service is referred to by a unique identifier.

2) A formal relationship between services, with explicit dependencies.

3) Automatic starting and restarting of services.

4) A repository to store service state and configuration properties (negating the need for dozens of configuration files scattered throughout the system.

The "thousand mile view" of SMF is that the system is managed by a master "restarter" named svc.startd.  This daemon enforces dependencies, starts and stops services, and basically keeps an eye on the running of the machine.  All configuration is stored in a repository on the system, managed by a daemon as well, named svc.configd.  There are one or more "delegated restarters" who are given a subset of services to manage, and are written specifically to deal with this subset, for example, inetd manages most networking services, as a delegated restarter.

Let's look at the pieces of SMF a little closer.


A service is the fundamental unit of SMF.  Each service can have one or more instances, which is a specific configuration of a service.  For example, Apache is a service.  An Apache daemon configured to serve on port 80 would be an instance of that service.  Apache could have several instances, all with different configurations.  The service holds basic configuration properties that are inherited by each of its instances, but each instance can override configuration properties, as needed.

There are also special services called milestones.  These are a service that correspond to a specific system state, such as "basic networking" or "local filesystems available".  They are basically a list of other services, and they're considered to be online when each of their component parts is online.

Each service is identified with an FMRI, or Fault Management Resource Identifier.  It's the unique identifier representing a service, or instance.  For example, the telnet service is represented by svc:/network/telnet:default, where "svc:/network/telnet" describes the service, and "default" describes a specific instance.

FMRIs can be a bit of a handful to type, so you'll find that most SMF commands will accept the "shortened" versions of a service's FMRI, given that it only has one instance.  For example, most utilities will accept network/telnet as the FMRI for telnet, since it comes installed with only one instance.

You will have noticed that telnet is preceded with the word network.  SMF contains several categories for services, to provide organization and uniqueness of naming.  The standard categories are:

  • application
  • device
  • milestone
  • network
  • platform
  • site
  • system


Each service on the machine is always in one of seven discrete states, observable by the SMF CLI tools.  The possible states of each service are:

  • degraded - The service is running, but something is wrong, or its capacities are limited in some way.
  • disabled - The service has been disabled and is not running.
  • legacy_run - A legacy rc.X script has been started by the system, and is running.  We'll talk more about legacy services later.
  • maintenance - The instance has encountered some sort of error, and it needs to be repaired by an administrator.
  • offline - The service is enabled, but not running yet, usually because a service it depends on is not online yet.
  • online - The service is both enabled and running successfully.
  • uninitialized - svc.startd has not yet read this service's configuration.

That's about enough typing for me tonight.  Next time we'll start to look at how services are described, and how you administer the system using SMF.  As usual, if you have any questions, please feel free to ask them in the comments section.


Thursday Sep 09, 2004

The time has come, the walrus said...

Right, then.

This blog will primarily be a space for me to discuss the work I'm doing at Sun, hopefully spread some information with examples of powerful but potentially confusing new technologies, and occasionally ramble on about things that may or may not be of interest.

I've been at Sun since July of 2000, when I started in a group called Internet Engineering, and has through several permutations become more generally Solaris Networking and Security Technologies. So I guess I'm a Network Engineer(tm). I worked on DHCP primarily at first, and then moved on to a short lived project titled EIC, which mutated into Greenline, which mutated into smf(5), the Service Management Facility, part of Solaris's new Predictive Self Healing offering.

I've been working on Greenline/smf/Predictive Self Healing for the last two years. I'm a firm believer in the power and potential of it, and you'll be hearing me evangelize about it quite a bit in the days to come.

SMF, the Service Manager side of Predictive Self Healing, basically objectifies and defines every service on a Solaris machine, charting its dependencies and storing its configuration properties, and allows the administrator to observe and manage the services much more effectively. It also provides automatic restarting of services upon failure. It's going to change (for the better) the system administrator's experience in administering Solaris machines, provide a (much faster) parallelized boot, and greatly assist in error diagnosis and fault recovery.

As you can see, I'm a big fan. And I plan to convert as many of you as humanly possible.

Anyway, for the boring background bits, I was an Air Force brat who's lived in 13 different states, and I got my degree in Computer Science with a minor in Engineering Studies from Carnegie Mellon, in Pittsburgh. Unlike nearly everyone else who's lived there, I loved Pittsburgh, with all its ugly little warts. I read books voraciously, I'm a libertarian (possibly the only one in California, much to my chagrin), and I really, really like pirates.

Soon I'll begin picking random bits of SMF to highlight, but if you're reading this and you have any questions about how this will affect your private cache of scrips, or inetd.conf modifications, or your life post-SMF in general, please feel free to post them in the comments section, and I'll do my best to answer them.




September 2004