Wednesday Dec 05, 2007

Lighttpd SMF troubles

We came across an issue recently when running Lighttpd with /dev/poll on Solaris under SMF. You would start the service and immediately the CPU would peg at 100% and the Lighttpd error log would fill up with the message "(server.c.1429) fdevent_poll failed: Invalid argument". 

 SMF (The Solaris Service Management Facility) allows the deployer of a service to specify which user and group the processes that belong to the service should run under. In this case Lighttpd was being started as user webservd with group webservd. This would be similar to logging on to a system as webservd and then running the lighttpd executable. When we did exactly that we saw the same problem as we did when running under SMF. If we started Lighttpd as root with the same config file it ran fine and no errors were logged. So the problem came down to starting Lighttpd as webservd with /dev/poll specified as the event handler in the Lighttpd config file.

The workaround is to start Lighttpd as root and specify the user name and group for Lighttpd to run under through the Lighttpd config file. This is fairly standard practice for starting both Lighttpd and Apache. If you've run into this problem then it's maybe because you've somehow obtained a Service Manifest file that specifies "webservd" as the user and group. The easy way to modify the service so that Lighttpd is started as root is to create a copy of the current manifest and in the copy remove the entire <method_credential> that you'll see here:

...
...

<exec_method
  type='method'
  name='start'
  exec='/opt/coolstack/lib/svc/method/svc-csklighttpd start'
  timeout_seconds='60'>
  <method_context>
    <method_credential
      user='webservd' group='webservd'
      privileges='basic,!proc_session,!proc_info,!file_link_any,net_privaddr' />
  </method_context>
</exec_method>
...
...

 
You can leave the <method_context> and </method_context> tags with nothing between or you can delete the closing tag and use an empty tag i.e.: <method_context /> Just don't remove it as it's a useful marker. The above snippet is from an example that I saw when I first came across this issue, yours maybe different but in which case hopefully you wrote it and understand how to change it.

What you are left with is:

...
...
<exec_method
  type='method'
  name='start'
  exec='/opt/coolstack/lib/svc/method/svc-csklighttpd start'
  timeout_seconds='60'>
  <method_context />
</exec_method>
...
...


Once you've changed the copy of the manifest, import it using svccfg as follows:

svccfg -v import <manifest filename>

This will take a snapshot of the current state of the service and name it previous then delete all of the entries that you removed from the copy of the manifest. They will be named start/group, start/user, start/privileges plus a few others that would have been set to their default values. It will then take another snapshot of the service and call it last-import. Finally it will "refresh" the service, which means pushing out the changes to the running service. If the Lighttpd service was running it will probably go to the state called "Maintenance" at this point. It's best to disable and enable the service after a refresh (see the man page for svcadm) so you should do that now. Lighttpd should then be running correctly.

I'll post some example Manifests on another blog entry.

Root Cause

It turns out that when Solaris 10 came along, this same problem was seen when using /dev/poll and when starting Lighttpd as root . Lighttpd is written such that it bases it's maximum number of connections on the number of File Descriptors available to the process, the result is that all of the available File Descriptors are locked away for use when creating connections and none are left  for /dev/poll to use and therefore every call to /dev/poll results in an error. A more detailed discussion is available on this thread on the Sun forums. A workaround was added to Lighttpd that effectively sets the max connections to 10 less that the max File descriptors. Unfortunately it only works for the root user as the number of connections for a non-root user is set in a different code path. See Lighttpd ticket 1465. We are working on getting a workaround added for non-root users so watch this space.

Oh, and also, if the process has it's max File Descriptors set to say 65535 and you specify server.max-fds = 1000 in the Lighttpd config file, Lighttpd will reset the max number of File Descriptors available to it to 1000. So you can't get around the problem simply by specifying a lower number for server.max-fds than what should be available to the process (according to ulimit -n in the shell from which you start Lighttpd).

About

Bloggity, blog

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today