Wednesday Nov 11, 2009

Time for a new Alfresco on Glassfish walkthrough

My Alfresco on Glassfish Walkthrough is over a year old now and is creaking at the seams a little so I plan to update it. Feel free to comment here on what the new walkthrough should cover, Glassfish v2.1 or v3? Solaris or OpenSolaris? Linux or Windows? Which DB (just MySQL or some other OSDB)?

In fact feel free to comment on anything you would like to see to do with running Alfresco (or any other (E)CMS app) on Sun gear.

Original blog entry:



Friday Jun 19, 2009

Building Ruby 1.9.1 on Solaris SPARC

We've been working on building Ruby 1.9.1 (p129) packages on Solaris Nevada (the platform that OpenSolaris is mostly built from). We hit a couple of problems on the way, one was easy to fix the other not so.

The first issue was Sun Studio borking when it found a function declared with a return type of void but which actually contained a return statement. gcc actually thinks this ok which seems odd to me, maybe it's just our version of gcc though... When gcc hits this it says:

pty.c:425: warning: `return' with a value, in a function returning void

But with Sun Studio cc you get:

pty.c:425: void function cannot return a value
cc: acomp failed for pty.c

The file causing the error is ext/pty/pty.c and line 425 has the offending line in function getDevice() (declared as 'static void'):

return Qnil;

If you comment out this line it will build ok. There is a bug for this issue......

The other problem only occured on SPARC initially, we were using one of the public build systems that we have for the SFW consolidation (the part that delivers most of the F/OSS into Solaris Nevada and OpenSolaris) and that was running Solaris Nevada build 114, for x64 we were building on Solaris Nevada build 116 which was the latest version available at time of writing. x64 built fine.

The error seen on SPARC was:

Undefined                       first referenced
symbol                             in file
rb_cFalseClass                      enc/emacs_mule.o  (symbol scope specifies local binding)
rb_cTrueClass                       enc/emacs_mule.o  (symbol scope specifies local binding)
rb_cFixnum                          enc/emacs_mule.o  (symbol scope specifies local binding)
rb_cSymbol                          enc/emacs_mule.o  (symbol scope specifies local binding)
rb_cNilClass                        enc/emacs_mule.o  (symbol scope specifies local binding)
ld: fatal: symbol referencing errors. No output written to .ext/sparc-solaris2.11/enc/  

when linking enc/ It's not a well documented error, but the implication was obvious, the required symbols had been found but they had been declared as local and so couldn't be used to build this shared object. Using nm on emac_mule.o and on seemed to indicate that the symbols were needed by emacs_mule.o (UNDEF) and were available in We asked the compiler experts and they thought that perhaps the symbols were declared as HIDDEN. We tried elfdump on both emacs_mule.o and and guess what, when analyzing, elfdump threw up loads of errors of the type:

"bad symbol entry: <address> lies outside of containing section"

This suggested that the shared library was broken in some way.

We isolated the linker lines from the build for and ran them individually (after touching ruby.c). There were two lines, the first was the linker line which actually built the shared library. That ran ok and when we ran elfdump on the resultant library there were no errors. The second line was:

/usr/sfw/bin/gobjcopy -w -I "Init_\*"

After running this manually we saw the same error when using elfdump on the resultant library.

At the same time as this we were running a build on a SPARC system that we'd had upgraded to Nevada build 116 and that completed OK. A check on the version of gobjcopy on the two systems showed that we had gobjcopy 2.15 on the build 114 system and 2.19 on the build 116 system. Further checking showed that gobjcopy was delivered into Solaris Nevada in SUNWbinutils and that had been updated in Nevada build 116. So the problem wasn't the fact we were building on SPARC but that we were building on different OS revs, the problem also exists on x64.

At the moment we haven't looked into what was going wrong when gobjcopy tried to make the Init_\* symbols local, but it was apparently corrupting the library.

At the moment this makes it tough to build Ruby 1.9 on OpenSolaris which is based on Nevada build 111b and we are looking at how best to get around this. Maybe make the packages available from the /webstack repository. In the meantime we'll file a bug against OpenSolaris and come up with a workaround.

Wednesday Dec 05, 2007

Lighttpd SMF troubles

We came across an issue recently when running Lighttpd with /dev/poll on Solaris under SMF. You would start the service and immediately the CPU would peg at 100% and the Lighttpd error log would fill up with the message "(server.c.1429) fdevent_poll failed: Invalid argument". 

 SMF (The Solaris Service Management Facility) allows the deployer of a service to specify which user and group the processes that belong to the service should run under. In this case Lighttpd was being started as user webservd with group webservd. This would be similar to logging on to a system as webservd and then running the lighttpd executable. When we did exactly that we saw the same problem as we did when running under SMF. If we started Lighttpd as root with the same config file it ran fine and no errors were logged. So the problem came down to starting Lighttpd as webservd with /dev/poll specified as the event handler in the Lighttpd config file.

The workaround is to start Lighttpd as root and specify the user name and group for Lighttpd to run under through the Lighttpd config file. This is fairly standard practice for starting both Lighttpd and Apache. If you've run into this problem then it's maybe because you've somehow obtained a Service Manifest file that specifies "webservd" as the user and group. The easy way to modify the service so that Lighttpd is started as root is to create a copy of the current manifest and in the copy remove the entire <method_credential> that you'll see here:


  exec='/opt/coolstack/lib/svc/method/svc-csklighttpd start'
      user='webservd' group='webservd'
      privileges='basic,!proc_session,!proc_info,!file_link_any,net_privaddr' />

You can leave the <method_context> and </method_context> tags with nothing between or you can delete the closing tag and use an empty tag i.e.: <method_context /> Just don't remove it as it's a useful marker. The above snippet is from an example that I saw when I first came across this issue, yours maybe different but in which case hopefully you wrote it and understand how to change it.

What you are left with is:

  exec='/opt/coolstack/lib/svc/method/svc-csklighttpd start'
  <method_context />

Once you've changed the copy of the manifest, import it using svccfg as follows:

svccfg -v import <manifest filename>

This will take a snapshot of the current state of the service and name it previous then delete all of the entries that you removed from the copy of the manifest. They will be named start/group, start/user, start/privileges plus a few others that would have been set to their default values. It will then take another snapshot of the service and call it last-import. Finally it will "refresh" the service, which means pushing out the changes to the running service. If the Lighttpd service was running it will probably go to the state called "Maintenance" at this point. It's best to disable and enable the service after a refresh (see the man page for svcadm) so you should do that now. Lighttpd should then be running correctly.

I'll post some example Manifests on another blog entry.

Root Cause

It turns out that when Solaris 10 came along, this same problem was seen when using /dev/poll and when starting Lighttpd as root . Lighttpd is written such that it bases it's maximum number of connections on the number of File Descriptors available to the process, the result is that all of the available File Descriptors are locked away for use when creating connections and none are left  for /dev/poll to use and therefore every call to /dev/poll results in an error. A more detailed discussion is available on this thread on the Sun forums. A workaround was added to Lighttpd that effectively sets the max connections to 10 less that the max File descriptors. Unfortunately it only works for the root user as the number of connections for a non-root user is set in a different code path. See Lighttpd ticket 1465. We are working on getting a workaround added for non-root users so watch this space.

Oh, and also, if the process has it's max File Descriptors set to say 65535 and you specify server.max-fds = 1000 in the Lighttpd config file, Lighttpd will reset the max number of File Descriptors available to it to 1000. So you can't get around the problem simply by specifying a lower number for server.max-fds than what should be available to the process (according to ulimit -n in the shell from which you start Lighttpd).


Bloggity, blog


« April 2014