sirius - OpenSolaris on System z - an update

OpenSolaris on System z - an update

Some months ago I reported on the prototype port of OpenSolaris to IBM mainframe under z/VM ("sirius"). Some things have transpired since then, and I figure its time for an update (also, some people asked me off-line, so why answer separately?)

Why we're helping - it's a Community thing

First, let me make Sun's position clear, as there's been a little confusion. We are obviously not advocating that people do more computing on mainframes! (This doesn't come as a surprise, right guys? :-) ) Remember, we make our own computers which we believe are far superior computing platforms, hosting the fully-supported, complete implementations of Solaris and OpenSolaris with a massive installed base and ISV partner product portfolio. We have our own highly effective, scalable, widely adopted virtualization, providing the best possible platforms for virtualized, consolidated Solaris operation. Solaris also runs on many SPARC, Intel and AMD systems from other vendors, insulating customers from lock-in or inflated prices.

Instead, our purpose is to grow the OpenSolaris community, and if that includes people who are traditionally mainframe customers, that's fine too. Maybe those folks will get experience with Solaris using excess capacity on their mainframes, see the light :-) and then deploy Solaris on SPARC and x86/x64 platforms, where the implementations are complete, fully supported, and widely deployed.

To that end, we've loaned SPARC server hardware to SNA for their cross-compile port work, set up an OpenSolaris project page and contributed time installing and testing the prototype, and providing bug reports and feedback. The last part is where I come in (in addition to having made introductions within Sun.)

Note: You can get started with OpenSolaris right now on your own Intel or AMD desktop or laptop computer: download it at http://www.opensolaris.com/get/index.jsp right now! You can run it from a LiveCD image so you don't have to overlay the OS you have installed on your computer, or you can run it in a virtual machine: download VirtualBox from the same URL above, and have your cake and eat it too!

Status update - things fixed and pending

So, that out of the way, some notes on the prototype. A lot of work has been done since my last post on this topic, though the implementation is still far from complete.

Last time, for example, there were no man or prstat commands, both of which are now available. There was no hostid command, but one has now been written (it didn't make the image I'm now testing, but I'm told it will be on the next one). Problems that made getconf -a and 64-bit ls fail have been resolved. There was a bug where you couldn't change your password - that's been resolved too.

One step in the right direction relates to the Solaris Containers (frequently referred to as "zones") feature of Solaris 10 and OpenSolaris (also see Wikipedia). Back in autumn, attempting to create a zone produced the following error:

# zonecfg -z zone1
zone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:zone1> create
Segmentation Fault (core dumped)
The good news is that zonecfg now apparently works. Unfortunately zone installation fails with a complaint that a script related to Solaris Live Upgrade (LU) is missing, saying /usr/lib/lu/lucreatezone: not found This is confusing: LU doesn't actually exist per-se on OpenSolaris, and that file shouldn't be present! OpenSolaris makes use of a completely different way to manage boot environments, so there is an issue about implementing Solaris Containers in the way that is compatible with OpenSolaris boot and package system - both of which have yet to be ported. Still, that's a step in the right direction, and one that I'm sure will be straightened out in due time.

This does open the big issue of differences between OpenSolaris and Solaris 10. For example, the prototype boots off the Unix File System (UFS), whereas OpenSolaris boots off ZFS (see also ZFS at Wikipedia). This is a substantial difference, because OpenSolaris uses ZFS snapshots and clones for managing boot environments. This entire infrastructure is part of what makes OpenSolaris, and needs to be implemented.

A number of important features are still missing: DTrace (see also DTrace at Wikipedia) for example, is not implemented. That's a key feature of Solaris 10 and OpenSolaris, and is even available on FreeBSD and Mac OS X. There's a deep need for a package/patch management system - currently there's no way to apply a patch, though the Image Packaging System is now going to be ported over.

One of the obvious things you do with a prototype system is you debug and diagnose things. Annoyingly, if you use commands like pstack or pfiles to diagnose what a program is up to, it can die or kill the target process, as that set of tools isn't quite working yet (Guess what happened to me one time when I tried to run pstack against sshd, while logged in via ssh! One time is all it took to teach me to not do that again!) More frustrating, using one of those commands to look at a core file, could have the tool crash and overlay the core file with its own dump. Ow.

Other bits and pieces are missing: If you issue the format command, it still says "No disks found" (even though there are definitely disks available and mounted). The kstat and prtconf -vp commands don't work.

In general, there are things you bump into now and then. This should be expected - this is a prototype, and that's what prototypes are like. The developer is outstanding, but this is a Herculean task (there's a pun here - and I'll mention Hercules again in a moment) and things like this should be expected. It's been my task to look for errata - so it shouldn't be a surprise that I find them.

In the queue now

In my latest testing I came up with new problems - or at the very least, newly discovered ones. ping didn't work from a non-root userid, which was quickly solved.

$ ping 10.80.63.130
ping: socket Permission denied
$ ppriv -De ping 10.80.63.130
ping[104270]: missing privilege "net_icmpaccess" (euid = 103812, syscall = 230) for "devpolicy" needed at common_specvp+0x5e
ping: socket Permission denied
Adding 'setuid' bit to the binary fixed that problem. Ain't Solaris privilege bracketing neat?

Right now, there are a few new problems that have been head-scratchers: one is that programs like ping or ftp dump core if I use DNS name resolution (they worked fine if you just put in the target IP address). Previously I hadn't set up /etc/nsswitch.conf to use DNS so the issue never arose. This provoked a problem where sshd would refuse to accept new sessions: any previously established session was fine, but attempts to establish new connections failed. Eventually (smack forehead!) I figured out that the forked instance of the ssh daemon was dieing while trying to do a reverse-DNS lookup of the client, so just turned off DNS resolution altogether. This is important, so I'll be looking into this again soon.

Around this time I started having recursive crashes in the Service Management Facility (SMF) daemons; every time I looked, svc.configd and svc.startd were running and chewing up as much z9 CPU time as they could get. Back in the summer, I was getting error messages about a corrupt SMF repository - maybe that error persists and has just bit me. Even more odd, I was getting crashes rebooting the system!

So, for sake of starting afresh, I'm reloading the test system from the disk restore image (you install this by doing an image restore of a disk volume, and manually configuring network identity). That way, we can start debugging again (or see if the above problems never recur) from a system where the tracks haven't been muddied.

Please take note: As I said before, this is a prototype. Errors and odd situations are what happens in the laboratory when working with experimental software, in this case a port of a complex operating environment. Please have expectations that are consistent with reality and not expect this software to have the properties of the production grade Solaris OS, which has the benefit of development teams, test organizations, QA groups, and years of baking in.

Speaking of Hercules

I previously alluded to the Hercules System/370, ESA/390, and z/Architecture Emulator. Early in the sirius project I advocated that sirius run only under VM, as that simplifies the porting effort.

A consequence of that, which I regret, is that it makes it impossible to run sirius on commodity, developer personal computers. Hercules provides a virtual mainframe, but sirius makes use of hypervisor functions provided by VM. Fortunately, the Hercules community is looking into fixing this - either by modifying the sirius port code so it can run without the hypervisor functions, or by adding them to the Hercules emulator. That will make it possible for a lot more people to use this prototype port, as not everybody has access to a mainframe! That will help grow the OpenSolaris community, which is why Sun is participating.

(It may be possible in future to run OpenSolaris in a Hercules virtual mainframe, running inside a Solaris Container underneath OpenSolaris or Solaris 10, running inside a VirtualBox or VMware Virtual Machine. Hm, maybe at that time we can bring up Hercules underneath sirius, for an arbitrary number of recursive levels of virtualization!) <script type="text/javascript"> var sc_project=6611784; var sc_invisible=1; var sc_security="4251aa3a"; </script> <script type="text/javascript" src="http://www.statcounter.com/counter/counter.js"></script>

visit tracker on tumblr
Comments:

Nice for you to report progress, but some notes are in order.

It's only fair to note that many of the complaints you're commenting on are in fact results of the state of the supplied OpenSolaris source tree from Sun -- we don't make this stuff up, we just get it to compile and work. 8-) We are submitting patches to the mainline source tree to correct some of these little oversights in the OpenSolaris source management process, but it's a two way street...the zonecfg command is a good example of that. You shipped it that way..8-)

We also can't do much about the parts that are supplied only in binary form -- after repeated requests, we still don't even have a response how to get a compiled binary set for those commands and features. The missing binary for Sun's tail variant causes some of those shutdown problems you observed, the missing PKCS routines in sshd aren't helping find the problem, etc. We did ask... silence is getting pretty deafening, folks.

Re: ZFS boot. The build 95 source for the first release of the binaries did not support ZFS boot on \*any\* platform. We haven't bothered to change the image from UFS because it's not necessary to getting the work done right now. It is possible to create and use ZFS in the more current builds.

Re: Dtrace. The entry point hooks are implemented. The user space tools require Java, which is not yet available, so they aren't available. See point about binaries.

Re: prtconf Some features of this command use features of the Sun PROMs. Obviously not available on non-Sun hardware, and will be modified to return something sensible. Hasn't exactly been high on the list to fix given other issues -- like Java, getting complete set of binaries, etc.

Re: packaging. Source to the pkg\* commands was added after build 95. We have ported IPS and will be using it for future work. Honestly, did this industry really need ANOTHER package format and delivery system? The Nexenta guys had the right idea with using APT.

We appreciate the help in hunting these things down. It'd be helpful if we could start with a complete set of functions and then start hunting the misfunctioning pieces -- too often, the issue has come down to having to craft a replacement for a missing piece in the source, and having to substitute something where there is source available. Any progress you can make on that front would do wonders.

It's also interesting that we seem to be the only non-Sun port of the OS to actually get this far. We think there's value in getting Solaris everywhere, not just on Sun hardware. It's one of Sun's most valuable software assets, and there's a lot of people that would pay for such a thing if Sun were to be cooperative in helping create that opportunity. It's up to you whether you'd like to create it.

Posted by dboyes on February 17, 2009 at 12:31 AM MST #

I provided a "start of the port" report that describes its progress and results from testing. This is valuable because there has been confusion, including in the trade press, about what is implemented and working. Some reports have said that Solaris features like DTrace and Solaris Containers are working, and it's appropriate to correct such widely-repeated misimpressions.

In my blog post I refer to issues with DTrace, Solaris Containers, kstat, hostid, IPS, boot administration, problems with pfiles/pstack, SMF daemons, networking, patching, init, ZFS boot, etc. All of these relate to software that is open-sourced, so encumbered source or binary-only distribution has nothing to do with them. They all can be corrected by port-specific modifications to source code available to the OpenSolaris community.

The problems with sshd had nothing to do with missing PKCS components - they were a consequence of a bug in networking where DNS resolution crashes the client program - as I reported with "ping" and "ftp" dumping core when using a hostname resolved by DNS - turning off DNS "fixed" sshd (I'll see if I can reproduce that later on.) The appropriate thing to do with programs like "prtconf -vp" is to modify it so it does something reasonable on a machine to which the OS is ported (as has been done with hostid), rather than generate an error message. DTrace has no dependency on Java whatsoever. There's nothing wrong with zonecfg now - it previously dumped core but it's fixed in the new build; the problem now is in zoneadm which doesn't work in sirius, but has been working in real OpenSolaris since at least last summer. Clearly there's a way to generate the zoneadm command in a way compatible with OpenSolaris. I've done "init 6" on my OpenSolaris systems many times without tripping over a missing flag in "tail", so obviously it can be made to work (my OpenSolaris desktop doesn't have that tail flag either.) The issues I cited relate to code that is available in source form.

For what it's worth, I've been using OpenSolaris as my standard desktop since it shipped last May, and it has used ZFS boot and the Image Packaging System (IPS) since day 1. I can understand not making ZFS boot a priority purely for the purpose of booting, but since OpenSolaris boot administration and software update management depend on ZFS functions (like snapshots and clones), ZFS boot and IPS are essential features of OpenSolaris that will have to be ported. Using DDR disk restore images and manually reinstating system configuration is no fun.

Make no mistake about what I'm saying. This is extremely impressive work and deserves commendation - you should be justly proud of the accomplishment. Sun also thinks "there's value in getting Solaris everywhere, not just on Sun hardware", which is why Solaris already runs on 1031 x86/x64 systems and 102 SPARC systems, the vast majority of which we don't make. I referred to that with the URL under "many" in the top of the blog. Obviously we at Sun believe in growing the OpenSolaris community, which is why we've provided time and machine resources. I'm sure that in the fullness of time, the community can resolve the issues I've described and whatever issues are yet to be discovered.

Posted by Jeffrey Savit on February 17, 2009 at 09:03 AM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today