Saturday Aug 04, 2012

Recovering a Totally Full ZFS Filesystem

If your ZFS filesystem is completely full, it can be difficult to free up space.  Most people's first impulse is to delete files, but that often fails because ZFS can require free space to record the deletion.  A colleague of mine ran into this a few weeks ago.  I advised him to try truncating a large file using shell redirection (e.g., "cat /dev/null > my_large_file"), but that didn't work.  He was able to free up space using the truncate command, which is available in Solaris 11.

 truncate -s 0 my_large_file

Tuesday Nov 29, 2011

IPS Facets and Info files

One of the unusual things about IPS is its "facet" feature. For example, if you're a developer using the foo library, you don't install a libfoo-dev package to get the header files. Intead, you install the libfoo package, and your facet.devel setting controls whether you get header files.

I was reminded of this recently when I tried to look at some documentation for Emacs Org mode. I was surprised when Emacs's Info browser said it couldn't find the top-level Info directory. I poked around in /usr/share but couldn't find any info files.

  $ ls -l /usr/share/info
  ls: cannot access /usr/share/info: No such file or directory

Was I was missing a package?

  $ pkg list -a | egrep "info|emacs"
  editor/gnu-emacs                                  23.1-     i--
  editor/gnu-emacs/gnu-emacs-gtk                    23.1-     i--
  editor/gnu-emacs/gnu-emacs-lisp                   23.1-     ---
  editor/gnu-emacs/gnu-emacs-no-x11                 23.1-     ---
  editor/gnu-emacs/gnu-emacs-x11                    23.1-     i--
  system/data/terminfo                              0.5.11-     i--
  system/data/terminfo/terminfo-core                0.5.11-     i--
  text/texinfo                                      4.7-      i--
  x11/diagnostic/x11-info-clients                   7.6-     i--

Hmm. I didn't have the gnu-emacs-lisp package. That seemed an unlikely place to stick the Info files, and pkg(1) confirmed that the info files were not there:

  $ pkg contents -r gnu-emacs-lisp | grep info

Well, if I have what look like the right packages but don't have the right files, the next thing to check are the facets.

The first check is whether there is a facet associated with the Info files:

  $ pkg contents -m gnu-emacs | grep usr/share/info
  dir group=bin mode=0755 owner=root path=usr/share/info
  file [...] chash=[...] group=bin mode=0444 owner=root path=usr/share/info/mh-e-1 [...] 
  file [...] chash=[...] group=bin mode=0444 owner=root path=usr/share/info/mh-e-2 [...]

Yes, they're associated with

Now let's look at the facet settings on my desktop:

  $ pkg facet
  FACETS           VALUE
  facet.locale.en* True
  facet.locale*    False    True
  facet.doc*       False

Oops. I've got man pages and various English documentation files, but not the Info files. Let's fix that:

  # pkg change-facet
  Packages to update: 970
  Variants/Facets to change:   1
  Create boot environment:  No
  Create backup boot environment: Yes
  Services to change:   1
  DOWNLOAD                                  PKGS       FILES    XFER (MB)
  Completed                              970/970     181/181      9.2/9.2
  PHASE                                        ACTIONS
  Install Phase                                226/226
  PHASE                                          ITEMS
  Image State Update Phase                         2/2
  PHASE                                          ITEMS
  Reading Existing Index                           8/8
  Indexing Packages                            970/970

Now we have the info files:

  $ ls -F /usr/share/info	  dir@      groff-2	  dired-x    groff-3	     remember

Tuesday Jul 12, 2011

Percussive Maintenance

On a flight home from Colorado, I had the opportunity to watch Rango, which I had missed when it was playing in theaters. Partway through the movie, the image occasionally took on a dark reddish hue. Then it got stuck in that mode, making it difficult to watch the movie. I pressed the button to get assistance from the cabin crew, and two attendants came over.

One of the attendants asked if I had changed the brightness setting. Well, I had, but I had also reset it. More fiddling with the brightness setting wasn't helping.

The second attendant suggested that I kick the box. "The box?" I asked. Yes, the metal box mounted underneath the seat in front of me. The attendant explained that the box contained hardware for the in-flight video. Because of its location, it competed with carry-on bags--like my backpack--for space. Sometimes it got pushed in ways that could mess up the video.

I was skeptical that kicking the box would help, but I gave it a sharp whack with my toe. The image regained the correct colors and kept them for the rest of the movie. I don't know who was more surprised: me, or the first attendant.

Wednesday Oct 27, 2010

Deja Vu

My first blog entry was just about 6 years ago. I'd been part of the project to open up the Solaris code for a couple months when Jim Grisanzio encouraged me to start a blog. I wasn't sure how much time I'd have to write, but blogging seemed a good way to get information out into the community, so I agreed to give it a go. I wrote my first entry while helping staff the guarded bicycle parking that the Silicon Valley Bicycle Coalition provides at Stanford home football games.

This past Saturday I was once again at Stanford to help with the bike parking. I was too busy to write, but I was reminded of that earlier game. Another echo from 2004 is that I'll be changing jobs soon. Before joining the OpenSolaris project I worked in the NFS group at Sun, and I'll be returning to that group on November 1st.

One of the things I liked about working on OpenSolaris infrastructure was the breadth of code that I got to touch. Besides hacking the custom code that we run on, I got to do multiple programming-in-the-large projects inside ON. I'm sure there are a few dark corners of the ON source tree that I haven't looked at, but I suspect that there aren't many.

Working on NFS has its own rewards, of course. With a foot in both networking and file systems, NFS requires its developers to be conversant with a fair chunk of the kernel. Because NFS is a general-purpose file system, I'm sure I'll get exposed to a wide variety of applications as a side effect of troubleshooting work. And I still find distributed systems fascinating. The challenges of making autonomous systems cooperate in the face of unreliability and ambiguity (e.g., timeouts) make distributed systems a very cool place to be.

Tuesday Oct 12, 2010

Watch This Space

With the latest XWiki update, now has working watchpoints. The UI for watching a page is pretty obvious--there's a watch/unwatch link in the upper right of the page. But I couldn't figure out how to watch an entire space (e.g., the Tools community space). Today I stumbled across it. The 2nd menu in the upper left shows the space that the current page belongs to. One of the operations you can invoke via that menu is "watch space".

Sunday Jul 25, 2010

Finding ON Flag Day Announcements

ON flag day emails are archived automatically to an internal server and then mirrored externally. Unfortunately, folks sometimes give the internal URL when referring to a particular flag day message, which doesn't do you much good if you don't have access to the Oracle/Sun internal network.

While it would be preferable to have the external URL to start with, the mapping from the internal URL to external URL is pretty simple. Replace "" with "". So, for example,


Tuesday Jul 20, 2010

"make all" in usr/src/uts

I've started work to fix a problem with Install, which is a script that kernel developers can use to help install test kernels. So I needed to build a kernel for testing, but I didn't need to build anything outside the kernel. I thought I'd just set up an environment using bldenv], then build using dmake, i.e.,

$ bldenv -d
$ cd $SRC/uts
$ dmake all

Notice I used "dmake all", not "dmake install". This is because Install extracts kernel modules from the source tree, not the proto area.

When my build finished, I noticed that there were build errors.

/usr/ccs/bin/ld -o ../cheetah/debug64/ -G -znoreloc -h 'cpu/$CPU' debug64/cpu_module.o debug64/mach_cpu_module.o
ld: fatal: file ../cheetah/debug64/ open failed: No such file or directory
\*\*\* Error code 1
dmake: Fatal error: Command failed for target `../cheetah/debug64/'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u/serengeti/unix
\*\*\* Error code 1
The following command caused the error:
BUILD_TYPE=DBG64 VERSION='6952783:66c93397e15b' dmake  symcheck.targ
dmake: Fatal error: Command failed for target `symcheck.debug64'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u/serengeti/unix
\*\*\* Error code 1
The following command caused the error:
(cd ../../../sun4u/serengeti/unix; pwd; \\
CPU_DIR=../cheetah SYM_MOD=../cheetah/obj64/unix.sym dmake symcheck)
dmake: Fatal error: Command failed for target `obj64/unix.sym'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u/serengeti/cheetah
\*\*\* Error code 1
The following command caused the error:
BUILD_TYPE=OBJ64 VERSION='6952783:66c93397e15b' dmake  all.targ
dmake: Fatal error: Command failed for target `all.obj64'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u/serengeti/cheetah
\*\*\* Error code 1
The following command caused the error:
cd cheetah; pwd; dmake  all
dmake: Fatal error: Command failed for target `cheetah'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u/serengeti
\*\*\* Error code 1
The following command caused the error:
cd serengeti; pwd; THISIMPL=serengeti dmake  all
dmake: Fatal error: Command failed for target `serengeti'
Current working directory /builds/kupfer/6952783/usr/src/uts/sun4u
Waiting for 3 jobs to finish

This was odd for a couple reasons.

First, I used bldenv -d, which sets up an environment for building debug kernel modules. Yet the "obj64" directory names indicated that non-debug modules were being built.

Second, the log seemed to indicate that the non-debug build had some sort of dependency on the debug build. That couldn't be right--the nightly script, which most of us use to drive our builds, does the non-debug build before it does the debug build. So any such dependency would break nightly.

Just what was going on here?

Well, the first thing you have to understand is that the kernel makefiles make heavy use of macros to determine what exactly to build. Do you want 32-bit binaries or 64-bit? Debug or non-debug?

After some poking around, I noticed that for the kernel, the "install" target uses DEF_BUILDS, which is going to specify debug or non-debug (one or the other). The "all" target uses ALL_BUILDS, which specifies both debug and non-debug.

So that answers my first question. I was building a non-debug module because I had used "dmake all" instead of "dmake install".

But why was a non-debug build depending on a debug binary?

Some tracing through the makefiles shows that the symcheck target in uts/sun4u/serengeti/unix/Makefile uses the SYM_BUILDS macro, which that makefile sets to $(DEF_BUILDSONLY64) (i.e., just the 64-bit variant of $(DEF_BUILDS)). Because I had used "bldenv -d", SYM_BUILDS was thus set to "debug64", even though it was being invoked for a non-debug module.

What can we conclude from this? Well, if you're building the kernel in a bldenv environment, it's better to use "make install" than "make all", at least on SPARC systems. (It looks like the x86 builds don't use the symcheck target.) It might be possible to restructure the kernel makefiles to avoid this issue, but I don't understand enough of what's going on here to say for sure.

Friday Dec 18, 2009

The Compiz Grid Plug-In

I tried Compiz shortly before it became the default in OpenSolaris, but I went back to Metacity because the fancy graphics plugins don't do much for me. But I found a plugin last month that I actually find useful: Grid. With a single gesture, I can put windows into any quadrant of the display, or two adjacent quadrants (e.g., top half of the display or left side of the display). I think it gives me the best of tiling and overlapping window management.

Wednesday Nov 25, 2009

Signed Crypto Gets Its Own Tarball

The OS/Net (ON) component of OpenSolaris has some closed-source code. The binaries for this code (well, the binaries that are redistributable) are made available to non-Sun developers in the form of a compressed tar file, which the build tools incorporate into BFU archives or packages. These closed-bins tarballs also contain binaries for open-source cryptographic code. To satisfy US government regulations, the OpenSolaris cryptography framework requires that certain crypto binaries be signed. Most external developers don't have the necessary key and certificate to sign their binaries, so we provide a working set for them.

This setup has worked okay since the launch of OpenSolaris in 2005, but it's got a couple problems. First, bindrop, the script that puts the crypto binaries into the closed-bins tarball, works off a hard-coded list. As with any manually-maintained list, this introduces a risk that it will not be updated when a new crypto module is added to the system. Second, bindrop gets the crypto binaries by extracting them from the SVR4-format packages that are generated from the ON gate. With the upcoming move to IPS, those packages will go away, and it will be much harder to extract the binaries from the IPS packages that will replace them.

So I'm working on changes to the way we deliver the signed crypto binaries. First, we'll be splitting the crypto out into its own tarball. This gives us more flexibility about when we deliver the crypto binaries. Second, instead of using a hard-coded list, we'll scan the proto area, which is the staging area before files are packaged. Any properly signed binaries will be included in the crypto tarball. If you're interested, CR 6855998 has more details.

The code for this has been written, though it still needs a little more polishing, like making sure that error messages are handled correctly. I'm hoping to get this into build 130, but it might slip into 131.

Wednesday Nov 04, 2009

OSCON 2009: FreeBSD

The last "people" talk that I went to at OSCON was Marshall Kirk McKusick's talk "Building and Running an Open-Source Community". I was interested in this talk for a couple reasons. First, I don't know much about how the BSD communities work, and I'm always interested in how large open-source communities do things. Second, I was at Berkeley during the time of the Computer Systems Research Group (CSRG). And while I got to know some of the CSRG staff, I was not directly involved with the development of Berkeley Unix. So I wanted to find out more about what was going on while I was there.

Kirk mentioned that the CSRG started up in the 1970s, after Bill Joy was already at Berkeley. At first, they didn't use a source code control system. Then around 1980 they started using SCCS. There are various reasons for using a source code control system, such as making it easy to review changes if a regression is discovered. For the CSRG, introducing SCCS enabled better productivity for the CSRG staff. Although they still reviewed all the changes that were checked in prior to a release, they could hand off some of the mechanics, such as merging patches and testing, to trusted committers.

This basic structure, with a core team, a group of committers, and a group of developers, is still used today for FreeBSD development. Kirk mentioned a couple details that I thought were interesting. In particular, he said that most developers don't want to be committers. This is usually because they don't want to be involved that much; they just have a change or two that they want to see made. Kirk also mentioned that committers are held to higher standards for things like email etiquette. And all changes must be reviewed by at least one other committer.

The FreeBSD core team is 9 people who are nominated from and elected by the committers every 2 years. They maintain the FreeBSD roadmap, they resolve conflicts between committers, and they admit and remove committers.

Kirk pointed out that people can contribute to FreeBSD in ways other than writing code. They can write documentation, they can do testing, they can do release engineering, and so forth. These people can be committers, too, and there's no relative weighting between code committers and other committers. This also means that these other folks can be elected to the core team. In fact, the latest election brought an advocacy/marketing committer onto the core team. At the time of the talk, there were 390 committers and around 6,000 developers.

FreeBSD does a stable .0 release every couple years. The stable branch has a 5-year lifetime and a binary compatibility guarantee. Minor (dot) releases happen on the stable branch every 6 months or so. Development happens on the trunk, and important bug fixes are merged into the stable branch. They use Subversion and provide a CVS mirror.

The pre-release freeze times vary: for a new stable branch it's about a month, and for a dot release it's about a week. I'm not sure how to compare those times with the times for OpenSolaris. For example, OpenSolaris freezes are gradual: first there's a period where only bug fixes are allowed in (no new features), then there's a period where only fixes for stopper bugs are allowed in. I wish I'd thought to ask Kirk for more details on how FreeBSD manages freezes.

One of the ways the BSD license is different from the GPL or the CDDL is that it lets people make proprietary changes to the code. Kirk said that this does happen, but those changes are usually specific to a particular product. Because they aren't generally interesting, the FreeBSD project probably wouldn't take the changes even if they were offered.

Saturday Oct 10, 2009

OSCON 2009: Building Belonging

Ubuntu is widely known for its community-building efforts, so I was very interested to hear what Jono Bacon, the Ubuntu community manager, had to say at OSCON.

Jono suggested that people want to be in a community (any community) because it gives them a sense of belonging. This makes sense to me--we are, after all, social creatures. And Ubuntu seems to do a good job at this. Jono showed a photo from the recent Ubuntu Developer Summit in Prague, in which all the volunteers were highlighted. I didn't have time to count, but it appeared to me that over half the people there were volunteers, not Canonical employees.

So how does a community foster this sense of belonging? Jono had several suggestions. One was to treat all contributions as gifts. That doesn't mean accepting every contribution. But if the contribution can't be accepted for some reason, be constructive and gentle in your feedback.

Jono suggested some ways that structure can help foster belonging. For example, Ubuntu structures everything in teams, even user groups. This makes everything simple and uniform: to get involved, find a team that you want to join. And because the teams are smaller than the community as a whole, it's easier to get started and make connections.

Structure can also be used to encourage socializing. By this Jono didn't mean the structured team-building exercises that one sees in the corporate world. Rather, the idea is to budget time just for chatting, sharing stories, etc. Besides encouraging social bonding, this promotes information exchange that might not happen otherwise.

Jono also noted the importance of the virtual environment. First, he drew an analogy with physical neighborhoods. People are more inclined to hang around someplace that is kept-up, safe, and inviting. Second, the virtual environment can help build belonging by making contributions (and contributors) visible.

Community members can reinforce a sense of belonging by framing questions or issues in a way that leads to progress. Jono's example was to change the thought "this sucks" to "I won't live life this way".

SCM Mounts: Done (Almost)

I've finished the workaround for the sshd privileges issue. I ended up writing a simple setuid C program so that our PAM module could unmount the loopback filesystems. I had been using an RBAC-based approach, but that requires that the user own the mount point for each loopback mount. The more I worked on it, the more failure scenarios I ran into because of that requirement. The setuid approach had none of those issues, and it turned out to be much simpler to code than I had been expecting.

So the changes have been committed to the repository for the SCM infrastructure, and the new bits have been deployed on the backup SCM server. The only thing left is to deploy on the primary SCM server.

Unfortunately, this doesn't mean I'll now have time to finish off the OSCON trip report. Instead, I'll be focusing on a change to the way we deliver crypto binaries to ON developers.

Friday Sep 25, 2009

Progress with SCM Mounts

I've been busy implementing a workaround to the sshd privileges issue that I mentioned a couple months ago. Unfortunately, this has meant some delays in my OSCON trip report, but I hope to post the next installment of that sometime in the next week.

Besides implementing the actual workaround, I've been beefing up the scripts that we use for setting up a test environment. Even though we have a backup production system that I could use for testing, it's safer to test a change of this size on a separate system. I sleep a lot better knowing that if I break something during development, I definitely won't inconvenience users.

I've also been learning about the privilege-management facilities that are available in (Open)Solaris. We had some problems finding a concise but sufficiently detailed writeup of how the mount/unmount privileges work. While the information is present in the umount(2) man page, a much clearer explanation is given in the output from "ppriv -lv".

Saturday Sep 05, 2009

OSCON 2009: No Unicorns

Between the site migration, a code review for Darren Moffat [1], and catching a cold, I'm still having to write my OSCON 2009 trip report in dribs and drabs.

The next "people" talk on my list was a talk by Rolf Skyberg of eBay called "There Are No Unicorns: And Other Lessons Learned While Running an Innovation Team". Having worked in a few research settings, but not being a researcher myself, I was hoping the talk would give me a better understanding of how innovation happens. And Rolf did have some things to say here. In particular, he pointed out that innovation is by nature disruptive, which can lead to resistance. So be sure to focus on the cultural changes that will be needed for your innovation to be successful, don't just focus on the technology or specific product.

The bulk of the talk seemed more about navigating corporate politics. Rolf's points included the importance of knowing how your work and your team's work will be evaluated, and how your project fits into the larger organizational picture. Understanding how your project fits in is particularly important when times are tough. If your project isn't in a core area for your organization, and it doesn't have someone high-up to defend it, it's more likely to get cancelled when funding runs low.

Rolf also talked about how traumatic events in a company's history can affect its behavior for a long time. In 1998 eBay was off-line for 3 days, which led to a very strong emphasis in the company on reliability and scalability. But, Rolf argued, the outage also led to an inflexible organization that has a hard time incorporating new innovations.

I've seen this never-again effect in other contexts. When I was working for Xerox, I wanted to look at the source code for Pilot, which was the operating system we were using. I was told I couldn't do that, so I asked why. I was told that sometime in the past, some application code had been written that depended on an interface that was private to Pilot. I doubt that this was accidental usage, because all this stuff was written in Mesa, which made it easy to keep public interfaces separate from private ones. Anyway, in the normal course of development, the Pilot group made some incompatible changes to that interface, which broke the application--right before the application was about to ship. The Pilot group wanted the application programmers to fix their code, even if it meant a schedule slip. But they were overruled by management and forced to revert the interface change. They vowed that they would not let this happen again, and from then on they kept their source squirreled away on private servers, accessible only to people in the group.

I've also seen this effect at Sun. I started at Sun in 1992, not too long after Solaris 2.0 had shipped. Solaris 2.0 introduced a lot of changes that were incompatible with SunOS 4.x. This caused considerable disruption for customers, and I gather that many were not shy in voicing their displeasure. I remember being told that all the pain around Solaris 2.0 was a big factor in the decision not to do any more major (.0) releases of Solaris.

Of course, that policy has born considerable fruit over the years. The ISVs that I've spoken with have praised Solaris because new Solaris releases rarely broke their code. Other platforms that their code ran on tended to cause breakage pretty regularly.

On the other hand, this compatibility doesn't come for free. Project teams must design new interfaces carefully, so that they can evolve in a compatible fashion. The interface review and cross-checking that is done by the ARCs is largely manual. Standard (cross-platform) interfaces are not always stable, so project teams must sometimes do extra engineering to support the new interface without breaking code that depended on the old interface. There is a mechanism for removing old, obsolete code, but again, this requires supporting the old and new interfaces for some time.

Rolf compared companies to creatures when he talked about the effect of traumatic events. That seems to imply some sort of organizational memory of the traumas. But I wonder if that's really how it works. Is it really an organizational memory, or is it more the memory of a few key individuals? If it's the memory of key individuals, then I think it's not until these individuals move on (or relax around the trauma) that the never-again imperative starts to lose its hold.

[1] Darren is removing 92,000 lines of closed-source code--the obsolete smartcard support--from ON.

Saturday Aug 22, 2009

Still Reviewing Web Pages

I'm still reviewing web pages in preparation for the migration of to XWiki. So far I've finished reviewing the ON Developer Reference and the SCM-related pages in the Tools web space.

The only thing left for (my) review is the SCM Migration project web space. Since that project is no longer active, I don't plan to look too carefully. But a sanity check does seem in order, and maybe there are some obsolete pages or attachments that we can delete.

The ON Developer Reference (DevRef) is a particularly tough case for the migration software because of its extensive use of anchored links. I had been planning to retire the XML (Docbook) source that the DevRef currently uses, and keep everything in XWiki markup, but I'm not looking forward to fixing all the cross-references. So I'm having second thoughts about that strategy.


Random information that I hope will be interesting to Oracle's technical community. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.


« July 2016