A new look at an old SA practice: separating /var from /

An old school SA practice...

This is probably the geekiest blog title I've used - but today's blog is a short look at two variations on the old sysadmin practice of separating /var from /, inspired by recent "how do I do this?" calls.

Why do it? How was it done before?

This was traditionally done to ensure that growing space consumption in /var, perhaps caused by core, log or package files, didn't exhaust critical parts of your file system. This could happen because some program kept dumping core or generating log entries. Exhausting space would add further injury by causing other failures.

Several techniques can prevent such problems. One method is to use coreadm to put core files somewhere else, and to use logadm and /etc/logadm.conf to rotate log files on a schedule that is consistent with your disk space and retention policy. But, the biggest hammer and most complete solution was to keep /var in a separate file system by giving it a dedicated UFS file system on its own disk slice. That way, even if something went amok and filled /var, it had no effect on other file systems.

The disadvantage, of course, is the hassle of creating and sizing separate disk slices. You had to plan how many slices you needed and how big they were, and if you got them wrong it was really inconvenient to change them. You might have one slice and file system too big, wasting space you really needed in another slice, but reallocating space was a drag. Having storage allocated into little islands was a real time-waster, especially on the itty-bitty disk drive capacities we used to live with.

ZFS, and in this case ZFS boot, pretty much eliminated this inconvenience - as I'll discuss in a moment.

But now, an old joke...

Before I go into the two examples that came up, a classic joke from mathematics or science class.

The professor is in the front of the classroom and writes an equation on the blackboard (I'm picturing the professor I had when studying Fourier transforms in EE class, but I won't try to do his accent.) Pointing to it, he tells the class, "As you can see, this theorem is clearly trivial."

Turning back to the blackboard he pauses for a moment, puts his hand on his chin and says "Hmmm.... just a moment." Now he starts working on the equation's derivation, covering blackboard after blackboard with equations - everything from α to ω. He fills all the blackboards in the classroom, mumbles "excuse me, I'll be right back," and then goes into an adjacent empty classroom to use its blackboards.

Twenty minutes pass. Finally, the professor returns to the classroom. He beams at the students with a big smile and says "I was right. It is trivial!"

I think this may be relevant to the rest of the post! :-)

The trivial case with ZFS root file system

I'll start with the straightforward case first. I was contacted by a long time friend (who has exceptional knowledge of Solaris and other operating systems, but is new to ZFS) who had wanted to restrict /var for a fresh installation of Solaris 10 that he had just done. He used ZFS boot and selected the option that allocated a separate ZFS dataset for /var, and wanted to know if there was an easy way to control its size.

I thought that this should be easy with a ZFS quota, but to be sure, I brought up a new instance of Solaris 10 under VirtualBox to run through the steps and get the right ZFS dataset name. I allocated the separate /var (it's an option you specify during install), and after installation completed I logged in and issued the following commands:
# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
rpool  15.9G  4.16G  11.7G    26%  ONLINE  -
# zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
rpool                          4.62G  11.0G    34K  /rpool
rpool/ROOT                     3.12G  11.0G    21K  legacy
rpool/ROOT/s10x_u8wos_08a      3.12G  11.0G  3.06G  /
rpool/ROOT/s10x_u8wos_08a/var  65.6M  11.0G  65.6M  /var
rpool/dump                     1.00G  11.0G  1.00G  -
rpool/export                    265K  11.0G    23K  /export
rpool/export/home               242K  11.0G   242K  /export/home
rpool/swap                      512M  11.5G  42.0M  -

Right - all I should need to do is set a quota on rpool/ROOT/s10x_u8wos_08a/var, so let's do that. I picked a quota slightly larger than the amount of space already consumed so I could easily test filling it up by creating dummy files with random data. I did that once to make sure I didn't mess up the syntax, and once more in earnest to exceed the quota:

# zfs set quota=80m rpool/ROOT/s10x_u8wos_08a/var
# zfs get quota rpool/ROOT/s10x_u8wos_08a/var
NAME                           PROPERTY  VALUE  SOURCE
rpool/ROOT/s10x_u8wos_08a/var  quota     80M    local

# dd if=/dev/urandom of=/var/XX1 bs=1024 count=10000
10000+0 records in
10000+0 records out
# zfs list rpool/ROOT/s10x_u8wos_08a/var
NAME                            USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/s10x_u8wos_08a/var  75.3M  4.67M  75.3M  /var
# dd if=/dev/urandom of=/var/XX2 bs=1024 count=10000
write: Disc quota exceeded
4737+0 records in
4737+0 records out
#
# ls -l XX\*
-rw-r--r--   1 root     root     10240000 Mar 17 14:15 XX1
-rw-r--r--   1 root     root     4849664 Mar 17 14:16 XX2
# zfs list rpool/ROOT/s10x_u8wos_08a/var
NAME                            USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/s10x_u8wos_08a/var  80.1M      0  80.1M  /var

Mission accomplished: the second file reached the quota allocated to this ZFS dataset as required. The only odd thing (in my opinion) is the odd spelling "Disc" instead of "Disk" in the message write: Disc quota exceeded. So, if I'm building a Solaris system and want to keep /var from exhausting disk space, all I need is one command to set the quota. Sweet.

A less trivial case, with zones

Shortly after the preceding example, I was contacted by a customer who wanted to do something similar to control /var within Solaris Containers. He tried to create the zone with /var defined as a delegated ZFS file system using legacy mounts. There seems to be a chicken-and-egg situation about what parts of the zone's filesystem must already be mounted before the zone can boot, but then you can't delegate it to the zone. Instead, I created a ZFS dataset and assigned it to the zone's /var:

# zfs create rpool/zones/vartest
# zfs list rpool/zones/vartest
# cat varzone.cfg
create
set zonepath=/zones/varzone
set autoboot=false
add net
set physical=e1000g0
set address=192.168.56.164
end
add fs
set dir=/var
set special=/zones/vartest
set type=lofs
end
add inherit-pkg-dir
set dir=/opt
end
verify
commit
# zonecfg -z varzone -f varzone.cfg
# zoneadm -z varzone install
A ZFS file system has been created for this zone.
Preparing to install zone <varzone>.
Creating list of files to copy from the global zone.
Copying <2899> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1062> packages on the zone.
Initialized <1062> packages on zone.
Zone  is initialized.
Installation of <2> packages was skipped.
The file </zones/varzone/root/var/sadm/system/logs/install_log> contains a log of the zone installation.

So far so good. After booting the zone without incident, I set a quota and fill it up (note: this is a much bigger /var because I'm building a zone in a Solaris instance with a bunch of additional software in /var/sadm/pkg )

# zfs list  rpool/zones/vartest
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool/zones/vartest   274M  9.31G   274M  /zones/vartest
# zfs set quota=300m rpool/zones/vartest

Within the zone, I exhaust allocated space using the same method as before:

# dd if=/dev/urandom of=/var/xx1 bs=1024 count=100000
write: Disc quota exceeded
26369+0 records in
26369+0 records out

So, I was able to create a separate /var for the zone, and manage its space independently from the zone's root. WARNING: I do not know if this is a supported or recommended procedure, even though it seems to work. My recommendation is that it's more important to impose a quota on the zone's ZFS-based zone root, in order to control its total accumulation of disk space. That protects other zones and other applications that may be using the same ZFS pool.

Conclusions

Separating /var was especially important with the small boot disk capacities we had to work with in Ye Olde Days, and perhaps became less important with the large disks we have now. However, this becomes important again due to the availability of relatively low capacity Solid State Disk (SSD) boot drives being used for fast local boot disks with low power consumption, and because of virtual environments in which a single Oracle Solaris instance might host many containers, each with its own /var and pattern of space consumption.

So, maybe this is a useful Old School idea that has new, and slightly different relevance today.

Comments:

Yes, this is awesome as I am a big believer in putting a quota on /var and I do like the fact that one "zfs set quota" command is all that is required to cap it, and Solaris is my favorite operating system. However, there is one major problem here: HOW THE HECK DO I DO THIS ON OPENSOLARIS 2009.06 !?!?!?!

Why is it that Solaris 10 and SXCE always get the features that system administrators really need (like flash archive, IPMI, zfs quotas for /var, serial console boot, and a text installer) while OpenSolaris gets left out? But OpenSolaris has all the features that make it a better GUI desktop (like time slider

If we could just combine all the features of Solaris 10 and OpenSolaris into one operating system that had IPMI, flash archive, zfs quota on var, etc. etc. I think it would be very beneficial for Sun's capability to market and generate revenue from Solaris because OpenSolaris, like I said, is missing a lot of important features that real world sysadmins care about.

Posted by system5 on March 23, 2010 at 11:52 AM MST #

I agree with you that "combine the best of both" is the right direction. I think you can see on opensolaris.org how the OS is moving forward, including instances where OpenSolaris has data center features (like later versions of ZFS) or replaces a traditional Solaris feature with something better. I recommend keeping an eye on future announcements on Solaris as things progress!

Posted by Jeffrey Savit on March 24, 2010 at 09:50 AM MST #

Jeff, I definitely like having a separate /var in the Solaris 10 global zone, mostly because of patching. We can end up with gigs of undo files in /var/sadm/pkg. I haven't felt the need to set quotas, though.

To a lesser extent, I also care about /var/tmp (I don't separate that from /var). Since S10U6, I've come across a few issues with separate ZFS root and /var. Most of these are fixed now.

First, earlier patch levels of LiveUpgrade didn't always handle the separate dataset correctly during lumount. This was fixed a while back, and LU seems to work fine now with this combination in S10U8.

One time, I somehow managed to lodge some debris under /var in my root dataset. After that, then I wasn't able to boot, as ZFS doesn't use the -O (overlay mount). I had to wanboot in order to fix this (zpool import -R /a rpool)

Finally, another early LU with ZFS root issue: It was adding lines to /etc/vfstab, when it should have been using the ZFS properties to mount. That led to /var not getting mounted, until the offending lines were commented out. Also fixed now.

As for containers, I tend to leave them in one filesystem, just for simplicity's sake. If we add another FS, it's usually for application data. Often there are many containers in one big UFS. Of course, ZFS zonepaths could help us with this.

The only outstanding issue I still see with LU and ZFS zonepaths is that the newly activated BE doesn't promote the zonepath dataset, so the active zonepath is still a clone of the old one. A simple zfs promote fixes that, no problemo.

Thanks Jeff... -cheers, CSB

Posted by Craig S. Bell on March 26, 2010 at 04:00 AM MST #

Craig - thanks for the insightful comment!

regards, Jeff

Posted by Jeffrey Savit on March 26, 2010 at 04:13 AM MST #

Jeff mate; Anybody who understands Fourier transforms is clearly a geek :-)

system5 I've not tried it, but you should be able to boot OpenSolaris single user. Create a new dataset for /var copy the data across and then set the mount point. Just like you would with UFS, if you wanted to put /var onto a separate file system, even if the ZFS commands are quite different. Of course with ZFS you don't have the problem of finding a spare slice.

Although IMHO with the size of modern disks this is a complete waste of time and has been for about 10 years.

Posted by trevor pretty on May 02, 2010 at 01:54 PM MST #

Hi Trevor! It would be an exaggeration to say I \*understood\* Fourier transforms - let's just say I had an acquaintance with them at one time! :-)

I agree that the approach you suggest would probably work, just as there is explicit support for separate /var already in Solaris 10 - and that the entire topic needs to be reevaluated (is it really needed anymore?) in light of modern disk capacities.

Posted by Jeffrey Savit on May 04, 2010 at 03:09 AM MST #

In my system I have a pool named uxwks205a, a zonepath=uxwks205a/os
If I want to create a /var for the zone prior to install, what would I need to create in the global before installing?

add fs
set dir=/var
set special=/uxwks205a/????
set type=lofs
end

Once booted I will set a quota and reservation.

Posted by Jeff Bast on July 20, 2010 at 02:14 AM MST #

Jeff B,

That's right - you would have an 'add fs' stanza just like that, similar to what I did in the body of this blog entry. You can replace ???? with any name that makes sense, and before you run zonecfg, create the corresponding ZFS dataset. For example, if the zone is called 'myzone':

# zfs create uxwks205a/myzone-var
# zfs set compression=on uxwks205a/myzone-var

and then go do the zone definition with 'myzone-var' where you have '????'. However, you MAY NOT WANT TO DO THAT - see the WARNING above where I said it may not be supported, and the text I'm about to add below:

By coincidence a few of my Oracle colleagues were discussing this topic today. Steffen Weiberle and Bob Netherton made valuable observations I want to share:

(1) the important use case for quotas on zones' disk use is to let the global zone administrator protect the system from a runaway zone that is filling its /var - or anything else. You can do that at the zone root level without separating the zone's /var). In fact, Bob Netherton pointed out that a separate dataset for /var might compromise Live Upgrade, which may not expect /var to be separate. So, just set the quota on the zone root dataset - and any other datasets you add to the zone (for data, not parts of the OS configuration).

(2) the best use for SSD is for high I/O rate data where low latency is important, such as the ZFS ZIL or a DB transaction log. The point I was grasping at (perhaps unclearly) is that there are now systems, such as some blades, shipping with SSD boot drives for heat and power reduction. Those small disk capacities make some of the traditional space conserving techniques relevant again. In the usual situation, though, SSD devices should be reserved for high performance disk applications.

My thanks to Steffen and Bob for their valuable comments!

Posted by Jeffrey Savit on July 20, 2010 at 04:09 AM MST #

Jeffrey,

Are there any public documents available on estimating the mainframe MIPS equivalencies of systems such as SPARC M-Series, HP Superdome, P-Series? I seem to remember Sun publishing something around 2003 - 2005 that placed an E15K at about 2000 - 3000 MIPS.

Posted by Richard Fichera on February 03, 2011 at 08:27 PM MST #

Hi Richard,

Maybe I should create a new blog entry on mainframes just so people could have a place to post comment :-)

Years ago we did have a MIPS equivalency document, but I don't think there is a specific document like that now. If you look in elsewhere in my blog entries and comments you'll see that this is a somewhat controversial issue. Even what people mean by MIPS is problematic, as the very same mainframe will have different MIPS rates based on factors like instruction mix and level of multiprogramming. I've run instruction kernels with a wide range of MIPS values depending on whether I was doing integer arithmetic or doing memory-memory operations (RR and SS instructions, for those who follow such things).

Since IBM still refuses to publish benchmark results on mainframes, even for the most relevant application workload types (database, Java application server) that they actually \*run\* anyway, we have to resort to indirect measurements. I recommend looking at http://www.oracle.com/us/solutions/benchmark/apps-benchmark/peoplesoft-167486.html That gives results on real, commercial workloads on mainframe, SPARC, POWER, and Intel. I think that if you download some of the results you'll be able to make deductions about relative performance, and even more interesting results about price:performance.

I hope that helps.
Jeff

PS: In researching my response, I came up with links like http://www.osnews.com/thread?419352 and http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=115092&threadid=111444&roomid=2 abd the surrounding posts. If you want to see people arguing a lot on this topic you'll find plenty to read. If only IBM actually published real performance figures - but it's pretty clear why they don't.

Posted by Jeffrey Savit on February 04, 2011 at 04:19 AM MST #

Post a Comment:
Comments are closed for this entry.
About

jsavit

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today