Tuesday Feb 24, 2009

Eating our own dog food

You have heard about "Practice what you preach", and here at Solaris Cluster Oasis we often talk about how important high availability is for your critical applications. Beyond just the good sense of using our own products, there is no substitute for actually using your own product day in and day out. It gives us engineers a very important dose of reality, in that any problems with the product have a direct impact on the our own daily functioning. That begs the question: How is the Solaris Cluster group dealing with its own high availability needs?

In this blog entry we teamed up with our Solaris Community Labs team to provide our regular visitors to Oasis with a peek into how SC plays a role in running key pieces of our own internal infrastructure. While a lot of SUN internal infrastructure uses Solaris Cluster, for the purpose of this blog entry, we landed up choosing one of the internal clusters which is used directly for Solaris Cluster Engineering team for their own home directories (yes, that is right, home directories, where all of our stuff lives, is on Solaris Cluster), and developer source code trees.

 See below for a block diagram of the cluster, continue after the diagram for more details about the configuration.


Here are some more specifications of the Cluster:

- Two T2000 servers

- Storage consists of four 6140's presenting RAID5 LUNs. We choose the 6140s to provide RAID, partly because they were there and also partly to leverage the disk cache on these boxes to improve performance

- Two Zpools configured as RAID 1+0, one for home directories and another for workspaces (workspace is engineer-speak for source code tree)

- Running S10U5 (5/08) and SC3.2U1 (2/08)

High Availability was a key requirement for this deployment as downtime for a home directory server with large number of users was simply not an option. For the developer source code too, downtime would mean that long running source code builds would have to be restarted, leading to costly loss of time, not to mention having lots of very annoyed developers roaming the corridors of your workplace is never a good thing :-)

Note that it is not sufficient to merely move the NFS services from one node to other during the failover, one has to make sure that any client state (including file locks) are failed over. This ensures that the clients truly don't see any impact (apart from perhaps a momentary pause). Additionally, deploying different Zpools on different cluster nodes means that the compute power of both nodes is utilized when both are up, while we continue to provide services when one of them is down.

Not only did the users benefit from the high availability, but the cluster administrators gained maintenance flexibility. Recently, the SAN fabric connected to this cluster was migrated from 2 GBps to 4 GPps and a firmware update (performed in single-user mode) was needed on the fibre channel host bus adapters (FC-HBA's). The work was completed without impacting services and the users never noticed. This was simply achieved by moving one of the Zpools (along with the associated NFS shares and HA IP addresses) from one node to another (with a simple click on the GUI) and upgrading the FC-HBA firmware. Once the update was complete, repeat the same with the next node and the work was done!

While the above sounds useful for sure, we think there is a subtler point here, that of "confidence in the product". Allow us to explain: While doing a HW upgrade on a live production system as described above is interesting and useful, what is really important is the ability of the system administrator to be able to do this without taking a planned outage. That is only possible if the administrator has full confidence that no matter what, my applications would keep running and my end users will not be impacted. That is the TRUE value of having a rock solid product.

Hope the readers found this example useful. I am happy to report that the cluster has been performing very well and we haven't (yet) have had episodes of angry engineers roaming our corridors. Touch wood!

During the course of writing this blog entry, i got curious about the origins of the phrase "Eating one's own dog food". Some googling led me to  this page, apparently this phrase has its origins in TV advertising and came over into IT jargon via Microsoft, interesting....

Ashutosh Tripathi - Solaris Cluster Engineering

Rob Lagunas: Solaris Community Labs

Sunday Aug 26, 2007

Why SUNW.nfs is required to configure HA-NFS over ZFS in Solaris Cluster?

The typical way of configuring a Highly Available NFS file system in Solaris Cluster environment is by using an HA-NFS agent (SUNW.nfs). The SUNW.nfs agent simply does the sharing of the file systems which has to be exported.

With the support of ZFS as failover file system in SUNW.HAStoragePlus, there are 2 possible ways to configure Highly Available NFS file system with ZFS as underlying file system. They are

  1. By enabling ZFS sharenfs property (i.e sharenfs=on) for filesystems of zpool and without using SUNW.nfs.
  2. By disabling the ZFS sharenfs property (i.e sharenfs=off) for filesystems of zpool and letting SUNW.nfs does the actual share.

Among the above 2 approaches HA-NFS work correctly only when SUNW.nfs agent is used (i.e option 2), and this blog explains the rationale behind the requirement of SUNW.nfs, to configure an Highly Available NFS file system in Cluster environment with ZFS.

Lock reclaiming by Clients (NFSv[23])

The statd(1M) keeps track of clients and processes holding locks on the server.The server can use this information to allow the client to re-claim the lock after NFS server reboot/failover.

When a file system is shared by setting ZFS sharenfs property on and not using SUNW.nfs, the lock information will be kept under /var/statmon which is local file system and specific to a host. So in the case of failover the stored information is not available on the machine to which the server is failed over. This makes server unable to send requests to clients to re-claim the locks.

This problem has been overcome by SUNW.nfs agent by keeping the monitor information in stable storage (which is on multi-ported disks) and accessible from all cluster nodes.

State information of clients (NFSv4)

NFSv4 is stateful protocol where nfsd(1M) keeps track of client status like opened/locked files in stable storage.

When a file system is shared by setting ZFS sharenfs property on, the stable storage will be under /var/nfs which is not accessible from all nodes of cluster. In this case, in a server failover scenario, the clients reclaim requests will fail and might result in client applications being exited (unless the client applications catch the SIGLOST signal).

This problem has been overcome by SUNW.nfs agent by keeping the state information in stable storage which is shared among cluster nodes and this helps server to make clients to reclaim their state.

The pictorial difference is shown below. 

HA-NFS without SUNW.nfs
 HA-NFS without SUNW.nfs



HA-NFS using SUNW.nfs
 HA-NFS using SUNW.nfs

To say more precisely the ZFS sharenfs property of zfs file system is not meant to work in Solaris Cluster environment and hence using SUNW.nfs agent is must for HA-NFS over ZFS.

The stable storage where SUNW.nfs keeps information is on ZFS highly available file system (which is value of PathPrefix extension property of SUNW.nfs resource type).

Venkateswarlu Tella (Venku)
Solaris Cluster Engineering


Oracle Solaris Cluster Engineering Blog


« October 2016