Introduction to PxFS and insight on global mounting

If you are using Sun Cluster software you are using the Proxy file system (PxFS). Global devices are made possible by PxFS, and global devices are central to device management in a cluster. The source is out there and now is a good time to explain some of the PxFS magic. I will give an overview of PxFS architecture with source references. I will do so via multiple blog entries. In this entry I will introduce PxFS and explain global mounting.

PxFS is a protocol layer that distributes a disk-based file system in a POSIX-compliant and highly available manner among cluster nodes. POSIX-compliant simultaneous access from multiple nodes is possible without demanding file-level locking from applications. The only requirement from the administrator to do a global mount, is to make sure the mount point exists on all cluster nodes. After that add a "-g" to the mount command and your mount becomes global. The following blog entry explains the terminology.

First let me show how easy creating and mounting a UFS file system globally is, without even using a dedicated physical device. I will create a lofi device, format it as UFS, and mount it globally.

Note: Do not try this in Solaris 9, as that Solaris version has an lofs bug that can panic the system.

# mkfile 100m /var/tmp/100m
# LOFIDEV=`lofiadm -a /var/tmp/100m`
# yes | newfs ${LOFIDEV}

Let us mount the above lofi device cluster wide (make sure the target directory exists on all nodes).

# mount -g ${LOFIDEV} /mnt

Done! You can access /mnt on any node of the cluster and reach the UFS file system on the lofi device on node1 transparently.

We will now get into the details of global mounting. I will take the example of globally mounting a file system in shared storage. We have a three-node cluster, with node2 and node3 having direct connection to shared storage. The svm metadevice "/dev/md/mydg/dsk/d42" is being mounted globally on the directory "/global/answer" from node1.

For code reference, startup of PxFS services happens here.

The mount subsystem is an HA service. An HA service in cluster parlance means that the service has failover capability. Any HA service has one primary and one or more secondaries. Any of the secondaries can become a primary if the current primary dies. This promotion of secondary to primary is transparent to applications.

For any cluster setup, there is always only one mount-service primary. All other nodes will have mount-service secondaries. Every cluster node will also have a mount client created when global mounts are first enabled for the node.

The mount primary and secondary are two faces of the mount replica object, which is created when the node joins the cluster. This is the code that creates the mount replica server. The replica framework ensures that there is only one primary at a time and promotes a secondary to primary when needed.

Now for the sequence of operations while doing a global mount. Refer to the picture below. Various steps during the global mount are numbered in sequence. Pointing your mouse at the number will pop up a tooltip explaining the step, with links to corresponding code.

<script type="text/javascript" src="/SC/resource/wz_tooltip.txt"></script> Step 1 Step 2 Step 3 Step 3 Step 4 Step 4 Step 5 Step 5 Step 5 Step 6 Step 6 Step 6

Here are the steps from the above image map for easy reading.

  1. The global mount command, mount -g, can be issued from any cluster node. It gets into the kernel and a generic mount redirects the call to PxFS. At this point, the directory to be mounted on is locked.

  2. The PxFS client tells the mount server about this global mount request via the mount client on that node. The mount client will have the server reference.

  3. The mount server in turn asks every client except the originating node, in this case node1, to lock the mount point.

  4. For shared devices, the mount server creates a PxFS primary and secondary. The node on which the device is primaried becomes the PxFS primary. For local devices, the mount is non-HA and an unreplicated PxFS server is created. The lofi device example above, will result in an unreplicated PxFS server being created on node1.

    The PxFS server does a hidden mount of the device. Details of the mount is contained in the PxFS server object.

  5. The mount server passes a reference to the newly created server to all mount clients and asks the clients to do a user-visible PxFS mount.

  6. The mount client creates and adds a vfs_t entry of the same type as the underlying file system.

Now the mount is visible on all clients. There is some more magic the mount subsystem does, like starting an fs replica when a node joins the cluster, or creating a new PxFS secondary or primary when a node that is connected to storage joins the cluster etc. The next installment will be about how regular file access in PxFS works.

Thanks to Walter Zorn for the javascript library which made tooltips so much easier.

Binu Philip
Solaris Cluster Engineering

Any news on ZFS would be interesting as well.

Posted by Bernd Eckenfels on July 13, 2008 at 06:28 AM PDT #

At present there are no plans to layer PxFS over ZFS.
It is technically feasible and the code is out there.

Posted by Binu on July 13, 2008 at 04:47 PM PDT #

When we try to remount global file system with any mount parameter, for instance - "mount -g -o remount,noxattr /global/fs", thereafter the mnttab gets updated with the above change but the functionality only works on the node where it was executed whereas on the other node in the cluster it does not work.
Again if you run the same remount command on the other node, it works but does not allow on the other node.

Posted by Renil Thomas on August 19, 2009 at 12:13 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Oracle Solaris Cluster Engineering Blog


« July 2016