A rave for PxFS

Sacrilege. I realized only recently that I haven't blogged nice things
about the technology that I work on. While I am at it, may I also
point your attention to the side bar on the left\^h\^h\^h\^h right which is full of
possibilities and not much in realization of those possibilities. That
side bar had led me to CSS and hours of wonderful time in front of the
monitor creating rectangles of various colors and sizes and overlaps.
The psychedelics started with a visit to http://www.csszengarden.com

I still remember the plans and grand designs I had for the next web
creation of mine. Don't you fret, those thoughts and designs are still
locked up somewhere in there. But what I actually put in place is what
you see here, a div that doesn't break or justify lines and a side bar
full of defaults. What fun to create vaporware, eh? Talking about
vaporware, I still haven't talked a single thing about PxFS. Aha, the
cat is out of the bag and the probability wave hasn't collapsed yet.
So, about PxFS.

PxFS is the general purpose distributed filesystem used internal to
Solaris Cluster nodes. More cats out of the bag now. By this time next
year you would have heard much more about Solaris Cluster in the Open. 
Haha .. in the open, all of the code for Solaris Cluster will be open.
As of today http://opensolaris.org/os/community/ha-clusters/ohac
will tell you what is open in Solaris Cluster and what is not. PxFS is
not open yet, but I can talk about it.

What does the big "Distributed HA Filesystem" suit-speak really tell?
PxFS is a Highly Available, Distributed and \*POSIX compliant\* file
system layer above some disk based file system. The disk based file
system can be UFS or VxFS for now. Layering it above something better
like ZFS is technically feasible. Okay. Now for details about what
each of the above terms really mean. Before I go into the explanation,
I am explaining the real basics of PxFS here, so total new-bees can
also understand and I can pretend I know much much more than what I
talked about here.

Distributed. PxFS is a distributed file system internal to cluster
nodes. To explain distributed, take the analogy of electric supply to
a house. If you have only one socket in the house, then the supply at
your house is not distributed. If you add more outlets then the supply
becomes is distributed. Similarly, a Solaris Cluster can have 1 to
err.. 8 or 16 nodes. No I am not going to quote an exact number, I
like vague. PxFS allows the filesystem hosted in one of the nodes to
be accessible in any of the other nodes. \*Any\* of the other nodes. It
is like NFS in that it does not need a disk path, yeah so maybe it is
just a file access protocol. To restate, if you globally mount a UFS
or VxFS filesystem on a cluster node and the mount directory exists on
all cluster nodes, you can access that file system on all cluster
nodes at the mount point. Distributed.

Now for the Highly Available part. Let's go back to the analogy of
vibrating electrons in a linear conductor. If your house's electricity
supply has an inverter to back it up then your electricity supply is
highly available. If the main line goes down, the inverter (battery)
kicks and you don't notice a down time. For exactness, there is the
few milliseconds the inverter needs to cut-in when there is no power.
Similarly, in a Solaris cluster setup, if you have more than one node
with a path to storage hosting the underlying filesystem for PxFS, you
have a highly available PxFS file system. If the node hosting PxFS
goes down, the other node with path to storage will automatically
takeover and your applications will not notice any down-time. Similar
to the inverter takeover delay, there will be a brief period when your
fs operations are delayed, but there will be no errors or retries. And
that is the highly available part.

What about POSIX compliant? Take writes to any POSIX compliant single
node filesystem. There is a guarantee that every write is atomic. If
there are multiple writes to the same file without synchronization
between the writers, you have the guarantee that no writes will
overlap. The only unknown is the order of writes. Similarly, in a PxFS
filesystem, writers from the same node or multiple nodes can do writes
with the guarantee that their writes will not get corrupted. That is
one example of POSIX compliance, guarantees like space for async
writes and fsync semantics, everything POSIX (as far as I know) is
guaranteed on PxFS. And that is POSIX compliance.

And the administration overhead? .. adding a "-g" to your mount
command and making sure there are mount directories on each node.
"man mount" will tell you about "-g". That part, the administrative
simplicity is worth many paragraphs of prose. The value of simplicity
has already been proven by 
"zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0"
and all of ZFSs other possibilities which saves you a lot of wear and
tear on fingertips and neurons if you had to use SVM and metaxxxx.

Post a Comment:
Comments are closed for this entry.



« July 2016