Friday Jul 11, 2008

Nostalgia and a lesson

Recently, I happened to stumble upon my old old home page. This was
created 10 or more years ago. In it, I had a list of software I like.
Finding how the list changed over time was an educating experience.

My original list:
- Linux
- Emacs
- Tcl/Tk
- Windowmaker

My current list:
- Emacs
- Windowmaker
- GNU (to a certain extend)

Emacs, no  uncertainties there.  Number one.  Learned more
about it, wrote more for it and converted a few unbelievers.
Switched to Gnus meanwhile and I spend horrendous amounts of
time there.

Windowmaker, tried xfce  and gnome and kde. After a little
while everything else started getting in the way. Got half
way to writing couple of applets. Windowmaker on Solaris
applet menagerie isn't anything to write home about.

GNU, even though not an absolute must, not having some of
the things  would cause  pain in the wrong place. Having
switched to Solaris, and spending the rest of my time in
Windows, means, other than for things like Gimp, I really
don't need most of it.

Linux, not in  the preferred list anymore.  The last happy
experience I had was installing DSL in a no good laptop and
finding the laptop come alive. Nothing against and nothing
for. As I mentioned, spending all my Unix time on Solaris is
a big reason. Still, I don't find anything driving me back
to Linux. Using it at home for example.

Tcl/Tk, absolutely  not in the  list. I can't for my life
think why it was in the list in the first place. I remember
being impressed with expect, but that is no excuse. I use
multixterm regularly but that is about it for Tcl.

Trying to explain the above to myself...

A few months back, me and a colleague were trying to burn a
DVD. He had just got his MacBook. The legendary usability of
Mac was put to test. After a futile 10 minutes with the Mac,
a few seconds with google put us right. It wasn't even worth
the question "How do I ..." after we found how to do it. The
lesson learned was, usability is as much about how much you
use it, as it is about how well it is designed.

Saturday Jun 28, 2008

Happiness and patching

A few weeks ago I picked up bug#6713370 and it made me happy. 
This entry is about how it made me happy.

First, about ioctls and PxFS. Assume you are on a PxFS client
node and call say _FIOSATIME ioctl. The file system primary
is on another node. The copyin for the ioctl happens on
another machine. The user address passed to the ioctl is not
valid there. You are deep in the kernel in non-cluster code
and must find a way to solve the wrongness in address
space. How is this achieved? Enter t_copyops.

Every kernel thread has the provision for storing a pointer
to array of functions to call when a fault is incurred
during a copyxxx routine. If this field is non-zero, the
appropriate vector from the array will be called.

Before calling any ioctl, cluster will register it's own
copyops vectors, mc_copyops. In addition the thread issuing
the ioctl is also given a tsd entry that has the pid and
nodeid of the node from which the call originated.

If there is a fault during copyxxx the appropriate vector
from mc_copyops is called after the original fault handler is
restored by copyxxx. User address space access will always
result in mc_copyops vector getting called. Put in simple
terms mc_copyops routines goes to the node where the ioctl
was issued and does a copyin from the process address space,
copies this over to memory in server and allows copyin to

mc_copyops factors in the case where a failover or node
death caused the thread which isused the ioctl to die. There
is no process or node to access the user address from. All
such ioctls will first copy over parameters to the kernel
and issue the ioctl after setting the tsd entry to point to
pid 0. mc_copyops vectors identifies such cases and treats
the passed address as a local kernel address.

The amazing thing is, to talk about why bug#6713370 made be
happy the above is not necessary. But it is good to know.

If you look at the stack in the bug you can see we panicked
while enabling logging on the underlying UFS filesystem via
_FIOLOGENABLE.  This is done whenever a new pxfs primary for
a filesystem is created.

Trap pc shows '0'! Much badness. That could happen only if
deep dark internals of copyin and fault handlers messed
things up. My suspision was ASI problems or something
similar happening in new code for sun4v.  After getting all
down and dirty trawling through the core, I had this
brainwave that I should verify parameters passed to the
ioctl call.  That is the first thing to do for any core
dump, better late than never. Sure enough for past many many
years, we were calling VOP_IOCTL() with DATAMODEL_NATIVE as
the flag instead of FKIOCTL.

FKIOCTL tells ddi_copyin() to use kcopy() instead of
bcopy(). The ioctl is called by a kernel thread and the
argument is in a kernel address. So passing DATAMODEL_NATIVE
should be a sure fire way to mess everything up. Until now
this code worked because mc_copyops identifies failover or
node death retry by a caller pid of '0'. That code kicks-in
in this case since for kernel ioctls we set the caller pid
to zero via the above mentioned tsd entry .

So why didn't it work this time?

There was another bug 6673119 which messed up the lofault
handler.  This is the error recovery handler for bcopy and
must be valid. For user<->kernel address copy lofault should
be valid after a fault recovery. copyops vector is accessed
only on the error handler. But due to 6673119 we had an
invalid handler. That is where we panicked.  kcopy should
not have this problem. Thus passing FKIOCTL to the ioctl as
intended should make everything work.

Amazingly even the above explanation is not necessary to
explain my happiness due to bug#6713370.

I had a probable root cause. Testing this should be easy
enough. But the tests were running on a different patch
level than the code I was working on. It is too much trouble
to build the whole thing after finding the correct date and
thus source gate. I thought of solutions.

FKIOCTL is a constant: 0x80000000. DATAMODEL_NATIVE for 64
bit is also a constant with the value 0x00200000. The code
that loads this into a register should be easily visible as
a sethi instruction. It should be easy to edit the binary
and change the instruction to load 0x80000000 instead of
0x00200000. Binary patching. The last time I did this was 12
years ago. Minor thrill developing!

First ask dis to disassemble the appropriate routine.

kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1

17 00 08 00 is the sequence we are looking for. Now open
pxfs module in emacs and M-x hexl-mode. Search for 
a92d 700c 1700 0800

0005b8e0: 4000 0000 9010 0007 8090 0008 1240 0014  @............@..
0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1700 0800  ._.......-p.....
   		          here it is ---\^\^\^\^\^\^\^\^\^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1700 0800  ._........0.....
   		            and here ---\^\^\^\^\^\^\^\^\^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......

Now change and disassemble till you get the correct bits for
"sethi 0x80000000, %o3" This took a few iterations since I
couldn't bother myself to learn the binary layout for sethi.

0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1720 0000  ._.......-p.. ..
   		          here it is ---\^\^\^\^\^\^\^\^\^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1720 0000  ._........0.. ..
   		            and here ---\^\^\^\^\^\^\^\^\^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......

1700 0800 became 1720 0000. Disassemble and check.

kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 20 00 00  sethi     %hi(0x80000000), %o3  <<< FKIOCTL
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 20 00 00  sethi     %hi(0x80000000), %o3   <<< FKIOCTL
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1

Run the tests. Everything hunky dory. I had successfuly
binary patched something after eons. And that, that made
me happy ;-)

Friday Oct 12, 2007

A rave for PxFS

Sacrilege. I realized only recently that I haven't blogged nice things
about the technology that I work on. While I am at it, may I also
point your attention to the side bar on the left\^h\^h\^h\^h right which is full of
possibilities and not much in realization of those possibilities. That
side bar had led me to CSS and hours of wonderful time in front of the
monitor creating rectangles of various colors and sizes and overlaps.
The psychedelics started with a visit to

I still remember the plans and grand designs I had for the next web
creation of mine. Don't you fret, those thoughts and designs are still
locked up somewhere in there. But what I actually put in place is what
you see here, a div that doesn't break or justify lines and a side bar
full of defaults. What fun to create vaporware, eh? Talking about
vaporware, I still haven't talked a single thing about PxFS. Aha, the
cat is out of the bag and the probability wave hasn't collapsed yet.
So, about PxFS.

PxFS is the general purpose distributed filesystem used internal to
Solaris Cluster nodes. More cats out of the bag now. By this time next
year you would have heard much more about Solaris Cluster in the Open. 
Haha .. in the open, all of the code for Solaris Cluster will be open.
As of today
will tell you what is open in Solaris Cluster and what is not. PxFS is
not open yet, but I can talk about it.

What does the big "Distributed HA Filesystem" suit-speak really tell?
PxFS is a Highly Available, Distributed and \*POSIX compliant\* file
system layer above some disk based file system. The disk based file
system can be UFS or VxFS for now. Layering it above something better
like ZFS is technically feasible. Okay. Now for details about what
each of the above terms really mean. Before I go into the explanation,
I am explaining the real basics of PxFS here, so total new-bees can
also understand and I can pretend I know much much more than what I
talked about here.

Distributed. PxFS is a distributed file system internal to cluster
nodes. To explain distributed, take the analogy of electric supply to
a house. If you have only one socket in the house, then the supply at
your house is not distributed. If you add more outlets then the supply
becomes is distributed. Similarly, a Solaris Cluster can have 1 to
err.. 8 or 16 nodes. No I am not going to quote an exact number, I
like vague. PxFS allows the filesystem hosted in one of the nodes to
be accessible in any of the other nodes. \*Any\* of the other nodes. It
is like NFS in that it does not need a disk path, yeah so maybe it is
just a file access protocol. To restate, if you globally mount a UFS
or VxFS filesystem on a cluster node and the mount directory exists on
all cluster nodes, you can access that file system on all cluster
nodes at the mount point. Distributed.

Now for the Highly Available part. Let's go back to the analogy of
vibrating electrons in a linear conductor. If your house's electricity
supply has an inverter to back it up then your electricity supply is
highly available. If the main line goes down, the inverter (battery)
kicks and you don't notice a down time. For exactness, there is the
few milliseconds the inverter needs to cut-in when there is no power.
Similarly, in a Solaris cluster setup, if you have more than one node
with a path to storage hosting the underlying filesystem for PxFS, you
have a highly available PxFS file system. If the node hosting PxFS
goes down, the other node with path to storage will automatically
takeover and your applications will not notice any down-time. Similar
to the inverter takeover delay, there will be a brief period when your
fs operations are delayed, but there will be no errors or retries. And
that is the highly available part.

What about POSIX compliant? Take writes to any POSIX compliant single
node filesystem. There is a guarantee that every write is atomic. If
there are multiple writes to the same file without synchronization
between the writers, you have the guarantee that no writes will
overlap. The only unknown is the order of writes. Similarly, in a PxFS
filesystem, writers from the same node or multiple nodes can do writes
with the guarantee that their writes will not get corrupted. That is
one example of POSIX compliance, guarantees like space for async
writes and fsync semantics, everything POSIX (as far as I know) is
guaranteed on PxFS. And that is POSIX compliance.

And the administration overhead? .. adding a "-g" to your mount
command and making sure there are mount directories on each node.
"man mount" will tell you about "-g". That part, the administrative
simplicity is worth many paragraphs of prose. The value of simplicity
has already been proven by 
"zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0"
and all of ZFSs other possibilities which saves you a lot of wear and
tear on fingertips and neurons if you had to use SVM and metaxxxx.



« July 2016