Tuesday Jul 15, 2008

PxFS mount architecture entry in Sun Cluster Oasis

This blog's title contect actually exists in the official Sun Cluster
blog site as it is more of a technical tutorial than a blog entry.

Writing it was fun. I put my foot down - on myself. I decided on
the method of presentation, and read what was needed to get there.
Last time I tried learning anything webbish was when I played with
CSS. Then I didn't have a precise aim in mind and it all turned into
vaporware very soon.

This time aim was simple, learn javascript and image maps. Make
a clickable image with tooltips that has active links. "Hah! That
is not strictly learning" you say. "Nowadays kids are born knowing
all of that!". True. There are somethings I never got around to
learning when the rest of the world did it. Web 2.0 is one such
thing. Some ajaxing and maybe javaFX-ing and I should be Web2.0
enabled, eh?

Friday Jul 11, 2008

Nostalgia and a lesson

Recently, I happened to stumble upon my old old home page. This was
created 10 or more years ago. In it, I had a list of software I like.
Finding how the list changed over time was an educating experience.

My original list:
- Linux
- Emacs
- Tcl/Tk
- Windowmaker

My current list:
- Emacs
- Windowmaker
- GNU (to a certain extend)

Emacs, no  uncertainties there.  Number one.  Learned more
about it, wrote more for it and converted a few unbelievers.
Switched to Gnus meanwhile and I spend horrendous amounts of
time there.

Windowmaker, tried xfce  and gnome and kde. After a little
while everything else started getting in the way. Got half
way to writing couple of applets. Windowmaker on Solaris
applet menagerie isn't anything to write home about.

GNU, even though not an absolute must, not having some of
the things  would cause  pain in the wrong place. Having
switched to Solaris, and spending the rest of my time in
Windows, means, other than for things like Gimp, I really
don't need most of it.

Linux, not in  the preferred list anymore.  The last happy
experience I had was installing DSL in a no good laptop and
finding the laptop come alive. Nothing against and nothing
for. As I mentioned, spending all my Unix time on Solaris is
a big reason. Still, I don't find anything driving me back
to Linux. Using it at home for example.

Tcl/Tk, absolutely  not in the  list. I can't for my life
think why it was in the list in the first place. I remember
being impressed with expect, but that is no excuse. I use
multixterm regularly but that is about it for Tcl.

Trying to explain the above to myself...

A few months back, me and a colleague were trying to burn a
DVD. He had just got his MacBook. The legendary usability of
Mac was put to test. After a futile 10 minutes with the Mac,
a few seconds with google put us right. It wasn't even worth
the question "How do I ..." after we found how to do it. The
lesson learned was, usability is as much about how much you
use it, as it is about how well it is designed.

Saturday Jun 28, 2008

Happiness and patching

A few weeks ago I picked up bug#6713370 and it made me happy. 
This entry is about how it made me happy.

First, about ioctls and PxFS. Assume you are on a PxFS client
node and call say _FIOSATIME ioctl. The file system primary
is on another node. The copyin for the ioctl happens on
another machine. The user address passed to the ioctl is not
valid there. You are deep in the kernel in non-cluster code
and must find a way to solve the wrongness in address
space. How is this achieved? Enter t_copyops.

Every kernel thread has the provision for storing a pointer
to array of functions to call when a fault is incurred
during a copyxxx routine. If this field is non-zero, the
appropriate vector from the array will be called.

Before calling any ioctl, cluster will register it's own
copyops vectors, mc_copyops. In addition the thread issuing
the ioctl is also given a tsd entry that has the pid and
nodeid of the node from which the call originated.

If there is a fault during copyxxx the appropriate vector
from mc_copyops is called after the original fault handler is
restored by copyxxx. User address space access will always
result in mc_copyops vector getting called. Put in simple
terms mc_copyops routines goes to the node where the ioctl
was issued and does a copyin from the process address space,
copies this over to memory in server and allows copyin to

mc_copyops factors in the case where a failover or node
death caused the thread which isused the ioctl to die. There
is no process or node to access the user address from. All
such ioctls will first copy over parameters to the kernel
and issue the ioctl after setting the tsd entry to point to
pid 0. mc_copyops vectors identifies such cases and treats
the passed address as a local kernel address.

The amazing thing is, to talk about why bug#6713370 made be
happy the above is not necessary. But it is good to know.

If you look at the stack in the bug you can see we panicked
while enabling logging on the underlying UFS filesystem via
_FIOLOGENABLE.  This is done whenever a new pxfs primary for
a filesystem is created.

Trap pc shows '0'! Much badness. That could happen only if
deep dark internals of copyin and fault handlers messed
things up. My suspision was ASI problems or something
similar happening in new code for sun4v.  After getting all
down and dirty trawling through the core, I had this
brainwave that I should verify parameters passed to the
ioctl call.  That is the first thing to do for any core
dump, better late than never. Sure enough for past many many
years, we were calling VOP_IOCTL() with DATAMODEL_NATIVE as
the flag instead of FKIOCTL.

FKIOCTL tells ddi_copyin() to use kcopy() instead of
bcopy(). The ioctl is called by a kernel thread and the
argument is in a kernel address. So passing DATAMODEL_NATIVE
should be a sure fire way to mess everything up. Until now
this code worked because mc_copyops identifies failover or
node death retry by a caller pid of '0'. That code kicks-in
in this case since for kernel ioctls we set the caller pid
to zero via the above mentioned tsd entry .

So why didn't it work this time?

There was another bug 6673119 which messed up the lofault
handler.  This is the error recovery handler for bcopy and
must be valid. For user<->kernel address copy lofault should
be valid after a fault recovery. copyops vector is accessed
only on the error handler. But due to 6673119 we had an
invalid handler. That is where we panicked.  kcopy should
not have this problem. Thus passing FKIOCTL to the ioctl as
intended should make everything work.

Amazingly even the above explanation is not necessary to
explain my happiness due to bug#6713370.

I had a probable root cause. Testing this should be easy
enough. But the tests were running on a different patch
level than the code I was working on. It is too much trouble
to build the whole thing after finding the correct date and
thus source gate. I thought of solutions.

FKIOCTL is a constant: 0x80000000. DATAMODEL_NATIVE for 64
bit is also a constant with the value 0x00200000. The code
that loads this into a register should be easily visible as
a sethi instruction. It should be easy to edit the binary
and change the instruction to load 0x80000000 instead of
0x00200000. Binary patching. The last time I did this was 12
years ago. Minor thrill developing!

First ask dis to disassemble the appropriate routine.

kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1

17 00 08 00 is the sequence we are looking for. Now open
pxfs module in emacs and M-x hexl-mode. Search for 
a92d 700c 1700 0800

0005b8e0: 4000 0000 9010 0007 8090 0008 1240 0014  @............@..
0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1700 0800  ._.......-p.....
   		          here it is ---\^\^\^\^\^\^\^\^\^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1700 0800  ._........0.....
   		            and here ---\^\^\^\^\^\^\^\^\^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......

Now change and disassemble till you get the correct bits for
"sethi 0x80000000, %o3" This took a few iterations since I
couldn't bother myself to learn the binary layout for sethi.

0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1720 0000  ._.......-p.. ..
   		          here it is ---\^\^\^\^\^\^\^\^\^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1720 0000  ._........0.. ..
   		            and here ---\^\^\^\^\^\^\^\^\^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......

1700 0800 became 1720 0000. Disassemble and check.

kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 20 00 00  sethi     %hi(0x80000000), %o3  <<< FKIOCTL
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 20 00 00  sethi     %hi(0x80000000), %o3   <<< FKIOCTL
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1

Run the tests. Everything hunky dory. I had successfuly
binary patched something after eons. And that, that made
me happy ;-)

Wednesday Jun 18, 2008

raftman vs walle

I had sat down to find if I can extract part of an elf section
transpose it to another section and still keep the object file
functional. As usual one thing led to another and before I knew
I was watching a short movie called "The Raftman's Razor". I
lovvvved it. From "how uninteresting this is", I went to rapt
attention and a contented sigh at the end.

Elf editing was a receding dot in the horizon now. I forget the
association, but from Raftman I reached WALL.E. Fantastic
characterization. Now I need to watch some sort of walle excerpt
everyday. My daughter too has declared she likes walle and goes 
"Walleeeeee..", this from someone who cried while watching Donald 
Duck get hit on the head. Can't wait for wall.e to hit the screen.

Tuesday Jun 03, 2008

Deciding culture

Last few weeks I was obsessing about decisions. I had to think about
culture. I had to think about our responsibilities 
to our kid. I am  in the US now  and we have to go
back  soon. There are  no un-thwartable reasons to
go back.  The reasons I  quote are  taking care of
parents, being at a place which is really home and
belongs  to us and  allowing our  kid to  grow up
immersed in our culture and education. She is happy here enjoying
the parks  and play school and long trips.

We have  friends here who are  questioning our sanity about  going back.
We have our parents, relative and friends in India not believing we will
come back. I  have  colleagues who  counsel  me about opportunities  and
exploiting them  and giving my kid  the best possible education.  I have
other friends who tell me how much more money I can  make and how stress
free and easy life is here.

Am I really denying my a daughter good education and  a better life only
because I value  something as ephemeral  and shifting as  culture? Would
she be a better person because of  "our" culture? Do my responsibilities
have an inflated perception factor  and a very small  reality factor?  I
thought a lot, rather I obsessed a lot and went  around with a dour face
and an irritable disposition.

Finally, I have decided. I absolved myself all sins so to say. Thinking
back how I grew up, memories I cherish, where I studied, what I enjoyed, I
don't see it being very different for my daughter. For every
luxury  I  may  have missed   during childhood  there is  an
enjoyable story, some learning, an event, a reason making it
special  and  enriching  than a  lack  of  some   sort. From
mosquito  bites to crowded   open  markets, from  overloaded
public    buses to  lack  of  shopping   malls,  from  dirty
rest-rooms to inadequate infrastructure, everything has left
an impression on me. Whether good or bad or inconsequential,
all that is what has made me.

The "could have been"s are  endless, I have decided for her.
I know I am giving her what she deserves, only the best.

Friday May 02, 2008

JES and another installer

JES installer in text mode. Ugliest, most useless of
all installers of all time. If there is a user surly
award, JES should surely get the award.

Yes and Y and Enter and 1 and A, it is a menagerie of
choices with the same meaning. But none of them can
be intermixed.

While I am in the topic of installers. Recently I had
to make a choice against my wish where installer could
use a new feature and using this feature would allow an
existing error to be eliminated. The install would use
the new feature and proceed. That would hide one more
question from the user and that I thought a good thing.

But surprise surprise. It was decided that the existing
error must be replaced by a screen to inform the user of
the alternative \*and\* get his confirmation to use the new
feature. If he says  "No" install would fail. The screen
would say

Error, can't install unless I use feature 'X'".
Use feature "Yes/No"?

I translate the second line as:

Fail install so I can throw an error "Yes/No"?

Logic? Existing customers could depend on the install
failing and hence we should not change the behavior.
All newcomers are lemmings anyway. \*We\* design only
for existing users!

With this excellent logic in place, I can foresee the
future where every new feature will need multiple menu
choices which says only "Fail/Succeed">.

Wednesday Mar 19, 2008

Long live Clarke

The man who set me on the course of loving Science Fiction is dead.
I still love the completeness and total lack of gross violence in
his work. His short stories have inspired me as much as O'Henry's
or Steinbeck's. He was a man who knew how to tell a story and also
knew more than most of us about science.

Rendezvous with Rama is still one of my favorites. In my eyes that
book epitomizes what science fiction should be. I am hoping I have
missed reading atleast some of his short stories so I can discover
and enjoy them someday.

Thankyou Clarke.

Tuesday Mar 18, 2008

How to kill software

This is yet another rantish blog entry. Too many of them recently. Well..

Why some software dies while some survive? I have convinced myself
quality, usability and applicability has nothing to do with it. I use
Emacs, you may use vi or notepad or brief. brief? brief from underware
Inc. No longer out there. Notepad? alive, even my 2 year old has
accidentally launched it while "exploring". Windowmaker, dying. I
always come back to it because it gives me functionality I need and
nothing more unless I ask. You would have your own examples. I am not
talking about open software alone. TruCluster. I work on Solaris
Cluster and couldn't help hearing tales about how good it was - going
going. OS/2 - gone, Windows - very much around, eh?

Bad management. From inside Sun I have seen some wonderful products
die a horrible and prolonged death because of internal politics. I
have seen that happen outside Sun too. The usual reason - even a
failed new project helps in increased visibility. Old wonderful
products just bring in money, not promotions. Usually new projects
cannot happen unless an existing something is killed. Wait for an
ambitious enough person with sufficient lack of understanding to come
along and you can see the death of a fine piece of software. That will
not be a long wait usually.

Laziness and lack of neurons. It is highly unlikely that the same
person(s) who conceived and coded the first cut will continue to work
on some software for ever. Lisp, vi and Emacs anyone? The new person(s)
working on it have to be motivated and maybe \*better\* skilled than the
author(s). It takes skill to understand a piece of software well
enough to play it. More than that, it takes dedication and hard-work
to stick with it long enough to start understanding it well enough.

Lack of courage. The simple belief "if it breaks I will fix it" is
very hard to come by. Tweaks and blobs and nudges are all well and
good. I worked for sustaining for 5.5 years. The written and unwritten
law there is "lesser lines changed better the fix". That works only to
a certain extent. Every once in a while you have to rip out huge gobs
of code and replace it with something better, more elegant, more
scale-able more tuned to the present designs. Software must change to
survive. If you want proof, read this blog entry by Charles Suresh
(BTW, he was my mentor when I joined sustaining).

One last reason, lack of vision. If the software is conceived with a
limited scope it is guaranteed not to survive long. Over reaching and
over engineering all aside, when you are talking about surviving 20 or
30 years or more, the initial idea has to be grand enough and solid
enough to survive and kick butts of new kids on the block for all that
time. Take Emacs. Survives and kicks butt.

The above is not halfway to a complete list. Still, all said and done,
it hurts when a perfectly fine piece of software you like, is put on
the chopping block for any or some of the above reasons and you have
to watch it die.

Thursday Mar 06, 2008

Making money and panicing

Last week I was going through a gargantuan depression triggered by
economic depression and realization that life is not fair and that
everyone and everything is mortal. I lurked around like a an unhappy
pumpkin till I got Terry Pratchett's latest - "Making Money". It put a
smile on my face the moment I felt it in my hand.

I finished the book cackling, laughing and giggling while maintaining
the unhappy pumpkin look. "Recursive premonition" caused me some
thought. There were some almost page turning moments towards the end.
Overall not as good as Thud! even, enjoyable if you are a fan.
Finished it sans unhappy pumpkin face. Then it hit me that I should be
further depressed due to the "embuggerance" Terry was diagnosed
with. That should have tickled the mortality factor of my depression.

That didn't happen and once again a Pratchett creation worked for me
as the perfect anti-depressant. I am now fit enough to tackle bugs
that should not exist in the first place. The current one is that
Solaris tries it's utmost to write dirty pages after panic. The
comment in zfs_sync tells the problem as it is.

zfs_sync(vfs_t \*vfsp, short flag, cred_t \*cr)
	 \* Data integrity is job one.  We don't want a compromised kernel
	 \* writing to the storage pool, so we never sync during panic.
	if (panicstr)
		return (0);

That is not the only problem. After a panic there is only one cpu
running threads and there is no pre-emption. Perfectly normal calls
like mutex_enter() and delay() will behave differently after a panic.
Understandable. But does panic code account for all that? No!


It first asks filesystems to sync() by calling vfs_sync on all mounted
file systems. I can work around by returning immediately if a panic is
in progress as ZFS does. Added that to PxFS. But panic is not done.
It calls pageout on every dirty page. To workaround I have to add the
same panic-check in pxfs's putpage. Did that and the bug is fixed.

But why? Why would you want pages from a compromised and hobbled
system to be written out? With a non-local filesystem almost nothing
works. You can't trust the data any more. System behavior after panic
is different enough to not trust locks and timeouts. My conclusion is
this is a throwback from the age of no-logging in ufs and pushing out
pages at panic was needed to avoid file system corruption.

Wednesday Feb 13, 2008

Where goes python

I was catching up on PEPs for Python 3.0. The very first non-meta PEP
is for better string formatting.  I get the feeling complicating
core language features is not the best way forward. What has made me
do more python than anything else is that once learned most of it
stays with me. The core of the language is simple. Modules are
different. I expect them to be different in complexity and interfaces.
But the language itself has a flow. Some of the recent core language
PEPs did not stand upto this view of mine. This format method is one
of those. I feel this belongs to a different module and not to the
core language. However everyone gets an opinion I guess.

Python modules reminds me. Long time back I was working on a
configparser implementation based on XML. I liked the configparser
object itself but not the backend limitation of not having nested
sections. I had done a quick implmentation of XMLConfigParser which
had a backend that would read and write XML instead of ini format.
This was before the new style classes and when didn't know as much
about Python as now. I'l have to re-write and get it out so someone
somewhere needn't re-invent it.

Monday Feb 04, 2008

Being human

I saw Hotel Rwanda and was immediately depressed. I know worse has
happened in history. Right from almost 2800 BC there is record of
human sacrifice. Organized persecution has happened regularly. With
that as the background I thought about my arguments with my sister.
She is vehement in her view that humans are unworthy of being
considered the pinnacle of evolution.  We Humans are not humane.

I argue with her about nature being "inhuman" and killing and fighting
being part of raw nature.  Humans are nothing if not a product of
nature. Everything from stealing to cheating to gangs and wars abound
in nature in many other specees. Humans are only following natural
instincts. Everything that humans do is thus fine.

But after watching this movie I have changed my point of view. If we
humans can't control our actions beyond what nature dictates what use
is self awareness? Being self aware is more than saying "I think
therefore I am.". It also means acknowledging anyone else has the
right to be as different as they want to be, tolerance and mercy.
Mercy being the keyword.

Wednesday Dec 12, 2007

Free Flow PxFS

I drove an Ikon 1.6 in Bangalore. It is a poor man's drivers car and
handled beautifully. After 3 years of great fun, I appealed to home
ministry, composed entirely of my wife Sapna. for more fun with the
car. I got sanction for alloys and a performance filter also called a
free flow filter. It's not that the power was inadequate, it had all
the power needed to wait powerfully in traffic and at the frequent
signals. This was similar to removing white spaces or refactoring or
rewriting code that works perfectly. It is difficult for some of us to
leave something that works as it is. There has to be change. Alloys
and tube-less tires were a easy. Even my wife complemented on the
difference and how maybe, just maybe, there was an iota of sense left
in me. The air filter, K&N, was another matter. Although it didn't
earn home ministry ire. For the money I paid the gain in power was
small. To hear the subtle whoosh of it sucking air I would have to
open the bonnet and listen carefully. Not a good posture while
driving. Well, I did get some increase in power and a lot more mental
satisfaction by looking at the beautiful cut-off cone once in a while.

What I gave the engine was, ability to suck in a lot more air and thus
burn fuel better. The engine does not face forward alone. There is a
rear to it. The increased air flow and the larger volume of exhaust
was still going through the old rear of the engine, the 'exhaust
sub-system'. If I could allow the exhaust to flow free, the engine
will become a much more efficient air pump. I will get more power and
better mileage. Home ministry personnel realized that for every sane
decision there is an equal and opposite decision. Thus, the alloys
have to be balanced by a free flow exhaust, which I demanded in a
manly manner by groveling and sniveling and composing long sentences
consisting entirely of non-sensical words.

With a free flow exhaust, the standard exhaust manifold is replaced
with custom headers. Each header manifold is of the same length and
(hopefully) tuned such that exhaust pulses from each cylinder helps
the others along. Here's some theory. The catalytic
converter may go, the standard muffler does go, the tail-pipe and the
exhaust pipes become a little bigger. With all this, from the moment
the exhaust valves open, the exhaust "pulse" has a much freer path to
the outside the world. For me, behind the wheel, all this means 15-20%
more power and much better throttle response. For a 950kg car, an
increase of power by 15-20 bhp is significant. To quote a like minded
friend, there is no restriction between flooring the pedal and
red-lining the engine. Wow! The moral of the story was free inflow
\*and\* free outflow were required for better performance. Oh, and yes,
it also gives a nice deep throaty exhaust note too.

Now let me get to PxFS, the crime of working on which I had pleaded
guilty to earlier. PxFS is slow. PxFS is slow for large files. PxFS is
slow for small files. PxFS is better used read-only. Many Solaris
cluster customers and developers have heard this or spoken this. Can
it be made faster? It is after all a data-pump. Yes, it can be made
faster. Before I expound about the easy modifications, I should tell
about Ellard Roush and the huge and the substantial modifications he
did to PxFS. That is equivalent of designing a better car. Till
Solaris Cluster 3.1u4, PxFS was a dog and it was a glutton for memory.
Ellard held it by the neck and shook it till it started behaving
itself, that project was called the Object Consolidation project. It
was a rewrite of PxFS and made most of the work that followed easier.
Somewhat like customizing an Evo is easier compared to almost every
other car on the road.

One of the main bottlenecks for allocating writes is making sure there
is backing store, ie. blocks on disk. This typically means executing a
stateful operation. The allocation should survive a failover and be
guaranteed to be available. In PxFS's case, the allocation would be
requested from a client and gets executed on the server. Even more
overhead. Asking for and getting a page is easy and lightweight.
To get around the allocation bottleneck, we implemented a space cache
reservation strategy for PxFS. The "Fastwrite" project. This idea
itself is not new and not unique to filesystems either. You allocate a
local backing store pool and when it is exhausted you request more.
The server or primary knows how much space is free and retains global
control. The clients write merrily to pages after reserving space from
the local cache. It is the page out which operates on filesystem
meta-data that in turn affects on-disk structures. In real world
terms, applications doing extending writes will see big speedups.
Solaris Cluster 3.2 introduced fastwrites.

What I described above is the equivalent of the performance air filter
for my car. Applications can now write unhindered, but they can also
starve the system of memory. One of the key requirements of a
filesystem is how fast it can get data into stable storage. Runaway
memory use can be fixed by throttling. The getting data to stable
storage fast requires a better, "freer flowing", back end. With
Solaris Cluster 3.2u1 or 3.2 and the latest patches, you get that. We
made write chunks bigger and gave more threads to write with. Where
data used to be written serially we parallelised. We split locks and
reduced hold times. We also introduced a semi-heuristic flow control
for clients. All of the above is the equivalent of my tuned headers
and straighter exhaust for the car. Unlike the car, I can take this
for a spin from my desk. Let me do so.

Tests are with 3.2 with latest patches (or 3.2u1 to be released). For
all these tests I mounted the metadevice as non-global to test UFS.
Disk speed concerns can be put to rest.

PxFS's strength is it's use of use. If you create mount directories on
all nodes, global mounting is as easy a "mount -g  .
Similarly, if I take off the "-g" I get a local mount. Feel free to
over-estimate my efforts.

I'l do the scientificest of all file systems tests. "mkfile!"

Writing a 2G file to UFS would take this long in solaris 10. 

-bash-3.00# timex mkfile 2g /mnt/kuntham

real        1:45.79
user           0.22
sys           11.32

Doing the same on a PxFS mount will take this long.

-bash-3.00# timex mkfile 2g /global/xxxx/kuntham

real        1:50.87
user           0.21
sys            8.27

It is comparable! To hammer home the significance. Not only are PxFS
writes going through two complete file system layers (like NFS), it is
also check pointing metadata changes and making sure every page and
transaction related to the file hits the disk before close() returns.

Let's repeat this with dd, the scientificestest of fs tests.

For UFS:
-bash-3.00# timex dd if=/dev/zero of=/mnt/kuntham bs=524288 count=4096
4096+0 records in
4096+0 records out

real        1:47.01
user           0.02
sys           12.50

For PxFS:
-bash-3.00# timex dd if=/dev/zero of=/global/xxxx/kuntham bs=524288 count=4096
4096+0 records in
4096+0 records out

real        1:46.71
user           0.01
sys            7.87

Ooooo, PxFS is faster. Knowing the insides of PxFS, I gave dd a block
size the same as the default page kluster size for PxFS. All is fair
in love and tuning.

Now I'l brew my own slightly un-scientific test to quantify the
statement that PxFS makes sure everything is on disk before closing.
This is a small python script that does the same as the scientific dd
test above, but breaks up the time for open, write, sync and close.

Here is the script and here are the results.


-bash-3.00# /cal/chunk.py /mnt/kuntham

Time in seconds

open  	0.228416
write 	106.314831
fsync 	0.704131
close 	0.000030

Total time: 107.247408 seconds


-bash-3.00# chunk.py /global/xxxx/kuntham

Time in seconds

open  	0.012415
write 	80.609731
fsync 	30.116318
close 	0.272743

Total time: 111.011207 seconds

Notice how writes are much faster than for UFS but fsync and close
contribute significantly to overall time? Overall, inspite of having
to make sure of data integrity guarantees PxFS performs quite well.

Am I done with the car? Absolutely no! Don't tell my wife, but there
are iridium spark plugs, porting and polishing, maybe ECU tuning.

Similarly, there are huge tuning opportunities for PxFS too and there
must be someone who should not be told. Directory operations, small
files, check points and so on. Since directory operations and other
metadata operations require check-points, they are still slow. In
another blog I'l explain checkpoints and recovery of PxFS. You can see
the internals of PxFS and rest of Solaris Cluster very soon. I did
mention it is going to be opensourced. Maybe one of you who is a
master tuner and can then do much more.

Tuesday Dec 04, 2007


"""I will not carry my camera or other electronic items in my
   checkin baggage as it is not covered if baggage is lost.""" \* 10

I already sat in my time out chair.

Tuesday Nov 06, 2007

saving the e-mail tag by tag

Is it only me who is stuck behind a not-very HTML friendly e-mail
mentality? Everyday I feel so. Many a time I have risked the enamel of
my teeth grinding away than ask the sender to please reduce HTML tags
to less than the message content. I confess I use XEmacs and gnus and
w3m to render HTML. While it can render HTML it is not as HTML
friendly as any of the holy browser based e-mail clients. Any time I
raise this issue the answer I get is "loose the looser e-mail client
looser!". After hearing that I get the feeling the sender didn't know
there is an option to disable HTML in his or her e-mail client.

Ever so often, instead of reading only the sentence "LOL, you \*are\* an
idiot" as a reply to a mail of mine, I end up reading pages of css and
HTML tags and the ancestry of the HTML e-mail composer which was used
to type in "LOL" in large red underlined font, oh, and it is centered
on the line and maybe justified too. Is anyone else irked by the large
amount of non-information e-mails contain? If I am reading my mail on
a poor man's terminal and if the client couldn't get correct colors, I
end up looking at black letters on a black background.

For me ASCII is rich enough to convey emphasis and rage and tantrums.
If holy wars could be wages on much poorer terminals, why use these
bells and whistles?

When conveyed in an e-mail, the message:


is as emphatic or more emphatic than


It is much easier on the eyes and I can also understand that I really
really should not have deleted the file. Contrary to the psychedelics,
shocking me into obeying, I know not deleting was important because
the sender didn't take time to change font colours and size.

Even more irksome is while deep into a e-mail thread with many
participants contributing, some of them insist on quoting with HTML
something or the other and indentation. I am attuned to reading this:

+---------------------                  +---------------------
| >> them                               |       them         
|                                       |                    
| > him                     and not     |    him                     
|                                       |                    
| " me now"                             |" me now"           
+---------------------                  +---------------------

I can understand indentation for delineating code blocks. But e-mail
as I remember did not depend on indentation alone for quoting. I still
don't believe e-mail is a web product. Leveraging web provided
services and features is fine. Over-use and mis-use is not the same as

That brings me to another story my father narrates...

The priest and his helper are walking along a river. This is a long
time ago and you should know that the priest is of a higher cast and
have rights which the helper does not. They are walking along and they
reach a place where the river stinks like rotten eggs.

The priest pinches his nose and walks on ahead. That is when he
notices his helper also pinching his nose.

The priest asks: "You Idiot, I am pinching my nose to keep the smell
out. Who gave you permission to pinch your nose?"

Helper: "Oh exalted one. It is below me to keep the smell from
entering my unworthy nostrils. I am pinching them to keep the smell
that came in from going out.".

The moral of the story is to not quote non-contextual stories.

Friday Oct 12, 2007

A rave for PxFS

Sacrilege. I realized only recently that I haven't blogged nice things
about the technology that I work on. While I am at it, may I also
point your attention to the side bar on the left\^h\^h\^h\^h right which is full of
possibilities and not much in realization of those possibilities. That
side bar had led me to CSS and hours of wonderful time in front of the
monitor creating rectangles of various colors and sizes and overlaps.
The psychedelics started with a visit to http://www.csszengarden.com

I still remember the plans and grand designs I had for the next web
creation of mine. Don't you fret, those thoughts and designs are still
locked up somewhere in there. But what I actually put in place is what
you see here, a div that doesn't break or justify lines and a side bar
full of defaults. What fun to create vaporware, eh? Talking about
vaporware, I still haven't talked a single thing about PxFS. Aha, the
cat is out of the bag and the probability wave hasn't collapsed yet.
So, about PxFS.

PxFS is the general purpose distributed filesystem used internal to
Solaris Cluster nodes. More cats out of the bag now. By this time next
year you would have heard much more about Solaris Cluster in the Open. 
Haha .. in the open, all of the code for Solaris Cluster will be open.
As of today http://opensolaris.org/os/community/ha-clusters/ohac
will tell you what is open in Solaris Cluster and what is not. PxFS is
not open yet, but I can talk about it.

What does the big "Distributed HA Filesystem" suit-speak really tell?
PxFS is a Highly Available, Distributed and \*POSIX compliant\* file
system layer above some disk based file system. The disk based file
system can be UFS or VxFS for now. Layering it above something better
like ZFS is technically feasible. Okay. Now for details about what
each of the above terms really mean. Before I go into the explanation,
I am explaining the real basics of PxFS here, so total new-bees can
also understand and I can pretend I know much much more than what I
talked about here.

Distributed. PxFS is a distributed file system internal to cluster
nodes. To explain distributed, take the analogy of electric supply to
a house. If you have only one socket in the house, then the supply at
your house is not distributed. If you add more outlets then the supply
becomes is distributed. Similarly, a Solaris Cluster can have 1 to
err.. 8 or 16 nodes. No I am not going to quote an exact number, I
like vague. PxFS allows the filesystem hosted in one of the nodes to
be accessible in any of the other nodes. \*Any\* of the other nodes. It
is like NFS in that it does not need a disk path, yeah so maybe it is
just a file access protocol. To restate, if you globally mount a UFS
or VxFS filesystem on a cluster node and the mount directory exists on
all cluster nodes, you can access that file system on all cluster
nodes at the mount point. Distributed.

Now for the Highly Available part. Let's go back to the analogy of
vibrating electrons in a linear conductor. If your house's electricity
supply has an inverter to back it up then your electricity supply is
highly available. If the main line goes down, the inverter (battery)
kicks and you don't notice a down time. For exactness, there is the
few milliseconds the inverter needs to cut-in when there is no power.
Similarly, in a Solaris cluster setup, if you have more than one node
with a path to storage hosting the underlying filesystem for PxFS, you
have a highly available PxFS file system. If the node hosting PxFS
goes down, the other node with path to storage will automatically
takeover and your applications will not notice any down-time. Similar
to the inverter takeover delay, there will be a brief period when your
fs operations are delayed, but there will be no errors or retries. And
that is the highly available part.

What about POSIX compliant? Take writes to any POSIX compliant single
node filesystem. There is a guarantee that every write is atomic. If
there are multiple writes to the same file without synchronization
between the writers, you have the guarantee that no writes will
overlap. The only unknown is the order of writes. Similarly, in a PxFS
filesystem, writers from the same node or multiple nodes can do writes
with the guarantee that their writes will not get corrupted. That is
one example of POSIX compliance, guarantees like space for async
writes and fsync semantics, everything POSIX (as far as I know) is
guaranteed on PxFS. And that is POSIX compliance.

And the administration overhead? .. adding a "-g" to your mount
command and making sure there are mount directories on each node.
"man mount" will tell you about "-g". That part, the administrative
simplicity is worth many paragraphs of prose. The value of simplicity
has already been proven by 
"zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0"
and all of ZFSs other possibilities which saves you a lot of wear and
tear on fingertips and neurons if you had to use SVM and metaxxxx.



« July 2016