Wednesday Dec 12, 2007

Free Flow PxFS

I drove an Ikon 1.6 in Bangalore. It is a poor man's drivers car and
handled beautifully. After 3 years of great fun, I appealed to home
ministry, composed entirely of my wife Sapna. for more fun with the
car. I got sanction for alloys and a performance filter also called a
free flow filter. It's not that the power was inadequate, it had all
the power needed to wait powerfully in traffic and at the frequent
signals. This was similar to removing white spaces or refactoring or
rewriting code that works perfectly. It is difficult for some of us to
leave something that works as it is. There has to be change. Alloys
and tube-less tires were a easy. Even my wife complemented on the
difference and how maybe, just maybe, there was an iota of sense left
in me. The air filter, K&N, was another matter. Although it didn't
earn home ministry ire. For the money I paid the gain in power was
small. To hear the subtle whoosh of it sucking air I would have to
open the bonnet and listen carefully. Not a good posture while
driving. Well, I did get some increase in power and a lot more mental
satisfaction by looking at the beautiful cut-off cone once in a while.

What I gave the engine was, ability to suck in a lot more air and thus
burn fuel better. The engine does not face forward alone. There is a
rear to it. The increased air flow and the larger volume of exhaust
was still going through the old rear of the engine, the 'exhaust
sub-system'. If I could allow the exhaust to flow free, the engine
will become a much more efficient air pump. I will get more power and
better mileage. Home ministry personnel realized that for every sane
decision there is an equal and opposite decision. Thus, the alloys
have to be balanced by a free flow exhaust, which I demanded in a
manly manner by groveling and sniveling and composing long sentences
consisting entirely of non-sensical words.

With a free flow exhaust, the standard exhaust manifold is replaced
with custom headers. Each header manifold is of the same length and
(hopefully) tuned such that exhaust pulses from each cylinder helps
the others along. Here's some theory. The catalytic
converter may go, the standard muffler does go, the tail-pipe and the
exhaust pipes become a little bigger. With all this, from the moment
the exhaust valves open, the exhaust "pulse" has a much freer path to
the outside the world. For me, behind the wheel, all this means 15-20%
more power and much better throttle response. For a 950kg car, an
increase of power by 15-20 bhp is significant. To quote a like minded
friend, there is no restriction between flooring the pedal and
red-lining the engine. Wow! The moral of the story was free inflow
\*and\* free outflow were required for better performance. Oh, and yes,
it also gives a nice deep throaty exhaust note too.

Now let me get to PxFS, the crime of working on which I had pleaded
guilty to earlier. PxFS is slow. PxFS is slow for large files. PxFS is
slow for small files. PxFS is better used read-only. Many Solaris
cluster customers and developers have heard this or spoken this. Can
it be made faster? It is after all a data-pump. Yes, it can be made
faster. Before I expound about the easy modifications, I should tell
about Ellard Roush and the huge and the substantial modifications he
did to PxFS. That is equivalent of designing a better car. Till
Solaris Cluster 3.1u4, PxFS was a dog and it was a glutton for memory.
Ellard held it by the neck and shook it till it started behaving
itself, that project was called the Object Consolidation project. It
was a rewrite of PxFS and made most of the work that followed easier.
Somewhat like customizing an Evo is easier compared to almost every
other car on the road.

One of the main bottlenecks for allocating writes is making sure there
is backing store, ie. blocks on disk. This typically means executing a
stateful operation. The allocation should survive a failover and be
guaranteed to be available. In PxFS's case, the allocation would be
requested from a client and gets executed on the server. Even more
overhead. Asking for and getting a page is easy and lightweight.
To get around the allocation bottleneck, we implemented a space cache
reservation strategy for PxFS. The "Fastwrite" project. This idea
itself is not new and not unique to filesystems either. You allocate a
local backing store pool and when it is exhausted you request more.
The server or primary knows how much space is free and retains global
control. The clients write merrily to pages after reserving space from
the local cache. It is the page out which operates on filesystem
meta-data that in turn affects on-disk structures. In real world
terms, applications doing extending writes will see big speedups.
Solaris Cluster 3.2 introduced fastwrites.

What I described above is the equivalent of the performance air filter
for my car. Applications can now write unhindered, but they can also
starve the system of memory. One of the key requirements of a
filesystem is how fast it can get data into stable storage. Runaway
memory use can be fixed by throttling. The getting data to stable
storage fast requires a better, "freer flowing", back end. With
Solaris Cluster 3.2u1 or 3.2 and the latest patches, you get that. We
made write chunks bigger and gave more threads to write with. Where
data used to be written serially we parallelised. We split locks and
reduced hold times. We also introduced a semi-heuristic flow control
for clients. All of the above is the equivalent of my tuned headers
and straighter exhaust for the car. Unlike the car, I can take this
for a spin from my desk. Let me do so.

Tests are with 3.2 with latest patches (or 3.2u1 to be released). For
all these tests I mounted the metadevice as non-global to test UFS.
Disk speed concerns can be put to rest.

PxFS's strength is it's use of use. If you create mount directories on
all nodes, global mounting is as easy a "mount -g  .
Similarly, if I take off the "-g" I get a local mount. Feel free to
over-estimate my efforts.

I'l do the scientificest of all file systems tests. "mkfile!"

Writing a 2G file to UFS would take this long in solaris 10. 

-bash-3.00# timex mkfile 2g /mnt/kuntham

real        1:45.79
user           0.22
sys           11.32

Doing the same on a PxFS mount will take this long.

-bash-3.00# timex mkfile 2g /global/xxxx/kuntham

real        1:50.87
user           0.21
sys            8.27

It is comparable! To hammer home the significance. Not only are PxFS
writes going through two complete file system layers (like NFS), it is
also check pointing metadata changes and making sure every page and
transaction related to the file hits the disk before close() returns.

Let's repeat this with dd, the scientificestest of fs tests.

For UFS:
-bash-3.00# timex dd if=/dev/zero of=/mnt/kuntham bs=524288 count=4096
4096+0 records in
4096+0 records out

real        1:47.01
user           0.02
sys           12.50

For PxFS:
-bash-3.00# timex dd if=/dev/zero of=/global/xxxx/kuntham bs=524288 count=4096
4096+0 records in
4096+0 records out

real        1:46.71
user           0.01
sys            7.87

Ooooo, PxFS is faster. Knowing the insides of PxFS, I gave dd a block
size the same as the default page kluster size for PxFS. All is fair
in love and tuning.

Now I'l brew my own slightly un-scientific test to quantify the
statement that PxFS makes sure everything is on disk before closing.
This is a small python script that does the same as the scientific dd
test above, but breaks up the time for open, write, sync and close.

Here is the script and here are the results.


-bash-3.00# /cal/ /mnt/kuntham

Time in seconds

open  	0.228416
write 	106.314831
fsync 	0.704131
close 	0.000030

Total time: 107.247408 seconds


-bash-3.00# /global/xxxx/kuntham

Time in seconds

open  	0.012415
write 	80.609731
fsync 	30.116318
close 	0.272743

Total time: 111.011207 seconds

Notice how writes are much faster than for UFS but fsync and close
contribute significantly to overall time? Overall, inspite of having
to make sure of data integrity guarantees PxFS performs quite well.

Am I done with the car? Absolutely no! Don't tell my wife, but there
are iridium spark plugs, porting and polishing, maybe ECU tuning.

Similarly, there are huge tuning opportunities for PxFS too and there
must be someone who should not be told. Directory operations, small
files, check points and so on. Since directory operations and other
metadata operations require check-points, they are still slow. In
another blog I'l explain checkpoints and recovery of PxFS. You can see
the internals of PxFS and rest of Solaris Cluster very soon. I did
mention it is going to be opensourced. Maybe one of you who is a
master tuner and can then do much more.



« October 2016