Wednesday Mar 23, 2011

So long and thanks for all the fish

I joined Sun over ten years back almost right out of
school. I started out working on Solaris Security where
one of my first projects was to help integrate SSH into
Solaris 9. Over the years I moved around a bunch and
worked on a myraid of technologies from Solaris Volume 
Manager, NFS and Install. 

During these years I've had the pleasure of working 
with some of the finest engineers in the industry and
made a lot of friends here (some of whom are more like
family at this point). Thank you guys for making this
such a fun place to work!

Besides the people, I have had a chance to experience and
work on truly one of the best Operating Systems on the
planet. Even though the rigor imposed by Solaris Engineering
on developers seemed daunting and frustrating at times, 
looking back -- it is one of the primary reasons why the 
OS is so rock solid. I am forever grateful for having
gotten a chance to work on Solaris.

Now the time has come for me to venture into something new --
outside Oracle. While I'm excited about the opportunity that
beckons me, it is tough for me to leave behind colleagues with
whom I truly loved working with.

This is my last blog entry here. In the future, you can
find my blog at My new 
email will be aggarwaa at cs dot pdx dot edu. I'm on Linkedin 
and Facebook as well should you wish to connect that way.

Thanks again!

Tuesday Dec 15, 2009

Automated Installer from media

The holiday season, the most fun part of the year, is
here. It is also the season to give. I just finished
giving (okay, delivering) another way to do an (automated) 
install of OpenSolaris.

Starting with OpenSolaris build 130, the automated
installer (AI) x86/SPARC iso and USB images will be
bootable standalone. That is, the AI media can now be
booted without necessarily needing to setup an AI server/
services. The machine will boot from such media and do an
Automated Install from an IPS repo. Either a default 
manifest (pointing to can be
used or a custom manifest can be used to specify the
installation parameters.

This opens up a few avenues:

a) SPARC machines that are not capable of wanboot (a requirement
   for doing AI over the network), can now be directly booted
   and an OpenSolaris install done using that media.

b) The AI media can also be used as a rescue disk.

   On SPARC, the following will simply boot the AI media and
   not start an install.

   ok boot cdrom

   On x86, edit the GRUB menu entry to have 'aimanifest='. This
   will boot the AI media without starting an install.

   A user can log in thereafter as root/opensolaris and perform
   the necessary repair/diagnostic procedures.

c) An xVM PV install can now be done without needing to setup
   an AI server/services so it greatly simplifies the install.

   The following command line does a PV install from the AI media:

   # virt-install -n domU-mediaboot-ai --paravirt -r 2048 --disk
     subdriver-vdisk,format=vmdk -l 
     --autocf aimanifest=default,auto-shutdown=enable 
     --nographics 0:16:36:20:a2:7d

Using a Custom Manifest

If you wish to customize the installation parameters, a custom manifest
must be specified.

On x86, the first choice (also the default choice) in the GRUB menu, allows
one to specify a custom manifest. 

On SPARC, the following allows one to specify a custom manifest:

ok boot cdrom – install prompt

('ok boot cdrom – install' uses a default manifest located on the 
media instead)

If a custom manifest is specified as above, the user is presented with the
following prompt during boot up:

Enter the URL for the AI manifest [HTTP, default]: 

Currently, only an HTTP path to the manifest can be specified. Once enhancement
request 13201 is fixed, it may be possible to specify other sorts of paths as well.

NOTE: If you plan to use the AI media to do an install before OpenSolaris 2010.03
ships, you must install using a custom manifest that specifies ''
instead of '' as the IPS repo to install from. Otherwise the 
installed system may not be bootable.

Monday Mar 09, 2009

Serving Up Lzma

I just pushed the changes that add LZMA to (Open)Solaris
and also allow lofi(7D) to use LZMA as one of the
supported compression algorithms.

On an snv_111 machine, here's what you will see -

Usage: lofiadm -a file [ device ]
       [-c aes-128-cbc|aes-192-cbc|aes-256-cbc|des3-cbc|blowfish-cbc]
       [-e] [-k keyfile] [-T[token]:[manuf]:[serial]:key]
       lofiadm -d file | device
       lofiadm -C [gzip|gzip-6|gzip-9|lzma] [-s segment_size] file
       lofiadm -U file
       lofiadm [ file | device ]

So, if you take large'ish file and compress it with gzip
and lzma, the size difference is quite noticeable.

contraption# du -h solaris.orig
 2.2G   solaris.orig
contraption# lofiadm -C lzma solaris.orig
contraption# du -h solaris.orig
 555M   solaris.orig
contraption# lofiadm -U solaris.orig
contraption# lofiadm -C gzip solaris.orig
contraption# du -h solaris.orig
 702M   solaris.orig

With LZMA support now available for both userland and
kernel consumers, it should be very easy for other Solaris
utilities (zfs?) to provide support for it.

Monday May 12, 2008

Compressed lofi for LiveCD - why

The OpenSolaris LiveCD contains hsfs filesystems that
are compressed with lofi compression, primary among
these are solaris.zlib which maps to /usr and solarismisc.zlib
which maps to /mnt/misc. 

The /usr filesystem contains essential components to
allow for the LiveCD to boot into a desktop. As a result
the layout of this filesystem is carefully ordered such
that accesses are sequential as opposed to being completely
random. This careful ordering of contents allows for the
LiveCD to boot into a desktop in a reasonable amount of
time (~3 minutes on most systems).

Since hsfs is the only OpenSolaris filesystem that allows
files to be ordered a certain way via the specification of
'-sort' flag to mkisofs(8), it was the obvious choice for
the /usr filesystem. And, the primary reason why compressed
lofi is used for the LiveCD as opposed to, say, ZFS or dcfs(7FS).

More details can be found in Moinak's slides here.

Wednesday Apr 30, 2008

Lzma Numbers

I recently wrote that LZMA has been used to pack more languages
onto the LiveCD. Here are some charts that show how LZMA
stacks up against someof the other popular compression algorithms.
(apologies for the poor image quality, open in another window 
for a clearer image)

These tests were run on a LiveCD archive using 7za(1). As you'll note, the compression ratio provided by LZMA is about 35% better than gzip-9. However, LZMA is more CPU intensive and as a result the compression and decompression speed is slower than the alternatives. So, for some use cases the cpu versus compression tradeoff might make LZMA unsuitable but for the LiveCD use case, it is reasonable provided we architect our solution such that the decompression speed isn't a bottleneck (Compression speed isn't a problem for the LiveCD architecture)

Thursday Apr 24, 2008

Lzma on OpenSolaris

The OpenSolaris 2008.05 release that is going to come
out sometime in May is going to have two versions of
the same LiveCD, one with a limited set of languages and
locales and another one with a more fuller set of languages.

One of the big challenges with creating a LiveCD with a
full set of languages was that there was limited amount
of available free space on the CD to allow for including
all the languages. How do you cram more stuff on the CD?
Compress it harder, I say! Even better, compress it with

The OpenSolaris kernel did not have an in-kernel implementation
of LZMA that could be taken advantage of (why do we need an
in-kernel implementation, I'll answer that in a separate blog entry). 
So, in our quest to provide one, we started looking at the LZMA SDK. 
Some of the challenges with porting the source from this SDK to the  
OpenSolaris kernel were that our lawyers were not amenable to the licensing 
terms and the compression code was all written in C++ (which, 
for the uninitiated, is strongly desisted in the kernel).

If you've ever dealt with lawyers you'll be quick to spot
that the licensing can be particularly troublesome. It was. 
But only until we contacted with author of LZMA, Igor Pavlov.
Igor was not only willing to relicense the source code under
CDDL (which would ofcourse be agreeable to the lawyers) but
also willing to re-write the compression code in C. And, he 
did that in just a matter of couple of weeks -  truly outstanding. 
That, to me, is the power behind open source and the sharing 
opportunities it provides for the broader good.

So, thank you Igor for an excellent compression algorithm
in LZMA and thanks for all your assistance in making the
OpenSolaris 2008.05 release what it is. We look forward to
working with you in the future too.

Thursday Aug 30, 2007

Multiboot - Solaris and Ubuntu

Multiboot - Solaris and Ubuntu I've recently been futzing with getting my laptop
to run both Solaris and Ubuntu. Ubuntu mostly
because I want to run VMware, which does not
support Solaris as the host operating system (yet?).
I wanted to run VMware mostly to cut down my
development time (I'll save the answer to how I do
that for another day).

I failed miserably in trying to get Ubuntu grub to
boot Solaris; which I later found out that it doesn't
work because the required changes to Solaris grub haven't
gone back to the mainstream grub code.

I also realized that the order in which the two operating
systems are installed is also important primarily because
of the deficiency in grub - Ubuntu must be installed first
and Solaris second. This results in Solaris grub being
installed in the master boot record which can then be
taught about where to find Ubuntu by adding an entry such
as this to /boot/grub/menu.lst -

title           Ubuntu, kernel 2.6.20-15-generic
root            (hd0,1)
kernel         /boot/vmlinuz-2.6.20-15-generic root=UUID=91647296-9aca-4d1f-bdfd-7894ff9f0807 ro quiet splash
initrd          /boot/initrd.img-2.6.20-15-generic

Having said this, I also found by trial and error that
if you do install Solaris first and Ubuntu second with
the result Ubuntu grub lands in the MBR; you can salvage
the situation by manually slamming Solaris grub into the MBR.

In order to do this, boot off of the Solaris media and
get a shell. Then utter the following incantation -

# /sbin/installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/cNdNsN

where cNdNsN is the root slice. This restores sanity and
you can now add the lines for Ubuntu to the menu.lst
Please note that the Solaris release on the media should be
as close as possible to the installed Solaris release (if not
the same)

Wednesday Aug 29, 2007

Marvell Ethernet on Solaris

Marvell Ethernet on Solaris I've got a Sony Vaio that has a Marvell 88E8055 gigabit
ethernet card that doesn't work out of the box on

The bundled SK98sol driver is old and dated. The new
driver must be downloaded either from Sysconnect in
the case of 64-bit or from Marvell in the case of 32-bit.
Update -- if you're doing this on a laptop, you want to
download the driver for "PCI Express Desktop Adapter"
from the Sysconnect website.

After downloading the driver, the pre-existing SK98sol
package needs to be removed prior to adding the downloaded
SKGEsol package (remember to also remove the 'sk98sol'
entries from /etc/driver_aliases). Once the SKGEsol package
has been added Solaris needs to be informed about the
new driver by doing the following -

- Get the PCI vendor ID for the ethernet card by either
  running 'prtconf -v' or '/usr/X11/bin/scanpci'. The
  pci id for my machine was 'pciex11ab,4363'

- Either use 'add_drv' or 'update_drv' to associate
  that pci id with the skge driver. Something like this -
  # rem_drv skge
  # add_drv -m '\* 0660 root sys' -i '"pciex11ab,4363"' skge

The driver should now attach and ready to be plumbed.

Tuesday Aug 28, 2007

NFS Namespace Extensions

NFS Namespace Extensions So, for those of you that haven't kept up with
projects going on in the NFS space, one of them
is NFSv4 namespace extensions. The two namespace
extensions being "mirror-mounts" and "referrals".

I just noticed that a demo based on a prototype
that we did earlier this year was posted a few
weeks back here. Avid viewers will note
how the referral functionality can be leveraged
to create a very basic global namespace.

Once the code is back in OpenSolaris,it will
be available for anyone interested in extending
this code in interesting ways.

The timing of these OpenSolaris projects is
quite nice considering the renewed momentum
at the IETF NFSv4 WG with respected to Federated
File Systems

Friday May 11, 2007

A Global Namespace with NFSv4

The NFSv4 specification has provisions in it that allows
for constructing a "Global Namespace" for files.

Let's start by defining what is meant by a Global Namespace.
A Wikipedia search doesn't quite yield what we're looking
for but it results in a link to "Global filesystem" which
oddly enough sortof captures the essence of a global namespace.

So, to rephrase what wikipedia has to say, a Global Namespace
is a flat namespace where filesystems hosted on a number of
different servers can be aggregated such that they appear as
a single filesystem to all the clients. Okay, that sounds
rather dry, just what good is that?

Consider a typical enterprise, for example, Sun. The enterprise
spans multiple countries across multiple geographies. This
brings about a need for separating the IT network that takes
into account the location affinity -- the US west coast users
associate with the .sfbay domain, US central with .central,
UK with .uk, China with .prc and so on. Each location has to
know the names of the servers that host the relevant filesystems
such that those filesystems can be mounted either a priori or
as they're accessed (via the automouter and such). Additionally,
there's obviously the administrative overhead relating to
configuring the mounts or the automounts as well as maps for
the latter.

What if I were to replace this with, say, a single server
(appropriately replicated across .sfbay, .central, .uk, etc)
that acts as the "root" of the namespace? All the clients
across the enterprise need to know just about this single
server (even that might not be needed in the presence of
something like this) from which they mount the root filesystem.
And, subsequently as they access the directories which don't
exist in, say the .sfbay domain, they get appropriately
redirected to the server in the .central domain that hosts
these directories (or filesystems to be accurate). The clients
automagically mount the absent filesystem(s) from the .central
server and allow access -- all transparently, without any
user intervention and without the need to configure any
automounter maps.

This is, in essence, a Global Namespace for files (grossly
over simplified but conveys the gist nevertheless).

The NFSv4 protocol allows for such a facility via the
use of Referrals, Replication and Migration. All the
gory details of this facility can be found in RFC 3530 Section 6
as well as the latest internet draft for NFSv4.1. From
a high level this facility allows for an NFSv4 capable server
to indicate to an equally capable client that a particular
filesystem does not exist on the server in question. The
client can subsequently query the server as to where that
filesystem actually resides to which the server replies with
a list of locations. The client can then initiate a mount
from any of those locations.

The NFSv4.1 spec allows for the primary server to return a
much richer set of location information as compared those
supported by NFSv4.0. The richer location information allows
for the client to ascertain which of the locations will be
better equipped, for example to deliver a high QoS.

So, ultimately this functionality enables us to tie together
a number of disjoint servers such that they appear as a
single server. Did I mention single? And, given the fact that
we're dealing with NFSv4 and it's a standard protocol helps
immensely -- you can construct a Global Namespace that comprises
of heterogenous servers and clients so long as they support NFSv4
in general and referrals/replication-migration in specific.

The logical next question is - does OpenSolaris support this
NFSv4 feature? No, not yet. But, follow the details here.

Thursday May 10, 2007


At the last ATLOSUG  meeting couple of days ago, Ryan 
talked about iSCSI in OpenSolaris. He even did a demo.
I never realized it was just \*that\* easy to setup the target
and the initiator except ofcourse for setting ACLs which
seemed like a royal PITA.

Ryan's presentation can be found here .

Friday Apr 06, 2007


A few weeks back Project Blackbox was doing roadshows and as part of
that it visited Atlanta as well. So, I went over and checked it out mostly
'cause I was curious to see how it had been engineered.

On the way out, I noticed this guy listening to music on his Powerbook
and tinkering with a little gizmo almost a third of the size of a credit card.
Upon asking what it was, he said it was a Sun SPOT - a
battery powered device that he was using to gather environmental data on the Blackbox. The device had been attached to the Blackbox and it was gathering data
such as temperature and humidity. And, this
dude was sucking data from
it by connecting to it wirelessly from
his Powerbook.

It turns out these devices on the Blackbox also serve as GPS tracking

So, what are Sun SPOTs? Pictures can be found here but these are
basically ARM powered devices that run Java and can be programmed
in a variety of ways using the SDK that comes with the kit. More information
on the ways this toy can be used can be found here and here.

cool stuff!

Friday Mar 02, 2007

Sun Rays in Classrooms

Couple days back I happened to notice an article on the Sun internal webpage the summary
of which said something like,
"Sun Provides Sun Ray Solution to Chitkara Institute At Chitkara, Sun has installed four high end SUN servers with the best in class Solaris
operating system. 300 thin clients are spread all over the campus connected on Nortel
switches through a Fibre backbone.
The Sun Ray architecture is comprised of Sun Ray thin clients and Sun servers. By
moving resources to a central location on the network and removing complexity, a much
more flexible and easy-to-maintain environment is achieved."
(Apparently, the actual article appeared in Chandigarh, India, Tribune News Service) And, the first thought that came to my mind is,"Chitkara? Is it the same as the Chitkara I
knew when I was in high school in India? If so, there's an Institute by that name?"
Once I
read the entire article, it was clear to me that it was the Chitkara I used to know.
The same guy who taught Math classes out of his house (couple miles from my parents'
house) when I was about to get into engineering school (around 1994). And, yes, it had
been turned into a full scale Institute or rather a group of Institutions. Wow!

This was an amazing news from my perspective on a couple of counts.
First, one of the well known educational institutions in the city of Chandigarh, where most
of the population has typically been Microsoft savvy (that is true for majority of India as well),
had adopted a Sun solution running Solaris! This is a huge acknowledgement of the fact that
Solaris has come a long ways in terms of being more approachable and user friendly. Mind
you, there still is a lot of a catchup still left to do but it is approachable enough for students in
this case.
Secondly, there were going to be Sun Rays in classrooms. Just like a bunch of other educational
institutions around the world have adopted. I've always thought of the Sun Ray technology as
one of the best pieces of technology created at Sun. And, where better to put them than in the
classrooms -- no need to have 300 different workstations that need to be maintained on a regular
basis and draw a significant amount of power, the students can't do crazy
stuff on 'em and they
can't even physically break the damn thing coz it's just a dumb thin client (well, for the most part).
Just ideal! Plus, you get all the benefits of a carrier grade OS in Solaris.
And, thinking about it for a moment - it doesn't even have to be Solaris. You could very well be
serving up Windows or even Linux sessions (how long will it be before a Mac OSX virtual client
is available?). So, you could very well bring up a screen on the Sun Ray that gives the users the
ability to choose from the available operating environments -- they could choose whichever they
like potentially based upon the kind of activity they wish to undertake.
Having been a long time user of the Sun Ray at Home solution, I wonder how long will it be
before service providers like Comcast, in addition to providing you internet access, also serve up
virtual desktops for an additional charge. The users benefit because they don't have to be sysadmins
anymore and the service provider benefits because it gets to deliver value added services.
So, really how long will it be before we hit a tipping point for the Sun Ray technology?
<script src="" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-1444181-1"; urchinTracker(); </script>

Wednesday Oct 26, 2005

NAS Industry Conference '05

NAS Industry Conference '05 The NAS Industry Conference concluded last week in Santa Clara. There
were a lot of good talks from various vendors, all of these should be
available later this week on the nasconf website.

Some of the talks I liked included the EMC Keynote and the Filebench talk by Eric.
The EMC Keynote on "Global File Virtualization" gave a very good overview of the
various components needed in order to provide file virtualization and what standards
based technologies (pNFS, NFSv4 referrals, CIFS) can be pieced together to provide
such a solution. Eric, whose a pretty entertaining speaker, talked about the use of a new
filesystem benchmarking tool, Filebench, and how that can be used to determine
performance bottlenecks in NFSv4.

Other talks from Sun included the ACLs tutorial by Sam and Lisa, Filebench tutorial by
Richard Mcdougall, Observability talk by Bill Baker and IETF NFSv4 Minor Version Update
by Spencer. And, oh yeah, yours truly gave a talk on Checksums for NFSv4 (yeah they
solve a real purpose because the TCP checksum \*is\* very weak :) ) as well. Incase the
pdf version on the conference website renders screwy, here's a staroffice version that
should work.

Overall, it was a good conference. And, thank you Audrey for doing such a good job
of organizing this once more.

Technorati Tag:

Tuesday Jun 14, 2005

Diskset Import - An Introduction to the source

Diskset Import - An Introduction to the source One of my most significant contributions (along with Steve Peng) to
Solaris 10 was to add support for import/export of disksets to SVM. So, what
is import/export of disksets? Simply put, you've got a bunch of disks
encapsulated in an SVM diskset and you want to disconnect them from one
host and connect them to a different host. And, get your SVM configuration
back. Why might you want to do this - say if you want to consolidate your storage
or incase of a disaster if you want to move your storage from one server to
another you might want to do something just like this.

SVM stores it's configuration information for the local set in a regular
metadb (the one that can be seen by metadb(1M) without arguments). The diskset
related configuration is stored in a diskset metadb (one that can be seen by
'metadb -s <diskset>' command) that resides on most (if not all) of the disks that are
a part of that diskset. Additionally, the local set metadb has knowledge about
the disksets including information on where to find the diskset metadbs.

The problem with moving storage from one server to another is that you loose
the local metadb and thus don't know where to find the diskset metadbs (and the
associated configuration). In order to implement diskset import it was needed to
figure out which of the recently connected disks in the target system have a diskset
metadb on them, read the configuration in from that metadb and populate the
kernel structures with the read in configuration information. That was the
scope of the problem in a nut shell.

We started out by writing the code to scan the disks for diskset metadbs
(entirely in userland). If you want to follow the conversation with code
references, pull up metaimport.c This is the essentially the source of
metaimport(1M). The code starts out by scanning the available set of disks,
pruning the disks that are in use and then for each drive that's left it
calls meta_get_set_info - this is the heart of the scanning code. It checks
to see if a diskset metadb exists on the passed in disk and if one exists, it
reads it in and does a sanity check on the metadata information read in. It
also does the work figuring out the new disk names, i.e. a disk named c1t1d1
in the source system might be named c2t2d22 in the target system and you need
to correct the related metadata information in the diskset metadb to reflect
the fresh state of affairs. Upon it's return, meta_get_set_info has a list
of disks that comprise a diskset.

Once we've build up the list of disksets and the disks that comprise each
of those disksets, we pass all of this information to meta_imp_set that does
the real work of populating the information in the kernel via ioctls. The
MD_DB_USEDEV ioctl creates the kernel structures (akin to what happens when
creating the initial configuration). The MD_IOCIMP_LOAD ioctl then snarfs in
the detailed configuration, the heart of this code is in md_imp_snarf_set.
The ops vector for each of the modules (stripe, mirror, etc) was expanded to
include an import op. So, for example, the stripe ops vector now looked
something like this -

md_ops_t stripe_md_ops = {
stripe_open, /\* open \*/
stripe_close, /\* close \*/
md_stripe_strategy, /\* strategy \*/
NULL, /\* print \*/
stripe_dump, /\* dump \*/
NULL, /\* read \*/
NULL, /\* write \*/
md_stripe_ioctl, /\* stripe_ioctl, \*/
stripe_snarf, /\* stripe_snarf \*/
stripe_halt, /\* stripe_halt \*/
NULL, /\* aread \*/
NULL, /\* awrite \*/
stripe_imp_set, /\* import set \*/

The import op for each of the modules handled creating detailed configuration
as well as updating out-of-date information.

md_imp_snarf_set calls the import op for each of the modules that appear in
the diskset configuration. So, if there's a stripe in the diskset configuration
stripe_imp_set gets called and so on. Subsequently, the code does exactly
what it says :)

\* Fixup
\* (1) locator block
\* (2) locator name block if necessary
\* (3) master block
\* (4) directory block
\* calls appropriate writes to push changes out
if ((err = md_imp_db(setno)) != 0)
goto cleanup;

\* Create set in MD_LOCAL_SET
if ((err = md_imp_create_set(setno)) != 0)
goto cleanup;

It fixes up another set of out-of-date information and creates the appropriate
structures in the local set to inform the local set about the diskset
configuration and where to find it. That's it, we're done with our job in the
kernel and we return to userland.

In the userland, the only other thing that needs to be done is to inform the
rpc daemon that stores the knowledge about disksets (rpc.metad) about the
existence of this imported set. This is accomplished via the clnt_resnarf_set routine.

So there you have it - a 15,000 ft overview of the implementation of diskset

Technorati Tag:
Technorati Tag:




Top Tags
« October 2016

No bookmarks in folder