Sunday Sep 23, 2007

What's new in Solaris Install?

Solaris Express Developer Edition 9/07: Introducing the New Solaris Installer

Today is the Solaris Express Developer Edition 9/07 launch! With Solaris Express Developer Edition 9/07 you will see the first in a series of planned projects for the "New Solaris Installer". You can download the image at Solaris Express Developer Edition 9/07.

The intended audience for this first New Solaris Installer project was novice Solaris users, wanting to download and try Solaris. The basic requirement for this project was that we needed provide a streamlined, easy to use installer with a whole new look and feel. As a result you will note that there are not many knobs with this version. That was intentional. An install takes 5 clicks. That's it! An upgrade takes 3 clicks, that's even better!

It is a great start for Solaris Install. But, it is just the start. We have lots planned for Solaris Install. To see all that we have planned in the next 6 months, take a look at Caiman.

The best way to show you the new look and feel is with pictures!

Welcome screen for New Solaris Installer

Disk Partitioning for New Solaris Installer

Installation Confirmation Page

The new look and feel is not all we offer with the New Solaris Installer. Check out this list:
  • Install or Upgrade question asked before additional configuration questions
  • Disk entries shown with current status for use during installation
  • Upgrade instances also show current status with regard to upgradeability
  • User account creation
  • New slicing algorithm that gives a larger root slice, and an alternate root for Solaris Live Upgrade
  • NetBeans and Sun Studio tools installation integrated seamlessly
  • No unnecessary system identification questions

We hope you enjoy the first of many enhancements to the New Solaris Installer! Try it out and send your comments and questions to

Friday Nov 18, 2005

ZFS zpool device in use checking feature

Device in use checking for ZFS zpool and other filesystem utilities

ZFS zpool feature

As many of you have read and seen, ZFS is live in OpensSolaris. Congratulations to the whole ZFS team for this major accomplishment!

You may have noticed by now a feature that is built in to the ZFS zpool command. The feature is the ability to check for a device 'in use' status prior to executing the command. This is one of the things I love most about ZFS, the thought and care that went in to providing a whole, well thought out and robust user experience. The ZFS CLI's are so easy to use and intuitive. And with this 'in use' feature, more user friendly than any other filesystem on Solaris today.

I worked on UFS for a few years, so this innovation in ZFS is really wonderful. I had a small part in the delivery of this device in use detection feature for ZFS, so today's blog talks about this capability in ZFS and the work I did to add this capability to other Solaris filesystem utilities.

ZFS and libdiskmgt

ZFS uses a library to gather the device in use details, that is also open for viewing on OpenSolaris, called libdiskmgt. This library was originally written by Jerry Jelinek and extended with additional features by me for ZFS and the other filesystem utilities.

The intent of this library is for applications, like the ZFS zpool utility, to have a common place to go to get device data. Not just what is in use, but what devices are on a system and what are their attributes. libdiskmgt recognizes the following types of devices:
  • Drives
  • Controllers
  • Media(including removable media)
  • Slices
  • Partitions(x86)
  • Paths to devices
  • Device aliases
  • Buses
These are represented by a generic descriptor to the caller of libdiskmgt which allows opaque handling of these device types by the application. A model of how these device types are associated is shown below:

The details about the interfaces available for libdiskmgt are in libdiskmgt.h

Device in use checking for ZFS zpool

What constitutes a device in use for a specific application? Well, the answer is that depends on the application. For ZFS in use means that some things will be OK to use without a forced override, other things are not OK to use even with a forced override and yet other in use scenarios are OK for use in a zpool only if the user does specifically override. Today, the ZFS zpool utility uses the following rules for devices found to be in use.

The following in use devices can never be used in a ZFS zpool:
  • Mounted filesystems
  • Device with entry with /etc/vfstab
  • Device used as a dedicated dump device
  • Device in an active ZFS pool
All other in use scenarios as detected by libdiskmgt can be forced via -f with the zpool command.

libdiskmgt reports the following in use statistics:
  • Slices with mounted filesystems
  • Slices with unmounted, but potential filesystems found using the fstyp(1M) Solaris utility.
  • Slices that are a part of an SVM volume configuration. Or that contain an SVM metadb.
  • Slices that are part of a VxVM volume configuration.
  • Slices that are in a current ZFS pool, active or not.
  • Slices that are current live upgrade partitions.
  • Slices that are configured to be dump devices.
  • Slices that are configured to be swap devices.
  • Slices that are in an /etc/vfstab entry.

Other filesystem utilities that use device in use checking

As part of the work that was done in preparation for ZFS, other Solaris filesystem utilities were modified to know about devices in use. These changes were integrated in to the same OpenSolaris release as ZFS. The utilities modified were:
  • newfs
  • mkfs
  • swap
  • dumpadm
  • format

These utilities now check for devices being in use prior to issuing the command. These utilities will now fail if a device is found to be in use. As with the ZFS zpool utility, each of these utilities has its own rules with regard to what is considered in use. For example, it is perfectly legal to use a swap device as a dump device.

The difference between the changes to these utilities and the ZFS zpool utility, is that these utilities were not modified to have a -f, or forced override flag. They either succeed or fail and require the user to clean up any 'in use' details prior to retrying the command. The idea is that if one of these utilities fail due to finding a device in use, the user must be able to clean up the in use state in a supported way in Solaris. For example, if the command fails due to a device being configured as a swap device, the user is instructed to us swap(1M) to change the device state prior to proceeding. These changes are only the beginning. ZFS is leading the way in usability in Solaris. The plan is to continue to enhance libdiskmgt, and other utilities to prevent the user from stepping on themselves if possible.

Thanks for listening. Enjoy ZFS!

Friday Aug 05, 2005

What Makes A Good User Experience?

Does the right arm know what the left arm is doing?

Today in my blog, I digress from my usual Solaris technical talks to something a bit more difficult to quantify, but nonetheless, critical to system users; What constitutes a good user experience?

I don't normally talk about personal experiences in my blogs but this one example was just too good (well bad really...) to not share. This example isn't related to using any software at all, but it is an example that I think will show what a bad user experience is.

I recently traveled to China for work.1 I was in China for approximately two weeks and during the time I was there I had my usual cell phone number which is enabled for use in China, as well as my personal and work long distance calling cards. My long distance carrier for my home number is also the provider for my personal and business long distance calling card access. They have my cell phone number as an alternate phone with which to reach me (This is an important piece of data for later in this little saga...). During non-work hours, while in China, I used my personal calling card a lot to call home; specifically to call my son who was at home by himself and as a result I was secretly worried the house would burn down while I was gone :-). Luckily, nothing major happened at home while I was away. (Well at least that my son would tell me about).

I travel a fair amount out of the United States, and have always used my personal calling card while traveling. It should not have been surprising to my long distance carrier that I was using my card, yet again, from a foreign country. But, apparently it was surprising to them, and that is where my bad user experience began. Also, just for another tidbit of data...I use my business calling card to call China for at least one meeting every week in the evenings from my home. Yet another clue that I actually could have been in China, from where my personal calling card was being used.

When I got home from my trip, I started to wade through two weeks of mail, and much to my surprise I had several letters from my long distance carrier. Each letter described concern about my recent calling card use, basically asking "was it really me who was using the card to make calls from China"? It was me obviously, but I didn't know there was an issue since I was in China and was not reading my mail at home. Since I did not respond to the letters, again because I was actually in China, they decided to suspend my usual long distance plan, for both my home phone and personal calling card, and charge my account ~$700 for the calls I made while in China using my personal calling card. I am not sure what the expected outcome of this decision was but needless to say I was surprised to receive this information upon arriving home. 2 The company did try one time, the day before I got home, to call me on my home phone requesting that I call them. Of course, I was not there to receive the call.

I called the company the day after I got home from my trip. I was told that because they could not verify it was me using the calling card from China, and their subsequent change to my long distance plan, which caused the large bill, they had my account flagged because it was considered a 'high balance' account. Although, they caused the 'high balance' in the first place. I let them know that it was indeed me using the calling card from China and asked them to reinstate my usual long distance plan and to modify my account balance accordingly and remove the high balance flag from my account. Getting this all fixed was not as straightforward as you might think. They had to route me first to the long distance folks, have them get me 'signed' up again for my 'original' plan, then I was routed back to the folks who apparently put the high balance flag on my account to have them retroactively adjust my charges based on my 'new' plan. I was put on hold for over 10 minutes during the routing between departments and in the end had to answer every question about my 'new' plan (which was my old plan until all this happened) with a verbal yes for acceptance of the 'new' plan. But, wasn't it my original long distance plan in the first place I asked? Yes, it was, but they still had to go through the defined procedure for setting up a 'new' plan. Then, after all of that, they still had to call me back at my home phone number to verify it was really me that had just called and asked for the 'changes' to my account. Sigh...

You might be thinking right about now.... if my long distance carrier for my home, my personal and business calling card services is the same company, and if they had my cell phone number on file, why couldn't they call me on my cell phone, or email me at work(data they also have since my business contact email is on file with them as well), or even call me at my work extension which I had forwarded to my office in China, to really verify it was me that was using my personal calling card? They obviously have all of this information about me in their database(s) somewhere, right? Does it make sense that they sent me a letter at home to inquire about my usage from China, really expecting if it was me who was using the calling card from China to respond? On top of them stopping my usual long distance plan as a way to stop any illegal use of my calling card even though I never responded to them regarding their inquiries?

I was thinking the exact same thing. This was a horrible user experience. I was penalized for using the service I signed up for, and apparently within the same company who provides me this service and the other services, they couldn't get enough data on me to actually try to contact me except via a letter. The long distance department couldn't figure out my cell phone number or my work number even though I am sure though my name appears for all three of the services I use with this company. Why wouldn't the department who had concern about my calling card usage from China have access to my other contact information? The data is obviously stored somewhere, but somehow is not accessible together? This kind of thing, having to move from one data source to another to get or set information is just silly. And very frustrating for the end user.

This experience got me to thinking about Solaris users. It has to be just as frustrating for users to move from one application to another to configure or get data regarding your system. Why, if it is all on the same system, should you have to use one application to get the data regarding, say for example, what devices are available on your system, and then go to another application to configure these devices for a filesystem or use in a volume manager? You shouldn't have to do that. It should all be integrated for seamless interaction and execution.

In Solaris 10 we have made many strides toward integration of what used to be disparate utilities and services in to one coherent and cohesive user experience. There are many examples of the enhanced usability, and thus the enhanced user experiences, of Solaris 10. Here are a few:

  • Solaris Volume Manager(SVM) integration with Solaris Install and Upgrade makes it much easier to configuration volumes during installation. SVM is also integrated with Flash and the Solaris Dynamic Reconfiguration(DR) utility.

  • Service Management Facility(SMF) in Solaris 10. This utility enables users to view system-wide service status, manage services not just individual process, specify dependencies between services to allow Solaris to self-heal when a service with dependencies stops.

  • Fault Management Architecture(FMA) in Solaris 10. FMA provides structured log files for telemetry data. It also provides live diagnosis updates without reboots. Fault messaging is now standardized. No longer does the user have to sift through the system syslog to find system messages that indicate symptoms of a problem.

Obviously, there is more we can do with Solaris regarding the user experience. We are continually working on this and are open to real life user experiences with Solaris. Have thoughts or ideas? Feel free to comment to on this posting. Soon, we will have an OpenSolaris Solaris approachability community available as well for your comments and suggestions.

Thanks for listening....

1Just to prove I was really in China, here are a few photos from my trip...

\*This first one is a gazebo in the Forbidden City, Imperial Gardens in Beijing

\*This one is a picture of the 'mask' dance. A traditional Chinese dance I went to see at the Lao She Tea House in Beijing. In this dance the performer changes masks almost instantaneously before your eyes. It was really amazing...

2 I know what you are thinking... that this was just a ploy to get me to upgrade my long distance plan. That thought crossed my mind too. But, if this is true, this is the worst way to get someone to sign up for more service.

Tuesday Jun 14, 2005

Debugging a UFS file truncation bug

Debugging a UFS file truncation bug
I decided for my first OpenSolaris blog I would talk about a particularly difficult to find UFS bug that I worked on about a year ago. This bug came along early in my new career in UFS and the Solaris kernel. It was a bug that taught me a huge amount about UFS, Solaris VM paging, delayed writes, DTrace and other miscellaneous kernel things. The fix was trivial, a one parameter change in a function call, but tracking it down was difficult. The trivial fix bugs are sometimes some of the most interesting to work on. The evaluation is where it gets interesting.

The bug is 5047967 ufs_itrunc does not account for pagesize < blocksize. This was originally known as the 'bmap' bug. Some of the details below show why it was named this. For more gory details you can read the bug report. Subsequently, during the course of my working on it I found it was not a bmap_write but indeed a ufs_itrunc bug and I changed the synopsis to reflect this.

This bug was found running a filesystem stress test called fsx.c. I cannot give you source to this test, but the basic things that this test did were to create a file, truncate it, grow it, write to it, write past the end to create and populate the holes. Then it validated that the data at every block was indeed what was expected. The beginning output from the test showed all was OK. Later the output will show where the error occurred.

So far, so good:

141(141 mod 256): TRUNCATE UP from 0xc0f9 to 0x2ad8a\*\*\*\*\*\*WWWW 
142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 \*\*\*\*\*\*WWWW 
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes) 
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE \*\*\*WWWW 
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) \*\*\*RRRR\*\*\* 

So, I was pondering how I could possibly see what was happening with this bug, and what the values were we actually passed in to the appropriate function calls. Enter the miracle known as DTrace.

Output from my DTrace script validating that the data passed to these functions is:

length ufs_itrunc: 1c4f8 
inode: 7 
inode orig size: 2ad8a 
boff: 4f8 
bmask: ffffe000
0 14386      bmap_write:entry 
off: 1c4f7 
i size: 4f8 

0 9098         fbzero:entry
fbzero off: 1c000 
fbzero size: 2000 

You can see that the length passed in to ufs_itrunc equals the new end point as shown in the last truncate down. This is all good. Then, the call to bmap_write is correct since when we call this we pass in length-1, and get 1c4f7. All good so far. Then the call to fbzero should have zeroed the blocks between 1c000-1e000, which should have covered the range of the truncate, that is from up to 1c4f8. However, the failure output from fsx.c shows:

READ BAD DATA: offset = 0x1c9ef, size = 0x16fb 
0x1d000 0x0000 0x80ea 0x ff4 

That starting at 0x1d000 we have bad data. The call to the truncate up above, prior to the truncate down, looked like it also had the accurate data passed in(DTrace output not shown for this). The initial thought on this bug was that we had a race since the file was not opened with O_DSYNC. This turned out to be not the case. Read the bug report for more details. In short, ftruncate does not use the O_DSYNC flag at all so calls coming in to ufs_itrunc from there are never synchronous.

So, what was the real problem? Well it was thought to be a problem when we were trying to fill in a hole in the file when creating the hole. It is, indirectly as you will see with the details below. We had truncated the file down, then written it again, past the new end point, thus creating a hole. The output of fsx.c for this process shows:

142(142 mod 256): TRUNCATE DOWN from 0x2ad8a to 0x1c4f8 \*\*\*\*\*\*WWWW 
143(143 mod 256): READ 0xe4c3 thru 0xfa85 (0x15c3 bytes) 
144(144 mod 256): WRITE 0x34119 thru 0x3c577 (0x845f bytes) HOLE \*\*\*WWWW <--right here we should zero the pages from 1c4f9->34119 
145(145 mod 256): READ 0x1c9ef thru 0x1e0e9 (0x16fb bytes) \*\*\*RRRR\*\*\* 

From the DTrace output below it is clear we were not doing that. It is kind of long output, but an important piece of the debugging story, so it is important for me to include it here. There is a bit of extra data in my output, partly because I had just started learning DTrace too. I hadn't quite mastered the DTrace script writing skills to narrow down my output a bit. I am much better now :-). The data I got from this DTrace script made me \*think\* we were having a problem in ufs_getpage. Again, we were kind of, but in an indirect way...

DTrace output:

2  15344                       rdip:entry
2  11884               segmap_getmapflt:entry 
off + mapon: 1c4f9
n: 1b07
!pagecreate: 1

2  15061               ufs_getpage:entry offset: 1c000
length: 2000
plsz: 2000
rw: 1

2   1463               page_lookup:entry pgoff: 1c000

2   1464               page_lookup:return page_lookup return: c6388058<--we find the 1c000 offset page

2   1463               page_lookup:entry pgoff: 1d000<--have incremented offset by 4k, and look again

2   1464               page_lookup:return page_lookup return: 0<--don't find this page

2  15065               ufs_getpage_miss:entry offset: 1d000<--get a miss so go searching
length: 1000
plsz: 1000
rw: 1

2  15650               bmap_read:entry offset: 1d000
lbn: e
boff: 1000

2   8248               bread_common:entry blkno: 2000

2   8249               bread_common:return
2  14933               findextent:entry bshift: d<--should never have gotten here from bmap_read. This whole block is a hole.
n: 11

2  14934               findextent:return len: 1

2  15651               bmap_read:return
2  10888               pvn_read_kluster:entry
2   1373               page_create_va:entry offset: 1d000

2   1374               page_create_va:return pp: c63736a8<--we create the page here, but do not zero it.
tid: 1
2  10889               pvn_read_kluster:return
3  15062               ufs_getpage:return

So, we return from ufs_getpage without zeroing the 1d000 page. Why? The code snippet below shows how we were getting in to this situation. The code in ufs_getpage_miss that zero pages that are not supported by a backing store, i.e. UFS_HOLE is:

         \* Figure out whether the page can be created, or must be
         \* must be read from the disk.
        if (rw == S_CREATE)
                crpage = 1;
        else {
                contig = 0;
                if (err = bmap_read(ip, off, &bn, &contig))
                        return (err);
                crpage = (bn == UFS_HOLE);<--bmap_read should have returned bp==UFS_HOLE

        if (crpage) {<---we don't come through here as a result of crpage == 0
                if ((pp = page_create_va(vp, off, PAGESIZE, PG_WAIT, seg,
                    addr)) == NULL) {
                        return (ufs_fault(vp,
                                    "ufs_getpage_miss: page_create == NULL"));

                if (rw != S_CREATE)
                        pagezero(pp, 0, PAGESIZE);
        } else {
                u_offset_t      io_off;
                uint_t  xlen;
                struct buf      \*bp;
                ufsvfs_t        \*ufsvfsp = ip->i_ufsvfs;

                 \* If access is not in sequential order, we read from disk
                 \* in bsize units.
                 \* We limit the size of the transfer to bsize if we are reading
                 \* from the beginning of the file. Note in this situation we
                 \* will hedge our bets and initiate an async read ahead of
                 \* the second block.
                if (!seq || off == 0)
                        contig = MIN(contig, bsize);

                pp = pvn_read_kluster(vp, off, seg, addr, &io_off,
                    &io_len, off, contig, 0); <--Read the data from disk.

However, as seen in the dtrace output above, bmap_read does not return that 1d000 is a hole in the file. The disk has garbage on it which we somehow think is good data and was not recognized as a hole. Therefore we return the garbage we read. Remember the old adage, 'Garbage In, Garbage Out'. Exactly what was happening here.

I modified fsx.c to output when it found a hole that did not have zero data. This is the output:

1c4f9 - 34119\*\*\*\*\*\*
 byte #: 1d000 = 80ea
 byte #: 1d001 = ea80
 byte #: 1d002 = 806b
 byte #: 1d003 = 6b80
 byte #: 1d004 = 8065
 byte #: 1d005 = 6580
 byte #: 1d006 = 80ba
 byte #: 1d007 = ba80
 byte #: 1d008 = 80a2
 byte #: 1d009 = a280
 byte #: 1d00a = 801c
 byte #: 1d00b = 1c80
 byte #: 1d00c = 80d6
 byte #: 1dfff = 5700

1 page worth of data.

So, why is the read to the disk blocks at this address not showing a hole? And, why was it always the 2nd page in a filesystem block that had this problem? And why didn't it happen every time we truncated to a non filesystem block boundary? These questions and their answers were the key to finding the true problem.

Remember from above, this bug only happened when pagesize < filesystem blocksize. A filesystem blocksize is usually 8k, but can be modified to be 4k. When pagesize was 4k and filesystem block size was 8k we saw the bug. In the 8k block/4k page size scenario truncate a file to a boundary such as the picture below:

Now, re-grow the file to include the whole block. ufs_itrunc will zero the whole block from the end of the truncate which will zero a partial page and a whole page, when pagesize < filesystem blocksize. This was happening as shown by this code snippet from ufs_itrunc:

line 1346:
                bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
                    fs->fs_bsize : fragroundup(fs, boff);
                pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
               (void) pvn_vplist_dirty(ITOV(oip), length, ufs_putapage,
                   B_INVAL | B_TRUNC, CRED());

We zero (bsize-boff ) bytes worth of data with the call to pvn_vpzero. This is the value passed in for the 3rd parameter, called zbyes in the pvn_vpzero function. It is important to note that this size is a function of filesystem block size not page size. So, we are zeroing the difference between where we last wrote data and the end of a filesystem block.

The call to pvn_vplist_dirty should mark the pages associated with this zeroing as dirty for later write. This is where we were getting in to trouble. To keep from writing useless data beyond the EOF, the pvn_vplist_dirty call is passed file length and B_INVAL | B_TRUNC. Since it only knows page boundaries it will of course throw away the second page of the block as it was asked to by the ufs_itrunc code. Note we passed in 'length' to the 2nd parameter of the pvn_vplist_dirty function call. This parameter represents the the offset for which to process the vnodes pages whose offset >= off. We were telling it only to worry about up to the end of the file, and in the case where pagesize == blocksize, we marked the page covering the offset to the next filesystem block size boundary and it works.

But, when pagesize < filesystem blocksize, we did not mark the last page in the filesystem block as dirty(because the offset is within the first page), therefore never pushed it, and therefore had bad data on disk. Now, when the file grows to encompass the whole block containing the thrown away page there will be garbage in the file where that page resided.

The fix was trivial and is shown:

            bsize = (int)lblkno(fs, length - 1) >= NDADDR ?
                        fs_bsize : fragroundup(fs, boff);
                 pvn_vpzero(ITOV(oip), length, (size_t)(bsize - boff));
                 \* Ensure full fs block is marked as dirty.
                (void) pvn_vplist_dirty(ITOV(oip), length + (bsize – boff), <---modified the value passed in to the 
		2nd parameter (offset) to account for the fs block size boundary
                   ufs_putapage, B_INVAL | B_TRUNC, CRED());

Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace

Wednesday Jun 01, 2005

UFS and the Solaris VM Subsystem

More UFS technical tidbits in anticipation of OpenSolaris. Today's talk is about UFS I/O. It is a complicated beast and has many different parts and paths it can take.

Overview of file system I/O in Solaris:

The interaction of UFS and the VM subsystem has been the cause of numerous bugs, and hard to find problems. Today's blog is an overview of the UFS I/O, with particular attention paid to the VM subsystem interaction. Details on the paths taken when a read() system call is initiated are to show the interaction of UFS and the VM subsystem. I am making some assumptions here that the readers of this blog will have some basic Solaris file system knowledge, or at a minimum some of the basic Solaris file system terminology is understood.

Basic Solaris VM facts

Solaris virtual memory is demand paged, and globally managed. There is integrated file caching and it is layered to allow VM to describe multiple memory types. The paging vnode cache is the unification of file and memory management by use of a vnode object. 1 page of memory == <vnode, offset> tuple. The UFS file system uses this relationship to implement caching for vnodes. The paging vnode cache provides a set of functions for cache management and I/O for vnodes.

The paging vnode cache functions are specified with a pvn_ <xxx> title. The source code for this is located at: xxxx. Some of the more important paging vnode functions are listed below, with basic function descriptions. Also shown is pointers to the code so you can get more detailed data about each of these.

Some important paging vnode cache functions:


  • Finds range of continuous pages within the supplied address/length that fit within the <vnode, offset> values that do not already exist.

  • Caller should call pagezero() on any part of last page that is not read from disk.


  • Finds dirty pages within the offset and length. Returns a list of locked pages ready to be written.

  • Caller then sets up write call with pageio_setup().

  • Write is initiated via a call to bdev_strategy().

  • Synchronous writes require the caller to call pvn_write_done(). Otherwise io_done() will call this when write is complete.


  • Finds all pages in page cache >= offset and pushes these pages.

  • Will cluster pages with adjacent pages if it can.

What is a seg_map and why do you care?

The seg_map segment maintains mappings of pieces of files into kernel address space. It is only used by file systems and it allows copying of data to or from user to kernel address space. At any given time, seg_map segment has some portion of total file system cache mapped in to the kernel address space. The seg_map segment driver divides the segment in to file system block sized slots.

Some important seg_map functions:

segmap_getmap() && segmap_getmapflt():

  • Retrieves or creates mapping

  • getmapflt allows for creation of segment if not found, calls ufs_getpage()


  • Releases the mapping for a file segment


  • Creates new pages of memory and slots in the seg_map for a given files

  • Used for extending files or writing holes to a file

Important in the mapping and getting data from the segmap driver is the fbuf structure. It is defined as follows:

struct fbuf {

caddr_t fb_addr;

u int_t fb_count;


This structure is used to get a mapping to part of a file via the segkmap interfaces. It is also used by the pseudo bio functions(shown below) for reading and writing of data. fbuf is used by directory reading to get on UFS on disk contents via a call to blkatoff().

seg_vn and UFS and memory mapped I/O:

Memory mapping allows for a file to be mapped in the a processes address space. This mapping is done via the VOP_MAP call and the seg_vn memory driver. File pages are read when a fault occurs in the address space. The seg_vn driver enables I/O's without process initiated system calls. I/O is performed ,,in units of pages, upon reference to the pages mapped into the address space. Reads are initiated by a memory access, writes are initiated as the VM subsystem finds dirty pages in the mapped address space.

So, why not use the seg_vn driver for non mmap'd I/O as well.? It could be used for mapping the file in to the kernel's address space, but seg_vn is a complex segment driver that manages the mapping of protections, copy-on-write fault handling, shared memory, etc...This is too heavy weight for what is needed for read and write system calls, so the seg_map driver was developed. Read and write system calls only require a few basic mapping functions since they do not map files into a process's address space. seg_map reduces locking complexity and gives better performance.

Pseudo bio functions:

Solaris has a set of interfaces which are considered buffered I/O interfaces, but that are used to read and write buffers containing directory entries only. These interfaces all use the seg_map driver for mapping to address file data. The functions are fbread(), fbwrite(), fbrelese(), fbdwrite(), fbiwrite(), fbzero(). Although these are not directly shown in the picture above, they are important enough to be worth mentioning.

A UFS/VM example, read() system call - non mmap'd:

Note: In general UFS caches the pages for write, but will also cache pages for reads if they are frequently reusable.


  • Checks for directio1 enabled, if so tries to bypass page cache

  • If cache_read_ahead is set, set appropriate flags for placement of pages on cache list(used in freebehind2)

  • calculate whether we need to free pages(freebehind +) behind our read, this will come in later

  • if i_contents(reader)3 held, drop it to avoid deadlock in ufs_getpage().

  • Calls segmap_getmapflt() which transitions to ufs_getpage() since we are forcing a fault via S_READ

  • ufs_getpage():

    • If calling thread is thread owning the current i_contents lock no need to acquire the lock. Also checks to see if the vfs_dqrwlock is required.

    • Checks to see if the file has holes via bmap_has_holes(), this will be important later

    • For a read in ufs_getpage() loop through all the pages in the range off, off + len:

    • Call ufs_getpage_ra() to initiate an asynchronous read ahead of the current page. This helps us in page_lookup() process later.

    • Check if we should initiate a read ahead of the next cluster of bytes, cluster size is determined from the UFS maxcontig4 value. Read ahead is true if:

    seqmode5 + pageoff + cluster size >= i_nextrio(start of next cluster) && pgoff <= i_nextrio && i_nextrio < current file size

    • Call page_lookup() to see if page is in page cache

    • if yes, update appropriate pointers, continue

    • If no, call ufs_getpage_miss():

      • Page is either read from disk or created. It is created, without disk read if we call it with S_CREATE or there is a hole in the file at this offset(not backed by a real disk block) in case of read()

  • Calls uiomove() to move data in to pages

  • We start freeing pages behind the current read if the i_nextr(next byte offset which was set after reading in the pages) > smallfile offset(32k), because we are reading in sequential mode so we know we won't need them

  • Call segmap_release() regardless, if cachemode set to freebehind(SM_FREE|SM_DONTNEED|SM_ASYNC) will put them to the head of the page cache

Technorati Tag: Solaris


1UFS directio will be saved for a later post.

2freebehind is always set to 1.

3The i_contents lock is a krwlock_t which is part of the ufs inode data structure. It protects most of the inodes contents. See my previous blog posting on UFS locking for more details.

4See my previous blog post regarding the use of maxcontig in UFS

5seqmode is determined from the i_nextr field in the current working inode. i_nextr represents the next byte offset for reads. If i_nextr == current offset and we are not creating a page, then we set seqmode == 1.

Thursday Apr 07, 2005

More of what you always wanted to know about UFS, but were afraid to ask :-)

So... recently the UFS team for Solaris 10 (of which I am a member) went through a very big exercise to create UFS technical documentation. This exercise proved to be immensely fruitful, for both the current Solaris UFS team and any future Solaris UFS teams. I know I learned a lot personally from this exercise and there is so much great data we amassed I think it can only be useful to the broader community. Thus my desire to share. My hope is that when OpenSolaris is primetime all of this data will be available to anyone who is interested in working on UFS.

I love working in the UFS source. I want to share that love with you all as well, because although UFS is very mature, it has many interesting features and quirks that are just fun to learn. UFS is a complicated beast and requires a lot of deep thought and many nights of grinding through code.

UFS is the keeper of the data for many folks running Solaris. As such, it matters a lot. It is one of those things that nobody notices, but everybody notices when it goes bad. The challenge in keeping UFS running smoothly and being the best keeper of the data is what makes all of the hard work worthwhile. Working in UFS provides a good overall understanding of so many parts of the Solaris kernel since it interacts with so many subsystems: page cache, buffer cache, I/O subsystem,... It isn't the most glamorous code on the planet, but well worth the effort for learning fundamental Solaris kernel technology.

Since OpenSolaris is going to happen soon, I thought I would start blogging about UFS technical data, as evidenced by my first blog. I hope that this blogging might prove useful to those of you out there interested in Solaris UFS filesystem technology. I cannot share code just yet, but can share technical concepts and later give pointers to the code.

Locking in UFS
This blog post will give an introduction in to locking in UFS. In the future this will be important data for developers to have and understand, as any feature additions will require an understanding of the locking. And, many bug fixes require this understanding as well.

Today's topic will cover the following things:
  • Basics about some Solaris kernel locks
  • UFS inode locks
  • UFS inode queue locks
  • Generic VNODE layer locks
  • Generic VFS layer locks
  • General Lock ordering
  • A Directory lookup locking pseudo code example

There is an implicit assumption on my part that you have some basic knowledge of the VNODE and VFS layers, basic locking principles, such as reader/write locks, mutual exclusion locks in Solaris and basic UFS inode knowledge. If not, it would be good to do some pre-reading on these topics. A good place to start is with the "Solaris Internals Core Kernel Architecture" book, by Jim Mauro and Richard McDougall.

Solaris Kernel Locks
In this section I will cover some very basic technical details of Solaris kernel locks, specifically those types used in UFS.
  • krwlock_t reader/writer lock, allows multiple readers or 1 writer at a time.
  • kmutex_t Mutual exclusion lock, allows one operator at a time.
    Solaris implements an adaptive mutex lock:
    • If holder is running, spin.
    • If holder is sleeping, sleep.

UFS Inode locks
There are four locks associated with UFS inodes:
  • i_rwlock(krwlock_t)
    • Serializes write requests. Allows reads to proceed in parallel. Serializes directory reads and updates.
    • Does not protect inode fields.
    • Indirectly protects blocks lists since it serializes allocations/deallocations in UFS
    • Must be taken prior to starting UFS logging transactions if operating on a file, otherwise taken after starting logging transaction.
    • Protects most fields in the inode.
    • When held as a writer protects all the fields protected by the i_tlock as well.
    • When held with the i_contents reader lock it protects the following inode fields:
      • i_utime, i_ctime, i_mtime, i_flag, i_delayoff, i_delaylen, i_nextrio, i_writes, i_writer, i_mapcnt
    • Also used as mutex for write throttling in UFS
    • i_contents and i_tlock held together allows parallelism in updates.
    • UFS inode hash lock

UFS Inode Queue locks
  • ufs_scan_lock(kmutex_t)
    • Synchronizes ufs_scan_inodes threads.
    • Needed because UFS has global inodes lists
  • ufs_q->uq_mutex(kmutex_t)
    • Used to protect idle queues
      • ufs_junk_iq, ufs_useful_iq These are the two inode idle queues, and as you can guess from their names, one holds still potentially useful inodes, the other holds inodes known to not contain valid data.
  • ufs_hlock
    • Used by the hlock thread. For more information see man lockfs(1M), hardlock section.
  • ih_lock
    • Protects the inode hash. The inode hash is global, per system, not per filesystem.

VNODE locks
  • v_lock(kmutex_t)
    • Protects VNODE fields.
      • Uses v_lock
      • Increments/decrements reference count on VNODE by 1.
      • Used to ensure that the VNODE/inode does not go away while in use.

VFS locks
  • vfs_lock(kmutex_t)
    • Locks contents of filesystem and cylinder groups.
    • Also protects fields of the vfs_dio(delayed io)field.
      • vfs_dio is delayed io bit
      • set via ioctl _FIOSDIO
    • vfs_dqrwlock
      • Manages quota subsystem quiescence.
      • Writer held means that the UFS quota subsystem can have major changes going on:
        • Disabling quotas, enabling quotas, setting new quota limits.
      • Protects d_quot structure as well. This structure is used to keep track of all the enabled quotas per filesystem.
      • It is important to note that UFS shadow inodes which are used to hold ACL data and extended attribute directories are not counted against user quotas. Thus this lock is not held for updates to these.
      • Reader held for this lock indicates to quota subsystem that major changes should not be occurring during that time.
      • Held when changes when the i_contents writer lock is held, as described above, indicating changes are occurring that affect user quotas.
      • Since UFS quotas can be enabled/disabled on the fly, this lock must be taken in all appropriate situations. It is not sufficient to check if the UFS quota subsystem is enabled prior to taking the lock.
      • Protects access to the list that links together all UFS filesystem instances.
      • Lists are updated as a part of the mount operation.
      • Also for allow syncing of all UFS filesystems.

UFS Inode Updates Lock Ordering
This pictorial representation of the ordering/weighting of UFS locks is intended to show 1) What each of the locks protects 2) The order in which the locks must be taken if you need to protect the fields relevant to the specific lock. This does not mean that you must always take every lock shown, simply that you must take these in the order shown in the picture based on the fields you are trying to protect.

Example of how locks are used:

Directory Lookups and locking
Doing a directory lookup....
dp is the current directory inode we are searching for an entry in.
    rw_enter(&dp->i_rwlock, RW_READER); <---taken to avoid races with a dirremove in the dncl directory cache. Not needed for standard dnlc cache.
    RW_READER is taken in this case as we want to prevent a write from coming in an changing data out from underneath us.

    If found in dnlc directory cache && "." or ".." then we have to do the following:

      \* release the lock on the dir we are searching
      \* to avoid a deadlock when grabbing the
      \* i_contents lock in getting the allocated inode.
      rw_exit(&dp->i_rwlock);<--drop this lock on the directory in which we are searching. Can deadlock with i_contents(on the directory above) in call to the function getting the allocated inode.
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode;
      \* must recheck as we dropped dp->i_rwlock
      rw_enter(&dp->i_rwlock, RW_READER);
      Now do rechecks here to ensure that data has not changed on the dp(directory inode) during the time we dropped the lock.

      Otherwise if not "." or ".." then proceed as normal for directory lookups
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode; <-i_contents taken in here, no possible contention with the i_rwlock taken above...

      No need to recheck anything since we did not drop the i_rwlock.

The important take away from this example is that there are times when you must release and reacquire locks in UFS. If this is necessary, then it is important to recheck the assumptions about the data you are working on since it is possible that it could have changed during the time the lock was released and reacquired.

There are many more locks used in UFS. This blog only covers a portion of those that I felt were good introductions to UFS locking. Perhaps later I will expand more on this topic in a future blog.

Tuesday Apr 05, 2005

Inquiring Minds Want To Know: UFS

UFS in Solaris 10
I spent the better part of the last year getting to know UFS. I think we are on a first name basis now :-). Thus, I begin my blog debut with some interesting UFS bugs and how they were fixed.

UFS had many improvements integrated in to Solaris 10 and Solaris 9 9/04: Bug fixes, logging on by default and general robustness improvements. In this post I will talk about three specific bug fixes which affect the UFS tuneable maxcontig and therefore aspects of UFS performance.

4639871 Logging ufs fails to boot from ATA drive on Ultra-10 if maxphys is too large
4638166 Ultra 5/10 panics with simba and pci errors if logging enabled and maxphys > 1MB
4349828 Inconsiderate tuning of maxcontig causes scsi bus to hang

As a result of these bugs, UFS in Solaris 10 and Solaris 9 9/04 was modified to change the values that could be used to set maxcontig and subsequently the value used for the maximum transfer size when I/O was issued.

Previously, an inconsiderate value set either for maxcontig or maxphys(in /etc/system) would result in a system getting hung. This was due to the fact that the filesystem I/O request size was calculated using the value set for maxcontig. The maximum transfer rate of the underlying device was never considered when calculating the size of the I/O transfer in UFS.

In UFS, the filesystem cluster size, for both reads and writes, is set to the value set for maxcontig. The filesystem cluster size is used to determine:

  • The maximum number of logical blocks contiguously laid out on disk for a UFS filesystem before inserting a rotational delay.
  • When, and the amount to read ahead and/or write behind if the sequential IO case is found. The algorithm that determines sequential read ahead in UFS is broken, so system administrators use the maxcontig value to tune their filesystems to achieve better random I/O performance.
  • The UFS filesystem cluster size also indicates how many pages to attempt to push out to disk at a time. It also determines the frequency of pushing pages because in UFS pages are clustered for writes, based on the filesystem cluster size.

How These Bugs Were Fixed:
1) The UFS filesystem cluster size(maxcontig) and I/O transfer size were separated, therefore removing the dependency that was causing systems to hang. UFS will no longer allow a setting of maxcontig to interrupt or hang any I/O requests to the device. UFS will always issue I/O requests that <= maximum transfer size of the device hosting the filesystem.

The UFS filesystem cluster size is still set using the value indicated for maxcontig. The I/O transfer size will be set in UFS as shown below.

2) The value for rotational delay(gap mkfs(1M),-d tunefs(1M)) no longer makes sense. The devices today are very sophisticated and do not need a delay artificially built in via software. As noted above, the value of maxcontig, determines the length of contiguous blocks placed on disk, before inserting space to account for rotational delay. The value for rotational delay has been obsoleted in Solaris 10 and Solaris 9 9/04 and defaults to 0 now, ensuring contiguous allocation.

Transfer size of I/O requests in UFS:
The device that hosts the filesystem will be queried as to the maximum transfer size it can handle, and the UFS I/O transfer size will default to this, if this information is obtainable. If the device does not support obtaining the maximum transfer data, the maximum transfer will be set using:

  • min(maxphys, ufs_maxmaxphys).

  • ufs_maxmaxphys is currently set to 1MB.

If, however the user sets the value of maxcontig to be less than the maximum device transfer size, UFS will honor the value of maxcontig as the maximum value for data transfers on this device.

The default value is determined from the disk drive's maximum transfer size as noted above. Any positive integer value is acceptable when setting this parameter, via tunefs(1M) or mkfs(1M).




« April 2014