Inside pfiles with pathnames

I'm finally back from vacation, and I'm here to help out with Adam's Top 11-20 Solaris Features. I'll be going into some details regarding one of the features I integrated into Solaris 10, pfiles with pathnames (which was edged out by libumem for the #11 spot by a lean at the finish line). This will be a technical discussion; for a good overview of why it's so useful, see my previous entry.

There were several motivations for this project:

  1. Provide path information for MDB and DTrace.
  2. Make pathname information available for pfiles(1).
  3. Improve the performance of getcwd(3c).

First of all, we needed to record the information in the kernel somewhere. In Solaris, we have what's known as the Virtual File System (VFS) layer. This is an abstract interface, where each file system fills in the implementation details so no other consumers has to know. Each file is represented by a vnode, which can be thought of as a superclass if you're familiar with inheritence. The end result of this is that we can open a UFS file in the same way we open a /proc file, and the only one who knows the difference is the underlying filesystem. We can also change things at the VFS layer and not have to worry about each individual filesystem.

To address concerns over performance and the difficulty of bookkeeping, it was necessary to adjust the constraints of the problem appropriately. It is extremely difficult, if not impossible, to ensure that the path is always correct (consider hard links, unlinked files, and directory restructuring). To make the problem easier, we make no claim that the path is currently correct, only that it was correct at one time. Whenever we translate from a path to a vnode (known as a lookup) for the first time, we store the path information within the vnode. The performance hit is negligible (a memory allocation and a few string copies) and it only occurs when first looking up the vnode. We must be prepared for situations where no pathname is available, as some files have no meaningful path (sockets, for example).

With the magic of CTF, MDB and DTrace need no modification. Crash dumps now have pathnames for every open file, and with a little translator magic we end up with a stable DTrace interface like the io provider. We also use this to improve getcwd performance. Normally, we would have to lookup "..", iterate over each entry until we find the matching vnode, record the entry name, lather, rinse, repeat. Now, we make a take a first stab at it by doing a forward lookup of the cached pathname, and if it's the same vnode, then we simply return the pathname. getcwd has very stringent correctness requirements, so we have to fall back to the old method when our shortcut fails.

The only remaining difficultly was exporting this information to userland for programs like pfiles to use. For those of you familiar with /proc, this is exactly the type of problem it was designed to solve. We added symbolic links in /proc/<pid>/path for the current working directory, the root directory, each open file descriptor, and each object mapped in the address space. This allows you to run ls -l in the directory and see the pathname for each file. More importantly, the modifications to pfiles become trivial. The only tricky part is security. Because a vnode can have only one name, and there can be hard links to files or changing permissions, it's possible for the user to be unable to access the path as it was originally saved. To avoid this, we do the equivalent of a resolvepath(2) in the kernel, and reject any paths which cannot be accessed or do not map to the same vnode. The end result of this is that we may lose this information is some exceptional circumstances (the directory layout of a filesystem is relatively static) but as Bart is fond of reminding us: performance is a goal, correctness is a constraint.

Comments:

Ah, so you are doing things similar to these horrible VNAME() and VPARENT() macros in FreeBSD. oh. I don't want to sound overly critical, but if `dnlc' cache were a master of the inode cache (as it is in, say, DragonFly or Linux), problem would not exist in the first place. And getcwd() would be not require IO at all.

Posted by nikita on July 20, 2004 at 11:30 PM PDT #

I'm not familiar with the FreeBSD implementation, but I think I know what you're getting at. The Solaris DNLC is one-way: you can go from (parent, name) to (inode), but not vice-versa (at least not in any reasonable amount of time). I thought about trying to make the DNLC have a fast reverse-lookup, but avoided it for two reasons:

  1. The DNLC is an incredibly (some say overly) complex system
  2. The DNLC is only used by a few filesystems. In particular, pseudo filesystems such as tmpfs, procfs, and devfs, don't use the DNLC at all

No matter what we do, we have to validate whatever path we came up with. In the current solution, we hardly ever do I/O. Because it's a forward lookup, we assume (if the DNLC is doing its job) that every lookup is entirely in memory, either through the DNLC or a pseudo-filesystem. Only in rare cases do we atcually have to go to disk to do a getcwd() call.

Posted by Eric Schrock on July 21, 2004 at 10:24 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today