In my previous entry, I described the overall architecture of shadow migration. This post will dive into the details of how it's actually implemented, and the motivation behind some of the original design decisions.
A very early desire was that we wanted something that could migrate data from many different sources. And while ZFS is the primary filesystem for Solaris, we also wanted to allow for arbitrary local targets. For this reason, the shadow migration infrastructure is implemented entirely at the VFS (Virtual FileSystem) layer. At the kernel level, there is a new 'shadow' mountpoint option, which is the path to another filesystem on the system. The kernel has no notion of whether a source filesystem is local or remote, and doesn't differentiate between synchronous access and background migration. Any filesystem access, whether it is local or over some other protocol (CIFS, NFS, etc) will use the VFS interfaces and therefore be fed through our interposition layer.
The only other work the kernel does when mounting a shadow filesystem is check to see if the root directory is empty. If it is empty, we create a blank SUNWshadow extended attribute on the root directory. Once set, this will trigger all subsequent migration as long as the filesystem is always mounted with the 'shadow' attribute. Each VFS operation first checks to see if the filesystem is shadowing another (a quick check), and then whether the file or directory has the SUNWshadow attribute set (slightly more expensive, but cached with the vnode). If the attribute isn't present, then we fall through to the local filesystem. Otherwise, we migrate the file or directory and then fall through to the local filesystem.
In order to migrate a directory, we have to migrate all the entriest. When migrating an entry for a file, we don't want to migrate the complete contents until the file is accessed, but we do need to migrate enough metadata such that access control can be enforced. We start by opening the directory on the remote side whose relative path is indicated by the SUNWshadow attribute. For each directory entry, we create a sparse file with the appropriate ownership, ACLs, system attributes and extended attributes.
Once the entry has been migrated, we then set a SUNWshadow attribute that is the same as the parent but with "/
Once the diretory has been completely migrated, we remove the SUNWshadow attribute so that future accesses all use the native filesystem. If the process is interrupted (system reset, etc), then the attribute will still be on the parent directory so we will migrate it again when the user (or background process) tries to access it.
Migrating a plain file is much simpler. We use the SUNWshadow attribute to locate the file on the source, and then read the source file and write the corresponding data to the local filesystem. In the current software version, this happens all at once, meaning the first access of a large file will have to wait for the entire file to be migrated. Future software versions will remove this limitation and migrate only enough data to satisfy the request, as well as allowing concurrent accesses to the file. Once all the data is migrated, we remove the SUNWshadow attribute and future accesses will go straight to the local filesystem.
One issue we knew we'd have to deal with is hard links. Migrating a hard link requires that the same file reference appear in multiple locations within the filesystem. Obviously, we do not know every reference to a file in the source filesystem, so we need to migrate these links as we discover them. To do this, we have a special directory in the root of the filesystem where we can create files named by their source FID. A FID (file ID) is a unique identifier for the file within the filesystem. We create the file in this hard link directory with a name derived from its FID. Each time we encounter a file with a link count greater than 1, we lookup the source FID in our special directory. If it exists, we create a link to the directory entry instead of migrating a new instance of the file. This way, it doesn't matter if files are moved around, removed from the local filesystem, or additional links created. We can always recreate a link to the original file. The one wrinkle is that we can migrate from nested source filesystems, so we also need to track the FSID (filesystem ID) which, while not persistent, can be stored in a table and reconstructed using source path information.
A key feature of the shadow migration design is that it treats all accesses the same, and allows background migration of data to be driven from userland, where it's easier to control policy. The downside is that we need the ability to know when we have finished migrating every single file and directory on the source. Because the local filesystem is actively being modified while we are traversing, it's impossible to know whether you've visited every object based only on walking the directory hierarchy. To address this, we keep track of a "pending" list of files and directories with shadow attributes. Every object with a shadow attribute must be present in this list, though this list can contain objects without shadow attributes, or non-existant objects. This allows us to be synchronous when it counts (appending entries) and lazy when it doesn't (rewriting file with entries removed). Most of the time, we'll find all the objects during traversal, and the pending list will contain no entries at all. In the case we missed an object, we can issue an ioctl() to the kernel to do the work for us. When that list is empty we know that we are finished and can remove the shadow setting.
The ability to specify the shadow mount option for arbitrary filesystems is useful, but is also difficult to manage. It must be specified as an absolute path, meaning that the remote source of the mount must be tracked elsewhere, and has to be mounted before the filesystem itself. To make this easier, a new 'shadow' property was added for ZFS filesystems. This can be set using an abstract URI syntax ("nfs://host/path"), and libzfs will take care of automatically mounting the shadow filesystem and passing the correct absolute path to the kernel. This way, the user can manage a semantically meaningful relationship without worrying about how the internal mechanisms are connected. It also allows us to expand the set of possible sources in the future in a compatible fashion.
Hopefully this provides a reasonable view into how exactly shadow migration works, and the design decisions behind it. The goal is to eventually have this available in Solaris, at which point all the gritty details should be available to the curious.