Saturday Nov 01, 2008

ZFS Crypto update

It is been a little while since I gave an update on the status of the ZFS Crypto project. A lot has happened recently and I've been really "heads down" writing code.

We had believed we were development complete and had even started code review ready for integration. All our new encryption tests passed and I only had one or two small regression test issues to reconfirm/resolve. However a set of changes (and very good and important ones for performance I might add) were integrated into the onnv_100 build that caused the ZFS Crypto project some serious remerge work. It took me just over 2 weeks to get almost back to where were. In doing that I discovered that the code I had written for the ZFS Intent Log (the ZIL) encryption/decryption wasn't going to work.

The ZIL encryption was "safe" from a crypto view point because all the data written to the ZIL was encrypted. However because of how the ZIL is claimed and replayed after a system failure (the only time the ZIL is actually read since it is mostly write only) mean't that the claim happened before the pool had any encryption keys available to it. The resulted in the ZFS code just thinking that all the datasets had no ZIL needing claimed, so ZFS stayed consistent on disk but anything that was in the ZIL that hadn't been committed in a transaction was lost - so it was equivalent to running without a ZIL. OOPS!

So I had to redesign how the ZIL encryption/decryption works. The ZIL is dealt with in two main stages, first there is the zil_claim which happens during pool import (before the pool starts doing any new transactions and long before userland is up and running). The second stage happens much much later when the filesystems (datasets really because the ZIL applies to ZVOLs too) are mounted - this is good because we have crypto keys available by then.

This mean't I need to really learn how the ZIL gets written to disk. The ZIL has 0 or more log blocks with each log block having 1 or more log records in it. There is a different type of log record (each of different sizes) for the different types of sync operations that can come through the ZIL. Each log record has a common part at the start of it that says how big it is - this is good from a crypto view. So I left the common part in the clear and I encrypt the rest of the log record. This is needed so that the claim can "walk" all the log records in all the log blocks and sanity check the ZIL and do the claim IO. So far this is similar to what was done for the DNODES. Unfortunately it wasn't quite that simple (never is really!).

All the log records other than the TX_WRITE type fell into the nice simple case described above. The TX_WRITE records are "funky" and from a crypto view point a really pain in how they are structured. Like all the other log records there is a common section at the start that says what type it is and how big the log record is. What is different about the TX_WRITE log records though is that they have a blkptr_t embedded in them. That blkptr_t needs to be in the clear because it may point of to where the data really is (especially if it is a "big" write). The blkptr_t is at the end of the log record. So no problem, just leave the common part at the start and the blkptr_t at the end in the clear, right ? Well that only deals with some of the cases. There is another TX_WRITE case, where the log record has the actually write data tagged on after the blkptr inside the log record and this is of variable size. Even in this case there is a blkptr_t embedded inside the log record. Problem is that blkptr_t is now "inside" the area we want to encrypt. So I ended up having to had a clear text log common area, encrypted TX_WRITE content, clear text blkptr_t and maybe encrypted data. Good job the OpenSolaris Cryptographic Framework supports passing in a uio_t for scatter gather!

So with all that done, it worked, right ? Sadly no there was still more to do. Turns out I still had two other special ZIL cases I had to resolve. Not with the log records though. Remember I said there are multiple log records in a log block ? Well at the end of the log block there is a special trailer record that says how big the sum of all the log records is (a good sanity checker!) but this also has an embedded checksum for the log block in it. This is quite unlike how checksums are normally done in ZFS, normally the checksum is in the blkptr_t not with the data. The reason it is done this way for the ZIL is for write performance and the risk is acceptable because the ZIL is very very rarely ever read and only then in a recovery situation. For ZFS Crypto we not only encrypt the data but we have a cryptographically strong keyed authentication as well (the AES_CCM MAC) that is stored in 16 bytes of the blkptr_t checksum field. Problem with the ZIL log blocks is that we don't have a blkptr_t for them we can put that 16 byte MAC into because the checksum for the log block is inside the log block in the trailer record, the blkptr checksum for the ZIL is used for record sequencing. So was there any reserved or padding fields ? Luckily yes there was but there was only 8 bytes available not 16. That is enough though since I could just change the params for CCM mode to ouput an 8 byte MAC instead of a 16 byte one. A little less security but still plenty sufficient - and especially so since we expect to never need to read the ZIL back - it is still cryptographically sound and strong enough for the size of the data blocks we are encrypting (less that 4k at a time). So with that resolved (this took a lot of time in Dtrace and MDB to work out!) I could decrypt the ZIL records after a (simulated) failure.

Still one last problem to go though, the records weren't decrypting properly. I verified, using dtrace and kmdb, that the data was correct and the CCM MAC was correct. So what was wrong why wouldn't they decrypt ? That only left the key and the nonce. Verifying the key was correct was easy, and it was. So what was wrong with the nonce ?

We don't actually store the nonce on disk for ZFS Crypto but instead we calculate it based on other stuff that is stored on disk. The nonce for a normal block is made up from: the block birth transaction (txg: a monatonically increasing unsigned 64 bit integer) , the object, the level, and the blkid. Those are manipulated (via a truncated SHA256 hash) into 12 bytes and used as the nonce. For a ZIL write the txg is always 0 because it isn't being written in a txg (that is the whole point of it!) but the blkid is actually the ZIL record sequence number which has the same properties as the txg. The problem is that when we replay the ZIL (not when we claim it though) we do have a txg number. This mean't we had a different nonce on the decrypt to the encrypt. The solution ? Remove the txg from the nonce for ZIL records - no big loss there since on a ZIL write it is 0 anyway and the blkid (the zil sequence number) has the security properties we want to keep AES CCM safe.

With all that done I managed to get the ZIL test suite to pass. I have a couple of minor follow-on issues to resolve so that zdb doesn't get its knickers in a twist (SEGV really) when it tries to display encrypted log blocks (which it can't decrypt since it is running in userland and without the keys available).

That turned out to be more that I expected to write up. So where are we with schedule ? I expect us to start codereview again shortly. I've just completed the resync to build 102.

Friday Sep 05, 2008

ZFS Crypto Codereview starts today

Prelim codereview for the OpenSolaris ZFS Crypto project starts today (Friday 5th September 2008 at 1200 US/Pacific) and is scheduled to end on Friday 3rd October 2008 at 2359 US/Pacific. Comments recieved after this time will still be considered but unless there are serious in nature (data corruption, security issue, regression from existing ZFS) they may have to wait until post onnv-gate integration to be addressed; however every comment will be looked at and assessed on its own merit.

For the rest of the pointers to the review materials and how to send comments see the project codereview page.

Thursday Aug 14, 2008

Making files on ZFS Immutable (even by root!)

First lets look at the normal POSIX file permissions and show who we are and what privileges our shell is running with:

# ls -l /tank/fs/hamlet.txt 
-rw-rw-rw-   1 root     root      211179 Aug 14 13:00 /tank/fs/hamlet.txt

# pcred $$
100618: e/r/suid=0  e/r/sgid=0
        groups: 0 1 2 3 4 5 6 7 8 9 12

# ppriv $$
100618: -zsh
flags = 
        E: all
        I: all
        P: all
        L: all

So we are running as root and have all privileges in our process and are passing all on to our children. We also own the file (and it is on a local ZFS filesystem not over NFS), and it is writable by us and our group, everyone in fact. So lets try and modify it:

# echo "SCRIBBLE" > /tank/fs/hamlet.txt 
zsh: not owner: /tank/fs/hamlet.txt

That didn't work lets try and delete it, but first check the permissions of the containing directory:

# ls -ld /tank/fs
drwxr-xr-x   2 root     root           3 Aug 14 13:00 /tank/fs

# rm /tank/fs/hamlet.txt
rm: /tank/fs/hamlet.txt: override protection 666 (yes/no)? y
rm: /tank/fs/hamlet.txt not removed: Not owner

That is very strange, so what is going on here ?

Before I started this I made the file immutable. That means that regardless of what privileges(5) the process has and what POSIX permissions or NFSv4/ZFS ACL it has we can't delete it change it nor can we even change the POSIX permissions or the ACL. So how did we do that ? Without good old friend chmod:

# chmod S+ci /tank/fs/hamlet.txt
Or more verbosely:
# chmod chmod S+v immutable /tank/fs/hamlet.txt

See chmod(1) for more details. For those of you running OpenSolaris 2008.05 releases then you need to change the default PATH to have /usr/bin in front of /usr/gnu/bin or use the full path to /usr/bin/chmod. This is because these extensions are only part of the OpenSolaris chmod command not the GNU version. The same applies to my previous posting on the extended output from ls.

Heaps of info available on files via good old ls(1) [ But not encryption status ]

In "compact" form:

ls -V@ -/c -% all /tank/fs/hamlet.txt
-rw-r--r--+  1 root     root      211179 Aug 14 12:20 /tank/fs/hamlet.txt
                timestamp: atime         Aug 14 12:37:37 2008 
                timestamp: ctime         Aug 14 12:32:58 2008 
                timestamp: mtime         Aug 14 12:20:08 2008 
                timestamp: crtime        Aug 14 12:19:41 2008 

In verbose form:

ls -v@ -/v -% all /tank/fs/hamlet.txt
-rw-r--r--+  1 root     root      211179 Aug 14 12:20 /tank/fs/hamlet.txt
                timestamp: atime         Aug 14 12:21:12 2008 
                timestamp: ctime         Aug 14 12:32:58 2008 
                timestamp: mtime         Aug 14 12:20:08 2008 
                timestamp: crtime        Aug 14 12:19:41 2008 

One interesting thing it doesn't tell me about this file is that it is that all that information is encrypted on disk. For that I have to use zfs(1):

# zfs get encryption tank/fs
tank/fs  encryption  on           local

Or a little more verbosely:

# zfs list -r -o name,encryption,keyscope,keystatus,mounted tank 
tank             off      pool    undefined      yes
tank/fs           on      pool    available      yes

I wonder if it is worth having the verbose ls(1) output indicate that the file was encrypted on "disk" by the filesystem.

What would people do with that info if they had it ? Any ideas let me know.

Thursday Oct 04, 2007

ZFS Crypto Alpha Release

ZFS Crypto (Phase 1) Alpha Release binaries are now available. At the moment this is x86/x64 only and debugging a very strange (non crypto) problem on the SPARC binaries and will make them available when I can.

Monday Jul 02, 2007

ZFS Crypto Design Review

The design review for phase one of the OpenSolaris ZFS Crypto Project starts now, details on how to participate are here.


Wednesday May 02, 2007

ZFS under GPLv2 already exists - no kidding!

I'm getting really fed up with the constant rantings on all sides about what Sun should to about the license on the ZFS code so that Linux can use it.  Apparently Sun is the bad guy because ZFS is under CDDL and not GPLv2 and we are purposely doing that so Linux does not get ZFS, personally I don't agree but each to their own opinion and licensing is worse than religion in open software development. 

There is already a port to FreeBSD and rumours abound that it is in a future release of MacOS, without the CDDL those might not have happened. 

There is also a port of ZFS to FUSE which means Linux users can use it that way.  Performance won't be great with FUSE but it is probably acceptable.  FUSE is a great tool and I can't wait until the Solaris port is ready - because then Solaris can read Linux ext based filesystems that way!

Now about that headline, yes I really did say that ZFS code is already available under the GPLv2.  I will be completely honest though and make it clear that it isn't all of the ZFS source.  It is, sufficient amount to be able to boot an OpenSolaris based system from GRUB, that means that support for mirroring and the checksum and compression support is there but radiz isn't nor are the userland commands.   It is possible that this might be enough to get someone started.  Still don't believe me check out the updated GRUB source on, specifically all the files with zfs in their name - every single one of them under the GPLv2 or later.

Update:  While I appreciate some of the comments posted I'm not going to let my blog be a place to post other peoples opinions on CDDL vs GPL.  So I've deleted some comments, if that annoys you because I deleted your comment, tough luck this is my blog and my policy and thats how it is.  Comments are now closed.


Darren Moffat-Oracle


« June 2016