ZFS Crypto update
By DarrenMoffat on Nov 01, 2008
It is been a little while since I gave an update on the status of the ZFS Crypto project. A lot has happened recently and I've been really "heads down" writing code.
We had believed we were development complete and had even started code review ready for integration. All our new encryption tests passed and I only had one or two small regression test issues to reconfirm/resolve. However a set of changes (and very good and important ones for performance I might add) were integrated into the onnv_100 build that caused the ZFS Crypto project some serious remerge work. It took me just over 2 weeks to get almost back to where were. In doing that I discovered that the code I had written for the ZFS Intent Log (the ZIL) encryption/decryption wasn't going to work.
The ZIL encryption was "safe" from a crypto view point because all the data written to the ZIL was encrypted. However because of how the ZIL is claimed and replayed after a system failure (the only time the ZIL is actually read since it is mostly write only) mean't that the claim happened before the pool had any encryption keys available to it. The resulted in the ZFS code just thinking that all the datasets had no ZIL needing claimed, so ZFS stayed consistent on disk but anything that was in the ZIL that hadn't been committed in a transaction was lost - so it was equivalent to running without a ZIL. OOPS!
So I had to redesign how the ZIL encryption/decryption works. The ZIL is dealt with in two main stages, first there is the zil_claim which happens during pool import (before the pool starts doing any new transactions and long before userland is up and running). The second stage happens much much later when the filesystems (datasets really because the ZIL applies to ZVOLs too) are mounted - this is good because we have crypto keys available by then.
This mean't I need to really learn how the ZIL gets written to disk. The ZIL has 0 or more log blocks with each log block having 1 or more log records in it. There is a different type of log record (each of different sizes) for the different types of sync operations that can come through the ZIL. Each log record has a common part at the start of it that says how big it is - this is good from a crypto view. So I left the common part in the clear and I encrypt the rest of the log record. This is needed so that the claim can "walk" all the log records in all the log blocks and sanity check the ZIL and do the claim IO. So far this is similar to what was done for the DNODES. Unfortunately it wasn't quite that simple (never is really!).
All the log records other than the TX_WRITE type fell into the nice simple case described above. The TX_WRITE records are "funky" and from a crypto view point a really pain in how they are structured. Like all the other log records there is a common section at the start that says what type it is and how big the log record is. What is different about the TX_WRITE log records though is that they have a blkptr_t embedded in them. That blkptr_t needs to be in the clear because it may point of to where the data really is (especially if it is a "big" write). The blkptr_t is at the end of the log record. So no problem, just leave the common part at the start and the blkptr_t at the end in the clear, right ? Well that only deals with some of the cases. There is another TX_WRITE case, where the log record has the actually write data tagged on after the blkptr inside the log record and this is of variable size. Even in this case there is a blkptr_t embedded inside the log record. Problem is that blkptr_t is now "inside" the area we want to encrypt. So I ended up having to had a clear text log common area, encrypted TX_WRITE content, clear text blkptr_t and maybe encrypted data. Good job the OpenSolaris Cryptographic Framework supports passing in a uio_t for scatter gather!
So with all that done, it worked, right ? Sadly no there was still more to do. Turns out I still had two other special ZIL cases I had to resolve. Not with the log records though. Remember I said there are multiple log records in a log block ? Well at the end of the log block there is a special trailer record that says how big the sum of all the log records is (a good sanity checker!) but this also has an embedded checksum for the log block in it. This is quite unlike how checksums are normally done in ZFS, normally the checksum is in the blkptr_t not with the data. The reason it is done this way for the ZIL is for write performance and the risk is acceptable because the ZIL is very very rarely ever read and only then in a recovery situation. For ZFS Crypto we not only encrypt the data but we have a cryptographically strong keyed authentication as well (the AES_CCM MAC) that is stored in 16 bytes of the blkptr_t checksum field. Problem with the ZIL log blocks is that we don't have a blkptr_t for them we can put that 16 byte MAC into because the checksum for the log block is inside the log block in the trailer record, the blkptr checksum for the ZIL is used for record sequencing. So was there any reserved or padding fields ? Luckily yes there was but there was only 8 bytes available not 16. That is enough though since I could just change the params for CCM mode to ouput an 8 byte MAC instead of a 16 byte one. A little less security but still plenty sufficient - and especially so since we expect to never need to read the ZIL back - it is still cryptographically sound and strong enough for the size of the data blocks we are encrypting (less that 4k at a time). So with that resolved (this took a lot of time in Dtrace and MDB to work out!) I could decrypt the ZIL records after a (simulated) failure.
Still one last problem to go though, the records weren't decrypting properly. I verified, using dtrace and kmdb, that the data was correct and the CCM MAC was correct. So what was wrong why wouldn't they decrypt ? That only left the key and the nonce. Verifying the key was correct was easy, and it was. So what was wrong with the nonce ?
We don't actually store the nonce on disk for ZFS Crypto but instead we calculate it based on other stuff that is stored on disk. The nonce for a normal block is made up from: the block birth transaction (txg: a monatonically increasing unsigned 64 bit integer) , the object, the level, and the blkid. Those are manipulated (via a truncated SHA256 hash) into 12 bytes and used as the nonce. For a ZIL write the txg is always 0 because it isn't being written in a txg (that is the whole point of it!) but the blkid is actually the ZIL record sequence number which has the same properties as the txg. The problem is that when we replay the ZIL (not when we claim it though) we do have a txg number. This mean't we had a different nonce on the decrypt to the encrypt. The solution ? Remove the txg from the nonce for ZIL records - no big loss there since on a ZIL write it is 0 anyway and the blkid (the zil sequence number) has the security properties we want to keep AES CCM safe.
With all that done I managed to get the ZIL test suite to pass. I have a couple of minor follow-on issues to resolve so that zdb doesn't get its knickers in a twist (SEGV really) when it tries to display encrypted log blocks (which it can't decrypt since it is running in userland and without the keys available).
That turned out to be more that I expected to write up. So where are we with schedule ? I expect us to start codereview again shortly. I've just completed the resync to build 102.