ZFS encryption what is on disk ?

This article is about what is and isn't stored encrypted on disk for ZFS datasets that are encrypted and how we do the actual encryption. It does require some understanding of Solaris and ZFS debugging tools.

The first important thing to understand about ZFS is that it is not providing "full disk" encryption and you will be able to tell that a disk that has data on it that was encrypted by ZFS is part of a ZFS pool.

This is in part because one of the requirements for adding encryption support to ZFS was that a given ZFS pool be able to contain a mix of encrypted and cleartext datasets and those that are encrypted be able to use different algorithms/keylengths and different encryption keys.

We also require that the key material does not need to have been made available in order for pool wide operations and certain dataset operations (such zfs destroy) to succeed.  One of the most important pool wide operations is scrub/resilver; we need to ensure that hotspare, disk replacement and self healing work even if the key material has never been made available to this running instance of the system. We must also be able to claim (but not necessarily replay) the log blocks (ZIL) on reboot after power loss or panic without requiring the key material (ZFS must remain consistent on disk at all times).

What this means is that even in a pool were all of the datasets are marked as being encrypted (eg zpool create -O encryption=on tank ...) there is some ZFS metadata that is always in the clear.

What is always in the clear even for encrypted datasets?
  • ZFS pool layout
  • Dataset names
  • Pool and dataset properties, including user defined properties
    • compression, encryption, share, etc.
  • Dataset level quotas (zfs set quota)
  • Dataset delegations (zfs allow)
  • The pool history (zpool history)
  • All dnode blocks
    • Needed to traverse the pool for resilver/scrub
  • Block pointer
    • The blkptr_t contains the MAC/AuthTag from AES-CCM or AES-GCM in the top 96 bits of the checksum field. The SHA256 checksum is truncated to provide this 96 bits of space.
      • The checksum for an encrypted block is always sha256-mac
    • The 96bit IV for the block is in dva[2] of the blkptr_t
      • This means that an encrypted block can have a maximum of 2 copies not 3

What is encrypted when a dataset is encrypted?

  • All file content written to a ZFS filesystem via the the ZPL/VFS interfaces (ie POSIX interfaces)
    • open(2), write(2), mmap(2), etc.
  • All POSIX (and ZFS filesystem) metadata: ACLs, file and directory names, permissions, system and extended attributes on files and all file timestamps
    • ZPL metadata is normally contained in the bonusbuf area of a dnode_phys_t but the dnode is in the clear on disk. For encrypted datasets the bonusbuf is always empty and the content normally have been there is pushed out to an encrypted "spill" block, called System Attribtue block.  Normally for ZPL filesystems spill blocks are only used for files with large ACLs.
  • System Attribute (spill) blocks (used for any purpose)
  • All data written to a ZVOL
  • User/group quota information for ZFS filesystems, both the policy and space accounting (zfs set userquota@ | groupquota@)
  • FUID mappings for UNIX <-> CIFS user identities
  • All of the above if it is in a ZIL (ZFS Intent Log) record.
    • Note that the actual ZIL blocks have block pointers and a record header that includes the sizing information that is in the clear.
  • Data encryption keys
    • These are stored in an on disk keychain referenced from the dsl_dir_phys_t. 

The ondisk keychain

The keychain entries are ZAP objects that are indexed by the transaction they were created in. The entries are individually wrapped by the dataset's wrapping key each with their own IV and an indicator of what wrapping key algorithm was used (at this time the wrapping key crypto algorithm always matches the encryption property).  Every encrypted dataset has at least one keychain entry.  Clones have their own keychain and do not reference the one of their origin, because the clone may have a different wrapping key and the clone may have different keychain entries to its origin.

Encrypting a block

Each ZFS on disk block (smallest size is 512 bytes, largest is 128k) is encrypted using AES in either CCM or GCM mode as indicated by the encryption property. Even though CCM and GCM provide the ability to have additional authenticated data that isn't encrypted this isn't used because (with the exception of the ZIL blocks) all data in the block is encrypted.  A 96 bit IV per disk block is used and both CCM and GCM are requested to provide a 96 bit MAC/AuthTag in addition to the ciphertext.  While we could get a larger MAC space in the ZFS on disk blkptr_t is very tight and we need to leave some of it available for future features.  After encryption each block is also checksummed by the ZIO pipeline using SHA256 (fletcher is not available for encrypted datasets).

IV generation for encrypted blocks

Every encrypted on disk block has its own IV, (stored in dva[2] of the blkptr_t).  The IV is generated by taking the first 96 bits of a SHA256 hash of the contents of the zbookmark_t and the transaction the block was first written in.  We actually have all this information available both at read and write time so we don't need to store the IV in the simplest case. However snapshots, clones and deduplication as well as some (non encryption related) future features complicate this so we do store the IV.

If dedup=on for the dataset the per block IVs are generated differently.  They are generated by taking an HMAC-SHA256 of the plaintext and using the left most 96 bits of that as the IV.  The key used for the HMAC-SHA256 is different to the one used by AES for the data encryption, but is stored (wrapped) in the same keychain entry, just like the data encryption key a new one is generated when doing a 'zfs key -K <dataset>'.  Obviously we couldn't calculate this IV when doing a read so it has to be stored.

ZIL blocks

The ZIL log blocks are written in exactly the same way regardless of whether the ZIL is in the main pool disks or a separate intent log (slog) is being used.  The ZIL blocks are encrypted a different way to blocks going through the "normal" write path; this is because log blocks are formated on disk differently anyway.  The log blocks are chained together and have a header (zil_chain_t) that indicates what size the log block is and the blkptr_t to the next block as well as an embedded checksum that chains the blocks together.  For encrypted log blocks the MAC from AES CCM/GCM is also stored in this header (zil_chain_t).   It is log blocks rather than log records that are encrypted.  Within a given log block there maybe multiple log records.  Some of these log records may contain pointers to blocks that were written directly (via dmu_sync), in order for us to claim the ZIL when the pool is imported these embedded block pointers need to be readable even if the encryption keys are not available (which they won't be in most cases during the claim phase).  These means that  we don't encrypt whole log blocks, the log record headers and any blkptr_t embedded in a log record is in the clear, the rest of the log block content is encrypted.

How is the passphrase turned into a wrapping key (keysource=passphrase,prompt)?

When the dataset 'keysource' property indicates that a passphrase should be used we have to derive a wrapping key from it.  The wrapping key is derived from the passphrase provided and a per dataset salt (which is stored as hidden property of the dataset) by using PKCS#5 PBKD2_HMAC_SHA1 with 1000 iterations.  The wrapping key is not stored on disk.  The salt is randomly generated when the dataset is created (with keysource=passphrase,prompt) and changed each time the 'zfs key -c' is run, even if the passphrase the user provides is the same the salt and thus the actual wrapping key will be different.


Looking at the on disk structures

Using mdb macros and zdb we can actually look at some of this.  Remember that mdb and zdb are debugging tools only, use of mdb on a live kernel without understanding what you are doing can corrupt data.  The interfaces used below are not committed interfaces and are subject to change.

Firstly using mdb on the live kernel (of an x86 machine) I've placed a breakpoint on the zio_decrypt function, lets look at the block pointer using the mdb blkptr dcmd:

[2]> <rdi::print zio_t io_bp | ::blkptr
DVA[0]=<0:204200:20000:STD:1>
[L0 PLAIN_FILE_CONTENTS] SHA256_MAC OFF LE contiguous unique encrypted 1-copy
size=20000L/20000P birth=10L/10P fill=1
cksum=a585cb5cce997c21:c79825e93b16d5aa:28df6742e24913e:6b94fbd569cf3cd9

This blkptr_t is for the contents of a file, we can see that it is encrypted and we only have one copy of it - so only one DVA entry. The checksum is SHA256_MAC so the actual MAC value is 2e24913e6b94fbd569cf3cd9.  The blkptr macro doesn't show us the IV that is stored in DVA[2], but we can see that if we print the raw structure using ::print

[2]> <rdi::print zio_t io_bp->blk_dva[2]
blk_dva[2] = {
    blk_dva[2].dva_word = [ 0x521926d500000000, 0x3b13ba46ab9f8a51 ]
}

Now lets use zdb, to look at some things (the output is trimmed slightly for the sake of this article)

# zdb -dd -e tank

Dataset mos [META], ID 0, cr_txg 4, 311K, 54 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type          0    1    16K    16K  96.0K    32K   84.38  DMU dnode          1    1    16K     1K  1.50K     1K  100.00  object directory          2    1    16K    512      0    512    0.00  DSL directory          3    1    16K    512  1.50K    512  100.00  DSL props

...

        26    1    16K   128K  18.0K   128K  100.00  SPA history

...

36 1 16K 128K 0 128K 0.00 bpobj 37 1 512 512 3.00K 1K 100.00 DSL keychain 38 1 16K 4K 12.0K 4K 100.00 SPA space map ...

This pool (tank) currently has 3 datasets, one of which is encrypted.  We can see from the above zdb output that the keychains are kept in the special "mos" dataset along with some other pool wide metadata.  Now lets look at those keychains in a bit more detail by asking zdb to be more verbose (again the output is trimmed to show relevant information only):

    # zdb -dddd -e tank
    ...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        37    1    512    512  3.00K     1K  100.00  DSL keychain
        dnode flags: USED_BYTES 
        dnode maxblkid: 1
        Fat ZAP stats:
                Pointer table:
                        32 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 5
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 2
                Leaf blocks: 1
                Total blocks: 2
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x16c6fb
                Leafs with 2\^n pointers:
                        5:      1 \*
                Blocks with n\*5 entries:
                        0:      1 \*
                Blocks n/10 full:
                        9:      1 \*
                Entries with n chunks:
                        9:      2 \*\*
                Buckets with n entries:
                        0:     14 \*\*\*\*\*\*\*\*\*\*\*\*\*\*
                        1:      2 \*\*
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136

The above keychain  object shows it has two entries in it, the lowest numbered one (5) is from when the dataset was initially created and the second one (85) is because I had run 'zfs key -K tank/fs' on the dataset a little later.  Now lets illustrate with zdb what I discussed in the previous article about assured delete where I discussed about clones being able to have different set of entries in the keychain to their origin.

To illustrate this I ran the following:

# zfs snapshot tank/fs@1
# zfs clone -K tank/fs@1 tank/fsc1
# zfs key -K tank/fs

First lets look at the keychain object 37 which is for tank/fs, and then at the keychain object for the clone (I've trimmed the output a little more this time):

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        37    2    512    512  7.50K     2K  100.00  DSL keychain
      ...
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136
                txg 174 : wkeylen = 136


     Object  lvl   iblk   dblk  dsize  lsize   %full  type
        101    1    512    512  4.50K  1.50K  100.00  DSL keychain
      ...
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136
                txg 152 : wkeylen = 136

What we see above is that the original tank/fs dataset now has an additional entry from the 'zfs key -K tank/fs' that was run.  The keychain  for the clone (object 101) also has three entries in it, it shares the same entries as tank/fs for txg 5 and txg 85 (though they maybe encrypted differently on disk depending on where the wrapping key is inherited from) and it has as a unique entry created at txg 152.  We can see similar information by looking at the 'zpool history -il' output:

2010-11-19.05:58:25 [internal encryption key create txg:85] rekey succeeded dataset = 33 [user root on borg-nas]
2010-11-19.06:05:59 [internal encryption key create txg:152] rekey succeeded dataset = 96 from dataset = 77 [user root on borg-nas]
2010-11-19.06:06:40 [internal encryption key create txg:174] rekey succeeded dataset = 33 [user root on borg-nas]

What is decrypted in memory?

As all ready mentioned the data encryption keys are stored wrapped (encrypted) on disk but they are stored in memory in the clear along with the wrapping key (we need the wrapping key to stay around for 'zfs key -K' and for 'zfs create' where the keysource property is inherited).  They are stored only in non swappable kernel memory (though remember you can swap on an encrypted ZVOL).  They are accessible to someone with all privilege that is able to use mdb on the live kernel or on a crash dump - but so is your plaintext data.  A suitable hardware keystore could be used so that key material is only ever inside its FIPS 140 boundary but that support is not yet complete (note this is not a commitment from Oracle to provide this support in any future release of ZFS or Solaris) - there would be no on disk change required to support it though.

Any data or metadata blocks that are encrypted on disk are in the in-memory cache (ARC) in the clear, this is required because the in memory ARC buffers are sometimes "loaned" using zero copy to other parts of the system - including other file systems such as NFS and CIFS.  If this is too much of a risk for you then you can force the system to always go back to disk and decrypt blocks only when needed but note you will not benefit from the caching and this will have a significant performance penalty: zfs set primarycache=metadata <dataset>.

The L2ARC is not currently available for use by encrypted datasets (note this is not a commitment from Oracle to provide this support in any future release of ZFS or Solaris) it is equivalent to having done 'zfs set secondarycache=none <dataset>'. The DDT for deduplication is not encrypted data and is pool wide metadata (stored in the MOS) so it is still able to be stored in the L2ARC.

All of the above article content could have been discovered by reading the zfs(1M) man page and using mdb, DTrace and zdb while experimenting on a live system, which is actually how I wrote the article.  There is a lot more you can examine about the on disk and in memory state of Solaris, not just ZFS by using mdb and DTrace - neither of which you can hide from, since the kernel modules have CTF data it in them full structure definitions - Note though that unless the interfaces/structures are documented in the Solaris DDI, or other official documentation from Oracle, you are looking at implementation details that are subject to change - often even in an update/patch.

Comments:

As you say the L2ARC is currently not available for encrypted datasets I wonder if this is true for data and metadata in the cache.

Afaik dedup needs a lot of memory to keep the DDT in fast memory (ARC or L2ARC).

Now if I have a huge, encrypted dataset with dedup enabled, only the ARC and not the L2ARC is used ?

Would be nice if you could clarify the impact on dedup.

Robert

Posted by Robert on November 19, 2010 at 08:39 AM GMT #

Thanks Robert, that that is very good question I've updated the L2ARC section to clarify what happens with the DDT.

Posted by Darren Moffat on November 19, 2010 at 08:53 AM GMT #

So, does this mean that it would be possible (assuming 2 servers of the requisite version of solaris), to zfs send an encrypted volume and subsequent snapshots from one host to another, without the destination ever having knowledge of the key?

Posted by Colin on November 22, 2010 at 05:46 AM GMT #

Colin, not at this time because the send stream is the decrypted, and decompressed data. See the 'Encryption ZFS File Systems' section of the Oracle Solaris documentation: http://docs.sun.com/app/docs/doc/821-1448/gkkih?l=en&a=view for information on send|recv and encryption.

Posted by Darren Moffat on November 22, 2010 at 05:56 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

DarrenMoffat

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today