Tuesday Jul 07, 2015

Solaris new system calls: getentropy(2) and getrandom(2)

The traditional UNIX/Linux method of getting random numbers (bit streams really) from the kernel to a user space process was to open(2) the /dev/random or /dev/urandom pseudo devices and read(2) an appropriate amount from it, remembering to close(2) the file descriptor if your application or library cached it.  Solaris 11.3 adds two new system calls, getrandom(2) and getentropy(2), for getting random bit streams or raw entropy. These are compatible with APIs recently introduced in OpenBSD and Linux.

OpenBSD introduced a getentropy(2) system call that reads a maximum of 256 bytes from the kernel entropy pool and returns it to user space for use in user space random number generators (such as in the OpenSSL library). The getentropy(2) system call returns 0 on success and always returns the amount of data requested or fails completely, setting errno.  It is an error to request more than 256 bytes of entropy and doing so causes errno to be set to EIO. 

On Solaris the output of getentropy(2) is entropy and should not be used where randomness is needed, in particular it must not be used where an IV or nonce is needed when calling a cryptographic operation.  It is intended only for seeding a user space RBG (Random Bit Generator) system. More specifically the data returned by getentropy(2) has not had the required FIPS 140-2 processing for the DRBG applied to it.

Recent Linux kernels have a getrandom(2) system call that reads between 1 and 1024 bytes of randomness from the kernel.  Unlike getentropy(2) it is intended to be used directly for cryptographic use cases such as an IV or nonce.  The getrandom(2) call can be told whether to use the kernel pool usually used for /dev/random or the one for /dev/urandom by using the GRND_RANDOM flag to request the former. If GRND_RANDOM is specified then getrandom(2) will block until sufficient randomness can be generated if the pool is low, if non blocking behaviour is specified then the GRND_NONBLOCK flag can be passed. 

#include <sys/random.h>
int getrandom(void *buf, size_t buflen, unsigned int flags);
int getentropy(void *buf, size_t buflen);

On Solaris if GRND_RANDOM is not specified then getrandom(2) is always a non blocking call. Note that this differs slightly from Linux but not in a way that impacts its usage.  The other difference is that on Solaris getrandom(2) will either fail completely or will return a buffer filled with the requested size, where as the Linux implementation can return partial buffers.  In order to ensure code portability developers must check the return value of getrandom(2) every time it is called, for example:

#include <sys/random.h>
#include <stdlib.h>
size_t bufsz = 128;
char *buf;
int ret;
buf = malloc(bufsz);
errno = 0;
ret = getrandom(buf, bufsz, GRND_RANDOM);
if (ret < 0 || ret != bufsz) {
    perror("getrandom failed");

The output of getrandom(2) on Solaris has been through a FIPS 140-2 approved DRBG function as defined in NIST SP-900-90A.

In addition to the above to system calls the OpenBSD arc4random(3C), arc4random_buf(3C) and arc4random_uniform(3C) functions are also now provided from libc, these are available by including <stdlib.h>.

Wednesday Apr 16, 2014

Is FIPS 140-2 Actively harmful to software?

Solaris 11 recently completed a FIPS 140-2 validation for the kernel and userspace cryptographic frameworks.  This was  a huge amount of work for the teams and it is something I had been pushing for since before we wrote a single line of code for the cryptographic framework back in 2000 during its initial design for Solaris 10.

So you would imaging I'm happy right ?  Well not exactly, I'm glad I won't have to keep answering questions from customers as to why we don't have a FIPS 140-2 validation but I'm not happy with the process or what it has done to our code base.

FIPS 140-2 is an old standard that doesn't deal well with modern systems and especially doesn't fit nicely with software implementations.  It is very focused on standalone hardware devices, and plugin hardware security modules or similar physical devices.  My colleague Josh over in Oracle Seceval  has already posted a great article on why we only managed to get FIPS 140-2 @ level 1 instead of level 2.  So I'm not going to cover that but instead talk about some of the technical code changes we had to make inorder to "pass" our validation of FIPS 140-2.

There are two main parts to completing a FIPS 140-2 validation: the first part is CAVP  (Cryptographic Algorithm Validation Program) this is about proving your implementation of a given algorithm is correct using NIST assigned test vectors.  This part went relatively quickly and easily and has the potential to find bugs in crypto algorithms that otherwise appear to be working correctly.  The second part is CMVP (Cryptographic Module Validation Program), this part looks at the security model of the whole "FIPS 140-2 module", in our case we had a separate validation for kernel crypto framework and userspace crypto framework.

CMVP requires we draw boundary around the delivered software components that make up the FIPS 140-2 validation boundary - so files in the file system.  Ideally you want to keep this as small as possible so that non crypto relevant libraries and tools are not part of the FIPS 140-2 boundary. We certainly made some mistakes drawing our boundary in userspace since it was a little larger than it needed to be.  We ended up with some "utility" libraries inside the boundary, so good software engineering practice of factoring out code actually made our FIPS 140-2 boundary bigger.

Why does the FIPS 140-2 boundary matter ?  Well unlike in Common Criteria with flaw remediation in the FIPS 140-2 validation world you can't make any changes to the compiled binaries that make up the boundary without potentially invalidating the existing valiation. Which means having to go through some or all of the process again and importantly this cost real money and a significant amount of elapsed time. 

It isn't even possible to fix "obvious" bugs such as memory leaks, or even things that might lead to vulnerabilties without at least engaging with a validation lab.  This is bad for over all system security, after all isn't FIPS 140-2 supposed to be a security standard ?  I can see, with a bit of squinting, how this can maybe make some sense in a hardware module world but it doesn't make any sense for software.

We also had to add POST (Power On Self Test) code that runs known answer tests for all the FIPS 140-2 approved algorithms that are implemented inside the boundary at "startup time" and before any consumer outside of the framework can use the crypto interfaces. 

For our Kernerl framework we implemented this using the module init hooks and also leveraged the fact that the kcf module itself starts very early in boot (long before we even mount the root file system from inside the kernel).  Since kernel modules are generally only unloaded to be updated the impact of having to do this self test on every startup isn't a big deal.

However in userspace we were forced because of "Implementation Guidance", I'll get back to this later on why it isn't guidance, to do this on every process that directly or indirectly causes the cryptographic framework libaries to be loaded.  This is really bad and is counter to sensible software engineering practice. On general purpose modern operating systems (well anything from the last 15+ years really) like Solaris share library pages are mapped shared so the same readonly pages of code are being used by all the processes that start up.  So this just wastes CPU resources and causes performance problems for short lived processes.  We measured the impact this had on Solaris boot time and it was if I'm remebering correctly about a 9% increase in the time it takes to boot to multi-user. 

I've actually spoken with NIST about the "always on POST" and we tried hard to come up with an alternative solution but so far we can't seem to agree on a method that would allow this to be done just once at system boot and only do it again if the on disk binaries are acutally changed (which we can easily detect!).

Now lets combine these last two things, we had to add code that runs every time our libraries load and we can't make changes to the delivered binaries without possibly causing our validation to become invalid.  Solaris actually had a bug in some of the new FIPS 140-2 POST code in userspace that had a risk of a file descriptor leak (it wasn't something that was an exploitable security vulnerability and it was only one single fd) but we couldn't have changed that without revising which binaries were part of the FIPS 140-2 validation.  This is bad for customers that are forced by their governments or other security standards to run with FIPS 140-2 validated crypto modules, because sometimes they might have to miss out on critical fixes.

I promissed I'd get back to "Implementation Guidance", this is really aroundable way of updating the standard with new interpretations that often look to developers like whole new requirements (that we were supposed to magically know about) without the standard being revised.  While the approved validation labs to get pre-review of these new or updated IGs the impact for vendors is huge.   A module that passes FIPS 140-2 (which is a specific revision, the current one as of this time, of the standard) today might not pass FIPS 140-2 in the future - even if nothing was changed. 

In fact we are in potentially in this situation with Solaris 11.  We have completed and passed a FIPS 140-2 validation but due to changes in the Implementation Guidance we aren't sure we would be able to submit the identical code again and pass. So we may have to make changes just to pass FIPS 140-2 new or updated IGs that has no functional beneift to our customers. 

This has serious implications for software implementations of cryptographic modules.  I can understand that if we change any of the core crypto algorithm code we should re run the CAVP test vectors again - and in fact we do that internally using our test suite for all changes to the crypto framework anyway (our test suite is actually much more comprehensive than what FIPS 140 required), but not being able to make simple bug fixes or changes to non algorithm code is not good for software quality.

So what we do we do in Solaris ?  We make the bug fixes and and new non FIPS 140-2 relevant algorithms (such as Camellia) anyway because most of our customers don't care about FIPS 140-2 and even many of those that do they only care to "tick the box" that the vendor has completed the validation.

In Solaris the kernel and userland cryptographic frameworks always contain the FIPS 140-2 required code but it is only enabled if you run 'cryptoadm enable fips-140' .  This turns on the FIPS 140-2 POST checking and a few other runtime checks.

So should I run Solaris 11 with FIPS 140-2 mode enabled ?

My personal opinion is that unless you have a very hard requirement to do so I wouldn't - it is the same crypto algorithm and key management code you are running anyway but you won't have the pointless POST code running that will hurt the start up time of short lived processes. Now having said that my day to day Solaris workstation (which runs the latest bi weekly builds of the Solaris 12 development train) does actually run in FIPS 140-2 mode so that I can help detect any possible issues in the FIPS 140-2 mode of operating long before a code release gets to customers.  We also run our test suites with it enabled and disabled.

I really hope that when a revision to FIPS 140 finally does come around (it is already many years behind schedule) it will deal better with software implementations. When FIPS 140-3 was first in public review I sent on a lot of comments to it for that area.   I really hope that the FIPS 140 program can adopt a sensible approach to allowing vendors to provide bugfixes without having to redo validations - in particular it should not cost the vendor any time or money beyond what they normally do themselves.

In the mean time the Solaris Cryptographic Framework team are hard at work; fixing bugs, improving performance adding features new algorithms and (grudgingly) adding what we think will allow us to pass a future FIPS 140 validation based on the currently known IGs.

-- Darren

Monday Oct 29, 2012

Solaris 11.1: Encrypted Immutable Zones on (ZFS) Shared Storage

Solaris 11 brought both ZFS encryption and the Immutable Zones feature and I've talked about the combination in the past.  Solaris 11.1 adds a fully supported method of storing zones in their own ZFS using shared storage so lets update things a little and put all three parts together.

When using an iSCSI (or other supported shared storage target) for a Zone we can either let the Zones framework setup the ZFS pool or we can do it manually before hand and tell the Zones framework to use the one we made earlier.  To enable encryption we have to take the second path so that we can setup the pool with encryption before we start to install the zones on it.

We start by configuring the zone and specifying an rootzpool resource:

# zonecfg -z eizoss
Use 'create' to begin configuring a new zone.
zonecfg:eizoss> create
create: Using system default template 'SYSdefault'
zonecfg:eizoss> set zonepath=/zones/eizoss
zonecfg:eizoss> set file-mac-profile=fixed-configuration
zonecfg:eizoss> add rootzpool
zonecfg:eizoss:rootzpool> add storage \
zonecfg:eizoss:rootzpool> end
zonecfg:eizoss> verify
zonecfg:eizoss> commit

Now lets create the pool and specify encryption:

# suriadm map \
mapped-dev	/dev/dsk/c10t600144F09ACAACD20000508E64A70001d0
# echo "zfscrypto" > /zones/p
# zpool create -O encryption=on -O keysource=passphrase,file:///zones/p eizoss \
# zpool export eizoss

Note that the keysource example above is just for this example, realistically you should probably use an Oracle Key Manager or some other better keystorage, but that isn't the purpose of this example.  Note however that it does need to be one of file:// https:// pkcs11: and not prompt for the key location.  Also note that we exported the newly created pool.  The name we used here doesn't actually mater because it will get set properly on import anyway. So lets go ahead and do our install:

zoneadm -z eizoss install -x force-zpool-import
Configured zone storage resource(s) from:
Imported zone zpool: eizoss_rpool
Progress being logged to /var/log/zones/zoneadm.20121029T115231Z.eizoss.install
    Image: Preparing at /zones/eizoss/root.

 AI Manifest: /tmp/manifest.xml.ujaq54
  SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
    Zonename: eizoss
Installation: Starting ...

              Creating IPS image
Startup linked: 1/1 done
              Installing packages from:
                      origin:  http://pkg.oracle.com/solaris/release/
              Please review the licenses for the following packages post-install:
                consolidation/osnet/osnet-incorporation  (automatically accepted,
                                                          not displayed)
              Package licenses may be viewed using the command:
                pkg info --license <pkg_fmri>
DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                            187/187   33575/33575  227.0/227.0  384k/s

PHASE                                          ITEMS
Installing new actions                   47449/47449
Updating package state database                 Done 
Updating image state                            Done 
Creating fast lookup database                   Done 
Installation: Succeeded

         Note: Man pages can be obtained by installing pkg:/system/manual


        Done: Installation completed in 929.606 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)

              to complete the configuration process.

Log saved in non-global zone as /zones/eizoss/root/var/log/zones/zoneadm.20121029T115231Z.eizoss.install

That was really all we had to do, when the install is done boot it up as normal.

The zone administrator has no direct access to the ZFS wrapping keys used for the encrypted pool zone is stored on.  Due to how inheritance works in ZFS he can still create new encrypted datasets that use those wrapping keys (without them ever being inside a process in the zone) or he can create encrypted datasets inside the zone that use keys of his own choosing, the output below shows the two cases:

rpool is inheriting the key material from the global zone (note we can see the value of the keysource property but we don't use it inside the zone nor does that path need to be (or is) accessible inside the zone). Whereas rpool/export/home/bob has set keysource locally.


# zfs get encryption,keysource rpool rpool/export/home/bob NAME PROPERTY VALUE SOURCE rpool encryption on inherited from $globalzone rpool keysource passphrase,file:///zones/p inherited from $globalzone rpool/export/home/bob encryption on local rpool/export/home/bob keysource passphrase,prompt local



Tuesday Nov 22, 2011

HOWTO Turn off SPARC T4 or Intel AES-NI crypto acceleration.

Since we released hardware crypto acceleration for SPARC T4 and Intel AES-NI support we have had a common question come up: 'How do I test without the hardware crypto acceleration?'.

Initially this came up just for development use so developers can do unit testing on a machine that has hardware offload but still cover the code paths for a machine that doesn't (our integration and release testing would run on all supported types of hardware anyway).  I've also seen it asked in a customer context too so that we can show that there is a performance gain from the hardware crypto acceleration, (not just the fact that SPARC T4 much faster performing processor than T3) and measure what it is for their application.

With SPARC T2/T3 we could easily disable the hardware crypto offload by running 'cryptoadm disable provider=n2cp/0'.  We can't do that with SPARC T4 or with Intel AES-NI because in both of those classes of processor the encryption doesn't require a device driver instead it is unprivileged user land callable instructions.

Turns out there is away to do this by using features of the Solaris runtime loader (ld.so.1). First I need to expose a little bit of implementation detail about how the Solaris Cryptographic Framework is implemented in Solaris 11.  One of the new Solaris 11 features of the linker/loader is the ability to have a single ELF object that has multiple different implementations of the same functions that are selected at runtime based on the capabilities of the machine.  The alternate to this is having the application coded to call getisax() and make the choice itself.  We use this functionality of the linker/loader when we build the userland libraries for the Solaris Cryptographic Framework (specifically libmd.so, and the unfortunately misnamed due to historical reasons libsoftcrypto.so)

The Solaris linker/loader allows control of a lot of its functionality via environment variables, we can use that to control the version of the cryptographic functions we run.  To do this we simply export the LD_HWCAP environment variable with values that tell ld.so.1 to not select the HWCAP section matching certain features even if isainfo says they are present. 

For SPARC T4 that would be:

export LD_HWCAP="-aes -des -md5 -sha256 -sha512 -mont -mpmul" 

and for Intel systems with AES-NI support:

export LD_HWCAP="-aes"

This will work for consumers of the Solaris Cryptographic Framework that use the Solaris PKCS#11 libraries or use libmd.so interfaces directly.  It also works for the Oracle DB and Java JCE.  However does not work for the default enabled OpenSSL "t4" or "aes-ni" engines (unfortunately) because they do explicit calls to getisax() themselves rather than using multiple ELF cap sections.

However we can still use OpenSSL to demonstrate this by explicitly selecting "pkcs11" engine  using only a single process and thread. 

$ openssl speed -engine pkcs11 -evp aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      54170.81k   187416.00k   489725.70k   805445.63k  1018880.00k

$ LD_HWCAP="-aes" openssl speed -engine pkcs11 -evp aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      29376.37k    58328.13k    79031.55k    86738.26k    89191.77k

We can clearly see the difference this makes in the case where AES offload to the SPARC T4 was disabled. The "t4" engine is faster than the pkcs11 one because there is less overhead (again on a SPARC T4-1 using only a single process/thread - using -multi you will get even bigger numbers).

$ openssl speed -evp aes-128-cbc
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc      85526.61k    89298.84k    91970.30k    92662.78k    92842.67k

Yet another cool feature of the Solaris linker/loader, thanks Rod and Ali.

Note these above openssl speed output is not intended to show the actual performance of any particular benchmark just that there is a significant improvement from using hardware acceleration on SPARC T4. For cryptographic performance benchmarks see the http://blogs.oracle.com/BestPerf/ postings.

Thursday Nov 17, 2011

OpenSSL Versions in Solaris

Those of you have have installed Solaris 11 or have read some of the blogs by my colleagues will have noticed Solaris 11 includes OpenSSL 1.0.0, this is a different version to what we have in Solaris 10.  I hope the following explains why that is and how it fits with the expectations on binary compatibility between Solaris releases.

Solaris 10 was the first release where we included OpenSSL libraries and headers (part of it was actually statically linked into the SSH client/server in Solaris 9).  At time we were building and releasing Solaris 10 the current train of OpenSSL was 0.9.7. 

The OpenSSL libraries at that time were known to not always be completely API and ABI (binary) compatible between releases (some times even in the lettered patch releases) though mostly if you stuck with the documented high level APIs you would be fine.   For this reason OpenSSL was classified as a 'Volatile' interface and in Solaris 10 Volatile interfaces were not part of the default library search path which is why the OpenSSL libraries live in /usr/sfw/lib on Solaris 10.  Okay, but what does Volatile mean ?

Quoting from the attributes(5) man page description of Volatile (which was called External in older taxonomy):

         Volatile interfaces can change at any time and  for  any

         The Volatile interface stability level allows  Sun  pro-
         ducts to quickly track a fluid, rapidly evolving specif-
         ication. In many cases, this is preferred  to  providing
         additional  stability to the interface, as it may better
         meet the expectations of the consumer.

         The most common application of this taxonomy level is to
         interfaces that are controlled by a body other than Sun,
         but unlike specifications controlled by standards bodies
         or Free or Open Source Software (FOSS) communities which
         value interface compatibility, it can  not  be  asserted
         that  an incompatible change to the interface specifica-
         tion would be exceedingly rare. It may also  be  applied
         to  FOSS  controlled  software  where  it is deemed more
         important to track the community  with  minimal  latency
         than to provide stability to our customers.

         It also common  to  apply  the  Volatile  classification
         level  to  interfaces in the process of being defined by
         trusted or  widely  accepted  organization.   These  are
         generically  referred  to  as draft standards.  An "IETF
         Internet draft"  is  a  well  understood  example  of  a
         specification under development.

         Volatile can also be applied to experimental interfaces.

         No assertion is made regarding either source  or  binary
         compatibility  of  Volatile  interfaces  between any two
         releases,  including  patches.  Applications  containing
         these  interfaces might fail to function properly in any
         future release.

Note that last paragraph!  OpenSSL is only one example of the many interfaces in Solaris that are classified as Volatile.  At the other end of the scale we have Committed (Stable in Solaris 10 terminology) interfaces, these include things like the POSIX APIs or Solaris specific APIs that we have no intention of changing in an incompatible way.  There are also Private interfaces and things we declare as Not-an-Interface (eg command output not intended for scripting against only to be read by humans).

Even if we had declared OpenSSL as a Committed/Stable interface in Solaris 10 there are allowed exceptions, again quoting from attributes(5):

         4.   An interface specification which  isn't  controlled
              by  Sun  has been changed incompatibly and the vast
              majority of interface consumers  expect  the  newer

         5.   Not  making  the  incompatible  change   would   be
              incomprehensible  to our customers. 

In our opinion and that of our large and small customers keeping up with the OpenSSL community is important, and certainly both of the above cases apply.

Our policy for dealing with OpenSSL on Solaris 10 was to stay at 0.9.7 and add fixes for security vulnerabilities (the version string includes the CVE numbers of fixed vulnerabilities relevant to that release train).  The last release of OpenSSL 0.9.7 delivered by the upstream community was more than 4 years ago in Feb 2007.

Now lets roll forward to just before the release of Solaris 11 Express in 2010. By that point in time the current OpenSSL release was 0.9.8 with the 1.0.0 release known to be coming soon.  Two significant changes to OpenSSL were made between Solaris 10 and Solaris 11 Express.  First in Solaris 11 Express (and Solaris 11) we removed the requirement that Volatile libraries be placed in /usr/sfw/lib, that means OpenSSL is now in /usr/lib, secondly we upgraded it to the then current version stream of OpenSSL (0.9.8) as was expected by our customers.

In between Solaris 11 Express in 2010 and the release of Solaris 11 in 2011 the OpenSSL community released version 1.0.0.  This was a huge milestone for a long standing and highly respected open source project.  It would have been highly negligent of Solaris not to include OpenSSL 1.0.0e in the Solaris 11 release. It is the latest best supported and best performing version.  


In fact Solaris 11 isn't 'just' OpenSSL 1.0.0 but we have added our SPARC T4 engine and the AES-NI engine to support the on chip crypto acceleration. This gives us 4.3x better AES performance than OpenSSL 0.9.8 running on AIX on an IBM POWER7. We are now working with the OpenSSL community to determine how best to integrate the SPARC T4 changes into the mainline OpenSSL.  The OpenSSL 'pkcs11' engine we delivered in Solaris 10 to support the CA-6000 card and the SPARC T1/T2/T3 hardware is still included in Solaris 11.

When OpenSSL 1.0.1 and 1.1.0 come out we will asses what is best for Solaris customers. It might be upgrade or it might be parallel delivery of more than one version stream.  At this time Solaris 11 still classifies OpenSSL as a Volatile interface, it is our hope that we will be able at some point in a future release to give it a higher interface stability level.

Happy crypting! and thank-you OpenSSL community for all the work you have done that helps Solaris.

Friday Nov 19, 2010

ZFS encryption what is on disk ?

This article is about what is and isn't stored encrypted on disk for ZFS datasets that are encrypted and how we do the actual encryption. It does require some understanding of Solaris and ZFS debugging tools.

The first important thing to understand about ZFS is that it is not providing "full disk" encryption and you will be able to tell that a disk that has data on it that was encrypted by ZFS is part of a ZFS pool.

This is in part because one of the requirements for adding encryption support to ZFS was that a given ZFS pool be able to contain a mix of encrypted and cleartext datasets and those that are encrypted be able to use different algorithms/keylengths and different encryption keys.

We also require that the key material does not need to have been made available in order for pool wide operations and certain dataset operations (such zfs destroy) to succeed.  One of the most important pool wide operations is scrub/resilver; we need to ensure that hotspare, disk replacement and self healing work even if the key material has never been made available to this running instance of the system. We must also be able to claim (but not necessarily replay) the log blocks (ZIL) on reboot after power loss or panic without requiring the key material (ZFS must remain consistent on disk at all times).

What this means is that even in a pool were all of the datasets are marked as being encrypted (eg zpool create -O encryption=on tank ...) there is some ZFS metadata that is always in the clear.

What is always in the clear even for encrypted datasets?
  • ZFS pool layout
  • Dataset names
  • Pool and dataset properties, including user defined properties
    • compression, encryption, share, etc.
  • Dataset level quotas (zfs set quota)
  • Dataset delegations (zfs allow)
  • The pool history (zpool history)
  • All dnode blocks
    • Needed to traverse the pool for resilver/scrub
  • Block pointer
    • The blkptr_t contains the MAC/AuthTag from AES-CCM or AES-GCM in the top 96 bits of the checksum field. The SHA256 checksum is truncated to provide this 96 bits of space.
      • The checksum for an encrypted block is always sha256-mac
    • The 96bit IV for the block is in dva[2] of the blkptr_t
      • This means that an encrypted block can have a maximum of 2 copies not 3

What is encrypted when a dataset is encrypted?

  • All file content written to a ZFS filesystem via the the ZPL/VFS interfaces (ie POSIX interfaces)
    • open(2), write(2), mmap(2), etc.
  • All POSIX (and ZFS filesystem) metadata: ACLs, file and directory names, permissions, system and extended attributes on files and all file timestamps
    • ZPL metadata is normally contained in the bonusbuf area of a dnode_phys_t but the dnode is in the clear on disk. For encrypted datasets the bonusbuf is always empty and the content normally have been there is pushed out to an encrypted "spill" block, called System Attribtue block.  Normally for ZPL filesystems spill blocks are only used for files with large ACLs.
  • System Attribute (spill) blocks (used for any purpose)
  • All data written to a ZVOL
  • User/group quota information for ZFS filesystems, both the policy and space accounting (zfs set userquota@ | groupquota@)
  • FUID mappings for UNIX <-> CIFS user identities
  • All of the above if it is in a ZIL (ZFS Intent Log) record.
    • Note that the actual ZIL blocks have block pointers and a record header that includes the sizing information that is in the clear.
  • Data encryption keys
    • These are stored in an on disk keychain referenced from the dsl_dir_phys_t. 

The ondisk keychain

The keychain entries are ZAP objects that are indexed by the transaction they were created in. The entries are individually wrapped by the dataset's wrapping key each with their own IV and an indicator of what wrapping key algorithm was used (at this time the wrapping key crypto algorithm always matches the encryption property).  Every encrypted dataset has at least one keychain entry.  Clones have their own keychain and do not reference the one of their origin, because the clone may have a different wrapping key and the clone may have different keychain entries to its origin.

Encrypting a block

Each ZFS on disk block (smallest size is 512 bytes, largest is 128k) is encrypted using AES in either CCM or GCM mode as indicated by the encryption property. Even though CCM and GCM provide the ability to have additional authenticated data that isn't encrypted this isn't used because (with the exception of the ZIL blocks) all data in the block is encrypted.  A 96 bit IV per disk block is used and both CCM and GCM are requested to provide a 96 bit MAC/AuthTag in addition to the ciphertext.  While we could get a larger MAC space in the ZFS on disk blkptr_t is very tight and we need to leave some of it available for future features.  After encryption each block is also checksummed by the ZIO pipeline using SHA256 (fletcher is not available for encrypted datasets).

IV generation for encrypted blocks

Every encrypted on disk block has its own IV, (stored in dva[2] of the blkptr_t).  The IV is generated by taking the first 96 bits of a SHA256 hash of the contents of the zbookmark_t and the transaction the block was first written in.  We actually have all this information available both at read and write time so we don't need to store the IV in the simplest case. However snapshots, clones and deduplication as well as some (non encryption related) future features complicate this so we do store the IV.

If dedup=on for the dataset the per block IVs are generated differently.  They are generated by taking an HMAC-SHA256 of the plaintext and using the left most 96 bits of that as the IV.  The key used for the HMAC-SHA256 is different to the one used by AES for the data encryption, but is stored (wrapped) in the same keychain entry, just like the data encryption key a new one is generated when doing a 'zfs key -K <dataset>'.  Obviously we couldn't calculate this IV when doing a read so it has to be stored.

ZIL blocks

The ZIL log blocks are written in exactly the same way regardless of whether the ZIL is in the main pool disks or a separate intent log (slog) is being used.  The ZIL blocks are encrypted a different way to blocks going through the "normal" write path; this is because log blocks are formated on disk differently anyway.  The log blocks are chained together and have a header (zil_chain_t) that indicates what size the log block is and the blkptr_t to the next block as well as an embedded checksum that chains the blocks together.  For encrypted log blocks the MAC from AES CCM/GCM is also stored in this header (zil_chain_t).   It is log blocks rather than log records that are encrypted.  Within a given log block there maybe multiple log records.  Some of these log records may contain pointers to blocks that were written directly (via dmu_sync), in order for us to claim the ZIL when the pool is imported these embedded block pointers need to be readable even if the encryption keys are not available (which they won't be in most cases during the claim phase).  These means that  we don't encrypt whole log blocks, the log record headers and any blkptr_t embedded in a log record is in the clear, the rest of the log block content is encrypted.

How is the passphrase turned into a wrapping key (keysource=passphrase,prompt)?

When the dataset 'keysource' property indicates that a passphrase should be used we have to derive a wrapping key from it.  The wrapping key is derived from the passphrase provided and a per dataset salt (which is stored as hidden property of the dataset) by using PKCS#5 PBKD2_HMAC_SHA1 with 1000 iterations.  The wrapping key is not stored on disk.  The salt is randomly generated when the dataset is created (with keysource=passphrase,prompt) and changed each time the 'zfs key -c' is run, even if the passphrase the user provides is the same the salt and thus the actual wrapping key will be different.

Looking at the on disk structures

Using mdb macros and zdb we can actually look at some of this.  Remember that mdb and zdb are debugging tools only, use of mdb on a live kernel without understanding what you are doing can corrupt data.  The interfaces used below are not committed interfaces and are subject to change.

Firstly using mdb on the live kernel (of an x86 machine) I've placed a breakpoint on the zio_decrypt function, lets look at the block pointer using the mdb blkptr dcmd:

[2]> <rdi::print zio_t io_bp | ::blkptr
[L0 PLAIN_FILE_CONTENTS] SHA256_MAC OFF LE contiguous unique encrypted 1-copy
size=20000L/20000P birth=10L/10P fill=1

This blkptr_t is for the contents of a file, we can see that it is encrypted and we only have one copy of it - so only one DVA entry. The checksum is SHA256_MAC so the actual MAC value is 2e24913e6b94fbd569cf3cd9.  The blkptr macro doesn't show us the IV that is stored in DVA[2], but we can see that if we print the raw structure using ::print

[2]> <rdi::print zio_t io_bp->blk_dva[2]
blk_dva[2] = {
    blk_dva[2].dva_word = [ 0x521926d500000000, 0x3b13ba46ab9f8a51 ]

Now lets use zdb, to look at some things (the output is trimmed slightly for the sake of this article)

# zdb -dd -e tank

Dataset mos [META], ID 0, cr_txg 4, 311K, 54 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type          0    1    16K    16K  96.0K    32K   84.38  DMU dnode          1    1    16K     1K  1.50K     1K  100.00  object directory          2    1    16K    512      0    512    0.00  DSL directory          3    1    16K    512  1.50K    512  100.00  DSL props


        26    1    16K   128K  18.0K   128K  100.00  SPA history


36 1 16K 128K 0 128K 0.00 bpobj 37 1 512 512 3.00K 1K 100.00 DSL keychain 38 1 16K 4K 12.0K 4K 100.00 SPA space map ...

This pool (tank) currently has 3 datasets, one of which is encrypted.  We can see from the above zdb output that the keychains are kept in the special "mos" dataset along with some other pool wide metadata.  Now lets look at those keychains in a bit more detail by asking zdb to be more verbose (again the output is trimmed to show relevant information only):

    # zdb -dddd -e tank
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        37    1    512    512  3.00K     1K  100.00  DSL keychain
        dnode flags: USED_BYTES 
        dnode maxblkid: 1
        Fat ZAP stats:
                Pointer table:
                        32 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 5
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 2
                Leaf blocks: 1
                Total blocks: 2
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x16c6fb
                Leafs with 2\^n pointers:
                        5:      1 \*
                Blocks with n\*5 entries:
                        0:      1 \*
                Blocks n/10 full:
                        9:      1 \*
                Entries with n chunks:
                        9:      2 \*\*
                Buckets with n entries:
                        0:     14 \*\*\*\*\*\*\*\*\*\*\*\*\*\*
                        1:      2 \*\*
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136

The above keychain  object shows it has two entries in it, the lowest numbered one (5) is from when the dataset was initially created and the second one (85) is because I had run 'zfs key -K tank/fs' on the dataset a little later.  Now lets illustrate with zdb what I discussed in the previous article about assured delete where I discussed about clones being able to have different set of entries in the keychain to their origin.

To illustrate this I ran the following:

# zfs snapshot tank/fs@1
# zfs clone -K tank/fs@1 tank/fsc1
# zfs key -K tank/fs

First lets look at the keychain object 37 which is for tank/fs, and then at the keychain object for the clone (I've trimmed the output a little more this time):

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        37    2    512    512  7.50K     2K  100.00  DSL keychain
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136
                txg 174 : wkeylen = 136

     Object  lvl   iblk   dblk  dsize  lsize   %full  type
        101    1    512    512  4.50K  1.50K  100.00  DSL keychain
        Keychain entries by txg:
                txg 5 : wkeylen = 136
                txg 85 : wkeylen = 136
                txg 152 : wkeylen = 136

What we see above is that the original tank/fs dataset now has an additional entry from the 'zfs key -K tank/fs' that was run.  The keychain  for the clone (object 101) also has three entries in it, it shares the same entries as tank/fs for txg 5 and txg 85 (though they maybe encrypted differently on disk depending on where the wrapping key is inherited from) and it has as a unique entry created at txg 152.  We can see similar information by looking at the 'zpool history -il' output:

2010-11-19.05:58:25 [internal encryption key create txg:85] rekey succeeded dataset = 33 [user root on borg-nas]
2010-11-19.06:05:59 [internal encryption key create txg:152] rekey succeeded dataset = 96 from dataset = 77 [user root on borg-nas]
2010-11-19.06:06:40 [internal encryption key create txg:174] rekey succeeded dataset = 33 [user root on borg-nas]

What is decrypted in memory?

As all ready mentioned the data encryption keys are stored wrapped (encrypted) on disk but they are stored in memory in the clear along with the wrapping key (we need the wrapping key to stay around for 'zfs key -K' and for 'zfs create' where the keysource property is inherited).  They are stored only in non swappable kernel memory (though remember you can swap on an encrypted ZVOL).  They are accessible to someone with all privilege that is able to use mdb on the live kernel or on a crash dump - but so is your plaintext data.  A suitable hardware keystore could be used so that key material is only ever inside its FIPS 140 boundary but that support is not yet complete (note this is not a commitment from Oracle to provide this support in any future release of ZFS or Solaris) - there would be no on disk change required to support it though.

Any data or metadata blocks that are encrypted on disk are in the in-memory cache (ARC) in the clear, this is required because the in memory ARC buffers are sometimes "loaned" using zero copy to other parts of the system - including other file systems such as NFS and CIFS.  If this is too much of a risk for you then you can force the system to always go back to disk and decrypt blocks only when needed but note you will not benefit from the caching and this will have a significant performance penalty: zfs set primarycache=metadata <dataset>.

The L2ARC is not currently available for use by encrypted datasets (note this is not a commitment from Oracle to provide this support in any future release of ZFS or Solaris) it is equivalent to having done 'zfs set secondarycache=none <dataset>'. The DDT for deduplication is not encrypted data and is pool wide metadata (stored in the MOS) so it is still able to be stored in the L2ARC.

All of the above article content could have been discovered by reading the zfs(1M) man page and using mdb, DTrace and zdb while experimenting on a live system, which is actually how I wrote the article.  There is a lot more you can examine about the on disk and in memory state of Solaris, not just ZFS by using mdb and DTrace - neither of which you can hide from, since the kernel modules have CTF data it in them full structure definitions - Note though that unless the interfaces/structures are documented in the Solaris DDI, or other official documentation from Oracle, you are looking at implementation details that are subject to change - often even in an update/patch.

Monday Nov 15, 2010

Having my secured cake and Cloning it too (aka Encryption + Dedup with ZFS)

The main goal of encryption is to make the (presumably sensitive) cleartext data indistinguishable from random data.  Good file system encryption usually aims to have the same plaintext encrypt to different ciphertext at least when written at a different "location" even if the same key is used.  One way to achieve that is that the initialisation vector (IV) is some how derived from where the blocks of the files are stored on disk.  In this respect the encryption support in ZFS is no different, by default we derive the IV from a combination of what dataset / object the block is for and also when (its transaction) written.  This means that the same block of plaintext data written to a different file in the same filesystem will get a different IV and thus different ciphertext.  Since ZFS is copy-on-write and we use the transaction identifier it also means that if we "overwrite" the same block of a file at a later time it still ends up having a different IV and thus will be different ciphertext.  Each encrypted dataset in ZFS has a different set of data encryption keys (see my earlier post on assured delete for more details on that), so there we change the IV and the encryption key so have a really high level of confidence of getting different ciphertext when written to different datasets.

The goal of deduplication in storage is to coalesce matching disk blocks into a smaller number of copies (ideally 1, but in ZFS that nunber depends on the value of the copies property on the dataset and the pool wide dedupditto property so it could be more than 1).  Given the above description of how we do encryption it would seem that encryption and deduplication are fundamentally at odds with each other - and usually that is true.

When we write a block to disk in ZFS it goes through the ZIO pipeline and in doing so a number of transforms are optionally applied to the data:  compress -> encryption -> checksum -> dedup -> raid.

The deduplication step uses the checksums of the blocks to find suitable matches. This means it is acting on the already compressed and encrypted data.  Also in ZFS deduplication matches are searched for in all datasets in the pool with dedup=on.

So we have very little chance of getting any deduplication hits with encrypted datasets because of how the IV is generated and the fact that each dataset has its own set of encryption keys.  In fact not getting hits with deduplication is actually a good test that we are using different keys and IVs and thus getting different ciphertext for the same plaintext.

So encryption=on + dedup=on is pointless, right ?

Not so with ZFS, I wasn't happy about giving up on deduplication for encrypted datasets, so we found a solution, it has some restrictions but I think they are reasonable and realistic ones.

Within what I'll call a "clone family", ie all datasets are clones of the same original dataset or are clones of those clones, we would be sharing data encryption keys in the default case, because they share data (again see my earlier post on assured delete for info on the data encryption keys). So I found a method of generating the IV such that within the "clone family" we will get dedup hits for the same plaintext.  For this to work you must not run 'zfs key -K' on any of the clones and you must not pass '-K' to 'zfs clone' when you create your clones.  Note that dedup does not apply to child datasets only to the snapshots/clones, and by that I mean it doesn't break you just won't get deduplication matches.

So no it isn't pointless and whats more for some configurations it will actually work really well.  A common use case for a configuration that does work well is a set of visualisation image (maybe filesystems for local Zones or ZVOLs shared over iSCSI for  OVM or similar) where they are all derived from the same original master by using zfs clones and that all get patched/updated with the pretty much the same set of patches/updaets.  This is a case where clones+dedup work well for the unencrypted case, and one which as shown above can still work well even when encryption is enabled.

The usual deployment caveats with ZFS deduplication still apply, ie it is block based and it works best when you have lots of available DRAM and/or L2ARC for caching the DDT.  ZFS Encryption doesn't add any additional requirements to this. 

So we can happily do this type of thing, and have it "work as expected":

$ zfs create -o compression=on -o encryption=on -o dedup=on tank/builds
$ zfs create tank/builds/master
$ zfs clone tank/builds/master@1tank/builds/project-one
$ zfs clone tank/builds/master@1 tank/builds/project-two 

General documentation for ZFS support of encryption is in the Oracle Solaris ZFS Administration Guide in the Encrypting ZFS File Systems section.

Assured delete with ZFS dataset encryption

Need to be assured that your data is inaccessible after a certain point in time ?

Many government agency and private sector security policies allow you to achieve that if the data is encrypted and you can show with an acceptable level of confidence that the encryption keys are no longer accessible.  The alternative is overriding all the disk blocks that contained the data, that is both time consuming, very expensive in IOPS and in a copy-on-write filesystem like ZFS actually very difficult to achieve.  So often this is only done on full disks as they come out of production use for recycling/repurposing, but this isn't ideal in a complex RAID layout.

In some situations (compliance or privacy are common reasons) it is desirable to have an assured delete of a subset of the data on a disk (or whole storage pool). Having the encryption policy / key management at that ZFS dataset (file system / ZVOL) level allows us to provide assured delete via key destruction at a much smaller granularity than full disks, it also means that unlike full disk encryption we can do this on a subset of the data while the disk drives remain live in the system.

If the subset of data matches a ZFS file system (or ZVOL) boundary we can provide this assured delete via key destruction; remember ZFS filesystems are relatively very cheap.

Lets start with a simple case of a single encrypted file system:

$ zfs create -o encryption=on -o raw,file:///media/keys/g projects/glasgow
$ zfs create -o encryption=on -o raw,file:///media/keys/e projects/edinburgh

After some time we decide we want to make projects/glasgow completely inaccessible.  The simplest way is to just destroy the wrapping key, in this case it is on /media/keys/g, and destroying the projects/glasgow dataset.  The data on disk will still be there until ZFS starts using those blocks again but since we have destroyed /media/keys/g (which I'm assuming here is on some separate file system) we have a high level of assurance that the encrypted data can't be recovered even by reading "below" ZFS by looking at the disk blocks directly.

I'd recommend a tiny additional step just to make sure that the last version of the data encryption keys (which are stored wrapped on disk in the ZFS pool) are not encrypted by anything the user/admin knows:

$ zfs key -c -o raw,file:///dev/random projects/glasgow
$ zfs key -u projects/glasgow
$ zfs destroy projects/glasgow

While the step of re-wrapping the keys with a key the user/admin doesn't know doesn't provide a huge amount of additional security/assurance it makes the administrative intent much clearer and at least allows the user to assert that they did not know the wrapping key at the point the dataset was destroyed.

If we have clones this situation is slightly more complex since clones share their data encryption key with their origin - since they share data written before the clone was branched off the clone needs to be able to read the shared and unique data as if it was its own.

We can make sure that the unique data in a clone uses a different data encryption key than the origin does from the point the clone was taken:

... time passes data is placed in projects/glasgow
$ zfs snapshot projects/glasgow@1
$ zfs clone -K projects/glasgow@1 projects/mungo

By passing '-K' to 'zfs clone' we ensure that any unique data in projects/mungo is using a different dataset encryption key from projects/glasgow, this means we can use the same operations as above to provide assured delete for the unique data in projects/mungo even though it is a clone.

Additionally we could also do 'zfs key -K projects/glasgow' and have any new data written to projects/glasgow after the projects/mungo clone was taken use a different data encryption key was well.  Note however that that is not atomic so I would recommend making projects/glasgow read-only before taking the snapshot even though normally this isn't necessary, the full sequence then becomes:

$ zfs set readonly=on projects/glasgow
$ zfs snapshot projects/glasgow@1
$ zfs clone -K projects/glasgow@1 projects/mungo
$ zfs set readonly=off projects/mungo
$ zfs key -K projects/glasgow
$ zfs set readonly=off projects/glasgow

If you don't have projects/glasgow marked as read-only then there is a risk that data could be written to projects/glasgow  after the snapshot is taken and before we get to the 'zfs key -K'.  This may be more than is necessary in some cases but it is the safest method.

General documentation for ZFS support of encryption is in the Oracle Solaris ZFS Administration Guide in the Encrypting ZFS File Systems section.


Introducing ZFS Crypto in Oracle Solaris 11 Express

Today Oracle Solaris 11 Express was released and is available for download, this release includes on disk encryption support for ZFS.

Using ZFS encryption support can be as easy as this:

# zfs create -o encryption=on tank/darren
Enter passphrase for 'tank/darren':
Enter again:

If you don't wish to use a passphrase then you can use the new keysource property to specify the wrapping key is stored in a file instead.  See the zfs(1M) man page or ZFS documentation set for more details, but here are a few simple examples:

# zfs create -o encryption=on -o keysource=raw,file:///media/stick/mykey tank/darren

# zfs create -o encryption=aes-256-ccm -o keysource=passphrase,prompt tank/tony

If encryption is enabled and keysource is not specified we default to keysource=passphrase,prompt.  I plan to have other keysource locations available at some point in the future, for example retrieving the wrapping key from an https:// location or from a PKCS#11 accessible hardware keystore or key management system.

There are multiple different keys used in ZFS, the one the user/admin manages (or is derived from the entered passphrase) is a wrapping key, this means it is used to encrypt other keys not used for data encryption.  The data encryption keys are randomly generated at the time the dataset is created (using the kernel interfaces for /dev/random), or when a user/admin explicitly requests a new data encryption key for newly written data in that ZVOL or file system (eg zfs key -K tank/darren).

Back to our simple example of 'tank/darren' lets do a little more. If we now create a filesystem below 'tank/darren'  eg 'tank/darren/music' we won't be prompted for an additional passphrase/key since the encryption and keysource properties are inherited by the child dataset.  Note that encryption must to set at create time and it can not be changed on existing datasets.  Inheriting the keysource means that the child dataset inherits the same wrapping key but we generate new data encryption keys for that dataset.

If you create a clone of an encrypted file system then the clone is always encrypted as well, but the wrapping key for a clone need not be the same as the origin, for example:

# zfs snapshot tank/darren@1
# zfs clone tank/darren@1 tank/darren/sub

# zfs clone tank/darren@1 tank/tony
Enter passphrase for 'tank/tony':
Enter again:

In the first clone above the 'tank/darren/sub' dataset inherits all the encryption properties and wrapping key from 'tank/darren'.  In the second case there was no encrypted dataset at 'tank' to inherit the keysource property from (and thus the wrapping key) so we take the default keysource value of passphrase,prompt.

We can also change the wrapping key at any time using the 'zfs key -c <dataset>' command.  Note that in the case of passphrase this does not prompt for the existing passphrase, this is intentional as the filesystem will already be mounted (and maybe shared) and the files accessible.  It also gives us the ability via ZFS allow delegations to distinguish between which users can load/unload the keys (and thus have the filesystem mount) via the 'key' delegation and those users that can change the wrapping keys with the 'keychange' delegation.

At the time the wrapping key is changed you can also choose to use a different style of wrapping key, say switching from a prompted for passphrase to a key in a file.

The easiest way to create the wrapping keys is to use the existing Solaris pktool(1) command, eg:

$ pktool genkey keystore=file keytype=aes keylen=128 outkey=/media/stick/mykey

ZFS uses the Oracle Solaris Cryptographic Services APIs, as such it automatically benefits from the hardware acceleration of AES available on the SPARC T series processors and on Intel processors supporting AES instructions.

General documentation for ZFS support of encryption is in the Oracle Solaris ZFS Administration Guide in the Encrypting ZFS File Systems section.

For more about Oracle Solaris 11 Express see the articles and white papers site. Particularly the security and storage software articles.

Updated 2010-01-17 to change links for documentation to new location.


Wednesday Dec 16, 2009

Can we improve ZFS dedup performance via SHA256 ?

Today I integrated the fix for 6344836 ZFS should not have a private copy of SHA256.  The original intent of this was to avoid having a duplicate implementation of the SHA256 source code in the ONNV source tree.   The hope was that for some platforms there would be an improvement in the performance of SHA256 over the private ZFS copy and that would have some impact on ZFS performance.   Until deduplication support arrived in ZFS the SHA256 wasn't heavily used by default since the default data checksum is fletcher not SHA256.  However I had been running a variant of this fix in the ZFS crypto project gate for almost 2 years now since when encryption is enabled on a ZFS dataset we force the use of sha256 as the checksum for data/metadata of a dataset.

As part of approving my RTI Jeff Bonwick rightly wanted to know that the change wouldn't regress the performance of deduplication support that he had just integrated.   So asked for a ZFS micro test based on deduplication and also for some micro benchmark numbers comparing the time to do a SHA256 digest over the various ZFS block sizes 1k through 128k (in powers of two).  The micro benchmark uses an all zeros block (malloc followed by bzero) of the appropriate size and averages the time to do SHA2Init,SHA2Update,SHA2Final or the equivalent and put it into a zio_cksum_t (a big endian layout of 4 unsigned 64 bit ints), these are the averages in nano seconds over a run of 10,000 iterations for each block size.

The micro benchmark was run in userland using 64 bit processes but the SHA256 code used is identical to that used by the zfs kernel module and the misc/sha2 kernel module from the cryptographic framework.

Note these are micro benchmarks and may not be indicative of real world performance, I selected two modern machines from Sun's hardware portfollio an X4140 (the head unit of a Sun Unified Storage 7310) and a UltraSPARC T2 based T5120.  Note that the fix I integrated only uses a software implementation of SHA256 on the T5120 (UltraSPARC T2) and is not (yet) using the on CPU hardware implementation of SHA256.  The reason for this is to do with boot time availability of the Solaris Cryptographic Framework and the need to have ZFS as the root filesystem.  I know how to fix this but it was a slightly more detailed fix and one I didn't think appropriate to introduce in build 131 of OpenSolaris - which means it needs to wait until post 134.

All very nice but as I said a that is a micro benchmark, what does this actually look like at the ZFS level.  Again this is still a benchmark and is not necessarily indicative of any possible real world performance improvements, the goal here is to show that regardless of wither or not there is an improvement in the implementation of calculating a SHA256 checksum for ZFS will it be noticed for dedup.   This test was suggested by Jeff as a quick way to determine if there is any impact to dedup in changing the SHA256 implementation - it uses data that will obviously dedup.  Note here the actual numbers aren't important because the goal wasn't to show how fast ZFS can do IO but to show the relative difference between the before and after implementations of SHA256.  As such I didn't build a pool config with the purpose of doing the best possible IO I just used a single disk pool using one of the available internal drives in each of my machines (again an X4140 and a T5120).

# zpool create mypool -O dedup=on c1t1d0 
# rm -f 100g; sync; ptime sh -c 'mkfile 100g 100g; sync' 

X4140 running build 129 with private ZFS copy of SHA256

real     3:34.386051924
user        0.341710806
sys      1:02.587268898

X4140 running build with misc/sha256

real     2:25.386560294
user        0.317230220
sys        56.600231785

T5120 running build 129 with private ZFS copy of SHA256

real 8:40.703912346
user 2.704046212
sys 4:06.518025697

T5120 running build with misc/sha256

real 5:40.593874259
user 2.704308565
sys 3:59.648897024

So for both the X4140 and the T5120 there is a noticeable decrease in the real time taken to run the test. In each case I ran the test 6 times and picked the lowest time result in each case - the variance was actually very small anyway (usually under a second).  Given that mkfile produces blocks of all zeros, and compression was not on, then there would be massive amounts of dedup hits going on, how much ?

# zpool get dedup mypool
mypool  dedupratio  819200.00x  -

Big hits for dedup so lots of exercising of SHA256 in this "test" case.

Again these are tests primary done to show no regression in peformance from switching to the common copy of the SHA256 code, but have shown that there is an actually a significant improvement.  Wither or not you will see that improvement with your real world data on your real systems depends on many factors - not least of which is wither your data is dedupable and wither or not SHA256 was anywhere near being the critical factor in your observed performance.

My RTI advocate was happy I'd done the due diligence and approved my RTI so this is integrated into ONNV during build 131.

Happy deduplicating.

Monday Jun 01, 2009

Encrypting ZFS pools using lofi crypto

I'm running OpenSolaris 2009.06 on my laptop, soon I'll be running my own development bits of ZFS Crypto but I couldn't do that because OpenSolaris 2009.06 is based on build 111 but the ZFS crypto source is already at build 116.  Once the /dev repository catches up to 116 (or later) then I can put ZFS Crypto onto my laptop and run it "real world" rather than just pounding on it with the test script.

In the mean time I had a need to store some confidential and also some highly personal data on my laptop - not something I generally do as it is mostly used for remote access or the sources I have on it are all open source.

I also really wanted to take advantage of the ZFS auto-snapshot service for this data.  So I really needed ZFS Crypto but I was also constrained to 2009.06.   I could have gone back and used an older build of ZFS crypto that was in sync with 111 but that would have taken me a few days to backport my current sources - not something that I'd consider useful work given the only person that would benefit from this was me and for a short period of time.

It was also only a relatively small amount of data and performance of access wasn't critical.  I should also mention that storing it on a USB mass storage device of any kind was a complete non starter given where I was going with this laptop at the time - lets just say one can't take usb disks there.

I wrote previously about the crypto capability we have in the lofi(7D) driver - this is "block" device based crypto and knows nothing about filesystems.  As I mentioned before it has a number of weaknesses but by running ZFS on top of it some of those are mitigated for me in the situation I outlined above.  Particularly since I'd end up with "ZFS  - lofi - ZFS".

What I decided to do was use lofi as a "shim" and create a ZFS pool on a lofi device which was backed by a ZVOL.    As I mentioned above the reason for using ZFS as the end filesystem was to take advantage of the auto-snapshot feature, using ZFS below lofi means that I can expand the size of the encrypted area by adjusting the size of the ZVOL if necessary.

darrenm$ export PVOL=rpool/export/home/darrenm/pvol
# zfs create -V 1g $PVOL
darrenm$ pktool genkey keystore=pkcs11 label=$PVOL keylen=256 keytype=aes
Enter PIN for Sun Software PKCS#11 softtoken: 
darrenm$ pfexec lofiadm -a /dev/zvol/rdsk/$PVOL -T:::$PVOL -c aes-256-cbc
# zpool create darrenm -O canmount=off  -O checksum=sha256 -O mountpoint=/export/home/darrenm darrenm /dev/lofi/1
# zfs allow darrenm create,destroy,mount darrenm 
darrenm$ zfs create -o canmount=off darrenm/Documents
darrenm$ zfs create darrenm/Documents/Private
That sets things up the first time. This won't be automatically available after a reboot, infact the "darrenm" pool will be known about but will be in an FAULTED state since its devices won't be present.  To make it available again after reboot do this:
darrenm$ pfexec lofiadm -a /dev/zvol/rdsk/$PVOL -T :::$PVOL -c aes-256-cbc
darrenm$ pfexec zpool import -d /dev/lofi darrenm
darrenm$ zfs mount -a

A few notes on the choice of naming.  I deliberately made the pool named after my usename and set the mountpoint of the top level dataset to match where my home dir is mounted (NOT the automounted /home/ entry but the actual underlying one) and set that as canmount=off so that my default home dir is still unencrypted but I can have subdirs below it that are actually ZFS datasets mounted encrypted from the encrypted pool.

The lofi encryption key is stored (encrypted) in my keystore managed by pkcs11_softtoken.  (As it happens my latop has a TPM so I could have stored it in there - expect for the fact that the TPM driver stack isn't fully operational until after build 111).  I chose to name the key using the name of the ZVOL just to keep things simple and to remind me what it is for, the label is arbitary but using the ZVOL name will help when this is scripted.

I'm planning on polishing this off a bit and making some scripts available, but first I want to try out some more complex use cases including having the "guest" pool be mirrored.  I also want to get the appropriate ZFS delegation and RBAC details correct.  For how though the above should be enough to show what is possible here until the "real" ZFS encrypted dataset support appears.  This isn't designed to be a replacement for the ZFS crypto project but a, for me at least, a usable workaround using what we have in 2009.06.

Saturday Nov 01, 2008

ZFS Crypto update

It is been a little while since I gave an update on the status of the ZFS Crypto project. A lot has happened recently and I've been really "heads down" writing code.

We had believed we were development complete and had even started code review ready for integration. All our new encryption tests passed and I only had one or two small regression test issues to reconfirm/resolve. However a set of changes (and very good and important ones for performance I might add) were integrated into the onnv_100 build that caused the ZFS Crypto project some serious remerge work. It took me just over 2 weeks to get almost back to where were. In doing that I discovered that the code I had written for the ZFS Intent Log (the ZIL) encryption/decryption wasn't going to work.

The ZIL encryption was "safe" from a crypto view point because all the data written to the ZIL was encrypted. However because of how the ZIL is claimed and replayed after a system failure (the only time the ZIL is actually read since it is mostly write only) mean't that the claim happened before the pool had any encryption keys available to it. The resulted in the ZFS code just thinking that all the datasets had no ZIL needing claimed, so ZFS stayed consistent on disk but anything that was in the ZIL that hadn't been committed in a transaction was lost - so it was equivalent to running without a ZIL. OOPS!

So I had to redesign how the ZIL encryption/decryption works. The ZIL is dealt with in two main stages, first there is the zil_claim which happens during pool import (before the pool starts doing any new transactions and long before userland is up and running). The second stage happens much much later when the filesystems (datasets really because the ZIL applies to ZVOLs too) are mounted - this is good because we have crypto keys available by then.

This mean't I need to really learn how the ZIL gets written to disk. The ZIL has 0 or more log blocks with each log block having 1 or more log records in it. There is a different type of log record (each of different sizes) for the different types of sync operations that can come through the ZIL. Each log record has a common part at the start of it that says how big it is - this is good from a crypto view. So I left the common part in the clear and I encrypt the rest of the log record. This is needed so that the claim can "walk" all the log records in all the log blocks and sanity check the ZIL and do the claim IO. So far this is similar to what was done for the DNODES. Unfortunately it wasn't quite that simple (never is really!).

All the log records other than the TX_WRITE type fell into the nice simple case described above. The TX_WRITE records are "funky" and from a crypto view point a really pain in how they are structured. Like all the other log records there is a common section at the start that says what type it is and how big the log record is. What is different about the TX_WRITE log records though is that they have a blkptr_t embedded in them. That blkptr_t needs to be in the clear because it may point of to where the data really is (especially if it is a "big" write). The blkptr_t is at the end of the log record. So no problem, just leave the common part at the start and the blkptr_t at the end in the clear, right ? Well that only deals with some of the cases. There is another TX_WRITE case, where the log record has the actually write data tagged on after the blkptr inside the log record and this is of variable size. Even in this case there is a blkptr_t embedded inside the log record. Problem is that blkptr_t is now "inside" the area we want to encrypt. So I ended up having to had a clear text log common area, encrypted TX_WRITE content, clear text blkptr_t and maybe encrypted data. Good job the OpenSolaris Cryptographic Framework supports passing in a uio_t for scatter gather!

So with all that done, it worked, right ? Sadly no there was still more to do. Turns out I still had two other special ZIL cases I had to resolve. Not with the log records though. Remember I said there are multiple log records in a log block ? Well at the end of the log block there is a special trailer record that says how big the sum of all the log records is (a good sanity checker!) but this also has an embedded checksum for the log block in it. This is quite unlike how checksums are normally done in ZFS, normally the checksum is in the blkptr_t not with the data. The reason it is done this way for the ZIL is for write performance and the risk is acceptable because the ZIL is very very rarely ever read and only then in a recovery situation. For ZFS Crypto we not only encrypt the data but we have a cryptographically strong keyed authentication as well (the AES_CCM MAC) that is stored in 16 bytes of the blkptr_t checksum field. Problem with the ZIL log blocks is that we don't have a blkptr_t for them we can put that 16 byte MAC into because the checksum for the log block is inside the log block in the trailer record, the blkptr checksum for the ZIL is used for record sequencing. So was there any reserved or padding fields ? Luckily yes there was but there was only 8 bytes available not 16. That is enough though since I could just change the params for CCM mode to ouput an 8 byte MAC instead of a 16 byte one. A little less security but still plenty sufficient - and especially so since we expect to never need to read the ZIL back - it is still cryptographically sound and strong enough for the size of the data blocks we are encrypting (less that 4k at a time). So with that resolved (this took a lot of time in Dtrace and MDB to work out!) I could decrypt the ZIL records after a (simulated) failure.

Still one last problem to go though, the records weren't decrypting properly. I verified, using dtrace and kmdb, that the data was correct and the CCM MAC was correct. So what was wrong why wouldn't they decrypt ? That only left the key and the nonce. Verifying the key was correct was easy, and it was. So what was wrong with the nonce ?

We don't actually store the nonce on disk for ZFS Crypto but instead we calculate it based on other stuff that is stored on disk. The nonce for a normal block is made up from: the block birth transaction (txg: a monatonically increasing unsigned 64 bit integer) , the object, the level, and the blkid. Those are manipulated (via a truncated SHA256 hash) into 12 bytes and used as the nonce. For a ZIL write the txg is always 0 because it isn't being written in a txg (that is the whole point of it!) but the blkid is actually the ZIL record sequence number which has the same properties as the txg. The problem is that when we replay the ZIL (not when we claim it though) we do have a txg number. This mean't we had a different nonce on the decrypt to the encrypt. The solution ? Remove the txg from the nonce for ZIL records - no big loss there since on a ZIL write it is 0 anyway and the blkid (the zil sequence number) has the security properties we want to keep AES CCM safe.

With all that done I managed to get the ZIL test suite to pass. I have a couple of minor follow-on issues to resolve so that zdb doesn't get its knickers in a twist (SEGV really) when it tries to display encrypted log blocks (which it can't decrypt since it is running in userland and without the keys available).

That turned out to be more that I expected to write up. So where are we with schedule ? I expect us to start codereview again shortly. I've just completed the resync to build 102.

Friday Sep 05, 2008

ZFS Crypto Codereview starts today

Prelim codereview for the OpenSolaris ZFS Crypto project starts today (Friday 5th September 2008 at 1200 US/Pacific) and is scheduled to end on Friday 3rd October 2008 at 2359 US/Pacific. Comments recieved after this time will still be considered but unless there are serious in nature (data corruption, security issue, regression from existing ZFS) they may have to wait until post onnv-gate integration to be addressed; however every comment will be looked at and assessed on its own merit.

For the rest of the pointers to the review materials and how to send comments see the project codereview page.

Thursday Oct 04, 2007

ZFS Crypto Alpha Release

ZFS Crypto (Phase 1) Alpha Release binaries are now available. At the moment this is x86/x64 only and debugging a very strange (non crypto) problem on the SPARC binaries and will make them available when I can.

Tuesday Jul 10, 2007

Niagara Random Number Driver source opened

"6543566 RNG does not need to be in closed source" got integrated recently but for some reason I forgot to post this (it has been sitting as a draft for a while, oops).  
This means that the OpenSolaris Crypto Framework driver for the on CPU chip
Niagara 2 is now available from OpenSolaris under the CDDL as another good
example of how to write a plugin for the crypto framework.

When OpenGrok gets updated on the next pass you can see it at this link.

Darren Moffat-Oracle


« January 2017