So, what makes ZFS so cool? (Part I: high level overview)

I have the privilege to do Solaris 11 tech-updates/demos at customers. It always amazes me how much they are amazed by ZFS. Don't get me wrong, ZFS is really cool. But it isn't exactly new technology, it's been around for a while now, the first implementations in 2003, included in Solaris 10 since S10 Update 2 in 2006. Everyone has heard about it being awesome, but every now and then I get the question for details: So tell me, what really makes ZFS so cool? 

Let me tell you about it. 

First and foremost: What were Sun's motivations to go and implement a new data management technology? Let's see. 

  • They had enough of storage capacity limitations of existing filesystems 
  • They were fed up with the complexity and the static nature of the managing data systems 
  • They considered data loss due to silent datacorruption inacceptable 
  • They could not bear the thought of partially written IOs endangering data consistence
  • They wanted consistent rollback functionality to any other previous states
  • They believed tools external to the filesystem were unreliable and an unintegrated way to provide data services
  • They wanted performance through using hybrid storage elements and transparent caching/fetching within the pool. 
So, how did they address these issues? 

  • To remove capacity limitations, they made ZFS a 128-bit filesystem. This makes ZFS capable to address 256 quadrillion zettabytes. With this addressspace you could practically store all the digital data ever generated on Earth. That is, you probably never will meet the problem having to create another zpool because the existing one isn't capable to address more storage. [1] 
  • To reduce administration complexity, they did the following:
    • They have moved RAID functionality from external (SVM, VxVM, HW raid controllers) into the zpools
    • They eliminated semi-static logical volume management completely, and defined filesystems with no static definition of size, but as simple, hierarchical management points in the pool. 
    • That is, no need to grow/shrink volumes, for they do not exist, and no need to grow/shrink filesystems, for they are dynamically growing and shrinking in the zpool with the amount of data changing within them.
    • Also, all you need to use is only two commands, zpool and zfs with their intuitively usable subcommands (create/list/destroy/get/...) to manage your data structures. 
  • To avoid data corruption, for they understood that you can't avoid physical bitrot, that is silent datacorruption on the disks, so they have decided to checksum every single block written into the pool. These checksums are controlled at read-time and self-healed from the redundant blocks (mirror, raidz). 
  • To forego partially successful writes ruining data consistence they have implemented ZFS as a transactional filesystem. That is, either a write is completely done, or not at all. Also, changing data happens on a Copy on Write way, that is, reading the relevant blocks, and not modifying the ones being changed, but writing the blocks with changed content to an unused area, leaving the original ones untouched. Both original and changed states are exist at the end of the modification write, then original blocks are marked as freespace (metadata released).
  • To be able to rollback to previous states they snapshot the metadata, and after CoW modification simply not throwing that away. See? Doing snapshots, by not releasing metadata and blocks. It is sometimes easier to snapshot than not to :) 
  • They implemented zfs-internal data services like encryption, deduplication, compression, snapshots, cloning... 
  • To achieve both read- and write performance, they have implemented a hierarchical and configurable cache mechanism, using main memory, L2 cache (even with SSDs) and disks, and autotiering between them. 

So if next time anyone asks you about why ZFS is cool, tell them: 

  • amazing storage addressing capability
  • builtin RAID, no need for LVM, dynamic hierarchical filesystems 
  • no datacorruption due to everything is checksummed and there is self-healing
  • transactional and copy-on-write features 
  • snapshots as a natural capability
  • builtin dataservices 
  • hybrid storage pool performance 

 ...and then of course we didn't yet talk about replication, shadow migration, shares, hierarchical filesystems, delegation, cache policies, dynamical property settings, dynamic striping and autoexpand, online versionupgrades, etc. Should you want me to write about those in a "Part II: a deeper dive" post - let me know in the comments. 

wbr, 

charlie



[1] Although, who knows. Remember working with 180KB "large" floppies? Now my phone has a storage capacity 200.000 times larger than that. My current  phone could replace the data storage needs of a smaller country back in the '80s. According to my quick exponential estimation, if data grows at this speed, then in around a 100 years we will  need something beyond 128 bit filesystems :) 

Comments:

What does ZFS stand for? Know anything about data de duplication?

Posted by guest on May 24, 2012 at 03:33 PM CEST #

[1] Regarding capacity limits of ZFS:
To populate a 128 bit filesystem, would require lot of disks. How many? If we use 3TB disks, then you need the equivalent mass of 10 moons. The moon weighs quite much. And you need ten of them.

Or, to populate a 128 bit filesystem, you need to move lot of electrons. That is energy. You need more energy than it takes to boil all water on earth. Oceans, rivers, everything would boil away, when you move that many electrons. And where do you store all electrons? On disks?

The entire universe consists of 10^78 atoms. That is much more than 2^128, but still, 128 bits is more than mankind will ever need.

Posted by kebabbert on May 27, 2012 at 11:39 PM CEST #

ZFS originally stood for "Zettabyte FileSystem", but since it can address way more than that, and this definition clearly just focuses on the storage capacities of it, leaving its other great features in the cold (see above :) ), it does not anymore.

It is simply ZFS, although some creative minds did suggest that it should say "Zen for Storage" or the traditional recursive way "ZFS File System" ;)

Posted by Karoly Vegh on June 22, 2012 at 03:40 PM CEST #

about deduplication: With Solaris 11 this capability of ZFS has been enabled. It is an inline deduplication, meaning it happens at writetime. Also, deduplication can be enabled on a per dataset basis. You can read about it more here, in the Solaris 11 Documentation Library: http://docs.oracle.com/cd/E23824_01/html/821-1448/gbscy.html#gjhbo , and on OTN: http://www.oracle.com/technetwork/server-storage/solaris11/technologies/zfs-338092.html

Posted by Karoly Vegh on June 22, 2012 at 03:45 PM CEST #

That was an excellent overview about ZFS. Please do write "Part II: a deeper dive" post.

Regards,
Satish

Posted by Satish Gummadi on June 29, 2012 at 09:30 AM CEST #

I created a disk with tracks and sectors that ended with check-sums that barely costs any performance and saw it was good.
I implemented smart technology in the firmware of the disks that barely costs any performance and saw it was good.
I grouped multiple disks together into a RAID with some redundancy and a slight bit of performance loss and saw it was good.
I took multiple raid systems, grouped them together in redundant clusters with a tolerable overhead and saw it was good.
I created a journaling filesystem on this clustered filesystem which logged its changes with only a slight overhead and saw it was good.
I installed an RDBMS like Oracle on this filesystem which has it's on internal block formatting and check-sums which are checked at every access with only a slight overhead and saw it was good.
I back this RDBMS up daily while checking the check-sums of every individual block to prevent silent corruption from going undetected and restore/recover any block if it would ever show up with only a bit of overhead and saw it was good.
I am a happy camper.

Posted by guest on August 10, 2012 at 10:47 PM CEST #

Yes, let's dive into Part II, III, ...
We need to know a bit more about this great file-system!

Cheers,
Carlos.

Posted by guest on September 03, 2012 at 04:22 PM CEST #

Please explain how to:

"dynamically growing and shrinking in the zpool with the amount of data changing within them"

Is it like BTRFS where you add/remove a disk on do btrfs balance <pool> and it automatically grows/shrinks the raid ?

Posted by guest on September 12, 2012 at 09:42 PM CEST #

About the ZFS filesystems' sizes growing and shrinking:

The definition above isn't perfect, I admit. ZPools are created on a fixed size disks/LUNs (with raidlevels, and they can grow dynamically though). But ZFS filesystems are not created with a fixed size. They are created on a hierarchical fashion within a zpool. They do not have specific sizes. At any time they consume as much space in the zpool as the data within them (+the metadata) utilises.
In a 100G test_zpool you can have two (or gazillions more) zfs filesystems next to (or hierarchivally above/below) eachother, where let's say /test_zpool/zfs_A has 80G data and /test_zpool/zfs_B contains 10G. Now, you can remove all the data in zfs_A and suddenly zfs_B has 80G more space free too, because they are in the very same zpool.

No fix filesystems sizes, ZFS filesystems are "merely" a point of administration with different properties.
Of course you can set space reservations and quotes on these filesystems, but those are different - though related - features.

wbr,

charlie

Posted by Karoly Vegh on October 25, 2012 at 08:56 PM CEST #

Part II: a deeper dive required

Posted by guest on January 11, 2013 at 02:56 AM CET #

Post a Comment:
  • HTML Syntax: NOT allowed
About

This is the Technology Blog of the Oracle Hardware Presales Consultant Team in Vienna, Austria. We post about our technology fields: server- and storage hardware, operating system technologies, virtualization, clustering, datacenter management and performance tuning possibilities.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today