In a previous post I scratched the surface of how ZFS uses the ZFS
Intent Log (ZIL), and how the 7000 Series uses Solid State Disk (SSD)
to accelerate its performance. After having presented the Hybrid
Storage Pool to more than a hundred customers, I can say that
questions around how the 7000 leverages SSDs, and how it handles SSD
failure are among the most frequently asked. I hope that I can
expand on my previous entry here and explain things in clear detail. I apologize in advance that my artwork is not nearly what it could be, but I wanted to share the information I have.
Before we can cover the detail of how
the file system leverages SSD and handles SSD failure, we need to
understand the basic components of ZFS, and how the data flows
between them. The ZFS file system is made up of a number of modules
and layers. The interfaces that we use to store data run as modules
at the top of the stack. In the 7000, we know these as Filesystems
and LUNS in the Shares section of the BUI.
Both user level interfaces connect
using transactions to a layer called the Data Management Unit (DMU).
The DMU manages the storage and retrieval of data independent of its
structure (the structure is implemented above by the modules that
give us Filesystems and LUNs); it is the coordinator, orchestrating
the movement of data between the various components below.
One of the key components it manages is
called the Adaptive Replacement Cache (ARC). ARC is used as cache
for both read and write operations as well as key file system data
and metadata. With the excpetion of the cached copy of the ZIL (more
on that later), and the actual write data cache, anything that can
live in the ARC can also live in the Level 2 ARC (L2ARC) which is a
'disk' based extension of primary cache designed to operate as a
second tier in the system storage model. I will cover L2ARC as it
relates to 7000 more later, but if you're itching for details, check
out Brendan's blog entry on it here.
Another component managed by the DMU
is the ZIL. As I discussed in my previous entry, the ZIL is the
journal that allows the file system to recover from system failures.
The ZIL must always exist on non-volatile storage in order to ensure
it will be there to recover from. By default, the ZIL is stored
inside the storage pool, however it can also be stored on a dedicated
disk device called a log device. Regardless of how the system is
configured to store the ZIL, it is always cached in system memory
while running in order to improve performance.
Below all of the caching tiers is the
disk pool itself. It is built from groups of disk devices. In the
Hybrid Storage Pool, this is where the data protection happens.
Translating This to the 7000 Series
In the Sun Storage 7000 Series, we use
SSD to accelerate some of the components of the storage
infrastructure. First, we use Write-Optimized SSDs to store the
non-volatile copy of the ZIL. For our use case the devices we are
shipping with the system today are capable of about 10,000 operations
per second and use a supercapacitor to ensure that the device can
stay powered long enough to write all data to the flash chips.
Second, we use Read-Optimized SSDs to store the L2ARC. The devices
we are shipping with the system today vary in read performance
depending on the size of the operation being used, but are somewhere
between 16 and 64 times faster than a standard disk device for read
Q: How does data get into the Write Optimized SSD?
A: First, either a filesystem or LUN
receives the new data to be written. That module then creates a
transaction to add the new data to the currently open transaction
group (TXG) in the DMU. As part of the transaction, the data is sent
to the ARC while the Write Optimized SSD containing the ZIL is
updated to reflect the changes. As new transactions continue to
happen, they are logged sequentially to the ZIL.
Q: I've heard that SSDs can "wear out" if you write to them too many times. How do you prevent that from happening to the Write Optimized SSD?
The system treats
the SSD as a circular buffer starting to write at the beginning of
the disk continuing in order until it reaches the end, and then
resuming again at the beginning of the disk. This sequential pattern
helps to minimize the risk of 'Wearing Out' the SSD over time. Some people I have explained this to
express concern that the system could overwrite data required for
recovery in this model, however the system is very aware of which part of the disk contains active data and which parts contain inactive data
Q: So how does the
data get from the Write Optimized SSD to disk?
answer is that it doesn't -- at least not in the way the most people
think. The trick here is that the ZIL is actually cached in the ARC
for performance reasons. So, every few seconds when the system
begins a commit cycle for the current transaction group it reads the
copy of the ZIL in memory. This is the point at which the data will
be integrated into the pool. If the data requires compression, it
will be compressed and then a checksum for the data is generated.
The system decides where the data should live, and then finally the
data is synchronized from the ARC to disk.
Q: What happens if a Write Optimized SSD fails?
From my previous post:
"If the ZIL is stored on a single
SSD, and that device fails, the system has a window to flush the ZIL
from memory to disk (the Transaction Group Commit I mentioned
earlier). Typically in the 7000 Series, this flush happens every 1-5
seconds, but it can take up to 30 seconds on an extremely busy
system. Once the data is flushed from memory to disk, the system
will use the disk pool to store the ZIL for the next transaction
group. This window is the only time in a 7000 series where there is a
chance for data loss. We mitigate this risk by mirroring the Write
Optimized SSD's in the system."
Q: How does data get into the Read Optimized SSD?
As I mentioned earier, the Read
Optimized SSD is used in the 7000 Series to hold the L2ARC. Since we
would prefer to return the most popular data directly from our first
level cache in DRAM, we use L2ARC to hold data that has a history of
being useful, but hasn't been accessed as recently or as frequently
as other data. As the ARC fills up, the system begins to scan the
cache for the data that has been accessed least frequently or
recently. After finding enough candidates, it begins to copy those
blocks from ARC to L2ARC. While this process is happening, the data
is still active in the ARC, so if a client did request it it could be
returned. The process that fills the L2ARC operates in batches in
order that there are a few larger writes rather than frequent smaller
writes which improves performance.
Q: How do you prevent the Read Optimized SSD from "wearing out"?
Similar to the ZIL, the system
writes to the ARC in a circular fashion to reduce the risk of wear
Q: When does the system read from the Read Optimized SSD instead of Memory or Disk?
When the system starts to run out of
space in the ARC, it will attempt to evict the data that has been
accessed least recently or frequently, the same data we copied to the
L2ARC earlier. Now that the data has been evicted from the ARC, the
lowest latency copy is living in L2ARC. When the next read request
comes for that data, the system will find that the data is no longer
available in the ARC, and will check the L2ARC to see if it has a
copy. If a copy does exist in L2ARC, the checksum will be compared
to ensure that there has been no corruption, and then the data will
be returned at micro second latencies. If during the checksum
comparison the system had found that the data had for some reason
become corrupt in the L2ARC, it would release that copy of the data
and read the correct data from the disk pool.
Q: What about the Read Optimized SSD, what happens
if it fails?
The L2ARC is what we call a clean
cache, meaning that all of the data stored in the L2ARC is available
somewhere on disk. So if an L2ARC device fails, the system continues
to operate returning read requests that would have been cached by
that device directly from disk.