Q&A on Hybrid Storage and SSDs
By user12674320 on Apr 08, 2009
In a previous post I scratched the surface of how ZFS uses the ZFS Intent Log (ZIL), and how the 7000 Series uses Solid State Disk (SSD) to accelerate its performance. After having presented the Hybrid Storage Pool to more than a hundred customers, I can say that questions around how the 7000 leverages SSDs, and how it handles SSD failure are among the most frequently asked. I hope that I can expand on my previous entry here and explain things in clear detail. I apologize in advance that my artwork is not nearly what it could be, but I wanted to share the information I have.
Before we can cover the detail of how the file system leverages SSD and handles SSD failure, we need to understand the basic components of ZFS, and how the data flows between them. The ZFS file system is made up of a number of modules and layers. The interfaces that we use to store data run as modules at the top of the stack. In the 7000, we know these as Filesystems and LUNS in the Shares section of the BUI.
Both user level interfaces connect using transactions to a layer called the Data Management Unit (DMU). The DMU manages the storage and retrieval of data independent of its structure (the structure is implemented above by the modules that give us Filesystems and LUNs); it is the coordinator, orchestrating the movement of data between the various components below.
One of the key components it manages is called the Adaptive Replacement Cache (ARC). ARC is used as cache for both read and write operations as well as key file system data and metadata. With the excpetion of the cached copy of the ZIL (more on that later), and the actual write data cache, anything that can live in the ARC can also live in the Level 2 ARC (L2ARC) which is a 'disk' based extension of primary cache designed to operate as a second tier in the system storage model. I will cover L2ARC as it relates to 7000 more later, but if you're itching for details, check out Brendan's blog entry on it here.
Another component managed by the DMU is the ZIL. As I discussed in my previous entry, the ZIL is the journal that allows the file system to recover from system failures. The ZIL must always exist on non-volatile storage in order to ensure it will be there to recover from. By default, the ZIL is stored inside the storage pool, however it can also be stored on a dedicated disk device called a log device. Regardless of how the system is configured to store the ZIL, it is always cached in system memory while running in order to improve performance.
Below all of the caching tiers is the disk pool itself. It is built from groups of disk devices. In the Hybrid Storage Pool, this is where the data protection happens.
Translating This to the 7000 Series
In the Sun Storage 7000 Series, we use
SSD to accelerate some of the components of the storage
infrastructure. First, we use Write-Optimized SSDs to store the
non-volatile copy of the ZIL. For our use case the devices we are
shipping with the system today are capable of about 10,000 operations
per second and use a supercapacitor to ensure that the device can
stay powered long enough to write all data to the flash chips.
Second, we use Read-Optimized SSDs to store the L2ARC. The devices
we are shipping with the system today vary in read performance
depending on the size of the operation being used, but are somewhere
between 16 and 64 times faster than a standard disk device for read
Q: How does data get into the Write Optimized SSD?
A: First, either a filesystem or LUN receives the new data to be written. That module then creates a transaction to add the new data to the currently open transaction group (TXG) in the DMU. As part of the transaction, the data is sent to the ARC while the Write Optimized SSD containing the ZIL is updated to reflect the changes. As new transactions continue to happen, they are logged sequentially to the ZIL.
Q: I've heard that SSDs can "wear out" if you write to them too many times. How do you prevent that from happening to the Write Optimized SSD?
The system treats the SSD as a circular buffer starting to write at the beginning of the disk continuing in order until it reaches the end, and then resuming again at the beginning of the disk. This sequential pattern helps to minimize the risk of 'Wearing Out' the SSD over time. Some people I have explained this to express concern that the system could overwrite data required for recovery in this model, however the system is very aware of which part of the disk contains active data and which parts contain inactive data
Q: So how does the data get from the Write Optimized SSD to disk?
Surprisingly, the answer is that it doesn't -- at least not in the way the most people think. The trick here is that the ZIL is actually cached in the ARC for performance reasons. So, every few seconds when the system begins a commit cycle for the current transaction group it reads the copy of the ZIL in memory. This is the point at which the data will be integrated into the pool. If the data requires compression, it will be compressed and then a checksum for the data is generated. The system decides where the data should live, and then finally the data is synchronized from the ARC to disk.
Q: What happens if a Write Optimized SSD fails?
From my previous post:
"If the ZIL is stored on a single SSD, and that device fails, the system has a window to flush the ZIL from memory to disk (the Transaction Group Commit I mentioned earlier). Typically in the 7000 Series, this flush happens every 1-5 seconds, but it can take up to 30 seconds on an extremely busy system. Once the data is flushed from memory to disk, the system will use the disk pool to store the ZIL for the next transaction group. This window is the only time in a 7000 series where there is a chance for data loss. We mitigate this risk by mirroring the Write Optimized SSD's in the system."
Q: How does data get into the Read Optimized SSD?
As I mentioned earier, the Read Optimized SSD is used in the 7000 Series to hold the L2ARC. Since we would prefer to return the most popular data directly from our first level cache in DRAM, we use L2ARC to hold data that has a history of being useful, but hasn't been accessed as recently or as frequently as other data. As the ARC fills up, the system begins to scan the cache for the data that has been accessed least frequently or recently. After finding enough candidates, it begins to copy those blocks from ARC to L2ARC. While this process is happening, the data is still active in the ARC, so if a client did request it it could be returned. The process that fills the L2ARC operates in batches in order that there are a few larger writes rather than frequent smaller writes which improves performance.
Q: How do you prevent the Read Optimized SSD from "wearing out"?
Similar to the ZIL, the system writes to the ARC in a circular fashion to reduce the risk of wear over time.
Q: When does the system read from the Read Optimized SSD instead of Memory or Disk?
When the system starts to run out of space in the ARC, it will attempt to evict the data that has been accessed least recently or frequently, the same data we copied to the L2ARC earlier. Now that the data has been evicted from the ARC, the lowest latency copy is living in L2ARC. When the next read request comes for that data, the system will find that the data is no longer available in the ARC, and will check the L2ARC to see if it has a copy. If a copy does exist in L2ARC, the checksum will be compared to ensure that there has been no corruption, and then the data will be returned at micro second latencies. If during the checksum comparison the system had found that the data had for some reason become corrupt in the L2ARC, it would release that copy of the data and read the correct data from the disk pool.
Q: What about the Read Optimized SSD, what happens if it fails?
The L2ARC is what we call a clean cache, meaning that all of the data stored in the L2ARC is available somewhere on disk. So if an L2ARC device fails, the system continues to operate returning read requests that would have been cached by that device directly from disk.