The initial topic from my list
is reARC. This is a major rearchitecture of the code that manages ZFS in-memory cache along with its interface to the DMU. The ARC is of course a key enabler of ZFS high performance. As the scale of systems grow in memory size, CPU count and frequency, some major changes were required to the ARC to keep up with the pace. reARC is such a major body of work, I can only talk about of few aspects of the Wonders of ZFS Storage
In this article, I describe how the reARC project had impact on at least
these 7 important aspects of it's operation:
- Managing metadata
- Handling ARC accesses to cloned buffers
- Scalability of cached and uncached IOPS
- Steadier ARC size under steady state workloads
- Improved robustness for a more reliable code
- Reduction of the L2ARC memory footprint
- Finally, a solution to the long standing issue of I/O priority inversion
The diversity of topics covered serves as a great illustration of the incredible work handled by the ARC and a testament to the importance of ARC operations to all other ZFS
subsystems. I'm truly amazed at how a single project was able to deliver all this goodness in one swoop.
No Meta Limits
Previously, the ARC claimed to use a two-state model:
- "most recently used" (MRU)
- "most frequently used" (MFU)
But it further subdivided these states into data and metadata lists.
That model, using 4 main memory lists, created a problem for ZFS. The ARC algorithm
gave us only 1 target size for each of the 2 MRU and MFU states. The fact that we had 2 lists (data and metadata) but only 1 target size for the aggregate meant that when we needed to adjust the list down, we just didn't have the necessary information to
perform the shrink. This lead to the presence of an ugly tunable arc_meta_limit
, which was impossible to set properly and was a source of problems for customers.
This problem raises an interesting point and a pet peeve of mine. Many people I've interacted with over the years defended the position that metadata was worth special protection in a cache. After all, metadata is necessary to get to data, so it has intrinsically higher value and should be kept around more. The argument is certainly
sensible on the surface, but I was on the fence about it.
ZFS manages every access through a least recently used scheme (LRU). New access to some block, data or metadata, puts that block back to the head of the LRU list, very much protected from eviction, which happens at the tail of the list.
When considering special protection for metadata, I've always stumbled on this question:
If some buffer, be it data or metadata, has not seen any accesses for
sufficient amount of time, such that the block is now the tail of an
eviction list, what is the argument that says that I should protect
that block based on it's state ?
I came up blank on that question. If it hasn't been used, it can be evicted, period.
Furthermore, even after taking this stance, I was made aware of an interesting fact about ZFS. Indirect blocks, the blocks that hold a set of block pointers to the actual data are non_evictable
inasmuch as any
of the block pointers they reference are currently in the ARC. In other words, if some data is in cache, it's metadata is also in the cache
and furthermore, is non-evictable
. This fact really reinforced my position that in our LRU cache handling, metadata doesn't need special protection from eviction.
And so, the reARC project actually took the same path. No more separation of data and metadata and no more special protection. This improvement led to fewer lists to manage and simpler code, such as shorter lock hold times for eviction. If you are tuning arc_meta_limit
for legacy reasons, I advise you to try without this special tuning. It might be hurting you today and should be considered obsolete.
Single Copy Arc: Dedup of Memory
Yet another truly amazing capability of ZFS is it's infinite snapshot capabilities. There are just no limits, other than hardware, to the number of (software) snapshots that you can have.
What is magical
here is not so much that ZFS can manage a large number of snapshots, but that it can do so without reference counting the blocks that are
referenced through a snapshot. You might need to read that sentence again ... and check the blog entry.
Now fast forward to today where there is something new for the ARC. While we've always had the ability to read a block referenced from the N-different snapshots (or clones), the old ARC actually had to manage separate in-memory copies of each block. If the accesses were all reads, we'd needlessly instantiate the same data multiple times in memory.
With the reARC project and the new DMU to ARC interfaces, we don't
have to keep multiple data copies. Multiple clones of the same data
share the same buffers for read accesses and new copies are only created for a write access.
It has not escaped our notice that this N-way pairing has immense
consequences for virtualization technologies. The use of ZFS clones
(or writable snapshots) is just a great way to deploy a large number
of virtual machines. ZFS has always been able to store N clone copies
with zero incremental storage costs. But reARC is taking this one step
further. As VMs are used, the in-memory caches that are used to manage
multiple VMs no longer need to inflate, allowing the space savings to
be used to cache other data. This improvement allows Oracle to boast
the amazing technology demonstration of booting 16000 VMs
Improved Scalability of Cached and Uncached OPs
The entire MRU/MFU list insert and eviction processes have been
redesigned. One of the main functions of the ARC is to keep track of
accesses, such that most recently used data is moved to the head of
the list and the least recently used buffers make their way towards
the tail, and are eventually evicted. The new design allows for eviction to be performed using a separate set of locks from the set that is used for insertion. Thus, delivering greater scalability. Moreover, through a very clever algorithm, we're able to move buffers from the middle of a list to the head without acquiring the eviction lock.
These changes were very important in removing long pauses in ARC
operations that hampered the previous implementation. Finally, the
main hash table was modified to use more locks placed on separate
cache lines improving the scalability of the ARC operations. This lead
to a boost in the cached and uncached maximum IOPs capabilities of the
Steadier Size, Smaller Shrinks
The growth and shrink model of the ARC was also revisited. The new
model grows the ARC less aggressively when approaching memory pressure
and instead recycles buffers earlier on. This recycling leads to a
steadier ARC size and fewer disruptive shrink cycles. If the changing
environment nevertheless requires the ARC to shrink, the amount by
which we do shrink each time is reduced to make it less of a stress
for each shrink cycle. Along with the reorganization of the ARC list
locking, this has lead to a much steadier, dependable ARC at high
ARC Access Hardening
A new ARC reference mechanism was created that allows the DMU to
signify read or write intent to the ARC. This, in turn, enables more
checks to be performed by the code. Therefore, catching bugs earlier
in the process. A better separation of function between the DMU and
the ARC is critical for ZFS robustness or hardening. In the new reARC
mode of operation, the ARC now actually has the freedom relocate
kernel buffers in memory in between DMU accesses to a cached
buffer. This new feature proves invaluable as we scale to large memory
L2ARC Memory Footprint Reduction
Historically, buffers were tracked in the L2ARC (the SSD based secondary ARC) using the same structure that was used by the main primary ARC. This represented about 170 bytes of memory per buffer. The reARC project was able to cut down this amount by more than 2X to
a bare minimum that now only requires about 80 bytes of metadata per
L2 buffers. With the arrival of larger SSDs for L2ARC and a better
feeding algorithm, this reduced L2ARC footprint is a very significant
change for the Hybrid Storage Pool (HSP
) storage model.
I/O Priority Inversion
One nagging behavior of the old ARC and ZIO pipeline was the so-called I/O priority inversion. This behavior was present mostly for prefetching I/Os, which was handled by the ZIO pipeline at a lower priority operation than, for example, a regular read issued by an
application. Before reARC, the behavior was that after an I/O prefetch was issued, a subsequent read of the data that arrived while the I/O prefetch was still pending
, would block waiting on the low priority I/O prefetch completion.
While it sounds simple enough to just boost the priority of the in-flight I/O prefetch, ARC/ZIO code was structured in such a way that this turned out to be much trickier than it sounds. In the end, the reARC project and subsequent I/O restructuring changes, put us on the right path regarding this particular quirkiness. Fixing the I/O priority inversion meant that fairness between different types of I/O was restored.
The key points that we saw in reARC are as follows:
- Metadata doesn't need special protection from eviction, arc_meta_limit has
become an obsolete tunable.
- Multiple clones of the same data share the same buffers for great performance
in a virtualization environment.
- We boosted ARC scalability for cached and uncached IOPs.
- The ARC size is now steadier and more dependable.
- Protection from creeping memory bugs is better.
- L2ARC uses a smaller footprint.
- I/Os are handled with more fairness in the presence of prefetches.
All of these improvements are available to customers of Oracle's ZFS Storage Appliances in any AK-2013 releases and recent Solaris 11 releases. And this is just topic number one
. Stay tuned as we go about describing further improvements we're making to ZFS.