Monday Sep 23, 2013

Using Analytics (Dtrace) to Troubleshoot a Performance Problem, Realtime

I was recently on a call with one of the largest financial retirement investment firms in the USA.  They were using a very small 7320 ZFS Storage appliance (ZFSSA) with 24GB of dram and 18 x 3TB HDDs.  This system was nothing in terms of specifications compared to the latest shipping systems, the  ZS3-2 and ZS3-4 storage systems with 512GB and 2TB of DRAM respectively.  Nevertheless I was more then happy to see what help myself and mostly the ZFSSA Analytics (Dtrace) could offer.  

The Problem: Umm, Performance aka Storage IO Latency is less then acceptable...

This customer was/is using this very small ZFSSA as an additional Data Guard target from their production Exadata.  In this case it was causing issues and pain.  Below you can see a small sample screen shot we got from their Enterprise Manager Console.

Discovery: Lets see what is going on right now! Its time to take a little Dtrace tour..

The next step was to fire up the browser and take a look at the real-time analytic's.    Our first stop on the dtrace train is to look at IOPS by protocol.  In this case the protocol is NFSv3 but we could easily see the same ting from nfsv4, smb, http, ftp, fc, iscsi etc... Quickly we see that this little box with 24GB of DRAM and 18 x 7200RPM HDD's was being pounded! Averaging 13,000 IOPS with 18 drives isn't bad.  I should note that this box had zero Read SSD drives and 2 x Write SSD drives.  

Doing a quick little calculation we see that this little system is doing about 650 IOPS per drive per second. Holy Cow SON! Wait, is that even possible? There is something else at play here, could it be the large ZFSSA cache (a measly 24GB in our case) at work?  Hold your horses, IOPS are not everything in fact you could argue that they really don't matter at all, what really matters is latency.  This is how we truly measure performance, how fast does it take to do one thing versus how many things can I do at the same time regardless of how fast they each get done.  To best understand how much effect DRAM has on latency see our World Record SPECSFS Benchmark information here.  Here you see that sure enough that some of the read latency is pathetic, for just about any application in the world.  

There are many ways to solve this problem, the average SAN/NAS vendor would tell you to simply add more disk, With dtrace we can get more granular and ask many other questions and see if there is perhaps a better or more efficient way(s) to solve this problem.

This leads us to our next stop on the dtrace discovery train, ARC accesses (Adaptive Replacement Cache).  Here we quickly find that even with our lackluster performance in terms of read latency.  We have an amazing read cache hit ratio.  Roughly about 60% of our IO is coming from our 24GB of DRAM. On this particular system the DRAM is upgradable to 144GB of DRAM.  Do ya think that would make a small dent in those 6033 data misses below?

This nicely leads into the next stop on the dtrace train which is to ask dtrace for all those 6033 data misses how many of them would be eligible to be read from L2ARC (READ SSD Drives in the Hybrid Storage Pool).  We quickly noticed that indeed they would have made a huge difference.  Sometimes 100% of the misses were eligible.  This means that after missing the soon to be upgraded 6x dram based cache the rest of the read IO's of this workload would likely be served from high performance MLC Flash SSD right in the controller itself.

Conclusion: Analytics on the ZFSSA are amazing, The Hybrid Storage Pool of the ZFSSA is amazing, the large DRAM based cache on the ZFSSA is very amazing...

At this point I recommend that they take a two phase approach to the workload.  First they upgrade the DRAM Cache 6x and add 2 x L2ARC Read SSD drives.  After that they could evaluate if they still needed to add more disk or not.

Extra Credit Stop:  One last stop I made  was to look at their NFS share settings and see if they had compression turned on like I have recommended in a previous post.  I noticed that they did not have it enabled and that CPU was very low at less then 10% AVG utilization.  I then explained to the customer how they would benefit even now without any upgrades by enabling compression and I asked if I could enable it that second at 12pm in the afternoon during a busy work day.  They trusted me and we enabled it on the fly and then for giggles I decided to see if it made a difference on disk IO and sure enough it made an immediate impact and disk IO lowered because now every new write they had was taking less disk space and less IO to move through the subsystem since we compress real-time in memory.  Remember that this system was a Data guard target so it was constantly being written too. You can clearly see in the image below how compression lowered the amount of  IO on the actual drives.  Trying doing that with your average storage array compression.

Special Thanks goes out to my teammates Michael Dautle and Hugh Shannon who helped with this adventure and capturing of the screenshots.

Monday Jul 15, 2013

ZFS Storage Appliance Compression vs Netapp, EMC and IBM in a Real Production Environment

Dear Oracle ZFSSA Customer, Please enable LZJB compression on every file system and LUN you have by default
I have been involved in many large Proof of Concepts during my time here at Oracle and I have come to realize that there is a quite compelling and honestly very important feature in the ZFS Storage Appliance that every customer and potential customer should review and plan on using.  This feature is inline data compression for just about any storage need, including production database storage.  Many storage vendors today list and possibly tout their compression technologies but rarely do they tell you to turn it on for production storage and never would they tell you to turn it on for every File System and LUN.  This is where the ZFS Storage Appliance is somewhat unique.  Real-world experience over a wide range of production systems has shown that Oracle storage customers should run LZJB compression unless there is a compelling reason not to run compression, such as storage for uncompressible data.  Even in the case of mixed data, where some data does not compress, the benefit of LZJB compression is still strong enough and system CPU capacity is still large enough to accommodate running compression in cases where not all of the data in a share will compress.  

Why Compress Everything?
There are many reasons to compress data such as saving space and money which are great reasons to compress, but with the ZFSSA you may actually gain performance as well.  ZFS compression reduces the amount of data written to the physical disks over the back-end network channels.  Finite physical limits on disk drive bandwidth and channel bandwidth put fundamental limitations on application data transfer rates.  With compression, the amount of data written to the drives and channels is reduced compared to the amount of data written by the application, so using CPU to perform compression increases the effective drive and channel bandwidth and opens up an important bottleneck that constrains application throughput.

What does this cost in terms of performance and is there a license fee?
 In terms of performance it costs very, very little. The currently shipping ZFSSA 7420 for example has either 32 or 40 cores of Intel Westmere horsepower under the covers coupled with up to 1TB of DRAM memory and a Solaris Kernel that is exceptional at doing many things at the same time and using all the cores it can.  Usually most of our customers have CPU cores and MHz sitting around waiting for something to do.  Turning on LZJB compression by default turns out to be a pretty good win-win in terms of using a few cores, gaining performance and saving space.  There is no license fee or cost to use ZFSSA compression you simply turn it on where you want to use it. It is granular enough that you can turn it on at the file system or LUN level.

There 4 different built in levels of compression to choose from with the ZFSSA.  LZJB, GZIP-2, GZIP(GZIP-6) and GZIP-9.  A customer could easily create 4 file systems each with a different compression version and then copy the same test data to each share at different times.  You could then very easily correlate the CPU usage (Using Dtrace Analytics) versus the compression ratio and make an educated decision on which level of compression you want to use for a particular data type.  I tell my customers to at a minimum start with LZJB pretty much everywhere.  Below is a real world example I pulled from one of my customers.  Here you see a ZFS pool that basically has multiple Oracle Databases running on it.  They have LZJB turned on everywhere and are seeing a very nice 2.79x compression ratio.

The Other Guys Compression Story: 
So at this point you may be wondering what is so unique about this Oracle ZFSSA Compression vs say Netapp Data Compression, EMC VNX compression, or IBM's v7000/SVC compression?  The basic issue is that pretty much all the other big players have issues scaling CPU/threads and processes and therefore turning on compression in these environments can create a cpu bottleneck performance problem pretty fast. Lets take a minute to examine each.

Netapp Data Compression:
Netapp's own whitepapers are riddled with WARNING's on using compression in production environments (Especially Inline Compression for Oracle OLTP).  Most notably they say that Compression can chew up a lot of the meager amount of CPU/threads available.  I tried to find a detailed document that explains how Netapp uses its cores and found it next to impossible to find even for Google!  I did find a few blogs and netapp forums where people frequently mentioned something about the Kahuna Domain and that compression was part of this Domain (CPU) and shared with many other services.  Maybe someone else can share how all the cores on a netapp box are used?  It appears that if a Netapp box has over 50% CPU utilized you could get into trouble turning on compression.

"On workloads such as file services, systems with less than 50% CPU utilization have shown an increased CPU usage of ~20% for datasets that were compressible. For systems with more than 50% CPU utilization, the impact may be more significant."

"When data is read from a compressed volume, the impact on the read performance varies depending on the access patterns, the amount of compression savings on disk, and how busy the system resources are (CPU and disk). In a sample test with a 50% CPU load on the system, read throughput from a dataset with 50% compressibility showed decreased throughput of 25%. On a typical system the impact could be higher because of the additional load on the system. Typically the most impact is seen on small random reads of highly compressible data and on a system that is more than 50% CPU busy. Impact on performance will vary and should be tested before implementing in production." Reference:

The following conclusions can be drawn based on the available evidence from industry analyst reviews as well as Netapp’s own documentation: In the best possible scenario, NetApp’s compression technology occurs asynchronously for active, transactional workloads, meaning that the data is initially written in an uncompressed form. Therefore, capacity must always be over-provisioned and later reclaimed, most likely with some degree of fragmentation, as compared to the synchronous or in-line compression architecture of the ZFS Storage Appliance. As a result, much of the potential cost-saving value of compression cannot be effectively realized as the data must always initially reside in an uncompressed state on the media.

EMC VNX Compression:
EMC is very similar to Netapp in that they give lots of warnings and say that compression is only recommended for static data.  EMC also has very small limits on the number of compressed LUN's you can have per controller.  The document that details this is a few old originally written for the Clariion CX4 but apparently still relevant for the newer controllers as it is directly linked in the newer VNX compression datasheet.

"Compression’s strength is improved capacity utilization. Therefore, compression is not recommended for active database or messaging systems, but it can successfully be applied to more static datasets like archives, clones of database, or messaging-system volumes." Reference:
"VNX file deduplication and compression should be used exclusively for file data as more granular control is available for file data rather than block, so the system can identify inactive data to process versus active data.”
“Block data compression is intended for relatively inactive data that requires the high availability of the VNX system. Consider static data repositories or copies of active data sets that users want to keep on highly available storage."

In conclusion, like Netapp, EMC’s compression technology is best suited for infrequently accessed data, and similarly requires post process functions to compress data; the usage cases for any data that will be accessed beyond an archival access frequency are highly limited.  Because data is initially written in the uncompressed form, compression must occur asynchronously by means of a time-consuming compression post-process. In fact, compression post-processing for newly written data often must run for time periods measured in days before the data is stored in compressed format. As a result, much of the space-saving value of compression is moot as the data must reside in an uncompressed state on the media, meaning capacity must always be over-provisioned relative to implementations that feature in-line compression such as the ZFS Storage Appliance.

IBM Real-Time Compression:

IBM is slightly different from the above in that they actually claim to support running their compression for production databases.  But as you dig into the bowels of the Redbook, you quickly find there are some severe limitation and concerns to beware of.  First off the v7000 has only 4 cores and 8GB of Cache Memory.  Pretty small versus the 32-40 cores and 1TB of Cache Memory in the Oracle ZFSSA 7420.  Apparently when you enable a single volume/lun with compression 3 of the 4 available cores and 2GB or the 8GB of memory now become dedicated solely to compression operations.  So with IBM v7000 if you turn on compression then kiss away 75% of your storage CPU resources on the particular node/controller this lun lives on.  IBM goes on and affirms this by saying that if you have more then 25% CPU used before compression is enabled then you should not turn it on.  They also have a hard limit of 200 volumes compressed within a 2-node I/O group.  They also say not to mix compressed and uncompressed luns/volumes in the same storage pool.  Probably one of the least desired aspects of the IBM compression is cost $$$.  IBM is the only vendor here that charges extra for compression.  With SVC they charge per TB and with the v7000 they charge per enclosure so either way, every time you had disk to a I/O Group/Controller Pair that has compression enabled you are going to be hit with more software licensing as well.  Another major SVC and v7000 viability concern arises for customers that would like to exploit IBM’s Easy Tier auto-tiering technology in addition to Real-Time Compression, because RTC and Easy Tier are currently mutually exclusive: Easy Tier is automatically disabled for compressed volumes and cannot be enabled.

"An I/O Group that is servicing at least one compressed volume dedicates certain processor and memory resources for exclusive use by the compression engine."

Special Thanks

I want to especially thank Jeff Wright (Oracle ZFSSA Product Management) and  Mark Kremkus (Fellow Oracle Storage Sales Consultant) who both contributed to this entry.


Various information about Oracle Storage.


« July 2016