Using Analytics (Dtrace) to Troubleshoot a Performance Problem, Realtime
By Darius Zanganeh on Sep 23, 2013
I was recently on a call with one of the largest financial retirement investment firms in the USA. They were using a very small 7320 ZFS Storage appliance (ZFSSA) with 24GB of dram and 18 x 3TB HDDs. This system was nothing in terms of specifications compared to the latest shipping systems, the ZS3-2 and ZS3-4 storage systems with 512GB and 2TB of DRAM respectively. Nevertheless I was more then happy to see what help myself and mostly the ZFSSA Analytics (Dtrace) could offer.
The Problem: Umm, Performance aka Storage IO Latency is less then acceptable...This customer was/is using this very small ZFSSA as an additional Data Guard target from their production Exadata. In this case it was causing issues and pain. Below you can see a small sample screen shot we got from their Enterprise Manager Console.
Discovery: Lets see what is going on right now! Its time to take a little Dtrace tour..
This leads us to our next stop on the dtrace discovery train, ARC accesses (Adaptive Replacement Cache). Here we quickly find that even with our lackluster performance in terms of read latency. We have an amazing read cache hit ratio. Roughly about 60% of our IO is coming from our 24GB of DRAM. On this particular system the DRAM is upgradable to 144GB of DRAM. Do ya think that would make a small dent in those 6033 data misses below?
This nicely leads into the next stop on the dtrace train which is to ask dtrace for all those 6033 data misses how many of them would be eligible to be read from L2ARC (READ SSD Drives in the Hybrid Storage Pool). We quickly noticed that indeed they would have made a huge difference. Sometimes 100% of the misses were eligible. This means that after missing the soon to be upgraded 6x dram based cache the rest of the read IO's of this workload would likely be served from high performance MLC Flash SSD right in the controller itself.
Conclusion: Analytics on the ZFSSA are amazing, The Hybrid Storage Pool of the ZFSSA is amazing, the large DRAM based cache on the ZFSSA is very amazing...At this point I recommend that they take a two phase approach to the workload. First they upgrade the DRAM Cache 6x and add 2 x L2ARC Read SSD drives. After that they could evaluate if they still needed to add more disk or not.
Extra Credit Stop: One last stop I made was to look at their NFS share settings and see if they had compression turned on like I have recommended in a previous post. I noticed that they did not have it enabled and that CPU was very low at less then 10% AVG utilization. I then explained to the customer how they would benefit even now without any upgrades by enabling compression and I asked if I could enable it that second at 12pm in the afternoon during a busy work day. They trusted me and we enabled it on the fly and then for giggles I decided to see if it made a difference on disk IO and sure enough it made an immediate impact and disk IO lowered because now every new write they had was taking less disk space and less IO to move through the subsystem since we compress real-time in memory. Remember that this system was a Data guard target so it was constantly being written too. You can clearly see in the image below how compression lowered the amount of IO on the actual drives. Trying doing that with your average storage array compression.
Special Thanks goes out to my teammates Michael Dautle and Hugh Shannon who helped with this adventure and capturing of the screenshots.