Why your Netapp is so slow...

Have you ever wondered why your Netapp FAS box is slow and doesn't perform well at large block read workloads?  In this blog entry I will give you a little bit of information that will probably help you understand why it’s so slow, why you shouldn't use it for applications that READ in large blocks like 64k, 128k, 256k ++ etc..  Of course since I work for Oracle at this time, I will show you why the ZS3 storage boxes are excellent choices for these types of workloads.

Netapp’s Fundamental Problem

Update:10/21/14  Well, it turns out that the netapp whitepaper referenced below is wrong/incomplete and doesn't tell the whole story.  The netapp will try and lay the blocks out contiguously and retrieve multiple blocks per IO using some method they call chaining.   So they are not as bad at large block as it may seem.  They are still slow we just do not know the real reason they are slow.  Could it be their CPU bound architecture?  Lack of ability to do SMP multithreading like the Solaris kernel?  The answer is I don't know.  I don't know why they won't post an SPC-2 benchmark and I am not sure why the FAS box is slow at such a workload. I would bet anyone that if performance was even modestly OK that you would see netapp posting a benchmark for the 8080EX. 

So fundamentally the issue still remains that the FAS box is not created for these types of workloads and will run out gas much faster then a ZS3 therefore costing the customer more $/IOP  and $/GBs

The fundamental problem you have running these workloads on Netapp is the backend block size of their WAFL file system.  Every application block on a Netapp FAS ends up in a 4k chunk on a disk. Reference:  Netapp TR-3001 Whitepaper

Netapp has proven this lacking large block performance fact in at least two different ways. 

  1. They have NEVER posted an SPC-2 Benchmark yet they have posted SPC-1 and SPECSFS, both recently.
  2. In 2011 they purchased Engenio to try and fill this GAP in their portfolio.

Block Size Matters

So why does block size matter anyways?  Many applications use large block chunks of data especially in the Big Data movement.  Some examples are SAS Business Analytics, Microsoft SQL, Hadoop HDFS is even 64MB! Now let me boil this down for you.

9-10-14 (Updated with more detail, the facts haven't changed just more detail because Netapp was upset about the way I made their writes look.  Even though the original entry focused on READS!  So to be fair and friendly I have updated the write section with more detail)

Writes

If an application such MS SQL is writing data in a 64k chunk then before Netapp actually writes it on disk it will have to split it into 16 different 4k writes and 16 different DISK IOPS.   Now, to be fair.  These WAFL writes will mostly be hidden or masked from the application via the write cache like all Storage Arrays have, in Netapp's case they have a couple of NVRAM pools that get flushed every 10 seconds or when they start getting full.

Now it is quite possible a customer could run into a huge problem here as well with the NVRAM buffers not being able to flush faster then they get full.  These are usually called "Back to Back CPs".  Netapp even has a systat to measure this potential problem.  Netapp will tell you they have good sequential write performance; "the main reason ONTAP can provide good sequential performance with randomly written data is the blocks are organized contiguously on disk"  So for writes this is mostly true unless you blow through the relatively small amount of NVRAM.  If you hit that bottleneck you cannot grow your nvram but you will have to buy a bigger Netapp FAS Filer.  With the Oracle ZS-3 we use a different method to handle write cache.  ZFS uses something called a ZFS Intent Log or ZIL.  The ZIL is based on Write Optimized SLC SSD's and can be expanded on the fly by simply hot adding more ZIL drives to the given ZFS pool. 

But this post is much less about writes and more about reads.  The only reason I even wrote about writes is because Netapp has a fixed Back-end block size of 4k which can be very limiting on large block read workloads.  So it seemed natural to explain that.

Reads

When the application later goes to read that 64k chunk the Netapp will have to again do 16 different disk IOPS for data that is not chained together on the netapp disk.  In comparison the ZS3 Storage Appliance can write in variable block sizes ranging from 512b to 1MB.  So if you put the same MSSQL database on a ZS3 you can set the specific LUNs for this database to 64k and then when you do an application read/write it requires only a single disk IO.  That is 16x faster!  But, back to the problem with your Netapp, you will VERY quickly run out of disk IO and hit a wall.  Now all arrays will have some fancy pre fetch algorithm and some nice cache and maybe even flash based cache such as a PAM card in your Netapp but with large block workloads you will usually blow through the cache and still need significant disk IO.  Also because these datasets are usually very large and usually not dedupable they are usually not good candidates for an all flash system.  You can do some simple math in excel and very quickly you will see why it matters.  Here are a couple of READ examples using SAS and MSSQL.  Assume these are the READ IOPS the application needs even after all the fancy cache and prefetch read-ahead algorithms or chaining.  Netapp will tell you they will get around this via read ahead and chaining and the fact that the data is laid out on the disk contiguously but at some point with these large block reads even that is going to fail.  If it didn't fail,  I guarantee you netapp would be posting SPC-2 throughput numbers and brag about it...

Here is an example with 128k blocks.  Notice the numbers of drives on the Netapp!


Here is an example with 64k blocks


You can easily see that the Oracle ZS3 can do dramatically more work with dramatically less drives.  This doesn't even take into account that the ONTAP system will likely run out of CPU way before you get to these drive numbers so you be buying many more controllers.  So with all that said, lets look at the ZS3 and why you should consider it for any workload your running on Netapp today. 

ZS3 World Record Price/Performance in the SPC-2 benchmark 
ZS3-2 is #1 in Price Performance $12.08
ZS3-2 is #3 in Overall Performance 16,212 MBPS

Note: The number one overall spot in the world is held by an AFA 33,477 MBPS but at a Price Performance of $29.79.  A customer could purchase 2 x ZS3-2 systems in the benchmark with relatively the same performance and walk away with $600,000 in their pocket.



Comments:

I've never responded to a blog in my life, but this time I feel compelled to do so.

Your statement that a 64k write requires 16 disk writes to 16 different locations is not true. The 4k block size is a minimum physical block size. Block operations in larger sizes will result in larger IO's, just like any storage array.

The remainder of your posting appears to be based on that false premise.

It would take more space than just a blog comment to properly explain how WAFL works, but irrespective of the details, I have Oracle AWR reports from real customers showing multiple GB/sec of sequential IO using NetApp storage systems, and certainly not with 5000 drives. Nobody would buy such an system.

Note: I am a NetApp employee, but my comments are my own.

Posted by Jeffrey Steiner on August 31, 2014 at 08:09 AM MDT #

Hi Jeff,
Yes I know your writes get queued up in NVAM and then sequentially written to disk. They still end up in essentially 4k bytes of data on each disk. Each of which will require a separate disk IO to READ, if you read the rest of the post its all focused on READ’s and NOT writes… The weakness of your limited NVRAM write feature would take a whole post of its own where we would discuss your nvram pools filling up faster than they can empty and creating Consistency Points problems (aka: Back to Back) https://communities.netapp.com/thread/14228

For this post and comments please focus on the LARGE BLOCK READs. Please tell me how you READ 64k from a single disk IO like ZFS?

Posted by Darius Zanganeh on September 03, 2014 at 09:21 AM MDT #

One other comment. Your Oracle DB example is not really relevant because its not a large block application like my post discusses. It is mostly 8k blocks. SO, I guess your box would only be degradaded on a 2-1 ratio for reads. Probably most of that could be masked with cache and PAM cards, but on large READ heavy DW workloads it would be a different story and I think the ZS3 would smoke any netapp FAS in a bakeoff. Even more so if your start talking about HCC and OISP.

Posted by Darius Zanganeh on September 03, 2014 at 09:32 AM MDT #

You need to learn how Wafl works. it's not what you're describing.

An oracle and netapp customer

Cheers
Irwin

Posted by guest on September 04, 2014 at 04:04 PM MDT #

Hello Irwin,
First of all, Thank you for your business.

RE: WAFL, Please let me know or show me something that shows my data is incorrect on LARGE BLOCK READs that are NOT coming from cache or prefetch etc... and I will happily update my post.

I want to know how you do a large block 64k+ reads from disk when clearly all the data is out on disk in 4k chunks.

Thanks
Darius

Posted by Darius Zanganeh on September 04, 2014 at 05:20 PM MDT #

Sorry about the delay following up from my earlier comment. It wasn't going to be productive to try to explain myself in the comment fields of this blog, so I went ahead and wrote up a more comprehensive response over on http://recoverymonkey.org/

Again, the statement that a 64k write results in 16 different 4k writes it factually false. Leave that uncorrected if you wish. I would hope potential customers would realize the claims here are obviously false, but if a customer asks about it, I'll direct them to my response. That should resolve their questions more than adequately.

I also responded to the questions about SPC-2 benchmarks and the Engenio purchase.

Posted by Jeffrey Steiner on September 09, 2014 at 11:33 PM MDT #

Hi Jeff,
I posted a reply on your blog and also updated my blog. Here is what I posted your blog post.

Hey Guys,
Thanks for the conversation, My blog clearly says the example is on LARGE READ's not writes.

OPENLY FALSE?
WAFL clearly writes the data down in 4k chunks even if it does it all contiguously which I have now acknowledged on my blog. If my 4k block thing is openly FALSE then why do your own white papers written by Dave H. explain WAFL that way with the 4k data block and lots of INODES...

Yes you mask the 4k writes from the application via NVRAM and you will try to mask the 4k reads via Read-Ahead which EVERY storage array has and I clearly said that in my blog post. You must admit that they are still 4k writes just written contiguously...

The point of my post which you clearly missed is that the Netapp WAFL is incredibly inefficient when it comes to large block sequential reads. Yes, some of the time the customer may be completely happy with the system because the workload is small enough and the read ahead of the 4k blocks on the WAFL backend is working well enough to mask any degradation in performance.

The point of the SPC-2 which Netapp can certainly afford (give me a break they just posted new SPC-1 and SPECSFS benchmarks). Is that you can compare two systems running the EXACT same throughput workload. It allows you to compare efficiency of a storage subsystem. How much bang of your buck do you get with vendor A vs vendor B.

Regarding your Oracle AWR report:
You are looking at a report from the client not the storage. Of course Netapp is not going to give the DB a bunch of 4k blocks. What the AWR does NOT show is how many individual DISK IOPS the WAFL system did to get this data. Only the filer could tell you that? Or can it? With DTRACE on the ZS3 we can show you individual DISK IOPS broken down by size both real-time and historical. I have no idea if Ontap can do that easily like dtrace?

I doubt your customer will let us run a bake off between the ZS3 and your current Netapp model. So because of that, we can look at benchmarks. We could look at the SPC-1 benchmark which is closest to a Oracle DB benchmark that is public and OPEN. The SPC states the following. "SPC-1 consists of a single workload designed to demonstrate the performance of a storage subsystem while performing the typical functions of business critical applications. Those applications are characterized by predominately random I/O operations and require both queries as well as update operations. Examples of those types of applications include OLTP, database operations, and mail server implementations."

So let’s compare efficiency of nice new Netapp 8040 to a 3 year old Oracle 7420.

http://www.storageperformance.org/benchmark_results_files/SPC-1/NetApp/A00141_NetApp_FAS8040_2node-cluster/a00141_NetApp_FAS8040_2-node-cluster_SPC-1_executive-summary.pdf
.
http://www.storageperformance.org/benchmark_results_files/SPC-1/Oracle/A00108_Oracle-Sun-ZFS-7420c/a00108_Oracle-Sun_ZFS-7420c%20_SPC1_executive-summary-r1.pdf

The Netapp 8040 did roughly 86,000 IOPS for $5.76 a piece.
The Oracle 7420 did roughly 137,000 IOPS for $2.99 a piece and this was almost 3 yrs ago.

So for running 3 yr old Oracle ZFS technology you will still be almost 2x performance for almost half the cost. Which IOPS is more efficient? And stay tuned because eventually Oracle ZFS engineering will release a new SPC-1 that will make the 7420 seem slow and expensive like the Netapp 8040.

I could talk about the actual Oracle benefits of running the Oracle DB on Oracle ZS3 with HCC (Hybrid Columnar Compression) and OISP (Oracle Intelligent Storage Protocol) + many other reasons Oracle DBs run best on Oracle Storage but that is another topic and another post some other time.

So please let’s focus on large block reads that are not able to be READ-Ahead and explain how your disk IO's will be larger than 4k each. I promise you that if you put a a large block app like SAS or a BIG Data Warehouse you will run out of read-ahead and cache performance and at some point be relying on disk. Even the ZS3 with its HUGE cache's and WORLD RECORD performance hits this with large throughput workloads. You’re talking about a couple of measly GB/s per second. Try 16GB/s on a Netapp FAS that are not in cache?

Thanks,
Darius Zanganeh
Oracle Storage

Posted by Darius Zanganeh on September 10, 2014 at 12:35 PM MDT #

Hi Darius,

Can you explain how the SPCs prove your point that you need more disks in NetApp for comparable performance. The 7420 was using 280 15K HDD vs. 192 10K HDD in NetApp 8040. So 50% more of faster HDD resulted in 60% increase in IOPS. This contradicts to what you say in the article that NetApp needs more disks to achieve similar IOPS...

P.S. Also it looks like 7420 used 4 times more of SSD cache (4TB vs 1TB)

Posted by guest on September 18, 2014 at 08:34 AM MDT #

Hi Mikhail,
Sorry for the confusion.
1.The SPC-1 benchmark is out of context to this post which is mainly about large block reads such as SAS/BI/Datawarehouse etc.. I was simply responding to the Netapp blog post refuting my original post.
2.The 7420 spc-1 is almost 3yrs old, so you can rest assured that a new benchmark will likely eventually come with 10k drives. What is interesting is that We are still about ½ the cost of the Netapp price today.
3.If you want to compare more apples to apples you should look at the SPECSFS benchmarks, but again SPECSFS isn’t about throughput like the SPC-2 is. The REAL question you should be asking is WHY DOESN’T Netapp EVER post an SPC-2 with FAS?

Oracle ZS3-2 with 172 x 10k drives= 210,535 IOPS with 1.12ms response time
http://www.spec.org/sfs2008/results/res2013q3/sfs2008-20130819-00227.html

Netapp 8020 with 144 x 10k drives= 110,281 IOPS with 1.18ms response time
http://www.spec.org/sfs2008/results/res2014q1/sfs2008-20140120-00235.html

The Oracle system is likely ½ the price of the Netapp as well.

Thanks,
Darius

Posted by Darius Zanganeh on September 18, 2014 at 10:13 AM MDT #

Why in your last comment are you comparing SFS results (nfs based tests) in a original block based posting. This seems indeed very confused to me, mixing the discussion around large reads, talking about our NVRAM (writes only), trying to make a point about SPC-2 figures by using SPC-1 results.
You do know that SPC-1 is not only about 4K blocks (otherwise look it up in the SPC1 workload description) and that to deliver 86.000 IOPS according to your math would need a significant larger amount of disks then the 192 disk @10 Krpm , certainly at the response times delivered.

I don't see any recent SPC-1 of Oracle to take the reverse approach, is it because your systems are bad at non sequential workloads?

The whole reasoning here is actually flawed because most of our customers are only interested in the end results, what problem we help them to solve and how we improve their business and under current budget constraints they will always challenge all their vendors so I don't see the point of this kind of FUD.

If Oracle wants to prove its value , then it should do so in the field and not on blogs pretending to know our technology better than we do or insinuate that our customers are stupid to select such a bad vendor as NetApp.

Sell your value instead of trying to break others.

Serge ( NetApp Employee)

Posted by guest on May 04, 2015 at 07:47 AM MDT #

Hi Serge,
We would glady compete with you on ANY workload head-to-head. I am confident we would provide superior performance/value in almost any workload. With a head-to-head unbiased bake off between Netapp FAS and Oracle ZS Storage very few customer would select Netapp. This has been my personal experience here in the USA. These benchmarks underscore and give customers a reason to test Oracle versus their existing netapp or emc storage. Why would you pay more for less?

Darius

Posted by Darius Zanganeh on May 04, 2015 at 08:17 AM MDT #

First of all benchmarks are no where near real life workloads so they represent just a snapshot measure in a specific context. They also focus just on performance and not functionality which is often were the value is. Lots of new comers AFF provide good performances but what about resiliency, replication, scaling, data protection.

Performance is a selection criteria but is typically not sufficient beside specific use cases (where any data management is done higher in the stack).
We address both performance and high value add and for customers that just need brute force performance and density we extended our offering with E-series.

If you were right Darius and we were as bad as you try to depict it, we would be long gone and bankrupt.
Considering the size of Oracle (almost 10 times bigger than NetApp) as an organization and their opportunity to bundle / cross sell full ranges of product, you must admit that your market share does not reflect your assertions.
The reality is that you don't appear in the top ten of IDC' market share and Gartner still positions Oracle in the challenger quadrant.
This to me reflects that NetApp must be doing something right to still convince customers to buy from us and reflects that Oracle must be doing not so good as they pretend to remain outside of the market leaders in storage.

Posted by guest on May 04, 2015 at 09:15 AM MDT #

So now benchmarks don't matter? So why post them? I would argue that it gives a customer a level playing field to see how a storage solution compares from many companies running the EXACT same workload. Yes that may differ then what a customer does in production but that is what bake-offs are for... Many years ago Netapp was one of the biggest proponets of benchmarks like the SPECsfs and I am pretty sure they are still on the board of that org as well.

Funtionality? We have all that you list and much more. Especially with Dtrace based analytics and our Oracle integration like OISP and HCC.

I agree you have more marketshare currently, but oracle is not sitting idle and neither is the rest of the industry. Compare the stock price of Oracle over 5 years and netapp over 5 years. Oracle has grown its stock 72% and netapp 6%... Notice this is AFTER the SUN aquisition.

I didn't say netapp doesn't do anything right. They obviously have built a very large business and marketshare based on the NFS protocol that SUN invented (LOL, how ironic). What I am saying and arguing is that TB/TB, IOPS/IOPS, feature/feature Oracle ZS is most likely a far better value then Netapp FAS on many fronts. Math cannot lie.

Posted by Darius Zanganeh on May 04, 2015 at 09:50 AM MDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Various information about Oracle Storage.

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today