A closer look at the new T5 TPC-H result
By Stefan Hinker on Jun 18, 2013
You've probably all seen the new TPC-H benchmark result for the SPARC T5-4 submitted to TPC on June 7. Our benchmark guys over at "BestPerf" have already pointed out the major takeaways from the result. However, I believe there's more to make note of.
TPC doesn't promote the comparison of TPC-H results with different storage sizes. So let's just look at the 3000GB results:
- SPARC T4-4 with 4 CPUs (that's 32 cores at 3.0 GHz) delivers 205,792 QphH.
- SPARC T5-4 with 4 CPUs (that's 64 cores at 3.6 GHz) delivers 409,721 QphH.
That's just a little short of 100% scalability, if you'd expect a doubling of cores to deliver twice the result. Of course, one could expect to see a factor of 2.4, taking the increased clockrate into account. Since the TPC does not permit estimates and other "number games" with TPC results, I'll leave all the arithmetic to you. But let's look at some more details that might offer an explanation.
Looking at the report on BestPerf as well as the full disclosure report, they provide some interesting insight into the storage configuration. For the SPARC T4-4 run, they had used 12 2540-M2 arrays, each delivering around 1.5 GB/s for a total of 18 GB/s. These were obviously directly connected to the 24 8GBit FC ports of the SPARC T4-4, using two cables per storage array. Given the 8GBit ports of the 2540-M2, this setup would be good for a theoretical maximum of 2GB/sec per array. With 1.5GB/sec actual throughput, they were pretty much maxed out.
In the SPARC T5-4 run, they report twice the number of disks (via expansion trays for the 2540-M2 arrays) for a total of 33GB/s peak throughput, which isn't quite 2x the number achieved on the SPARC T4-4. To actually reach 2x the throughput (36GB/s), each array would have had to deliver 3 GB/sec over its 4 8GBit ports. The FDR only lists 12 dual-port FC HBAs, which explains the use of Brocade FC switches: Connecting all 4 8GBit ports of the storage arrays and using the FC switch to bundle that into 24 16GBit HBA ports. This delivers the full 48x8GBit FC bandwidth of the arrays to the 24 FC ports of the server. Again, the theoretical maximum of 4 8GBit ports on each storage array would be 4 GB/sec, but considering all the protocol and "reality overhead", the 2.75 GB/sec they actually delivered isn't bad at all. Given this, reaching twice the overall benchmark performance is good. And a possible explanation for not going all the way to 2.4x. Of course, other factors like software scalability might also play a role here.
By the way - neither the SPARC T4-4 nor the SPARC T5-4 used any flash in these benchmarks.
Ever since the T4s are on the market, our competitors have done their best to assure everyone that the SPARC core still lacks in performance, and that large caches and high clockrates are the only key to real server performance. Now, when I look at public TPC-H results, I see this:
|TPC-H @3000GB, Non-Clustered Systems|
3.6 GHz SPARC T5
4/64 – 2048 GB
3.0 GHz SPARC T4
4/32 – 1024 GB
|IBM Power 780
4.1 GHz POWER7
8/32 – 1024 GB
|HP ProLiant DL980 G7
2.27 GHz Intel Xeon X7560
8/64 – 512 GB
So, in short, with the 32 core SPARC T4-4 (which is 3 GHz and 4MB L3 cache), SPARC T4-4 delivers more QphH@3000GB than IBM with their 32 core Power7 (which is 4.1 GHz and 32MB L3 cache) and also more than HP with the 64 core Intel Xeon system (2.27 GHz and 24MB L3 cache). So where exactly is SPARC lacking??
Right, one could argue that both competing results aren't exactly new. So let's do some speculation:
IBM's current Performance Report lists the above mentioned IBM Power 780 with an rPerf value of 425.5. A successor to the above Power 780 with P7+ CPUs would be the Power 780+ with 64 cores, which is available at 3.72 GHz. It is listed with an rPerf value of 690.1, which is 1.62x more. So based on IBM's own performance estimates, and assuming that storage will not be the limiting factor (IBM did test with 177 SSDs in the submitted result, they're welcome to increase that to 400) they would not be able to double the performance of the Power7 system. And they'd need more than that to beat the SPARC T5-4 result. This is even more challenging in the "per core" metric that IBM values so highly.
For x86, the story isn't any better. Unfortunately, Intel doesn't have such handy rPerf charts, so I'll have to fall back to SPECint_rate2006 for this one. (Note that I am not a big fan of using one benchmark to estimate another. Especially SPECcpu is not very suitable to estimate database performance as there is almost no IO involved.) The above HP system is listed with 1580 CINT2006_rate. The best result as of 2013-06-14 for the new Intel Xeon E7-4870 with 8 CPUs is 2180 CINT2006_rate. That's an improvement of 1.38x. (If we just take the increase in clockrate and core count, it would give us 1.32x.) I'll stop here and let you do the math yourself - it's not very promising for x86...
So what have we learned?
- There's some evidence that storage might have been the limiting factor that prevented the SPARC T5-4 to scale beyond 2x
- The myth that SPARC cores don't perform is just that - a myth. Next time you meet one, ask your IBM sales rep when they'll publish TPC-H for Power7+
- Cache memory isn't the magic performance switch some people think it is.
- Scaling a CPU architecture (and the OS on top of it) beyond a certain limit is hard. It seems to be a little harder in the x86 world.
What did I miss? Well, price/performance is something I'll let you discuss with your sales reps ;-)
And finally, before people ask - no, I haven't moved to marketing. But sometimes I just can't resist...
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
TPC-H, QphH, $/QphH are trademarks of Transaction Processing Performance Council (TPC). For more information, see www.tpc.org, results as of 6/7/13. Prices are in USD. SPARC T5-4 409,721.8 QphH@3000GB, $3.94/QphH@3000GB, available 9/24/13, 4 processors, 64 cores, 512 threads; SPARC T4-4 205,792.0 QphH@3000GB, $4.10/QphH@3000GB, available 5/31/12, 4 processors, 32 cores, 256 threads; IBM Power 780 QphH@3000GB, 192,001.1 QphH@3000GB, $6.37/QphH@3000GB, available 11/30/11, 8 processors, 32 cores, 128 threads; HP ProLiant DL980 G7 162,601.7 QphH@3000GB, $2.68/QphH@3000GB available 10/13/10, 8 processors, 64 cores, 128 threads.
SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of June 18, 2013 from www.spec.org. HP ProLiant DL980 G7 (2.27 GHz, Intel Xeon X7560): 1580 SPECint_rate2006; HP ProLiant DL980 G7 (2.4 GHz, Intel Xeon E7-4870): 2180 SPECint_rate2006.