Donnerstag Mrz 27, 2014
Dienstag Jun 18, 2013
By Stefan Hinker on Jun 18, 2013
You've probably all seen the new TPC-H benchmark result for the SPARC T5-4 submitted to TPC on June 7. Our benchmark guys over at "BestPerf" have already pointed out the major takeaways from the result. However, I believe there's more to make note of.
TPC doesn't promote the comparison of TPC-H results with different storage sizes. So let's just look at the 3000GB results:
- SPARC T4-4 with 4 CPUs (that's 32 cores at 3.0 GHz) delivers 205,792 QphH.
- SPARC T5-4 with 4 CPUs (that's 64 cores at 3.6 GHz) delivers 409,721 QphH.
That's just a little short of 100% scalability, if you'd expect a doubling of cores to deliver twice the result. Of course, one could expect to see a factor of 2.4, taking the increased clockrate into account. Since the TPC does not permit estimates and other "number games" with TPC results, I'll leave all the arithmetic to you. But let's look at some more details that might offer an explanation.
Looking at the report on BestPerf as well as the full disclosure report, they provide some interesting insight into the storage configuration. For the SPARC T4-4 run, they had used 12 2540-M2 arrays, each delivering around 1.5 GB/s for a total of 18 GB/s. These were obviously directly connected to the 24 8GBit FC ports of the SPARC T4-4, using two cables per storage array. Given the 8GBit ports of the 2540-M2, this setup would be good for a theoretical maximum of 2GB/sec per array. With 1.5GB/sec actual throughput, they were pretty much maxed out.
In the SPARC T5-4 run, they report twice the number of disks (via expansion trays for the 2540-M2 arrays) for a total of 33GB/s peak throughput, which isn't quite 2x the number achieved on the SPARC T4-4. To actually reach 2x the throughput (36GB/s), each array would have had to deliver 3 GB/sec over its 4 8GBit ports. The FDR only lists 12 dual-port FC HBAs, which explains the use of Brocade FC switches: Connecting all 4 8GBit ports of the storage arrays and using the FC switch to bundle that into 24 16GBit HBA ports. This delivers the full 48x8GBit FC bandwidth of the arrays to the 24 FC ports of the server. Again, the theoretical maximum of 4 8GBit ports on each storage array would be 4 GB/sec, but considering all the protocol and "reality overhead", the 2.75 GB/sec they actually delivered isn't bad at all. Given this, reaching twice the overall benchmark performance is good. And a possible explanation for not going all the way to 2.4x. Of course, other factors like software scalability might also play a role here.
By the way - neither the SPARC T4-4 nor the SPARC T5-4 used any flash in these benchmarks.
Ever since the T4s are on the market, our competitors have done their best to assure everyone that the SPARC core still lacks in performance, and that large caches and high clockrates are the only key to real server performance. Now, when I look at public TPC-H results, I see this:
|TPC-H @3000GB, Non-Clustered Systems|
3.6 GHz SPARC T5
4/64 – 2048 GB
3.0 GHz SPARC T4
4/32 – 1024 GB
|IBM Power 780
4.1 GHz POWER7
8/32 – 1024 GB
|HP ProLiant DL980 G7
2.27 GHz Intel Xeon X7560
8/64 – 512 GB
So, in short, with the 32 core SPARC T4-4 (which is 3 GHz and 4MB L3 cache), SPARC T4-4 delivers more QphH@3000GB than IBM with their 32 core Power7 (which is 4.1 GHz and 32MB L3 cache) and also more than HP with the 64 core Intel Xeon system (2.27 GHz and 24MB L3 cache). So where exactly is SPARC lacking??
Right, one could argue that both competing results aren't exactly new. So let's do some speculation:
IBM's current Performance Report lists the above mentioned IBM Power 780 with an rPerf value of 425.5. A successor to the above Power 780 with P7+ CPUs would be the Power 780+ with 64 cores, which is available at 3.72 GHz. It is listed with an rPerf value of 690.1, which is 1.62x more. So based on IBM's own performance estimates, and assuming that storage will not be the limiting factor (IBM did test with 177 SSDs in the submitted result, they're welcome to increase that to 400) they would not be able to double the performance of the Power7 system. And they'd need more than that to beat the SPARC T5-4 result. This is even more challenging in the "per core" metric that IBM values so highly.
For x86, the story isn't any better. Unfortunately, Intel doesn't have such handy rPerf charts, so I'll have to fall back to SPECint_rate2006 for this one. (Note that I am not a big fan of using one benchmark to estimate another. Especially SPECcpu is not very suitable to estimate database performance as there is almost no IO involved.) The above HP system is listed with 1580 CINT2006_rate. The best result as of 2013-06-14 for the new Intel Xeon E7-4870 with 8 CPUs is 2180 CINT2006_rate. That's an improvement of 1.38x. (If we just take the increase in clockrate and core count, it would give us 1.32x.) I'll stop here and let you do the math yourself - it's not very promising for x86...
So what have we learned?
- There's some evidence that storage might have been the limiting factor that prevented the SPARC T5-4 to scale beyond 2x
- The myth that SPARC cores don't perform is just that - a myth. Next time you meet one, ask your IBM sales rep when they'll publish TPC-H for Power7+
- Cache memory isn't the magic performance switch some people think it is.
- Scaling a CPU architecture (and the OS on top of it) beyond a certain limit is hard. It seems to be a little harder in the x86 world.
What did I miss? Well, price/performance is something I'll let you discuss with your sales reps ;-)
And finally, before people ask - no, I haven't moved to marketing. But sometimes I just can't resist...
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
TPC-H, QphH, $/QphH are trademarks of Transaction Processing Performance Council (TPC). For more information, see www.tpc.org, results as of 6/7/13. Prices are in USD. SPARC T5-4 409,721.8 QphH@3000GB, $3.94/QphH@3000GB, available 9/24/13, 4 processors, 64 cores, 512 threads; SPARC T4-4 205,792.0 QphH@3000GB, $4.10/QphH@3000GB, available 5/31/12, 4 processors, 32 cores, 256 threads; IBM Power 780 QphH@3000GB, 192,001.1 QphH@3000GB, $6.37/QphH@3000GB, available 11/30/11, 8 processors, 32 cores, 128 threads; HP ProLiant DL980 G7 162,601.7 QphH@3000GB, $2.68/QphH@3000GB available 10/13/10, 8 processors, 64 cores, 128 threads.
SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of June 18, 2013 from www.spec.org. HP ProLiant DL980 G7 (2.27 GHz, Intel Xeon X7560): 1580 SPECint_rate2006; HP ProLiant DL980 G7 (2.4 GHz, Intel Xeon E7-4870): 2180 SPECint_rate2006.
Montag Mai 14, 2012
By Stefan Hinker on Mai 14, 2012
There's a rather thorough performance comparison between an M5000 and a T4-2 that I can only recommend to anyone still wondering if those TPC-H world records are really possible:
And of course, check out the TPC-H results, too. Look for 1000GB and 3000GB ;-)
(No, I didn't transfer to marketing. I just think this test is worth being mentioned on a blog that's about performance, among other things.)
Dienstag Apr 17, 2012
By Stefan Hinker on Apr 17, 2012
One of the first questions that typically comes up when I talk to customers about virtualization is the overhead involved. Now we all know that virtualization with hypervisors comes with an overhead of some sort. We should also all know that exactly how big that overhead is depends on the type of workload as much as it depends on the hypervisor used. While there have been attempts to create standard benchmarks for this, quantifying hypervisor overhead is still mostly hidden in the mists of marketing and benchmark uncertainty. However, what always raises eyebrows is when I come to Solaris Zones (called Containers in Solaris 10) as an alternative to hypervisor virtualization. Since Zones are, greatly simplyfied, nothing more than a group of Unix processes contained by a set of rules which are enforced by the Solaris kernel, it is quite evident that there can't be much overhead involved. Nevertheless, since many people think in hypervisor terms, there is almost always some doubt about this claim of zero overhead. And as much as I find the explanation with technical details compelling, I also understand that seeing is so much better than believing. So - look and see:
The Oracle benchmark teams are so convinced of the advantages of Solaris Zones that they actually use them in the configurations for public benchmarking. Solaris resource management will also work in a non Zones environment, but Zones make it just so much easier to handle, especially with some of the more complex benchmark configurations. There are numerous benchmark publications available using Solaris Containers, dating back to the days of the T5440. Some recent examples, all of them world records, are:
The use of Solaris Zones is documented in all of these benchmark publications.
The benchmarking team also published a blog entry detailing how they make use of resource management with Solaris Zones to actually increase application performance. That almost asks for calling this "negative overhead", if the term weren't somewhat misleading.
So, if you ever need to substantiate why Solaris Zones have no virtualization overhead, point to these (and probably some more) published benchmarks.
Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.
This is a bilingual blog (most of the time). Please select your prefered language:
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.
- Ein paar Gedanken zu Single Thread Performance
- A few Thoughts about Single Thread Performance
- What's up with LDoms: Part 8 - Physical IO
- CPU-DR fuer Zonen
- CPU-DR for Zones
- Memory-DR for Zones
- Memory-DR fuer Zonen
- What's up with LDoms: Part 7 - Layered Virtual Networking
- What's up with LDoms: Eine Artikel-Reihe zum Thema Oracle VM Server for SPARC
- Das T5-4 TPC-H Ergebnis naeher betrachtet