Donnerstag Mrz 27, 2014

A few Thoughts about Single Thread Performance


[Read More]

Dienstag Jun 18, 2013

A closer look at the new T5 TPC-H result

You've probably all seen the new TPC-H benchmark result for the SPARC T5-4 submitted to TPC on June 7.  Our benchmark guys over at "BestPerf" have already pointed out the major takeaways from the result.  However, I believe there's more to make note of.

Scalability

TPC doesn't promote the comparison of TPC-H results with different storage sizes.  So let's just look at the 3000GB results:

  • SPARC T4-4 with 4 CPUs (that's 32 cores at 3.0 GHz) delivers 205,792 QphH.
  • SPARC T5-4 with 4 CPUs (that's 64 cores at 3.6 GHz) delivers 409,721 QphH.

That's just a little short of 100% scalability, if you'd expect a doubling of cores to deliver twice the result.  Of course, one could expect to see a factor of 2.4, taking the increased clockrate into account.  Since the TPC does not permit estimates and other "number games" with TPC results, I'll leave all the arithmetic to you.  But let's look at some more details that might offer an explanation.

Storage

Looking at the report on BestPerf as well as the full disclosure report, they provide some interesting insight into the storage configuration.  For the SPARC T4-4 run, they had used 12 2540-M2 arrays, each delivering around 1.5 GB/s for a total of 18 GB/s.  These were obviously directly connected to the 24 8GBit FC ports of the SPARC T4-4, using two cables per storage array.  Given the 8GBit ports of the 2540-M2, this setup would be good for a theoretical maximum of 2GB/sec per array.  With 1.5GB/sec actual throughput, they were pretty much maxed out. 

In the SPARC T5-4 run, they report twice the number of disks (via expansion trays for the 2540-M2 arrays) for a total of 33GB/s peak throughput, which isn't quite 2x the number achieved on the SPARC T4-4.  To actually reach 2x the throughput (36GB/s), each array would have had to deliver 3 GB/sec over its 4 8GBit ports.  The FDR only lists 12 dual-port FC HBAs, which explains the use of Brocade FC switches: Connecting all 4 8GBit ports of the storage arrays and using the FC switch to bundle that into 24 16GBit HBA ports.  This delivers the full 48x8GBit FC bandwidth of the arrays to the 24 FC ports of the server.  Again, the theoretical maximum of 4 8GBit ports on each storage array would be 4 GB/sec, but considering all the protocol and "reality overhead", the 2.75 GB/sec they actually delivered isn't bad at all.  Given this, reaching twice the overall benchmark performance is good.  And a possible explanation for not going all the way to 2.4x. Of course, other factors like software scalability might also play a role here.

By the way - neither the SPARC T4-4 nor the SPARC T5-4 used any flash in these benchmarks. 

Competition

Ever since the T4s are on the market, our competitors have done their best to assure everyone that the SPARC core still lacks in performance, and that large caches and high clockrates are the only key to real server performance.  Now, when I look at public TPC-H results, I see this:

TPC-H @3000GB, Non-Clustered Systems
System QphH
SPARC T5-4
3.6 GHz SPARC T5
4/64 – 2048 GB
409,721.8
SPARC T4-4
3.0 GHz SPARC T4
4/32 – 1024 GB
205,792.0
IBM Power 780
4.1 GHz POWER7
8/32 – 1024 GB
192,001.1
HP ProLiant DL980 G7
2.27 GHz Intel Xeon X7560
8/64 – 512 GB
162,601.7

So, in short, with the 32 core SPARC T4-4 (which is 3 GHz and 4MB L3 cache), SPARC T4-4 delivers more QphH@3000GB than IBM with their 32 core Power7 (which is 4.1 GHz and 32MB L3 cache) and also more than HP with the 64 core Intel Xeon system (2.27 GHz and 24MB L3 cache).  So where exactly is SPARC lacking??

Right, one could argue that both competing results aren't exactly new.  So let's do some speculation:

IBM's current Performance Report lists the above mentioned IBM Power 780 with an rPerf value of 425.5.  A successor to the above Power 780 with P7+ CPUs would be the Power 780+ with 64 cores, which is available at 3.72 GHz.  It is listed with an rPerf value of 690.1, which is 1.62x more.  So based on IBM's own performance estimates, and assuming that storage will not be the limiting factor (IBM did test with 177 SSDs in the submitted result, they're welcome to increase that to 400) they would not be able to double the performance of the Power7 system.  And they'd need more than that to beat the SPARC T5-4 result.  This is even more challenging in the "per core" metric that IBM values so highly. 

For x86, the story isn't any better.  Unfortunately, Intel doesn't have such handy rPerf charts, so I'll have to fall back to SPECint_rate2006 for this one. (Note that I am not a big fan of using one benchmark to estimate another.  Especially SPECcpu is not very suitable to estimate database performance as there is almost no IO involved.)  The above HP system is listed with 1580 CINT2006_rate.  The best result as of 2013-06-14 for the new Intel Xeon E7-4870 with 8 CPUs is 2180 CINT2006_rate.  That's an improvement of 1.38x.  (If we just take the increase in clockrate and core count, it would give us 1.32x.)  I'll stop here and let you do the math yourself - it's not very promising for x86...

Of course, IBM and others are welcome to prove me wrong - but as of today, I'm waiting for recent publications in this data range.

 So what have we learned?  

  • There's some evidence that storage might have been the limiting factor that prevented the SPARC T5-4 to scale beyond 2x
  • The myth that SPARC cores don't perform is just that - a myth.  Next time you meet one, ask your IBM sales rep when they'll publish TPC-H for Power7+
  • Cache memory isn't the magic performance switch some people think it is.
  • Scaling a CPU architecture (and the OS on top of it) beyond a certain limit is hard.  It seems to be a little harder in the x86 world.

What did I miss?  Well, price/performance is something I'll let you discuss with your sales reps ;-)

And finally, before people ask - no, I haven't moved to marketing.  But sometimes I just can't resist...


Disclosure Statements

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

TPC-H, QphH, $/QphH are trademarks of Transaction Processing Performance Council (TPC). For more information, see www.tpc.org, results as of 6/7/13. Prices are in USD. SPARC T5-4 409,721.8 QphH@3000GB, $3.94/QphH@3000GB, available 9/24/13, 4 processors, 64 cores, 512 threads; SPARC T4-4 205,792.0 QphH@3000GB, $4.10/QphH@3000GB, available 5/31/12, 4 processors, 32 cores, 256 threads; IBM Power 780 QphH@3000GB, 192,001.1 QphH@3000GB, $6.37/QphH@3000GB, available 11/30/11, 8 processors, 32 cores, 128 threads; HP ProLiant DL980 G7 162,601.7 QphH@3000GB, $2.68/QphH@3000GB available 10/13/10, 8 processors, 64 cores, 128 threads.

SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. Results as of June 18, 2013 from www.spec.org. HP ProLiant DL980 G7 (2.27 GHz, Intel Xeon X7560): 1580 SPECint_rate2006; HP ProLiant DL980 G7 (2.4 GHz, Intel Xeon E7-4870): 2180 SPECint_rate2006.

Donnerstag Apr 04, 2013

A few remarks about T5

By now, most of you will have seen the announcement of the T5 and M5 systems.  I don't intend to repeat any of this, but I would like to share a few early thoughts.  Keep in mind, those thoughts are mine alone, not Oracle's.

It was rather obvious during the Launch Event that we will enjoy the competition with IBM even more than before.  I will not join the battle of words here, but leave you with a very nice summary (of the first few skirmishes) found on Forbes.  It is worth 2 minutes of reading - I find it very interesting how IBM seems to be loosing interest in performance...

Since much of the attention we are getting is based on performance claims, I thought it would be nice to have a short and clearly arranged overview of the more commonly used benchmark results that were posted.  I will not compare the results to any other systems here, but leave this as a very entertaining exercise to you ;-)

There are more performance publications, especially on the BestPerf blog.  Some of these are interesting because they compare T5 to x86 CPUs, something I recommend doing if you don't shy away from reconsidering your view of the world from time to time.  But the ones I listed here are more likely to be accepted as "independent" benchmarks than some others.  Now, we all know that benchmarking is a leap-frogging game, I wonder who will jump next?  (We've leap-frogged our own systems a couple times, too...)    And to finish this entry off, I'd like to remind you that performance is only one part of the equation.  What usually matters just as much, if not more, is price performance.  In the benchmarking game, we can usually only compare list prices - have a go at that!  To quote Larry here:  “You can go faster, but only if you are willing to pay 80% less than what IBM charges.”

Competition is an interesting thing, don't you think?

Montag Mai 14, 2012

Benchware Test of T4

There's a rather thorough performance comparison between an M5000 and a T4-2 that I can only recommend to anyone still wondering if those TPC-H world records are really possible:

Find the test report on the Benchware website - look for T4 in the "Benchmark" section.

And of course, check out the TPC-H results, too.  Look for 1000GB and 3000GB ;-)

(No, I didn't transfer to marketing.  I just think this test is worth being mentioned on a blog that's about performance, among other things.)

Dienstag Apr 17, 2012

Solaris Zones: Virtualization that Speeds up Benchmarks

One of the first questions that typically comes up when I talk to customers about virtualization is the overhead involved.  Now we all know that virtualization with hypervisors comes with an overhead of some sort.  We should also all know that exactly how big that overhead is depends on the type of workload as much as it depends on the hypervisor used.  While there have been attempts to create standard benchmarks for this, quantifying hypervisor overhead is still mostly hidden in the mists of marketing and benchmark uncertainty.  However, what always raises eyebrows is when I come to Solaris Zones (called Containers in Solaris 10) as an alternative to hypervisor virtualization.  Since Zones are, greatly simplyfied, nothing more than a group of Unix processes contained by a set of rules which are enforced by the Solaris kernel, it is quite evident that there can't be much overhead involved.  Nevertheless, since many people think in hypervisor terms, there is almost always some doubt about this claim of zero overhead.  And as much as I find the explanation with technical details compelling, I also understand that seeing is so much better than believing.  So - look and see:

The Oracle benchmark teams are so convinced of the advantages of Solaris Zones that they actually use them in the configurations for public benchmarking.  Solaris resource management will also work in a non Zones environment, but Zones make it just so much easier to handle, especially with some of the more complex benchmark configurations.  There are numerous benchmark publications available using Solaris Containers, dating back to the days of the T5440.  Some recent examples, all of them world records, are:

The use of Solaris Zones is documented in all of these benchmark publications.

The benchmarking team also published a blog entry detailing how they make use of resource management with Solaris Zones to actually increase application performance.  That almost asks for calling this "negative overhead", if the term weren't somewhat misleading.

So, if you ever need to substantiate why Solaris Zones have no virtualization overhead, point to these (and probably some more) published benchmarks.

Freitag Dez 17, 2010

Solaris knows Hardware - pgstat explains it

When Sun's engineering teams observed large differences in memory latency on the E25K, they introduced the concept of locality groups (lgrp) into Solaris 9 9/02. They describe the hierarchy of system components, which can be very different in different hardware systems. When creating processes and scheduling them onto CPUs for execution, Solaris will try to minimize the distance between CPU and memory for optimal latency. This feature, known as Memory Placement Optimization (MPO) can, depending on hardware and appliation, significantly enhance performance.

There are, among many other things, thousands of counters in the Solaris kernel. They can be queried using kstat, cpustat, or more widely used tools like mpstat or iostat. Especially the counters made available with cpustat depend heavily on the underlying hardware. The it hasn't always been easy to analyze the performance benefit of MPO and the utilization of individual parts of the hardware using these counters. For cpustat, there was only a perl-script called corestat to help understand T1/T2 core utilization. This has finally changed with Solaris 11 Express


There are now three new commands: lgrpinfo, pginfo und pgstat.

lgrpinfo shows the hierarchy of the lgroups - the NUMA-architecture of the hardware. This can be useful when configuring resource groups (for containers or standalone) to select the right CPUs.

pginfo shows a different view of this information: A tree of the hardware hierarchy. The leaves of this tree are the individual integer and floatingpoint unit of each core.  Here's a little example from a T2 LDom configured with 16 strands from different cores:


# pginfo -v
0 (System [system]) CPUs: 0-15
|-- 3 (Data_Pipe_to_memory [chip]) CPUs: 0-7
| |-- 2 (Floating_Point_Unit [core]) CPUs: 0-3
| | `-- 1 (Integer_Pipeline [core]) CPUs: 0-3
| `-- 5 (Floating_Point_Unit [core]) CPUs: 4-7
| `-- 4 (Integer_Pipeline [core]) CPUs: 4-7
`-- 8 (Data_Pipe_to_memory [core,chip]) CPUs: 8-15
`-- 7 (Floating_Point_Unit [core,chip]) CPUs: 8-15
|-- 6 (Integer_Pipeline) CPUs: 8-11
`-- 9 (Integer_Pipeline) CPUs: 12-15

As you can see, the mapping of strands to pipelines and cores is easily visible.

pgstat finally, is a worthy successor of corestat. It gives you a good overview of the utilization of all components. Again, an example, on the same LDom, which at the same time shows almost 100% core utilization, something I don't find very often...


# pgstat -Apv 1 2
PG RELATIONSHIP HW UTIL CAP SW USR SYS IDLE CPUS
0 System [system] - - - 100.0% 99.6% 0.4% 0.0% 0-15
3 Data_Pipe_to_memory [chip] - - - 100.0% 99.1% 0.9% 0.0% 0-7
2 Floating_Point_Unit [core] 0.0% 179K 1.3B 100.0% 99.1% 0.9% 0.0% 0-3
1 Integer_Pipeline [core] 80.0% 1.3B 1.7B 100.0% 99.1% 0.9% 0.0% 0-3
5 Floating_Point_Unit [core] 0.0% 50K 1.3B 100.0% 99.1% 0.9% 0.0% 4-7
4 Integer_Pipeline [core] 80.2% 1.3B 1.7B 100.0% 99.1% 0.9% 0.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - 100.0% 100.0% 0.0% 0.0% 8-15
7 Floating_Point_Unit [core,chip] 0.0% 80K 1.3B 100.0% 100.0% 0.0% 0.0% 8-15
6 Integer_Pipeline 76.4% 1.3B 1.7B 100.0% 100.0% 0.0% 0.0% 8-11
9 Integer_Pipeline 76.4% 1.3B 1.7B 100.0% 100.0% 0.0% 0.0% 12-15
PG RELATIONSHIP HW UTIL CAP SW USR SYS IDLE CPUS
0 System [system] - - - 100.0% 99.7% 0.3% 0.0% 0-15
3 Data_Pipe_to_memory [chip] - - - 100.0% 99.5% 0.5% 0.0% 0-7
2 Floating_Point_Unit [core] 0.0% 76K 1.2B 100.0% 99.5% 0.5% 0.0% 0-3
1 Integer_Pipeline [core] 79.7% 1.2B 1.5B 100.0% 99.5% 0.5% 0.0% 0-3
5 Floating_Point_Unit [core] 0.0% 42K 1.2B 100.0% 99.5% 0.5% 0.0% 4-7
4 Integer_Pipeline [core] 79.8% 1.2B 1.5B 100.0% 99.5% 0.5% 0.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - 100.0% 99.9% 0.1% 0.0% 8-15
7 Floating_Point_Unit [core,chip] 0.0% 80K 1.2B 100.0% 99.9% 0.1% 0.0% 8-15
6 Integer_Pipeline 76.3% 1.2B 1.5B 100.0% 100.0% 0.0% 0.0% 8-11
9 Integer_Pipeline 76.4% 1.2B 1.5B 100.0% 99.8% 0.2% 0.0% 12-15

SUMMARY: UTILIZATION OVER 2 SECONDS

------HARDWARE------ ------SOFTWARE------
PG RELATIONSHIP UTIL CAP MIN AVG MAX MIN AVG MAX CPUS
0 System [system] - - - - - 100.0% 100.0% 100.0% 0-15
3 Data_Pipe_to_memory [chip] - - - - - 100.0% 100.0% 100.0% 0-7
2 Floating_Point_Unit [core] 76K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 0-3
1 Integer_Pipeline [core] 1.2B 1.5B 79.7% 79.7% 80.0% 100.0% 100.0% 100.0% 0-3
5 Floating_Point_Unit [core] 42K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 4-7
4 Integer_Pipeline [core] 1.2B 1.5B 79.8% 79.8% 80.2% 100.0% 100.0% 100.0% 4-7
8 Data_Pipe_to_memory [core,chip] - - - - - 100.0% 100.0% 100.0% 8-15
7 Floating_Point_Unit [core,chip] 80K 1.2B 0.0% 0.0% 0.0% 100.0% 100.0% 100.0% 8-15
6 Integer_Pipeline 1.2B 1.5B 76.3% 76.3% 76.4% 100.0% 100.0% 100.0% 8-11
9 Integer_Pipeline 1.2B 1.5B 76.4% 76.4% 76.4% 100.0% 100.0% 100.0% 12-15

The exact meaning of these values is nicely described in the manpage for pgstat, so I'll leave the interpretation to the reader. With this little tool, performance analysis, especially on T2/T3 systems, will be even more fun ;-)

Montag Jun 07, 2010

prstat and microstate accounting

You never stop learning.  As a reply to my last blog entry, it was pointed out to me that with Solaris 10, microstate accounting is always enabled, and prstat supports this with the option "-m".  This option removes the moving average lags from the values displayed, and is much more accurate.  I wanted to know more about the background.  Eric Schrock was kind enough to provide it on his blog.  Here's a short summary.


The legacy output of prstat (and some of the other monitoring commands) represents moving averages based on regular samples.  With higher CPU frequencies, it is more and more likely that some scheduling events will be missed completely by these samples.  This makes the reports more and more unreliable.  Microstate accounting collects event statistics for every event, when the event happens.  Thanks to some implementation tricks introduced with Solaris 10, this is now efficient enough to be turned on all the time.  If you use this more precise data with prstat, a CPU hog will show up immediately, showing 100% CPU on all threads involved.  In this way, you're much more precise, and you need'nt convert from the number of CPUs in the system to the corresponding %-age as in the example in my blog entry.  A singlethreaded process will be visible instantly. This is easier do to, easier to understand, less error prone and more exact.


I've also updated the presentation to represent this.


Thanks for the hint - you know who you are!  It's from things like this that I notice that I've been using prstat and the likes (successfully) for too long .  It's just like Eric mentioned in his blog: This great feature slipped past me, with all the more prominent stuff like containers, zfs, smf etc.  Thanks again!

Mittwoch Mrz 10, 2010

Some thoughts about scalability


In a recent email thread about scalability and why Solaris is especially good at it, some long-time performance gurus summarized the subject matter so well, that I thought it worth sharing with a broader community.  They agreed, so here it is:  

What is scalability, and why is Solaris so good in not preventing applications to scale?

The good scalability is a classic observation about systems which have been profiled for multiple years. They not only perform well at high load, they degrade less on overload.


The cause is usually described mathematically: the causes of the slope of the response time (degradation) curve is dominated by the service time of the single slowest component.  A new product usually has a few large bottlenecks, and because they're large, the response time curve take off for infinity early, and goes almost straight up. Overloading the system even a little bit causes it to "hit the wall" and seem to hang. If X is load and Y is response time, the curve looks like this:

That's called the "hockey-stick" curve in the trade ;-) Response time is fairly flat until an inflection point, then heads up like a homesick angel.

A well profiled mature product has lots of little bottlenecks, one of which is the largest, and which therefor sets the inflection point and the slope. With a small bottleneck, the slope is gentle, and during an overload, the users see the system as somewhat slow, not hung.  This looks a little like this:


The reason you get bad performance at high loads on unprofiled programs is that above 80% load, there is a good chance that multiple users will make requests at the same time, and momentarily drive the system to its inflection point into degradation. As the system is un-optimal, the degradation at that point is large and user-visible. This usually hits at around 70 or 80%, sometimes even less.

We've been hunting down and fixing the slow bits for a long time, and have a very very gentle degradation curve.  PCs, on the other hand, tend to hit the wall really easily, and often.  Some of their legendary unreliability is really bogus: users overload their machines, assume they've hung, and then reboot.

In particular, the fine-grained spin-locking in Solaris is often celebrated as being responsible for a lot of its superior scaling.  In contrast, coarse-grained locks inflate the response time of inherently-serial locks, with the resulting impact
just as Amdahl's Law would dictate.  A large set of evolved architecturally-aware features make the Solaris scheduler itself a huge factor in the superior scaling of Solaris.  Other features such as evolved AIO options and preemption
control which have been well-integrated by Oracle provide even more reasons better scaling.

I should add that superior scaling is not all about peak throughput and the average response-time curve as a function of load, but also tends to manifest as reduced variance in response times in many cases - as well as the "graceful degradation" on the far side of peak throughput that you mentioned.  Those are factors I'd like to see more-frequently characterized - but the habit in the benchmarking world is often to simply celebrate the peak results.

A last thing I'd mention is that the foundation of this was laid out when Sun's version of SVR4 was defined. We pretty much threw out AT&T's implementation and did our own with the idea of full pre-emption, multi-threading and all the rest. It's much much easier when the foundation is built solidly to deliver things on top of it. If you also need
to rebuild the foundation, it's far harder to make things work. One could make a pretty solid argument that ,without the foundation, the rest would have been much, much harder.

Big hardware on top of a great foundation leads to customers who throw more work at the boxes who expose problems that we fix that leads to customers throwing even more work at the boxes...

To see all this in action, here's what you need for a live demo:

Start the old Gnome perfmeter (or perfbar or mpstat) on a customer system running an interactive load.  You can use the Java2D demo that comes with any JDK.  Then fire up dummy CPU loads and push the %CPU higher and higher in front of the customer's eyes, until it finally starts to feel slow. They'll be amazed at how close to 100% they are before they see any actual, user-visible degradation.

And if you want to make their brain explode, use SRM to grant their app 80% of the cpu and then start dozens of dummy CPU loads in another zone to force the CPU to pin at 100%, while their performance stays fine. Of course, this is incredible enough that you may just convince them that you're faking it ;-)


We did this in a demo to techies one immersion week, and even though they knew what we were doing, there was a lot of jaws left on the demo-room floor when they saw the theory in practice.

This article has been compiled from several email messages by

David Collier-Brown
James Litchfield
Bob Sneed

Thank you!
(A German version of this article is available in the German part of this blog.)

About

Neuigkeiten, Tipps und Wissenswertes rund um SPARC, CMT, Performance und ihre Analyse sowie Erfahrungen mit Solaris auf dem Server und dem Laptop.

This is a bilingual blog (most of the time). Please select your prefered language:
.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today