By Jsavit-Oracle on Oct 02, 2011
I was eagerly waiting for the announcement made last week on the new SPARC T4 processor and servers. The T4 provides landmark performance (see Bestperf blog), with world records beating systems based on IBM Power7, IBM mainframe, and Intel Westmere. The T4 adds world-class single CPU thread performance to the throughput computing performance T-series systems are known for. It has 2.85 or 3.0Ghz clock rate, branch prediction, longer pipelines, Out-Of-Order execution, for up to 5x better per-CPU performance than its predecessor. Forget bogus old cliches like "SPARC is slow" or "T-series is slow"!
Product evolutionThe first generation T1 chip provided up to 8 cores, each with 4 CPU threads (hence the name CMT for Chip MultiThreading). Each core ran at the same time as the others (a chip could retire 8 instructions per clock cycle), providing round robin service to its CPU threads. On a cache miss or after a quantum of clock cycles, the core would switch to the next CPU thread. This technique is extremely effective because most workloads spend a lot of time - often estimated at 2/3 to 3/4 of the time - suffering from cache misses. During a cache miss an instruction "stalls" until RAM responds with a cache line of data. T-series uses otherwise-wasted stall time in one thread to run a different CPU thread's instructions. You can contrive instruction kernels whose working sets always fit in cache, but that's not Real World.
The T1 effectively provided a 32-way multiprocessor. No individual processor was particularily fast because transistors were spent on creating more (simple) threads rather than fast clocks and deep pipelines. In aggregate, the many CPUs provided excellent throughput. Subsequent designs had 8 cores with 8 CPU threads per core (T2 and T2+) for 64 threads/chip or 16 cores with 8 threads per core (T3) for 128 threads/chip. These dramatically increased compute density but had only modest improvements for single-thread applications - except for floating point and crypto, which were dramatically sped up.
Now, the T4 has 8 cores with 8 threads, but with much faster per-thread performance.
T-series products always provided great throughput performance and price/performance, but you had to select applications that matched the machines' characteristics. Ideally that meant multi-threaded applications with good parallelism. Fortunately, a lot of workloads fit that thread-rich profile: web servers, messaging servers, Java application servers, and some database and middleware applications. Another approach is consolidation of multiple (even non-threaded) workloads, using T-series' builtin virtualization. Applications requiring single-CPU performance were better suited for M-series, which is designed for vertically scaled purposes but doesn't have hardware crypto and a built-in hypervisor. A trade-off.
The T4 removes the constraint on single-CPU performance, and T-series can be used for parallel applications that use many CPUs, consolidation workloads, and apps requiring hot single CPU performance.
Measurement pitfall #1
A common situation is that somebody would say "My application isn't going fast enough, but
vmstat says that the CPU is almost completely idle. What's happening?" Closer inspection would reveal that CPU utilization was indeed very low - 1% to 3% -
but mpstat would show that one or two of the CPU threads were working as hard as it could. Consider a 128-thread T3-1 with only 1 active thread:
vmstat will show average CPU utilization of 1/128, which is about 0.8%, even when 1 thread is 100% busy. The answer: run more threads! The box is almost completely idle, and adding more compute load won't slow down the existing application.
Measurement pitfall #2
Another pitfall happens when people measure performance of a single transaction on an empty system. Sometimes developers even compare response time on their laptops to the production servers. This gives a distorted view of performance unless your production systems are idle at peak load!
Consider this hypothetical (and rather simplified) example. Let's assume that CPU service time for a transaction on a 1.65GHz T3 chip is twice the time of a product with a deep pipeline and 2 CPUs running at 3GHz, and that response time is solely due to CPU service time. If response time on the T3 is 0.6 seconds, response time for a single transaction on the faster clock machine is 0.3 seconds. If the service level agreement requires 1 second response time, then both products are acceptable even though the faster clock produced faster response time.
What happens if we add concurrent transactions, as would happen in a real workload? Under our simple assumptions, the 2-CPU machine will still have 0.3 second response with 2 concurrent transactions (each gets 100% of one of two CPUs). But at 40 concurrent transactions, each transaction has the equivalent of only 5% of a CPU (2 CPUs divided by 40), and CPU service time grows to 6 seconds. On the T3 server, each of the 40 concurrent transactions will have 100% of a CPU, and response time will still be 0.6 seconds, even up to 128 current transactions - at 100 transactions the 2 CPU system has 50x slower response (15 seconds) while the T3 would still be subsecond. That's the scalability of throughput computing: under load, the T-series system performs much better. (Yes, I know I'm over simplifying, but at a crude level that's how it works). Don't measure single transactions on idle systems!
What has changed
The big difference with the T4 is that it provides both the throughput of the earlier T-series chips (with networking, crypto, and virtualization enhancements I'll discuss at a later time) and the single-CPU performance that wasn't previously available on T-series. No more need to carefully select multi-threaded workloads - the T4 chip is a powerhouse for a very broad range of applications.
Which server should I pick?
A natural question for SPARC and Solaris customers would be "should I use a T4, a T3, or an M-series product?" Now that T-series has a broader range of applicability, there's more choice in platform selection: a T4 can be used in cases where M-series would have been the only answer. There's more overlap.
In general, the M-series will still have the advantage for vertically scaling workloads that need massive CPU, memory, and I/O capacity, that need the higher redundancy and reliability features, and depend on the ability to add capacity to a running system by inserting CPU boards when needed. The T3 product will still find use in pure throughput computing applications because it has the higher core density and lower software license core factor (0.25 instead of 0.5).
So, there's still room for the different models - but the best news is that it remains completely compatible SPARC and Solaris, so systems and applications can be deployed (and redeployed) without concerns about compatibility.