Database scaling on Sun Fire T2000
By travi on Dec 07, 2005
Database scalability on Sun Fire T2000
1. Where does time go ?
Researchers from University of Wisconsin in their analysis of DBMS performance of modern processors have looked at the CPU utilization of DBMS by analyzing where does time go ? They have analyzed how does processor cycles get utilized. The conclusion clearly states that for OLTP workloads 60% to 80% of the time is spent in memory related stalls. Also memory stalls breakdown shows dominance of data and instruction stalls at L2 cache level. OLTP workloads due to the higher amount of memory stalls exhibit high CPI (Cycles per Instruction). As the stalls increase, processor core utilization reduces and it lowers the overall efficiency.
UltraSPARC T1 processor with CoolThreads technology is fundamentally designed to take advantage of the stall component in the workload. UltraSPARC T1 hides memory stalls in one thread by allowing other threads from the same core, to use the pipeline. Where a thread on a conventional processor would stall and still occupy the pipeline, UltraSPARC T1 has hardware threads which can continue to execute even if one or more threads are stalled. This results in greatly improving the core efficiency.
2. What did we observe ?
Soon after the arrival of early prototypes of Sun Fire servers based on UltraSPARC T1 processor we were curious to know how CMT works for database, how does shared L2 cache behave for OLTP and how does commercial databases benefit from all the large page performance projects in Solaris.
So, we configured about 1.5 TB of database using a commercial DBMS on Sun Fire T2000 with 32 GB memory and did a number of performance tests. Let us see what we found :
Scaling characteristics :
Initially by sizing the database scale we controlled the amount of i/o activity to simply understand the scaling with increasing hardware threads. We noticed excellent scaling :
of Virtual processors
We did two sets of experiments to further understand scaling. By appropriately sizing the database cache and the database scale, we kept the disk io/transaction more or less same throughout these tests. For these tests threads within a core were disabled as needed using psradm(1M) command of Solaris.
across cores (with
always using all 4 threads/core)
|# of Cores||# of Virtual processors||Relative
- Scaling the
of hardware threads per core
|# of hardware Threads/core||# of Virtual Processors||Relative
\* We observed ~10% idle time for this config
Comparison of the throughput results show that for the same number of hardware threads its beneficial to use more number of cores. The performance gap gets closed as we increase the number of threads per core.
|1 thread/core v/s 4 threads in 2 cores||33 %
|2 threads/core v/s 4 threads in 4 cores||16 %
|3 threads/core v/s 4 threads in 6 cores||2 %
This shows that for the same number of hardware threads DBMS performance benefits from using more cores. Certain resources like Level 1 caches and TLBs are available per core and using more number of cores allows the software to use these resources. However, around 24 threads, the difference between choosing all 8 cores over selecting only 6 cores almost vanishes.
DBMS and large page support in Solaris :
We also did characterization of large pagesize selection features in Solaris 10, specifically developed for UltraSPARC T1.
Individual feature tests showed following results :
Page (LP) feature
|LP for text and libraries
|LP for ISM (database shared
|LP for Kernel heap
|LP for heap, stack and anon
All of this works just out of the box !
Other observations :
- We also tried different
scheduling classes. FX as well as RT but noticed that at even at high
default TS scheduling class performs the best.
- Understanding processor utilization on CMT systems can be tricky. Low CPI applications tend to saturate the core using fewer threads whereas high CPI applications tend to run out of available threads before they can saturate the core. Since the CPI of the OLTP workload on Sun Fire T2000 is quite high, it doesn't saturate the core capacity of 1.2 billion instructions/sec/core (@1200 MHz frequncy). Which means we can use vmstat and mpstat to get true idea about the head room available. Also scaling tests with varying load on the system have shown performance variation pattern following the variation in CPU utilization closely.
- As the throughput scales, we did notice slight increase in response time. It was reasonably low and within acceptable limits.
A number of factors contribute to overall good performance and scaling. Basically CoolThreads technology is really working well. [We have validated this by analyzing hardware performance counter data collected using cpustat]
- We have used cpustat to analyze the
cache misses per transaction
to analyze the code path. Cache misses
increase marginally as we scale up which
validates that 12 way associativity of L2 cache is working well. We
also noticed that the code path i.e. instructions executed per
transaction almost remain constant even as we
increase the number of threads. This also shows good software scaling.
- Floating point usage is extremely low in case of DBMS. During the tests cpustat data showed that floating point unit (FPU) usage is less than 0.01% per instruction.
- All the large page features in Solaris help reduce the number of TLB misses. It requires no special /etc/system tuning.
- Along with all these, low memory latency, enough i/o connectivity with 3 PCI-E and 2 PCI-X slots and 4 GigE ports on board providing good network connectivity make Sun Fire T2000 a balanced server architecture for database.