Database scaling on Sun Fire T2000

dbperf.html

Database scalability on Sun Fire T2000

With the launch of Sun Fire T2000 server I would like to share with you three aspects of database performance on these CoolThreads Servers. First let us see what reaserachers have found about CPU utilization of DBMSs, then we will see what we observed from our tests and finally I will share my analysis.

1. Where does time go ?

Researchers from University of Wisconsin in their analysis of DBMS performance of modern processors have looked at the CPU utilization of DBMS by analyzing where does time go ? They have analyzed how does processor cycles get utilized. The conclusion clearly states that for OLTP workloads 60% to 80% of the time is spent in memory related stalls. Also memory stalls breakdown shows dominance of data and instruction stalls at L2 cache level. OLTP workloads due to the higher amount of memory stalls exhibit high CPI (Cycles per Instruction). As the stalls increase, processor core utilization reduces and it lowers the overall efficiency.

UltraSPARC T1 processor with CoolThreads technology is fundamentally designed to take advantage of the stall component in the workload. UltraSPARC T1 hides memory stalls in one thread by allowing other threads from the same core, to use the pipeline. Where a thread on a conventional processor would stall and still occupy the pipeline, UltraSPARC T1 has hardware threads which can continue to execute even if one or more threads are stalled. This results in greatly improving the core efficiency.

2. What did we observe ?

Soon after the arrival of early prototypes of Sun Fire servers based on UltraSPARC T1 processor we were curious to know how CMT works for database, how does shared L2 cache behave for OLTP and how does commercial databases benefit from all the large page performance projects in Solaris.

So, we configured about 1.5 TB of database using a commercial DBMS on Sun Fire T2000 with 32 GB memory and did a number of performance tests. Let us see what we found :

Scaling characteristics :


Initially by sizing the database scale we controlled the amount of i/o activity to simply understand the scaling with increasing hardware threads. We noticed excellent scaling :

#of hardware threads/core
# of Virtual processors
Relative Performance (Throughput)
1
8
1.0
2
16
1.95
3
24
2.98
4
32
3.9


We did two sets of experiments to further understand scaling. By appropriately sizing the database cache and the database scale, we kept the disk io/transaction more or less same throughout these tests. For these tests threads within a core were disabled as needed using psradm(1M) command of Solaris.

  • Scaling across cores (with always using all 4 threads/core)

# of Cores # of Virtual processors Relative Performance
(Throughput)
2
8
1.0
4
16
1.91
6
24
2.91
8
32
3.68\*


  • Scaling the number of hardware threads per core 

# of hardware Threads/core # of Virtual Processors Relative Performance
(Throughput)
1
8
1.0
2
16
1.84
3
24
2.47
4
32
3.10 \*

     \*  We observed ~10% idle time for this config

As shown above, this commercial DBMS could scale quite well in both dimensions. Due to the high amount of i/o we saw idle time at 32 threads.

There are two ways in which we can select hardware threads from cores of UltraSPARC T1. e.g. if we want to use 8 hardware threads, we can choose 4 threads in 2 cores or can use 1 thread in each core.

Comparison of the throughput results show that for the same number of hardware threads its beneficial to use more number of cores. The performance gap gets closed as we increase the number of threads per core.


Configuration Performance Difference
1 thread/core v/s 4 threads in 2 cores 33 %
2 threads/core v/s 4 threads in 4 cores 16 %
3 threads/core v/s 4 threads in 6 cores 2 %

This shows that for the same number of hardware threads DBMS performance benefits from using more cores. Certain resources like Level 1 caches and TLBs are available per core and using more number of cores allows the software to use these resources. However, around 24 threads, the difference between choosing all 8 cores over selecting only 6 cores almost vanishes.

DBMS and large page support in Solaris :

We also did characterization of large pagesize selection features in Solaris 10, specifically developed for UltraSPARC T1.

UltraSPARC T1 processor has a 64 entry Instruction and Data TLB per core which supports 8k, 64k, 4M and 256 M pagesizes. Solaris 10 kernel on Sun Fire T2000 has been optimized to make use of large pages for various segments in the address space of a process. Solaris provides optimum pagesize selection  algorithm out of the box and requires no special tuning.

Individual feature tests showed following results :

Large Page (LP) feature
Performance gain (%)
LP for text and libraries
6.4 %
LP for ISM (database shared memory)
9.4 %
LP for Kernel heap
6.8 %
LP for heap, stack and anon
1.8 %

We have seen OLTP performance improvements upto 30% due to combined effects of all large page projects in Solaris. While running commercial DBMS on Sun Fire T2000, we see most of the database cache being allocated using 256 MB pages, text getting allocated on 4 MB pages with heap, stack and anonymous memory segments getting allocated using 64 KB pages.

All of this works just out of the box !

Other observations :

  • We also tried different scheduling classes. FX as well as RT but noticed that at even at high throughput the default TS scheduling class performs the best.

  • Understanding processor utilization on CMT systems can be tricky. Low CPI applications tend to saturate the core using fewer threads whereas high CPI applications tend to run out of available threads before they can saturate the core. Since the CPI of the OLTP workload on Sun Fire T2000 is quite high, it doesn't saturate the core capacity of 1.2 billion instructions/sec/core (@1200 MHz frequncy). Which means we can use vmstat and mpstat to get true idea about the head room available. Also scaling tests with varying load on the system have shown performance variation pattern following the variation in CPU utilization closely.

  • As the throughput scales, we did notice slight increase in response time. It was reasonably low and within acceptable limits.
3. So why does database OLTP performance scale really well on Sun Fire T2000 ?

A number of factors contribute to overall good performance and scaling. Basically CoolThreads technology is really working well. [We have validated this by analyzing hardware performance counter data collected using cpustat]
  • We have used cpustat to analyze the cache misses per transaction and to analyze the code path. Cache misses increase marginally as we scale up which validates that 12 way associativity of L2 cache is working well. We also noticed that the code path i.e. instructions executed per transaction almost remain constant even as we increase the number of threads. This also shows good software scaling.
  • Floating point usage is extremely low in case of DBMS. During the tests cpustat data showed that floating point unit (FPU) usage is less than 0.01% per instruction.
  • All the large page features in Solaris help reduce the number of TLB misses. It requires no special /etc/system tuning.
  • Along with all these, low memory latency, enough i/o connectivity with 3 PCI-E and 2 PCI-X slots and 4 GigE ports on board providing good network connectivity make Sun Fire T2000 a balanced server architecture for database.
Comments:

So when (or if) can we expect some numbers on TPC (-C and/or -H) or other widely used benchmarks?

Posted by Igor on December 07, 2005 at 11:48 AM IST #

What have you used to allow the IO to be fast enough to test the full cores? tmpfs? multiple io boards?

Posted by Bernd Eckenfels on December 11, 2005 at 01:11 AM IST #

About database proofpoints : Sun has already published SAP-SD benchmark which has a Database component and we also published SPECjappserver04 benchmark which has a DB component too. We will have some more database performance proofpoints on UltraSPARC T1 over the coming months.

Posted by Ravindra Talashikar on December 12, 2005 at 11:41 AM IST #

About the i/o configuration and connectivity : We didn't use anything special like tempfs during these database performance and scaling tests. We used 2 PCI-E cards and 2 PCI-X cards to connect Sun's 3510 storage over fiber channel host bus adapters. That worked great! If needed we had an option of using an I/O expansion box but we didn't have to go there.

Posted by Ravindra Talashikar on December 12, 2005 at 11:45 AM IST #

Very nicely written. However can you give us a baseline that relates to something we are already familiar with. For instance, what relative performance would a machine like say a V440 or a V880 achieve, given identical memory and I/O?

Posted by Mike on December 13, 2005 at 09:31 PM IST #

Has anybody tested an Oracle database on this maschine with OLAP option? It seems to be a bug in the Hard- or Software. The JVM does not run. It hangs without an error message.

Posted by Heinz-Josef Wrobel on January 19, 2006 at 02:19 PM IST #

Did you have to do something unique to get the oracle installation to work? I tried to install it and the DBCA (Database Creation Assistant) portion of the install just hangs. Were there any required patches, /etc/system setttings, etc.? I am at a loss and saw that there was a problem with Java 1.5 ,which is what Solaris 10 installs by default, and Niagara chip. No workaround was listed. On a V240 the DBCA portion lasts about 5 minutes.

Posted by Tom Zurita on May 23, 2006 at 04:37 PM IST #

Well, currently Oracle database 10gR2 installation on T2000 has a problem where use of the Oracle JVM involves an uncommon software trap which was not implemented correctly for sun4v platforms (T1000/T2000) and will cause the calling process/thread to enter an infinite loop and appear to be hung. There is a temporary patch available. Look for all the details at : http://www.sun.com/servers/coolthreads/tnb/applications_oracle.jsp

Posted by Ravindra Talashikar on June 09, 2006 at 09:57 AM IST #

How does it perform when doing DB imports? Especially when importing BLOBs Oracle seems to use only 1 of the 32 threads, so, huge BLOB Tables taking ages to import. (Ora 9207)

Posted by Christian on June 16, 2006 at 07:43 AM IST #

Hello, We have implemented Oracle 9206 on our new Sun T2000 server. When we compare the query performance with the current 8i on v880 we are surprised to see that 8i outperforms 9i on T2000. Can you let us know what specific config is required from the OS or Database side so that we can exploit the true potential of this T2000? Thanks.

Posted by Deepak Jaisingh on August 25, 2006 at 10:36 PM IST #

@Deepak: Most likely a single query performance is slower due to the slow (clock) CPU. You need to measure multiple access in parallel or at least allow some parallelity in the SELECTs. I think it is typical for some kind of business applications which perform badly (due to missing parallelity) that they behave even worse on T1.

Posted by Bernd Eckenfels on August 26, 2006 at 04:02 PM IST #

Great tool. What cpustat syntax will produce a file suitable for input to corestat? I would like to run cpustat for extended periods and run the output through corestat.

Posted by Alan Wilson on September 08, 2006 at 02:39 PM IST #

hola

Posted by guest on February 18, 2007 at 09:07 AM IST #

i'm having problem with oracle 10g in sun t2000. the performance is terribly slow. even a laptop with intel solaris can perform well than that.

Posted by dean leong on March 15, 2007 at 12:48 AM IST #

Hi, You had mentioned about testing with different scheduling class. TS is better. is there any testing done on Oracle/any other DBMS with FSS? we need to understand abt performance implications of using FSS with database on T2000.pls give your suggestions on this.

Posted by Gowrishankar on May 10, 2007 at 10:03 AM IST #

If SQL takes very short time,then t2000 might be nice and scable. If SQL takes more than seconds and a lot of buffer gets, then sql would be much slower than tranditional machine. And from our test, t2000 is slower on data import & alter table move ,batch jobetc. Thanks, Bin

Posted by yumianfeilong on June 20, 2007 at 07:17 AM IST #

We're trying to port an Oracle9i based OLTP database from a V1280 ( 12 single core CPUs ) to a T2000 and are seeing terrible results.

Suggestions to turn parallelism on hasn't helped since this is not a DW or DSS architecture. In fact, a majority of the app runs slower.

Are there any references of sucessful porting to the T2000?

Posted by Ramon Martinez on October 19, 2007 at 10:29 PM IST #

For Oracle 9i or 10g r2 on SUN T-2000, does
set consistent_coloring=2
required to be set on kernel to take advantage of the high-speed memory L2 cache? Your help is much appreciated.

Posted by Raghu on November 09, 2007 at 03:34 PM IST #

I too am seeing horrible performance. 2xT2000 8 core with Sun 3510. Oracle 10gR2. Same setup with 2xv240 is 4 times faster for the exact same database. Does anyone have any tricks for speeding up the t2000?

Posted by Todd Helfter on April 02, 2008 at 05:42 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

travi

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks