Running Batch Workloads on Sun's CMT Servers

Ever since Sun introduced Chip Multi-Threading (CMT) hardware in the form of UltraSPARC T1's T1000/T2000, our internal mail aliases were inundated with variety of customer stories, majority of those go like 'batch jobs are taking 12+ hours on T2000, where as it takes only 3 or 4 hours on US-IV+ based v490'. Even after two and half years since the introduction of the revolutionary CMT hardware, it appears that majority of Sun customers are still under the impression that Sun's CMT systems like T2000, T5220 are not capable of handling CPU intensive batch workloads. It is not a valid concern. CMT processors like UltraSPARC T1, T2, T2 Plus can handle batch workloads just as well like any other traditional/conventional processor viz. UltraSPARC-IV+, SPARC64-VI, AMD Opteron, Intel Xeon, IBM POWER6. However CMT awareness and little effort are required at the customer end to achieve good throughput on CMT systems.

First of all, the end users must realize the fact that the maximum clock speed of the existing CMT processor line-up (UltraSPARC T1, UltraSPARC T2, UltraSPARC T2 Plus) is only 1.4 GHz; and on top of that each strand (individual hardware thread) within a core shares the CPU cycles with the other strands that operate on the same core (Note: each core operates at the speed of the processor). Based on these facts, it is no surprise to see batch jobs taking longer times to complete when only one or a very few single-threaded batch jobs are submitted to the system. In such cases, the system resources are fairly under-utilized in addition to the longer elapsed times. One possible trick to achieve the required throughput in the expected time frame is to split up the workload into multiple jobs. For example, if an EDU customer needs to generate 1000 transcripts, the customer should consider submitting 4 individual jobs with 250 transcripts each or 8 jobs with 125 transcripts each rather than submitting one job for all 1000 transcripts. Ideally the customer should observe the resource utilization (CPU%, for example); and experiment with the number of jobs to be submitted until the system achieves the desired throughput within the expected time frame.

Case study: Oracle E-Business Suite Payroll 11i workload on Sun SPARC Enterprise T5220

In order to prove that the aforementioned methodology works beyond a reasonable doubt, let's take Oracle's E-Business Suite 11.5.10 Payroll workload as an example. On a single T5220 with one 1.4 GHz UltraSPARC T2 processor, acting as the batch, application and database server, 4 payroll threads generated 5,000 paychecks in 31.53 minutes of time consuming only 6.04% CPU on average. ~9,500 paychecks is the projected hourly throughput. This is a classic example of what majority of Sun's CMT customers are experiencing as of today i.e., longer batch processing times with little resource consumption. Keep in mind that each UltraSPARC T2 and UltraSPARC T2 Plus processors can execute up to 64 jobs in parallel (on a side note, UltraSPARC T1 processor can execute up to 32 jobs in parallel). So to put the idling resources for effective use, there by to improve the elapsed times and the overall throughput, few experiments were conducted with 64 payroll threads and the results are very impressive. With a maximum of 64 payroll threads, it took only 4.63 minutes to process 5,000 paychecks at an average of 40.77% CPU utilization. In other words, similarly configured T5220 can process ~64,700 paychecks at less than half of the available CPU cycles. Here is a word of caution: just because the processor can execute 64 threads in parallel, it doesn't mean it is always optimal to submit 64 parallel jobs on systems like T5220. Very high number of batch jobs (payroll threads in this particular scenario) might be an overkill for simple tasks like NACHA in Payroll process.

The following white paper has more detailed information about the nature of the workload and the results from the experiments with various number of threads for different components of the Oracle Applications' Payroll batch workload. Refer to the same white paper for the exact tuning information as well.

Link to the white paper:
     E-Business Suite Payroll 11i (11.5.10) using Oracle 10g on a Sun SPARC Enterprise T5220

Here is the summary of the results that were extracted from the white paper:

Hardware configuration
          1x Sun SPARC Enterprise T5220 for running the application, batch and the database servers
              Specifications: 1x 1.4 GHz 8-core UltraSPARC T2 processor with 64 GB memory

Software configuration
          Oracle E-Business Suite 11.5.10
          Oracle 10g R1 10.1.0.4 RDBMS
          Solaris 10 8/07

Results
Oracle E-Business Suite 11i Payroll - Number of employees: 5,000
Component #Threads Time (min) Avg. CPU% Hourly Throughput
Payroll process 64 1.87 90.56 160,714
PrePayments 64 0.20 46.33 1,500,000
Ext. Proc. Archive 64 1.90 90.77 157,895
NACHA 8 0.05 2.52 6,000,000
Check Writer 24 0.38 9 782,609
Costing 48 0.23 32.5 1,285,714
Total or Average NA 4.63 min 40.77% 64,748

It is evident from the average CPU% that the Payroll process and the External Process Archive components are extremely CPU intensive; and hence take longer time to complete. That's the reason 64 threads were configured for those components to run at the full potential of the system. Light-weight components like NACHA need fewer threads to complete the job efficiently. Configuring 64 threads for NACHA will have a negative impact on the throughput. In other words, we would be wasting CPU cycles for no apparent improvement.

It is the responsibility of the customers to tune the application and the workload appropriately. One size doesn't fit all.

The Payroll 11i results on the T5220 demonstrate clearly that Sun's CMT systems are capable of handling batch workloads well. It would be interesting to see how well they perform against other systems equipped with traditional processors with higher clock speeds. For this comparison, we could use couple of results that were published by UNISYS and IBM with the same workload. The following table summarizes the results from the following two white papers. For the sake of completeness, Sun's CMT results were included as well.

Source URLs:
  1. E-Business Suite Payroll 11i (11.5.10) using Oracle 10g on a UNISYS ES7000/one Enterprise Server
  2. E-Business Suite Payroll 11i (11.5.10) using Oracle 10g for Novell SUSE Linux on IBM eServer xSeries 366 Servers
Oracle E-Business Suite 11i Payroll - Number of employees: 5,000
Vendor OS Hardware Config #Threads Time (min) Avg. CPU% Hourly Throughput
UNISYS Linux: RHEL 4 Update 3 DB/App/Batch server: 1x Unisys ES7000/one Enterprise Server (4x 3.0 GHz Dual-Core Intel Xeon 7041 processors, 32 GB memory) 121 5.18 min 53.22% 57,915
IBM Novell SUSE Linux Enterprise Server 9 SP1 DB, App servers: 2x IBM eServer xSeries 366 4-way server (4x 3.66 GHz Intel Xeon MP Processors (EM64T), 32 GB memory) 12 8.42 min 50+%2 35,644
Sun Solaris 10 8/07 DB/App/Batch server: 1x Sun SPARC Enterprise T5220 (1x 1.4 GHz 8-core UltraSPARC T2 processor, 64 GB memory) 8 to 64 4.63 min 40.77% 64,748

Better results were highlighted. The results speak for themselves. One 1.4 GHz UltraSPARC T2 processor outperformed four 3 GHz / 3.66 GHz processors in terms of the average CPU utilization and most importantly in the hourly throughput (Hourly throughput calculation relies on the total elapsed time).

Before we conclude, let us reiterate few things purely based on the factual evidence presented in this blog post:

  • Sun's CMT servers like T2000, T5220, T5240 (two socket system with UltraSPARC T2 Plus processors) are good to run batch workloads like Oracle Applications Payroll 11i

  • Sun's CMT servers like T2000, T5220, T5240 are good to run the Oracle 10g RDBMS when the DML/DDL/SQL statements that make up the majority of the workload are not very complex, and

  • When the application is tuned appropriately, the performance of CMT processors can outperform some of the traditional processors that were touted to deliver the best single thread performance

Footnotes

1. There is a note in the UNISYS/Payroll 11i white paper that says "[...] the gains {from running increased numbers of threads} decline at higher numbers of parallel threads." This is quite contrary to what Sun observed in its Payroll 11i experiments on UltraSPARC T2 based T5220. Higher number of parallel threads (maximum: 64) improved the throughput on T5220, where as UNISYS' observation is based on their experiments with a maximum of 12 parallel threads. Moral of the story: do NOT treat all hardware alike.

2. IBM's Payroll 11i white paper has no references to the average CPU numbers. 50+% was derived from the "Figure 3: Average CPU Utilization".

Comments:

Very interesting ! BTW: in the first white paper it is written that "the database was on UFS with asynchronous I/O enabled ...". Do you mean 'forcedirectio' on ufs side or filesystemio_options=setall on Oracle side ?

Posted by przemol on April 08, 2008 at 11:51 PM PDT #

'forcedirectio' on UFS, Przemol.

Posted by Giri Mandalika on April 09, 2008 at 01:10 AM PDT #

One of the problems we have on CMT servers running Oracle 10g is that db stats collection (which is run once a day in our case) takes a very long time compared to on a v490, as do backups (using RMAN) -- is there some magic in parallelizing those so that they run faster on CMT machines?

Posted by R.P. Aditya on April 09, 2008 at 01:47 AM PDT #

Thank you, this is really great information.

Unfortunately, 11i Payroll seems to be the only oracle e-biz module that can leverage CMT machines.

Please correct me if I'm wrong, but the other apps (PA,GL,AR,AP,FA) don't appear to have specific threading capabilities.

I'd love to run our whole oracle e-biz suite on a T5240, but there's a much broader need to leverage CMT accross all modules, not just PAYroll. In my opinion, the apps need a major re-write to capitalize on the divide and conquer CMT model.

Thanks again for Payroll information
-s3

Posted by stall3 on April 09, 2008 at 11:45 PM PDT #

The numbers look great for Sun, but are potentially discredited by the fact that the T5220 was loaded with twice as much memory than the competition. Why not rerun with 32gb like the others and prove that the CPU efficiency is making the difference -- not avoiding swapping to disk.

Posted by rk on April 10, 2008 at 03:59 AM PDT #

You could allocate more than one channel to speed up RMAN backup.
Paralleling db stats collection requires a little bit of programming.

If you use monitoring option for tables you could query user_tab_modifications table and build one stat job per each table. Then submit jobs to Oracle job queues.
We have 6 job queues on 4 UltraSPARC-IV dual-core V890 server housing 180GB OLTP database.
Our db stats collection time is 20min.
T2 has 8 cores and 64 threads.
It should scale well.

Posted by Ljubomir Zivkovic on April 10, 2008 at 05:59 AM PDT #

Aditya:

You can speed up the stats collection by submitting multiple 'dbms_stats' jobs. For example, if you are currently submitting one simple "execute dbms_stats.gather_schema_stats / dbms_stats.gather_database_stats / ..", why not submit multiple "execute dbms_stats.gather_table_stats / dbms_stats.gather_index_stats / .. "? You can adjust the degree of parallelism based on the number of jobs being submitted to Oracle and the CPU load on the system.

Please check Ljubomir Zivkovic's comment as well.
________________

stall3:

I'm no Oracle Applications expert. But I do know that it is possible to split up the workload into multiple jobs even in the case of other modules like General Ledger, Auto Invoice, High Volume Order Processing.

Be aware that it might not be as simple as a database UPDATE statement. It may require some effort. Please consult the application software vendor (Oracle Corporation, in this case) for some help.
________________

rk:

I'm not sure why you are under the impression that there is some swapping involved in this workload. In any case, I can assure you that the system didn't swap at all while the Oracle Apps Payroll processes (PYUGEN) and the Oracle shadow processes are actively working to generate the paychecks for 5,000 employees. In fact, only 9 GB out of 64 GB of memory was used by the application and the database. Payroll 11i is a CPU intensive, but not memory intensive workload. Most of the work will be done in the db-tier by the database processes. Perhaps that's the reason other vendors didn't even bother to reveal the actual memory utilization. However we (Sun) want the results to be as transparent as possible. Did you see the little note that says "~9 used" next to the memory configuration under "HARDWARE CONFIGURATION" section of Sun's Payroll 11i white paper publication?
________________

Ljubomir:

Thank you for stepping in and sharing the ideas.

- Giri

Posted by Giri Mandalika on April 10, 2008 at 08:25 AM PDT #

I remember performing very similar benchmarks when T2000 servers just came out. Based upon those benchmarks, we have replaced E6500 (28x400MHz, 28GB RAM) with one T2000 and gained 30%+ performance as a result.

Posted by Leonid Roodnitsky on April 12, 2008 at 03:04 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today