Saturday Oct 25, 2008

Oracle parallel query performance on a T5140

Being an engineer it's always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application...........and what do you get......... a problem. I suspect after that opening some of you are thinking "where's he going with this?" .... the answer nowhere, I'm not picking holes in either the T5140 or Oracle.....good, I'm glad we got that one clear! :)

Anyhow, so a customer came to us complaining that their application wasn't running as expected on this platform and really wanted to know if there was a hardware fault /  bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially,  the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:

bash-3.00# raidctl -l c1t1d0
Volume                  Size    Stripe  Status   Cache  RAID
       Sub                     Size                    Level
               Disk
----------------------------------------------------------------
c1t1d0                  409.9G  64K     OPTIMAL  OFF    RAID0
               0.1.0   136.6G          GOOD
               0.2.0   136.6G          GOOD
               0.3.0   136.6G          GOOD

They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you've a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()'ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.

SQL> set timing on
SQL> create tablespace andy3
 2  datafile '/u01/oracle/oradata/SUN/andy03.dbf'
 3  size 1g;

Tablespace created.

Elapsed: 00:00:18.12

and iostat -xnz 1 shows us

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0  128.0  104.0 56369.0  0.0  1.6    0.0   11.2   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  14.0   78.0  112.0 56281.3  0.0  1.2    0.1   13.5   0  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   93.0  112.0 53734.0  0.0  1.4    0.0   13.4   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   95.0  104.0 58397.6  0.0  1.4    0.0   12.7   1  92 c1t1d0

This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:

 Parallelism
Time
Throughput (Mb per second)
db_file_multiblock_read_count
 0 719
13.8
16 (16 x 8k = 128k)
 2 195
50.9
16
 4 208
47.7
16
 0 420
23.6
32 (32 x 8k = 256k)
 2 163
60.9
32
 4 208
53.9
32

Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).

Using the same parallel test plan as the customer:

oracle@v4v-t5140a-gmp03~$more para.sql

set timing on;

select /\*+ FULL(t) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,2) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,4) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,64) \*/ count(\*) from contact_methods t;

oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r-----   1 oracle   dba         9.7G Oct 24 08:25 test01.dbf

I got these:

SQL> @para

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:47.85

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:32.53

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:34.68

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:42.17

whilst the first full table scan is running I see the following in iostat -xnz 1:

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 108.0    0.0 93122.4    0.0  0.0  0.4    0.1    4.0   1  35 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 151.0    3.0 95796.9   48.0  0.0  0.5    0.1    3.0   1  34 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 115.0    0.0 103367.6    0.0  0.0  0.3    0.1    2.6   1  28 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 102232.7    0.0  0.0  0.3    0.1    3.0   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 122.0    3.0 105326.4   48.0  0.0  0.3    0.1    2.5   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 96467.2    0.0  0.0  0.5    0.1    4.1   1  34 c1t1d0

and then

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 193.0    0.0 159383.3    0.0  0.0  8.0    0.1   41.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 195.5    3.0 163681.0   48.1  0.0  8.1    0.1   40.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 220.1    0.0 188770.3    0.0  0.0  7.7    0.1   34.8   3 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 168156.9    0.0  0.0  7.2    0.1   37.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    3.0 162361.2   48.0  0.0  7.4    0.1   38.1   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 190.0    0.0 162776.0    0.0  0.0  7.3    0.1   38.7   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 162737.6    0.0  0.0  6.9    0.1   35.9   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 186.0    3.0 153754.2   48.0  0.0  8.4    0.1   44.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    1.0 160412.4    8.0  0.0  7.7    0.1   40.1   1 100 c1t1d0

when the parallel jobs are running.

This is because I changed the db_file_multiblock_read_count to 128 (128 \* 8k = 1M) and in fact I saw improvements using 192k (64k\*3) to match the stripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.

It's amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don't get the results you're expecting not to assume that there is a fault with the hardware or software. For me personally, it's always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what's gone on before and end going down the wrong path. To avoid that,  make sure you chat with you're colleagues when you feel like you're not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.




About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today