Saturday Oct 17, 2009

Solaris resource controls and Oracle CRS/RAC

In an earlier blog entry about performance considerations when upgrading solaris, I discussed using Oracle with resource controls instead of /etc/system settings but forget to include details about using with Oracle CRS (Cluster Ready Services). Oracle CRS is required for Oracle RAC configurations and you can run into this bug CR 5000940 - "project 0's rctls are not applied from /etc/project" when migrating to using resource controls instead of /etc/system entries. The impact of this is that you won't be able to start Oracle and get something like the following in the alert log:

WARNING: EINVAL creating segment of size 0x000000021e000000

fix shm parameters in /etc/system or equivalent

There is a workarounds described both in the bug report and in metalink document 435464.1 - "ENABLING SOLARIS PROJECT SETTINGS FOR CRS".

Monday Sep 28, 2009

Problems with Solaris and a DISM enabled Oracle Database

Resolving problems becomes ever more challenging in todays complex application environments and sometimes error messages can be misleading or confusing to the system administrator / DBA trying to resolve the issue. With my involvement with the VOSJEC (Veritas Oracle Sun Joint Escalation Center), we've seen a steady increase in the number of cases being raised about customers using Oracle with DISM enabled on Solaris systems and in the majority of cases disabling DISM appears to resolve the issue but with DISM use becoming more popular and making DBA's lives easier I personally don't feel as though that's the real right course of action. DISM essentially enables the DBA the ability to resize the SGA on the fly without taking down the database and maximize availability and avoid downtime.  See the following page from the Administering Oracle Database on Solaris documentation for further details.

For knowing about other DISM related issues then have a read of Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases as it gives some background of DISM and what can go wrong.

How do I know if the customer is using DISM or not...

\* There should be an ora_dism process running on the system so you can check from the ps listing (explorer / guds).

\* Check in the Oracle alert log as that should also mention the DISM process starting. It should also log the fact that the DISM process has died which means your Oracle shared memory segmented aren't locked and can cause other problems .... see Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases.

So, what the issue you here yourself asking, well several customers complained about Oracle batch jobs crashing or the entire database falling over with something like:

Sun Mar 15 14:53:28 2009
Errors in file /u01/app/oracle/admin/JSPSOL0/udump/jspsol03_ora_23220.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
SVR4 Error: 12: Not enough space
Additional information: 4
Additional information: 938866512
Additional information: -1

OR

Mon Aug 24 09:01:17 2009
Errors in file /opt/oracle/admin/MW/bdump/mw1_lgwr_18124.trc:
ORA-00340: IO error processing online log 1 of thread 1
ORA-00345: redo log write error block 167445 count 11
ORA-00312: online log 1 thread 1: '/dev/vx/rdsk/mwdbdg/redo11'
ORA-27070: async read/write failed
SVR4 Error: 12: Not enough space
Additional information: 1
Additional information: 167445
LGWR: terminating instance due to error 340

The first jump to cause would indicate that this is some kind of I/O issue and that perhaps we've run out of space on the backing storage layer. Which unfortunately in this case wasn't the reason as all filesystems had enough free space available. The next hit in our possible causes, is perhaps that we don't have enough swap space available as in using DISM you have to have enough swap space configured as DISM segments in use, see  Solution  214947 :   Solaris[TM] Operating System: DISM double allocation of memory for details. Again, the customer appeared to have the correct swap space configuration for the DISM segment currently in use.
Now, what I actually suspected was that CR 6559612 multiple softlocks on a DISM segment should decrement availrmem just once is coming into play here which if you read engineerings comments on it says can only happen if DISM isn't locked properly and we end up going down a different code path due to seg_pdisable being set to 1. There is another reason for seg_pdisable being set and that's when we're trying to handle page retirement. Coincidentally for the two situations mentioned above both had pending page retirements happening  due CR 6587140 -  "page_retire()/page_trycapture should not try to retire non relocatable kernel pages" fixed in KJP 141414-06 (latest 10) so there was a window for opportunity for CR 6559612 to happen every once and a while.

From what I can currently work out CR 6559612 - "multiple softlocks on a DISM segment should decrement availrmem just once" is a duplicate of CR 6603296 - "Multiple writes into dism segment reduces available swap" and also mentions duplicate CR 6644570 - "Application using large page sizes with DISM on Huron got pread to return error ENOMEM".

There seems to be a family of related issues which are all fixed as part of CR 6423097 - "segvn_pagelock() may perform very poorly" which includes the above:

6526804 DR delete_memory_thread, AIO, and segvn deadlock
6557794 segspt_dismpagelock() and segspt_shmadvise(MADV_FREE) may deadlock
6557813 seg_ppurge_seg() shouldn't flush all unrelated ISM/DISM segments
6557891 softlocks/pagelocks of anon pages should not decrement availrmem for memory swapped pages
6559612 multiple softlocks on a DISM segment should decrement availrmem just once
6562291 page_mem_avail() is stuck due to availrmem overaccounting and lack of seg_preap() calls
6596555 locked anonymous pages should not have assigned disk swap slots
6639424 hat_sfmmu.c:hat_pagesync() doesn't handle well HAT_SYNC_STOPON_REF and HAT_SYNC_STOPON_MOD flags
6639425 optimize checkpage() optimizations
6662927 page_llock contention during I/O

These are all fixed in the development release of Solaris Nevada build 91 but haven't been backported as yet. From looking at the bug reports it does seem likely that CR 6423097 and associated CR's will get fixed in Solaris 10 U8 (scheduled for release 9th October 2009).

So, that would lead us to the following suggestions:

1/ CR 6423097 mentions disabling large page support completely to workaround the issue which you could do and still continue to use DISM:

Workaround for nevada and S10U4 is to add to /etc/system:

set max_uheap_lpsize = 0x2000
set max_ustack_lpsize = 0x2000
set max_privmap_lpsize = 0x2000
set max_shm_lpsize = 0x2000
set use_brk_lpg = 0
set use_stk_lpg = 0
and reboot.

for S10U3 and earlier S10 releases, the workaround is to add to /etc/system:

set exec_lpg_disable = 1
set use_brk_lpg = 0
set use_stk_lpg = 0
set use_zmap_lpg = 0
and reboot

NOTE: this is \*not\* the same as disabling large page coalescing as this completely disables large page support. Disabling large page coalescing is sometimes required dependent on certain types of application workload so see 215536 : Oracle(R) 10g on Solaris[TM] 10 may run slow for further details.

2/ Disable DISM entirely for the short term and use ISM with Oracle whilst waiting for the release of Solaris 10 update 8.

Conclusion / Actions
-------------------------------------

The above is essentially a hit list of things to be aware of when seeing errors of that nature. If the customer isn't using DISM at all then that might take you down another  avenue. If it does then remember that "Applications Drive System Behavior", so you need to understand the architecture, when and where things run to aid in tying the loose ends of any problem together and be able to prove root cause or not. See my previously blog entry about dealing with performance issues as it essentially takes about the same troubleshooting process I use all the time in diagnosing complex problems.

Monday Jan 19, 2009

Performance considerations when upgrading Solaris

The biggest piece of advice I can give you about those of you about to upgrade with lots of custom tunables in /etc/system........read the manual (FTFM if you're feeling particularly vocal), no seriously, I mean it! :) You only have to read the Solaris tunables reference manual as it actually discusses upgrading to newer releases with older /etc/system tunables:

"We recommend that you start with an empty /etc/system file when moving to a new Solaris
release. As a first step, add only those tunables that are required by in-house or third-party
applications. Any tunables that involve System V IPC (semaphores, shared memory, and
message queues) have been modified in the Solaris 10 release and should be changed in your
environment. For more information, see “System V IPC Configuration” on page 21. After
baseline testing has been established, evaluate system performance to determine if additional
tunable settings are required."


So, that's a move it out of the way and start from scratch. :) Obviously speak to your application vendors about anything that is required to run the application but other than that, see how things go and only change when and where necessary otherwise you could run into other problems.

The only application which I'll make specific points about is Oracle as with Solaris 10 we've introduced resource controls so the shared memory / semaphore settings no longer need to be defined in /etc/system. See the Oracle installation guide  or  Solution  208623 :   Solaris[TM] 10 Operating System: System V Inter-Process Communication (IPC) resource controls for further details.

Friday Jan 16, 2009

Capturing data on Oracle "WARNING: aiowait timed out" events

Oracle "WARNING: aiowait timed out" error messages in the alert log are always something that needs further investigation as it can indicate some kind of I/O or system resource issue that stopping valuable I/O's down to storage. Now there could be a whole bunch of reasons why our asynchronous I/O as got stuck. Oracle reports "WARNING: aiowait timed out" messages after 10 mins of non-response since our I/O was queued by aioread(3) or aiowrite(3). It's also worth being aware of the two method of by which Asynchronous I/O is handled by Solaris. This is either via kernel asynchronous I/O (KAIO) if the underlying filesystem supports it (raw or filesystems with added s/w which give raw access) or the standard userland library implementation. For further background reading then have a look at the Solaris™ Internals: Core Kernel Components book by Jim Mauro; Richard McDougall, chapter 11.4. Asynchronous I/O.

So, to help understand the where in our life cycle we're stuck we'll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement

Enabling the trigger environment

A full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I'm going to assume that we don't really want to take down the entire box as it'll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn't yield results then try a full dump. So, here's the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.

1/ You'll need to setup a dedicated dump device to collect a live savecore. You'll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.

2/ Download livecore_aio (or livecore_aio.x86) and guds script.

3/ Create collect_data.sh:

root@hippy-1 # more collect_data.sh
#!/bin/sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5  -w0 -s<case ref> -X2 -D /var/tmp &
/usr/bin/savecore -L
echo "aiowait coredump event - please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root

In my case I'm going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.

4/ Change perms on binary and script before copying to /var/tmp

chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp

note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly

Test run

Test the program and script (as root):

# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file

This should produce a live savecore dump and kick off guds.

Deploy

\*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &

When issue happens, upload the live core dump and guds data to Sun for analysis.

Saturday Oct 25, 2008

Oracle parallel query performance on a T5140

Being an engineer it's always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application...........and what do you get......... a problem. I suspect after that opening some of you are thinking "where's he going with this?" .... the answer nowhere, I'm not picking holes in either the T5140 or Oracle.....good, I'm glad we got that one clear! :)

Anyhow, so a customer came to us complaining that their application wasn't running as expected on this platform and really wanted to know if there was a hardware fault /  bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially,  the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:

bash-3.00# raidctl -l c1t1d0
Volume                  Size    Stripe  Status   Cache  RAID
       Sub                     Size                    Level
               Disk
----------------------------------------------------------------
c1t1d0                  409.9G  64K     OPTIMAL  OFF    RAID0
               0.1.0   136.6G          GOOD
               0.2.0   136.6G          GOOD
               0.3.0   136.6G          GOOD

They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you've a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()'ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.

SQL> set timing on
SQL> create tablespace andy3
 2  datafile '/u01/oracle/oradata/SUN/andy03.dbf'
 3  size 1g;

Tablespace created.

Elapsed: 00:00:18.12

and iostat -xnz 1 shows us

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0  128.0  104.0 56369.0  0.0  1.6    0.0   11.2   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  14.0   78.0  112.0 56281.3  0.0  1.2    0.1   13.5   0  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   93.0  112.0 53734.0  0.0  1.4    0.0   13.4   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   95.0  104.0 58397.6  0.0  1.4    0.0   12.7   1  92 c1t1d0

This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:

 Parallelism
Time
Throughput (Mb per second)
db_file_multiblock_read_count
 0 719
13.8
16 (16 x 8k = 128k)
 2 195
50.9
16
 4 208
47.7
16
 0 420
23.6
32 (32 x 8k = 256k)
 2 163
60.9
32
 4 208
53.9
32

Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).

Using the same parallel test plan as the customer:

oracle@v4v-t5140a-gmp03~$more para.sql

set timing on;

select /\*+ FULL(t) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,2) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,4) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,64) \*/ count(\*) from contact_methods t;

oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r-----   1 oracle   dba         9.7G Oct 24 08:25 test01.dbf

I got these:

SQL> @para

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:47.85

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:32.53

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:34.68

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:42.17

whilst the first full table scan is running I see the following in iostat -xnz 1:

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 108.0    0.0 93122.4    0.0  0.0  0.4    0.1    4.0   1  35 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 151.0    3.0 95796.9   48.0  0.0  0.5    0.1    3.0   1  34 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 115.0    0.0 103367.6    0.0  0.0  0.3    0.1    2.6   1  28 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 102232.7    0.0  0.0  0.3    0.1    3.0   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 122.0    3.0 105326.4   48.0  0.0  0.3    0.1    2.5   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 96467.2    0.0  0.0  0.5    0.1    4.1   1  34 c1t1d0

and then

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 193.0    0.0 159383.3    0.0  0.0  8.0    0.1   41.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 195.5    3.0 163681.0   48.1  0.0  8.1    0.1   40.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 220.1    0.0 188770.3    0.0  0.0  7.7    0.1   34.8   3 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 168156.9    0.0  0.0  7.2    0.1   37.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    3.0 162361.2   48.0  0.0  7.4    0.1   38.1   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 190.0    0.0 162776.0    0.0  0.0  7.3    0.1   38.7   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 162737.6    0.0  0.0  6.9    0.1   35.9   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 186.0    3.0 153754.2   48.0  0.0  8.4    0.1   44.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    1.0 160412.4    8.0  0.0  7.7    0.1   40.1   1 100 c1t1d0

when the parallel jobs are running.

This is because I changed the db_file_multiblock_read_count to 128 (128 \* 8k = 1M) and in fact I saw improvements using 192k (64k\*3) to match the stripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.

It's amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don't get the results you're expecting not to assume that there is a fault with the hardware or software. For me personally, it's always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what's gone on before and end going down the wrong path. To avoid that,  make sure you chat with you're colleagues when you feel like you're not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.




About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today