Saturday Oct 17, 2009

Solaris resource controls and Oracle CRS/RAC

In an earlier blog entry about performance considerations when upgrading solaris, I discussed using Oracle with resource controls instead of /etc/system settings but forget to include details about using with Oracle CRS (Cluster Ready Services). Oracle CRS is required for Oracle RAC configurations and you can run into this bug CR 5000940 - "project 0's rctls are not applied from /etc/project" when migrating to using resource controls instead of /etc/system entries. The impact of this is that you won't be able to start Oracle and get something like the following in the alert log:

WARNING: EINVAL creating segment of size 0x000000021e000000

fix shm parameters in /etc/system or equivalent

There is a workarounds described both in the bug report and in metalink document 435464.1 - "ENABLING SOLARIS PROJECT SETTINGS FOR CRS".

Monday Sep 28, 2009

Problems with Solaris and a DISM enabled Oracle Database

Resolving problems becomes ever more challenging in todays complex application environments and sometimes error messages can be misleading or confusing to the system administrator / DBA trying to resolve the issue. With my involvement with the VOSJEC (Veritas Oracle Sun Joint Escalation Center), we've seen a steady increase in the number of cases being raised about customers using Oracle with DISM enabled on Solaris systems and in the majority of cases disabling DISM appears to resolve the issue but with DISM use becoming more popular and making DBA's lives easier I personally don't feel as though that's the real right course of action. DISM essentially enables the DBA the ability to resize the SGA on the fly without taking down the database and maximize availability and avoid downtime.  See the following page from the Administering Oracle Database on Solaris documentation for further details.

For knowing about other DISM related issues then have a read of Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases as it gives some background of DISM and what can go wrong.

How do I know if the customer is using DISM or not...

\* There should be an ora_dism process running on the system so you can check from the ps listing (explorer / guds).

\* Check in the Oracle alert log as that should also mention the DISM process starting. It should also log the fact that the DISM process has died which means your Oracle shared memory segmented aren't locked and can cause other problems .... see Solution 230653 : DISM Troubleshooting For Oracle9i and Later Releases.

So, what the issue you here yourself asking, well several customers complained about Oracle batch jobs crashing or the entire database falling over with something like:

Sun Mar 15 14:53:28 2009
Errors in file /u01/app/oracle/admin/JSPSOL0/udump/jspsol03_ora_23220.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
SVR4 Error: 12: Not enough space
Additional information: 4
Additional information: 938866512
Additional information: -1

OR

Mon Aug 24 09:01:17 2009
Errors in file /opt/oracle/admin/MW/bdump/mw1_lgwr_18124.trc:
ORA-00340: IO error processing online log 1 of thread 1
ORA-00345: redo log write error block 167445 count 11
ORA-00312: online log 1 thread 1: '/dev/vx/rdsk/mwdbdg/redo11'
ORA-27070: async read/write failed
SVR4 Error: 12: Not enough space
Additional information: 1
Additional information: 167445
LGWR: terminating instance due to error 340

The first jump to cause would indicate that this is some kind of I/O issue and that perhaps we've run out of space on the backing storage layer. Which unfortunately in this case wasn't the reason as all filesystems had enough free space available. The next hit in our possible causes, is perhaps that we don't have enough swap space available as in using DISM you have to have enough swap space configured as DISM segments in use, see  Solution  214947 :   Solaris[TM] Operating System: DISM double allocation of memory for details. Again, the customer appeared to have the correct swap space configuration for the DISM segment currently in use.
Now, what I actually suspected was that CR 6559612 multiple softlocks on a DISM segment should decrement availrmem just once is coming into play here which if you read engineerings comments on it says can only happen if DISM isn't locked properly and we end up going down a different code path due to seg_pdisable being set to 1. There is another reason for seg_pdisable being set and that's when we're trying to handle page retirement. Coincidentally for the two situations mentioned above both had pending page retirements happening  due CR 6587140 -  "page_retire()/page_trycapture should not try to retire non relocatable kernel pages" fixed in KJP 141414-06 (latest 10) so there was a window for opportunity for CR 6559612 to happen every once and a while.

From what I can currently work out CR 6559612 - "multiple softlocks on a DISM segment should decrement availrmem just once" is a duplicate of CR 6603296 - "Multiple writes into dism segment reduces available swap" and also mentions duplicate CR 6644570 - "Application using large page sizes with DISM on Huron got pread to return error ENOMEM".

There seems to be a family of related issues which are all fixed as part of CR 6423097 - "segvn_pagelock() may perform very poorly" which includes the above:

6526804 DR delete_memory_thread, AIO, and segvn deadlock
6557794 segspt_dismpagelock() and segspt_shmadvise(MADV_FREE) may deadlock
6557813 seg_ppurge_seg() shouldn't flush all unrelated ISM/DISM segments
6557891 softlocks/pagelocks of anon pages should not decrement availrmem for memory swapped pages
6559612 multiple softlocks on a DISM segment should decrement availrmem just once
6562291 page_mem_avail() is stuck due to availrmem overaccounting and lack of seg_preap() calls
6596555 locked anonymous pages should not have assigned disk swap slots
6639424 hat_sfmmu.c:hat_pagesync() doesn't handle well HAT_SYNC_STOPON_REF and HAT_SYNC_STOPON_MOD flags
6639425 optimize checkpage() optimizations
6662927 page_llock contention during I/O

These are all fixed in the development release of Solaris Nevada build 91 but haven't been backported as yet. From looking at the bug reports it does seem likely that CR 6423097 and associated CR's will get fixed in Solaris 10 U8 (scheduled for release 9th October 2009).

So, that would lead us to the following suggestions:

1/ CR 6423097 mentions disabling large page support completely to workaround the issue which you could do and still continue to use DISM:

Workaround for nevada and S10U4 is to add to /etc/system:

set max_uheap_lpsize = 0x2000
set max_ustack_lpsize = 0x2000
set max_privmap_lpsize = 0x2000
set max_shm_lpsize = 0x2000
set use_brk_lpg = 0
set use_stk_lpg = 0
and reboot.

for S10U3 and earlier S10 releases, the workaround is to add to /etc/system:

set exec_lpg_disable = 1
set use_brk_lpg = 0
set use_stk_lpg = 0
set use_zmap_lpg = 0
and reboot

NOTE: this is \*not\* the same as disabling large page coalescing as this completely disables large page support. Disabling large page coalescing is sometimes required dependent on certain types of application workload so see 215536 : Oracle(R) 10g on Solaris[TM] 10 may run slow for further details.

2/ Disable DISM entirely for the short term and use ISM with Oracle whilst waiting for the release of Solaris 10 update 8.

Conclusion / Actions
-------------------------------------

The above is essentially a hit list of things to be aware of when seeing errors of that nature. If the customer isn't using DISM at all then that might take you down another  avenue. If it does then remember that "Applications Drive System Behavior", so you need to understand the architecture, when and where things run to aid in tying the loose ends of any problem together and be able to prove root cause or not. See my previously blog entry about dealing with performance issues as it essentially takes about the same troubleshooting process I use all the time in diagnosing complex problems.

Saturday Mar 28, 2009

Don't forget truss for diagnosing issues

Oracle came to us via the VOSJEC channel for help in diagnosing a customer getting the following error in the alert log during various DML queries:

ORA-27505: IPC error destroying a port
ORA-27300: OS system dependent operation:close failed with status: 9
ORA-27301: OS failure message: Bad file number
ORA-27302: failure occurred at: skgxpdelpt1
ORA-03135: connection lost contact

and

ORA-27509: IPC error receiving a message
ORA-27300: OS system dependent operation:recvmsg failed with status: 95
ORA-27301: OS failure message: Socket operation on non-socket
ORA-27302: failure occurred at: sskgxp_rcvms
Non critical error ORA-00001 caught while writing to trace file
"/opt/oracle/diag/rdbms/tacor3p/tacor3p2/trace/tacor3p2_ora_5758.trc"
Error message: SVR4 Error: 9: Bad file number

Interestingly enough Oracle commented that the customer was not running out of file descriptors and also had enough UDP ports available for their load. Oracle were also unable to replicate this in house but appeared to be consistent at the customer site and only happened when the customer enabled Oracle auditing with the syslog audit trail option. So why did Oracle believe this was an interoperability issue? Well they provided the following truss data:

27873: open("/var/run/syslog_door", O_RDONLY) = 0
->the bad code now opens the syslog file again, at free fd 0
<snip>
27873: door_info(0, 0xFFFFFFFF7FFF78D8) = 0
27873: getpid() = 27873 [27340]
27873: door_call(0, 0xFFFFFFFF7FFF78A8) = 0
27873: close(0) = 0
->the bad code now closes 0 since it opened the file at 0
<snip>
27873: close(0) Err#9 EBADF
->again, closes fd 0 incorrectly, it is no longer open to anyone
<snip>
27873: recvmsg(0, 0xFFFFFFFF7FFF6F88, 0) Err#95 ENOTSOCK
->SKGXP now tries to use its socket without knowing someone else closed it and fails
<snip>

So Oracle believed that the open("/var/run/syslog_door" seemed suspicious as the call didn't originate from the rdbms code and hence some help was required from Sun to progress. Unfortunately that limited truss output didn't actually show us the where in the process lifecycle we were going wrong so we needed to collect some more data from the customer to work it all out. With truss you can use the -u option to do user level function call tracing i.e:



truss -faeo /var/tmp/out -u a.out:: -u libc:: ./a.out

So, what did this new data actually show us:

5389: execve("/opt/oracle/product/11.1.0/bin/sqlplus", 0xFFFFFFFF7FFFFB38, 0xFFFFFFFF7FFFFB48) argc = 1
5389: argv: sqlplus
6085/1@1: -> libc:_open(0xffffffff7b3e7498, 0x0, 0x0, 0x0)
6085/1@1: -> libc:__open(0xffffffff7b3e7498, 0x0, 0x0, 0x0)
6085/1: open("/var/run/syslog_door", O_RDONLY) = 0
6085/1@1: <- libc:__open() = 0
6085/1@1: <- libc:_open() = 0
6085/1@1: -> libc:_door_info(0x0, 0xffffffff7fff7978, 0x0, 0x2000)
6085/1: door_info(0, 0xFFFFFFFF7FFF7978) = 0
6085/1@1: <- libc:_door_info() = 0
6085/1@1: -> libc:getpid(0x0, 0xffffffff7fff7978, 0x0, 0x2000)
6085/1: getpid() = 6085 [5389] 6085/1@1: <- libc:getpid() = 6085
6085/1@1: -> libc:_door_call(0x0, 0xffffffff7fff7948, 0x0, 0x2000)
6085/1: door_call(0, 0xFFFFFFFF7FFF7948) = 0
6085/1@1: <- libc:_door_call() = 0
6085/1@1: -> libc:_close(0x0, 0xffffffff7fff7948, 0x0, 0x2000)
6085/1: close(0) = 0
6085/1@1: <- libc:_close() = 0
6085/1@1: <- libc:syslogd_ok() = 1
6085/1@1: <- libc:vsyslog() = 0xffffffff7fff83c8
6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0x0, 0x0, 0x0)
6085/1@1: -> libc:_cancelon(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0x0, 0x1, 0x0, 0x0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0x9, 0x1, 0x0, 0x0)
6085/1@1: -> libc:___errno(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0x1, 0x0, 0x0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0x108c65000, 0x108c65, 0x108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683
6085/1@1: -> libc:gethrtime(0x0, 0x108c00, 0x10a0c7, 0x400)
6085/1@1: -> libc:setjmp(0xffffffff7fffa7e0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:gettimeofday(0xffffffff7fffa620, 0x0, 0xffffffffffffffff, 0xfffffffffffffff8)
6085/1@1: -> libc:localtime_r(0xffffffff7fffa618, 0xffffffff7fffa5f4, 0x0, 0xec23d)
6085/1@1: -> libc:getsystemTZ(0xffffffff7fffa528, 0x0, 0x0, 0x0)
6085/1@1: -> libc:getenv(0xffffffff7b3e5b68, 0x0, 0x0, 0x0)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

384 /\*
385 \* Use a door call to syslogd to see if it's alive.
386 \*/
387 static int
388 syslogd_ok(void)
389 {
390 int d;
391 int s;
392 door_arg_t darg;
393 door_info_t info;
394
395 if ((d = open(DOORFILE, O_RDONLY)) < 0)
396 return (0);
397 /\*
398 \* see if our pid matches the pid of the door server.
399 \* If so, syslogd has called syslog(), probably as
400 \* a result of some name service library error, and
401 \* we don't want to let syslog continue and possibly
402 \* fork here.
403 \*/
404 info.di_target = 0;
405 if (__door_info(d, &info) < 0 || info.di_target == getpid()) {
406 (void) close(d);
407 return (0);
408 }
409 darg.data_ptr = NULL;
410 darg.data_size = 0;
411 darg.desc_ptr = NULL;
412 darg.desc_num = 0;
413 darg.rbuf = NULL;
414 darg.rsize = 0;
415 s = __door_call(d, &darg);
416 (void) close(d);
417 if (s < 0)
418 return (0); /\* failure - syslogd dead \*/
419 else
420 return (1);
421 }

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

340 /\* output the message to the local logger \*/
341 if ((putmsg(LogFile, &ctl, &dat, 0) >= 0) && syslogd_ok())
342 return;
343 if (!(LogStat & LOG_CONS))
344 return;

so syslogd_ok returns 1 and then we return out of vsyslog():

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

158 void
159 syslog(int pri, const char \*fmt, ...)
160 {
161 va_list ap;
162
163 va_start(ap, fmt);
164 vsyslog(pri, fmt, ap);
165 va_end(ap);
166 }


So we've returned from vsyslog() then into gettimeofday() and then our second close() returns EBADF.

6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0x0, 0x0, 0x0)
6085/1@1: -> libc:_cancelon(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0x0, 0x1, 0x0, 0x0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0x9, 0x1, 0x0, 0x0)
6085/1@1: -> libc:___errno(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0x1, 0x0, 0x0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0x108c65000, 0x108c65, 0x108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683

So we've returned from syslog() and are now somewhere else in the code so nothing to do with the door_calls from my understanding. With that, Oracle development did some further digging and then announced they'd found a new defect in the Oracle code and was logged as Oracle bug 7519558. If Oracle hadn't have come back then I would have probably used the following Dtrace script to confirm where the EBADF was actually being returned from close():

root@hippy-1 # more badclose.d
syscall::close:return
/pid == $target && errno == EBADF/
{
@badf_ustack[execname,ustack(20)] = count();
stop_pids[pid] = 1;
}

syscall::rexit:entry
/stop_pids[pid] != 0/
{
printf("stopping pid %d", pid);
stop();
system("prun %d", pid);
stop_pids[pid] = 0;
}

END
{
printa(@badf_ustack);
}

root@hippy-1 # dtrace -ws badclose.d -c ./p
dtrace: script 'badclose.d' matched 3 probes
dtrace: allowing destructive actions
andy1
CPU ID FUNCTION:NAME
0 28735 rexit:entry stopping pid 27953
dtrace: pid 27953 exited with status 6
1 2 :END
p libc.so.1`_close+0x4
p`poo+0x8
p`main+0x4c
p`_start+0x5c
1

So, another happy ending for all, in that VOSJEC managed to get to the bottom of it using plain old truss rather than Dtrace. It's worth remembering that although Dtrace is a fantastic diagnosis tool, don't forget the other great tools available in Solaris in trying to observe and diagnose problems you're investigating! :)

Friday Jan 16, 2009

Capturing data on Oracle "WARNING: aiowait timed out" events

Oracle "WARNING: aiowait timed out" error messages in the alert log are always something that needs further investigation as it can indicate some kind of I/O or system resource issue that stopping valuable I/O's down to storage. Now there could be a whole bunch of reasons why our asynchronous I/O as got stuck. Oracle reports "WARNING: aiowait timed out" messages after 10 mins of non-response since our I/O was queued by aioread(3) or aiowrite(3). It's also worth being aware of the two method of by which Asynchronous I/O is handled by Solaris. This is either via kernel asynchronous I/O (KAIO) if the underlying filesystem supports it (raw or filesystems with added s/w which give raw access) or the standard userland library implementation. For further background reading then have a look at the Solaris™ Internals: Core Kernel Components book by Jim Mauro; Richard McDougall, chapter 11.4. Asynchronous I/O.

So, to help understand the where in our life cycle we're stuck we'll need to collect some data. This also might help us define a better problem statement, which is obviously vitally important in our rational troubleshooting process. See my previous post about the importance of defining a good problem statement

Enabling the trigger environment

A full crash dump is much more preferred over a live core as things tend to change on the fly whilst the dump is being taken and corrupts some of the structures in the dump so we can get strange results. In this example I'm going to assume that we don't really want to take down the entire box as it'll result in all services being effected other than Oracle. Obviously, try the live method first and if that doesn't yield results then try a full dump. So, here's the steps in enabling a triggered live savecore on an aiowait timed out error message in an Oracle alert log.

1/ You'll need to setup a dedicated dump device to collect a live savecore. You'll either need a spare raw partition or have to create a large file using mkfile, see the man page dumpadm(1M) for details on how to do this.

2/ Download livecore_aio (or livecore_aio.x86) and guds script.

3/ Create collect_data.sh:

root@hippy-1 # more collect_data.sh
#!/bin/sh
/var/tmp/guds_2.8.4 -q -c30 -i1 -n5  -w0 -s<case ref> -X2 -D /var/tmp &
/usr/bin/savecore -L
echo "aiowait coredump event - please guds output and crash dump to Sun" | mailx -s "AIOWAIT TIMED OUT EVENT" root

In my case I'm going to fire off collecting guds to gather some performance stats on the box, but you could add anything else you want to run here including a "reboot -d" to take a full crash dump before rebooting the box.

4/ Change perms on binary and script before copying to /var/tmp

chmod +x livecore_aio collect_data.sh
cp collect_data.sh to /var/tmp

note: livecore_aio expects collect_data.sh to be in /var/tmp for it to work correctly

Test run

Test the program and script (as root):

# touch dummy_file
#./livecore_aio dummy_file &
# echo "WARNING: aiowait timed out" >> dummy_file

This should produce a live savecore dump and kick off guds.

Deploy

\*execute the livecore_aio binary
#./livecore_aio <full path of alert_log> &

When issue happens, upload the live core dump and guds data to Sun for analysis.

Saturday Oct 25, 2008

Oracle parallel query performance on a T5140

Being an engineer it's always a good feeling getting to the bottom of a problem and none so as this one. Take a T5140, create a 3 way RAID 0 LUN using the internal disks and stick Oracle on top so you can do something useful with it for your application...........and what do you get......... a problem. I suspect after that opening some of you are thinking "where's he going with this?" .... the answer nowhere, I'm not picking holes in either the T5140 or Oracle.....good, I'm glad we got that one clear! :)

Anyhow, so a customer came to us complaining that their application wasn't running as expected on this platform and really wanted to know if there was a hardware fault /  bug with platform or operating system running on it. From the tests that the customer had been doing themselves they believed that the bottleneck was the underlying I/O subsystem and in this case the LSI H/W RAID. Essentially,  the customer had configured a 3 disk RAID 0 stripe using the default 64k stripe width like thus:

bash-3.00# raidctl -l c1t1d0
Volume                  Size    Stripe  Status   Cache  RAID
       Sub                     Size                    Level
               Disk
----------------------------------------------------------------
c1t1d0                  409.9G  64K     OPTIMAL  OFF    RAID0
               0.1.0   136.6G          GOOD
               0.2.0   136.6G          GOOD
               0.3.0   136.6G          GOOD

They had then created a single slice for which Oracle was installed and configured for Direct I/O (which is a good thing anyway if you've a UFS filesystem) so we were avoiding the filesystem buffer cache and double buffering. So for a 64k stripe per disk and three disks gives us a total stripe width of 192k. The throughput performance of each of this disks is between 50-60mb per second which means we have a theoretical throughput on all stripes of 150->180mb per second for reads. We can forget writes as Oracle is pwrite()'ing in 8k synchronous chunks to a non-write enabled volume and only hits one disk (because 8K is less than the 64k stripe size) and hence why we saw a 1Gb tablespace creation take 18 seconds and an average through put of 56mb per second which is what we would have expected for a single disk.

SQL> set timing on
SQL> create tablespace andy3
 2  datafile '/u01/oracle/oradata/SUN/andy03.dbf'
 3  size 1g;

Tablespace created.

Elapsed: 00:00:18.12

and iostat -xnz 1 shows us

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0  128.0  104.0 56369.0  0.0  1.6    0.0   11.2   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  14.0   78.0  112.0 56281.3  0.0  1.2    0.1   13.5   0  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   93.0  112.0 53734.0  0.0  1.4    0.0   13.4   1  93 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  13.0   95.0  104.0 58397.6  0.0  1.4    0.0   12.7   1  92 c1t1d0

This result was the same as the customers but things then get interesting when we start looking at full table scan parallel queries. The customer ended up with these results:

 Parallelism
Time
Throughput (Mb per second)
db_file_multiblock_read_count
 0 719
13.8
16 (16 x 8k = 128k)
 2 195
50.9
16
 4 208
47.7
16
 0 420
23.6
32 (32 x 8k = 256k)
 2 163
60.9
32
 4 208
53.9
32

Now, they look bad especially if you think that theoretically we should be able to achieve 150mb ->180mb based on a three disk stripe (3 x 60mb).

Using the same parallel test plan as the customer:

oracle@v4v-t5140a-gmp03~$more para.sql

set timing on;

select /\*+ FULL(t) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,2) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,4) \*/ count(\*) from contact_methods t;

select /\*+ FULL(t) PARALLEL(t,64) \*/ count(\*) from contact_methods t;

oracle@v4v-t5140a-gmp03~/oradata/SUN$ls -alh test01.dbf
-rw-r-----   1 oracle   dba         9.7G Oct 24 08:25 test01.dbf

I got these:

SQL> @para

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:47.85

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:32.53

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:34.68

 COUNT(\*)
----------
 15700000

Elapsed: 00:00:42.17

whilst the first full table scan is running I see the following in iostat -xnz 1:

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 108.0    0.0 93122.4    0.0  0.0  0.4    0.1    4.0   1  35 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 151.0    3.0 95796.9   48.0  0.0  0.5    0.1    3.0   1  34 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 115.0    0.0 103367.6    0.0  0.0  0.3    0.1    2.6   1  28 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 102232.7    0.0  0.0  0.3    0.1    3.0   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 122.0    3.0 105326.4   48.0  0.0  0.3    0.1    2.5   1  29 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 116.0    0.0 96467.2    0.0  0.0  0.5    0.1    4.1   1  34 c1t1d0

and then

                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 193.0    0.0 159383.3    0.0  0.0  8.0    0.1   41.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 195.5    3.0 163681.0   48.1  0.0  8.1    0.1   40.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 220.1    0.0 188770.3    0.0  0.0  7.7    0.1   34.8   3 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 168156.9    0.0  0.0  7.2    0.1   37.8   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    3.0 162361.2   48.0  0.0  7.4    0.1   38.1   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 190.0    0.0 162776.0    0.0  0.0  7.3    0.1   38.7   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 192.0    0.0 162737.6    0.0  0.0  6.9    0.1   35.9   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 186.0    3.0 153754.2   48.0  0.0  8.4    0.1   44.4   1 100 c1t1d0
                   extended device statistics                r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
 191.0    1.0 160412.4    8.0  0.0  7.7    0.1   40.1   1 100 c1t1d0

when the parallel jobs are running.

This is because I changed the db_file_multiblock_read_count to 128 (128 \* 8k = 1M) and in fact I saw improvements using 192k (64k\*3) to match the stripe width. I also went with some recommendations from here which also helped along with running the latest T5140 firmware and latest KJP 137111-08 to avoid some known performance issues.

It's amazing that tuning one little option can have such a dramatic effect on the results and also shows that just because you don't get the results you're expecting not to assume that there is a fault with the hardware or software. For me personally, it's always good to understand the results and how you got there, although as in all performance related issues you can sometimes get sidetracked from what's gone on before and end going down the wrong path. To avoid that,  make sure you chat with you're colleagues when you feel like you're not going anywhere as a fresh set of eyes can bring you back on the path and closer to resolution.




Wednesday Oct 01, 2008

Capturing Snoop Data On An Oracle Event

Following from my previous article about capturing TNF data on an Oracle event, I thought I'd also mention that you can do something similar with capturing snoop data.  This really came about because of trying to troubleshooting ORA-29740 issues with Oracle RAC. Most of the time you really want to have a look at these Sunsolve documents for giving hints:

  • Solution 230303 : SunCluster[TM] 3.x: Troubleshooting ORA-29740 in a RAC environment
  • Solution 205145 : Oracle error ORA-29740 (reason 3) resolved with network tuning

But, if you think that network is perhaps something worth chasing and the actual physical switches & hubs etc don't hint at a fault then we could enable the lantracer script to snoop on an interconnect and terminate when Oracle receives an ORA-29740. To do this you'll need to do the following:

1) in $ORACLE_HOME/bin, place a script called oradbg containing:

#!/bin/sh
echo "stopping Lantracer" >> /tmp/oradbg.$$
/usr/bin/rm /usr/tmp/lan/running
echo `/bin/date` > /tmp/oradbg.$$
echo "done\\." >> /tmp/oradbg.$$

2) in an sqlplus "/ as sysdba" session run the following:

alter system set "_oradbg_pathname"="<$ORACLE_HOME>/bin/oradbg" scope=spfile;
alter system set event="29740 debug" scope=spfile;

where $ORACLE_HOME is the expanded path of the $ORACLE_HOME env var

3) add execute to the oradbg script

4) restart the instance

5/ create /var/tmp/lan and copy lantracer script to it. chmod 755 lantracer

6/ start lantracer

cd /var/tmp/lan; ./lantracer >/dev/null 2>&1 & 

7/ check oracle can stop lantracer by using it to detect a 942 event:

 sqlplus <user>/<password>
 alter session set events '942 debug';
 select \* from <a table that doesn't exist>;

This should remove /var/tmp/lan/running and leave us with some snoop scripts which should hopefully have captured data during the event.

Capturing TNF Data On An Oracle Event

Recently, I got involved in an interesting VOSJEC problem in which Oracle was core dumping at intermittent times with ORA-7445 on a large Solaris 8 system with no known root cause. Unfortunately the core dump from Oracle or the trace files helped either Oracle or us to working out the issue so a colleague of mine Tim Uglow managed to create a TNF package to collect TNF data constantly and then stop/dump the TNF kernel buffer out to disk at the time that the Oracle event happened. That way we'd be able to see what led up to the event.

To allow oracle processes running as user "oracle" to stop the tnf kernel tracing we need to change the permissions on the tnf driver. Currently you have ..

# ls -alL /dev/tnf\*
crw-------   1 root     sys      114,  0 Apr 10 22:02 /dev/tnfctl
crw-------   1 root     sys      114,  1 Apr 10 22:02 /dev/tnfmap

and we need to change it to ...

# ls -alL /dev/tnf\*
crw-rw-rw-   1 root     sys      114,  0 Jun 27 15:57 /dev/tnfctl
crw-rw-rw-   1 root     sys      114,  1 Jun 27 15:57 /dev/tnfmap

This is done by the following two commands...

# rem_drv tnf
# add_drv -m '\* 0666 root sys' tnf

the install script for JAVAtnfd.datastream.pkg package will do this for you and the uninstall script will put it back to its original state. This change would allow ordinary users to affect the kernel tracing but most users don't know about the subsystem and the only information they can gather is kernel events, there is no actual data visible.

Installing the scripts


1) copy JAVAtnfd.datastream.pkg to /tmp on your machine
2) pkgadd -d  /tmp/JAVAtnfd.datastream.pkg

You may get warning messages like ..

"Device busy
Cannot unload module: tnf
Will be unloaded upon reboot."

These can be ignored.. devfsadm will have changed the permissions.

To start the tnf tracing..

as root please run ...

# cd /opt/JAVAtnfd/data 
# /usr/bin/nohup /opt/JAVAtnfd/bin/tnfscript &

then check in /opt/JAVAtnfd/data/nohup.out for errors, every time you start it it will append to nohup.out.

If you needed to stop the tracing ..

looking in /opt/JAVAtnfd/data/nohup.out for the last line like "tnfscript pid XXXX starting" and then run ...

# kill XXXX

should stop it.

When the oracle debug executes it will touch /opt/JAVAtnfd/data/stop and stop the tnf tracing and that will cause the tnfscript to exit at the end of its 10 minute loop - look at the end of /opt/JAVAtnfd/data/nohup.out for "tnfscript pid XXXX exiting as requested by Oracle". Before it exits it will extract the required files into /opt/JAVAtnfd/data/\*.'date' and email root the location of the four files that need to be sent to support together with the Oracle alert log and the trace file that will be mentioned in the alert.log file, something like....

Errors in file <path to trace file>
ORA-7445...
<path to trace file> is also required.

Instructions from Oracle to get Oracle processes to stop the tnf on a ORA-7445 error.


1) in init.ora define the following 2 parameters:

  event="7445 debug"
  _oradbg_pathname=<$ORACLE_HOME>/bin/oradbg

or you can set via and sqlplus session:

SQL> alter system set "_oradbg_pathname"="/u01/oracle/product/10.2.0/db/bin/oradbg" scope=both;
System altered.
SQL> alter system set event="7445 debug" scope=spfile;
System altered.

where $ORACLE_HOME is the expanded path of the $ORACLE_HOME environment variable

2) in $ORACLE_HOME/bin, place a script called oradbg containing:

  #!/bin/sh
  echo "stopping TNF collection using prex" >> /tmp/oradbg.$$
  echo "ktrace off" | /usr/bin/prex -k
  echo $$ > /opt/JAVAtnfd/data/stop
  echo `/bin/date` > /tmp/oradbg.$$
  echo "done\\." >> /tmp/oradbg.$$

3) add execute to the oradbg script

4) restart the instance

Upon a 7445 error being encountered (whereby oracle's signal handler spots a sigsegv/core dump) we will disable the prex via the built-in oracle 'debug' event and a file will be written to /tmp containing the date and time and a note that we stopped prex. It willalso touch the file /opt/JAVAtnfd/data/stop which will signal the Sun tnf script to stop. Obviously if the customer has any path differences for any of the bins they should modify accordingly. This can easily be tested after restarting the instance by running:

sqlplus <user>/<password>
 alter session set events '942 debug';
 select \* from <a table that doesn't exist>;

This will generate an error along the lines: <

ERROR at line 1:
ORA-00942: table or view does not exist

and the debug event will fire on the temporary session-level dummy version of the event and will run the prex command. They can then check /tmp for a new file with the details called oradbg. Obviously you can trace any other Oracle event by just changing the events debug in sqlplus which might make problem resolution potentially a little easier on Solaris prior to Solaris 10 and the world of Dtrace! :)
About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today