Saturday Mar 28, 2009

Don't forget truss for diagnosing issues

Oracle came to us via the VOSJEC channel for help in diagnosing a customer getting the following error in the alert log during various DML queries:

ORA-27505: IPC error destroying a port
ORA-27300: OS system dependent operation:close failed with status: 9
ORA-27301: OS failure message: Bad file number
ORA-27302: failure occurred at: skgxpdelpt1
ORA-03135: connection lost contact

and

ORA-27509: IPC error receiving a message
ORA-27300: OS system dependent operation:recvmsg failed with status: 95
ORA-27301: OS failure message: Socket operation on non-socket
ORA-27302: failure occurred at: sskgxp_rcvms
Non critical error ORA-00001 caught while writing to trace file
"/opt/oracle/diag/rdbms/tacor3p/tacor3p2/trace/tacor3p2_ora_5758.trc"
Error message: SVR4 Error: 9: Bad file number

Interestingly enough Oracle commented that the customer was not running out of file descriptors and also had enough UDP ports available for their load. Oracle were also unable to replicate this in house but appeared to be consistent at the customer site and only happened when the customer enabled Oracle auditing with the syslog audit trail option. So why did Oracle believe this was an interoperability issue? Well they provided the following truss data:

27873: open("/var/run/syslog_door", O_RDONLY) = 0
->the bad code now opens the syslog file again, at free fd 0
<snip>
27873: door_info(0, 0xFFFFFFFF7FFF78D8) = 0
27873: getpid() = 27873 [27340]
27873: door_call(0, 0xFFFFFFFF7FFF78A8) = 0
27873: close(0) = 0
->the bad code now closes 0 since it opened the file at 0
<snip>
27873: close(0) Err#9 EBADF
->again, closes fd 0 incorrectly, it is no longer open to anyone
<snip>
27873: recvmsg(0, 0xFFFFFFFF7FFF6F88, 0) Err#95 ENOTSOCK
->SKGXP now tries to use its socket without knowing someone else closed it and fails
<snip>

So Oracle believed that the open("/var/run/syslog_door" seemed suspicious as the call didn't originate from the rdbms code and hence some help was required from Sun to progress. Unfortunately that limited truss output didn't actually show us the where in the process lifecycle we were going wrong so we needed to collect some more data from the customer to work it all out. With truss you can use the -u option to do user level function call tracing i.e:



truss -faeo /var/tmp/out -u a.out:: -u libc:: ./a.out

So, what did this new data actually show us:

5389: execve("/opt/oracle/product/11.1.0/bin/sqlplus", 0xFFFFFFFF7FFFFB38, 0xFFFFFFFF7FFFFB48) argc = 1
5389: argv: sqlplus
6085/1@1: -> libc:_open(0xffffffff7b3e7498, 0x0, 0x0, 0x0)
6085/1@1: -> libc:__open(0xffffffff7b3e7498, 0x0, 0x0, 0x0)
6085/1: open("/var/run/syslog_door", O_RDONLY) = 0
6085/1@1: <- libc:__open() = 0
6085/1@1: <- libc:_open() = 0
6085/1@1: -> libc:_door_info(0x0, 0xffffffff7fff7978, 0x0, 0x2000)
6085/1: door_info(0, 0xFFFFFFFF7FFF7978) = 0
6085/1@1: <- libc:_door_info() = 0
6085/1@1: -> libc:getpid(0x0, 0xffffffff7fff7978, 0x0, 0x2000)
6085/1: getpid() = 6085 [5389] 6085/1@1: <- libc:getpid() = 6085
6085/1@1: -> libc:_door_call(0x0, 0xffffffff7fff7948, 0x0, 0x2000)
6085/1: door_call(0, 0xFFFFFFFF7FFF7948) = 0
6085/1@1: <- libc:_door_call() = 0
6085/1@1: -> libc:_close(0x0, 0xffffffff7fff7948, 0x0, 0x2000)
6085/1: close(0) = 0
6085/1@1: <- libc:_close() = 0
6085/1@1: <- libc:syslogd_ok() = 1
6085/1@1: <- libc:vsyslog() = 0xffffffff7fff83c8
6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0x0, 0x0, 0x0)
6085/1@1: -> libc:_cancelon(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0x0, 0x1, 0x0, 0x0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0x9, 0x1, 0x0, 0x0)
6085/1@1: -> libc:___errno(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0x1, 0x0, 0x0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0x108c65000, 0x108c65, 0x108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683
6085/1@1: -> libc:gethrtime(0x0, 0x108c00, 0x10a0c7, 0x400)
6085/1@1: -> libc:setjmp(0xffffffff7fffa7e0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:setjmp(0xffffffff7fffa5f0, 0x10a1025b0, 0x10a0bc4b0, 0x46100)
6085/1@1: -> libc:gettimeofday(0xffffffff7fffa620, 0x0, 0xffffffffffffffff, 0xfffffffffffffff8)
6085/1@1: -> libc:localtime_r(0xffffffff7fffa618, 0xffffffff7fffa5f4, 0x0, 0xec23d)
6085/1@1: -> libc:getsystemTZ(0xffffffff7fffa528, 0x0, 0x0, 0x0)
6085/1@1: -> libc:getenv(0xffffffff7b3e5b68, 0x0, 0x0, 0x0)

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

384 /\*
385 \* Use a door call to syslogd to see if it's alive.
386 \*/
387 static int
388 syslogd_ok(void)
389 {
390 int d;
391 int s;
392 door_arg_t darg;
393 door_info_t info;
394
395 if ((d = open(DOORFILE, O_RDONLY)) < 0)
396 return (0);
397 /\*
398 \* see if our pid matches the pid of the door server.
399 \* If so, syslogd has called syslog(), probably as
400 \* a result of some name service library error, and
401 \* we don't want to let syslog continue and possibly
402 \* fork here.
403 \*/
404 info.di_target = 0;
405 if (__door_info(d, &info) < 0 || info.di_target == getpid()) {
406 (void) close(d);
407 return (0);
408 }
409 darg.data_ptr = NULL;
410 darg.data_size = 0;
411 darg.desc_ptr = NULL;
412 darg.desc_num = 0;
413 darg.rbuf = NULL;
414 darg.rsize = 0;
415 s = __door_call(d, &darg);
416 (void) close(d);
417 if (s < 0)
418 return (0); /\* failure - syslogd dead \*/
419 else
420 return (1);
421 }

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

340 /\* output the message to the local logger \*/
341 if ((putmsg(LogFile, &ctl, &dat, 0) >= 0) && syslogd_ok())
342 return;
343 if (!(LogStat & LOG_CONS))
344 return;

so syslogd_ok returns 1 and then we return out of vsyslog():

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/syslog.c

158 void
159 syslog(int pri, const char \*fmt, ...)
160 {
161 va_list ap;
162
163 va_start(ap, fmt);
164 vsyslog(pri, fmt, ap);
165 va_end(ap);
166 }


So we've returned from vsyslog() then into gettimeofday() and then our second close() returns EBADF.

6085/1@1: -> libc:gettimeofday(0xffffffff7fff8540, 0x0, 0x0, 0x0)
6085/1@1: -> libc:_cancelon(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:_cancelon() = -19
6085/1@1: -> libc:_close(0x0, 0x1, 0x0, 0x0)
6085/1: close(0) Err#9 EBADF
6085/1@1: -> libc:_cerror(0x9, 0x1, 0x0, 0x0)
6085/1@1: -> libc:___errno(0x0, 0x0, 0x0, 0x0)
6085/1@1: <- libc:___errno() = 0xffffffff7b4fc9bc
6085/1@1: <- libc:_close() = -1
6085/1@1: -> libc:_canceloff(0xffffffffffffffff, 0x1, 0x0, 0x0)
6085/1@1: <- libc:_canceloff() = 0
6085/1@1: -> libc:times(0xffffffff7fff9460, 0x108c65000, 0x108c65, 0x108c00)
6085/1: times(0xFFFFFFFF7FFF9460) = 86657683

So we've returned from syslog() and are now somewhere else in the code so nothing to do with the door_calls from my understanding. With that, Oracle development did some further digging and then announced they'd found a new defect in the Oracle code and was logged as Oracle bug 7519558. If Oracle hadn't have come back then I would have probably used the following Dtrace script to confirm where the EBADF was actually being returned from close():

root@hippy-1 # more badclose.d
syscall::close:return
/pid == $target && errno == EBADF/
{
@badf_ustack[execname,ustack(20)] = count();
stop_pids[pid] = 1;
}

syscall::rexit:entry
/stop_pids[pid] != 0/
{
printf("stopping pid %d", pid);
stop();
system("prun %d", pid);
stop_pids[pid] = 0;
}

END
{
printa(@badf_ustack);
}

root@hippy-1 # dtrace -ws badclose.d -c ./p
dtrace: script 'badclose.d' matched 3 probes
dtrace: allowing destructive actions
andy1
CPU ID FUNCTION:NAME
0 28735 rexit:entry stopping pid 27953
dtrace: pid 27953 exited with status 6
1 2 :END
p libc.so.1`_close+0x4
p`poo+0x8
p`main+0x4c
p`_start+0x5c
1

So, another happy ending for all, in that VOSJEC managed to get to the bottom of it using plain old truss rather than Dtrace. It's worth remembering that although Dtrace is a fantastic diagnosis tool, don't forget the other great tools available in Solaris in trying to observe and diagnose problems you're investigating! :)

About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today