Monday Sep 22, 2008

Jumpstarting problems

I ran into this problem a while back whilst trying to jumpstart one of my VOSJEC lab systems which caused a little bit of grief for a while:

Sun Fire V210, No Keyboard
Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.33, 2048 MB memory installed, Serial #73138818.
Ethernet address 0:14:4f:5c:2:82, Host ID: 845c0282.



ChassisSerialNumber UNDEFINED
Rebooting with command: boot net -sv                                  
Boot device: /pci@1f,700000/network@2  File and args: -sv
100 Mbps FDX Link up
Requesting Internet Address for 0:14:4f:5c:2:82
Requesting Internet Address for 0:14:4f:5c:2:82
100 Mbps FDX Link up
boot: cannot open kernel/sparcv9/unix
Enter filename [kernel/sparcv9/unix]:
boot: cannot open kernel/sparcv9/unix
Enter filename [kernel/sparcv9/unix]:
boot: cannot open kernel/sparcv9/unix


[andharr@e2big:~]$ snoop -i snoop_v210.out
  1   0.00000 OLD-BROADCAST -> (broadcast)  RARP C Who is 0:14:4f:5c:2:82 ?
  2   0.01679        e2big -> vos-v210-n0-gmp03 RARP R 0:14:4f:5c:2:82 is 129.156.214.198, vos-v210-n0-gmp03
  3   0.00025 vos-v210-n0-gmp03 -> BROADCAST    TFTP Read "819CD6C6" (octet)
  4   0.00169 vos-v210-n0-gmp03 -> (broadcast)  ARP C Who is 129.156.214.22, vos-x4100-gmp03 ?
  5   0.00479 vos-v210-n0-gmp03 -> (broadcast)  ARP C Who is 129.156.215.14, vos-jet-gmp03 ?
  6   0.00297        e2big -> vos-v210-n0-gmp03 TFTP Data block 1 (512 bytes)
  7   0.00182 vos-v210-n0-gmp03 -> e2big        TFTP Error: unknown transfer ID

bummer, so what if we do it again, accept this time specify the boot server:

Sun Fire V210, No Keyboard
Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.33, 2048 MB memory installed, Serial #73138818.
Ethernet address 0:14:4f:5c:2:82, Host ID: 845c0282.



ChassisSerialNumber UNDEFINED
Rebooting with command: boot net:129.156.210.175 -sv                  
Boot device: /pci@1f,700000/network@2:129.156.210.175  File and args: -sv
100 Mbps FDX Link up
Requesting Internet Address for 0:14:4f:5c:2:82
Requesting Internet Address for 0:14:4f:5c:2:82
100 Mbps FDX Link up
module /platform/sun4u/kernel/sparcv9/unix: text at [0x1000000, 0x10c8b2d] data at 0x1800000
module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10c8b30, 0x12b9b17] data at 0x1867440
module /platform/SUNW,Sun-Fire-V210/kernel/misc/sparcv9/platmod: text at [0x12b9b18, 0x12ba84f] data at 0x18bbc60
module /platform/sun4u/kernel/cpu/sparcv9/SUNW,UltraSPARC-IIIi: text at [0x12ba880, 0x12cd2d7] data at 0x18bc340
SunOS Release 5.11 Version snv_90 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Ethernet address = 0:14:4f:5c:2:82
<snip>

So some other tftp server is responding and even though I've disabled the tftp & bootparams on the other host and it does look like the correct server, I'm still picking up that "unknow transfer ID" error as part of the tftpboot. Mmm, something is still responding.... :( so I'll have to workaround the issue by specifying the jumpstart server using "boot net:129.156.210.175".

Thursday Sep 21, 2006

Oracle redo logs and dd

I recently had a case with a customer concerned over the time taken to dd an Oracle redo log file even though Oracle had been shutdown. Finding this strange I thought I'd use TNF probes to see what was happening.
#!/bin/sh -x

# Script to collect TNF data whilst dd is happening on Oracle log file
#
# Andy H 19/09/06
#

TMPDIR=${TMPDIR:-/var/tmp/tnf}
output_file=$TMPDIR/ktrace.tnf
command="dd if=/u01/oracle/oradata/TEST/redo01.log of=/u01/oracle/oradata/TEST/tmp.tmp"

start_prex()
{
        echo "Enabling kernel tracing"
prex -k < $TMPDIR/pid.out
                wait $pid
        fi
}

# Extract kernel trace buffer
ktrace_xtract() {
        echo "Extracting kernel trace buffer"
        tnfxtract $output_file || fail "Could not extract kernel trace buffer"
        ls -l $output_file
}

# Get kernel symbols
ktrace_sym() {
        echo "Collecting kernel symbol information"
        /usr/ccs/bin/nm -fx /dev/ksyms | tr "|" " " > $TMPDIR/nm.out
}

# tar and gzip data
ktrace_gzip() {
        echo "Packaging collected data"
        /usr/sbin/tar cf /tmp/tnf.tar $TMPDIR
        /usr/bin/gzip /tmp/tnf.tar
        echo "Collection complete, please send in /tmp/tnf.tar.gz for analysis"
}

start_prex
run_command
ktrace_xtract
ktrace_sym
stop_prex
ktrace_gzip
Following using some clever sorting awk script from Tim Uglow I compared the TNF output from the normal dd timing and then the slow one.
64.922000  -  2562 1 0x300042edc20 0 page_unmap vnode: 0x600062ee440 offset: 157253632
64.924500  -  2562 1 0x300042edc20 0 page_unmap vnode: 0x600062ee440 offset: 157261824
64.927100  -  2562 1 0x300042edc20 0 page_unmap vnode: 0x600062ee440 offset: 157270016
64.929200  -  2562 1 0x300042edc20 0 page_unmap vnode: 0x600062ee440 offset: 157278208
64.931500  -  2562 1 0x300042edc20 0 page_unmap vnode: 0x600062ee440 offset: 157286400
101.904300  -  2562 1 0x300042edc20 0 syscall_end rval1: 4 rval2: 0 errno: 0 creat64 98.822400 No error
101.908200  -  2562 1 0x300042edc20 0 thread_queue tid: 0x300042edc20 cpuid: 0 priority: 0 queue_length: 0x0
101.908500  -  2562 1 0x300042edc20 0 thread_state tid: 0x300042edc20 state: 8 waiting for CPU
101.924000  -  2562 1 0x300042edc20 0 syscall_start sysnum: 137 sysconfig
101.924500  -  2562 1 0x300042edc20 0 syscall_end rval1: 8192 rval2: 18466816 errno: 0 sysconfig 0.000500 No error
101.928700  -  2562 1 0x300042edc20 0 syscall_start sysnum: 17 brk
101.931200  -  2562 1 0x300042edc20 0 syscall_end rval1: 0 rval2: 134217728 errno: 0 brk 0.002500 No error
101.932500  -  2562 1 0x300042edc20 0 syscall_start sysnum: 17 brk
101.941300  -  2562 1 0x300042edc20 0 syscall_end rval1: 0 rval2: 134217728 errno: 0 brk 0.008800 No error
101.943200  -  2562 1 0x300042edc20 0 thread_state state: 4 asleep in user data fault
101.945400  -  2562 1 0x300042edc20 0 address_fault address: 0x28000 fault_type: 0 access: 1
101.947900  -  2562 1 0x300042edc20 0 anon_zero address: 0x28000
101.967700  -  2562 1 0x300042edc20 0 thread_state state: 0 in userland 
101.978500  -  2562 1 0x300042edc20 0 syscall_start sysnum: 98 sigaction
101.980600  -  2562 1 0x300042edc20 0 syscall_end rval1: 0 rval2: 0 errno: 0 sigaction 0.002100 No error
101.985800  -  2562 1 0x300042edc20 0 syscall_start sysnum: 3 read
102.000000  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8193 errno: 0 read 0.014200 No error
102.009800  -  2562 1 0x300042edc20 0 syscall_start sysnum: 4 write
102.073000  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8962 errno: 0 write 0.063200 No error
102.074500  -  2562 1 0x300042edc20 0 syscall_start sysnum: 3 read
102.079700  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8193 errno: 0 read 0.005200 No error
102.082300  -  2562 1 0x300042edc20 0 syscall_start sysnum: 4 write
102.088200  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8962 errno: 0 write 0.005900 No error
102.089200  -  2562 1 0x300042edc20 0 syscall_start sysnum: 3 read
102.093600  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8193 errno: 0 read 0.004400 No error
102.096200  -  2562 1 0x300042edc20 0 syscall_start sysnum: 4 write
102.129600  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8962 errno: 0 write 0.033400 No error
102.130900  -  2562 1 0x300042edc20 0 syscall_start sysnum: 3 read
102.136100  -  2562 1 0x300042edc20 0 syscall_end rval1: 512 rval2: 8193 errno: 0 read 0.005200 No error
102.138600  -  2562 1 0x300042edc20 0 syscall_start sysnum: 4 write
So each of the read() / writes() happen one after each other.....but what about on the so called broken one:
59.433900  -  2635 1 0x300041792c0 3 page_unmap vnode: 0x600062ee440 offset: 157253632
59.436200  -  2635 1 0x300041792c0 3 page_unmap vnode: 0x600062ee440 offset: 157261824
59.438300  -  2635 1 0x300041792c0 3 page_unmap vnode: 0x600062ee440 offset: 157270016
59.440300  -  2635 1 0x300041792c0 3 page_unmap vnode: 0x600062ee440 offset: 157278208
59.442600  -  2635 1 0x300041792c0 3 page_unmap vnode: 0x600062ee440 offset: 157286400
95.378500  -  2635 1 0x300041792c0 3 syscall_end rval1: 4 rval2: 0 errno: 0 creat64 91.600000 No error
95.380600  -  2635 1 0x300041792c0 3 thread_queue tid: 0x300041792c0 cpuid: 3 priority: 0 queue_length: 0x0
95.380800  -  2635 1 0x300041792c0 3 thread_state tid: 0x300041792c0 state: 8 waiting for CPU
95.398100  -  2635 1 0x300041792c0 3 syscall_start sysnum: 137 sysconfig
95.398600  -  2635 1 0x300041792c0 3 syscall_end rval1: 8192 rval2: 18466816 errno: 0 sysconfig 0.000500 No error
95.404300  -  2635 1 0x300041792c0 3 syscall_start sysnum: 17 brk
95.406100  -  2635 1 0x300041792c0 3 syscall_end rval1: 0 rval2: 134217728 errno: 0 brk 0.001800 No error
95.407300  -  2635 1 0x300041792c0 3 syscall_start sysnum: 17 brk
95.415700  -  2635 1 0x300041792c0 3 syscall_end rval1: 0 rval2: 134217728 errno: 0 brk 0.008400 No error
95.417700  -  2635 1 0x300041792c0 3 thread_state state: 4 asleep in user data fault
95.419500  -  2635 1 0x300041792c0 3 address_fault address: 0x28000 fault_type: 0 access: 1
95.421600  -  2635 1 0x300041792c0 3 anon_zero address: 0x28000
95.440600  -  2635 1 0x300041792c0 3 thread_state state: 0 in userland 
95.453200  -  2635 1 0x300041792c0 3 syscall_start sysnum: 98 sigaction
95.454800  -  2635 1 0x300041792c0 3 syscall_end rval1: 0 rval2: 0 errno: 0 sigaction 0.001600 No error
95.460900  -  2635 1 0x300041792c0 3 syscall_start sysnum: 3 read
95.483700  -  2635 1 0x300041792c0 3 strategy device: 118,22  block: 2262752 size: 0x200 buf: 0x60008105278 flags: 0x2200061
95.522200  -  2635 1 0x300041792c0 3 thread_block reason: 0x60008105338 stack: 
95.523200  -  2635 1 0x300041792c0 3 thread_state state: 7 asleep for other reason
95.523400  -  2635 1 0x300041792c0 3 thread_state tid: 0x2a10001fcc0 state: 1 in syscall/pagefault
95.939700  -  2635 1 0x300041792c0 17 syscall_end rval1: 512 rval2: 8193 errno: 0 read 0.478800 No error
95.972400  -  2635 1 0x300041792c0 17 syscall_start sysnum: 4 write
96.070000  -  2635 1 0x300041792c0 17 syscall_end rval1: 512 rval2: 8962 errno: 0 write 0.097600 No error
96.072000  -  2635 1 0x300041792c0 17 syscall_start sysnum: 3 read
96.078800  -  2635 1 0x300041792c0 17 strategy device: 118,22  block: 2262753 size: 0x200 buf: 0x60008105278 flags: 0x2200061
96.105100  -  2635 1 0x300041792c0 17 thread_block reason: 0x60008105338 stack: 
96.106100  -  2635 1 0x300041792c0 17 thread_state state: 7 asleep for other reason
96.106300  -  2635 1 0x300041792c0 17 thread_state tid: 0x2a1005d1cc0 state: 1 in syscall/pagefault

Interesting we appear to be blocking for a reason....ahhh following from a quick chat with John Alderson and reading his blog entry this now becomes clear.

Basically directio is still persistant on the file after Oracle is shutdown. DIRECTIO_ON allows you to force directio per-file and is used potentially by databases if their app tunable is so set (which in our case it is). The flag is set in the inode's i_flag and hangs around cached within the inode in the DNLC and will effect files until the inode is flushed from the DNLC.

About

I'm an RPE (Revenue Product Engineering) Engineer supporting Solaris on Exadata, Exalogic and Super Cluster. I attempt to diagnose and resolve any problems with Solaris on any of the Engineered Systems.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today