Sunday Nov 14, 2010

This blog has moved

This blog has moved to http://work.lewiz.org/, please update your bookmarks and RSS/Atom feeds.

Friday Jun 04, 2010

Obtaining AMD64 function arguments

Debugging AMD64 crash dumps is made slightly more tricky when compared to SPARC due to its lack of register windows. In order to determine what a value was when initially passed into a function we can not look at a register in the previous register window. We must instead use the stack.

This is a topic that comes up frequently and one that I don't get enough practise at (and therefore tend to forget), it's a worthy blog entry. If you're really interested, all of this (and a lot more) is covered in Frank Hofmann's excellent book The Solaris Operating System on x86 Platforms: Crashdump Analysis, Operating System Internals which I can't recommend highly enough.

> ffffffff9bc9cc60::findstack -v
stack pointer for thread ffffffff9bc9cc60: fffffe800145d890
[ fffffe800145d890 _resume_from_idle+0xf8() ]
fffffe800145d8c0 swtch+0x12a()
fffffe800145d8e0 cv_wait+0x68()
fffffe800145d910 pr_p_lock+0x79()
fffffe800145d960 pr_lookup_piddir+0x7e()
fffffe800145d9c0 prlookup+0xd4()
fffffe800145da10 fop_lookup+0x35()
fffffe800145dbe0 lookuppnvp+0x1bf()
fffffe800145dc50 lookuppnat+0xf9()
fffffe800145dd10 lookupnameat+0x86()
fffffe800145de40 vn_openat+0x2aa()
fffffe800145def0 copen+0x1e5()
fffffe800145df00 open+0x19()
fffffe800145df10 sys_syscall+0x17b()

In the above stack we are interested in finding the first argument to pr_lookup_piddir(), which is a vnode_t pointer. We know that prlookup() makes a call to pr_lookup_piddir() therefore it must pass one of its registers to the input of pr_lookup_piddir(). A callee expects to find its input arg0 in register %rdi (this is part of the AMD64 ABI, more details are discussed in Frank's book and also at Solaris 64-bit Developer's Guide: AMD64 ABI Features). Therefore by disassembling the calling function we can check where %rdi comes from:

> prlookup+0xd4::dis
prlookup+0xae: orl %edx,%eax
prlookup+0xb0: testb $0x1,%al
prlookup+0xb2: jne +0xbf
prlookup+0xb8: cmpl $0x24,%r12d
prlookup+0xbc: je +0xb5
prlookup+0xc2: movl %r12d,%edx
prlookup+0xc5: xorl %eax,%eax
prlookup+0xc7: movq %r14,%rsi
prlookup+0xca: movq %rbx,%rdi
prlookup+0xcd: call \*0xfffffffffbd0e460(,%rdx,8)
prlookup+0xd4: cmpq $0x1,%rax

At prlookup+0xca (just prior to calling pr_lookup_piddir) we see that the contents of register %rbx are moved to the callee's input register, %rdi. We now know that at the time we enter pr_lookup_piddir() both %rdi and %rbx contain the same value (a vnode_t pointer). If pr_lookup_piddir() is to use %rbx for scratch it must save the value so it can subsequently restore it when it returns control to pr_lookup().

We can disassemble pr_lookup_piddir() to get an idea of what it's doing (truncated for this example):

> pr_lookup_piddir::dis     
pr_lookup_piddir: pushq %rbp
pr_lookup_piddir+1: movq %rsp,%rbp
pr_lookup_piddir+4: pushq %r15
pr_lookup_piddir+6: movq %rdi,%r15
pr_lookup_piddir+9: pushq %r14
pr_lookup_piddir+0xb: xorl %r14d,%r14d
pr_lookup_piddir+0xe: pushq %r13
pr_lookup_piddir+0x10: movq %rsi,%r13
pr_lookup_piddir+0x13: pushq %r12
pr_lookup_piddir+0x15: pushq %rbx

Above we are saving the caller's frame pointer (pushq %rbp) and setting our frame pointer (movq %rsp,%rbp) before we begin to push registers that we wish to reuse, onto the stack (the pushq instructions).

Of particular interest is pr_lookup_piddir+0x15 where we push %rbx onto the stack. From the top of the function this is the sixth pushq instruction and therefore the sixth register that we have stored to the stack. We can use this knowledge to vnode_t pointer we passed into pr_lookup_piddir().

Looking back at the ::findstack output we can see the function names on the right and the frame pointer on the left. pr_lookup_piddir() is the function that is pushing to the stack so we'll start with the pr_p_lock()'s fp (fffffe800145d910) and print down the stack, including pr_lookup_piddir()'s fp (fffffe800145d960):

> fffffe800145d910,10/naP             
0xfffffe800145d910:
0xfffffe800145d910: 0xfffffe800145d960
0xfffffe800145d918: pr_lookup_piddir+0x7e
0xfffffe800145d920: 0xffffffff80037008
0xfffffe800145d928: 0xc
0xfffffe800145d930: 0xffffffffb0939700
0xfffffe800145d938: 0xffffffffb297e240
0xfffffe800145d940: 2
0xfffffe800145d948: 0xffffffffb0939700
0xfffffe800145d950: 0xfffffe800145dab0
0xfffffe800145d958: 0xfffffe800145da88
0xfffffe800145d960: 0xfffffe800145d9c0
0xfffffe800145d968: prlookup+0xd4
0xfffffe800145d970: 0xfffffe800145d990
0xfffffe800145d978: 0x19c691b80
0xfffffe800145d980: 0xffffffff816b6440
0xfffffe800145d988: 5

At 0xfffffe800145d960 we have pr_lookup()'s fp (1), this was the first register that we pushed to the stack. Counting five values up the stack we get to 0xfffffe800145d938 (2) which is the sixth value pushed to pr_lookup_piddir()'s stack. This value, 0xffffffffb297e240, is the value of pr_lookup()'s %rbx register when pr_lookup_piddir() was called. As we've shown above, this is also the register we sourced %rdi from and is therefore a vnode_t pointer:

> 0xffffffffb297e240::print vnode_t v_path
v_path = 0xffffffff91155060 "/proc/21391"

> 0t21391::pid2proc|::ps -f
S PID PPID PGID SID UID FLAGS ADDR NAME
R 21391 21379 21350 21265 41311 0x4a004000 ffffffff836021a0
/app/common/java/jdk1.5.0_14/bin/amd64/java -server -Xms1g -Xmx1g -Duser.langua

Just as expected! Furthermore, since we were waiting for a CV we were able to determine from the vnode what path we were waiting on and, since this was in /proc, we could even look up the process.

Monday May 10, 2010

Failfast for Solaris IO

Introduction

Recently I have been trying to better understand Solaris IO, specifically what goes on once a process enters the biowait(\*buf) function. As part of this investigation I found a need to learn about failfast for Solaris IO which I will discuss in this blog post. Failfast was first presented as PSARC/2002/126: Buf Flag for Faster Failover. If you do not already have a general knowledge of Solaris IO internals I strongly recommend reading General Flow of Control from the Writing Device Drivers book available for free on docs.sun.com.


At a later date I would like to expand this article with more discussion of the interaction between ZFS & SVM and the failfast flag, along with a more general Solaris IO entry.


Functionality

Data buffers (buffer, buf or bp from now on) passed into the (s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are passed to the transport (glm, qlc, etc.) via scsi_transport(\*scsi_pkt).


sd_initpkt_for_buf() sets the scsi_pkt's command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the transport has finished processing a packet (e.g. due to a completion, timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the packet back to the driver. Commands time out if no response is received from the target within pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf(). In case of timeout the driver will attempt to retry the packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without failfast it can take up to 5 minutes for, e.g., a read to return an error in the case of a non-responsive disk.


Failfast is a process that takes place within the driver to more expediently fail a pending buf and inform the upper layer Volume Manager (ZFS, SVM, VxVM). Co-operation is required from the VM which must set B_FAILFAST in the buf b_flags mask to enable the behaviour (the driver can check the ddi-failfast-supported property to know whether B_FAILFAST can be used).


Most VMs tend to round-robin read IO when multiple copies of the data exist. In the case of a mirror where one disk has gone away we ultimately expect all of our read IOs to be serviced by the working disk. In order for this to happen it is necessary for the driver to return a failure code (EIO) to the VM so that it can retry with the working disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall average IO time. The B_FAILFAST flag was initially proposed as B_ALTDATASRC as this accurately describes the conditions that need to be true for us to want to use failfast behaviour.


Implementation

Within sd each physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its internal failfast state in un_failfast_state and un_failfast_bp. The instance may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.


When any packet (i.e. regardless of B_FAILFAST) is returned to sd via pkt_comp qualifies for a retry due to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete reason is a selection timeout) we call into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance failfast state. Every instance begins in SD_FAILFAST_INACTIVE.


Transition to failfast pending: The first buf to enter sd_retry_command() with SD_RETRIES_FAILFAST set will take the sd instance into the failfast pending state by registering itself as the un_failfast_bp. The buf is then retried normally. Subsequent SD_RETRIES_FAILFAST bufs will be retried without changing any failfast state.


Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all B_FAILFAST bufs on the wait queue to be returned to the caller with a suitable error set (this is done via thread). This buf is also returned with an error set if it has B_FAILFAST set, otherwise it is retried.


Transition to SD_FAILFAST_INACTIVE: Any buf that either completes successfully (via sd_return_command()) or requires a retry for any reason other than those that take us into failfast pending will transition us into SD_FAILFAST_INACTIVE by updating un_failfast_state and clearing un_failfast_bp. It should now be clear from above that only B_FAILFAST bufs are affected by the failfast state which means any subsequent buf without B_FAILFAST (or indeed any buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.


Any buf passed into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun, \*buf).


Wednesday Jan 13, 2010

mdb: calculating thread idle time

In SolarisCAT a thread output shows the thread idle time. In mdb I've not found a dcmd that provides this same information (although if you are using the excellent ACT module the ::act_thread dcmd does give solve this problem).

Fortunately the answer is very simple:

> 2a100499ca0::findstack -v
stack pointer for thread 2a100499ca0: 2a100498e41
[ 000002a100498e41 sema_p+0x138() ]
000002a100498ef1 biowait+0x6c(600d3d31240, 0, 18ba800, 30008670000, 2080201, 600d3d31240)
000002a100498fa1 bwrite_common+0x1ac(0, 600d3d31240, 1, 0, 0, 1)
000002a100499051 ldl_savestate+0x88(600d2c5da40, 14e5757ef, a72ba5c0, 600de7f8080, 600de7f8280, 0)
000002a100499101 logmap_sethead+0x78(600d2c5db60, 600d2c5da40, f555, 6ed48, 600ccb7f6b0, 600ccb7f5c0)
000002a1004991b1 trans_roll+0x354(600d2c5da40, 10, 2000, 10, 600df0f3180, 600ccb7f6ba)
000002a100499291 thread_start+4(600d2c5da40, 0, 6269665f7365, 745f62696c6c5f68, 6f6c645f636f6465, 2e70726f63000000)

> 2a100499ca0::print -t kthread_t t_disp_time
clock_t t_disp_time = 0x9a22a2
> \*panic_lbolt64-0x9a22a2=D
65537



panic_lbolt64 is set to lbolt64 when panic() is called. lbolt64 is incremented each time the clock thread runs. By default this is 100 times/second but if hires_tick is set in /etc/system the clock thread runs 1000 times/second (usr/src/uts/common/conf/param.c for details). This means that by default there are 100 ticks per second. On a live system panic_lbolt64 can be replaced by lbolt64.

Converting ticks to minutes and seconds is left as a (really simple) exercise for the reader.

Monday Jan 11, 2010

mdb: biowait(buf_t \*bp) to (s)sd softstate

How to get the sd_lun structure from a buf (e.g. in biowait()).

> 0x2a1002efca0::findstack -v
stack pointer for thread 2a1002efca0: 2a1002ee1a1
[ 000002a1002ee1a1 sema_p+0x138() ]
000002a1002ee251 biowait+0x6c(46bb44a9d00, 0, 18bac00, 30024b12000, 1a, 46bb44a9d00)
000002a1002ee301 default_physio+0x388(12ebf74, 24, 0, 46bb44a9d40, 12ddc10, 46bb44a9d38)
000002a1002ee431 scsi_uscsi_handle_cmd+0x1b8(2000000010, 1, 338a8de2c50, 12ebf74, 46bb44a9d00, 3005a3d1d70)
000002a1002ee521 sd_send_scsi_cmd+0x114(2000000010, 1970800, 3005a3d1d70, 1, 3000111ecc0, 2a1002eeeb0)
000002a1002ee5e1 sd_send_scsi_MODE_SENSE+0x110(3000240be40, 6, 3389353b680, 24, 4, 1)
000002a1002ee701 sd_get_physical_geometry+0x9c(3389353b680, 2a1002ef06c, 43d5bd5, 200, 1, 3000111ecc0)
000002a1002ee7b1 sd_resync_geom_caches+0xb4(3000111ecc0, 43d5bd5, 200, 1, 3ec1, ff)
000002a1002ee881 sd_validate_geometry+0xb4(3000111ecc0, 1, 60, 1, 7, fa000050)
000002a1002ee941 sd_ready_and_valid+0x2d4(3000111ecc0, 2a1002efca0, 0, 3000240be40, 3000240be40, c1)
000002a1002eea51 sdopen+0x248(1, 3000111ecc0, 0, 1978108, 3000111eda0, 0)
000002a1002eeb01 spec_open+0x4f8(2a1002ef528, 224, 3000410be48, a21, 430043ae440, 0)
000002a1002eebc1 fop_open+0x78(2a1002ef528, 2, 3000410be48, 40000003, 301ac8b1ec0, 301ac8b1ec0)
000002a1002eec71 dev_lopen+0x34(2a1002ef5e0, 3, 4, 3000410be48, ffffffff, ffffffffffffffff)
000002a1002eed31 md_layered_open+0x120(13, 2a1002ef6c8, 3, 30003e9d580, 2000000010, 3000410be48)
000002a1002eedf1 stripe_open_all_devs+0x188(58, 3, 0, 0, 0, dc)
000002a1002eeed1 stripe_open+0xa0(dc, 3, 4, 3000113a628, 30003e88b70, 3)
000002a1002eef81 md_layered_open+0xb8(0, 2a1002ef908, 3, 3000113a628, dc, 3000410be48)
000002a1002ef041 mirror_probe_dev+0x98(3000113a078, 19be608, 0, 1, 30003e8b2b0, 0)
000002a1002ef111 md_probe_one+0x84(49976e5ee40, 3000113a078, 0, 68c965b92c0, 14, 7ba0730c)
000002a1002ef1c1 md_daemon+0x21c(0, 19bf478, 33864030100, 19bf478, 2a1002efa88, 19bf4a0)
000002a1002ef291 thread_start+4(19bf478, 0, 0, 0, 0, 0)

The first argument to biowait() is a pointer to a buf_t structure.

> 46bb44a9d00::print -t buf_t
{
int b_flags = 0x200067
struct buf \*b_forw = 0
struct buf \*b_back = 0
struct buf \*av_forw = 0
struct buf \*av_back = 0
o_dev_t b_dev = 0
size_t b_bcount = 0x24
union b_un = {
caddr_t b_addr = 0x3389353b680
struct fs \*b_fs = 0x3389353b680
struct cg \*b_cg = 0x3389353b680
struct dinode \*b_dino = 0x3389353b680
daddr32_t \*b_daddr = 0x3389353b680
}
lldaddr_t _b_blkno = {
longlong_t _f = 0
struct _p = {
int32_t _u = 0
int32_t _l = 0
}
}
char b_obs1 = '\\0'
size_t b_resid = 0x24
clock_t b_start = 0
struct proc \*b_proc = 0
struct page \*b_pages = 0
clock_t b_obs2 = 0
size_t b_bufsize = 0
int (\*)() b_iodone = 0
struct vnode \*b_vp = 0
struct buf \*b_chain = 0
int b_obs3 = 0
int b_error = 0x5
void \*b_private = 0x3005a3d1d70
dev_t b_edev = 0x2000000010
ksema_t b_sem = {
void \* [2] _opaque = [ 0, 0 ]
}
ksema_t b_io = {
void \* [2] _opaque = [ 0x2a1002efca0, 0 ]
}
struct buf \*b_list = 0
struct page \*\*b_shadow = 0x338a79b81c0
void \*b_dip = 0x30003961b90
struct vnode \*b_file = 0
offset_t b_offset = 0xffffffffffffffff
}

We are interested in getting the sd_lun so we'll take the b_edev which is a dev_t. The DDI getminor(dev_t dev) and getmajor(dev_t dev) functions allow us to extract the major and minor numbers from a dev_t.

So to get the major number we shift right by NBITSMINOR64 (32) on 64-bit or NBITSMINOR (18) on 32-bit. We then AND with MAXMAJ64 (0xffffffff) on 64-bit or MAXMAJ (MAXMAJ64) on 32-bit:

> (0x2000000010>>0t32)&0xffffffff=D
32

And for the minor number we AND we MAXMIN64 (0xffffffff) on 64-bit or MAXMIN (MAXMIN64) on 32-bit:

> 0x2000000010&0xffffffff=D
16

Alternatively if your genunix module provides the ::devt dcmd, this can be used:

> 0x2000000010::devt
MAJOR MINOR
32 16

The ::major2name dcmd converts the major number to a name. Alternatively we could check in /etc/name2major from an explorer or on the host itself.

> 0t32::major2name
sd

In this case the device is sd. If it had returned ssd all of the following commands that mention sd should be replaced with ssd.

Converting the minor number to an sd instance is slightly more tricky. The driver's DDI getinfo(9E) function is called, in the case of sd this is sdinfo(9E). The SDUNIT(dev_t dev) macro is called:

#define SDUNIT(dev) (getminor((dev)) >> SDUNIT_SHIFT)

So we need to shift the minor number right by SDUNIT_SHIFT (3 on my system):

> 0t16>>0t3=D
2

We now know that this thread is waiting on a buffer which is being serviced by sd2.

The next stage is get this sd instance's sd_lun structure. These are held on an array pointed to by sd's DDI softstate ptr, sd_softate (or ssd_softstate for ssd). For more information see ddi_soft_state(9F) in the Solaris 10 man page collection.

> \*sd_state::print -t struct i_ddi_soft_state
{
void \*\*array = 0x3000111a640
kmutex_t lock = {
void \* [1] _opaque = [ 0 ]
}
size_t size = 0x558
size_t n_items = 0x40
struct i_ddi_soft_state \*next = 0x30003b7f900
}

We're after sd instance 2 so that will be entry 3 in the array (remember before sd2 are sd0 and sd1). There are a number of different ways to do this so I'll cover a few.

Getting the softstate #1, without any helper dcmds:

The array is a list of pointers, so we'll need to know the size of a uintptr_t. We then multiply this by the sd_state instance that we want and add it to the array, e.g.:

> ::sizeof uintptr_t
sizeof (uintptr_t) = 8
> 0x3000111a640+8\*2/J
0x3000111a650: 3000111ecc0 <-- sd2

Getting the softstate #2, walking the array up to the state we want:

Here we tell mdb to print out the address (a) and the contents (P) of the first 3 (,3) values from array:

> 0x3000111a640,3/naP
0x3000111a640:
0x3000111a640: 0x3000111e680 <-- sd0
0x3000111a648: 0x30003ea80c0 <-- sd1
0x3000111a650: 0x3000111ecc0 <-- sd2

Getting the softstate #3, using the ::array dcmd (probably the worst way but the one I somehow always try and use):

We get the first three (0t3) elements from the start of the array. We specify that each element of the array is a uintptr_t. This might make more sense if the array was not an array of pointers but of a real structure.

> 0x3000111a640::array uintptr_t 0t3
3000111a640
3000111a648
3000111a650
> 3000111a650/J
0x3000111a650: 3000111ecc0 <-- sd2

Getting the softstate #4, with the helpful ::softstate dcmd (definitely the best way):

> \*sd_state::softstate 0t2
3000111ecc0

Getting all softstates using the softstate walker:

> \*sd_state::walk softstate
3000111e680
30003ea80c0
3000111ecc0
3000111e040
30003e9d900
3001d4ab980

We're now done. We have the sd_lun pointer for sd2 and we can do whatever we want with it.

Below are a few helpful things that can be dumped. This is in no way exhaustive.

> \*sd_state::softstate 0t2|::print -t struct sd_lun
{
struct scsi_device \*un_sd = 0x3000240be40
struct buf \*un_rqs_bp = 0x300010a1340
struct scsi_pkt \*un_rqs_pktp = 0x300039a7e90
int un_sense_isbusy = 0
int un_buf_chain_type = 0x1
int un_uscsi_chain_type = 0x8
int un_direct_chain_type = 0x8
int un_priority_chain_type = 0x9
struct buf \*un_waitq_headp = 0
struct buf \*un_waitq_tailp = 0
struct buf \*un_retry_bp = 0
int (\*)() un_retry_statp = 0
void \*un_xbuf_attr = 0x30003ba0200
uint32_t un_sys_blocksize = 0x200
uint32_t un_tgt_blocksize = 0x200
uint64_t un_blockcount = 0x43d5bd5
uchar_t un_ctype = 0x2
char \*un_node_type = 0x1972e48 "ddi_block:channel"
[...]

> 0x3000240be40::print -t struct scsi_device
{
struct scsi_address sd_address = {
struct scsi_hba_tran \*a_hba_tran = 0x300024290c0
ushort_t a_target = 0x2
uchar_t a_lun = 0
uchar_t a_sublun = 0
}
dev_info_t \*sd_dev = 0x30003961b90
kmutex_t sd_mutex = {
void \* [1] _opaque = [ 0 ]
}
void \*sd_reserved = 0x300024290c0
struct scsi_inquiry \*sd_inq = 0x3000393d788
struct scsi_extended_sense \*sd_sense = 0
caddr_t sd_private = 0x3000111ecc0
}

> 0x30003961b90::devinfo
30003961b90 sd, instance #2
System properties at 30003b59810:
name='lun' type=int items=1
value=00000000
name='target' type=int items=1
value=00000002
name='class_prop' type=string items=1
value='atapi'
name='class' type=string items=1
value='scsi'
Driver properties at 30003b594f0:
name='ddi-no-autodetach' type=int items=1
value=00000001
name='inquiry-serial-no' type=string items=1
value='00N0A2TH '
name='pm-components' type=string items=3
value='NAME=spindle-motor' + '0=off' + '1=on'
name='pm-hardware-state' type=string items=1
value='needs-suspend-resume'
name='ddi-failfast-supported' type=any items=0
name='ddi-kernel-ioctl' type=any items=0
name='device-nblocks' type=int64 items=1
value=00000000043d671f
Hardware properties at 30003b59518:
name='devid' type=string items=1
value='id1,sd@SFUJITSU_MAP3367N_SUN36G_00N0A2TH____'
name='inquiry-revision-id' type=string items=1
value='0401'
name='inquiry-product-id' type=string items=1
value='MAP3367N SUN36G'
name='inquiry-vendor-id' type=string items=1
value='FUJITSU'
name='inquiry-device-type' type=int items=1
value=00000000

I hope this was helpful. I'm going to try and put up a few more posts in a similar style. I'm happy to take requests but can't guarantee results!

Tuesday Jan 05, 2010

Twitter and RSS feeds for Solaris ARC cases

Within Sun the Architectural Review Committee (ARC) are involved for certain changes to software (including Solaris) and firmware.  I don't want to get into too much detail on the exact role of the various ARCs but those interested can visit the OpenSolaris ARC Community pages.


You can keep up-to-date with new cases as they are logged by following ARCbot either via Twitter or the RSS feed.  Data is fetched from a handy csv file hosted on opensolaris.org.


Without further ado, the two feeds:



Any comments/suggestions, leave a comment, email me, or send a tweet to @lewiz

Saturday Nov 14, 2009

64-bit Windows VPN to SWAN

This is a short entry for those other Sun employees who have 64-bit Windows machines at home who wish to be able to connect to SWAN via the VPN without resorting to running OpenSolaris or Linux inside a VM.


Cisco do not provide a 64-bit compatible VPN client for free.  However, a third party, Shrew Soft, do.  Shrew Access Manager can be downloaded from http://www.shrew.net/download/vpn - I can confirm that development build 2.1.5-rc4 definitely connects to SWAN on Windows 7 release.


The steps to configure this are very basic:



  1. Download and install Shrew Access Manager (http://www.shrew.net/download/vpn)

  2. Visit the Sun download library (https://downloadlibrary.central.sun.com/) and download vpn_client_profiles.tar.gz.  This file can be found by following the download trail for VPN 3000 and selecting Linux as the client

  3. Extract the relevant pcf file for your GEO (in my case this is EMEA_UK.pcf)

  4. From Shrew Access Manager select File->Import to import the relevant pcf file

  5. Connect and enter your Sun VPN userid and provide a token generated with button 8 on the token card


It's also worth pointing out that I also have Punchin configured, so ITOps have a public certificate for me on file.


Good luck!

Thursday Aug 13, 2009

Solaris RPE Kernel rotation

This month I am on a rotation into the Solaris RPE (Revenue Product Engineering) Kernel team where I have picked a bug in the Solaris kernel which I am attempting to diagnose and fix.  Thanks to Mita Solanky, Chris Beal, Bill Watson & Rob Harris for helping me arrange this.


Normally I work as a part of the TSC (Technical Solutions Centre) Kernel team as an engineer who diagnoses kernel-related issues.  This could be system panics (i.e. crash dump analysis), performance problems, errors, queries from customers, etc.  The RPE organisation is the direct escalation path for us engineers in the TSC, and they are required to understand and provide fixes for bugs logged by the TSC.


By spending time in RPE I am getting to know better how this organisation functions (e.g. process), the exact role the engineers play, what external pressures they have and more about how code changes go back into the Solaris product.  RPE have some communication with NRE (New Revenue Engineering) which is interesting for me as I do not usually encounter NRE engineers as part of my day-to-day job.


Once the rotation has finished I'll blog again about my experience here as compared to the TSC role I normally play.  For this this serves as a brief introduction to my next blog entry where I will discuss the bug I am working on. 

Wednesday Apr 22, 2009

Anonymous FTP in a Solaris 8 Branded Zone

The Problem

Recently one of my customers migrated a legacy SPARC system to a Solaris 8 branded zone on a new T5220 platform. This system provided an anonymous FTP service which no longer worked correctly within the branded Zone. More specifically, files could be uploaded & downloaded but directory listings always returned empty. Non-anonymous users reported no problems with any of the functionality

The Investigation

Initial investigation began by trussing in.ftpd within the Solaris 8 zone to see what was happening. Very quickly it was possible to see that in.ftpd forks and execs /bin/ls to generate the directory listing and this was failing. It's worth pointing out that on Solaris 8 anonymous FTP requires a chroot() environment (as per in.ftpd(1M)) and this was configured correctly

19710/1: execve("/bin/ls", 0x000353F0, 0xFFBFFE0C) Err#2 ENOENT

Very odd as ENOENT indicates 'no such file or directory' and /bin/ls definitely exists. Next step was to look more closely at the execve() syscall... I knocked up a very quick fbt DTrace script that would do this for me before realising that this was Solaris 8! Fortunately we can DTrace from the global zone, so after a few changes I can up with the following to trace the Solaris 8 exec calls:

#pragma D option flowindent

fbt::s8_elfexec:entry
/ execname == "in.ftpd" /
{ printf("s8_exec: execname: %s", execname); self->follow = 1; tracing = 1; }

fbt:::
/ self->follow /
{}

fbt:::return
/ self->follow /
{ trace(arg1); }

fbt::s8_elfexec:return
/ self->follow /
{ trace(arg1); self->follow = 0; exit(0); }

Running this from the global zone at the same time as issuing an LS in the anonymous FTP client and I could see that ENOENT was being returned from a call to lookupnameat(). The DTrace was updated with the following to print out the full filename of the file that was being looked up:

fbt::lookupnameat:entry
/ self->follow /
{ printf("%s", stringof(args[0])); }

Another LS from the client and the problem becomes clear -- we are trying to load /.SUNWnative/usr/lib/s8_brand.so.1. Until now I was not familiar with the /.SUNWnative directory but this exists on all Solaris 8 & 9 zones. Within this directory are three lofs mounts that present /lib, /usr and /platform to the local zone from the global:

/zones/s8_lt203398/root/.SUNWnative/lib on /lib read only/setuid/nodevices/dev=4010002 on Tue Apr 14 16:59:39 2009
/zones/s8_lt203398/root/.SUNWnative/platform on /platform read only/setuid/nodevices/dev=4010002 on Tue Apr 14 16:59:39 2009
/zones/s8_lt203398/root/.SUNWnative/usr on /usr read only/setuid/nodevices/dev=4010002 on Tue Apr 14 16:59:39 2009

Those that have kept up will now know why things are failing -- everything here hinges around anonymous FTP using a chroot() environment and those Solaris 8 zone-specific mounts not existing (why would they? in.ftpd(1M) was written years before Zones were around)

The Fix

Despite all of this investigation the real fix is far more obvious: don't use the legacy Solaris 8 FTP daemon when your system is running Solaris 10 (or OpenSolaris)!

The same functionality can be achieved by creating a minimal Solaris 10 local zone & setting up anonymous FTP there. The newer Solaris 10 ftpd provides additional functionality over Solaris 8. The last step in this method is to set up a loopback mount in the global zone to present the relevant directory into the Solaris 8 zone

The Workaround

Not everybody will have the option of adding an additional Solaris 10 zone to deliver "The Fix" so here is a workaround to get Solaris 8 anonymous FTP working. The good news is that it isn't at all hacky, it's just a few extra steps that are missing from in.ftpd(1M)

Armed with the knowledge that we are missing the /.SUNWnative mounts within the chroot() environment we simply add these in and things will begin to work. The updated mount output will now include:

/export/home/ftp/.SUNWnative on /.SUNWnative read/write/setuid/zone=s8_lt203398/dev=4010009 on Wed Apr 15 02:20:15 2009

Here my chroot environment exists under /export/home/ftp. And a more permanent way to achieve this is by updating your zone's vfstab with:

/.SUNWnative    -    /export/home/ftp/.SUNWnative    lofs    -    yes    ro

As a final note, in.ftpd(1M) didn't seem to mention that /lib/ld.so.1 should also be included in the chroot() environment. It should, and things won't work unless it is there. Don't forget to ensure the permissions are all set correctly

Saturday Mar 14, 2009

Remap middle mouse button

I have a three button mouse, but I mainly use two buttons: left and middle click. From time to time I use the right button

These days most mice have a scroll-wheel that doubles as middle click. They're not bad, but I prefer the left and right mouse buttons. So why not remap them?

On my Sun Ray I run the following:

% xmodmap -e "pointer = 1 3 2 4 5"

I think the syntax is very clear... 5 logical buttons are assigned by listing the physical buttons in order. i.e. the default is "1 2 3 4 5" but as I want to swap buttons 2 & 3, I transpose them in the list

Thursday Feb 05, 2009

Highlighting CDA comments in VIM

I spend the larger part of my day collating outputs from customer systems (be they crash dumps, LDOM outputs, messages, etc.) and adding notes where appropriate. For example I might dump a vnode structure in mdb and want to draw particular attention to a specific member of that structure

Most CDAs I've read through use !! or <-- to draw attention to notes, or distinguish notes from output. I've followed this pattern and find it very useful day to day... but this can be improved

As a (g)vim user I added the following to my .vimrc file:

highlight CDANote ctermbg=red
match CDANote /<--.\*$/
highlight CDANote2 ctermbg=red
2match CDANote2 /!!.\*$/

And an obligatory screenshot:

Almost certainly there are better ways to achieve the same, but for day-to-day use, this makes reading notes and comments a great deal easier

Sunday Feb 01, 2009

Understanding udp_get_next_priv_port()

udp_get_next_priv_port() from http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/udp/udp.c
Assuming the system is unlabeled, what port will be returned when we're called (hint: it isn't IPPORT_RESERVED-1)?
No doubt this will be trivial for the C guru, but it had me scratching my head for a little while. It's probably time that I went back and read Programming 101...
/\*
 \* Return the next anonymous port in the privileged port range for
 \* bind checking.
 \*/
static in_port_t
udp_get_next_priv_port(udp_t \*udp)
{
        static in_port_t next_priv_port = IPPORT_RESERVED - 1;
        in_port_t nextport;
        boolean_t restart = B_FALSE;
        udp_stack_t \*us = udp->udp_us;

retry:
        if (next_priv_port < us->us_min_anonpriv_port ||
            next_priv_port >= IPPORT_RESERVED) {
                next_priv_port = IPPORT_RESERVED - 1;
                if (restart)
                        return (0);
                restart = B_TRUE;
        }

        if (isystem_labeled() &&
            (nextport = tsol_next_port(crgetzone(udp->udp_connp->conn_cred),
            next_priv_port, IPPROTO_UDP, B_FALSE)) != 0) {
                next_priv_port = nextport;
                goto retry;
        }

        return (next_priv_port--);
}

Wednesday Jan 28, 2009

Introduction to Solaris CIFS & VSCAN (Cardiff presentation)

Thanks to an invitation from Clive King I attended Cardiff university for the day to present on Solaris CIFS and VSCAN as part of a set of tech talks for academic customers

As promised to those guests I'm providing a copy of my slides (thanks Jarod & our lab interns), at the end of which there is the list of references along with links to the relevant sites. You can also check out the sites I visited while gathering the information for the slides as well as my notes

If there were any other questions, just drop a comment and I'll endeavour to give you an answer

Saturday May 17, 2008

Compiling MPlayer on OpenSolaris 2008.05

Video playback (with codecs) is an absolute must to make OpenSolaris a workable alternative to Ubuntu for my home machine

What follows is a very quick guide to compiling MPlayer v1.0rc2 on OpenSolaris

Fetch gcc, gmake and gawk to allow us to compile MPlayer with minimum fuss; we also pull SUNWxorg-headers to allow us to compile the Xv video-out plugin (I also have FSWxorg-headers installed; if things don't work out, you may want to add that to the end of the list). As IPSgawk is in the Blastwave IPS repository, we add that also:

$ pfexec pkg set-authority -O http://blastwave.network.com:10000/ Blastwave
$ pfexec pkg install SUNWgcc SUNWgmake IPSgawk SUNWxorg-headers

Compiling MPlayer is now very straightforward, the only special thing we need to do is promote /opt/csw/gnu in our PATH during compile time. This is because various parts of the build fail with the standard Solaris awk, so we override with GNU awk:

$ echo $PATH
/usr/gnu/bin:/usr/bin:/usr/X11/bin:/usr/sbin:/sbin
$ export PATH=/usr/gnu/bin:/opt/csw/gnu:/usr/bin:/usr/X11/bin:/usr/sbin:/sbin
$ wget http://www8.mplayerhq.hu/MPlayer/releases/MPlayer-1.0rc2.tar.bz2
$ tar jxf MPlayer-1.0rc2.tar.bz2 && cd MPlayer-1.0rc2
$ ./configure
$ gmake

Key points to watch out for are the list of enabled audio and video plugins upon configure completing (you almost certainly want Xv)

If all went well, you should be able to run MPlayer with a simple ./mplayer -vo xv /path/to/video

In the meantime, I'm looking into the best way to get MPlayer packages available via one of the standard IPS repositories

Wednesday Sep 05, 2007

Growing a mirrored volume with Solaris Volume Manager / DiskSuite

I recently had a customer contact the Solutions Centre requesting a procedure to grow a Solaris Volume Manager (previously known as Solstice DiskSuite) RAID1 mirror using an old, unwanted volume.

In order to answer the customer's question authoritatively I turned to Sun's extensive 'Global Labs', where I booked a suitable host and installed the same version of Solaris as the customer. In this circumstance the global lab I picked was actually in the same building as me, but the procedure was done remotely using the lab tools. For all intents and purposes, the system could very well have been in the global lab in Singapore.

With Solaris installed I quickly duplicated the customer's configuration. He was using two disks in his mirror, giving him good redundancy:

d10 (RAID 1 mirror, 20GB; mounted on /export/home)
|- d11 (c1t0d0s0, 20GB)
\\- d12 (c2t0d0s0, 20GB)

d20 (RAID 1 mirror, 30GB; mounted on /export/backup)
|- d21 (c1t0d0s1, 30GB)
\\- d22 (c2t0d0s1, 30GB)

In this example, the customer is now using a remote host for backup, so the d20 metadevice is no longer being used. He wanted to grow d10 by the capacity of d20.

The following simple steps achieve this with 0% downtime. It is not even necessary to unmount the volume that is being grown:

  1. Take a backup of /export/home. Verify this backup.

  2. Unmount d20:
  3. # umount /export/backup

  4. Remove all references to the d20 mirror and all sub-mirrors with metaclear. In this step the -r flag indicates that d20, d21 and d22 will be removed:
  5. # metaclear -r d20

  6. Attach the slices previously used in the d20 mirror to each of the d10 sub-mirrors:
  7. # metattach d11 c1t0d0s1
    # metattach d12 c2t0d0s1

  8. Tell d10 that the sub-mirrors have changed and that the size of the metadevice has changed:
  9. # metattach d10

  10. Finally, we need to use growfs to grow the UFS filesystem that lives on the d10 metadevice:
  11. # growfs -M /export/home /dev/md/rdsk/d10

Hopefully this simple six-step procedure will be of use to those new to Volume Manager.

About

stuff I get up to :)

Search

Archives
« July 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today