Friday Aug 15, 2008

Today's colour is

#FF35C5. That is all.

Tuesday Jun 17, 2008

Trap for the unwary

I did a bios upgrade on my laptop the other day - from A05 to A08. Thought nothing of it until I re-installed the beast with build 91 to get some ZFS root goodness. (Note that currently you have to use the text-mode installer to do this).

xVM told me, none too politely, that it couldn't find any virtualization capabilities in my cpus, so it wasn't going to be my friend any more.

I logged 6714698 snv_91 xVM spurious failure on VT-enabled hardware and provided what I thought was enough info (prtpicl -v and prtconf -v output). Turns out I should have also provided the output from xm info and xm dmesg. When I did, I noticed these lines:
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p 

(xVM) Processor #0 6:15 APIC version 20
(xVM) Processor #1 6:15 APIC version 20
(xVM) IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
(xVM) Enabling APIC mode:  Flat.  Using 1 I/O APICs
(xVM) Using scheduler: SMP Credit Scheduler (credit)
(xVM) Detected 2194.558 MHz processor.
(xVM) VMX disabled by Feature Control MSR.
(xVM) CPU0: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz stepping 0b
(xVM) Booting processor 1/1 eip 90000
(xVM) VMX disabled by Feature Control MSR.
(xVM) CPU1: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz stepping 0b
(xVM) Total of 2 processors activated.

What the...?

Quick jump into the bios revealed that there was a new option - Virtualization support. It was, of course, turned off by default. Turning it on and booting the xVM kernel showed me some much nicer output from those commands:
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
                             hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 

(xVM) Processor #0 6:15 APIC version 20
(xVM) Processor #1 6:15 APIC version 20
(xVM) IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
(xVM) Enabling APIC mode:  Flat.  Using 1 I/O APICs
(xVM) Using scheduler: SMP Credit Scheduler (credit)
(xVM) Detected 2194.555 MHz processor.
(xVM) HVM: VMX enabled
(xVM) VMX: MSR intercept bitmap enabled
(xVM) CPU0: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz stepping 0b
(xVM) Booting processor 1/1 eip 90000
(xVM) CPU1: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz stepping 0b
(xVM) Total of 2 processors activated.

Now as soon as I get a spare cycle or three, I can go and see about building an S10 domU for backport builds. That'll be fun!

Friday Dec 07, 2007

Bios bugs annoy the heck out of me

In the last few days I've been kinda-sorted prevented from successfully LiveUpgrading due to a freakin' annoying bug in my Ultra20-M2 system bios:

6636511 u20m2 bios version 1.45.1 still can't distinguish disks on the same sata channel

(It's in a closed prod/cat/subcat, sorry).

The gist of the bug is that I've got two identical Seagate 320Gb disks (ST3320620AS, 320072933376 bytes) in my system, providing /, /zroot (for my zones, it's ufs), and sink - my zpool. No matter which two SATA ports I plug those two disks into, Shidokht's /sbin/biosdev util cannot do anything but report either no disks found, or (if run with -d) that the matchcount for the devices is greater than 1.

This means that /usr/lib/lu/lumkboot, which is called as part of lucreate and friends, cannot do the needful. Hence LU fails.

Yesterday I finally cracked and went off to purchase two new 320Gb disks (one Western Digital, the other a Samsung) in order to see how deep the bug goes. This became particularly important after JanD attempted to reproduce

6628268 u20 and u20m2 + snv_75a with non-global zones refuses to allow LU (lucreate)

with an u20m2 and two identical Hitachi 250Gb disks. He wasn't able to, despite having the same model disk, with the same firmware version in each slot.

At the moment my box is having a grand old time, 1hr10 into a zpool replace:

farnarkle:jmcp $ zpool status sink
pool: sink
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 67.52% done, 0h28m to go

sink DEGRADED 0 0 0
mirror DEGRADED 0 0 0
c2t0d0s7 ONLINE 0 0 0
replacing DEGRADED 0 0 0
c3t0d0s7/old FAULTED 0 0 0 corrupted data
c3t0d0s7 ONLINE 0 0 0

errors: No known data errors

To get to the point where zpool could replace the device, I made sure the slices on the new disk were in order, then ran zpool replace sink c3t0d0s7. That's it - it's really nifty.

I've got one more thing to try (swapping the cables around for c3t0 and c3t1), which I think I'll have a go at in about 40 minutes. Whatever the results of that test, it's not looking good for the bios when it's got Seagate-branded disks attached.

Monday Oct 10, 2005

More on bioses

When I was building my current desktop box, I knew I was close to the bleeding edge. Today with a lot of help from sethg I finally found out that I was over the edge with the built-in AC97 audio device. To start with, I built the box around the Gigabyte K8-NSPro motherboard. I had some initial issues with the chipset (nForce3-250) and not being able to see the plugin pci bus. Seth helped me with that as well (this is back when Solaris 10 build 63 was new), and I did have a soundblaster card installed so the audio seemed quite ok. Then a few months ago I ripped out the soundblaster card to rebuild the box from the ground up for a fresh Nevada install. And the audio didn't really work too well. Scratchy and interrupted --- with Sun's audio810(7D) driver and also with Jurgen Keil's audioi810(7D) driver from Jurgen suggested I look at the interrupt mappings, which I did and found that the AC97 and USB devices were using the same IRQ. Since I don't know how to remap this without going to the bios (there's magic to do it, which I don't know and don't really care to know), I did that... So now I had AC97 and USB using different IRQs. Still no good with the output. Last night I tried running intrstat(1M) in an effort to see what was going on. Something of the order of 85000 interrupts/second for audio810#0. Surely that couldn't be correct? Then I fired up dtrace with /usr/demo/dtrace/intr.d. BIG MISTAKE!!! I encountered what could easily be mistaken for a hardhang, but was probably just my desktop box being kicked and pummelled and punched and ... resulted in it getting the one-finger salute. So after I posted to dtrace-discuss-AT-opensolaris-DOT-org asking for some assistance, sethg got in touch with me and poked around inside my kernel's acpi tables for a few hours. The end result is that we've discovered a new failure mode, and it looks like my motherboard ignores the laws of physics! To quote Seth's response on dtrace-discuss-AT-opensolaris-DOT-org:
The wrong interrupt controller input was programmed with the wrong polarity, causing continuous interrupts to be sent to the CPU. Hey, at least now James knows the maximum theoretical # of interrupts / seconds his system can process ;) ... Specifically, the ACPI tables were lying about the interrupt polarity for a particular set of interrupts.

At least with ACPI we can provide the system with a new table for the kernel to blast in and make use of. So now I'm happily listening to non-scratchy and uninterrupted audio, and thanking the dedication of a fellow engineer on the other side of the world. Thanks Seth!

Wednesday Aug 31, 2005

AMD64 is very very popular in South Korea!

I'm always interested in Sun's stock price --- not only does it have an effect on how our customers perceive us, it has an effect on me too because I'm a shareholder via our Employee Stock Purchase Plan. And one day, when the stock price goes up enough I might be able to cash in some options. I doubt that will happen for a long time though --- my options are at whole-number multiples of the current price :-( Anyway, with the StorageTek acquisition approved in Europe and finalised in the USA the stock went up. Which is always nice to see. And then I noticed that this new release about the South Korea Ministry of Education's National Education Information System buying 1200 Sun Fire V40z servers to seed middle- and high-schools with around that nation. I see this as a poke in the eye for those people who reckon Sun can't make any traction in the educational arena, let alone the non-Sparc market. The best part for me is that these servers (they're buying some Sun Fire v440s and Sun Fire v240s as well) will all be running Solaris 10.

Tuesday Aug 30, 2005

My laptop is an engineering miracle

Mike Shapiro recently integrated some x86/x64 RFEs that allow us to make use of the System Management Bios facility which is in newer PC bioses.

Amongst other things this does is finally give us prtdiag(1M) support on the x86 and x64 platforms.

So of course I had to try it out. This is what I get on my laptop:

$ su root -c ' /usr/sbin/prtdiag -v'
System Configuration: MiTAC
BIOS Configuration: Insyde Software O1.06  02/20/2004

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------

==== Memory Device Sockets ================================

Type    Status Set Device Locator      Bank Locator
------- ------ --- ------------------- --------------------
DRAM    in use 0   [snip bogosity]     \*\*nks 0/1
DRAM    in use 0   DRAM Slot 1         Banks 2/3
DRAM    in use -   DRAM Slot 2         Banks 4/5

==== On-Board Devices =====================================
ECP Port
16550 UART
IrDA Port
CardBus Bridge
IDE Controller

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
4   available PCI              MiniPCI
So, I've got 2x512mb dimms in this beast --- but all the slots are in use.... strange. But how many cpus do I have? Oooh none! It's a miracle!

I pinged Mike Shapiro about this --- I've got a non-DMTF-compliant bios because there are no CPU records and no cache records.

So who should I hassle about this? I kinda think it should be Insyde Software, but they could very well blame MiTAC, who could blame Insyde....

When I get my home workstation up to current I'll just have to see how good Gigabyte is with their bios.

The story continues......

Monday Aug 29, 2005

Where oh where is my laptop's bios update?

Each time I boot my laptop now that I've got the latest nightly build on it, I see a big scary warning message like this: Aug 26 08:56:37 broken unix: [ID 950921] cpu0: AMD Athlon(tm) 64 Processor 3000+ Aug 26 08:56:37 broken unix: [ID 101328 kern.warning] WARNING: BIOS microcode patch for AMD Athlon(tm) 64/Opteron(tm) processor Aug 26 08:56:37 broken erratum 109 was not detected; updating your system's BIOS to a version Aug 26 08:56:37 broken containing this microcode patch is HIGHLY recommended or erroneous system Aug 26 08:56:37 broken operation may occur. Wow --- that looks serious! A quick check of AMD's website provides me with this document which lists that particular erratum on page 68:
109 Certain Reverse REP MOVS May Produce Unpredictable Behavior Description In certain situations a REP MOVS instruction may lead to incorrect results. An incorrect address size, data size or source operand segment may be used or a succeeding instruction may be skipped. This may occur under the following conditions: · EFLAGS.DF=1 (the string is being moved in the reverse direction). · The number of items being moved (RCX) is between 1 and 20. · The REP MOVS instruction is preceded by some microcoded instruction that has not completely retired by the time the REP MOVS begins execution. The set of such instructions includes BOUND, CLI, LDS, LES, LFS, LGS, LSS, IDIV, and most microcoded x87 instructions. Potential Effect on System Incorrect results may be produced or the system may hang. Suggested Workaround Contact your AMD representative for information on a BIOS update. Fix Planned Yes
Well that's a relief ... now where do I go to find the bios update the boot message talks about? Let's try Mitac. No joy there. Not even a "search" field or button. Googling produces which does have a search field. So I put in my model number (8355) and there are some hits. But nothing which stands out as being relevant. Ok, so there is a hit for a "bios update" but nothing that really stands out and matches up with the erratum date of June 2004. And now when I go back to the site I keep getting ye olde "connection refused" by the referred-to website Not happy! I'll keep trying, but I don't like my chances. You know, if Sun was this lax in providing support we'd have died a long time ago. I guess the PC manufacturing industry makes up in volume what it loses in satisfied customers. Seems a waste, really.

I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.


« April 2014