Sunday May 24, 2009

Mild annoyance for snv_110 to snv_115

When LU'ing a buildbox from snv_110 to snv_115, I saw that 71 packages failed to add correctly to the new BE. The failure on pkgadd came with a message like this:



Doing pkgadd of SUNWgtk2 to /
28453 blocks
/a/var/sadm/pkg/SUNWgtk2/install/postinstall: /a/usr/share/desktop-cache/restart_fmri:
not found
pkgadd: ERROR: postinstall script did not complete successfully

Installation of failed.
pkgadd return code = 1




The workaround for this was to pkgadd the SUNWdesktop-cache package to my new BE's root (/.alt.snv_115 in my case), and then run


yes |pkgadd -R /.alt.snv_115 -d /net/installserver/export/nv/x/115/Solaris_11/Product `cat var/sadm/system/data/upgrade_failed_pkgadds`

Wednesday Oct 31, 2007

Today is a very good day

 

 

Today I'm ecstatic to be able to announce that the S10 patches for our backport are finally available on sunsolve.sun.com. We've delivered PSARC 2006/703 MPxIO extension for Serial Attached SCSI, and (my personal favourite) PSARC 2007/046 stmsboot(1M) extension for mpt(7D).


The patches that you need to install are

sparc:: 125081-10
(We recommend that on sparc you also install 127747-01 as well, due to 6466248)

and

x86/x64:: 125082-10

 

The full list of rfes and bugs is as follows:


6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

6242789 primary path comes up as standby instead online even if auto-failback is enabled
6442215 mpt.conf maybe overwritten because filetype within SUNWckr package is 'f'
6449836 stmsboot -d failed to boot if several LUNs or targets map to same partition
6510425 properties "flow_control" and "queue" in mpt.conf are useless
6525558 untagged command unlikely to be sent to HBA during heavy I/O
6541750 CAM5.1.1b2: 2530, MPT2: Vdbench bailed out after I pull ctlr-A out
6545198 build should allow architecture-dependent class action scripts
6546164 stmsboot does not remove sun4u SMF service, erroneously lists parallel SCSI HBAs
6548867 mpxio-upgrade script has fatally mis-defined variable
6550585 mpt driver has a memory leak in mpt_send_tur
6550591 mpt should not print unnecessary messages
6550849 WARNING: mpt TEST_UNIT_READY failure
6554029 mpt should get maxdevice from portfacts, not IOCfacts
6554556 stmsboot's privilege message is not quite correct
6556832 after ctlr brought online, some paths failed to come back
6560371 mpt hangs during ST2530 firmware upgrade
6566097 mpt: sd targets under mpt are not power-manageable
6566815 changes for 6502231 broke g11n in stmsboot
6531069 SCSI2 (tc_mhioctkown test cases) testing are showing UNRESOLVED results for ST2530
6546465 mpt: kernel panic due to NULL pointer reference in an error code path
6556852 mpt needs to support Sun Fire x4540 platform
6588204 mpt_check_scsi_io_error() incorrectly tests IOCStatus register
6588278 mpt driver doesn't check GUID of LUN when the path online
6591973 panic in mdi_pi_free() when remapping devices
6613189 T125082-09 and T125081-09 don't work - missing misc/scsi module from deliverables



As an interesting side note, during the development process we stumbled across

6566270 Seagate Savvio 10k1 disks do not enumerate under scsi_vhci

You'll probably see this if you have a Galaxy or T2000/T1000 system. (Unfortunately you need a service contract to view the bug report due to its category).

 

And on a personal note, I'd like to thank the other members of our team for working so well together - with Greg in Melbourne, Javen and Dolpher up in Beijing, test teams in Beijing, Menlo Park, Broomfield and San Diego and yours truly in Sydney (and now Brisbane) - we have truly been a virtual team. I reckon we've demonstrated that physical distance does not get in the way of designing, developing, testing and (most importantly) delivering good software that provides solutions for our customers.

 


Technorati tags: , , , , , , , , , , ,

Saturday Sep 22, 2007

A load off my mind

This morning I awoke and saw that the gatekeepers had given me the ok to putback our wad of changes to the Solaris 10 patch and feature gates. The gates opened less than a day ago, US/Pacific time, so I'm very happy that we were amongst the first to get in.

So after a few hours on the phone with the rest of our team, making absolutely darned sure that we had everything correct and doing a last-minute merge+test with the feature gate, I putback. I don't think I've been this nervous since the day I got married!

We expect to see patches for this backport in about a month - a coupla weeks after the close of the patch gate for this build. Not sure what the patch IDs will be, when I find out I'll mention them here.

The PSARC fasttracks we integrated are

PSARC 2006/703 MPxIO extension for Serial Attached SCSI (SAS) on mpt(7D)
PSARC 2007/046 stmsboot(1M) extension for mpt(7D)

And the primary bugids are

6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

(bugs.opensolaris.org won't show the updates until tomorrow).

Our little project team is very happy and despite the fact that we're in Beijing, Brisbane and Melbourne I think we might all go off to a pub to celebrate.

Wednesday Jul 05, 2006

nge, ultra20, nevada .... no packets!


One problem that I've been having with my bleeding-edge committment has been with the nge driver. I noticed that after my bfu gave me nge version 1.4, I couldn't get any packet responses when I pinged.

With v1.3 I could, so I logged a bug against the nge driver.

That was all well and good, except that I have absolutely \*no\* idea what to look for when debugging network issues. If <tt>snoop</tt> can't give me a clue, I'm stuffed.

Yesterday I bfu'd to the 2nd July nightly bits (which contained nge v1.6), rebooted, saw the same lack of packet response and did my nge shuffle. Update the boot archive, reboot, kaboom!

Turns out the there were some putbacks for GLD v3 which nge v1.3 isn't compatible with.

Slight problem for me then, because I couldn't get my network..... it was more of a notwork. Not good.

With the aid of Murayama's nfo driver (yay for usb storage!) I was able to determine that the problem wasn't actually with the nic driver, but either above that in the stack, or below it, in the hardware.

I have Brendan Gregg's DTrace Toolkit installed, so I ran <tt>dtruss</tt> on a ping to my gateway. That showed me that everything seemed to be working ok from the above-the-nic part of the issue. So that left the hardware itself.

Since I knew that nge v1.3 worked just fine, I was left to poke around in the hardware........ and the only thing I could find was in the bios for this box.

It turns out that there's a setting in one of the Advanced Settings pages, called <b>MAC Media Interface</b>. Somewhere in my futzing around (I can't help myself, you know how it is), I'd set that particular item to "MII".

That's the wrong thing.

I actually needed that set to "RGMII", which stands for "Reduced Gigabit Media Independent Interface".

Once I'd done that, all my network stuff came good.

That's one setting I won't be playing with again!


Technorati Tags: , ,

Wednesday May 24, 2006

CVS Pserver... once more with feeling

A few weeks ago I received a blog comment that my manifest to enable CVS Pserver operation had disappeared. Of course with the large number of emails I get every day I filed it in my todo list... which quickly got covered over with other stuff.

Anyway, I couldn't recall where I'd stashed it, so I ran


$ svccfg export svc:/network/pserver/tcp > /tmp/cvspserver-tcp.xml

(I was surprised I could do that not as root, but anyway....) and I've uploaded the manifest to my b.s.c. resources area.

Sorry for the delay in getting the resource back. I'm really uncertain why or how it could have disappeared.

Wednesday Apr 05, 2006

Updated again - the CVS pserver manifest

I just got an email from a colleague within Sun regarding my CVS Pserver entry. He noticed that I had written

# svcadm import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default


Now the keen-eyed amongst you will notice that svcadm doesn't have an "import" sub-command. The commands that we should be using here are in fact
# svccfg import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default


which makes a world of difference!

I've updated the original entry to reflect this correction. I'm still getting referrer hits for it more than 6 months after posting.

Thursday Nov 17, 2005

How do I find out what my network device is?

I've been hanging out on #opensolaris-AT-irc.freenode-DOT-net a lot recently doing my bit to help people get over the initial hump of installing Solaris and OpenSolaris. This evening we've been talking about devices, specifically NICs, and figuring out what driver they need. So how does one go about this if one has no idea what the driver should be? Well, start by running prtpicl -v and either pipe the output through /usr/bin/less or dump it to a file. Then you need to know what you're looking for: search for "Ethernet" or "Network" and you can't get too far off the track. That will appear in a stanza like this:
                 pci1458,e000 (obp-device, 187220000034b)
                  :DeviceID      0xb
                  :UnitAddress   2
                  :device-id     17184
                  :vendor-id     4523
                  :revision-id   19
                  :class-code    131072
                  :unit-address  b
                  :subsystem-id  57344
                  :subsystem-vendor-id   5208
                  :min-grant     23
                  :max-latency   31
                  :interrupts    1
                  :devsel-speed  1
                  :fast-back-to-back
                  :66mhz-capable
                  :power-consumption     01  00  00  00  01  00  00  00
                  :model         Ethernet controller
                  :compatible   (1872200000357TBL)
                   | pci11ab,4320.1458.e000.13 |
                   | pci11ab,4320.1458.e000 |
                   | pci1458,e000 |
                   | pci11ab,4320.13 |
                   | pci11ab,4320 |
                   | pciclass,020000 |
                   | pciclass,0200 |
                  :reg
 00  58  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
 10  58  02  02  00  00  00  00  00  00  00  00  00  00  00  00  00  40  00  00
 14  58  02  01  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00
 30  58  02  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00
                  :assigned-addresses
 10  58  02  82  00  00  00  00  00  00  00  f5  00  00  00  00  00  40  00  00
 14  58  02  81  00  00  00  00  00  94  00  00  00  00  00  00  00  01  00  00
 30  58  02  82  00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00
                  :pm-hardware-state     needs-suspend-resume
                  :devfs-path    /pci@0,0/pci10de,ed@e/pci1458,e000@b
                  :driver-name   skge
                  :binding-name  pci1458,e000
                  :bus-addr      b
                  :instance      0
                  :_class        obp-device
                  :name  pci1458,e000
see, up there at :model Ethernet controller. Now the next stanza or property is the very important :compatible part. It's so important to this blog that I'll excerpt it:
                  :compatible   (1872200000357TBL)
                   | pci11ab,4320.1458.e000.13 |
                   | pci11ab,4320.1458.e000 |
                   | pci1458,e000 |
                   | pci11ab,4320.13 |
                   | pci11ab,4320 |
                   | pciclass,020000 |
                   | pciclass,0200 |
These strings are PCI Consortium identifiers. Let's walk through them one by one.
identifier which is what?
pci11ab,4320.1458.e000.13 vendor,device.subvendor.subdevice.revision
pci11ab,4320.1458.e000 vendor,device.subvendor.subdevice
pci1458,e000 vendor,device
pci11ab,4320.13 vendor,device.revision
pciclass,020000 PCI Consortium device class, specific
pciclass,0200 PCI Consortium device class, general
Ok, that's all well and good, but how do I use that information? Well let's assume for a second that you want to find a network device in general. So searching through your prtpicl -v output you'll look for pciclass,0200. That will give you a pointer to the pci vendor,deviceid information, which you can then check /etc/driver_aliases for:
$ grep pci1458,e000 /etc/driver_aliases
skge "pci1458,e000"
This tells me that this particular pci identifier (pci1458,e000) is a device alias for the skge driver from SysKonnect. The example we came across on #opensolaris this evening was
                  :compatible   (1e4000001e7TBL)
                   | pci14e4,1677.1028.179.1 |
                   | pci14e4,1677.1028.179 |
                   | pci1028,179 |
                   | pci14e4,1677.1 |
                   | pci14e4,1677 |
                   | pciclass,020000 |
                   | pciclass,0200 |
which a quick search of /etc/driver_aliases reveals is actually a Broadcom nic which we supply a driver for. As it happens, for this particular system we had to specify more than just the vendor,deviceid: # update_drv —a —i ' "pci14e4,1677.1028.179" ' bge This reported warning: driver (bge) successfully added to the system but failed to attach which was a pain, but progress. So we asked this new user to run
# svcs clear svc:/network/physical:default
because it was showing as 'maintenance'.... and suddenly all that we had left was a piddly little routing problem. Joy! Now if we didn't supply a driver, the thing to do would be to Goooooooogle for the pci vendor,deviceid string and "solaris driver" -- for NICs you'll frequently come up with a hit for Masayuki Murayama's collection of drivers. There is a really nifty utility available for linux called lspci which will let you see what you've got installed in your system and on your motherboard. It makes use of a file of XOrg you can run /usr/X11/bin/scanpci instead.... I'm not sure whether the device mappings are hard-coded though.

Saturday Nov 05, 2005

The dangers of BFU

I pulled across build 26 and the latest nightly bfu archives to upgrade my laptop and workstation. The laptop bfu went swimmingly... once I remembered to remove my existing SUNWzfs package installation and install the binaries from the bfu archive.

Unfortunately I had a problem with my workstation when I updated it this morning. For some reason my conflict resolution procedure hadn't quite worked. I was getting panics on boot before being able to mount /..... so my type-6 usb keyboard driver wasn't being loaded and I couldn't do anything in kmdb like look at the panic string because the stack was more than 24 lines long! Bit of a problem there.

So I booted off my installation dvd, mounted my root partition under /mnt and had a trawl through /mnt/etc. There were two files listed in the bfu conflicts report which I thought I had fixed up correctly: /etc/name_to_major and /etc/driver_aliases.

Normally I use a procedure something like this to munge the new version and my installed version of those files:


# awk '{print $2,$1}' etc/name_to_major /etc/name_to_major |sort -n |uniq > /tmp/name_to_major
[ fire up vi on /tmp/name_to_major, check that everything looks correct ]
# mv /tmp/name_to_major /etc/name_to_major

And that's been successful for months.

Not this time, and the "fault" (if any should be ascribed) is in the package defaults for SUNWzfs. The bfu archive has the zfs driver use major number 182. Guess what existing driver used major number 182?

\*\*\* drum roll please \*\*\*

pci-ide

Yeeeeouuuch!

I figured this out by running


# awk '{print $2}' etc/name_to_major /etc/name_to_major |sort -n |uniq -c |grep -v "\^ 1"

which immediately showed me that there were two instances of 182. A quick edit of /etc/name_to_major followed by


# /sbin/bootadm update-archive -v -R /mnt

got me back to my happy place.

Lesson to be learnt: always double-check your /etc/name_to_major file for major number conflicts. Save yourself downtime and keep some of that hair on your head!

Thursday Sep 29, 2005

usb-attached storage and a handy hint

A few weeks ago the disk in my laptop died on me. Started making this harsh clicking sound, everything totally locked up. Now I did have backups for a lot of my stuff, but not all of it.... and my procedure is to backup at the end of the day. This happened right after lunch. Doh! I dashed across the road from the office to get a replacement disk from the local bits-n-pieces shop, along with an external usb-attached enclosure, and that night did my damnedest to get any data whatsoever off that disk. No luck at all. A veritable piece of rotating rust it surely was. So I installed build 20 (current at the time), and bfu'd to whatever the nightly was, and tried to get my config and data back to some semblance of useful state. I also tried to mount the usb-attached disk (the dodgy one now in the external casing), but no dice. It wasn't until I read a post from FritS that I remembered a setting to use with scsa2usb(7D), which is what usb storage attaches with. In /kernel/drv/scsa2usb.conf is a lot of documentation about how to workaround the various, um, "implementation details" that we come across with consumer-grade hardware like usb enclosures. The bit that made it all work for me was this:
#       reduced-cmd-support     - "true" if the device cannot handle
#               mode sense, start/stop, and doorlock.
#               This is the only legal value for this parameter.
#
So I need to have the following line in my .conf file: attribute-override-list="vid=0x402 pid=0x5642 reduced-cmd-support=true"; Then a quick disconnect, modunload, update_drv
# echo yes | cfgadm -c disconnect usb2/3 # modunload -i `modinfo |awk '/scsa2usb/ {print $1}'` # update_drv -v scsa2usb
followed by a re-connect of the device, and joy oh joy I've got 60gb of usb-attached storage available for me. If there's a point to be made, it's this --- always read the manpage for your device driver, and check out its driver.conf file as well --- you might just learn something :-)

Sunday Aug 28, 2005

wireless on the train --- security.... what's that?

A few weeks ago I got an alpha version of our iwi driver --- for my Intel Centrino 2200bg adapter. There's a GPL'd driver available via SourceForge btw. While the developers only call it alpha, I've been using it with our Sun-on-Sun vpn solution for quite some time with absolutely no problems. Anyway, this morning on the train up to the office I figured I'd see whether I could see any hotspots. Lo and behold, I could! One looked like the one attached the mac user sitting 5 rows away and had no security on it, the other was I think downstairs in the train but did have security. [We have double-decker trains in Sydney :-( ] It got me thinking --- why do people ignore security, especially with wireless? I was able to connect to that person's network (not that I did anything apart from that of course!) and theoretically I could have purloined that person's files. If that person didn't know about security logfiles or event logs, they could have absolutely no idea what had happened. Do yourself a favour: if you're going to setup your own little wireless AP, at least set up some password control on connections, and preferably start using 128bit WEP as a minimum. Do you know who has access to your data? Are you sure?

Sunday Aug 21, 2005

An update on the cvs pserver manifest

Well somewhere along the great BFU way my manifest for the CVS pserver stopped working. It's been quite some time since I've had it working and what with work and uni (oh, and a holiday in Europe...) I didn't get back to it until today. I keep getting errors like these:
Aug 22 06:31:25 broken inetd[100282]: [ID 702911 daemon.error] Property 'name' of instance svc:/network/cvspserver/tcp:default is missing, inconsistent or invalid
Aug 22 06:31:25 broken inetd[100282]: [ID 702911 daemon.error] Property 'proto' of instance svc:/network/cvspserver/tcp:default is missing, inconsistent or invalid
Which was darned annoying, because my /etc/inet/inetd.conf and /etc/services looked just fine. So I plugged "inetconv invalid inconsistent fields" into sunsolve and got back an infodoc on Samba (contract-only unfortunately) and found a CR involving the libinetsvc.so library which inetconv(1M) uses. Joining the dots together I ran a series of trusses:
# truss -f -a -topen -u libsocket -u libinetsvc /usr/sbin/inetconv -n -i /tmp/pserver.conf
108712: execve("/usr/sbin/inetconv", 0x08047CD4, 0x08047CE8)  argc = 4
108712:  argv: /usr/sbin/inetconv -n -i /tmp/pserver.conf
108712: open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
108712: open("/lib/libscf.so.1", O_RDONLY)              = 3
108712: open("/usr/lib/libinetsvc.so.1", O_RDONLY)      = 3
108712: open("/lib/libc.so.1", O_RDONLY)                = 3
108712: open("/lib/libuutil.so.1", O_RDONLY)            = 3
108712: open("/lib/libsocket.so.1", O_RDONLY)           = 3
108712: open("/lib/libnsl.so.1", O_RDONLY)              = 3
108712: open("/lib/libmd5.so.1", O_RDONLY)              = 3
108712/1:       open("/tmp/pserver.conf", O_RDONLY)             = 3
108712/1:       open64("/var/run/name_service_door", O_RDONLY)  = 4
108712/1@1:     -> libinetsvc:get_prop_table(0x803fba8)
108712/1@1:     <- libinetsvc:get_prop_table() = 0xfef752a0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664e8, 0x8068768)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664e0, 0x8068764)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664ac, 0x80687a8)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664a4, 0x8068798)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x806649c, 0x80687b8)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:valid_props(0x806b6b8, 0x0, 0x0, 0x0)
108712/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x8068848, 0x803eef0, 0x803ef00)
108712/1:       open("/etc/netconfig", O_RDONLY|O_LARGEFILE)    = 5
108712/1:       open("/dev/udp", O_RDONLY)                      = 5
108712/1:       open("/dev/udp", O_RDONLY)                      = 5
108712/1:       open("/etc/nsswitch.conf", O_RDONLY|O_LARGEFILE) = 5
108712/1:       open("/lib/nss_files.so.1", O_RDONLY)           = 5
108712/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108712/1@1:       <- libsocket:getservbyname_r() = 0
108712/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x0, 0x803eef0, 0x803ef00)
108712/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108712/1@1:       <- libsocket:getservbyname_r() = 0
108712/1@1:     <- libinetsvc:valid_props() = 0
inetconv: Error /tmp/pserver.conf line 1 invalid or inconsistent fields: service-name protocol
108712/1@1:     -> libinetsvc:free_instance_props(0x806b6b8)
108712/1@1:     <- libinetsvc:free_instance_props() = 0xfed62000
There it is again - that error which just looked out of place. At this point I thought that perhaps I should check out /etc/services. Sure enough, I had a different service name (cvs). Changing that to match my inetd.conf-like file and re-running gave me the manifest I wanted. Here's the tail end of the truss:
108718/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108718/1@1:       <- libsocket:getservbyname_r() = 0x803eef0
108718/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x0, 0x803eef0, 0x803ef00)
108718/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108718/1@1:       <- libsocket:getservbyname_r() = 0x803eef0
108718/1@1:     <- libinetsvc:valid_props() = 1
108718/1@1:     -> libinetsvc:free_instance_props(0x806b6b8)
108718/1@1:     <- libinetsvc:free_instance_props() = 0xfed62000
108718/1:       open("/var/svc/manifest/network/pserver-tcp.xml", O_WRONLY|O_CREAT|O_EXCL, 0644) = 5
pserver -> /var/svc/manifest/network/pserver-tcp.xml
So with a quick flick of the wrist I had a new service imported and enabled:
# svccfg import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default
Here's a link to the manifest

Sunday Jul 24, 2005

Why is this only a warning?

Welcome to Monday! I'm jumpstarting an ultra-60 in our lab so I can test a bugfix when I see this message:
WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd2):
        Error for Command: load/start/stop         Error Level: Informational
        Requested Block: 0                         Error Block: 0
        Vendor: SEAGATE                            Serial Number: 9808500387
        Sense Key: Soft Error
        ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x45
That's a pretty serious-looking message, so why is it only a "WARNING" rather than an "ERROR" ? The answer comes from the routine gda_errmsg(..) which is in usr/src/uts/common/io/dktp/dcdev/gda.c starting at line 247. This routine calls gda_log(..) which is a wrapper around cmn_err(..). One of the parameters we pass to cmn_err(..) is the error level: CE_CONT, CE_NOTE, CE_WARN, CE_PANIC and CE_IGNORE (defined in usr/src/uts/common/sys/cmn_err.h. The gda_errmsg(..) routine passes CE_WARN (that's the first part of the message above) and CE_CONT (the rest of the message). So what should I do about this message? Replace the disk immediately. There is no other option you can take. The message is that the drive's failure prediction threshold has been exceeded, so the drive's internal electronics is telling you that it's about to die. In my case this is a rather old 4gb Seagate disk, so I'm more than happy to get a new one in instead. We don't pass CE_PANIC as an argument to gda_log(..) because we do not want to take out the system due to a (generally) online-resolvable issue. Of course if this is your boot disk you'd better take action right away, but Solaris isn't going to panic on you from this incident. Moral of the story: don't ignore "WARNING" messages because they're only "WARNING"s and always read the full text of the message. It could really be an error.

Thursday Jul 14, 2005

When should you clean a DLT or LTO tape drive?

The answer to the question is only when the tape drive cleaning light is on. Yesterday I was having a discussion with a frontline engineer about cleaning tape drives. Apparently a customer had had several replacements of their DLT tape drive and wanted to know what proactive measures they could take to avoid needing to replace their drive in the future. So I had a look at the messages which they had used to justify replacement, and in each case the sense key was "media error." This set alarm bells ringing, because if a DLT or LTO drive reckons there is a problem with the media then you either have a problem with that tape, or an environmental problem which makes the tape media a carrier (like a bacterium in a way). The technology and intelligence which is designed into the DLT and LTO families is such that you should never need to use a cleaning tape. And if you do, it is only because the drive itself has detected that it needs a clean. I remember a performance escalation a few years back where it turned out that the customer was running cleaning tapes through their DLT drives twice a week. This was completely unnecessary and had the decidedly unwanted effect of killing the drive's read and write performance as the heads were degraded. Replacing the drives was their only option. Other customers have configured frequency-based cleaning in NetBackup or Solaris Backup / Legato Networker --- this is a complete nono. Use the TapeAlert function of the drive and configure cleaning for "on-demand" only. If you keep your DLT and LTO drives in a minimally-dusty environment and don't throw your tapes around you should never need to use a cleaning tape.

Data Integrity, or, raid+cache+battery backup

In the last two weeks I've had two enquiries from customers about caching on their raid arrays. In both cases the customers said (words to the effect of) "we want to get the speed benefits from the array's cache, but we don't want to pay for battery backup." (or, "we don't want to replace the battery on our T3/6x20" ) In each case (and in any case like this) the answer is a resounding NO! Why do you think that Sun and other hardware raid vendors design in battery-backup? We worry about the availability and integrity of your data. You should too.

Thursday Apr 21, 2005

gpart saved my laptop

On Tuesday I bfu'd to the latest nightly build of Solaris next so I could take advantage of the boot re-architecture project integration. This went quite well except that I managed to corrupt my boot-archive through not paying attention at the right time and forgetting a step.... grrr. Once I'd fixed that problem (boot cdrom -s, mount -F ufs -o rw,logging /dev/dsk/c0d0s0 /mnt ; /mnt/sbin/bootadm update-archive -R /mnt ; sync ; umount /mnt ; reboot) I felt confident enough to go to the next stage, booting the competition's OS on my laptop.

I figured I should boot it to see what it thought was going on. That was ok, but running partition magic was when things went downhill fast. PM decided that my partition table had errors, and would I like it to fix them? I was really stupid at this point, and clicked yes.

BAD mistake.

Not only could I not boot back to MS-Windows, but I was unable to boot Solaris either...

Fortunately my desktop Solaris box was unaffected, so with a bit of digging I was able to find the System Rescue CD iso, pull it down, burn it and boot from it. That was great, but sfdisk and cfdisk both told me I had a bodgy partition table (duh! I knew that already!) and refused to help. By this point I was getting quite frantic, and googled again and again, eventually coming up with a hit on gpart.

I am very pleased to say that gpart saved my laptop. It was included on the linux System Rescue CD as /usr/bin/gpart.

Gpart has a scan option where it looks at where your partition table should be, and tries to interpret the data which it finds. I used this first, and wrote down exactly what it produced. Fortunately for me it matched what I remembered of my disk layout, so I re-ran it with the "-W" option to write the corrected partition table to disk.

Then deep breaths, sync, sync, sync, reboot..... grub menu.... YAY!!! I'm back to life!

Of course MS-Windows still won't boot properly -- gets to a certain point and hard-hangs, or just reboots the laptop entirely.... but that's a topic for another day.

Now I'm doing another backup of my data to a workstation in the office..... because you never know.

I'm also emailing the author of gpart to thank him for his utility, and request that he enhance the list of known partition types to include Solaris2 (== 0xbf by the way) which is what Solaris10 installations use now.

About

I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today