Wednesday Jul 08, 2015

Oracle Instant Client: now available in IPS

Over the last few years I've spent a fair amount of time working deep inside the ON (OS/Networking) consolidation, improving the build system (https://blogs.oracle.com/jmcp/entry/my_own_private_crash_n)
and enhancing some general gate plumbing. One significant aspect of that plumbing was migrating our automated bug update system from Sun's Bugster to Oracle's bug database.

When Oracle's acquisition of Sun took effect we were still developing Solaris 11, so we got an exemption from the "thou shalt migrate to the one true bug tracking tool" edict. Once we had shipped Solaris 11, however, we had to get cracking on that migration. My small part in that process was writing a python script to provide a gate tools interface. We needed to provide a way for engineers to check that their bugids, synopses and pre-integration state were correct, as well as automatically updating bug states on integration ("fix available"), backout ("fix failed") and build close (when the gate staff mark a bug as "fix delivered").

This was a surprisingly large amount of work, even though it resulted in only about 1500 lines of code (95% python, 5% shell). The majority of the effort came from learning the database schema and its APIs - and for that I needed to use sqlplus.

Those of you who need to interact directly with Oracle databases will be familiar with this tool, and a great many people use the version that comes with their full Oracle database installation. There is another way of obtaining this tool - the Oracle Instant Client. Until now you needed to download the Instant Client from https://www.oracle.com/technetwork/database/features/instant-client/index-100365.html in zipped format, and unpack the bits you needed into a convenient location. Some mucking around with LD_LIBRARY_PATH was necessary, too.

As follow-on from this bug service migration project I developed a hankering to see the Oracle Instant Client made available in IPS form, and I am delighted to announce that this is now possible with the Oracle Solaris 11.3 beta release for the 12.1.0.2.0 release of the Oracle Instant Client.

If you've downloaded the Oracle Instant Client from OTN, you will be aware that the zipfiles are split up into 32 and 64bit versions of the basic libraries, sqlplus, the ODBC and JDBC supplements and software development kit (sdk). What we are providing with our delivery is slightly different from what you'll find on OTN, because we've combined a few logically aligned packages into one:

pkg:/database/oracle/instantclient
pkg:/database/oracle/instantclient/jdbc-supplement
pkg:/database/oracle/instantclient/odbc-supplement
pkg:/developer/oracle/instantclient/sdk

There is also a pkg:/consolidation/instantclient/instantclient-incorporation which ties them all together. The contents of pkg:/database/oracle/instantclient almost completely match the OTN 'basic', 'sqlplus' and 'wrc' zipfiles - in both 32- and 64-bit versions. The sdk, odbc-supplement and jdbc-supplement packages match what is provided in the OTN zipfiles.

To install these packages, once you have set your solaris publisher to the Beta release repo, just utter


# pkg install pkg:/consolidation/instantclient/instantclient-incorporation

As newer versions of the Instant Client are released, we will update the version in https://pkg.oracle.com to match, and you will notice that the package FMRI tracks the Database release version (12.1) rather than the Solaris release.

We have also updated the runpaths in the libraries and binaries so there is no need for you to set LD_LIBRARY_PATH in a wrapper script - though you might find it useful to add /usr/oracle/instantclient/12.1/bin to your $PATH.

As a side note, you might find it useful to set ORACLE_HOME if you are going to build bindings such as cx_Oracle for Python or DBD::Oracle for Perl.

Finally, I could not have done this without the assistance of Chris Jones, the Instant Client program manager - thankyou Chris!

Tuesday Apr 29, 2014

My own private crash-n-burn farm: using kernel zones for speedy testing

I've spent most of the last two years working on a complete rewrite of the ON consolidation build system for Solaris 12. (We called it 'Project Lullaby' because we were putting nightly to sleep). This was a massive effort for our team of 5, and when I pushed the changes at the end of February we wound up with about 121k lines of change over close to 6000 files. Most of those were Makefiles (so you can understand why I'm now scarred!).

We had to do an incredible amount of testing for this project. Introducing new patterns and paradigms for the complete Makefile hierarchy meant that we had to be very careful to ensure that we didn't break the OS. To accomplish this we used (and overloaded) the resources of the DIY test group and also made use of a feature which is now available in 11.2 - kernel zones.

Kernel zones are a type-2 hypervisor, so you can run a separate kernel in them. If you've used non-global zones (ngz) on Solaris in the past, you'll recall the niggle of having to have those ngz in sync with the global when it comes to SRUs and releases.

Using kernel zones offered several advantages to us: I could run tests whenever I wanted on my desktop system (a newish quad-core Intel Core i5 system with 32gb ram), I could quickly test updates of the newly built bits, I could keep the zone at the same revision while booting the global zone with a new build, and (this is my favourite) I could suspend the zone while rebooting the global zone.

Our testing of Lullaby in kernel zones had two components: #1 does it actually boot? and #2 assuming I can boot the kz with Lullaby-built bits, can I then build the workspace in the kz and then boot those new bits in that same kernel zone?

Creating a kernel zone is very, very easy:


limoncello: # zonecfg -z crashs12 create -t SYSsolaris-kz
limoncello: # zoneadm -z crashs12 install -x install-size=40g
limoncello: # zoneadm -z crashs12 boot

I could have used one of the example templates (eg /usr/share/auto_install/sc_profiles/sc_sample.xml) but for this use-case I just logged in and created the necessary users, groups, automount entries and installed compilers by hand. (Meaning pkg install rather than tar xf).

To start with, I ensured that crashs12 was running the same development build as my global zone, but I removed the various hardware drivers I had no need for.

The very first test I ran in crashs12 was a test of libc and the linker subsystem. Building libc is rather tricky from a make(1s) point of view, due to having several generated (rather than source-controlled) files as part of the base. The linker is even more complex - there's a reason that we refer to Rod and Ali as the 'linker aliens'! Once I had my fresh kz configured appropriately, I created a new BE, mounted it, then blatted the linker and libc bits onto it and rebooted. I was really, really happy to see the kz come up and give me a login prompt.

Several weeks after that we got to the point of successful full builds, so I installed the Lullaby-built bits and rebooted:


root@crashs12:~# pkg publisher
PUBLISHER TYPE STATUS P LOCATION
nightly origin online F file:///net/limoncello/space/builds/jmcp/lul-jmcp/packages/i386/nightly-nd//repo.osnet/
extra (non-sticky, disabled) origin online F file:///space/builds/jmcp/test-lul-lul/packages/i386/nightly/repo.osnet-internal/
solaris (non-sticky) origin online T http://internal/repo
root@crashs12:~# pkg update --be-name lul-test-1
root@crashs12:~# reboot

This booted, too, but I couldn't get any network-related tests to work. Couldn't ssh in or out. Couldn't for the life of me work out what I'd done wrong in the build, so I asked the linker aliens and Roger for help - they were quick to realise that in my changes to the libsocket Makefiles, I'd missed the filter option. Once I fixed that, things were back on track.

Now that Lullaby is in the gate and I'm working on my next project, I'm still using crashs12 for spinning up a quick test "system" and I'm migrating my 11.1 Virtualbox environment to an 11.2 kernel zone. The 11.2 zone, incidentally, was configured and installed in about 4 minutes using an example AI profile (see above) and a unified archive.

Kernel zones: you know you want them.

Sunday May 24, 2009

Mild annoyance for snv_110 to snv_115

When LU'ing a buildbox from snv_110 to snv_115, I saw that 71 packages failed to add correctly to the new BE. The failure on pkgadd came with a message like this:



Doing pkgadd of SUNWgtk2 to /
28453 blocks
/a/var/sadm/pkg/SUNWgtk2/install/postinstall: /a/usr/share/desktop-cache/restart_fmri:
not found
pkgadd: ERROR: postinstall script did not complete successfully

Installation of failed.
pkgadd return code = 1




The workaround for this was to pkgadd the SUNWdesktop-cache package to my new BE's root (/.alt.snv_115 in my case), and then run


yes |pkgadd -R /.alt.snv_115 -d /net/installserver/export/nv/x/115/Solaris_11/Product `cat var/sadm/system/data/upgrade_failed_pkgadds`

Wednesday Oct 31, 2007

Today is a very good day

 

 

Today I'm ecstatic to be able to announce that the S10 patches for our backport are finally available on sunsolve.sun.com. We've delivered PSARC 2006/703 MPxIO extension for Serial Attached SCSI, and (my personal favourite) PSARC 2007/046 stmsboot(1M) extension for mpt(7D).


The patches that you need to install are

sparc:: 125081-10
(We recommend that on sparc you also install 127747-01 as well, due to 6466248)

and

x86/x64:: 125082-10

 

The full list of rfes and bugs is as follows:


6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

6242789 primary path comes up as standby instead online even if auto-failback is enabled
6442215 mpt.conf maybe overwritten because filetype within SUNWckr package is 'f'
6449836 stmsboot -d failed to boot if several LUNs or targets map to same partition
6510425 properties "flow_control" and "queue" in mpt.conf are useless
6525558 untagged command unlikely to be sent to HBA during heavy I/O
6541750 CAM5.1.1b2: 2530, MPT2: Vdbench bailed out after I pull ctlr-A out
6545198 build should allow architecture-dependent class action scripts
6546164 stmsboot does not remove sun4u SMF service, erroneously lists parallel SCSI HBAs
6548867 mpxio-upgrade script has fatally mis-defined variable
6550585 mpt driver has a memory leak in mpt_send_tur
6550591 mpt should not print unnecessary messages
6550849 WARNING: mpt TEST_UNIT_READY failure
6554029 mpt should get maxdevice from portfacts, not IOCfacts
6554556 stmsboot's privilege message is not quite correct
6556832 after ctlr brought online, some paths failed to come back
6560371 mpt hangs during ST2530 firmware upgrade
6566097 mpt: sd targets under mpt are not power-manageable
6566815 changes for 6502231 broke g11n in stmsboot
6531069 SCSI2 (tc_mhioctkown test cases) testing are showing UNRESOLVED results for ST2530
6546465 mpt: kernel panic due to NULL pointer reference in an error code path
6556852 mpt needs to support Sun Fire x4540 platform
6588204 mpt_check_scsi_io_error() incorrectly tests IOCStatus register
6588278 mpt driver doesn't check GUID of LUN when the path online
6591973 panic in mdi_pi_free() when remapping devices
6613189 T125082-09 and T125081-09 don't work - missing misc/scsi module from deliverables



As an interesting side note, during the development process we stumbled across

6566270 Seagate Savvio 10k1 disks do not enumerate under scsi_vhci

You'll probably see this if you have a Galaxy or T2000/T1000 system. (Unfortunately you need a service contract to view the bug report due to its category).

 

And on a personal note, I'd like to thank the other members of our team for working so well together - with Greg in Melbourne, Javen and Dolpher up in Beijing, test teams in Beijing, Menlo Park, Broomfield and San Diego and yours truly in Sydney (and now Brisbane) - we have truly been a virtual team. I reckon we've demonstrated that physical distance does not get in the way of designing, developing, testing and (most importantly) delivering good software that provides solutions for our customers.

 


Technorati tags: , , , , , , , , , , ,

Saturday Sep 22, 2007

A load off my mind

This morning I awoke and saw that the gatekeepers had given me the ok to putback our wad of changes to the Solaris 10 patch and feature gates. The gates opened less than a day ago, US/Pacific time, so I'm very happy that we were amongst the first to get in.

So after a few hours on the phone with the rest of our team, making absolutely darned sure that we had everything correct and doing a last-minute merge+test with the feature gate, I putback. I don't think I've been this nervous since the day I got married!

We expect to see patches for this backport in about a month - a coupla weeks after the close of the patch gate for this build. Not sure what the patch IDs will be, when I find out I'll mention them here.

The PSARC fasttracks we integrated are

PSARC 2006/703 MPxIO extension for Serial Attached SCSI (SAS) on mpt(7D)
PSARC 2007/046 stmsboot(1M) extension for mpt(7D)

And the primary bugids are

6443044 add mpxio support to SAS mpt driver
6502231 stmsboot needs to support SAS devices
6544226 mpt needs mdb module

(bugs.opensolaris.org won't show the updates until tomorrow).

Our little project team is very happy and despite the fact that we're in Beijing, Brisbane and Melbourne I think we might all go off to a pub to celebrate.

Wednesday Jul 05, 2006

nge, ultra20, nevada .... no packets!


One problem that I've been having with my bleeding-edge committment has been with the nge driver. I noticed that after my bfu gave me nge version 1.4, I couldn't get any packet responses when I pinged.

With v1.3 I could, so I logged a bug against the nge driver.

That was all well and good, except that I have absolutely \*no\* idea what to look for when debugging network issues. If <tt>snoop</tt> can't give me a clue, I'm stuffed.

Yesterday I bfu'd to the 2nd July nightly bits (which contained nge v1.6), rebooted, saw the same lack of packet response and did my nge shuffle. Update the boot archive, reboot, kaboom!

Turns out the there were some putbacks for GLD v3 which nge v1.3 isn't compatible with.

Slight problem for me then, because I couldn't get my network..... it was more of a notwork. Not good.

With the aid of Murayama's nfo driver (yay for usb storage!) I was able to determine that the problem wasn't actually with the nic driver, but either above that in the stack, or below it, in the hardware.

I have Brendan Gregg's DTrace Toolkit installed, so I ran <tt>dtruss</tt> on a ping to my gateway. That showed me that everything seemed to be working ok from the above-the-nic part of the issue. So that left the hardware itself.

Since I knew that nge v1.3 worked just fine, I was left to poke around in the hardware........ and the only thing I could find was in the bios for this box.

It turns out that there's a setting in one of the Advanced Settings pages, called <b>MAC Media Interface</b>. Somewhere in my futzing around (I can't help myself, you know how it is), I'd set that particular item to "MII".

That's the wrong thing.

I actually needed that set to "RGMII", which stands for "Reduced Gigabit Media Independent Interface".

Once I'd done that, all my network stuff came good.

That's one setting I won't be playing with again!


Technorati Tags: , ,

Wednesday May 24, 2006

CVS Pserver... once more with feeling

A few weeks ago I received a blog comment that my manifest to enable CVS Pserver operation had disappeared. Of course with the large number of emails I get every day I filed it in my todo list... which quickly got covered over with other stuff.

Anyway, I couldn't recall where I'd stashed it, so I ran


$ svccfg export svc:/network/pserver/tcp > /tmp/cvspserver-tcp.xml

(I was surprised I could do that not as root, but anyway....) and I've uploaded the manifest to my b.s.c. resources area.

Sorry for the delay in getting the resource back. I'm really uncertain why or how it could have disappeared.

Wednesday Apr 05, 2006

Updated again - the CVS pserver manifest

I just got an email from a colleague within Sun regarding my CVS Pserver entry. He noticed that I had written

# svcadm import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default


Now the keen-eyed amongst you will notice that svcadm doesn't have an "import" sub-command. The commands that we should be using here are in fact
# svccfg import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default


which makes a world of difference!

I've updated the original entry to reflect this correction. I'm still getting referrer hits for it more than 6 months after posting.

Thursday Nov 17, 2005

How do I find out what my network device is?

I've been hanging out on #opensolaris-AT-irc.freenode-DOT-net a lot recently doing my bit to help people get over the initial hump of installing Solaris and OpenSolaris. This evening we've been talking about devices, specifically NICs, and figuring out what driver they need. So how does one go about this if one has no idea what the driver should be? Well, start by running prtpicl -v and either pipe the output through /usr/bin/less or dump it to a file. Then you need to know what you're looking for: search for "Ethernet" or "Network" and you can't get too far off the track. That will appear in a stanza like this:
                 pci1458,e000 (obp-device, 187220000034b)
                  :DeviceID      0xb
                  :UnitAddress   2
                  :device-id     17184
                  :vendor-id     4523
                  :revision-id   19
                  :class-code    131072
                  :unit-address  b
                  :subsystem-id  57344
                  :subsystem-vendor-id   5208
                  :min-grant     23
                  :max-latency   31
                  :interrupts    1
                  :devsel-speed  1
                  :fast-back-to-back
                  :66mhz-capable
                  :power-consumption     01  00  00  00  01  00  00  00
                  :model         Ethernet controller
                  :compatible   (1872200000357TBL)
                   | pci11ab,4320.1458.e000.13 |
                   | pci11ab,4320.1458.e000 |
                   | pci1458,e000 |
                   | pci11ab,4320.13 |
                   | pci11ab,4320 |
                   | pciclass,020000 |
                   | pciclass,0200 |
                  :reg
 00  58  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
 10  58  02  02  00  00  00  00  00  00  00  00  00  00  00  00  00  40  00  00
 14  58  02  01  00  00  00  00  00  00  00  00  00  00  00  00  00  01  00  00
 30  58  02  02  00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00
                  :assigned-addresses
 10  58  02  82  00  00  00  00  00  00  00  f5  00  00  00  00  00  40  00  00
 14  58  02  81  00  00  00  00  00  94  00  00  00  00  00  00  00  01  00  00
 30  58  02  82  00  00  00  00  00  00  00  00  00  00  00  00  00  00  02  00
                  :pm-hardware-state     needs-suspend-resume
                  :devfs-path    /pci@0,0/pci10de,ed@e/pci1458,e000@b
                  :driver-name   skge
                  :binding-name  pci1458,e000
                  :bus-addr      b
                  :instance      0
                  :_class        obp-device
                  :name  pci1458,e000
see, up there at :model Ethernet controller. Now the next stanza or property is the very important :compatible part. It's so important to this blog that I'll excerpt it:
                  :compatible   (1872200000357TBL)
                   | pci11ab,4320.1458.e000.13 |
                   | pci11ab,4320.1458.e000 |
                   | pci1458,e000 |
                   | pci11ab,4320.13 |
                   | pci11ab,4320 |
                   | pciclass,020000 |
                   | pciclass,0200 |
These strings are PCI Consortium identifiers. Let's walk through them one by one.
identifier which is what?
pci11ab,4320.1458.e000.13 vendor,device.subvendor.subdevice.revision
pci11ab,4320.1458.e000 vendor,device.subvendor.subdevice
pci1458,e000 vendor,device
pci11ab,4320.13 vendor,device.revision
pciclass,020000 PCI Consortium device class, specific
pciclass,0200 PCI Consortium device class, general
Ok, that's all well and good, but how do I use that information? Well let's assume for a second that you want to find a network device in general. So searching through your prtpicl -v output you'll look for pciclass,0200. That will give you a pointer to the pci vendor,deviceid information, which you can then check /etc/driver_aliases for:
$ grep pci1458,e000 /etc/driver_aliases
skge "pci1458,e000"
This tells me that this particular pci identifier (pci1458,e000) is a device alias for the skge driver from SysKonnect. The example we came across on #opensolaris this evening was
                  :compatible   (1e4000001e7TBL)
                   | pci14e4,1677.1028.179.1 |
                   | pci14e4,1677.1028.179 |
                   | pci1028,179 |
                   | pci14e4,1677.1 |
                   | pci14e4,1677 |
                   | pciclass,020000 |
                   | pciclass,0200 |
which a quick search of /etc/driver_aliases reveals is actually a Broadcom nic which we supply a driver for. As it happens, for this particular system we had to specify more than just the vendor,deviceid: # update_drv —a —i ' "pci14e4,1677.1028.179" ' bge This reported warning: driver (bge) successfully added to the system but failed to attach which was a pain, but progress. So we asked this new user to run
# svcs clear svc:/network/physical:default
because it was showing as 'maintenance'.... and suddenly all that we had left was a piddly little routing problem. Joy! Now if we didn't supply a driver, the thing to do would be to Goooooooogle for the pci vendor,deviceid string and "solaris driver" -- for NICs you'll frequently come up with a hit for Masayuki Murayama's collection of drivers. There is a really nifty utility available for linux called lspci which will let you see what you've got installed in your system and on your motherboard. It makes use of a file of XOrg you can run /usr/X11/bin/scanpci instead.... I'm not sure whether the device mappings are hard-coded though.

Saturday Nov 05, 2005

The dangers of BFU

I pulled across build 26 and the latest nightly bfu archives to upgrade my laptop and workstation. The laptop bfu went swimmingly... once I remembered to remove my existing SUNWzfs package installation and install the binaries from the bfu archive.

Unfortunately I had a problem with my workstation when I updated it this morning. For some reason my conflict resolution procedure hadn't quite worked. I was getting panics on boot before being able to mount /..... so my type-6 usb keyboard driver wasn't being loaded and I couldn't do anything in kmdb like look at the panic string because the stack was more than 24 lines long! Bit of a problem there.

So I booted off my installation dvd, mounted my root partition under /mnt and had a trawl through /mnt/etc. There were two files listed in the bfu conflicts report which I thought I had fixed up correctly: /etc/name_to_major and /etc/driver_aliases.

Normally I use a procedure something like this to munge the new version and my installed version of those files:


# awk '{print $2,$1}' etc/name_to_major /etc/name_to_major |sort -n |uniq > /tmp/name_to_major
[ fire up vi on /tmp/name_to_major, check that everything looks correct ]
# mv /tmp/name_to_major /etc/name_to_major

And that's been successful for months.

Not this time, and the "fault" (if any should be ascribed) is in the package defaults for SUNWzfs. The bfu archive has the zfs driver use major number 182. Guess what existing driver used major number 182?

\*\*\* drum roll please \*\*\*

pci-ide

Yeeeeouuuch!

I figured this out by running


# awk '{print $2}' etc/name_to_major /etc/name_to_major |sort -n |uniq -c |grep -v "\^ 1"

which immediately showed me that there were two instances of 182. A quick edit of /etc/name_to_major followed by


# /sbin/bootadm update-archive -v -R /mnt

got me back to my happy place.

Lesson to be learnt: always double-check your /etc/name_to_major file for major number conflicts. Save yourself downtime and keep some of that hair on your head!

Thursday Sep 29, 2005

usb-attached storage and a handy hint

A few weeks ago the disk in my laptop died on me. Started making this harsh clicking sound, everything totally locked up. Now I did have backups for a lot of my stuff, but not all of it.... and my procedure is to backup at the end of the day. This happened right after lunch. Doh! I dashed across the road from the office to get a replacement disk from the local bits-n-pieces shop, along with an external usb-attached enclosure, and that night did my damnedest to get any data whatsoever off that disk. No luck at all. A veritable piece of rotating rust it surely was. So I installed build 20 (current at the time), and bfu'd to whatever the nightly was, and tried to get my config and data back to some semblance of useful state. I also tried to mount the usb-attached disk (the dodgy one now in the external casing), but no dice. It wasn't until I read a post from FritS that I remembered a setting to use with scsa2usb(7D), which is what usb storage attaches with. In /kernel/drv/scsa2usb.conf is a lot of documentation about how to workaround the various, um, "implementation details" that we come across with consumer-grade hardware like usb enclosures. The bit that made it all work for me was this:
#       reduced-cmd-support     - "true" if the device cannot handle
#               mode sense, start/stop, and doorlock.
#               This is the only legal value for this parameter.
#
So I need to have the following line in my .conf file: attribute-override-list="vid=0x402 pid=0x5642 reduced-cmd-support=true"; Then a quick disconnect, modunload, update_drv
# echo yes | cfgadm -c disconnect usb2/3 # modunload -i `modinfo |awk '/scsa2usb/ {print $1}'` # update_drv -v scsa2usb
followed by a re-connect of the device, and joy oh joy I've got 60gb of usb-attached storage available for me. If there's a point to be made, it's this --- always read the manpage for your device driver, and check out its driver.conf file as well --- you might just learn something :-)

Sunday Aug 28, 2005

wireless on the train --- security.... what's that?

A few weeks ago I got an alpha version of our iwi driver --- for my Intel Centrino 2200bg adapter. There's a GPL'd driver available via SourceForge btw. While the developers only call it alpha, I've been using it with our Sun-on-Sun vpn solution for quite some time with absolutely no problems. Anyway, this morning on the train up to the office I figured I'd see whether I could see any hotspots. Lo and behold, I could! One looked like the one attached the mac user sitting 5 rows away and had no security on it, the other was I think downstairs in the train but did have security. [We have double-decker trains in Sydney :-( ] It got me thinking --- why do people ignore security, especially with wireless? I was able to connect to that person's network (not that I did anything apart from that of course!) and theoretically I could have purloined that person's files. If that person didn't know about security logfiles or event logs, they could have absolutely no idea what had happened. Do yourself a favour: if you're going to setup your own little wireless AP, at least set up some password control on connections, and preferably start using 128bit WEP as a minimum. Do you know who has access to your data? Are you sure?

Sunday Aug 21, 2005

An update on the cvs pserver manifest

Well somewhere along the great BFU way my manifest for the CVS pserver stopped working. It's been quite some time since I've had it working and what with work and uni (oh, and a holiday in Europe...) I didn't get back to it until today. I keep getting errors like these:
Aug 22 06:31:25 broken inetd[100282]: [ID 702911 daemon.error] Property 'name' of instance svc:/network/cvspserver/tcp:default is missing, inconsistent or invalid
Aug 22 06:31:25 broken inetd[100282]: [ID 702911 daemon.error] Property 'proto' of instance svc:/network/cvspserver/tcp:default is missing, inconsistent or invalid
Which was darned annoying, because my /etc/inet/inetd.conf and /etc/services looked just fine. So I plugged "inetconv invalid inconsistent fields" into sunsolve and got back an infodoc on Samba (contract-only unfortunately) and found a CR involving the libinetsvc.so library which inetconv(1M) uses. Joining the dots together I ran a series of trusses:
# truss -f -a -topen -u libsocket -u libinetsvc /usr/sbin/inetconv -n -i /tmp/pserver.conf
108712: execve("/usr/sbin/inetconv", 0x08047CD4, 0x08047CE8)  argc = 4
108712:  argv: /usr/sbin/inetconv -n -i /tmp/pserver.conf
108712: open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
108712: open("/lib/libscf.so.1", O_RDONLY)              = 3
108712: open("/usr/lib/libinetsvc.so.1", O_RDONLY)      = 3
108712: open("/lib/libc.so.1", O_RDONLY)                = 3
108712: open("/lib/libuutil.so.1", O_RDONLY)            = 3
108712: open("/lib/libsocket.so.1", O_RDONLY)           = 3
108712: open("/lib/libnsl.so.1", O_RDONLY)              = 3
108712: open("/lib/libmd5.so.1", O_RDONLY)              = 3
108712/1:       open("/tmp/pserver.conf", O_RDONLY)             = 3
108712/1:       open64("/var/run/name_service_door", O_RDONLY)  = 4
108712/1@1:     -> libinetsvc:get_prop_table(0x803fba8)
108712/1@1:     <- libinetsvc:get_prop_table() = 0xfef752a0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664e8, 0x8068768)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664e0, 0x8068764)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664ac, 0x80687a8)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x80664a4, 0x8068798)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:put_prop_value(0x806b6b8, 0x806649c, 0x80687b8)
108712/1@1:     <- libinetsvc:put_prop_value() = 0
108712/1@1:     -> libinetsvc:valid_props(0x806b6b8, 0x0, 0x0, 0x0)
108712/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x8068848, 0x803eef0, 0x803ef00)
108712/1:       open("/etc/netconfig", O_RDONLY|O_LARGEFILE)    = 5
108712/1:       open("/dev/udp", O_RDONLY)                      = 5
108712/1:       open("/dev/udp", O_RDONLY)                      = 5
108712/1:       open("/etc/nsswitch.conf", O_RDONLY|O_LARGEFILE) = 5
108712/1:       open("/lib/nss_files.so.1", O_RDONLY)           = 5
108712/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108712/1@1:       <- libsocket:getservbyname_r() = 0
108712/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x0, 0x803eef0, 0x803ef00)
108712/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108712/1@1:       <- libsocket:getservbyname_r() = 0
108712/1@1:     <- libinetsvc:valid_props() = 0
inetconv: Error /tmp/pserver.conf line 1 invalid or inconsistent fields: service-name protocol
108712/1@1:     -> libinetsvc:free_instance_props(0x806b6b8)
108712/1@1:     <- libinetsvc:free_instance_props() = 0xfed62000
There it is again - that error which just looked out of place. At this point I thought that perhaps I should check out /etc/services. Sure enough, I had a different service name (cvs). Changing that to match my inetd.conf-like file and re-running gave me the manifest I wanted. Here's the tail end of the truss:
108718/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108718/1@1:       <- libsocket:getservbyname_r() = 0x803eef0
108718/1@1:       -> libsocket:getservbyname_r(0x8068828, 0x0, 0x803eef0, 0x803ef00)
108718/1:       open("/etc/services", O_RDONLY|O_LARGEFILE)     = 5
108718/1@1:       <- libsocket:getservbyname_r() = 0x803eef0
108718/1@1:     <- libinetsvc:valid_props() = 1
108718/1@1:     -> libinetsvc:free_instance_props(0x806b6b8)
108718/1@1:     <- libinetsvc:free_instance_props() = 0xfed62000
108718/1:       open("/var/svc/manifest/network/pserver-tcp.xml", O_WRONLY|O_CREAT|O_EXCL, 0644) = 5
pserver -> /var/svc/manifest/network/pserver-tcp.xml
So with a quick flick of the wrist I had a new service imported and enabled:
# svccfg import /var/svc/manifest/network/pserver-tcp.xml
# svcadm enable svc:/network/pserver/tcp:default
Here's a link to the manifest

Sunday Jul 24, 2005

Why is this only a warning?

Welcome to Monday! I'm jumpstarting an ultra-60 in our lab so I can test a bugfix when I see this message:
WARNING: /pci@1f,4000/scsi@3/sd@0,0 (sd2):
        Error for Command: load/start/stop         Error Level: Informational
        Requested Block: 0                         Error Block: 0
        Vendor: SEAGATE                            Serial Number: 9808500387
        Sense Key: Soft Error
        ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x45
That's a pretty serious-looking message, so why is it only a "WARNING" rather than an "ERROR" ? The answer comes from the routine gda_errmsg(..) which is in usr/src/uts/common/io/dktp/dcdev/gda.c starting at line 247. This routine calls gda_log(..) which is a wrapper around cmn_err(..). One of the parameters we pass to cmn_err(..) is the error level: CE_CONT, CE_NOTE, CE_WARN, CE_PANIC and CE_IGNORE (defined in usr/src/uts/common/sys/cmn_err.h. The gda_errmsg(..) routine passes CE_WARN (that's the first part of the message above) and CE_CONT (the rest of the message). So what should I do about this message? Replace the disk immediately. There is no other option you can take. The message is that the drive's failure prediction threshold has been exceeded, so the drive's internal electronics is telling you that it's about to die. In my case this is a rather old 4gb Seagate disk, so I'm more than happy to get a new one in instead. We don't pass CE_PANIC as an argument to gda_log(..) because we do not want to take out the system due to a (generally) online-resolvable issue. Of course if this is your boot disk you'd better take action right away, but Solaris isn't going to panic on you from this incident. Moral of the story: don't ignore "WARNING" messages because they're only "WARNING"s and always read the full text of the message. It could really be an error.

Thursday Jul 14, 2005

When should you clean a DLT or LTO tape drive?

The answer to the question is only when the tape drive cleaning light is on. Yesterday I was having a discussion with a frontline engineer about cleaning tape drives. Apparently a customer had had several replacements of their DLT tape drive and wanted to know what proactive measures they could take to avoid needing to replace their drive in the future. So I had a look at the messages which they had used to justify replacement, and in each case the sense key was "media error." This set alarm bells ringing, because if a DLT or LTO drive reckons there is a problem with the media then you either have a problem with that tape, or an environmental problem which makes the tape media a carrier (like a bacterium in a way). The technology and intelligence which is designed into the DLT and LTO families is such that you should never need to use a cleaning tape. And if you do, it is only because the drive itself has detected that it needs a clean. I remember a performance escalation a few years back where it turned out that the customer was running cleaning tapes through their DLT drives twice a week. This was completely unnecessary and had the decidedly unwanted effect of killing the drive's read and write performance as the heads were degraded. Replacing the drives was their only option. Other customers have configured frequency-based cleaning in NetBackup or Solaris Backup / Legato Networker --- this is a complete nono. Use the TapeAlert function of the drive and configure cleaning for "on-demand" only. If you keep your DLT and LTO drives in a minimally-dusty environment and don't throw your tapes around you should never need to use a cleaning tape.
About

I work at Oracle in the Solaris group. The opinions expressed here are entirely my own, and neither Oracle nor any other party necessarily agrees with them.

Search

Archives
« September 2015
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today