Tuesday Oct 03, 2006


I'll leave the write ups of the main sessions to those who are competing for the best CEC blog. I could run the risk of being accused of “brown nosing” if I wrote what I thought.

I managed to get to three excellent brake out sessions the first of which was about SGRT. Specifically how to use SGRT when dealing with performance escalations. Clive King and Venkat Ramani specifically explained how to answer the questions:

“Where on the Object” and “Where in the life cycle” for performance problem during situation appraisal. The simplicity and power of the tools and concepts use were stunning.

For the “Where on the Object”, Clive demonstrated the difficulty of applying this question to a software system where you have nothing physical to point to or touch. He simply had a diagram of how IO is done in Solaris showing all the levels from the user application down to the disk drive. The original diagram is in the Solaris Internals book. Using this you get the possible places for the answer to the where on the object question. By then explaining how different tools will give you different timing points with in that system you can narrow down where the issue is on that diagram. So iostat(1M) can be used to see if the problem is below that target driver, typically sd(7D), and tnf probes to see if the problem is below the system call interface. Immediately you can narrow down the “Where on the Object” question to “above the system call interface” or “between the system call interface and the target driver” or “below the system call interface”. Using dtrace(1) on Solaris 10 you can dive in deeper.

All in all an enlightening talk.


Friday Jun 02, 2006

Sun Ray Keyboard problem solved.

Proper keyboards have the escape key next to the 1, the control key next to the 'A', shift next to the 'Z' and caps lock under the shift that is next to the 'A'. Unfortunately not everyone realises this so they have keyboards that result in much swearing if I hot desk to an incorrect keyboard.

The solution is so blindingly obvious that I am ashamed not to have worked it out before.

Take your keyboard with you. Sun Ray copes perfectly with having two keyboards and on the newer Sun Rays the USB port is on the front or near the front making it easy to plug in.

Plus if you are really clever you can use both keyboards at the same time!


Tuesday May 30, 2006

Another Sun Ray convert.


I could not agree more.


Wednesday May 03, 2006

Testing using remote power

I've blogged about the remote power systems we have in our labs before, here and here, however over the last few weeks I've been investigating how long it takes for the system to recover from a disk failure. The customer has a test case that involves pulling the drive and then seeing how long the application stalls for. Not a perfect test but a reasonable simulation for a drive failing. The goal is to have no more than a 30 second pause when the drive fails.

The trouble with this test is I need to pull the drive so I have to be in the lab and I'm not even in the same country as the test case.

If however I put the drive in a unipack and the arrange for that to be on remote power I can power off the drive remotely and automatically. This helps me as the test case in in Germany so I don't have to move the systems. It helps even more as I can now write a script that runs the test automatically in a loop. By doing this I can get this graph running the test over night while I sleep:

The 5 cases where we are over 30 seconds are a bit of a worry but the others show a nice curve giving some confidence that in the usual run of things the failure time is actually less than 20 seconds. The outlying results are on inspection of the logs the result of a Disconnected command time out on another target on the bus when the the target is failed.


Wednesday Mar 15, 2006

SGRT Epiphany

I keep getting asked when engineers should use or should not use SGRT.

Well I've had an SGRT Epiphany, well perhaps not quite and Epiphany but a realisation. The realisation is so simple that once written down you wonder why any one would bother. The realisation is this:

Once you have the solution, stop doing trouble shooting.

See I told you it was obvious.

When applied to SGRT this means if at the end of Situation Appraisal the solution is clear, stop. Only move on if you need to. Once you accept this it is absolutely clear we don't need to ask the question: “When should I use SGRT to solve a problem?” Instead the question becomes: “At what point do I stop using SGRT to solve a problem?” to which the answer is either when you have the solution or when you no longer need a solution.

So something that I do often, like crash dump analysis is just situation appraisal, if the problem drops out I'll stop if it does not I can move on.


Thursday Nov 17, 2005

Switching Sun Ray servers

I've been on a course so have been away from my normal desk and Sun Ray server. Since our Sun Ray server runs nevada and is not part of the main IT setup when I plug my card in I have to login to the IT system and then use the utswitch command to switch to my server.

Then each time I use a new Sun Ray I have to the utwitch again.

However this is a better way.

ON logging into the IT system I do:

utaction -c "/opt/SUNWut/bin/utswitch -h enoxec"

Now when I plug my card in the IT system automatically switches me to enoexec, just like it was my local system. A small success.


Saturday Nov 05, 2005

Twice in two weeks

The end of the week was send askew by me working from home on Thursday and failing to un redirect my office phone when I should have stopped work.

So at 7:30pm the phone rings and it is an engineer in the US who is having a problem with running the AMD 64 binary of a program I look after. Fortunately it turned out the 32 bit version was all that was required for the particular test. However I then had the problem of my binary, a very late night, and the problem was resolved by not using gcc but instead using the Sun Compiler which was not available when I did the original port.

That however means testing all the options, which turned out another bug (not in my code) that had to be investigated and reported.

So finally I got around to doing the testing I was supposed to be doing lon Friday morning ate on Friday afternoon, not a problem as the code is perfect, or not. Yet again I have to stop work on a Friday with a but hanging over me until Monday. While She was out, taperty tap tap tap on the Sun Ray, found it and fixed it or so I hope.

I'm going to recover some of the "overtime" on Tuesday as She has booked me a personal shopper in an attempt to make me wear something that does not embarrass. I' m looking forward to it.


Tuesday Jun 21, 2005

Tim Uglow

Congratulations Tim. Tim has been promoted to be a Principle Engineer which in Sun is something special, being the first time you are faced with a peer review of your work and contribution to Sun.

Tim's contribution is huge and mostly hidden from the outside world in the bug tracking database as one of those engineers who given any problem will get to the bottom of it no matter what. No crash dumps secrets are safe from Tim. No obscure timing problem is obscure enough for Tim not to find it.

At the same time Tim tirelessly educates and mentors anyone who wants to know, and some who don't but should.

So congratulation Tim.

Monday May 09, 2005

Sun Ray hot keys

If you are an end user on a Sun Ray you man not have bothered with reading all the manuals, I know I have not. Someone just emailed me this link to a table of special key sequences that the Sun Ray understands.

Really useful.

Tag: , Sun

Wednesday May 04, 2005

More Sun Ray questions

Another email about Sun Ray:

do you know how to set it so it doesn't automatically lock the screen when you take your card out (just coz makes for a much more impressive 'session portability' demo if you move the card to another sunray and the same screen instantly appears without any keyboard intervention)?

You are right, as a demo it certainly does. The answer is in the utxlock man page:

Note - Although some users may find screen locking an inconvenience, overriding it has security implica- tions that should be obvious. Override at your own risk. A user may disable any screen lock behavior by setting the environment variable SUN_SUNRAY_UTXLOCK_PREF to NULL. Any other value will be used as a command line to use for invok- ing a screen lock command instead of the default behavior.

Remember to set your MANPATH to include /opt/SUNWut/man and run “catman -w -M /opt/SUNWut/man” so that “man -k” will do it's stuff. If the keepers of my Sun Ray server are reading this can you please run “catman -w -M /opt/SUNWut/man” on our Sun Ray servers. Actually can you just install the new SRSS software please.

Tag: , Sun Ray

Friday Apr 22, 2005

Debugging programs that catch SEGV

This is just plain bizarre. I spent an hour or so this morning discussing a problem with a colleague via IM (and no Two Ronnies moments) with init in a loop taking a SEGV. This is strange as init has been around for a while and is generally well behaved and does not normally do this. Then this afternoon another colleague comes around to my desk (face to face contact without any technology to help, amazing) and asks me about how to debug a problem where, you guessed it, init is in a loop taking SEGV. The second looks less interesting as there is a race condition that can result in this failure and there are patches.

The first customer however has the patch and now has a multithreaded init program which threw us for a while. It looks like it becomes multi threaded when it pulls in some libraries that are multi threaded via the name service switch, nice.

Anyway back on topic. The top tip for debugging programs that catch SIGSEGV (like they are going to be able to recover.... (I know there are cases where catching SEGV is the right thing to do, but they are few and far between and prone to not producing the desired results)) is this:

Use “truss -S SEGV -t !all -p PID” to get the process to stop when it gets the signal. Then use gcore to collect your core file and use that to work out what has gone wrong.

I'm now waiting for the third question about init so that I can say “Init problems are like busses. None for months then three come along together”.

Hardening the ISP driver

The ISP card and driver work by having the driver poke a 32 bit quantity, the token, into a mailbox register on the card which uniquely identifies the particular IO that is being done. Then when the card has completed the IO it loads the same 32 bit token into a mailbox register that the driver can then read and map back to the IO that was being done so that it can be completed.

For some years I have been taking escalation's where the system would crash and the crash dump showed various problems with the value that the driver had read back from that mailbox register. It would very occasionally be 0 or sometimes it did not map to a known IO that was outstanding. Invariably the problems were a one off. The systems would fail just once, some times the card would be replaced, but since where the card was not replaced the failures were never seen again anyway I began to doubt that this was a good course of action.

The initial attempt to fix this used the fact that in the 64 bit port of Solaris a routine had been used to map the 32 bit values put into the mailbox registers into 64 bit values which then pointed to the real data structures. These routines were hardened to include parity bits so that any corruption could be detected, including the return of a 0. Then for my sins I got to port these changes back to Solaris 7 and 8.

However the problems persisted. A very low volume of systems crashing but now we had a much better diagnostic view. We now knew that the token that was being returned was “good” in that it had good parity, but the systems still crashed because the driver thought that the IO had already been completed. Sometimes good programming can make diagnosis harder and this was another case where it did. Looking back through the internal structures of the ISP driver you can see all the tokens that it has used in the past so perhaps if it has used the token before either the driver or the ISP card could have been confused and there might be some smoking gun which would lead us to the solution.

Alas when I looked I saw the same tokens being used over and over again, sometimes there would just be one token! I had been knocked back by the excellent kmem_cache_alloc, the tokens were generated after the structure was allocated in constructor and so when the structure was freed it was returned to the cache and the token was not touched.

I discussed this with colleagues and we mulled over various solutions before realising there was a much better way to map the tokens to the structures. As with so many solutions once we had it it was obvious.

The driver has to keep a list of IO that are in the chip so that if the chip is reset it can complete the IO with an error, since that list is a fixed size it is in an array. If we use the array offset as part of the token we would then have some spare bits in the token where we can write a 16 bit, per array offset, sequence number making each token close to unique. Now we can spot a token that gets replayed or corrupted (we keep an XOR of the token as well as there happens to be another 8 bit register available on the ISP card).

All of that made it into 10 and was back ported into the latest patches for 8 and 9.

For some reason the qus driver is almost identical to the ISP driver so the same changes had to be made to that driver as well. Last night I finally finished the putback so with luck patches for 10, 9 and 8 will be available soon.

Wednesday Apr 20, 2005

Alan's rules

I just read Alan's rules for conference calls with customers and the only thing I would say is that these are not rules for conference calls with customers but rules for all your technical dealings, internal or external.

This is why I love working at Sun. I suspect a lot of the people I work with abide by these rules, I know I do, but Alan blogged them so they are his.

Tuesday Apr 19, 2005

Real Player on my laptop

I now have the UNIXware realplayer 8 running on Solaris 10 using the instructions from here. So I can now listen to Scott as well as read the slides. Getting it to compile using the version of gcc that in bundled with 10 was trivial just have to get the declarations in the file to match those in the headers. However the plugin for mozilla will not link as the original one from real was built with gcc 2.x. So it was off to blastwave to get the 2.95.3 gcc compiler via the very excellent pkg-get command then following the instructions again and all works. Next I think I will pull the helix community sources and see if I can get that built for Solaris 10 x86.

One strange thing I now need to investigate is why on the the Toshiba M2 the speakers are actually controlled by the line out setting.

Technorati tag:

Monday Apr 18, 2005

I'm Tuned in and Network the Dog

Than you to Dave for posting this link to the analysts presentations. I did not get beyond page 10 of Scott's presentation before I was completely tuned in. “Sun DB” I wonder what that will be, alas I have no inside knowledge of this but the rumours are now external.

Then on page 16 is the list of Sun misses, which includes “Network the Dog”, which I'm glad is recognised as a miss. The rough translation into English of the Network being a dog was that it was slow as a dog. Though it did give one of the best T-shirts that was made at Sun. The Buy our computers or the Dog gets it shirt. With a Gun pointing to the head of the Dog. Alas I no longer have mine. Hopefully the person who made it still has the artwork and will share it on his blog.

Saturday Apr 16, 2005

I want a new Sun Ray

I want one of these. The coolest feature is that the smart card will work either way up.

Thursday Apr 14, 2005

400 days and you are history

As Calum mentioned we are about to have all our email ever 400 days deleted unless we archive it. After my initial Oh my God this is mad, the idea is growing on me. Rather than deleting any email at all (which I never really do as it just goes in a huge Trash folder that I never empty) I can now just let it run into the 400 day limit. If it is really important slot it straight into the archive, but only if it is really important, like an email from Jonathan saying: “Great Blog”, well I can dream.

However like Calum I have a bit of a problem with the backlog of old email. 17 years worth my attempt to count them is still running. I'm tempted just to take them all and throw them on a tape. Then when the tape is 400 days old and I've not looked at it, bin it.

For stuff I write, since it is bound to be of Earth shattering importance, ore more likely technical, so it's relevance could exceed 400 days I think I will just put it on my blog.

Tuesday Apr 12, 2005

More group aggregation

Our internal departmental blog aggregator is up and running thanks to the planetplanet.org software, a twiki running on the Sun webserver and a vers short nawk script. I know it should have been perl, but as is so often the case nwak will do. Specifically this nawk will do:

nawk '$0 == "%META:TOPICPARENT{name=\\"TWikiUsers\\"}%" {
$1 == "\*" && $2 == "Name:" {
        sub("[ \\t]\*\\\\\*[ \\t]Name:[ \\t]","");
user && /\\\* PlanetPts [Ff]eed/  {
        printf("\\n[%s]\\nname = %s\\n", $NF, n)}' 

Users just have to put an entry in there home page on the twiki, the nawk script sucks those entries out and writes the config file for the planet software that then aggregates them all. They then all get displayed within the same twiki thanks to the %INCLUDE% variable so that they appear to be part of the main web.

I'm sure there are better ways to do this and with a bit more infrastructure help it would be cool to be able to register your blog feed in LDAP and then have an aggregator use that to produce planets based on reporting structures. However for us I think this will do for now.

Monday Apr 11, 2005

kmem_flags 7f on Sun Ray server

The quest to run the latest development build of Solaris on our Sun Ray server continues, this time with a twist. Just to try and flush some more bugs out of the system we have set kmem_flags to 7f. If you look in /usr/include/sys/kmem_impl.h you will see what this gets us (it has not changed since Solaris 10). We now have each allocation fitting in it's own page and the Sun Ray server is running but is slower than is really acceptable. Looks like we have shown that there are no obvious kmem cock ups. I think we will DR in a new board, which we have to get the fastest parts in the lab warm, and if it survives that unset the kmem_flags (I wonder if we can do that dynamically), if not then we can see if it survives until the end of the day. We can't really leave it like this as as Tim just said: “It is a bit sedate”.

Anyway top marks to the lab staff for giving this a go.

Friday Apr 08, 2005

New Printer

Got a new printer for the home PC. Finally went for a Canon Pixma iP4000 bringing an end to my relationship with Epson which seemed to involve large amounts of cash from me in return for printer cartridges and rapidly deteriorating print quality. Just for the record the Epson Stylus C70 worked well for many years and my only complaint is that I felt like I was Epson's cash cow so when it started to play up Epson were at a disadvantage during product selection. HP did not figure on the list. Canon won mainly as no one I spoke to had a bad word to say about them and that I find the IXUS 500 camera is excellent. So am hopeful that the printer will be of the same high standard.

It all connected up and worked fine on the PC but when I added it to the Qube for some reason the print queue would not come online until I stopped the printer server and restarted it, all from the web interface.

My remaining problem is that there does not seem to be a driver for this printer model for any thing other than windoze, despite the documentation saying there is one for Mac OS X. So if anyone has a driver for the above that will work with cups or gimp or the native Solaris print system on Solaris 10 please let me know.


This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com


« July 2016