Tuesday Nov 06, 2007

Three strikes, you're out (but it's only the first inning of a preseason game)

The indiana prototype was released with the name "opensolaris developer preview".  Having installed it and (briefly) attempted to get work done on it, I'm even more convinced the name was vastly premature.

 opensolaris. nope.  it's the prototype output of a small implementation team and hasn't been through the rigor and wringer of the development process.  It's a fascinating and impressive start, but.. it's not opensolaris yet, and I'm skeptical of some of the design choices (admittedly, I always am..)

 developer.  I'm a developer so it should be for me to use?  I installed it.  I tried to run bugster to file a bug.  Nope, no java.  Tried to build some simple C programs.  Nope, No C compiler, no make.  Tried to add the binary nvidia driver so the display didn't look crappy -- got that working but it was an adventure -- pkgadd(1m) is busted so I had to pkgadd -R on a different system, rsync the bits over, and re-run the package's postinstall script a few times before it "took".

If you want to play with the new packaging system, make sure you have another system running SXCE handy to actually do builds.  And if you see something unexpected, please file bugs early and often.   If you actually want to get work done, you're better off with SXDE/SXCE until further notice.

preview.  When I hear "preview" I think "movie sneak preview" .. getting a look at the almost-final production before they come out.  SXCE & SXDE are previews.  The Indiana prototype is like the raw footage viewed by the production team the day after it's shot before the bloopers have been edited out.

Tim Foster's critique misses the point of the engineering objections.  Attaching this name to a pre-alpha snapshot demo which is simply not even close to done is a mistake.  It might make sense when Indiana is a little better baked, but IMHO, the current name attached to the current bits hurts the "brand".  IMHO the name should have been held in reserve until it was ready for community endorsement. 


Friday Nov 02, 2007

Accusations of "stop energy" are "stop energy"

So, there's this hot new newage-y (I think that rhymes with sewage-y) concept being bantered about by certain people called "Stop Energy".   It seems that it's been used primarily as a non-rebuttal rebuttal to project review comments that the reviewee would rather not deal with.

Many desireable properties of large systems are not localized to any one part of the system.  Security is a key example, but performance and availability are others.  Senior engineers in organizations which produce such systems often have an obsessive-compulsive streak -- because they have to.  In order to preserve or enhance that property, you need to get all the details right, and this often results in long list of "stuff you got wrong" coming from reviewers.  It's easy to misread this as a message to just give up.

But it seems that the new thing is to instead accuse the reviewer of applying Stop Energy to the project. 

Based on what appears to be the canonical definition, it occurs to me that these accusations of "Stop Energy" are .. an exertion of Stop Energy against the reviewer.  The reviewer actually is trying to help, it's just that there's a breakdown of communication such that constructive criticism is interpreted as an attempt at stonewalling.  So, the frustrated reviewee counter-stonewalls, perhaps with this accusation.

A more constructive response is to honestly ask "okay, so what should I do?".  And then listen, and change your proposal accordingly.  Maybe the requirement you just learned about overlaps with your requirements in such a way to produce a null solution set, so maybe you need to go back and adjust your requirements.

 UPDATE: Perhaps I should have been clearer.   My non-snarky view is that "Stop Energy", if it exists, only exists in the mind of the person who stops in response to criticism.  In a large project there isn't a single frame of reference in which you can declare some action unambiguously as "forward progress".   Reviewers often point out things where, in certain frames of reference, a proposed change is a big (or small) step backwards.  In the large, those reviewers are themselves charged with (say) improving the overall security of the system; and in that scope, one or more proposals that introduce more insecurity form a barrier to progress.  Reviewees often hear the "no, don't do it that way" part of the message and then tune out and fail to get the message about the requirement they overlooked and cause the subsequent conversation about alternatives to fail.  And as a result a message intended by its speaker as "do it a little differently" is received as "don't do it at all".

Thursday Nov 01, 2007

Looking good, save for the name.

Ran into a few bugs installing the Indiana prototype.  

1) the installer got confused when I attempted to add the user "sommerfeld".  (a 8-character username limit is a figment of useradd's imagination).    I had to reboot and try again. 

 2) the lack of the nvidia binary driver in the distribution meant that it didn't cope with a 1920x1200 display.

but otherwise it installed with a zfs root in almost no time flat from CD (system refused to boot from a USB key).

It still needs a name change, though..

Premature naming.

So, a preview of the new packaging & install technology produced by Project Indiana was just released. I'm shortly going to be installing it on a spare system in my office just to give it a shot.

Unfortunately, it's being called the "OpenSolaris Developer Preview" and is being portrayed as a distinctly special binary distribution on the opensolaris home page. The name is unfortunate for a number of reasons:

  1. The vast majority of the changes have not yet received the typical design and architecture review received by Solaris components
  2. There is not yet community consensus that OpenSolaris should have a reference binary distribution
  3. There is certainly not yet consensus that the Indiana technology is the right tool for the job.

I hope the folks who chose this name despite ample warning that it would cause trouble quickly reconsider. And I hope that the poor choice of name doesn't deter people from giving it a try. But the choice of names is forcing something of a constitutional crisis within opensolaris.

Wednesday Feb 07, 2007

When a favorite restaurant closes

Valerie asks what she can do about a favorite restaurant which has lost its lease and will most likely need to move.

Don't Panic.

A while ago (must be over a decade ago by now), the canonical Chinese restaurant at the MIT end of Cambridge, Mary Chung's, lost its lease and was shut down for about a year before they found new space on the other side of Massachusetts Avenue. Mary's was open every day but Tuesday, though she took an annual one-week summer vacation (which was known as "the week of Tuesdays" to some of her regular patrons).

The Year of Tuesdays was painful for some but they came back from it stronger than ever in a better, larger space. Recently they were even one of the five Boston-area restaurants featured in an episode of The Hungry Detective on the Food Network.

There's not a heck of a lot you can do unless you've got connections in the commercial real estate arena, but there are a few things which come to mind:

  • Keep patronizing them until the bitter end.
  • Stay in touch with the proprieter during the shutdown period (easier with FdM than it was with Mary's since they have the secondary location).
  • Most likely there will be some amount of town-level zoning/licensing involved in the move. Generally the only people who comment on such matters are concerned abutters; statements in support of the applicants from satisfied customers will typically make a big impression on the licensing authority.

Tuesday Feb 06, 2007

Signs the DRM house of cards is collapsing.

I'm happy to see Steve Jobs' open letter to the music industry where he calls for the end of DRM on downloadable music. I'm happy to say that I have on the order of 5200 tracks on my ipod, none of which were purchased from iTunes. I have a legitimate fair-use right to all of them. The vast majority were ripped from CD's I own and which I still possess. Some of the rest are podcasts (offered freely to all); some were mp3's of performances I participated in. None were downloaded from file sharing services. Steve's open letter refers to "secrets" being the key to security. General principles of cryptography say that in secure systems, the only secrets should be changeable and limited in scope. The nature of DRM is such that you'll typically end up with the same set of secrets in every device/player which needs access to the plaintext content, which is what led to the collapse of the DVD CSS scheme and its followons for HD DVD's. Time after time people learn the hard way that you can't effectively hide secrets in binary object code -- given enough time and digging it will be possible to dig any keys and algorithms out of the blob of code.

Monday Nov 20, 2006

if you thought lost bombs were bad, consider lost mustard gas..

In an analogy to the "Windows Genuine Advantage" program, Simon Phipps mentioned the recent discovery of explosives underneath a British airfield, and draws an analogy to anti-piracy "kill switches" embedded in software. While not directly analagous to a "kill switch", a couple years ago I heard of a somewhat more astonishing case of leftover lurking horrors: in 1993, World War I-era mustard gas shells were discovered in what is now an affluent residential neighborhood of Washington, DC in 1993. As of this summer, the cleanup was still in progress.

Returning to the real target : I share Tim Bray's concerns. License enforcement by intentional denial of service has no business going into mission-critical software; we have a hard enough time coping with denial of service from unintentionally introduced "features".

Wednesday Nov 16, 2005

The End-to-end argument meets ZFS

I'm really a networking&security type at heart.  Why am I excited about ZFS?

Back when I was studying for a degree in computer science, I took what was then (and probably still is) the best undergraduate course in MIT's CS department: Computer Systems Engineering, better known as "6.033" or just "'033".

A major part of the course was a series of case studies -- we would read an important paper on a system, write a short analysis, and then discuss the system in class.

One of the key papers presented was Saltzer, Reed, and Clark's "End to End Arguments in System Design"

I'll quote the abstract:

This paper presents a design principle that helps guide placement of
functions among the modules of a distributed computer system.  This
principle, called the end-to-end argument, suggests that functions
placed at low levels of a system may be redundant or of little value
when compared with the cost of providing them at that low level.
Examples discussed in the paper include bit error recovery, security
using encryption, duplicate message suppression, recovery from system
crashes, and delivery acknowledgement.  Low level mechanisms to
support these functions are justified only as performance

The paper has spawned a lot of debate and more than a few followups over the years, and interminable arguments about what counts as an end, but overall I think it's held up pretty well.

Fast forward to a couple years ago when I first saw a high level overview of the ZFS design.  I immediately thought of this paper.

ZFS applies the end-to-end principle to filesystem design.  

End-to-end is normally applied to distributed systems, where two distinct "ends" are communicating with each other, often in real time or with relatively short delays.

Here, the "ends" are separated mainly by time: one "end" writes data to the filesystem, and the other "end" expects to get the exact same data back in the future.  (And the "middle" is the storage subsystem, which these days is itself a complex distributed system).

By placing the functionality required for robustness at a relatively high layer within the storage stack, ZFS can perform these functions with reduced overall system cost; you can use a much simpler disk subsystem to get a desired level of performance, availability and reliability.

For instance, the filesystem knows for sure which disk blocks are in use.  The disk doesn't.  If you replace a disk in a mirror or Raid-Z group, ZFS only needs to copy the blocks that are currently in use to the new disk; when lower layers are responsible for redundancy, you have to copy the whole
disk.  With the upper layer responsible for redundancy, the repair takes less time, and your window of exposure to an additional failure can be significantly shorter.

I'm hoping this leads to simpler (and cheaper) storage hardware in the long run -- JBODs seem to be ideal for ZFS, and you can take the battery-backed NVRAM out of the raid controllers and give it to the lumberjacks.

Technorati Tag:

Thursday Oct 20, 2005

packaging svk

So, Adam, never fear..

I have two bits of tech in hand which will make deploying svk on solaris for development purposes pretty painless.

1) NetBSD's pkgsrc will build packages on solaris and handle chasing down the dozens of dependencies.  Currently it has SVK 1.00, but I've got diffs to the pkgsrc config to take it to 1.05 under review right now (three packages needed to be upgraded and two more needed to be added.  took me about an hour and a half last night).

2) there's a "gensolpkg" inside pkgsrc which will create solaris/SVR4-format packages.   it's a little rusty as it still assumes the 9-character package name limit, but that's easily repaired, and I should probably commit that fix as well..

toss them all into a single packagestream-format blob and we're all set.

Only real misfeature at this point is that pkgsrc insists on building its own copy of perl.   But given that we lock down most aspects of the build/development environment, and occasionally get hurt when we don't, this might be another case where we should just take the hit of another copy.


Tuesday Oct 18, 2005

Creative Hash Functions

Take a quick look at this macro definition.   Did you spot the bug?

Because of poor paren placement, the OUTBOUND_HASH_V6 macro in sadb.h
computes a hash value of:
was intended, with the result that only a small number of outbound hash buckets are ever used.  Half end up in bucket 0.  All hash values have two low order bits of zero, then (going upwards) zero or more 1 bits, and then all zeros until the top of the word.  Distribution looks like:

value:         occurances
       0         2147483648
       4         1073741824
       c          536870912
      1c          268435456
1ffffffc                 16
3ffffffc                  8
7ffffffc                  4
fffffffc                  4

needless to say, this distribution is awful, with only 31 unique hash values, and with 50% of entries in one bucket, and with 99% of hits in only 7 buckets. 

Discovered this shortly before 10:30 this morning; filed bug 6338289; tested fix on x86 and sparc, code reviewed, and integrated into the development sources by 5:50pm this afternoon. 

UPDATE: In response to a comment: Yes, inline functions would be better here, but the compiler version we used during solaris 9 development didn't support them in C.    If we're going to revisit this code, a more likely mini-project here is to find all the various places within IP where we compute hash functions based on a protocol address, find the best one, and make that a common one used by all the address-based hash functions, possibly tossing in a key or equivalent as a defense against hash-bucket-clogging attacks.

Thursday Sep 29, 2005

And something resembling a root cause analysis.

The Prius saga continues. Toyota sent the NHTSA a complete reply on August 26th.

The meat is in Responses 8 and 12.  It appears that Toyota released a patch in October 2004 which fixed a firmware bug - apparently the stall occurred when the firmware thought the engine wasn't taking in enough air, but the "not enough air" threshold was set too high.  Some of the details are in attachments that were not made public, but it's now clear that they're confident they understand the cause of the stall:

"Under certain circumstances, the engine ECM incorrectly determines that the gas engine is experiencing a failure to start when the engine intake air volume is lower than the ECM's programming criteria.  In this condition, the gasoline engine will not start (because the ECM believes it cannot) and the vehicle will go into a fail-safe mode of electric-only operation.  In conjunction with the ECM misjudgement, the warning lights ... will be illuminated when this occurs."

and there are two relevant fixes.  The first one was released as part of "Special Service Campaign 40A" in October 2003:

"Due to a programming error, if the vehicle is restarted in the "fail-safe" mode, a secondary condition may occur where the vehicle transmission may not operate smoothly."

Subsequently, they released TSB EG047-04:

 "Toyota discovered a software error within the engine intake air volume criteria ... Toyota developed a revised software version and introduced this software along with reprogramming methodology in a TSB in the middle of October 2004"

What's perhaps a bit strange is that the first bug and a third unrelated (and seemingly trivial) defect were the subject of two different "special service campaigns" where they actively asjed customers to bring in their cars for a firmware upgrade, but the seemingly more critical bug (the apparent proximate cause of the stalls) is only subject to a TSB, which appears to be a "fix it if the customer complains" reactive patch.  If I buy a Prius I guess I'll feel obligated to check for TSB's on a regular basis...

Tuesday Sep 06, 2005

How not to sell me a firmware-driven product...

Well, start off your sales pitch by describing how easy it is to reboot the product, and by talking about how I can avoid trips to the repair shop by rebooting it.

I was pretty close to being willing to put down a deposit on a Prius to replace my Saturn, but now I'm off doing a "due diligence" of a sort.  What I've learned so far: there's a software defect which causes the gasoline engine to shut off which may have been fixed in a firmware upgrade.  The NHTSA's Office for Defect Investigation is on the case (investigation PE05029)  but hasn't yet released a final report.  Some of the documents filed by Toyota in response to the ODI's request for investigation have been made available, but there's not that much "meat" in the main document of July 22nd-- which promises follow-on updates on August 5th and/or 26th which don't seem to be available from ODI just yet.

One friend of mine who has a Prius has experienced this stall condition, and then had the firmware upgrade which may -- or may not -- fix it.  He hasn't had a stall since the firmware upgrade but, well, anecdotes are not data.

I'm not so much worried that there are bugs in the firmware.   Of \*course\* there will be bugs in any software system of nontrivial complexity.  But are they set up to diagnose and fix defects found in the field by customers?  Instructing customers to "just hit ctrl-alt-del and drive on" doesn't sound consistent with an attitude towards software quality which will get those defects fixed.  I hope this particular sales guy is an outlyer.

Given the limitations of repair shops, perhaps software-controlled cars like the Prius should be equipped to "phone home" with the moral equivalent of a crash dump whenever anything odd happens....

Thursday Aug 25, 2005

How to destroy a brand: Saturn is dead.

As far as I'm concerned, GM's Saturn line is dead.

Some years ago, my parent's (non-GM) car caught fire in their garage due to a defective cruise control switch.   The fire went out but there was substantial smoke damage elsewhere.  They had been on vacation at the time, and a recall notice for the defect was in their held mail when they returned from vacation.  So I tend to take recall notices and the like as high urgency issues, worthy of immediate action.

My current car is a Saturn.

Today, in my mail, I received a plain white envelope with a Saturn return address and the ominous notices "Important Vehicle Information Enclosed" and  "Open Immediately Do Not Discard".   I was suspicious, but given the past family experience with recalls, I opened it immediately just in case.

Was it a recall notice? 

Nope, just a slimy marketing trick.  When I called the dealer to complain, they denied that it was a deceptive practice and then hung up on me.

It used to be that Saturn tried to be a brand for people who just wanted reliable transportation without the slimy behavior so common among auto dealers.  My experience buying in 1996 was good.  But now it seems they're no different from the rest.  For all practical purposes, they're dead.

Thursday Aug 11, 2005

Symphony and Release Numbering

So, there I was last Sunday in rehearsal, minding my own business in the middle of the trombone section, and I look up and I see sheet music entitled "Symphony in E Minor (No. 5 Opus 95) / From the New World".   But wait, isn't the "New World" Dvořák's 9th Symphony?  Err, well, yes it is, at least in all the concert programs and liner notes I've ever seen....  the musicologists and the sheet music publishers seem to disagree..

This is more confusing than our release numbering scheme for SunOS/Solaris ...

Wednesday Jun 08, 2005

On the conversion of working systems into warm bricks...

Operating systems development communities wind up inventing and using a fair bit of slang.  The existing Solaris development community within Sun tends to use one particular metaphor a fair bit: the brick.  That's what you get when you take your test machine, add your latest test bits, and, well, something goes wrong in a big way and your system (whether a low end PC or high end multiprocessor) winds up having all of the capability of a Warm Brick, at least until you get  a chance to reinstall it. 

Typical usage: "Oops, I bricked it."   "Hey, when you brickify a test machine, at least reinstall a good build on it before you move on..", and "Bugs in the packaging scripts may still result in brickification".

(Note: members of another OS development community have been known to use "brick" as short for  "throw a brick at".   As far as I can tell, these usages are completely unrelated).



Top Tags
« April 2014