Unsolved Developer Mysteries

I love when customers play "stump the geek" and ask really insightful, serious questions. It's partly what makes being a systems engineer at Sun challenging and fun (and yes, I consider myself an SE within my own group, but I'll pass on the is-a has-a polymorphism jokes, thank you). Yesterday's question scored an 8 for style and 9 for terseness (usually a difficult combination to execute):

What are the top developer problems we haven't run into yet? I gave an answer in three parts.

1. Unstructured data management and non-POSIX semantics. Increasingly, data reliability is taking the shape of replication handled by a data management layer, using RESTful syntax to store, update, and delete items with explicit redundancy control. If you're thinking of moving an application into a storage cloud, you're going to run into this. Applications thriving on read()/write() syntax are wonderful when you have a highly reliable POSIX environment against which to run them. And no, don't quote me as saying POSIX filesystem clusters are dead - the Sun Storage 7310C is an existence proof to the contrary. Filesystems we loved as kids are going to be around as adults, and probably with the longevity of the mainframe and COBOL: they'll either engineer or survive the heat death of the universe. There is an increasing trend, however, toward WebDAV, Mogile, SimpleDB, HDFS and other data management systems that intermediate the block level from the application. New platforms, not at the expense of old ones.

2. Software reliability trumps hardware replacement. An application analog to the first point. Historically, we've used high availability clusters, RAID disk configurations and redundant networks to remove single points of failure, and relied on an active/active or active/passive application cluster to fail users from one node over to a better, more healthy one. But what if the applications are highly distributed, recognize failure, and simply restart a task or request as needed, routing around failure? IP networks work (quite well) in that sense. It requires writing applications that package up their state, so that the recovery phase doesn't involve recreating, copying or otherwise precipitating state information on the new application target system. There's a reason REST is popular - the ST stands for "state transfer". And yes, this worked really well for NFS for a long time. Can I get an "idempotent" from the crowd?

3. Parallelism. If not bound by single thread, what would you waste, pre-compute, or do in parallel? This isn't about parallelizing loops or using multi-threaded libraries; it's about analyzing large-scale compute tasks to determine what tasks could be partitioned and done in parallel. I call this "lemma computing" -- in more pure mathematics, a lemma is a partial result that you assume true; someone spent a lot of time figuring out the lemma so that you can leverage the intermediate proof point. When you have a surfeit of threads in a single processor, you need to consider what sidebar computation can be done with those threads that will speed up the eventual result bound by single-thread performance. This isn't the way we "think" computer science; we either think single threaded or multiple copies of the same single thread.

That was my somewhat top of mind list, based partly on the talk I gave at CloudSlam 09 which will be updated for SIFMA in New York later this month.


Number three is the killer "app", or killer "app change" IMHO. Pre-fetch along with pre-compute takes "free" idle cycles/cores/threads and leverages them. We did this at Sun ages ago by pre-computing sets of parts for machine configs so that future customer quotes could have "chunks" of pre-matched parts that would naturally fit together. Made quoting in Q4 bearable...


Posted by Bill Walker on May 28, 2009 at 08:49 AM EDT #

Intriguing. We recently did a construction project in our backyard, and because of the economy, we ended up getting twice or more the number of workers showing up each day than we would have had in a normal year, because the contractor wanted to keep all his guys employed and he didn't have much else going on. The foreman said it was a new challenge for him to keep that many guys moving productively, and he had to get creative about chunks of work that could be usefully done ahead and in parallel that normally would have been sequential. With 8 or 9 guys, there were often three different activities going on -- some laying rebar for concrete, some precutting wood for a pergola, some doing masonry. It was exactly that "surfeit of threads" environment you describe, and illustrates how you need to get more creative about finding new ways to do multithreading effectively.

Posted by Tom Chatt on May 29, 2009 at 07:13 AM EDT #

Post a Comment:
Comments are closed for this entry.

Hal Stern's thoughts on software, services, cloud computing, security, privacy, and data management


« August 2016