Cloud Serviceibility and Architecture

Composite services and “clouds” are architectural in nature. We can no longer attempt to mediate system events at an element or “server” level. It must be broader and intelligently confer architectural context at nearly every level.

A Server Perspective

A trouble ticket is filed, possibly automated

Someone researches the problem and confers with the admins, developers, BU, etc.

The service limps along, or is down completely during this process

Eventually the troubled system is brought back on line

- services (applications) are reloaded, things are back up.


A Service Perspective The event management system analyzes the outage

-- are other services functioning, is the desired SLO (service level objective) still being met? If other services are functioning and SLO OK then server is placed in “to replace” status pool within the POD

VM or other operational management (policy) system brings up additional workloads if required to replace capacity lost

Someone eventually replaces the system (quarterly?) when “replace” threshold limits are reached

-- new systems added to “spare”/unallocated pool

While simplified, this illustrate a next generation practice that most large scale system providers do today – they don't care when a server goes down, in fact they (the human) may not even know. It's trapped by a content delivery network (e.g. Akamai – there for perhaps someone else's problem) or the “management” systems – a sort of architectural meta-cognition.

The difference here – is like the classic “If a tree falls in the forest and no one is around to hear it does it make a sound?” Is that tree architecturally significant? Maybe to a couple of squirrels but its very likely (??) that the forest survives and the animals find a new home (unless it was the last tree!!) Was the event architecturally significant? If not – wait until we hit that threshold.

The key to correctly applying architecture is abstraction. We will want to be able to specify a workload or process to run without specific implementation parameters so they can be consumed by the “cloud” that best meets the requirements. There are some key factors....

\*Where's the data?

\*Where's the data go once processed?

\*What's the desired availability?

\*What's the desired level of scale? Now? Tomorrow? One Year from Now?

\*What's the security requirements?

\*What are the key service run-time dependencies?

\*Are the service components stateless or stateful at the node level?

There are probably others but they are increasingly non-relevant in an abstracted discussion. Any compute infrastructure that supports cloud computing should be able to deal with these issues today, but what makes them inter-operate and how does a user decide which cloud to use? What if the user is the system of clouds?


A brilliant dissertation!

Could it be said that the "SubSpace Relay Network", which originated in a 52101 dorm room, was the predecessor to what we now refer to as "Cloud Computing"? While the architecture was rather primitive, the serviceability was first rate. I seem to recall the sysadmin literally sleep next to his computer in order to meet his desired SLO.

Posted by Martin McFly on February 16, 2009 at 10:46 AM PST #

An excellent view that helps manage scale.
An interesting side effect is the extent to which the service instance becomes unavailable. At least due to network partitioning the service instance may become unavailable but actually be up and running, consuming cloud resources. In which case, is the user billed for it? If a new service instance is spawned due to the unavailability, then the user may be billed twice, incorrectly. The difference in perception ('useful' availability of instance) and reality (existence of the instance) is a challenge to address as well. Sometimes trivial, sometimes not. Depending on the actual resources the service instance consumes.

Posted by Sharma Podila on February 18, 2009 at 01:19 AM PST #

So 1) SSR Net -- yes I'm familiar. :) It was pretty rudimentary, though UUNET enabled. :) Who is the lurker?

2) To Sharma - your right, and this is that "context" I spoke about in the posting. Today, most products that support some root cause analysis capability are simple -- a couple queries here and there. As we work on the cloud architecture "abstraction" we'll want to include "helpers" to provide the pointers to the service level and diagnosability hooks necessary to understand what's happening.

Posted by Jason on February 18, 2009 at 11:34 AM PST #

Post a Comment:
Comments are closed for this entry.

Thoughts from Jason Carolan -- Distinguished Engineer @ Sun and Global Systems Engineering Director - -


« February 2016