Composite services and “clouds” are architectural in nature. We can no longer attempt to mediate system events at an element or “server” level. It must be broader and intelligently confer architectural context at nearly every level.
A Server Perspective
A trouble ticket is filed, possibly automated
Someone researches the problem and confers with the admins, developers, BU, etc.
The service limps along, or is down completely during this process
Eventually the troubled system is brought back on line
- services (applications) are reloaded, things are back up.
A Service Perspective
The event management system analyzes the outage
-- are other services functioning, is the desired SLO (service level objective) still being met?
If other services are functioning and SLO OK then server is placed in “to replace” status pool within the POD
VM or other operational management (policy) system brings up additional workloads if required to replace capacity lost
Someone eventually replaces the system (quarterly?) when “replace” threshold limits are reached
-- new systems added to “spare”/unallocated pool
While simplified, this illustrate a next generation practice that most large scale system providers do today – they don't care when a server goes down, in fact they (the human) may not even know. It's trapped by a content delivery network (e.g. Akamai – there for perhaps someone else's problem) or the “management” systems – a sort of architectural meta-cognition.
The difference here – is like the classic “If a tree falls in the forest and no one is around to hear it does it make a sound?” Is that tree architecturally significant? Maybe to a couple of squirrels but its very likely (??) that the forest survives and the animals find a new home (unless it was the last tree!!) Was the event architecturally significant? If not – wait until we hit that threshold.
The key to correctly applying architecture is abstraction. We will want to be able to specify a workload or process to run without specific implementation parameters so they can be consumed by the “cloud” that best meets the requirements. There are some key factors....
\*Where's the data?
\*Where's the data go once processed?
\*What's the desired availability?
\*What's the desired level of scale? Now? Tomorrow? One Year from Now?
\*What's the security requirements?
\*What are the key service run-time dependencies?
\*Are the service components stateless or stateful at the node level?
There are probably others but they are increasingly non-relevant in an abstracted discussion. Any compute infrastructure that supports cloud computing should be able to deal with these issues today, but what makes them inter-operate and how does a user decide which cloud to use? What if the user is the system of clouds?