Tuesday Feb 22, 2005

SUPerG 2005 Paper

SUPerG (Sun Users Performance Group) is Sun's premiere technical event for customers interested in large scale solutions architected for data centers and high-performance and technical computing. The program is designed to provide highly interactive and intimate exchanges between Sun's leading technical experts and our customers.

You can read more about the event, and register to attend at: http://www.sun.com/datacenter/superg/

This year, I've been invited to speak at this event. In the spirit of blogging, I've posted my abstract (below). I need to get busy writing the paper and creating a clear, concise, and compelling technology demonstration. If you'd like me to address a particular topic or concern or challenge in the paper (related to the abstract), or have an idea that I might include in the demo, please drop me a line or submit a comment to this entry. If I use your idea, I'll attribute it to you in the published paper, and here in blog land, so please include your name and contact info.

Hope you see you at SUPerG in April in Virgina. Stop by and say "hi".

\*SUPerG 2005 Abstract\*
Effective Deployment & Operational Practices for Dynamically Scaled SOA and Grid Environments

Scalability is taking on a new form and introducing new challenges. The traditional "vertical" platform with dozens of powerful CPUs bound by local memory offering (nearly) uniform memory access, is being threatened by a new model - networked grids of powerful but low cost compute nodes.

Grids are not new. But powerful new techniques are emerging that allow commercial workloads to take advantage of this style of computing. This includes SOA-based application design, as well as auto-deployment and provisioning to drive efficiency and utilization in infrastructure operation.

Modern designs provide for on-the-fly horizontal scaling with the push of a button.... in which new containers join the grid into which a distributed app may expand to offer new levels of performance and service level. A side effect of this approach is a highly-resilient platform, such that bound dependencies can fail without a catastrophic impact on the running service.

This talk will provide an update on the State of the Technology with respect to SOA and Infrastructure Provisioning, and how these can be leveraged to offer Adaptable, Scalable, and Resilient services.

I may also include a demonstration that will show how a collection of bare metal servers can be established into a Grid using N1 SPS (integrated with JET). Following this provisioning phase, the demo will then show a sample app deployed and executed across multiple nodes. Finally, it'll show a node being added to the live Grid using SPS, and how that app can then expand, at run-time, to leverage this new node, increasing its work rate.

Monday Feb 14, 2005

Boycotting Oracle

So the news (news.com.com) is reporting that Intel and HP are getting into the game... joining the ranks of multi-core chip vendors and their customers who see Oracle's license strategy (to charge by the core) as misaligned with the times. These are times of virtualized resources that are consumed and funded as needed, they say.

I was thinking of an analogy for Oracle's position... Consider how you would feel about a policy at Blockbuster Video if, when you rented a DVD, you had to pay $10.00 per seat (your sofa counts as three - being multi-seated). No, it doesn't matter if it'll just be you and your spouse watching the movie. Since you have 15 seats that you \*could\* utilize (the bar stools and folding chairs count too) you will pay $150.00 per night for that movie. Oh, you'd like to display that movie in PARALLEL in your family room and in your entertainment room? Sure, you can do that with their "shared disc" technology. But now add up all the seats in both rooms (25), and that'll be $20.00 per seat! So please pay us $500.00 per night for that movie.

Now, why in the world would Oracle change that policy? They've maximized their revenue pull - and customers are still writing checks. They are in business to extract as much from their "value" as the market will bear, not offer charity discounts to a world that can't rationalize the price tag assigned by a market share leader (I won't use the other "m" word). Oracle reports having $10B in cash, about equal to their annual revenue. It would take less than a thousand E25K customers to decide to run Oracle RAC on their servers to deliver another $10B to their warchest. Not bad, for the price of DVD blanks :-)

Choice in this market segment is the only lever that will work. Customers are demanding choice. And they will respond when it appears. Oracle should note that when choice knocks, many will answer even if they then respond with a competitive position. It takes a long time to get a bad taste out of your mouth. Many will boycott Oracle just because they finally can.

There are some hints that choice might be just around the corner.

Saturday Feb 12, 2005

The Fall and Rise of IT: Part 1

Here's a collection of charts, graphs, and images that provide insight into the abyss of the typical datacenter operation. It's scary out there, when we apply benchmarks used to measure utilization, efficiency, and contribution from other part of the business.

But there is hope. For example, just this month Sun released a valuable and comprehensive (and free) BluePrint book called "Operations Management Capabilities Model". We've been working on this one for some time - so check it out. In addition, you can sign up (for free) with our SunTONE Program for self-assessment guides and self-remediation activities related to our ITIL-plus Certification program. It is based on, but extends ITIL. Thousands of companies are registered. We'll help if you'd like. Finally, the Service-Optimized DataCenter program will act as a Center of Excellence for putting these concepts into practice along with innovative new technologies in virtualization, provisioning, automation, and optimization, and other best practices. As you read about the state of IT below, realize that there is an escape from the pit of mediocrity. Part 2 will explore the oppty.

For now, for this post, I'll survey some of the problems that need fixing...

Let's assume that the prime directive for a datacenter is simply to: Deliver IT Services that meet desired Service Level Objectives at a competitive cost point. There are all kinds of important functions that fall within those large buckets [Service Level and Financial Mgmt], but that'll work for this discussion.

In my experience working with customers, there are two primary barriers that prevent a datacenter from being as successful as it might be in this mission. First, there is rampant unmanaged complexity. Second, most IT activities are reactive in nature... triggered by unanticipated events and often initiated by unsatisfied customer calls. The result: expensive services that can't meet expectations. Which is the exact opposite of the what an IT shop should deliver!

Here are some related graphics (with comments following each graphic):

This illustrates the typical "silo" or "stovepipe" deployment strategy. A customer or business unit wants a new IT service developed and deployed. They might help pick their favorite piece parts and IT builds/integrates the unique production environment for this application or service. There is often a related development and test stovepipe for this application, and maybe even a DR (disaster recovery) stovepipe at another site. That's up to four "n"-tier environments per app, with each app silo running different S/W stacks, different firmware, different patches, different middleware, etc, etc. Each a science experiment and someone's pet project.

Standish, Meta, Gartner, and others describe the fact that ~40% of all major IT initiatives that are funded and staffed are eventually canceled before they are ever delivered! And of those delivered, half never recover their costs. Overall, 80% of all major initiatives do not deliver to promise (either canceled, late, over budget, or simply don't meet expectation). Part of the reason (there are many reasons) for this failure rate is the one-off stovepipe mentality. Other reasons are a lack of clear business alignment, requirements, and criteria for success.

This is a interesting quote from a systems vendor. While 200M IT workers seems absurd, it describes the impact of accelerating complexity and the obvious need to manage that process. We saw the way stovepipe deployment drives complexity. We're seeing increasing demand for services (meaning more stovepipes), each with increasing service level expectations (meaning more complex designs in each stovepipe), each with increasing rates of change (meaning lots of manual adjustments in each stovepipe), each with with increasing numbers of (virtual) devices to manage, each built from an increasing selection of component choices. The net result is that each stovepipe looks nothing like the previous or next IT project. Every app lives in a one-off custom creation.

If all this complexity isn't bad enough, as if to add insult to injury, each of these silos averages less than 10% utilization. Think about that.... say you commit $5million to build out your own stovepipe for an ERP service. You will leave $4.5M on the floor running idle! That would be unacceptable in just about any other facet of your business. Taken together, high complexity (lots of people, unmet SLOs) and low utilization rates (more equip, space, etc) drive cost through the roof! If we could apply techniques to increase average utilization to even 40% (and provide fault and security isolation), we could potentially eliminate the need for 75% of the deployed equip and related overhead (or at least delay further  acquisitions, or find new ways to leverage the resources).

We've seen what complexity and utilization does to cost... But the other IT mandate is to deliver reliable IT services. This graphic summarizes a few studies performed by IEEE, Oracle, and Sun as to the root cause of service outages. In the past, ~60% of all outages were planned/scheduled, and 40% were the really bad kind - unplanned. Thankfully, new features like live OS upgrades and patches and backups and dynamic H/W reconfigurations are starting to dramatically reduce the need for scheduled outages. But we've got to deal with the unplanned outages that always seem to happen at the worst times. Gartner explains that 80% of unplanned outages are due to unskilled and/or unmotivated people making mistakes or executing poorly documented and undisciplined processes. In theory, we can fix this with training and discipline. But since each stovepipe has its own set of unique operational requirements and processes, it nearly impossible to implement consistent policies and procedures across operations.

So it isn't surprising, then, that Gartner has found that 84% of datacenters are operating in the basement in terms of Operational Maturity... Either in Chaotic or Reactive modes.

Okay... enough. I know I didn't paint a very pretty picture. The good news is that most firms recognize these problems and are starting to work at  simplifying and standardizing their operations. In Part 2, I'll provide some ideas on where to start and how to achieve high-return results.

Tuesday Feb 08, 2005

The Cell Processor

The latest buzz on the streets, at least around those neighborhoods frequented by the eXtreme crowd, seems to be about the Cell Processor. I wrote a little blog on the Power 6 recently and one reader warned me to watch out for The Cell.

Well, I have to admit, I'm a bleeding edge junkie myself at times. And the theory of operation around The Cell is pretty compelling. The problem is that theory doesn't always translate to reality! In fact, it seldom does. Especially when S/W is a critical component of the translation.

Gartner suggests that only 1 in 5 major initiatives that Sr. Mgmt funds and resources ever delivers to promise... 80% fail to meet expectations. IBM talks about a recent Standish Group report that suggests only 16.2% of S/W projects are delivered to promise. Another study suggests that > 40% are canceled before delivered (and most that are delivered are late and/or way over budget, often never recovering costs).

If you read the reports about Cell, it isn't about the H/W... That's the point really. The H/W is made up of standard building blocks (cells) of Power cores. A socket holds a Processor Element which contains a main Processor Unit (core) and several (often 8) Attached Processor Units (cores). However, the interesting part is the "Cell Object", which is a S/W construct that includes code and data that can migrate around looking for a "host" Cell system on which to execute. There is talk of dynamically-orchestrating pipelines. Of S/W self-healing properties. Of dynamic partitionability with multiple native OS support. All S/W ideas.

So it isn't really about H/W.  The H/W "Cells" are simply the "amino acids". The more interesting question might be: is there an "intelligent designer" who can breath life into a soup made up of these "single celled" organisms? There is a precedent for doom - where advanced life forms failed to thrive due to a lack of S/W life support (eg: EPIC/VLIW, Stack Machines, etc).

We saw earlier the dismal failure rate of projects using well established S/W development paradigms. It'll be amazing if Sony/Toshiba/IBM can turn the PlayStation3 engine into a viable general purpose computing platform that can threaten AMD, Intel, and SPARC at home and in the datacenter. From what I hear, the development tools and processes for PlayStation2 are an absolute nightmare.

It'll be fun to watch this pan out. One thing is for sure... at least PlayStation3 will ROCK, if they can deliver a reasonable flow of affordable immersive networked games. I hope so.

The Cell makes for great reading. Unfortunately, when it comes to a general purpose platform, this one might never recover from Stage 3:

Sunday Jan 30, 2005

Chips, Cores, Threads, OH MY!!

I don't know about you, but the whole mess around the emerging lexicon associated with modern processors is very frustrating. Terms are frequently redefined and twisted based on each vendor's whim and fancy. But words (should) mean something and obviously it's important that we all talk the same language.

A perfect example... you might have been approached by a Jehovah's Witness in the past. Or have a friend who is a Mormon. I do. They are wonderful people in general. When they talk about their faith their words and themes sound very similar to Biblical Christianity. But dive a little deeper and you'll find the belief systems are radically different. I'm not making a statement on value or correctness or anything like that (so don't bother starting a religious debate). My point is that two people can talk and maybe even (think they) agree, when in fact they are as far from each other as heaven and hell (so-to-speak).

When it comes to the engines that power computers, people talk about CPUs, and Processors, and Cores, and Threads, and Sockets, and Chips, and n-Way, and TEUs, and CMT, and Hyper-Threading, and and and... Whew!

I like to use three terms... Chips, Cores, and Threads. Note that this is pretty much what SPEC.ORG uses: http://www.spec.org/cpu2000/results/rint2000.html

I stay away from Sockets and Processors and CPUs and n-Way, as these are confusing or ambiguous or redundant.

Here are some examples [Chips/Cores/Threads]:
V880:          8/8/8
V490:          4/8/8
p570:          8/16/32
V40z:          4/4/4
Niagara:      1/8/32  (for a system with just one of these chips)

Here are my definitions:

This refers to the laser scribed rectangle cut from a semiconductor wafer, which is then packaged in a plastic or ceramic casing with pins or contacts. A "chip" may have multiple processing "cores" in it (see: Cores). The US-IV and Niagara and Power5 and Itanium and Opteron are all single "chips".

This term refers to the number of discrete "processors" in a system or on a chip. Some chips, such as Power5, US-IV,  Niagara, etc, have more than one core per chip. A core is also know as a TEU (thread execution unit). Each "core" may also be multi-threaded (see Threads), which can support concurrent or switched execution. Some cores have more than one integer, floating point or other type of "execution unit" that supports instruction level parallelism and/or more than one concurrent thread.

Threads are really a S/W construct. These are the streams of execution scheduled by the OS to do work driven by the computer's processor(s). Some cores can handle more than one thread of execution. Some cores can execute more than one thread at the same time. Other cores can be loaded with multiple threads, but perform H/W context switching at nanosecond speeds. The Thread Count of a processors equals the number of cores multiplied by the number of threads handled by each core. The US-IV has a Thread Count of 2\*1=2. The Power5 has a Thread Count of 2\*2=4. Niagara has a TC of 8\*4=32.

Sockets (avoid)
This term is designed to communicate the number of processor "chips" in a system. However, in reality it is an ambiguous term, because IBM's MCMs (multi-chip modules) have four "chips" per motherboard "socket". And, a long time ago, some sockets were stacked with more than one chip. Regardless, this term is supposed to equate to the number of "chips", so why confuse the issue. Just use "chips".

Processors (avoid)
This is technically equal to the number of cores. However, marketing has corrupted this term and some vendors (like Sun) equate this to the number of chips (or sockets), while others equate this to the number of cores. Vendors also use the term "n-Way". But since the number "n" equals the number of processors, this means different things depending on the vendor. For example, a 4-way V490 from Sun has 8 cores, and Oracle will charge you $320,000.00 (list price) to run Oracle on it.

CPUs (avoid)
This suffers from the same marketing overload problem as Processors.

SOA & JSR 208: Reality Check

A friend recently asked me what I'm hearing about SOA adoption and the buzz around JSR 208.

"JSR 208" might be a new term for some. Here is a brief overview: http://www.bijonline.com/PDF/Chappell%20oct.pdf

SOA is so over hyped these days that everyone probably has something different in mind when they hear that TLA (three letter acronym). Kinda like "Grid" - the concepts are real and useful, but the hype around SOA and Grid is running years ahead of reality.

Remember when N1 was first discussed... the vision of heterogeneous datacenters managed by a meta-OS that auto-provisions virtual slices into which services are deployed and managed to sustain business-driven SLOs based on priorities and charge-back constraints. Just roll in new equip and N1 will "DR" (read: dynamically reallocate) services into the increased capacity. If something fails... no problem... N1 will detect and adapt. We'll get there... eventually. And we've made important steps along the way. Investing almost $2B/yr in R&D will help. But it'll take (a long) time.

In some circles I'm hearing similar visions of grandeur around SOA. They talk of business apps described at a high level of abstraction (eg: business process models) loaded into an "application orchestrator" that broadcasts descriptions of the various components/services it needs, and then auto-builds the business app based on services from those providers (both internal and external) that offer the best cost point and service level guarantees. As new "service" providers come on-line with better value (or, if existing providers go off-line), business apps will rebind (on-the-fly) to these new service components.

Now, combine N1 and SOA and ITIL, and we could paint a beautiful picture of Service Delivery based on self-orchestrating business apps made up of discrete reusable and shared (possibly outsourced) components that each exist in their own virtual containers that are auto-provisioned and auto-optimized (based on SLAs and Cost and Demand) to maximize asset utilization and minimize cost, all while meeting service level objectives (even in the event of various fault scenarios).

Okay - back to reality :-) I'm finding there is a common theme from many of my customers/prospects. Many are seeking to increase efficiency and agility thru "horizontal integration" of reusable building blocks (eg: identity, etc), a shared services platform (grids, virtualization, etc), and higher-level provisioning (automation, SPS).

That isn't SOA, per-se, but is a good first step. The "building blocks" most are looking to share today are infrastructure services, rather than higher-level business app components. There is a maturity gradient that simply takes a lot of hard technical and political work to negotiate. Every customer is at a different place along that gradient, but most are embarrassingly immature (relative to the grand vision). It takes strong leadership and commitment at all levels, and a synchronization of people, processes, technology, and information, to even embark on the journey. It takes a tight coupling of S/W engineering, IT Architecture, and Business Architecture.

So, yes, I'm passionate about SOA, and JSR 208 will help integrate discrete business services. There are some firms that are pushing the envelope and building interesting business/mission apps from shared "service providers". But, in general, SOA is an abused term and the hype can derail legitimate efforts.

I'd be curious if others are sensing "irrational exuberance" around SOA, which can lead to a "Trough of Disillusionment" and a rejection of the legitimate gains that an evolutionary progression can provide. As Architects, we can establish credibility and win confidence (and contracts) by setting realistic expectations (hype-busting) and presenting not only a desired state "blueprint" (something that gets them existed about the possibilities for their environment), but a detailed roadmap that demonstrates the process and the benefits at each check point along the way.

Thursday Jan 27, 2005

Sun & The Nobel Prize

Back about 15 years ago, an economist named Ronald Coase won the Nobel Prize based on some very interesting ideas that we're just starting to see drive serious considerations and behavior in the the world of IT. Sun is well aware of this and responding with initiatives (that I can't talk about here). Like the "perfect storm", our industry is at an inflection point driven by the confluence of various trends and developments. These all add up to an environment that is ripe for Coase's Law to be enforced with prejudice.

Coase's Law states that: firms should only do what they can do more efficiently than others, and should outsource what others can do more efficiently, after considering the transaction costs involved in working with the outside suppliers.

There is nothing earth shattering about that simple and intuitive statement. However, back in the 90's, when this idea was explored in theoretical circles by bean counters, the "escape clause" related to transactional costs rendered the idea impotent, or at least limited, in the IT world. A captive internal service (eg: payroll) might not be highly efficient, but the thought of outsourcing a business function was quickly evaporated under the heat of a financial impact analysis. It just cost too much per transaction to realize a worthwhile return. And the incredible growth and prosperity of the "bubble years" leading up to Y2K was not a climate that drove consideration of the business value of outsourcing.

But all that is changing. You are familiar with many of the various "laws" that describe trends in IT, such as:

  • Moore's Law (fab process trends that underpin cheap powerful compute infrastructures)
  • Gilder's Law (the ubiquity of high-bandwidth network fabrics interconnecting businesses and consumers)

You are also familiar with concept of Web Services that leverage standard interfaces and protocols and languages to facilitate secure B2B and B2C transactions over these networks.

Taken together, the cost of an outsourced transaction is now dramatically lower than it was pre-bubble. Today, outsourcing is not only a viable consideration for certain business functions, but a necessary competitive reality. Here's the way I interpret and apply Coase's Law... Every business has a strategy to capture value and translate that value into revenue and profit. But the realities of running a business require common support functions. Every company had to build their own network of these supporting services (think: Payroll, HR, PR, Legal, Marketing, Manufacturing, etc, etc, etc). Think of these as chapters tucked away in the back of a company's Business Process handbook... necessary ingredients to implement the Business Design, but not part basic value capture. Many of these necessary support functions operate with limited efficiency and effectiveness, because delivering these services is not part of the company's DNA.

But there are provides that live on Gilder's external network fabric, operating grids of Moore's compute capability, offering highly efficient Web Services based business functions. These providers specialize in specific support services and can drive efficiency (and lowered cost) by aggregating demand. Their core competency is delivering secure reliable business functions at contracted service levels at a highly competitive transactional cost point. Wow! Think about that.

And moving forward, as we begin to explore the implications of Service-Oriented Architectures, as we implement business processes by orchestrating applications that are built from loosely coupled networked "services", it is not unreasonable to expect some or many of these SOA-based business components to be supplied from one (or more) outsourced suppliers.

Some people believe that targeted outsourcing will drive massive deconstruction and reconstruction, and that this will be THE major disruptive catalyst in business designs over the next several years. If so, IT will play a major part in this transformation. Sun needs to aggressively tap into this oppty (and we are). To do so will require building B2B/B2C services (and the underlying distributed service delivery platform) that integrates & optimizes business processes beyond the four walls to include the external value chain.

In his 1997 book, The Innovator's Dilemma: When New Technologies Cause Great Firms to Fail, Harvard business professor Clayton Christensen posited that, thanks to the Internet, companies are becoming more vulnerable than ever to a competitor wielding a disruptive technology - a technical process or business model so transformative that it could shake a Fortune 500-sized corporation, or even an entire industry, to its foundation. The lesson is that companies must structure themselves so they can rapidly build a new business around a disruptive technology even as they sustain their core competency.

IBM's OnDemand Enterprise is described as: "An Enterprise whose business process – integrated end-to-end across the company and with key partners, suppliers and customers – can respond with speed to any customer demand, market oppty or external threat".

Like Coase's Law, the expression of IBM's OnDemand vision is really common sense. It is the confluence of technology and economics today that has caused these ideas to become very interesting. Now it all comes down, as it always does, to execution.

And one of the initiatives we're driving at Sun that I can talk about is the Service Optimized Data Center (SODC).  The Sun Service Optimized Data Center program is comprised of an extensive set of services and technologies. Sun creates a comprehensive roadmap, which is used to transform your data center into an efficient, risk-averse, and agile service-driven environment that emphasizes IT operation as a strategic business driver and competitive weapon.

Wednesday Jan 19, 2005

Oracle Tech Day & NY Cabbies

It's been awhile since I've visited New York. Last time I was there I met with customers in the World Trade Center. Yesterday I was in midtown Manhattan at the Grand Hyatt, attached to Grand Central Station.

I presented at an Oracle Technology Day. Over 500 people registered for the event to hear about technology and solutions from Sun and Oracle. I discussed, among other things, our ERP Grid Reference Architecture that combines Oracle's 10g RAC with our Opteron-based Servers and Infiniband. Sun is sponsoring five cities. Over 700 are registered for the Atlanta session, to whom I'll be presenting next week.

On the way back home from the NY session, I was dropped off at LaGuardia. I had to cross a two lane street to get across to the main gate/check-in curb. It was a clear (but cold) day, 100% visibility. In front of me was a wide brightly painted cross-walk. Several people were standing there waiting to cross (which should have been my first clue that things are different in New York). Finally a natural break in traffic... the next group of vehicles is about 70 feet away, lead by a black limo approaching at about 20mph. Great! It's our turn... I step out and start to cross. Suddenly someone yells out to warn me... "Hey Buddy, Watch Out"! I look to my right and the limo driver apparently has no intention to respect the inalienable rights of pedestrians in crosswalks! He slows down just enough to allow me to back up onto the curb and get out of his way!

The term "inalienable" is apropos to this experience :-) The root, alien, has this definition:
Adj. Belonging to, characteristic of, or constituting another and very different place, society, or person; strange

I think I saw the cabbie mutter: "you're not from around here, are you". Or, something like that :-) I'm reminded of Morpheus' line in The Matrix when he explains to Neo that: "Some rules can be bent, others can be broken". Seems to be the creed of the NY cabbie.

Anyway, New York is a lot of fun. Just look both ways before you cross. And then, run like hell.

Saturday Jan 15, 2005

CIO Longevity and IT Execution

This is a little longer than I generally like for a blog entry. So, I tell you what to expect... I quickly review the essence of IT, then consider why many IT groups are considered ineffective, and finally what can be done to improve execution.

The essence of Information Technology is to create, deliver, and sustain high-quality IT services that meet (on time and within budget) the functional specs and the on-going service level agreements (SLAs) as established thru a partnership with the owners of the requested services. This is, in a nutshell, the role and ultimate responsibility of the CIO.

The creation of IT services generally focuses on functional requirements (the purpose of the application - what the service needs to do for the consumer/user). The delivery and support of those services focuses more on quality of service (QoS) attributes, such as performance, as well as the non-functional or systemic qualities (aka: the "ilities") such as reliability, availability, maintainability, securability, manageability, adaptability, scalability, recoverability, survivability, etc. A quick Google search found this paper among many on the topic.

Unfortunately, achieving success is often doomed from the start. And is probably why the average CIO survives for just 30 months (a new Gartner report even suggests that 2/3rds of CIOs are worried about their job)! Quality is sacrificed on the alter of expedience. Developers focus exclusively on the functional spec. For example, it is rare to find developers who are concerned with Recovery-Oriented Computing techniques (ref: Berkeley's David Patterson, et al) that can help mask infrastructure faults by, say, performing run-time discovery and binding of alternate dependencies. It is too easy for a developer to assume their target platform is failsafe, or that recovery is outside their area of concern. That's just lazy or ignorant, IMHO.

Just as guilty are the teams responsible for the implementation of those services. Too often new services stand alone in a datacenter as a silo, constructed using a unique set of components and patterns. Often, even if there is an IT Governance Board and/or an Enterprise Architectural Council, their strategic vision, standards and best practices are ignored, ostensibly to achieve time-to-market goals. In reality, it's just easier to not worry about the greater good.

What am I leading up to? Well, I believe there are two key areas that IT must take more seriously in order to increase their value to shareholders and to those who desire their services. These might even help the CIO keep his or her job.

The first is the effective leadership and influence of an Enterprise Architecture Council. One that has a clear and compelling vision of a shared services infrastructure, and has established a pro-active partnership with the developer community and strategic vendors to realize that vision. One that fights hard against human nature to ensure that IT services meet standards in quality, adaptability, observability, platform neutrality, etc.

The second is a focus on the disciplines associated with running a world-class datacenter operation. There is a well established set of standards that are useful as a framework around which these disciplines can be built. It's called the IT Infrastructure Library (ITIL) and is widely adopted in Europe and increasingly being pursued in the States across business, agencies, and institutions.

There are 10 ITIL "Best Practice" disciplines associated with the Delivery and Support of IT Services. These prescribe relevant and desirable processes that an IT group should seek to implement if they desire to evolve to a higher level of Operational Maturity. ITIL is highly focused on building a working partnership between IT and the associated Business Units, on increasing the quality of IT services, on reducing the true cost of operations, on establishing communications and execution plans, on the promotion of the value of IT, on understanding the cost and priority of services, etc.

Of the ten focus areas, the ITIL discipline that is probably the most important to start with is "Change Mgmt". This is a key area with a significant ROI in terms of service quality and cost. The cost of sloppy change control is huge. In a Fortune 500 acct I visited recently, the S/W developers all have root access to the production machines and make changes ad hoc!! Unfortunately, this isn't uncommon. The introduction of structure and discipline in this area is a great test case for those who think they want to implement ITIL. While the benefits are self evident, it isn't easy. The change will take exec level commitment. There will be serious pressure to resist a transition from a cowboy/hero culture to one that produces repeatable, consistent, predictable high-quality service delivery. The "heroes" won't like it, and they often wield influence and power. But, if this ITIL discipline can be instilled, the other nine have a chance. It's a multi-year effort, but the results will be a highly tuned and business linked competitive weapon.

The journey that many IT shops will have to take to achieve higher levels of maturity as suggested by Gartner and Meta, and described by the ITIL Best Practices, is a systemic culture change that fills gaps, eliminates overlap, aligns effort, and establishes structure and methods, designed to increase quality and lower costs. But, ultimately, it is a journey to prosperity and away from dysfunction. ITIL isn't to be taken lightly. It isn't for all IT departments (well, it is to some level, but many aren't ready to make the commitment). These charts show that most (>80%) have stopped and camped on the shore of mediocrity way too early in the journey.

There is a certification track for ITIL. A 3-day ITIL Essentials class is available to provide an introduction and "conversant" knowledge of the various discipline areas. A multiple choice cert test validates this level of understanding. This class is a pre-req for the very intense two-week ITIL Managers (aka: Masters) class. More than 50% fail the two 3-hour Harvard Business School type essay exams that are taken to achieve this level of certification. This is a respected certification and actually demonstrates true command of the principles of IT service excellence.

Sun also has offerings around our Service Optimized Data Center program, a new comprehensive roadmap of services and technologies to help customers deploy and manage IT services faster, smarter and more cost-effectively in their data centers. EDS Hosting Services is pleased with it. SODC leverages, among other things, our Operations Management Capability Model, based on principles from the Information Technology Infrastructure Library (ITIL) and the Controls Objective for Information and Related Technology (COBIT).

I believe Sun can establish itself as more than a parts and technology vendor by demonstrating value in helping our customers address the "Process of IT", into which our Technical Solutions are best delivered.

Friday Jan 14, 2005

The Fallacy of IBM's Power6

IBM is leaking FUD about its processors again. The Power5+, it is said, will be released later this year, ramping to 3GHz. The Power6, according to a "leaked" non-disclosure preso discussed by TheRegister, will sport "very large frequency enhancements". At the end of another news.com article, IBM suggests the Power6 will run at an "ultra-high frequency".

In engineering terms, those kinds of phrases generally imply at least an "order of magnitude" type of increase. That's [3GHz \* 10\^1], or an increase to 30GHz! But let's view this thru a marketing lens and say IBM is only talking about a "binary" order of magnitude [3GHz \* 2\^1]. That still puts the chip at 6GHz.

And therein lies part of the problem. First, even Intel can't get past 4GHz. In an embarrassing admission, they pulled their plans for a 4GHz Pentium and will concentrate their massive brain trust of chip designers on more intelligent ways to achieve increasing performance. More on that in a minute. Now I know IBM has some pretty impressive semiconductor fab processes and fabrication process engineers. But getting acceptable yields from a 12" wafer with 1/2 billion transistor chips at 6GHz and a 65nm fab process is pure rocket science. They can probably do it, at great shareholder expense. But even if that rocket leaves the atmosphere, they are still aiming in the wrong direction. As Sun, and now Intel, have figured out, modern apps and the realities of DRAM performance (even with large caches) render "ultra-high" clock rates impotent.

I've also got to hand it to IBM's chip designers...Here is an interesting technical overview of the z990 (MainFrame) CPU. The Power6 is targeted as the replacement for the z990, so it'll have to meet the z990 feature bar. The Power6 is rumored to be a common chip for their M/F zSeries and Unix pSeries platforms... (but they've been talking about a common chip for 10 years now, according to Gartner). Here is an excerpt of the z990 description:

"These include millicode, which is the vertical microcode that executes on the processor, and the recovery unit (R-unit), which holds the complete microarchitected state of the processor and is checkpointed after each instruction. If a hardware error is detected, the R-unit is then used to restore the checkpointed state and execute the error-recovery algorithm. Additionally, the z990 processor, like its predecessors, completely duplicates several major functional units for error-detection purposes and uses other error-detection techniques (parity, local duplication, illegal state checking, etc.) in the remainder of the processor to maintain state-of-the-art RAS characteristics. It also contains several mechanisms for completely transferring the microarchitected state to a spare processor in the system in the event of a catastrophic failure if it determines that it can no longer continue operating."

Wow! Still, they are continuing to fund rocket science based on the old "Apollo" blueprints. And that "dog don't hunt" any longer, to mix metaphors. Single thread performance and big SMP designs are still important. Sun leads the world in that area, with the 144 core E25K. And our servers with US-IVs (et al), AMD Opterons, and the engineering collaboration we're doing with Fujitsu should continue that leadership. But extreme clock rates are not the answer going forward.

In the benchmarketing world of TPC-C and SPECrates, where datasets fit nicely inside processor caches, performance appears stellar. But the problem, you see, is that for real applications, especially when micro-partitioning and multiple OS kernels and stacked applications are spread across processors, the L1/L2/L3 caches only contain a fraction of the data and instructions that the apps need to operate. At 6GHz, there is a new clock tick every 0.17 ns (light only travels about 2 inches in that time)!! However, about every 100 instructions or so, the data needed by a typical app might not appear in the processor cache chain. This is called a "cache miss" and it results in a DRAM access (or worse - to disk). Typical DRAM latency is about 150-300ns for large/complex SMP servers. Think about that... a 6GHz CPU will simply twiddle it's proverbial thumbs for over 1000 click ticks  (doing nothing but generating heat) before that DRAM data makes it way back up to the CPU so that work can continue. If this happens every 100 instructions, we're at <10% efficiency (100 instructions, followed by 1000 idle cycles, repeat). Ouch!! And that ratio just gets worse as the CPU clock rate increases. Sure, big caches can help some, but not nearly enough to overcome this fundamental problem.

What to do? The answer is to build extremely efficient thread engines that can accept multiple thread contexts from the OS and manage those on chip. And we're not talking 2-way hyper-threading here. Say a single processor can accept dozens of threads from the OS. Say there are 8 cores on that processor so that 8 threads can run concurrently, with the other threads queued up ready to run. When any one of those 8 threads need to reach down into DRAM for a memory reference (and they will, frequently), one of the H/W queued threads in the chip's run queue will instantly begin to execute on the core vacated by the "stalled" thread that is now patiently waiting for its DRAM retrieval. We've just described a design that can achieve near 100% efficiency even when DRAM latency is taken into account. Ace's Hardware reports that "Niagara has reached first silicon, and is running in Sun's labs".

I won't comment on the veracity of that report. But if true, we are years ahead of competition. We're orbiting the Sun, and IBM is still sending its design team to the moon.

An analogy - consider an Olympic relay race... There are 8 teams of 4 runners. Each runner sprints for all they are worth around the lap once, and then hands the baton, in flight, to the next runner. We've got 32 "threads" that are constantly tearing up the track at full speed. On the other hand, a 6GHz single threaded core is like a single runner who sprints like a mad man around the track once, and then sits down for 15 minutes to catch his breath. Then does it again. Which model describes the kind of server you'd like running your highly threaded enterprise applications?

Thursday Jan 13, 2005

Solution Consulting @ Sun

I just met with a large customer up here in Virginia. The rep I was with spoke of a colleague who has an amazing ability to sell complete solutions (not just a collection of parts). He delivers Solution Proposals with the not so subtle expectation that they will not be broken down into component parts with line item veto authority on the part of the customer. Somehow we need to bottle that sales behavior... The benefit of a proven solution w.r.t. cost, risk, complexity, support, etc, is self-evident. Too often, I believe, Sun's field is conditioned to (or we've conditioned the customer to think that we) offer solutions as strawmen that we expect will be hacked up and put back together (with many pieces left on the garage floor).

Client Solutions (read: Professional Services from Sun and our Partners) needs to be part of the total Solution Package. And we need to present the package with the clear expectation that we'll assist in the design, test, deployment and on-going mgmt/support, be committed to our customer's success, share in the risk, etc. But that the solution stands as a whole... If the customer simply wants a miscellany of parts, then we'll need a note from their mom :-) (eg: the CIO) that they understand the increased risk to their project's cost, timeline, and ultimate success. That they are "skiing outside the boundary area".

I've noticed that about half of the customers I deal with have senior techo-geeks on their staff. They often go by the title "Architect". Often they are far from it... but they've been there forever, and they are often brilliant technologists that can integrate "creative" solutions from random piece parts. In fact, this is how they thrive and how they (think they) sustain their value add... They become threatened by and obstacles to a solution sale in which the integration work is done for them. Somehow we need to figure out how to make these "technical grease monkeys" feel comfortable with a custom automobile that comes from Detroit already well tuned and ready to run. Sun can't survive being in the auto parts business.  We need to leverage their brilliance and secure their vote of confidence. There is an art to getting folks like this to "come up with an idea" that you already have :-) If they become the "owner" of the reference architecture (upon which the proposed solution is built), and still get to play with the technology and learn new techniques, and they can still look like they came up with the idea, then I think we can get past that common roadblock.

However, I think there is a development gap in Client Solutions that we have an oppty to address... We have a lot of people who can talk the talk... but we have fewer people that have actually implemented complex solutions such as N1 SPS, Trusted Solaris based SNAP solutions, Retail-oriented SunRay POS gigs, comprehensive ITIL compliance audits, strategic BCP consultation, etc... This is a natural fallout of the fact that most of us came from the pre-sales side of the merged Client Solutions organization. As we become even more successful in securing solution architecture and implementation gigs, we'll need to step up and hit the ball out of the park - not just talk about being able to do it. I encourage everyone to get as much hands on experience as possible with our strategic solution offerings. I know I'm doing that with N1 SPS, SOA, and Sol10. I know we're all are ramping our skills. That's goodness. Thankfully, I think it is easier to engage partners and teach (or remind) bright technical pre-sales "SEs" how to architect and implement solutions, than it is to teach implementation gurus the inter-personal skills and acumen needed to talk to CIOs about business value and relevance.

Sunday Jan 09, 2005

Original "Think Pad"

As an Electrical Engineering undergrad, I worked for IBM for four semesters as an intern/co-op student back in the very early 80's in Boca Raton, FL, just as the first IBM PC was brought to market. It was an incredible experience, in many ways. Today, about 25 years later (wow, I can't be that old!!) I was cleaning out my attic, preparing to put back all the Christmas boxes for another year. I opened some of the boxes to figure out what I had up there... And came across something from my days at IBM. An original IBM "Think Pad". Measuring just 3" x 4.5", this is the pocket-sized progenitor of the now ubiquitous lap-sized room heater.

You know... there is something to be said for the utility and durability and availability and cost-effectiveness of the original. Where will your "modern" ThinkPad be in 25 years? I'll still have mine, and it'll still be as useful as it was in 1980 :-) No upgrades and no viruses.

Wednesday Jan 05, 2005

Cobalt Qube3 w/ SunRays, RedHat 9

I've got a Cobalt Qube 3 Professional Edition computer. Remember those cute blue cube Linux appliances?  Sun was handing these out to SEs at one point.

They only have a 450MHz processor. But they are the perfect little home file server and networked backup device. The Business and Professional Editions also have a SCSI port to which additional storage can be attached. In fact, the Professional Edition has two internal disks and a built-in Raid1 Controller. It's headless, but has nice features for a server. Problem is (well, you might consider this a problem) it runs an old Linux release (based on a 2.2 kernel) and has been EOL'ed. But in true Open fashion, there is a grassroots community of developers and advocates, and there are instructions for how to refresh this device to a 2.4-based RedHat (v7.2)kernel here:


I just exchanged e-mail with Dax Kelson of Guru Labs, who told me that this procedure can be used to install RedHat 9 or even  the newer Fedora releases.

I think I'm going to give this a try. I'll let you know if/how this works out. Hmmm, with the new Linux-based SunRay Server Software, I could even potentially drive a couple wireless SunRays around the house, using an 802.11g Wireless Bridge, such as: http://www.dlink.com/products/?pid=241

Thursday Dec 23, 2004

Big Sun Clusters!!

The Center for Computing and Communication (CCC) at the RWTH Aachen University has recently published details about two interesting clusters they operate using Sun technology.  RWTH Aachen is the largest university of technology in Germany and one of the most renowned technical universities in Europe, with around 28,000 students, more than half of which are in engineering (according to their website).

Check this out!

First, there is a huge Opteron-Linux-Cluster that consists of 64 of Sun's V40z servers, each with four Opteron CPUs. The 256 processors total 1.1TFlop/s (peak) and have a pool of RAM equal to 512GB. Each node runs a 64-bit version of Linux. Hybrid Programs use a combination of MPI and OpenMP, where each MPI process is multi-threaded. The hybrid parallelization approach uses a combination of coarse grained parallelism with MPI and underlying fine-grained parallelism with OpenMP in order to use as many processors efficiently as possible. For shared memory programming, OpenMP is becoming the de facto standard.

See: http://www.rz.rwth-aachen.de/computing/info/linux/primer/opteron_primer_V1.1.pdf

Another Cluster is based on 768 UltraSPARC-IV processors, with an accumulated peak performance of 3.5 TFlop/s and a total main memory capacity of 3 TeraByte. The Operating System's view of each of the two cores of the UltraSPARC IV processors is as if they are separate processors. Therefore from the user's perspective the Sun Fire E25Ks have 144 “processors”, the Sun Fire E6900s have 48 “processors” and the Sun Fire E2900s have 24 “processors” each. All compute nodes also have direct access to all work files via a fast storage area network (SAN) using the QFS file system. High IO bandwidth is achieved by striping multiple RAID systems.

See: http://www.rz.rwth-aachen.de/computing/info/sun/primer/primer_V4.0.pdf

Big -vs- Small Servers?

Big Iron -vs- Blades. Mainframe -vs- Micro. Hmmm. We're talking Aircraft Carriers -vs- Jet Skis, right?

Sun designs and sell servers that cost from ~$1000 to ~$10 million. Each! We continue to pour billions into R&D and constantly raise the bar on the quality and performance and reliability and feature set that we deliver in our servers. No wonder we lead in too many categories to mention. Okay, I'll mention some :-)

While the bar keeps rising on our "Enterprise Class", the Commodity/Volume Class is never too far behind. In fact, I think it may be inappropriate to continue to refer to our high-end as our Enterprise-class Servers, because that could imply that our "Volume" Servers are only for workgroups or non-mission-critical services. That is hardly the case. Both are important and play a role in even the most critical service platforms.

Let's look at the next generation Opterons... which are only months away. And how modern S/W Architectures are fueling the adoption of these types of servers...

Today's AMD CPUs, with on-board hypertransport pathways, can handle up to 8 CPUs per server! And in mid-2005, AMD will ship dual-core Opterons. That means that it is probable for a server, by mid-2005 or so, to have 16 Opteron cores (8 dual-core sockets) in just a few rack units of space!! If you compare SPECrate values, such a server would have the raw compute performance capability of a full-up $850K E6800. Wow!

AMD CPU Roadmap: http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_608,00.html
AMD 8-socket Support: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~72268,00.html
SPECint:_Rate: http://www.spec.org/cpu2000/results/rint2000.html
E6800 Price: http://tinyurl.com/3xbq2

Clearly, there are many reasons why our customers are and will continue to buy our large SMP servers. They offer Mainframe-class on-line maintenance, redundancy, upgradability. They even exceed the ability of a Mainframe in terms of raw I/O capability, compute density, on-the-fly expansion, etc.

But, H/W RAS continue to improve in the Opteron line as well. One feature I hope to see soon is on-the-fly PFA-orchestrated CPU off-lining. If this is delivered, it'll be Solaris x86 rather than Linux. Predictive Fault Analysis detecting if one of those 16 cores or 32 DIMMs starts to experience soft errors in time to fence off that component before the server and all the services crash. The blacklisted component could be serviced at the next scheduled maintenance event. We can already do that on our Big Iron. But with that much power, and that many stacked services in a 16-way Opteron box, it would be nice not to take a node panic and extended node outage.

On the other hand, 80% of the service layers we deploy are already or are attempting to move to the horizontal model. And modern S/W architectures are increasingly designed to provide continuity of service level even in the presence of various fault scenarios. Look at Oracle RAC, replicated state App Servers with Web-Server plug-ins to seamlessly transfer user connections, Load Balanced web services, TP monitors, Object Brokers, Grid Engines and Task Dispatchers, and SOA designs in which an alternate for a failed dependency is rebound on-the-fly.

These kinds of things, and many others, are used to build resilient services that are much more immune to component or node failures. In that regard, node level RAS is less critical to achieving a service level objective. Recovery Oriented Computing admits that H/W fails [http://roc.cs.berkeley.edu/papers/ROC_TR02-1175.pdf]. We do need to reduce the failure rate at the node/component level... but as Solution Architects, we need to design services such that node/component failure can occur, if possible, without a service interruption or degradation of "significance".

In the brave new world (or, the retro MF mindset) we'll stack services in partitions across a grid of servers. Solaris 10 gives us breakthrough new Container technology that will provide this option. Those servers might be huge million dollar SMP behemoths, or $2K Opteron blades... doesn't matter from the architectural perspective. We could have dozens of services running on each server... however, most individual services will be distributed across partitions (Containers) on multiple servers, such that a partition panic or node failure has minimal impact. This is "service consolidation" which includes server consolidation as a side effect. Not into one massive server, but across a limited set of networked servers that balance performance, adaptability, service reliability, etc.

Server RAS matters. Competitive pressure will drive continuous improvement in quality and feature sets in increasingly powerful and inexpensive servers. At the same time, new patterns in S/W architecture will make "grids" of these servers work together to deliver increasingly reliable services. Interconnect breakthroughs will only accelerate this trend.

The good news for those of us who love the big iron is that there will always be a need for aircraft carriers even in an age of powerful jet skis.




« August 2016

No bookmarks in folder