By Gerry Haskins-Oracle on Nov 06, 2012
It's not surprising that Engineered Systems accelerate the debugging and resolution of customer issues.
But what has surprised me is just how much faster issue resolution is with Engineered Systems such as SPARC SuperCluster.
These are powerful, complex, systems used by customers wanting extreme database performance, app performance, and cost saving server consolidation.
A SPARC SuperCluster consists or 2 or 4 powerful T4-4 compute nodes, 3 or 6 extreme performance Exadata Storage Cells, a ZFS Storage Appliance 7320 for general purpose storage, and ultra fast Infiniband switches. Each with its own firmware.
It runs Solaris 11, Solaris 10, 11gR2, LDoms virtualization, and Zones virtualization on the T4-4 compute nodes, a modified version of Solaris 11 in the ZFS Storage Appliance, a modified and highly tuned version of Oracle Linux running Exadata software on the Storage Cells, another Linux derivative in the Infiniband switches, etc.
It has an Infiniband data network between the components, a 10Gb data network to the outside world, and a 1Gb management network.
And customers can run whatever middleware and apps they want on it, clustered in whatever way they want.
In one word, powerful. In another, complex.
The system is highly Engineered. But it's designed to run general purpose applications.
That is, the physical components, configuration, cabling, virtualization technologies, switches, firmware, Operating System versions, network protocols, tunables, etc. are all preset for optimum performance and robustness.
That improves the customer experience as what the customer runs leverages our technical know-how and best practices and is what we've tested intensely within Oracle.
It should also make debugging easier by fixing a large number of variables which would otherwise be in play if a customer or Systems Integrator had assembled such a complex system themselves from the constituent components. For example, there's myriad network protocols which could be used with Infiniband. Myriad ways the components could be interconnected, myriad tunable settings, etc.
But what has really surprised me - and I've been working in this area for 15 years now - is just how much easier and faster Engineered Systems have made debugging and issue resolution.
All those error opportunities for sub-optimal cabling, unusual network protocols, sub-optimal deployment of virtualization technologies, issues with 3rd party storage, issues with 3rd party multi-pathing products, etc., are simply taken out of the equation.
All those error opportunities for making an issue unique to a particular set-up, the "why aren't we seeing this on any other system ?" type questions, the doubts, just go away when we or a customer discover an issue on an Engineered System.
It enables a really honed response, getting to the root cause much, much faster than would otherwise be the case.
Here's a couple of examples from the last month, one found in-house by my team, one found by a customer:
Example 1: We found a node eviction issue running 11gR2 with Solaris 11 SRU 12 under extreme load on what we call our ExaLego test system (mimics an Exadata / SuperCluster 11gR2 Exadata Storage Cell set-up). We quickly established that an enhancement in SRU12 enabled an 11gR2 process to query Infiniband's Subnet Manager, replacing a fallback mechanism it had used previously. Under abnormally heavy load, the query could return results which were misinterpreted resulting in node eviction. In several daily joint debugging sessions between the Solaris, Infiniband, and 11gR2 teams, the issue was fully root caused, evaluated, and a fix agreed upon. That fix went back into all Solaris releases the following Monday. From initial issue discovery to the fix being put back into all Solaris releases was just 10 days.
Example 2: A customer reported sporadic performance degradation. The reasons were unclear and the information sparse. The SPARC SuperCluster Engineered Systems support teams which comprises both SPARC/Solaris and Database/Exadata experts worked to root cause the issue. A number of contributing factors were discovered, including tunable parameters. An intense collaborative investigation between the engineering teams identified the root cause to a CPU bound networking thread which was being starved of CPU cycles under extreme load. Workarounds were identified. Modifications have been put back into 11gR2 to alleviate the issue and a development project already underway within Solaris has been sped up to provide the final resolution on the Solaris side. The fixed nature of the SPARC SuperCluster configuration greatly aided issue reproduction and dramatically sped up root cause analysis, allowing the correct workarounds and fixes to be identified, prioritized, and implemented. The customer is now extremely happy with performance and robustness.
Since the Engineered System configuration is common to other SPARC SuperCluster customers, the lessons learned are being proactively rolled out to other customers and incorporated into the installation procedures for future customers.
This effectively acts as a turbo-boost to performance and reliability for all SPARC SuperCluster customers.
If this had occurred in a "home grown" system of this complexity, I expect it would have taken at least 6 months to get to the bottom of the issue.
But because it was an Engineered System, known, understood, and qualified by both the Solaris and Database teams, we were able to collaborate closely to identify cause and effect and expedite a solution for the customer.
That is a key advantage of Engineered Systems which should not be underestimated.
Indeed, the initial issue mitigation on the Database side followed by final fix on the Solaris side, highlights the high degree of collaboration and excellent teamwork between the Oracle engineering teams.
It's a compelling advantage of the integrated Oracle Red Stack in general and Engineered Systems in particular.