Wednesday Dec 04, 2013

It's good to stare!

Some say it's rude to stare.  But that's not my experience.

I've been working on SuperCluster for 2 years now.  And I've been looking intently at issues arising for SuperCluster customers, both to ensure the issues are fixed a.s.a.p. and to understand what lessons we can learn and where we can improve our products and processes.

Why ?

Well, I want every customer to have the best possible experience. 

Call me naive, but I sincerely believe that ensuring a good customer experience is the best way to encourage repeat business.

So, what have I learned ?

I've been working in the Solaris customer lifecycle space for 15 years now.  One thing that's always puzzled me is why, while most customers have a perfectly good experience, there's always one or two customers who repeatedly hit problems.

The reasons are often not obvious.  They may be running very similar hardware with very similar software configuration and broadly comparable workloads to hundreds of other customers in the same industry segment who are not experiencing any issue.

It's easy to assume that there may be something subtle "wrong" in their set-up.  Either a misconfigured network, a piece of 3rd party kit which we don't have internally to aid us reproduce the issue, 3rd party or home grown apps relying on private interfaces they shouldn't be using, even a dodgey "favorite" /etc/system setting which the customer "knows" works from their Solaris 2.5.1 or V880 days that hamstrings performance, or whatever.  Occasionally, despite enormous effort, it feels like we never get to true root cause and that customer never does have an optimal experience.

More often, we do determine the root cause, which may indeed be a sub-optimal configuration but, if the system's already in production, it may not be possible to reconfigure the system and start again, so the customer experience remains compromised for that system.

Indeed, it's for this exact reason - sub-optimal customer lifecycle experiences are often due to sub-optimal initial install and configuration - that my team was asked to develop the install and configuration utilities for SuperCluster so that they are configured according to best practice right out of the box.  And that's worked very well indeed.

But some issues do still arise for SuperCluster customers.

Most are when we leverage new functionality - initially Infiniband, more lately VM2 and iSCSI.  These issues are found and fixed rapidly, with proactive roll-out of the fixes to the entire SuperCluster customer base.

I previously blogged that, even though SuperCluster is configurable and certainly not an appliance, we are finding Engineered Systems issues much easier to debug, as the fixed hardware layout, cabling, protocols, etc., dramatically reduces the number of variables in play, making issue reproduction in-house much easier, and hence issue analysis and resolution much faster.  This really helps to improve our customers' experience.

But we still see a very small number of customers (two or possibly three come to mind) who repeatedly hit issues not seen by any other. 

Why is that ?  

The hardware is identical.  The configurations are similar.  We have other customers in the same industry segment utilizing the SuperClusters for broadly similar purposes.  Even with similar DB and load characteristics.  We know the networking is correct - it's fixed.  We know the I/O cards are in the right slots - it's fixed.  We know we're using the optimal protocols, configured optimally.  We even have a process, ssctuner, running in the background to check that no dodgey settings are added to /etc/system, and it'll automatically remove them if they are.

We've gone through an interesting period over the summer.  In early summer, we were seeing very few issues indeed reported from our now large customer base.  Then, we saw 3 customers raise issues in quick succession.

The first, in Europe, looked like an Infiniband issue.  Responses would just stop for multiple seconds for no apparent reason, then restart.  We actually sent two very experienced engineers on site to debug after trying to debug over shared shell was unsuccessful, and they root caused a VM2 (Virtual Memory) issue and two scheduler issues.

Almost the same week, two U.S. SuperCluster customers raised VM2 issues.  Our lead VM2 sustaining engineer, Vamsi Nagineni, engaged Eric Lowe from the VM2 development team, and they determined that none of the customer issues had the same root cause.

In one case, a bank, the customers' database is not optimized for Exadata, so more of the load runs on the SuperCluster compute nodes rather than on the storage cells.  Nothing overly excessive, just enough to encounter an issue not seen by other customers.

In another, a State social services provider, the customer runs a high proportion of batch processing.  Again, nothing excessive, just enough to encounter a different issue not seen by other customers.

In the third, a major retailer, the customer's apps had very specific memory requirements which the VM2 algorithms were handling sub-optimally.

The outcome of this is that a number of subtle VM2 and other bugs have been found and fixed, not just for the benefit of these and other SuperCluster customers, but since the fixes are putback into generic Solaris SRUs, all Solaris 11 customers benefit.

Without the reduced variables at play in Engineered Systems, it would be extremely difficult if not impossible to reproduce, analyze, and fix such subtle issues.

So even if you don't have a SuperCluster, you can still reap the benefits.

FYI, currently most of the SuperCluster install base is running Solaris 11.1 SRU7.5 (which fixes a number of VM2 issues).

BTW: We also improved the SRU README last month to summarize the important content.

Best Wishes,

Gerry.

Tuesday Nov 06, 2012

Unexpected advantage of Engineered Systems

It's not surprising that Engineered Systems accelerate the debugging and resolution of customer issues.

But what has surprised me is just how much faster issue resolution is with Engineered Systems such as SPARC SuperCluster.

These are powerful, complex, systems used by customers wanting extreme database performance, app performance, and cost saving server consolidation.

A SPARC SuperCluster consists or 2 or 4 powerful T4-4 compute nodes, 3 or 6 extreme performance Exadata Storage Cells, a ZFS Storage Appliance 7320 for general purpose storage, and ultra fast Infiniband switches.  Each with its own firmware.

It runs Solaris 11, Solaris 10, 11gR2, LDoms virtualization, and Zones virtualization on the T4-4 compute nodes, a modified version of Solaris 11 in the ZFS Storage Appliance, a modified and highly tuned version of Oracle Linux running Exadata software on the Storage Cells, another Linux derivative in the Infiniband switches, etc.

It has an Infiniband data network between the components, a 10Gb data network to the outside world, and a 1Gb management network.

And customers can run whatever middleware and apps they want on it, clustered in whatever way they want.

In one word, powerful.  In another, complex.

The system is highly Engineered.  But it's designed to run general purpose applications.

That is, the physical components, configuration, cabling, virtualization technologies, switches, firmware, Operating System versions, network protocols, tunables, etc. are all preset for optimum performance and robustness.

That improves the customer experience as what the customer runs leverages our technical know-how and best practices and is what we've tested intensely within Oracle.

It should also make debugging easier by fixing a large number of variables which would otherwise be in play if a customer or Systems Integrator had assembled such a complex system themselves from the constituent components.  For example, there's myriad network protocols which could be used with Infiniband.  Myriad ways the components could be interconnected, myriad tunable settings, etc.

But what has really surprised me - and I've been working in this area for 15 years now - is just how much easier and faster Engineered Systems have made debugging and issue resolution.

All those error opportunities for sub-optimal cabling, unusual network protocols, sub-optimal deployment of virtualization technologies, issues with 3rd party storage, issues with 3rd party multi-pathing products, etc., are simply taken out of the equation.

All those error opportunities for making an issue unique to a particular set-up, the "why aren't we seeing this on any other system ?" type questions, the doubts, just go away when we or a customer discover an issue on an Engineered System.

It enables a really honed response, getting to the root cause much, much faster than would otherwise be the case.

Here's a couple of examples from the last month, one found in-house by my team, one found by a customer:

Example 1: We found a node eviction issue running 11gR2 with Solaris 11 SRU 12 under extreme load on what we call our ExaLego test system (mimics an Exadata / SuperCluster 11gR2 Exadata Storage Cell set-up).  We quickly established that an enhancement in SRU12 enabled an 11gR2 process to query Infiniband's Subnet Manager, replacing a fallback mechanism it had used previously.  Under abnormally heavy load, the query could return results which were misinterpreted resulting in node eviction.  In several daily joint debugging sessions between the Solaris, Infiniband, and 11gR2 teams, the issue was fully root caused, evaluated, and a fix agreed upon.  That fix went back into all Solaris releases the following Monday.  From initial issue discovery to the fix being put back into all Solaris releases was just 10 days.

Example 2: A customer reported sporadic performance degradation.  The reasons were unclear and the information sparse.  The SPARC SuperCluster Engineered Systems support teams which comprises both SPARC/Solaris and Database/Exadata experts worked to root cause the issue.  A number of contributing factors were discovered, including tunable parameters.  An intense collaborative investigation between the engineering teams identified the root cause to a CPU bound networking thread which was being starved of CPU cycles under extreme load.  Workarounds were identified.  Modifications have been put back into 11gR2 to alleviate the issue and a development project already underway within Solaris has been sped up to provide the final resolution on the Solaris side.  The fixed nature of the SPARC SuperCluster configuration greatly aided issue reproduction and dramatically sped up root cause analysis, allowing the correct workarounds and fixes to be identified, prioritized, and implemented.  The customer is now extremely happy with performance and robustness. 

Since the Engineered System configuration is common to other SPARC SuperCluster customers, the lessons learned are being proactively rolled out to other customers and incorporated into the installation procedures for future customers. 

This effectively acts as a turbo-boost to performance and reliability for all SPARC SuperCluster customers. 

If this had occurred in a "home grown" system of this complexity, I expect it would have taken at least 6 months to get to the bottom of the issue. 

But because it was an Engineered System, known, understood, and qualified by both the Solaris and Database teams, we were able to collaborate closely to identify cause and effect and expedite a solution for the customer. 

That is a key advantage of Engineered Systems which should not be underestimated. 

Indeed, the initial issue mitigation on the Database side followed by final fix on the Solaris side, highlights the high degree of collaboration and excellent teamwork between the Oracle engineering teams. 

It's a compelling advantage of the integrated Oracle Red Stack in general and Engineered Systems in particular.

About

This blog is to inform customers about Solaris 11 maintenance best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Gerry Haskins, Director, Software Lifecycle Engineering

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today