It's good to stare!
By Gerry Haskins-Oracle on Dec 04, 2013
Some say it's rude to stare. But that's not my experience.
I've been working on SuperCluster for 2 years now. And I've been looking intently at issues arising for SuperCluster customers, both to ensure the issues are fixed a.s.a.p. and to understand what lessons we can learn and where we can improve our products and processes.
Well, I want every customer to have the best possible experience.
Call me naive, but I sincerely believe that ensuring a good customer experience is the best way to encourage repeat business.
So, what have I learned ?
I've been working in the Solaris customer lifecycle space for 15 years now. One thing that's always puzzled me is why, while most customers have a perfectly good experience, there's always one or two customers who repeatedly hit problems.
The reasons are often not obvious. They may be running very similar hardware with very similar software configuration and broadly comparable workloads to hundreds of other customers in the same industry segment who are not experiencing any issue.
It's easy to assume that there may be something subtle "wrong" in their set-up. Either a misconfigured network, a piece of 3rd party kit which we don't have internally to aid us reproduce the issue, 3rd party or home grown apps relying on private interfaces they shouldn't be using, even a dodgey "favorite" /etc/system setting which the customer "knows" works from their Solaris 2.5.1 or V880 days that hamstrings performance, or whatever. Occasionally, despite enormous effort, it feels like we never get to true root cause and that customer never does have an optimal experience.
More often, we do determine the root cause, which may indeed be a sub-optimal configuration but, if the system's already in production, it may not be possible to reconfigure the system and start again, so the customer experience remains compromised for that system.
Indeed, it's for this exact reason - sub-optimal customer lifecycle experiences are often due to sub-optimal initial install and configuration - that my team was asked to develop the install and configuration utilities for SuperCluster so that they are configured according to best practice right out of the box. And that's worked very well indeed.
But some issues do still arise for SuperCluster customers.
Most are when we leverage new functionality - initially Infiniband, more lately VM2 and iSCSI. These issues are found and fixed rapidly, with proactive roll-out of the fixes to the entire SuperCluster customer base.
I previously blogged that, even though SuperCluster is configurable and certainly not an appliance, we are finding Engineered Systems issues much
easier to debug, as the fixed hardware layout, cabling, protocols,
etc., dramatically reduces the number of variables in play, making issue
reproduction in-house much easier, and hence issue analysis and resolution much faster. This really helps to improve our customers' experience.
But we still see a very small number of customers (two or possibly three come to mind) who repeatedly hit issues not seen by any other.
Why is that ?
The hardware is identical. The configurations are similar. We have other customers in the same industry segment utilizing the SuperClusters for broadly similar purposes. Even with similar DB and load characteristics. We know the networking is correct - it's fixed. We know the I/O cards are in the right slots - it's fixed. We know we're using the optimal protocols, configured optimally. We even have a process, ssctuner, running in the background to check that no dodgey settings are added to /etc/system, and it'll automatically remove them if they are.
We've gone through an interesting period over the summer. In early summer, we were seeing very few issues indeed reported from our now large customer base. Then, we saw 3 customers raise issues in quick succession.
The first, in Europe, looked like an Infiniband issue. Responses would just stop for multiple seconds for no apparent reason, then restart. We actually sent two very experienced engineers on site to debug after trying to debug over shared shell was unsuccessful, and they root caused a VM2 (Virtual Memory) issue and two scheduler issues.
Almost the same week, two U.S. SuperCluster customers raised VM2 issues. Our lead VM2 sustaining engineer, Vamsi Nagineni, engaged Eric Lowe from the VM2 development team, and they determined that none of the customer issues had the same root cause.
In one case, a bank, the customers' database is not optimized for Exadata, so more of the load runs on the SuperCluster compute nodes rather than on the storage cells. Nothing overly excessive, just enough to encounter an issue not seen by other customers.
In another, a State social services provider, the customer runs a high proportion of batch processing. Again, nothing excessive, just enough to encounter a different issue not seen by other customers.
In the third, a major retailer, the customer's apps had very specific memory requirements which the VM2 algorithms were handling sub-optimally.
The outcome of this is that a number of subtle VM2 and other bugs have been found and fixed, not just for the benefit of these and other SuperCluster customers, but since the fixes are putback into generic Solaris SRUs, all Solaris 11 customers benefit.
Without the reduced variables at play in Engineered Systems, it would be extremely difficult if not impossible to reproduce, analyze, and fix such subtle issues.
So even if you don't have a SuperCluster, you can still reap the benefits.
FYI, currently most of the SuperCluster install base is running Solaris 11.1 SRU7.5 (which fixes a number of VM2 issues).
BTW: We also improved the SRU README last month to summarize the important content.