Wednesday Dec 04, 2013

It's good to stare!

Some say it's rude to stare.  But that's not my experience.

I've been working on SuperCluster for 2 years now.  And I've been looking intently at issues arising for SuperCluster customers, both to ensure the issues are fixed a.s.a.p. and to understand what lessons we can learn and where we can improve our products and processes.

Why ?

Well, I want every customer to have the best possible experience. 

Call me naive, but I sincerely believe that ensuring a good customer experience is the best way to encourage repeat business.

So, what have I learned ?

I've been working in the Solaris customer lifecycle space for 15 years now.  One thing that's always puzzled me is why, while most customers have a perfectly good experience, there's always one or two customers who repeatedly hit problems.

The reasons are often not obvious.  They may be running very similar hardware with very similar software configuration and broadly comparable workloads to hundreds of other customers in the same industry segment who are not experiencing any issue.

It's easy to assume that there may be something subtle "wrong" in their set-up.  Either a misconfigured network, a piece of 3rd party kit which we don't have internally to aid us reproduce the issue, 3rd party or home grown apps relying on private interfaces they shouldn't be using, even a dodgey "favorite" /etc/system setting which the customer "knows" works from their Solaris 2.5.1 or V880 days that hamstrings performance, or whatever.  Occasionally, despite enormous effort, it feels like we never get to true root cause and that customer never does have an optimal experience.

More often, we do determine the root cause, which may indeed be a sub-optimal configuration but, if the system's already in production, it may not be possible to reconfigure the system and start again, so the customer experience remains compromised for that system.

Indeed, it's for this exact reason - sub-optimal customer lifecycle experiences are often due to sub-optimal initial install and configuration - that my team was asked to develop the install and configuration utilities for SuperCluster so that they are configured according to best practice right out of the box.  And that's worked very well indeed.

But some issues do still arise for SuperCluster customers.

Most are when we leverage new functionality - initially Infiniband, more lately VM2 and iSCSI.  These issues are found and fixed rapidly, with proactive roll-out of the fixes to the entire SuperCluster customer base.

I previously blogged that, even though SuperCluster is configurable and certainly not an appliance, we are finding Engineered Systems issues much easier to debug, as the fixed hardware layout, cabling, protocols, etc., dramatically reduces the number of variables in play, making issue reproduction in-house much easier, and hence issue analysis and resolution much faster.  This really helps to improve our customers' experience.

But we still see a very small number of customers (two or possibly three come to mind) who repeatedly hit issues not seen by any other. 

Why is that ?  

The hardware is identical.  The configurations are similar.  We have other customers in the same industry segment utilizing the SuperClusters for broadly similar purposes.  Even with similar DB and load characteristics.  We know the networking is correct - it's fixed.  We know the I/O cards are in the right slots - it's fixed.  We know we're using the optimal protocols, configured optimally.  We even have a process, ssctuner, running in the background to check that no dodgey settings are added to /etc/system, and it'll automatically remove them if they are.

We've gone through an interesting period over the summer.  In early summer, we were seeing very few issues indeed reported from our now large customer base.  Then, we saw 3 customers raise issues in quick succession.

The first, in Europe, looked like an Infiniband issue.  Responses would just stop for multiple seconds for no apparent reason, then restart.  We actually sent two very experienced engineers on site to debug after trying to debug over shared shell was unsuccessful, and they root caused a VM2 (Virtual Memory) issue and two scheduler issues.

Almost the same week, two U.S. SuperCluster customers raised VM2 issues.  Our lead VM2 sustaining engineer, Vamsi Nagineni, engaged Eric Lowe from the VM2 development team, and they determined that none of the customer issues had the same root cause.

In one case, a bank, the customers' database is not optimized for Exadata, so more of the load runs on the SuperCluster compute nodes rather than on the storage cells.  Nothing overly excessive, just enough to encounter an issue not seen by other customers.

In another, a State social services provider, the customer runs a high proportion of batch processing.  Again, nothing excessive, just enough to encounter a different issue not seen by other customers.

In the third, a major retailer, the customer's apps had very specific memory requirements which the VM2 algorithms were handling sub-optimally.

The outcome of this is that a number of subtle VM2 and other bugs have been found and fixed, not just for the benefit of these and other SuperCluster customers, but since the fixes are putback into generic Solaris SRUs, all Solaris 11 customers benefit.

Without the reduced variables at play in Engineered Systems, it would be extremely difficult if not impossible to reproduce, analyze, and fix such subtle issues.

So even if you don't have a SuperCluster, you can still reap the benefits.

FYI, currently most of the SuperCluster install base is running Solaris 11.1 SRU7.5 (which fixes a number of VM2 issues).

BTW: We also improved the SRU README last month to summarize the important content.

Best Wishes,

Gerry.

Tuesday Sep 17, 2013

Top Tips for Updating Solaris 11 Systems

We now have quite a bit of experience of IPS and Repositories under our belt. 

Feedback from customers has been extremely positive.  I recently met a customer with 1000+ Solaris servers who told me that with Solaris 10 it took them 2 months to roll out a new patchset across their enterprise.  With Solaris 11, it takes 10 days.

That really helps lower TCO.

As with anything, experience teaches us how to optimize things.  Here's a few Top Tips around IPS / Repo management which I'd like to share with you from my experience with SuperCluster:

  • To avoid most IPS dependency resolution errors, keep your main local Repository populated with all Solaris Updates and SRUs up to and including the version you wish to apply.  A sparsely populated Repo is much more likely to result in copious IPS dependency resolution errors.
  • Keep any IDRs (Interim Diagnostics or Relief) in a separate Repo local to the Boot Environments (BEs) for which they are relevant.  For example, if you have an IDR to address an issue with 11gR2 RAC on Solaris 11.1.7.5.0 (Solaris 11.1 SRU7.5), keep it local to the relevant BEs running 11gR2.  This avoids IDRs being unnecessarily propagated to LDoms or Zones for which they are irrelevant.
  • Before upgrading, check to ensure that the issues addressed in any IDRs you are using are fixed in the Solaris version to which you are updating.  If they are, IPS will automatically supersede them - that is, unlike in Solaris 10, there's no need to manually remove them.  You can check this by looking in the Support repository, or the relevant Repo ISO image, for packages whose base name is the IDR number, that is 'idr<number>'.  If such a package exists, then the IDR has been superseded and the issues it addresses are fixed in that SRU.  If the issues are not fixed in the Solaris version to which you are updating, you may need to ask Support for new IDR(s) for that Solaris version.
  • Zone creation in Solaris 11 works differently to how it did in Solaris 10.  In Solaris 11, effectively a manifest is taken of the Global Zone and then Non-Global Zones (NGZs) are constructed from that using the Repo(s).  Therefore, your Repo(s) must be up to date with all Solaris software installed on your global zone, including any IDRs.  You can have multiple Publishers specified, so that multiple Repos can be used (e.g. main local Repo for the Solaris Updates / SRUs, BE specific Repo for IDRs).

I hope you find these tips useful.

My colleagues, Glynn Foster and Bart Smaalders, will be presenting on "Oracle Solaris 11 Best Practices for Software Lifecycle Management [Con3889]" @ Oracle OpenWorld next week.  The Oracle Sun "Systems" sessions are in the Westin this year.  This particular session is on Tuesday, Sept 24 @ 5:15pm in the "City" meeting room in the Westin and will have lots more tips and best practices.

Other colleagues, Rob Hulme and Colin Seymour, are presenting on "Best Practices for Maintaining and Upgrading Oracle Solaris [CON8255]" on Monday, Sept 23 @ 10:45am in the Westin San Francisco, also in the "City" meeting room.

And there's lots of other good stuff on Solaris and SuperCluster.  For example, the "Deep Dive into Oracle SuperCluster [CON8632]" on Tuesday, Sept 24 @ 5:15pm in the Westin, Metropolitan II.

I'm not presenting this year, but if you would like to meet up with me @ OpenWorld to discuss anything about Solaris / Systems / SuperCluster Lifecycle Maintainence, whether it's ideas you'd like to see implemented, what's keeping you awake at night, issues you want me to look at, etc., I am more than happy to do so.  Just ping me at Gerry.Haskins@oracle.com.

Best Wishes,

Gerry.

Monday Nov 28, 2011

Solaris 11 Customer Maintenance Lifecycle

Hi Folks,

Welcome to my new blog http://blogs.oracle.com/Solaris11Life which is all about the Customer Maintenance Lifecycle for Image Packaging System (IPS) based Solaris releases, such as Solaris 11.

It'll include policies, best practices, clarifications, and lots of other stuff which I hope you'll find useful as you get up to speed with Solaris 11 and IPS.  

Let's start with an updated version of my Solaris 11 Customer Maintenance Lifecycle presentation which I originally gave at Oracle Open World 2011 and at the 2011 Deutsche Oracle Anwendergruppe (DOAG - German Oracle Users Group) conference in N├╝rnberg.

Some of you may be familiar with my Patch Corner blog, http://blogs.oracle.com/patch , which fulfilled a similar purpose for System V [five] Release 4 (SVR4) based Solaris releases, such as Solaris 10 and below.

Since maintaining a Solaris 11 system is quite different to maintaining a Solaris 10 system, I thought it prudent to start this 2nd parallel blog for Solaris 11.

Actually, I have an ulterior motive for starting this separate blog. 

Since IPS is a single tier packaging architecture, it doesn't have any patches, only package updates. 

I've therefore banned the word "patch" in Solaris 11 and introduced a swear box to which my colleagues must contribute a quarter [$0.25] every time they use the word "patch" in a public forum.  From their Oracle Open World presentations, John Fowler owes 50 cents, Liane Preza owes $1.25, and Bart Smaalders owes 75 cents. 

Since I'm stinging my colleagues in what could be a lucrative enterprise, I couldn't very well discuss IPS best practices on a blog called "Patch Corner" with a URI of http://blogs.oracle.com/patch.  I simply couldn't afford all those contributions to the "patch" swear box. :)

Feel free to let me know what topics you'd like covered - just post a comment in the comment box on the blog.

Best Wishes,

Gerry.


About

This blog is to inform customers about Solaris 11 maintenance best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Gerry Haskins, Director, Software Lifecycle Engineering

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today