Wednesday Dec 04, 2013

It's good to stare!

Some say it's rude to stare.  But that's not my experience.

I've been working on SuperCluster for 2 years now.  And I've been looking intently at issues arising for SuperCluster customers, both to ensure the issues are fixed a.s.a.p. and to understand what lessons we can learn and where we can improve our products and processes.

Why ?

Well, I want every customer to have the best possible experience. 

Call me naive, but I sincerely believe that ensuring a good customer experience is the best way to encourage repeat business.

So, what have I learned ?

I've been working in the Solaris customer lifecycle space for 15 years now.  One thing that's always puzzled me is why, while most customers have a perfectly good experience, there's always one or two customers who repeatedly hit problems.

The reasons are often not obvious.  They may be running very similar hardware with very similar software configuration and broadly comparable workloads to hundreds of other customers in the same industry segment who are not experiencing any issue.

It's easy to assume that there may be something subtle "wrong" in their set-up.  Either a misconfigured network, a piece of 3rd party kit which we don't have internally to aid us reproduce the issue, 3rd party or home grown apps relying on private interfaces they shouldn't be using, even a dodgey "favorite" /etc/system setting which the customer "knows" works from their Solaris 2.5.1 or V880 days that hamstrings performance, or whatever.  Occasionally, despite enormous effort, it feels like we never get to true root cause and that customer never does have an optimal experience.

More often, we do determine the root cause, which may indeed be a sub-optimal configuration but, if the system's already in production, it may not be possible to reconfigure the system and start again, so the customer experience remains compromised for that system.

Indeed, it's for this exact reason - sub-optimal customer lifecycle experiences are often due to sub-optimal initial install and configuration - that my team was asked to develop the install and configuration utilities for SuperCluster so that they are configured according to best practice right out of the box.  And that's worked very well indeed.

But some issues do still arise for SuperCluster customers.

Most are when we leverage new functionality - initially Infiniband, more lately VM2 and iSCSI.  These issues are found and fixed rapidly, with proactive roll-out of the fixes to the entire SuperCluster customer base.

I previously blogged that, even though SuperCluster is configurable and certainly not an appliance, we are finding Engineered Systems issues much easier to debug, as the fixed hardware layout, cabling, protocols, etc., dramatically reduces the number of variables in play, making issue reproduction in-house much easier, and hence issue analysis and resolution much faster.  This really helps to improve our customers' experience.

But we still see a very small number of customers (two or possibly three come to mind) who repeatedly hit issues not seen by any other. 

Why is that ?  

The hardware is identical.  The configurations are similar.  We have other customers in the same industry segment utilizing the SuperClusters for broadly similar purposes.  Even with similar DB and load characteristics.  We know the networking is correct - it's fixed.  We know the I/O cards are in the right slots - it's fixed.  We know we're using the optimal protocols, configured optimally.  We even have a process, ssctuner, running in the background to check that no dodgey settings are added to /etc/system, and it'll automatically remove them if they are.

We've gone through an interesting period over the summer.  In early summer, we were seeing very few issues indeed reported from our now large customer base.  Then, we saw 3 customers raise issues in quick succession.

The first, in Europe, looked like an Infiniband issue.  Responses would just stop for multiple seconds for no apparent reason, then restart.  We actually sent two very experienced engineers on site to debug after trying to debug over shared shell was unsuccessful, and they root caused a VM2 (Virtual Memory) issue and two scheduler issues.

Almost the same week, two U.S. SuperCluster customers raised VM2 issues.  Our lead VM2 sustaining engineer, Vamsi Nagineni, engaged Eric Lowe from the VM2 development team, and they determined that none of the customer issues had the same root cause.

In one case, a bank, the customers' database is not optimized for Exadata, so more of the load runs on the SuperCluster compute nodes rather than on the storage cells.  Nothing overly excessive, just enough to encounter an issue not seen by other customers.

In another, a State social services provider, the customer runs a high proportion of batch processing.  Again, nothing excessive, just enough to encounter a different issue not seen by other customers.

In the third, a major retailer, the customer's apps had very specific memory requirements which the VM2 algorithms were handling sub-optimally.

The outcome of this is that a number of subtle VM2 and other bugs have been found and fixed, not just for the benefit of these and other SuperCluster customers, but since the fixes are putback into generic Solaris SRUs, all Solaris 11 customers benefit.

Without the reduced variables at play in Engineered Systems, it would be extremely difficult if not impossible to reproduce, analyze, and fix such subtle issues.

So even if you don't have a SuperCluster, you can still reap the benefits.

FYI, currently most of the SuperCluster install base is running Solaris 11.1 SRU7.5 (which fixes a number of VM2 issues).

BTW: We also improved the SRU README last month to summarize the important content.

Best Wishes,

Gerry.

Tuesday Oct 23, 2012

Solaris 11 SRU / Update relationship explained, and blackout period on delivery of new bug fixes eliminated

Relationship between SRUs and Update releases

As you may know, Support Repository Updates (SRUs) for Oracle Solaris 11 are released monthly and are available to customers with an appropriate support contract.  SRUs primarily deliver bug fixes.  They may also deliver low risk feature enhancements.

Solaris Updates are typically released once or twice a year, containing support for new hardware, new software feature enhancements, and all bug fixes available at the time the Update content was finalized.  They also contain a significant number of new bug fixes, for issues found internally in Oracle and complex customer bug fixes which  require significant "soak" time to ensure their efficacy prior to release.

Changes to SRU and Update Naming Conventions

We're changing the naming convention of Update releases from a date based format such as Oracle Solaris 10 8/11 to a simpler "dot" version numbering, e.g. Oracle Solaris 11.1. Oracle Solaris 11 11/11 (i.e. the initial Oracle Solaris 11 release) may be referred to as 11.0.

SRUs will simply be named as "dot.dot" releases, e.g. Oracle Solaris 11.1.1, for SRU1 after Oracle Solaris 11.1.

Many Oracle products and infrastructure tools such as BugDB and MOS are tailored towards this "dot.dot" style of release naming, so these name changes align Oracle Solaris with these conventions.

No Blackout Periods on Bug Fix Releases

The Oracle Solaris 11 release process has been enhanced to eliminate blackout periods on the delivery of new bug fixes to customers.

Previously, Oracle Solaris Updates were a superset of all preceding bug fix deliveries.  This made for a very simple update message - that which releases later is always a superset of that which was delivered previously.

However, it had a downside.  Once the contents of an Update release were frozen prior to release, the release of new bug fixes for customer issues was also frozen to maintain the Update's superset relationship.

Since the amount of change allowed into the final internal builds of an Update release is reduced to mitigate risk, this throttling back also impacted the release of new bug fixes to customers.

This meant that there was effectively a 6 to 9 week hiatus on the release of new bug fixes prior to the release of each Update.  That wasn't good for customers awaiting critical bug fixes.

We've eliminated this hiatus on the delivery of new bug fixes in Oracle Solaris 11 by allowing new bug fixes to continue to be released in SRUs even after the contents of the next Update release have been frozen.

The release of SRUs will remain contiguous, with the first SRU released after the Update release effectively being a superset of both the the Update release and all preceding SRUs*. 

That is, later SRUs are supersets of the content of previous SRUs.

Therefore, the progression path from the final SRUs prior to the Update release is to the first SRU after the Update release, rather than to the Update release itself.

The timeline / logical sequence of releases can be shown as follows:

Updates: 11.0                                                11.1                               11.2     etc.

                 \                                                         \                                    \

SRUs:       11.0.1, 11.0.2,...,11.0.12, 11.0.13, 11.1.1, 11.1.2,...,11.1.x, 11.2.1, etc.

For example, for systems with Oracle Solaris 11 11/11 SRU12.4 or later installed, the recommended update path is to Oracle Solaris 11.1.1 (i.e. SRU1 after Solaris 11.1) or later rather than to the Solaris 11.1 release itself.  This will ensure no bug fixes are "lost" during the update*.

If for any reason you do wish to update from SRU12.4 or later to the 11.1 release itself - for example to update a test system - the instructions to do so are in the SRU12.4 README, https://updates.oracle.com/Orion/Services/download?type=readme&aru=15607102

For systems with Oracle Solaris 11 11/11 SRU11.4 or earlier installed, customers can update to either the 11.1 release or any 11.1 SRU as both will be supersets of their current version.  My colleague, Pete Dennis, explains the step-by-step process here.

Please do read the README of the SRU you are updating to, as it will contain important installation instructions which will save you time and effort.

*Nerdy details:

  • SRUs only contain the latest change delta relative to the Update on which they are based.  Their dependencies will, however, effectively pull in the Update content.  Customers maintaining a local Repo (e.g. behind their firewall), need to add both the 11.1 content and the relevant SRU content to their Repo, to enable the SRU's dependencies to be resolved.  Both will be available from the standard Support Repo and from MOS.  This is no different to existing SRUs for Oracle Solaris 11.0, whereby you may often get away with using just the SRU content to update, but the original 11.0 content may be needed in the Repo to resolve dependencies.
  • The following bug fixes in SRU12.4 are not in Oracle Solaris 11.1.  They'll be available in 11.1.1 (SRU1 for Oracle Solaris 11.1):
7166132 vim should be able to run its test suite
7190213 libibmad and associated files need to be delivered in an NGZ
7191495 mkisofs install is incomplete
7195687 Update fetchmail to version 6.3.22
7195704 Problem with utility/fetchmail
7196234 Problem with network/dns
7197223 vim shows high CPU usage when editing dtrace script with syntax
        highlighting enabled
7071362 tcp_icmp_source_quench and other tunables may no longer be field
        modifiable
7181137 sol_umad should allow userland MAD operations in NGZs
7196540 After 7174929 integration 0.9.0 is shown for first disk in
        secondRAID volume

Thursday Apr 12, 2012

How To Update Oracle Solaris 11

My colleague, Glynn Foster, has published a nice article on how to update Oracle Solaris 11 which I think you may find interesting.

Monday Nov 28, 2011

Solaris 11 Customer Maintenance Lifecycle

Hi Folks,

Welcome to my new blog http://blogs.oracle.com/Solaris11Life which is all about the Customer Maintenance Lifecycle for Image Packaging System (IPS) based Solaris releases, such as Solaris 11.

It'll include policies, best practices, clarifications, and lots of other stuff which I hope you'll find useful as you get up to speed with Solaris 11 and IPS.  

Let's start with an updated version of my Solaris 11 Customer Maintenance Lifecycle presentation which I originally gave at Oracle Open World 2011 and at the 2011 Deutsche Oracle Anwendergruppe (DOAG - German Oracle Users Group) conference in N├╝rnberg.

Some of you may be familiar with my Patch Corner blog, http://blogs.oracle.com/patch , which fulfilled a similar purpose for System V [five] Release 4 (SVR4) based Solaris releases, such as Solaris 10 and below.

Since maintaining a Solaris 11 system is quite different to maintaining a Solaris 10 system, I thought it prudent to start this 2nd parallel blog for Solaris 11.

Actually, I have an ulterior motive for starting this separate blog. 

Since IPS is a single tier packaging architecture, it doesn't have any patches, only package updates. 

I've therefore banned the word "patch" in Solaris 11 and introduced a swear box to which my colleagues must contribute a quarter [$0.25] every time they use the word "patch" in a public forum.  From their Oracle Open World presentations, John Fowler owes 50 cents, Liane Preza owes $1.25, and Bart Smaalders owes 75 cents. 

Since I'm stinging my colleagues in what could be a lucrative enterprise, I couldn't very well discuss IPS best practices on a blog called "Patch Corner" with a URI of http://blogs.oracle.com/patch.  I simply couldn't afford all those contributions to the "patch" swear box. :)

Feel free to let me know what topics you'd like covered - just post a comment in the comment box on the blog.

Best Wishes,

Gerry.


About

This blog is to inform customers about Solaris 11 maintenance best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Gerry Haskins, Director, Software Lifecycle Engineering

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today