Monday Mar 10, 2014

ORAchk Health Checks for the Oracle Stack (including Solaris)

My colleagues, Susan Miller and Erwann Chénedé, have been working with the nice people behind the ORAchk tool (formerly RACcheck) to add Solaris health checks to the tool.

ORAchk 2.2.4, containing the initial 8 Solaris health checks, is now available:

ORAchk includes EXAchks functionality and replaces the popular RACcheck tool, extending the coverage based on prioritization of top issues reported by users, to proactively scan for known problems within:

  • E-Business Suite Financials Accounts Payables
  • Oracle Database
  • Sun Systems

ORAchk features:

  • Proactively scans for the most impactful known problems across your entire system as well as various layers of your stack
  • Simplifies and streamlines how to investigate and analyze which known issues present a risk to you
  • Lightweight tool runs within your environment; no data will be sent to Oracle
  • High level reports show your system health risks with the ability to drill down into specific problems and understand their resolutions
  • Can be configured to send email notifications when it detects problems
  • Collection Manager, a companion Application Express web app, provides a single dashboard view of collections across your entire enterprise

ORAchk will expand in the future with more high impact checks in existing and additional product areas. If you have particular checks or product areas you would like to see covered, please post suggestions in the ORAchk community thread accessed from the support tab on the below document.

For more details about ORAchk see Document 1268927.1

Wednesday Dec 04, 2013

It's good to stare!

Some say it's rude to stare.  But that's not my experience.

I've been working on SuperCluster for 2 years now.  And I've been looking intently at issues arising for SuperCluster customers, both to ensure the issues are fixed a.s.a.p. and to understand what lessons we can learn and where we can improve our products and processes.

Why ?

Well, I want every customer to have the best possible experience. 

Call me naive, but I sincerely believe that ensuring a good customer experience is the best way to encourage repeat business.

So, what have I learned ?

I've been working in the Solaris customer lifecycle space for 15 years now.  One thing that's always puzzled me is why, while most customers have a perfectly good experience, there's always one or two customers who repeatedly hit problems.

The reasons are often not obvious.  They may be running very similar hardware with very similar software configuration and broadly comparable workloads to hundreds of other customers in the same industry segment who are not experiencing any issue.

It's easy to assume that there may be something subtle "wrong" in their set-up.  Either a misconfigured network, a piece of 3rd party kit which we don't have internally to aid us reproduce the issue, 3rd party or home grown apps relying on private interfaces they shouldn't be using, even a dodgey "favorite" /etc/system setting which the customer "knows" works from their Solaris 2.5.1 or V880 days that hamstrings performance, or whatever.  Occasionally, despite enormous effort, it feels like we never get to true root cause and that customer never does have an optimal experience.

More often, we do determine the root cause, which may indeed be a sub-optimal configuration but, if the system's already in production, it may not be possible to reconfigure the system and start again, so the customer experience remains compromised for that system.

Indeed, it's for this exact reason - sub-optimal customer lifecycle experiences are often due to sub-optimal initial install and configuration - that my team was asked to develop the install and configuration utilities for SuperCluster so that they are configured according to best practice right out of the box.  And that's worked very well indeed.

But some issues do still arise for SuperCluster customers.

Most are when we leverage new functionality - initially Infiniband, more lately VM2 and iSCSI.  These issues are found and fixed rapidly, with proactive roll-out of the fixes to the entire SuperCluster customer base.

I previously blogged that, even though SuperCluster is configurable and certainly not an appliance, we are finding Engineered Systems issues much easier to debug, as the fixed hardware layout, cabling, protocols, etc., dramatically reduces the number of variables in play, making issue reproduction in-house much easier, and hence issue analysis and resolution much faster.  This really helps to improve our customers' experience.

But we still see a very small number of customers (two or possibly three come to mind) who repeatedly hit issues not seen by any other. 

Why is that ?  

The hardware is identical.  The configurations are similar.  We have other customers in the same industry segment utilizing the SuperClusters for broadly similar purposes.  Even with similar DB and load characteristics.  We know the networking is correct - it's fixed.  We know the I/O cards are in the right slots - it's fixed.  We know we're using the optimal protocols, configured optimally.  We even have a process, ssctuner, running in the background to check that no dodgey settings are added to /etc/system, and it'll automatically remove them if they are.

We've gone through an interesting period over the summer.  In early summer, we were seeing very few issues indeed reported from our now large customer base.  Then, we saw 3 customers raise issues in quick succession.

The first, in Europe, looked like an Infiniband issue.  Responses would just stop for multiple seconds for no apparent reason, then restart.  We actually sent two very experienced engineers on site to debug after trying to debug over shared shell was unsuccessful, and they root caused a VM2 (Virtual Memory) issue and two scheduler issues.

Almost the same week, two U.S. SuperCluster customers raised VM2 issues.  Our lead VM2 sustaining engineer, Vamsi Nagineni, engaged Eric Lowe from the VM2 development team, and they determined that none of the customer issues had the same root cause.

In one case, a bank, the customers' database is not optimized for Exadata, so more of the load runs on the SuperCluster compute nodes rather than on the storage cells.  Nothing overly excessive, just enough to encounter an issue not seen by other customers.

In another, a State social services provider, the customer runs a high proportion of batch processing.  Again, nothing excessive, just enough to encounter a different issue not seen by other customers.

In the third, a major retailer, the customer's apps had very specific memory requirements which the VM2 algorithms were handling sub-optimally.

The outcome of this is that a number of subtle VM2 and other bugs have been found and fixed, not just for the benefit of these and other SuperCluster customers, but since the fixes are putback into generic Solaris SRUs, all Solaris 11 customers benefit.

Without the reduced variables at play in Engineered Systems, it would be extremely difficult if not impossible to reproduce, analyze, and fix such subtle issues.

So even if you don't have a SuperCluster, you can still reap the benefits.

FYI, currently most of the SuperCluster install base is running Solaris 11.1 SRU7.5 (which fixes a number of VM2 issues).

BTW: We also improved the SRU README last month to summarize the important content.

Best Wishes,

Gerry.

Friday Oct 04, 2013

Top Tip: Managing Solaris 11 IDRs

Here's a Top Tip from my colleague, IPS Guru, and all-round good guy, Pete Dennis:

Background

If the issue(s) addressed by a Solaris 11 IDR (Interim Diagnostics / Relief) are fixed in a subsequent SRU (Support Repository Update), the SRU is said to "supersede" the IDR. 

As mentioned in previous posts, in Solaris 11 the IDR is automatically superseded when the system is updated to the relevant SRU (or any later SRU).  That is, unlike in Solaris 10, there's no need to manually remove the IDR before updating*.  We provide "terminal packages" for superseded IDRs in the Support Repo, enabling IPS (Image Packaging System) to automatically handle the IDRs for you.

Several weeks before a planned maintenance update, it's a good idea to check whether all the IDRs in use are superseded by the SRU to which you are planning to update.

If any of them aren't superseded, and the relevant packages they touch are updated in the SRU, you'll need to raise an SR (Service Request) with Oracle Support to get new IDRs generated for the relevant BugIDs at that SRU level.  So please ensure you provide enough time for these to be generated.  Note, if the Bugs are already fixed in a later SRU, you'll be told to update to that SRU.

Question:

Is there a simple way for a customer to find out which of their IDRs will be superseded by updating to a given SRU ?

Answer:

All superseded IDRs are tagged in the Support Repository and on the incremental ISO images available from MOS (My Oracle Support).

The following command will list the superseded IDRs in the Support Repository, so you can then examine the ones of interest. 

I'm assuming here that you're maintaining a local Repo behind your firewall which is, at a minimum, up to date with the SRU to which you are planning to update:

pkg list -g http://<url of local repo> -af idr* 

For example:

pkg contents -g http://<url of local repo> -m idr679

set name=pkg.fmri value=pkg://solaris/idr679@3,5.11:20130905T193900Z
set name=pkg.description value="Terminal package"
set name=pkg.renamed value=true
depend fmri=pkg:/consolidation/osnet/osnet-incorporation@0.5.11,5.11-0.175.1.11.0.4.2 type=require

You do need to be able to interpret FMRI strings correctly (see previous posts). For example, 5.11-0.175.1.11.0.4.2 is Solaris 11.1 SRU 11.4 or, to give it its official Marketing name, Solaris 11.1.11.4.0.

So that tells us that idr679 is superseded by Solaris 11.1 SRU 11.4 (Solaris 11.1.11.4.0).

We'll look to make this more transparent by adding a text field with the human readable translation of the FMRI string to the metadata.

If you wish to restrict updates to selected SRUs which you have "qualified" in your environment, for example, a "Golden Image", Bart's blog posting may also be of interest.

Best Wishes,

Gerry.

* There's more work required to make this happen seamlessly in Solaris 11 Zones.

Tuesday Oct 01, 2013

T5 and M6 - providing more Umpf for your Buck

Just thought you may be interested in some random metrics...

The unit of measurement is Umpf!  The more Umpf!, the better.  Comparing Oracle Sun's latest systems to some of our old favorite systems:

The new M6-32 which was announced at OpenWorld is ~174x more powerful than a fully loaded E10K running Solaris 8.

A T5-8 running Solaris 11 is ~133x more powerful than a fully loaded V880 and ~530x more powerful than a fully loaded E450 running Solaris 8.  Holy spondulicks!

Guess it's time to replace the E450s in my lab and rent out the space I'll save for student accommodation.

With that sort of power, you can have world domination at your fingertips.

And surprisingly affordable world domination at that!

Best Wishes,

Gerry.

Usual disclaimers about my personal incompetence, lies, damn lies, and statistics, etc., apply

Tuesday Sep 17, 2013

Top Tips for Updating Solaris 11 Systems

We now have quite a bit of experience of IPS and Repositories under our belt. 

Feedback from customers has been extremely positive.  I recently met a customer with 1000+ Solaris servers who told me that with Solaris 10 it took them 2 months to roll out a new patchset across their enterprise.  With Solaris 11, it takes 10 days.

That really helps lower TCO.

As with anything, experience teaches us how to optimize things.  Here's a few Top Tips around IPS / Repo management which I'd like to share with you from my experience with SuperCluster:

  • To avoid most IPS dependency resolution errors, keep your main local Repository populated with all Solaris Updates and SRUs up to and including the version you wish to apply.  A sparsely populated Repo is much more likely to result in copious IPS dependency resolution errors.
  • Keep any IDRs (Interim Diagnostics or Relief) in a separate Repo local to the Boot Environments (BEs) for which they are relevant.  For example, if you have an IDR to address an issue with 11gR2 RAC on Solaris 11.1.7.5.0 (Solaris 11.1 SRU7.5), keep it local to the relevant BEs running 11gR2.  This avoids IDRs being unnecessarily propagated to LDoms or Zones for which they are irrelevant.
  • Before upgrading, check to ensure that the issues addressed in any IDRs you are using are fixed in the Solaris version to which you are updating.  If they are, IPS will automatically supersede them - that is, unlike in Solaris 10, there's no need to manually remove them.  You can check this by looking in the Support repository, or the relevant Repo ISO image, for packages whose base name is the IDR number, that is 'idr<number>'.  If such a package exists, then the IDR has been superseded and the issues it addresses are fixed in that SRU.  If the issues are not fixed in the Solaris version to which you are updating, you may need to ask Support for new IDR(s) for that Solaris version.
  • Zone creation in Solaris 11 works differently to how it did in Solaris 10.  In Solaris 11, effectively a manifest is taken of the Global Zone and then Non-Global Zones (NGZs) are constructed from that using the Repo(s).  Therefore, your Repo(s) must be up to date with all Solaris software installed on your global zone, including any IDRs.  You can have multiple Publishers specified, so that multiple Repos can be used (e.g. main local Repo for the Solaris Updates / SRUs, BE specific Repo for IDRs).

I hope you find these tips useful.

My colleagues, Glynn Foster and Bart Smaalders, will be presenting on "Oracle Solaris 11 Best Practices for Software Lifecycle Management [Con3889]" @ Oracle OpenWorld next week.  The Oracle Sun "Systems" sessions are in the Westin this year.  This particular session is on Tuesday, Sept 24 @ 5:15pm in the "City" meeting room in the Westin and will have lots more tips and best practices.

Other colleagues, Rob Hulme and Colin Seymour, are presenting on "Best Practices for Maintaining and Upgrading Oracle Solaris [CON8255]" on Monday, Sept 23 @ 10:45am in the Westin San Francisco, also in the "City" meeting room.

And there's lots of other good stuff on Solaris and SuperCluster.  For example, the "Deep Dive into Oracle SuperCluster [CON8632]" on Tuesday, Sept 24 @ 5:15pm in the Westin, Metropolitan II.

I'm not presenting this year, but if you would like to meet up with me @ OpenWorld to discuss anything about Solaris / Systems / SuperCluster Lifecycle Maintainence, whether it's ideas you'd like to see implemented, what's keeping you awake at night, issues you want me to look at, etc., I am more than happy to do so.  Just ping me at Gerry.Haskins@oracle.com.

Best Wishes,

Gerry.

About

This blog is to inform customers about Solaris 11 maintenance best practice, feature enhancements, and key issues. The views expressed on this blog are my own and do not necessarily reflect the views of Oracle. The Documents contained within this site may include statements about Oracle's product development plans. Many factors can materially affect these plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Gerry Haskins, Director, Software Lifecycle Engineering

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today