GSI Architecture Strategy:Maintenance
By asparks on May 08, 2009
Continuing a series of posts on Global Single Instances architecture strategies. This one’s a touchy subject: maintenance.
Maintaining a Global Single Instance presents a number of unique challenges to the IT organization, all of which can be overcome with some common sense, discipline and by taking advantage of all of the relevant features that the software and hardware infrastructure offer you.
Where for art thou Downtime?
The most obvious challenge for GSI maintenance is finding downtime to execute the maintenance. What influences the (potentially) available downtime? Key influences on downtime are listed below in no particular order.
Typically it costs more to have a production instance offline than a project instance.
However, one usually has more project instances to maintain as there is usually a many-to-one relation of project to production instances. Furthermore, project instances often need to be maintained more frequently to support project progress or to promote patches through the test instance set to production.
In some cases project instance downtime can be very costly. The cost of project team delays (including expensive consultant time) can add up quickly and is frequently overlooked.
Equal care can be required in planning project and production instance downtime depending on the project phase.
Instance Functional Footprint
The downtime cost for an instance depends to a degree on the business processes supported by the instance.
Revenue side or customer-critical operations (e.g. billing, manufacturing, shipping or some CRM functions) carry a higher downtime cost to the company.
These business processes can frequently have a 2- or 3-shift or 24x7 character.
The more "global" an instance is (the number of timezones to be supported) the more difficult it can be to find a maintenance window that does not impact production (or project) operations somewhere. This can be made even more difficult when you add a new region to the program (e.g. adding project operations in Asia-Pacific to instance with Americas & EMEA) or dealing with regions with a shifted workweek (e.g. Sunday - Thursday)
Some system aspects do not impact the available downtime as such, they impact the scope of what has to be executed in each maintenance window. Examples of this are multiple languages installed (linear impact on number of patches to be installed) and synchronization feeds to reporting instances/DWH (requiring stop/start/resynchronize time in the window).
GSI Maintenance Management
Each of the influences on downtime can be be mitigated.
First and foremost all maintenance schedules need to be clearly communicated, highly visible to all relevant parties and frequently updated with the planned activities to be executed in each window. Communication needs to be a combination of static information on the project/IT website on the standard maintenance windows and similar relevant information and dynamic updates via email blasts, news/rss updates on the planned activities for each window and maintenance start/stop notifications. All of this builds an operational environment where maintenance is an expected element of the landscape and there are no surprises when an instance goes offline for patching or other maintenance.
Allied with communication, plan the instance maintenance windows 6 - 12 months in advance (for *all* instances) and publish the plan. This becomes a framework for negotiation (particularly with the project instances) but at least everyone is starting with the same baseline of information.
Slots and Kaizen
Use a kaizen process of refinement to keep maintenance under control and drive performance.
Plan 2x - 3x per week windows for 4 - 6 hours (say) for project environments. Publish as SLA. Then drive to improve to get to 1x or 2x per week.
Plan 1x per week 4 - 8 hours window for production environment(s). Publish as SLA.
Then keep driving process improvements to get to 2x per month, then 1x per month then 2x per quarter...
Remember it is not about the starting point, it is about the journey of continuously reducing the frequency of planned downtimes and improving reliability and predictability of the process.
Which Slot? Timezones
As stated in the section on influences - timezone coverage can squeeze your maintenance window slots badly. But you can also take advantage of this. Maintenance windows from 0200 - 0600 CET can also be expensive in terms of overtime if your data center staff are also in the CET timezone.
Using (for example) staff based in IST (India) combined with PST (California) or EST (East Coast US) can mitigate this very well and give 24x7 coverage. Make the timezones work for you.
Use Available Technology
Invest time in understanding how to get the best out of the technology you have and what additional technology you may need to acquire to meet your downtime requirements. I will always recommend colleague Steven Chan’s excellent blog for more detail on this subject – but here is a grab bag of suggestions to follow…
- Invest in the hardware infrastructure (e.g. disk technology) to speed backup and recovery operations
- Hot backups must be routine
- Use the well documented features of Oracle E-Business Suite to minimize patching time. Some examples:
- Minimize the number of individual patches and human interference in patching by merging patches
- Minimize the software distributions to be patched by consolidating to a single shared software directory
- Run the patches faster by spreading multiple parallel patch utility worker processes over multiple servers
- Install patches in offline staging area (staged APPL_TOP technique)
- Analyze which patch (components) can be installed hot according to Oracle Development guidelines