AIX Checklist for stable obiee deployment

Common AIX configuration issues     ( last updated 26 Jun 2013 )

OBIEE is a complicated system with many moving parts and connection points.
The purpose of this article is to provide a checklist to discuss OBIEE deployment with your systems administrators.

The information in this article is time sensitive, and updated as I discover new  issues or details and broken URL's.
Apologies for lack of updates.  I just discovered this blog software doesn't work with Internet Explorer.
Last 4 updates were discarded.   -- 2013-06-26

What makes OBIEE different?

When Tech Support suggests AIX component upgrades to a stable, locked-down production AIX environment, it is common to get "push back".  "Why is this necessary?  We aren't we seeing issues with other software?"

It's a fair question that I have often struggled to answer; here are the talking points:

  • OBIEE is memory intensive.  It is the entire purpose of the software to trade memory for repetitive, more expensive database requests across a network.
  • OBIEE is implemented in C++ and is very dependent on the C++ runtime to behave correctly.
  • OBIEE is aggressively thread efficient;  if atomic operations on a particular architecture do not work correctly, the software crashes.
  • OBIEE dynamically loads third-party database client libraries directly into the nqsserver process.  If the library is not thread-safe, or corrupts process memory the OBIEE crash happens in an unrelated part of the code.  These are extremely difficult bugs to find.
  • OBIEE software uses 99% common source across multiple platforms:  Windows, Linux, AIX, Solaris and HPUX.  If a crash happens on only one platform, we begin to suspect other factors.  load intensitysystem differences, configuration choices, hardware failures. 

It is rare to have a single product require so many diverse technical skills.   My role in support is to understand system configurations, performance issues, and crashes.   An analyst trained in Business Analytics can't be expected to know AIX internals in the depth required to make configuration choices.  Here are some guidelines.

  1. AIX C++ Runtime must be at  version 12.1.0.1 (was: 11.1.0.4, which still works fine )
    $ lslpp -L | grep xlC.aix
    obiee software will crash if xlC.aix.rte is downlevel;  this is not a "try it" suggestion.
    Aug 2012 version 12.1.0.1  is appropriate for all AIX versions ( 5.3, 6.1, 7.1 )
    Download from here:
    http://www-01.ibm.com/support/docview.wss?uid=swg24033340
    No reboot is necessary to install, it can even be installed while applications are using the current version.
    Restart the apps, and they will pick up the latest version.


  2. AIX 5.3 Technology Level 12 is required when running on Power5,6,7 processors
    AIX 6.1 was introduced with the newer Power chips, and we have seen no issues with 6.1 or 7.1 versions.
    Customers with an unstable deployment, dozens of unexplained crashes, became stable after the TL12 upgrade.
    If your AIX system is 5.3, the minimum TL level should be at or higher than this:
    $ oslevel -s
      5300-12-03-
    1107
    IBM typically supports only the two latest versions of AIX ( 6.1 and 7.1, for example).  AIX 5.3 is still supported and popular running in an LPAR.

  3. Java runtime should be downloaded from IBM's FixCentral.
    IBM now requires registration and login.
      Search FixCentral [ java ]
      [x] Runtimes for Java

    Fixes are available for Java 4,5,6,7.  OBIEE supports 1.5 and 1.6
    Test your installed version using this command:
      java -version

    Recent fixes from IBM have resolved Java performance and stability issues on AIX.
    I see customers have deployed these versions to repair problems.
      "SR9   FP1" 
       "SR12 FP5"
       "SR13 FP3"  ( a popular version: IBM fix IZ94331 )


  4. obiee userid limits
    $ ulimit -Ha  ( hard limits )
    $ ulimit -a   ( default limits )
    core file size (blocks)     unlimited
    data seg size (kbytes)      unlimited
    file size (blocks)          unlimited
    max memory size (kbytes)    unlimited
    open files                  10240
    cpu time (seconds)          unlimited
    virtual memory (kbytes)     unlimited

    It is best to establish the values in /etc/security/limits
    root user is needed to observe and modify this file.
    If you modify a limit in /etc/security/limits , you will need to relog in to have the change take effect.  For example,
    $ ulimit -c 0
    $ ulimit -c 2097151
    cannot modify limit: Operation not permitted
    $ ulimit -c unlimited
    $ ulimit -c
    0

    There are only two meaningful values for core files ( ulimit -c ) ; zero or unlimited.
    Anything else is likely to produce a truncated core file that cannot be analyzed.
    Lack of filesystem space and file size limit ( ulimit -f ) may also produce truncated cores.

  5. Deploy 32-bit or 64-bit ?
    Early versions of OBIEE offered 32-bit or 64-bit choice to AIX customers.
    The 32-bit choice was needed if a database vendor did not supply a 64-bit client library.
    That's no longer an issue and beginning with OBIEE 11, 32-bit code is no longer shipped.

    A common error that leads to "out of memory" conditions to to accept the 32-bit memory configuration choices on 64-bit deployments.  The significant configuration choices are:
    • Maximum process data (heap) size is in an AIX environment variable
      LDR_CNTRL=IGNOREUNLOAD@LOADPUBLIC@PREREAD_SHLIB@MAXDATA=0x...
    • Two thread stack sizes are made in obiee NQSConfig.INI
      [ SERVER ]
      SERVER_THREAD_STACK_SIZE = 0;
      DB_GATEWAY_THREAD_STACK_SIZE = 0;
    • Sort memory in NQSConfig.INI
      [ GENERAL ]
      SORT_MEMORY_SIZE = 4 MB ;
      SORT_BUFFER_INCREMENT_SIZE = 256 KB ;


    Choosing a value for MAXDATA:
    0x080000000  2GB Default maximum 32-bit heap size  ( 8 with 7 zeros )
    0x100000000  4GB 64-bit effectively same as 32-bit ( 1 with 8 zeros )
    0x200000000  8GB 64-bit quadruple the 32-bit max for 64-bit
    0x400000000 16GB 64-bit extra memory for safety ( could cause large core files)


    Using 2GB heap size for a 64-bit process will almost certainly lead to an out-of-memory situation.
    Registers are twice as big ... consume twice as much memory in the heap.
    Upgrading to a 4GB heap for a 64-bit process is just "breaking even" with 32-bit.

    A 32-bit process is constrained by the 32-bit virtual addressing limits.  Heap memory is used for dynamic requirements of obiee software, thread stacks for each of the configured threads, and sometimes for shared libraries.

    64-bit processes are not constrained in this way;  extra heap space can be configured for safety against a query that might create a sudden requirement for excessive storage.  If the storage is not available, this query might crash the whole server and disrupt existing users.

    The MAXDATA settings on obiee10 are changed in
     ./setup directory files:
    .variant.sh and systunesrv.sh

    There is no performance penalty on AIX for configuring more memory than required;  extra memory can be configured for safety.  If there are no other considerations, start with 8GB.


    Choosing a value for Thread Stack size:
    zero is the value documented to select an appropriate default for thread stack size.  My preference is to change this to an absolute value, even if you intend to use the documented default;  it provides better documentation and removes the "surprise" factor.

    There are two thread types that can be configured.
    • GATEWAY is used by a thread pool to call a database client library to establish a DB connection.
      The default size is 256KB;  many customers raise this to 512KB ( no performance penalty for over-configuring ).
      This value must be set to 1 MB if Teradata connections are used.
    • SERVER threads are used to run queries.  OBIEE uses recursive algorithms during the analysis of query structures which can consume significant thread stack storage.  It's difficult to provide guidance on a value that depends on data and complexity.  The general notion is to provide more space than you think you need,  "double down" and increase the value if you run out, otherwise inspect the query to understand why it is too complex for the thread stack.  There are protections built into the software to abort a single user query that is too complex, but the algorithms don't cover all situations.
      256 KB  The default 32-bit stack size.  Many customers increased this to 512KB on 32-bit.  A 64-bit server is very likely to crash with this value;  the stack contains mostly register values, which are twice as big.
      512 KB  The documented 64-bit default.  Some early releases of obiee didn't set this correctly, resulting in 256KB stacks.
      1 MB  The recommended 64-bit setting.  If your system only ever uses 512KB of stack space, there is no performance penalty for using 1MB stack size.
      2 MB  Many large customers use this value for safety.  No performance penalty.

      nqscheduler does not use the NQSConfig.INI file to set thread stack size.
      If this process crashes because the thread stack is too small, use this to set 2MB:
      export OBI_BACKGROUND_STACK_SIZE=2048

  6. Shared libraries are not (shared)
    1. When application libraries are loaded at run-time, AIX makes a decision on whether to load the libraries in a "public" memory segment.  If the filesystem library permissions do not have the "Read-Other" permission bit, AIX loads the library into private process memory with two significant side-effects:
      * The libraries reduce the heap storage available.  
          Might be significant in 32-bit processes;  irrelevant in 64-bit processes.
      * Library code is loaded into multiple real pages for execution;  one copy for each process.
      Multiple execution images is a significant issue for both 32- and 64-bit processes.

      The "real memory pages" saved by using public memory segments is a minor concern.  Today's machines typically have plenty of real memory.
      The real problem with private copies of libraries is that they consume processor cache blocks, which are limited.   The same library instructions executing in different real pages will cause memory delays as the i-cache ( instruction cache 128KB blocks) are refreshed from real memory.   Performance loss because instructions are delayed is something that is difficult to measure without access to low-level cache fault data.   The machine just appears to be running slowly for no observable reason.

      This is an easy problem to detect, and an easy problem to correct.

      Detection:  "
      genld -l" AIX command produces a list of the libraries used by each process and the AIX memory address where they are loaded.
      32-bit public segment is 13 ( "dxxxxxxx" ).   private segments are 2-a.
      64-bit public segment is 9 ( "9xxxxxxxxxxxxxxx") ; private segment is 8.

      genld -l | grep -v ' d| 9' | sort +2

      provides a list of privately loaded libraries. 

      Repair: chmod o+r <libname>
      AIX shared libraries will have a suffix of ".so" or ".a".
      Another technique is to change all libraries in a selected directory to repair those that might not be currently loaded.   The usual directories that need repair are obiee code, httpd code and plugins, database client libraries and java.
      chmod o+r /shr/dir/*.a /shr/dir/*.so

  7. Configure your system for diagnostics
    Production systems shouldn't crash, and yet bad things happen to good software.
    If obiee software crashes and produces a core, you should configure your system for reliable transfer of the failing conditions to Oracle Tech Support.  Here's what we need to be able to diagnose a core file from your system.
    * fullcore enabled. chdev -lsys0 -a fullcore=true
    * core naming enabled. chcore -n on -d
    * ulimit must not truncate core. see item 3.
    * pstack.sh is used to capture core documentation.
    * obidoc is used to capture current AIX configuration.
    * snapcore  AIX utility captures core and libraries. Use the proper syntax.
     $ snapcore -r corename executable-fullpath
       /tmp/snapcore will contain the .pax.Z output file.  It is compressed.
    * If cores are directed to a common directory, ensure obiee userid can write to the directory.  ( chcore -p /cores -d ; chmod 777 /cores )
    The filesystem must have sufficient space to hold a crashing obiee application.
    Use:  df -k
      Check the "Free" column ( not "% Used" )
      8388608 is 8GB.

  8. Disable Oracle Client Library signal handling
    The Oracle DB Client Library is frequently distributed with the sqlplus development kit.
    By default, the library enables a signal handler, which will document a call stack if the application crashes.   The signal handler is not needed, and definitely disruptive to obiee diagnostics.   It needs to be disabled.   sqlnet.ora is typically located at:
       $ORACLE_HOME/network/admin/sqlnet.ora
    Add this line at the top of the file:
       DIAG_SIGHANDLER_ENABLED=FALSE

  9. Disable async query in the RPD connection pool.
    This might be an obiee 10.1.3.4 issue only ( still checking  ).
    "async query" must be disabled in the connection pools.
    It was designed to enable query cancellation to a database, and turned out to have too many edge conditions in normal communication that produced random corruption of data and crashes.  Please ensure it is turned off in the RPD.

  10. Check AIX error report (errpt).
    Errors external to obiee applications can trigger crashes.
     $ /bin/errpt -a
    Hardware errors ( firmware, adapters, disks ) should be reported to IBM support.
    All application core files are recorded by AIX;  the most recent ones are listed first.

  11. Capture pstack output for the most recent crash
    $ errpt -A |grep core |head -1 |xargs pstack.sh
    produces a core*.pstack file in directory set by $obiCollect

  12. Reserved for something important to say.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Dick Dunbar
is an escalation engineer working in the Customer Engineering & Advocacy Lab (CEAL team)
for Oracle Analytics and Performance Management.
I live and work in Santa Cruz, California.
I'll share the techniques I use to detect, avoid and repair problems.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today