Wednesday Mar 26, 2014

Software Availability : Solaris Studio 12.4 Beta & ORAchk

First off, these are two unrelated softwares.

Solaris Studio 12.4 Beta

Nearly two-and-a-half years after the release of Solaris Studio 12.3, Oracle is gearing up for the next major release 12.4. In addition to the compiler and library optimizations to support the latest and greatest SPARC & Intel x64 hardware such as SPARC T5, M5, M6, Fujitsu's M10, and Intel's Ivy Bridge and Haswell line of servers, support for C++ 2011 language standard is one of the highlights of this forthcoming release. The complete list of features and enhancements in release 12.4 are documented in the What's New page.

Those who feel compelled to give the updated/enhanced compilers and tools a try, can get started right away by downloading the beta bits from the following location. This software is available for Solaris 10 & 11 running on SPARC, x86 hardware; and Linux 5 & 6 runnin g on x86/x64 hardware. Anyone can download this software for free.

     Oracle Solaris Studio 12.4 Beta Download

Don't forget to check the Release Notes out for the installation instructions, known issues, limitations and workarounds, features that were removed in this release and so on.

Here's a pointer to the documentation (preview): Oracle Solaris Studio 12.4 Information Library

Finally, should you run into any issue(s) or if you have questions about anything related, feel free to use the Solaris Studio Community Forum.




ORAchk 2.2.4 (formerly known as EXAchk)

ORAchk, the Oracle Configuration Audit Tool, enhances EXAchk tool's functionality, and replaces the existing & popular RACcheck tool. In addition to the top issues reported by users/customers, ORAchk proactively scans for known problems within Oracle Database, Sun systems (especially engineered systems) and Oracle E-Business Suite Financials.

While checking, ORAchk covers a wide range of areas such as OS kernel settings, database installations (single instance and RAC), performance, backup and recovery, storage setting, and so on.

ORAchk generated reports (mostly high level) show the system health risks with the ability to drill down into specific problems and offers recommendations specific to the environment and product configuration. Those who do not like sending this data back to Oracle should be happy to know that there is no phone home feature in this release.

Note that ORAchk is available only for the Oracle Premier Support Customers - meaning only those customers with appropriate support contracts can use this tool. So, if you are a Oracle customer with the ability to access the Oracle Support website, check the following pages out for additional information.

     ORAchk - Oracle Configuration Audit Tool
     ORAchk user's guide

Feel free to use the community forum to ask any related questions.

Friday Apr 12, 2013

Siebel 8.1.1.4 Benchmark on SPARC T5

Hardly six months after announcing Siebel 8.1.1.4 benchmark results on Oracle SPARC T4 servers, we have a brand new set of Siebel 8.1.1.4 benchmark results on Oracle SPARC T5 servers. There are no updates to the Siebel benchmark kit in the last couple years - so, we continued to use the Siebel 8.1.1.4 benchmark workload to measure the performance of Siebel Financial Services Call Center and Order Management business transactions on the recently announced SPARC T5 servers.

Benchmark Details

The latest Siebel 8.1.1.4 benchmark was executed on a mix of SPARC T5-2, SPARC T4-2 and SPARC T4-1 servers. The benchmark test simulated the actions of a large corporation with 40,000 concurrent active users. To date, this is the highest user count we achieved in a Siebel benchmark.


User Load Breakdown & Achieved Throughput

Siebel Application Module %Total Load #Users Business Trx per Hour
Financial Services Call Center 70 28,000 273,786
Order Management 30 12,000 59,553
Total     100 40,000 333,339

Average Transaction Response Times for both Financial Services Call Center and Order Management transactions were under one second.


Software & Hardware Specification

 Test Component Software Version Server Model Server Qty Per Server Specification OS
Chips Cores vCPUs CPU Speed CPU Type Memory
Application Server Siebel 8.1.1.4 SPARC T5-2 2 2 32 256 3.6 GHz SPARC-T5 512 GB Solaris 10 1/13 (S10U11)
Database Server Oracle 11g R2 11.2.0.2 SPARC T4-2 1 2 16 128 2.85 GHz SPARC-T4 256 GB Solaris 10 8/11 (S10U10)
Web Server iPlanet Web Server 7.0.9 (7 U9) SPARC T4-1 1 1 8 64 2.85 GHz SPARC-T4 128 GB Solaris 10 8/11 (S10U10)
Load Generator Oracle Application Test Suite 9.21.0043 SunFire X4200 1 2 4 4 2.6 GHz AMD Opteron 285 SE 16 GB Windows 2003 R2 SP2
Load Drivers (Agents) Oracle Application Test Suite 9.21.0043 SunFire X4170 8 2 12 12 2.93 GHz Intel Xeon X5670 48 GB Windows 2003 R2 SP2

Additional Notes:

  • Siebel Gateway Server was configured to run on one of the application server nodes
  • Four Siebel application servers were configured in the Siebel Enterprise to handle 40,000 concurrent users
    • - Each SPARC T5-2 was configured to run two Siebel application server instances
    • - Each of the Siebel application server instances on SPARC T5-2 servers were separated using Solaris virtualization technology, Zones
    • - 40,000 concurrent user sessions were load balanced across all four Siebel application server instances
  • Siebel database was hosted on a Sun Storage F5100 Flash Array consisting 80 x 24 GB flash modules (FMODs)
    • - Siebel 8.1.1.4 benchmark workload is not I/O intensive and does not require flash storage for better I/O performance
  • Fourteen iPlanet Web Server virtual servers were configured with Siebel Web Server Extension (SWSE) plug-in to handle 40,000 concurrent user load
    • - All fourteen iPlanet Web Server instances forwarded HTTP requests from Siebel clients to all four Siebel application server instances in a round robin fashion
  • Oracle Application Test Suite (OATS) was stable and held up amazingly well over the entire duration of the test run.
  • The benchmark test results were validated and thoroughly audited by the Siebel benchmark and PSR teams
    • - Nothing new here. All Sun published Siebel benchmarks including the SPARC T4 one were properly audited before releasing those to the outside world

Resource Utilization

Component #Users CPU% Memory Footprint
Gateway/Application Server 20,000 67.03 205.54 GB
Application Server 20,000 66.09 206.24 GB
Database Server 40,000 33.43 108.72 GB
Web Server 40,000 29.48 14.03 GB

Finally, how does this benchmark stack up against other published benchmarks? Short answer is "very well". Head over to the Oracle Siebel Benchmark White Papers webpage to do the comparison yourself.



[Credit to our hard working colleagues in SAE, Siebel PSR, benchmark and Oracle Platform Integration (OPI) teams. Special thanks to Sumti Jairath and Venkat Krishnaswamy for the last minute fire drill]

Copy of this blog post is also available at:
Siebel 8.1.1.4 Benchmark on SPARC T5

Wednesday Jan 30, 2013

Siebel 8.1.1.4 Benchmark on SPARC T4

Siebel is a multi-threaded native application that performs well on Oracle's T-series SPARC hardware. We have several versions of Siebel benchmarks published on previous generation T-series servers ranging from Sun Fire T2000 to Oracle SPARC T3-4. So, it is natural to see that tradition extended to the current genration SPARC T4 as well.

Benchmark Details

29,000 user Siebel 8.1.1.4 benchmark on a mix of SPARC T4-1 and T4-2 servers was announced during the Oracle OpenWorld 2012 event. In this benchmark, Siebel application server instances ran on three SPARC T4-2/Solaris 10 8/11 systems where as the Oracle database server 11gR2 was configured on a single SPARC T4-1/Solaris 11 11/11 system. Several iPlanet web server 7 U9 instances with the Siebel Web Plug-in (SWE) installed ran on one SPARC T4-1/Solaris 10 8/11 system. Siebel database was hosted on a single Sun Storage F5100 flash array consisting 80 flash modules (FMODs) each with capacity 24 GB.

Siebel Call Center and Order Management System are the modules that were tested in the benchmark. The benchmark workload had 70% of virtual users running Siebel Call Center transactions and the remaining 30% vusers running Siebel Order Management System transactions. This benchmark on T4 exhibited sub-second response times on average for both Siebel Call Center and Order Management System modules.

Load balancing at various layers including web and test client systems ensured near uniform load across all web and application server instances. All three Siebel application server systems consumed ~78% CPU on average. The database and web server systems consumed ~53% and ~18% CPU respectively.

All these details are supposed to be available in a standard Oracle|Siebel benchmark template - but for some reason, I couldn't find it on Oracle's Siebel Benchmark White Papers web page yet. Meanwhile check out the following PR that was posted on oracle.com on 09/28/2012.

    SPARC T4 Servers Set World Record on Siebel CRM 8.1.1.4 Benchmark

Looks like the large number of vusers (29,000 to be precise) sets this benchmark apart from the other benchmarks published with the same Siebel 8.1.1.4 benchmark workload.

[Credit to our colleagues in Siebel PSR, benchmark, SAE and ISVe teams]

Monday Feb 14, 2011

Oracle Solaris Studio C/C++: Tuning iropt for inline control

It is desirable to inline as many hot routines as possible to reduce the runtime overhead of CPU intensive applications. In general, compilers go by their own rules when to inline and when to not inline a routine. This blog post is intended to introduce some of the not widely known (or used) compiler internal flags to tweak the pre-defined rules of compiler.

Consider the following trivial C code:


% cat inline.c

#include <stdio.h>
#include <stdlib.h>

inline void freememory(int \*ptr)
{
        free(ptr);
}

extern inline void swapdata(int \*ptr1, int \*ptr2)
{
        int \*temp;

        temp = (int \*) malloc (sizeof (int));
        printf("\\nswapdata(): before swap ->");

        \*temp = \*ptr1;
        \*ptr1 = \*ptr2;
        \*ptr2 = \*temp;

        printf("\\nswapdata(): after swap ->");

        free (temp);
}

inline void printdata(int \*ptr)
{
        printf("\\nAddress = %x\\tStored Data = %d", ptr, \*ptr);
}

inline void storedata(int \*ptr, int data)
{
        \*ptr = data;
}

inline int \*getintptr()
{
        int \*ptr;
        ptr = (int \*) malloc (sizeof(int));
        return (ptr);
}

inline void AllocLoadAndSwap(int val1, int val2)
{
        int \*intptr1, \*intptr2;

        intptr1 = getintptr();
        intptr2 = getintptr();
        storedata(intptr1, val1);
        storedata(intptr2, val2);
        printf("\\nBefore swapping .. ->");
        printdata(intptr1);
        printdata(intptr2);
        swapdata(intptr1, intptr2);
        printf("\\nAfter swapping .. ->");
        printdata(intptr1);
        printdata(intptr2);
        freememory(intptr1);
        freememory(intptr2);
}

inline void InitAllocLoadAndSwap()
{
        printf("\\nSnapshot 1\\n___________");
        AllocLoadAndSwap(100, 200);
        printf("\\n\\nSnapshot 2\\n___________");
        AllocLoadAndSwap(435, 135);
}

int main() {
        InitAllocLoadAndSwap();
        return (0);
}

By default auto inlining is turned off in Oracle Studio compilers; and to turn it on, the code must be compiled with -O4 or higher optimization. This example attempts to hint the compiler to inline all the routines with the help of inline keyword. Note that inline keyword is a suggestion/request for the compiler to inline the function. However there is no guarantee that compiler honors the suggestion/request. Just like everything else in the world, compiler has a pre-defined set of rules. And based on those rules, it tries to do its best as long as those rules are not violated. If the compiler chooses to inline a routine, the function body will be expanded at all the call sites (just like a macro expansion).

When this code is compiled with Oracle Studio C compiler, it doesn't print any diagnostic information on stdout or stderr - so, using nm or elfdump is one way to find what routines are inlined and what routines are not.


% cc -xO3 -c inline.c
% nm inline.o

inline.o:

[Index]   Value      Size      Type  Bind  Other Shndx   Name

[4]     |         0|       0|NOTY |LOCL |0    |3      |Bbss.bss
[6]     |         0|       0|NOTY |LOCL |0    |4      |Ddata.data
[8]     |         0|       0|NOTY |LOCL |0    |5      |Drodata.rodata
[16]    |         0|       0|NOTY |GLOB |0    |ABS    |__fsr_init_value
[14]    |         0|       0|FUNC |GLOB |0    |UNDEF  |InitAllocLoadAndSwap
[1]     |         0|       0|FILE |LOCL |0    |ABS    |inline.c
[15]    |         0|      20|FUNC |GLOB |0    |2      |main

From this output, we can see that InitAllocLoadAndSwap() is not inlined yet there is no information as to why this function is not inlined.

Compiler commentary with er_src

To get some useful diagnostic information, Oracle Studio compiler collection offers a utility called er_src. When the source code was compiled with debug flag (-g or -g0), er_src can print the compiler commentary. However since compiler does auto inlining only at O4 or later optimization levels, unfortunately compiler commentary for inlining is not available at O3 optimization level.

iropt's inlining report

"iropt" is the global optimizer in Oracle Solaris Studio compiler collection. Inlining will be taken care by iropt. It performs inlining for callees in the same file unless compiler options for cross file optimizations such as -xipo, -xcrossfile are specified on compile line. Some of the iropt options can be used to control inlining heuristics. These options have no dependency on the optimization level.

Finding the list of iropt phases and the supported options

Oracle Studio C/C++ compilers on SPARC support a variety of options for function inline control. -help displays the list of supported flags/options.


% CC -V
CC: Sun C++ 5.9 SunOS_sparc Patch 124863-01 2007/07/25

% cc -V
cc: Sun C 5.9 SunOS_sparc Patch 124867-01 2007/07/12

% iropt -help

  \*\*\*\*\*\*  General Usage Information about IROPT  \*\*\*\*\*\*

To get general help information about IROPT, use -help
To list all the optimization phases in IROPT, use -phases
To get help on a particular phase, use -help=phase
To turn on phases, use -A++...+
To turn off phases, use -R++...+
To use phase-specific flags, use -A:

% iropt -phases


  \*\*\*\*\*\* List of Optimization Phases in IROPT \*\*\*\*\*\*

    Phase Name          Description
-------------------------------------------------------------
bitfield	     Bitfield transformations
iv		     Strength Reduction
loop		     Loop Invariant Code Motion
cse		     Common Subexpression Elimination
copy		     Copy Propagation
const		     Const Propagation and Folding
reg		     Virtual Register Allocation
unroll		     Data Dependence Loop Unrolling
merge		     Merge Basic Blocks
reassoc		     Reconstruction of associative and/or distributive expressions
composite_breaker	     
tail		     Tail Recursion Optimization
rename		     Scalar Rename
reduction	     
mvl		     Two-version loops for parallelization
loop_dist	     Loop Distribution
ivsub		     Induction Variables Substitution: New Algorithm
ddint		     Loop Interchange
fusion		     Loop Fusion
eliminate	     Scalar Replacement on def-def and def-use
private		     Private Array Analysis
scalarrep	     Scalar Replacement for use-use
tile		     Cache Blocking
ujam		     Register Blocking
ddrefs		     Loop Invariant Array References Moving
invcc		     Invariant Conditional Code Motion
sprof		     Synthetic Profiling
restrict_g	     Assume global pointers as restrict
dead		     Dead code elimination
pde		     Partial dead code elimination
reassoc2	     loop invariant reassociative tranfsformations
distr		     distributive reassociative tranfsformations
height2		     tree height reassociative reduction
ansi_alias	     Apply ANSI Aliase Rules to Pointer References
perfect		     
yas		     Scalar Replacement for reduction arrays
pf		     Prefetch Analysis
cond_elim	     Conditional Code Elimination
vector		     Vectorizing Some Intrinsics Functions Calls in Loops
whole		     Whole Program Mode
bopt		     Branches Reordering based on Profile Data
invccexp	     Invariant Conditional Code Expansion
bcopy		     Memcpy and Memset Transformations
ccse		     Cross Iteration CSE
data_access	     Array Access Regions Analysis
ipa		     Interprocedual Analysis
contract	     Array Contraction Analysis
symbol		     Symbolic Analysis
ppg2		     optimistic strategy of constant propagation
parallel	     Parallelization
pcg		     Parallel Code Generator
lazy		     Lazy Code Motion
region		     Region-based Optimization
loop_peeling	     Loop Peeling
loop_reform	     Loop Reformulation
loop_shifting	     Loop Shifting
loop_collapsing	     Loop Collapsing
memopt		     Merge memory allocations
inline		     IPA-based inlining phase
clone		     Routine cloning phase
norm_ir		     clean-up and normalize ir
ipa_ppg		     interprocedural constant propagation
sr		     Strength reduction (new)
ivsub3		     Induction Variable Substitution
icall_opt	     indirect call optimization
cha		     Class Hierarchy Analysis
ippt		     Interprocedual pointer tracking
reverse_invcc	     reverse invariant condition code hoisting
crit		     Critical path optimisations
loop_norm	     loop normalization
loop_unimodular	     loop unimodular transformation
scalar_repl	     Scalar Replacement
loop_bound	     Redundant Loop Bound Checking Elimination
loop_condition	     Invariant Loop Bound Checking Hoisting
memopt_pattern	     Memory Access Optimization
loop_improvement	     Loop structure improvement by code specialization
pbranch_opt	     C++ Java Pbranch Optimizations
norm_ldst	     short ld/st normalisation
micro_vector	     Micro vectorization for x86
ipa_symbol_ppg	     interprocedural symbolic analysis
optinfo		     Compile-time information about loop and inlining transformations
vp		     Value profiling and code specialization
pass_ti		     Pass IR type trees to the backend
fully_unroll	     Fully Loop Unrolling
builtin_opt	     Builtin Optimization


% iropt -help=inline

NAME
    inline - Qoption for IPA-based inlining phase.

SYNOPSIS
    -Ainline[:][:]:...[:] - turn on inline.
    -Rinline                             - turn off inline

DESCRIPTION
    inline is on by default now. -Ainline turns it on.
    -Rinline turns it off. 
    
    NOTE: the following is a brief description of the old inliner qoptions
          1. Old inliner qoptions that do not have equivalent 
             options in the new inliner--avoid to use them later: 
             -Ml -Mi -Mm -Ma -Mc -Me -Mg -Mw -Mx -Mx -MC -MS 

          2. Old inliner qoptions that have equivalent option 
             in the new inliner--use the new options later: 
             Old options     new options 
                -Msn          recursion=n 
                -Mrn          irs=n      
                -Mtn          cs=n       
                -Mpn          cp=n       
                -MA           chk_alias  
                -MR           chk_reshape 
                -MI           chk_reshape=no 
                -MF           mi         
 
    The acceptable sub-options are:

      report[=n] - dump inlining report.
                  n=chain: 
                        show to-be-inlined call chains.
                  n=user_request: 
                        show the inlining status of user-requests.
                  n=0:  show inlined calls only.
                  n=1:  (default):  show both inlined and 
                        non-inlined calls and reasons for 
                        inlining/non-inlining.
                  n=2:  n=1 plus call id and node id
                  n=3:  show inlining summary only
                  n=4:  n=2 and iropt aborts after the 
                        inlining report is dumped out.
      cgraph     - dump cgraph.
      call_in_pragma[=no|yes]:
                 - call_in_pragma or call_in_pragma=yes: 
                   Inline a call that in the Parallel region 
		      into the original routine 
                 - call_in_pragma=no: (default) 
                   Don't inline a call that in the Parallel region
		      into the original routine 
      inline_into_mfunction[=no|yes]:(only for Fortran) 
		    - inline_into_mfunction or inline_into_mfunction=yes:(default) 
		      Inline a call into the mfunction if it is in the
		      Parallel Region
                 - inline_into_mfunction=no: 
                   Don't inline a call into the mfunction if it 
                   in the Parallel Region
NOTE: for other languages, if you specify inline_into_mfunction=yes 
	 The compiler will silently ignore this qoption. As a result, 
	 Calls in parallel region will still be inlined into pragma constructs
      rs=n       - max number of triples in inlinable routines.
                   iropt defines a routine as inlinable or not
                   based on this number. So no routines over 
                   this limit will be considered for inlining.
      irs=n      - max number of triples in a inlining routine,
                   including size increase by inlining its calls
      cs=n       - max number of triples in a callee. 
                   In general, iropt only inline calls whose 
                   callee is inlinable (defined by rs) AND 
                   whose callee size is not greater than n.
                   But some calls to inlinable routines are 
                   actually inlined because of other factors
                   such as constant actuals, etc. 
      recursion=n  
                 - max level of resursive call that is 
                   considered for inlining.
      cp=n       - minimum profile feedback counter of a call.
                   No call with counter less than this limit 
                   would be inlined.
      inc=n      - percentage of the total number of triples 
                   increased after inlining. No inlining over
                   this percentage. For instance, 'inc=30' 
                   means inlining is allowed to increase the 
                   total program size by 30%.
      create_iconf=:
      use_iconf=:
                   This creates/uses an inlining configuration.
                   The file lists calls and routines that are
                   inlined and routines that inline their calls.
                   Its format is:
                      air      /\* actual inlining routines \*/
                      r11 r12 r13 ...
                      r21 r22 r23 ...
                      .....
                      ari      /\* actual routines inlined \*/
                      r11 r12 r13 ...
                      r21 r22 r23 ...
                      .....
                      aci      /\* actual calls inlined \*/
                      (r11,c11) (r12,c12) (r13,c13) ...
                      (r21,c21) (r22,c22) (r23,c23) ...
                      .....
                   The numbers are callids (cxx) and nodeids(rxx) 
                   printed out when report=2. It is used for
                   debugging. The usual usage is to use
                   create_iconf= to create a config file.
                   then, comment (by preceding numbers line
                   with #) to disallow inlining for those 
                   calls or routines. For instance, 
                       aci
                       (2,3) (2,5) (2,6) (3,9)
                       (3,10) (6,4) (6,7) (7,6)
                       #(7,10) (8,21) (8,22)
                   with the above config file, calls whose
                   nodeids and callids are (7,10),(8,21) and 
                   (8,22) will not be inlined.

                   NOTE:for the aci part of the configure file,
                        in each pair (rij,cij), the parentheses
                        are not necessary, but the comma is necessary 
                        and there should not be any space between
                        rji and comma, comma and cij.
      do_inline=:
                 - guide inliner to do inlining for a given
                   routine only.
      mi:
                 - Do maximum inlining for given routines if do_inline
                   is used; otherwise, do maximum inlining for main routine.
                   (The inliner will not check inlining parameters.
      inline_level[=1|2|3]: 
                 - specify the level of inline: 
                     inline_level=1    basic inlining 
                     inline_level or inline_level=2    medium inlining (default) 
                     inline_level=3 or inline_level=4,5...   aggressive inlining 
      remove_ip[=no|yes]:
                 - remove_ip or remove_ip=yes:
                      removing inliningPlan after inlining.
                 - remove_ip=no [default]:
                      keep inliningPlan after inlining.
      chk_alias[=no|yes]:
                 - chk_alias or chk_alias=yes [default]:
                      Don't inline a call if inlining it causes
                      aliases among callee's formal arrays.
                 - chk_alias=no:
                      Ignore such checking.
      chk_reshape[=no|yes]:
                 - chk_reshape or chk_reshape=yes [default]:
                      Don't inline a call if its array argument
                      is reshaped between caller and callee.
                 - chk_reshape=no:
                      Ignore such checking.
      chk_mismatch[=no|yes]:
                 - chk_mismatch or chk_mismatch=yes [default]:
                      Don't inline a call if any real argument
                      mismatches with its formal in type.
                 - chk_mismatch=no:
                      Ignore such checking.
      do_chain[=no|yes]:
                 - do_chain or do_chain=yes [default]:
                      Enable inlining for call chains.
                 - do_chain=no:
                      Disable inlining for call chains.
      callonce[=no|yes]:
                 - callonce=no [default]:
                      Disable inlining a routine that is
                      called only once.
                 - callonce or callonce=yes:
                      Enable inlining a routine that is
                      called only once.
      icall_recurse[=no|yes]:
                 - icall_recurse=no [default]:
                      Disable recursive inlining of indirect
                      and virtual call sites
                 - icall_recurse=yes:
                      Enable recursive inlining of indirect
                      and virtual call sites
      formal_dbgsym[=no|yes]: (default = no)
                 - Specify to preserve the debug information for
                   formal parameter of inlined funcion

Some of these options can be used to get all the diagnostic information from compile time. Especially the sub-option (report) to -Ainline is useful in obtaining the inlining report. To pass special flags to iropt, specify -W2,<option>:<sub-option> on compile line.

Here is an example.


% cc -xO3 -c -W2,-Ainline:report=2 inline.c

INLINING SUMMARY

   inc=400: percentage of program size increase.
   irs=4096: max number of triples allowed per routine after inlining.
   rs=450: max routine size for an inlinable routine.
   cs=400: call size for inlinable call.
   recursion=1: max level for inlining recursive calls.
   Auto inlining: OFF

   Total inlinable calls: 14
   Total inlined calls: 36
   Total inlined routines: 7
   Total inlinable routines: 7
   Total inlining routines: 3
   Program size: 199
   Program size increase: 744
   Total number of call graph nodes: 11

   Notes for selecting inlining parameters

    1. "Not inlined, compiler decision":
       If a call is not inlined by this reason, try to
       increase inc in order to inline it by
          -Qoption iropt -Ainline:inc=  for FORTRAN, C++
          -W2,-Ainline:inc=  for C

    2. "Not inlined, routine too big after inlining":
       If a call is not inlined by this reason, try to
       increase irs in order to inline it by
          -Qoption iropt -Ainline:irs=  for FORTRAN, C++
          -W2,-Ainline:irs=  for C

    3. "Not inlined, callee's size too big":
       If a call is not inlined by this reason, try to
       increase cs in order to inline it by
          -Qoption iropt -Ainline:cs=  for FORTRAN, C++
          -W2,-Ainline:cs=  for C

    4. "Not inlined, recursive call":
       If a call is not inlined by this reason, try to
       increase recursion level in order to inline it by
          -Qoption iropt -Ainline:recrusion=  for FORTRAN, C++
          -W2,-Ainline:recrusion=  for C

    5. "Routine not inlined, too many operations":
       If a routine is not inlinable by this reason, try to
       increase rs in order to make it inlinable by
          -Qoption iropt -Ainline:rs=  for FORTRAN, C++
          -W2,-Ainline:rs=  for C


ROUTINES NOT INLINABLE:

 main [id=7] (inline.c)
   Routine not inlined, user requested

CALL INLINING REPORT:

 Routine: freememory [id=0] (inline.c)
  Nothing inlined.

 Routine: swapdata [id=1] (inline.c)
  Nothing inlined.

 Routine: printdata [id=2] (inline.c)
  Nothing inlined.

 Routine: storedata [id=3] (inline.c)
  Nothing inlined.

 Routine: getintptr [id=4] (inline.c)
  Nothing inlined.

 Routine: AllocLoadAndSwap [id=5] (inline.c)
   getintptr [call_id=8], line 46: Auto inlined
   getintptr [call_id=9], line 47: Auto inlined
   storedata [call_id=10], line 48: Auto inlined
   storedata [call_id=11], line 49: Auto inlined
   printdata [call_id=13], line 51: Auto inlined
   printdata [call_id=14], line 52: Auto inlined
   swapdata [call_id=15], line 53: Auto inlined
   printdata [call_id=17], line 55: Auto inlined
   printdata [call_id=18], line 56: Auto inlined
   freememory [call_id=19], line 57: Auto inlined
   freememory [call_id=20], line 58: Auto inlined

 Routine: InitAllocLoadAndSwap [id=6] (inline.c)
   AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
     (inc limit reached. See INLININING SUMMARY)
   AllocLoadAndSwap [call_id=24], line 66: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined

 Routine: main [id=7] (inline.c)
   InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
     AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
       (inc limit reached. See INLININING SUMMARY)
     AllocLoadAndSwap [call_id=24], line 66: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined

The above report shows the threshold values being used while making decisions, all the routines and information about whether a call to any function is inlined; if not, the reason for not inlining it, and some suggestions on how to make it succeed. This is cool stuff.

Going back to the example: based on the report, the compiler is trying to inline all the routines as long as the program size doesn't go beyond 400% of the original size (ie., without inlining). Unfortunately AllocLoadAndSwap() went beyond the limits and as a result, compiler decides not to inline it. Fair enough. If we don't bother about the size of the binary and if we really want this routine inlined, one solution is to increase the value for inc in such a way that AllocLoadAndSwap()'s inclusion would fit into the newer limits.

eg.,


% cc -xO3 -c -W2,-Ainline:report=2,-Ainline:inc=650 inline.c
INLINING SUMMARY

   inc=650: percentage of program size increase.
   irs=4096: max number of triples allowed per routine after inlining.
   rs=450: max routine size for an inlinable routine.
   cs=400: call size for inlinable call.
   recursion=1: max level for inlining recursive calls.
   Auto inlining: OFF

   Total inlinable calls: 14
   Total inlined calls: 60
   Total inlined routines: 7
   Total inlinable routines: 7
   Total inlining routines: 3
   Program size: 199
   Program size increase: 1260
   Total number of call graph nodes: 11

   Notes for selecting inlining parameters

    ... skip ... (see prev reports for the text that goes here)

ROUTINES NOT INLINABLE:

 main [id=7] (inline.c)
   Routine not inlined, user requested


CALL INLINING REPORT:

 Routine: freememory [id=0] (inline.c)
  Nothing inlined.

 Routine: swapdata [id=1] (inline.c)
  Nothing inlined.

 Routine: printdata [id=2] (inline.c)
  Nothing inlined.

 Routine: storedata [id=3] (inline.c)
  Nothing inlined.

 Routine: getintptr [id=4] (inline.c)
  Nothing inlined.

 Routine: AllocLoadAndSwap [id=5] (inline.c)
   getintptr [call_id=8], line 46: Auto inlined
   getintptr [call_id=9], line 47: Auto inlined
   storedata [call_id=10], line 48: Auto inlined
   storedata [call_id=11], line 49: Auto inlined
   printdata [call_id=13], line 51: Auto inlined
   printdata [call_id=14], line 52: Auto inlined
   swapdata [call_id=15], line 53: Auto inlined
   printdata [call_id=17], line 55: Auto inlined
   printdata [call_id=18], line 56: Auto inlined
   freememory [call_id=19], line 57: Auto inlined
   freememory [call_id=20], line 58: Auto inlined

 Routine: InitAllocLoadAndSwap [id=6] (inline.c)
   AllocLoadAndSwap [call_id=22], line 64: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined
   AllocLoadAndSwap [call_id=24], line 66: Auto inlined
     swapdata [call_id=15], line 53: Auto inlined
     getintptr [call_id=8], line 46: Auto inlined
     getintptr [call_id=9], line 47: Auto inlined
     printdata [call_id=13], line 51: Auto inlined
     printdata [call_id=14], line 52: Auto inlined
     printdata [call_id=17], line 55: Auto inlined
     printdata [call_id=18], line 56: Auto inlined
     freememory [call_id=19], line 57: Auto inlined
     freememory [call_id=20], line 58: Auto inlined
     storedata [call_id=10], line 48: Auto inlined
     storedata [call_id=11], line 49: Auto inlined

 Routine: main [id=7] (inline.c)
   InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
     AllocLoadAndSwap [call_id=22], line 64: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined
     AllocLoadAndSwap [call_id=24], line 66: Auto inlined
       swapdata [call_id=15], line 53: Auto inlined
       getintptr [call_id=8], line 46: Auto inlined
       getintptr [call_id=9], line 47: Auto inlined
       printdata [call_id=13], line 51: Auto inlined
       printdata [call_id=14], line 52: Auto inlined
       printdata [call_id=17], line 55: Auto inlined
       printdata [call_id=18], line 56: Auto inlined
       freememory [call_id=19], line 57: Auto inlined
       freememory [call_id=20], line 58: Auto inlined
       storedata [call_id=10], line 48: Auto inlined
       storedata [call_id=11], line 49: Auto inlined

From the above output we can conclude that AllocLoadAndSwap() was inlined by the compiler when we let the program size to increase by 650%.

Notes:

  • Multiple iropt options separated by a comma (,) can be specified after -W2
    eg., -W2,-Ainline:report=2,-Ainline:inc=650

  • For C++ code, use -Qoption to specify iropt options.
    eg., -Qoption iropt -Ainline:report=2
    -Qoption iropt -Ainline:report=2,-Ainline:inc=650

  • Inlining those functions whose function call overhead is large relative to the routine code improves performance. Improvement is the result of elimination of the function call, stack frame manipulation and the function return

  • Even though inlining may increase the runtime performance of an application, do not try to inline too many functions. Inline only those functions (from profiling data) that could benefit from inlining

  • In general, compiler threshold values are good enough for inlining the functions. Use iropt's options only if some very hot routines couldn't make it due to some reason. Turn on auto inlining with -xO4 option

  • Inline functions increase build time and program sizes. Sometimes it is possible that some of the very large routines (when inlined) may not fit into processor's cache and may lead to poor performance mainly due to the increased cache miss rate

ALSO SEE
Oracle Solaris Studio: Advanced Compiler Options for Performance

(Original blog post is at:
http://technopark02.blogspot.com/2005/11/sun-studio-cc-tuning-iropt-for-inline.html)

Sunday Jan 30, 2011

PeopleSoft Financials 9.0 (Day-in-the-Life) Benchmark on Oracle Sun

It is very rare to see any vendor publishing a benchmark on competing products of their own let alone products that are 100% compatible with each other. Well, it happened at Oracle|Sun. M-series and T-series hardware was the subject of two similar / comparable benchmarks; and PeopleSoft Financials 9.0 DIL was the benchmarked workload.

Benchmark report URLs

PeopleSoft Financials 9.0 on Oracle's SPARC T3-1 Server
PeopleSoft Financials 9.0 on Oracle's Sun SPARC Enterprise M4000 Server

Brief description of workload

The benchmark workload simulated Financial Control and Reporting business processes that a customer typically performs when closing their books at period end. "Closing the books" generally involves Journal generation, editing & posting; General Ledger allocations, summary & consolidations and reporting in GL. The applications that were involved in this process are: General Ledger, Asset Management, Cash Management, Expenses, Payables and Receivables.

The benchmark execution simulated the processing required for closing the books (background batch processes) along with some online concurrent transaction activity by 1000 end users.

Summary of Benchmark Test Results

The following table summarizes the test results of the "close the books" business processes. For the online transaction response times, check the benchmark reports (too many transactions to summarize here). Feel free to draw your own conclusions.

As of this writing no other vendor published any benchmark result with PeopleSoft Financials 9.0 workload.

(If the following table is illegible, click here for cleaner copy of test results.)

Hardware Configuration Elapsed Time Journal Lines per Hour Ledger Lines per Hour
Batch only Batch + 1K users Batch only Batch + 1K users Batch only Batch + 1K users
DB + Proc Sched  1 x Sun SPARC Enterprise M5000 Server
 8 x 2.53 GHz QC SPARC64 VII processors, 128 GB RAM
App + Web  1 x SPARC T3-1 Server
 1 x 1.65 GHz 16-Core SPARC T3 processor, 128 GB RAM
24.15m
Reporting: 11.67m
25.03m
Reporting: 11.98m
6,355,093 6,141,258 6,209,682 5,991,382
DB + Proc Sched  1 x Sun SPARC Enterprise M5000 Server
 8 x 2.66 GHz QC SPARC64 VII+ processors, 128 GB RAM
App + Web  1 x Sun SPARC Enterprise M4000 Server
 4 x 2.66 GHz QC SPARC64 VII+ processors, 128 GB RAM
21.74m
Reporting: 11.35m
23.30m
Reporting: 11.42m
7,059,591 6,597,239 6,898,060 6,436,236

Software Versions

Oracle’s PeopleSoft Enterprise Financials/SCM 9.00.00.331
Oracle’s PeopleSoft Enterprise (PeopleTools) 8.49.23 64-bit
Oracle’s PeopleSoft Enterprise (PeopleTools) 8.49.23 32-bit on Windows Server 2003 SP2 for generating reports using nVision
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 64-bit + RDBMS patch 9699654
Oracle Tuxedo 9.1 RP36 Jolt 9.1 64-bit
Oracle WebLogic Server 9.2 MP3 64-bit (Java version "1.5.0_12")
MicroFocus Server Express 4.0 SP4 64-bit
Oracle Solaris 10 10/09 and 09/10

Acknowledgments

It is one of the complex and stressful benchmarks that I have ever been involved in. It is a collaborative effort from different teams within Oracle Corporation. A sincere thanks to the PeopleSoft benchmark team for providing adequate support throughout the execution of the benchmark and for the swift validation of benchmark results numerous times (yes, "numerous" - it is not a typo.)

Saturday Dec 04, 2010

Oracle's Optimized Solution for Siebel CRM 8.1.1

A brief explanation of what an optimized solution is and what it is not can be found in the previous blog entry Oracle's Optimized Solution for PeopleSoft HCM 9.0. We went through a similar exercise to publish another optimized solution around Siebel CRM 8.1.1.

The Siebel solution implements Oracle Siebel CRM using a unique combination of SPARC servers, Sun storage, Solaris OS virtualization, Oracle application middleware and Oracle database products.

URLs to the Siebel CRM white papers:

White you are at it, do not forget to check the 13,000 user Siebel CRM benchmark on the latest SPARC T3 platform.

Thursday Dec 02, 2010

Oracle's Optimized Solution for PeopleSoft HCM 9.0

According to Oracle Corporation: Oracle's optimized solutions are applications-to-disk solutions that are comprised of Oracle's Sun servers, storage, and networking components, Oracle Solaris, Oracle Enterprise Linux, Oracle Database, Oracle Fusion Middleware and Oracle Applications.

To be clear, an optimized/optimal solution is neither a software package nor a hardware system bundled with pre-tuned software. It is simply a set of recommendations based on some testing performed in labs. The recommendations typically provide sizing guidelines for small, medium and large configurations, best practices, tuning tips and some performance data. Customers can refer to these guidelines when deploying enterprise applications on Oracle hardware to achieve optimal configuration for better TCO and ROI.

The PeopleSoft solution implements two modules in Oracle PeopleSoft Enterprise Human Capital Management (HCM) 9.0 to demonstrate how Oracleʼs servers, disk storage and advanced flash based storage technology can be used to accelerate database transactions to achieve unprecedented application performance. Workload consolidation is achieved through server consolidation while maintaining the appropriate balance of performance, availability, cost and expected future capacity requirements.

The optimized solution technical white paper can be accessed from the following URL:

    Oracleʼs Optimized Solution for PeopleSoft Human Capital Management Consolidation using M-series servers, Flash and Enterprise Storage

A corresponding solution brief targeting less patient is available at:

    Oracle's Optimized Solution for PeopleSoft HCM - A Business White paper

Sunday Nov 07, 2010

Instructions to Turn ON/OFF Hardware Prefetch on SPARC64 Systems

The hardware prefetch is ON by default on M-series servers such as M8000/M9000, M4000/M5000, M3000

The following excerpt is from a SPARC64 document:

Hardware speculatively issues the prefetch operation based on the prediction that there is high possibility to access to the following continuous address in the future, if there have been load accesses for a consecutive address.

Although this feature is designed to improve the performance of various workloads, due to the speculative nature, not all workloads may benefit with the default behavior. For example, in our experiments, we noticed 10+% improvement in CPU utilization while running some of the PeopleSoft workloads on M-series hardware with hardware prefetch turned off. Hence irrespective of the application/workload, the recommended approach is to conduct few experiments by running representative customer workloads on target M-series hardware with and without the hardware prefetch turned on.

Instructions to Turn On/Off Hardware Prefetch:

  1. Connect to the system Service Processor (XSCF)

    % ssh -l <userid> <host>
    
  2. Check the current prefetch mode by running the following command at XSCF> prompt

    XSCF> showprefetchmode
    
  3. Find the domain id of all mounted system boards (or skip to next step)

    XSCF> showboards -a
    
  4. Power-off all configured domains

    XSCF> poweroff -d <domainid> [OR]
    XSCF> poweroff -a
    

    From my experience, on larger systems with multiple domains configured, all domains must be powered off before the SP lets changing the prefetch mode. If someone has a correction to this information or better instruction that minimizes disruption, please let me know. I'd be happy to update these instructions.

  5. Wait until the domain(s) are completely powered off. Check the status by running showlogs command

    XSCF> showlogs power
    
  6. Change the prefetch mode to the desired value

    XSCF> setprefetchmode -s [on|off]
    
  7. Verify the prefetch mode

    XSCF> showprefetchmode
    
  8. Finally power-on all configured domains

    XSCF> poweron -d <domainid> [OR]
    XSCF> poweron -a
    
  9. Disconnect from SP, and wait for the OS to boot up

Note to Sun-Oracle customers:

If the default value of hardware prefetch is changed, please make sure to mention this in any service requests, bug reports, etc., that you may file with Oracle Corporation. Unfortunately none of the standard commands on Solaris report the status of hardware prefetch - so, providing this additional piece of information beforehand will help the person who is analyzing/diagnosing the case.

Sunday Oct 24, 2010

SPARC T3 reiterates Siebel CRM's Supremacy on T-series Hardware

It's been mentioned and proved several times that Sun/Oracle's T-series hardware is the best fit to deploy and run Siebel CRM. Feel free to browse through the list of Siebel benchmarks that Sun published in the past on T-series:

        2004-2010 : A Look Back at Sun Published Oracle Benchmarks

Oracle Corporation announced the availability of SPARC T3 servers in Oracle OpenWorld 2010, and sure enough there is a Siebel CRM benchmark on SPARC T3-1 server to support the server launch event. Check the following web page for high level details of the benchmark.

        SPARC T3-1 Server Posts a High Score on New Siebel CRM 8.1.1 Benchmark

I intend to provide the missing pieces of information in this blog post.

First of all, it is not a "Platform Sizing and Performance Program" (PSPP) benchmark. Siebel 8.1.1 was used to run the benchmark, and there is no Siebel PSPP benchmark kit available as of today for v8.1.1. Hence the test results from this benchmark exercise are not directly comparable to the Siebel 8.0 PSPP benchmark results.

Workload

The benchmark workload consists of a mix of Siebel Financial Services Call Center and Siebel Web Services / EAI transactions. The FINS Call Center transactions create a bunch of Opportunities, Quotes and Orders, where as the Web Services / EAI transactions submit new Service Requests (SR), search for and update existing SRs. The transaction mix is 40% FINS Call Center transactions and 60% Web Services / EAI transactions.

Software Versions

  • Siebel CRM 8.1.1
  • Oracle RDBMS 11g R2 (11.2.0.1), 64-bit
  • iPlanet Web Server 7.0 Update 8, 32-bit
  • Solaris 10 09/10 in the application-tier and
  • Solaris 10 10/09 in the web- and database-tiers

Hardware Configuration

  • Application Server : 1 x SPARC T3-1 Server (2 RU system)
    • One socket 16-Core 1.65 GHz SPARC T3 processor, 128 hardware threads, 6 MB L2 Cache, 64 GB RAM
  • Web Server + Database Server : 1 x Sun SPARC Enterprise T5240 Server (2 RU system)
    • Two socket 16-Core 1.165 GHz UltraSPARC T2 Plus processors, 128 hardware threads, 4 MB L2 Cache, 64 GB RAM

Virtualization Technology

iPlanet Web Server and the Oracle 11g Database Server were configured on a single Sun SPARC Enterprise T5240 Server. Those software layers were isolated from each other with the help of Oracle Solaris Containers virtualization technology. Resource allocations are shown below.

Tier #vCPU Memory (GB)
Database 96 48
Web 32 16

Test Results

#vUsers Avg Trx Resp Time (sec) Business Trx
Throughput/HR
Avg CPU Utilization (%) Avg Memory Footprint (GB)
FINS EAI FINS EAI App DB Web App DB + Web
13,000 0.43 0.2 48,409 116,449 58 42 37 52 35

Why stop at 13K users?

Notice that the average CPU utilization on the application server node (SPARC T3-1) is only ~58%. The application server node has room to accommodate more online vusers - however, there is not enough free memory left on the server to scale beyond 13,000 concurrent users. That is the main reason to stop at 13,000 user count in this benchmark.

Siebel Best Practices

Check the following presentation:

        Siebel on Oracle Solaris : Best Practices, Tuning Tips

Acknowledgments

Credit to all our peers at Oracle Corporation who helped us with the hardware, workload, verification and validation etc., in a timely manner. Also Jenny deserves special credit for spending enormous amount of time running the benchmark with patience.

Friday Oct 08, 2010

Is it really Solaris versus Windows & Linux?

(Even though the title explicitly states "Solaris Versus .. ", this blog entry is equally applicable to all the operating systems in the world with few changes.)

Lately I have seen quite a few e-mails and heard few customer representatives talking about the performance of their application(s) on Solaris, Windows and Linux. Typically they go like the following with a bunch of supporting data (all numbers) and no hardware configuration specified whatsoever.

  • "Transaction X is nearly twice as slow on Solaris compared to the same transaction running on Windows or Linux"
  • "Transaction X runs much faster on my Windows laptop than on a Solaris box"

Lack of awareness and taking the hardware completely out of the discussions and context are the biggest problems with complaints like these. Those claims make sense only when the underlying hardware is the same in all test cases. For example, comparing a single user, single threaded transaction running on Windows, Linux and Solaris on x86 hardware is appropriate (as long as the type and speed of the processor are identical), but not against Solaris running on SPARC hardware. This is mainly because the processor architecture is completely different for x86 and SPARC platforms.

Besides, these days Oracle offers two types of SPARC hardware - 1. T-series and 2. M-series, which serve different purposes though they are compatible with each other. It is hard to compare and analyze the performance discrimination between different SPARC offerings (T- and M-series) too with no proper understanding of the characteristics of the CPUs in use. Choosing the right hardware for the right job is the key.

It is improper to compare the business transactions running on x86 with SPARC systems or even between different types of SPARC systems, and to incorrectly attribute the hardware strength or weakness to the operating system that runs on top of the bare metal. If there is so much of discrepancy among different operating environments, it is recommended to spend some time understanding the nuances in testing hardware before spending enormous amounts of time trying to tune the application and the operating system.

The bottomline: in addition to the software (application + OS), hardware plays an important role in the performance and scalability of an application - so, unless the testing hardware is the same for all test cases on different operating systems, don't you just focus on the operating system alone and make hasty decisions to switch to other operating platforms. Carefully choose appropriate hardware for the task in hand.

About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today