Saturday Nov 01, 2008

Ramifications of the Solaris 10 kernel patch 137111


A recent code change in Solaris 10 inadvertantly exposed an inherent bug in some of the 32-bit applications that rely on their own memory allocators. Due to this, some of the 3rd party applications which were working earlier without the KU 137111 may crash on Solaris 10/SPARC with the KU 137111 (any revision).

Symptoms & the Cause

It was identified that majority of such application failures are mainly due to the applications' custom memory allocator that incorrectly returns 4-byte aligned mutexes in place of the required 8-byte aligned mutexes. In Solaris, mutex_t and pthread_mutex_t structures have been defined to be aligned on an 8-byte boundary. Both of those structures contain the upad64_t member, which is a double even for the 32-bit applications. The natural alignment of a double is 8 bytes; and per the SPARC Compliance Definition 2.4, the structures must be aligned according to their strictest member. That is, applications which create 4-byte aligned mutexes are technically non-compliant on Solaris/SPARC (for the sake of simplicity, such code will be referred to as the non-complying code for the remainder of this blog entry).

Due to a change in the implementation of the userland mutexes introduced by CR 6296770 in KU 137111-01, objects of type mutex_t and pthread_mutex_t must start at 8-byte aligned addresses. If this requirement is not satisfied, all non-compliant applications on Solaris/SPARC may fail with the signal SEGV with a callstack similar to the following one or with similar callstacks containing the function mutex_trylock_process.

  \*_atomic_cas_64(0x141f2c, 0x0, 0xff000000, 0x1651, 0xff000000, 0x466d90)
  set_lock_byte64(0x0, 0x1651, 0xff000000, 0x0, 0xfec82a00, 0x0)
  fast_process_lock(0x141f24, 0x0, 0x1, 0x1, 0x0, 0xfeae5780)

Patches & the Next Steps

Note that only non-compliant 32-bit applications will be affected by the KU 137111. All other complying 32-bit applications continue to run as expected even with the KU 137111 - hence the customers, partners, ISVs and the other software vendors must understand the fact that it is not a Solaris issue. Customers running into this issue must work with the respective software vendors to obtain a patch/fix. We suggest the ISVs and the rest of the software vendors to pro-actively check their 32-bit native code for any discrepancies like the one mentioned in this blog entry.

In our testing of some of the enterprise applications, we have identified Oracle's Siebel CRM as one of the potential applications that is vulnerable to the KU 137111. It appears that IBM's Lotus Domino Server is also prone to a crash on Solaris 10 with the same kernet patch. Speaking of these two known cases, Oracle/Siebel and IBM/Lotus Domino customers (running Solaris) should approach Oracle and IBM Corporations respectively but not Sun Microsystems for a proper fix.

As it may take some time for the ISVs / software vendors to identify and fix the non-complying code in their applications, Sun is planning to provide an interim fix to the mutex byte alignment issue in the form of a Solaris kernel patch. As of this writing, we expect the fix to be integrated into the KU 137137-07. The fix is already available in the latest update of the Solaris, Solaris 10 10/08. Those who cannot upgrade to Solaris 10 10/08 from the prior versions of Solaris 10 must wait for the patch KU [Updated 12/07/08] 137137-07 137137-09.

One must note that the fix in Solaris is a tentative one that allows the non-complying code to run on SPARC hardware for the time being. There is no guarantee that the non-complying code continues to run 'as is' in the future with new Solaris kernel patches and/or major updates/releases of the Solaris operating system. So the best long term solution is for the software vendors to fix the non-compliant code before it is too late.


Steve S and Roger F of Sun Microsystems.

Monday Oct 13, 2008

Siebel on Sun CMT hardware : Best Practices

The following suggested best practices are applicable to all Siebel deployments on CMT hardware (Tx00, T5x20, T5x40) running Solaris 10 [Note: some of this tuning applies to Siebel running on conventional hardware running Solaris]. These recommendations are based on our observations from the 14,000 user benchmark on Sun SPARC Enterprise T5440. Your mileage may vary.

All Tiers
  • Ensure that the system's firmware is up-to-date.

  • Upgrade to the latest update release of Solaris 10.

      Note to the customers running Siebel on Solaris 5/08: apply the kernel patch 137137-07 as soon as it is available on web site. Patch 137137-07 and later revisions, Solaris 10 10/08 will have the workaround to a critical Siebel specific bug. Oracle Corporation will eventually fix the bug in their codebase - in the meantime Solaris is covering for Siebel and all other 32-bit applications with their own memory allocators that return unaligned mutexes. Check the RFE 6729759 Need to accommodate non-8-byte-aligned mutexes and Oracle's Siebel support document 735451.1 Do NOT apply Kernel Patch 137111-04 on Solaris 10 for more details.

  • Enable 256M large pages on all nodes. By default, the latest update of Solaris 10 will use a maximum of 4M pages even when 256M pages are a good fit.

      256M pages can be enabled with the following /etc/system tunable.
      \* 256M pages
      set max_uheap_lpsize=0x10000000

  • Pro-actively avoid running into stdio's 256 file descriptors limitation.

      Set the following in a shell or add the following lines to the shell's profile (bash/ksh).
      ulimit -n 2048
      export LD_PRELOAD_32=/usr/lib/$LD_PRELOAD_32

      Technically the file descriptor limit can be set to as high as 65536. However from the application's perspective, 2048 is a reasonable limit.

  • Improve scalability with MT-hot memory allocation library, libumem or libmtmalloc.

    To improve the scalability of the multi-threaded workloads, preload MT-hot object-caching memory allocation library like libumem(3LIB), mtmalloc(3MALLOC).

      eg., To preload the libumem library, set the LD_PRELOAD_32 environment variable in the shell (bash/ksh) as shown below.

      export LD_PRELOAD_32=/usr/lib/$LD_PRELOAD_32

      Web and the Application servers in the Siebel Enterprise stack are 32-bit. However Oracle 10g or 11g RDBMS on Solaris 10 SPARC is 64-bit. Hence the path to the libumem library in the PRELOAD statement differs slightly in the database-tier as shown below.

      export LD_PRELOAD_64=/usr/lib/sparcv9/$LD_PRELOAD_64

    Be aware that the trade-off is the increase in memory footprint -- you may notice 5 to 20% increase in the memory footprint with one of these MT-hot memory allocation libraries preloaded. Also not every Siebel application module benefits from MT-hot memory allocators. The recommendation is to experiment before implementing in production environments.

  • TCP/IP tunables

    Application fared well with the following set of TCP/IP parameters on Solaris 10 5/08.

    ndd -set /dev/tcp tcp_time_wait_interval 60000
    ndd -set /dev/tcp tcp_conn_req_max_q 1024
    ndd -set /dev/tcp tcp_conn_req_max_q0 4096
    ndd -set /dev/tcp tcp_ip_abort_interval 60000
    ndd -set /dev/tcp tcp_keepalive_interval 900000
    ndd -set /dev/tcp tcp_rexmit_interval_initial 3000
    ndd -set /dev/tcp tcp_rexmit_interval_max 10000
    ndd -set /dev/tcp tcp_rexmit_interval_min 3000
    ndd -set /dev/tcp tcp_smallest_anon_port 1024
    ndd -set /dev/tcp tcp_slow_start_initial 2
    ndd -set /dev/tcp tcp_xmit_hiwat 799744
    ndd -set /dev/tcp tcp_recv_hiwat 799744
    ndd -set /dev/tcp tcp_max_buf  8388608
    ndd -set /dev/tcp tcp_cwnd_max  4194304
    ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500
    ndd -set /dev/udp udp_xmit_hiwat 799744
    ndd -set /dev/udp udp_recv_hiwat 799744
    ndd -set /dev/udp udp_max_buf 8388608

Siebel Application Tier
  • All T-series systems (T1000/T2000, T5120/T5220, T5120/T5240, T5440) support the 256M page size. However Siebel's siebmtshw script restricts the page size to 4M. Comment out the following lines in $SIEBEL_HOME/siebsrvr/bin/siebmtshw.
      # This will set 4M page size for Heap and 64 KB for stack

  • Experiment with less number of Siebel Object Managers.

      Configure the Object Managers in such a way that each OM will be handling at least 200 active users. Siebel's standard recommendation of 100 or less users per Object Manager is suitable for conventional systems but not ideal for CMT systems like Tx000, T5x20, T5x40, T5440. Sun's CMT systems are ideal for running multi-threaded processes with tons of LWPs per process. Besides, there will be significant improvement in the overall memory footprint with less number of Siebel Object Managers.

  • Try Oracle 11g R1 client in the application-tier. Oracle 10g R2 clients may crash under high load. For the symptoms of the crash, check Solaris/SPARC: Oracle 11gR1 client for Siebel 8.0.

      Oracle 10g R2 32-bit client is supposed to have a fix for the process crash issue - however it wasn't verified in our test environment.

Siebel Database Tier
  • Eliminate double buffering by forcing the file system to use direct I/O.

    Oracle database caches the data in its own cache within the shared global area (SGA) known as the database block buffer cache. Database reads and writes are cached in block buffer cache so the subsequent accesses for the same blocks do not need to re-read the data from the operating system. On the other hand, file systems on Solaris default to reading the data though the global file system cache for improved I/O operations. That is, by default each read is cached potentially twice - one copy in the operating system's file system cache, and the other copy in Oracle's block buffer cache. In addition to double caching, there is also some extra CPU overhead for the code which manages the operating system's file system cache. The solution is to eliminate double caching by forcing the file system to bypass the OS file system cache when reading and writing to the disk.

      In the 14,000 user benchmark setup, the UFS file systems (holding the data files and the redo logs) were mounted with the forcedirectio option.

      mount -o forcedirectio /dev/dsk/<partition> <mountpoint>

  • Store data files separate from the redo log files -- If the data files and the redo log files are stored on the same disk drive and if that disk drive fails, the files cannot be used in the database recovery procedures.

      In the 14,0000 user benchmark setup, there are two Sun StorateTek 2540 arrays connected to the T5440 - one array was holding the data files, where as the other was holding the Oracle redo log files.

  • Size online redo logs to control the frequency of log switches.

      In the 14,0000 user benchmark setup, two online redo logs were configured each with 10 GB disk space. When all 14,000 concurrent users are on-line, there is only one log switch in a 60 minute period.

  • If the storage array supports the read-ahead feature, enable it. When 'read-ahead enabled' is set to true, the write will be committed to the cache as opposed to the disk, and the OS signals the application that the write has been committed.

    Oracle Database Initialization Parameters

  • Set Oracle's initialization parameter DB_FILE_MULTIBLOCK_READ_COUNT to appropriate value. DB_FILE_MULTIBLOCK_READ_COUNT parameter specifies the maximum number of blocks read in one I/O operation during a sequential scan.

      In the 14,0000 user benchmark configuration, DB_BLOCK_SIZE was set to 8 kB. During the benchmark run, the average reads are around 18.5 kB per second. Hence setting DB_FILE_MULTIBLOCK_READ_COUNT to a high value does not necessarily improve the I/O performance. A value of 8 for the database init parameter DB_FILE_MULTIBLOCK_READ_COUNT seems to perform better.

  • On T5240 and T5440 servers, set the database initialization parameter CPU_COUNT to 64. Otherwise, by default Oracle RDBMS assumes 128 and 256 for the CPU_COUNT on T5240 and T5440 respectively. Oracle's optimizer might use a completely different execution plan when it notices such a large number for the CPU_COUNT; and the resulting execution plan need not necessarily be an optimal one. In the 14,000 user benchmark, setting CPU_COUNT to 64 produced optimal execution plans.

  • On T5240 and T5440 servers, explicitly set the database initialization parameter _enable_NUMA_optimization to FALSE. On these multi-socket servers, _enable_NUMA_optimization will be set to TRUE by default. During the 14,000 user benchmark run, we noticed intermittent shadow process crashes with the default behavior. We didn't realize any additional gains either with the default NUMA optimizations.

Siebel Web Tier
  • Upgrade to the latest service pack of Sun Java Web Server 6.1 (32-bit).

  • Run the Sun Java Web Server in multi-process mode by setting the MaxProcs directive in magnus.conf to a value that is greater than 1. In the multi-process mode, the web server can handle requests using multiple processes with multiple threads in each process.

      When you specify a value greater than 1 for the MaxProcs, the web server relies on the operating system to distribute connections among/between multiple web server processes. However many modern operating systems including Solaris do not distribute connections evenly, particularly when there are a small number of concurrent connections.

  • Tune the maximum simultaneous requests by setting the RqThrottle parameter in the magnus.conf file to appropriate value. A value of 1024 was used in the 14,000 user benchmark.

Tuesday Nov 27, 2007

Solaris/SPARC: Oracle 11gR1 client for Siebel 8.0

First things first - Oracle 11g Release 1 for Solaris/SPARC is available now; and can be downloaded from here.

In some Siebel 8.0 environments where Oracle database is being used, customers might be noticing intermittent Siebel object manager crashes under high loads when the work is actively being done by tons of LWPs with less number of object managers. Usually the call stack might look something like:

/export/home/oracle/lib32/ [ Signal 11 (SEGV)]


Setting the Siebel environment variable SIEBEL_STDERROUT to 1 shows the following heap dump in StdErrOut directory under Siebel enterprise logs directory.

% more stderrout_7762_23511113.txt
\*\*\*\*\*\*\*\*\*\* Internal heap ERROR 17112 addr=35dddae8 \*\*\*\*\*\*\*\*\*

\*\*\*\*\* Dump of memory around addr 35dddae8:
35DDCAE0 00000000 00000000 [........]
35DDCAF0 00000000 00000000 00000000 00000000 [................]
Repeat 243 times
35DDDA30 00000000 00000000 00003181 00300020 [..........1..0. ]
35DDDA40 0949D95C 35D7A888 10003179 00000000 [.I.\\5.....1y....]
35DDDA50 0949D95C 0949D8B8 35D7A89C C0000075 [.I.\\.I..5......u]
HEAP DUMP heap name="Alloc environm" desc=949d8b8
extent sz=0x1024 alt=32767 het=32767 rec=0 flg=3 opc=2
parent=949d95c owner=0 nex=0 xsz=0x1038
EXTENT 0 addr=364fb324
Chunk 364fb32c sz= 4144 free " "
EXTENT 1 addr=364f8ebc
Chunk 364f8ec4 sz= 4144 free " "
EXTENT 2 addr=364f7d5c
Chunk 364f7d64 sz= 4144 free " "
EXTENT 3 addr=364f6d04
Chunk 364f6d0c sz= 4144 recreate "Alloc statemen " latch=0
ds 2c38df34 sz= 4144 ct= 1
EXTENT 406 addr=35ddda54
Chunk 35ddda5c sz= 116 free " "
Chunk 35dddad0 sz= 24 BAD MAGIC NUMBER IN NEXT CHUNK (6)
freeable assoc with mark prv=0 nxt=0

Dump of memory from 0x35DDDAD0 to 0x35DDEAE8
35DDDAD0 20000019 35DDDA5C 00000000 00000000 [ ...5..\\........]
35DDDAE0 00000095 0000008B 00000006 35DDDAD0 [............5...]
35DDDAF0 00000000 00000000 00000095 35DDDB10 [............5...]
EXTENT 2080 addr=d067a6c
Chunk d067a74 sz= 2220 freeable "Alloc statemen " ds=2b0fffe4
Chunk d068320 sz= 1384 freeable assoc with mark prv=0 nxt=0
Chunk d068888 sz= 4144 freeable "Alloc statemen " ds=2b174550
Chunk d0698b8 sz= 4144 recreate "Alloc statemen " latch=0
ds 1142ea34 sz= 112220 ct= 147
223784cc sz= 4144
240ea014 sz= 884
28eac1bc sz= 900
2956df7c sz= 900
1ae38c34 sz= 612
92adaa4 sz= 884
2f6b96ac sz= 640
c797bc4 sz= 668
2965dde4 sz= 912
1cf6ad4c sz= 656
10afa5e4 sz= 656
2f6732bc sz= 700
27cb3964 sz= 716
1b91c1fc sz= 584
a7c28ac sz= 884
169ac284 sz= 900
Chunk 2ec307c8 sz= 12432 free " "
Chunk 3140a3f4 sz= 4144 free " "
Chunk 31406294 sz= 4144 free " "
Bucket 6 size=16400
Bucket 7 size=32784
Total free space = 340784
Chunk 949f3c8 sz= 100 perm "perm " alo=100
Permanent space = 100
Hla: 255

ORA-21500: internal error code, arguments: [17112], [0x35DDDAE8], [], [], [], [], [], []
Errors in file :
ORA-21500: internal error code, arguments: [17112], [0x35DDDAE8], [], [], [], [], [], []

----- Call Stack Trace -----
NOTE: +offset is used to represent that the
function being called is offset bytes from
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
F2365738 CALL +23052 D7974250 ? D797345C ? DD20 ?
D79741AC ? D79735F8 ?
F2ECD7A0 ?
F286DDB8 PTR_CALL 00000000 949A018 ? 14688 ? B680B0 ?
F2365794 ? F2ECD7A0 ? 14400 ?
F286E18C CALL +77460 949A018 ? 0 ? F2F0E8D4 ?
1004 ? 1000 ? 1000 ?
F286DFF8 CALL +66708 949A018 ? 0 ? 42D8 ? 1 ?
D79743E0 ? 949E594 ?
__1cN_smiWorkQdDueu CALL __1cN_smiWorkQdDueu 1C8F608 ? 18F55F38 ?
30A5A008 ? 1A98E208 ?
FDBDF178 ? FDBE0424 ?
__1cQSmiThrdEntryFu PTR_CALL 00000000 1C8F608 ? FDBE0424 ?
1AB6EDE0 ? FDBDF178 ? 0 ?
1500E0 ?
__1cROSDWslThreadSt PTR_CALL 00000000 1ABE8250 ? 600140 ? 600141 ?
105F76E8 ? 0 ? 1AC74864 ?
__1cP_AfxThreadEntr PTR_CALL 00000000 0 ? FF30A420 ? 203560 ?
1AC05AF0 ? E2 ? 1AC05A30 ?
__1cIMwThread6Fpv_v PTR_CALL 00000000 D7A7DF6C ? 17F831E0 ? 0 ? 1 ?
0 ? 17289C ?
_lwp_start()+0 ? 00000000 1 ? 0 ? 1 ? 0 ? FCE6C710 ?
1AC72680 ?

----- Argument/Register Address Dump -----

Argument/Register addr=d7974250. Dump of memory from 0xD7974210 to 0xD7974350
D7974200 0949A018 00014688 00B680B0 F2365794
D7974220 F2ECD7A0 00014400 D7974260 F286DDB8 800053FC 80000000 00000002 D79743E0
D7974240 4556454E 545F3231 35303000 32000000 F23654D4 F2365604 00000000 0949A018
D7974260 00000000 0949A114 0000000A F2ECD7A0 FC873504 00001004 F2365794 F2F0E8D4
D7974280 0949A018 00000000 F2F0E8D4 00001004 00001000 00001000 D79742D0 F286E18C
D79742A0 F6CB7400 FC872CC0 00000004 00CDE34C 4556454E 545F3437 31313200 00000000
D79742C0 00000000 00000001 D79743E0 00000000 F2F0E8D4 00001004 00001000 00007530
D79742E0 00007400 00000001 FC8731D4 00003920 0949A018 00000000 000042D8 00000001
D7974300 D79743E0 0949E594 D7974330 F286DFF8 D7974338 F244D5A8 00000000 01000000
D7974320 364FBFFF 01E2C000 00001000 00000000 00001000 00000000 F2ECD7A0 F23654D4
D7974340 000000FF 00000001 F2F0E8D4 F25F9824
Argument/Register addr=d797345c. Dump of memory from 0xD797341C to 0xD797355C
D7973400 F2365738
Argument/Register addr=1eb21388. Dump of memory from 0x1EB21348 to 0x1EB21488
1EB21340 00000000 00000000 FE6CE088 FE6CE0A4 0000000A 00000010
1EB21360 00000000 00000000 00000000 00000010 00000000 00000000 00000000 FEB99308
1EB21380 00000000 00000000 00000000 1C699634 FFFFFFFF 00000000 FFFFFFFF 00000001
1EB213A0 00000000 00000000 00000000 00000000 00000081 00000000 F16F1038 67676767
1EB213C0 00000000 FEB99308 00000000 00000000 FEB99308 FEB46EF4 00000000 00000000
1EB213E0 00000000 00000000 FEB99308 00000000 00000000 00000001 0257E3B4 03418414
1EB21400 00000000 00000000 FEB99308 FEB99308 00000000 00000000 00000000 00000000
1EB21420 00000000 00000000 00000000 00000000 1EB213B0 00000000 00000041 00000000
1EB21440 0031002D 00410036 00510038 004C0000 00000000 00000000 00000000 00000000
1EB21460 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
1EB21480 00000109 00000000

----- End of Call Stack Trace -----

Although I'm not sure what exactly is the underlying issue for the core dump, my suspicion is that there is some memory corruption in Oracle client's code; and the Siebel Object Manager crash is due to the Oracle bug 5682121 - Multithreaded OCI clients do not mutex properly for LOB operations. The fix to this particular bug would be available in Oracle release; and is already available as part of Oracle In case if you notice the symptoms of failure as described in this blog post, upgrade the Oracle client in the application-tier to Oracle 11gR1 and see if it brings stability to the Siebel environment.

Original blog post is at:

Friday Aug 10, 2007

Oracle 10gR2/Solaris x64: Must Have Patches for E-Business Suite 11.5.10

If you have an Oracle E-Business Suite 11.5.10 database running on Oracle 10gR2 ( / Solaris x86-64 platform, make sure you have the following two Oracle patches to avoid concurrency issues and intermittent Oracle shadow process crashes.

Oracle patches

1) 4770693 BUG: Intel Solaris: Unnecessary latch sleeps by default
2) 5666714 BUG: ORA-7445 ON DELETE


1) If the top 5 database timed events look something similar to the following in AWR, it is very likely that the database is running into the bug 4770693 Intel Solaris: Unnecessary latch sleeps by default.

Top 5 Timed Events

EventWaitsTime(s)Avg Wait(ms)% Total Call TimeWait Class
latch: cache buffers chains 94,301 169,403 1,796 86.1Concurrency
CPU time-- 5,478-- 2.8--
wait list latch free 247,466 4,756 19 2.4Other
buffer busy waits 14,928 1,382 93 .7Concurrency
db file sequential read 98,750 552 6 .3User I/O

Apply Oracle server patch 4770693 to get rid of the concurrency issue(s). Note that the fix will be part of the release

2) If the application becomes unstable and if you notice core dumps in cdump directory, have a look at the corresponding stack traces generated in udump directory. If the call stack looks similar to the following stack, apply Oracle server patch 5666714 to overcome this problem.

Alert log will have the following errors:

Errors in file /opt/oracle/admin/VIS/udump/vis_ora_1040.trc:
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [2168] [] [] []
Fri Jun 15 01:30:38 2007
Errors in file /opt/oracle/admin/VIS/udump/vis_ora_1040.trc:
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [9] [] [] []
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [2168] [] [] []
Fri Jun 15 01:30:38 2007
Errors in file /opt/oracle/admin/VIS/udump/vis_ora_1040.trc:
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [9] [] [] []
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [9] [] [] []
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [2168] [] [] [] 

% more vis_ora_1040.trc
\*\*\* 2007-06-15 01:30:38.403
\*\*\* SERVICE NAME:(VIS) 2007-06-15 01:30:38.402
\*\*\* SESSION ID:(1111.4) 2007-06-15 01:30:38.402
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x878
\*\*\* 2007-06-15 01:30:38.403
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [SIGSEGV] [Address not mapped to object] [2168] [] [] []
Current SQL statement for this session:
INSERT /\*+ IDX(0) \*/ INTO "INV"."MLOG$_MTL_SUPPLY" (dmltype$$,old_new$$,snaptime$$,change_vector$$,m_row$$) 
VALUES (:d,:o,to_date('4000-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS'),:c,:m)
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst()+23          ?        0000000000000001     00177A9EC 000000000 0061D0A60
ksedmp()+636         ?        0000000000000001     001779481 000000000 00000000B
ssexhd()+729         ?        0000000000000001     000E753CE 000000000 0061D0B90
_sigsetjmp()+25      ?        0000000000000001     0FDDCB7E6 0FFFFFD7F 0061D0B50
call_user_handler()  ?        0000000000000001     0FDDC0BA2 0FFFFFD7F 0061D0EF0
+589                                               000000000
sigacthandler()+163  ?        0000000000000001     0FDDC0D88 0FFFFFD7F 000000002
kglsim_pin_simhp()+  ?        0000000000000001     0FFFFFFFF 0FFFFFFFF 00000000B
173                                                000000000
kxsGetRuntimeLock()  ?        0000000000000001     001EBF830 000000000 005E5D868
+683                                               000000000
kksfbc()+7361        ?        0000000000000001     001FB60A6 000000000 005E5D868
opiexe()+1691        ?        0000000000000001     0029045D0 000000000 0FFDF9250
opiall0()+1316       ?        0000000000000001     0028E9FB9 000000000 000000001
opikpr()+536         ?        0000000000000001     00290B2DD 000000000 0000000B7
opiodr()+1087        ?        0000000000000001     000E7BE1C 000000000 000000001
rpidrus()+217        ?        0000000000000001     000E8058E 000000000 0FFDFA6B8
skgmstack()+163      ?        0000000000000001     003F611D0 000000000 005E5D868
rpidru()+129         ?        0000000000000001     000E808A6 000000000 005E6FAD0
rpiswu2()+431        ?        0000000000000001     000E7FD8C 000000000 0FFDFB278
kprball()+1189       ?        0000000000000001     000E86E6A 000000000 0FFDFB278
kntxslt()+3150       ?        0000000000000001     0030601F3 000000000 005F7C538
kntxit()+998         ?        0000000000000001     003058EBB 000000000 005F7C538
0000000001E4866E     ?        0000000000000001     001E4864B 000000000 000000000
delrow()+9170        ?        0000000000000001     0032020B7 000000000 000000002
qerdlFetch()+640     ?        0000000000000001     0033545F5 000000000 0EF38B020
delexe()+909         ?        0000000000000001     0032034EA 000000000 005E6FC50
opiexe()+9267        ?        0000000000000001     002906368 000000000 000000001
opiodr()+1087        ?        0000000000000001     000E7BE1C 000000000 0FFDFCD10
ttcpip()+1168        ?        0000000000000001     003D031AD 000000000 0FFDFEDF4
opitsk()+1212        ?        0000000000000001     000E77C41 000000000 000E7BA00
opiino()+931         ?        0000000000000001     000E7B0D8 000000000 005E5B8F0
opiodr()+1087        ?        0000000000000001     000E7BE1C 000000000 000000000
opidrv()+748         ?        0000000000000001     000E76A11 000000000 0FFDFF6D8
sou2o()+86           ?        0000000000000001     000E73E6B 000000000 000000000
opimai_real()+127    ?        0000000000000001     000E3A7C4 000000000 000000000
main()+95            ?        0000000000000001     000E3A694 000000000 000000000
0000000000E3A4D7     ?        0000000000000001     000E3A4DC 000000000 000000000
--------------------- Binary Stack Dump ---------------------
========== FRAME [1] (ksedst()+23 -> 0000000000000001) ==========
Dump of memory from 0x00000000061D0910 to 0x00000000061D0920
0061D0910 061D0920 00000000 0177A9EC 00000000  [ .........w.....]
========== FRAME [2] (ksedmp()+636 -> 0000000000000001) ==========
Dump of memory from 0x00000000061D0920 to 0x00000000061D0A60
0061D0920 061D0A60 00000000 01779481 00000000  [`.........w.....]
0061D0930 0000000B 00000000 061D0EF0 00000000  [................]
0061D0940 05E5B96C 00000000 05E5C930 00000000  [l.......0.......]
0061D0950 05E5C930 00000000 FE0D2000 FFFFFD7F  [0........ ......]
0061D0960 061D0A40 00000000 00000000 00000000  [@...............]
0061D0970 00000000 00000000 00000000 00000000  [................]

After installing the Oracle database patches, check the installed patches by running opatch lsinventory on your database server.

% opatch lsinventory
Invoking OPatch

Oracle interim Patch Installer version
Copyright (c) 2005, Oracle Corporation.  All rights reserved..

Oracle Home       : /oracle/product/10.1.0
Central Inventory : /export/home/oracle/oraInventory
   from           : /oracle/product/10.1.0/oraInst.loc
OPatch version    :
OUI version       :
OUI location      : /oracle/product/10.1.0/oui
Log file location : /oracle/product/10.1.0/cfgtoollogs/opatch/opatch-2007_Aug_10_21-56-03-PDT_Fri.log

Lsinventory Output file location : 

Installed Top-level Products (3): 

Oracle Database 10g                                        
Oracle Database 10g Products                               
Oracle Database 10g Release 2 Patch Set 1                  
There are 3 products installed in this Oracle Home.

Interim patches (2) :

Patch  4770693      : applied on Thu Aug 02 15:27:23 PDT 2007
   Created on 12 Jul 2006, 11:52:39 hrs US/Pacific
   Bugs fixed:

Patch  5666714      : applied on Fri Jul 20 10:21:33 PDT 2007
   Created on 29 Nov 2006, 04:52:58 hrs US/Pacific
   Bugs fixed:


OPatch succeeded.

Sunday May 13, 2007

Patches to get extended FILE solution on Solaris 10

The issue was discussed in the blog post Solaris: Workaround to stdio's 255 open file descriptors limitation.

If the system is running any existing major customer releases of Solaris 10 i.e., Solaris 10 3/05 through Solaris 10 11/06, extended FILE facility can be installed on these systems by applying the kernel patch, 125100-04 or later & libc patch, 120473-05 or later.

Systems running Solaris Express (SX) or any OpenSolaris distribution after build 39 do not need the patches mentioned above to get the extended FILE solution for stdio's 256 open files limitation.

Do not forget to read the man pages of extendedFILE(5), enable_extended_FILE_stdio(3C), fopen(3C), fdopen(3C) and popen(3C) to enable the extended FILE solution by pre-loading /usr/lib/ (run-time solution) or by using the enhanced stdio's interfaces, fopen(), fdopen(), popen() or with the new interface enable_extended_FILE_stdio().

Peter Shoults, Sun Microsystems

Monday Apr 23, 2007

Solaris: (Undocumented) Thread Environment Variables

The new thread library (initially called T2) on Solaris 8 Update 7 and later versions of Solaris provides a 1:1 threading model for better performance and scalability. Just like other libraries of Solaris, the new thread library is binary compatible with the old library, and the applications linked with old library continue to work as they are, without any changes to the application or any need to re-compile the code.

The primary difference being the new thread library uses a 1:1 thread model, where as the old library implements a two-level model in which user-level threads are multiplexed over possibly fewer light weight processes. In the 1:1 (or 1x1) model, an user level (application) thread has a corresponding kernel thread. A multi-threaded application with n threads will have n kernel threads. Each user level thread has a light weight process (LWP) connecting it to the kernel thread. Due to the more number of LWPs, the resource consumption will be a little high with 1x1 model, compared to MxN model; but due to the less overhead of multiplexing and scheduling the threads over LWP, 1x1 model performs much better compared to the old MxN model.

Tuning an application with thread environment variables

Within the new thread library (libthread) framework, a bunch of tunables have been provided to tweak the performance of the application. All these tunables have default values, and often the default values are good enough for the application to perform better. For some unknown reasons, these tunables were not documented anywhere except in source code. You can have a look at the source code for the following files at OpenSolaris web site.

  • thr.c <- check the routine etest()
  • synch.c <- read the comments and code between lines: 76 and 97

Here's a brief description of these environment variables along with their default values:

    • Controls the spinning for queue locks
    • Default value: 1000


    • Specifies the number of iterations (spin count) for adaptive mutex locking before giving up and going to sleep
    • Default value: 1000


    • Spin LIBTHREAD_RELEASE_SPIN times to see if a spinner grabs the lock; if so, don’t bother to wake up a waiter
    • Default value: LIBTHREAD_ADAPTIVE_SPIN/2 ie., 500. However it can be set explicitly to override the default value


    • Limits the number of simultaneous spinners attempting to do adaptive locking
    • Default value: 100


    • Do direct mutex handoff (no adaptive locking)
    • Default value: 0


    • Specifies the frequency of FIFO queueing vs LIFO queueing (range is 0..8)
    • Default value: 4
    • LIFO queue ordering is unfair and can lead to starvation, but it gives better performance for heavily contended locks

      • 0 - every 256th time (almost always LIFO)
      • 1 - every 128th time
      • 2 - every 64th time
      • 3 - every 32nd time
      • 4 - every 16th time (default, mostly LIFO)
      • 5 - every 8th time
      • 6 - every 4th time
      • 7 - every 2nd time
      • 8 - every time (never LIFO, always FIFO)


    • Causes a dump of user−level sleep queue statistics onto stderr (file descriptor 2) on exit
    • Default value: 0


    • Specifies the number of cached stacks the library keeps for re−use when more threads are created
    • Default value: 10


    • Little bit of history:

      The old thread library (Solaris 8's default) forces the thread performing cond_wait() to block all signals until it reacquires the mutex.

      The new thread library on Solaris 9 (default) and later versions, behaves exact opposite. The state of the mutex in the signal handler is undefined and the thread does not block any more signals than it is already blocking on entry to cond_wait().

      However to accomodate the applications that rely on old behavior, the new thread library implemented the old behavior as well, and it can be controlled with the LIBTHREAD_COND_WAIT_DEFER env variable. To get the old behavior, this variable has to be set to 1.


    • Setting it to 1 issues a warning message about illegal locking operations and setting it to 2 issues the warning message and core dump
    • Default value: 0

Roger Faulkner's New/Improved Threads Library presentation

(Original blog post is at:

Sunday Apr 15, 2007

C/C++: Printing Stack Trace with printstack() on Solaris

On Solaris 9 and later, libc provides an useful function called printstack, to print a symbolic stack trace to the specified file descriptor. This is useful for reporting errors from an application during run-time.

If the stack trace appears corrupted, or if the stack cannot be read, printstack() returns -1.

Here is a programmatic example:

% more printstack.c

#include <stdio.h>
#include <ucontext.h>

int callee (int file)
    return (0);

int caller()
    int a;
    a = callee (fileno(stdout));
    return (a);

int main()
    return (0);

% cc -V
cc: Sun C 5.8 Patch 121016-04 2006/10/18 <- Sun Studio 12 C compiler
usage: cc [ options] files.  Use 'cc -flags' for details

% cc -o stacktrace stacktrace.c

% ./stacktrace

The printstack() function uses dladdr1() to obtain symbolic symbol names. As a result, only global symbols are reported as symbol names by printstack().

The stack trace from a C++ program, will have all the symbols in their mangled form. So as of now, the programmers may need to have their own wrapper functions to print the stack trace in demangled form.

% CC -V
CC: Sun C++ 5.8 Patch 121018-07 2006/11/01 <- Sun Studio 12 C++ compiler

% CC -o stacktrace stacktrace.c

% ./stacktrace

There has been an RFE (Request For Enhancement), 6182270 printstack should provide demangled names while using with c++, in place against Solaris' libc to print the stack trace in demangled form, when printstack() has been called from a C++ program. This will be released as a libc patch for Solaris 9 & 10, as soon as it gets enough attention.

% /usr/ccs/bin/elfdump  | grep printstack
     [677]  0x00069f70 0x0000004c  FUNC WEAK  D    9 .text          printstack
    [1425]  0x00069f70 0x0000004c  FUNC GLOB  D   36 .text          _printstack

Since the object code is automatically linked with libc during the creation of an executable or a dynamic library, the programmer need not specify -lc on the compile line.

Suggested Reading:
Man page of walkcontext / printstack

(Original blog post is at:

Monday Mar 12, 2007

Solaris: How To Disable Out Of The Box (OOB) Large Page Support?

Starting with the release of Solaris 10 1/06 (aka Solaris 10 Update 1), large page OOB feature turns on MPSS (Multiple Page Size Support) automatically for applications' data (heap) and text (libraries).

One obvious advantage of this large page OOB feature is that it improves the performance of user land applications by reducing the wastage of CPU cycles in serving iTLB and dTLB misses. For example, if the heap size of a process is 256M, on a Niagara (UltraSPARC-T1) box it will be mapped on to a single 256M page. On a system that doesn't support large pages, it will be mapped on to 32,768 8K pages. Now imagine having all the words of a story on a single large page versus having the words spread into 32,500+ small pages. Which one do you prefer?

However large page OOB feature may have negative impact on some applications - For example, an application may crash due to some wrong assumption(s) about the page size {by the application} or there could be an increase in virtual memory consumption due to the way the data and libraries are mapped on to larger pages.

Fortunately there are a bunch of /etc/system tunables to enable/disable large page OOB support on Solaris.

/etc/system tunables to disable large page OOB feature

  • set exec_lpg_disable = 1

    This parameter prevents large pages from being used when the kernel is allocating memory for processes being executed. These constitute the memory needed for a processes' text/data/bss.

  • set use_brk_lpg = 0

    This parameter prevents large pages from being used for heap. To enable large pages for heap, set the value of this parameter to 1 or remove this parameter from /etc/system completely.

    brk() is the kernel routine that is called whenever a user level application invokes malloc().

  • set use_stk_lpg = 0

    This parameter disables the large pages for stack. Set it to 1 to retain the default functionality.

  • set use_zmap_lpg = 0

    This variable controls the size of anonymous (anon) pages.

  • set use_text_pgsz4m = 0

    This tunable disables the default use of 4M text pages on UltraSPARC-III/III+/IV/IV+/T1 platforms.

  • set use_text_pgsz64k = 0

    This tunable disables the default use of 64K text pages on UltraSPARC-T1 (Niagara) platform.

  • set use_initdata_pgsz64k = 0

    This tunable disables the default use of 64K data pages on UltraSPARC-T1 (Niagara) platform.

Tuning off large page OOB support for heap/stack/anon pages on-the-fly

Setting /etc/system parameters require the system to be rebooted to enable/disable large page OOB support. However it is possible to set the desired page size for heap/stack/anon pages dynamically as shown below. Note that the system goes back to the default behavior when it is rebooted. Depending on the need to turn off large page support, use mdb or /etc/system parameters at your discretion.

To turn off large page support for heap, stack and anon pages dynamically, set the following under mdb -kw:

  • use_brk_lpg/W 0 (heap)
  • use_stk_lpg/W 0 (stack)
  • use_zmap_lpg/W 0 (anon)

Java sets its own page size with memctl() interface - so, the /etc/system changes won't impact Java at all. Consider using the JVM option -XX:LargePageSizeInBytes=pagesize[K|M] to set the desired page size for Java process mappings.

How to check whether disabling large page support is really helping?

Compare the outputs of the following {along with application specific data} before and after changes:

  • vmstat 2 50 - look under free and id columns
  • trapstat -T 5 5 - check %time column
  • mdb -k and then ::memstat

How to set the maximum large page size?

Run pagesize -a to get the list of supported page sizes on your system. Then set the page size of your choice as shown below.

% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca
lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/W <hex_value>
hex_value = { 0x0 for 8K,
0x1 for 64K,
0x2 for 512K,
0x3 for 4M,
0x4 for 32M and
0x5 for 256M }

How to check the maximum page size in use?

Here is an example from a Niagara box (T2000):

% pagesize -a
% mdb -kw
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba random fcp fctl nca
lofs ssd logindmux ptm cpc sppp crypto nfs ipc ]
> auto_lpg_maxszc/X
> ::quit

See Also:
6287398 vm 1.5 dumps core with -d64

Sang-Suan Sam Gam

(Original blog post is at:

Thursday Dec 07, 2006

Solaris: Workaround to stdio's 255 open file descriptors limitation

A quick web search with key words Solaris stdio open file descriptors results in numerous references to stdio's limitation of 255 open file descriptors on Solaris.

#1085341: 32-bit stdio routines should support file descriptors >255, a 14 year old RFE explains the problem and the bug report links to handful of other bugs which are some how related to stdio's open file descriptors limitation.

Now the good news: the wait is finally over. Sun Microsystems finally made a fix/workaround available to the community in the form of a new library called extendedFILE. If you are running Solaris Express (SX) 06/06 or later builds, your system already has the workaround. You just need to enable it to get around the 255 open file descriptors problem with stdio's API. This workaround is part of the Update 3 release of Solaris 10 (Solaris 10 11/06). [Update: 02/05/07] This workaround will be part of the next major release of Solaris 10, which is going to be Solaris 10 Update 4 - For some reason, it couldn't make it to Update 3. There are some plans to backport this workaround to Solaris 9 and 10. However there is no clear timeline for completion of this backport, at the moment.

The workaround does not require any source code changes or re-compilation of the objects. You just need to increase the file descriptor limit using limit or ulimit commands; and pre-load /usr/lib/ before running the application.

However applications fiddling with _file field of FILE structure may not work. This is because when extendedFILE library is pre-loaded, descriptors > 255 will be stored in an auxiliary location and a fake descriptor will be stored in the FILE structure's _file field. In fact, accessing _file field was long discouraged; and to discourage non-standard practices even further _file has been renamed to _magic starting with SX 06/06. So, applications which access _file directly rather than with fileno() function, may encounter compilation errors starting with S10 U3 U4. This step is necessary to ensure that the source code is in a clean state, so the resulting object code is not vulnerable to data corruption during run-time.

The following example shows the failure and the steps to workaround the issue. Note that with the extendedfile library pre-loaded, the process can open upto 65532 files excluding stdin, out and err.

\* Test case (simple C program tries to open 65536 files):
% cat files.c
#include <stdio.h>
#define NoOfFILES 65536

int main()
        char filename[10];
        FILE \*fds[NoOfFILES];
        int i;

        for (i = 0; i < NoOfFILES; ++i)
                sprintf (filename, "%d.log", i);
                fds[i] = fopen(filename, "w");

                if (fds[i] == NULL)
                        printf("\\n\*\* Number of open files = %d. fopen() failed with error:  ", i);
                        fprintf (fds[i], "some string");

        return (0);

% cc -o files files.c
\* Re-producing the failure:
 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     256
 memorysize      unlimited

 % uname -a
 SunOS sunfire4 5.11 snv_42 sun4u sparc SUNW,Sun-Fire-280R


 \*\* Number of open files = 253. fopen() failed with error: Too many open files

\* Showcasing the Solaris workaround:
 % limit descriptors 5000

 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     5000
 memorysize      unlimited

 % setenv LD_PRELOAD_32 /usr/lib/

 % ./files

 \*\* Number of open files = 4996. fopen() failed with error: Too many open files

 % limit descriptors 65536

 % limit
 cputime         unlimited
 filesize        unlimited
 datasize        unlimited
 stacksize       8192 kbytes
 coredumpsize    unlimited
 descriptors     65536
 memorysize      unlimited


 \*\* Number of open files = 65532. fopen() failed with error: Too many open files

(Note that descriptor 0, 1 and 2 will be used by stdin, stdout and stderr)

 % ls -l 65531.log
 -rw-rw-r--   1 gmandali ccuser        11 Aug  9 12:35 65531.log

For further information about the extendedFILE library and for the extensions to fopen, fdopen, popen, .. please have a look at the new and updated man pages:


Craig Mohrman

(Original blog post is at:

Wednesday Nov 22, 2006

Java performance on Niagara platform

(The goal of this blog post is not really to educate you on how to tune Java on UltraSPARC-T1 (Niagara) platform, but to warn you not to completely rely on the out-of-the-box features of Solaris 10 and Java, with couple of interesting examples).


Customer XYZ heard very good things about UltraSPARC-T1 (Niagara) based coolthreads servers; and the out of the box performance of Solaris 10 Update 1 and Java SE 5.0. So, he bought an US-T1 based T2000 server; and deployed his application on this server running the latest update of Solaris 10 with latest version of Java SE.

Pop Quiz:
Assuming he didn't tune the application further with the blind faith on all the things he heard, is he getting all the performance he is supposed to get from the Solaris run-time environment and the underlying hardware?


Here is why, with a simple example:

US-T1 chip supports 4 different page sizes: 8K, 64K, 4M and 256M.

% pagesize -a

As long as Solaris run-time takes care of mapping heap/stack/anon/library text of a process on to appropriate page sizes, we don't have to tweak anything for better performance at least from dTLB/iTLB hits perspective. However things are little different with Java Virtual Machine (VM). Java sets its own page size with memctl() interface - so, large page OOB feature of Solaris 10 Update 1 (and later) will have no impact on Java at all. The following mappings of a native process, and a Java process confirm this.

Oracle shadow process using 256M pages for ISM (Solaris takes care of this mapping):

0000000380000000    4194304    4194304          -    4194304 256M rwxsR    [ ism shmid=0xb ]
Some anonymous mappings from a Java process (Java run-time take care of these mappings):
D8800000   90112   90112   90112       -   4M rwx--    [ anon ]
DE000000  106496  106496  106496       -   4M rwx--    [ anon ]
E4800000   98304   98304   98304       -   4M rwx--    [ anon ]
EA800000   57344   57344   57344       -   4M rwx--    [ anon ]
EE000000   45056   45056   45056       -   4M rwx--    [ anon ]

If Solaris run-time takes care of the above mappings, it would have mapped 'em on to one 256M page and the rest on other pages. So, are we losing (something that we cannot gain is a potential loss) any performance here by using 4M pages? Yes, we are. The following trapstat output gives us a hint that still there is at least 12% (check the last column, min of all %time values) CPU can be regained by switching to much larger page (256M in this example). But in reality we cannot avoid all memory translations completely - so, it is safe to assume that the potential gain by using 256M pages would be anywhere between 5% and 10%.

% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
      ttl |    553995  0.9       711  0.0 |   6623798 11.0      4371  0.0 |12.0
      ttl |    724981  1.3       832  0.0 |   9509112 16.5      5969  0.1 |17.8
      ttl |    753761  1.3       661  0.0 |  11196949 19.7      4601  0.0 |21.1

Why didn't Java run-time use 256M pages even when it could potentially use that large page in this particular scenario?

The answer to this question is pretty simple - usually large pages (pages > default 8K pages) improve the performance of the process by reducing the number of CPU cycles spent on virtual <-> physical memory translations on-the-fly. The bigger the page size, the higher the chances for good performance. However the improvement in CPU performance with large pages is not completely free - we need to sacrifice little virtual memory due to the page alignment requirements. i.e., there will be an increase in the virtual memory consumption depending on the page size in use. When 4M pages are in use, we might be losing ~4M at the most. When 256M pages are in use, .. ? Well, you get the idea. Depending on the heap size, the performance difference between 4M and 256M pages might not be substantial for some applications - but there might be a big difference in the memory footprint with 4M and 256M pages. Due to this, Java SE development team chose 4M page size in favor of normal/lesser memory footprints; and provided a hook to the customers who wish to use different page sizes including 256M, in the form of -XX:LargePageSizeInBytes=pagesize[K|M] JVM option. That's why Java uses 4M pages by default even when it could use 256M pages.

It is up to the customers to check the dTLB/iTLB miss rate by running trapstat tool (eg., trapstat -T 5 5 ) and to decide if it helps to use 256M pages on Niagara servers with JVM option -XX:LargePageSizeInBytes=256M. Use pmap -sx <pid> to check the page size and the mappings.

Some anonymous mappings from a Java process with -XX:LargePageSizeInBytes=256M option:

90000000  262144  262144  262144       - 256M rwx--    [ anon ]
A0000000  524288  524288  524288       - 256M rwx--    [ anon ]
C0000000  262144  262144  262144       - 256M rwx--    [ anon ]
E0000000  262144  262144  262144       - 256M rwx--    [ anon ]

Let us check the time spent in virtual-to-physical and physical-to-virtual memory translations again.

% grep ttl trapstat.txt
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
      ttl |    332797  0.5       523  0.0 |   2546510  3.8      2856  0.0 | 4.3
      ttl |    289876  0.4       382  0.0 |   1984921  2.7      3226  0.0 | 3.2
      ttl |    255998  0.4       438  0.0 |   2037992  2.9      2350  0.0 | 3.3

Now scroll up a little and compare the %time columns of both 4M and 256M page experiments. There is a noticeable difference in dtlb-miss rate - more than 8%. i.e., the performance gain by merely switching from 4M to 256M pages is ~8% CPU. Since the CPU is not spending/wasting some cycles on memory translations, it'd be doing more useful work and hence the throughput or response time from JVM would improve.

Another example:

Recent versions of Java SE support parallel garbage collection with the JVM switch -XX:+UseParallelGC. When this option is used on command line, by default Java run-time starts some garbage collection threads whose count is equal to the number of processors (including virtual processors). Niagara server acts like a 32-way server (capable of running 32 threads in parallel) - so, running the Java process with -XX:+UseParallelGC option may run 32 garbage collection threads, which would probably be unnecessarily high. Unless the garbage collection thread count is restricted to a decent number with another JVM switch -XX:ParallelGCThreads=<gcthreadcount>, customers may see very high system CPU utilization (> 20%); and misinterpret it as a problem with the Niagara server.

Moral of the story:

Unless you know the auto tune policy of the OS or the software that runs on top of it, do NOT just rely on their auto tuning capability. Measure the run-time performance of the application and tune it accordingly for better performance.

Suggested reading: (Original blog post is at:

Benchmark announcements, HOW-TOs, Tips and Troubleshooting


« July 2016