Saturday Aug 31, 2013

[Oracle Database] Unreliable AWR reports on T5 & Redo logs on F40 PCIe Cards

(1) AWR report shows bogus wait events and times on SPARC T5 servers

Here is a sample from one of the Oracle 11g R2 databases running on a SPARC T5 server with Solaris 11.1 SRU 7.5

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
latch: cache buffers chains 278,727 812,447,335 2914850 13307324.15Concurrency
library cache: mutex X212,595449,966,33021165427370136.56Concurrency
buffer busy waits219,844349,975,25115919255732352.01Concurrency
latch: In memory undo latch25,46837,496,8001472310614171.59Concurrency
latch free2,60224,998,5839607449409459.46Other

Unknown. There is a pending bug 17214885 - Implausible top foreground wait times reported in AWR report.

Tentative workaround:
Disable power management as shown below.

# poweradm set administrative-authority=none

# svcadm disable power
# svcadm enable power

Verify the setting by running poweradm list.

Also disable NUMA I/O object binding by setting the following parameter in /etc/system (requires a system reboot).

set numaio_bind_objects=0

Oracle Solaris 11 added support for NUMA I/O architecture. Here is a brief explanation of NUMA I/O from Solaris 11 : What's New web page.

Non-Uniform Memory Access (NUMA) I/O : Many modern systems are based on a NUMA architecture, where each CPU or set of CPUs is associated with its own physical memory and I/O devices. For best I/O performance, the processing associated with a device should be performed close to that device, and the memory used by that device for DMA (Direct Memory Access) and PIO (Programmed I/O) should be allocated close to that device as well. Oracle Solaris 11 adds support for this architecture by placing operating system resources (kernel threads, interrupts, and memory) on physical resources according to criteria such as the physical topology of the machine, specific high-level affinity requirements of I/O frameworks, actual load on the machine, and currently defined resource control and power management policies.

Do not forget to rollback these changes after applying the fix for the database bug 17214885, when available.

(2) Redo logs on F40 PCIe cards (non-volatile flash storage)

Per the F40 PCIe card user's guide, the Sun Flash Accelerator F40 PCIe Card is designed to provide best performance for data transfers that are multiples of 8k size, and using addresses that are 8k aligned. To achieve optimal performance, the size of the read/write data should be an integer multiple of this block size and the data transferred should be block aligned. I/O operations that are not block aligned and that do not use sizes that are a multiple of the block size may suffer performance degration, especially for write operations.

Oracle redo log files default to a block size that is equal to the physical sector size of the disk, typically 512 bytes. And most of the time, database writes to the redo log in a normal functioning environment. Oracle database supports a maximum block size of 4K for redo logs. Hence to achieve optimal performance for redo write operations on F40 PCIe cards, tune the environment as shown below.

  1. Configure the following init parameters
  2. Create redo log files with 4K block size
    SQL> ALTER DATABASE ADD LOGFILE '/REDO/redo.dbf' size 20G blocksize 4096;
  3. [Solaris only] Append the following line to /kernel/drv/sd.conf (requires a reboot)
    sd-config-list="ATA     3E128-TS2-550B01","disksort:false,\
                 cache-nonvolatile:true, physical-block-size:4096";
  4. [Solaris only][F20] To enable maximum throughput from the MPT driver, append the following line to /kernel/drv/mpt.conf and reboot the system.

This tip is applicable to all kinds of flash storage that Oracle sells or sold including F20/F40 PCIe cards and F5100 storage array. sd-config-list in sd.conf may need some adjustment to reflect the correct vendor id and product id.

Tuesday Jun 29, 2010

Identifying Ideal Oracle Database Objects for Flash Storage and Accelerators

The Sun Storage F5100 Flash Array and Sun Flash Accelerator F20 PCIe Card help accelerate I/O bound applications such as databases. The following are some of the guidelines to identify Oracle database objects that can benefit by using the flash storage. Even though the title explicitly states "Oracle", some of these guidelines are applicable to other databases and non-database products. Exercise discretion, evaluate and experiment before implementing these recommendations as they are.

  • Heavily used database tables and indexes are ideal for flash storage

    • - The database workloads with no I/O bottlenecks may not show significant performance gains
    • - The database workloads with severe I/O bottlenecks can fully realize the benefits of flash devices

      • Top 5 Timed Foreground Events section in any AWR report that was collected on the target database system is useful in finding whether disk I/O is a bottleneck

        • Large number of Waits and the large amount of time in DB spent waiting for some blocked resource under User I/O Wait Class is an indication of I/O contention on the system
  • Identify the I/O intensive tables and indexes in a database with the help of Oracle Enterprise Manager Database Control, a web-based tool for managing Oracle database(s)

    • - The "Performance" page in OEM Database Control helps you quickly identify and analyze performance problems
    • - Historical and the real-time database activity can be viewed from the "performance" page.
      • The same page also provides information about the top resource consuming database objects
  • An alternate way to identify the I/O intensive objects in a database is to analyze the AWR reports that are generated over a period of time especially when the database is busy

    • - Scan through the SQL ordered by .. tables in each AWR report
    • - Look for the top INSERT & UPDATE statements with more elapsed and DB times
      • The database tables that are updated frequently & repeatedly, along with the indexes created on such tables are good candidates for the flash devices

    • - SQL ordered by Reads is useful in identifying the database tables with large number of physical reads
      • The database table(s) from which large amounts of data is read/fetched from physical disk(s) are also good candidates for the flash devices

        • To identify I/O intensive indexes, look through the explain plans of the top SQLs that are sorted by Physical Reads

  • Examine the File IO Stats section in any AWR report that was collected on the target database system

    • - Consider moving the database files with heavy reads, writes and relatively high average buffer wait time to flash volumes
  • Examine Segments by Physical Reads, Segments by Physical Writes and Segments by Buffer Busy Waits sections in AWR report

    • - The database tables and indexes with large number of physical reads, physical writes and buffer busy waits may benefit from the flash acceleration
  • Sun flash storage may not be ideal for storing Oracle redo logs

    • - Sun Flash Modules (FMOD) in F5100 array and F20 Flash Accelerator Card are optimized for 4K sector size

        A redo log write that is not aligned with the beginning of the 4K physical sector results in a significant performance degradation

    • - In general, Oracle redo log files default to a block size that is equal to the physical sector size of the disk, which is typically 512 bytes

      • Majority of the recent Oracle Database platforms detect the 4K sector size on Sun flash devices
      • Oracle database automatically creates redo log files with a 4K block size on file systems created on Sun flash devices
        • However with a block size of 4K for the redo logs, there will be significant increase in redo wastage that may offset expected performance gains

F5100 Flash Storage and F20 PCIe Flash Accelerator Card as Oracle Database Smart Flash Cache

In addition to the I/O intensive database objects, customers running Oracle 11g Release 2 or later versions have the flexibility of using flash devices to turn on the "Database Smart Flash Cache" feature to reduce physical disk I/O. The Database Smart Flash Cache is a transparent extension of the database buffer cache using flash storage technology. The flash storage acts as a Level 2 cache to the (Level 1) SGA. Database Smart Flash Cache can significantly improve the performance of Oracle databases by reducing the amount of disk I/O at a much lower cost than adding an equivalent amount of RAM.

F20 Flash Accelerator offers an additional benefit - since it is a PCIe card, the I/O operations bypass disk controller overhead.

The database flash cache can be enabled by setting appropriate values to the following Oracle database parameters.


Check Oracle Database Administrator's Guide 11g Release 2 (11.2) : Configuring Database Smart Flash Cache documentation for the step-by-step instructions to configure Database Smart Flash Cache on flash devices.

Wednesday Jan 20, 2010

PeopleSoft NA Payroll 240K EE Benchmark with 16 Job Streams : Another Home Run for Sun

Poor Steve A.[1] ... This entry is not about Steve A. though. It is about the new PeopleSoft NA Payroll benchmark result that Sun published today.

First things first. Here is the direct URL to our latest benchmark results:

        PeopleSoft Enterprise Payroll 9.0 using Oracle for Solaris on a Sun SPARC Enterprise M4000 (16 job streams[2] -- simply referred as 'stream' hereonwards)

The summary of the benchmark test results is shown below only for the 16 stream benchmarks. These numbers were extracted from the very first page of the benchmark results white papers where Oracle|PeopleSoft highlights the significance of the results and the actual numbers that are of interest to the customers. The results in the following table are sorted by the hourly throughput (payments/hour) in the descending order. The goal is to achieve as much hourly throughput as possible. Click on the link that is underneath the hourly throughput values to open corresponding benchmark result.

Oracle PeopleSoft North American Payroll 9.0 - Number of employees: 240,000 & Number of payments: 360,000
Vendor OS Hardware Config #Job Streams Elapsed Time (min) Hourly Throughput
Payments per Hour
Sun Solaris 10 5/09 1x Sun SPARC Enterprise M4000 with 4 x 2.53 GHz SPARC64-VII Quad-Core processors and 32 GB memory
1 x Sun Storage F5100 Flash Array with 40 Flash Modules for data, indexes
1 x Sun Storage J4200 Array for redo logs
16 43.78 493,376
HP HP-UX 1 x HP Integrity rx6600 with 4 x 1.6 GHz Intel Itanium2 9000 Dual-Core processors and 32 GB memory
1 x HP StorageWorks EVA 8100
16 68.07 317,320

This is all public information. Feel free to compare the hardware configurations and the data presented in both of the rows and draw your own conclusions. Since both Sun and HP used the same benchmark toolkit, workload and ran the benchmark with the same number of job streams, comparison should be pretty straight forward.

If you want to compare the 8 stream results, check the other blog entry: PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise. Sun used the same hardware to run both benchmark tests with 8 and 16 streams respectively. We could have gotten away with 20+ Flash Modules (FMODs), but we want to keep the benchmark environment consistent with our prior benchmark effort around the same benchmark workload with 8 job streams. Due to the same hardware setup, now we can easily demonstrate the advantage of parallelism (simply by comparing the test results from 8 and 16 stream benchmarks) and how resilient and scalable the F5100 Flash array is.

Our benchmarks showed an improvement of ~55% in overall throughput when the number of job streams were increased from 8 to 16. Also our 16 stream results showed ~55% improvement in overall throughput over HP's published results with the same number of streams at a maximum average CPU utilization of 45% compared to HP's maximum average CPU utilization of 89%. The half populated Sun Storage F5100 Flash Array played the key role in both of those benchmark efforts by demonstrating superior I/O performance over the traditional disk based arrays.

Before concluding, I would like to highlight a few known facts (just for the benefit of those people who may fall for the PR trickery):

  1. 8 job streams != 16 job streams. In other words, the results from an 8 stream effort is not comparable to that of a 16 stream result.
  2. The throughput should go up with increased number of job streams [ only up to some extent -- do not forget that there will be a saturation point for everything ]. For example, the throughput with 16 streams might be higher compared to the 8 stream throughput.
  3. The Law of Diminishing Returns applies to the software world too, not just for the economics. So, there is no guarantee that the throughput will be much better with 24 or 32 job streams.

Other blog posts and documents of interest:

  1. Best Practices for Oracle PeopleSoft Enterprise Payroll for North America using the Sun Storage F5100 Flash Array or Sun Flash Accelerator F20 PCIe Card
  2. PeopleSoft Enterprise Payroll 9.0 using Oracle for Solaris on a Sun SPARC Enterprise M4000 (8 streams benchmark white paper)
  3. PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise
  4. App benchmarks, incorrect conclusions and the Sun Storage F5100
  5. Oracle PeopleSoft Payroll (NA) Sun SPARC Enterprise M4000 and Sun Storage F5100 World Record Performance


[1] Steve A. tried so hard and his best to make everyone else believe that HP's 16 job stream NA Payroll 240K EE benchmark results are on par with Sun's 8 stream benchmark results. Apparently Steve A. failed and gave up after we showed the world a few screenshots from a published and eventually withdrawn benchmark [ by HP ]. You can read all his arguments, comparisons etc., in the comments section of my other blog entry PeopleSoft North American Payroll on Sun Solaris with F5100 Flash Array : A blog Reprise as well as in Joerg Moellenkamp's blog entries around the same topic.

[2] In PeopleSoft terminology, a job stream is something that is equivalent to a thread.


Benchmark announcements, HOW-TOs, Tips and Troubleshooting


« July 2016