Monday Oct 19, 2009

Exadata V2... Oracle grid consolidation in a box

I spent some time last week at OOW talking with Oracle customers regarding the technology in the Exadata V2 database machine. There were certainly a lot of customers excited to use this for their data warehouses - 21GB/sec disk throughput, 50GB/sec flash cache, and Hybrid Columnar Compression really accelerate this machine past the competition. The viability of Exadata V2 for DW/BI was really a given, but what impressed me the most was the number of customers looking to consolidate applications in this environment.

Ever since I was first brought onto this project, I thought Exadata V2 would be an excellent platform for consolidation. In my experience working on the largest of Sun's servers, I have seen customers with dozens of instances on a single machine. Using M9000 series machines, you can create domains in order to support multiple environments - this very much mirrors what Exadata V2 can provide. Exadata V2 allows DBAs to deploy multiple instances across a grid of RAC nodes available in the DB machine – and since you are using RAC, availability is a given. Also, the addition of Flash allows for up to 1 million IOPs to support your ERP/OLTP environments. Consider the picture below.

With this environment, your production data warehouse can share the same infrastructure as the ERP, test, and development environments. This model allows the flexibility to add/subtract nodes from a particular database as needed. But, the operational efficiency is not the biggest benefit to consolidation. The savings in terms of power, space, and cooling are substantial.

Consider for a moment the number of drives necessary to match the 1 million IOPs available in the database machine. Assuming you are using the best 15,000 rpm drive, you would be able to do 250 IOPs/drive. So, to get to 1 million IOPs, you would need 4,000 drives! A highly dense 42U storage rack can house any where from 300-400 drives. So, you would need 10 racks, just for the storage and at least one rack for servers.

With Exadata V2, you get more than 10:1 savings in floor space and all the power an cooling benefits as well. It is no wonder people are excited about Exadata V2 as a platform to consolidate compute and storage resources.

Thursday Sep 17, 2009

Breaking the silence... Exadata V2 performance details at OOW.

UPDATE: Unfortunately our OOW session was canceled. I will still be at the show participating in various performance related activities.

Now that Exadata V2 has been announced, I can begin to talk about what I have been working on lately. Kevin Closson and I have been working to characterize the performance of the Exadata V2 product. The performance of this machine is just stunning. This is the fastest machine I have ever used and I have been benchmarking big iron SMP with Oracle for the better part of 20 years now.

Exadata V2 brings together the best of Brawny Hardware and Brainy Software. The storage cells utilize Sun's new FlashFire technology along with the latest Nehalem based servers. Software advances with Storage Indexes and Hybrid Columnar compression push the performance to amazing speeds. If you are at Oracle Open World, please stop by our session to learn more about the performance of this exciting new product. I will post the session details once they are available.

Monday Sep 14, 2009

Sun Oracle Database machine arrives!

Let the games begin...

I will post some informational links as they are available. Pretty exciting stuff!

Wednesday Jun 17, 2009

Chronology of Sun technology for Oracle performance

With the pending buyout of Sun by Oracle still fresh in our minds, a recent email thread emerged wanting to put together a time-line of technologies created to help Sun servers run Oracle better. This seemed like a great way to start off the Oracle Performance portion of the Performance Technologies group new wiki. I hope you enjoy this overview of the key technologies used to optimize Oracle performance on Sun servers.

Monday Apr 20, 2009

Oracle buys Sun! Dtrace probes for Oracle?

As a Sun engineer who specializes on Oracle performance, I am pretty excited for the future. Sun and Oracle have been working together closely for years, but this will allow for even closer integration and innovation.

Could you imagine... "Dtrace probes for Oracle?" How cool would that be?

Time will tell innovations will come out of this acquisition, but I for one am pretty excited... Certainly, much more so than last week :) Heck, I now get to work with my old buddy Kevin Closson, the Exadata performance guru. Should be a fun ride ahead.

Monday Sep 22, 2008

Oracle Open World 2008 - Growing Green Databases with UltraSPARC CMT

The time has come present at Oracle Open World on UltraSPARC CMT performance. I decided to post the final version here in addition to the OOW site. I hope to see you there!
Session ID: S299785
     Title: Growing Green databases with Oracle on the UltraSPARC CMT processor
      Date: Monday Sept 22nd
      Time: 13:00 - 14:00
     Place: Moscone South Rm 236

Monday May 12, 2008

Optimizing Oracle Schema Analyze with CMT based servers

A common observation regarding Niagara based servers is that system maintenance or database administration tasks can run slower than previous generations of Sun servers. While single-threaded performance may be less, these maintenance tasks are often able to be parallelized, especially using a database engine as mature as Oracle. Take for instance the task of gathering schema statistics. Oracle offers many options on how to gather schema statistics, but there are a few ways to reduce overall gather statistics time:
  • Increased Parallelism
  • Reduced Sample Size
  • Concurrency
Oracle has written many articles in metalink which discuss sample size and the various virtues. There have also been many volumes written on optimizing the Oracle cost based optimizer (CBO). Jonathan Lewis of who is a member of the famous Oaktable network has written books and multiple white papers on the topic. You can read these for insight into the Oracle CBO. While a reasonable sample size or the use of the "DBMS_STATS.AUTO_SAMPLE_SIZE" can seriously reduce the gather statistics times, I will leave that up to you to choose the sample size the produces the best plans.


The following graph shows the total run time in seconds of a "GATHER_SCHEMA_STATS" operations at various levels of parallelism and sample size on a simple schema of 130GB. All tests were run on a Maramba T5240 with a 6140 array and two channels.

GATHER_SCHEMA_STATS parallelism and sample_size

Note that if higher levels of sampling are required, parallelism can help to significantly reduce the overall runtime of the GATHER_SCHEMA_STATS operation. Of course a smaller sample size can be employed as well.


SQL> connect / as sysdba

-- Example with 10 percent with parallel degree 32

-- Example with AUTO_SAMPLE_SIZE and parallel degree 32

Note that you must have "parallel_max_servers" set to at least the level of parallelism desired for the GATHER_SCHEMA_STATS operation. I typically set it higher to allow for other parallel operations to get servers.

        SQL> alter system set parallel_max_servers = 128;

Finally, you can easily run a schema collect on multiple schema's concurrently and in parallel by issuing GATHER_SCHEMA_STATS from multiple sessions and ensuring the level of parallelism is set high enough to accommodate.


  • T5240 - "Maramba" CMT based server
    • 2 x UltraSPARC T2 Plus @ 1.4GHz
    • 128GB memory
    • 2 x 1GB fiber channel HBA
    • 1 x 6140 Storage array with 1 lun per channel.
  • Software
    • Solaris 10 Update 5
    • Oracle
    • CoolTools
  • Schema
      SQL> Connected.
      SQL> SQL> SQL> SQL> SQL> SQL> SQL>   2    3    4  
      -------- ------------ ---------- --------
      GLENNF	 B2	       239826150    38560
      GLENNF	 B1	       237390000    32110
      GLENNF	 S2		 4706245      750
      GLENNF	 S4		 4700995      750
      GLENNF	 S5		 4699955      750
      GLENNF	 S7		 4698450      750
      GLENNF	 S8		 4706435      750
      GLENNF	 S9		 4707445      750
      GLENNF	 S10		 4700905      750
      GLENNF	 S3		 4706375      750
      GLENNF	 GTEST		 4706170      750
      -------- ------------ ---------- --------
      GLENNF	 S6		 4700980      750
      GLENNF	 S1		 4705905      710
      HAYDEN	 HTEST		 4723031      750
      14 rows selected.
      SQL>   2    3    4  
      -------- ------------ ---------- --------
      GLENNF	 B1_I2	       244841720    11623
      GLENNF	 B2_I2	       239784800    11451
      GLENNF	 B1_I1	       248169793     8926
      GLENNF	 B2_I1	       241690170     8589
      GLENNF	 S6_I2		 4790380      229
      GLENNF	 S3_I2		 4760090      227
      GLENNF	 S2_I2		 4693120      226
      GLENNF	 S5_I2		 4688230      224
      GLENNF	 S8_I2		 4665695      223
      GLENNF	 S4_I2		 4503180      216
      GLENNF	 S1_I2		 4524730      216
      -------- ------------ ---------- --------
      GLENNF	 S9_I2		 4389080      211
      GLENNF	 S10_I2 	 4364885      209
      GLENNF	 S7_I2		 4357240      208
      GLENNF	 S2_I1		 4972635      177
      GLENNF	 S3_I1		 4849660      174
      GLENNF	 S6_I1		 4830895      174
      GLENNF	 S9_I1		 4775830      171
      GLENNF	 S7_I1		 4772975      169
      GLENNF	 S5_I1		 4648410      168
      GLENNF	 GTEST_C1	 4686790      167
      GLENNF	 S1_I1		 4661605      166
      -------- ------------ ---------- --------
      GLENNF	 S4_I1		 4626965      166
      GLENNF	 S10_I1 	 4605100      164
      GLENNF	 S8_I1		 4590735      163
      25 rows selected.

Monday Apr 21, 2008

Throughput computing series: Utilizing CMT with Oracle

Since we just recently announced mutli-chip based CMT servers that provide up to 128 threads in a 1U or 2U box, it seems fitting to pick up this thread on throughput computing.

The key to fully appreciating the CMT architecture with Oracle is to exploit the available threads. As I have spoke about earlier in the "throughput computing series", this can be done through "concurrency", "parallelism", or both. Oracle, being the mature product that it is, can achieve high-levels of parallelism as well as concurrency.

Concurrent processing with Oracle

For examples of concurrent processing with Oracle, look at recent results on the Oracle Ebusiness payroll benchmark. This shows that using concurrent processes to break up the batch, you can increase batch throughput. By going from 4 to 64 processes, batch time decreased from 31.53 minutes to 4.63 minutes and throughput was increased by 6.8x!

With Oracle's Ebusiness Suite of software, you can increase the number of "concurrent manager" processes to more fully utilize the available threads on the system. Each ISV has different ways of controlling batch distribution and concurrency. You will have to check with your various software vendors to find all the ways to increase concurrency.

Parallelism in Oracle

People often associate parallelism in Oracle with parallel query. In most cases where CMT is involved, I see a lack of understanding of how to achieve parallelism with more basic administrative functions. Oracle excels in providing parallelism for important administrative tasks.

  • Schema analyze
  • Index build/rebuild
  • Parallel loader
  • Parallel export/import with datapump

    While parallelism exists for these administrative tasks, some configuration is required. I will examine the various ways to achieve optimal throughput with CMT based systems on these tasks.
  • Monday Mar 10, 2008

    Oracle db_block_checksum performance bug

    We recently ran across a pretty strange performance bug with the checksum function of Oracle. This bug (6814520) causes excessive CPU to be used for the checksum routine. The checksum feature of Oracle is enabled by the db_block_checksum=TRUE parameter in Oracle. With the release of Oracle 10gR2, "TRUE" is the default setting. The order of magnitude of CPU overhead depends on the type of Solaris SPARC architecture.

    Chip     %overhead
    ----     ---------
    SPARC64       250%  
    USIV           45%  
    w/patch         8% 

    Oracle released a patch via metalink to address this situation. This patch is for installations. The fix will be included in:,, and

    If you are unsure whether or not you are hitting this bug, you can easily alter this parameter on the fly:
      SQL> alter system set db_block_checksum=FALSE
    Warning, this will disable the checksum feature and blocks written when this is turned to FALSE will not contain checksum information.

    Regardless of whether you are currently hitting the bug or not, the recommendation is:
      INSTALL "6814520" WHEN USING ORACLE !!!

    Tuesday Oct 16, 2007

    Sun server reboot best practices

    I had a recent interaction with a customer regarding how often to "reboot" Sun servers. There is no official policy at Sun regarding rebooting - mainly because we sell enterprise class machines. The intent of an enterprise class machine is to stay up at all cost. Our enterprise class servers have the ability to add and remove memory, IO, and CPU without a reboot. I know it is common practice to reboot "windows" based machines on a regular schedule, but this simply does NOT apply to Enterprise class Sun servers.

    For sake of discussion, I will post what I think our server reboot policy should entail.

    Purposed Sun Server Reboot Policy:

    "Only reboot Sun servers when installing SW or HW that requires a reboot. It is not necessary to reboot servers on a regular schedule like Windows servers."

    Wednesday Nov 15, 2006

    Where do you cache Oracle data?

    Using the filesystem page cache in combination with the Oracle buffer cache for database files was commonplace before 64-bit databases were prevalent - machines had a lot of memory and databases could not use more than 4GB. Now after many years of 64-bit databases, there are still a fair number of systems that still use buffered IO via the filesystem or cached QIO. While buffered IO used to provide benefit, it can cause substandard performance and impede scaling on modern large-scale systems. Buffered file system issues include:
    • Single-writer lock
    • Memory fragmentation: 8k blocks instead of 4M or 32M ISM.
    • Double buffering doubles memory bandwidth requirements!
    • Segmap thrashing... lots of xcalls!

    2x throughput increase with Oracle Caching vs OS buffered IO

    A quick experiment was conducted on Oracle 10gR2 on a large memory machine (192GB).
    • 1st test: DB cache was set to 1GB and the database was mounted on a buffered file system.
    • 2nd test: DB cache was set to 50GB and the database was mounted direct - NOT buffered.
    A 46GB table was populated, indexed, and then queried by 100 processes each requesting a range of data. A single row was retrieved at a time to simulate what would happen in an OLTP environment. The data was cached so that no IO occurred during any of the runs. When the dust had settled, the Oracle buffer cache provided a 2X speedup over buffered file systems. There was also a dramatic decrease in getmaps, xcalls, and system CPU time. The table below shows the results.





























    Notice that cross calls for the Solaris 10 with FS cache have been nearly eliminated while the getmaps have increased in proportional to throughput. This is due to the elimination of xcalls associated with the getmap operation. That said, the mild improvement in throughput with S10 on filesystems, it is nothing like the 2x improvement achieved by avoiding buffered IO altogether.

    Recognizing buffered IO impact

    A higher amount of system CPU time can be observed at the high-level. It is not uncommon to see a usr/sys ratio of 1 or less on systems where buffered IO is in use. This is due to the high number of getmap reclaims and cross-calls (xcal). You can observe cross-calls with mpstat(1) command. Segmap activity can be best observed using segmapstat utility which is part of the cachekit utilities. The segmapstat utility polls "kstats" to retrieve hit/miss data in an easy to read format. If you are using Solaris 10, the impact due to cross-calls is less, but segmap activity is still visible.

    Finally, it would be nice to be able to see the amount of data in the page cache. If you are on Solaris 8, you will need to install the memtool 8 written by Richard McDougal. If you are on Solaris 9 or greater, you can use mdb(1) with the ::memstat command. Beware, this command will take a long time to run and may affect performance, therefore it is best to run this when the system is not busy.

    # mdb -k Loading modules: [ unix krtld genunix ip usba wrsm random
      ipc nfs ptm cpc ]
     > ::memstat 
    Page Summary                Pages                MB  %Tot 
    ------------     ----------------  ----------------  ---
    Kernel                     430030              3359    2% 
    Anon                       805572              6293    3% 
    Exec and libs                9429                73    0% 
    Page cache               14974588            116988   52% 
    Free (cachelist)          2547680             19903    9% 
    Free (freelist)           9853807             76982   34% 
    Total                    28621106            223602 

    How do you avoid using buffered IO?

    The easiest way to avoid using the OS page cache is to simply use RAW partitions. This is commonly done in combination with SVM or VxVM. More recently, Oracle introduced their own volume manager (ASM) which makes use of async IO and eases the administration of Oracle databases. That said, databases on RAW partitions are not for everyone. Often users perfer to use standard OS tools to view and manpulate database files in filesystems.

    Most filesystems have ways of bypassing the OS page cache for Oracle datafiles. UFS, QFS, and VxFS all support mounting filesystems to bypass the OS page cache - the only exeception is ZFS which doesn't allow for direct or async IO. Below, methods for disabling buffered IO with filesystems are discussed.

    FILESYSTEMIO_OPTIONS=SETALL (Oracle 9i and greater) init.ora parameter

    The first step to avoiding buffered IO is to use the "FILESYSTEMIO_OPTIONS" parameter. When you use the "SETALL" option, this sets all the options for a particular filesystem to enable directio or async IO. Setting the FILSYSTEMIO_OPTIONS to anything other than "SETALL" could reduce performance. Therefore, it is a best practice to set this option.

    UFS and directio

    With UFS, the only way to bypass the page cache is with directio. If you are using Oracle 9i or greater, then set the FILESYSTEMIO_OPTIONS=SETALL init.ora parameter. This the preferred way of enabling directio with Oracle. With this method, Oracle uses an api to enable directio when it opens database files. This method allows you to still use buffered IO for operations like backup and archiving. If you are using Oracle 8i, then the only way to enable directio with UFS is via the forcedirectio mount option.

    VxFS with QIO

    VxFS has several options for disabling buffered IO. Like UFS, VxFS does support directio but it is not as efficient as Quick IO (QIO) or Oracle Data Management (ODM). With VxFS, async IO is possible with QuickIO or ODM. Data files for use with QIO must be created with a special utility or converted to the QIO format. With QIO you have to be careful that the "cached" QIO option is not enabled. With the cached QIO option, blocks of selected data files will be placed in the OS page cache.

    VxFS with ODM

    Like QIO, ODM uses async IO. ODM uses an api specified by Oracle to open and manipulate data files. ODM lowers overhead in large systems by sharing file descriptors and eliminating the need for each oracle shadow/server process to open and obtain its own file descriptors.

    Convincing Oracle to cache table data

    Finally, after all this is done Oracle still may not properly cache table data. I have seen more than a few persons enable "directio" and increase the SGA only to have response time of their critical queries take longer! If a table is too large or the "cache" attribute is not set, Oracle will not attempt to cache tables when scanning. This is done to avoid flooding the Oracle buffer cache with data that will most likely not be used. Luckily, there is an easy way to correct this behavior by setting the "CACHE" storage parameter on a table.
      SQL> alter table BIGACTIVETABLE cache;

    Finally, you may need to convince some of the Oracle DBAs of the benefit. DBAs look at Oracle performance data from an Oracle centric point of view. When data such as Oracle's statspack is analyzed, some pretty awsome response times can be seen. Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether.

    Summary and references

    Hopefully this has given you some background on why unbuffered IO is so critical to obtain optimal performance with Oracle. It is far more efficient to obtain an Oracle blocks from the database buffer cache than to go through the OS page cache layers.

    Tuesday Oct 31, 2006

    Solaris - Unbreakable, Unfakeable, Open, and the most Affordable!

    In light of the recent announcements by Oracle on the creation of "Larry Linux", I thought it would be prudent to comment about where Solaris fits into this picture. Oracle was hoping to fill a perceived hole in the support of Linux for the Enterprise. Oracle believes they can do this better than Red Hat. While won't comment on whether or not Oracle can support Linux better than Red Hat, I am looking forward to seeing the "tars" from Larry Linux :)

    Anyway, this recent announcement led me to realize that Solaris "now more than ever" is the best choice to run your enterprise. If you want open, Solaris is open. If you want iron-clad Unfakeable, Unshakeable, Unbreakable, ... well Solaris has been running enterprises since before Linux was a twinkle in Torvalds eye. Finally, if it comes down to price - Oracle will not cut you any breaks on Database support... You may have to mortgage your house just to get a years support for Oracle. But if you want to save some pocket change on OS support, Solaris beats Red Hat and Larry Linux in that category as well.
        Red Hat: $999
    Larry Linux: $399
        Solaris: $120
    Seems like a pretty easy choice to me...

    Monday Oct 30, 2006

    Dueling DUAL with BEA Weblogic and TestConnectionsOnReserve.

    You would think that the "DUAL" table, a simple stub table, would not be a performance topic - but I have seen this for years on high-end benchmarks. People develop applications or tests for applications which tend to over-use the DUAL table. Most commonly, this comes in the form of "select abc.nextseq from DUAL" and "select sysdate from DUAL". This is typically, not a problem for small severs with a low level of concurrency, but it can be bottle-neck on high-end severs with lots of processors.

    The problem with DUAL (in Oracle 9i and below) is that this "fake table" hashes to a "real" cache line :) If over-used it can cause a "cache buffers chains" latch contention like crazy. The most dangerous over-use situations are systemic ones. I can get around these issues in most benchmark environments, but cringe when I see the embedded use DUAL.

    In BEA websphere, there is a parameter called "TestConnectionsOnReservere". This parameter sends a SQL statement to the database before \*EVERY\* user statement.... talk about overhead! This not only adds SQL\*Net round trips increasing network use, but most commonly uses the "SQL SELECT 1 from DUAL" as the test statement :) What is worse, the overhead just continues to increase as the load is increased. Ken Gottry discusses the performance impact in an article he wrote. This study used a 2-way server to show the performance impact. It is much worse on a high-end server.

    What can you do?

    Avoid setting the TestConnectionsOnReserve within BEA. The performance cost in terms of potential latch contention and network over-head is too high. If you must use this paramenter, use the "X$DUAL" table instead. Oracle 10g, uses this by default and while it avoids the latching issues, the networking component this parameter is still present.

    Wednesday Aug 16, 2006

    Solaris Applications Specific Tuning Wiki

    As part of the Second Edition of the famous Solaris Internals and the new Solaris Performance and Tools book a performance tuning Wiki has been created. This site is meant to be a living document where best practices, tuning information, and tips are collected.

    I have began contributing Oracle performance information to the Solaris applications specific tuning Wiki. I hope you enjoy this repository of information regarding performance on Sun systems.

    Friday Aug 04, 2006

    High "times()" syscall count with Oracle processes

    "Why does Oracle call times() so often? Is something broken? When using truss or dtrace to profile Oracle shadow processes, one often sees a lot of calls to "times". Sysadmins often approach me with this query.

    root@catscratchb> truss -cp 7700
    syscall               seconds   calls  errors
    read                     .002     120
    write                    .008     210
    times                    .053   10810
    semctl                   .000      17
    semop                    .000       8
    semtimedop               .000       9
    mmap                     .003      68
    munmap                   .003       5
    yield                    .002     231
    pread                    .150    2002
    kaio                     .003      68
    kaio                     .001      68
                         --------  ------   ----
    sys totals:              .230   13616      0
    usr time:               1.127
    elapsed:               22.810
    At first glance it would seem alarming to have so many times() calls, but how much does this really effect performance? This question can best be answered by looking at the overall "elapsed" and "cpu" time. Below is output from the "procsystime" tool included in the Dtrace toolkit.

    root@catscratchb> ./procsystime -Teco -p 7700
    Hit Ctrl-C to stop sampling...
    Elapsed Times for PID 7700,
             SYSCALL          TIME (ns)
                mmap           17615703
               write           21187750
              munmap           21671772
               times           90733199       <<== Only 0.28% of elapsed time
              semsys          188622081
                read          226475874
               yield          522057977
               pread        31204749076
              TOTAL:        32293113432
    CPU Times for PID 7700,
             SYSCALL          TIME (ns)
              semsys            1346101
               yield            3283406
                read            7511421
                mmap           16701455
               write           19616610
              munmap           21576890
               times           33477300         <<== 10.6% of CPU time for the times syscall
               pread          211710238
              TOTAL:          315223421
    Syscall Counts for PID 7700,
             SYSCALL              COUNT
              munmap                 17
              semsys                 84
                read                349
                mmap                350
               yield                381
               write                540
               pread               3921
               times              24985    <<== 81.6% of syscalls.
              TOTAL:              30627
    According to the profile above, the times() syscall accounts for only 0.28% of the overall response time. It does use 10.6% of sys CPU. The usr/sys CPU percentages are "83/17" for this application. So, using the 17% for system CPU we can calculate the overall amount of CPU for the times() syscall: 100\*(.17\*.106)= 1.8%.

    Oracle uses the times() syscall to keep track of timed performance statistics. Timed statistics can be enabled/disabled by setting the init.ora parameter "TIMED_STATISTICS=TRUE". In fact, it is an \*old\* benchmark trick to disable TIMED_STATISTICS after all tuning has been done. This is usually good for another 2% in overall throughput. In a production environment, it is NOT advisable to ever disable TIMED_STATISTICS. These statistics are extremely important to monitor and maintain application performance. I would argue that disabling timed statistics would actually hurt performance in the long run.

    This blog discusses performance topics as running on Sun servers. The main focus is in database performance and architecture but other topics can and will creep in.


    « July 2016

    No bookmarks in folder


    No bookmarks in folder