Monday Oct 19, 2009

Exadata V2... Oracle grid consolidation in a box

I spent some time last week at OOW talking with Oracle customers regarding the technology in the Exadata V2 database machine. There were certainly a lot of customers excited to use this for their data warehouses - 21GB/sec disk throughput, 50GB/sec flash cache, and Hybrid Columnar Compression really accelerate this machine past the competition. The viability of Exadata V2 for DW/BI was really a given, but what impressed me the most was the number of customers looking to consolidate applications in this environment.

Ever since I was first brought onto this project, I thought Exadata V2 would be an excellent platform for consolidation. In my experience working on the largest of Sun's servers, I have seen customers with dozens of instances on a single machine. Using M9000 series machines, you can create domains in order to support multiple environments - this very much mirrors what Exadata V2 can provide. Exadata V2 allows DBAs to deploy multiple instances across a grid of RAC nodes available in the DB machine – and since you are using RAC, availability is a given. Also, the addition of Flash allows for up to 1 million IOPs to support your ERP/OLTP environments. Consider the picture below.

With this environment, your production data warehouse can share the same infrastructure as the ERP, test, and development environments. This model allows the flexibility to add/subtract nodes from a particular database as needed. But, the operational efficiency is not the biggest benefit to consolidation. The savings in terms of power, space, and cooling are substantial.

Consider for a moment the number of drives necessary to match the 1 million IOPs available in the database machine. Assuming you are using the best 15,000 rpm drive, you would be able to do 250 IOPs/drive. So, to get to 1 million IOPs, you would need 4,000 drives! A highly dense 42U storage rack can house any where from 300-400 drives. So, you would need 10 racks, just for the storage and at least one rack for servers.

With Exadata V2, you get more than 10:1 savings in floor space and all the power an cooling benefits as well. It is no wonder people are excited about Exadata V2 as a platform to consolidate compute and storage resources.

Friday Sep 25, 2009

Jumbo Frames with Oracle RAC really does Rock!!

I have been involved in a customer situation on and off for at least 6 months now. The customer had been seeing performance issues with their application running on Oracle 10g RAC. We looked through the mounds of data initially and noticed that they were indeed waiting quite a bit on Global cache buffer waits. This was during times of fairly heavy load and we could see the CPU was fairly busy with interrupts as well. After looking at the MTU size for the cluster interconnect, we noticed that it was incorrectly set to the default (1500). Thus started the odyssey to implement Jumbo Frames.

The default MTU is 1500 for Solaris, but this is not ideal when Oracle is using an 8K block size. Simple math tells us that you will require 6 transfers to transmit just one block of data across the cluster interconnect. This just creates additional overhead on the server and additional latency waiting for global blocks to be transferred. Changing the MTU to be a "jumbo frame" of 8K or greater is fairly simple from a technical point of view, but it can quickly turn into a political issue.

The cluster interconnect is often relegated to be the responsibility of the networking group. No problem right? While this is a network component, it is really part of the server - no different really from a PCI bus or processor back plane. The networking groups will often apply their tried and true methods for LANs around the company, but this doesn't translate to RAC. Modern network switches can easily handle this configuration change as well, but policy often wins. The networking group assures everyone their switch can handle the traffic with the default MTU and everyone goes on their merry way.

So, what happened?

After months looking at "other things", they finally were convinced to try this "Best Practice" with Jumbo Frames. Immediately, they saw:

  • 50% reduction in CPU overhead
  • 75% reduction in Global cache buffer waits
  • IP Reassemblies dropped by 10x

Moral of the story: Implement Jumbo Frames for Oracle RAC interconnects... It is a best practice after all :)

Thursday Sep 17, 2009

Breaking the silence... Exadata V2 performance details at OOW.

UPDATE: Unfortunately our OOW session was canceled. I will still be at the show participating in various performance related activities.

Now that Exadata V2 has been announced, I can begin to talk about what I have been working on lately. Kevin Closson and I have been working to characterize the performance of the Exadata V2 product. The performance of this machine is just stunning. This is the fastest machine I have ever used and I have been benchmarking big iron SMP with Oracle for the better part of 20 years now.

Exadata V2 brings together the best of Brawny Hardware and Brainy Software. The storage cells utilize Sun's new FlashFire technology along with the latest Nehalem based servers. Software advances with Storage Indexes and Hybrid Columnar compression push the performance to amazing speeds. If you are at Oracle Open World, please stop by our session to learn more about the performance of this exciting new product. I will post the session details once they are available.

Monday Sep 14, 2009

Sun Oracle Database machine arrives!

Let the games begin...

I will post some informational links as they are available. Pretty exciting stuff!

Wednesday Jun 17, 2009

Chronology of Sun technology for Oracle performance

With the pending buyout of Sun by Oracle still fresh in our minds, a recent email thread emerged wanting to put together a time-line of technologies created to help Sun servers run Oracle better. This seemed like a great way to start off the Oracle Performance portion of the Performance Technologies group new wiki. I hope you enjoy this overview of the key technologies used to optimize Oracle performance on Sun servers.

Monday Apr 20, 2009

Oracle buys Sun! Dtrace probes for Oracle?

As a Sun engineer who specializes on Oracle performance, I am pretty excited for the future. Sun and Oracle have been working together closely for years, but this will allow for even closer integration and innovation.

Could you imagine... "Dtrace probes for Oracle?" How cool would that be?

Time will tell innovations will come out of this acquisition, but I for one am pretty excited... Certainly, much more so than last week :) Heck, I now get to work with my old buddy Kevin Closson, the Exadata performance guru. Should be a fun ride ahead.

Tuesday Dec 09, 2008

Oracle analysis 101 : Begining analysis techniques

Recently, I was asked to present beginning Oracle analysis techniques to an internal audience of Sun engineers. This presentation was a lot of fun to put together and was well received. After cleaning it up a bit and taking out the boring internal Sun stuff, I thought the presentation might be useful to a larger audience. This presentation focuses on problem statement, environmental, and basic AWR/Statspack analysis.

If you find this useful or have suggestions, drop me a note.

Saturday Nov 08, 2008

Virtual CPUs effect on Oracle SGA allocations.

Several years ago, I wrote about how Oracle views multi-threaded processors. At the time we were just introducing a dual-core processor. This doubling of the number of cores was presented by Solaris as virtual CPUs and Oracle would automatically size the CPU_COUNT accordingly. But what happens when you introduce a 1RU server that has 128 virtual CPUs?

The UltraSPARC T1/T2/T2+ servers have many threads or virtual CPUs. The CPU_COUNT on these systems is sized no different than before. So, the newly introduced T5540 with 4xUltraSPARC T2+ processors would have 256 threads and CPU_COUNT would be set to 256.

So, what does CPU_COUNT have to do with memory?

Thanks to my friends in the Oracle Real World Performance group, I was made aware that Oracle uses CPU_COUNT to size the minimum amount of SGA allowed. In one particular case, the DBA was trying to allocate 70 database instances on a T5140 with 64GB of memory and 128 virtual CPUs. Needless to say, the SGA_TARGET would have to be set fairly low in-order to accomplish this task. A SGA_TARGET was set to 256MB, but the following error was encountered.

    ORA-00821: Specified value of sga_target 256M is too small
After experimentation, they were able to start Oracle with a target of 900MB, but with 70 instances this would not fly. Manually lowering the CPU_COUNT allowed the DBA to use an SGA_TARGET of 256MB. Obviously, this is an extreme case and changing CPU_COUNT was reasonable.

Core and virtual CPU counts have been on the rise for some years now. Combine rising virtual CPU count with the current economic climate and I would suspect that consolidation will be more popular than ever. In general, I would not advocate changing CPU_COUNT manually. If you had one instance on this box, the default be just fine. CPU_COUNT automatically sizes so many other parameters that you should be very careful before making a change.

Monday Sep 22, 2008

Oracle Open World 2008 - Growing Green Databases with UltraSPARC CMT

The time has come present at Oracle Open World on UltraSPARC CMT performance. I decided to post the final version here in addition to the OOW site. I hope to see you there!
Session ID: S299785
     Title: Growing Green databases with Oracle on the UltraSPARC CMT processor
      Date: Monday Sept 22nd
      Time: 13:00 - 14:00
     Place: Moscone South Rm 236

Monday Mar 10, 2008

Oracle db_block_checksum performance bug

We recently ran across a pretty strange performance bug with the checksum function of Oracle. This bug (6814520) causes excessive CPU to be used for the checksum routine. The checksum feature of Oracle is enabled by the db_block_checksum=TRUE parameter in Oracle. With the release of Oracle 10gR2, "TRUE" is the default setting. The order of magnitude of CPU overhead depends on the type of Solaris SPARC architecture.

Chip     %overhead
----     ---------
SPARC64       250%  
USIV           45%  
w/patch         8% 

Oracle released a patch via metalink to address this situation. This patch is for installations. The fix will be included in:,, and

If you are unsure whether or not you are hitting this bug, you can easily alter this parameter on the fly:
    SQL> alter system set db_block_checksum=FALSE
Warning, this will disable the checksum feature and blocks written when this is turned to FALSE will not contain checksum information.

Regardless of whether you are currently hitting the bug or not, the recommendation is:

Thursday Feb 14, 2008

Ensuring directIO with Oracle on Solaris UFS filesystems

I usually really dislike blog entries that have nothing to say other than repackage bug descriptions and offer them up as knowledge, but in this case I have made an exception since the full impact of the bug is not fully described.

There is a fairly nasty Oracle bug with that prevents the use of DirectIO with Solaris. The metalink note "406472.1" describes the failure modes but fails to mention the performance impact if you use "filesystemio_options=setall" and fail to have the mandatory patch "5752399" in place.

This was particularly troubling to me since we have been recommending for years the use of the "setall" to ensure all the proper filesystem options are set for optimal performance. I just finished working a customer situation where this patch was not installed and their critical batch run-times were nearly 4x as large... Not a pretty situation.... OK, So bottom line:

Friday Jan 04, 2008

Organizational stove-pipes complicate database storage configurations.

IT organizations at large companies are complex entities where people are partitioned by function. There are SAN people, system administrators, Database administrators, and Developers. While it is good to specialize by function there seems to be a mis-match when each organization optimizes their internal operations. Let me walk you though the a common situation where the SAN administrators and system administrators each try to optimize performance without consideration to overall picture.

The setup

  • DBA requests storage for new application. They are expecting filesystem(s) or RAW luns will be presented for ASM use.
  • System's administrators request luns from the Storage administrators to fulfill the request.
  • Storage administrators supply the luns.

Systems Administrators

Their job is to make sure the performance of the supplied luns map cleanly to the Database environment. For years System Administrators have been using SW volume management/Raid to improve performance. So, naturally, they request a large number of luns (say 128) from the Storage administrators so they can stripe. Past experimentation has shown that a 32k stripe width was best.

Storage Administrators

The Storage people take care of large Hitachi or EMC boxes. Their job is to supply luns to applications and make sure their "san-box" performs well. They gladly supply the luns to the Systems administrators, but to ensure performance of the SAN box, they must prevent the fiber from "resets". The maximum number of requests on a fiber is 256 requests. So, no problem, they have the system administrators adjust the "sd_max_throttle" parameter so the OS will queue events and not cause resets. The rule of thumb is to set it to:

       sd_max_throttle = 256/#luns = 256/128 = 2

Putting it all together

So, now the system administrator takes these 128 luns and creates four file systems by striping 32 luns together each with a 32k stripe width using SVM. Since this is a SAN, there are multiple connections from the host to the SAN in this case there are 4 connections. MPxIO is used to round-robin IO requests to the 4 connections to the SAN to balance load and allow for fail-over in case of an HBA failure.

This environment is turned over to the DBA who finds the performance is less than stellar.


The DBA is running 10 jobs that result in queries which full scan 10 tables. These queries request 1MB per IO. Now a stripe width of 32k breaks down the 1MB IO into 32 equal pieces... and since there are 10 concurrent jobs that equates to 32\*10 or 320 concurrent request for IO. Finally, these 320 request, are routed down one of the four channels so that would be 320/4 or 80 requests per channel. Are you beginning to see the problem?

Given the "sd_max_throttle" setting of 2, the OS will allow 2 outstanding requests at a time. If you look at the array, the performance will look great... so it must be an OS problem :)

The Fix

This issue was solved in multiple phases.
  • Quick Fix: Simply increase the "sd_max_throttle" >= 80. This will prevent queuing at the driver level.
  • Increased stripe width. Use an SVM stripe width of 1MB or greater. This will reduce the number of IO being broken down by the SW volume manager.
  • Optimal Solution. Eliminate SW striping all together and build larger luns within the SAN box.


Storage issues often involve multiple layers of HW, SW, people, and organizations. To architect a well thought out solution, all aspects must be taken into consideration. Get everyone talking and sharing information so that your organizational stove-pipes don't cripple application performance.

Friday Jun 29, 2007

Swingbench Order-Entry doesn't scale-up with equal load

In my previous post, I pointed out some considerations to deploying the Swingbench Order-Entry benchmark on large systems. The main bottle-neck in this case was the database size. When scaling too small of a database to huge transaction rates, concurrency issues in the data prevent scaling. Luckily, Swingbench has a way to adjust the number of "Users" and "Orders"... or so it would seem.

Adjusting Users and Orders
I used the "oewizard" utility to create the maximum number of customers and orders - 1 million each. This created a database that was about 65GB total. The "oewizard" is a single threaded process and therefore takes a little time... Be patient. After doing my 1st run, I was a little concerned at the difference in performance.

Scale-up differences
In the real-world as database size grows, often transactions bloat. This is often noticed by enterprising DBAs and performance analysts. Eventually, this will lead to a re-coding of SQL or some changes in the transaction logic. So as a real-world database scales-up it will go through a series of bloating and fixing.

When designing a benchmark to show scale-up and make comparisons of systems at various database sizes, it is desirable to ensure transactions are presented with a similar load. If this is not the case, it should be noted and comparisons should NOT be made across database sizes. The "Order Products", "New Registration", and "Browse Order" transactions which are part of the SwingBench Order-Entry test, all experience transaction bloat as the database size is increased.

The following response time chart shows the effects of "one" user running on databases of 25,000 and 1,000,000 orders.

The moral-- beware of comparing results of differing database sizes using the Swingbench default Order-Entry kit.

Monday Jun 11, 2007

Swing and a miss... Benchmarking is not \*that\* easy.

I applaud tools that aim to make life easier. The cell phone is a wonderful invention that when combined with my palm pilot was wonderful. Now Apple has taken it as step further with the music, movies, internet and birthed the iPhone - nicer still!

Over the past year, I have been seeing more and more IT shops experiment with benchmark tools. One such tool is a kit developed by Dominic Giles of Oracle called Swingbench. Swingbench is a benchmark toolkit that is easy to install and run. Now the DBA can install the benchmark schema and with a few clicks... Wham they are benchmarking! Now comes the hard part - What do these results mean?

After about the 4th call of a customer having performance issues with their application "Swingbench", I was compelled to take a deeper look.

Luckily, all of the performance problems were easily solved by someone who benchmarks for a living. They were typically misconfiguration issues like: filesystem features, lack of io, lack of memory, too small of a dataset, ect... The scary part, these situations all used the supplied "demo" schema's.

By pursuing the Swingbench documentation, I saw that the demo schema's top out at a 100GB database size. This is also alarming. Most IT shops that buy servers or deploy multi-node RAC configurations have more disk than the modern laptop. So you can imagine my surprise when I saw a bake-off of an enterprise class machine that is essentially doing no IO and choking to death on latches... simply the wrong test for the environment.

Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free                                      4,542,675   1,137,914    79.04
log file sync                                     242,359     164,671    11.44
buffer busy waits                                 102,540      61,887     4.30
enqueue                                            35,142      42,498     2.95
CPU time                                                       25,310     1.76

Benchmarking, is simply not \*that\* easy. It takes time to scale up a workload that can simulate your environment. No question that Swingbench gives you a nice head start. It allows you to encapsulate your transactions, run simple regression tests, but you have to take the time to customize the kit to include your data and transactions. The demo schema's are simply a starting point.

Wednesday Nov 15, 2006

Where do you cache Oracle data?

Using the filesystem page cache in combination with the Oracle buffer cache for database files was commonplace before 64-bit databases were prevalent - machines had a lot of memory and databases could not use more than 4GB. Now after many years of 64-bit databases, there are still a fair number of systems that still use buffered IO via the filesystem or cached QIO. While buffered IO used to provide benefit, it can cause substandard performance and impede scaling on modern large-scale systems. Buffered file system issues include:
  • Single-writer lock
  • Memory fragmentation: 8k blocks instead of 4M or 32M ISM.
  • Double buffering doubles memory bandwidth requirements!
  • Segmap thrashing... lots of xcalls!

2x throughput increase with Oracle Caching vs OS buffered IO

A quick experiment was conducted on Oracle 10gR2 on a large memory machine (192GB).
  • 1st test: DB cache was set to 1GB and the database was mounted on a buffered file system.
  • 2nd test: DB cache was set to 50GB and the database was mounted direct - NOT buffered.
A 46GB table was populated, indexed, and then queried by 100 processes each requesting a range of data. A single row was retrieved at a time to simulate what would happen in an OLTP environment. The data was cached so that no IO occurred during any of the runs. When the dust had settled, the Oracle buffer cache provided a 2X speedup over buffered file systems. There was also a dramatic decrease in getmaps, xcalls, and system CPU time. The table below shows the results.





























Notice that cross calls for the Solaris 10 with FS cache have been nearly eliminated while the getmaps have increased in proportional to throughput. This is due to the elimination of xcalls associated with the getmap operation. That said, the mild improvement in throughput with S10 on filesystems, it is nothing like the 2x improvement achieved by avoiding buffered IO altogether.

Recognizing buffered IO impact

A higher amount of system CPU time can be observed at the high-level. It is not uncommon to see a usr/sys ratio of 1 or less on systems where buffered IO is in use. This is due to the high number of getmap reclaims and cross-calls (xcal). You can observe cross-calls with mpstat(1) command. Segmap activity can be best observed using segmapstat utility which is part of the cachekit utilities. The segmapstat utility polls "kstats" to retrieve hit/miss data in an easy to read format. If you are using Solaris 10, the impact due to cross-calls is less, but segmap activity is still visible.

Finally, it would be nice to be able to see the amount of data in the page cache. If you are on Solaris 8, you will need to install the memtool 8 written by Richard McDougal. If you are on Solaris 9 or greater, you can use mdb(1) with the ::memstat command. Beware, this command will take a long time to run and may affect performance, therefore it is best to run this when the system is not busy.

# mdb -k Loading modules: [ unix krtld genunix ip usba wrsm random
  ipc nfs ptm cpc ]
 > ::memstat 
Page Summary                Pages                MB  %Tot 
------------     ----------------  ----------------  ---
Kernel                     430030              3359    2% 
Anon                       805572              6293    3% 
Exec and libs                9429                73    0% 
Page cache               14974588            116988   52% 
Free (cachelist)          2547680             19903    9% 
Free (freelist)           9853807             76982   34% 
Total                    28621106            223602 

How do you avoid using buffered IO?

The easiest way to avoid using the OS page cache is to simply use RAW partitions. This is commonly done in combination with SVM or VxVM. More recently, Oracle introduced their own volume manager (ASM) which makes use of async IO and eases the administration of Oracle databases. That said, databases on RAW partitions are not for everyone. Often users perfer to use standard OS tools to view and manpulate database files in filesystems.

Most filesystems have ways of bypassing the OS page cache for Oracle datafiles. UFS, QFS, and VxFS all support mounting filesystems to bypass the OS page cache - the only exeception is ZFS which doesn't allow for direct or async IO. Below, methods for disabling buffered IO with filesystems are discussed.

FILESYSTEMIO_OPTIONS=SETALL (Oracle 9i and greater) init.ora parameter

The first step to avoiding buffered IO is to use the "FILESYSTEMIO_OPTIONS" parameter. When you use the "SETALL" option, this sets all the options for a particular filesystem to enable directio or async IO. Setting the FILSYSTEMIO_OPTIONS to anything other than "SETALL" could reduce performance. Therefore, it is a best practice to set this option.

UFS and directio

With UFS, the only way to bypass the page cache is with directio. If you are using Oracle 9i or greater, then set the FILESYSTEMIO_OPTIONS=SETALL init.ora parameter. This the preferred way of enabling directio with Oracle. With this method, Oracle uses an api to enable directio when it opens database files. This method allows you to still use buffered IO for operations like backup and archiving. If you are using Oracle 8i, then the only way to enable directio with UFS is via the forcedirectio mount option.

VxFS with QIO

VxFS has several options for disabling buffered IO. Like UFS, VxFS does support directio but it is not as efficient as Quick IO (QIO) or Oracle Data Management (ODM). With VxFS, async IO is possible with QuickIO or ODM. Data files for use with QIO must be created with a special utility or converted to the QIO format. With QIO you have to be careful that the "cached" QIO option is not enabled. With the cached QIO option, blocks of selected data files will be placed in the OS page cache.

VxFS with ODM

Like QIO, ODM uses async IO. ODM uses an api specified by Oracle to open and manipulate data files. ODM lowers overhead in large systems by sharing file descriptors and eliminating the need for each oracle shadow/server process to open and obtain its own file descriptors.

Convincing Oracle to cache table data

Finally, after all this is done Oracle still may not properly cache table data. I have seen more than a few persons enable "directio" and increase the SGA only to have response time of their critical queries take longer! If a table is too large or the "cache" attribute is not set, Oracle will not attempt to cache tables when scanning. This is done to avoid flooding the Oracle buffer cache with data that will most likely not be used. Luckily, there is an easy way to correct this behavior by setting the "CACHE" storage parameter on a table.

  SQL> alter table BIGACTIVETABLE cache;

Finally, you may need to convince some of the Oracle DBAs of the benefit. DBAs look at Oracle performance data from an Oracle centric point of view. When data such as Oracle's statspack is analyzed, some pretty awsome response times can be seen. Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether.

Summary and references

Hopefully this has given you some background on why unbuffered IO is so critical to obtain optimal performance with Oracle. It is far more efficient to obtain an Oracle blocks from the database buffer cache than to go through the OS page cache layers.

This blog discusses performance topics as running on Sun servers. The main focus is in database performance and architecture but other topics can and will creep in.


« July 2016

No bookmarks in folder


No bookmarks in folder