Where do you cache Oracle data?

Using the filesystem page cache in combination with the Oracle buffer cache for database files was commonplace before 64-bit databases were prevalent - machines had a lot of memory and databases could not use more than 4GB. Now after many years of 64-bit databases, there are still a fair number of systems that still use buffered IO via the filesystem or cached QIO. While buffered IO used to provide benefit, it can cause substandard performance and impede scaling on modern large-scale systems. Buffered file system issues include:
  • Single-writer lock
  • Memory fragmentation: 8k blocks instead of 4M or 32M ISM.
  • Double buffering doubles memory bandwidth requirements!
  • Segmap thrashing... lots of xcalls!

2x throughput increase with Oracle Caching vs OS buffered IO

A quick experiment was conducted on Oracle 10gR2 on a large memory machine (192GB).
  • 1st test: DB cache was set to 1GB and the database was mounted on a buffered file system.
  • 2nd test: DB cache was set to 50GB and the database was mounted direct - NOT buffered.
A 46GB table was populated, indexed, and then queried by 100 processes each requesting a range of data. A single row was retrieved at a time to simulate what would happen in an OLTP environment. The data was cached so that no IO occurred during any of the runs. When the dust had settled, the Oracle buffer cache provided a 2X speedup over buffered file systems. There was also a dramatic decrease in getmaps, xcalls, and system CPU time. The table below shows the results.

    Cache

    OS

    Rows/sec

    getmaps/sec

    xcalls/sec

    Usr

    sys

    FS

    S9

    287,114

    86,516

    2,600,000

    71

    28

    DB

    S9

    695,700

    296

    3,254

    94

    5

    FS

    S10

    334,966

    106,719

    1,003

    78

    21


Notice that cross calls for the Solaris 10 with FS cache have been nearly eliminated while the getmaps have increased in proportional to throughput. This is due to the elimination of xcalls associated with the getmap operation. That said, the mild improvement in throughput with S10 on filesystems, it is nothing like the 2x improvement achieved by avoiding buffered IO altogether.

Recognizing buffered IO impact

A higher amount of system CPU time can be observed at the high-level. It is not uncommon to see a usr/sys ratio of 1 or less on systems where buffered IO is in use. This is due to the high number of getmap reclaims and cross-calls (xcal). You can observe cross-calls with mpstat(1) command. Segmap activity can be best observed using segmapstat utility which is part of the cachekit utilities. The segmapstat utility polls "kstats" to retrieve hit/miss data in an easy to read format. If you are using Solaris 10, the impact due to cross-calls is less, but segmap activity is still visible.

Finally, it would be nice to be able to see the amount of data in the page cache. If you are on Solaris 8, you will need to install the memtool 8 written by Richard McDougal. If you are on Solaris 9 or greater, you can use mdb(1) with the ::memstat command. Beware, this command will take a long time to run and may affect performance, therefore it is best to run this when the system is not busy.

# mdb -k Loading modules: [ unix krtld genunix ip usba wrsm random
  ipc nfs ptm cpc ]
 > ::memstat 
Page Summary                Pages                MB  %Tot 
------------     ----------------  ----------------  ---
Kernel                     430030              3359    2% 
Anon                       805572              6293    3% 
Exec and libs                9429                73    0% 
Page cache               14974588            116988   52% 
Free (cachelist)          2547680             19903    9% 
Free (freelist)           9853807             76982   34% 
Total                    28621106            223602 

How do you avoid using buffered IO?

The easiest way to avoid using the OS page cache is to simply use RAW partitions. This is commonly done in combination with SVM or VxVM. More recently, Oracle introduced their own volume manager (ASM) which makes use of async IO and eases the administration of Oracle databases. That said, databases on RAW partitions are not for everyone. Often users perfer to use standard OS tools to view and manpulate database files in filesystems.

Most filesystems have ways of bypassing the OS page cache for Oracle datafiles. UFS, QFS, and VxFS all support mounting filesystems to bypass the OS page cache - the only exeception is ZFS which doesn't allow for direct or async IO. Below, methods for disabling buffered IO with filesystems are discussed.

FILESYSTEMIO_OPTIONS=SETALL (Oracle 9i and greater) init.ora parameter

The first step to avoiding buffered IO is to use the "FILESYSTEMIO_OPTIONS" parameter. When you use the "SETALL" option, this sets all the options for a particular filesystem to enable directio or async IO. Setting the FILSYSTEMIO_OPTIONS to anything other than "SETALL" could reduce performance. Therefore, it is a best practice to set this option.

UFS and directio

With UFS, the only way to bypass the page cache is with directio. If you are using Oracle 9i or greater, then set the FILESYSTEMIO_OPTIONS=SETALL init.ora parameter. This the preferred way of enabling directio with Oracle. With this method, Oracle uses an api to enable directio when it opens database files. This method allows you to still use buffered IO for operations like backup and archiving. If you are using Oracle 8i, then the only way to enable directio with UFS is via the forcedirectio mount option.

VxFS with QIO

VxFS has several options for disabling buffered IO. Like UFS, VxFS does support directio but it is not as efficient as Quick IO (QIO) or Oracle Data Management (ODM). With VxFS, async IO is possible with QuickIO or ODM. Data files for use with QIO must be created with a special utility or converted to the QIO format. With QIO you have to be careful that the "cached" QIO option is not enabled. With the cached QIO option, blocks of selected data files will be placed in the OS page cache.

VxFS with ODM

Like QIO, ODM uses async IO. ODM uses an api specified by Oracle to open and manipulate data files. ODM lowers overhead in large systems by sharing file descriptors and eliminating the need for each oracle shadow/server process to open and obtain its own file descriptors.

Convincing Oracle to cache table data

Finally, after all this is done Oracle still may not properly cache table data. I have seen more than a few persons enable "directio" and increase the SGA only to have response time of their critical queries take longer! If a table is too large or the "cache" attribute is not set, Oracle will not attempt to cache tables when scanning. This is done to avoid flooding the Oracle buffer cache with data that will most likely not be used. Luckily, there is an easy way to correct this behavior by setting the "CACHE" storage parameter on a table.

  SQL> alter table BIGACTIVETABLE cache;

Finally, you may need to convince some of the Oracle DBAs of the benefit. DBAs look at Oracle performance data from an Oracle centric point of view. When data such as Oracle's statspack is analyzed, some pretty awsome response times can be seen. Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether.

Summary and references

Hopefully this has given you some background on why unbuffered IO is so critical to obtain optimal performance with Oracle. It is far more efficient to obtain an Oracle blocks from the database buffer cache than to go through the OS page cache layers.
Comments:

Good info, although it would be nice to see a S10 result using the big DB cache.

Posted by William Hathaway on November 15, 2006 at 03:40 AM PST #

Unfortunately, had to give up the environment before I could get the result. I suspect it would be about the same as S9 however.

Posted by guest on November 15, 2006 at 06:45 AM PST #

This is very interesting info, Glenn. Thank you for your hard work while preparing this. I always wanted to see how raw partitions work in contrast to standard files in Oracle.

Posted by linportal on November 15, 2006 at 07:04 AM PST #

Hi Glenn, I got your blog's address from Kevin Closson in response to my question on the breakdown of memory usage in Solaris 9. I was wondering if you could explain what the following memory areas mean and what kind of information they contain: \* Anon (I believe it stands for Anonymous); what does it represent? \* Page cache I have two domains on two SunFire 20k servers that are running large Oracle databases. On server A, I am using ODM for all instances, where as on server B I am running the instance with ODM disabled. What memory area (from memstat) contains the shared memory segments from SGA? I will appreciate your help in clarifying these stats. Thanks Amir

Posted by Amir Hameed on November 29, 2006 at 12:19 PM PST #

Thanks Amir for the questions. Hopefully, I can clarify the memory usage with respect to Oracle on Solaris.... maybe another blog entry. Anon: Is used by the Oracle PGA. Oracle started using Anon memory for the PGA in conjunction with the auto tuning features of the PGA. The idea is to use Anon or (mapped memory) in place of Heap. The PGA is used for DSS type queries for caching sorts and hashes. SGA: The SGA will also show up as Anon from the "::memstat" command. The SGA is a special type of Anon memory. It shows up under anon since it is shareable and is mapped by all Oracle processes. To get a better idea of the total memory layout of an Oracle process, us the "pmap()" command. "pmap -xs <oracle_pid>" In your case above, I guess that on server A with ODM enabled, your page cache will be fairly low since ODM bypasses the page cache. On your server B, if ODM is disabled and you are not using forcedirectio, then I suspect that you will have a larger page cache. In general, I would suggest that you use ODM especially since you are already paying for it :)

Posted by Glenn Fawcett on December 01, 2006 at 01:18 AM PST #

"Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether." hi Glenn, It is clear from the test results presented that the buffer cache is faster. However, regarding above comment - my understanding is: When the read for a full table (or index FFS) scan from OS buffers, the read is from adjacent blocks in memory. When it is done from Oracle buffer cache, would this be true? Are all blocks belonging to a table adjacent in memory? The buffer cache management may prevent this (?) - e.g. the LRU algorithm. Maybe there is something with regard to the hash mechanisms for the CBC also (it's a question, I am not saying this is true as I don't know). Given this, the read for a full scan may not be as "straight" from the Oracle buffer cache. Could you please elaborate on this? Thanks, Naresh

Posted by Naresh Bhandare on February 11, 2007 at 06:06 AM PST #

Very interesting summary. Thanks.

Posted by Pete Neve on April 18, 2007 at 01:03 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

This blog discusses performance topics as running on Sun servers. The main focus is in database performance and architecture but other topics can and will creep in.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder