Thursday Feb 14, 2008

Ensuring directIO with Oracle on Solaris UFS filesystems

I usually really dislike blog entries that have nothing to say other than repackage bug descriptions and offer them up as knowledge, but in this case I have made an exception since the full impact of the bug is not fully described.

There is a fairly nasty Oracle bug with that prevents the use of DirectIO with Solaris. The metalink note "406472.1" describes the failure modes but fails to mention the performance impact if you use "filesystemio_options=setall" and fail to have the mandatory patch "5752399" in place.

This was particularly troubling to me since we have been recommending for years the use of the "setall" to ensure all the proper filesystem options are set for optimal performance. I just finished working a customer situation where this patch was not installed and their critical batch run-times were nearly 4x as large... Not a pretty situation.... OK, So bottom line:

Wednesday Jan 16, 2008

Throughput computing series: System Concurrency and Parallelism

Most environments have some open source SW that is used as part of the application stack. Depending on the packages, this can take a fair amount of time to configure and compile. To speed the install process, parallelism can easily be used to take advantage of the throughput of CMT servers.

Let us consider the following five open source packages:
  • httpd-2.2.6
  • mysql-5.1.22-rc
  • perl-5.10.0
  • postgresql-8.2.4
  • ruby-1.8.6

The following experiments will time the installation of these packages in both a serial, parallel, and concurrent fashion.

Parallel builds

After the "configure" phase is complete, these packages are all compiled using gmake. This is where parallelism within each job can be used to speed the install process. By using the "gmake -j" option, the level of parallelism can specified for each of the packages. This can dramatically improve the overall compile time as seen below.

compile time \*without\* concurrency
    • Jobs were ran in a serial fashion but with parallelism within the job itself.
    • 79% reduction in compile time at 32 threads/job.

Concurrency and Parallelism

The build process for the various packages are not each able to be parallelized perfectly. In fact, the best overall gain of any of the packages is 6x out of 32. This is where concurrency comes into play. If we start all the compiles at the same time and use parallelism as well, this further reduces the overall build time.

compile times with concurrency and parallelism
    • All 5 jobs were run concurrently with 1 and 32 (threads/job).
    • 88% overall reduction in compile time from serial to parallel with concurrency.
    • 42% reduction in compile time over parallel jobs ran serially.

Load it up!

Hopefully, this helps to better describe how to achieve better system throughput through parallelism and concurrency. Sun's CMT servers are multi-threaded machines which are capable of a high level of throughput. Whether you are building packages from source or installing pre-build packages, you have to load up the machine to see throughput.

Monday Jan 14, 2008

Throughput computing series: Parallel commands (pbzip2)

In this installment of the throughput computing series, I will explore how to get parallelism from the system point of view. The system administrator who first begins to configure the system will start forming impressions from the moment the shrink wrap comes off the server. First impressions and potential parallel options will be explored in this entry.

Off with the shrink-wrap... on with the install

Unfortunately, most installation processes involve a fair number of single-threaded procedures. As mentioned before, the CMT processor is designed to be a processor that optimizes the overall throughput of a server - often to the detriment of single threaded processes. There are several schools of thought on this one. First is, why bother - the install process happens but once and it really doesn't matter. That is true for most typical environments. But the current trend toward grid computing and virtualization makes "time to provision" often a critical factor. To help speed provisioning, there are some things that can be done by using parallelized commands and concurrency.

pbzip2 to the rescue

A very common time-consuming part of provisioning is the packing/unpacking of SW packages. Commonly, gzip or bzip is used to unpack data and packages, but this is not a parallel program. Fortunately, there is a parallel version of bzip that has been made available. "pbzip2" allows you to specify the level of parallelism in order to speed the compression/decompression process.

I spent a little time experimenting with the pbzip program after repeated interactions that always seemed to come back to "gzip" performance. I decided to do some quick benchmarks with pbzip2 using both the T2000(8core@1.4GHz) and v20z(AMD 2cores@2.2GHz).

pbzip2 benchmark

The setup used a 135M text file. This file was the trade_history.txt created using the egen program distributed by the tpc council for the TPC-E benchmark. This file was compressed using the following simple test script:
    for i in 1 2 4 8 16 32
      print "pbzip2 compress: ${i} threads\\n" 
      timex pbzip2 -p${i} small.txt
      print "pbzip2 decompress: ${i} threads\\n" 
      timex pbzip2 -d -p${i} small.txt.bz2
T2000 pbzip2 throughput T2000 pbzip2 throughput

At lower thread counts, the v20z with two AMD cores does better. This is expected since the AMD x64 processor is optimized single-threaded performance. But you can see as you crank up the thread count, the T2000 starts to really shine. This demonstrates my main point that to push massive throughput within a single application, you need lots of threads and parallelism.

    ...The next entry will explore how concurrency and parallelism can help improve build times.

Tuesday Jan 08, 2008

Throughput computing series: Defining Throughput Computing

This is the first installment in a series of entries that discuss different aspects of throughput computing. This series aims to improve the understanding of how SPARC CMT servers can be utilized to increase business throughput. Let's start with a definition.

What is throughput computing?

Oxford American defines "throughput" as:
    "The amount of materials or items passing through a system or process"
In computer terms, throughput computing is the amount of "work" that can be done in a given period of time. Things like "orders per second", "paychecks per hour", "queries per second", "webpages per minute",... are all metrics of throughput. These measures help define the amount of work a system can complete in a given period of time.

Misguided throughput metrics

  • Latency or Response time is not a throughput metric.
  • CPU % is not a throughput metric.
  • IO wait% is definitely not a throughput metric... or anything other than a measure of idle time :)
  • The "Load average" of a system is not a measure of throughput. OK... you get the idea.

Job level parallelism

Job level parallelism is about taking a single job and breaking it into multiple pieces. Say you have 10,000 letters to put stamps on. If it takes 3 seconds per letter, you would need 30,000 seconds or more than 8 hours to complete the task. Now consider you are a teacher and you bring the letters to class. There are 20 students in the class so each student will place stamps on 500 letters. With only 500 letters to complete per student, the job can be done in only 1500 seconds or 25 minutes.

In terms of throughput, one person processes one letter every 3 seconds or 60/3 = 20 letters per minute... and a class of 20 students can process 20\*20 = 400 letters per minute.


Concurrency comes from running multiple jobs or applications together on a system. A job may be single-threaded or use multiple threads of execution as discussed above. These jobs need not be related or even from the same application. To further increase concurrency, virtualization is often used to run multiple concurrent OS images on the same machine in-order to take advantage of modern multi-threaded systems.

Putting it all together with Chip Multi-Threading

Denis Sheanan sums up Sun's throughput computing initiative in his paper on CMT as:
    "Sun’s Throughput Computing initiative represents a new paradigm in system design, with a focus toward maximizing the overall throughput of key commercial workloads rather than the speed of a single thread of execution. Chip multi-threading (CMT) processor technology is key to this approach, providing a new thread-rich environment that drives application throughput and processor resource utilization while effectively masking memory access latencies."
The salient point is that you must have an application that has multiple threads of execution in-order to take advantage of CMT. Multiple threads of execution could come from a single job that has been parallelized or from multiple jobs of different types that run concurrently.


Monday Jan 07, 2008

Throughput computing series... getting the most of your SPARC CMT server.

I was thinking about the development of a CMT throughput benchmark, but it occurred to me that there are many \*good\* examples of throughput already out with the benchmarks we publish... just look at the bmseer postings on the Recent T2 results and the long line of performance records on the T2000.

The biggest disconnect with CMT servers is a misunderstanding of throughput and multi-threaded applications. I made a posting last year which touched on some initial impressions, but I thought it would be a good idea to dig in further.

This entry is to kick off a series of postings that explore different aspects of throughput computing in a CMT environment. The rough outline is as follows:
    • Definition of Throughput computing, multi-threading, and concurrency.
    Explore system parallelism
    • Unix commands and parallel options
    • Concurrent builds/compiles.
    • Configuring the system for parallelism
    Configuring applications for parallelism
    • Concurrency vs multi-threading
    • Single-Threaded jobs
    Database parallelism with Oracle
    • Parallel loader and datapump
    • Index build parallelism
    • Concurrent processing in Oracle
    • Configuring Oracle for CMT servers

Friday Sep 28, 2007

PGstatspack - Getting at Postgres performance data.

I thought I posted this a while ago... Maybe a blog bug?
I have been working with Oracle for the past 18 years, mostly in the performance arena. Last year, I began working with Postgres as well. Being a performance guy, I naturally was looking at how to get at the performance data necessary to tune the database for maximum performance. To my surprise, little existed in the way of performance tools for Postgres. I was looking for the "Statpack" or "AWR" report for Postgres. I found several on-off tools but nothing that provided a "Load Profile" like Statspack.

PG_STAT\* tables... V$ tables in disguise

Postgres has a series of tables that are essentially counters like the V$ tables. They record the counts of things like:
  • commited transactions
  • rolled back transactions
  • tuples accessed
  • tuples inserted
  • block read
  • block hits
  • tuples accessed by table and index
  • physical reads by table and index

Creating a prototype

I fashioned the prototype after Oracle's Statspack. I created a simple schema where I essentially duplicated the PG_STAT\* tables and added a key for the snapshot. There is also a management table "pgstatspack_snap" which stores the snapid, timestamp, and a short description.
To keep with the statspack like theme, a simple PLPGSQL procedure was created to take snapshots:
      SELECT pgstatspack_snap('My test run');

Creating pgstatspack reports

Now \*all\* you have to do is create the reports. I have created a simple report that gets at the heart of what is encapsulated in the "Load Profile" section of the Statspack. Additionally, I have profiled some of the table objects in terms of access, IO, etc. The report essentially does a diff of the counters between the two snap intervals. Time data is applied to calculate the per-second rates.
This is meant to be a launch pad for experimentation. Hopefully, you will find it interesting. The prototype package and report can be downloaded here: pgstatspack.tar.gz

$ 1 2 

 database  |  tps   | hitrate | lio_ps  |  rd_ps  | rows_ps  | ins_ps | upd_ps | del_ps 
 igen      | 169.55 |   94.00 | 3909.70 |  211.15 | 23543.05 |  50.87 |  46.74 |   0.00 
 tpce      |   0.04 |    0.00 | 2310.97 | 2307.90 |     0.65 |   0.01 |   0.00 |   0.00 
 postgres  |   0.03 |   99.00 |    1.86 |    0.00 |     0.44 |   0.00 |   0.00 |   0.00 
 template1 |   0.00 |    0.00 |    0.00 |    0.00 |     0.00 |   0.00 |   0.00 |   0.00 
 template0 |   0.00 |    0.00 |    0.00 |    0.00 |     0.00 |   0.00 |   0.00 |   0.00 
(5 rows)

MOST ACCESSED TABLES by pct of tuples: igen database
    table     | tuples_pct | tab_hitpct | idx_hitpct | tab_read | tab_hit | idx_read | idx_hit 
 order_125    |         45 |         91 |         77 |    67566 |  698578 |    58050 |  202950
 product_125  |         42 |         99 |         99 |       82 |  120060 |       30 |  127345
 industry_125 |         10 |         99 |          0 |        1 |   22409 |        0 |       0
 customer_125 |          1 |         94 |         99 |    34978 |  657096 |     6858 | 1032477

Note: This prototype is built on top of the 8.3 version of Postgres. Some modification would be required to use it on other versions of Postgres.

Thursday Aug 16, 2007

Getting past GO with Sparc CMT

The Sparc T1/T2 chip is a lean mean throughput machine. Load on DB users, application servers, JVMs, ect and this chip begins to hum. While the benchmark proof points are many, there still seem to be mis-conceptions about the performance of this chip.

I have ran across several performance evaluations lately where the T2000 was not being utilized to its full potential. The story goes like so...

System Administrators First impressions

Installing SW and configuring Solaris seems a little slow compared to the V490's we have sitting in the corner. But, this is not a show stopper - just an observation. The system admin presses on and preps the machine for the DBA.

DBAs First Impressions

After the OS is installed, the machine is turned over to the DBAs to install and configure Oracle. The DBA notices that, compared to the v490, Oracle installation is taking about twice as long. They continue to configure and begin loading an export file from the production machine. Again, this too is slower than the v490. Thinking something is wrong with the OS/HW, the DBA now contacts the system administrator.

Fanning the fire

At this point, the DBA and system admin have an "ah-ha" moment and begin to speculate that something is awry. The system admin "times" some simple unix commands. "tar", "gzip", "sort", ect... all seem slower on the T2000. The DBA does a few simple queries... again slower. What gives? Google fingers are warmed up, blogs are consulted, best practices are applied, and the results are unchanged.

Throughput requires more than one thing

The DBA and System admin have fallen into a the trap of not testing the \*real\* application. In the process of setting up the environment, the single-threaded jobs to install and configure the SW and load the database are slower. But, that is not the application. The real application, is a on-line store with several hundred users running concurrently. Now we are getting somewhere.

Throughput, Throughput, Throughput!

Finally, the real testing begins. Load generators are broken out to simulate the hundreds of users. After loading up the system, it is found that the T2000 DB server can handle about 2x the number of Orders/Sec than the V490! Wait a minute. "gzip" is slower but this little chip can process 2x the Orders? That's what CMT is all about... throughput, throughput, throughput!

Friday Jun 29, 2007

Swingbench Order-Entry doesn't scale-up with equal load

In my previous post, I pointed out some considerations to deploying the Swingbench Order-Entry benchmark on large systems. The main bottle-neck in this case was the database size. When scaling too small of a database to huge transaction rates, concurrency issues in the data prevent scaling. Luckily, Swingbench has a way to adjust the number of "Users" and "Orders"... or so it would seem.

Adjusting Users and Orders
I used the "oewizard" utility to create the maximum number of customers and orders - 1 million each. This created a database that was about 65GB total. The "oewizard" is a single threaded process and therefore takes a little time... Be patient. After doing my 1st run, I was a little concerned at the difference in performance.

Scale-up differences
In the real-world as database size grows, often transactions bloat. This is often noticed by enterprising DBAs and performance analysts. Eventually, this will lead to a re-coding of SQL or some changes in the transaction logic. So as a real-world database scales-up it will go through a series of bloating and fixing.

When designing a benchmark to show scale-up and make comparisons of systems at various database sizes, it is desirable to ensure transactions are presented with a similar load. If this is not the case, it should be noted and comparisons should NOT be made across database sizes. The "Order Products", "New Registration", and "Browse Order" transactions which are part of the SwingBench Order-Entry test, all experience transaction bloat as the database size is increased.

The following response time chart shows the effects of "one" user running on databases of 25,000 and 1,000,000 orders.

The moral-- beware of comparing results of differing database sizes using the Swingbench default Order-Entry kit.

Monday Jun 11, 2007

Swing and a miss... Benchmarking is not \*that\* easy.

I applaud tools that aim to make life easier. The cell phone is a wonderful invention that when combined with my palm pilot was wonderful. Now Apple has taken it as step further with the music, movies, internet and birthed the iPhone - nicer still!

Over the past year, I have been seeing more and more IT shops experiment with benchmark tools. One such tool is a kit developed by Dominic Giles of Oracle called Swingbench. Swingbench is a benchmark toolkit that is easy to install and run. Now the DBA can install the benchmark schema and with a few clicks... Wham they are benchmarking! Now comes the hard part - What do these results mean?

After about the 4th call of a customer having performance issues with their application "Swingbench", I was compelled to take a deeper look.

Luckily, all of the performance problems were easily solved by someone who benchmarks for a living. They were typically misconfiguration issues like: filesystem features, lack of io, lack of memory, too small of a dataset, ect... The scary part, these situations all used the supplied "demo" schema's.

By pursuing the Swingbench documentation, I saw that the demo schema's top out at a 100GB database size. This is also alarming. Most IT shops that buy servers or deploy multi-node RAC configurations have more disk than the modern laptop. So you can imagine my surprise when I saw a bake-off of an enterprise class machine that is essentially doing no IO and choking to death on latches... simply the wrong test for the environment.

Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free                                      4,542,675   1,137,914    79.04
log file sync                                     242,359     164,671    11.44
buffer busy waits                                 102,540      61,887     4.30
enqueue                                            35,142      42,498     2.95
CPU time                                                       25,310     1.76

Benchmarking, is simply not \*that\* easy. It takes time to scale up a workload that can simulate your environment. No question that Swingbench gives you a nice head start. It allows you to encapsulate your transactions, run simple regression tests, but you have to take the time to customize the kit to include your data and transactions. The demo schema's are simply a starting point.

Wednesday Nov 15, 2006

Where do you cache Oracle data?

Using the filesystem page cache in combination with the Oracle buffer cache for database files was commonplace before 64-bit databases were prevalent - machines had a lot of memory and databases could not use more than 4GB. Now after many years of 64-bit databases, there are still a fair number of systems that still use buffered IO via the filesystem or cached QIO. While buffered IO used to provide benefit, it can cause substandard performance and impede scaling on modern large-scale systems. Buffered file system issues include:
  • Single-writer lock
  • Memory fragmentation: 8k blocks instead of 4M or 32M ISM.
  • Double buffering doubles memory bandwidth requirements!
  • Segmap thrashing... lots of xcalls!

2x throughput increase with Oracle Caching vs OS buffered IO

A quick experiment was conducted on Oracle 10gR2 on a large memory machine (192GB).
  • 1st test: DB cache was set to 1GB and the database was mounted on a buffered file system.
  • 2nd test: DB cache was set to 50GB and the database was mounted direct - NOT buffered.
A 46GB table was populated, indexed, and then queried by 100 processes each requesting a range of data. A single row was retrieved at a time to simulate what would happen in an OLTP environment. The data was cached so that no IO occurred during any of the runs. When the dust had settled, the Oracle buffer cache provided a 2X speedup over buffered file systems. There was also a dramatic decrease in getmaps, xcalls, and system CPU time. The table below shows the results.





























Notice that cross calls for the Solaris 10 with FS cache have been nearly eliminated while the getmaps have increased in proportional to throughput. This is due to the elimination of xcalls associated with the getmap operation. That said, the mild improvement in throughput with S10 on filesystems, it is nothing like the 2x improvement achieved by avoiding buffered IO altogether.

Recognizing buffered IO impact

A higher amount of system CPU time can be observed at the high-level. It is not uncommon to see a usr/sys ratio of 1 or less on systems where buffered IO is in use. This is due to the high number of getmap reclaims and cross-calls (xcal). You can observe cross-calls with mpstat(1) command. Segmap activity can be best observed using segmapstat utility which is part of the cachekit utilities. The segmapstat utility polls "kstats" to retrieve hit/miss data in an easy to read format. If you are using Solaris 10, the impact due to cross-calls is less, but segmap activity is still visible.

Finally, it would be nice to be able to see the amount of data in the page cache. If you are on Solaris 8, you will need to install the memtool 8 written by Richard McDougal. If you are on Solaris 9 or greater, you can use mdb(1) with the ::memstat command. Beware, this command will take a long time to run and may affect performance, therefore it is best to run this when the system is not busy.

# mdb -k Loading modules: [ unix krtld genunix ip usba wrsm random
  ipc nfs ptm cpc ]
 > ::memstat 
Page Summary                Pages                MB  %Tot 
------------     ----------------  ----------------  ---
Kernel                     430030              3359    2% 
Anon                       805572              6293    3% 
Exec and libs                9429                73    0% 
Page cache               14974588            116988   52% 
Free (cachelist)          2547680             19903    9% 
Free (freelist)           9853807             76982   34% 
Total                    28621106            223602 

How do you avoid using buffered IO?

The easiest way to avoid using the OS page cache is to simply use RAW partitions. This is commonly done in combination with SVM or VxVM. More recently, Oracle introduced their own volume manager (ASM) which makes use of async IO and eases the administration of Oracle databases. That said, databases on RAW partitions are not for everyone. Often users perfer to use standard OS tools to view and manpulate database files in filesystems.

Most filesystems have ways of bypassing the OS page cache for Oracle datafiles. UFS, QFS, and VxFS all support mounting filesystems to bypass the OS page cache - the only exeception is ZFS which doesn't allow for direct or async IO. Below, methods for disabling buffered IO with filesystems are discussed.

FILESYSTEMIO_OPTIONS=SETALL (Oracle 9i and greater) init.ora parameter

The first step to avoiding buffered IO is to use the "FILESYSTEMIO_OPTIONS" parameter. When you use the "SETALL" option, this sets all the options for a particular filesystem to enable directio or async IO. Setting the FILSYSTEMIO_OPTIONS to anything other than "SETALL" could reduce performance. Therefore, it is a best practice to set this option.

UFS and directio

With UFS, the only way to bypass the page cache is with directio. If you are using Oracle 9i or greater, then set the FILESYSTEMIO_OPTIONS=SETALL init.ora parameter. This the preferred way of enabling directio with Oracle. With this method, Oracle uses an api to enable directio when it opens database files. This method allows you to still use buffered IO for operations like backup and archiving. If you are using Oracle 8i, then the only way to enable directio with UFS is via the forcedirectio mount option.

VxFS with QIO

VxFS has several options for disabling buffered IO. Like UFS, VxFS does support directio but it is not as efficient as Quick IO (QIO) or Oracle Data Management (ODM). With VxFS, async IO is possible with QuickIO or ODM. Data files for use with QIO must be created with a special utility or converted to the QIO format. With QIO you have to be careful that the "cached" QIO option is not enabled. With the cached QIO option, blocks of selected data files will be placed in the OS page cache.

VxFS with ODM

Like QIO, ODM uses async IO. ODM uses an api specified by Oracle to open and manipulate data files. ODM lowers overhead in large systems by sharing file descriptors and eliminating the need for each oracle shadow/server process to open and obtain its own file descriptors.

Convincing Oracle to cache table data

Finally, after all this is done Oracle still may not properly cache table data. I have seen more than a few persons enable "directio" and increase the SGA only to have response time of their critical queries take longer! If a table is too large or the "cache" attribute is not set, Oracle will not attempt to cache tables when scanning. This is done to avoid flooding the Oracle buffer cache with data that will most likely not be used. Luckily, there is an easy way to correct this behavior by setting the "CACHE" storage parameter on a table.

  SQL> alter table BIGACTIVETABLE cache;

Finally, you may need to convince some of the Oracle DBAs of the benefit. DBAs look at Oracle performance data from an Oracle centric point of view. When data such as Oracle's statspack is analyzed, some pretty awsome response times can be seen. Wait events for IO such as "db file sequential read" and "db file scattered reads" can show response times of less than 1ms when reading from the OS page cache. Often when looking at such data, DBA's are reluctant to give up this response time. This should be viewed as an oppurtunity to further improve performance by placing the data in the Oracle buffer cache and avoiding the reads alltogether.

Summary and references

Hopefully this has given you some background on why unbuffered IO is so critical to obtain optimal performance with Oracle. It is far more efficient to obtain an Oracle blocks from the database buffer cache than to go through the OS page cache layers.

Saturday Nov 11, 2006

NAS architecture presentation at OSWOUG

I had the pleasure of hearing an old Sequent friend Kevin Closson speak about NAS architecture at a recent OSWOUG meeting. It was an interesting and energetic discussion on the direction of NAS in commodity servers. If you are interested at all in the direction of storage technology for databases, you should check out Kevin's blog and paper on this technology.

Monday Oct 30, 2006

Dueling DUAL with BEA Weblogic and TestConnectionsOnReserve.

You would think that the "DUAL" table, a simple stub table, would not be a performance topic - but I have seen this for years on high-end benchmarks. People develop applications or tests for applications which tend to over-use the DUAL table. Most commonly, this comes in the form of "select abc.nextseq from DUAL" and "select sysdate from DUAL". This is typically, not a problem for small severs with a low level of concurrency, but it can be bottle-neck on high-end severs with lots of processors.

The problem with DUAL (in Oracle 9i and below) is that this "fake table" hashes to a "real" cache line :) If over-used it can cause a "cache buffers chains" latch contention like crazy. The most dangerous over-use situations are systemic ones. I can get around these issues in most benchmark environments, but cringe when I see the embedded use DUAL.

In BEA websphere, there is a parameter called "TestConnectionsOnReservere". This parameter sends a SQL statement to the database before \*EVERY\* user statement.... talk about overhead! This not only adds SQL\*Net round trips increasing network use, but most commonly uses the "SQL SELECT 1 from DUAL" as the test statement :) What is worse, the overhead just continues to increase as the load is increased. Ken Gottry discusses the performance impact in an article he wrote. This study used a 2-way server to show the performance impact. It is much worse on a high-end server.

What can you do?

Avoid setting the TestConnectionsOnReserve within BEA. The performance cost in terms of potential latch contention and network over-head is too high. If you must use this paramenter, use the "X$DUAL" table instead. Oracle 10g, uses this by default and while it avoids the latching issues, the networking component this parameter is still present.

Wednesday Aug 16, 2006

Solaris Applications Specific Tuning Wiki

As part of the Second Edition of the famous Solaris Internals and the new Solaris Performance and Tools book a performance tuning Wiki has been created. This site is meant to be a living document where best practices, tuning information, and tips are collected.

I have began contributing Oracle performance information to the Solaris applications specific tuning Wiki. I hope you enjoy this repository of information regarding performance on Sun systems.

Friday Aug 04, 2006

High "times()" syscall count with Oracle processes

"Why does Oracle call times() so often? Is something broken? When using truss or dtrace to profile Oracle shadow processes, one often sees a lot of calls to "times". Sysadmins often approach me with this query.

root@catscratchb> truss -cp 7700
syscall               seconds   calls  errors
read                     .002     120
write                    .008     210
times                    .053   10810
semctl                   .000      17
semop                    .000       8
semtimedop               .000       9
mmap                     .003      68
munmap                   .003       5
yield                    .002     231
pread                    .150    2002
kaio                     .003      68
kaio                     .001      68
                     --------  ------   ----
sys totals:              .230   13616      0
usr time:               1.127
elapsed:               22.810

At first glance it would seem alarming to have so many times() calls, but how much does this really effect performance? This question can best be answered by looking at the overall "elapsed" and "cpu" time. Below is output from the "procsystime" tool included in the Dtrace toolkit.

root@catscratchb> ./procsystime -Teco -p 7700
Hit Ctrl-C to stop sampling...
Elapsed Times for PID 7700,
         SYSCALL          TIME (ns)
            mmap           17615703
           write           21187750
          munmap           21671772
           times           90733199       <<== Only 0.28% of elapsed time
          semsys          188622081
            read          226475874
           yield          522057977
           pread        31204749076
          TOTAL:        32293113432

CPU Times for PID 7700,
         SYSCALL          TIME (ns)
          semsys            1346101
           yield            3283406
            read            7511421
            mmap           16701455
           write           19616610
          munmap           21576890
           times           33477300         <<== 10.6% of CPU time for the times syscall
           pread          211710238
          TOTAL:          315223421

Syscall Counts for PID 7700,
         SYSCALL              COUNT
          munmap                 17
          semsys                 84
            read                349
            mmap                350
           yield                381
           write                540
           pread               3921
           times              24985    <<== 81.6% of syscalls.
          TOTAL:              30627

According to the profile above, the times() syscall accounts for only 0.28% of the overall response time. It does use 10.6% of sys CPU. The usr/sys CPU percentages are "83/17" for this application. So, using the 17% for system CPU we can calculate the overall amount of CPU for the times() syscall: 100\*(.17\*.106)= 1.8%.

Oracle uses the times() syscall to keep track of timed performance statistics. Timed statistics can be enabled/disabled by setting the init.ora parameter "TIMED_STATISTICS=TRUE". In fact, it is an \*old\* benchmark trick to disable TIMED_STATISTICS after all tuning has been done. This is usually good for another 2% in overall throughput. In a production environment, it is NOT advisable to ever disable TIMED_STATISTICS. These statistics are extremely important to monitor and maintain application performance. I would argue that disabling timed statistics would actually hurt performance in the long run.

Tuesday Jul 11, 2006

Threshold login triggers for Oracle 10046 event trace collection

There are multiple ways to gather trace data. You can instrument the application, pick an oracle sid from sysdba, turn on tracing for all users (ouch), or use a login trigger to narrow down to a specific user. Each of these methods have merit, but recently I desired to gather traces at various user levels.

The problem with most packaged applications, is that they all use the \*same\* userid. For this Oracle 10G environment, I used this fact to filter only connections of the type that I wanted to sample. I wanted to gather 10046 event trace data when the number of connections was 10, 20, or 30. To achieve this, I used a logon trigger and sampled the number of sessions from v$session to come up with the connection count. I have found this little trick to be very useful in automating collection without modifying the application. I hope this can be useful to you as well.

create or replace trigger trace_my_user
  after logon on database

  mycnt  int;


SELECT count(\*)
 INTO mycnt
 FROM v$session
 WHERE username='GLENNF';

 if (user='GLENNF') and ((mycnt=10) or (mycnt=20) or (mycnt=30)) then
 end if;

This blog discusses performance topics as running on Sun servers. The main focus is in database performance and architecture but other topics can and will creep in.


« July 2016

No bookmarks in folder


No bookmarks in folder