Did you ever have a performance problem and not know where to start looking? Well, who didn't? In such a situation, it is important to perform a proper monitoring of the system and resist the immediate urge to blame the processing power --the storage subsystem is often overlooked. Now that I spoiled the suspense giving away the answer, give me 5 minutes to illustrate this with a typical example. Why typical? Because this is a case I meet quite often with startup companies who rightly concentrate on their core business and not on their IT infrastructure --that's what we are here for.
One of our partners, Squid Solutions, recently reported a performance problem with an Oracle database; they had engaged our team as part of the Sun Startup Essentials support program. Squid Solutions makes a software called Nautilus that performs intensive analytics on database systems, whatever their kind or size. They call Nautilus an SQL Knowledge Engine because it models data and business knowledge, and then automatically generates SQL code to execute the data processing tasks. Nautilus is sold as a service performed by Squid's engineers as Customer Intelligence projects.
So, Squid Solutions had purchased a brand-new Sun
Fire X2250 server --quad-core Intel Xeon with 4 GB of RAM, running Solaris 10--, and was experiencing poor performance --much lower than their old system of previous generation-- when executing a read-n-write intensive workload on the database. When I first got on the server, something jumped right to my eyes; the server used a single hard drive for all of the following: the operating system; the SWAP files; Oracle binaries; Oracle redo logs; Oracle table spaces
This basically means
Monitoring the System
We ran the Squid Solutions
test case on the Oracle database for 10 minutes and used the
dim_STAT tool to collect and
analyze all the necessary system statistics: CPU, RAM and I/O. dim_STAT is a great performance monitoring and analysis tool
developed by our colleague Dimitri Kravtchuk at the Sun Solution Center in Paris: see Solaris Performance Primer: dimSTAT, dimSTAT by examples and Customizing db_STRESS for some detailed blogging on dim_STAT from our team.
The CPUs are hardly used;
Finally the I/O analysis shows the problem; the disk is clearly the bottleneck here. Throughout the test, the single disk is
more than 60% busy, this figure even goes up to 85% at the end of the
test. Also the wait percentage that should be close to zero, is
around 10% and even peaks at 20% once in a while. This test confirms our suspicion concerning the
Better performance can be achieved by isolating Oracle's redo logs on a separate drive. By doing so, we will no longer have the share the IOPS with other reads or writes. But it will not be enough; we will still be
limited by the maximum number of IOPS possible per drive. The industry solution to this problem is RAID storage; by stripping the Oracle redo logs over multiple disks, you then have 200x the number of drives of possible IOPS.
A budget configuration could consist of a Sun
Storage J4200, a simple JBOD array, combined to a SAS
RAID HBA to your server to compensate the lack of a RAID hardware controller in the JBOD --and thus of memory cache. This adapter
offers up to 256 MB DDR2 cache memory. The cache is important for
performance because Oracle does not have to wait until the redo log
in actually written to the hard drive. And you have a 1000x
better performance writing to RAM than disk! The RAID controller
will guarantee that what it has in cache will eventually be written
A better option for higher performance, reliability and flexibility is an external storage array for an integrated RAID controller such as the Sun
StorageTek 2510. It is an entry-level SAS storage, that requires a SAS controller on the server that connects to it. The array can host up to 48 SAS or SATA drives, and can be controlled by one or
two SAS RAID controllers. Each controller has a 512 MB cache size, 2-to-4x more cache than
the previous solution.
Would you like to monitor your own system
to check where the bottleneck is? As mentioned earlier, I used a tool called
dim_STAT to collect and
analyze all the necessary system statistics. You can
download and use it for free on Solaris SPARC, Solaris x86 and
Linux. Features of
dim_STAT include: history of system statistics saved
That said, you can already get useful
statistics from your Solaris system out of the box, without installing
dim_STAT --dim_STAT fully relies in fact on the Solaris native tools. The three most used are
vmstat for virtual memory and CPU
iostatfor IO and CPU utilization
iostatfor the disk statistics.Using these tools --and most importantly interpreting the output-- is a skill on it own though. Better than reading through the man pages, I recommend you read the Solaris
Oh, by the way! I almost forgot ! So why was our partner's old server
doing better than the new one? Because it had two disks, and the Oracle redo
logs and data files were on separate drives.
If there is any conclusion to draw from this: