Alejandro Vargas' Blog

  • February 8, 2009

Useful Scripts

Alejandro Vargas
Technical Leader, ACS Global Delivery, Infrastructure & BigData

- Service level objectives setup

- Have the dba speak the same language as the storage admin

- Throughput measure of transfer bits across the media

throughput cannot be faster than the slowest link of the path from source to destination.

Tak einto account that 2gbit HBA does not provide 2Gbytes throughput but 256 Mbytes

To convert 2Gbits into Mbytes :

2Gbits = (2* 1024 * 1024 * 1024) / (1024* 1024) / 8bits

= 2147483648 / 1048576

= 2048 / 8

= 256Mbytes

Always size for throughput

- Good Put measure of usable amount of data, schedule appropriately.

Measure the transfer of usable data. That measure is known as good-put.

Good-put is the measure of usable data transferred over a given period of time.

So even though you have great throughput, your good put may not be that well enough.

This is because your queries are not well tuned (though they may good good response times),

and thus request more data than is necessary. Good-put is a measurement of resource efficiency.

- IOPS io operations per second (OLTP)

set lines 250

set pages 9999

spool sysmetric_outp.log

alter session set nls_date_format='dd-mm-yyyy hh24:mi';

select min(begin_time), max(end_time),

       sum(case metric_name when 'Physical Read Total Bytes Per Sec' then average end) Physical_Read_Total_Bps,

       sum(case metric_name when 'Physical Write Total Bytes Per Sec' then average end) Physical_Write_Total_Bps,

       sum(case metric_name when 'Redo Generated Per Sec' then average end) Redo_Bytes_per_sec,

       sum(case metric_name when 'Physical Read Total IO Requests Per Sec' then average end) Physical_Read_IOPS,

       sum(case metric_name when 'Physical Write Total IO Requests Per Sec' then average end) Physical_write_IOPS,

       sum(case metric_name when 'Redo Writes Per Sec' then average end) Physical_redo_IOPS,

       sum(case metric_name when 'Current OS Load' then average end) OS_LOad,

       sum(case metric_name when 'CPU Usage Per Sec' then average end) DB_CPU_Usage_per_sec,

       sum(case metric_name when 'Host CPU Utilization (%)' then average end) Host_CPU_util, --NOTE 100% = 1 loaded RAC node

       sum(case metric_name when 'Network Traffic Volume Per Sec' then average end) Network_bytes_per_sec,


from dba_hist_sysmetric_summary

group by snap_id

order by snap_id;

spool off

- Mbytes per second large block sequential io (DSS/Olap)

- Transactions per second does not translate well to storage so we don't use it

- Determine application IO characteristics

- Recovery point objective (RPO), Recovery Time Objective (RTO), response time SLA, IO rates

- Understand each IO components badwith limits

- CPU, HBA, Switch, Controller, Disks

- Choose the IO components that match your application throughput requirements

- design for throughput (not capacity)

- design for scalability and availability

- AWR can be used to determine current IO

- IOPS = “physical reads total I/O requests” + “physical writes total I/O requests”

- MBytes/s = “physical reads total bytes” + physical writes total bytes”

- For RAC environments - aggregate IOPS or MBytes/s for all nodes

- Estimate expected IO througput rates, new users, reports, etc

- extrapolate known loads to estimate what will be

- Use the following as basic guidelines for OLTP systems :

- Low transaction system – 1000 IOPS or 200MBytes/s

- Medium transaction system – 5000 IOPS or 600 Mbytes/s

- High-end transaction system – 10,000 IOPS or 1Gbytes/s <- almost rarely achievable and usually TPC-C type workloads

- Use the following as basic guidelines for DSS systems (units are in Gig = Gigabytes/Sec):

- 2 Gig I/O entry level DW

- 4-8 Gig I/O High End DW

- 10+ Gig I/O Very High END DW (not yet seen in production !)

- 40+ Gig I/O High End TPC-H

- Example from AWR

Instance Activity Stats Per second stats

physical read total IO requests


physical read total bytes


physical write total IO requests 1,050.86

physical write total bytes 19,114,880.37



physical read total I/O requests + physical write total IO requests ~ 2247 IOPS

physical read total bytes + physical write total bytes ~ 110 Mbytes/s

- Script to determine how much IO

SELECT a.sid,


SUBSTR(b.name,1,40) name,


FROM v$sesstat a,

v$statname b,

v$session se

WHERE se.audsid = (select userenv('sessionid') from dual)

AND a.statistic# = b.statistic#

AND se.sid = a.sid

AND b.name in ('physical read total IO requests','physical read total bytes','physical write total IO requests','physical write total bytes')

ORDER BY b.class, b.name;

- IO requirement calculation should include backups related IO

- frequency, time elapsed, total IO during period related to the backup

- Report to storage administrators the following numbers:


- MBytes/s

- Backup Requirements

- Considerations for DW

- DW systems require specific considerations due to the high IO data rate

CPU power – ETL processing, Account for non-database I/O processing

- Number of HBA – driven by Mbytes/s not IOPS

- Server Memory

- DW are also mixed workload – transactional + DSS. But size for DSS, since it’s the bigger drain.

- To increase host IO throughput– add nodes to RAC and distribute workload

- Don’t oversubscribe on the FC switch

From http://www.cisco.com/en/US/solutions/collateral/ns340/ns394/ns259/ns261/net_implementation_white_paper0900aecd800f592f.html:

"It is a common practice for storage subsystem vendors to share one port on the storage subsystem among multiple HBAs on multiple servers. After all, the initial purpose for a SAN was to increase connectivity to storage subsystems that were growing in storage capacity at a faster rate than their external connectivity capabilities. This SAN fan-out ratio of storage ports typically ranges from 6:1 to 12:1 server-to-storage subsystem ports. This ratio balances different server platforms and applications across these subsystem ports to fully utilize available bandwidth while enabling the maximum throughput of each HBA to achieve near-wire-rate throughput at a given time. The ratio also implies that the Fibre Channel switch ports that the server HBAs are connected to are being underused most of the time. To achieve the best price-performance metrics, it is common practice to oversubscribe server HBA-connected switch ports, as long as storage subsystem ports have access to full line-rate performance and the Fibre Channel switching fabric is non-blocking."

"However, most application servers are oversubscribed in terms of fabric bandwidth requirements, due to the bursty and often underused fabric utilization. Oversubscription helps lower the cost of the overall application infrastructure while helping to ensure that application servers receive the appropriate I/O channel resources to meet their application needs. A computing environment comprises a CPU, memory, and the I/O subsystem resources."

From http://vinf.net/2008/04/09/how-does-an-hp-fibre-channel-virtual-connect-module-work/:

"There is a concept of over-subscription, in the Virtual Connect GUI that’s managed by setting the number of uplink ports used."

"Most people will probably choose 4 uplink ports per VC module, this is 4:1 oversubscription, meaning each FC-VC port (and there are 4 per module) has 4 individual HBA ports connected to it, if you reduce the numeber of uplinks you increase the oversubscription (2 uplinks = 8:1 oversubscription, 1 uplink = 16:1 oversubscription)"

- Considerations for OLTP

"Maybe for OLTP systems:"

- This SAN fan-out ratio of storage ports typically ranges from 6:1 to 12:1 server-to-storage subsystem ports. This ratio balances different server platforms and applications across these subsystem ports to fully utilize available bandwidth while enabling the maximum throughput of each HBA to achieve near-wire-rate throughput at a given time. The ratio also implies that the Fibre Channel switch ports that the server HBAs are connected to are being underused most of the time. To achieve the best price-performance metrics, it is common practice to oversubscribe server HBA-connected switch ports, as long as storage subsystem ports have access to full line-rate performance and the Fibre Channel switching fabric is non-blocking.

- CPU requirements depend on user workload:

- Concurrency of users, ratio of CPU-related tasks

- Memory requirement mostly user-process driven

- IO requirements depend on query-mix:

- CPU vs. IO

- Relative CPU power for IO related tasks

- Logically Random IOs (predominant in star schema)

- required for index driven queries, e.g. Index lookups, Index driven joins, Index scans

- Logically Sequential IOs required for table scans, e.g. Hash Joins

- Find the balance between CPU and IO

- Understand bandwidth limitations and choosing appropriate components

- Database

– Scales to the ability of the hardware. I/O throughput is bounded by host I/O components; e.g., CPU, SCSI driver, HBA, etc.

- In DSS environments ensure enough CPU resources are available to accommodate the application - plan for 75 -100MB/sec per GHz/CPU

- Ensure largest I/O request is defined at the host – this should be 1MB.

- All database I/O performed against ASM devices are initiated and processed by the database; i.e., ASM is not involved.

- HBA – dependent on the type (1, 2, 4 Gbit/s)

- 2GBit FC HBA – best possible rates:

Sequential – 150-180MBytes/s

Random rate - is a function of block size, the larger the blocksize the lower the random rate

4GBit FC HBA – ~375 MBytes/s

- When configuring hardware for a data warehouse defining the IO requirements is very challenging because predicting the IO demand is difficult.

In general the IO requirements depends on the query mix.

As a rule of thumb assuming today’s CPUs (3 GHz Xeon or 2.2 GHz Opteron) an IO bound query can drive about 200MB/s per CPU. Having said this, it depends on the ratio of IO vs. CPU bound queries how high the IO requirements are.

It is also very important to investigate whether the system will have more queries issuing random vs sequential IOs. Random IOs are more dominant in index driven queries such as index lookups, index driven joins such as nested loops or bitmap indexes or index scans. Sequential IOs are more dominant in table scans used in hash joins

Regarding the Random IOPS rate - is a function of block size and the cost of the per block overhead. Max theoretical rate (8KB block) is about 16k IOPS, but a storage system will not respond that fast (100% cache hit rate required).

For NAS (In NFS environments):

A single Gigabit Ethernet can support about 30 MBytes/s at a reasonable 30% utilization or 70 MBytes/s at a high 70% utilization.

- SCSI driver

the per port I/O rates (either random or sequential) are determined by the OS ability to drive I/O and the storage system’s ability to respond.

The largest the I/O request from SCSI driver (on Linux 2.6),is 1MByte but observed or typical is 512K. SCSI layer limitation is 512k

SCSI driver limits 128 entries of 4k pages, if two memory locations are next to each other, then these are considered as 1 entry, so you could have more than 512k IO request. Use of large-pages (4MB), PTE 1 entry is 4MB to one physical segment. IA-64 (Itanium) – 16K pagesize, 128 entry * 16K = 2MB.

- Use RAC - If your I/O metrics exceed your server throughput; i.e., CPU, PCI bandwidth, etc. Scale out with RAC

Distributing the I/O load across RAC nodes

- PCI Bus:

Account for PCI bandwidth – don’t overwhelm PCI bus, by plugging in too many HBAs. The interrupt rate per HBA on that PCI bus is as important as the transfer rates. However, the host side PCI bus bandwidth must be watched, ensure your PCI capabilities are met. PCI-e can do - 2Gb 200Mb/s.

Courtesy of Henry Newman -http://www.enterprisestorageforum.com/technology/features/article.php/3665911

"The big concept hers is “lanes”. For example, if you want to use 4Gbit FC and run full duplex with a single port HBA, you need 400 MB/sec of bandwidth for each direction. In PCI-E terms, that's two lanes, since each lane will be able to run at full rate. You could use a single lane, but you would be limited to 250 MB/sec. That might not be a problem for IOPS like database index searches.

Using 400 MB/sec transfer with no other overhead and 16 KB requests would support 25,600 (400 MB per second/16 KB requests) requests per second, while 250 MB/sec single lane would support 16,000 requests. There is additional overhead, so you will never really achieve those rates, but either way, one lane or two lanes far exceeds what most servers, HBAs and RAID systems can deliver.

So from an IOPS perspective, a single lane and 4Gb HBA will work just fine, and with dual port, one or two lanes will more than saturate most RAID configurations. Assuming that a disk drive at most can do 150 random I/Os per second, you would need a large number of disk drives or cache hits to run at full rate. Since most RAID controllers do not have a command queue of 8K, you will also far exceed the command queue of RAID controllers.

I can't remember a time when the performance of an I/O bus was faster than the fastest host attachment, so we have reached an important time in technology history where the bus is fast enough to run any card at rate. This assumes a number of things"

- Storage arrays come in different flavors and sizes

- High-end

- Modular arrays

- Low-cost commodity

- What performance number should you look for in a storage array.

- A published SPC-1 result is the best indicator of how a system will behave with typical DB workloads.

- Use aggregate I/O metrics to size the storage array

- describe I/O metrics and application characteristics to Storage vendor.

- The array architecture is a vast topic, which can be discussed at length. This section addresses the internal bandwidth provided to each disk array path in the enclosure and also to the cache memory.

- The disk enclosures or the disks are connected in FC loops. In modular storage these loops can be somewhat customized to provide higher performance. In high-end enterprise arrays, the drive loop technology is already predefined for specific throughput. The internal bandwidth of a storage system for medium performance should be around 4 Gbyte/sec, while for extreme throughput you can choose from 8 Gbyte/sec to 15 Gbyte/sec technology.

- The read/write throughput to the cache should also be considered for maximum storage performance. For a high-end array the transfer rate to cache can be as high as 800 Mbytes/s, and for a medium-rated array it could be around 300-400 Mbytes/s.

- The big differentiator is the CPU processing speed and onboard bus speed, this dictates how fast (well) IO requests are passed down through to the back-end adapters and disks.

- Disk drives – a necessary evil

- Sole function is to service I/O requests from the host

- The slowest component in the I/O stack

- Includes mechanical aspects – seek, rotational, data transfer times

- Come in various flavors – SATA-II, FC, etc.



146GB FC


146GB FC




500GB SATA 7200 RPM

- Determining disk IO Throughput (IOPS):

Courtesy of http://www.ufsdump.org/papers/io-tuning.pdf

1. Divide 10000 RPM by 60 seconds (10000/60 = 166 RPS)

2. Convert 1 of 166 to decimal (1/166 = 0.0006 seconds per Rotation)

3. Multiply the seconds per rotation by 1000 milliseconds (6 MS per rotation)

4. Divide the total in half (6/2 = 3 MS) or Rotational Delay

5. Add an average of 3 MS for seek time (4 MS + 3 MS = 7 MS)

6. Add 2 MS for latency (internal transfer) (7 MS + 2 MS = 9MS)

7. Divide 1000 MS by 9MS per IO (1000/9 = 111 IOPS)

Note this is the effective IOPS that can be gotten from a 10k RPM drive. Generally I reduce this value by 10% to account for seek overhead. Each time an application issues an IO, it takes an average of 9MS to service that IO on a 10K RPM disk. Since this is a fixed time, it is imperative that the disk be as efficient as possible with the time it will spend reading and writing to the disk. The amount of IO requests are often measured in IOs Per Second (IOPS). The 10K RPM disk has the ability to push 80 -to 100 (burst) IOPS. To measure the effectiveness of IOPS, divide the amount of IOPS by the amount of data read or written for each IO.

- Four questions to answer:

1. What interface type ?

2. What speed ?

3. How many ?

4. What size?

Notice that it’s the last question we need to answer

- SATA disks –

Are very good for sequential I/O – archived logs, flashback logs, RMAN backups.

Not too good in IOPS driven applications

For 2nd, 3rd tier of tiered storage architectures (ILM, HSM)

In Low cost storage solutions

- High-speed FC disks

great for heavy IOPS and high sequential throughput applications.

More costly than SATA

- For mixed workloads FC or SAS drives are a very good choice

Drive Type


IOPS @ 9ms (95%tile)

Sustained throughput




20-30 MB/sec




25-35 MB/sec




20-30 MB/sec

A word on disk bandwidth –

IOPS and Throughput are mutually exclusive – you won't get 25MBytes/s &110 IOPS at the same time.

- So how many physical disks do I need to drive my App?

Use your IOPS or Mbytes/s metric to determine the number of physical disks.

Assume no storage array I/O cache hit ratio in the calculation

Two examples (we chose FC 15K RPM disks) –

Requirement is 600MBytes/s:

(600MBytes/s) / (25MBytes/s per disk) = 24 disks

Requirement is 3000 IOPS:

(3000 IOPS) / (130 IOPS per disk) = 23 disks

For mixed workloads make sure to use the max value from both calculations – accommodate both IOPS & Mbytes/s requirement.

Note that this raw disk; i.e; not mirrored or RAID’ed

The value from the calculation is the [service-able] required number of disks. No matter what kind of RAID you implement, the total service-able disks should not change, this is particularly important point to remember for RAID5 configurations. For example, if you are doing 4+1 RAID5, then your service-able number of disks for this RAID is 4 disks, the parity drive is not involved in the read operation.

Note that for mirrored RAID (RAID10), you need to double the number of total disks in backend. Also, this may require additional IO bandwidth in the backend array to satisfy the mirrored writes. You’ll need to a similar calculation for RAID5, but it depends on the RAID5 deployment; i.e; 3+1, 4+1, 7+1.

- Once all other things are decided now review what size disks you need

But by now you may have realized that your capacity requirements have already been satisfied by your throughput requirements

You may want to go with a larger disk size to fit your storage standards or to meet future growth requirements.

Disk Configuration - Best Practices

- Keep it simple

- In most cases no need to separate out random I/O from sequential

- Its been proven that greater number of disk spindles can consume any I/O workload type.

- Ensure disks span multiple backend disk adapters

- On most high-end enterprise arrays this is automatic

- Midrange arrays may require manual configuration

Make sure disks span multiple backend disk loops on midrange systems (automatic in enterprise)

- Implement multiple access paths to the storage array using two or more HBAs or initiators

- Use multipathing for load balancing and failover

- Carving too many LUNs same the RAID group will not give you any extra IOPS/Mbytes,as the RAID group has finite bandwidth

- Multiple HBAs provide multiple access paths to a the diskgroup disks. Multi-pathing software includes PowerPath, MPxIO, or Secure Path.

- Do the same for just fail-over purposes for midrange systems (all midrange systems have controller-based ownership of LUNs or RAID Groups – it is very costly to constantly dynamically change this in the storage controllers).

The ASM “factor”

- What is ASM?

Volume manager and file system built into the Oracle kernel

Provides a storage pool for your database files

File system with raw disk performance

- ASM Benefits

Distributes extents for database evenly across all disks in the diskgroup

Provides even and wide I/O distribution, resulting in improved overall throughput and minimizes potential hot spots

Provides simplified storage management by masking underlying storage complexities

Can add disk storage dynamically to scale I/O throughput and capacity

- The following are some key benefits of ASM:

o I/O is spread evenly across all available disk drives to prevent hot spots and maximize


o ASM eliminates the need for over provisioning and maximizes storage resource utilization

facilitating database consolidation.

o Inherent large file support.

o Performs automatic online redistribution after the incremental addition or removal of storage


o Maintains redundant copies of data to provide high availability, or leverage 3rd party RAID


o Supports Oracle Database 10g as well as Oracle Real Application Clusters (RAC).

o Capable of leveraging 3rd party multipathing technologies.

o For simplicity and easier migration to ASM, an Oracle Database 10g Release 2 database can

contain ASM and non-ASM files. Any new files can be created as ASM files whilst existing files

can also be migrated to ASM.

o RMAN commands enable non-ASM managed files to be relocated to an ASM disk group.

o Oracle Database 10g Enterprise Manager can be used to manage ASM disk and file management


o ASM reduces Oracle Database 10g cost and complexity without compromising performance or


- ASM Diskgroup Best Practices

Create two diskgroups. One for database area and another for recovery area

In cases where application performs heavy writes to temp tablespace, create a diskgroup for temp using distinct/isolated disks

Create diskgroups using large number of similar type disks

same size characteristics

same performance characteristics

Typical question: How many disks do I need in my ASM diskgroup and what should be the LUN size?

1. How many disks ? We answered it! Its driven by the I/O requirements

2. LUN size is dependent on site-specific storage standards. LUN size is not relevant as much as the number of LUNs.

These disks should of similar size and performance characteristics.

In DSS environments it maybe prudent to create a third diskgroup for Tempfile; e.g;, a TEMP_DG.

RAID 10 or RAID 5 storage array LUNs can be used as ASM disks, to minimize the number of LUNS presented to the OS.

; examples would EMC Metavolumes or HDS Sfvols.

Creating a separate temp diskgroup also allows you to setup write cache bias in the storage array; i.e;, greater write cache .

- A walk through configuration of a diskgroup

Present LUNs to host

Ensure correct disk permission so that ASM disk discovery will “find” the provisioned LUNs.

Create diskgroup using the required number of disks:

SQL> create diskgroup DATA external redundancy disks ‘/dev/sda1’, ‘/dev/sdb1’, /dev/sdc1’, ‘/dev/sdd1’;

Create you database files in the ASM diskgroup

SQL> create tablespace OOW_TBS datafile ‘+DATA’ size 200G;

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.