Oracle RAC's Secret
By dcb on Dec 19, 2004
I'm a big fan of Oracle's RAC technology. I (speaking for myself, not Sun) think it is the only database product out there that can solve the challenge of near continuous database transaction access to a single (complete) data set even when the database node that a client is connected to experiences a catastrophic failure. Traditional failover can incur a 10x longer service disruption, and multi-site state "replicated" designs are complex and subject to sync skew.
However, there is a little secret associated with the magic of Oracle RAC. Well, it isn't really a secret, it is just something that most people don't like to talk about, an elephant in the room that people choose to ignore. In fact, it is a very natural consequence of a NUMA design. NUMA, of course, means "non-uniform memory access", and is generally a serious issue when frequent memory accesses takes place in which there is a latency ratio from best to worst of >10x. Local SGA memory access latency on an SMP node takes from 100 to 400ns (depending on the type of node). However, if that node is part of a RAC Cluster and it needs to access a memory block on another RAC node's shared SGA (via cache fusion) the latency to retrieve that block will be measured in micro-seconds, often 1000x worse! Here is an illustration of the NUMA aspect of Oracle RAC:
Oracle published a paper recently in which it lists GBE as having an average latency from 600us to over 1000us. That is well over 1000x worse than the local SMP node! Even Infiniband has a latency of almost 200us, which is 1000x worse than a 200ns local SMP node. Ouch. That is a serious performance hit! Here is a graphic from Oracle's paper:
There is also the issue bandwidth. An older server from Sun, the F15K, has over 172GB/s of internal bandwidth. That's aggregate B/W among 18 boards. However, that is a TON of bandwidth. GBE, bless its heart, can only push about 70MB/s of user data. Even with 18 of those links (if you attempted to build an "F15K" from blades), that adds up to only 1.2GB/s. And consider CPU utilization needed to drive each GBE link. Hmmm. Let's see what Oracle says about Bandwidth, Latency, and CPU:
You can get an idea of why this is a problem when you understand the internal structure of an Oracle database. It's amazing what Oracle can do w.r.t. data integrity and performance. It takes a lot of behind the scenes action. Here is a peek:
And when you try to spread this out among even two nodes, you suffer the consequences of 1000x higher latency, and 100x less throughput. Here is a look at the protocol mgmt that must take place for every node sync or transfer, which can happen thousands of times per second:
So it is no wonder that RAC can run into scaling issues if the data is not highly partitioned, to reduce to a trickle the amount of remote references and cache fusion transfers. TPC-C is an example of a type of benchmark in which the work is split between each node without inter-node interaction. RAC scales wonderfully in that benchmark. The problem is that most ad-hoc databases that customers are attempting to use with RAC involve significant inter-node communications. You can imagine the challenge, even with Infiniband, which still has 1000x higher latency (according to Oracle's tests).
Compare this to an SMP node, in which we have shown near 100% linear scale to 100+ CPUs running real world workloads that involve intense remote memory access. Thankfully, a "remote" access on an SMP box (a CPU asking for a block that is cached by another CPU) is still in the nano-second range. Here is a look at what SMP can do:
I have graphs that show Oracle RAC performance on real-world workloads, but Oracle doesn't allow anyone to publish Oracle performance results without their permission. So I will only suggest that the graph has a much different shape. And that anyone contemplating Oracle RAC run full load testing and a comparison to a non-RAC SMP baseline.
Okay... what does all this mean? Well, just that Oracle RAC, as I started out saying, is incredible technology that solves a particularly nasty problem that many customers face. But, you must enter the decision to deploy RAC with full knowledge of the engineering trade offs. RAC can be made to perform well in many environments, given a proper design and data/query partitioning and proper skills training. But in general, if there is appreciable inter-node communications, then you should consider using fewer RAC nodes (eg: 2 or 3), in which each node is larger in size. This keeps memory accesses as local as possible.
For many customers, a traditional HA-Failover is actually a very good design choice, in which you leverage the linear scale of an SMP box, and let SunCluster restart the database on some other node if there is a problem. This generally takes ~5-10 minutes, which is an acceptable service disruption duration for many, especially since that might only happen a couple times per year. And, for clusters with more than 4 CPU cores, Oracle charges $70K per CPU core for RAC+Partitioning, whereas Oracle "only" charges $40K per CPU core for an HA-Failover environment (and for failover, you only pay Oracle for the active node if you only expect to run Oracle on the failover node for less than 10 days per year).