X

Oracle RAC Cache Fusion Testing on SPARC Enterprise T5440



Summary of Findings



  1. The minimum latency of an 8KB block transfer can be reduced

    by 25% with network tuning.


  2. Latency increases with block size but not dramatically, for

    example latency of 16KB block transfer is around 50% more than 2KB

    block transfers.


  3. Latency increases almost linearly with load (number of blocks

    transferred per second) - as expected - till it reaches the "knee"

    of the curve, where the latency starts to increase very fast. The

    "knee" for 4KB block size is about 2 times that for 16K

    block size. This happens when the CPU(s) running LMS process(es)

    become saturated.


  4. With no tunings, 4 LMS processes are needed to saturate a 1

    Gigabit NIC using 8KB block size. The default number of LMS servers

    configured by 11gR2 on the SPARC Enterprise T5440 is 10.


  5. With or without network tuning, we are not able to saturate a

    single 1 Gigabit NIC using 2KB block size. Tuning for low latency

    will sometimes see a 1 Gigabit NIC saturated, but throughput

    degrades significantly at loads beyond the saturation point.


  6. Using 10 Gigabit NIC, throughput increases up to 2.33 times;

    by increasing the number of LMS processes from the default 10 the

    maximum 35.


  7. The response time was marginally better for 10 Gigabit than

    for 1 Gigabit.


  8. Provisioning a larger-than-optimal number of LMS processes

    does not cause a significant increase in latency at low to medium

    throughput. Further, increasing the number of LMS servers only

    increases server and client CPU consumption by a small amount


  9. Placing LMS processes into a processor set, and fencing that

    processor set from interrupts provides up to 40% more throughput

    than Out-Of-the-Box when using a 10 Gigabit interconnect.


  10. Jumbo Frames is confirmed as a Best Practice for RAC

    clusters. They offer around 20% better throughput.


  11. On the network side, enabling UDP check-sum offload,

    disabling RX soft rings and disabling interrupt blanking can improve

    latency by up to 32%. This also has a side benefit of reduced CPU

    consumption (normalized by throughput).


  12. Use of two 10 Gigabit interfaces can potentially provide the

    best of both worlds – low latency and high throughput.


  13. Increasing the number of LMS servers only increases server

    and client CPU consumption by a small amount




Configuration Under Test


The following components were used in testing RAC interconnect
performance on the T5440:



  • Sun SPARC Enterprise T5440, 1.4 GHz, 256 MB RAM (two)



    • 4 x 1 Gigabit Ethernet (built-in)


    • 2 x 10 Gigabit PCIe Ethernet (Neptune)



  • Solaris 10 Update 9 (development build)


  • Oracle release 11.2.0.1 (11gR2)


  • Sun StorEdge 6140 Arrays



Overview of Benchmark


This testing uses a simple test to exercise Cache Fusion. The
framework for the test includes a single table, with one row per
database block. The method involved is to:



  1. Load all rows to be used in a test on one node, to be known

    as the “block-serving node”.


  2. Update these rows then do a checkpoint


  3. Query the rows using a simple query via SQL\*Plus from the

    second node, to be known as the “client” node



With the correct settings to your database, we reach a steady
state where all the SQL\*Plus clients are requesting blocks which need
to be satisfied via Global Cache transfers from the block-serving
node to the client node. Very little disk I/O is happening, just
network traffic.


For further details on the benchmark, see Appendix A – Notes on the Benchmark.



10 Gigabit
versus 1 Gigabit Interconnect


The Sun SPARC Enterprise T5440 includes 4 x 1 Gigabit Ethernet
connections. A common choice to increase both available throughput
and reduce latency is to add one or more 10 Gigabit Ethernet links.
Let's compare the performance of these choices.













  • 1 Gigabit has reached saturation at 24 clients. 10 Gigabit

    offers much more throughput




  • 10 Gigabit does offer better latency


  • These charts do not yet explore

    the maximum throughput available for 10 Gigabit – see 10 Gigabit

    below.








Scaling up Number of LMS Processes


1 Gigabit


What happens if we increase the number of LMS processes?

















  • Here we can see the effects of the LMS processes becoming

    saturated; which is the case for 1, 2 and 3 LMS.


  • Once we have more LMS processes than we need to saturate the

    NIC, we can get much greater throughput; as seen for 4 and 8 LMS.



LMS CPU Consumption as LMS Processes are
Increased












  • This is not normalized to the number of LMS processes, so the

    2 LMS maximum is 200%, 4 LMS if 400%, etc.


  • Again we can see that for 1 and 2 LMS, we are reaching CPU

    saturation; but not for greater number of LMS; where the figures

    show the NIC is saturated


  • No significant increase in CPU consumption for 12 LMS over 8,

    even though we have established 12 LMS offers no other advantage in

    throughput



10 Gigabit


Throughput - Can we Reach NIC Saturation?






Out of the box, the default number of LMS processes for 11gR2 on
the SPARC Enterprise T5440 is 10. This does not change if you change
the number of type of interconnects. The maximum number of LMS
processes that can be started is 35.




  • Using

    the default 10 LMS processes significantly limits throughput for 1 x

    10 Gigabit. Using the maximum of 35 is similar to using 32.


  • At the peak, the throughput for 35 LMS is 2.33 times the

    throughput for 10 LMS servers



Response Time








  • At low throughput, the difference in latency as the number of

    LMS is increased is negligible.


  • As the throughput is increased, the larger number of LMS

    servers is able to respond to the greater throughput with less

    degradation.




Theoretical
Peak of 10 Gigabit Interconnect


Observing that we have been able to easily saturate a 1 Gigabit
interconnect, but unable to saturate a 10 Gigabit interconnect, a
colleague suggested that it would be unlikely we could saturate a 10
Gigabit interconnect due to the average packet size of Cache Fusion
traffic – approximately 4,300 bytes for a raw, 8KB
unidirectional Cache Fusion test workload on the side requested
blocks are sent.


We decided to test the 10 Gigabit network link to see just what
bandwidth we could get if there were no database or Cache Fusion test
involved. The tool for this is uperf[1], which was configured to
replicate and scale up the packet flow as observed from a one-client
run.




  • These

    tests used an “Out-of-the-Box” tuning for the network,

    with the exclusion of Jumbo Frames being enabled for both tests


  • uperf peak is 87.5% versus 70.6% for our Cache Fusion test


  • Our Cache Fusion test used 64 LMS processes







We can also test how our Cache Fusion test compares to uperf when
the nodes have their networks tuned for low latency. This is a
tuning we have called “N3” that is described later in Tuning for Network Latency.








  • The theoretical maximum bandwidth is lower for this tuning,

    as we have observed for our Cache Fusion test




  • uperf peak is 70.1% versus 54.5% for Cache Fusion test


  • Cache Fusion test used 35 LMS processes




CPU Tuning –
Processor Sets and Interrupt Fencing


Processor Sets – 1 Gigabit


For these studies, the “T1” tuning implements a
processor set of 16 CPUs (2 cores) on the block-serving node, which
interrupts disabled on these CPUs. The “T2” tuning is
the same as T1, but in addition we have all CPUs in three of the four
sockets turned off.


Throughput – Out of Box versus T1, T2









  • Very small improvement for T1

    psrset (e.g. 1.7% at 32 clients)


  • Same for T2 -

    3.1% higher than

    baseline at 32 clients


  • Once the interconnect is

    saturated, all perform the same.



Response Time
by CR Block Throughput – Out-of-Box versus T1, T2









  • As for throughput, very slight

    improvement; no difference once we saturate the interconnect.



Processor Sets – 10 Gigabit


For 10 Gigabit, let's examine runs with 32 LMS's, with and
without processor sets for the block-server node LMS processes.
Processor sets can be used to isolate processes from each other,
fence processes from the effects of interrupts, limit the cores or
sockets processes run on (which in turn affects their cache usage and
memory locality efficiency), or combinations of these.


The processor sets used were:



  • T3: 64 CPUs, on 8 cores (each core 100% dedicated to set),

    across all 4 sockets, with interrupts disabled for the processor set


  • T7: 32 CPUs, on 8 cores (each core 50% dedicated to set),

    across all 4 sockets, with interrupts disabled for the processor set



Throughput – Out of Box versus T3 and T7
Processor Sets








  • Effective? Yes. That is +33% and +35% at 240 clients for T3

    and T7, respectively


  • Both variants offer similar improvements. T7 potentially

    offers greater co-operation with other workloads on the same system

    – hard to tell from our Cache Fusion test as there is little

    other load.



Latency – Out of Box versus T3 Processor
Set


In this case, let's use our network tuning, so that we have
already improved latency as much as we can that way.








  • You have to take my word – there are two lines in that

    chart. So, I think we can conclude the difference is negligible.




Network Tuning


Jumbo Frames


Jumbo Frames is a feature of some NICs that allows the use of
Ethernet Maximum Transmission Units (MTU) greater than 1500. This is
an established Best Practice for Oracle RAC. Let's see why.


Throughput – Jumbo Frames versus No Jumbo
Frames




  • No

    Jumbo is MTU of 1500


  • Jumbo frames offer 20% more

    throughput



Response Time by CR Block Throughput, Jumbo
Frames v No Jumbo




  • Both

    workloads reach a peak then retrograde


  • Jumbo frames obviously better



LMS CPU Consumption by Throughput, Jumbo Frames v
No Jumbo












  • As expected, use of Jumbo Frames entails lower CPU

    consumption by the LMS processes


  • There is a similar but less pronounced gap in system-wide CPU

    consumption on both nodes



Tuning for Network Latency


Here we will study a set of network tunings that are intended to
improve network latency. There are usually a number of trade-offs
involved in network tuning; between latency of individual packets,
peak packet throughput, peak bandwidth, etc.


The following configurations are compared against the
Out-Of-the-Box (OOB) network configuration – although we are
using Jumbo Frames in all configurations, which is already a
recognized best practice for RAC and other large packet workloads:



  • N1 – use of hardware offload for UDP check-sums. This

    is normally disabled for nxge interfaces, as it can sometimes fail.

    In the case of RAC, block transfers are check-summed by the

    database, so we do not risk data corruption by enabling this

    feature.


  • N2 – N1 plus disabling of RX soft rings. Soft rings

    are enabled by default to provide scalability to very high packet

    rates per interface. RAC does not involve very high packet rates,

    so can we make the trade-off the other way and use simpler delivery

    via a single CPU?.


  • N3 – N2 plus disabling of interrupt blanking.

    Interrupt blanking is used to reduce the number of times a network

    interface interrupts a CPU to deliver packets. Again, we have a low

    packet rate, so can we afford one interrupt per packet?


  • N4 – N3 plus disabling of TX serialization. This is

    similar to soft rings, but on the transmit side.



For details on how to enable these network tunings, see Appendix B – How to Set Network Tunings.









  • All tunings up to N3 appear to provide incrementally better

    response time









  • N3

    offers a 14-32% reduction in latency


  • Similar improvements for throughput


  • N3 Looks like the best choice of tunings overall



Broader
Comparisons – OOB versus N3 Tuning


Seeing N3 appears to be a likely recommendation, let's look closer
at the numbers.







  • Data

    indicates the N3 tuning has saturated the network interface at 32

    clients, whereas OOB peaks around 90% saturation









  • A different view of RT – normalized to the throughput


  • N3 better than OOB across range, with higher peak throughput

    capability







  • Useful to consider – does the disabling of many of

    these features mean there is higher CPU consumption? There may be

    at the system level (I can dig up the numbers from these

    experiments), but the impact on the LMS processes is a reduction

    across the range.


  • Network utilization is also plotted so that you can see how

    the OOB configuration does not manage to saturate the network




Choice of Block
Size


There are a range of block sizes available with Oracle RDBMS. The
most common choices for SPARC platforms are 8 KB and 4 KB. Here we
examine the impact of block size for a single 1 Gigabit interface.












  • Larger block sizes reach saturation earlier


  • It took a larger number of LMS processes to reach saturation

    (or near saturation) as the block size decreased – 4 for 16

    and 8 KB, 16 for 4KB and 32 for 2KB


  • 2KB blocks were unable to saturate 1 x 1 Gigabit. Average

    packet size is 1240 bytes on the side blocks are sent.



Is There a Bottleneck For 2 KB?


The above chart shows that for 2 KB, we were not able to saturate
a single 1 Gigabit interface. This was investigated further with
some different configurations and the results are as follows:








  • Configurations

    using 4, 16 and 32 LMS were tried; with and without network tuning

    intended to improve latency (see discussion of “N3”

    later in this report).


  • The maximum achieved was 39,081 block/second, which

    corresponds to 97% utilization of the network interface.



Response Time for Various Block Sizes








Note:
some samples beyond peak throughput have been omitted for legibility



  • Here we can see the response time increases as we increase

    the block size.


  • As each workload reaches or approaches saturation, it is able

    to stay close to peak throughput as the response time degrades



LMS CPU% by Throughput, for each Block Size








  • This is normalized to the number of LMS processes (2), so the

    maximum; indicating saturation of the 2 LMS processes; is 100%


  • This verifies we are saturating the LMS processes before the

    1Gigabit NIC is saturated




Impact of Additional 10 Gigabit Interface


The next step beyond increasing your bandwidth to 10 Gigabits
might be – use two 10 Gigabit interfaces. So, we then want to
know, do we get any scalability out of this? The following
configurations were done with 35 LMS processes – the maximum.




  • For

    the N3 (better latency) configurations, there is little difference

    between 1 and 2 interfaces below 64 clients, but then there is an

    advantage of up to 20% as we head toward 192 clients. Then there is

    some degradation for the 2 x 10 Gigabit configuration.


  • For the non-N3 configurations, we see only a slight advantage

    for 2 interfaces through the entire range. We do see the peak

    throughput advantage return to the non-N3 configurations at 192

    clients and above; or around 85,000 CR blocks/second.




  • Peak for 1 x 10 Gigabit is 89,729 at 224 clients


  • Peak for 2 x 10 Gigabit is 93,817 at 240 clients, an increase

    of 4.6%









  • As expected, latency is better for the N3 configurations at

    lower throughput, but little different for the extra interface.


  • The N3 configuration with 2 interfaces nearly matches the the

    non-N3 configurations in terms of scalability before degrading.



How About Impact for 4KB Blocks?


For these experiments, we were using 35 block-serving LMS's, 4KB
blocks (raw), with _fast_cursor_reexecute=true, N3 and T3 tunings.








  • Very little difference


  • Notice also that 4KB throughput peaks at lower number of

    clients than for 8KB.









  • Same as for throughput – little difference; 2 x 10

    Gigabit slightly worse



Conclusions


So, an additional 10 Gigabit interface, while not providing a 100%
increase in maximum throughput, does offer:



  • In the absence of extreme throughput requirements, the “best

    of both worlds” configuration for 8 KB database blocks would

    be two 10 Gigabit interfaces with the N3 tuning.


  • In general, just adding a second 10 Gigabit interface and

    making no other changes offers little additional scalability.



There is one other issue with multiple network interfaces (of any
kind) with RAC at present – which is that there is no fail-over
capability that you might expect with NIC bonding at the OS level.
This OS feature is to be tested in a future study.



How About if We
Use Remote SQL\*Plus Clients?


The normal configuration for our Cache Fusion test is to use
SQL\*Plus clients on the “client” RAC node. We could
instead use SQL\*Plus on remote node(s) and see if that makes a
difference. The differences would potentially be:



  • Reduced CPU/memory consumption


  • Introduction of a small amount of think time between queries



We can also investigate if there is any affect on client network
traffic from our N3 network tuning.


The first thing discovered was that remote client network traffic
is very small (less than one packet/sec), so an affect of N3 network
tuning on the remote client will likely be immeasurable.


This may be because the database foreground processes still run on
the client RAC node, and they are responsible for nearly all the CPU
and memory consumption also. Here is are the results:




  • Remote

    clients shows slightly less throughput than local clients


  • Runs at 240 & 256 clients were inconsistent



  • Local

    clients get slightly better

    response time than remote – note that the response time is

    measured between RAC nodes,

    not from the perspective of the client.



One thing that did change with the move to remote clients was the
profile of Oracle's Top 5 Timed Events, as reported via statspack.
We went from 16.0% on “gc cr block lost”, to 7.7% “gc
cr block lost” and 8.0% “gc cr block congested”.
See below for details.



Local Clients – 240 Client run, 97,555 blocks/second



Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
------------------------------- ------------ ----------- ------ ------
gc cr block 2-way 57,970,653 71,341 1 43.3
CPU time 65,674 39.9
gc cr block lost 48,849 26,341 539 16.0
gc cr block congested 443,140 695 2 .4
ges message buffer allocation 52,587,636 384 0 .2



Remote Clients – 240 Client run, 91,039 blocks/second



Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
------------------------------- ------------ ----------- ------ ------
CPU time 69,851 42.4
gc cr block 2-way 54,072,799 67,608 1 41.1
gc buffer busy acquire 6,477,349 13,139 2 8.0
gc cr block lost 23,729 12,664 534 7.7
gc cr block congested 449,769 739 2 .4




Bi-Directional Cache Fusion test


The conventional Cache Fusion test
involves loading up a table on one RAC node (by updating each row),
then querying it from another node in the same RAC cluster. This
produces lop-sided network traffic – for example, at a
throughput of 97,824 8KB blocks/second, we see 805 MB/s in the
direction the blocks are being sent, but only 54 MB/s in the other
direction.


Let us now look at what we get if we
split the table into two ranges, and have queries originating on both
nodes.


Bi-directional –
1 x 10 Gigabit


Here we compare a 16 + 16 LMS
bi-directional configuration to both 16 and 32 LMS uni-directional
setup.





  • Significant advantage to either 16

    + 16 LMS bi-directional or 32 LMS uni-directional configuration –

    each around 65% higher peak throughput.


  • Bi-directional workload has

    slightly higher throughput for same number of total LMS server

    processes – 2.0% higher at the peak


  • The peak on this chart corresponds

    to 51.8% utilization on the single 10 Gigabit NIC in use, so it is

    unlikely that network bandwidth or packet rate is the bottleneck



Bi-directional –
1 x 1 G

bit


How about if we do the same where we
know we have saturated (at least one direction of) the interconnect?








  • As

    we might expect, the bi-directional Cache Fusion test on a single 1

    Gigabit link sees quite a bit more Cache Fusion transfers. At a 79%

    improvement, we do not see a doubling however.




CPU Impact of Select Tunings


An important question is – What is the effect in CPU
consumption of tuning my system in these ways?

This is important because our Cache Fusion test exercises a small
part of RAC, using a very simple query. It would be good to know how
much CPU available to a “real world” workload after we
tune for best Cache Fusion performance.


“N3” Network Tuning for best Latency


First, let's look at the impact of the
“N3” network tuning. It might be expected that these
tunings will increase CPU load, although the offloading of UDP
check-sums to network hardware in theory should reduce CPU
consumption. Lets look at an “entry-level” comparison –
a single 1 Gigabit interface and 3 LMS server processes.












  • The simple answer is that the N3 tuning actually consumes

    less CPU, both on the server node (called “1” here) and

    on the client node generating the Cache Fusion requests (called “2”

    here).


  • Is there a suggestion of a bottleneck on the client node,

    hinted at by the near-vertical increase in CPU consumption?







How about if we are generating much more throughput? Perhaps with
1 x 10 Gigabit and 64 LMS processes?








  • Again, N3 tuning consumes less CPU. We do forgo a higher

    peak throughput for this configuration, however.


  • The “base” configuration went considerably

    retrograde in throughput, and this is not shown.



Impact of Increasing Number of LMS Processes


How about if we scale up the number of LMS processes on the
block-serving node? We do this so that we can handle more throughput
– do we suffer in terms of CPU consumption?












  • Here we see the CPU consumption on the block-serving node is

    little different as we add more LMS servers


  • There is a more noticeable increase in CPU consumption on the

    client side, although it is still small




Appendix A – Notes on the Benchmark






We use several non-standard tunings to achieve the desired
behavior where we see a constant stream of database blocks from the
buffer cache of the block-serving node to the client node. These
include:


Buffer sizing


We size the database buffers on the block-serving node large
enough to hold the range of block we will use, but not large enough
on the client node, so that we can iterate through the range of
blocks and by the time we query any block that was previously
transferred to the client node will have since been evicted from the
client node's cache.


_fairness_threshold


This is set to 0, which means that the block-serving node will
retain blocks in its cache even in the face of the client node's
requests and absence of any accesses local to the block-serving
node. Without this setting, the block-serving node would relinquish
ownership of the blocks that it has cached.


_fast_cursor_reexecute


This setting advises the database to maintain cursor context for
queries, so that they can be re-executed more efficiently. Without
this setting we hit a bottleneck with many simultaneous SQL\*Plus
clients.


Fusion Compression


Fusion Compression is a feature; on by default; where “empty”
space in a database block is not transmitted in response to the GC
block request. Instead, the block is transmitted as just the meta
data and rows (or other data), then re-constituted on the receiving
side.


For our Cache Fusion test framework, we have only one row per DB
block; for each block size. This is so that we can better control
the GC block caching and transfer. This means we want to disable
GC Fusion Compression, so that our network utilization is more in
line with real world experience. Experiments with Compression
disabled will be referred to as “Raw”.


The impact of using Fusion Compression;
for various block sizes; is illustrated below:








  • The size of a transfer, when there is a single row in a block

    is fairly constant regardless of block size.













  • The effect of different block sizes on throughput is

    therefore masked by compression when we have only one block per row.









  • The effect on response time is also masked.



Hence the decision to disable Fusion Compression for these tests.



Appendix B –
How to Set Network Tunings


The simple details on how to set the network tunings used in these
tests. Some of these are specific to the “nxge”
interface.



  1. Jumbo Frames



This depends on your Solaris version, you may need to use:


# dladm set-linkprop -p mtu=9000 nxge0


or you may need to add this to /platform/sun4v/kernel/drv/nxge.conf


accept_jumbo = 1;


  1. Hardware offload for UDP check-sums:



Add this to /etc/system:


set nxge:nxge_cksum_offload = 1


  1. Disable RX soft rings:



Add this to /etc/system:


set ip:ip_squeue_fanout = 0

set ip:ip_soft_rings_cnt = 0



  1. Disable TX serialization:



Add this to /etc/system:


set nxge:nxge_tx_scheme = 1


  1. Disable TX serialization:



Add this to /platform/sun4v/kernel/drv/nxge.conf:


rxdma-intr-time = 1;

rxdma-intr-pkts = 8;







Page 38

of 38


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services