X

Predictable, Consistent, High Performance Computing with Oracle Bare Metal Cloud Compute Service

By: Lee Gates

Hi Everyone, I’m Lee Gates and I am focused on performance in the Oracle Bare Metal Cloud Services product management team.

Last November, we launched Oracle Bare Metal Cloud Services (BMCS) and introduced our Compute Service with bare metal instances. Today I am going to show you the value of a mix of high performance bare metal instances and how they can be used seamlessly in concert with virtual machine instances.

We know it is difficult to achieve consistent performance with enterprise workloads, and customers need a range of options. I'll provide an overview of the features and the performance results covering each component in a bare metal instance. You can have your cake and eat it too with compute instance mobility enabling you to move or clone an instance from bare metal and launch a new virtual machine in minutes!

Understanding platform capabilities is the first step when you have enterprise applications that demand consistent and predictable performance. This can be measured using simple tests which are designed to help find the sustained peak performance. By reviewing platform performance as a workload sizing exercise, you can get an understanding of how your application will run after migration. Here is a summary of our bare metal instance types:

Shape Instance Core RAM
(GB)
Networking NVMe Storage
Standard BM.Standard1.36 36 256 10 Gigabit Ethernet N/A
HighIO BM.HighIO1.36 36 512 10 Gigabit Ethernet 12.8TB NVMe
DenseIO BM.DenseIO1.36 36 512 10 Gigabit Ethernet 28.8TB NVMe

BMCS bare metal instances deliver over 4MM IOPS from the NVMe storage devices, a cloud first, meeting the most demanding and difficult workloads. The compute architecture is designed with an Intel E5-2699 v3 @ 2.30GHz enabling high performance compute intensive workloads. An Intel 10GBE network adapter provides line rate network to access your block devices, other instances in your network, or your internet traffic. The NVMe flash storage is delivered by Samsung NVMe SSD Controller 172X.

Bare Metal Test Plan

Here we'll go through tests I use to measure and benchmark performance, and summarize how other services perform from a compute perspective. We'll start the review with a summary of our plan:

Component Measurement Observation
Compute CPU2017

SPECrate2017_int estimate up to 128

Memory Memory bandwidth up to 100 GBS
Network

10GBE bandwidth, host to host latency

< 100 microseconds
NVMe devices 12.8TB and 28.8TB latency and throughput <1 millisecond latency for all R/W mixes
Block storage Single Volume and 32 Volume Performance <1 millisecond latency @ 10Gbe for all R/W mixes

Compute

Let's move to the CPU and the integer and floating point math performance.

Standard Performance Evaluation Corporation (SPEC) CPU®2017 v1 is an industry standard CPU intensive benchmark suite stressing a system's processor, memory subsystem and compiler.  It consists of 10 integer benchmarks, and 14 floating point benchmarks.  The SPEC CPU2017 suite can be run to provide a speed metric or a throughput metric, each using the same base optimizations, or per-benchmark peak optimizations.

SPEC CPU2017 is SPEC's latest update to the CPU series of benchmarks. The focus of CPU2017 is on compute intensive performance and the benchmarks emphasize the performance of the processor, memory hierarchy, and compilers.

The benchmark is also divided into four suites:

    • SPECspeed 2017 Integer – 10 integer benchmarks
    • SPECspeed 2017 Floating Point – 10 floating point benchmarks
    • SPECrate 2017 Integer – 10 integer benchmarks
    • SPECrate 2017 Floating Point – 13 floating point benchmarks

Each of the suites contain two metrics, base and peak, which reflect the amount of optimization allowed. The overall metrics for the benchmark suites which are commonly used are:

    • SPECspeed2017_int_base, SPECspeed2017_int_peak: integer speed
    • SPECspeed2017_fp_base, SPECspeed2017_fp_peak: floating point speed
    • SPECrate2017_int_base, SPECrate2017_int_peak: integer rate
    • SPECrate2017_fp_base, SPECrate2017_fp_peak: floating point rate

When I ran the test using default values for a highIO bare metal instances the test estimates were:

 

Oracle BMCS Measured Estimates
Bare Metal Machine Shapes
System O/S Compiler SPECspeed2017_int SPECspeed2017_fp SPECrate2017_int SPECrate2017_fp
Base Peak Base Peak Base Peak Base Peak
BM.DenseIO1.36
2x E5-2699v3 (2.3/3.6* GHz, 2s/36c/72t, 512 GB DDR4/2133 MHz, 10GBE)
(stock)
OL 7.3 Intel 17.0.4.196 4.13 4.38 66.5 66.5 118 127 108 108
BM.HighIO1.36
2x E5-6699v3 (2.3/3.6* GHz, 2s/36c/72t, 512 GB DDR4/2133 MHz, 10GBE)
(stock)
OL 7.3 Intel 17.0.4.196 4.16 4.38 67.0 67.2 120 128 109 109
BM.Standard1.36
2x E5-2699v3 (2.3/3.6* GHz, 2s/36c/72t, 256 GB DDR4/2133 MHz, 10GBE)
(stock)
OL 7.3 Intel 17.0.4.196 4.12 4.37 69.2 69.3 120 128 111 111

Memory

Memory bandwidth and latency are important for data intensive workloads.  We're running the memory stream-scaling test harness automated by Cloud HarmonySTREAM measures sustainable memory bandwidth and the corresponding computation rate for four simple vector kernels.  While it can be run serially, it is typically run in parallel (using either OpenMP, pthreads, or MPI).  The benchmark benefits from the amount of compiler optimization applied up to a point; for parallel runs performance is ultimately constrained by thread or process synchronization (e.g. the efficiency of barrier() calls in underlying system libraries). Additionally, some parallel library implementations will use only (and bind only) to physical cores, so some care is required when interpreting results if vcpus (e.g. Intel Hyper-threading) is enabled.

Oracle BMCS Results
Bare Metal Machine Shapes
System O/S Compiler Threads Memory Bandwidth GB/s
(1GB = 109 bytes)
Copy Scale Add Triad
BM.DenseIO1.36
(stock)
OL 7.3 gcc.4.8.5 72 66.13 65.83 75.23 75.11
BM.Standard1.36
(stock)
OL 7.3 gcc 4.8.5 72 71.96 71.92 81.54 82.00
BM.HighIO1.36
(stock)
OL 7.3 gcc 4.8.5 72 65.96 65.59 74.97 74.80
BM.DenseIO1.36
(stock)
OL 7.3 Intel 17.0.4 72 83.95 84.78 93.45 93.63
BM.Standard1.36
(stock)
OL 7.3 Intel 17.0.4 72 92.69 93.68 100.80 100.94
BM.HighIO1.36
(stock)
OL 7.3 Intel 17.0.4 72 83.93 84.74 93.41 93.60

Network

We'll focus now measuring network latency.  We can measure the network bandwidth during the block volume test.  BMCS employs state of the art networking architecture to ensure consistent, predictable performance.  We used the Gartner Cloud Harmony network benchmark this time to measure instance to instance latency.  This test simply sends ICMP packets to a second node and measures the latency. 

Test Details Observed Performance: < 100 microsecond latency

ICMP Ping between two
Bare Metal instances

in the same subnet

[opc@test1 ~]$ ping -c 10 test2
PING test2.ad1.iad.oraclevcn.com (10.0.1.7) 56(84) bytes of data.
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=1 ttl=64 time=0.097 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=3 ttl=64 time=0.090 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=4 ttl=64 time=0.087 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=5 ttl=64 time=0.096 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=6 ttl=64 time=0.086 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=7 ttl=64 time=0.087 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=8 ttl=64 time=0.083 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=9 ttl=64 time=0.089 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=10 ttl=64 time=0.086 ms

NVMe Storage

When we introduced bare metal instances, NVMe defined the performance bar for high performance compute I/O.  The ability to have a dedicated PCIe bandwidth to 28.8TB of NVMe with 4MM IOPS at < 1ms latency flash is an amazing combination.

Before running any tests, protect your data by making a backup of your data and operating system environment to prevent any data loss. WARNING: Do not run FIO tests directly against a device that is already in use, such as /dev/sdX. If it is in use as a formatted disk and there is data on it, running FIO with a write workload (readwrite, randrw, write, trimwrite) will overwrite the data on the disk, and cause data corruption. Run FIO only on unformatted raw devices that are not in use.

BM.DenseIO.36 - NVMe Storage Summary devices for all Blocksizes and R/W Mix

Test Details Observed Performance: 36 Core DenseIO Bare Metal Machine 4.5MM IOPS

NVMe Block Volume Capacity: 2.9TB x 9

Direct I/O
Host Shape: DenseIO
Region: Frankfurt
Reproduction Steps
  1. Provision 36 Core DenseIO BM
  2. Run Gartner Cloud Harmony Block Storage
    1. ~/block-storage/run.sh --nopurge --precondition_once
      --target /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1,
      /devnvme3n1,/dev/nvme4n1,/dev/nvme5n1,
      /dev/nvme6n1,/devnvme7n1,/dev/nvme8n1

      --skip_blocksize 512b

The test log file is attached for the measurements. Updated February 16, 2018.

BM.HighIO.36 - NVMe Storage Summary devices for all Blocksizes and R/W Mix:

Test Details Observed Performance: 36 Core HighIO Bare Metal Machine 1.6MM IOPS
NVMe Block Volume Capacity: 2.9TB x 4
Direct I/O
Host Shape: HighIO
Region: Phoenix
Reproduction Steps
  1. Provision 36 Core HighIO BM
  2. Run Gartner Cloud
    Harmony Block Storage
    1. ~/block-storage/run.sh
      --nopurge
      --precondition_once

      --target /dev/nvme0n1\
      ,/dev/nvme1n0\
      ,/dev/nvme2n0,
      /devnvme3n0

      --skip_blocksize 512b

Block Storage

My May 15th blog covered the detail for block performance with bare metal machines in detail. 

Before running any tests, protect your data by making a backup of your data and operating system environment to prevent any data loss. WARNING: Do not run FIO tests directly against a device that is already in use, such as /dev/sdX. If it is in use as a formatted disk and there is data on it, running FIO with a write workload (readwrite, randrw, write, trimwrite) will overwrite the data on the disk, and cause data corruption. Run FIO only on unformatted raw devices that are not in use.

Here's the amazing report showing line rate to 32 volumes:

Test Details Observed Performance: 32 x 1TB volume, 400,000 IOPS
Block Volume Capacity: 32 x 1TB
Direct I/O
Host Shape: Standard
Region: Ashburn

 


Reproduction Steps
  1. Mount 1TB volume
  2. Run Gartner Cloud
    Harmony Block Storage
    1. ~/block-storage/run.sh
      --nopurge
      --noprecondition
      --target /dev/sd[b-ag]
      --test iops

The test log file is attached for the measurements. Updated February 16, 2018.

Here's the same test for reference from the May 15th for the 1TB test.

Test Details Observed Performance: 1TB volume, 25,000 IOPS
Block Volume Capacity: 1TB
Direct I/O
Host Shape: Dense
Region: Phoenix

 


Reproduction Steps
  1. Mount 1TB volume
  2. Run Gartner Cloud
    Harmony Block Storage
    1. ~/block-storage/run.sh -
      -nopurge --noprecondition
      --target /dev/sdb
      --test iops
      --skip_blocksize 512b

Chart-1TB.png

The test log file is attached for the measurements. Updated February 16, 2018.

Bare Metal Performance Summary

Bare metal instances have significant performance advantages. These advantages include no virtualization management overhead, and the additional benefit of seamless compute instance mobility between VM and BM shapes.

Start your development cycle on a virtual machine, then move to a bare metal instance for production or scale testing.  You also have the flexibility to work in the other direction.  Bring the most difficult workload in your datacenter with a move to a Standard bare metal instance with remote block storage, increasing your performance and availability while converting your capital plan from CapEx to OpEx.  Your development team will enjoy the flexibility of creating a clone of VM or BM machines and block volume clones.  This combination enables continuous development and integration IT development best practices.

I'm pleased to take you through the test methodology, results, and analysis for the incredible performance we deliver. These benchmarks can all be run within the Oracle Bare Metal Cloud Compute Service with a free cloud trial today.  Here's the overview for all of the platform features you can test with the trial.

Please share your most challenging high availability and performance sensitive workloads.  We've reviewed the different tests, results, and suggest performance advancements shown here are capable of your most difficult performance challenge.  We want to ensure your success, if you want more information on our performance methodology, have questions on specific workloads or need help achieving similar results, please reach out to me at lee.gates [-at-] oracle.com.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha