Predictable, Consistent, High Performance Computing with Oracle Bare Metal Cloud Compute Service

Hi Everyone, I’m Lee Gates and I am focused on performance in the Oracle Bare Metal Cloud Services product management team.

Last November, we launched Oracle Bare Metal Cloud Services (BMCS) and introduced our Compute Service with bare metal instances. Today I am going to show you the value of a mix of high performance bare metal instances and how they can be used seamlessly in concert with virtual machine instances.

We know it is difficult to achieve consistent performance with enterprise workloads, and customers need a range of options. I’ll provide an overview of the features and the performance results covering each component in a bare metal instance. You can have your cake and eat it too with compute instance mobility enabling you to move or clone an instance from bare metal and launch a new virtual machine in minutes!

Understanding platform capabilities is the first step when you have enterprise applications that demand consistent and predictable performance. This can be measured using simple tests which are designed to help find the sustained peak performance. By reviewing platform performance as a workload sizing exercise, you can get an understanding of how your application will run after migration. Here is a summary of our bare metal instance types:

Shape	Instance	Core	RAM (GB)	Networking	NVMe Storage
Standard	BM.Standard1.36	36	256	10 Gigabit Ethernet	N/A
HighIO	BM.HighIO1.36	36	512	10 Gigabit Ethernet	12.8TB NVMe
DenseIO	BM.DenseIO1.36	36	512	10 Gigabit Ethernet	28.8TB NVMe

BMCS bare metal instances deliver over 4MM IOPS from the NVMe storage devices, a cloud first, meeting the most demanding and difficult workloads. The compute architecture is designed with an Intel E5-2699 v3 @ 2.30GHz enabling high performance compute intensive workloads. An Intel 10GBE network adapter provides line rate network to access your block devices, other instances in your network, or your internet traffic. The NVMe flash storage is delivered by Samsung NVMe SSD Controller 172X.

Bare Metal Test Plan

Here we’ll go through tests I use to measure and benchmark performance, and summarize how other services perform from a compute perspective. We’ll start the review with a summary of our plan:

Component	Measurement	Observation
Compute	CPU2017	SPECrate2017_int estimate up to 128
Memory	Memory bandwidth	up to 100 GBS
Network	10GBE bandwidth, host to host latency	< 100 microseconds
NVMe devices	12.8TB and 28.8TB latency and throughput	<1 millisecond latency for all R/W mixes
Block storage	Single Volume and 32 Volume Performance	<1 millisecond latency @ 10Gbe for all R/W mixes

Compute

Let’s move to the CPU and the integer and floating point math performance.

Standard Performance Evaluation Corporation (SPEC) CPU^®2017 v1 is an industry standard CPU intensive benchmark suite stressing a system’s processor, memory subsystem and compiler. It consists of 10 integer benchmarks, and 14 floating point benchmarks. The SPEC CPU2017 suite can be run to provide a speed metric or a throughput metric, each using the same base optimizations, or per-benchmark peak optimizations.

SPEC CPU2017 is SPEC’s latest update to the CPU series of benchmarks. The focus of CPU2017 is on compute intensive performance and the benchmarks emphasize the performance of the processor, memory hierarchy, and compilers.

The benchmark is also divided into four suites:

- SPECspeed 2017 Integer – 10 integer benchmarks
- SPECspeed 2017 Floating Point – 10 floating point benchmarks
- SPECrate 2017 Integer – 10 integer benchmarks
- SPECrate 2017 Floating Point – 13 floating point benchmarks

Each of the suites contain two metrics, base and peak, which reflect the amount of optimization allowed. The overall metrics for the benchmark suites which are commonly used are:

- SPECspeed2017_int_base, SPECspeed2017_int_peak: integer speed
- SPECspeed2017_fp_base, SPECspeed2017_fp_peak: floating point speed
- SPECrate2017_int_base, SPECrate2017_int_peak: integer rate
- SPECrate2017_fp_base, SPECrate2017_fp_peak: floating point rate

When I ran the test using default values for a highIO bare metal instances the test estimates were:

Oracle BMCS Measured Estimates Bare Metal Machine Shapes
System	O/S	Compiler	SPECspeed2017_int		SPECspeed2017_fp		SPECrate2017_int		SPECrate2017_fp
System	O/S	Compiler	Base	Peak	Base	Peak	Base	Peak	Base	Peak
BM.DenseIO1.36 2x E5-2699v3 (2.3/3.6* GHz, 2s/36c/72t, 512 GB DDR4/2133 MHz, 10GBE) (stock)	OL 7.3	Intel 17.0.4.196	4.13	4.38	66.5	66.5	118	127	108	108
BM.HighIO1.36 2x E5-6699v3 (2.3/3.6* GHz, 2s/36c/72t, 512 GB DDR4/2133 MHz, 10GBE) (stock)	OL 7.3	Intel 17.0.4.196	4.16	4.38	67.0	67.2	120	128	109	109
BM.Standard1.36 2x E5-2699v3 (2.3/3.6* GHz, 2s/36c/72t, 256 GB DDR4/2133 MHz, 10GBE) (stock)	OL 7.3	Intel 17.0.4.196	4.12	4.37	69.2	69.3	120	128	111	111

Memory

Memory bandwidth and latency are important for data intensive workloads. We’re running the memory stream-scaling test harness automated by Cloud Harmony. STREAM measures sustainable memory bandwidth and the corresponding computation rate for four simple vector kernels. While it can be run serially, it is typically run in parallel (using either OpenMP, pthreads, or MPI). The benchmark benefits from the amount of compiler optimization applied up to a point; for parallel runs performance is ultimately constrained by thread or process synchronization (e.g. the efficiency of barrier() calls in underlying system libraries). Additionally, some parallel library implementations will use only (and bind only) to physical cores, so some care is required when interpreting results if vcpus (e.g. Intel Hyper-threading) is enabled.

Oracle BMCS Results Bare Metal Machine Shapes
System	O/S	Compiler	Threads	Memory Bandwidth GB/s (1GB = 10⁹ bytes)
System	O/S	Compiler	Threads	Copy	Scale	Add	Triad
BM.DenseIO1.36 (stock)	OL 7.3	gcc.4.8.5	72	66.13	65.83	75.23	75.11
BM.Standard1.36 (stock)	OL 7.3	gcc 4.8.5	72	71.96	71.92	81.54	82.00
BM.HighIO1.36 (stock)	OL 7.3	gcc 4.8.5	72	65.96	65.59	74.97	74.80
BM.DenseIO1.36 (stock)	OL 7.3	Intel 17.0.4	72	83.95	84.78	93.45	93.63
BM.Standard1.36 (stock)	OL 7.3	Intel 17.0.4	72	92.69	93.68	100.80	100.94
BM.HighIO1.36 (stock)	OL 7.3	Intel 17.0.4	72	83.93	84.74	93.41	93.60

Network

We’ll focus now measuring network latency. We can measure the network bandwidth during the block volume test. BMCS employs state of the art networking architecture to ensure consistent, predictable performance. We used the Gartner Cloud Harmony network benchmark this time to measure instance to instance latency. This test simply sends ICMP packets to a second node and measures the latency.

Test Details	Observed Performance: < 100 microsecond latency
ICMP Ping between two Bare Metal instances in the same subnet	[opc@test1 ~]$ ping -c 10 test2 PING test2.ad1.iad.oraclevcn.com (10.0.1.7) 56(84) bytes of data. 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=1 ttl=64 time=0.097 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=2 ttl=64 time=0.080 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=3 ttl=64 time=0.090 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=4 ttl=64 time=0.087 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=5 ttl=64 time=0.096 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=6 ttl=64 time=0.086 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=7 ttl=64 time=0.087 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=8 ttl=64 time=0.083 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=9 ttl=64 time=0.089 ms 64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=10 ttl=64 time=0.086 ms

Test Details

Observed Performance: < 100 microsecond latency

ICMP Ping between two
Bare Metal instances

in the same subnet

[opc@test1 ~]$ ping -c 10 test2
PING test2.ad1.iad.oraclevcn.com (10.0.1.7) 56(84) bytes of data.
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=1 ttl=64 time=0.097 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=2 ttl=64 time=0.080 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=3 ttl=64 time=0.090 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=4 ttl=64 time=0.087 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=5 ttl=64 time=0.096 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=6 ttl=64 time=0.086 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=7 ttl=64 time=0.087 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=8 ttl=64 time=0.083 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=9 ttl=64 time=0.089 ms
64 bytes from test2.ad1.iad.oraclevcn.com (10.0.1.7): icmp_seq=10 ttl=64 time=0.086 ms

NVMe Storage

When we introduced bare metal instances, NVMe defined the performance bar for high performance compute I/O. The ability to have a dedicated PCIe bandwidth to 28.8TB of NVMe with 4MM IOPS at < 1ms latency flash is an amazing combination.

Before running any tests, protect your data by making a backup of your data and operating system environment to prevent any data loss. WARNING: Do not run FIO tests directly against a device that is already in use, such as /dev/sdX. If it is in use as a formatted disk and there is data on it, running FIO with a write workload (readwrite, randrw, write, trimwrite) will overwrite the data on the disk, and cause data corruption. Run FIO only on unformatted raw devices that are not in use.

BM.DenseIO.36 – NVMe Storage Summary devices for all Blocksizes and R/W Mix

Test Details	Observed Performance: 36 Core DenseIO Bare Metal Machine 4.5MM IOPS
NVMe Block Volume Capacity: 2.9TB x 9 Direct I/O Host Shape: DenseIO Region: Frankfurt Reproduction Steps Provision 36 Core DenseIO BM Run Gartner Cloud Harmony Block Storage _{^{~/block-storage/run.sh –nopurge –precondition_once}} _{^{–target /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1, /devnvme3n1,/dev/nvme4n1,/dev/nvme5n1, /dev/nvme6n1,/devnvme7n1,/dev/nvme8n1}} _{^{–skip_blocksize 512b}}

Test Details

Observed Performance: 36 Core DenseIO Bare Metal Machine 4.5MM IOPS

NVMe Block Volume Capacity: 2.9TB x 9

Direct I/O
Host Shape: DenseIO
Region: Frankfurt

Reproduction Steps

Provision 36 Core DenseIO BM
Run Gartner Cloud Harmony Block Storage
1. _{^{~/block-storage/run.sh –nopurge –precondition_once}}
  _{^{–target /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1,
  /devnvme3n1,/dev/nvme4n1,/dev/nvme5n1,
  /dev/nvme6n1,/devnvme7n1,/dev/nvme8n1}}
  _{^{–skip_blocksize 512b}}

The test log file is attached for the measurements. Updated February 16, 2018.

BM.HighIO.36 – NVMe Storage Summary devices for all Blocksizes and R/W Mix:

Test Details	Observed Performance: 36 Core HighIO Bare Metal Machine 1.6MM IOPS
NVMe Block Volume Capacity: 2.9TB x 4 Direct I/O Host Shape: HighIO Region: Phoenix Reproduction Steps Provision 36 Core HighIO BM Run Gartner Cloud Harmony Block Storage _{^{~/block-storage/run.sh –nopurge –precondition_once}} _{^{–target /dev/nvme0n1\ ,/dev/nvme1n0\ ,/dev/nvme2n0, /devnvme3n0}} _{^{–skip_blocksize 512b}}

Block Storage

My May 15th blog covered the detail for block performance with bare metal machines in detail.

Here’s the amazing report showing line rate to 32 volumes:

Test Details	Observed Performance: 32 x 1TB volume, 400,000 IOPS
Block Volume Capacity: 32 x 1TB Direct I/O Host Shape: Standard Region: Ashburn Reproduction Steps Mount 1TB volume Run Gartner Cloud Harmony Block Storage ^{_{~/block-storage/run.sh –nopurge –noprecondition –target /dev/sd[b-ag] –test iops}}

Test Details

Observed Performance: 32 x 1TB volume, 400,000 IOPS

Block Volume Capacity: 32 x 1TB
Direct I/O
Host Shape: Standard
Region: Ashburn

Reproduction Steps

Mount 1TB volume
Run Gartner Cloud
Harmony Block Storage
1. ^{_{~/block-storage/run.sh
  –nopurge
  –noprecondition
  –target /dev/sd[b-ag]
  –test iops}}

The test log file is attached for the measurements. Updated February 16, 2018.

Here’s the same test for reference from the May 15th for the 1TB test.

Test Details	Observed Performance: 1TB volume, 25,000 IOPS
Block Volume Capacity: 1TB Direct I/O Host Shape: Dense Region: Phoenix Reproduction Steps Mount 1TB volume Run Gartner Cloud Harmony Block Storage ^{_{~/block-storage/run.sh – -nopurge –noprecondition –target /dev/sdb –test iops –skip_blocksize 512b}}

Test Details

Observed Performance: 1TB volume, 25,000 IOPS

Block Volume Capacity: 1TB
Direct I/O
Host Shape: Dense
Region: Phoenix

Reproduction Steps

Mount 1TB volume
Run Gartner Cloud
Harmony Block Storage
1. ^{_{~/block-storage/run.sh –
  -nopurge –noprecondition
  –target /dev/sdb
  –test iops
  –skip_blocksize 512b}}

The test log file is attached for the measurements. Updated February 16, 2018.

Bare Metal Performance Summary

Bare metal instances have significant performance advantages. These advantages include no virtualization management overhead, and the additional benefit of seamless compute instance mobility between VM and BM shapes.

Start your development cycle on a virtual machine, then move to a bare metal instance for production or scale testing. You also have the flexibility to work in the other direction. Bring the most difficult workload in your datacenter with a move to a Standard bare metal instance with remote block storage, increasing your performance and availability while converting your capital plan from CapEx to OpEx. Your development team will enjoy the flexibility of creating a clone of VM or BM machines and block volume clones. This combination enables continuous development and integration IT development best practices.

I’m pleased to take you through the test methodology, results, and analysis for the incredible performance we deliver. These benchmarks can all be run within the Oracle Bare Metal Cloud Compute Service with a free cloud trial today. Here’s the overview for all of the platform features you can test with the trial.

Please share your most challenging high availability and performance sensitive workloads. We’ve reviewed the different tests, results, and suggest performance advancements shown here are capable of your most difficult performance challenge. We want to ensure your success, if you want more information on our performance methodology, have questions on specific workloads or need help achieving similar results, please reach out to me at lee.gates [-at-] oracle.com.

Predictable, Consistent, High Performance Computing with Oracle Bare Metal Cloud Compute Service

Bare Metal Test Plan

Compute

Memory

Network

NVMe Storage

Block Storage

Bare Metal Performance Summary

How PaaS Can Help Your Financials Feel Right at Home in the Cloud

Tutorial: Using Ceph Distributed Storage Cluster on Bare Metal Cloud Services

Predictable, Consistent, High Performance Computing with Oracle Bare Metal Cloud Compute Service

Bare Metal Test Plan

Compute

Memory

Network

NVMe Storage

Block Storage

Bare Metal Performance Summary

Authors

How PaaS Can Help Your Financials Feel Right at Home in the Cloud

Tutorial: Using Ceph Distributed Storage Cluster on Bare Metal Cloud Services