Breakthrough performance with OCI Compute HPC shapes powered by AMD EPYC

November 13, 2023 | 10 minute read
Kevin Jorissen
Distinguished HPC Cloud Architect
Text Size 100%:

The E5.HPC shape based on AMD 4th Gen EPYC (codenamed Genoa) is closely related to E5.Standard, for which we published Ansys Fluent performance. E5.HPC is a 144-core bare-metal server on Oracle’s ultra low-latency remote direct memory access (RDMA) network that scales efficiently to tens of thousands of cores. Each E5.HPC instance comes with 3.2-TB local NVMe storage and 768 GB of fast DDR5 RAM.

What 4th Gen AMD EPYC processors can do

The defining feature of the 4th Gen AMD EPYC processors is a big jump in memory bandwidth (500 GBps) combined with fast DDR5 RAM. Memory bandwidth-hungry high-performance computing (HPC) codes can run efficiently across more cores per node than in previous generations. That in turn makes cores and compute cycles cheaper.

The result is a shape that adds a better performance for the most demanding HPC jobs, while driving down cost-per-job by typically 20–50% compared to older shapes. You can get more simulation and modeling done within your research and development budget or research grant. Meanwhile, Oracle continues to offer the low-core density BM.Optimized3 shape with 36 cores, so you can run high- and low-density shapes as needed.

CFD benchmarks and comparisons

First, let’s look at a small computational fluid dynamics (CFD) benchmark that’s commonly performed. Ansys performance is typically reported as rating, but we converted this metric to runtime = cst./rating.

Ansys Fluent aircraft_wing_14m

Shape

Nodes

Cores

Rating

Runtime (s)

Cost ($)

Performance/core

Performance/node

E5.HPC

1

144

3,011

2,870

$5.05

2.42

348.5

 

2

288

6,570

1,315

$4.63

2.64

380.2

 

4

576

13,091

660

$4.65

2.63

378.7

E4

1

128

 

6,103

$10.63

1.28

163.8

Optimized3

1

36

 

9,907

$7.43

2.80

100.8

 

2

72

 

4,962

$7.44

2.80

100.8

 

4

144

 

2,500

$7.50

2.78

100.1

 

8

288

 

1,245

$7.47

2.79

100.5

 

16

576

 

625

$7.50

2.78

100.1

E5.HPC is approximately 3.5 times faster per node than BM.Optimized3 and about 2.1-times faster than E4. We also get nearly perfect scaling over the RDMA over Converged Ethernet v2 (RoCE v2) network. The runtime goes down linearly as you add E5.HPC or Optimized3 nodes, while performance per core and cost per job stay the same. So, speeding up the job doesn’t lower the efficiency, although E5.HPC sees some superlinear scaling between one and two nodes.

Third, E5.HPC gets nearly as high performance-per-core as Optimized3 at a four-times higher core density! E5.HPC has 144 cores per node, while BM.Optimized3 has only 36 cores per node. This comparison shows the increased memory bandwidth of the AMD 4th Gen EPYC at work. Finally, looking at the cost per job using on-demand price, this workload costs about $4.65 on E5.HPC,  approximately 38% cheaper than BM.Optimized3 and 54% cheaper than the previous-gen BM.E4.

Compare the following shapes to the followings in Azure’s 4th Gen EPYC (HBv4) blog. To calculate runtime, cost, and performance per core, we used on-demand list pricing for the cheapest region available.

  • Shape: Azure HBv4
  • Nodes: 1
  • Cores: 176
  • Rating: 3,248
  • Runtime (in seconds): 2,660
  • Cost (in USD): $5.32
  • Performance/core: 2.14
Converge Si8

Shape

Nodes

Cores

Runtime (s)

Cost ($)

Performance/core

E5.HPC

1

144

8,321

$14.64

8.35

Optimized3

4

144

8,839

$26.52

7.86

At equal core count, E5.HPC has higher per-core performance, faster turnaround time, and 45% lower cost-per-job.

WRF CONUS2.5km (v4.4)

Shape

Nodes

Cores

Runtime (s)

Cost ($)

Performance/core

Optimized3

1

36

2041

$1.53

13.61

E5.HPC

1

144

631

$1.11

11.01

This weather forecast sees a 28% reduction in cost per job. Per node, E5.HPC is over three times faster.

Averaging across a broader set of HPC benchmarks, we find average cost savings of 28%:

Benchmark

Cost per job (E5.HPC as % of Optimized3)

Ansys Fluent

76%

LSDYNA

62%

Altair RADIOSS

73%

PAM-CRASH

56%

AVL CFD

97%

WRF

81%

Converge

56%

Average cost:

72%

Average savings:

28%

We show AVL CFD as a counterexample: It doesn’t scale well and can’t use the many cores of the E5.HPC well. Here, you could run multiple jobs per node, downcore the bare metal instance below 144 cores to reduce licensing costs, or choose a different shape, such as VM.Standard.E5. However, many HPC codes that didn’t scale well on previous generations of processors do scale very well on 4th Gen AMD EPYC.

Conclusion

In summary, E5 is a high-core-density general-purpose HPC shape with similar per-core performance as a low core-density shape but a much faster turnaround time per-node and a much lower cost per-job. In future blogs, we go into more detail on how best to run these applications on E5.HPC. Hardware has limited availability for general release, so contact your Oracle sales representative for more details.

Kevin Jorissen

Distinguished HPC Cloud Architect


Previous Post

OCI Secure Desktops: Watch our latest demo videos on YouTube!

Next Post


Announcing plans to offer NVIDIA Grace Hopper Superchip on OCI

Sagar Rawal | 3 min read