The following benchmarks were run by Alberto Araldo and Giovanna Vatteroni from Azimut-Benetti SPA
Siemens Simcenter STAR-CCM+ is a multiphysics computational fluid dynamics (CFD) software for the simulation of products and designs. Customers can deploy it on Oracle Cloud infrastructure (OCI) using the reference architecture, following the LiveLab, and running the terraform scripts. In some cases, particularly for customers with predetermined budgets and minimal resource needs, running a high-performance computing (HPC) workload on a dense node with high number of CPU cores can be more feasible rather than running it on a multinode HPC cluster as a distributed workload.
This post illustrates comparison results of running customer-specific STAR-CCM+ models from Azimut Benetti on an HPC cluster based on BM Optimized3 Shape (Intel) up to four nodes and a single Standard BM E5 shape (AMD) and a standalone BM Optimized3 Shape (Intel). For more details regarding the E5 shape, see Available now: E5 instances on AMD EPYC processors for up to 2x better performance.
Azimut Benetti is the first global leading private group in the yachting sector. Thanks to continuous innovation and experimentation, it has been the world’s leading manufacturer of megayachts and the shipyard with the widest range of models on the market for 24 years.
The architecture used to run this benchmark is based on the one described in the reference architecture site and is shown in Figure 1. An alternative architecture with Standard E5 shape is also illustrated in Figure 2.
Figure 1: OCI HPC RDMA cluster with Bare Metal Optimized3 cluster nodes
Figure 2: A generic architecture with a single BareMetal E5 standard node
The benchmark runs use two different architectures, based on the following components:
Setup 1:
Setup 2:
To evaluate a total cost to time-to-solution metric, we carried out CFD simulations with the following models:
Azimut owns these models, VTTT, GAS, and Aerodynamic, and the geometry and simulation settings are Azimut’s specific test use case.
To perform these simulation benchmarks, we designed two variations of architectural infrastructures. In the first variation, we deployed a HPC cluster with Intel CPU-based BM.Optimised3.36 bare metal nodes on four nodes to reach a total number of 144 cores. In the other variation, we chose a single AMD E5-based BM.Standard.E5.192 node to perform the same benchmarks.
The following table summarizes the technical details of the simulation models:
VTTT |
GAS |
AIR |
N. Cells: 1.30 mil. Solver Setup: U-RANS N. Time-steps: 200 N. Iterations: 1000 |
N. Cells: 2.75 mil. Solver Setup: U-RANS N. Time-steps: 200 N. Iterations: 1000 |
N. Cells: 12.0 mil. Solver Setup: S-RANS N. Time-steps: - N. Iterations: 200 |
The following figures compare the CPU time and cost analysis for the different use cases.
Figure 3. CPU time per core for different simulation runs.
Figure 4: Total cost per simulation runs for different setups
These three different simulations use cases show that the time to solution and total cost per simulation minimizes when using the standard E5 shapes compared to the HPC optimized X9 (BM.Optimized3.36) shape around 120 CPU cores. The addition of multinode distribution and RDMA network for the BM.Optimized3.36 nodes compared to the intranode runs for the BM.Standard.E5.192 contributes to this outcome.
However, while only concentrating on the price-to-performance for these use cases, the standard bare metal E5 shows a good lead. Distributed parallel runs on a RDMA cluster have their own strengths and performance scaling, but these use cases didn’t show speed-up while running on RDMA cluster. There can be many reasons, such as parallel efficiency of the model and load distribution or memory footprint, but an investigation for the cause is out of scope.
The following images illustrate the visual models of the streamlines, free surface, and mesh structures:
Figure 5: Hull geometry (orange), sea surface (light blue), and velocity magnitude on aerodynamic streamlines in aerodynamic simulation.
Figure 6: Hull geoemtry (in grey) and wave formation in a Virtual Towing Tank Test simulation.
Figure 7: Hull geometry (orange), wave formation (light blue), and velocity magnitude on exhaust gas streamlines in a virtual towing tank test simulation with exhaust gases.
The benchmark results show that, for these CFD workloads, running them on a single node with standard bare metal hardware configuration without RDMA capabilities could become more feasible and productive for price-to-performance than running the same workloads on a specially optimized HPC cluster nodes with powerful CPUs, locally attached high throughput IO drives, and fast low-latency network. This setup is mostly applicable for parallel workloads that can ideally fit into the CPU count and memory of a single node.
For more information, see the following resources:
OCI Compute and HPC Specialist
I am a cloud HPC & AI Specialist at Oracle OCI. My focus is infrastructure and performance optimizations. I am a Ph.D. in Computational Physics concentrating on massively parallel large-scale physics simulations.
I have nearly 15 years of experience in the high-performance computing (HPC), AI and Data Centre infrastructure industry with demonstrated success tracks. I am skilled in HPC concept, architecture, operation, and performance optimizations, as well as IT Project management. At OCI I engage my knowledge to give our customers the best price-to-performance for HPC compute on Bare Metal and virtualized, CPUs & GPUs.
Ask me about HPC, AI and parallel application workload!
Previous Post