Fluent is an industry leading computational fluid dynamics (CFD) package, developed by Ansys, that’s used world-wide on various applications.
We created various compute clusters using the HPC Terraform template, available on the Oracle Cloud Marketplace. Table 1 gives an overview of the shapes we tested. For a more comprehensive description, see the OCI documentation.
Table 1: Compute shapes tested |
|||
Compute shape |
CPU cores /node |
GPUs/node |
GPU Memory/GPU |
BM.Standard.E5.192 |
192 |
0 |
--- |
VM.GPU.A10.2 |
30 |
Two NVIDIA A10 Tensor Core GPUs |
24 GB |
BM.GPU.A10.4 |
64 |
Four NVIDIA A10 Tensor Core GPUs |
24 GB |
BM.GPU4.8 |
64 |
Eight NVIDIA A100 Tensor Core GPUs |
40 GB |
BM.GPU.H100.8 |
112 |
Eight NVIDIA H100 Tensor Core GPUs |
80 GB |
We obtained the benchmark datasets used in this study from Ansys, as they are well known and vary in size and complexity, and carried out the benchmarks using the script, “fluent_benchmark_gpu.py,” supplied with the Fluent release. For the performance characterization, we recorded the output of million iterative updates per second (MIUPS), where a higher value demonstrated better overall performance and faster processing time.
We carried out all benchmarks on a single dedicated compute shape with Oracle Linux 8 and Fluent 2024R2. For reference, we used the CPU based BM.Standard.E5.192, a state-of-the-art CPU based system, as noted in the blog post, Ansys Fluent performance on OCI E5 Standard Compute with 4th Gen AMD EPYC. On the CPU-based BM.Standard.E5.192, we used all cores, assigning a single MPI domain to each.
For the GPU systems, we used one MPI domain for each GPU on the system and assigned a single CPU to each domain. We tested varying the number of CPU cores per domain with the “-gpu_remap option.” However, the results weren’t consistently better using more than one core. Chart 1 shows the results, where the performance was scaled to 1.0 based on the results of the reference BM.Standard.E5.192 Compute shape. Because of the wide variation in overall performance, we used a log scale on the vertical axis.
At first glance, several data points are missing, mainly because of the following memory size limitations on the various GPUs:
Overall, Fluent performance on the
The total user cost, based on both hardware and software, is likely even greater because faster run times often benefit the software license cost per job. The performance advantage and cost savings trends even better for the more powerful
From a useability perspective, the use of
Our study gives Fluent users a compelling reason to investigate moving their CPU-based Fluent workloads to
For more information, see the following resources:
Previous Post
Next Post