AI infrastructure refers to the cloud resources required to build and train AI models, now made famous by products like ChatGPT. The infrastructure consists of a cluster of compute instances, connected by a high bandwidth network, connected to a high-performance file system. In this post, we leave out the file system for simplicity.
All four hyperscalers offer compute instances targeted at this type of workload, but Oracle Cloud Infrastructure (OCI) is the only provider that offers bare metal instances with GPUs, which offer more isolation and no overhead from virtualization. OCI also offers more local storage, CPU cores, and CPU memory. While OCI costs as much as 22% less for individual instances, the differences between cloud providers are even more significant when it comes to clustered workloads.
Metric | Unit | Azure | AWS | Google Cloud | OCI |
---|---|---|---|---|---|
|
|
NDm A100 v4 |
P4de.24xlarge |
A2-ultragpu-8g |
BM.GPU.GM4.8 |
Region |
Name |
US East (N. VA) |
East US (N. VA) |
US-Central1 |
Any region |
Instance type |
|
Virtual machine |
Virtual machine |
Virtual machine |
Bare metal |
CPU |
vCPU |
96 |
96 |
96 |
256 |
CPU memory |
GB |
1900 GiB |
1152 GB |
1360 GB |
2048 GB |
GPU type |
|
NVIDIA A100 80GB |
NVIDIA A100 80GB |
NVIDIA A100 80GB |
NVIDIA A100 80GB |
GPU count per instance |
GPUs |
8 |
8 |
8 |
8 |
Local storage |
TB |
6.4 TB |
8 TB |
3 TB |
27.2 TB |
Instance |
Instance/month (730 hours) |
$23,922 |
$29,905 |
$29,602 |
$23,360 |
Training large language models like GPT require tight coupling and low latency data exchange across multiple compute instances. The infrastructure surrounding the GPUs is the primary determining factor of performance, particularly the networking between compute instances, sometimes called cluster networking.
OCI offers four times the amount of cluster networking bandwidth as Amazon Web Services (AWS) and eight times that of Google Cloud Platform (GCP). The bandwidth is similar between OCI and Azure. Industry experts have stated that the amount of interconnection bandwidth translates nearly directly into performance of industry standard AI frameworks like NCCL, which has a benchmark of the same name.
Networking between instances isn’t the same as networking between individual GPUs within an instance. Though vendors use terminology like “NV Switch” or “NVLink Interconnect,” these are all technologies within a single instance.
Metric | Unit | Azure | AWS | Google Cloud | OCI |
---|---|---|---|---|---|
|
|
NDmA100v4 |
P4de.24xlarge |
A2-ultragpu-8g |
BM.GPU.GM4.8 |
Instance |
Instance per month (730 hours) |
$23,922 |
$29,905
|
$29,602 |
$23,360 |
Cluster network bandwidth |
Gbps |
1600 Gbps |
400 Gbps |
200 Gbps |
1600 Gbps |
Cluster price-performance (Lower is better) |
|
15.0
|
74.8 |
148.0 |
14.6 |
The formula for the real cost for each AI Infrastructure cluster is calculated in the following equation:
Total price = Instance fee x cluster size
Where the fees are calculated as:
Instance fee = (# GPUs) x (price-per-GPU-hour) x (# GPU-duration-hours)
Cluster size = (# instances)
To compare the true cost of real world use cases, let’s use a scenario that represents a common workload: Training a one billion-parameter GPT-3 large language model (LLM). In this use case, we have a single cluster of 16 compute instances (128 GPUs) running for as long as it takes to process 1B tokens. Based on a 2022 study by MosaicML, using OCI for this scenario took 0.5 hours.
While we could assume a perfect correlation between bandwidth and performance, this idea might not always be the case, particularly for small clusters that don’t require the maximum available 1,600 GB per second of bandwidth. Let’s assume a modest performance degradation of 10% for each successive halving of bandwidth.
We arrive at these processing times when the bandwidth is halved:
0.5 + (10%*0.5) = 0.55 hours with 800 Gb/sec of bandwidth
0.55 + (10%*0.55) = 0.605 hours with 400 Gb/sec of bandwidth
0.605 + (10%*0.605) = 0.6665 hours with 200 Gb/sec of bandwidth
Azure NDmA100v4 (0.5 hours to process 1B tokens with with 1,600 GB per second of bandwidth):
Instance fee = 0.5 x 8 x $4.10 = $16.39
Cluster size = 16
Total = $16.39 x 16 = $262.16
AWS P4de.24xlarge (0.605 hours to process 1B tokens with 400 GB per second of bandwidth)
Instance fee = 0.605 x 8 x $5.12 = $24.78
Cluster size = 16
Total = $24.78 x 16 = $396.55
Google A2-ultragpu-8g (0.6665 hours to process 1B tokens with 200 GB per second of bandwidth):
Instance fee = 0.6665 x [($3.93 x 8) + $9.11)] = $27.03
Cluster size = 16
Total = $27.03 x 16 = $432.43
OCI BM.GPU.GM4.8 (0.5 hours to process 1B tokens with 1,600 GB per second bandwidth):
Instance fee = 0.5 x 8 x $4.00 = $16.00
Cluster size = 16
Total = $16.00 x 16 = $256.00
Even assuming a performance impact of only 10% for each successive halving of bandwidth from 1,600 GB per second, AWS costs are 155% that of OCI, and GCP costs are 169% that of OCI.
Azure | AWS | GCP | OCI |
---|---|---|---|
$262.40 |
$396.55 |
$432.43 |
$256.00 |
Comparable to OCI |
155% of OCI costs |
169% of OCI costs |
OCI has lowest cost and best value. 35% lower cost than AWS, 41% lower cost than GCP
|
To learn more, visit our AI Infrastructure site and the following resources:
I'm an experienced product manager and product marketer at both large and startup vendors. I've been an cloud application, platform, and infrastructure end user and website developer, enterprise storage system product manager and marketer, cloud storage and cloud application product manager and operations manager, and storage software product marketer. I've managed business partnerships with many infrastructure ISVs, as well as large systems vendors like Cisco, Dell, EMC, HPE, and IBM.
Akshai is a Director of Product Marketing for Oracle Cloud Infrastructure (OCI) focused on driving adoption of OCI’s services and solutions. He is a graduate of UC Berkeley and Georgia Tech. When not working, he enjoys keeping up with the latest in technology and business.
Previous Post