AI Infrastructure cloud cost comparison: Who provides the best value?

AI infrastructure refers to the cloud resources required to build and train AI models, now made famous by products like ChatGPT. The infrastructure consists of a cluster of compute instances, connected by a high bandwidth network, connected to a high-performance file system. In this post, we leave out the file system for simplicity.

Price comparisons for compute instances

All four hyperscalers offer compute instances targeted at this type of workload, but Oracle Cloud Infrastructure (OCI) is the only provider that offers bare metal instances with GPUs, which offer more isolation and no overhead from virtualization. OCI also offers more local storage, CPU cores, and CPU memory. While OCI costs as much as 22% less for individual instances, the differences between cloud providers are even more significant when it comes to clustered workloads.

Lowest published on-demand list prices for AI compute instances as of May 2, 2023
Metric	Unit	Azure	AWS	Google Cloud	OCI
		NDm A100 v4	P4de.24xlarge	A2-ultragpu-8g	BM.GPU.GM4.8
Region	Name	US East (N. VA)	East US (N. VA)	US-Central1	Any region
Instance type		Virtual machine	Virtual machine	Virtual machine	Bare metal
CPU	vCPU	96	96	96	256
CPU memory	GB	1900 GiB	1152 GB	1360 GB	2048 GB
GPU type		NVIDIA A100 80GB	NVIDIA A100 80GB	NVIDIA A100 80GB	NVIDIA A100 80GB
GPU count per instance	GPUs	8	8	8	8
Local storage	TB	6.4 TB	8 TB	3 TB	27.2 TB
Instance	Instance/month (730 hours)	$23,922	$29,905	$29,602	$23,360

Training large language models like GPT require tight coupling and low latency data exchange across multiple compute instances. The infrastructure surrounding the GPUs is the primary determining factor of performance, particularly the networking between compute instances, sometimes called cluster networking.

OCI offers four times the amount of cluster networking bandwidth as Amazon Web Services (AWS) and eight times that of Google Cloud Platform (GCP). The bandwidth is similar between OCI and Azure. Industry experts have stated that the amount of interconnection bandwidth translates nearly directly into performance of industry standard AI frameworks like NCCL, which has a benchmark of the same name.

Networking between instances isn’t the same as networking between individual GPUs within an instance. Though vendors use terminology like “NV Switch” or “NVLink Interconnect,” these are all technologies within a single instance.

Metric	Unit	Azure	AWS	Google Cloud	OCI
		NDmA100v4	P4de.24xlarge	A2-ultragpu-8g	BM.GPU.GM4.8
Instance	Instance per month (730 hours)	$23,922	$29,905	$29,602	$23,360
Cluster network bandwidth	Gbps	1600 Gbps	400 Gbps	200 Gbps	1600 Gbps
Cluster price-performance (Lower is better)		15.0	74.8	148.0	14.6

AI Infrastructure cost formula

The formula for the real cost for each AI Infrastructure cluster is calculated in the following equation:

Total price = Instance fee x cluster size

Where the fees are calculated as:

Instance fee = (# GPUs) x (price-per-GPU-hour) x (# GPU-duration-hours)
Cluster size = (# instances)

Comparing AI Infrastructure cloud costs

To compare the true cost of real world use cases, let’s use a scenario that represents a common workload: Training a one billion-parameter GPT-3 large language model (LLM). In this use case, we have a single cluster of 16 compute instances (128 GPUs) running for as long as it takes to process 1B tokens. Based on a 2022 study by MosaicML, using OCI for this scenario took 0.5 hours.

While we could assume a perfect correlation between bandwidth and performance, this idea might not always be the case, particularly for small clusters that don’t require the maximum available 1,600 GB per second of bandwidth. Let’s assume a modest performance degradation of 10% for each successive halving of bandwidth.

We arrive at these processing times when the bandwidth is halved:

0.5 + (10%*0.5) = 0.55 hours with 800 Gb/sec of bandwidth
0.55 + (10%*0.55) = 0.605 hours with 400 Gb/sec of bandwidth
0.605 + (10%*0.605) = 0.6665 hours with 200 Gb/sec of bandwidth

Azure NDmA100v4 (0.5 hours to process 1B tokens with with 1,600 GB per second of bandwidth):

Instance fee = 0.5 x 8 x $4.10 = $16.39
Cluster size = 16
Total = $16.39 x 16 = $262.16

AWS P4de.24xlarge (0.605 hours to process 1B tokens with 400 GB per second of bandwidth)

Instance fee = 0.605 x 8 x $5.12 = $24.78
Cluster size = 16
Total = $24.78 x 16 = $396.55

Google A2-ultragpu-8g (0.6665 hours to process 1B tokens with 200 GB per second of bandwidth):

Instance fee = 0.6665 x [($3.93 x 8) + $9.11)] = $27.03
Cluster size = 16
Total = $27.03 x 16 = $432.43

OCI BM.GPU.GM4.8 (0.5 hours to process 1B tokens with 1,600 GB per second bandwidth):

Instance fee = 0.5 x 8 x $4.00 = $16.00
Cluster size = 16
Total = $16.00 x 16 = $256.00

Even assuming a performance impact of only 10% for each successive halving of bandwidth from 1,600 GB per second, AWS costs are 155% that of OCI, and GCP costs are 169% that of OCI.

Billion-parameter GPT-3 training
Azure	AWS	GCP	OCI
$262.40	$396.55	$432.43	$256.00
Comparable to OCI	155% of OCI costs	169% of OCI costs	OCI has lowest cost and best value. 35% lower cost than AWS, 41% lower cost than GCP

Conclusion

To learn more, visit our AI Infrastructure site and the following resources:

On-demand list prices published as of May 2, 2023: Azure, AWS, GCP, OCI
NCCL
Mosaic LLMs (Part 1): Billion-Parameter GPT Training Made Easy, by Abhinav Venigalla & Linden Li, 2022

AI Infrastructure cloud cost comparison: Who provides the best value?

Price comparisons for compute instances

AI Infrastructure cost formula

Comparing AI Infrastructure cloud costs

Conclusion

Leo Leung

Vice President, Product Marketing, OCI

Akshai Parthasarathy

Product Marketing Director, Oracle

Implementing Oracle Database for Azure with a hub-and-spoke network

Introducing the GraalVM Free License

AI Infrastructure cloud cost comparison: Who provides the best value?

Price comparisons for compute instances

AI Infrastructure cost formula

Comparing AI Infrastructure cloud costs

Conclusion

Authors

Leo Leung

Vice President, Product Marketing, OCI

Akshai Parthasarathy

Product Marketing Director, Oracle

Implementing Oracle Database for Azure with a hub-and-spoke network

Introducing the GraalVM Free License