AI Infrastructure cloud cost comparison: Who provides the best value?

June 13, 2023 | 9 minute read
Leo Leung
Vice President, Product Marketing, OCI
Akshai Parthasarathy
Director of Product Marketing
Text Size 100%:

AI infrastructure refers to the cloud resources required to build and train AI models, now made famous by products like ChatGPT. The infrastructure consists of a cluster of compute instances, connected by a high bandwidth network, connected to a high-performance file system. In this post, we leave out the file system for simplicity.

Price comparisons for compute instances

All four hyperscalers offer compute instances targeted at this type of workload, but Oracle Cloud Infrastructure (OCI) is the only provider that offers bare metal instances with GPUs, which offer more isolation and no overhead from virtualization. OCI also offers more local storage, CPU cores, and CPU memory. While OCI costs as much as 22% less for individual instances, the differences between cloud providers are even more significant when it comes to clustered workloads.

Lowest published on-demand list prices for AI compute instances as of May 2, 2023
Metric Unit Azure AWS Google Cloud OCI

 

 

NDm A100 v4

P4de.24xlarge

A2-ultragpu-8g

BM.GPU.GM4.8

Region

Name

US East (N. VA)

East US (N. VA)

US-Central1

Any region

Instance type

 

Virtual machine

Virtual machine

Virtual machine

Bare metal

CPU

vCPU

96

96

96

256

CPU memory

GB

1900 GiB

1152 GB

1360 GB

2048 GB

GPU type

 

NVIDIA A100 80GB

NVIDIA A100 80GB

NVIDIA A100 80GB

NVIDIA A100 80GB

GPU count per instance

GPUs

8

8

8

8

Local storage

TB

6.4 TB

8 TB

3 TB

27.2 TB

Instance

Instance/month (730 hours)

$23,922

$29,905

$29,602

$23,360

Training large language models like GPT require tight coupling and low latency data exchange across multiple compute instances. The infrastructure surrounding the GPUs is the primary determining factor of performance, particularly the networking between compute instances, sometimes called cluster networking.

OCI offers four times the amount of cluster networking bandwidth as Amazon Web Services (AWS) and eight times that of Google Cloud Platform (GCP). The bandwidth is similar between OCI and Azure. Industry experts have stated that the amount of interconnection bandwidth translates nearly directly into performance of industry standard AI frameworks like NCCL, which has a benchmark of the same name.

Networking between instances isn’t the same as networking between individual GPUs within an instance. Though vendors use terminology like “NV Switch” or “NVLink Interconnect,” these are all technologies within a single instance.

Metric Unit Azure AWS Google Cloud OCI

 

 

NDmA100v4

P4de.24xlarge

A2-ultragpu-8g

BM.GPU.GM4.8

Instance

Instance per month (730 hours)

$23,922

 

$29,905

 

$29,602

$23,360

Cluster network bandwidth

Gbps

1600 Gbps

400 Gbps

200 Gbps

1600 Gbps

Cluster price-performance (Lower is better)

 

 

15.0

 

74.8

148.0

14.6

AI Infrastructure cost formula

The formula for the real cost for each AI Infrastructure cluster is calculated in the following equation:

Total price = Instance fee x cluster size

Where the fees are calculated as:

  • Instance fee = (# GPUs) x (price-per-GPU-hour) x (# GPU-duration-hours)

  • Cluster size = (# instances)

Comparing AI Infrastructure cloud costs

To compare the true cost of real world use cases, let’s use a scenario that represents a common workload: Training a one billion-parameter GPT-3 large language model (LLM). In this use case, we have a single cluster of 16 compute instances (128 GPUs) running for as long as it takes to process 1B tokens. Based on a 2022 study by MosaicML, using OCI for this scenario took 0.5 hours.

While we could assume a perfect correlation between bandwidth and performance, this idea might not always be the case, particularly for small clusters that don’t require the maximum available 1,600 GB per second of bandwidth. Let’s assume a modest performance degradation of 10% for each successive halving of bandwidth.

We arrive at these processing times when the bandwidth is halved:

  • 0.5 + (10%*0.5) = 0.55 hours with 800 Gb/sec of bandwidth

  • 0.55 + (10%*0.55) = 0.605 hours with 400 Gb/sec of bandwidth

  • 0.605 + (10%*0.605) = 0.6665 hours with 200 Gb/sec of bandwidth

Azure NDmA100v4 (0.5 hours to process 1B tokens with with 1,600 GB per second of bandwidth):

  • Instance fee = 0.5 x 8 x $4.10 = $16.39

  • Cluster size = 16

  • Total = $16.39 x 16 = $262.16

AWS P4de.24xlarge (0.605 hours to process 1B tokens with 400 GB per second of bandwidth)

  • Instance fee = 0.605 x 8 x $5.12 = $24.78

  • Cluster size = 16

  • Total = $24.78 x 16 = $396.55

Google A2-ultragpu-8g (0.6665 hours to process 1B tokens with 200 GB per second of bandwidth):

  • Instance fee = 0.6665 x [($3.93 x 8) + $9.11)] = $27.03

  • Cluster size = 16

  • Total = $27.03 x 16 = $432.43

OCI BM.GPU.GM4.8 (0.5 hours to process 1B tokens with 1,600 GB per second bandwidth):

  • Instance fee = 0.5 x 8 x $4.00 = $16.00

  • Cluster size = 16

  • Total = $16.00 x 16 = $256.00

Even assuming a performance impact of only 10% for each successive halving of bandwidth from 1,600 GB per second, AWS costs are 155% that of OCI, and GCP costs are 169% that of OCI.

Billion-parameter GPT-3 training
Azure AWS GCP OCI

$262.40

$396.55

$432.43

$256.00

Comparable to OCI

155% of OCI costs

169% of OCI costs

 

OCI has lowest cost and best value.

35% lower cost than AWS, 41% lower cost than GCP

 

Conclusion

To learn more, visit our AI Infrastructure site and the following resources:

Leo Leung

Vice President, Product Marketing, OCI

I'm an experienced product manager and product marketer at both large and startup vendors. I've been an cloud application, platform, and infrastructure end user and website developer, enterprise storage system product manager and marketer, cloud storage and cloud application product manager and operations manager, and storage software product marketer. I've managed business partnerships with many infrastructure ISVs, as well as large systems vendors like Cisco, Dell, EMC, HPE, and IBM.

Akshai Parthasarathy

Director of Product Marketing

Akshai is a Director of Product Marketing for Oracle Cloud Infrastructure (OCI) focused on driving adoption of OCI’s services and solutions. He is a graduate of UC Berkeley and Georgia Tech. When not working, he enjoys keeping up with the latest in technology and business.


Previous Post

Implementing Oracle Database for Azure with a hub-and-spoke network

Thomas Van Buggenhout | 8 min read

Next Post


Introducing the GraalVM Free License

Shaun Smith | 3 min read