Storage Solutions for AI Applications in Oracle Cloud Infrastructure

Introduction

Training a very large language model demands extreme high performance for compute, storage and network services in modern data centers. Unfortunately, often storage performance is overlooked in comparison to the attention we give to compute and networking. However, storage performance is as important as compute and networking in real-world training, large language models or any other type of large-scale deep learning.

This article will highlight key properties of storage performance that enterprises should consider and how enterprises can scale their AI workloads in Oracle Cloud Infrastructure.

Oracle Cloud Infrastructure provides flexible, low cost and highly scalable options to run various AI training and inference infrastructure including Compute, Network and Storage. In this article I will explain how storage for AI workloads differs in comparison to traditional applications and flexible options of Oracle Cloud Storage for AI workloads.

Plan to be future ready

Planning for storage often comes with more complexity than compute and networking because it can be challenging to move data from one storage system to another. Data is often coming from a variety of source systems, and it is very unlikely that all the data required for AI training is available in one central storage system. This adds complexity and additional management overhead to ingest, prepare and maintain quality of data that are heterogeneous and sourced from disparate systems. Important questions for an enterprise to address may include:

  • Can I store all my data needed to successfully execute my AI project at scale?
  • Does my storage system meet the performance expectation of AI training?
  • Is the current storage system able to store all types of existing and future data?
  • While choosing the right storage system how can I remain flexible to the demand of future changes?
  • Can I take advantage of cloud scale for my ever-growing data size? How does the hybrid model of storage fit into this?
  • How can I control the cost of data storage and processing? Does all data always require high performance storage?

Storage Consideration

The answers to these questions are often not obvious without a direct A/B comparison and the impact of the decision is far reaching over time. Poor choice of storage can be expensive and innovation prohibitive by incurring tax with slow performance and increasing cost.

Considerations for early-stage storage comparisons include:

  • Is it a replacement of existing storage or a net new implementation?
  • How much refactoring of architecture needs to be made and what efforts are involved?
  • Will storage be implemented entirely in the cloud or in a hybrid deployment?
  • How much storage must be locally attached versus remotely attached?
  • What is the distribution of data between object, block or file system storage?
  • How much data will use Flash vs HDD vs SSD storage and their different performance characteristics (e.g. IO per second)?
  • How much data will fit into memory of the compute used?

GPUs are a key element for high performance enterprise AI compute. GPUs provide maximum performance when all the data fits into memory. However, in most large language models or large deep learning workloads data sets are too large to fit into available memory in the cluster. Data will be swapped in and out of memory impacting performance and depending on the performance of data transfer between storage and memory training and inference performance can vary widely, model training takes longer, and inference can get slow. AI apps in self-driving cars, video surveillance and fraud detection are some of the applications where the effect will be significant if GPU is waiting too long for data.

Additionally, there are data center considerations to address, including how to integrate storage, processing, and networks to provide optimum end-to-end performance

Storage Solutions for AI

When it comes to storage for AI workloads there is no one size fits all solution. What is perfectly fine today may not be sufficient for tomorrow’s expectation of performance. For certain workloads there is no such thing as too much storage performance. Beside performance, scaling out capability is extremely important as data required for training large AI models are growing exponentially.

The bottom line is as GPUs and other compute advance to create and use more and more powerful AI, storage performance and scale must also grow and be ready for future demand. Enterprise investment in AI shouldn’t have to have diminishing returns because of poor storage performance.

NVMe Flash Storage

Locally attached NVMe (Non-Volatile Memory Express) flash drives to compute instances are the fastest available storage system for AI clusters. It accelerates data transfer between SSD and CPU and GPUs using a fast PCIe Bus protocol. The protocol is designed to take advantage of parallel processing of data from SSDs for low latency data access and optimized for Non-Uniform Memory Access (NUMA) for multiple CPU and GPUs to manage data parallelism.

During AI model training all activations and model states can be offloaded into NVMe attached to compute clusters for accelerated computation and communication.

Oracle Cloud Infrastructure provides a variety of instance configurations in both bare metal and virtual machine (VM) shapes. Each shape varies on multiple dimensions including memory, CPU cores, network bandwidth, and the option of local NVMe SSD storage found in Dense IO and HPC shapes.

Oracle Cloud Infrastructure provides a service-level agreement (SLA) for NVMe performance. Measuring performance is complex and open to variability. Oracle Cloud bare metal shape BM.DenseIO.E5.128 offers SLA supported 3.4MM IOPS on a 4K block random writes FIO Benchmark test. You can find more details on Oracle Cloud Compute shapes and their NVMe performance benchmark here.

File Storage

AI processes can take advantage of common file protocols such as Network File System (NFS v3) which supports data replication, integrity and encoding or SMB protocol. File system can be a dedicated file server or NAS head built on top of object or block storage.

File storage can provide the limit of NVMe size and scale out further to provide the required amount of storage required for today’s large deep neural network training. In terms of inference file-based storage systems can be used where data is to be discretized such as image recognition and object categorization.

Oracle Cloud Infrastructure File Storage employs 5-way replicated storage, located in different fault domains to provide for data redundancy and resiliency, with erasure encoding and network Lock Manager feature (NLM) for file locking functionality.

Oracle Cloud High Performance File System (HFS)

High performance file systems support workloads that require the ability to read and write data at extremely high throughput rates. OCI HPC File Systems (HFS) are available on Oracle Cloud Marketplace and make it easy to deploy a variety of industry leading high-performance file servers. In as little as three clicks, customers can have a file server up and running at petabyte scale with double digit gigabyte throughput.

Oracle Cloud High Performance Mount Target

OCI offers high performance mount target (HPMT) in the file storage service which can significantly accelerate data processing speed over standard file storage systems. HPMT can scale throughput to 80 Gbps and multiple mount targets can be combined to scale throughput linearly up to 480 Gbps sustained read throughput for large language models training across multiple GPU clusters providing exceptional speed. HPMT is implemented on top of OCI distributed file storage service providing necessary throughput for high performance cluster processing required for AI workloads. You can read more on HPMT on file storage here.

OCI File Storage with Lustre

Lustre is an open source, parallel, distributed file system used for high-performance computing (HPC) clusters and environments. The name Lustre is a portmanteau of Linux and cluster. The file storage architecture consists of three layers – 1. Metadata services (MDS), 2. Object Storage and 3. Lustre Client. It is open source and runs on most commodity hardware with any kind of block storage device including single disks, software and hardware RAID and logical volume manager. Lustre is used in many missions critical, large-scale AI applications, scaling up to 512 PB in one file system, 32 PB in one file and 2tb/s throughput. It provides built-in capability to help meet high availability requirements using no single point failure.

With OCI File Storage with Lustre, Oracle Cloud Infrastructure offers four, fully managed performance tiers of the Lustre file system starting from 125 mb/s/ per tb storage to 1000 mb/s per tb storage with maximum file system size of 20 PB. The low cost and flexibility to choose various performance tiers offers a unique opportunity to run your AI workloads on a best-in-class open-source file storage system.

Object Storage

Object storage found a critical place in AI workloads because of its ability to store data in any format. As AI progressed in the last decades to be able to process image, video, speech and audio data in unstructured form, object storage has become storage before applications can process them. The other benefit of object storage is the ability to store metadata. Some AI apps take advantage of object metadata while also benefiting from the infinite scale of a flat address space object storage architecture. AI analytics can take advantage of rich metadata to enable precision data categorization and organization, allowing data to be more useful and easier to manage and understand. Object storage can scale to hundreds of petabytes of data and can be replicated across data centers for high availability. It can be accessed publicly or can be accessed privately adding a security layer on top.

Oracle Cloud Infrastructure offers highly available, durable and scalable object storage that provides low-cost storage solutions for AI applications.

Block Storage

While block storage is ubiquitous for all types of applications when it comes to AI, locally or remotely attached NVMe is preferred for lower latency. Block storage lacks metadata which is a benefit of object storage. Most enterprise applications that don’t require massively parallel processing and store structured data use block storage.

Oracle Cloud provides low latency high throughput and scalable block volume up to 1 PB and use premium SSD disks. You can scale a virtual processing unit (VPU) that can give balanced, high performance and ultra-high-performance throughput and IOPS 225 IOPS / GB up to 300,000 IOPS. Oracle provides NVMe backed block volume and utilizes flat and fast data center networks to provide 480 Mbps on a 1TB block volume.

Storage Type Performance Characteristics Size and Limits
NVMe Ultra-low latency and high-performance, at no additional cost on bare metal GPU nodes. Store models locally or use as scratch for checkpointing.

8 x 3.84 TB on H200 shape
BM.GPU.H200.8

FSS Widely used cloud-native file storage service for enterprises, with up to 80 Gb/s/mount target throughput, scale infinitely w/ additional mount targets. Up to 480 Gbps of sustained aggregate read throughput
Object Storage 11 nines of durability, for any type of data. Scale to nearly unlimited capacity for unstructured data. Supports interactive workloads suitable for wide variety of workloads and big data processing.  
Lustre Using Lustre, you can build an HPC file server on Oracle Cloud Infrastructure bare metal Compute and network-attached block storage or NVMe SSDs locally attached to Compute nodes. On a configuration as small as two Object Storage Server (OSS) nodes, the Lustre file system on Oracle Cloud Infrastructure provides over 5 GiB/s throughput. Aggregate throughput 20 GiB/s

 

Key Takeaways in Enterprise Decision Making

Within successful AI projects, storage is not “one size fits all”; what may be satisfactory for your storage solution today in terms of scale or performance – may very soon be insufficient. Hence considering future requirements of scale and performance is pivotal for serious AI initiatives that can respond to the ever-growing compute and network speed demands of more and more powerful AI such as Large Language Model, Robotics, Self Driving Car and real time surveillance type applications. It is also important to consider interoperability between cloud and any data center of choice and helps ensure it works in a hybrid model. Additionally, enterprises should be mindful of budget considerations and balancing cost effectiveness with lowest latency and highest throughput – for example, a lower cost but slow performing storage system is likely highly cost inefficient because it keeps higher cost GPUs idle and increases training or inference time. Lastly, storage is a pivotal consideration for overall AI workload performance and shouldn’t be placed at the bottom of your AI project priority list.

 

For more information, see the following resources: