Imagine having 25,000 conversations directed at you in parallel and all of them are waiting for a response. This is what happens to storage systems when customers train and deploy Large Language Models (LLM) today. Now add Multi-Modal Training with pictures, videos, audio and other rich content to the mix, and this just got 10x larger. Artificial Intelligence (AI)/Machine Learning (ML) workloads routinely operate at this scale. GPUs process 10s of Petabytes(PB) of data in parallel at 10s of Terabits per second (Tbps) throughput to enable the most complex models in the world. They need an extremely fast storage system that can access 100s of 1000s of files in parallel and feed it to 100s of 1000s of GPUs at high speed.

We are introducing Oracle Cloud Infrastructure (OCI) File Storage with Lustre today to meet the performance demands of these workloads. Lustre is designed to deliver parallel I/O performance at scale and is widely used in large-scale large language model (LLM) training and supercomputing projects.  

OCI File Storage with Lustre is a fully managed service based on Lustre. It enables you with the performance and scale benefits of Lustre, including milliseconds of meta-data latency, capacity to petabytes, and high throughput of terabytes per second, while eliminating the complexity of management. As a fully managed service, OCI automates the file system deployment, scaling, and maintenance. Further, as the service is built on OCI’s leading Block Storage Service, you can expect the same enterprise class availability and durability of an enterprise application running on the Block Storage Service.

Lustre file system can be accessed in parallel by thousands of clients. OCI File Storage with Lustre is seamlessly integrated with Oracle Kubernetes Engine (OKE) and can deployed in GPUs Hosts, Bare Metal or virtualized environments.

OCI File Storage with Lustre is available now in Oracle Cloud Console! Pricing is based on provisioned capacity and performance tiers. You can find more details on the Oracle Cloud Pricing webpage.

A diagram of OCI File Storage with Lustre within an Oracle Cloud region

Customer Use Cases

Our customers are already enjoying the benefits of Lustre service for use cases such as LLM Training and Engineering simulation Models.

Large LLM Training: A recent large-scale LLM training run using an OCI File Storage with Lustre file system scaling to multiple PBs, managed by OCI, feeds 25,000 GPUs with aggregate speeds of up to 20 terabits per second (Tbps).  

AI for Engineering: NXAI, a leader in Industrial AI simulations delivers large language models for manufacturing, logistics, and energy sectors, uses OCI File Storage with Lustre to improve their AI training speeds.

Physics Simulations: Emmi AI, powers physics architectures and models to unlock real time interactions for electrical systems, thermal simulations and aerospace engineering, uses OCI File Storage with Lustre to accelerate simulation times.

“We were impressed by the ease and speed of implementing OCI File Storage with Lustre. It transformed our AI training process, making it incredibly efficient. What used to take days is now accomplished in hours, thanks to the 4X-10X performance boost. This solution is a game-changer for our operations.”    — Fabian Schlager, AI Platform Operations, Emmi AI (NXAI Spin-off)

NXAI logo    Emmi AI logo

        www.nx-ai.com                             www.emmi.ai

Why choose OCI File Storage with Lustre?

Customers running Larger LLM models, GenAI  applications and customers running physics simulations are already using OCI File Storage with Lustre at scale in production.  OCI File Storage with Lustre is seamlessly integrated with Oracle Kubernetes Engine (OKE) and the file system can be accessed by GPUs Hosts, Bare Metal or virtualized servers.  Here are some of the key features it enables

  • Performance at scale: Access the file system from thousands of clients and GPUs in parallel. OCI File Storage with Lustre efficiently handles massive data loads. The file system can scale up to 20 petabytes (PB) enabling you to store AI, ML and HPC data including training datasets, research models and checkpoints. It enables high sustained performance for each terabyte (TB) of provisioned capacity. OCI File Storage with Lustre offers the following performance tiers
    • 125 MBps per provisioned TB
    • 250 MBps per provisioned TB
    • 500 MBps per provisioned TB
    • 1000 MBps per provisioned TB
  • Fully managed service: OCI File Storage with Lustre helps eliminate the complexities of setting up and maintaining Lustre infrastructure components, such as storage servers, metadata servers, and data volumes. You can easily scale up your capacity and aggregate performance on-demand while continuing to run your production applications. This streamlining enables you to focus on your core business objectives without worrying about infrastructure management. You can create a file system in minutes using the Oracle Cloud Console, command-line tools, APIs, software developer kits (SDKs) or terraform.
  • Highly available architecture: Helps your critical workloads to access the data that is highly available and resilient to infrastructure failures.
  • Seamless copy between Lustre and object (coming soon): You can link your Lustre file system to an OCI Object Storage bucket to copy object storage data on demand and access directly from the Lustre file system. This enables you to load data from object to the file system for faster access.
  • Quota management: OCI File Storage with Lustre enables you to set capacity limits for your users, groups and projects. The enables predictable storage consumption that helps you keep your storage costs under control.  

Getting Started

You can easily create a file system from the OCI Cloud Console, CLI or APIs. To create your own Lustre file system today, in the Oracle Cloud Console, navigate to Lustre File Storage in Oracle Cloud console. The following figures show the main file system setup panels.

Figure 1: Select Lustre File Storage in Oracle Cloud Console

Figure 1: Select Lustre File Storage in Oracle Cloud Console

Figure 2: Create new Lustre File System

Figure 2: Create new Lustre File System

Figure 3: select performance tier Figure 3: select capactiy

Figure 3: Select Performance Tier and Capacity

 

For more information, see the following resources: