Co-authored by Arun Sharma
Oracle Cloud Infrastructure (OCI) aims to solve complex problems for customers. OCI is leading the cloud high-performance computing (HPC) battle in performance and price. Over the last few months, we have set new cloud standards for internode latency, cloud HPC benchmarks, and application performance. Oracle Cloud Infrastructure’s bare metal infrastructure lets you get the same on-premises performance in the cloud.
To improve the run time, OCI introduced cluster networking in 2018. A cluster network is a pool of high-performance computing (HPC) instances that are connected with a high-bandwidth, ultra-low-latency network providing Large Clusters, Lowest Latency: Cluster Networking on Oracle Cloud Infrastructure.
Our customers were looking for scale-in and scale-out HPC instances as they further optimize workloads and reuse HPC machines. With the launch of cluster network resizing, customers can now efficiently manage their cost and heal the existing clusters without having to tear them down and make them again. You no longer have to manage large clusters to account for in-case scenarios. You can shrink and expand clusters dynamically, while keeping all instances, existing or new additions, on a low-latency, high-bandwidth remote direct memory access (RDMA) network.
You can now utilize the full functionality of instance pools and change the number of instances in a cluster network by simply resizing the underlying instance pool. When you increase the size, instances are provisioned until the required number of instances in the pool are launched within the cluster’s RDMA network. When you decrease the size, instances are deleted in the order that they were created, as mentioned in step-by-step guide of resizing a cluster network. This functionality continues to maintain the capacity required for workloads to run and making workloads more cost-efficient.
You can also now easily troubleshoot any hosts without impacting the whole cluster by detaching a specific instance from the pool. This feature provides more capability to handle any individual run. To remove a specific instance from the cluster network, you can follow the step-by-step guide of detaching the instance from the cluster network. This instance is retained until you decide to delete it.
With both features, dynamic resizing of cluster network and the capability to detach an instance from the pool, are useful for the customers to run workloads efficiently and troubleshoot problems faster and more precisely, while retaining their instances on high-bandwidth, ultra-low-latency network.
Today, cluster networking resizing is available in the regions that have our HPC instances supported regions and availability domains. Cluster networking continues to spread throughout all our regions as cluster networking-enabled instances continue to roll out.
Andrew Butterfield is the Product Manager for Oracle Cloud Infrastructure’s GPU and HPC offerings. He drives the product development, product launch, as well as the AI and HPC strategy.