When you create a compute instance on Oracle Cloud Infrastructure (OCI), you specify the shape that it’s created from. A shape is a template that determines the number of CPUs, number of GPUs, amount of memory, and other resources that are allocated to an instance.
Choosing the optimum compute shape for machine learning tasks can be challenging, mostly because so many shapes are available to choose from. At the time of writing this article, over 20 are listed on Oracles OCI Compute pricing page at https://cloud.oracle.com/compute/pricing.
The shapes that are available cover a variety of cost and performance options. The task of choosing one that best suits your needs is made easier and faster by first applying a few rules of thumb and then conducting a systematic search to find the right one.
I give you an overview here, but you can find detailed instructions at Learn how to choose the best compute shape for machine learning, which is an Oracle Solutions doc.
The first thing to do is create a test workload. You’ll use this workload to assess the performance of the candidate shapes. The workload can be a standard benchmark test, or it can be a machine learning task that uses the data and methods that you plan to employ on your selected compute instance.
Next, provision an instance on OCI Compute. The instance can be a low-power cheap instance because its only purpose is to become a template for creating other instances. What you do is load this instance with the test workload and any data and software that you need to run the workload, then create a custom image of the instance.
Now you need to decide which category your workload is in. That is, do you need a GPU-powered instance to run it, or is a non-GPU instance sufficient? For training neural networks, you almost certainly need a GPU shape. For machine learning training that doesn’t involve a neural network, you can usually use non-GPU shapes unless you have an extremely large amount of data, or if you’re using NVIDIA’s RAPIDS toolkit. For inference tasks, you can almost always use a non-GPU shape.
Finally, come up with a target time that your test workload should run in. Do you need it to complete in less than a minute? Or is it okay to let it run overnight?
After you decide between GPU and non-GPU shapes and have a desired time that the workload should run in, you’re ready to start the selection process. Basically, you arrange the available shapes into a binary search tree (BST) and start with the top node and work your way down. This technique will get you to the best shape in a couple of tries, on average. In the worst case it’ll take you four tries to get to the best non-GPU shape and three tries for a GPU shape.
Here’s what the GPU shape selection tree looks like, using the shapes that are available at the time of this writing:
Starting at the top, provision an instance that’s based on that shape and uses your custom image that you created earlier. When provisioning is complete, you’re ready to log into the instance, run the test workload, then evaluate the results. If your workload is taking too long then take the left branch and select that shape. Otherwise, take the right branch.
Using this process should save you a good deal of time, and possibly a good deal of money too because you’ll be able to quickly choose the shape that provides the best balance of speed and cost. For a more detailed look at the process, as well as instruction on how to provsion an instance and how to create a custome image, see Learn how to choose the best compute shape for machine learning.