Oracle Cloud Infrastructure (OCI) offers automated cluster deployment, which includes a Slurm scheduler ready to accept jobs. From Slurm’s official website, Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
You can enhance Slurm using plugins to expand its capabilities. It facilitates GPU usage through generic resources (GRES), associated with Slurm nodes and utilized in job processing. GRES plugins cater to different GPU types. Key features of Slurm include scalability to tens of thousands of GPUs and millions of cores, robust security, heterogenous configuration supporting GPU utilization, topology-aware job scheduling for optimal system utilization, and advanced scheduling options, such as reservations, backfill, suspend and resume, fair-share, and preemptive scheduling for critical tasks.
One major benefit in deploying your workloads on a cloud-based cluster is the effortless ability to adjust the cluster’s size according to your needs. This capability allows for faster job completion and significant cost savings. This dynamic adjustment of the cluster’s size, known as autoscaling, ensures optimal performance. On OCI, you can achieve this on your Slurm cluster deployed using the Oracle Cloud Marketplace. It monitors the Slurm queues, adding more Compute nodes when jobs are pending and removing nodes when jobs are complete. Now, you can apply a lot of compute power to a problem with minimal overhead. You also don’t have to worry about cleaning up these resources and being charged past their useful period.
We have some common Slurm commands that you can use to operate the Slurm cluster daily. For detailed documentation, see the Slurm documentation.
View partition and node information.
sinfo
View information about jobs in scheduling queue.
squeue
Submit your sbatch script to specific nodes.
sbatch -w <nodename1,nodename2,…> <location of="" sbatch="" script=""></location></nodename1,nodename2,…>
If you have more than one partition in your Slurm configuration, submit to a particular partition in which your node belongs that isn’t the default partition.
sbatch -w <nodename1,nodename2,…> -p <partition name=""> <location of="" sbatch="" script=""></location></partition></nodename1,nodename2,…>
Cancel a job.
scancel <job number=""></job>
Something is wrong with the node (either a hardware failure or another error you encountered) and you’re investigating it. While you’re doing that, you don’t want any other users to schedule jobs on that node.
sudo scontrol update nodename=<nodename1,nodename2,…> state=drain reason=<reason></reason></nodename1,nodename2,…>
When the node is ready, add it back to the Slurm cluster.
sudo scontrol update nodename=<nodename1,nodename2,…> state=resume</nodename1,nodename2,…>
View accounting data for specific job.
sacct -j <job number=""></job>
If you can SSH to the node, but the state of the node is idle, SSH to the node and check the status of slurmd. If it’s not active, restart slurmd daemon.
sudo systemctl status slurmd
sudo systemctl restart slurmd
If a job fails, you can check slurmctld logs on the bastion and slurmd logs on the nodes on which the job ran.
sudo vi /var/log/slurm/slurmctld.log (bastion)
sudo vi /var/log/slurm/slurmd.log (node)
Run parallel jobs on a cluster. The following example runs the command parallelly on the list of nodes given and prints the node ordering and name.
srun -w <comma separated="" list="" of="" nodes=""> -N<number of="" nodes=""> bash -c ’printf "%d - %s\n" $SLURM_NODEID $SLURMD_NODENAME’</number></comma>
Get full names instead of compact ones.
scontrol show hostname node-[1-10]
See the reason that a node was put in drain state
scontrol show nodes <nodename> | grep Reason</nodename>
You can get started with creating a cluster yourself on Oracle Cloud Infrastructure and running jobs with Slurm based on these commands. For more information, refer to the following resources:
I architect large-scale GPU/HPC cluster solutions on Oracle Cloud.
Previous Post
Next Post