Skip to main content

Slurm

Slurm (which originally stood for Simple Linux Utility for Resource Management) is an open source resource manager and job scheduler used on more than half of the world's top 500 HPC clusters. The advantage of using Slurm over launching compute jobs directly on one or more machines is that Slurm takes care of identifying available resources matching your job requirements and then launches the job(s) for you once these resources becomes available. Unlike the situation where too many users are trying to use the same compute node at the same time, resulting in resource contention, once allocated to your job, Slurm gives you exclusive access to these resources. Slurm also enforces configurable fairness policies so that no one user or group can monopolize all or most compute resources for an extended period of time, and further affords scheduling priority based on a fair share algorithm which gives preference to users who have used fewer resources over some period of time prior to the scheduling request.

IFML Slurm Cluster

tl;dr: The job submission node for the IFML cluster is slurm-submit.cs.utexas.edu. You submit jobs by ssh'ing to slurm-submit and then running an sbatch script (see example below). Interactive sessions are also possible, but these go into a pending queue like any other job if resources aren't currently available. If you can ssh to slurm-submit, you can submit jobs to the cluster, but you must request access to slurm-submit. if you need access, please have the faculty member that you are working with request access by sending email to help@cs.utexas.edu. Unless you have specialized compute requirements, use the allnodes partition, which includes all nodes as potential resources.

Cluster Resources

The IFML cluster currently includes 11 compute nodes and 2 storage systems.

  • slurm-node-001 - slurm-node-003: 2 x 20 core Intel Xeon Gold 6230 @ 2.1Ghz, 385GB RAM, 8 x Nvidia A40 GPU with 45GB VRAM
  • slurm-node-004 - slurm-node-011: 2 x 24 core AMD EPYC 7413 @ 2GHz, 1TB RAM, 8 x Nvidia A40 GPU with 45GB VRAM
  • /datastor1 and /datastor2 (note that you must request permission to write to these)
Local Policies

With a large number of compute users and a  relatively small cluster, cluster resources must be allocated judiciously to insure that everyone's submitted job eventually runs. To this end: 

  • all jobs are restricted to a max 3-day runtime. Once documentation on checkpointing has been made available, this limit will be reduced to 2 days. Requeuing is already enabled on the cluster and users are encouraged to explore this option themselves, but for now if you have a job that must run for more than 3 days, file a helpreq and we'll work something out.
  • Users are restricted to 8 GPUs total for all currently running jobs. To explore potential exceptions, file a helpreq.
  • You must specify a time limit in your sbatch script; e.g. 
    #SBATCH --time=16:00:00
    #SBATCH --time=2-00:15:12
    tells Slurm your job will run for less than or equal to 16 hours, or 2 days, 15 minutes and 12 seconds, respectively. This is necessary for the backfill operator to function properly. If Slurm has no idea how long your job will run, then it can't efficiently schedule jobs, leading to unnecessarily idle node resources and considerably longer queue times. Note that your job will be killed when this time limit is reached, so it's appropriate to overestimate a bit.
  • In your sbatch script you must specify  (--cpus-per-task or --ntasks) and (--mem or --mem-per-cpu). If you don't specify how much memory your job will use, Slurm assumes that you need all the memory available on the node. This leads to situations where a user requests a single GPU (out of 4 or 8 available) and the entire node becomes unavailable for other jobs because there isn't any memory available to allocate.
  • Ssh access is restricted to nodes on which you have currently running jobs. This is to prevent the practice of circumventing the scheduler by ssh'ing directly to nodes to run jobs. Note that you are welcome to launch additional jobs on the nodes you can access (by having jobs running on them) as long as the resources required fall under the umbrella of the resources you've requested on that node. For example suppose you submit a job request for 4 CPUs, 128GB memory, and 2 GPUs on a single node and this job is currently running on slurm-node-007. If the job is actually only using 1 CPU, 96GB memory, and 1 GPU, you can ssh to slurm-node-007 and launch additional jobs directly as long as the additional resources used are less than 3 CPUs, 32GB memory, and 1 GPU. If you go over this resource limit, all your jobs running on that node will be killed by the scheduler. It's suggested to only use the ssh option to check on running jobs and not to launch additional ones.

     

Vision DGX-Titan (DURIP) Cluster

slurm is the cluster management and job scheduling system being used to submit jobs to the titan and dgx cluster of machines.

To submit jobs to the titan or dgx clusters, you should ssh into hypnotoad.cs.utexas.edu. If you do not have access to hypnotoad, that is because you do not have access to the titan or dgx clusters. This is not a public resources, and if you feel that you should have access to it, you will need to consult your advisor or professor.

At this time, there is no priority in place, and jobs are simply submitted FIFO. This decision was made by the owners of these machines, and not by the department, and may change in the future based on usage. Please bear that in mind as you are submitting your jobs. There are many people sharing these machines, and you are the only person enforcing your own fair share usage. If you are having a problem with a user you feel is monopolizing resources, please send email either to them or to your advisor requesting they consider the impact their jobs are having.

If you are having trouble with the software or the titan nodes themselves, please send mail to help@cs.utexas.edu.

To submit a job from slurm-submit or hypnotoad, you need to first create a submission script. Here is a sample script that you can customize:

#!/bin/bash

#SBATCH --job-name slurmjob                                        # Job name

### Logging
#SBATCH --output=logs/slurmjob_%j.out                    # Name of stdout output file (%j expands to jobId)
#SBATCH --error=logs/slurmjob_%j.err                        # Name of stderr output file (%j expands to jobId)
#SBATCH --mail-user=csusername@cs.utexas.edu    # Email of notification
#SBATCH --mail-type=END,FAIL,REQUEUE                                      

### Node info
#SBATCH --partition PARTITIONNAME                          # Queue name - current options are allnodes, titans and dgx (on DURIP)
#SBATCH --nodes=1                                                            # Always set to 1 when using the cluster
#SBATCH --ntasks-per-node=1                                         # Number of tasks per node
#SBATCH --time 1:00:00                                                     # Run time (hh:mm:ss)

#SBATCH --gres=gpu:4                                                       # Number of gpus needed
#SBATCH --mem=50G                                                         # Memory requirements
#SBATCH --cpus-per-task=8                                              # Number of cpus needed per task

./your_script.sh

If you named your script above slurm.sh, you can submit it to the titan or ifml cluster by running

sbatch slurm.sh

Once you have submitted your script, here are some commands you may find useful.

To see all jobs currently submitted:

squeue

To see all currently running jobs:

squeue -t RUNNING

To see all queued jobs which are not yet running, along with their priority weights:

sprio

To get an estimate of how long until your queued job will run:

squeue --start -j <jobid>

To requeue one of your own jobs (this requires that #sbatch --requeue was set in your sbatch script):

scontrol requeue <jobid>

To see all jobs currently submitted by a specific user:

squeue -u <username>

To see all jobs submitted to the titan cluster since a specific start date:

sacct -S MM/DD/YY -r titans

To see all jobs submitted to the titan cluster by a specific user since a specific start date:

sacct -S MM/DD/YY -r titans -u <username>

To see utilization per user on the titan cluster since slurm was deployed:

/lusr/opt/slurm/bin/slurm-usage

To see utilization per user on the titan cluster during a specific time period:

sreport cluster AccountUtilizationByUser Account=cs Start=MM/DD/YY End=MM/DD/YY

To cancel a running or queued job (you can find the job number with squeue):

scancel <job#>

To see the current state of the available nodes in slurm:

sinfo

More information can be found on how slurm works, useful commands, and applications on slurm's webpage. Here are some excellent references from other universities to get you started: