Skip to content

Usages

Basic Slurm Commands

Viewing System Information

sinfo


The sinfo command shows the state of partitions and nodes managed by Slurm.


Terminal window
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up 30:00 4 mix prism-[1-4]
cpu up 7-00:00:00 4 mix prism-[1-4]
gpu up 3-00:00:00 1 mix prism-4
batch up 7-00:00:00 4 mix prism-[1-4]
interactive up 1:00:00 1 mix prism-4

When there is a maintenance or failure in any node, you can add -R flag to sinfo command to check for more details.

Terminal window
$ sinfo -R

Job Management Commands

sbatch

The sbatch command is used to submit a job script for later execution.

Terminal window
$ sbatch myjob.sh
Submitted batch job 12345

Example job script (myjob.sh):

#!/bin/bash
#SBATCH --job-name=my_test_job # Job name
#SBATCH --output=job_%j.out # Output file (%j = job ID)
#SBATCH --error=job_%j.err # Error file
#SBATCH --time=01:00:00 # Time limit hrs:min:sec
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of CPU cores
#SBATCH --mem=2G # Memory limit
echo "My first Slurm job"
hostname
date
sleep 60

scancel

The scancel command is used to cancel a queued or running job.

Terminal window
$ scancel 12345 # Cancel job with ID 12345
$ scancel -u username # Cancel all jobs for a specific user

squeue


The squeue command shows the status of the submitted job in the cluster. This includes pending, running, and completing jobs.


Terminal window
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 batch vllm test1 PD 0:00 1 (Resources)
12346 batch python test2 PD 0:00 1 (Priority)
12347 batch python test2 R 2:49 1 prism-1
12348 debug bash test1 R 15:30 1 prism-1
12349 debug image-la test3 R 1:00 1 prism-2

Assuming you are test1 user, you want to filter only your submmited job, you can add -u flag to filter only for your jobs.

Terminal window
# $USER is environment variable that contains the current session username.
# This can also be replaced by your username.
# In this example, $USER refer to test1
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 batch vllm test1 PD 0:00 1 (Resources)
12348 debug bash test1 R 15:30 1 prism-1

Interactive Sessions

srun

The srun command is used to run jobs interactively or create job steps.

Terminal window
$ srun --pty bash -i # Start an interactive bash session
$ srun -N1 hostname # Run 'hostname' command on one node

GPU Job Submission

Interactive GPU Session

To request an interactive session with GPU access:

Terminal window
# Request 1 GPU with 8 CPU cores and 16GB memory
$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i
# Request another example with the same configuration
$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i

Batch GPU Jobs

Example GPU job script (gpu_job.sh):

#!/bin/bash
#SBATCH --job-name=gpu_test # Job name
#SBATCH --output=gpu_%j.out # Output file (%j = job ID)
#SBATCH --error=gpu_%j.err # Error file
#SBATCH --partition=batch # Partition selection
#SBATCH --gres=gpu:1 # Number of GPUs (1 in this case)
#SBATCH --cpus-per-gpu=8 # CPUs per GPU
#SBATCH --mem-per-gpu=16G # Memory per GPU
#SBATCH --time=08:00:00 # Time limit hrs:min:sec
# Load any required modules here
# module load cuda/11.8
# Your GPU program commands here
nvidia-smi # Check GPU status
python your_gpu_script.py

Submit the GPU job:

Terminal window
$ sbatch gpu_job.sh

You can monitor your job’s output in real-time using the tail command:

Terminal window
# Monitor output file (replace JOBID with your job number)
$ tail -f gpu_JOBID.out
# Monitor error file
$ tail -f gpu_JOBID.err
# Example with actual job ID 12345
$ tail -f gpu_12345.out

Resource Monitoring

scontrol

The scontrol command is the administrative tool for viewing and modifying Slurm state.

Terminal window
$ scontrol show job 12345 # Show details of a specific job
$ scontrol show node prism-1 # Show details of a specific node
$ scontrol show partition # Show partition information