Usages

Basic Slurm Commands

Viewing System Information

`sinfo`

The sinfo command shows the state of partitions and nodes managed by Slurm.

$ sinfo

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*         up      30:00      4    mix prism-[1-4]
cpu            up 7-00:00:00      4    mix prism-[1-4]
gpu            up 3-00:00:00      1    mix prism-4
batch          up 7-00:00:00      4    mix prism-[1-4]
interactive    up    1:00:00      1    mix prism-4

When there is a maintenance or failure in any node, you can add -R flag to sinfo command to check for more details.

$ sinfo -R

Job Management Commands

`sbatch`

The sbatch command is used to submit a job script for later execution.

$ sbatch myjob.sh
Submitted batch job 12345

Example job script (myjob.sh):

#!/bin/bash
#SBATCH --job-name=my_test_job    # Job name
#SBATCH --output=job_%j.out       # Output file (%j = job ID)
#SBATCH --error=job_%j.err        # Error file
#SBATCH --time=01:00:00          # Time limit hrs:min:sec
#SBATCH --nodes=1                # Number of nodes
#SBATCH --ntasks=1              # Number of CPU cores
#SBATCH --mem=2G               # Memory limit

echo "My first Slurm job"
hostname
date
sleep 60

`scancel`

The scancel command is used to cancel a queued or running job.

$ scancel 12345    # Cancel job with ID 12345
$ scancel -u username  # Cancel all jobs for a specific user

`squeue`

The squeue command shows the status of the submitted job in the cluster. This includes pending, running, and completing jobs.

$ squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345     batch     vllm    test1 PD       0:00      1 (Resources)
12346     batch   python    test2 PD       0:00      1 (Priority)
12347     batch   python    test2  R       2:49      1 prism-1
12348     debug     bash    test1  R      15:30      1 prism-1
12349     debug image-la    test3  R       1:00      1 prism-2

Assuming you are test1 user, you want to filter only your submmited job, you can add -u flag to filter only for your jobs.

# $USER is environment variable that contains the current session username.
# This can also be replaced by your username.
# In this example, $USER refer to test1
$ squeue -u $USER

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345     batch     vllm    test1 PD       0:00      1 (Resources)
12348     debug     bash    test1  R      15:30      1 prism-1

Interactive Sessions

`srun`

The srun command is used to run jobs interactively or create job steps.

$ srun --pty bash -i     # Start an interactive bash session
$ srun -N1 hostname     # Run 'hostname' command on one node

GPU Job Submission

Interactive GPU Session

To request an interactive session with GPU access:

# Request 1 GPU with 8 CPU cores and 16GB memory
$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i

# Request another example with the same configuration
$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i

Batch GPU Jobs

Example GPU job script (gpu_job.sh):

#!/bin/bash
#SBATCH --job-name=gpu_test     # Job name
#SBATCH --output=gpu_%j.out     # Output file (%j = job ID)
#SBATCH --error=gpu_%j.err      # Error file
#SBATCH --partition=batch       # Partition selection
#SBATCH --gres=gpu:1           # Number of GPUs (1 in this case)
#SBATCH --cpus-per-gpu=8       # CPUs per GPU
#SBATCH --mem-per-gpu=16G      # Memory per GPU
#SBATCH --time=08:00:00        # Time limit hrs:min:sec

# Load any required modules here
# module load cuda/11.8

# Your GPU program commands here
nvidia-smi  # Check GPU status
python your_gpu_script.py

Submit the GPU job:

$ sbatch gpu_job.sh

You can monitor your job’s output in real-time using the tail command:

# Monitor output file (replace JOBID with your job number)
$ tail -f gpu_JOBID.out

# Monitor error file
$ tail -f gpu_JOBID.err

# Example with actual job ID 12345
$ tail -f gpu_12345.out

Resource Monitoring

`scontrol`

The scontrol command is the administrative tool for viewing and modifying Slurm state.

$ scontrol show job 12345          # Show details of a specific job
$ scontrol show node prism-1       # Show details of a specific node
$ scontrol show partition          # Show partition information