Usages
This content is not available in your language yet.
Basic Slurm Commands
Viewing System Information
sinfo
The sinfo
command shows the state of partitions and nodes managed by Slurm.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELISTdebug* up 30:00 4 mix prism-[1-4]cpu up 7-00:00:00 4 mix prism-[1-4]gpu up 3-00:00:00 1 mix prism-4batch up 7-00:00:00 4 mix prism-[1-4]interactive up 1:00:00 1 mix prism-4
When there is a maintenance or failure in any node, you can add -R
flag to sinfo command to check for more details.
$ sinfo -R
Job Management Commands
sbatch
The sbatch
command is used to submit a job script for later execution.
$ sbatch myjob.shSubmitted batch job 12345
Example job script (myjob.sh):
#!/bin/bash#SBATCH --job-name=my_test_job # Job name#SBATCH --output=job_%j.out # Output file (%j = job ID)#SBATCH --error=job_%j.err # Error file#SBATCH --time=01:00:00 # Time limit hrs:min:sec#SBATCH --nodes=1 # Number of nodes#SBATCH --ntasks=1 # Number of CPU cores#SBATCH --mem=2G # Memory limit
echo "My first Slurm job"hostnamedatesleep 60
scancel
The scancel
command is used to cancel a queued or running job.
$ scancel 12345 # Cancel job with ID 12345$ scancel -u username # Cancel all jobs for a specific user
squeue
The squeue
command shows the status of the submitted job in the cluster. This includes pending, running, and completing jobs.
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)12345 batch vllm test1 PD 0:00 1 (Resources)12346 batch python test2 PD 0:00 1 (Priority)12347 batch python test2 R 2:49 1 prism-112348 debug bash test1 R 15:30 1 prism-112349 debug image-la test3 R 1:00 1 prism-2
Assuming you are test1
user, you want to filter only your submmited job, you can add -u
flag to filter only for your jobs.
# $USER is environment variable that contains the current session username.# This can also be replaced by your username.# In this example, $USER refer to test1$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)12345 batch vllm test1 PD 0:00 1 (Resources)12348 debug bash test1 R 15:30 1 prism-1
Interactive Sessions
srun
The srun
command is used to run jobs interactively or create job steps.
$ srun --pty bash -i # Start an interactive bash session$ srun -N1 hostname # Run 'hostname' command on one node
GPU Job Submission
Interactive GPU Session
To request an interactive session with GPU access:
# Request 1 GPU with 8 CPU cores and 16GB memory$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i
# Request another example with the same configuration$ srun --partition=batch --gres=gpu:1 --cpus-per-gpu=8 --mem-per-gpu=16G --pty bash -i
Batch GPU Jobs
Example GPU job script (gpu_job.sh):
#!/bin/bash#SBATCH --job-name=gpu_test # Job name#SBATCH --output=gpu_%j.out # Output file (%j = job ID)#SBATCH --error=gpu_%j.err # Error file#SBATCH --partition=batch # Partition selection#SBATCH --gres=gpu:1 # Number of GPUs (1 in this case)#SBATCH --cpus-per-gpu=8 # CPUs per GPU#SBATCH --mem-per-gpu=16G # Memory per GPU#SBATCH --time=08:00:00 # Time limit hrs:min:sec
# Load any required modules here# module load cuda/11.8
# Your GPU program commands herenvidia-smi # Check GPU statuspython your_gpu_script.py
Submit the GPU job:
$ sbatch gpu_job.sh
You can monitor your job’s output in real-time using the tail
command:
# Monitor output file (replace JOBID with your job number)$ tail -f gpu_JOBID.out
# Monitor error file$ tail -f gpu_JOBID.err
# Example with actual job ID 12345$ tail -f gpu_12345.out
Resource Monitoring
scontrol
The scontrol
command is the administrative tool for viewing and modifying Slurm state.
$ scontrol show job 12345 # Show details of a specific job$ scontrol show node prism-1 # Show details of a specific node$ scontrol show partition # Show partition information