Submitting batch jobs on AccelerateAI

Once you have installed the software that your workload needs and perhaps tested interactively that it works on a small problem, you are ready to move to running your workloads as batch jobs. This means defining what you want them to do in the form of a Unix shell script, and submitting this script to the Supercomputing Wales job scheduler, Slurm.

This has a number of advantages:

If you’re not already familiar with Unix shell scripting, you may want to work through the Software Carpentry introduction to the Unix shell.

Slurm

Slurm is the batch scheduler used by Supercomputing Wales on SUNBIRD, and which manages workloads submitted to AccelerateAI.

Details on how to use Slurm on Supercomputing Wales can be found at the Supercomputing Wales Portal, specifically at the pages on:

To get started quickly, you can base your submission scripts on the templates below. To use one, copy its contents into a file called, for example, submit.sh, modify it to run the commands you need, and submit it with the command:

$ sbatch submit.sh

For further information, see the documentation linked above.

Queues and time limits

There are two batch queues on AccelerateAI that give access to a full A100 GPU:

If you need to run workloads that take longer than two days, then you can use checkpointing to save the state of your computation, allowing it to be resumed in a subsequent Slurm job. To see an example of how this can be done in Tensorflow, see the Tensorflow documentation.

There is one additional partition, accel_ai_mig, which holds the partitioned GPUs used for interactive test workflows. We would recommend using the interactive workflow to use this queue, but if you have a use case for submitting batch jobs to these nodes, then please contact SA2C support and we will work with you to get running.

Example job scripts

Single-GPU job

#!/bin/bash

#SBATCH --account=scw0000
#SBATCH --partition=accel_ai
#SBATCH --job-name=training
#SBATCH --output=training.out.%j
#SBATCH --error=training.err.%j
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=0-6:00:00

module load anaconda/2024.06
module load CUDA/12.4
source activate ai_2024

echo "Job ${SLURM_JOB_ID} is running on ${HOSTNAME}."
echo "It has access to GPU ${CUDA_VISIBLE_DEVICES}."

python3 training.py

Explaining the options in turn:

Two-GPU job

#!/bin/bash

#SBATCH --account=scw0000
#SBATCH --partition=accel_ai
#SBATCH --job-name=training_2gpu
#SBATCH --output=training.out.%j
#SBATCH --error=training.err.%j
#SBATCH --gres=gpu:2
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=0-6:00:00

module load anaconda/2024.06
module load CUDA/12.4
source activate ai_2024

echo "Job ${SLURM_JOB_ID} is running on ${HOSTNAME}."
echo "It has access to GPUs ${CUDA_VISIBLE_DEVICES}."

python3 training.py

Most of the options are the same as explained above. The differences:

Four-GPU job

#!/bin/bash

#SBATCH --account=scw0000
#SBATCH --partition=accel_ai
#SBATCH --job-name=training_4gpu
#SBATCH --output=training.out.%j
#SBATCH --error=training.err.%j
#SBATCH --gres=gpu:4
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=0-6:00:00

module load anaconda/2024.06
module load CUDA/12.4
source activate ai_2024

echo "Job ${SLURM_JOB_ID} is running on ${HOSTNAME}."
echo "It has access to GPUs ${CUDA_VISIBLE_DEVICES}."

python3 training.py

The options are as described above, but now using 4 GPUs and 16 CPU cores.

Many-GPU job

To use more than four GPUs, please get in touch with the SA2C RSE team so that we can help you check that your workload scales up on AccelerateAI.