We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.
GPU Resources
The BMRC cluster includes a number of NVidia GPU-accelerated servers in order to support AI/ML, image processing and other GPU-accelerated workflows.
If you have any questions or comments about using the GPU resources on BMRC please contact us (bmrc-help@medsci.ox.ac.uk)
RecenT Changes
Interactive sessions are available through Slurm using the gpu_interactive partition.
compg039-compg042 have been added into service.
VARIETIES OF GPU NODE
In our regular (i.e. non-GPU) cluster, there are groups of nodes (e.g.
compa
,
compe
,
compf
) where the hardware varies
between
groups but is identical
within
each group. The situation is different within the
compg
GPU nodes. Because of rapidly changing hardware capabilities, there is considerable variation in the hardware capabilities of the GPU nodes: they offer different combinations of CPU and RAM as well as different numbers and types of GPU card. Furthermore, each machine is configured to host only as many scheduler slots as it has GPU cards, on the assumption that every job will need at least one GPU card. In consequence, the
Available RAM per slot
on the GPU partition can vary widely from the minimum of 60.8GB of RAM up to 750GB.
Because of the variation in CPU, RAM, GPU card type and number of GPUs available per node, you may need to plan your job submissions carefully. The sections below provide full information on the nodes available in order to assist with your planning.
SCHEDULED GPU CLUSTER NODES
There are two Slurm partitions for GPU resources,
gpu_
short
and
gpu_long
.
Jobs run on
gpu_
short
have a maximum job duration of
4 hours
.
Jobs run on
gpu_
long
have a maximum job duration of
60 hours
.
gpu_long
is only available on a subset of nodes so it is recommended that you submit jobs to
gpu_
short
when you can.
There is a partition for interactive jobs
gpu_interactive
There are three Slurm partitions for specific resources/workflows, please contact us if you need access to these.
gpu_relion
gpu_long_palamara
gpu_long_zhang
Jobs are submitted to
gpu_
short
(or
gpu_long
) using
sbatch
in a
similar way to submitting a non-gpu job
; however, you must supply some extra parameters to indicate your GPU requirements as follows:
sbatch -p gpu_short --gres gpu:<N> <JOBSCRIPT>
<N> is the number of GPUs required for each job.
Alternatively you can use
sbatch -p gpu_short --gpus-per-node <N> <JOBSCRIPT>
The recommended way to request GPUs for jobs on the BMRC Slurm GPU queues is to use
--gres
or
--gpus-per-node
There are other options in Slurm for requesting GPUs including --gpus,--gpus-per-task and --gpus-per-socket. These are relevant for MPI workloads, and can lead to blocking reservations, so please contact BMRC before using them.
Optionally, you can specify one type of GPU to run on with e.g.:
sbatch -p gpu_short --gres gpu:a100-pcie-40gb:1 <JOBSCRIPT>
You can use Slurm Features/Constraints to specify the class(es) of GPU(s) that you wish your job to run on. The features are listed in the 'Slurm GPU Feature' column in the table below. For example, to run on P100 and A100 nodes only:
sbatch -p gpu_short --gpus-per-node 1 --constraint "p100|a100" <JOBSCRIPT>
The default number of CPU cores per GPU is 6. You can request more (or fewer) CPU cores for your job with
--cpus-per-gpu <N>
. Alternatively, you can set the total number of cores required for the job with
-c <N>
. <N> is the number of cores.
The default system memory available per GPU is 60.8 GB. You can request more (or less) system memory for your job with
--mem-per-gpu <M>G
. Alternatively, you can specify the total memory requirement for your job with
--mem <M>G
. <M> is the number of GB of memory required.
INTERACTIVE GPU Sessions
You can get an interactive session using the
gpu_interactive
partition. The partition has a 12 hour runtime limit. Jobs submitted to this partition will get submitted to compg009, compg010.
srun -p gpu_interactive --gres gpu:1 --pty bash
LEGACY DEDICATED NODES
We maintain a number of GPU nodes which are dedicated to specific legacy projects. Please
email
us with any questions regarding these dedicated nodes.
FAST LOCAL SCRATCH SPACE
A number of nodes have fast local NVMe drives for jobs that require a lot of I/O. This space can be accessed from:
/flash/scratch
or from project specific folders in /flash on the nodes.
In Slurm you can select nodes with a scratch folder with:
sbatch -p gpu_short --gpus-per-node 1 --constraint "flash" <JOBSCRIPT>
This is folder is open to all jobs, so care should be taken to protect your data by placing it in subfolders with the correct permissions.
As the space on these drives is limited you should remove any data from the scratch space when the job is complete. A scheduled automatic deletion from /flash/scratch will be introduced.
MONITORING
In an interactive session you should use the
nvidia-smi
command
to check what processes are running on the GPUs and
top
to check what is running on the CPUs.
On the scheduled nodes, from a login node you should run
squeue -p gpu_short,gpu_long
to see the jobs running and waiting in the GPU queue.
GPU SOFTWARE
The CUDA libraries are required to run applications on NVidia GPUs. More advanced GPUs require later versions of the cuda libraries. The
CUDA page on wikipedia
has useful information about versions. Software packages typically need to be compiled for a particular version of cuda.
Our pre-installed CUDA-related software is made available, in the same way as the majority of our pre-installed software, via
software modules
. Use
module avail
to see which software packages are available and
module load <MODULE-NAME>
to load your desired software modules.
In addition to the main
CUDA
libraries themselves, we also have:
cudNN
TensorFlow
,
PyTorch
,
Keras
A number of other widely used GPU software packages.
You can also install your own software via e.g. a python virtualenv or conda.
RELION GPU
Information about running Relion on the Slurm GPU partitions will follow.