Running CUDA on Argo
CUDA Setup on Argo
The Argo cluster has 4 GPU compute nodes (nodes 40, 50, 55 and 56). Node 40 is a Dell PowerEdge C4130 and has 4 Tesla K80 (Kepler architecture) graphics cards. Each K80 card has 2 GPUs, hence there is a total of 8 GPUs on the node. Each GPU's VRAM is 12GB, yielding a total usable VRAM of 96GB. It is recommended that you use the CUDA 7+ standard for programming with the above class of GPUs. Node 40 also has 2 Intel(R) Xeon(R) E5-2670 v3 @ 2.30GHz CPUs each with 12 cores and 128 GB of DDR4 RAM.
Nodes 50, 55, and 56 have 2 Tesla K80 GPU cards each, so a total of 4 GPUs per node.
Compiling and running CUDA jobs on the cluster
To compile CUDA code you need to use the nvcc
compiler which comes
with the CUDA toolkit. To use the nvcc
command you first need to load
the cuda/9.X
module.
Compiling CUDA Code
A simple way to compile a single cuda program is to use the nvcc
command:
nvcc sample.cu -o executable_name
cuda/9.X
module loaded before using
this command.
Running A CUDA Job
GPUs are treated as generic resources by Slurm. You need to request the number of GPUs you require in your SLURM wrapper script. Even if you need only one GPU, you must use this option to request the one GPU card. You also need to request the GPU partition in the SLURM script.
Below is a sample job script, that asks for 4 GPUs.
#!/bin/bash
#SBATCH --job-name caffe
#SBATCH --partition gpuq
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --gres=gpu:4
# Load necessary modules
module load cuda/9.2
# List the current modules loaded
module list
# Your program executable or script goes here
./YOUR_PROGRAM
OpenACC
Users are encouraged to use OpenACC if gpu-acceleration is desired.
OpenACC based programs can achieve almost the same level of performance
of cuda-based programs. It is much easier to learn OpenACC and a serial
cpu-based program can easily be turned into a gpu-accelerated program
using it. OpenACC is a directive-based programming model much like
OpenMP. Both the PGI and GNU compilers have support for OpenACC.
However, users are encouraged to use the available PGI compiler on Argo
to compile OpenACC-accelerated programs. To make the PGI compiler
available, the pgi
module must be loaded as follows,
module load pgi
The PGI compiler can compile an OpenACC-accelerated program using the
-acc
and -Minfo=accel
command line switches. The -acc
switch
enables support for OpenACC and -Minfo=accel
will print the
optimization output of the compiler. Since, OpenACC is a directive-based
language extension, it is important to check if the compiler accepted
and applied the directives as expected.
This command will compile a C program with OpenACC directives
pgcc -acc -Minfo=accel your_program.c -o your_program
Compiling a Fortran program with OpenACC directives is done as follows
pgf90 -acc -Minfo=accel your_program.f90 -o your_program
An example Slurm script is shown below that can be used to submit a OpenACC based gpu-accelerated program
#!/bin/bash
#SBATCH --job-name test-openacc
#SBATCH --partition gpuq
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --gres=gpu:1
# Your program executable goes here
./YOUR_PROGRAM
Note: OpenACC is supposed to work on heterogeneous architecture, i.e., multi core cpu, gpu, fpga etc. However, the momentum is towards making compilers that can turn OpenACC-directives into efficient gpu-based code. Users who want to make use of multi-core cpus will benefit more from sticking to OpenMP. The Initially planned merger of OpenACC and OpenMP is highly unlikely to happen.
Running Threaded/MPI CUDA Jobs
To use multiple threads or MPI process with CUDA, see How to run parallel jobs on Argo.