Running CUDA on Argo

CUDA Setup on Argo

The Argo cluster has 4 GPU compute nodes (nodes 40, 50, 55 and 56). Node 40 is a Dell PowerEdge C4130 and has 4 Tesla K80 (Kepler architecture) graphics cards. Each K80 card has 2 GPUs, hence there is a total of 8 GPUs on the node. Each GPU's VRAM is 12GB, yielding a total usable VRAM of 96GB. It is recommended that you use the CUDA 7+ standard for programming with the above class of GPUs. Node 40 also has 2 Intel(R) Xeon(R) E5-2670 v3 @ 2.30GHz CPUs each with 12 cores and 128 GB of DDR4 RAM.

Nodes 50, 55, and 56 have 2 Tesla K80 GPU cards each, so a total of 4 GPUs per node.

Compiling and running CUDA jobs on the cluster

To compile CUDA code you need to use the nvcc compiler which comes with the CUDA toolkit. To use the nvcc command you first need to load the cuda/9.X module.

Compiling CUDA Code

A simple way to compile a single cuda program is to use the nvcc command:

nvcc sample.cu -o executable_name

More info can be found here. REMINDER: You need to have the cuda/9.X module loaded before using this command.

Running A CUDA Job

GPUs are treated as generic resources by Slurm. You need to request the number of GPUs you require in your SLURM wrapper script. Even if you need only one GPU, you must use this option to request the one GPU card. You also need to request the GPU partition in the SLURM script.

Below is a sample job script, that asks for 4 GPUs.

#!/bin/bash

#SBATCH --job-name caffe
#SBATCH --partition gpuq
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --gres=gpu:4

# Load necessary modules
module load cuda/9.2

# List the current modules loaded
module list

# Your program executable or script goes here
./YOUR_PROGRAM

OpenACC

Users are encouraged to use OpenACC if gpu-acceleration is desired. OpenACC based programs can achieve almost the same level of performance of cuda-based programs. It is much easier to learn OpenACC and a serial cpu-based program can easily be turned into a gpu-accelerated program using it. OpenACC is a directive-based programming model much like OpenMP. Both the PGI and GNU compilers have support for OpenACC. However, users are encouraged to use the available PGI compiler on Argo to compile OpenACC-accelerated programs. To make the PGI compiler available, the pgi module must be loaded as follows,

module load pgi

The PGI compiler can compile an OpenACC-accelerated program using the -acc and -Minfo=accel command line switches. The -acc switch enables support for OpenACC and -Minfo=accel will print the optimization output of the compiler. Since, OpenACC is a directive-based language extension, it is important to check if the compiler accepted and applied the directives as expected.

This command will compile a C program with OpenACC directives

pgcc -acc -Minfo=accel your_program.c -o your_program

Compiling a Fortran program with OpenACC directives is done as follows

pgf90 -acc -Minfo=accel your_program.f90 -o your_program

An example Slurm script is shown below that can be used to submit a OpenACC based gpu-accelerated program

#!/bin/bash
#SBATCH --job-name test-openacc
#SBATCH --partition gpuq
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --gres=gpu:1

# Your program executable goes here
./YOUR_PROGRAM

Note: OpenACC is supposed to work on heterogeneous architecture, i.e., multi core cpu, gpu, fpga etc. However, the momentum is towards making compilers that can turn OpenACC-directives into efficient gpu-based code. Users who want to make use of multi-core cpus will benefit more from sticking to OpenMP. The Initially planned merger of OpenACC and OpenMP is highly unlikely to happen.

Running Threaded/MPI CUDA Jobs

To use multiple threads or MPI process with CUDA, see How to run parallel jobs on Argo.