Slurm has many built-in features that allow users to run many different types of parallel code, leveraging the full capabilities of the cluster. Here we shall briefly go over some common parallel environments.
Running Distributed Jobs
MPI stands for Message Passing Interface. The MPI specification is basically an API that provides useful routines for communication between the nodes. The specification itself is in two main parts. MPI-1 is the first incarnation of the specification, which was later expanded and evolved into MPI-2. MPI is the de-facto standard library for message passing in a distributed computing setting. The standard is controlled by the MPI forum. All MPI-2 implementations are capable of supporting anything originally written for MPI-1, but the reverse is not true.
We will provide guidelines on how to run MPI jobs on ARGO using Slurm. First we shall briefly go over the various types of MPI software available on Argo. These are listed as modules and can be loaded as required. These modules are updated frequently and users can check the current version of the MPI modules by typing:
module avail #this lists all available modules
intel-mpi(included in intel/ps_xe module)
mpich
mvapich2
openmpi
Note: It is highly recommended that you use the intel-mpi module as it gives the best performance out of the lot.
Each MPI type has multiple flavors depending on which compiler version
was used to compile it and the architecture. For example the
mvapich2
has the following flavors:
mvapich2/gcc/64/2.0b
mvapich2/gcc/64/2.2b
mvapich/intel/64/2.2b
mvapich2/open64/64/2.0b
The first two were compiled with
gcc
, the third with intel/ps_xe
,
and the last one with open64
. The
MPI modules are named using the format below:
mpitype{/interface/compiler/arch/}version #interface=communication type, arch=architecture
Compiling Your Program
Depending on which MPI environment you want to use, you will need to
load the respective compiler module. Additionally each MPI compiler
depends on the core C/C++/FORTRAN compiler which it is based on. For
example, if you want to compile your MPI program with
open64
version of the
mpich
compiler then you first need
to load the open64
module before
compiling your program. Details, as to which MPI module needs which
compiler is given below:
MPI-module |
Base Compiler |
---|---|
mpich/gcc... |
Use gcc |
mpich/ge/open64/... |
Use open64 |
intel-mpi/64/... |
Use intel/ps_xe |
Compiler-Module Dependency
It is recommended that you use x64 architecture whenever possible.
MPI compilers associated with the mpi modules above is given in the table below:
MPI-module |
Language |
MPI Compiler Wrapper |
---|---|---|
mpich |
C |
mpicc |
C++ |
mpiCC |
|
Fortran77 |
mpif77 |
|
Fortran90 |
mpif90 |
|
intel-mpi |
C |
mpiicc |
C++ |
mpiicpc |
|
Fortran77 |
mpiifort |
MPI Compiler
We will show how to compile an MPI program using two different MPI libraries and give a corresponding job script to submit for each implementation. The purpose of this demonstration is to get you started on running MPI jobs on ARGO. Here is a sample program called MpiHello.c which we will use for compiling and running.
Setting the environment
To compile a program with an MPI library, start by loading the
appropriate module. You can use any version available on the cluster.
Here we are loading the gcc
version
of the mvapich2
library. Note that
the gcc
module is loaded by default
on the head nodes, hence you do not need to load it. However, if you
were to say use the open64
version
of mvapich2
, then you would have
needed to load the open64
module
before loading the MPI module:
module load mvapich2/gcc/64/...
module load open64/...
module load mvapich2/open64/...
Loading the environment for mpich
and openmpi
is similar.
Compiling the code
To compile the MpiHello.c using the gcc compiler, first load the gcc and openmpi modules and compile using mpicc:
module load gcc/7.1.0
module load openmpi/gcc/64/1.10.1
mpicc MpiHello.c -o MpiHello
To see the version that the MPI library was compiled with:
mpicc -v
mpicc -show
ldd MpiHello
mpirun -np 2 ./MpiHello
Hello from process 0 out of 2 processors
Hello from process 1 out of 2 processors
Running short jobs with a small number of processes on the head nodes is fine for testing, but anything requiring more time or more processes should be submitted to SLURM as a real job (see below).
Linking against a dynamic library
If the program requires linking against a dynamic library, the library
path can be specified during compilation by passing
"-L/location/to/static/library/directory -llibrary" option to the
compiler. For example, to link against the appropriate fftw3 library,
the appropriate library path from fftw3 module should be determined.
Issuing a 'module show' command on the library module will show the
correct LD_LIBRARY_PATH for the dynamic library. This path should be
used during compilation. When preparing your SLURM submission script to
run this program, you should be sure to execute the appropriate
module load
command to assure that the correct dynamic libraries are
available.
[user@ARGO-1 user]$ module show fftw3/openmpi/gcc/64/3.3.4
-------------------------------------------------------------------
/cm/shared/modulefiles/fftw3/openmpi/gcc/64/3.3.4:
module-whatis Adds FFTW library for 64 bits to your environment
prepend-path LD_RUN_PATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib/
**prepend-path LD_LIBRARY_PATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib/**
prepend-path MANPATH /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/share/man/
setenv FFTWDIR /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/lib
setenv FFTWINCLUDE /cm/shared/apps/fftw/openmpi/gcc/64/3.3.4/include
setenv FFTWLIB fftw3
-------------------------------------------------------------------
Running Your MPI Job with "sbatch"
Below is a sample SLURM submission script which sets up appropriate
resources and calls the MpiHello
program.
#!/bin/bash
## Specify the name for your job, this is the job name by which Slurm will
## refer to your job. This can be different from the name of your executable
## or the name of your script file.
#SBATCH --job-name MPI_job
#SBATCH --qos normal # normal,cdsqos,phyqos,csqos,statsqos,hhqos,gaqos,esqos
#SBATCH -p all-HiPri # partition (queue): all-LoPri, all-HiPri,
# bigmem-LoPri, bigmem-HiPri, gpuq, CS_q, CDS_q, ...
## Deal with output and errors. Separate into 2 files (not the default).
## NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH -o /scratch/%u/%x-%N-%j.out # Output file
#SBATCH -e /scratch/%u/%x-%N-%j.err # Error file
#SBATCH --mail-type=BEGIN,END,FAIL # NONE,BEGIN,END,FAIL,REQUEUE,ALL,...
#SBATCH --mail-user=<userID>@gmu.edu # Put your GMU email address here
## -------MPI Specific Options----------
#SBATCH --nodes <N> # Number of computers to run MPI processes on
#SBATCH --ntasks-per-node <n> # Number of tasks (processes) per node --
# -- must be less than the node core count!
## Enable one of the following module corresponding to the MPI compiler used.
## These may not be the newest version. Use "module avail" to find the best.
#module load mpich/ge/gcc/64/3.2
module load openmpi/gcc/64/1.10.1
#module load mvapich2/gcc/64/2.2b
#module load intel-mpi/64 # This will load the default version
## Run you program
mpirun -np <Nxn> ./MpiHello # Here <Nxn> is the total number of process used
If you do not specify the "-np" (tasks per code), then Slurm will automatically determine the correct number based on the total processes(nxN) provided in the Slurm script.
When Using Intel-MPI
In case you are typing these steps in as you go, let's start from a clean slate to make sure there's no confusion.
module purge
In order to compile and run your MPI program using Intel's optimized MPI library, you need to make the following change when compiling and running your code. You have to load the intel-suite.
module load intel/ps_xe
module load intel-mpi/64
Compile your code using mpiicc
instead of mpicc
:
mpiicc MpiHello.c -o MpiHello
mpicc
. Then you can use the above sample job script
to run your test program. You only have to comment out the line where
gcc MPI module is loaded, and uncomment the corresponding intel-mpi
module.
...
## Enable one of the following module corresponding to the MPI compiler used.
## These may not be the newest version. Use "module avail" to find the best.
#module load mpich/ge/gcc/64/3.2
#module load openmpi/gcc/64/1.10.1
#module load mvapich2/gcc/64/2.2b
module load intel-mpi/64 # This will load the default version
...
Running Your MPI Job with "srun"
When using the Intel-MPI compiler and libraries, there is another
approach you can use instead of running a SLURM submission script. You
can use the srun
command to launch MPI programs directly. Before you
do this though, you must execute the following command or you will
receive errors when using srun
:
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
MpiHello
.
srun --nodes=2 --ntasks-per-node=3 ./MpiHello
srun
with other MPI
library and compiler combinations did not work. Also note that setting
the I_MPI_PMI_LIBRARY environment variable (as above) will cause the
mpirun
to stop working. To get mpirun
working again, use the
following command.
unset I_MPI_PMI_LIBRARY
It is straight forward to run threaded jobs in Slurm. You need only to specify the number of threads used per task using the "--cpus-per-task" option as shown in the job script below.
#!/bin/bash
...
#SBATCH --ntasks=1
#SBATCH --cpus-per-task $nThreads
...
#run you threaded application here
my_application
Hybrid Parallelism
Again it is straight forward to use threading in conjunction with MPI in Slurm. In this case both "--ntasks-per-node" and "--cpus-per-task" must be specified simultaneously as shown below:
#!/bin/bash
#SBATCH --job-name Hybrid
#SBATCH --nodes 2
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 4
module load intel/ps_xe/18.0.1.163
#The OMP_NUM_THREADS env must be set before calling your application if your application does not take the number of threads as an input argument
mpirun ./hybrid 4
The above script will run on two nodes, with 2 tasks per node (MPI processes) and each process will have 4 threads. You can use the MPI Hybrid Example (hybrid.c) to test the above script. When compiling make sure to include "-fopenmp" directive:
mpiicc -fopenmp hybrid.c -o hybrid
GPU Parallelism
GPUs are treated as generic
resources in Slurm. In order to run
you CUDA or GPU based application you have to first allocate the desired
number of GPUs to your job using the "--gres=gpu:
#!/bin/bash
#SBATCH --job-name poke_cuda
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:2
#SBATCH --output=/scratch/%u/sample_cuda.%N.%j.out
## If you need to run on a particular node, you can request it with the
## --nodelist option. For example, uncomment (i.e. remove one '#') the line
## below and SLURM will run your job on node40 when it is available.
##SBATCH --nodelist=NODE040
module load cuda/9.2
./know_gpus
In the above example, 2 gpus are allocated on a single node which runs 1 tasks. If you are using MPI along with CUDA (say for example each MPI process controls a set of GPUs) then you have to specify the desired number of MPI processes in "--ntasks-per-node" option as before. Additionally you have to load the relevant CUDA module(s). Note that the number of GPUs requested must not be more than that allocated to the node.
Note: In a multi-gpu programming context, peer-to-peer communication between gpus on a single node may be necessary. In that case gpu resources should be selected carefully from the following groups. In the following table, peer-to-peer communication-capable gpus are grouped with square braces.
GPU-Node Number | GPU id |
40 | [0,1,2,3], [4,5,6,7] |
50 | [0,1], [2,3] |
55 | [0,1], [2,3] |
56 | [0,1], [2,3] |
The listed program 'know_gpus.cu' can be compiled as shown below on any login node and can be run on gpu-nodes using the provided sample Slurm job submission script.
module load cuda/10.0
nvcc know_gpus.cu -o know_gpus
//know_gpus.cu
#include <iostream>
int main()
{
int nDevices;
cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++)
{
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
std::cout << "Device Number: " << i << std::endl;
std::cout << "Device name: "<< prop.name << std::endl;
std::cout << "Memory Clock Rate (KHz):" << prop.memoryClockRate << std::endl;
std::cout << "Memory Bus Width (bits): " << prop.memoryBusWidth << std::endl;
std::cout << "Peak Memory Bandwidth (GB/s): " <<
2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6 << std::endl;
std::cout << std::endl;
}
}