Running Containers on the Cluster
You can run calculations on the cluster by submitting jobs via SLURM in batch or interactively from the terminal. Both containerized and native applications are supported. You can run
- containerized applications using Apptainer/Singularity containers you build or ones we provide
- native applications you have compiled or those we provision using Lmod modules
The following sections describe how to run containerized applications.
Running Containerized Applications
We provide a growing list of Apptainer/Singularity containers in a shared location. You are also welcome to pull and run your own Apptainer/Singularity containers.
Using Shared Containers available on the Cluster
Containers and examples available for all users can be found on at
/containers
/containers/hopper
/containers/dgx
The environmental variable
$SINGULARITY_BASE
points to
/containers
Available containers on the cluster can be viewed with:
$ tree -L 2 /containers/hopper/Containers
$ tree -L 2 /containers/dgx/Containers
Currently installed containers for non-GPU software include
/containers/hopper/Containers/
├── autodock
├── biobakery
├── busco
├── caffe
├── dealii
├── digits
├── fio
├── fmriprep
├── gamess
├── gromacs
├── hpc-stack
├── lammps
├── maker
├── mitoz
├── mysql
├── namd
├── ncl
├── ngc-preflightcheck
├── nvidia-hpc-benchmarks
├── openpose
├── plasticity
├── python
├── pytorch
├── qiime2
├── qsiprep
├── quantum_espresso
├── r
├── Rapids
├── rserver
├── sagemath
├── tensorflow
└── wine
Building your Own Containers
Modern containers come from many registries (Dockerhub, NGC, SingularityHub, Biocontainers, ... etc ) and in different formats (Docker, Singularity, OCI) and runtimes (Docker, Singularity, CharlieCloud, ...).
Warning
Please keep in mind that you can not build or run Docker containers directly on Hopper or the DGX. You would need to pull and convert Docker containers to Singularity format and run the Singularity containers.
We use Docker containers pulled from NVIDIA GPU Cloud (NGC) catalog in the examples below, but the same steps apply to containers from other sources. The NVIDIA GPU Cloud (NGC) provides simple access to GPU-optimized software for deep learning, data science and high-performance computing (HPC). An NGC account grants you access to these tools as well as the ability to set up a private registry to manage your customized software. However, it is not absolutely necessary that you have an NGC account. Please see the link below for more:
If you build your own containers, they should be downloaded and stored under the Container directories set up for Cluster users:
/containers/hopper/UserContainers/$USER
/containers/dgx/UserContainers/$USER
NGC commands:
This example below demonstrates how to search and pull down a GROMACS image using the NGC CLI:
$ ngc registry image list
$ ngc registry image list | grep -i <container_name>
$ ngc registry image info nvcr.io/<container_name>:<containter_tag>
$ ngc registry image list|grep -i gromacs
| GROMACS | hpc/gromac | 2020.2 | 275.47 MB | Sep 24, | unlocked|
$ ngc registry image info nvcr.io:hpc/gromacs
--------------------------------------------------
Image Repository Information
Name: GROMACS
Short Description: GROMACS is a popular molecular dynamics application used to simulate proteins and lipids.
Built By: KTH Royal Institute of Technology
Publisher: KTH Royal Institute of Technology
Multinode Support: False
Multi-Arch Support: True
Logo: https://assets.nvidiagrid.net/ngc/logos/ISV-OSS-Non-Nvidia-Publishing-Gromacs.png
Labels: Covid-19, HPC, Healthcare, High Performance Computing, Supercomputing, arm64, x86_64
Public: Yes
Last Updated: Sep 24, 2020
Latest Image Size: 275.47 MB
Latest Tag: 2020.2
Tags:
2020.2
2020
2020.2-arm64
2020.2-x86_64
2018.2
2016.4
$ ngc registry image info nvcr.io/hpc/gromacs:2020.2
--------------------------------------------------
Image Information
Name: hpc/gromacs:2020.2
Architecture: amd64
Schema Version: 1
Image Size: 275.47 MB
Last Updated: Jun 22, 2020
--------------------------------------------------
Pulling Docker containers and building Singularity containers:
Once you select a Docker container to use, you need to pull it down and convert it to a Singularity image format with the following command. You would need to load singularity
module first.
$ module load singularity
$ singularity build <container_name>_<container_version/tag>.sif docker://nvcr.io/<hpc>/<container_name><container_version/tag>
Here is an example for preparing a GROMACS Singularity container:
$ cd /containers/dgx/UserContainers/$USER
$ module load singularity
$ singularity build gromacs-2020_2.sif docker://nvcr.io/hpc/gromacs:2020.2
Please note that we have adapted the following convention on naming Singularity image files.
- we use SIF instead of SIMG for the file extension
- we name containers as
<container_name>_<container_version/tag>.sif
Also note that you can pull the containers from NGC, DockerHub or any other source, but we encourage using ones from the NGC registry if one is available because they are optimized for NVIDIA GPUs.
Scheduling SLURM Jobs
You can run containerized application through SLURM either interactively or using batch submission scripts. Both approaches are discussed below. To run jobs on the cluster, you would need
- to have a SLURM account on Hopper AND
- be eligible to use the 'gpu' Quality-of-Service (QoS)
To see the available partitions and status of various nodes on the cluster, you can run:
``bash $ sinfo -o "%12P %5D %14F %8z %10m %.11l %15N %G"
PARTITION NODES NODES(A/I/O/T) S:C:T MEMORY TIMELIMIT NODELIST GRES debug 3 0/3/0/3 2:24:1 180000 1:00:00 hop[043-045] (null) interactive 3 0/3/0/3 2:24:1 180000 12:00:00 hop[043-045] (null) contrib 42 6/36/0/42 2:24:1 180000 6-00:00:00 hop[001-042] (null) normal* 25 21/4/0/25 2:24:1 180000 3-00:00:00 hop[046-070] (null) gpuq 1 0/1/0/1 8:16:1 1024000 2-00:00:00 dgx-a100-01 gpu:A100.40gb:6,gpu:1g.5gb:9,gpu:2g.10gb:1,gpu:3g.20gb:1 orc-test 70 27/43/0/70 2:24:1 180000 1-00:00:00 hop[001-070] (null)
### Interactive Mode
You can request an interactive access the DGX A100 server through SLLURM as follows:
```bash
$ salloc -p gpuq -q gpu --ntasks-per-node=1 --gres=gpu:A100.80gb:1 -t 0-01:00:00
salloc: Granted job allocation 2185
salloc: Waiting for resource configuration
salloc: Nodes amd000 are ready for job
$
Once your reservation is available, you will be logged into the DGX automatically:
$ hostname -s
dgx-a100-01
To run the container while connected:
$ singularity run [ --nv] [other_options] <container_name>_<container_version/tag>.sif <command>
As an example, the following command runs a Python script using Tensorflow container
$ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd /containers/dgx/Containers/tensorflow/tensorflow_21.02-tf1-py3.sif python test_single_gpu.py
You can run on any one or more GPUs. Since this is a shared resource, we encourage you to monitor the GPUs usage and selectively submit to idle GPU(s) when running jobs interactively. For example, the output of nvidia-smi
command suggests that there GPUs indexed 0,1,2 are being actively used, and you should run your jobs on one of the other GPUs.
$ nvidia-smi
Thu Mar 15 10:58:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 29C P0 52W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
.
.
.
| 6 A100-SXM4-40GB On | 00000000:B7:00.0 Off | On |
| N/A 31C P0 46W / 400W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 7 A100-SXM4-40GB On | 00000000:BD:00.0 Off | On |
| N/A 31C P0 42W / 400W | 25MiB / 40537MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 6 7 0 0 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
.
.
.
| 6 13 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 1 0 0 | 11MiB / 20096MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 5 0 1 | 7MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 13 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 7 14 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| running processes found |
| 0 App1 1% |
| 1 App2 12% |
| 2 App3 90% |
+-----------------------------------------------------------------------------+
To select particular GPU(s), you can use the SINGULARITYENV_CUDA_VISIBLE_DEVICES
environmental variable. For example, you can select the 1st and 3rd GPU by setting
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=0,2
SLURM specifies the GPU indices assigned to your job to the SLURM_JOB_GPUS
environmental variable. So you can set
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=${SLURM_JOB_GPUS}
For example, the following commands will run on any number of GPU assigned to you:
$ SINGULARITYENV_CUDA_VISIBLE_DEVICES=${SLURM_JOB_GPUS}
$ singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd /containers/dgx/Containers/tensorflow/tensorflow_21.02-tf1-py3.sif python test_single_gpu.pyUseful tools for monitoring the GPU usage
While you are on the server, you can use these tools to monitor the GPU usage:
nvitop -m
nvtop
nvidia-smi
Please remember to log out of the DGX A100 server when you finish running your interactive job.
Batch Mode
Below is a sample SLURM batch submission file you can use as an example to submit your jobs. Save the information into a file (say run.slurm
), and submit it by entering sbatch run.slurm
. Please update <N_CPU_CORES>
, <MEM_PER_CORE>
and <N_GPUs>
to reflect the number of CPU cores and GPUs you need.
#!/bin/bash
#SBATCH --partition=gpuq
#SBATCH --qos=gpu
#SBATCH --job-name=jmultigpu_basics
#SBATCH --output=jmultigpu_basics.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=<N_CPU_CORES>
#SBATCH --gres=gpu:A100.80gb:<N_GPUs>
#SBATCH --mem-per-cpu=<MEM_PER_CORE>
#SBATCH --export=ALL
#SBATCH -time=0-01:00:00
set echo
umask 0022
nvidia-smi
env|grep -i slurm
SINGULARITY_BASE=/containers/dgx/Containers
CONTAINER=${SINGULARITY_BASE}/tensorflow/tensorflow_21.02-tf1-py3.sif
SINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"
SCRIPT=multigpu_basics.py
${SINGULARITY_RUN} ${CONTAINER} python ${SCRIPT} | tee ${SCRIPT}.log
We encourage the use of environmental variables to make the job submission file cleaner and easily reusable.
The syntax for running different containers varies depending on the application. Please check the NGC page for more instructions on running these containers using Singularity.
Storage Locations
Currently, these locations have been designated for storing shared and user-specific containers.
- Containers
- Shared:
/containers/dgx/Containers
- User-specific:
/containers/dgx/UserContainers/$USER
Sample Runs
We provide some sample calculations to facilitate setting up and running calculations:
- examples on running native and containerized applications is available here:
/groups/ORC-VAST/app-tests
- The examples at https://gitlab.com/NVHPC/ngc-examples are helpful. For many applications, there are no instructions on running the containers using Singularity, but you should be able to build one from the Docker image and run it.
Running Containerized TensorFlow
These examples demonstrate how to run a TensorFlow Container from NGC on the GPUs using SLURM
SLURM script for a Single GPU Run
You can this template and necessary files at /containers/dgx/Examples/Tensorflow/21.02-tf1-py3/1-single-GPU-example
#!/bin/bash
#SBATCH --partition=gpuq # the DGX only belongs in the 'gpu' partition
#SBATCH --qos=gpu # need to select 'gpu' QoS
#SBATCH --job-name=single-gpu
#SBATCH --output=jsingle-gpu.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # up to 128;
#SBATCH --gres=gpu:A100.40gb:1 # up to 8; only request what you need
#SBATCH --mem-per-cpu=3500M # memory per CORE; total memory is 1 TB (1,000,000 MB)
#SBATCH --export=ALL
#SBATCH --time=0-01:00:00 # set to 1hr; please choose carefully
set echo
umask 0027
# to see ID and state of GPUs assigned
nvidia-smi
SINGULARITY_BASE=/containers/dgx/Containers
CONTAINER=${SINGULARITY_BASE}/tensorflow/tensorflow_21.02-tf1-py3.sif
SINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"
SCRIPT=test_single_gpu.py
${SINGULARITY_RUN} ${CONTAINER} python ${SCRIPT} | tee ${SCRIPT}.log
SLURM script for a Multi-GPU Run
You can find this example at /containers/dgx/Examples/Tensorflow/21.02-tf1-py3/2-multi-GPU-example
#!/bin/bash
#SBATCH --partition=gpuq # the DGX only belongs in the 'gpu' partition
#SBATCH --qos=gpu # need to select 'gpu' QoS
#SBATCH --job-name=jmultigpu-2
#SBATCH --output=jmultigpu-2.%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8 # up to 128; note that multithreading is enabled
#SBATCH --gres=gpu:A100.80gb:2 # up to 8; only request what you need
#SBATCH --mem-per-cpu=3500M # memory per CORE; total memory is 1 TB (1,000,000 MB)
#SBATCH --export=ALL
#SBATCH --time=0-01:00:00 # set to 1hr; please choose carefully
set echo
umask 0027
# to see ID and state of GPUs assigned
nvidia-smi
# parse out number of GPUs and CPU cores assigned to your job
env | grep -i slurm
N_GPUS=`echo $SLURM_JOB_GPUS | tr "," " " | wc -w`
N_CORES=${SLURM_NTASKS}
# set up the calculation
SINGULARITY_BASE=/containers/dgx/Containers
CONTAINER=${SINGULARITY_BASE}/tensorflow/tensorflow_21.02-tf1-py3.sif
SINGULARITY_RUN="singularity run --nv -B ${PWD}:/host_pwd --pwd /host_pwd"
# run the calculation
SCRIPT=multigpu_basics.py
${SINGULARITY_RUN} ${CONTAINER} python ${SCRIPT} | tee ${N_GPUS}g-${N_CORES}c-${SCRIPT}.log
SCRIPT=multigpu_cnn.py
${SINGULARITY_RUN} ${CONTAINER} python ${SCRIPT} | tee ${N_GPUS}g-${N_CORES}c-${SCRIPT}.log
Create a directory in /scratch and copy the necessary files:
cd $SCRATCH
mkdir tf_container_example && cd tf_container_example
cp -r /containers/dgx/Examples/Tensorflow/21.02-tf1-py3/1-single-GPU-example
cp -r /containers/dgx/Examples/Tensorflow/21.02-tf1-py3/2-multi-GPU-example
Once you have the files, change into one of the example directories and run the example with sbatch:
cd 1-single-GPU-example
sbatch run.slurm