Running Slurm Jobs With Multiple MIG Devices
IMPORTANT: Processes running on separate MIG GPUs are not able to communicate via CUDA. The information in this document only shows how to run multiple independant CUDA procsseses in a single slurm job. Distributed training with pytorch, tensorflow or any other common ML/DL framework is not currently possible using MIG devices
As mentioned in the Running GPU Jobs article, the most significant constraint on using MIG devices is the current resriction imposed by CUDA that limits the MIG device enumeration to a single device.
It is still possible to use multiple MIG devices in a single slurm job, however, the are some adjustments that need to be made in the job environment. When a job is allocated GPU resources Slurm will set an environment variable called CUDA_VISIBLE_DEVICES. For example a job that requested two 2g.20gb MIG devices would have this variable set to something like this:
CUDA_VISIBLE_DEVICES=MIG-92b6c26a-bcfc-5603-b6fe-faa085154d31,MIG-2a3e9acd-9cc5-5e7c-b513-cfef043bb574
This environment variable is what CUDA uses to enumerate available GPU devices. In the case of MIG, however, CUDA will ignore any devices after the first one in the list. To use both devices we must parse the names of the MIG devices from the CUDA_VISIBLE_DEVICES variable and then pass each device name individually to the CUDA processes when they are run.
As proof of concept, below is a sample slurm script that requests two MIG devices and 9 cores. The script runs two copies of a CUDA python program in the background, passing a single MIG device to each process (along with four of the physical cores by specifying the OMP_NUM_THREADS variable). The script runs nvdia-smi
to verify that each process is started on a distinct GPU, then waits in a final loop for the python processes to complete.
#!/bin/bash
#SBATCH --job-name=multi-mig-test
#SBATCH --output=%j-%N-out.txt
#SBATCH --error=%j-%N-err.txt
#SBATCH --partition=gpuq
#SBATCH --ntasks=9
#SBATCH --mail-type=all
#SBATCH --mem-per-cpu=4G
#SBATCH --gres=gpu:2g.20gb:2
#SBATCH --qos=gpu
module load gnu10
module load python/3.9.9-jh
j=0
for i in $(echo $CUDA_VISIBLE_DEVICES | tr ',' ' '); do
OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=${i} ./gpu-stress.py &
pids[${j}]=$!
j = j + 1
done
sleep 20
nvidia-smi
# wait for all pids
for pid in ${pids[*]}; do
wait $pid
done
We can see from the output of the job that each python process gets its apropriate GPU device:
Fri May 17 14:58:17 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | On |
| N/A 32C P0 118W / 500W | 3161MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | On |
| N/A 35C P0 190W / 500W | 21852MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 5 0 0 | 2219MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 3 0 0 | 2219MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 5 0 2962875 C python3 2186MiB |
| 1 3 0 2962877 C python3 2186MiB |
+---------------------------------------------------------------------------------------+