3. Deploy g-learn on GPU Clusters#

On GPU clusters, the NVIDIA graphic driver and CUDA libraries are pre-installed and they only need to be loaded.

3.1. Load Modules#

Check which modules are available on the machine

module avail

Load python and a compatible CUDA version by

module load python/3.9
module load cuda/11.7

Check which modules are loaded

module list

3.2. Interactive Session with SLURM#

There are two ways to work with GPU on a cluster. The first method is to ssh to a GPU node and for hands-on interaction with the GPU device. If the GPU cluster uses SLURM manager, use srun to initiate a session as follows

srun -A fc_biome -p savio2_gpu --gres=gpu:1 --ntasks 2 -t 2:00:00 --pty bash -i

In the above example:

  • -A fc_biome sets the group account associated with the user.

  • -p savio2_gpu sets the name of the GPU node.

  • --gres=gpu:1 requests one GPU device on the node.

  • --ntasks 2 requests two parallel CPU threads on the node.

  • -t 2:00:00 requests a two-hour session.

  • --pty bash starts a Bash shell.

  • -i redirects std input to the user’s terminal for interactive use.

See the list of options of srun for details. As another example, to request a GPU node named savio2_1080ti with 4 GPU devices and 8 CPU threads for 10 hours, run

srun -A fc_biome -p savio2_1080ti --gres=gpu:4 --ntasks 8 -t 10:00:00 --pty bash -i

Note

Replace the name of nodes and accounts in the above example with yours. The name of GPU nodes and accounts in the above examples are obtained from SAVIO Cluster (an institutional Cluster at UC Berkeley).

3.3. Submit Jobs to GPU with SLURM#

To submit a parallel job to GPU nodes on a cluster with SLURM manager, use sbatch command, such as

sbatch jobfile.sh

See the list of options of sbatch for details. A sample job file, jobfile.sh is shown below. The highlighted line in the file instructs SLURM to request the number of GPU devices with --gres option.

 #!/bin/bash

 #SBATCH --job-name=your_project
 #SBATCH --mail-type=your_email
 #SBATCH --mail-user=your_email
 #SBATCH --partition=savio2_1080ti
 #SBATCH --account=fc_biome
 #SBATCH --qos=savio_normal
 #SBATCH --time=72:00:00
 #SBATCH --nodes=1
 #SBATCH --gres=gpu:4
 #SBATCH --ntasks=1
 #SBATCH --cpus-per-task=8
 #SBATCH --mem=64gb
 #SBATCH --output=output.log

 # Point to where Python is installed
 PYTHON_DIR=$HOME/programs/miniconda3

 # Point to where a script should run
 SCRIPTS_DIR=$(dirname $PWD)/scripts

 # Directory of log files
 LOG_DIR=$PWD

 # Load modules
 module load cuda/11.2

 # Export OpenMP variables
 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

 # Run the script
 $PYTHON_DIR/bin/python ${SCRIPTS_DIR}/script.py > ${LOG_DIR}/output.txt

In the above job file, modify --partition, --account, and --qos according to your user account allowance on the cluster.