3. Deploy g-learn on GPU Clusters#
On GPU clusters, the NVIDIA graphic driver and CUDA libraries are pre-installed and they only need to be loaded.
3.1. Load Modules#
Check which modules are available on the machine
module avail
Load python and a compatible CUDA version by
module load python/3.9
module load cuda/11.7
Check which modules are loaded
module list
3.2. Interactive Session with SLURM#
There are two ways to work with GPU on a cluster. The first method is to ssh
to a GPU node and for hands-on interaction with the GPU device. If the GPU cluster uses SLURM manager, use srun
to initiate a session as follows
srun -A fc_biome -p savio2_gpu --gres=gpu:1 --ntasks 2 -t 2:00:00 --pty bash -i
In the above example:
-A fc_biome
sets the group account associated with the user.-p savio2_gpu
sets the name of the GPU node.--gres=gpu:1
requests one GPU device on the node.--ntasks 2
requests two parallel CPU threads on the node.-t 2:00:00
requests a two-hour session.--pty bash
starts a Bash shell.-i
redirects std input to the user’s terminal for interactive use.
See the list of options of srun for details. As another example, to request a GPU node named savio2_1080ti
with 4 GPU devices and 8 CPU threads for 10 hours, run
srun -A fc_biome -p savio2_1080ti --gres=gpu:4 --ntasks 8 -t 10:00:00 --pty bash -i
Note
Replace the name of nodes and accounts in the above example with yours. The name of GPU nodes and accounts in the above examples are obtained from SAVIO Cluster (an institutional Cluster at UC Berkeley).
3.3. Submit Jobs to GPU with SLURM#
To submit a parallel job to GPU nodes on a cluster with SLURM manager, use sbatch
command, such as
sbatch jobfile.sh
See the list of options of sbatch for details. A sample job file, jobfile.sh
is shown below. The highlighted line in the file instructs SLURM to request the number of GPU devices with --gres
option.
#!/bin/bash
#SBATCH --job-name=your_project
#SBATCH --mail-type=your_email
#SBATCH --mail-user=your_email
#SBATCH --partition=savio2_1080ti
#SBATCH --account=fc_biome
#SBATCH --qos=savio_normal
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64gb
#SBATCH --output=output.log
# Point to where Python is installed
PYTHON_DIR=$HOME/programs/miniconda3
# Point to where a script should run
SCRIPTS_DIR=$(dirname $PWD)/scripts
# Directory of log files
LOG_DIR=$PWD
# Load modules
module load cuda/11.2
# Export OpenMP variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# Run the script
$PYTHON_DIR/bin/python ${SCRIPTS_DIR}/script.py > ${LOG_DIR}/output.txt
In the above job file, modify --partition
, --account
, and --qos
according to your user account allowance on the cluster.