Interactive Sessions on DRAC#
Tested on: Nibi cluster
Last updated: October 2025
Interactive sessions allow you to debug and test your code on a compute node before submitting larger batch jobs. This is similar to working on a lab machine (like Tars or Thufir), but on DRAC’s compute infrastructure.
Requesting an Interactive Session#
Use salloc to request resources:
salloc --time=02:05:00 \
--mem=50G \
--ntasks=1 \
--cpus-per-task=8 \
--account=def-monahana \
--job-name=gpu_interactive \
--gres=gpu:h100:1 \
--mail-user=sbeairsto@uvic.ca \
--mail-type=BEGIN
Parameters explained:
--time: Maximum session duration (HH:MMformat)
--mem: RAM allocation--cpus-per-task: Number of CPU cores--account: Your PI’s allocation account--gres=gpu:h100:1: Request 1 H100 GPU--mail-type=BEGIN: Email notification when allocation is ready
Extra DRAC documentation can be found here.
Note
Allocation wait times vary from 1 minute to 10+ hours depending on requested resources and cluster availability. The email notification (--mail-type=BEGIN) is helpful for longer waits.
Checking Allocation Status#
While waiting for your allocation, you can check its status:
# Check your pending/running jobs
squeue -u $USER
# Output shows job ID, status, and remaining time
Connecting to Your Allocation#
Once your allocation is granted, attach to the compute node:
srun --pty bash
You’ll now have an interactive shell on the allocated compute node with access to the requested GPU.
Setting Up Your Environment#
Before running any GPU code, load the required modules:
module purge
module load cuda/12.9
source /path/to/your/climatexvenv/bin/activate
Verifying GPU Access#
# Check GPU is available
nvidia-smi
# Test PyTorch GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
Debugging with Data#
Option 1: Using Synthetic Data (Recommended for Quick Testing)#
For rapid debugging without the overhead of copying and unpacking large datasets, use the synthetic data loader:
# In your training script or config
use_synthetic_data = True # Enable synthetic data mode
This allows you to test your training loop, GPU utilization, and code logic without data I/O delays.
Tip
The synthetic data loader will be added to the main branch soon. It generates random data matching your dataset’s dimensions, perfect for debugging before running full-scale experiments.
Option 2: Using Real Data#
If you need to debug with actual data:
# Copy data to fast local storage
cp /path/to/my/dataset.tar.gz $SLURM_TMPDIR/
# Extract (this can be slow for large files)
tar -xzf $SLURM_TMPDIR/dataset.tar.gz -C $SLURM_TMPDIR/
# Update your data path
export DATA_PATH="$SLURM_TMPDIR/dataset"
Warning
For large datasets, copying and unpacking to $SLURM_TMPDIR can take significant time. Use synthetic data for initial debugging, then test with real data once your code is working.
Debugging Your Code#
Now you can run and debug your code interactively:
# Test your training script
python ClimatExML/train.py --debug
# Run Python interactively
python
>>> import torch
>>> # debug as needed
Exiting the Session#
When you’re done debugging:
# Exit the interactive shell
exit
# This will release your allocation`
The session will also automatically end when the time limit is reached.
Canceling an Allocation#
If you realize you requested the wrong resources or no longer need the allocation:
# Cancel by job ID (find ID with squeue -u $USER)
scancel JOBID
# Cancel all your jobs
scancel -u $USER
Tip
Best practice: Request slightly more time than you think you’ll need. If your session times out while debugging, you’ll lose your work and need to request a new allocation.
Common Interactive Session Workflow#
Request allocation with
sallocWait for email notification (or check with
squeue -u $USER)Connect with
srun --pty bashLoad modules and activate environment
Test and debug your code
Exit when done
This workflow helps you catch errors before submitting long-running batch jobs.