The TensorBoard PyTorch Profiler plugin (torch-tb-profiler) gives you five views of a training run. Each one is best for a different class of problem.


Setup

pip install torch-tb-profiler
import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3),
    on_trace_ready=tensorboard_trace_handler("./tb_logs"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()
tensorboard --logdir ./tb_logs

The Five Views

TensorBoard → PyTorch Profiler plugin
├── Overview       — step time breakdown, GPU utilization summary
├── Operator       — per-op CPU/CUDA time, call count
├── Kernel         — per-CUDA-kernel time, occupancy, Tensor Core usage
├── Trace          — raw Perfetto-style timeline
├── Memory         — allocation timeline and peak usage
└── Distributed    — multi-rank comparison (the key tab for DDP)

Overview Tab

Your starting point. The step time breakdown shows:

DataLoader      ████░░░░░░  15%
Forward         ██████░░░░  30%
Backward        ████████░░  40%
Optimizer       ██░░░░░░░░  10%
Other           █░░░░░░░░░   5%

GPU Utilization is the headline metric. Below ~80% means the GPU is waiting on something.

  • DataLoader dominant → I/O bound, fix the pipeline
  • Forward ≈ Backward, low GPU util → memory-bandwidth bound or launch overhead
  • “Other” large → Python GIL or framework overhead

Operator Tab

Per-op breakdown. Sort by Self Time (the op itself, not its children) to find the actual bottleneck rather than the wrapper.

Key things to look for:

OpHigh what?Meaning
aten::copy_CPU self timeUnpinned memory → use pin_memory=True
ncclAllReduceCUDA self timeBandwidth-limited communication
many aten::add_low CUDA, many callsFragmented ops → torch.compile()

Switch “Group by” to Source Location to see which lines of your code are most expensive.


Kernel Tab

CUDA kernels on the hardware, not PyTorch function names. Sort by Total Duration.

ColumnWhat to look for
OccupancyLow (<50%) → register pressure or small block size
Tensor Core %0% on matmul → you’re in FP32, not using hardware properly
Mean Blocks/SMLow → kernel not saturating the GPU

A matmul kernel with Tensor Core % = 0 means you’re leaving most of the GPU’s compute on the table. Fix: enable AMP with torch.autocast("cuda", dtype=torch.bfloat16).


Memory Tab

Timeline of allocations and frees.

  • Peak Reserved — total memory PyTorch has claimed from CUDA
  • Peak Allocated — memory actually in use at peak

A large Reserved - Allocated gap is fragmentation. The allocator is holding blocks it isn’t using. Fix:

torch.cuda.empty_cache()  # after validation, between phases

Allocation spikes during backward are normal (gradient buffers). Spikes outside backward suggest unexpected tensor retention — check for closures or missing .detach().


Distributed Tab

The primary view for multi-rank debugging. It aggregates all rank traces and highlights imbalance.

Rank | Compute  | Comm (AllReduce) | Overlap | Comm overhead
-----|----------|------------------|---------|---------------
  0  | 520 ms   | 180 ms           | 60%     | 72 ms
  1  | 518 ms   | 182 ms           | 58%     | 76 ms
  3  | 580 ms   | 181 ms           | 20%     | 145 ms  ← straggler

Overlap ratio = how much of AllReduce runs concurrently with backward. Low on one rank means that rank’s backward is slower or its bucket sizes are off.

Exposed comm time = AllReduce duration × (1 - overlap ratio) — the actual wall-clock time lost. This is what you want to minimize.


Bottlenecks and Fixes

DataLoader (I/O bound)

DataLoader(dataset,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
)

CPU Overhead

# Bad — forces CPU/GPU sync every step
running_loss += loss.item()

# Good — accumulate tensor, sync periodically
running_loss += loss
if step % 100 == 0:
    print(running_loss.item())

# Fuse ops to reduce kernel launches
model = torch.compile(model)

GPU Compute (FP32, Tensor Cores unused)

from torch.amp import autocast, GradScaler

scaler = GradScaler()
with autocast(device_type="cuda", dtype=torch.bfloat16):
    loss = model(batch)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Communication (AllReduce bottleneck)

# Gradient accumulation — fewer all-reduces
for i, batch in enumerate(loader):
    loss = model(batch) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Tune bucket size for your network bandwidth
model = DDP(model, bucket_cap_mb=25)  # default 25 MB, try smaller for fast networks

Quick Reference

SymptomViewLikely causeFirst fix
GPU util < 80%Overview → TraceDataLoader or CPU overheadnum_workers, pin_memory
Fragmented GPU kernelsTraceCPU dispatch overheadtorch.compile, remove .item()
Tensor Cores = 0%KernelFP32 computeEnable AMP
High NCCL timeDistributedGradient sync bottleneckGradient accumulation, tune bucket_cap_mb
Large Reserved-Allocated gapMemoryFragmentationempty_cache() between phases