Reading the PyTorch Profiler in TensorBoard

The TensorBoard PyTorch Profiler plugin (torch-tb-profiler) gives you five views of a training run. Each one is best for a different class of problem.

Setup

pip install torch-tb-profiler

import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler, schedule

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3),
    on_trace_ready=tensorboard_trace_handler("./tb_logs"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(dataloader):
        train_step(batch)
        prof.step()

tensorboard --logdir ./tb_logs

The Five Views

TensorBoard → PyTorch Profiler plugin
├── Overview       — step time breakdown, GPU utilization summary
├── Operator       — per-op CPU/CUDA time, call count
├── Kernel         — per-CUDA-kernel time, occupancy, Tensor Core usage
├── Trace          — raw Perfetto-style timeline
├── Memory         — allocation timeline and peak usage
└── Distributed    — multi-rank comparison (the key tab for DDP)

Overview Tab

Your starting point. The step time breakdown shows:

DataLoader      ████░░░░░░  15%
Forward         ██████░░░░  30%
Backward        ████████░░  40%
Optimizer       ██░░░░░░░░  10%
Other           █░░░░░░░░░   5%

GPU Utilization is the headline metric. Below ~80% means the GPU is waiting on something.

DataLoader dominant → I/O bound, fix the pipeline
Forward ≈ Backward, low GPU util → memory-bandwidth bound or launch overhead
“Other” large → Python GIL or framework overhead

Operator Tab

Per-op breakdown. Sort by Self Time (the op itself, not its children) to find the actual bottleneck rather than the wrapper.

Key things to look for:

Op	High what?	Meaning
`aten::copy_`	CPU self time	Unpinned memory → use `pin_memory=True`
`ncclAllReduce`	CUDA self time	Bandwidth-limited communication
many `aten::add_`	low CUDA, many calls	Fragmented ops → `torch.compile()`

Switch “Group by” to Source Location to see which lines of your code are most expensive.

Kernel Tab

CUDA kernels on the hardware, not PyTorch function names. Sort by Total Duration.

Column	What to look for
Occupancy	Low (<50%) → register pressure or small block size
Tensor Core %	0% on matmul → you’re in FP32, not using hardware properly
Mean Blocks/SM	Low → kernel not saturating the GPU

A matmul kernel with Tensor Core % = 0 means you’re leaving most of the GPU’s compute on the table. Fix: enable AMP with torch.autocast("cuda", dtype=torch.bfloat16).

Memory Tab

Timeline of allocations and frees.

Peak Reserved — total memory PyTorch has claimed from CUDA
Peak Allocated — memory actually in use at peak

A large Reserved - Allocated gap is fragmentation. The allocator is holding blocks it isn’t using. Fix:

torch.cuda.empty_cache()  # after validation, between phases

Allocation spikes during backward are normal (gradient buffers). Spikes outside backward suggest unexpected tensor retention — check for closures or missing .detach().

Distributed Tab

The primary view for multi-rank debugging. It aggregates all rank traces and highlights imbalance.

Rank | Compute  | Comm (AllReduce) | Overlap | Comm overhead
-----|----------|------------------|---------|---------------
  0  | 520 ms   | 180 ms           | 60%     | 72 ms
  1  | 518 ms   | 182 ms           | 58%     | 76 ms
  3  | 580 ms   | 181 ms           | 20%     | 145 ms  ← straggler

Overlap ratio = how much of AllReduce runs concurrently with backward. Low on one rank means that rank’s backward is slower or its bucket sizes are off.

Exposed comm time = AllReduce duration × (1 - overlap ratio) — the actual wall-clock time lost. This is what you want to minimize.

Bottlenecks and Fixes

DataLoader (I/O bound)

DataLoader(dataset,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
)

CPU Overhead

# Bad — forces CPU/GPU sync every step
running_loss += loss.item()

# Good — accumulate tensor, sync periodically
running_loss += loss
if step % 100 == 0:
    print(running_loss.item())

# Fuse ops to reduce kernel launches
model = torch.compile(model)

GPU Compute (FP32, Tensor Cores unused)

from torch.amp import autocast, GradScaler

scaler = GradScaler()
with autocast(device_type="cuda", dtype=torch.bfloat16):
    loss = model(batch)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Communication (AllReduce bottleneck)

# Gradient accumulation — fewer all-reduces
for i, batch in enumerate(loader):
    loss = model(batch) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Tune bucket size for your network bandwidth
model = DDP(model, bucket_cap_mb=25)  # default 25 MB, try smaller for fast networks

Quick Reference

Symptom	View	Likely cause	First fix
GPU util < 80%	Overview → Trace	DataLoader or CPU overhead	`num_workers`, `pin_memory`
Fragmented GPU kernels	Trace	CPU dispatch overhead	`torch.compile`, remove `.item()`
Tensor Cores = 0%	Kernel	FP32 compute	Enable AMP
High NCCL time	Distributed	Gradient sync bottleneck	Gradient accumulation, tune `bucket_cap_mb`
Large Reserved-Allocated gap	Memory	Fragmentation	`empty_cache()` between phases