TensorBoard PyTorch Profiler — Distributed Debugging
The TensorBoard profiler plugin’s Distributed tab aggregates traces from all ranks in one view, making it easy to spot straggler ranks and communication overhead without manually comparing JSON files.
Setup
pip install torch-tb-profiler
import torch
from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=tensorboard_trace_handler("./log/profiler"),
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
for step, batch in enumerate(loader):
train_step(batch)
prof.step()
Each rank writes to a separate file named by hostname and PID:
log/profiler/
├── [host]-[pid]-rank0.pt.trace.json
├── [host]-[pid]-rank1.pt.trace.json
└── ...
tensorboard --logdir=./log/profiler
The Distributed Tab
This is the main view for multi-rank debugging. It shows per-rank compute/communication breakdown and highlights imbalance.
Computation / Communication Breakdown
Rank | Compute | Comm (AllReduce) | Overlap | Comm overhead
-----|----------|------------------|---------|---------------
0 | 520 ms | 180 ms | 60% | 72 ms
1 | 518 ms | 182 ms | 58% | 76 ms
2 | 521 ms | 179 ms | 61% | 70 ms
3 | 580 ms | 181 ms | 20% | 145 ms ← straggler
Overlap ratio = fraction of AllReduce that runs concurrently with backward. High is good — it means DDP bucketing is working.
Exposed comm time = AllReduce duration × (1 - overlap ratio). This is what you actually lose to communication.
Low overlap on one rank means that rank’s backward is slower (straggler) or bucket sizes aren’t aligned to layer boundaries.
Step Time Variance
The plugin plots step time per rank across all recorded steps. A rank with consistently higher step time will make every other rank wait at the AllReduce barrier.
Overview Tab
Starting point for any profiling session. The step time breakdown shows which phase dominates:
DataLoader ████░░░░░░ 15%
Forward ██████░░░░ 30%
Backward ████████░░ 40%
Optimizer ██░░░░░░░░ 10%
Other █░░░░░░░░░ 5%
- DataLoader dominant → I/O bound
- Low GPU utilization with normal forward/backward split → launch overhead or memory-bandwidth bound
- “Other” large → Python/framework overhead, GIL contention
Operator Tab
Aggregated self-time per op across steps. Sort by CUDA Self Time to find what the GPU is actually spending time on.
| Op | Pattern | Action |
|---|---|---|
aten::copy_ high CPU self | Unpinned memory copies | pin_memory=True |
ncclAllReduce dominates CUDA | Bandwidth-limited | Gradient compression, fewer buckets |
Many aten::add_ with low CUDA | Fragmented ops | torch.compile() |
Switch “Group by” to Source Location to trace expensive ops back to your code.
Kernel Tab
CUDA kernel-level view. Key columns:
| Column | Check for |
|---|---|
| Mean Duration | Long kernels blocking the stream |
| Occupancy | < 50% → register pressure or small block size |
| Tensor Core % | 0% on matmul → not using FP16/BF16 |
A matmul kernel with Tensor Core % = 0 means you’re running FP32. Switching to autocast(dtype=torch.bfloat16) can give 2–4× kernel throughput improvement.
Memory Tab
- Peak Reserved — total memory PyTorch holds from CUDA (includes free pool)
- Peak Allocated — memory in active use at the peak
A large Reserved - Allocated gap is fragmentation. The allocator has blocks it isn’t using. Fix:
torch.cuda.empty_cache() # after validation
# or
torch.cuda.memory.set_per_process_memory_fraction(0.9) # leave headroom
Multi-Node: Collecting Traces
Each node writes traces locally. Gather them before launching TensorBoard:
# On each worker node
rsync -av ./log/profiler/ master:/shared/log/profiler/
# On master
tensorboard --logdir=/shared/log/profiler
Or point tensorboard_trace_handler at a shared filesystem (NFS, GPFS) directly — all ranks write to the same directory and TensorBoard reads all files at once.
Common Findings and Fixes
| Tab | Observation | Fix |
|---|---|---|
| Distributed | One rank has low overlap ratio | Smaller bucket_cap_mb, check straggler GPU |
| Distributed | All ranks have zero overlap | Remove find_unused_parameters=True |
| Operator | aten::copy_ CPU self high | pin_memory=True, non_blocking=True |
| Operator | ncclAllReduce >> compute | Gradient accumulation, fewer all-reduces |
| Kernel | matmul Tensor Core % = 0 | torch.autocast("cuda", dtype=torch.bfloat16) |
| Kernel | Low occupancy on attention | F.scaled_dot_product_attention (FlashAttention) |
| Memory | Large Reserved - Allocated gap | empty_cache(), avoid fragmentation |
| Overview | DataLoader > 20% of step | More workers, prefetch, offline preprocessing |