The TensorBoard profiler plugin’s Distributed tab aggregates traces from all ranks in one view, making it easy to spot straggler ranks and communication overhead without manually comparing JSON files.


Setup

pip install torch-tb-profiler
import torch
from torch.profiler import profile, ProfilerActivity, schedule, tensorboard_trace_handler

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=tensorboard_trace_handler("./log/profiler"),
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch)
        prof.step()

Each rank writes to a separate file named by hostname and PID:

log/profiler/
├── [host]-[pid]-rank0.pt.trace.json
├── [host]-[pid]-rank1.pt.trace.json
└── ...
tensorboard --logdir=./log/profiler

The Distributed Tab

This is the main view for multi-rank debugging. It shows per-rank compute/communication breakdown and highlights imbalance.

Computation / Communication Breakdown

Rank | Compute  | Comm (AllReduce) | Overlap | Comm overhead
-----|----------|------------------|---------|---------------
  0  | 520 ms   | 180 ms           | 60%     | 72 ms
  1  | 518 ms   | 182 ms           | 58%     | 76 ms
  2  | 521 ms   | 179 ms           | 61%     | 70 ms
  3  | 580 ms   | 181 ms           | 20%     | 145 ms  ← straggler

Overlap ratio = fraction of AllReduce that runs concurrently with backward. High is good — it means DDP bucketing is working.

Exposed comm time = AllReduce duration × (1 - overlap ratio). This is what you actually lose to communication.

Low overlap on one rank means that rank’s backward is slower (straggler) or bucket sizes aren’t aligned to layer boundaries.

Step Time Variance

The plugin plots step time per rank across all recorded steps. A rank with consistently higher step time will make every other rank wait at the AllReduce barrier.


Overview Tab

Starting point for any profiling session. The step time breakdown shows which phase dominates:

DataLoader      ████░░░░░░  15%
Forward         ██████░░░░  30%
Backward        ████████░░  40%
Optimizer       ██░░░░░░░░  10%
Other           █░░░░░░░░░   5%
  • DataLoader dominant → I/O bound
  • Low GPU utilization with normal forward/backward split → launch overhead or memory-bandwidth bound
  • “Other” large → Python/framework overhead, GIL contention

Operator Tab

Aggregated self-time per op across steps. Sort by CUDA Self Time to find what the GPU is actually spending time on.

OpPatternAction
aten::copy_ high CPU selfUnpinned memory copiespin_memory=True
ncclAllReduce dominates CUDABandwidth-limitedGradient compression, fewer buckets
Many aten::add_ with low CUDAFragmented opstorch.compile()

Switch “Group by” to Source Location to trace expensive ops back to your code.


Kernel Tab

CUDA kernel-level view. Key columns:

ColumnCheck for
Mean DurationLong kernels blocking the stream
Occupancy< 50% → register pressure or small block size
Tensor Core %0% on matmul → not using FP16/BF16

A matmul kernel with Tensor Core % = 0 means you’re running FP32. Switching to autocast(dtype=torch.bfloat16) can give 2–4× kernel throughput improvement.


Memory Tab

  • Peak Reserved — total memory PyTorch holds from CUDA (includes free pool)
  • Peak Allocated — memory in active use at the peak

A large Reserved - Allocated gap is fragmentation. The allocator has blocks it isn’t using. Fix:

torch.cuda.empty_cache()          # after validation
# or
torch.cuda.memory.set_per_process_memory_fraction(0.9)  # leave headroom

Multi-Node: Collecting Traces

Each node writes traces locally. Gather them before launching TensorBoard:

# On each worker node
rsync -av ./log/profiler/ master:/shared/log/profiler/

# On master
tensorboard --logdir=/shared/log/profiler

Or point tensorboard_trace_handler at a shared filesystem (NFS, GPFS) directly — all ranks write to the same directory and TensorBoard reads all files at once.


Common Findings and Fixes

TabObservationFix
DistributedOne rank has low overlap ratioSmaller bucket_cap_mb, check straggler GPU
DistributedAll ranks have zero overlapRemove find_unused_parameters=True
Operatoraten::copy_ CPU self highpin_memory=True, non_blocking=True
OperatorncclAllReduce >> computeGradient accumulation, fewer all-reduces
Kernelmatmul Tensor Core % = 0torch.autocast("cuda", dtype=torch.bfloat16)
KernelLow occupancy on attentionF.scaled_dot_product_attention (FlashAttention)
MemoryLarge Reserved - Allocated gapempty_cache(), avoid fragmentation
OverviewDataLoader > 20% of stepMore workers, prefetch, offline preprocessing