Perfetto Trace Debugging for Distributed Training
Perfetto is the recommended viewer for PyTorch profiler traces — the same .json files that TensorBoard’s Trace tab shows, but with a faster, more flexible UI.
Open your trace at ui.perfetto.dev → drag & drop the .pt.trace.json file.
Capturing a Trace
import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
on_trace_ready=tensorboard_trace_handler("./trace_dir"),
record_shapes=True,
with_stack=True,
) as prof:
for step, batch in enumerate(loader):
train_step(batch)
prof.step()
Each rank writes its own trace file. For distributed runs, you’ll have one file per rank.
UI Layout
Timeline (top)
├── Process: rank 0
│ ├── Thread: CPU ops
│ │ └── [forward] [backward] [optimizer]
│ ├── Thread: CUDA stream 0
│ │ └── [kernel] [kernel] [kernel] ...
│ ├── Thread: NCCL stream
│ │ └── [AllReduce] [AllReduce] ...
│ └── Thread: DataLoader workers
└── Process: rank 1
└── ...
Zoom: scroll wheel. Pan: click-drag. Click a slice: details in the bottom panel.
What to Look for
GPU Idle Gaps
CUDA stream: [kernel▓▓▓][ gap ][kernel▓▓▓]
White space on the CUDA stream = GPU waiting. Click the gap — the tooltip shows the CPU op that caused it.
Common causes:
- CPU kernel dispatch too slow (many small ops)
- DataLoader workers not keeping up
- NCCL AllReduce blocking the next forward pass
AllReduce Overlap
Good — AllReduce runs concurrently with backward:
CUDA compute: [backward▓▓▓▓▓▓▓▓▓]
NCCL stream: [AllReduce▓▓▓▓] ← starts mid-backward
Bad — AllReduce serialized after backward:
CUDA compute: [backward▓▓▓▓▓▓▓]
NCCL stream: [AllReduce▓▓▓▓▓▓]
If you see no overlap:
- Bucket size too large →
DDP(model, bucket_cap_mb=25)or smaller find_unused_parameters=Truedisables the overlap- Very small models simply have nothing to overlap
Straggler Rank
Open multiple rank traces (File → Open multiple) and align by timestamp:
Rank 0: [fwd▓▓][bwd▓▓▓][AllReduce waiting...]
Rank 1: [fwd▓▓][bwd▓▓▓▓▓▓▓▓▓▓▓▓▓]→[AllReduce]
^ straggler
All ranks block at AllReduce until the slowest one arrives.
Causes:
- Uneven data shards →
DistributedSampler+drop_last=True - Thermal throttle on one GPU →
nvidia-smi -q -d PERFORMANCE - Uneven
find_unused_parametersoverhead
DataLoader Stall
CPU thread: [DataLoader.__next__ ████████████] ← long blocking
CUDA stream: [ idle ][kernel]
Fixes:
- More
num_workers pin_memory=True+non_blocking=Trueon.to(device)- Move preprocessing offline (pre-tokenize, pre-normalize)
Keyboard Shortcuts
| Key | Action |
|---|---|
W / S | Zoom in / out |
A / D | Pan left / right |
F | Fit selection to screen |
M | Mark a region |
/ | Search by slice name |
Shift+click | Select a time range → shows duration |
Common Patterns and Fixes
| Trace Pattern | Diagnosis | Fix |
|---|---|---|
| GPU idle after backward | AllReduce not overlapping | Reduce bucket_cap_mb, remove find_unused_parameters |
| One rank always last at AllReduce | Straggler | drop_last=True, check thermal throttle |
| Long DataLoader slice each step | I/O bound | More workers, pin_memory, prefetch |
| Tiny dense kernels, low throughput | Launch overhead | torch.compile(), fuse ops |
Long Memcpy HtoD | Unpinned memory | pin_memory=True |
| All ranks idle simultaneously | Load imbalance in forward | record_shapes=True, profile per layer |