Perfetto is the recommended viewer for PyTorch profiler traces — the same .json files that TensorBoard’s Trace tab shows, but with a faster, more flexible UI.

Open your trace at ui.perfetto.dev → drag & drop the .pt.trace.json file.


Capturing a Trace

import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=tensorboard_trace_handler("./trace_dir"),
    record_shapes=True,
    with_stack=True,
) as prof:
    for step, batch in enumerate(loader):
        train_step(batch)
        prof.step()

Each rank writes its own trace file. For distributed runs, you’ll have one file per rank.


UI Layout

Timeline (top)
├── Process: rank 0
│   ├── Thread: CPU ops
│   │   └── [forward] [backward] [optimizer]
│   ├── Thread: CUDA stream 0
│   │   └── [kernel] [kernel] [kernel] ...
│   ├── Thread: NCCL stream
│   │   └── [AllReduce] [AllReduce] ...
│   └── Thread: DataLoader workers
└── Process: rank 1
    └── ...

Zoom: scroll wheel. Pan: click-drag. Click a slice: details in the bottom panel.


What to Look for

GPU Idle Gaps

CUDA stream:  [kernel▓▓▓][  gap  ][kernel▓▓▓]

White space on the CUDA stream = GPU waiting. Click the gap — the tooltip shows the CPU op that caused it.

Common causes:

  • CPU kernel dispatch too slow (many small ops)
  • DataLoader workers not keeping up
  • NCCL AllReduce blocking the next forward pass

AllReduce Overlap

Good — AllReduce runs concurrently with backward:

CUDA compute:  [backward▓▓▓▓▓▓▓▓▓]
NCCL stream:          [AllReduce▓▓▓▓]   ← starts mid-backward

Bad — AllReduce serialized after backward:

CUDA compute:  [backward▓▓▓▓▓▓▓]
NCCL stream:                    [AllReduce▓▓▓▓▓▓]

If you see no overlap:

  • Bucket size too large → DDP(model, bucket_cap_mb=25) or smaller
  • find_unused_parameters=True disables the overlap
  • Very small models simply have nothing to overlap

Straggler Rank

Open multiple rank traces (File → Open multiple) and align by timestamp:

Rank 0:  [fwd▓▓][bwd▓▓▓][AllReduce waiting...]
Rank 1:  [fwd▓▓][bwd▓▓▓▓▓▓▓▓▓▓▓▓▓]→[AllReduce]
                          ^ straggler

All ranks block at AllReduce until the slowest one arrives.

Causes:

  • Uneven data shards → DistributedSampler + drop_last=True
  • Thermal throttle on one GPU → nvidia-smi -q -d PERFORMANCE
  • Uneven find_unused_parameters overhead

DataLoader Stall

CPU thread:    [DataLoader.__next__ ████████████]   ← long blocking
CUDA stream:   [                   idle          ][kernel]

Fixes:

  • More num_workers
  • pin_memory=True + non_blocking=True on .to(device)
  • Move preprocessing offline (pre-tokenize, pre-normalize)

Keyboard Shortcuts

KeyAction
W / SZoom in / out
A / DPan left / right
FFit selection to screen
MMark a region
/Search by slice name
Shift+clickSelect a time range → shows duration

Common Patterns and Fixes

Trace PatternDiagnosisFix
GPU idle after backwardAllReduce not overlappingReduce bucket_cap_mb, remove find_unused_parameters
One rank always last at AllReduceStragglerdrop_last=True, check thermal throttle
Long DataLoader slice each stepI/O boundMore workers, pin_memory, prefetch
Tiny dense kernels, low throughputLaunch overheadtorch.compile(), fuse ops
Long Memcpy HtoDUnpinned memorypin_memory=True
All ranks idle simultaneouslyLoad imbalance in forwardrecord_shapes=True, profile per layer