torchrun is PyTorch’s launcher for distributed training. It replaces the older torch.distributed.launch and handles process spawning, environment setup, and fault tolerance automatically.

Each process runs the same script. Processes use their rank to figure out which data shard they own and whether to log or save checkpoints.


Key Environment Variables

torchrun sets these automatically — your script just reads them.

VariableMeaning
RANKGlobal rank (0 … WORLD_SIZE-1)
LOCAL_RANKRank within this node (0 … nproc_per_node-1)
WORLD_SIZETotal processes across all nodes
MASTER_ADDRHostname/IP of rank-0 node
MASTER_PORTPort for rendezvous

Single-Node Launch

torchrun \
  --standalone \          # sets up rendezvous on localhost automatically
  --nproc-per-node=4 \    # one process per GPU
  train.py

--standalone is shorthand for --rdzv-backend=c10d --rdzv-endpoint=localhost:PORT --nnodes=1.


Multi-Node Launch

Run this on every node — only --node-rank differs.

# Node 0 (master)
torchrun \
  --nnodes=2 \
  --nproc-per-node=4 \
  --rdzv-id=job42 \          # must match on all nodes
  --rdzv-backend=c10d \
  --rdzv-endpoint=<MASTER_IP>:29500 \
  --node-rank=0 \
  train.py

# Node 1
torchrun \
  --nnodes=2 \
  --nproc-per-node=4 \
  --rdzv-id=job42 \
  --rdzv-backend=c10d \
  --rdzv-endpoint=<MASTER_IP>:29500 \
  --node-rank=1 \
  train.py

Elastic / Dynamic Node Count

torchrun \
  --nnodes=2:4 \       # min:max — starts at 2 nodes, can scale to 4
  --nproc-per-node=4 \
  --max-restarts=3 \
  --rdzv-id=job42 \
  --rdzv-backend=c10d \
  --rdzv-endpoint=<MASTER_IP>:29500 \
  train.py

Worker failures or additions trigger a re-rendezvous. Your script needs to handle re-initialization cleanly (load from last checkpoint, reinitialize process group).


Minimal Training Script

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    # torchrun has already set RANK, LOCAL_RANK, WORLD_SIZE
    dist.init_process_group(backend="nccl")

    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    model = MyModel().to(local_rank)
    model = DDP(model, device_ids=[local_rank])

    if dist.get_rank() == 0:
        print(f"World size: {dist.get_world_size()}")

    # ... training loop ...

    dist.destroy_process_group()

if __name__ == "__main__":
    main()

Common Patterns

Data loading

sampler = DistributedSampler(dataset, shuffle=True)
loader = DataLoader(dataset, sampler=sampler, batch_size=batch_per_gpu)

for epoch in range(epochs):
    sampler.set_epoch(epoch)  # without this, all epochs get the same shuffle
    for batch in loader:
        ...

Checkpointing

if dist.get_rank() == 0:
    torch.save({
        "model": model.module.state_dict(),  # .module unwraps DDP
        "optimizer": optimizer.state_dict(),
        "epoch": epoch,
    }, "checkpoint.pt")

dist.barrier()  # all ranks wait here before continuing

Gradient accumulation without double all-reduce

for i, batch in enumerate(loader):
    # suppress all-reduce for the first N-1 micro-steps
    with model.no_sync() if (i + 1) % accum_steps != 0 else contextlib.nullcontext():
        loss = model(batch) / accum_steps
        loss.backward()

    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

find_unused_parameters

model = DDP(model, device_ids=[local_rank], find_unused_parameters=True)

Only enable this if your model has conditional forward paths. It adds per-step overhead and disables the AllReduce/backward overlap optimization.


NCCL Tuning

export NCCL_IB_DISABLE=0          # enable InfiniBand if available
export NCCL_DEBUG=INFO             # verbose logging — useful for debugging hangs
export NCCL_SOCKET_IFNAME=eth0    # specify the network interface
export NCCL_TIMEOUT=1800          # seconds before a collective op times out

Common Failure Modes

SymptomLikely Cause
Hangs at init_process_groupFirewall blocking MASTER_PORT, wrong MASTER_ADDR
NCCL error: unhandled system errorNIC not found — try setting NCCL_SOCKET_IFNAME
Out-of-sync loss across rankssampler.set_epoch() missing
Deadlock after checkpoint saveMissing dist.barrier() after rank-0 save
Gradient NaN after accumulationno_sync() not used — gradients accumulated then double all-reduced

Rendezvous Backends

BackendUse Case
c10d (default)Standard use, no external dependencies
etcd / etcd-v2Elastic jobs, Kubernetes — survives pod restarts