When you hit an OOM, “reduce batch size” is the blunt fix. Knowing what is actually taking up memory lets you be smarter about it.


Where the Memory Goes

For a model with P parameters in FP32:

ComponentSizeNotes
Parameters4P bytesthe model weights
Gradients4P bytesone grad tensor per parameter
Optimizer states (Adam)8P bytesmomentum + variance, both FP32
Activationsvariesdepends on batch size and sequence length
Total (Adam, FP32)≥ 16P bytesbefore activations

A 7B parameter model at FP32 needs at least 112 GB just for the optimizer state — that’s before a single batch goes through.


Mixed Precision Changes the Picture

With BF16/FP16 training (torch.autocast + GradScaler):

ComponentSize
Parameters (BF16)2P bytes
Master weights (FP32 copy)4P bytes
Gradients (BF16)2P bytes
Optimizer states (FP32)8P bytes
Total~16P bytes

Mixed precision saves activation memory (BF16 activations are half the size) but optimizer states are still FP32 — the saving is mostly in activations and gradient communication.


Measuring What’s Actually Allocated

import torch

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters())

# Before forward pass
print(f"After model init:  {torch.cuda.memory_allocated() / 1e9:.2f} GB")

x = get_batch().cuda()
loss = model(x).mean()
print(f"After forward:     {torch.cuda.memory_allocated() / 1e9:.2f} GB")

loss.backward()
print(f"After backward:    {torch.cuda.memory_allocated() / 1e9:.2f} GB")

optimizer.step()
optimizer.zero_grad()
print(f"After optimizer:   {torch.cuda.memory_allocated() / 1e9:.2f} GB")

The jump from “after model init” to “after forward” is your activation memory for that batch size. The jump from “after forward” to “after backward” is gradient memory.


Activation Memory Is the Variable One

Activations are intermediate tensors saved during the forward pass for use in the backward pass. Their size scales with:

  • batch size
  • sequence length (quadratic for attention — O(seq_len²))
  • number of layers

Gradient checkpointing trades compute for memory by discarding activations and recomputing them during backward:

from torch.utils.checkpoint import checkpoint

# Instead of: out = layer(x)
out = checkpoint(layer, x, use_reentrant=False)

This roughly halves activation memory at the cost of ~33% more compute.


The Reserved vs Allocated Gap

PyTorch’s allocator doesn’t return memory to CUDA after each free — it holds onto it in a pool for reuse. So:

torch.cuda.memory_allocated()  # memory in active use
torch.cuda.memory_reserved()   # memory held by the allocator (allocated + cached free blocks)

reserved - allocated is memory the allocator is holding but not currently using. If reserved is near your GPU limit but allocated is much lower, fragmentation is the problem — not peak usage.

torch.cuda.empty_cache()  # release the cached free blocks back to CUDA

Call this after validation or between major phases if fragmentation is hurting you.


Quick Cheatsheet: Memory Reduction Techniques

TechniqueMemory savedCost
BF16/FP16 training~50% activationsSlight precision risk
Gradient checkpointing~50% activations~33% more compute
Gradient accumulation— (no saving)Reduces batch-per-GPU, not peak memory
zero_grad(set_to_none=True)gradient buffer freedMinor
FSDP / DeepSpeed ZeROoptimizer states shardedDistributed setup required
Flash Attention (F.scaled_dot_product_attention)O(seq_len) vs O(seq_len²)Near-free with PyTorch 2.0+

Further Reading