Eesung Kim
Production AI systems — ASR, translation, summarization.
Notes on ML engineering, research, and real-time speech infrastructure.
Recent Notes
- Merging NeMo FSDP Distributed Checkpoints into a Single File
How to consolidate PyTorch DCP (distributed checkpoint) shards from NeMo FSDP training into one .ckpt file, including the model config stored in meta.pt
- Understanding GPU Memory in PyTorch
Where GPU memory actually goes during training — parameters, gradients, optimizer states, activations — and how to measure each
- Modern Memory Snapshot in PyTorch
How to capture, visualize, and analyze CUDA memory snapshots in PyTorch 2.1+ to debug OOM errors and memory leaks
- NeMo PytorchProfilerCallback — Chakra Traces and Execution Profiling
How NeMo's PytorchProfilerCallback works, what it captures, and how to reproduce or extend the pattern in custom training pipelines
- Perfetto Trace Debugging for Distributed Training
How to read Perfetto traces to diagnose GPU idle gaps, AllReduce overlap, straggler ranks, and DataLoader stalls in distributed PyTorch training
- TensorBoard PyTorch Profiler — Distributed Debugging
Using the PyTorch Profiler TensorBoard plugin to diagnose bottlenecks in DDP/multi-node training, with a focus on the Distributed tab
Recent Log