Eesung Kim

Production AI systems — ASR, translation, summarization.
Notes on ML engineering, research, and real-time speech infrastructure.

GitHub · Log · Search

Recent Notes

FSDP2 Checkpoint Resume Crashes on the First Training Step (cuda:0 vs cuda:1) Jul 19, 2026
Why resuming from an FSDP2/DCP sharded checkpoint passes validation but crashes the first optimizer step with "Expected all tensors to be on the same device", how to diagnose it DTensor-safely, and when to use weights-only loading instead of a full resume

pytorchfsdp2dtensorlightning
Merging NeMo FSDP Distributed Checkpoints into a Single File May 1, 2026
How to consolidate PyTorch DCP (distributed checkpoint) shards from NeMo FSDP training into one .ckpt file, including the model config stored in meta.pt

pytorchfsdpnemocheckpoint
Understanding GPU Memory in PyTorch Apr 23, 2026
Where GPU memory actually goes during training — parameters, gradients, optimizer states, activations — and how to measure each

pytorchcudamemoryprofiling
Modern Memory Snapshot in PyTorch Apr 23, 2026
How to capture, visualize, and analyze CUDA memory snapshots in PyTorch 2.1+ to debug OOM errors and memory leaks

pytorchcudamemoryprofiling
NeMo PytorchProfilerCallback — Chakra Traces and Execution Profiling Apr 23, 2026
How NeMo's PytorchProfilerCallback works, what it captures, and how to reproduce or extend the pattern in custom training pipelines

pytorchnemoprofilingchakra
Perfetto Trace Debugging for Distributed Training Apr 23, 2026
How to read Perfetto traces to diagnose GPU idle gaps, AllReduce overlap, straggler ranks, and DataLoader stalls in distributed PyTorch training

pytorchdistributedprofilingperfetto

Recent Log

Review: Streaming SpeechLLM Back-end Architecture for ASR & Translation Mar 5, 2026
SpeechLLMASRstreamingDynamo
qwen3_asr_triton vs speechLLM (vLLM): Performance & Architecture Comparison Mar 2, 2026
ASRQwen3-ASRvLLMTriton
SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model Feb 16, 2026
speechs2sduplexpaper