Checkpointing on preemptibles
Use frequent, incremental checkpoints and snapshot-aware pipelines so spot evictions cost minutes—not hours.
Pattern
- Write shard-local checkpoints; sync to object storage.
- Record dataset/version, code commit, and hyperparams with each checkpoint.
- On restart, reconcile the latest consistent global step, then resume.