Indian Institute of Technology Guwahati, India.
World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2091-2098
Article DOI: 10.30574/wjaets.2025.15.2.0686
Received on 04 April 2025; revised on 13 May 2025; accepted on 15 May 2025
Data Lake-Aware Checkpointing addresses a critical gap in large-scale model training resilience by incorporating reader state as a first-class citizen in training checkpoints. While traditional frameworks save only model and optimizer states, they neglect data reader progress, leading to overlapping or missed data reads when resuming training from massive data lakes. This article proposes a system that tracks consumed Parquet files and row group offsets across distributed fleets, and includes that information as part of the checkpoint, creating comprehensive checkpoints that enable precise recovery without data loss or duplication. It integrates seamlessly with existing training pipelines and distributed storage systems, establishing the foundation for truly epoch-less, streaming-style training on vast data repositories. By treating data consumption state with the same importance as model parameters, we significantly enhance fault tolerance and training reliability for large language models at scale.
Distributed Training; Fault Tolerance; Checkpointing; Data Lakes; Large Language Models
Preview Article PDF
Sravankumar Nandamuri. Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2091-2098. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0686.