Home
World Journal of Advanced Engineering Technology and Sciences
International, Peer reviewed, Referred, Open access | ISSN Approved Journal

Main navigation

  • Home
    • Journal Information
    • Abstracting and Indexing
    • Editorial Board Members
    • Reviewer Panel
    • Journal Policies
    • WJAETS CrossMark Policy
    • Publication Ethics
    • Instructions for Authors
    • Article processing fee
    • Track Manuscript Status
    • Get Publication Certificate
    • Issue in Progress
    • Current Issue
    • Past Issues
    • Become a Reviewer panel member
    • Join as Editorial Board Member
  • Contact us
  • Downloads

ISSN: 2582-8266 (Online)  || UGC Compliant Journal || Google Indexed || Impact Factor: 9.48 || Crossref DOI

Fast Publication within 2 days || Low Article Processing charges || Peer reviewed and Referred Journal

Research and review articles are invited for publication in Volume 18, Issue 2 (February 2026).... Submit articles

Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking

Breadcrumb

  • Home
  • Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking

Sravankumar Nandamuri *

Indian Institute of Technology Guwahati, India.

Review Article

World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2091-2098

Article DOI: 10.30574/wjaets.2025.15.2.0686

DOI url: https://doi.org/10.30574/wjaets.2025.15.2.0686

Received on 04 April 2025; revised on 13 May 2025; accepted on 15 May 2025

Data Lake-Aware Checkpointing addresses a critical gap in large-scale model training resilience by incorporating reader state as a first-class citizen in training checkpoints. While traditional frameworks save only model and optimizer states, they neglect data reader progress, leading to overlapping or missed data reads when resuming training from massive data lakes. This article proposes a system that tracks consumed Parquet files and row group offsets across distributed fleets, and includes that information as part of the checkpoint, creating comprehensive checkpoints that enable precise recovery without data loss or duplication. It integrates seamlessly with existing training pipelines and distributed storage systems, establishing the foundation for truly epoch-less, streaming-style training on vast data repositories. By treating data consumption state with the same importance as model parameters, we significantly enhance fault tolerance and training reliability for large language models at scale.

Distributed Training; Fault Tolerance; Checkpointing; Data Lakes; Large Language Models

https://wjaets.com/sites/default/files/fulltext_pdf/WJAETS-2025-0686.pdf

Preview Article PDF

Sravankumar Nandamuri. Data lake-aware checkpointing: Enabling resilient large-scale model training through precise data consumption tracking. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 2091-2098. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0686.

Get Certificates

Get Publication Certificate

Download LoA

Check Corssref DOI details

Issue details

Issue Cover Page

Editorial Board

Table of content


Copyright © Author(s). All rights reserved. This article is published under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the original author(s) and source, a link to the license is provided, and any changes made are indicated.


Copyright © 2026 World Journal of Advanced Engineering Technology and Sciences

Developed & Designed by VS Infosolution