NVIDIA Corporation, USA.
World Journal of Advanced Engineering Technology and Sciences, 2025, 15(01), 1044-1052
Article DOI: 10.30574/wjaets.2025.15.1.0320
Received on 04 March 2025; revised on 12 April 2025; accepted on 14 April 2025
This article explores how Kubernetes has become a critical solution for addressing the complex infrastructure challenges inherent in artificial intelligence and machine learning workloads. As AI models grow in size and complexity, organizations face significant hurdles in resource management, scaling, reliability, and operational efficiency. The article examines how Kubernetes provides dynamic resource allocation, intelligent scaling, self-healing capabilities, enhanced monitoring, and workload portability that directly address these challenges. Through industry-specific case studies, the article demonstrates how industry leaders leverage Kubernetes to manage massive computational demands, orchestrate distributed training, and deploy models efficiently. The analysis also covers the evolving Kubernetes AI ecosystem, including specialized tools like Kubeflow, TensorFlow operators, enhanced security technologies, and lightweight orchestration mechanisms that further extend its capabilities for AI workloads. The inquiry highlights how Kubernetes has enabled organizations to accelerate AI initiatives while maintaining operational efficiency in a rapidly growing market.
Kubernetes; Artificial Intelligence; Machine Learning Infrastructure; Container Orchestration; Distributed Training
Preview Article PDF
Praneel Madabushini. Leveraging Kubernetes for AI/ML Workloads: Case studies in autonomous driving and large language model infrastructure. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(01), 1044-1052. Article DOI: https://doi.org/10.30574/wjaets.2025.15.1.0320.