Network Engineer (Network Layers and Storage) – MTS IV, IRELAND.
Received on 03 January 2023; revised on 26 January 2023; accepted on 30 January 2023
The paper will dive into the concept of optimizing the data movement in distributed AI workloads and how well data should be managed because current AI training datasets are above petabytes. We focus on key network techniques such as RDMA (Remote Direct Memory Access), ECMP (Equal-Cost Multi-Path Routing), and DPDK (Data Plane Development Kit) to optimize east-west traffic in large-scale distributed systems. We also look into the work of Quality of Service (QoS) and congestion control protocols such as DCQCN (Data Center Quantized Congestion Notification) in data flow stability maintenance. The paper also examines optimization of data paths which use storage-to-GPU data methods such as NV Me over fabrics (NVMEIOF) and Undirect Storage that improve the rate of data transfer, eliminating bottlenecks. Moreover, we examine the workload profiles of distributed training actually the connection between bandwidth constraints as well as the batch size. Due to a detailed insight into these optimization methods, this paper will add value to the provision of efficient, scalable implementations of AI workloads, making training of models faster and more dependable.
Optimization Techniques; Data Movement; RDMA; ECMP; GPUDIRECT Storage; Network Traffic
Get Your e Certificate of Publication using below link
Preview Article PDF
Oluwatosin Oladayo Aramide. Optimizing data movement for AI workloads: A multilayer network engineering approach. World Journal of Advanced Engineering Technology and Sciences, 2023, 08(01), 518-528. Article DOI: https://doi.org/10.30574/wjaets.2023.8.1.0017