Department of Electrical and Computer Engineering, University of Illinois at Urbana Champaign, Illinois, USA.
World Journal of Advanced Engineering Technology and Sciences, 2025, 17(03), 357–374
Article DOI: 10.30574/wjaets.2025.17.3.1563
Received on 06 November 2025; revised on 17 December 2025; accepted on 20 December 2025
As consumer-grade GPUs have rapidly evolved, efforts have emerged to deploy these computational models for training and inference, typically handled by data center hardware. The paper explores optimization of two next-generation graphics computing units, the NVIDIA GeForce RTX 5090 and the AMD Radeon RX 9070, to optimize the new generation of ML and AI applications. We examine the internal compute pipelines, tensor/matrix acceleration capabilities, memory hierarchies, and software ecosystems (CUDA/cuDNN/TensorRT versus ROCm/MIOpen/HIP) that influence ML performance in a two-pronged architectural and empirical study. The convolutional networks, transformer models, diffusion architecture, and graph neural networks share a standard benchmarking model: training, inference latency, power consumption, precision scaling (FP32-INT8), and bottlenecks. The results of the experiment have demonstrated that the performance profiles of the RTX 5090 and the RX 9070 are different, i.e., the acceleration performance of mixed precision and kernel fusion is higher in the RTX 5090 as compared to the throughput performance of the RX 9070 in the BF16/INT8 workloads with the high memory-bandwidth utilization. Strategies for each platform. Platform-specific optimization strategies, such as kernel tuning, compiler optimization, memory prefetching, gradient checkpointing, and scaling to multiple GPUs, are developed and evaluated. Further, two case studies of real-world performance tuning of transformer fine-tuning and diffusion model inference are also presented.
The findings highlight that hardware alone does not guarantee the best ML performance; effective optimization can deliver performance gains that are even more significant than raw compute alone. The paper will provide a step-by-step roadmap for practitioners, researchers, and engineers who may want to optimize the application of RTX 5090 and RX 9070 in artificial intelligence algorithms, as well as a future perspective on the standard models of unified programming on GPUs and emergent precision formats.
Deep learning compute efficiency; Tensor core mixed precision deep learning; Mixed precision training; Large model training GPU efficiency analysis; Deep learning optimization consumer GPUs; AI GPU benchmarking; FP8 acceleration; Low-precision inference
Get Your e Certificate of Publication using below link
Preview Article PDF
Mohit Jain, Adit Shah, Brahaspati Dev, Ram Kumar and Mathew Campisi. OPTIMIZING NVIDIA® GEFORCE RTX™ 5090 & "AMD RX 9070"for machine learning and artificial intelligence workload. World Journal of Advanced Engineering Technology and Sciences, 2025, 17(03), 357-374. Article DOI: https://doi.org/10.30574/wjaets.2025.17.3.1563.