Raleigh, USA.
World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 1145-1149
Article DOI: 10.30574/wjaets.2025.15.2.0652
Received on 26 March 2025; revised on 03 May 2025; accepted on 06 May 2025
Large Language Models (LLMs) are increasingly leveraged to generate synthetic datasets that overcome challenges in real-world data collection, including privacy risks, imbalance, and scarcity. This paper surveys recent developments in LLM-based synthetic data generation, emphasizing techniques that improve diversity, task alignment, and reliability—crucial factors in high-stakes domains such as predictive maintenance. We categorize state-of-the-art approaches into four methodological pillars: prompt engineering, multi-step generation pipelines, quality control through data curation, and rigorous evaluation methods. Structured generation workflows and controlled prompting strategies significantly enhance output coherence and domain relevance, while self-correction mechanisms and diversity-aware metrics contribute to higher dataset fidelity. Despite progress, open challenges persist, including bias propagation, limited generalization across tasks and modalities, and the need for robust ethical safeguards. We outline promising future directions—such as integrating external knowledge, expanding to multilingual and multimodal settings, and fostering human-AI collaboration—for advancing synthetic data generation using LLMs.
Synthetic Data Generation; Large Language Models; Predictive Maintenance; Anomaly Detection; Disk Failure Prediction; Cloud Storage Systems
Preview Article PDF
Abinandaraj Rajendran. Generating high-quality and diverse synthetic datasets with large language models: A survey. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 1145-1149. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0652.