Generating high-quality and diverse synthetic datasets with large language models: A survey

Abinandaraj Rajendran

doi:10.30574/wjaets.2025.15.2.0652

Abinandaraj Rajendran ^*

Raleigh, USA.

Review Article

World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 1145-1149

Article DOI: 10.30574/wjaets.2025.15.2.0652

DOI url: https://doi.org/10.30574/wjaets.2025.15.2.0652

Publication history

Received on 26 March 2025; revised on 03 May 2025; accepted on 06 May 2025

Abstract

Large Language Models (LLMs) are increasingly leveraged to generate synthetic datasets that overcome challenges in real-world data collection, including privacy risks, imbalance, and scarcity. This paper surveys recent developments in LLM-based synthetic data generation, emphasizing techniques that improve diversity, task alignment, and reliability—crucial factors in high-stakes domains such as predictive maintenance. We categorize state-of-the-art approaches into four methodological pillars: prompt engineering, multi-step generation pipelines, quality control through data curation, and rigorous evaluation methods. Structured generation workflows and controlled prompting strategies significantly enhance output coherence and domain relevance, while self-correction mechanisms and diversity-aware metrics contribute to higher dataset fidelity. Despite progress, open challenges persist, including bias propagation, limited generalization across tasks and modalities, and the need for robust ethical safeguards. We outline promising future directions—such as integrating external knowledge, expanding to multilingual and multimodal settings, and fostering human-AI collaboration—for advancing synthetic data generation using LLMs.

Keywords

Synthetic Data Generation; Large Language Models; Predictive Maintenance; Anomaly Detection; Disk Failure Prediction; Cloud Storage Systems

Download Article PDF

https://wjaets.com/sites/default/files/fulltext_pdf/WJAETS-2025-0652.pdf

Preview Article PDF

How to cite this article

Abinandaraj Rajendran. Generating high-quality and diverse synthetic datasets with large language models: A survey. World Journal of Advanced Engineering Technology and Sciences, 2025, 15(02), 1145-1149. Article DOI: https://doi.org/10.30574/wjaets.2025.15.2.0652.

Generating high-quality and diverse synthetic datasets with large language models: A survey

Abinandaraj Rajendran ^*

Preview Article PDF

Get Certificates

Issue details

Generating high-quality and diverse synthetic datasets with large language models: A survey

Abinandaraj Rajendran *

Preview Article PDF

Get Certificates

Issue details

Abinandaraj Rajendran ^*