Synthetic Data Generation with Generative AI for Low-Resource Predictive Modeling

Authors

  • Emile C. Crawford Department of Computer Science, University of North Texas, Denton, TX, USA.
  • Karan L. Sharma Department of Computer Science, Binghamton University, Binghamton, NY, USA.
  • Benjamin Lawrence Department of Computer Science, Colorado State University, Fort Collins, CO, USA.

Keywords:

synthetic data, generative AI, low-resource predictive modeling, data augmentation, fairness, robustness, governance, infrastructure

Abstract

The increasing demand for predictive models in domains where labeled data are scarce has motivated the exploration of synthetic data generation using generative artificial intelligence. This paper presents a comprehensive systems-level analysis of the architectures, trade-offs, and governance challenges inherent in deploying generative models for low-resource predictive modeling. We examine the structural properties of generative adversarial networks, variational autoencoders, and diffusion models as they relate to data fidelity, diversity, and privacy preservation. The discussion extends to the infrastructure required for training such models on limited real data, including transfer learning, self-supervised pretraining, and federated setups. Critical tensions between statistical utility and fairness are analyzed, particularly the risk of amplifying biases present in small real-world samples. Policy implications, including regulatory frameworks for synthetic data provenance and accountability, are explored through cross-domain case illustrations in healthcare, finance, and natural language processing. The paper concludes with forward-looking perspectives on sustainable deployment, robustness evaluation, and the need for standardized benchmarks. Our findings suggest that while generative AI offers a powerful pathway to overcome data scarcity, its success depends on careful architectural choices, rigorous validation protocols, and governance structures that ensure equitable outcomes across populations.

References

1. S. He, C. Li, and J. Wang, "Generative adversarial networks for synthetic data generation: A survey," IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 10, pp. 5235–5254, 2022.

2. D. P. Kingma and M. Welling, "Auto-encoding variational bayes," in Proceedings of the International Conference on Learning Representations (ICLR), 2014.

3. J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851.

4. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

5. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, vol. 27, 2014.

6. M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein generative adversarial networks," in Proceedings of the International Conference on Machine Learning (ICML), 2017, pp. 214–223.

7. M. Mirza and S. Osindero, "Conditional generative adversarial nets," arXiv preprint arXiv:1411.1784, 2014.

8. C. Doersch, "Tutorial on variational autoencoders," arXiv preprint arXiv:1606.05908, 2016.

9. P. Dhariwal and A. Nichol, "Diffusion models beat GANs on image synthesis," in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 8780–8794.

10. T. Sennrich, B. Haddow, and A. Birch, "Improving neural machine translation models with monolingual data," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 86–96.

11. J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, "How transferable are features in deep neural networks?" in Advances in Neural Information Processing Systems, vol. 27, 2014.

12. I. Goodfellow, "NIPS 2016 tutorial: Generative adversarial networks," arXiv preprint arXiv:1701.00160, 2016.

13. H. Choi, S. Kim, and J. Lee, "Medical image synthesis using generative adversarial networks: A systematic review," IEEE Reviews in Biomedical Engineering, vol. 15, pp. 169–186, 2022.

14. M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, "Deep learning with differential privacy," in Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2016, pp. 308–318.

15. R. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, "A survey on bias and fairness in machine learning," ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, 2021.

16. S. Barocas, M. Hardt, and A. Narayanan, Fairness and Machine Learning. MIT Press, 2019.

17. European Commission, "Proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act)," COM(2021) 206 final, 2021.

18. Q. Yang, Y. Liu, T. Chen, and Y. Tong, "Federated machine learning: Concept and applications," ACM Transactions on Intelligent Systems and Technology, vol. 10, no. 2, pp. 1–19, 2019.

19. E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, "Doctor AI: Predicting clinical events via recurrent neural networks," in Proceedings of the Machine Learning for Healthcare Conference, 2016, pp. 301–318.

20. B. Schölkopf, "Causality for machine learning," in Probabilistic and Causal Inference: The Works of Judea Pearl, 2022, pp. 765–804.

Downloads

Published

2024-09-30

How to Cite

Emile C. Crawford, Karan L. Sharma, & Benjamin Lawrence. (2024). Synthetic Data Generation with Generative AI for Low-Resource Predictive Modeling. Computational Intelligence Systems, 2(1). Retrieved from https://scivexus.org/index.php/CIS/article/view/337