Generación de Datos Sintéticos para la Investigación de la Diabetes Infantil Usando GANs y WGANs
DOI:
https://doi.org/10.17488/RMIB.46.1.3Palabras clave:
diabetes infantil, generación de datos sintéticos, Redes Generativas Adversarias, Redes Generativas Adversarias de WassersteinResumen
La investigación en diabetes pediátrica a menudo está limitada por la escasez de datos, lo que dificulta el desarrollo de modelos predictivos precisos para aplicaciones clínicas. Este estudio aborda esta limitación evaluando la efectividad de las Redes Generativas Antagónicas (GANs) y las Wasserstein GANs (WGANs) para generar conjuntos de datos sintéticos que replican las propiedades estadísticas de los datos reales de diabetes pediátrica. Se aplicó una metodología estructurada que incluye el preprocesamiento, diseño de modelos y métricas de evaluación dual: divergencias de Jensen-Shannon y Kullback-Leibler para evaluar la fidelidad estadística, y un modelo de clasificación para evaluar la utilidad práctica. Los resultados demuestran que ambos modelos generan datos sintéticos de alta fidelidad, siendo las WGANs superiores en la captura de patrones complejos gracias a su estabilidad de entrenamiento mejorada. Sin embargo, persisten desafíos para replicar la variabilidad inherente de los datos pediátricos, influida por el crecimiento y los factores de desarrollo. Este trabajo resalta el potencial de los datos sintéticos para aumentar los conjuntos de datos de diabetes pediátrica, facilitando el desarrollo de modelos predictivos robustos y generalizables. Las limitaciones incluyen la dependencia de la calidad de los datos iniciales y la especificidad de los modelos a los conjuntos de datos pediátricos. Este estudio contribuye a cerrar brechas críticas en la disponibilidad de datos, impulsando soluciones de salud personalizadas basadas en inteligencia artificial para la investigación en diabetes pediátrica.
Descargas
Citas
D. J. G. Carrasco Ramírez, M. Islam, and I. H. Even, “Machine Learning Applications in Healthcare: Current Trends and Future Prospects,” J. Artif. Intell. Gen. Sci., vol. 1, no. 1, Art. no. 1, 2024, doi: https://doi.org/10.60087/jaigs.v1i1.33
A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, “Secure and Robust Machine Learning for Healthcare: A Survey,” IEEE Rev. Biomed. Eng., vol. 14, pp. 156–180, 2021, doi: https://doi.org/10.1109/RBME.2020.3013489
M. Javaid et al., “Significance of machine learning in healthcare: Features, pillars and applications,” Int. J. Intell. Netw., vol. 3, pp. 58–73, 2022, doi: https://doi.org/10.1016/j.ijin.2022.05.002
A. García-Domínguez et al., “Optimizing Clinical Diabetes Diagnosis through Generative Adversarial Networks: Evaluation and Validation,” Diseases, vol. 11, no. 4, 2023, art. no. 134, doi: https://doi.org/10.3390/diseases11040134
A. García-Domínguez et al., “Diabetes Detection Models in Mexican Patients by Combining Machine Learning Algorithms and Feature Selection Techniques for Clinical and Paraclinical Attributes: A Comparative Evaluation,” J. Diabetes Res., vol. 2023, 2023, art. no. 9713905, doi: https://doi.org/10.1155/2023/9713905
J. G. Elmore and C. I. Lee, “Data Quality, Data Sharing, and Moving Artificial Intelligence Forward,” JAMA Netw. Open, vol. 4, no. 8, 2021, art. no. e2119345, doi: https://doi.org/10.1001/jamanetworkopen.2021.19345
H. Sun et al., “IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045,” Diabetes Res. Clin. Pract., vol. 183, 2022, art. no. 109119, doi: https://doi.org/10.1016/j.diabres.2021.109119
C. W. Tsao et al., “Heart Disease and Stroke Statistics—2022 Update: A Report From the American Heart Association,” Circulation, vol. 145, no. 8, pp. e153–e639, 2022, doi: https://doi.org/10.1161/cir.0000000000001052
L. C. Stene and J. Tuomilehto, “Epidemiology of Type 1 Diabetes,” in Textbook of Diabetes, R.I.G. Holt and A. Flyvbjerg (Eds.)., Pondicherry, India: Wiley, 2024, ch. 4, pp. 41–54, doi: https://doi.org/10.1002/9781119697473.ch4
G. Imperatore et al., “Projections of Type 1 and Type 2 Diabetes Burden in the U.S. Population Aged <20 Years Through 2050: Dynamic modeling of incidence, mortality, and population growth,” Diabetes Care, vol. 35, no. 12, pp. 2515–2520, 2012, doi: https://doi.org/10.2337/dc12-0669
P. Bjornstad et al., “Youth-onset type 2 diabetes mellitus: an urgent challenge,” Nat. Rev. Nephrol., vol. 19, no. 3, pp. 168-184, 2023, doi: https://doi.org/10.1038/s41581-022-00645-1
J. Jordon et al., “Synthetic Data -- what, why and how?,” 2022, arXiv:2205.03257, doi: https://doi.org/10.48550/arXiv.2205.03257
V. C. Pezoulas et al., “Synthetic data generation methods in healthcare: A review on open-source tools and methods,” Comput. Struct. Biotechnol. J., vol. 23, pp. 2892–2910, 2024, doi: https://doi.org/10.1016/j.csbj.2024.07.005
I. Goodfellow et al., “Generative Adversarial Networks,” 2014, arXiv:1406.2661, doi: https://doi.org/10.48550/arXiv.1406.2661
M. Arjovsky, S. Chintala, L. Bottou, “Wasserstein Generative Adversarial Networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 214-223, doi: https://doi.org/10.5555/3305381.3305404
M. L. Menéndez, J. A. Pardo, L. Pardo, and M. C. Pardo, “The Jensen-Shannon divergence,” J. Frankl. Inst., vol. 334, no. 2, pp. 307–318, 1997, doi: https://doi.org/10.1016/S0016-0032(96)00063-4
S. Kullback and R. A. Leibler, “On Information and Sufficiency,” Ann. Math. Stat., vol. 22, no. 1, pp. 79–86, 1951, doi: https://doi.org/10.1214/aoms/1177729694
P. A. Apellániz et al., “Synthetic Tabular Data Validation: A Divergence-Based Approach,” IEEE Access, vol. 12, pp. 103895–103907, 2024, doi: https://doi.org/10.1109/ACCESS.2024.3434582
X. Donath et al., “Next-generation sequencing identifies monogenic diabetes in 16% of patients with late adolescence/adult-onset diabetes selected on a clinical basis: a cross-sectional analysis,” BMC Med., vol. 17, no. 1, 2019, art. no. 132, doi: https://doi.org/10.1186/s12916-019-1363-0
Y. Park, D. Heider, and A.-C. Hauschild, “Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence,” Cancers, vol. 13, no. 13, 2021, art. no. 3148, doi: https://doi.org/10.3390/cancers13133148
P. S. Paladugu et al., “Generative Adversarial Networks in Medicine: Important Considerations for this Emerging Innovation in Artificial Intelligence,” Ann. Biomed. Eng., vol. 51, pp. 2130-2142, 2023, doi: https://doi.org/10.1007/s10439-023-03304-z
A. Aggarwal, M. Mittal, and G. Battineni, “Generative adversarial network: An overview of theory and applications,” Int. J. Inf. Manag. Data Insights, vol. 1, no. 1, 2021, art no. 100004, doi: https://doi.org/10.1016/j.jjimei.2020.100004
C. O. S. Sorzano, J. Vargas, and A. Pascual, “A survey of dimensionality reduction techniques”, 2014, arXiv:1403.2877, doi: https://doi.org/10.48550/arXiv.1403.2877
G. T. Reddy et al., “Analysis of Dimensionality Reduction Techniques on Big Data,” IEEE Access, vol. 8, pp. 54776–54788, 2020, doi: https://doi.org/10.1109/ACCESS.2020.2980942
P. C. Austin, I. R. White, D. S. Lee, and S. van Buuren, “Missing Data in Clinical Research: A Tutorial on Multiple Imputation,” Can. J. Cardiol., vol. 37, no. 9, pp. 1322–1331, 2021, doi: https://doi.org/10.1016/j.cjca.2020.11.010
T. Emmanuel et al., “A survey on missing data in machine learning,” J. Big Data, vol. 8, no. 1, 2021, art. no. 140, doi: https://doi.org/10.1186/s40537-021-00516-9
R. S. Somasundaram and R. Nedunchezhian, “Evaluation of Three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values,” Int. J. Comput. Appl., vol. 21, no. 10, pp. 14–19, 2011, doi: https://doi.org/10.5120/2619-3544
R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed., Hoboken, NJ, USA: Wiley, 2019, doi: https://doi.org/10.1002/9781119482260
C. K. Enders, Applied Missing Data Analysis, 2nd ed., Guilford Publications, 2022.
S. van Buuren, Flexible Imputation of Missing Data, 2nd ed., NY, USA: CRC Press, 2018, doi: https://doi.org/10.1201/9780429492259
A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed., O’Reilly Media, Inc., 2019.
S. G. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage,” Int. Adv. Res. J. Sci. Eng. Technol., pp. 20–22, 2015, doi: https://doi.org/10.17148/IARJSET.2015.2305
D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Appl. Soft Comput., vol. 97, 2020, art. no. 105524, doi: https://doi.org/10.1016/j.asoc.2019.105524
L. Rahmad Ramadhan and Y. Anne Mudya, “A Comparative Study of Z-Score and Min-Max Normalization for Rainfall Classification in Pekanbaru,” J. Data Sci., vol. 2024, 2024, art. no. 4. [Online]. Available: https://iuojs.intimal.edu.my/index.php/jods/article/view/446
M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” 2017, arXiv:7101.07875, doi: https://doi.org/10.48550/arXiv.1701.07875
I. Goodfellow et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, 2014. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
T. Salimans et al., “Improved Techniques for Training GANs,” in Advances in Neural Information Processing Systems, 2016. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf
A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” 2016, arXiv:1511.06434, doi: https://doi.org/10.48550/arXiv.1511.06434
S. Jenni and P. Favaro, “On Stabilizing Generative Adversarial Training With Noise,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 12137–12145, doi: https://doi.org/10.1109/CVPR.2019.01242
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 2010. [Online]. Available: https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
I. Gulrajani et al., “Improved Training of Wasserstein GANs,” in NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, 2017. [Online]. Available: https://dl.acm.org/doi/10.5555/3295222.3295327
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
J. Lin, “Divergence measures based on the Shannon entropy,” IEEE Trans. Inf. Theory, vol. 37, no. 1, pp. 145–151, 1991, doi: https://doi.org/10.1109/18.61115
M. Lucic et al., “Are GANs Created Equal? A Large-Scale Study,” in NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, 2018. [Online]. Available: https://dl.acm.org/doi/10.5555/3326943.3327008
A. Borji, “Pros and cons of GAN evaluation measures,” Comput. Vis. Image Underst., vol. 179, pp. 41–65, 2019, doi: https://doi.org/10.1016/j.cviu.2018.10.009
S. Arora et al., “Generalization and Equilibrium in Generative Adversarial Nets (GANs),” in ICML'17: Proceedings of the 34th International Conference on Machine Learning - Volume 70, Sydney, Australia, 2017, pp. 224–232. [Online]. Available: https://dl.acm.org/doi/10.5555/3305381.3305405
M. S. M. Sajjadi et al., “Assessing Generative Models via Precision and Recall,” in NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, 2018. [Online]. Available: https://dl.acm.org/doi/10.5555/3327345.3327429
S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization,” in NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, España, 2016. [Online]. Available: https://dl.acm.org/doi/10.5555/3157096.3157127
L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: https://doi.org/10.1023/A:1010933404324
A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2, pp. 18-22, 2002. [Online]. Available: https://www.r-project.org/doc/Rnews/Rnews_2002-3.pdf
T. K. Ho, “Random decision forests,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 1995, pp. 278-282, doi: https://doi.org/10.1109/ICDAR.1995.598994
F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825-2830, 2011. [Online]. Available: https://dl.acm.org/doi/10.5555/1953048.2078195
C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J. Big Data, vol. 6, no. 1, 2019, art. no. 60, doi: https://doi.org/10.1186/s40537-019-0197-0
E. Choi et al., “Generating Multi-label Discrete Patient Records using Generative Adversarial Networks,” 2018, arXiv:1703.06490, doi: https://doi.org/10.48550/arXiv.1703.06490
A. Goncalves et al., “Generation and evaluation of synthetic patient data,” BMC Med. Res. Methodol., vol. 20, no. 1, 2020, art. no. 108, doi: https://doi.org/10.1186/s12874-020-00977-1
L. Xu, et al., “Modeling Tabular data using Conditional GAN,” 2019, arXiv:1907.00503, doi: https://doi.org/10.48550/arXiv.1907.00503

Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Derechos de autor 2025 Revista Mexicana de Ingenieria Biomedica

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial 4.0.
Una vez que el artículo es aceptado para su publicación en la RMIB, se les solicitará al autor principal o de correspondencia que revisen y firman las cartas de cesión de derechos correspondientes para llevar a cabo la autorización para la publicación del artículo. En dicho documento se autoriza a la RMIB a publicar, en cualquier medio sin limitaciones y sin ningún costo. Los autores pueden reutilizar partes del artículo en otros documentos y reproducir parte o la totalidad para su uso personal siempre que se haga referencia bibliográfica al RMIB. No obstante, todo tipo de publicación fuera de las publicaciones académicas del autor correspondiente o para otro tipo de trabajos derivados y publicados necesitaran de un permiso escrito de la RMIB.