Preview

Medical Genetics

Advanced search
Open Access Open Access  Restricted Access Subscription Access

An Original Approach to Identifying Genomic Loci Associated with Polygenic Diseases Based on Random Forest and Resampling

https://doi.org/10.25557/2073-7998.2025.10.135-138

Abstract

For the detection of genomic loci associated with polygenic diseases, an alternative to traditional genome-wide association studies is machine learning with feature ranking according to importance contribution to predictive model performance. To implement this approach, it is necessary to address the class imbalance problem caused by differences in the size of case and control samples, and learn how to select features based on their importance metric, which unlike p-values does not have a threshold. This work presents a bioinformatic approach that solves both problems simultaneously. It is based on training a random forest algorithm on randomized case-control samples of similar size, followed by feature ranking according to decreasing importance score and selection based on frequency among top-ranked values, as well as stability of importance scores. The approach has been tested on simulated genotypephenotype data containing single nucleotide polymorphisms. Two types of synthetic datasets were applied. The first one contained the genomic loci associated with polygenic disease. The second one did not have such loci.

About the Authors

G. V. Khvorykh
National Research Centre «Kurchatov Institute»
Russian Federation

2, Akademika Kurchatova sq., Moscow, 123182 



N. A. Sapozhnikov
National Research Centre «Kurchatov Institute»
Russian Federation

2, Akademika Kurchatova sq., Moscow, 123182 



S. A. Limborska
National Research Centre «Kurchatov Institute» ; Research Centre for Medical Genetics
Russian Federation

2, Akademika Kurchatova sq., Moscow, 123182 

1, Moskvorechye st, Moscow,115522 



A. V. Khrunin
National Research Centre «Kurchatov Institute»
Russian Federation

2, Akademika Kurchatova sq., Moscow, 123182 



References

1. Khvorykh G., Belousov M., Limborska S. et al. The performance of machine learning approach in genome-wide association study of disease. The Proceedings of 14th International Conference on Bioinformatics of Genome Regulation and Structure/Systems Biology (BGRS/ SB-2024), Novosibirsk, Russia, August 5-10, 2024:846-848. doi: 10.18699/bgrs2024-4.3-08

2. Nikolić S., Ignatov D.I., Khvorykh G.V. et al. Genome-wide association studies of ischemic stroke based on interpretable machine learning. PeerJ Computer Science. 2024;10:e2454. doi: 10.7717/peerj-cs.2454

3. Purcell S., Neale B., Todd-Brown K. et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007;81(3):559-75. doi: 10.1086/519795

4. Bonett D.G., Seier E. Confidence Interval for a Coefficient of Dispersion in Nonnormal Distributions. Biometrical Journal. 2006;48(1):144-148. doi: 10.1002/bimj.200410148


Review

For citations:


Khvorykh G.V., Sapozhnikov N.A., Limborska S.A., Khrunin A.V. An Original Approach to Identifying Genomic Loci Associated with Polygenic Diseases Based on Random Forest and Resampling. Medical Genetics. 2025;24(10):135-138. (In Russ.) https://doi.org/10.25557/2073-7998.2025.10.135-138

Views: 7


ISSN 2073-7998 (Print)