Background Rheumatoid arthritis (RA) and systemic lupus erythematosus (SLE) are the prototypes of autoimmune diseases, for which many genetic loci have been identified using genome-wide association studies (GWAS) for recent decades. This study aims to establish genomic prediction models for autoimmune diseases using machine learning (ML) techniques, which can classify the patients with RA or SLE from healthy controls.
Methods We obtained SNPs data (N=446,097) from Korean Chip (Affymetrix Axiom™ KOR) for 3,145 RA, 1,870 SLE, and 3,635 controls, enrolled from Hanyang University Hospital for Rheumatic Diseases. After quality control (N=428,380), we selected the significant associated SNPs with each disease (RA or SLE) using univariate test (p< 5.0e-8) and imputed by SHAPEIT2, IMPUTE4, Eagle v2, Minimac3. We conducted ML, support vector machine (SVM) and extreme gradient boosting (XGB). Contribution of SNPs for classification was calculated with importance (or gain) values.
Results A total of 2,458 SNPs were selected and used as inputs for ML models. Each group of RA, SLE, and controls was randomly divided into training (70%) and testing (30%) subsets. Both SVM and XGB approaches showed high prediction performance to discriminate autoimmune diseases of RA or SLE from controls (table 1). However, these ML models showed lower prediction performance to discriminate RA from SLE (accuracy 0.7229 – 0.7368). Among top 30 SNPs showing high feature importance for classification in each ML model (figure 1), 8 SNPs were overlapped, which were annotated as HLA-DQA1, HLA-DOA, MICB, GTF2IRD1, HLA-DRA, TNFSF4 and STAT4.
Conclusions We identified ML based genomic prediction models that could distinguish between autoimmune diseases (RA or SLE) and healthy controls in the Korean population, while suggesting the need for more research for differentiating RA from SLE. These results suggest that ML approach using genomic data might be useful to predict the risk of autoimmune diseases.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.