Article Text

Download PDFPDF

LSO-081 Genomic prediction model using machine learning techniques that can distinguish autoimmune diseases (RA or SLE) from healthy controls
  1. Young Bin Joo1,2,
  2. Youngho Park3,
  3. So-Young Bang1,2,
  4. Sang-Cheol Bae1,2 and
  5. Hye-Soon Lee1,2
  1. 1Department of rheumatology, Hanyang University Hospital for Rheumatic Diseases, Seoul, Republic of, Republic of Korea
  2. 2Hanyang University Institute for Rheumatology Research, Seoul, Republic of Korea, Republic of Korea
  3. 3Department of Big Data Application, Hannam University, Daejeon, Republic of Korea., Republic of Korea


Background Rheumatoid arthritis (RA) and systemic lupus erythematosus (SLE) are the prototypes of autoimmune diseases, for which many genetic loci have been identified using genome-wide association studies (GWAS) for recent decades. This study aims to establish genomic prediction models for autoimmune diseases using machine learning (ML) techniques, which can classify the patients with RA or SLE from healthy controls.

Methods We obtained SNPs data (N=446,097) from Korean Chip (Affymetrix Axiom™ KOR) for 3,145 RA, 1,870 SLE, and 3,635 controls, enrolled from Hanyang University Hospital for Rheumatic Diseases. After quality control (N=428,380), we selected the significant associated SNPs with each disease (RA or SLE) using univariate test (p< 5.0e-8) and imputed by SHAPEIT2, IMPUTE4, Eagle v2, Minimac3. We conducted ML, support vector machine (SVM) and extreme gradient boosting (XGB). Contribution of SNPs for classification was calculated with importance (or gain) values.

Results A total of 2,458 SNPs were selected and used as inputs for ML models. Each group of RA, SLE, and controls was randomly divided into training (70%) and testing (30%) subsets. Both SVM and XGB approaches showed high prediction performance to discriminate autoimmune diseases of RA or SLE from controls (table 1). However, these ML models showed lower prediction performance to discriminate RA from SLE (accuracy 0.7229 – 0.7368). Among top 30 SNPs showing high feature importance for classification in each ML model (figure 1), 8 SNPs were overlapped, which were annotated as HLA-DQA1, HLA-DOA, MICB, GTF2IRD1, HLA-DRA, TNFSF4 and STAT4.

Conclusions We identified ML based genomic prediction models that could distinguish between autoimmune diseases (RA or SLE) and healthy controls in the Korean population, while suggesting the need for more research for differentiating RA from SLE. These results suggest that ML approach using genomic data might be useful to predict the risk of autoimmune diseases.

Abstract LSO-081 Figure 1

Feature importance of top 30 SNPs of machine learning models.

Abstract LSO-081 Table 1

Statistics of machine learning models

  • Genomic prediction
  • Machine learning
  • Autoimmune diseases

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.