Article Text
Abstract
Background Nearly 20% of pregnancies in patients with Systemic lupus erythematosus (SLE) result in an adverse pregnancy outcome (APO); early identification of women with SLE who are at high risk of APO is vital. We previously derived a risk model for APO using logistic regression and data from the PROMISSE Study, a large multi-center, multi-ethnic/racial study of APO in women with mild/moderate SLE and/or aPL. While this highly interpretable regression model showed promising predictive performance, we sought to determine if novel and increasingly popular machine learning (ML) approaches would enhance APO risk prediction using all available predictors and potential complex relationships such as interactions or higher order terms. We compared logistic regression modeling to LASSO, a regression approach that handles high-dimensionality and correlated predictors through shrinkage of estimated coefficients, as well as several ‘black box’ ML algorithms. ML techniques are well-suited to high-dimensional data, require no variable selection, and unlike regression-based approaches are able to explore complex relationships without explicit input by the user.
Methods We used the original PROMISSE data (41 predictor variables from 385 subjects) with APO (71/385, 18.4%) defined as preterm delivery due to placental insufficiency or preeclampsia, fetal or neonatal death, or fetal growth restriction. Logistic regression with stepwise selection (LR-S) was compared to LASSO, random forest (RF), neural network (NN) with 2 hidden neurons, support vector machines with RBF kernel (SVMRBF), and gradient boosting (GB). To summarize discrimination we present the area under the receiver operating curve (AUC), along with sensitivity (Sn) and specificity (Sp) at an optimal cut-point.
Results Regression based classifiers confirmed the predictors of APO identified in our previously reported model: non-white race, use of anti-hypertensive medication, low platelets, SLE disease activity, lupus anticoagulant (LAC) +, and high diastolic blood pressure (DBP). RF additionally revealed two novel interaction variables that increased APO risk: LAC+ with anti-β2GPI IgM, high DBP with low C3. LR-S and LASSO were observed to have similar overall discrimination (AUC=0.75 vs. 0.77, table 1) but LASSO had higher sensitivity (Sn=0.71 vs. 0.65). ML classifiers RF and SVMRBF had similar good performance (AUC=0.77-0.78), while NN and GB were inferior.
Summary of 5x10 fold cross-validation results
Conclusions Several popular ML algorithms did not provide meaningful improvements in the prediction of APO. The strong relative performance of regression-based models with this large and well-characterized clinical data set is notable as these models are highly interpretable, well-understood, and generally require fewer variables to generate a risk prediction. It is unlikely that complex ML algorithms with existing variables will yield superior APO predictions; new clinical and laboratory markers may improve predictions in the future.
Acknowledgments This work was supported by NIH grant R21 AR076612
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.