Article Text
Abstract
Background Electronic health records (EHRs) can play an important role in generating data on the natural history, treatment, and outcomes of systemic lupus erythematosus (SLE). A key issue in using EHRs for SLE research is accurately identifying populations of patients with the disease. This is especially important because traditional definitions that rely on coding systems such as ICD9 have had poor specificity in previous studies. We aimed to develop and test disease classification algorithms to define a population with SLE in the EHR. We analysed both traditional definitions that used structured data (ICD-9 codes, medications, laboratories) and machine learning algorithms that used the entirety of information in the EHR, including unstructured data from clinical notes.
Materials and methods We created a repository of patients with possible SLE (based on relevant ICD-9 codes, positive auto-antibodies, and/or mention of “SLE” or “lupus” in the text of a clinical note). We combined 300 patients from that repository with 1000 randomly selected adult patients in our EHR as our training set. These patients were reviewed by domain experts for a diagnosis of SLE and confirmed cases were used as a gold standard for training our machine learning algorithms. We calculated the test characteristics for various definitions of SLE using only structured data. Finally, we compared this to a series of supervised machine learning algorithms based on support vector machines (SVMs) that used text features extracted from clinical notes in addition to structured fields. All SVM algorithms were trained and validated using 10-fold cross-validation.
Results One hundred thirty-seven patients met criteria for SLE. The test characteristics of both the structured and supervised ML algorithms are shown in the Table 1. A single ICD-9 code for 710.0 had a precision/positive predictive value of 79%. In contrast, machine learning algorithms greatly outperformed structured definitions in terms of precision, with precision/positive predictive value approaching 96% in the most comprehensive algorithm.
Conclusions In an EHR-based data repository, a single ICD-9 was highly sensitive for SLE. Machine learning algorithms processed a multitude of structured and unstructured EHR data, allowing improved precision/positive predictive value. Further validation across different health systems will be necessary prior to implementing these algorithms on a national basis.
Acknowledgements The Rheumatology Research Foundation provided funding for this work.