Discussion
This study addresses the challenge of accurately identifying prevalent cases of LN using structured EHR data. We first characterised performance of LN ICD codes documented in EHRs of two large health systems and found high specificity (92–97%) but low sensitivity (43–73% across health systems). We then went on to develop two scoring systems using a broader range of EHR data, prioritising sensitivity. Given the morbidity and mortality associated with LN, accurate identification of this manifestation of SLE is vital for patient management, clinical research and public health initiatives, but has been hampered by underutilisation of specific LN diagnosis codes. Prediction of prevalent LN using structured data elements available in EHRs was feasible, had good accuracy and external validity. The scoring systems proposed have the potential to identify prevalent LN accurately across different health systems.
We found that the specificity of LN diagnosis codes was high, but their sensitivity was low. We were able to quantify the impact of underutilisation of LN ICD codes on the ability to accurately identify patients with prevalent LN. LN diagnosis codes are associated with a high false-negative rate as the code does not discern a diagnosis of ‘no LN’ from the lack of a documentation of LN diagnosis. The sensitivity of diagnosis codes for LN was suboptimal in both health systems examined, especially when using an inclusive definition of LN. This finding suggests that relying solely on these codes may result in missing a significant proportion of prevalent LN cases (up to 52% in this study) and underscores the importance of combining multiple data elements and refined algorithms for accurate LN identification.
We developed two scoring systems, each designed to enhance LN identification in different settings. LN-Code, which includes LN diagnosis codes, demonstrated strong predictive performance with an AUC of 0.93 for the inclusive definition of LN and good sensitivity (0.88). This system is valuable when specific LN diagnosis codes are routinely used and could be employed to improve the accuracy of prevalent LN identification in EHRs. In this scenario, patients with two or more LN codes and two or more codes for kidney disease or proteinuria have a high predicted probability of having LN; use of mycophenolate acid and younger age further increase this probability. LN-No Code, excluding LN diagnosis codes, was designed to address situations where LN codes are underused or unavailable. This model achieved a remarkable sensitivity of 0.95 for the inclusive definition of LN (0.97 for the strict definition) while maintaining a high AUC of 0.91. While other clinical indicators, such as diagnoses of chronic kidney disease, proteinuria and higher UPCr, lack the specificity of LN ICD codes, their inclusion in this scoring system offers a practical solution for identifying LN in cases where specific LN codes are lacking.
We chose to develop a scoring system rather than a binary system (diagnosis present vs absent) to permit a more granular assessment of LN diagnosis. This approach allows investigators to apply definitions according to the use case and the degree of accuracy required for the intended task. Such an approach permits adjustment of the threshold for classification based on clinical or operational needs. The scoring systems outline which data elements are required for high accuracy and highlight where more limited data might compromise accuracy. Moreover, the scoring systems provide transparency, allowing users to directly observe how different variables are weighted, making the model more interpretable and clinically relevant.
Investigators may choose to apply these scoring systems separately or together based on their needs. For example, whether for research or clinical work, if high sensitivity is desired and at least some LN codes are available, investigators may choose to use LN-Code first to identify patients with LN; this may then be followed by the application of LN-No Code to identify any additional patients who do not have codes for LN to increase sensitivity. In another use case, an investigator may be interested in applying a system with very high specificity; in this case, using LN-Code alone may be preferred.
As evident by receiver operating characteristic curve plots, differences in performance between the two scoring systems were more subtle in the ZSFG health system (our external validation set) than at UCSF (the test set). We observed greater improvements in sensitivity by both scoring systems (compared with LN diagnosis codes alone) in UCSF than ZSFG. This may be explained by a higher sensitivity of LN diagnosis codes in ZSFG, which is an integrated public health system, leaving fewer unidentified true positives. Another reason may be differences in workflow within EHRs as well as documentation differences of laboratory results, diagnosis or procedure codes across the two health systems. In addition, patient demographics differed significantly across the two health systems.
We see many clinical applications of these scoring systems to identify individuals with LN. LN is a condition associated with significant morbidity, causing 2% of all end-stage renal disease in the USA. While there has been some progress in treating LN, with two new drug approvals in 2020–2021, there are few comparative effectiveness studies using real-world data. Tools for disease surveillance of LN are also lacking. Application of the scoring systems presented here could facilitate such studies. The scoring systems could also help identify eligible patients for clinical trials requiring identification of prevalent disease. Moreover, in clinical settings, individuals with LN may be lost to follow-up or not under appropriate specialty care. Using algorithms such as those presented here for population health management and quality improvement may be useful for ensuring all patients receive timely and appropriate care.
This study employed a rigorous methodology involving chart reviews conducted by rheumatologists, machine learning techniques to optimise predictive performance and evaluation of a comprehensive range of metrics to validate the scoring systems. We also acknowledge the limitations of our study. First, our research was conducted using data from two academic health systems in the San Francisco Bay Area, potentially limiting generalisability. Future studies should validate the scoring systems in community healthcare settings. Second, our scoring systems rely on structured EHR data and therefore may not fully capture the complexity of LN that is included in clinical notes. Incorporating unstructured clinical notes and imaging data could further improve accuracy, although many large datasets lack unstructured data, and even when available, health systems and clinics may lack the infrastructure or resources to analyse clinical notes.
In conclusion, our study corroborates previous studies demonstrating that LN diagnosis codes are underused in EHRs and presents two novel scoring systems to enhance LN identification. These systems, tailored to different data availability scenarios, offer practical solutions for healthcare providers and researchers. Improving the accuracy of LN identification has potential to facilitate patient care, inform research and facilitate public health initiatives in the field of LN. Further research and validation are warranted to ensure the robustness and applicability of our scoring systems across diverse healthcare settings.