Article Text
Abstract
Background Accurate identification of prevalent cases of lupus nephritis (LN) is essential for timely patient monitoring and treatment, advancing research, and informing public health initiatives for the management of LN. However, diagnosis codes for LN are generally underutilized, making identification of this patient population in real-world databases challenging. We developed a scoring system to quantify the probability of accurate LN case identification using structured data from electronic health records.
Methods We used data from EHRs of two large health systems and included patients with ≥1 ICD9/10 codes for SLE from June 2012 to Jan 2022. Prevalent LN was defined as current active LN or a history of LN. We used regular expressions with negation to loosely tag LN within EHR notes, in a training set consisting of a balanced sample of 2038 patients from the larger health system. Testing sets included 100 patients randomly selected from each health system and were manually chart reviewed to classify patients as having ‘no LN’, ‘definite LN’ (biopsy report of Class III, IV or V LN), ‘potential LN’ (no biopsy report but physician diagnosed LN), and ‘diagnostic uncertainty’ (physician states LN is possible). A gradient boosting model (GBM) including 42 predictors that covered demographics, encounters, diagnosis and procedure codes, comorbidities, medications, and laboratory test results (e.g., serologies, urine studies, chemistries) was used for predictor selection. Predictive performance of a logit regression model (LRM) including key predictors from GBM was evaluated for identifying patients with a ‘strict’ (definite LN) or an ‘inclusive’ (definite LN, potential LN, or diagnostic uncertainty) definition of LN. A LRM-based scoring system was developed and calibrated.
Results Table 1 includes demographics of the 4,522 patients meeting the eligibility criteria from both health systems. In addition to more specific diagnosis codes for LN, presence of diagnosis codes for acute or chronic kidney disease or proteinuria, younger age at first SLE diagnosis code, and use of mycophenolate mofetil or mycophenolic acid were identified as key predictors in the GBM. Urine protein creatinine ratios (UPCR) >0.5, abnormal complement component 3 (C3) levels, any use of hydroxychloroquine, azathioprine, or rituximab, and glucocorticoid dose were also identified as important predictors but were omitted from the final LRM as their inclusion did not further improve performance.
The final LRM had an area under the curve, sensitivity, and positive predictive value of 0.93, 0.88, and 0.84, respectively, for identifying LN using the inclusive definition, performed similarly with a strict LN definition, and had good external validity when tested in the second health system (table 2). Predicted and observed probabilities had good calibration (table 2). The scoring system was derived from this model (table 3).
Conclusions Prediction of prevalent LN using data elements available in EHR or claims data was feasible, had good accuracy and was validated externally. With further validation, the scoring system has the potential to identify prevalent LN accurately across health systems, addressing the current challenge of LN case identification using ICD10 codes.
Disclaimer: Aurinia Pharmaceuticals provided an unrestricted grant for this work.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.