Article Text

PDF

Which outcome measures in SLE clinical trials best reflect medical judgment?
  1. Aikaterini Thanou1,
  2. Eliza Chakravarty1,
  3. Judith A James1,2 and
  4. Joan T Merrill1
  1. 1Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma, USA
  2. 2University of Oklahoma Health Sciences Center, Oklahoma City, Oklahoma, USA
  1. Correspondence to Dr Aikaterini Thanou; aikaterini-thanou{at}omrf.org

Abstract

Objectives To compare two measures of systemic lupus erythematosus (SLE) response: the British Isles Lupus Assessment Group (BILAG)-based Composite Lupus Assessment (BICLA) and the Systemic Lupus Responder Index (SRI) against a clinician's assessment of improvement.

Methods Ninety-one lupus patients were identified with two visits at which Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) and BILAG had been scored and with active disease (SLEDAI≥6) at the first visit. A physician rated the disease activity at the second visit as clinically significant improvement, no change or worsening. SRI and BICLA were scored both with and without the medication criteria often used in trials to restrict response definitions.

Results 68 patients were considered improved, 17 same and 6 worse at follow-up. SRI versus BICLA, performed without considering medication changes, captured physician-rated improvement with 85% vs 76% sensitivity and 74% vs 78% specificity. With medication limits both instruments had 37% sensitivity and 96% specificity for physician-assessed improvement. Seven patients considered improved by the clinician met the BICLA but not the SRI definition of improvement by failing to achieve a four-point improvement in SLEDAI. 13 clinician-rated responders met SRI but not BICLA by improving in less than all organs.

Conclusions Shortfalls of SRI and BICLA may be due to BICLA only requiring partial improvement but in all organs versus SRI requiring full improvement in some manifestation(s) and not all organs. SRI and BICLA with medication restrictions are less likely to denote response when the physician disagrees and could provide stringent proof of efficacy in appropriately powered clinical trials.

  • Systemic Lupus Erythematosus
  • Outcomes research
  • Disease Activity
  • Treatment
  • Autoimmunity

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/

View Full Text

Statistics from Altmetric.com

Key messages

  • Shortfalls of SRI and BICLA may be due to BICLA requiring only partial improvement but in all organs versus SRI requiring full improvement in some manifestations and not necessarily in all organs.

  • BICLA may be less sensitive than SRI in determining clinically significant improvement, particularly in SLE patients with multiple organs involved at baseline.

  • Each instrument could likely be an optimal primary endpoint depending on the population under study and the design of a given clinical trial.

Introduction

Despite many promising lupus treatments reaching early-phase human studies over the past several decades, none except belimumab has yet demonstrated efficacy in phase III trials. Attempts to explain the disappointing results of most clinical trials in lupus have pointed to pitfalls in the application and interpretation of clinical endpoints.1 According to recommendations by the Food and Drug Administration, European League Against Rheumatism and Outcome Measures in Rheumatology, the ideal endpoint should be able to detect both improvement and worsening in different manifestations and discern acute lupus disease activity from chronic damage and changes related to other causes.2 Detection of both improvement and worsening is not possible using a global measure of disease activity such as the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI), and it is not a trivial undertaking even with an organ-based assessment such as the British Isles Lupus Assessment Group (BILAG) since different events within one organ may not be differentiated even with that complex instrument. This led to proposals of composite endpoints to detect improvement without worsening, two of which have already been widely incorporated in clinical trials.3–5

The SLE Responder Index (SRI) was derived in part by posthoc analysis of data from a failed phase II study of belimumab.2 This was subsequently used as the primary outcome measure in two successful phase III trials (BLISS-52 and BLISS-76), leading to belimumab approval by regulatory agencies.3 ,4 The SRI consists of scores derived from the SLEDAI and the BILAG Index: (1) ≥4-point reduction in SLEDAI global score, (2) no new severe disease activity (BILAG A organ score) or more than one new moderate organ score (BILAG B) and (3) no deterioration from baseline in the physician's global assessment (≤10% of scale). Although a requirement for no change in treatment was not formally included in the SRI definition, initiation of off-protocol medications defined non-response in the BLISS studies. Based on the success of the BLISS programme, SRI is currently being used in a number of clinical trials for SLE. Nonetheless, the treatment effect achieved by adding belimumab to standard of care using the original SRI (four-point drop in SLEDAI) was at best modest, and it remains unclear whether this is the optimal discriminatory endpoint.

The BILAG-Based Composite Lupus Assessment (BICLA) was derived by expert consensus6 and employed as the primary endpoint in the EMBLEM trial of epratuzumab in lupus, where it appeared to discriminate well between standard of care and epratuzumab added to standard of care, at least in some doses tested.5 BICLA response is defined as (1) at least one gradation of improvement in baseline BILAG scores in all body systems with moderate or severe disease activity at entry (eg, all A (severe disease) scores falling to B (moderate), C (mild), or D (no activity) and all B scores falling to C or D); (2) no new BILAG A or more than one new BILAG B scores; (3) no worsening of total SLEDAI score from baseline; (4) ≤10% deterioration in physicians global assessment and (5) no initiation of non-protocol treatment. It should be noted that the EMBLEM protocol was more restrictive than the BLISS protocols in the increases allowed in background medications at baseline.

The BICLA and SRI might be compared in two possible ways. The clinical components could be studied in isolation of the medication restrictions in order to simply determine presence or absence of clinical improvement. Adding medication restrictions (as has been done in clinical trials) allows an assessment of ‘response’ to an intervention given at baseline without clouding that assessment by the impact of other medications that might have been added when patients were not responding to that treatment. Direct comparison of BICLA and SRI has been addressed by only few studies to date,7 ,8 and the impact of the requirement for no medication changes on the simple determination of improvement has not yet been examined for either instrument. In this real-world exercise, the BICLA and SRI definitions were compared with physician's clinical assessment of change in disease state with and without medication restriction rules to distinguish between improvement and response.

Materials and methods

This study was performed using data from the Oklahoma Lupus Cohort study. All patients underwent informed consent procedures and Health Insurance Portability and Accountability Act disclosures in compliance with good clinical practice.

Individuals were identified who met the 1997 modified American College of Rheumatology classification criteria for SLE,9 had two cohort study encounters at which the BILAG, SLEDAI and physician's global assessment (PGA) had been scored and had a SLEDAI≥6 at the earlier (baseline) visit. In a minor deviation from either the SELENA-SLEDAI (used in the SRI) or the SLEDAI-2K (used in the BICLA), a Hybrid SLEDAI is used for this cohort (identical to the SELENA-SLEDAI except for the scoring of proteinuria, which uses the SLEDAI-2K definition). During a quality check of the data, where discrepancies were found between SLEDAI and BILAG scores and the original clinic note, these were corrected based on the original source documentation. The following data were determined retrospectively based on the medical record: some PGA values that had not been scored at the time of the clinic encounters, and an overall physician impression of clinical change at the later encounter (follow-up visit) as compared with the first visit, determined by review of clinic notes and laboratory values but without consideration of what treatments had been given in the interim. This was categorised as clinically significant improvement (physician-rated improvement (PRI)), deterioration or no change. Medication changes between baseline and follow-up and at the follow-up visit were recorded, and the SRI-4 (SRI) and BICLA composite responses were calculated, both with and without consideration of the medication restrictions that characterised the major clinical trials. SRI-3 and SRI-5 were computed similarly to SRI-4, except for a minimal three-point or five-point improvement in SLEDAI being required, respectively.2 When medication criteria were applied, improvement criteria were not considered to be met at any follow-up visit where lupus medication changes were made after the initial changes that were instituted at baseline with the exception of NSAIDs or topical agents. The performance of BICLA and SRI was compared against clinician-rated improvement. Potential causes of discrepancies in each instrument were explored, including the role of each component endpoint.

Statistical analysis

Descriptive statistics (mean SD) were used to describe measures of disease activity (PGA, SLEDAI, global BILAG and number of BILAG domains involved) at baseline and follow-up. Comparison of disease activity measures between baseline and follow-up was performed by paired t test. Disease activity measures between PRI responders that did or did not meet the BICLA or SRI endpoints were compared by Mann–Whitney rank-sum test. Spearman's rank test was used to correlate PRI with SRI and BICLA responses. The number of BILAG domains with persistent and/or B scores at follow-up among PRI responders stratified by the number of BILAG A and/or B scores at baseline was compared by Fisher's exact test. SigmaPlot V.12.5 (Systat Software, Inc) was used for all statistical analyses.

Results

Demographic characteristics

In total, 91 patients eligible for the analysis were identified, including 86 women and 5 men with SLE. The mean age at the time of the baseline visit was 41 years (SD 11.8), ranging from 21 to 68. Also, 46 subjects were Caucasian, 22 African-American, 14 Native American, 5 Caucasian/Hispanic and 4 Asian. The patients in this analysis were all assessed between June 2009 and April 2012 with a mean interval of 6.24 months between baseline and follow-up visits (range 1–25 months).

The PRI was associated with improvements in disease activity scores between baseline and follow-up

In total, 68 patients improved by PRI, 17 remained the same and 6 deteriorated. Disease activity in patients who did or did not improve by PRI was compared using the PGA, SLEDAI, global BILAG scores and the number of BILAG organ domains involved (table 1). A statistically significant improvement between baseline and follow-up in all disease activity instruments was found in patients improving by PRI, but no differences in those indices were observed in group not deemed by the physician to be improved. This supports previous literature that outcome measures discriminate clinically relevant differences.10 ,11

Table 1

Disease activity at baseline and follow-up among patients clinically improving (n=68) or not (n=23)

Sensitivity of SRI compared with BICLA in detecting PRI

Of 68 patients determined to be improving by the physician (PRI), 58 met the clinical criteria for SRI and 52 met the BICLA endpoint, with sensitivity of agreement with PRI 85% and 76%, respectively (table 2). This evaluation was for improvement not response and did not include the restriction that change of medications over-rules ‘response’ to the baseline treatment. Of the 23 patients not improving by PRI, 17 did not meet the SRI clinical criteria and 18 did not improve by the BICLA; therefore, specificity of SRI and BICLA for PRI was 74% and 78%. Sensitivities and specificities were similarly computed for SRI-3 and SRI-5. Spearman's rank correlations to PRI were SRI-3 0.605, SRI-4 0.563, SRI-5 0.541 and BICLA 0.492 (all p values<0.000001).

Table 2

Sensitivity and specificity of SRI and BICLA compared with PRI

Of the 10 out of 68 patients improving by PRI but not by SRI, 7 did meet the BICLA improving criteria. These seven SRI-negative/BICLA-positive discordant cases failed the SRI because of less than four-point improvement on SLEDAI, which scores mild, moderate or severe arthritis four points and mild, moderate or severe rash two points (six individuals had a two-point improvement and one had a three-point improvement; therefore, all did have complete resolution of at least one feature, but the scores given to these particular features were not high enough to define improvement by SRI). Among 16 patients who improved by PRI but not BICLA, 13 were defined as improving by SRI. These 13 SRI-positive/BICLA-negative discordant cases failed BICLA because of one or more organ systems without even partial improvement, despite complete resolution of at least one other feature counting four points on the SLEDAI. Persistent B (moderate) scores in the musculoskeletal and mucocutaneous domains were present in six patients each, and there was one persistent B in each of the following domains: gastrointestinal (lupoid hepatitis), cardiorespiratory (pleurisy) and renal (proteinuria). One patient had a persistent A (severe) score in the cardiorespiratory domain (interstitial lung disease).

Six of the twenty-three patients who did not improve by PRI met the SRI criteria, three of whom did not improve by BICLA. These three SRI/BICLA discordant cases achieved four-point reduction in SLEDAI due to resolving mild manifestations (rash, alopecia, mucosal ulcers) despite other ongoing (but not worsening) more severe organ involvement. Five PRI failures met the BICLA, two of which did not meet the SRI. One of these two patients developed new BILAG B arthritis, and one new B score is allowed in the BICLA definition of improvement. Another had only moderate improvement in arthritis (musculoskeletal BILAG A score (severe arthritis) decreased to a high B (significant moderate arthritis)), but the patients were globally judged as overall clinically unchanged.

Those considered improved by the physician but failed the BICLA had more active organs at baseline

Since patients improving by PRI often failed BICLA (false negative) because improvement did not occur in all domains, we hypothesised this may be more common if there is a greater number of organs active at baseline. Therefore, the total number of BILAG domains involved in BICLA improvement versus non-improvement was assessed in those patients meeting PRI (table 3).

Table 3

Comparison of disease activity indices between PRI responders that met (n=52) or not (n=16) the BICLA endpoint

BICLA responders tended to have fewer domains involved at baseline compared with non-responders, an observation that was more pronounced when only organs with baseline A (severe disease) and B (moderate disease) were counted. This supports the hypothesis that the more BILAG organs are significantly involved at baseline, the more likely it is to have persistent disease in one or more organs at follow-up despite significant improvement in some aspects of the disease and an overall clinical impression of improvement. Among patients improving by PRI, 33.3% of those with two or more BILAG A or B scores at baseline had persistent A or B scores at follow-up (BICLA failure) compared with 10.3% in those with one or less BILAG A or B scores at baseline (Fisher's exact test, p=0.0924). This suggests that the BICLA might be less sensitive at detecting improvement when more organs are involved.

PRI responders meeting the BICLA criteria also had less disease activity at follow-up compared with those not meeting the BICLA, evident by the PGA, SLEDAI and global BILAG scores, as well as the number of BILAG organs involved (p<0.02 for all comparisons), indicating the possibility that the BICLA may provide some additional meaningful discrimination beyond what is captured by PRI or be considered a higher bar for determining improvement.

Disease activity indices at baseline and follow-up were not as informative when comparing those improved by PRI who did or did not meet the SRI criteria (table 4). With the exception of PGA, there was no difference in baseline disease activity measures in those improving by SRI versus failures.

Table 4

Comparison of disease activity indices between PRI responders that met (n=58) or not (n=10) the SRI endpoint

Response differs from improvement: the addition of medication criteria dramatically decreased the number of patients meeting the BICLA and SRI response definitions

Clinical trial endpoints using the SRI and BICLA constructs to determine response to a treatment initiated at baseline have used restrictions in rescue medications in the definition of response (underscoring a distinction between the concepts of improvement and response). A requirement for no off-protocol medication changes was included in the BICLA response definition, used in the EMBEM trial.5 In the BLISS trials of belimumab in SLE where the SRI was used,3 ,4 medication increases were restricted during the latter months, and protocol deviations defined non-response accomplishing a similar end, although medication restrictions were less restrictive in those protocols than in the EMBLEM study. To examine the impact of medication restrictions on how BICLA and SRI compare to physician's determination of improvement, we performed the SRI and BICLA determination in this real-world sample of patients (no protocolised restriction on treatment). All pharmacological changes in treatment after baseline interventions and prior to and/or at follow-up visit, except for NSAIDs or topical agents, excluded the designation of improvement by BICLA or SRI in this analysis. As expected, medication restrictions dramatically decreased the number of patients meeting BICLA and SRI response definitions. Surprisingly, when medications were factored in as ‘response’ criteria, this increased specificity for PRI (decreased the likelihood that BICLA and SRI will detect improvement when the physician did not agree), even though the physician was not considering the treatment changes in the PRI assessment (table 5).

Table 5

Sensitivity and specificity of SRI and BICLA with medication criteria

Discussion

The performance of two composite indices of lupus disease improvement and/or response was compared using as reference a clinician's global rating of improvement. Although the PGA, when performed in a paper patient exercise, has been demonstrated to have poor agreement between different physicians,12 a retrospective overall determination of whether or not a patient has a clinically meaningful improvement might be a valuable tool when determined in a consistent manner by one qualified assessor. This was supported in the current exercise, where changes in disease activity (including SLEDAI and BILAG performed prospectively at the time of the visits) were consistent with the global clinician ratings. Retrospectively assessing clinical change between visits is possibly limiting the validity of our observation and warrants conformation prospectively. Although the visits used to determine improvement or response were separated by various timepoints in the current study, this did not affect the global clinical assessment of change between visits and is consistent with the manner in which SRI and BICLA are used in clinical trials as landmark assessments performed at different timepoints compared with a baseline visit.

The assessment of response to an intervention is not the same as a measurement of clinical improvement (which may occur without response to a given intervention if rescue treatments have been given). A comparison of BICLA and SRI without using the medication criteria for ‘response’ to the baseline treatment was first performed to determine their utility in simply defining clinical improvement. In this assessment, BICLA was less sensitive than SRI at capturing physician-determined improvement. When only those patients ranked as improved by the physician were evaluated, BICLA but not SRI seemed to define patients with greater improvement in disease activity, suggesting the possibility that it could be more discriminatory in some settings. Since more organs involved at baseline decreased the sensitivity of the BICLA, it can be hypothesised that this instrument might be particularly useful in patients with less widespread organ involvement, including those patients considered suitable for clinical trials in which less background medication is allowed. In fact, in the EMBLEM trial of epratuzumab in lupus, which limited background medication changes more restrictively than the belimumab trials, the BICLA response (including the medication criteria) at 12 weeks effectively discriminated the 2400 mg/month cumulative epratuzumab dosage groups from placebo with placebo responses lower than in the BLISS trials.5

A comparison of two outcome measures used in different trials does not account for potential differences in patient populations, background treatments or efficacy of the test articles. However, it is worth observing that the apparent improved discriminatory capacity in the EMBLEM trial (using the BICLA) did not appear to be due to increased efficacy rates in the treatment group but to lower rates in the placebo (standard of care) group,7 which does not suggest increased sensitivity in detection of improvement, but might either reflect a more generally ill population, less background treatments or, indeed, the possibility of increased specificity of the improvement measurement. In posthoc analysis of this same trial dataset using the SRI,7 SRI rates were higher than BICLA rates in all arms, including the placebo group, losing discriminatory capacity with a loss of significant differences between treatment and placebo. Disagreement between BICLA and SRI in the EMBLEM trial may, however, have been driven by the baseline distribution of items with high SLEDAI weights that tended to improve at follow-up, with the greatest difference in groups with activity scored as eight-point SLEDAI items (vasculitis, lupus headache) at baseline. Scoring SLEDAI descriptors for mild/moderate cutaneous vasculitis and headache risks SLEDAI scores that are high in relation to the degree of illness of the patients. SRI discriminatory performance might improve in a protocol, restricting the scoring of these features. In the current analysis, no lupus headaches were scored (consistent with the evaluation here that they are very rare), and although four individuals had cutaneous vasculitis at baseline, this resolved in only one at follow-up, thereby not accounting for discrepancies in SRI and BICLA performance found here.

Some PRI responders failed the SRI, providing potential insight into limitations of the SRI. Patients rated as improving by the physician usually fail to meet the SRI improvement criteria due to less than four-point improvement in SLEDAI, which requires not only improvement but complete resolution of at least one manifestation. As expected, this limitation was less evident for SRI-3 at the expense of lower specificity for the detection of PRI, whereas the opposite was the case for the SRI-5. Our results are consistent with the posthoc sensitivity analysis of data from the BLISS-76 trial, where modifications of the SRI using higher thresholds for SELENA-SLEDAI improvement (≥5-point to ≥10-point reductions) increased differentiation of belimumab from placebo at 52 and 76 weeks but occurred less frequently.4

Conclusions

In an assessment using a physician's global rating of clinically significant improvement, the BICLA may be less sensitive than the SRI, particularly in determining improvement in patients with SLE with multiple organs involved at baseline. On the other hand, the BICLA may be more discriminatory than the SRI in selecting those patients with a greater change in disease activity. When SRI and BICLA were discrepant, this was usually due to the BICLA requiring only partial improvement but in all organs versus the SRI requiring full improvement and not necessarily in all organs. Each instrument could likely be an optimal primary endpoint depending on the population under study and the design of a given clinical trial.

References

View Abstract

Footnotes

  • Contributors All authors fulfil the criteria of authorship and no one else who fulfils these criteria has been excluded from the list of authors.

  • Competing interests None.

  • Ethics approval Oklahoma Medical Research Foundation IRB.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.