Article Text

Original research
Fragility of randomised controlled trials for systemic lupus erythematosus and lupus nephritis therapies
  1. Gabriel Figueroa-Parra1,
  2. Michael S Putman2,
  3. Cynthia S Crowson1,3 and
  4. Alí Duarte-García1,4
  1. 1Division of Rheumatology, Mayo Clinic, Rochester, Minnesota, USA
  2. 2Division of Rheumatology, Medical College of Wisconsin, Milwaukee, Wisconsin, USA
  3. 3Department of Quantitative Health Sciences, Mayo Clinic, Rochester, Minnesota, USA
  4. 4Robert D and Patricia E Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, USA
  1. Correspondence to Dr Alí Duarte-García; duarte.ali{at}mayo.edu

Abstract

Objective We aimed to evaluate the robustness of phase III randomised controlled trials (RCTs) for SLE and lupus nephritis (LN) using the fragility index (FI), the reverse FI (RFI) and the fragility quotient (FQ).

Methods We searched for phase III RCTs that included patients with active SLE or LN. Data on primary endpoints, total participants and the number of events for each arm were obtained. We calculated the FI score for RCTs with statistically significant results (number of patients required to change from event to non-event to make the study lose statistical significance), the RFI for RCTs without statistically significant results (number of patients required to change from non-event to event to make study gain statistical significance) and the FQ score for both (FI or RFI score divided by the sample size).

Results We evaluated 20 RCTs (16 SLE, four LN). The mean FI/RFI score was 13.6 (SD 6.6). There were nine RCTs with statistically significant results (seven SLE, two LN), and the mean FI score was 10.2 (SD 6.2). The lowest FI was for the ILLUMINATE-2 trial (FI=2), and the highest FI was for the BLISS-52 trial (FI=17).

Twelve studies had non-statistically significant results (10 SLE, two LN) with a mean RFI score of 15.6 (SD 6.1). The lowest RFI was for the ILLUMINATE-1 trial (RFI=4), and the highest RFI was for the TULIP-1 trial (RFI=27). The lowest FQ scores were found in the ILLUMINATE trials and the highest in the Rituximab trials (EXPLORER and LUNAR), meaning that the last ones were the most robust results after accounting for sample size.

Conclusions The evidence of therapies for patients with SLE and LN is derived mostly from fragile RCTs. Clinicians and trialists must be aware of the fragility of these RCTs for clinical decision-making and designing trials for novel therapeutics.

  • Systemic Lupus Erythematosus
  • Lupus Nephritis
  • Clinical Trial
  • Therapeutics

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Randomised controlled trials (RCTs) in SLE for novel therapeutics have frequently failed to meet the criteria for regulatory approval.

  • The fragility index aids in the interpretation of RCT findings and in identifying studies in which results could be overturned due to imprecision.

WHAT THIS STUDY ADDS

  • We found that the phase III, randomised, placebo-controlled trials in patients with SLE and lupus nephritis (LN) held an important degree of fragility.

  • Both positive and negative RCTs of SLE and LN therapeutics are not particularly robust.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Clinicians should know that the evidence supporting several therapies in SLE and LN has been derived from fragile RCTs, and trialists should consider fragility among the challenges when designing clinical trials for patients with SLE and LN.

Introduction

SLE is a chronic autoimmune disease that primarily affects women and is characterised by heterogeneous clinical manifestations.1 Its treatment aims to achieve remission, prevent organ damage, minimise drug side effects and improve quality of life.2 The first US Food and Drug Administration (FDA)-approved therapy for patients with SLE was aspirin (1948), which was followed by glucocorticoids (1950s) and the antimalarial drug hydroxychloroquine (1955).3 Despite important advances in therapy, which included the use of cyclophosphamide or mycophenolate mofetil for induction therapy of lupus nephritis (LN), the next approval took more than five decades.

Randomised controlled trials (RCTs) in SLE for novel therapeutics have frequently failed to meet criteria for regulatory approval. Even among recently approved drugs, trials have been mixed in their conclusions4 5 or have had small effect sizes for clinically relevant outcomes.6 7 Also, several RCTs of new therapies for patients with SLE or LN were terminated or were not published.8 9 In the last decade, the FDA approved three novel treatments for the management of SLE. First, in 2011, belimumab (a human monoclonal antibody that inhibits the B-cell activating factor) was approved for adult patients with non-renal SLE, which was followed in 2020 by its approval for LN. In 2021, voclosporin (a calcineurin inhibitor) was approved for the treatment of LN, and anifrolumab (a human monoclonal antibody targeting type 1 interferon receptor) for adult patients with SLE.3 All four approvals were based on large double-blind RCTs that demonstrated superiority over matched placebo interventions. Although these therapies met the criteria for FDA approval,10 the robustness of the evidence supporting their use should be continuously assessed.

The fragility index (FI) is a recently described method for assessing the robustness of RCT findings, which calculates the minimum number of patients whose status would be required to change from an event to a non-event to make the study lose statistical significance.11 12 A small FI score indicates a more fragile and less statistically robust clinical trial; the fragility quotient (FQ) assesses fragility relative to a trial’s sample size.12 In this study, we aimed to assess the robustness of the published phase III RCTs for SLE and LN treatments using the FI, the reverse FI (RFI) and the FQ scores to complement the interpretation of these RCTs.

Methods

To identify RCTs, we searched on ClinicalTrials.gov (up to 30 September 2023) for trials that included patients with SLE or LN. We included all phase III, randomised, placebo-controlled trials of patients with active SLE or LN. We excluded RCTs that were defined as phase II or IV RCTs as well as those whose status was recruiting or ongoing, completed but unpublished or terminated for reasons other than futility.

Data were obtained from the full-text publications on prespecified primary endpoints, total participants, participants in all arm doses for non-approved drugs and the arms of approved doses (in the USA), the number of events in the intervention and placebo groups, the number of patients who withdrew or discontinued and the reported p values. FI score was calculated using an online calculator (available at https://clincalc.com/Stats/FragilityIndex.aspx). The RFI was calculated for RCTs with non-statistically significant results by modifying the number of events in the intervention arm until a p value <0.05 was reached while keeping the total number of participants constant.13 The FQ score was calculated by dividing the FI (or RFI) score by the total sample size of the trial.12 We used descriptive statistics to present the results. Analyses were performed using BlueSky Statistics software V.10.3 (BlueSky Statistics, Chicago, Illinois, USA).

Results

We evaluated 20 RCTs, 16 in SLE and 4 in LN. The mean FI/RFI score of the 20 studies was 13.6 (SD 6.6). Additional characteristics of the included studies are shown in online supplemental table 1. There were nine RCTs with statistically significant results (seven in SLE, two in LN; table 1), which had a mean FI score of 10.2 (SD 6.2). The lowest FI was for the ILLUMINATE-2 trial (score: 2, tabalumab 120 mg every 4 weeks), and the highest FI was found in the BLISS-52 trial (score: 17, belimumab), meaning that would be required to have only two and 17 fewer responder patients, respectively, to lose significance.

Supplemental material

Table 1

Fragility evaluation of phase III, randomised, placebo-controlled trials with statistically significant results involving patients with active SLE and lupus nephritis (LN)

Twelve studies showed non-statistically significant results (10 in SLE, two in LN; table 2) with a mean RFI score of 15.6 (SD 6.1). The lowest RFI (non-significant trials) was for the ILLUMINATE-1 trial (score: 4, tabalumab 120 mg every 4 weeks), and the highest RFI was for the TULIP-1 trial (score: 27, anifrolumab), meaning these trials would only need to have four and 27 more responders, respectively, to gain statistical significance.

Table 2

Fragility evaluation of non-statistically significant randomised clinical trials involving patients with active SLE and lupus nephritis (LN)

Overall, the non-significant RCTs for SLE therapies seem to be more robust than those with statistically significant findings (figure 1). The RCTs for LN therapies showed a similar distribution.

Figure 1

Fragility of randomised controlled trials (RCTs) for SLE and lupus nephritis (LN) therapies grouped by statistical significance of the primary endpoint. FI, fragility index; RFI, reverse FI.

The RCT that granted FDA approval to anifrolumab was the TULIP-2 trial, with an FI score of 11. Voclosporin trial (AURORA-1) had an FI score of 15. Regarding the belimumab studies, BLISS-52 had an FI score of 17, BLISS-76 had an FI score of 4, BLISS-SC had an FI score of 16 and BLISS-LN had an FI score of 3; all these trials had statistically significant results. The lowest FQ scores were found in the ILLUMINATE trials and the highest in the Rituximab trials (EXPLORER and LUNAR) and the early terminated LOTUS trial, suggesting that the last ones are the more robust results after accounting for sample size.

Discussion

In this study of pivotal phase III, randomised, placebo-controlled trials of new treatments for SLE and LN, many trials had an important degree of fragility, even among RCTs of medications that obtained approval. These results may in part explain the perception that trials in SLE have frequently failed, which has been previously attributed to the complexity of the patients, the choice of the standard of care, the selection of the appropriate endpoints and the procedures needed for conditions with lower incidence and prevalence like SLE.14 15 Our results suggest that fragile trials—which require relatively few patients to switch groups to alter the conclusion of the study—may have also contributed.

Fragility was observed across all recently studied therapies. Among the approved drugs, the BLISS-LN trial was the most fragile, which would have been considered non-significant study (ie, a ‘negative trial’) if only three fewer patients in the treatment group had achieved the primary endpoint.16 This does not invalidate the approval of belimumab or suggest that physicians should not prescribe it. Rather, it highlights the limitations of the current methodological approach to conducting trials in SLE. These findings should be viewed in light of other concerning factors in BLISS-LN, which include the change in the original primary endpoint, the selected power of 80% to detect a difference in the endpoint or the underlying treatment that was left at the investigator’s discretion (for both, the intervention and placebo arms). Studies have been designed under optimistic assumptions of efficacy, which is difficult given the heterogeneity of SLE. Given the precarity of these results, it seems reasonable to ask whether these results would be reproducible if the trial was run a second time or if a repeat trial would result in inconsistent findings, as has been the case for other recent therapies, including tabalumab,17 18 anifrolumab4 5 and baricitinib.19 20

Conversely, the most robust result was from a negative trial (ie, not statistically significant), the TULIP-1 trial of anifrolumab. It would have required 27 more patients to have achieved an SLE responder index-4 response (the prespecified primary endpoint) for the trial to be considered positive. Despite the negative result from TULIP-1, the TULIP-2 trial was tailored to detect a significant BILAG (British Isles Lupus Assessment Group)-based composite lupus assessment response, based on secondary endpoints from TULIP-1. The FDA gave credence to the somewhat inexplicably different results from the TULIP-2 trial related primarily to the mucocutaneous and musculoskeletal involvement and granted approval. We observed two additional interventions that found mixed results in their trials. These were tabalumab (a human monoclonal antibody that binds B-cell activating factor) and baricitinib (a Janus kinase 1 and 2 inhibitor). The difference in the results from these studies might reflect the limitations of using endpoints based on global disease activity instruments, in contrast to system-based instruments for clinically heterogeneous conditions like SLE. Switching the main endpoints to a system-based approach might be beneficial, such as studying patients with specific manifestations (eg, arthritis or cutaneous disease) instead of having a global disease activity score that might result from several combinations of signs and symptoms. Perhaps these global disease activity scores could be kept as secondary outcomes. In other words, we should consider taking the same approach used for LN for other SLE manifestations like was the case for the recent LILAC trial, which focused on patients with cutaneous and joint involvement.21 22 It would be interesting to test the fragility for future RCTs that used similar approaches by the two types of endpoints (global and system based).

Another issue for placebo-controlled trials in SLE throughout the years is the lack of truly standardised underlying therapy. Although it is currently recommended that all patients with SLE without contraindications must receive antimalarials,2 23 24 even in the most recent studies of anifrolumab and baricitinib, the proportion of patients on these medications did not go beyond 85% (the lowest 66–73% in TULIP-2). The use of other immunosuppressors, such as methotrexate, is also recommended (ie, for musculoskeletal and mucocutaneous manifestations, which are the most common across these RCTs) when antimalarials alone are insufficient, or high-dose glucocorticoids are required.2 23 24 However, these were not used in more than 20–25% of the patients from the most recent RCTs.4 5 19 20 This low use might be for different reasons, including intolerability or adverse events, patient or physician’s decision, among other. If future clinical trials are system based as suggested above, it would be easier to agree on what is a standardised therapy for a specific disease manifestation.

One of the objectives of RCTs is to prove causality between the interventions being tested and the outcomes of interest. The most often used ‘frequentist’ approach relies on rejecting a null hypothesis (ie, there is not enough evidence of difference) based on a prespecified statistical significance threshold (typically p<0.05). Requiring thresholds for statistical significance establishes an arbitrary dichotomy, whereby a trial with a p value of 0.049 is categorised similarly to one with a p value of 0.001. Stated within the framework of this paper, frequentist statistical approaches do not account for the fragility of a trial outcome. Alternative approaches to trial design in SLE and LN may be considered, including Bayesian trial designs.25 This approach presents other difficulties, including a lack of understanding of Bayesian methodology and interpretation among practising clinicians, ability for pharmaceutical sponsors to choose advantageous priors and the downstream issue that FDA approvals are necessarily dichotomous.

More simply, pivotal trials in SLE could be designed to be more robust. The easiest way to reduce the fragility of SLE trials would be through assuming a lower response rate among treatment groups. This would necessitate larger trials, but if it resulted in fewer negative trials for drugs that may have some efficacy (type II error), it would be a worthy trade-off. Another approach would be to select populations more likely to benefit. Finally, developing drugs that work better should be considered. Many of the aforementioned negative studies failed to surpass relatively low bars. In our efforts to understand why so many therapies have failed in SLE, we should consider the most obvious explanation; we do not understand SLE and our drugs do not work very well.

Alternative approaches for assessing available evidence exist. The American College of Rheumatology (ACR) guidelines have embraced the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach, for instance, which helps assess the quality of evidence and the strength of recommendations. The GRADE approach assigns the highest evidence grade to RCTs but can downgrade them if the studies have serious limitations, inconsistencies in the results, indirectness or imprecision concerns, along with the risk of reporting bias.26 The FI is only one way of assessing imprecision within the GRADE domains27 and must not be interpreted in isolation. The ACR has been undergoing efforts to update SLE and LN guidelines.

Our approach has limitations. First, there were relatively few RCTs available for each treatment. We did not evaluate secondary endpoints from the included trials because of the design-linked justification and due to the limited applicability of the FI/RFI to non-dichotomous endpoints. There are no validated approaches for interpreting the FI scores, which correlate with sample size (ie, larger studies likely to have larger FI scores), inversely correlate with p values (ie, small p values inevitably have larger FI scores) and may simply reflect good trial design (ie, small FI scores merely suggest that investigators correctly assessed the likely benefit of therapy and appropriately powered their study to reject the null hypothesis). FQ scores may account for this to some degree, but they are less intuitive than FI scores, and the interpretation of either score is ultimately subjective. Among our strengths is the novelty application of the FI approach in SLE and LN RCTs to aid clinicians and trialists in interpreting results from RCTs. We limited our evaluation to phase III trials for a more realistic picture of the currently approved treatments and those that matured enough in their pipeline to be compared with the standard of care.

These limitations notwithstanding, the results of this study suggest that the data informing approvals of novel SLE therapeutics may not be particularly robust. Comprehensive methodologies for evaluating the evidence supporting therapeutic interventions beyond the reductionist significance interpretation should be considered. In the interim, clinicians should know that the evidence supporting these therapies has mostly been derived from fragile RCTs, and trialists should consider fragility along with the rest of the challenges when designing clinical trials for patients with SLE and LN. Factors beyond the statistical fragility, like heterogeneity of clinical manifestations, characteristics of the selected endpoints and the lack of truly standardised therapy in the control arms (particularly in SLE RCTs), may have an impact on this analysis.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statements

Patient consent for publication

Ethics approval

This study is exempt from review or approval by the institutional review board due to the use of secondary data from published studies.

Acknowledgments

This manuscript is the product of the final project for the course ‘Evidence-Based Medicine for Clinical Researchers’ as part of the Postdoctoral Master’s Degree in Clinical and Translational Science Program from the Center for Clinical and Translational Science and the Mayo Clinic Graduate School of Biomedical Sciences, where GFP is a scholar.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Twitter @DrGabrielFP, @EBRheum, @CrowsonCindy, @@AliDuarteMD

  • Presented at Preliminary results were presented at ACR Convergence 2023 (Figueroa-Parra G, Putman M, Duarte-Garcia A. Fragility of randomized clinical trials of systemic lupus erythematosus and lupus nephritis therapies [abstract]. Arthritis Rheumatol. 2023; 75 (suppl 9). https://acrabstracts.org/abstract/fragility-of-randomized-clinical-trials-ofsystemic-lupus-erythematosus-and-lupus-nephritis-therapies/).

  • Contributors GF-P, MSP, CSC and AD-G contributed to the study conception and design. Material preparation and data collection were performed by GF-P. Analyses of data was performed by GF-P. Interpretation of results was made by GF-P, MSP, CSC and AD-G. The first draft of the manuscript was written by GF-P. GF-P and AD-G are responsible for the overall content of the study and act as guarantors. All authors critically revised for intellectual content and approved the final manuscript.

  • Funding This work was supported by the National Centre for Advancing Translational Sciences (grant number: UL1 TR002377).

  • Disclaimer Contents of the work are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.

  • Competing interests AD-G has received unrelated grant funding from the Centers for Disease Control and Prevention, the Rheumatology Research Foundation Scientist Development Award and the Robert D and Patricia E Kern Center for the Science of Health Care Delivery.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.