Common reasons for clinical trial failure in SLE and recommendations
In recent years, various agents in SLE clinical trials have failed to meet their primary endpoints, including epratuzumab, an anti-CD22, B-cell-directed monoclonal antibody (programme terminated)49; rituximab43 44; and tabalumab, an anti-BAFF monoclonal antibody (programme terminated).50 Several potential reasons have been cited for clinical trial failure besides the drugs not being efficacious, including heterogeneity of patients included in a trial, use of outcome measures that were not developed for clinical trials and cannot measure change accurately over time, site investigator inexperience, concomitant medication use during trials and other trial design flaws.2 51 52 Many of these factors continue to challenge trial site staff, investigators and contract research organisations involved in running an SLE trial. An overview of these factors and recommendations for avoiding them is presented in figure 1.53–59
Figure 1The most common pitfalls in lupus clinical trials The inner circle lists the most common pitfalls that have hindered the success of lupus clinical trials. The outside statements reflect the domains in which the pitfalls may occur and insights into each of the pitfalls, along with some guidance. GCS, glucocorticosteroid.
In general, homogeneity of enrolled patients in clinical trials is essential, especially for trials with small sample sizes. Stringent and well-chosen criteria ensure patients’ homogeneity. Currently, SLEDAI-2K 4–6 is used to select patients with mild-to-moderate disease activity, but this approach often is not sufficient to ensure homogeneity of the sample. For example, patients who enter a trial with SLEDAI-2K of ≥4–6 can have different manifestations that sum up to the inclusion threshold. Furthermore, the degree of disease activity for the respective manifestation may differ from patient to patient (eg, inflammatory skin rash involving 3% of body surface area (BSA) in one patient vs 10% of BSA in another patient). Inflammatory lupus rash could be related to discoid rash in one patient and subacute rash in another patient, and a particular drug may work better for subacute rash compared with discoid rash. The time to improvement or resolution of these manifestations may vary. All of these points should be considered if the aim is to have a homogenous sample for a specific trial. Moreover, particular inclusion criteria may be overlooked in trial development, such as serological activity thresholds (eg, for ANA), which may have an important role in trial success. When deciding which patients to include in a trial, it may be necessary to classify the disease by immunological mechanisms, such as ANA concentrations (eg, ANA≥1:80). Several lessons learnt from the post hoc analyses in clinical trials may benefit the design of future trials. The post hoc analyses of the phase II/III APRIL study highlighted the importance of baseline biomarkers such as elevated serum concentrations of B-lymphocyte stimulator (BLyS) and APRIL, which may help to identify potential responders to atacicept.31 Another study by Petri et al demonstrated that BLyS concentrations of ≥2.0 ng/mL at screening are an independent prognostic factor for an increased risk of BILAG A or B flares.60 In the MUSE trial, patients with IFNGS-high test results responded better to anifrolumab than patients with IFNGS-low test results.28 In the future, lupus clinical trials will probably include and stratify patients based on their concentrations of cytokines and other biomarkers. Inclusion and exclusion criteria need to be selected carefully so that they will not be too restrictive and thereby fail to identify patients who may potentially benefit in future trials. In addition, excessively restrictive criteria will limit the external validity of the trial results and its generalisability.
We recommend the following actions to address the factors associated with heterogeneous samples. First, an accurate set of inclusion criteria should be optimised for each specific trial to ensure the homogeneity of the sample. For example, the majority of current trials mandate serologically positive patients with SLE (ANA +or dsDNA +antibodies). Second, severity level for disease activity should be mandated in the inclusion criteria. For example, six active joints should be mandated as opposed to ≥2 joints as per SLEDAI-2K. Finally, the inclusion criteria should require involvement of specific organ systems. Using an SLEDAI-2K score of ≥6 or BILAG 1A as inclusion criterion is not sufficient. The inclusion criteria for trials should require the activity in specific organ-systems such as musculoskeletal or dermal systems.
SLE encompasses a spectrum of manifestations, and the commonly used outcome measures in clinical trials lack the required extent of standardisation in the documentation of lupus manifestations. Accurate documentation is crucial for identifying and confirming change over time. Moreover, the different composite indices used, such as SRI and BILAG Composite Lupus Assessment (BICLA), can result in different responder rates, which can complicate between-trial comparisons. For example, SRI response is defined as (1)≥4 point reduction in SLEDAI global score; (2) no new severe disease activity (BILAG A organ score) or >1 new moderate organ score (BILAG B); and (3) no worsening from baseline in Physician’s Global Assessment score (increase <0.3).12 13 BICLA response is defined as (1) baseline BILAG score improvement (eg, all A (severe disease) scores falling to B (moderate), C (mild), or D (no activity), and all B scores falling to C or D); (2) no new BILAG A scores and ≤1 new BILAG B score; (3) no worsening of total SLEDAI-2K score from baseline; (4)≤10% deterioration in Physician’s Global Assessment score; and (5) no initiation of non-protocol treatment.61 One major difficulty for developing uniform outcome measurements is the low number of validated biomarkers available.
We therefore recommend the use of reliable and responsive instruments, for they are very important in clinical trials. Although SLEDAI-2K measures a complete recovery of descriptors, a better approach might be a 50% improvement, which SLEDAI-2K SRI(50) can capture. SLEDAI-2K SRI(50) is superior to SLEDAI-2K for measuring change over time.53–55 Second, the utilisation of organ-specific instruments (eg, Cutaneous Lupus Erythematosus Disease Area and Severity Index, composite renal outcomes, and so on) should be encouraged. Several groups have recently described the development of new indices for assessing lupus activity. Touma et al recently demonstrated that SLE Disease Activity Index Glucocorticosteroid Index (SLEDAI-2KG) identifies more responders at 6 months (92% vs 84%) and at 12 months (89% vs 76%) than SLEDAI-2K for cut-off points of 5, 6 and 7.62 Abrahamowicz et al described the derivation of a new Multivariable Lupus Outcome Score (LuMOS) with data from BLISS-76. LuMOS included a reduction in SLEDAI by ≥4 points, increase in C4, decrease in DNA antibody titre and no new symptoms or worsening in renal BILAG as well as improvements in the mucocutaneous component of BILAG. Early validation of LuMOS with data from BLISS-52 demonstrated superiority in discriminating responders from non-responders compared with SRI-4.63 Furthermore, it is necessary to develop and validate other organ-specific instruments that are sensitive to change (eg, an instrument for central nervous system manifestations such as cognitive impairment; instruments for assessing serositis disease severity). In addition, the choice of outcome measures should be optimised for each trial. For example, recent analyses have shown that urinary red blood cells should not be included as a component of renal composite outcomes.56 Spot urine protein to creatinine ratio should not take the place of 24-hour proteinuria quantification.
In some cases, investigators are not adequately prepared to use the disease activity instruments correctly. Investigators need proper training on the use of outcome measures and the specific instruments selected for the study. Selection of study sites needs to be considered carefully and should have expertise in treating patients with SLE. To address issues associated with a lack of appropriate preparedness of investigators and centres, we recommend avoiding loose criteria. Non-stringent criteria allow the participation of non-competent centres with insufficient skills for assessing and managing lupus. The importance of competent centres should not be underestimated. The inclusion of certified investigators for the use of specific instruments may be insufficient to assure a properly run study if competent centres are not chosen.
It is necessary to ensure adequate sample size and power to detect a significant difference between the arms of the trial.57 58 In view of the heterogeneity of the disease, the sample size in a trial needs to be large enough to obtain a statistically significant result. One case in which low sample size may have been important in determining the significance of a result is the LUNAR trial for rituximab, in which the primary endpoint of a superior renal response rate with rituximab at end of treatment was not achieved.44 In this trial of 144 patients (72 each for rituximab and control), the overall renal response rate was 56.9% for the rituximab cohort compared with 45.8% for the control cohort (P=0.18).44 By comparison, in the BLISS-76 trial for belimumab, a significant improvement (P=0.017) in efficacy (SRI response at end of treatment) was obtained with the 10 mg dosage compared with placebo, although the percentage improvement (43.2% vs 33.5%) was slightly smaller than in the LUNAR trial.13 In BLISS-76, more than twice as many patients completed this trial (n=186 for placebo, n=191 for 10 mg dosage) compared with the LUNAR trial.13 Probably trials with smaller sample sizes can be designed and implemented once disease heterogeneity is controlled and a strict disease phenotype is achieved.
Patient diversity is another consideration in patient recruitment beyond achieving adequate sample size. Centres worldwide should be chosen to promote diversity in the trial participants. Such an approach needs to be taken prudently to avoid certain types of variability associated with different geographical locations, such as high infection rates in certain centres. One method for achieving participant diversity would be to use centres that have substantial population diversity.
Endpoints with high bars can be too restrictive to demonstrate positive results. An example in which this may have occurred is the ILLUMINATE trials for tabalumab. Although the ILLUMINATE-2 trial met its primary endpoint of SRI(5) response at week 52 for the more frequent dosing regimen, it did not meet this endpoint for the less frequent dosing regimen or in ILLUMINATE-1.50 64 In ILLUMINATE-1, similar percentages of patients achieved SRI(5) response at week 52 (31.8% and 35.2% for the two treatment arms compared with 29.3% for placebo.50 Using an SRI(5) response as the primary endpoint, whereby a criteria of a ≥5 point reduction in the Safety of Estrogens in Lupus Erythematosus–National Assessment-SLEDAI score is used as opposed to the ≥4 point reduction in SRI(4) response, may have resulted in trial failure. The successful BLISS-52 and BLISS-76 trials for belimumab used SRI(4) response.12 13 Another example is the phase II/III abatacept trial set. In this case, the bar was set overly high by using spot urine protein to creatinine ratio ≤0.26 g/g (30 mg/mmol), although EULAR (European League of Associations for Rheumatology) guidelines define a complete renal response as <50 mg/mmol.6 37
Standard of care needs to be considered when designing a trial. It is difficult to achieve a significant difference between the placebo and drug treatment arms when patients are receiving standard-of-care treatment. For example, GCS use can increase the response rate in the placebo group and thereby influence trial results. Therefore, the exposure to and dosage of GCS should be limited. In the context of disease activity, use of GCS should be adjusted. This strategy is currently undergoing evaluation in the SLEDAI-2KG trial.65 The GCS dosage should be balanced between arms to minimise introduced bias.57 For mild lupus manifestations, GCS should be omitted if possible. Drug trials focusing on patients with dermal and musculoskeletal SLE manifestations might demonstrate results of experimental therapy more clearly if they omit GCS use as a standard of care. However, this strategy would be unethical to implement for patients with moderate-to-severe lupus.
The adjudication committee review of data is important. It is important to review the data in a timely manner to identify deficiencies and inconsistencies. Two of the more common approaches used are post hoc review and adjudication when all patient visits are finalised and ongoing data review and adjudication. Although the first approach allows for an overall review of the data, the latter approach can identify difficulties with data collection during the trial process. A combination of both approaches is preferable because it would allow the identification of sites that are not adequately trained with respect to inclusion criteria and outcomes.
Drug dosages and regimens should be selected for optimal efficacy and safety. In relation to drug dosing, when safety issues do occur, the safety committee needs to consider carefully if treatment discontinuation is appropriate. For example, in the phase II/III APRIL-SLE trial for atacicept, the 150 mg arm was discontinued prematurely because of two deaths related to infection; no deaths occurred in the placebo arm.30 In this 52-week trial, patients with moderate-to-severe SLE received atacicept two times per week for 4 weeks followed by once weekly for the remaining 48 weeks.30 In the phase IIb ADDRESS II trial of patients with SLE, atacicept 150 mg was given weekly for 24 weeks with no increase in serious adverse events compared with placebo.32 66 Although other factors cannot be excluded, these results indicate that the dosing regimen given in the ADDRESS II trial has a better safety profile than the regimen in the APRIL-SLE trial.
The time to improvement or resolution of disease activity for a particular manifestation often depends on disease phenotype. The length of trials may be too short to observe meaningful effects, and researchers may need more than 1 year for given endpoints (eg, a significant reduction in proteinuria for patients with lupus nephritis, especially because the speed of recovery from proteinuria is slow).59 However, the length of trials for patients with dermal or musculoskeletal manifestations can be shortened, especially if GCS are omitted or tapered and stopped very early. Trials involving patients with mild skin/musculoskeletal manifestations can also be shortened both by omitting the use of GCS and using partial recovery as an endpoint instead of complete recovery.