Introduction

Systemic lupus erythematosus (SLE) is an autoimmune disease with a wide array of clinical and laboratory manifestations that can affect every body system [13]. The pathogenesis of SLE involves genetic, immunologic, hormonal, and environmental factors, and current therapies include NSAIDs, antimalarials, corticosteroids, and immunosuppressive agents. Despite the use of non–target-specific, steroid-sparing, immunosuppressive agents such as methotrexate, azathioprine, mycophenolate mofetil, and cyclophosphamide, most patients require the ongoing use of low- to mid-dose levels of corticosteroids together with intermittent bursts of high doses [35]. The morbidity of these treatments prevents many of these patients from living productive lives, particularly those who are diagnosed in their early reproductive years. Patients with moderate to severe, remitting, relapsing disease would be better served by agents that target more specific immunologic molecular targets. The basic science, pharmacology, and translational medicine efforts for the discovery of therapies to alter new molecular targets have advanced significantly over the past 25 years; however, the protean nature of SLE has made it challenging to develop reproducible and scalable measures of disease activity that are applicable to global clinical trials requiring 150 to 300 investigative sites. Drug development for SLE has therefore lagged behind that of other inflammatory autoimmune diseases such as rheumatoid arthritis, multiple sclerosis, and inflammatory bowel disease [811].

Several Disease Activity Indices (DAIs) have been and are currently being used for the development of composite primary end points to test a number of therapeutics that are in clinical development for moderate to severe SLE. These DAIs include versions of the SLE Disease Activity Index (SLEDAI), including the original SLEDAI, which is composed of 24 disease parameters heavily weighted toward the neurological and renal systems that are deemed “present or not present” within the previous 10 days of the assessment [12, 13]. The clinical parameters are given weighted scores of 1 to 8. The Safety of Estrogens in Lupus Erythematosus, National Assessment–Systemic Lupus Erythematosus Disease Activity Index (SELENA-SLEDAI), extended the original SLEDAI to include a flare index to measure moderate and severe flares over time in an SLE population whose disease was minimally active at baseline [14, 15]. The SLEDAI-2000, a modification of the original 24 SLEDAI clinical parameters, was developed to take into account persistent disease activity rather than only new or recurrent features of SLE [16]. It has been shown to be valid over both 30-day and 10-day periods [17].

Another commonly utilized DAI for SLE activity is the BILAG (British Isles Lupus Assessment Group) Index [1820]. The BILAG is a group of physicians who have been meeting regularly since 1984, and originally produced what has become known as the “classic” BILAG Index, which consists of 86 clinical and laboratory parameters measured over a 28-day period. It is based upon a physician’s intention to treat and captures changes in active disease in different body systems simultaneously. The body systems included in the original version included constitutional, mucocutaneous, neuropsychiatric, musculoskeletal, cardiovascular, vasculitis, renal, and hematological. The classic BILAG underwent a major revision in 2004 to remove some damage items, increase the number of parameters to 97, add the gastrointestinal (GI) and ophthalmic systems, and include vasculitis terms within the appropriate body systems [21]. These parameters are scored by the treating physician as new (4), worse (3), same (2), improving (1), and not present (0). Within the nine body systems, grades of A through E are assigned based on a severity scoring index. While it is a complex instrument to use, it has been validated and shown to correlate with a physician’s intent to treat [22].

The SELENA-SLEDAI modification of the Physician’s Global Assessment, or MDGA, is a three-inch visual analogue scale (VAS) with superimposed landmarks ranging from 0 to 3 in which zero is equivalent to no disease activity, and 1, 2, and 3 correlate with mild, moderate, and the most severe disease possible, respectively. Note that an MDGA of 2.5 will trigger a severe flare by SELENA-SLEDAI criteria [14, 15]. Some studies have used a VAS of 100 mm, similar to that used in rheumatoid arthritis studies [23, 24••], which can also be successfully used in lupus trials with the SELENA-SLEDAI landmarks. The use of this instrument in clinical trials is dependent upon having sequential determinations performed by the same physician, as it is largely subjective.

In contrast to these activity indices, the Systemic Lupus International Collaborating Clinics/ACR (SLICC/ACR) is a measure of damage or nonreversible change present for at least 6 months that is not related to ongoing active inflammation [25, 26].

The first drug to be approved for SLE in 60 years was belimumab, a monoclonal antibody that inhibits the activity of the B-lymphocyte stimulator (BLyS) [27, 28]. This drug was approved by the US Food and Drug Administration (FDA) in March 2011 [29]. The primary end point used in the two registration trials was the Systemic Lupus Response Index (SRI) [30••, 31••, 32••]. The SRI is a composite index composed of a ≥4-point decrease in the SLEDAI, the absence of new flares in any of the eight body systems defined by no new A scores (representing severe disease activity), and no more than one new B score (representing moderate disease activity) in the classic BILAG, and no worsening (≥0.3 increase) of the MDGA visual analogue three-point score [30••]. Using this composite end point, there was a modest but statistically significant increase in the response rate in patients receiving 10 mg/kg belimumab for 52 weeks [31••, 32••]. In the two trials that had identical entry criteria, BLISS-52 and BLISS-76 showed placebo response rates of 34 % or 44 %, and belimumab (10 mg/kg) response rates of 43 % and 58 % [31••, 32••]. This represented treatment effects of only 9 % and 14 %, respectively; however, all secondary end points demonstrated directional improvement in the treatment groups, the most important of which was a higher percentage of patients whose average prednisone dose was reduced by 25 % or more from baseline to ≤7.5 mg/d during weeks 40 through 52 [31••, 32••]. The main driver of the SRI was the 4-point or greater reduction in the SLEDAI score., Placebo rates of 34 % and 44 % may have been high because of the remitting and relapsing nature of the disease and because the patients were allowed increases in standard-of-care medications during the first part of the study, many of which can be quite effective.

Another targeted therapy currently in phase 3 is epratuzumab, which is a monoclonal antibody directed against the CD22 antigen on the B-cell surface [3335]. Using the same DAIs, a different composite primary end point was independently developed and used in the EMBLEM phase 2 study [33]. The primary end point was the BILAG Based Composite Lupus Assessment (BICLA) at 12 weeks composed of BILAG 2004 improvement, defined as BILAG A’s at study entry improved to B/C/D and BILAG B’s at study entry improved to C/D and no BILAG worsening in other BILAG organ systems, no worsening in SLEDAI total score compared with study entry, no worsening in physician’s global VAS assessment of disease activity (defined as <10 % increase) compared with study entry; moreover, patients who are treatment failures cannot be responders [33, 36••]. The results of the EMBLEM study showed that 21 % of patients receiving placebo met all criteria defined in the BICLA. Patients treated with epratuzumab in a total dose of 2,400 mg had a combined response rate of approximately 43 % at 12 weeks [33]. Although the treatment groups contained a relatively small number of patients (37–39 per arm), the lower placebo response resulted in the realization of a statistically significant, approximately twofold treatment effect.

The EMBLEM data were subjected to post-hoc analysis comparing the BICLA versus SRI composite end points [37••]. Application of the SRI to these data showed a placebo efficacy rate of greater than 50 %, or twice that observed when data were analyzed by BICLA criteria. No treatment effect was observed using SRI [37••]. Improvement using the BILAG-based BICLA requires a response in all body systems that were involved at baseline, as well as no new flares in the remaining body systems, which may account for the fact that a placebo response occurs less often. By contrast, if just one SLEDAI element worth at least four points resolves, while other features that were present at baseline stay the same or worsen slightly, then the patient could qualify as a responder in the SRI.

A similar analysis was applied to preliminary data obtained in the Biomarkers of Lupus Disease (BOLD) study, a study of 100 patients with SLE on immunosuppressive therapy [38••]. Half of the patients are undergoing withdrawal of their immunosuppressive therapy to determine whether this would be safe and lessen the time to flare. The placebo effect was examined using multiple outcome measures including the SRI-4, SRI-5, and a stringent end point similar to the BICLA, which did not, however, allow a responder to have even one B flare. These end points were compared with BOLD study protocol criteria minimally defined by either a ≥drop of one BILAG grade or a SLEDAI ≥4-point reduction from baseline but anchored by the investigators’ simple intent-to-treat–based determination, whether there was clinically significant improvement, no significant change, or clinically significant flare. A preliminary analysis at 4 and 8 weeks showed that the BICLA-like end point was superior to SRI in detecting improvement, and less likely to pick up a flare visit as improvement based on the BOLD standard [38••].

The potential for variability in the application of the SLEDAI, BILAG, and MDGA by physician evaluators worldwide poses a large challenge in multicenter clinical trials given the differences in disease expression in individual patients and the interpretation of the patient signs and symptoms. The focus of this article is on analyzing the potential pitfalls in the collection of clinical trial data in patients with SLE, describing approaches that have been used in the successful development of drugs for other clinical indications, and presenting solutions for assuring the uniformity of the data across multiple SLE clinical trial sites.

Centralized Adjudication Committees for Complex Clinical End Points

Centralized adjudication committees (CACs) have been used successfully to analyze clinical end points that are not solely composed of objective laboratory data and are thus subject to variable and/or biased interpretation. These adjudication committees are different from data monitoring committees (also known as data safety monitoring committees or DMCs), which are a group of individuals established for large, randomized, multicenter trials in which a treatment is intended to reduce mortality or major cardiovascular events, recurrence of cancer, or other life-threatening events [39]. These committees are used in trials predominantly to monitor study treatments or procedures in which the results could possibly have a very favorable or highly unfavorable impact on safety, or in trials in which particularly fragile populations are involved and in whom the risk of mortality or other serious outcomes is high [39]. A DMC may be charged with determining whether trials should be aborted for futility if efficacy is found to be absent or if a detrimental effect on the population is discovered. Conversely, a trial may be stopped if the treatment is shown in an interim analysis to have a very favorable outcome such that it would be unethical to continue withholding therapy in a placebo-controlled trial.

In contrast to DMCs, CACs provide more detailed, specific data review, predominantly of efficacy end points [40, 41]. Given the investment of hundreds of millions of dollars required to bring a drug to late-stage development, and the intense level of audit and scrutiny applied to trial data before marketing approval of drugs and devices, most companies have employed centralized review and adjudication to assure bias-free interpretation of critical end points.

CACs are routinely used to analyze cardiac and pulmonary events for safety and efficacy [42, 43]. Criteria are developed prospectively for determining classification of events into myocardial infarction, acute coronary syndrome without infarction, appropriate application of angioplasty and stenting procedures, episodes of congestive heart failure, and respiratory deaths due to various causes [44, 45]. Studies have shown that there can be substantial discrepancies between evaluation of these events by the site investigator as compared with the central committee [4345]. A recent analysis of clinical event classification (CHF and cardiogenic shock) for the Assessment of Pexelizumab in Acute Myocardial Infarction (APEX-AMI) trial showed that the clinical events committee agreed with site investigator assessments of CHF in only 45 % of CHF events, and 77 % of cardiogenic shock events [44].

In the Understanding the Potential Long-term Impacts on Function with Tiotropium (UPLIFT) trial, conducted in patients with chronic obstructive pulmonary disease (COPD), a mortality adjudication committee determined cause-specific mortality due to respiratory, cancer, cardiovascular, sudden cardiac death, or sudden death, which was then compared with the site investigator assessment. In this study, there was complete agreement in only 50.2 %, incomplete agreement in 18.5 %, and no agreement in 31 % of 981 reported deaths. Thus, there is a high degree of variability in relying upon individual site assessments in classifying causes of what is usually considered a “hard” end point, mortality [45].

Assessments of complete and partial responses and time to progression in oncology trials are particularly difficult. These determinations are dependent upon physical examination and radiological imaging techniques, which are inherently variable. Central adjudication of these end points is critical for oncology drug development and provisional marketing approval in the absence of long-term mortality data [46, 47].

Disease Activity Index Interpretation in Systemic Lupus Erythematosus Clinical Trials

The FDA guidance for the development of drugs for lupus specifies that sponsors should use an adjudication committee to determine whether patients meet prespecified criteria for responder status [48••]. The following describes two general approaches that can be utilized.

Post-Hoc Versus Ongoing Data Review and Adjudication in Systemic Lupus Erythematosus Trials

Adjudication can be performed when all patient visits are finalized and after site personnel have applied the DAIs to serial patient visits and assigned scores to BILAG, SLEDAI, and MDGA. The cumulative data are then provided to members of a CAC who have developed specific rules for verifying the correct application of the DAIs. All visits for a patient are reviewed and queries are issued to investigators to clarify the interpretation and correct application of the disease measurement tools. The advantage of this approach is that the adjudicator can assess the patient over the entire period of time being studied in sequential visits, which can help detect potentially illogical patterns in the data, but the disadvantage is that a significant amount of time has passed between the original patient visit and clinical adjudication/reassessment, leading to a less accurate recall of specific patient encounters by the investigator. In addition, CAC review of data following completion of all study visits prevents early identification of potentially incorrect application of the multiple DAIs, which can lead to an increase in protocol deviations and therefore a reduction in the per protocol population.

A preferable, albeit more labor-intensive approach is to conduct ongoing data review from the time of screening through database lock. This approach requires that patient data be made available to adjudicators soon after each patient visit, and therefore full-time coordination by data managers. Adjudicators also need to be available throughout the course of the trial rather than solely at the end. Given the history of failed or modestly successful development programs for lupus and the complexity of the disease and its assessments, we favor this approach, however. In particular, we encourage sponsors to include rigorous review of all patients screened for entry into the trial prior to randomization. Most of the trials currently recruiting, as well as those having been run in the recent past [24••, 31••, 48••], have strict entry criteria to ensure the following:

  1. 1.

    The patient has been correctly diagnosed as having lupus.

  2. 2.

    The patient has moderate to severe disease characterized by at least one A or two B BILAG 2004 body system scores, as well as a SLEDAI-2K score of 6 or greater. Most trials are now incorporating a minimum MDGA score to qualify for a study, as well as a prespecified type of immunosuppressive regimen.

  3. 3.

    The exclusion of subjects with severe neurological disease, active antiphospholipid antibody syndrome, and active lupus nephritis requiring cyclophosphamide, as these patients often require a different approach to therapy. While many investigators feel comfortable assessing the individual parameters in the SLEDAI-2K and BILAG 2004 in study patients, many do not have detailed experience in BILAG body system grading, especially the renal body system. Thus, it is important for external experts to verify the BILAG 2004 body system grades for study entry.

  4. 4.

    Another potential issue at screening and baseline visits that can greatly increase the placebo response rate using the SRI is the recording of inappropriate SLEDAI 2K eight-point items as “present.” All of these items, except for vasculitis, are neurologic features of SLE, most of which are exclusion criteria. In a post-hoc analysis of the EMBLEM study, Petri et al. [37••] compared the SRI with the BICLA using the same dataset and found a much higher placebo response rate using SRI compared with BICLA. In this analysis, 61 of 227 (27 %) differed in assignment of clinical response by SRI or BICLA, and the majority of these differences (47 SRI, but not BILAG responders) were explained by a disappearance of an eight-point item that was recorded at baseline: lupus headache (n = 15) or vasculitis (n = 7). The intended definition of lupus headache as applied in the original SLEDAI was that it was a manifestation of ongoing cerebral inflammation. Recent studies have shown that headache alone refractory to narcotics is no more common in the SLE population than in the population without lupus [4952]. The wider availability of narcotics may falsely elevate the attribution of headache to SLE. Because the SLEDAI and BILAG are the key drivers of the SRI and the BICLA responses, it is critical to make certain that the clinical parameters contained in both that share similar, if not exactly the same, definitions are scored consistently across both DAIs at every patient visit. Table 1 shows examples of suggested data correlations for the DAIs used to measure SLE. Table 2 shows differences in the clinical and laboratory definitions between the BILAG 2004 and SLEDAI-2K glossaries. Note that the threshold for common laboratory abnormalities seen in SLE patients, such as leukopenia and thrombocytopenia, are different for SLEDAI-2K and BILAG 2004.

  5. 5.

    Another challenge in using the BILAG 2004 as a key end point is to ensure that site personnel assign the features of not present, improving, same, worse, and new in a logical fashion across different visits. There are very few clinical scenarios that would result in a change from “new,” “worse,” or “same” to “not present” in the 28-day period specified by the correct use of the BILAG 2004. Most parameters should go through an “improving” step before being scored “not present.” Why is this important? The BILAG 2004 Scoring Index is based on the intent-to-treat concept [21, 22]. Heavily weighted clinical parameters will produce an A or B score when “new,” “worse,” or “same” are marked, but reduction to “improving” reduces the body system grade of an A parameter to B, and a B parameter to C. In many ongoing lupus phase 2 and phase 3 trials, swollen/tender joint counts and the Cutaneous LE Activity and Severity Index (CLASI) [53] are being used as exploratory outcome measures, but these tools can also provide additional data that can be utilized to verify that the SLEDAI-2K and BILAG are capturing joint and mucocutaneous parameters correctly. The joint count is the same 28 tender, swollen joint count used to capture active joints in rheumatoid arthritis assessments [54, 55], and the CLASI is an index that more precisely measures lupus skin involvement by applying weighted scores of activity (rashes, scale, hypertrophy) and damage (dyspigmentation) to segments of the body [53]. The CLASI is more heavily weighted toward areas of the body that are frequently involved in SLE, such as the face and scalp, while the Rule of Nines Burn Index [56], which is used in the BILAG 2004 to assess rash involvement, can greatly underestimate the degree of SLE skin involvement. Although both the joint count and CLASI give a perspective of these parameters only on the day of the visit, while SLEDAI and BILAG assess the previous 28 to 30 days, the CLASI and joint count may be particularly useful in identifying positive findings that are overlooked on the SLEDAI and BILAG. For example, if scalp erythema is scored as present and red on the CLASI, and the patient is noted to have alopecia, a central adjudicator may query a site to make certain that the term for “alopecia, severe” should be considered in addition to “alopecia, mild,” as erythema would indicate the possible characterization of inflammation associated with the alopecia.

Table 1 Suggested correlation of SLEDAI-2000 and BILAG 2004 clinical parameters for SLE clinical trialsa
Table 2 Differences in SLEDAI-2000 and BILAG-2004 glossary definitions

Operational Aspects of Central Adjudication for Global Systemic Lupus Erythematosus Trials

Although central review and adjudication of DAIs for consistency is desirable, and perhaps even essential to assure high-quality, interpretable data in this heterogeneous patient population, the execution of this process can be time consuming and costly for sponsors. The ideal components of this venture are included in Fig. 1.

Fig. 1
figure 1

Operational aspects of central adjudication for global systemic lupus erythematosus. CRF, case report form

Investigator, Site Personnel, and Monitor Training and Certification

A uniform, concise, but detailed training program including SLEDAI, BILAG, correct use of MDGA, CLASI, and joint count consisting of lectures at Investigator Meetings, WebEx, and/or video training and certification testing (currently provided by the Lupus Foundation of America and others) that can be accessed through Web-based portals is essential. The training and testing should be tailored to the specific study.

Data Collection and Management

For large, global trials, Internet-based electronic data capture, case report forms, and source document templates should be designed with input from the adjudication team to facilitate collection of the correct data for the DAIs. The addition of preprogrammed edit checks minimizes manual queries by adjudicators and streamlines data review. An experienced data management team should be trained to streamline communication between adjudicators, sponsors, CRAs, and site personnel.

Adjudicator Review of Screening Data Prior to Randomization

Screening source documents, subject narrative, and laboratory review should be done in a timely fashion to confirm that a subject has moderate to severe SLE by DAI entry criteria.

On-Study Visits

On site source document verification by monitors should occur and all outstanding queries addressed before the adjudication team reviews the visits.

Pre-database Lock Check for Discrepant Data

Trained data managers should review serial BILAG, SLEDAI, and MDGA scores across all visits for each patient to identify unusual data patterns and have outliers reviewed by adjudicators.

Confirmation of Response

Responder status should be confirmed by the CAC.

Conclusions

The development of reproducible composite primary end points such as ACR20, 50, and 70 and DAS28-CRP and others for the measurement of changes in rheumatoid arthritis activity leads to more interpretable trial end points and a revolution in the development of drugs for treatment of this disease compared with 20 years ago [54, 56]. By studying the differential responsiveness of patients with different phenotypes and genotypes to the new rheumatoid arthritis therapies that target specific immunologic pathways, it will be possible to develop a better understanding of the pathogenesis and genetics of this disease.

SLE may represent several different diseases given the extremely variable clinical presentation and organ system involvement. The clinical development approach to SLE is more challenging than for rheumatoid arthritis; however, lessons have been learned from rheumatoid arthritis trials in constructing potentially viable clinical end points that can be used in large, global, multicenter trials involving more than 1,500 patients. While attempts are being made to develop and validate new versions of existing DAIs [57], the BILAG and SLEDAI are being used today in ongoing phase 2 and phase 3 trials. Given the potential subjectivity and complexity of applying these tools to patients with different SLE manifestations, it is essential to make certain that they are being applied uniformly and consistently. This paper has highlighted the importance of CACs for data review and confirmation of clinical responses in SLE clinical trials. In addition, we have described a process that is currently being utilized. While certain features such as grading of BILAG body systems can be automated with computer algorithms, as in the iBLIPs program developed by Isenberg et al. [19], and in certain other proprietary databases, which include automatic edit checks, it is nevertheless important to incorporate a thoughtful medical review to ensure consistency in the use of these DAIs in global clinical trials.