Review

Patient-reported outcome measures for use in clinical trials of SLE: a review

Abstract

Inclusion of patient-reported outcomes is important in SLE clinical trials as they allow capture of the benefits of a proposed intervention in areas deemed pertinent by patients. We aimed to compare the measurement properties of health-related quality of life (HRQoL) measures used in adults with SLE and to evaluate their responsiveness to interventions in randomised controlled trials (RCTs). A systematic review was undertaken using full original papers in English identified from three databases: MEDLINE, EMBASE and PubMed. Studies describing the validation of HRQoL measures in English-speaking adult patients with SLE and SLE drug RCTs that used an HRQoL measure were retrieved. Twenty-five validation papers and 26 RCTs were included in the indepth review evaluating the measurement properties of 4 generic (Medical Outcomes Study Short-Form 36 (SF36), Patient Reported Outcomes Measurement Information System (PROMIS) item-bank, EuroQol-5D, and Functional Assessment of Chronic Illness Therapy-Fatigue) and 3 disease-specific (Lupus Quality of Life (LupusQoL), Lupus Patient Reported Outcomes, Lupus Impact Tracker (LIT)) instruments. All measures had good convergent and discriminant validity. PROMIS provided the strongest evidence for known-group validity and reliability among generic instruments; however, data on its responsiveness have not been published. Across measures, standardised response means were generally indicative of poor-moderate sensitivity to longitudinal change. In RCTs, clinically important improvements were reported in SF36 scores from baseline; however, between-arm differences were frequently non-significant and non-important. SF36, PROMIS, LupusQoL and LIT had the strongest evidence for acceptable measurement properties, but few measures aside from the SF36 have been incorporated into clinical trials. This review highlights the importance of incorporating a broader range of SLE-specific HRQoL measures in RCTs and warrants further research that focuses on longitudinal responsiveness of newer instruments.

Introduction

SLE is a chronic inflammatory autoimmune disorder with variable multisystem involvement, an unpredictable relapsing–remitting course, an early onset and a significant impact on health-related quality of life (HRQoL).1 Previous research has shown poor correlation between HRQoL and physician assessments of disease activity and damage, highlighting the distinct contribution of HRQoL data to understanding patient trajectories and supporting the need for its assessment in SLE.2 Further, HRQoL has been found to be an important determinant of adherence and healthcare utilisation in patients with SLE and may facilitate justifying the considerable costs of new therapies.3 Therefore, both the US Food and Drug Administration and the European Medicines Agency advocate use of patient-reported instruments such as those measuring HRQoL in clinical trials (guidelines available at fda.gov and ema.europa.eu, respectively).

Patient-reported HRQoL measures, in the form of questionnaires, have been either developed exclusively for use in SLE (disease-specific measures) or have been used in patients with SLE but developed for any disease state or healthy individuals (generic measures). Patient-reported outcome evaluation has been incorporated in drug clinical trials in SLE; however, it has not been a consistent practice,4 and it is not clear whether sensitivity to change over time has been observed.5 6 Knowledge of acceptable measurement standards, responsiveness to change, generalisability and cultural adaptability would help determine the adequacy of the HRQoL measure for clinical research.

The aims of this systematic review were (1) to compare the measurement properties of published HRQoL measures that have been developed and/or evaluated for use in adults with SLE and (2) to evaluate the responsiveness of validated HRQoL measures used in SLE randomised controlled trials (RCTs) to date. Our goal was to provide a comprehensive review of these outcome measures to inform future selection of these tools in SLE clinical trials.

Methods

Search strategy

Literature searches were conducted in MEDLINE, EMBASE and PubMed, limited to humans, English language and articles published between inception and 1 April 2018. Journal articles (excluding conference abstracts, letters to editor, dissertations and book chapters) containing the keywords in the title and/or abstract were included (search terms are available in online supplementary material 1).

For our first aim, we included papers that described the methodology of the development and validation of HRQoL measures in SLE, and papers that described the evaluation of an existing HRQoL measure or its translated/adapted version for patients with SLE. Exclusion criteria were inadequate numbers of patients with SLE (<50% of the study population) and patients <18 years old. For our second aim, we included drug RCTs (pilot studies, phase I, II and III) in patients with SLE with published HRQoL data. Exclusion criteria were transplantation or plasma exchange RCTs, cutaneous lupus RCTs and RCTs in patients <18 years old.

The selected articles were categorised into (1) validation studies of extensively published HRQoL instruments (defined as having >3 validation studies in English-speaking SLE populations) and HRQoL instruments that had been used in an RCT, and (2) RCTs that used a validated HRQoL instrument.

Outcomes

Measurement properties

To assess measurement properties, we evaluated floor and ceiling effects, construct validity, test–retest reliability, internal consistency, and responsiveness. The instrument was considered to have floor or ceiling effects if >15% of the respondents scored at the extreme ends of the scale.7 Construct validity was determined using convergent and discriminant validity and known-group validity. Convergent validity was judged to be adequately demonstrated if there were high (>0.6) positive correlations between scales and discriminant validity, if correlations were low (<0.3) or if they were negative.8 Known-group validity was adequate if group means differed by ≥0.5 SD.9 Test–retest reliability was gauged by the intraclass correlation coefficient (ICC) and considered adequate if ICC was >0.7.10 The acceptable statistical value for internal consistency was a Cronbach’s α >0.7.10 Responsiveness was compared using standardised response means (SRMs) and considered poor if SRMs were <0.5, moderate if SRMs were ≥0.5 and high if SRMs were ≥0.8.11 The generalisability of the instrument was assessed by establishing if the study population was adequately described to help investigators extrapolate the results to their study cohorts. For each measure, we also determined if estimates of SLE-specific minimally important differences (MID) were available. While our review focused on English-speaking populations, we noted the availability of validation studies in non-English-speaking populations for each measure.

Responsiveness to interventions in RCTs

The results of any intervention during RCTs were interpreted in the context of the MID value of the instrument being used. We first determined if the direction of change in HRQoL scale was consistent with clinical changes measured by disease activity, damage or flare indices. We then determined if between-arm differences in HRQoL were ≥MID. Finally, we determined if HRQoL changes from baseline were ≥MID.

Screening process and data extraction

Screening and data extraction were performed by two independent researchers using predesigned templates and in line with the Centre for Reviews and Dissemination guidance, available at york.ac.uk/crd. Any disagreements were discussed and resolved based on consensus opinion. In addition to the outcomes, demographics, clinical data and information on instrument characteristics were extracted.

After removal of duplicates, screening of titles and abstracts, and full-text review, 23 validation studies in English-speaking patients with SLE that met the inclusion criteria were identified and were selected for the indepth review (figure 1). Of the 23 studies, 21 focused on 5 HRQoL instruments (the Medical Outcomes Study Short-Form 36 (SF36), the Patient Reported Outcomes Measurement Information System item-bank (PROMIS), the Lupus Quality of Life (LupusQoL), the Lupus Patient Reported Outcomes and the Lupus Impact Tracker (LIT)) and 2 papers described the validation of the Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-Fatigue) and EuroQoL-5D (EQ5D). The references of the selected 23 papers were also screened for additional relevant papers and 2 further papers were identified and included in the review. We excluded 25 studies that described the measurement properties of 22 additional HRQoL instruments not used in an RCT setting, with 1–3 studies per instrument (list available in online supplementary material 1).

Figure 1
Figure 1

Of the 25 validation papers identified, 8 assessed the measurement properties of Short-Form 36, 1 assessed EuroQoL-5D, 7 assessed Lupus Quality of Life questionnaire, 5 assessed Patient Reported Outcomes Measurement Information System item-bank, 3 assessed Lupus Patient Reported Outcomes questionnaire, 6 assessed Lupus Impact Tracker, and 1 assessed Functional Assessment of Chronic Illness Therapy-Fatigue. Some studies assessed multiple quality of life instruments. *Extensively published defined as having >3 validation studies in English-speaking SLE populations or having been used in an RCT. Of the 26 RCT papers identified, 25 used Short-Form 36, 3 used Functional Assessment of Chronic Illness Therapy-Fatigue and 1 used EuroQol-5D. Some studies used >2 quality of life instruments. PRO, patient-reported outcome; RCT, randomised controlled trial.

After removal of duplicates, screening of title and abstracts, and full-text review, 26 papers describing an RCT in SLE with HRQoL data were identified. HRQoL instruments used included the SF36 (25 papers), EQ5D (1 paper) and FACIT-Fatigue (3 papers). The references of the selected 26 papers were also screened for additional relevant papers and no further papers were identified.

Results

Table 1 provides information on demographics and clinical characteristics of participants with SLE in patient-reported outcome validation studies. Information on instrument characteristics is provided in table 2. Measurement properties are summarised in table 3, and data on measure responsiveness from RCTs are summarised in table 4.

Table 1
|
Demographics and clinical characteristics of participants with SLE in patient-reported outcome validation studies
Table 2
|
Summary of characteristics of instruments used to measure patient-reported outcomes in SLE
Table 3
|
Summary of measurement properties of HRQoL instruments used in SLE
Table 4
|
Summary of measure responsiveness from RCTs

Medical Outcomes Study Short-Form 36

We found only one study that reported the frequency of maximum and minimum obtainable scores; role emotional and role physical domains were found to have significant floor and ceiling effects.12 The instrument had good internal consistency; however, its test–retest reliability using ICC is currently unknown.13 14 The Health Assessment Questionnaire scores correlated strongly with the SF36 physical function scores (r=0.75) and moderately with role physical, bodily pain and vitality scores (r=0.41–0.48).15 Weak-moderate correlations (r≤0.41) were reported with various disease activity or damage indices.12–15 The mean SF36 scores differed significantly across categories of disease activity; however, the effect size was not reported.13 16 Studies examining responsiveness were inconsistent in the methodology but seem to suggest that the measure is not particularly responsive in SLE. SRMs were poor in most domains and inconsistent (poor to moderate) across different anchors.17 18 SLE-specific MIDs have been reported using anchor-based and distributions-based methods. Anchor-based MIDs for improvement ranged from 2.8 to 10.9 for domains and from 2.1 to 2.4 for summary scores, which is consistent with literature-reported estimates from other rheumatological conditions (5–10 points for domains and 2.5–5 for summary scores).17 18 Validation studies using the Chinese, French and Turkish versions of the SF36 have shown results that are comparable with the English version.19–22

In RCTs, we assessed responsiveness in two ways. First, we determined if between-arm differences met or exceeded the commonly accepted MID (5–10 points for domains and 2.5–5 for summary scores). Then we determined if improvements over time (from baseline) were ≥MID. We found that although SF36 scores generally improved over time, between-arm differences were clinically non-important, implying that the SF36 is not responsive to interventions.23–48 Among 10 studies that met the primary efficacy endpoint and had sufficient data for analysis, 6 reported improvements in SF36 scores from baseline that were ≥MID (table 4), while only 2 reported between-arm differences in SF36 scores that were ≥MID in all or most domains. Within-arm improvements from baseline were not limited to RCTs that achieved the primary efficacy endpoint. Improvements in SF36 scores from baseline were also ≥MID in 6 of 10 RCTs that reported a statistically non-significant clinical improvement.

Patient Reported Outcomes Measurement Information System

Significant ceiling effects were reported in most domains of the 29-item profile,49 while no floor or ceiling effects were observed among 14 Computer Adaptive Tests (CATs).50 PROMIS measures had good internal consistency and test–retest reliability, although the internal consistency of the PROMIS CATs remains to be established. PROMIS scores correlated strongly (r>0.6) with other HRQoL instruments across comparable domains and weakly to moderately (r≤0.6) across divergent domains.49–52 Correlations with disease activity indices, physician global assessment, damage and physical activity (using an accelerometer) were mostly weak (<0.3).50–52 Patients with SLE scored 0.5 SD or worse than the general population across most domains.49–53 Longitudinal responsiveness and MIDs have not been published. While PROMIS measures have been translated into many other languages including Spanish and Chinese, additional studies are needed to validate PROMIS measures in non-English-speaking patients with SLE.50 PROMIS measures remain to be used in RCTs.

Functional Assessment of Chronic Illness Therapy-Fatigue

We found only one study that reported the measurement properties of FACIT-Fatigue in patients with SLE. The measure was found to have good internal consistency; however, the test–retest reliability and the floor and ceiling effects are currently unknown.54 FACIT-Fatigue had moderate-high correlations (r=0.5–0.8) with SF36, brief pain inventory and patient global assessment, but poor correlations with disease activity and physician global assessments (r=0.1–0.3). Cross sectionally, FACIT-Fatigue has good discrimination between remission-mild versus moderate disease activity (0.52 SDs) but not between moderate versus severe disease activity (0.24 SDs). The measure was responsive to clinical improvements (SRM=0.69) but not clinical deteriorations. The measure was responsive to improvements (SRM=0.82) and deteriorations (SRM=0.53) in patient global assessment. Distribution and anchor-based estimates suggested an MID range of 3–6 points, which is consistent with literature reports of 3–4 points for patients with rheumatoid arthritis or cancer.

We identified three papers that used FACIT-Fatigue in an RCT setting (table 4). In all three, the intervention led to statistically non-significant improvements in disease activity. Two studies reported a change in FACIT-Fatigue scores from baseline that was ≥MID (4 points); however, only one study reported between-arm differences that were ≥MID.

EuroQoL-5D

We found only one study that reported the measurement properties of EQ5D in patients with SLE.55 No floor or ceiling effects were observed. Related domains on the EQ5D and SF36 correlated strongly (r=0.60), whereas unrelated domains showed weak-moderate correlation. Disease activity and damage showed weak correlation with EQ5D domains (r<0.22). The mean scores differed significantly across categories of disease activity but not damage. The measure showed poor responsiveness to self-reported change in health (SRMs ranged from 0.08 to 0.27 in patients who deteriorated and from 0.35 to 0.43 in patients who improved) but was not responsive to longitudinal changes in disease activity (SRM=0.01 in patients who deteriorated and 0.12 in patients who improved). SLE-specific MIDs have not been reported. EQ5D was shown to have good construct and criterion validity in a group of Chinese-speaking patients with SLE.56

EQ5D was used in one RCT that met our inclusion criteria. In this study, the intervention led to statistically non-significant improvements in disease activity. While changes in EQ5D from baseline were ≥MID in one of the intervention arms, between-arm differences did not reach MID (table 4).

Lupus Quality of Life

Some domains were found to have significant floor and ceiling effects, including intimate relationships and planning.12 57 The measure was found to have good internal consistency and test–retest reliability. LupusQoL had strong correlations with SF36 across comparable domains (r>0.6) and weak correlations with age, disease duration, disease activity and damage across all domains (r<0.30).12 17 58–61 Scores differed significantly across categories of disease activity and damage in all domains except fatigue and intimate relationships.58 60 The effect size has not been reported. SRMs were poor in most domains and inconsistent (poor to moderate) across different anchors.57 59 SLE-specific MIDs derived using the anchor-based approach ranged from 2.4 to 8.7 for deteriorations and from 3.5 to 7.3 for improvements.17 MIDs using distribution-based approaches based on 0.5 SD ranged from 12.9 to 16.7. Measurement properties of the LupusQoL have been examined and published in Chinese-speaking, Farsi-speaking, French-speaking, Italian-speaking, Spanish-speaking and Turkish-speaking populations.62–70 A version adapted and validated for a US population is also available.13

We identified one RCT that used LupusQoL; however, the results have not been published.39

Lupus Patient Reported Outcomes

The measure was found to have significant floor effects in satisfaction with medical care; data on ceiling effects were inconclusive, with one study reporting no ceiling effects and another reporting significant ceiling effects in all domains except coping.71 72 Good internal consistency and test–retest reliability were reported in most domains.71–73 Procreation and satisfaction with care had the lowest ICCs. Moderate-strong correlations (r≥0.5) were reported with the SF36 across comparable domains, while correlations with disease activity, physician global assessment, damage and flare were weak-moderate (r≤0.50).71–73 In cross-sectional analyses, significant associations were reported with categories of patient-reported health status across all domains except social support, coping, satisfaction with medical care and procreation. Lupus symptom scores and pain/vitality scores differentiated among patients with flare/active disease and those without; however, estimates of effect size were not reported.72 73 Scores changed significantly in response to longitudinal changes in patient-reported health (across seven domains), physician global assessment (across six domains) and flare (across five domains); however, SRMs have not been reported.71 No data are currently available on MIDs. The instrument has been validated in several languages, including Chinese, French, Italian, Japanese, Spanish, Tagalog and Turkish.74–82 The instrument has not been used in an RCT.

Lupus Impact Tracker

The instrument had good internal consistency and test–retest reliability83–85; however, no data are available on floor and ceiling effects. Moderate-strong correlations (r≥0.40) were reported with patient-reported HRQoL measures, and correlations with disease activity, damage and physician global assessment were mostly weak (r≤0.31).83 84 86 The mean scores differed significantly across dichotomised categories of disease activity, disability, socioeconomic status, age, race, education and marital status83–86; estimates of effect size have not been reported. Scores changed significantly in response to longitudinal changes in patient-reported outcomes or disease activity.83–86 Data on SRMs were contradictory and further research is warranted. One study suggested that the measure was responsive to clinical improvements (SRM=0.69) but not to clinical deteriorations (SRM=0.20).87 MIDs range from 2 to 4 for clinical deteriorations and an MID of 28 points has been reported for clinical improvements.83 87 The instrument has been validated in several languages, including German, Italian, Spanish, Swedish and French.88 The instrument has not been used in an RCT setting.

Discussion

In this review, we compared the measurement properties of 7 patient-reported HRQoL instruments in English-speaking adult patients with SLE using data from 25 validation studies and 26 drug RCTs. Overall, we found comparable measurement properties between the disease-specific measures, the SF36 and the PROMIS measures, but few measures aside from the SF36 have been incorporated into clinical trials. In general, instruments had good validity but poor-moderate responsiveness to change over time. Cultural adaptability and responsiveness of the PROMIS measures remain to be reported. In RCTs, clinically important improvements were reported in SF36 scores from baseline; however, between-arm differences were frequently non-significant and non-important, implying the SF36 is not responsive to interventions.

Despite the validation of the PROMIS item-bank and the disease-specific instruments in SLE, SF36 frequently has been the only patient-reported outcome in RCTs. Several prior publications have called for standardisation of instruments to measure HRQoL in SLE research to enable comparison between studies and encouraged SF36 use as it is internationally recognised and well-validated across multiple conditions.89 The 1995 Systemic Lupus International Collaborating Clinics Workshop recommended SF36 for measuring HRQoL in patients with SLE.16 It was also recommended by Outcome Measures in Rheumatology IV for assessment in RCTs and longitudinal observational studies in SLE.17 Our findings do not support the use of SF36 as the key measure moving forward and show that the SF36 is not particularly responsive in SLE. Despite its extensive validation, the measure’s test–retest reliability and known-group validity using effect size remain to be reported and may provide further insight into the measure’s responsiveness. Generally, instruments found to discriminate among clinically distinct groups are also found to be responsive to change.90 PROMIS measures provided the strongest evidence for known-group validity, indicating they may be more sensitive to change over time; however, this remains to be tested. Our findings also demonstrate that the disease-specific measures had good validity and reliability and were equally or more responsive to change than the SF36. As the field of clinical trials in SLE evolves, guidelines should be revised to encourage use of a broader range of validated HRQoL measures in clinical research to improve study designs. Incorporating disease-specific HRQoL measures as endpoints is also important in providing patient-centric care to improve outcomes pertinent to patients with SLE.4

Consistent with a prior review of patient-reported outcomes in lupus clinical trials,4 data from RCTs that used the SF36 show that longitudinal changes were clinically important regardless of assignment to pharmacological intervention. In contrast, between-arm differences were mostly non-important and non-significant. One interpretation of this finding is that non-pharmacological interventions associated with RCTs (such as routine monitoring of adverse events, improved access to health services, provision of multidisciplinary care, use of background medication, provision of health-related educational material and improved patient–physician dialogue) may have a greater impact on constructs measured by the SF36 than pharmacological interventions that specifically target clinical outcomes. The observation that non-pharmacological approaches can improve HRQoL is supported by prior research89 and suggests that the combinations of pharmacological and non-pharmacological therapies may have an additive (or perhaps synergistic) effect on improving HRQoL. Selection bias may be another plausible explanation for clinically important improvements in HRQoL among patients assigned to placebo in RCTs. Strict inclusion criteria often mean that patients enrolled in RCTs are generally healthier and better-informed than the general SLE population and more likely to experience further improvements in self-perceived health. Finally, there are observations that the MID determined through anchor-based methods seen in validation studies may differ from MIDs seen in RCTs.90 As evidence accumulates in RCTs, the observed changes in HRQoL measures based on effective treatments provide a valuable source of data on responsiveness and MIDs. Therefore, it is important that clinical trial literature in SLE is reviewed for older instruments such as the SF36 and synthesised for newer instruments to further support the evidence base on responsiveness and MID for interpreting HRQoL data.

While the validity of this literature review is strengthened by the inclusion of validation studies and RCTs, this study has some limitations. First, an assessment of the quality of the studies identified from the literature search was not conducted, so as not to limit our search. Second, HRQoL measures that had few publications were not prioritised and therefore not included in the indepth review. Third, we did not evaluate measurement properties in non-English-speaking SLE populations.

In conclusion, SLE is a condition associated with high unmet need and considerable burden to patients. SF36, PROMIS, LupusQoL and LIT have the strongest evidence for validity and as such are suitable for use in SLE RCTs; however, few measures aside from the SF36 have been incorporated into clinical trials. SRMs were inconsistent across different anchors and generally poor in all instruments with data for analysis. In RCTs, between-arm differences in SF36 scores were frequently non-significant and non-important. This review highlights the importance of incorporating a broader range of SLE-specific HRQoL measures in RCTs and warrants further research that focuses on longitudinal responsiveness and cultural adaptability of newer instruments such as the PROMIS item-bank.