Article Text

Download PDFPDF

Original research
Measurement properties of selected patient-reported outcome measures for use in randomised controlled trials in patients with systemic lupus erythematosus: a systematic review
  1. Vibeke Strand1,
  2. Lee S Simon2,
  3. Alexa Simon Meara3 and
  4. Zahi Touma4
  1. 1Division of Immunology/Rheumatology, Stanford University, Palo Alto, California, USA
  2. 2SDG, LLC, Cambridge, Massachusetts, USA
  3. 3Rheumatology, The Ohio State University, Columbus, Ohio, USA
  4. 4Department of Medicine, University of Toronto, Toronto, Ontario, Canada
  1. Correspondence to Dr Vibeke Strand; vibekestrand{at}me.com

Abstract

Objective The heterogeneous multisystem manifestations of SLE include fatigue, pain, depression, sleep disturbance and cognitive dysfunction, and underscore the importance of a multidimensional approach when assessing health-related quality of life. The US Food and Drug Administration has emphasised the importance of patient-reported outcomes (PROs) for approval of new medications and Outcome Measures in Rheumatology has mandated demonstration of appropriate measurement properties of selected PRO instruments.

Methods Published information regarding psychometric properties of the Medical Outcomes Survey Short Form 36 (SF-36), Lupus Quality of Life Questionnaire (LupusQoL) and Functional Assessment of Chronic Illness Therapy-Fatigue Scale (FACIT-F), and their suitability as end points in randomised controlled trials (RCTs) and longitudinal observational studies (LOS) were assessed. A search of English-language literature using MEDLINE and EMBASE identified studies related to development and validation of these instruments. Evidence addressed content validity, reliability (internal consistency and test-retest reliability), construct validity (convergent and divergent) and longitudinal responsiveness, including thresholds of meaning and discrimination.

Results All instruments demonstrated strong internal consistency, reliability and appropriate face/content validity, indicating items within each instrument that measure the intended concept. SF-36 and LupusQoL demonstrated test-retest reliability; although not published with FACIT-F in SLE supported by evidence from other rheumatic diseases. All instruments demonstrated convergent validity with other comparable PROs and responsivity to treatment.

Conclusion The measurement properties of PRO instruments with published data from RCTs including: SF-36, LupusQoL and FACIT-F indicate their value as secondary end points to support labelling claims in RCTs and LOS evaluating the efficacy of SLE treatments.

  • systemic lupus erythematosus
  • outcomes research
  • patient perspective
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

View Full Text

Statistics from Altmetric.com

Introduction

SLE is a chronic autoimmune disease that affects multiple organ systems and significantly impacts patient-reported health-related quality of life (HRQoL). The clinical manifestations of SLE are heterogeneous, vary over time, and may include fatigue, pain, depression, sleep disturbance and cognitive dysfunction.1 The multisystemic nature of SLE poses a challenge for evaluating treatment benefit and underscores the importance of using a multidimensional approach when assessing HRQoL. In 1998, the Outcome Measures in Rheumatology (OMERACT) international consensus effort recommended five domains for assessment in all randomised controlled trials (RCTs) and longitudinal observational studies (LOS) in SLE, including disease activity, damage, HRQoL, adverse events and economic costs.2 OMERACT also recommended that both generic and disease-specific instruments be used to assess HRQoL.

Since the release of the OMERACT recommendations, some RCTs in SLE have included patient-reported outcomes (PROs), such as the Medical Outcomes Survey Short Form 36 (SF-36),3 the Lupus Quality of Life questionnaire (LupusQoL)4 and the Functional Assessment of Chronic Illness Therapy-Fatigue Scale (FACIT-F).5 The US Food and Drug Administration guidance for PRO measures outlines the methodology and evidence needed to support labelling claims for new treatments6 and emphasises the importance of demonstrating content validity, reliability, construct validity and responsiveness of the measure among the target population.

Existing PROs in SLE measure patient perceptions of their health conditions and assess a spectrum of HRQoL—pain, fatigue, anxiety, depression, physical function, cognitive function and others. Current SLE PROs can be grouped as disease-specific and generic. Among the generic, SF-36 is most commonly used in research settings and RCTs, as well as EuroQol Five-Dimensional Questionnaire (EQ-5D).7 While several SLE-specific HRQoL questionnaires have been developed and validated, including LupusQOL, SLE-specific Quality of Life Questionnaire (SLE-QOL),8 SLE Quality of Life Questionnaire (L-QoL),9 LupusPRO and Lupus Impact Tracker (LIT),10 we focused on legacy measures and only those with publicly available data from RCTs: SF-36, LupusQoL and FACIT-F.

The objective of this analysis was to summarise available evidence supporting the psychometric properties of SF-36, LupusQoL and FACIT-F in SLE and to assess their suitability as secondary end points in RCTs to support labelling claims for SLE treatments.

Methods

Search strategy

This review used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines.11 The search strategy was developed in consultation with a medical librarian with expertise in systematic reviews (online supplementary appendix 1). A search of English-language published literature using MEDLINE (1995 to June 2017) and EMBASE (1964 to June 2017) was conducted to identify:1 studies related to development and validation of SF-36, LupusQoL and FACIT-F in SLE; and2 RCTs and LOS in SLE that included these instruments. We excluded studies if they were non-English articles, publications written only in abstract form, conference letters, editorials, dissertations and case reports with <20 patients. Search terms were individualised for each database and for MEDLINE. Titles and abstracts of initially identified studies were screened and reviewed to further identify articles that could be included in the final literature review and synthesis.

Supplemental material

Psychometric properties of included instruments

To assess the psychometric properties of each instrument, evidence of content validity, reliability (internal consistency and test-retest reliability), construct validity (convergent, divergent and known-group validity) and longitudinal responsiveness was extracted.12 13 Convergent validity was judged appropriate if positive correlations between instrument were present and >0.6 and discriminant validity if correlations were <0.3. For internal consistency, Cronbach’s α>0.7 was considered acceptable.12 An intraclass correlation coefficient (ICC) >0.7 for test-retest reliability was interpreted as acceptable.12 Longitudinal responsiveness was evaluated using standardised response means (SRMs) and interpreted as poor if SRMs<0.5, moderate if SRMs≥0.5 and high if SRMs≥0.8 (14).14 Thresholds of meaning, particularly minimum clinically important difference (MCID) and minimum important differences (MIDs) are also presented. In discussions of clinically relevant thresholds for outcomes scores, MCIDs refer to approaches based on the patient perspective/perception15 16 while MIDs are not based on clinical judgement (eg, perceptions of patients or clinicians) and generally use approaches that are anchored to a statistical change defined as p<0.05 or based on a change in a laboratory marker or a functional test ≥0.5 SD.17 18 It is important to note that anchor-based methods are recommended for derivation of MCID definitions.18 Discrimination of studied instruments in RCTs was also assessed. Instrument measurement properties of SF-36 (table 1), LupusQoL (table 2) and FACIT-F are included in the online appendix table A1. Selected RCTs and LOS in SLE that included these instruments are presented in online appendix tables A2 and A3, respectively.

Table 1

SF-36: assessment of instrument properties in SLE

Table 2

SF-36: responsiveness and ability to detect change in SLE: (within-group)

Results

Content validity

Content validity of SF-36 in patients with SLE was examined in two studies that included factor analysis of its domains. Results from the first study indicated that there were four significant factors with eigenvalues >3.3 that could be meaningfully interpreted, including ‘physical functioning’, ‘physical and emotional role functioning’, ‘mental and social health’ and ‘general health’.19 The second assessed a Chinese version of SF-36 and demonstrated that the eight-domain structure of SF-36 was supported with the overall factor loadings as ‘Physical Functioning’, ‘Role-Physical’, ‘Bodily Pain’ and ‘Role-Emotional’. These scales loaded cleanly onto one factor, while all other domains loaded on two factors.

Reliability (internal consistency and test-retest)

Evidence of internal consistency for SF-36 was demonstrated in three studies (online supplementary table S1), based on three different language versions, including Japanese,20 Chinese21 and English.19 Cronbach’s α for domains ranged from 0.72 to 0.96. Test-retest reliability of SF-36 (Spearman’s rank correlation) measured in these three studies ranged from 0.65 to 0.90 (table 1).

Supplemental material

Construct and known-group validity

Six studies22–27 assessed convergent validity of SF-36 (online supplementary table S2). Convergent validity with the LupusQoL was tested and established in five studies and demonstrated correlations of 0.48–0.83 between comparable domains in both questionnaires (SF-36/LupusQoL: Physical Functioning and Physical Health, Role-Emotional and Emotional Health, Bodily Pain and Pain, and Vitality and Fatigue).22–25 27 Comparison of component summary scores (Physical Component Subscale (PCS) and Mental Component Subscale (MCS)) and EQ-5D values indicated that PCS Score correlated more strongly than MCS Score with both EQ-5D (r=0.72 vs 0.49) and EQ-5D Visual Analogue Scale (VAS) (r=0.61 vs 0.37).26 Two studies examined SF-36s divergent validity comparing individual domain scores with SLE Disease Activity Index 2000 (SLEDAI-2K) Scores and showed no significant correlations between the two measures.20 27 Using Systemic Lupus International Collaborating Clinics Damage Index (SDI), Baba et al20 reported weak-to-moderate inverse correlations between SDI and SF-36 domain scores (r=−0.08 to −0.47, p<0.05). Two other studies that used SDI as a comparator indicated no significant correlations with SF-36 domains.19 26 Two studies correlated results from the British Isles Lupus Assessment Group (BILAG) general and organ subscales with those from SF-36. Thumboo et al21 reported correlations ranging from −0.34 to 0.17 across SF-36 domains and BILAG General; and Thumboo et al19 reported correlations between −0.07 and −0.36 (all p<0.05 except for Physical Functioning and General Health) (online supplementary table S3). One study reported known-group validity of SF-36 (online supplementary table S4). Results indicated that SF-36 scores could differentiate between neuropsychiatric events attributed to SLE and non-SLE causes. Changes in SF-36 component summary and domain scores, particularly those related to mental health, were strongly associated with the clinical outcome of neuropsychiatric events in SLE (p<0.05 except for Role Physical) (table 1).28

Longitudinal responsiveness

Five identified studies assessed the ability of SF-36 to detect changes over time (table 2).

Baba et al20 assessed responsiveness of SF-36 over 1 year to clinical worsening defined by a change of ≥3 in the SLEDAI-2K or damage accrual defined by a change of ≥1 in SDI. Using SLEDAI-2K as an anchor of change in disease activity, effect sizes (ES) and SRM for SF-36 domain and component summary scores were generally <0.20 in patients with clinical worsening and those who remained clinically unchanged. ES and SRM for social functioning and MCS Scores (≤−0.20) suggested low responsiveness in patients whose SLEDAI-2K Scores changed by ≥3. Using the SDI≥1 (reflecting the accrual of damage compared with last assessment) as an anchor of accrued damage over time, ES and SRM values for SF-36 domain and component summary scores were generally <0.19 in patients with evidence of damage worsening and patients who remained without damage change.20

In another study, patients completed a Global Rating of Change (GRC) assessment and mean SF-36 domain scores were calculated for those with worsening (scores −7 to −2) versus improvement (scores 2 to 7). In all SF-36 domains, scores were significantly lower in the deterioration and unchanged groups versus those with improvements.23 Devilliers et al29 also assessed responsiveness of SF-36 domains using a 100 mm VAS of change in lupus-related health status over the past 3 months as the anchor. Patients reporting a difference ≥+0.5 SD were considered to report worsening health and those with a VAS difference ≥−0.5 SD improving health. For patients reporting improving health, significant improvements in Role Physical, Bodily Pain, Social Functioning, Role Emotional, and both PCS and MCS Scores were reported. In those reporting worsening health, significant decreases in Physical Functioning, Role Physical, Bodily Pain, Mental Health, Vitality, General Health domains PCS score were evident. Hanly et al28 examined the ability to detect change in SF-36 using a physician-completed 7-point scale assessing neuropsychiatric events. Patients with neuropsychiatric improvement reported significant increases in PCS and MCS Scores. The responsiveness of SF-36 has also been assessed using SLEDAI-2K with improvements defined as reductions ≥4 from the previous visit, and worsening as increases ≥4. Among patients with clinical worsening, SRMs of 0.64 were noted for Role Physical, 0.42 for Social Functioning and 0.30 for PCS Scores. Among those who improved, SRMs were 0.60 for MCS, 0.43 for Mental Health, 0.40 for General Health, 0.30 for Vitality, 0.30 for Role Physical, 0.24 for Social Functioning and 0.23 for Physical Functioning.27

Thresholds of meaning of SF-36 Scores

Four studies assessed MCIDs (online supplementary table S5) or MID (online supplementary table S6) of SF-36. In one study, MCID was estimated using a patient-reported overall health status anchor: ‘How would you describe your overall status since your last visit?’ Response options included much better, somewhat better, about the same, somewhat worse and much worse. Those self-rated as somewhat better or somewhat worse were considered the ‘minimally changed’ subgroups. MCID for SF-36 was 2.1 (somewhat better) and −2.2 (somewhat worse) for PCS and 2.4 (somewhat better) and −1.2 (somewhat worse) for MCS Scores.30 In a second study, MCIDs for domains and component summary scores of SF-36 were based on the 15-point global change scale (Guyatt feeling thermometer) corresponding to an improvement by a score of 6: ‘a little better’ and worsening by a score of 10: ‘a little worse’. Clinically important improvement ranged from 6.7 to 11.4 points for domain scores and 3.4 to 4.9 for PCS. Clinically important worsening ranged from −14.7 to −1.7 points for domain scores and from −2.1 to −0.8 for PCS and MCS, respectively.31 MCIDs for improvement were then defined as 2.5 for PCS and MCS and 5.0 for domain scores; for deterioration −1.8 for PCS and MCS and −2.5 for domain scores, respectively. McElhone et al23 examined MID using both anchor-based and distribution-based methods. For deterioration, mean MID ranged from −2.0 for General Health to −11.1 for Role Physical domain scores. For improvement, they ranged from 2.8 for General Health to 10.9 for both Bodily Pain and Vitality. MIDs were larger using distributional versus anchor-based approaches. The MID for SF-36 has also been estimated as the mean change observed in the minimally improved and the minimally worse categories defined by a 7-point Likert Scale (−3, much improved; −2, moderately improved; −1, minimally improved; 0, the same;+1, minimally worse;+2, moderately worse and +3, much worse). MID for global improvement ranged from 1.9 to 11.3 for SF-36 domain scores. In patients reporting worsening, MIDs ranged from −4.4 to −15.6.29

Discrimination of SF-36 in RCTs and LOS

SF-36 has been the most frequently used HRQoL instrument in SLE trials (online appendix table A2). Twenty-six RCTs examined the impact of treatment on SF-36 results. Several examples are summarised in this section. Results from two of three trials that evaluated abetimus sodium indicated that SF-36 reflected clinical improvements in SLE accompanied by improvements ≥MCID in SF-36.31–33 Furie et al34 studied the safety and efficacy of belimumab 1 mg/kg and 10 mg/kg in patients with SLE and found significantly more SLE Responder Index (SRI-4) responders in the 10 mg/kg group versus placebo (p=0.017), but this difference was not sustained over 76 weeks. Results from a post hoc analysis of SRI-4 responders versus non-responders in these trials indicated that PCS, MCS and all SF-36 domain scores were significantly greater in SRI responders, across treatment groups, versus non-responders (p<0.001).35 Both belimumab groups also reported similar improvements in SF-36 domain scores at week 52 versus placebo. Secondary analyses of these two RCTs indicated that changes from baseline to week 52 in SF-36 PCS Scores were significantly greater (p<0.05) in the belimumab arms versus placebo.1 36

Seven RCTs examined the impact of physical activity, psychotherapy and alternate medicine interventions in patients with SLE with the objective of exploring improvements in SF-36. In three, physical training showed significant improvements in SF-36 Vitality and Role Physical Scores; however, significant differences in clinical improvements between treatment and control groups were reported in only three trials.37–39 Psychotherapy and cognitive behavioural therapy were tested in three RCTs. Improvements in SF-36 MCS Scores were demonstrated in two trials and clinical improvement demonstrated in one.40–42 Greco et al43 studied the benefits of acupuncture in reducing pain and fatigue in patients with SLE and reported significant improvements in SF-36 Bodily Pain and Vitality domains.

Content validity

The LupusQoL was developed from qualitative interviews with patients with SLE, as well as inputs from clinical experts and refined through cognitive interviews supporting content, followed by two rounds of psychometric testing.4 Patients reported that most items were relevant, easy to understand and answer, and reflected their HRQoL. Factor analysis in both English-speaking and Spanish-speaking populations confirmed the eight-domain structure.4 44 In the original derivation of LupusQoL, although only women of two racial ethnicities were involved, white and South Indians, subsequent validations included a wider population as well as men.4

Reliability (internal consistency and test-retest)

Internal consistency of LupusQoL domains was assessed in three studies44–46 and Cronbach’s α ranged from 0.85 to 0.94 across studies and domains (online supplementary table S7). Assessment of the test-retest reliability of LupusQoL indicated intraclass correlations (ICCs) ranging from 0.68 to 0.95 (online supplementary table S7),4 45 46 (table 3).

Table 3

LupusQoL: assessment of instrument properties in SLE

Construct and known-group validity

Six studies assessed the convergent validity of LupusQoL (online supplementary table S8). Five used SF-36 for assessment of construct validity, finding moderate-to-strong correlations between LupusQoL and the corresponding SF-36 domains (r=0.38 to 0.83). The LupusQoL also correlated strongly with the Systemic Lupus Activity Questionnaire (SLAQ) Symptom Scale (r=−0.70 to −0.76), EQ-5D Analogic Scale (r=0.76 to 0.80) and comparable EQ-5D domains (r=0.50 to 0.68). Associations between LupusQoL domain scores and SLEDAI-2K Scores were small and non-significant (r=−0.02 to 0.25) confirming divergent validity with SLE disease activity.27

Two studies examined known-group validity of LupusQoL in patients with SLE (online supplementary table S9). In one, mean LupusQoL domain scores, with the exception of Intimate Relationships, did not significantly differentiate between improved versus same/worsened groups on SLEDAI-2K (p>0.05 for all domains).27 Results from a second study indicated that the LupusQoL discriminated among groups of patients in different disease activity categories based on either BILAG Index or SDI Scores (those with SDI=0 and SDI≥1).4

Longitudinal responsiveness

Four validation studies assessed ability of LupusQoL to detect change (table 4).

Table 4

LupusQoL: responsiveness and ability to detect change in SLE

Assessment of responsiveness of LupusQoL in SRI-4 responders versus non-responders indicated that only Physical Health and Pain domains of LupusQoL were responsive.47 Results from patients who completed a GRC assessment and LupusQoL indicated that all LupusQoL domain scores were significantly worse in those with deterioration versus improvement.23 Evaluation of responsiveness of LupusQoL domains in patients with improved or worsened health status measured by a 100 mm VAS indicated that patients with improving health reported significant improvements in LupusQoL Physical Health, Pain, Emotional Health and Fatigue. Physical Health, Planning and Fatigue domain scores declined significantly in those with worsening health.29

Evaluation of LupusQoL using SLEDAI-2K over 30 days as an anchor indicated that LupusQoL displayed responsiveness in some domains determined by ES and SRM estimates. There were moderate effects in Pain, Fatigue and Physical Health; and small effects in Emotional Health, Body Image, Burden to Others and in Planning among patients whose SRM improved. Among patients who worsened, a moderate-effect SRM was found in Fatigue and small effect in Burden to Others.27

Thresholds of meaning of LupusQoL Scores

Two studies calculated MCIDs for LupusQoL domains (online supplementary table S10). In one anchor-based analysis, patient GRC was used as the anchor (improvement MCID (McElhone)=GRC of 2 or 3; deterioration MCID (McElhone)=GRC of −3 or −2). For deterioration, mean LupusQoL domain scores ranged from −2.4 for Body Image to −8.7 for Intimate Relationships, and for improvement from 3.5 for Body Image to 7.3 for Burden to Others. Using distribution-based approaches based on 0.5 SD, LupusQoL domain MCIDs (McElhone) ranged from 12.9 (Emotional Health) to 16.7 (Intimate Relationships).23 Results from a second study by Devilliers et al that used a patient-reported anchor-based approach (7-point Likert Scale of change in health status over the past 3 months, a 100 mm VAS assessing impact of illness, and Likert Scale from 0 (no problem) to 3 (severe problem) exploring patient-reported symptoms) indicated minimally improved domain scores ranging from 1.1 to 9.2 while minimally worsened scores ranged from −0.5 to −6.4.29

The different MCIDs defined by McElhone et al23 were used in a recent prospective study of 78 clinically active patients with SLE22 to compare the performance of each MCID in determining worsening and improvement measured by LupusQoL. Results indicated that the percentage of patients reporting improvements or worsening across domains varied between different MCID definitions. For most domains, percentages of patients reporting changes (improvement or worsening) were greater for MCID defined by Devillier et al29 versus those from McElhone et al.23

Discrimination of LupusQoL in RCTs and LOS

Only two RCTs used LupusQoL (online appendix table A2). In the EMBODY 1 and EMBODY 2 phase III trials in which the primary end point was not achieved (ie, no significant differences between groups in BILAG-based Combined Lupus Assessment responses at week 48), there were also no significant between-group differences with LupusQoL.48 Results from a small trial of Acthar Gel in 10 patients with SLE indicated improvements SLEDAI-2K and LupusQoL scores over 28 days.49

Content validity

While FACIT-F50 was not developed in patients with SLE, the content validity of the instrument has been confirmed in this patient population.51 Three 90 min focus groups, each including six to eight patients with SLE, were conducted to determine if FACIT-F included all aspects of fatigue relevant to these patients. Overall, the content of FACIT-F was relevant for capturing fatigue in patients with SLE and no changes to the instrument were suggested.

Reliability

Internal consistency testing of FACIT-F indicated that Cronbach’s α was >0.9552 (online supplementary table S11) (table 5).

Table 5

FACIT-F: assessment of instrument properties in SLE

Test-retest reliability of FACIT-F has been studied in other disease states. Yellin et al5 developed and validated a measurement system for oncology patients with anemia-related concerns. The FACT-Fatigue (FACT-F), consisting of the Cancer Therapy General (FACT-G) plus 13 fatigue items and the FACT-Anaemia (FACT-An), consisting of the FACT-F plus seven non-fatigue items were found to be stable (test-retest r=0.87 for both) in the 50 patients studied. Chandran et al,53 studied the reliability and validity of the FACIT-F Scale in psoriatic arthritis. The ICC for first and repeat FACIT-F Scores was 0.95 in 73 patients. The FACIT-F was also studied in patients with inflammatory bowel disease by Tinsley et al.54 The ICC for first and repeat FACIT-F Scores assessed within 180 days without change in disease state was 0.81 (CD 0.78; UC 0.87).

Finally, in patients with cancer of the head and neck, Eden, et al55 established the test-retest reliability and concurrent validity of FACIT-F in 65 patients. The FACIT-F ICC was 0.866 (0.75–0.93) and internal consistency was 0.874. Nevertheless, test-retest reliability has not been studied in patients with SLE.

Construct and known-group validity

Three RCTs assessed construct validity of FACIT-F (online supplementary table S12). Secondary analysis of pooled data across BLISS-52 and BLISS-76 RCTs indicated a strong correlation (r=0.70) between FACIT-F and SF-36 Vitality domain.56 Analysis of results from a longitudinal study demonstrated construct validity of FACIT-F across time with moderate-to-strong correlations between FACIT-F and SF-36 Vitality domain, PCS and MCS scores, as well pain intensity, pain interference, patient global assessment and SLAQ Scores (r=0.52 to 0.87).52 Results from a third study indicated moderate correlations between FACIT-F and both SLAQ and Patient Global Assessment scores (r=0.49 to 0.59)57 (table 5).

Known-group validity for FACIT-F was assessed in a longitudinal study by calculating mean FACIT-F Scores after stratifying by BILAG Musculoskeletal and General scores at baseline and week 12 (online supplementary table S12). FACIT-F significantly differentiated groups defined using BILAG anchors at both time points with ESs ranging from 0.22 to 0.65.52

Longitudinal responsiveness

Two validation studies assessed ability of FACIT-F to detect change in patients with SLE.

In one, patients were classified as SRI-4 responders or non-responders at week 52 across all treatment groups in two RCTs. FACIT-F Scores were significantly higher in SRI responders versus non-responders. Improvements in the responder group exceeded the 4-point MCID the authors defined as their meaningful threshold score.35 In the second study, patients were classified as improved, worsened or unchanged using BILAG Musculoskeletal, BILAG General and Patient Global Assessment of Change Scores and those classified as improved also reported significant mean improvements in FACIT-F Scores.50

Thresholds of meaning of FACIT-F Scores

Two studies calculated the MCID of the FACIT-F in patients with SLE (online supplementary table S13). Lai et al52 used FACIT-F Scores derived from responsiveness analyses, as well as multiple distribution-based measures and MCID from these analyses were estimated to be 3–7 points. The authors concluded the likely MCID of FACIT-F in SLE to be in the range of 3–4 points.52 Goligher et al57 estimated MCID of FACIT-F in SLE as the mean difference between the fatigue instrument scores between patients reporting ‘a little bit more’ fatigue (referred to as Greater Fatigue) and their interview partner. MCID was calculated by estimating the mean difference between patients reporting ‘a little bit less’ fatigue (referred to as Less Fatigue) and their interview partner. Using this method, Greater Fatigue MCID was 17.5 and Less Fatigue MCID was −5.3. Regression analyses estimated MCID to be −5.9 points using the original FACIT-F scaling.57 Methods in this analysis have the potential to include a self-reference bias and an interview order effect. Further, differences between patients used to estimate the MCID may not provide valid references to interpret differences within patients (ie, within individual change), which are more appropriately derived using longitudinal data (table 6).

Table 6

FACIT-F: responsiveness and ability to detect change in SLE (within-group)

Discrimination of FACIT-F in RCT and LOS

Online supplementary table A2 presents examples of seven published RCTs and one LOS in SLE that used FACIT-F. For some RCTs, significant improvements in clinical efficacy measures across each trial corresponded with significant improvements in FACIT-F. For example, in a 52-week RCT of blisibimod versus placebo significant improvements in SELENA-SLEDAI and FACIT-F Scores were observed with a 200 mg dose. FACIT-F Scores were also improved in patients who received 100 mg of blisibimod.58 Secondary analysis of results from the BLISS RCTs indicated that FACIT-F Scores were not significantly different across treatment groups at the week 24 prespecified secondary end point. However, FACIT-F Scores from baseline to week 52 improved significantly (p<0.05) with belimumab 1 mg and 10 mg versus placebo in BLISS-52, and with 1 mg at weeks 52 and week 76 in BLISS-76. These findings corresponded with significant improvements in FACIT-F reported by SRI responders versus non-responders in a combined analysis across treatment groups of both RCTs.35 In some other RCTs, significant improvements in FACIT-F Scores were not achieved, even when the primary end point was met.59

Discussion

Measurement properties of SF-36, LupusQoL and FACIT-F in patients with SLE were examined to support their use as secondary end points supporting labelling claims in RCTs evaluating the efficacy of treatments for SLE. All three instruments demonstrated strong internally consistent reliability in an SLE population (ranging from 0.72 to 0.95 across measures, domains and studies), indicating that items within each instrument measured the intended concept. In addition, both SF-36 and LupusQol demonstrated test-retest reliability; test-retest of FACIT-F has not been assessed in patients with SLE but acceptable (>0.7) ICCs in other rheumatic diseases have been confirmed.53 All measures also demonstrated convergent validity with other comparable PROs (correlations for SF-36: 0.37–0.83; LupusQol: 0.38–0.83; FACIT-F: 0.52–0.68). In general, correlations between these PROs and measures of disease activity and damage such as SLEDAI-2K and SDI were low, as might be expected with MD-assessed outcomes confirming divergent validity. This finding suggests that SF-36, LupusQoL and FACIT-F assess important underlying concepts distinct from disease activity measures.

Given the multisystemic nature of SLE, it is important to use a multidimensional approach to capture a broad array of symptoms and impacts when assessing HRQoL. Both SF-36 and LupusQoL evaluate a number of domains, including physical and mental impacts. Results from several RCTs have shown that SF-36 and FACIT-F are responsive to treatment benefit.1 LupusQoL is disease specific and has the advantage of being more sensitive to anticipated changes in the health status of a patient with SLE. It has been included in only a few trials of patients with SLE.48 49 While SLE-specific concepts covered by LupusQoL have been reported to be important to patients with SLE,60 some domains (eg, Body Image, Burden to Others) have not performed as well as those similar to SF-36 domains such as fatigue and pain. As such, both SF-36 and LupusQoL should be used together and whether LupusQoL may be more sensitive and appropriate for use in SLE subpopulations (eg, those with cutaneous manifestations)—can be studied in future trials. On the other hand, FACIT-F has been used in seven RCTs demonstrating responsiveness and providing evidence that FACIT-F is able to capture treatment-related benefits of fatigue in SLE (online supplementary table A2).

Responsiveness of instruments can be studied by assessing the correlations between changes in instruments and external anchors of change, and by the magnitude of statistics (eg, SRMs). Also, responsiveness should be interpreted very carefully in the context of a study’s hypothesis since statistics have little meaning on their own. In this review, we found that SRMs of responsiveness are not met by all domains or only by small magnitudes of change in some SF-36 and LupusQoL domains. Interpreting responsiveness by just focusing on the magnitude of statistics (SRMs) is not appropriate. First, SRMs should be interpreted in the context of the expected magnitude of change in every particular study – being small, moderate or large. Sometimes a change is not expected and it is acceptable to have undetectable SRMs. If a therapeutic intervention doesn’t align with a large SRM and the a priori hypothesis expected only a small SRM, a small SRM should be accepted as a valid result. In this situation, the null hypothesis is valid and it should be concluded that the instrument is responsive in the studied population despite small SRMs.

Second, the magnitudes of statistics of change (SRMs) depends also on the baseline characteristics of the studied patients – for example, if improvements in SF-36 domains are hypothesised to be large after a specific intervention but patients have reported good levels of HRQoL at baseline across a majority of the domains, it is unlikely that moderate-large SRMs will be identified. Rather than indicating that the instrument is not responsive, these results confirm the null hypothesis is not valid in this group of patients. Therefore, it is difficult to assess the responsiveness of an instrument from a single study and it is unlikely that all domains will demonstrate changes in all studies. In conclusion, the magnitude of statistics (eg, SRMs) of responsiveness should always be interpreted in the context of the research question – what magnitude of change was hypothesised and what magnitude was identified?

The use of PROs in SLE is essential and complements the assessment and management of patients with SLE. HRQoL in SLE is measured by generic questionnaires (eg, SF-36, FACIT, EQ-5D) and SLE-specific questionnaires (eg, LupusQOL, L-QoL, SLE-QOL, LupusPRO and LIT). While in this review we focused only on three measures, SF-36, LupusQoL and FACIT-F, we will again review the psychometric properties of all other HRQoL measures in future work under the auspices of OMERACT to reconsider the domains for the core outcome set in SLE.

In conclusion, available evidence of the measurement properties of SF-36, LupusQoL and FACIT-F in patients with SLE supports the use of these instruments as secondary end points to support labelling claims in RCTs evaluating the efficacy of treatments for SLE.

References

View Abstract

Footnotes

  • Contributors The work reviewed in this manuscript was conducted under the auspices of the SLE working group of OMERACT (VS, LSS, SM and ZT), with the SLE international experts (John Esdaile (Canada), Martin Aringer (Germany), Matthias Schneider (Germany), Anca Askenaze (USA), Rosalind Ramsey-Goldman (USA), George Karpouzas (USA), Alfred Kim (USA), Julian Thumboo (Singapore), Eric Morand (Australia), David Tunnicliffe (Australia), Roger Levy (Brazil/(USA), Edward Vital (UK), Ian Bruce (UK)), the Patient Research Partners (PRPs) (Kirsten Lerstrom (Denmark), Francesca Marchiori (Italy), Davide Mazzoni (Italy), Nelma Nimaut (Brazil), Eduardo Ferreira Borba Neto (Brazil), Izabel Oliveiras (Brazil), Amy Reynolds (Australia), Corry Ang (Australia), Adwoa Parker (UK and MRC), Karina Svalya (Arthritis Research Canada), Debra Hurst (Canada), Rowena Rodriguez (Canada), Kelsey Schmitt (USA), Lucretia Taper (USA), Janeen Mays (USA)) and SLE sponsor reviewers from Amgen (Brian Ortemeier and Brad Stolshek), AstraZeneca (David Ginkel and Micki Hultquist and Ogun Sasova), EMD Serono (Amy Kao and Stephen Wax), Pfizer (Connie Chen, Sam Zwillich and Noriko Likuni), Janssen (Chetan Karyekar, Mark Chevrier and Pamela Barry), Lilly (Julie Birt) and Xencor (Debra Zack). All members from the SLE working group, the PRPs and SLE sponsor reviewers were involved in all steps of this review.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information. All data reviewed as part of this systematic review has been provided within the tables and supplementary materials.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.