Design and participants
FACIT-F validation was based on data from two RCTs of tofacitinib in patients with AS, details of which have been published previously [15, 16]. The first was a 16-week (12-week treatment, 4-week follow-up) phase 2, placebo-controlled, dose-ranging study (NCT01786668; hereby referred to as Study 1) of tofacitinib 2, 5, or 10 mg twice daily (BID) in patients (N = 207) with active AS [15]. The second was a 48-week phase 3 trial (NCT03502616; hereby referred to as Study 2) of tofacitinib 5 mg BID in patients (N = 269) with active AS. The study had a 16-week placebo-controlled double-blind phase; from Weeks 16–48, all patients received open-label tofacitinib [16].
In both studies, patients were aged ≥ 18 years, had a diagnosis of AS and fulfilled modified New York criteria for AS, documented with central reading of the radiograph of the sacroiliac joints. All patients had active disease at screening and baseline (defined as Bath AS Disease Activity [BASDAI] score ≥ 4, back pain score [BASDAI question 2] ≥ 4), and an inadequate response or intolerance to ≥ 2 NSAIDs. Patients could continue the following (stable) background therapies: NSAIDs; methotrexate (≤ 20 [Study 1] or ≤ 25 [Study 2] mg/week); sulfasalazine (≤ 3 g/day); and oral corticosteroids (< 10 [Study 1] or ≤ 10 [Study 2] mg/day of prednisone or equivalent).
This post hoc analysis used FACIT-F data from both RCTs, captured from all treatment groups (using the 13-item FACIT-F questionnaire [Additional file 1: Appendix 4, Fig. S1]) at baseline, and at Weeks 2, 4, 8, and 12 (both studies), and Week 16 (Study 2 only). The percentage of missing FACIT-F items in the studies was negligible.
Patient and public involvement
Patients were not directly involved in the design, recruitment, or conduct of the clinical studies. Studies were conducted in accordance with the Declaration of Helsinki and International Council for Harmonisation Guidelines for Good Clinical Practice and were approved by the institutional review board and/or independent ethics committee for each study center. Written, informed consent was provided by patients.
Psychometric analyses
Measurement model assessment
The FACIT-F scale is a 13-item questionnaire that evaluates an individual’s self-reported fatigue during their usual daily activities over the past week [10, 11]. The 13 items fall into either an Experience (5 items) or Impact (8 items) domain [11]. Experience items evaluate patients’ perceptions and severity of feeling, including tiredness, energy level, weakness, fatigue, and listlessness, while Impact items evaluate how fatigue impacts an individual’s daily functioning [11]. Each item is presented with a 5-point Likert scale ranging from 0 (‘not at all’) to 4 (‘very much’). After appropriate recoding so negatively phrased items are reverse scored, items are summed to calculate a FACIT-F total score ranging from 0–52, with higher scores representing less fatigue [13, 17].
The current FACIT-F measurement model is represented by both domains and total score. In the measurement model assessments, the latent construct ‘Experience’ (represented by the first-order factor f1) includes items 1, 2, 3, 4, and 7 of FACIT-F, and the latent construct ‘Impact’ (represented by the first-order factor f2) includes the remaining 8 items [11]. The latent aggregated factor (represented by the second-order factor f3) includes all Experience and Impact domains (Additional file 1: Appendix 4, Fig. S2).
Second-order confirmatory factor analysis (CFA) modeling tested the measurement structure of FACIT-F using Study 1 and 2 data (Additional file 1: Appendix 4, Fig. S2). As an indication of whether the model fits the data, the following simultaneous criteria were used [18]: 1) Bentler’s comparative fit index (CFI) > 0.90; 2) unstandardized path coefficients were statistically significant (p < 0.05); and 3) standardized path coefficients were > 0.40 and statistically significant (p < 0.05).
To support the dimensionality of the FACIT-F scale, supplemental analyses using bifactor CFA modeling were also performed. In a bifactor model, every item can be affected by only one general (overall) factor and by only one nuisance (domain) factor. Further details of the methodology are described in the Additional file 1: Appendix 1.
Internal consistency reliability
Internal consistency reliability was assessed using Cronbach’s coefficient alpha (α) and corrected item-to-total correlations. A Cronbach’s coefficient α ≥ 0.70 [19] and an item-to-total correlation ≥ 0.40 [18] were defined as acceptable.
Test–retest reliability
Intraclass Correlation Coefficients (ICCs) estimated test–retest reliability using baseline and Week 2 data and were calculated using a one-way random model (absolute agreement) [18, 20]. An ICC ≥ 0.70 was defined as acceptable [21]. Because of treatment intervention, a subgroup of ‘stable’ patients was used in the analysis, defined using PtGA scores captured during the primary studies. Patients were asked to score their overall disease activity over the last week using a numerical rating scale between 0 (‘no disease activity’) and 10 (‘very active disease’) in response to the question, “How active was your spondylitis on average during the last week?”. Patients were not made aware of their scores at baseline during assessment at Week 2. Test conditions and administration were consistent across visits. Two models were investigated, with only ‘stable’ patient data used in each: Model A assumed that ‘stable’ patients had the same PtGA score at baseline and Week 2; Model B was more ‘relaxed’ and assumed that ‘stable’ patients can change, but not more than 1 point in PtGA score from baseline to Week 2.
Convergent validity
Evidence of convergent validity (the extent to which two concepts are related to one another [22]) was evaluated by the Pearson correlation coefficients of the FACIT-F domain/total scores with the following set of PROs from Studies 1 and 2: PtGA, total back pain/nocturnal spinal pain due to AS, Short Form-36 Health Survey version 2 (SF-36v2), Bath AS Functional Index (BASFI), BASDAI, EuroQol-5 Dimension (EQ-5D) Utility Index, and AS Quality of Life (ASQoL). Although dependent on the nature of the measures being compared, and the time points evaluated (correlations are expected to be higher following a treatment intervention than at pre-treatment or baseline), under most circumstances a correlation of 0.4–0.8 may be taken as evidence of convergent validity for the target scale under consideration (in this case, FACIT-F) [18].
Known-groups validity
Known-groups validity was examined by evaluating differences in the reported FACIT-F domain/total scores among clinically distinct patient groups. This anchor-based approach used a repeated measures longitudinal model with the reported PtGA scores as the anchor and FACIT-F domain/total scores as the outcome. PtGA scores represented patient state from ‘no disease activity’ (PtGA score of 0) to ‘very active disease’ (PtGA score of 10).
Ability to detect change
Ability to detect change was based on the repeated measures longitudinal model, with change from baseline in PtGA scores at Weeks 2, 4, 8, and 12 (both studies), and Week 16 (Study 2 only), as the anchor and change from baseline in FACIT-F domain/total scores (at the same time points) as the outcome [18, 23] to examine the relationship between change from baseline in PtGA scores and change from baseline in FACIT-F domain/total scores.
Defining meaningful within-patient change
Meaningful within-patient change (MWPC; i.e., meaningful improvement or deterioration from the patients’ perspective) was estimated using a repeated measures longitudinal model (the same model used to define ability to detect change).
As there were not likely to be 11 distinct levels of differentiation, PtGA was transformed from a 0–10 numerical rating scale to a Patient Global Impression of Severity (PGIS) 0–4 category scale (Additional file 1: Appendix 4, Fig. S3 and Additional file 1: Appendix 3, Table S1). Thus, a 1-category difference on PGIS corresponded to a 2.5-category difference on PtGA, and a 2-category difference on PGIS corresponded to a 5.0-category difference on PtGA. As a result, MWPC was evaluated based on a 2.5-category change in PtGA, and separately, a 5.0-category change in PtGA was taken as clinically relevant change.
For FACIT-F domain/total scores, differences in mean scores between groups (numerator) divided by standard deviations at baseline (denominator) were used to estimate standardized effect sizes. These effect sizes provided a general set of thresholds or benchmarks through adjectival descriptors on the impact of an intervention, with values of 0.2 standard deviation units generally regarded as ‘small’, 0.5 as ‘medium’, and 0.8 as ‘large’ [24].
Empirical cumulative distribution functions (eCDFs) [25] were produced at the studies’ respective primary analysis time points: Week 12 (Study 1) and Week 16 (Study 2) (Additional file 1: Appendix 1, Supplemental methods).