Patients
ENLIVEN was a 2-part, multi-center, double-blind, randomized, placebo-controlled Phase 3 study designed to compare the response rate of pexidartinib with that of placebo per RECIST 1.1 at Week 25 in subjects with symptomatic TGCT for whom surgical resection would be associated with potentially worsening functional limitation or severe morbidity (locally advanced disease) [18]. In Part 1, the double-blind phase, eligible candidates were enrolled from May 11, 2015, to September 30, 2016, and centrally randomized in a 1:1 ratio to receive either pexidartinib or placebo for 24 weeks. Randomization was stratified by United States (US) versus non-US sites and by upper extremity (UE) versus lower extremity (LE) involvement.
Eligible patients were age 18 or older, had a histologically confirmed TGCT diagnosis, and had advanced disease for which surgical resection would be associated with potentially worsening functional limitation or severe morbidity. They had symptomatic disease defined as a worst pain or worst stiffness score of at least 4 at any time during the week preceding the Screening Visit (based on scale of 0 to 10, with 10 representing “pain as bad as you can imagine” or “stiffness as bad as you can imagine”), and measurable disease per RECIST v1.1 with a minimum size of 2 cm. 120 subjects across approximately 45 study sites in the US, Canada, EU, and Australia were treated, 61 with pexidartinib and 59 with placebo.
Instruments
PROMIS-PF
Items from the validated PROMIS-PF item bank, which was designed to assess mobility, dexterity, axial, and complex activity function irrespective of specific anatomic location or acuteness of disease [1, 17], were used to assess physical functioning. Due to the heterogeneity in the physical impacts based on the tumor location, items for two customized tumor location-specific scales were selected based on input directly from patients on which activities were impacted by their TGCT [9, 10]. From the 121 validated items available, a 13-item scale and 11-item scale were customized to assess physical function among patients with tumors in the LE and UE, respectively. Nine of the PROMIS-PF items were overlapping across the two customized forms (i.e., included in both LE and UE scales).
Each PROMIS-PF question had five response options ranging in value from 1 to 5. Item-response theory-based parameters were used to calculate person-specific scores. A fixed-parameter calibration with no estimation was done using subject’s responses to the PROMIS-PF items to estimate person latent trait scores. Missing items were not imputed. The item parameters used to estimate person-latent trait scores were obtained from the PROMIS Assessment Center (https://www.assessmentcenter.net/). As is customary for PROMIS, the results are reported as T-scores, which represents physical functioning as a standardized score with a mean of 50 and a standard deviation (SD) of 10. A higher PROMIS T-score represents more of the concept being measured. For positively-worded concepts like physical function, a T-score of 60 is one SD better than average, and a person with a T-score of 40 is one SD worse than the average.
Worst stiffness NRS
The Worst Stiffness NRS was a single-item, which stated, “The following question asks about stiffness at the site of your tumor. Please rate your stiffness by circling the one number that best describes your stiffness at its worst in the last 24 hours.” For consistency the item had a response scale similar to that of the Brief Pain Inventory (BPI) Worst Pain NRS item [2, 4], that was a 0–10 NRS where zero is “no stiffness” and 10 was “stiffness as bad as you can imagine.” The item was included in ENLIVEN because qualitative interviews with patients and clinicians demonstrated that stiffness was an important treatment outcome [9].
The stiffness score was calculated using the number on the 11-point NRS selected by the patient for each day. The range for the score was 0 to 10. The weekly score was calculated as the average of non-missing records during each seven-day period, where the patient-reported entries on an outpatient basis were completed in at least 4 of the 7 days. (i.e., Mean weekly score = [sum of daily scores/# diary days completed]). Patients with fewer than 4 days of Worst Stiffness NRS entries had their stiffness scores for the week set to missing.
Other measures
The BPI Worst Pain NRS administered in ENLIVEN was a single-item, which stated, “The following question asks about pain at the site of your tumor. Please rate your pain by selecting the one number that best describes your pain at its worst in the last 24 hours.” The item was adapted from item 3 of the BPI-short form [2, 4] to include “pain at the site of your tumor.” The item has a response scale that is a 0–10 NRS where zero is “no pain” and 10 is “pain as bad as you can imagine.”
The EQ-5D-5L (heretofore referred to as EQ-5D) is a standardized measure of health status developed by the EuroQol Group in order to provide a simple, generic measure of health for clinical and economic appraisal [6]. The EQ-5D descriptive system includes five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Each dimension has five levels: no problems, slight problems, moderate problems, severe problems and unable to/extreme. The EQ visual analogue scale (VAS) records the respondent’s self-rated health on a vertical VAS from 0 to 100 where the endpoints are labeled “Best imaginable health state (100)” and “Worst imaginable health state (0).”
The Patient Global Rating of Concept (PGRC)- Physical Functioning item was a single item that assessed the subject’s perception of physical functioning. Subjects were asked to indicate how much their tumor limits their ability to carry out every day physical activities on a 5-point Likert scale from “Not at all” to “Extremely”.
The Patient Global Impression of Change (PGIC) – Stiffness was a single item that assessed the subject’s perception of change in stiffness at the site of their tumor. Subjects were asked to indicate how much the stiffness at the site of their tumor had changed at Week 25 from Baseline on a 7-point Likert scale from “Much improved” to “Much worse.”
The Tumor Volume Score (TVS) was a semi-quantitative magnetic resonance imaging (MRI) scoring system that described tumor mass. The TVS was based on 10% increments of the estimated volume of the maximally distended synovial cavity or tendon sheath involved. Thus, a tumor that was equal to the volume of a maximally distended synovial cavity or tendon sheath was scored 10, whereas a tumor that was 70% of that volume was scored 7, a tumor that was twice the volume of the maximally distended synovial cavity or tendon sheath was scored 20, etc.
Finally, a passive range of motion (ROM) assessment, standardized according to American Medical Association disability criteria and uses standard goniometers [11], was completed as an objective measure of physical functioning.
Assessments
All PROs were completed via electronic handheld device in the local language of the study participant. The assessment time points for these analyses focus on the double-blind phase and are shown in Fig. 1.
Statistical analysis
The analytical methods were undertaken to assess item performance, reliability, validity, ability to detect change, and identification of responder definition thresholds for the PROMIS-PF and Worst Stiffness NRS. The January 31, 2018, data cutoff was used for these analyses. Descriptive statistics were used to characterize the socio-demographic and clinical characteristics of the sample, as well as the Baseline and Week 25 PROMIS-PF and Worst Stiffness NRS scores. Confirmatory factor analysis (CFA) was conducted for the PROMIS-PF LE and UE item sets to confirm that the 15 PF candidate items comprised a single underlying factor in patients with TGCT. Model fit was assessed with comparative fit index (CFI), root mean square error approximation (RMSEA), and average weighted correlation residuals (SRMR). CFI > 0.95 was considered a good fit, as well as RMSEA < 0.05 and SRMR < 0.08.
Internal consistency reliability of the PROMIS-PF LE and UE item sets was assessed at Baseline to determine the extent to which individual items in the instrument were related to one another. Cronbach’s alphas ≥0.70 are considered acceptable [15]. Test-retest reliability of the PROMIS-PF and the Worst Stiffness NRS was evaluated to assess the reproducibility of scores when patients were presumed to be stable. Specifically, the test-retest reliability of the PROMIS-PF was assessed among all subjects between Screening and Baseline, and from Week 9 to 17 among subjects with no change on the PGRC – Physical Functioning. For the Worst Stiffness NRS, data from all subjects between each of 2 consecutive days from Day − 1 to Day-7 (e.g., Day − 2 vs Day − 3, Day − 3 vs. Day − 4) was used. Weekly scores (i.e., 7-day average estimates) for Baseline compared with Screening were also analyzed, as well as from Week 9 to 17 among subjects with no change on the PGIC – Stiffness measure. Intraclass correlation coefficients (ICC) were calculated. The ICC ranges from 0.00–1.00; an ICC ≥0.70 among stable subjects is considered acceptable to demonstrate test-retest reliability [16].
Construct validity of the PROMIS-PF and Worst Stiffness NRS was evaluated at Baseline by examining the relationships with the BPI Worst Pain NRS, EQ-5D, TVS, and ROM. All relationships were assessed via the Spearman’s rank-order correlation coefficient. Cohen’s conventions were used to interpret the absolute value of the correlation results, where a correlation > 0.5 is large, 0.3 to ≤0.5 is moderate, 0.1 to < 0.3 is small, and < 0.1 is insubstantial [3]. It was hypothesized that both measures would have large correlations with BPI Worst Pain NRS, and moderate correlations with each other. It was hypothesized that the correlations with the EQ-5D mobility, self-care, usual activities, and pain/discomfort items would be moderate to large.
To assess known-groups validity, which is the extent to which scores from an instrument are different for groups of participants that differ on a relevant clinical or other indicator, the PROMIS-PF and Worst Stiffness NRS were analyzed by levels of pain (no pain, mild, moderate, and severe categories), TVS (small, medium, and large categories), PF limitation (no limitation, low, medium, and high categories), and stiffness (no stiffness, low, medium, and high categories). Mean scores for the PROMIS-PF and Worst Stiffness NRS were compared for each of the groups using analysis of covariance (ANCOVA) (PROC GLM) at Baseline, controlling for age, gender, race, and body mass index (BMI).
A responsiveness analysis of the PROMIS-PF and Worst Stiffness NRS item was completed to evaluate the instruments’ ability to detect changes in participants who had an established change in clinical status. The association between changes in the scores on the PROMIS-PF and Worst Stiffness NRS from Baseline and Week 25 with change scores on the PGRC – Physical Functioning for PROMIS-PF, and PGIC – Stiffness for the Stiffness NRS, and tumor response status (complete response, partial response, progressive disease, and stable disease) defined by RECIST 1.1 response criteria and TVS for both measures, were examined.
Methods to establish the responder definition threshold included triangulation of anchor- and distribution-based analyses. Anchor-based methods are preferred by the FDA for interpretation of PRO scores [8] and were considered the primary analysis. The anchor for the PROMIS-PF was a change in PGRC-Physical Functioning from Baseline to Week 25. Improvement of “-1” was defined as a change in response in any of the following ways: Extremely to Severely; Severely to Somewhat; Somewhat to A little; or A little to Not at all. The mean change in the PROMIS-PF scale observed in the small improvement group (“-1”) was examined as a key anchor-based indicator of a responder. The anchor for Worst Stiffness NRS was change in PGIC-Stiffness from Baseline to Week 25. The mean change score among patients who reported that they were “a little improved” was examined as a key anchor-based indicator of a responder. Distribution-based analyses included the 0.50 and 0.30 baseline SD, as well as one standard error of measurement (SEM). Empirical cumulative distribution function (eCDF) curves were generated for the PROMIS-PF and Worst Stiffness NRS. The eCDF is a continuous (both positive and negative) presentation of the change scores from Baseline to Week 25 on the X-axis and a cumulative proportion of patients with that level of score change on the Y-axis.