Sample
This study is a part of a larger study whose primary aim was to assess the convergent validity of PROMIS, PRO-CTCAE, and NRS by comparing item responses for two groups based on ECOG PS (0–1 vs. 2–4). A secondary analysis was to assess the relationship between survival status and PRO scores. In order to achieve power for all primary and secondary analyses, the sample size for the primary study was based on a superiority analysis of survival between high and low PRO score groups. In the current equivalence study comparing modes of administration, we claim equivalence when the confidence interval of the difference in outcomes between modes is within a predetermined equivalence margin that represents a clinically acceptable range of difference. Using σ of 2 based on the normative data on overall QOL NRS that include cancer trial patients [17], Δ of 0.45 corresponding to an effect size of 0.225, which may start to be considered non-negligible on a 0–10 scale, 2-sided type I error level of 5%, and the sample size of 1184, we obtain 94% statistical power. Using Δ of 0.40 corresponding to an effect size of 0.20, we obtain 86% statistical power.
There were five participating sites (Mayo Clinic, M.D. Anderson, Memorial-Sloan-Kettering, Northwestern University, and University of North Carolina). Patients with a diagnosis of cancer who were initiating active anti-cancer treatment within the next seven days, were currently receiving anti-cancer treatment, or underwent surgery for cancer treatment in the past 14 days, were recruited in-person by research study associates/data managers when arriving at a participating institution for a cancer-related appointment. Patients were accrued from the main hospital sites and satellite clinics for each institution. Eligibility criteria included adults who possess the ability to use and understand the informed consent and privacy protection documentation (written in English) and interact with the data collection modes (i.e., read and answer questions on a computer screen, listen to questions and respond using an IVR telephone system, or fill out a paper questionnaire). Each eligible patient provided informed consent. Enrollment and distribution of accrual across disease groups and institutions were facilitated by a recruitment coordinator.
The resulting sample were randomized to PSAQ (n = 604), CSAQ (n = 603), or IVR (n = 602). Participants were asked to complete the questionnaires while at the clinic for their visit. A study coordinator handed the participant a folded paper questionnaire booklet (PSAQ arm), an iPad tablet computer (CSAQ arm) or directed the patient to a landline (i.e., hardwired to a telephone jack) telephone with a keypad (IVR arm). Twenty-eight patients across three arms did not respond at all, and one person switched from IVR to PSAQ. Excluding these patients, we analyzed the remaining 595 patients who received paper, 596 IVR, and 589 CSAQ.
Measures
We used PROMIS short forms and analogous NRS and PRO-CTCAE single-item rating scales. The PROMIS domains included in the study were emotional distress-anxiety, emotional distress-depression, fatigue, pain interference, pain intensity, physical function, satisfaction with social roles, sleep disturbance, global mental health, and global physical health. We administered nine version 1.0 short forms derived from PROMIS item banks: Anxiety 8a, Depression 8a, Fatigue 7a with two added items from another fatigue short form, Sleep Disturbance 8a, Pain Intensity 3a, Pain Interference 8a, Ability to participate in Social Roles and Activities 8a, Global Mental Health, Global Physical Health, and Physical Function 10a. The PROMIS scores are on T-score scale, which we used for comparing the average scores between modes, and we did not transform the T-scores to 0–100 scale.
National Cancer Institute (NCI)’s PRO-CTCAE is a pool of adverse symptom items for patient self-reporting in NCI-sponsored clinical trials. The CTCAE is an existing lexicon of clinician-reported adverse event items required for use in all NCI-sponsored trials. Patient versions of CTCAE symptom items is intended to provide clinicians with more comprehensive information about the patient experience with treatment when trials are completed and reported. The PRO-CTCAE item bank consists of five “types” of items (present/not present, frequency, severity, interference with usual or daily activities, and amount of symptom). In this study, we included frequency, severity and interference items for PRO-CTCAE. The response options were never, rarely, occasionally, frequently, and almost constantly for frequency; none, mild, moderate, severe, and very severe for severity; and not at all, a little bit, somewhat, quite a bit, and very much for interference items.
We used NRS items for overall health-related QOL, five major QOL domains (e.g., sleep, pain, anxiety, depression, and fatigue), and items for each domain for which a PROMIs measure exists (e.g., social and physical function). NRS scores and PRO-CTCAE scores are on 0–10 and 1–5 integer rating scales respectively. For comparing mean scores, NRS and PRO-CTCAE item scores were linearly transformed to 0–100 scales with higher scores indicating more of the construct in question (e.g., more fatigue, better physical function). Respondents answered 62 PROMIS items, 16 PRO-CTCAE items, and 11 NRS items.
Health literacy was measured by an item, “how confident are you filling out medical form by yourself?” Based on the findings by Chew et al. [18] of the screening threshold that optimizes both sensitivity and specificity, “Extremely” and “quite a bit” were coded as having adequate health literacy, and “somewhat”, “a little bit”, and “not at all” were coded as not having adequate health literacy.
Measurement equivalence or lack of differential item functioning
A critical step before using instruments to compare scores from different modes of administration is determining whether items have the same meaning to members of different groups [19]. Psychometric concern for measurement equivalence arises whenever group comparisons on observable scores are the focus of research [20]. Once measurement equivalence between modes of administration has been established, quantitative cross-mode comparisons can be meaningfully conducted.
The R software package, lordif [21], was used to evaluate differential item functioning (DIF) in each of the PROMIS scales. Item-level data for all three types of measures (i.e., PROMIS, PRO-CTCAE, and NRS) were entered for each construct tested. Lordif assesses DIF using a hybrid of ordinal logistic regression and IRT framework. The main objective of fitting an IRT model under lordif is to obtain IRT trait estimates to serve as matching criterion. We tested whether the combined item set is unidimensional by conducting confirmatory factor analyses (CFA) treating the items as ordinal and using WLSMV estimator with lavaan R package [22]. Model fit was evaluated based on the Comparative Fit Index (CFI ≥ 0.95 very good fit) and the Standardized Root Mean Square Error Residual (SRMR ≤ 0.08) [23]. We also estimated the proportion of total variance attributable to a general factor (i.e., coefficient omega, ωh) [24]: Values of 0.70 or higher suggest that the item set is sufficiently unidimensional for most analytic procedures that assume unidimensionality [25] McFadden pseudo R2 change criterion of ≥ 0.02 was used to flag items for DIF [26]. A value of pseudo R2 less than 0.02 indicates a lack of evidence of differential interpretation of an item across modes.
Comparison of means by modes of administration
If no DIF is found, then we can meaningfully interpret the differences in scores between modes. If differences in average scores between modes are observed, then we can conclude that these differences come purely from the characteristics of the modes rather than DIF. If DIF is found for a certain mode in a given domain, we would score patients using newly derived item parameters for that mode before conducting mean comparisons between modes.
We compared the percentages of missing values among modes using the equality of proportion test with χ2 test statistic to ensure missing values are not driving differences across the modes. For constructs where lack of DIF was established, we concluded equivalence across modes if the margin of small effect size, defined as ± 0.20 × pooled SD, completely surrounds 95% confidence intervals for difference in mean score. Here, one fifth of the pooled standard deviation indicates a small difference, following the observation by Coons et al. [27] that a “small” effect size difference is between 0.20 SD and 0.49 SD and that these values indicate minimal difference worthy of attention. If the 95% CI fell completely outside the margin, we concluded systematic score difference by modes. If the 95% CI partly overlaps the equivalence margin, we concluded neither equivalence nor difference.