Establishing a common metric for patient-reported outcomes in cancer patients: linking patient reported outcomes measurement information system (PROMIS), numerical rating scale, and patient-reported outcomes version of the common terminology criteria for adverse events (PRO-CTCAE)

Background Researchers and clinicians studying symptoms experienced by people with cancer must choose from various scales. It would be useful to know how the scores on one measure translate to another. Methods Using item response theory (IRT) with the single-group design, in which the same sample answers all measures, we produced crosswalk tables linking five 0–10 numeric rating scale (NRS) and 15 items from Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE, scored on a 1–5 scale) to the T-Score metric of six different scales from the NIH Patient reported Outcomes Measurement Information System (PROMIS®). The constructs, for which we conducted linking, include emotional distress-anxiety, emotional distress-depression, fatigue, sleep disturbance, pain intensity, and pain interference. We tested the IRT linking assumption of construct similarity between measures by comparing item content and testing unidimensionality of item sets comprising each construct. We also investigated the correlation of the measures to be linked and, by inspecting standardized mean differences, whether the linkage is invariant across age and gender subgroups. For measures that satisfied the assumptions, we conducted linking. Results In general, an NRS score of 0 corresponded to about 38.2 on the PROMIS T-Score scale (mean = 50; SD = 10); whereas an NRS score of 10 corresponded to a PROMIS T-Score of approximately 72.7. Similarly, the lowest/best score of 1 on PRO-CTCAE corresponded to 39.8 on T-score scale and the highest/worst score of 5 corresponded to 72.0. Conclusion We produced robust linking between single item symptom measures and PROMIS short forms. Supplementary Information The online version contains supplementary material available at 10.1186/s41687-020-00271-0.

Patient-reported outcome measures (PROMs) are tools for directly eliciting patient experience; their use has become the standard in clinical trials for assessing symptoms and health-related quality of life (HRQOL) [1,2]. Single-item measures have been used for the simplicity of administration, reduction in respondent burden, and ease of interpretation [3,4]. Validity of single-item numerical rating scales (NRS) has been demonstrated and in some settings, they are an efficient alternative to longer assessments [1,4,5].
Commonly used measures in oncology include PRO-CTCAE (PRO version of the Common Terminology Criteria for Adverse Events) and PROMIS® (PRO Measurement Information System) [1]. The PRO-CTCAE was designed to assess side-effects related to treatment toxicity or tolerability. PROMIS provides short forms for a number of selected symptoms and HRQOL, which vary in length to meet the needs of researchers, balancing a tradeoff between precision and respondent burden.
Given this heterogeneity of PROMs, it would be useful to know how the scores on single-item measures such as NRS or PRO-CTCAE map onto longer, calibrated PRO-MIS scales measuring the same construct. These maps, or cross-walk tables, would allow researchers and clinicians to more accurately compare results across studies that use different PROMs, and allow for a common reporting metric in comparative effectiveness research or meta-analyses. Once multiple instruments are linked on cross-walk tables, clinicians and investigators can determine if clinical cutoff scores on different instruments converge or diverge based on a common metric [6].
Previous studies have linked legacy measures to the PROMIS T-score metric for depression [6,7], anxiety [8], pain interference [9], physical function [10], and fatigue [11]. The current study is the first to link NRS and PRO-CTCAE single-item measures to their associated PROMIS short form measures. This allows placing all measures on the same (PROMIS) metric. We present the cross-walk results on the following domains: fatigue, pain intensity, pain interference, sleep disturbance, anxiety, and depression.

Sample
Adult cancer patients were recruited from five cancer centers: University of North Carolina, Memorial Sloan-Kettering Cancer Center, Northwestern University, MD Anderson Cancer Center, and Mayo Clinic in Rochester, Minnesota. Patients were eligible for the study if they had a diagnosis of cancer, were currently receiving anticancer treatment or would be initiating active anticancer treatment within the next 7 days, or underwent surgery for cancer treatment in the past 14 days. Patients treated with only hormonal therapy and patients with clinically significant cognitive impairment were excluded. The study was reviewed by the IRB of each of the participating sites, and all patients provided consent to enter the study. Patients were randomized to three modes of administration for the baseline assessment: paper, IVRS, and web. All follow-up assessments at 6 weeks were administered by mail. The current study utilizes the baseline data only.

Pro-CTCAE
PRO-CTCAE is a patient version of the existing clinician-reported adverse event items for use in cancer clinical trials. The intent of PRO-CTCAE is to improve the accuracy and precision of adverse symptom assessment in cancer trials, and to bring the CTCAE into harmony with other areas of clinical research, in which the gold standard for symptom evaluation is patient selfreport. The PRO-CTCAE consists of five types of items (presence/absence , amount, frequency, severity, and interference with usual or daily activities). For items asking the frequency of symptoms, the response options are (a) never, (b) rarely, (c) occasionally, (d) frequently, and (e) almost constantly. For severity items, response options are (a) none, (b) mild, (c) moderate, (d) severe, and (e) very severe. The response options for items asking interference with daily activities include (a) not at all, (b) a little bit, (c) somewhat, (d) quite a bit, and (e) very much.

NRS
The NRS items have 11 response options from 0 to 10, but the interpretation of high scores varies by domain. For pain, fatigue, anxiety and depression, an NRS score of 0 indicates the patient does not experience the symptom and 10 indicates the symptom is as bad as imaginable. Conversely, for overall QOL, emotional/mental/ physical well-being, social activity, and sleep quality an NRS score of 0 as bad as it can be and 10 indicates the best it can be. The NRS item for sleep quality was reverse-coded to represent sleep disturbance with similar interpretation to the other symptoms (pain, fatigue, etc).

PROMIS
We administered six version 1.0 short forms derived from PROMIS item banks: Anxiety 8a, Depression 8a, Fatigue 7a with two additional items from Fatigue 8a (i.e., FATIMP3, FATIMP16), Sleep Disturbance 8a, Pain Intensity 3a, Pain Interference 8a excluding one item (8a-1) as it was redundant with a PRO-CTCAE item (i.e., In the past 7 days, how much did pain INTERFERE with your usual or daily activities?), and Physical Function 10a. We used the version 2.0 short form 8a for Ability to Participate in Social Roles and Activities. The PROMIS measures are scored on a T score metric in which 50 is the mean of a general US adult reference population and 10 is the standard deviation (SD) of that reference population.

Linking design
Following the methods of previously linking studies with PROMIS measures [6][7][8][9][10][11], we used the single-group design, in which the same sample answers all three measures (PROMIS; NRS; PRO-CTCAE). This is the strongest of the linking methodologies [12]. We used two IRT-based linking methods: Fixed-parameter calibration, and concurrent calibration followed by transformation with linking constants [13,14]. When the two approaches provide the same result, a robust linking relationship between instruments can be obtained.

Fixed-parameter calibration
In the fixed-parameter calibration, the item parameters of the anchor measure (PROMIS) were fixed at their previously established calibration [6,15], while the item parameters of the target measures (NRS or PRO-CTCAE) were freely estimated (subject to the metric defined by the anchor measure) in a single run for each domain. For example, in anxiety domain, there were 8 PROMIS items, 1 NRS, and 3 PRO-CTCAE items. These single-item measures were calibrated in a single run. Afterwards, each single item measure was anchored to the metric defined by the PROMIS item parameters. This calibration yielded item parameters for the legacy measure that were on the PROMIS metric.

Concurrent calibration with linking constants
The second IRT-based method we applied was concurrent calibration followed by the computation of transformation constants. With concurrent calibration, all items of the anchor and target measures are freely estimated in a single calibration. This produces a common metric and avoids imposing the constraints inherent in the fixed-parameter calibration (e.g., differences in population). However, the item statistics (calibrations) are arbitrary, or not linked to the original anchor item calibrations. To address this, linking constants are derived from the difference between these new "free" PROMIS calibrations and the previously established PROMIS calibrations. These constants are multiplicative and additive constants from the two sets of parameters so that their test characteristic curves (TCCs) become as similar as possible [14]. These constants can then be applied to the free calibrations of the target measures, thereby putting their parameters on the common metric. A test characteristic curve method by Stocking and Lord [14] was used to obtain the linking constants with an R package, lordif [16]. We ran all calibrations using flexMIRT® [17].

Tests of linking assumptions
The first linking assumption is construct similarity between measures [12,18]. When two measures are developed using different test specifications but measure similar constructs, we can produce concordance table that transforms scores from one to another. To test the similarity of constructs, we used several methods. First, we evaluated the degree of conceptual interchangeability by inspecting item content across measures. Second, since our planned IRT calibrations require that the combined item set is unidimensional, we conducted the confirmatory factor analyses (CFA) treating the items as ordinal and using WLSMV estimator with lavaan R package [19]. Using commonly used benchmark values [20], model fit was evaluated based on standard fit indices including the Comparative Fit Index (CFI ≥ 0.95 very good fit) and the Standardized Root Mean Square Error Residual (SRMR ≤0.08). We also estimated the proportion of total variance attributable to a general factor (i.e., coefficient omega, ω h ) [21,22] using the psych package [23] in R. This method estimates ω h from the general factor loadings derived from principal axes factor analysis and a Schmid-Leiman transformation [24]. The default was to extract 3 group factors, and for two domains, two subfactors had more desirable solutions. Values of .70 or higher for ω h suggest that the item set is sufficiently unidimensional for most analytic procedures that assume unidimensionality [25].
A second linking assumption is that the scores of the two measures to be linked are highly correlated [18]. We calculated correlation coefficients between the raw scores of the measures to be linked. We evaluated a third linking assumption (i.e., linkage is invariant across important subgroups) by computing mean differences between important subpopulations [18]. We chose two types of subgroups based on gender and age (i.e., men and women; ages ≥60 and ages < 60). To compute the standardized mean difference (smd) between males and females, the difference between female and male means was divided by the total group pooled standard deviation. If the smd values of PROMIS and the measures to be linked are similar (≤ 0.10), then we can assume linkage is likely invariant between subgroups. A difference in SMDs greater than 0.11 suggests a need for sub-populationspecific cross-walks [12,18].

Sample
As displayed in Table 1, the mean age of 1859 patients was 56 years. There were more women (61%) than men (39%) in the sample. About 74% were Caucasians, 22% Blacks, 3% Asians, 0.3% American Indians or Alaska Natives. About 6% were Hispanics. Breast cancer, lymphoma/ myeloma, colorectal cancer, head/neck/gastroesophageal cancer, and lung cancer made up 71.2% of the patients. There were 12% of the patients who were in stage I cancer, 21% in stage II, 30% in stage III, and 37% in stage IV. There were 6% who had education less than high school, 23% high school or GED, 30% some college, and 41% college graduate or more.

Construct similarity
Content of the items from three kinds of measures representing each construct was similar. The PROMIS Anxiety 8a consists of statements on the frequency of feeling nervous, anxious, tense, and feeling fearful, and the NRS asks the severity and PRO-CTCAE asks the severity, frequency, interference of anxiety. In addition, the content in single-item measures was fully represented in PROMIS Depression 8a such as feeling depressed, unhappy, or nothing could cheer one up, but the PROMIS focused on the frequency of these feelings while the single-item measures asked these feelings in terms of frequency, severity and interference. PROMIS Depression 8a had other content not represented in NRS or PRO-CTCAE such as feeling worthless, helpless, hopeless, feeling like a failure, or having nothing to look forward to. For fatigue, the single-item measures directly asked the level of fatigue and its interference with activities, while PROMIS items mostly addressed the construct without mentioning "fatigue". For example, PROMIS asked the frequency of feeling tired, experiencing extreme exhaustion, running out of energy, or feeling too tired to think clearly or take a bath. There was one PROMIS item that asked how often fatigue interferes with work. Regarding sleep, both NRS and PROMIS had an item addressing average sleep quality. Many items in Sleep Disturbance 8a asked about sleep difficulty in a variety of ways, such as the degree to which sleep was refreshing, restless, or satisfying, and how hard it was to fall asleep. The related PRO-CTCAE items asked the severity of insomnia at its worst and its interference with activities.
The content of both NRS and PRO-CTCAE items addressing pain intensity was fully represented in Pain Intensity 3a which asked how intense one's pain was at its worst, average pain intensity, and the level of pain right now. There was no NRS item addressing pain interference. A PRO-CTCAE item asked the degree to which pain interfered with activities in general, whereas Pain Interference 8a-1 asked interference with different aspects of activities such as working around the home, social activities, family life, or household chores. Similarly, the PROMIS short forms for social and physical function covered various aspects of the functions, while the NRS counterparts asked for global ratings of social activity and physical well-being.

Correlations between measures to be linked
The Pearson correlation between PROMIS and NRS, or PROMIS and PRO-CTCAE items ranged from .70 to .77 for anxiety; .72 to .79 for depression; .76 to .82 for fatigue; .79 to .85 for sleep disturbance; .87 to .89 for pain intensity; and .88 for pain interference ( Table 2). The correlations for the following domains were deemed too low to support linking: .65 for social function; .57 to .64 for global mental health; and .53 for physical function.
Choi et al. [6] suggested a lower bound of correlation of .75 for scores to be linked.

Invariant linkage between subgroups
As shown in Fig. 1 (anxiety domain as an example) Appendix A (for all scales), the smd's by gender between PROMIS and other measures were similar (≤ 0.1 difference). Those by age were more variable for anxiety domain: The smd's of PRO-CTCAE frequency (− 0.15) and interference (− 0.13) items were at least 0.11 point away [18] from the smd of PROMIS by age (− 0.26). This suggests that the linking relationship of PRO-CTCAE frequency and interference items to PROMIS anxiety scales may not be the same for the older and the younger patients. For five other domains, the smd's by gender or age between PROMIS and other measures were sufficiently close. On the basis of the findings above, the final scales to be linked were determined (Table 2).

Linkage results
Discrimination and location parameters on the PROMIS metric were estimated for the PRO-CTCAE and NRS items. Based on these parameters, we plotted the test characteristic curves (TCCs), showing the score values of the non-PROMIS items on the y-axis against the corresponding PROMIS-Tscore on the x-axis (anxiety NRS as an example in Fig. 2; all scales in Appendix B). Furthermore, we plotted the differences of the fixed calibration method vs the concurrent calibrations using linking constants ( Fig. 2 and Appendix B). For each comparison between the TCCs, the expected raw score value differed   by less than 1 point across thetas ranging from − 4 to 4. For all domains except pain intensity, the expected raw score values differed by less than |0.5| point across thetas. For pain intensity, in a higher T-score range of about 60-80, the difference in NRS score was larger than |0.5|. Because of the close similarity of the two IRT solutions on most of the domains and because the concurrent calibration using linking constants makes fewer assumptions about the population difference of the current sample and the original PROMIS calibration sample, we report only the results of the concurrent calibration followed by linking constants. The fixed and freely estimated item parameters of the PROMIS anchor items are plotted along with the identity line in Appendix C, which shows how the two calibrations compare to each other. Cross-walk tables based on concurrent calibration followed by transformation with Stocking-Lord linking constants are provided in Tables 3, 4 and 5. Table 6 shows the item parameters for NRS and PRO-CTCAE items from concurrent calibrations with linking  constants. We mapped the raw scores on NRS or PRO-CTCAE to their corresponding PROMIS T scores based on the conversion tables constructed with Lord & Wingersky method [26]. Across domains, the score of zero on NRS was mapped to about 38.2 ± 3.3 on T scale, and the maximum score of ten on NRS to about 72.7 ± 2.2. In addition, the score of 50 which is the population norm on PROMIS scales was mapped to approximately 3 on NRS fatigue, 2 on pain intensity, 3 on NRS sleep disturbance, 2 on NRS anxiety, and 1 on NRS depression. The middle NRS categories tended to be close in terms of PROMIS scores (e.g., the NRS scores of 4, 5, and 6 in Table 3), which can be attributed to limited ability of categories of 4 and 6 to separate responders. Appendix D shows the item characteristic curves for the NRS items.
In terms of PRO-CTCAE, the (lowest/best) score of one corresponded to about 39.8 ± 3.6 on PROMIS T score metric depending on domains. The maximum score of five was mapped to about 72 ± 2 on PROMIS scale.

Conclusions
Based on two different linking methods, we provide practical crosswalk tables that link PROMIS with 0-10 numeric rating scales (NRS) and PRO-CTCAE items in the following symptoms: pain, fatigue, anxiety, depression, and sleep. This is the first linking of these symptoms as measured by the PRO-CTCAE and NRS. Results based on both methods (fixed parameter and concurrent calibration with linking constants) were similar and consistent with each other. We tested whether the measures to be linked are highly similar in contents, highly correlated, and are likely to produce invariant linkages between subgroups. Through the IRT linking process, we found that a T score of 50 on PROMIS scale (the population norm) was aligned with NRS scores of 2 or 3 for domains other than depression, which was closer to 1. T scores of 50 were linked with level "2" responses on the PRO-CTCAE items across domains. Interestingly, the 11level NRS items had only a slightly wider PROMIS score range compared to the 5-level PRO-CTCAE The NRS item asked sleep quality rather than sleep disturbance, so was reverse-coded

Discussion
The linking tables produced by this effort will have practical research and clinical value. One limitation is that we only used an IRT approach so cannot determine how our results compare to a non-IRT approach. Having this in mind, researchers can use the tables produced by the study to estimate group means on one measure even  when the sample had been assessed with another. This has particular value for systematic review and metaanalysis of research questions raised around these important cancer symptoms. Clinically, it will be useful to compare cut scores for symptom severity, to help refine the actionability of patient response, and change in response, in treatment settings.