Measurement properties of PROMIS short forms for pain and function in total hip arthroplasty patients

Introduction While the Patient-Reported Outcomes Measurement Information System (PROMIS) is mainly designed for computer adaptive testing, its static short forms (SF) are used when a paper-pencil format is preferred or item banks are not yet translated into the target language. This study examined the measurement properties of the German PROMIS-SF for pain intensity (PAIN), pain interference (PI) and physical function (PF) in total hip arthroplasty (THA) patients. Methods SF were collected before and 12 months post-surgery. Higher scores indicate more PAIN, higher PI and better PF. Oxford Hip Score (OHS) was the main reference measure. Six months post-surgery, a subsample completed the SF twice within 14 days to test reliability. Results Of 172 eligible patients, 147 consented to participate and received questionnaires; 132 (74 males) returned baseline questionnaires (mean age 65.8 ± 10.2 years) and 116, 12-month questionnaires. Forty-five patients provided test-retest data. Correlations of all SF with OHS were large (│r│ ≥ 0.7; confidence intervals did not include 0.50). Cronbach’s alpha values were: PAIN, 0.86; PI, 0.93; PF, 0.91. Intraclass correlation coefficients were: PAIN, 0.77; PI, 0.81; PF, 0.69. Standard errors of measurement were: PAIN, 3.8; PI, 2.8; PF, 3.6. Smallest detectable change thresholds were: PAIN, 8.8; PI, 6.6; PF, 8.4. Follow-up data showed a ceiling effect (best score) for PAIN (66%), PI (76%), and PF (66%). SF change scores showed large correlations with OHS change scores (│r│ > 0.6). Conclusion Our results provide some evidence of construct validity, and acceptable reliability and responsiveness of PROMIS-SF for pain and function in THA patients. These SF can thus be considered acceptable for use, although patients’ improvement in physical function might be underestimated due to the large follow-up PF score ceiling effects.


Introduction
The Patient Reported Outcomes Measurement Information System (PROMIS®) aims to provide a common health metric for many medical conditions [1]. It is primarily designed for computer adaptive testing (CAT). However, PROMIS static short forms (SF) are also available and in use. PROMIS measurement properties have been investigated in total hip arthroplasty (THA) patients [2][3][4][5] but are mostly limited to CAT and focused on single aspects of validity [2], interpretability [4,5] or responsiveness [3]. Conversely, the measurement properties of PROMIS-SF for pain and function in THA patients remain largely undetermined.
German language CAT item banks for pain and function were under development by the German PROMIS group at the time of this study. In future, these PROMIS CAT instruments will be offered by this group for third party use via REDCap (personal communication). Furthermore, not all patients actually prefer electronic over paper forms (according to an internal survey where we found half of our patients reporting their preference for paper questionnaires), and this can influence response rate and adherence. The SF can be easily implemented in clinical registries (especially the shortest versions), while connecting CAT platforms to active registries might initially require additional resources. We decided to use the shortest available SF, which were most feasible for our purposes and minimized respondent and administrative burden (i.e. potential barriers to the collection of patient-reported measures in a clinical setting and registries). Therefore, the aim of the study was to examine the psychometric properties of German PROMIS-SF for pain intensity (PAIN), pain interference (PI) and physical function (PF) in THA patients. Valid

Study design and questionnaire administration
This prospective study included consecutive patients of our THA registry from November and December 2016 (Fig. 1). Enrolled patients had to provide consent to use their data for research purposes. Exclusion criteria were living abroad, insufficient knowledge of the German language, cognitive impairment or ongoing follow-up of former surgeries. Ethics approval was obtained. Patientreported outcomes were collected from paper questionnaires administered 1 to 4 weeks before (baseline) and again, from paper questionnaires or, if chosen by the patient, via online survey 12 months after surgery. A subsample of consecutive patients completed questionnaires 6 months after surgery with a retest occurring within 14 days (median: 6 days) for reliability testing until a sample size of 30 was reached. The patients' condition was considered as stable in this period.

Outcome questionnaires
We investigated PROMIS-SF for PAIN (3 items), PI and PF (each with 4 items) provided by the PROMIS Germany research group. Answers are given on 5-point verbal rating scales. For PAIN, we used the form 3a (v2.0) that assesses pain over a 7-day recall period and current pain [6]. Form 4a (v1.0) defined PI based on the consequences of pain on relevant aspects of one's life over a 7-day recall period [7,8]. For PF, we used form 4a (v2.0) [9,10] assessing the current ability to perform various physical activities. Overall scores for PAIN, PI and PF were presented as T-scores; higher scores indicate more PAIN, higher PI and better PF. A score of 50 (10) represents the US general population mean (standard deviation). Scoring was done by using the "Health-Measures Scoring Service", powered by Assessment Center SM . Missing items were not replaced. We used the reference Oxford Hip Score (OHS), a condition-specific instrument that assesses constructs encompassing the selected PROMIS domains and 2 single-item questions rating surgical success.
Specifically, we used the cross-culturally adapted and validated German OHS [11,12]. This 12-item, joint-specific self-administered questionnaire is valid, reliable and responsive for assessing pain and disability in THA patients. Items are answered on 5-point Likert scales extending from 0 to 4 points, where 4 indicates the best outcome. Total scores, calculated by adding all items, range from 0 (worst) to 48 points (best). OHS was shown to have a two-factor structure (pain, function) as well [13].
At 12 months, patients rated their global treatment outcome (GTO): "How much did the operation help your hip problem?" on a 5-point Likert scale ranging from "helped a lot" to "made things worse" [14]. They also defined their state of symptom-specific well-being (SSWB): "If you had to spend the rest of your life with the symptoms you have right now, how would you feel about it?" on a 5-point Likert scale ranging from "very satisfied" to "very dissatisfied" [15].

Evaluation of measurement properties
Construct validity was assessed using scale-specific hypothesis testing and considered good if at least 75% of the hypotheses were confirmed. We tested correlations with OHS total score and OHS pain and function subscales at baseline and 12 months, and SSWB at 12 months. All correlations were expected to be large (confidence intervals ≥0.5), and specific correlations were expected to be negative for PAIN and PI with OHS and for PF with SSWB and positive for PAIN and PI with SSWB and PF with OHS.
Internal consistency was calculated using Cronbach's alpha with values between 0.70 and 0.95 indicating appropriate internal consistency [16]. Testretest reliability was assessed with the intraclass correlation coefficient (ICC) from a single measurement, absolute agreement, 2-way mixed-effects model; an ICC (confidence interval) ≥ 0.7 was considered acceptable [16]. Agreement was assessed using the standard error of measurement (SEM agr = √(variance due to systematic differences between measurements + residual variance)). The effect size based on SEM agr was calculated from the mean change score. The smallest detectable change (SDC) for individuals that can be considered above the measurement error with a 90% confidence level was calculated as SDC90 = 1.65 * √2 * SEM agr [17].
Responsiveness defines the ability of a questionnaire to detect clinically important changes over time. Longitudinal validity can be considered a measure of responsiveness and is examined by inspecting the correlation between change scores of the instrument under validation and the reference instrument. We expected negative correlations between change scores of PAIN, PI and OHS, and positive correlations between change scores of PF and OHS, each in the order of |r| (confidence intervals) ≥ 0.5. The smallest effect size of interest was defined as a Cohen's d ≥ 1.5 for the decrease in PI and increase in PF based on other studies [3,18]. Responsiveness was considered sufficient if at least 75% of the hypotheses were confirmed.
Floor and ceiling effects were considered acceptable if percentages were below 15%. To determine the individual-level minimal important change (MIC), we used linear regression with the OHS change scores and reported MIC for OHS in THA patients [19].
Analyses were performed using Stata Statistical Software Release 15 (StataCorp LP, TX, USA). Table 1 presents the baseline demographics with pain and function status. Age range was 32 to 93 years with a median of 66.8 years. Most surgeries were primary THA (92%) and 8% of patients underwent THA revisions.

Construct validity
Scale-specific hypothesis testing for validity resulted in 100% confirmed hypotheses for PAIN, 89% for PI and 78% for PF ( Table 2).

Reliability
Cronbach's alpha ranged between 0.7 and 0.95. ICC confidence intervals were ≥ 0.7 for PAIN and PI, but not for PF (Table 3). PAIN showed the highest SEM agr and SDC90, whereas PI had the lowest. The effect size based on SEM agr was smallest for PF, and smaller than OHS for all three SF.

Discussion
Our results suggest that the construct validity of PROMIS-SF is acceptable in THA patients. The SF have good internal consistency, test-retest reliability and responsiveness. For PAIN and PI, MICs were larger than the corresponding SDC90 values. Some measurement property limitations were nevertheless detected.
For PF, MIC was smaller than SDC90 meaning that clinically relevant change could not be distinguished from measurement error on the individual level. Compared to OHS, all SF show 40% to 60% smaller effect sizes based on SEM agr , which means that the jointspecific OHS allows more detailed grading of patient recovery than the PROMIS-SF scales.
The high proportion of patients with best possible scores of PI and PAIN after surgery may be not critical. These scales represent unipolar constructs where the complete absence of pain or pain interference makes it difficult (yet likely less relevant) to differentiate them any further. Nevertheless, researchers should be careful in interpreting PF after surgery because of ceiling effects. This problem may be resolved by using PF CAT without substantially increasing respondent burden [20,21]. Although confirmation of this aspect is warranted, we think it is unlikely that longer SF (i.e. 6b, 8b, 20a or 12a for people who can walk) will impact the ceiling effect because their maximum T-score is only slightly higher (59 to 66) than that of the 4-item SF (57) [9,21]. There was also 12% of patients who went from the worst possible to best possible PI score from baseline to followup, which can be critical if a more detailed grading of recovery is desired.

Limitations
THA is typically associated with very high patient satisfaction. Consequently, we did not have patients in the "poor outcome" category upon dichotomisation of the GTO, and MIC could not be calculated with an anchorbased standard method using the receiver operating characteristics curve. For this reason, we adopted an alternative indirect approach using linear regression from the OHS MIC calculated in a much larger study with 82, 415 THA patients [19].
Only 77% of eligible patients responded at baseline and 67% at follow-up. From our internal registry qualitycontrol procedures, we know that "lack of time" is the most common reason for not responding. From followup non-responders, less than 3% refused to cooperate because they were dissatisfied with their treatment, which suggests that there was no major selection bias.
Unidimensionality of the SF scale structure was not assessed, due to existing reports and guidelines of the development of PROMIS item banks [1,22,23]. The unidimensionality of the PF and PI item banks has been reported previously [8,24].

Conclusion
Our results provide some evidence of construct validity, and acceptable reliability and responsiveness of Stephan et al. PROMIS-SF for pain and function in THA patients. The SF can thus be considered as acceptable as another common static instrument (i.e. OHS) for use in these patients, although improvement in PF might be underestimated due to the large follow-up PF score ceiling effects.