The Norwegian PROMIS-29: psychometric validation in the general population for Norway

Background The Patient Reported Outcome Measurement Information System profile instruments include “high information” items drawn from large item banks following the application of modern psychometric criteria. The shortest adult profile, PROMIS-29, looks set to replace existing short-form instruments in research and clinical practice. The objective of this study was to undertake the first psychometric evaluation of the Norwegian PROMIS-29, following a postal survey of a random sample of 12,790 Norwegians identified through the National Registry of the Norwegian Tax Administration. Confirmatory factor analysis was used to assess structural validity. Fit to the Rasch partial credit model and differential item functioning (DIF) were assessed in relation to age, gender, and education. PROMIS-29 scores were compared to those for the EQ-5D-5L and the Self-assessed Comorbidity Questionnaire (SCQ), for purposes of assessing validity based on a priori hypotheses. Results There were 3200 (25.9%) respondents with a mean age (SD) of 51 (20.7, range 18 to 97 years) and 55% were female. The PROMIS-29 showed satisfactory structural validity and acceptable fit to Rasch model including unidimensionality, and measurement invariance across age and education levels. One pain interference item had uniform DIF for gender but splitting gave satisfactory fit. Domain reliability estimates ranged from 0.85 to 0.95. Correlations between PROMIS-29 domain, SCQ and EQ-5D scores were largely as expected, the largest being for scores assessing very similar aspects of health. Conclusions The Norwegian version of the PROMIS-29 is a reliable and valid generic self-reported measure of health in the Norwegian general population. The instrument is recommended for further application, but the analysis should be replicated and responsiveness to change assessed in future studies before it can be recommended for clinical and health services evaluation in Norway. Supplementary Information The online version contains supplementary material available at 10.1186/s41687-021-00357-3.


Introduction
The US National Institutes of Health (NIH) Patient Reported Outcomes Measurement Information System (PROMIS ® ) is the most important development in the field of health status measurement, following the advent of short-form generic instruments over three decades ago [1]. PROMIS unifies measurement through standardized measures with broad applicability across health problems in clinical practice, research, and quality measurement [2]. The system builds on recent scientific advances including item response theory (IRT) and computer adaptive testing (CAT), resulting in higher precision and lower respondent burden respectively. Standardization, based on common metrics, allows for comparisons across domains, across health problems, and with the general population [2]. PROMIS measures are freely available and have widespread application internationally [3,4].
PROMIS IRT-calibrated item banks assess aspects of physical, mental, and social health and include over 300 measures for adults and children [4]. This approach promotes flexibility in the selection of domains and items of relevance to specific health problems or populations [5]. PROMIS items within an item bank can be administered by short form fixed questionnaires (4-10 items) or CAT (4-12 items), with the former contributing to profiles.
The PROMIS-29 adult profile is a brief generic health measure comprising 29-items from the PROMIS domains of anxiety, depression, fatigue, pain (intensity and interference), physical function, sleep disturbance, satisfaction with participation in social roles (social participation) [2]. The PROMIS-29 has had rapid uptake since it became available in the last decade, including translation into over 40 languages [2], evaluation of measurement properties in different countries and populations [6][7][8], and application in research, including randomized controlled trials [9][10][11]. The instrument has also been used in crosswalks or mapping to other widely used PROMs including the EuroQol EQ-5D [12]. The inclusion of an extra domain of cognitive function-abilities, or its imputation using PROMIS-29 data, also makes it suitable for economic evaluation through the inclusion of values for health states in the form of PROPr [3,13].
The present study describes the evaluation of the Norwegian-language version of the PROMIS-29, following a postal survey of the general population for Norway. The measure was assessed for data quality, structural validity, fit of the seven domains to the IRT partial credit model, differential item functioning (DIF), internal consistency and convergent validity through comparisons with scores for the EQ-5D and a comorbidity questionnaire.

Data collection
This study was based on data from a national sample of Norwegians aged 18 years and over. Published Norwegian surveys [14][15][16][17][18], informed the sample size and quota sampling for seven age groups and sex. The random sample of 12,790 adults aged 18 years and over, were selected from the Norwegian Tax Administration registry (Folkeregisteret). They were sent a postal questionnaire and reply-paid envelope addressed to the Norwegian Institute of Public Health on December 15, 2019. An accompanying letter explained the study purpose and that respondents would be included in a lottery of ten prizes each to the value of 1000 Euros.
The Regional Committee for Medical and Research Ethics stated that the study did not need ethical board approval and a Data Protection Impact Assessment was approved by the Institute on the 16th October 2019.
The questionnaire included the Norwegian version of the PROMIS-29 as distributed by the PROMIS Health Organization [19]. Translations of PROMIS measures follow FACIT universal methodology, an iterative process of forward-and back-translation, expert review, harmonization and cognitive interviewing [1]. Each domain comprises four items with five-point descriptive scales, except for pain intensity which has a 0-10 numerical rating scale. The sum of the item responses for each multiitem domain are converted to T-scores where a score of 50 is the average for the US general population with a standard deviation of 10 [2,19]. Higher scores represent more of a domain. Therefore, for physical function, higher scores represent better health whereas for anxiety, higher scores represent poorer health.
The questionnaire also included the Norwegian EQ-5D-5L which includes five dimensions (mobility, self-care, usual activities, pain/discomfort, and anxiety/depression) with five levels [20]. Health states are transformed to a single index using a scoring algorithm derived from valuation tasks undertaken with general population samples. An algorithm is not yet available for Norway and hence, recommendations from the Norwegian Medicines Agency [21] were followed, including the use of the UK value set [22] and mapping [23]. Scores for the EQ-5D index range from -0.59 to 1, where 1 is the best possible health state. In addition to the five dimensions, the EQ VAS, assesses self-rated health on a vertical visual analogue scale, with endpoints labelled "Best imaginable health state" (100) and "Worst imaginable health state" (0). The presence of health problems was assessed by the Self-administered Comorbidity Questionnaire (SCQ), which lists thirteen medical conditions and up to three other non-specified medical problems [24]. Osteoand rheumatoid arthritis are listed separately but scored as one. Respondents are asked if they have a condition, if they are receiving treatment for it, and if it limits their activities. All items use yes/no responses and are scored one for the former, giving a score range of 0 to 45, the latter equivalent to 15 conditions being present, treated, and limiting activities. The Norwegian version underwent two independent forward-backwards translations in accordance with recommendations for PROMs translation [25]. Background questions included age, gender, and education level.

Statistical analysis
Statistical analysis followed an a priori analysis plan with explicit hypotheses. Missing data and floor and ceiling effects were assessed at the item and domain level. Confirmatory factor analysis (CFA) with robust weighted least squares (WLSMV) appropriate for categorical data [26,27], was used to assess the structural validity of the PROMIS-29, or the extent to which the item scores adequately contribute to the seven domains [28]. Model fit was assessed by the Root Mean Square Error  [27,29]. The unidimensionality of each domain was tested using the partial credit model, which extends the Rasch model for polytomous items, and, hence has separable item and person parameters, sufficient statistics and conjoint additivity permitting item and person comparisons [30]. Overall and item fit statistics were used to assess whether items within the domains fitted the one-dimensional model. Item fit was assessed with the χ 2 statistic, standardized residuals, which should be between ± 2.5, and item characteristic curves. Local independence, a further assumption of Rasch models, was assessed through examination of the residual correlation matrix with coefficients of ≥ 0.2 indicating redundancy among items [31,32].
Domain invariance was assessed through uniform and non-uniform differential item functioning (DIF) for age (6 categories), gender, and education level (3 categories); differences of ≥ 0.5 logits in item difficulties were considered meaningful [33,34].
Internal consistency was assessed by Cronbach's alpha [35] and the person separation index (PSI) [36]. These are similarly interpreted, but PSI uses the logit value (linear person estimate) or, proportion of error free variance of the distribution of person estimates relative to the sum of this variance and the error variance in these estimates. Reliability estimates of 0.7 and 0.90 deemed necessary for group and individual comparisons respectively [37].
Hypothesis testing was used to further assess the convergent validity of the PROMIS-29 domain scores through comparisons with those for the EQ-5D and SCQ. Inclusion of EQ-5D item data meant that Spearman correlation was used. Criteria for expected levels of correlation followed those used in a systematic review of generic PROMs [38]. First, correlations ≥ 0.60 were expected for scores assessing the same construct: anxiety and depression and EQ-5D anxiety/depression; pain interference/intensity and EQ-5D pain/discomfort; physical function and EQ-5D mobility, usual activities; social participation and EQ-5D usual activities. Second, correlations < 0.60 and ≥ 0.30 for instruments assessing largely related but dissimilar constructs: fatigue and EQ-5D anxiety/depression; pain interference and EQ-5D mobility, usual activities; physical function and EQ-5D selfcare, pain/discomfort; social participation and EQ-5D mobility. This level was also expected for correlations between all PROMIS-29 domain scores and those for the EQ-5D index and EQ VAS. Third, correlations < 0.50 and ≥ 0.20 for scores assessing moderately related but dissimilar constructs: anxiety/depression and EQ-5D usual activities, pain/discomfort; fatigue and remaining EQ-5D scores; sleep disturbance and EQ-5D usual activities, pain/discomfort, anxiety/depression; pain intensity and EQ-5D mobility, usual activities, anxiety/depression; social participation, pain interference and EQ-5D self-care, anxiety/depression; social participation and EQ-5D pain/discomfort. Fourth, correlations < 0.30 were expected for scores assessing weakly related or unrelated constructs: anxiety/depression and EQ-5D mobility, selfcare; pain intensity and EQ-5D self-care; physical function and EQ-5D anxiety/depression; sleep disturbance and EQ-5D mobility, self-care.
Different studies using a variety of approaches to assessing multimorbidity, including simple counts, have found that higher levels of multimorbidity are associated with poorer health [39]. One third of SCQ scores comprise activity limitations and correlations of up to 0.4 have been found with SF-36 scores [24]. The great majority of SCQ items relate to somatic health problems, and hence, correlations in the range < 0.5 and ≥ 0.20 were expected for PROMIS-29 domains of physical function, social participation, pain interference/intensity. Lower correlations < 0.3 were expected for the remaining domains. EQ-5D domains comprise single items, and hence, compared to the PROMIS-29, lower correlations in the same range were expected with SCQ scores. Slightly higher correlations were expected for the EQ-5D index and EQ VAS scores which assess health more generally.

Data collection
Of the 12,790 questionnaires mailed, 426 were returned as incorrectly addressed, and one person had died. Of the remainder, 3,200 (25.9%) returned a questionnaire that was at least partly completed. The mean age (SD) was 51 (20.7) and ages ranged from 18 to 97 years (Table 1). There were approximately 10% more female respondents than men, and 247 to 698 respondents across seven age categories; the lowest number of respondents was for 80 years and above and the highest was for those 18-29 years of age. Compared to general population data available from Statistics Norway from the time of the data collection [40] survey respondents were also over-represented for the youngest and oldest age groups, highest education level, and married/domestic partner (Table 1).

Distribution of scores
Levels of missing data for the PROMIS-29 ranged from 0.3 to 3.4% for items relating to sleep and anxiety respectively ( Table 2). The four anxiety items had the highest levels of missing data for any domain. Floor or ceiling effects, indicative of the best possible health, were apparent and over 70% for ten items. For the PROMIS-29 domains, 71% of respondents had the best possible physical function, the other domains ranging from 7.5 to 54.2% for sleep disturbance and depression respectively. The p values for the chi-square statistics in Table 3 show that the PROMIS-29 items and domains fit the Rasch unidimensional model. Moreover, the results were highly consistent with no disordered thresholds for any item, and correlations between item residuals did not suggest any lack of local independence. Additional file 1 includes the item characteristic curves for these items. There was no evidence of age or education DIF and only the pain interference item, "How much did pain interfere with your household chores?", was affected by uniform DIF relating to gender (> 0.5 logits), indicating that compared to males, females gave responses showing more severe impact across the scale. This item was split to create gender-specific versions of the same item which gave satisfactory model fit.

Psychometric evaluation
The correlations with the EQ-5D were largely consistent with a priori hypotheses. Correlations ≥ 0.60 were found for PROMIS domain scores and those for the EQ-5D assessing the same construct, the highest being for those relating to pain. More moderate correlations for domain and EQ-5D scores assessing largely related but dissimilar constructs were found in the range 0.47 to 0.55. Correlations with the EQ-5D index scores were considerably higher than the expected upper level of 0.6 for the two PROMIS domains relating to pain interference and pain intensity. They were also slightly higher than this level for physical function and social participation. Table 4 also shows that PROMIS-29 domain and EQ-5D scores had statistically significant associations with those for the SCQ, the highest being for domains relating most to physical health which were largely above the expected range of < 0.50 and ≥ 0.20, and particularly for pain domains. Correlations for the EQ-5D item scores were, as expected, slightly lower, except for anxiety/ depression. The correlation for the EQ-5D index scores were higher than those for EQ-5D items and PROMIS domains. The EQ-VAS correlation was lower than expected, and below that for the PROMIS-29 domains that relate most to physical health. Overall, 53 (83%) of the 64 correlations for the PROMIS-29 were within the hypothesized range.

Discussion
The PROMIS-29 performed satisfactorily in relation to measurement criteria widely recommended in the evaluation of PROMs including classical and modern psychometric methods [28]. Levels of missing data were low across the 29 items, but many items show high ceiling effects denoting the highest possible levels of health, which meant that the domain scores for all but the sleep disturbance domain, were highly skewed. This follows previous findings for general populations from France, Germany and the UK [7,41]. Short-form instruments such as the PROMIS-29, include the most important health domains and items of general relevance across sick and healthy populations, and hence, skewed data towards positive health was not unexpected in this population. Highly skewed PROMs data is common for general population samples [14][15][16]. In a comparison of data from Germany, Poland, South Korea, and USA, the 5L version of the EQ-5D reported here, was found to have ceiling effects in the range of 48 to 97% and 35 to 61% for item and index scores respectively [42]. Skewed data might be also expected in younger age groups with more minor health problems. Given the potential supplementary information that they offer, additional PROMIS shortforms, item banks and/or condition-specific instruments  should be considered for application alongside shortform generic instruments. CFA showed that the Norwegian PROMIS-29 had good evidence for structural validity including the presence of the seven domains. Rasch analysis further confirmed unidimensionality of the seven domains which had acceptable levels of reliability, with all domains close to, or meeting the more stringent criterion of 0. 9 [37]. This follows the findings of the developers and similar testing in general populations for other countries [7,41]. The instrument was not affected by DIF for age and education levels but as was found previously [41], females and males were found to respond differently to one of the items within the pain interference domain. At 0.5 logits, this is considered a large effect [34]. DIF has greater implications for domains that comprise few items, including those within the PROMIS-29. It is recommended that the domain of pain interference is analysed separately for Table 3 Rasch analysis for the seven domains of the Norwegian PROMIS-29 a Overall fit p value for chi-square, where a non-significance (p > 0.05) indicates fit to the Rasch model. Person separation index is an estimate of reliability or the proportion of error free variance of the distribution of person estimates relative to the sum of this variance and the error variance in these estimates b Location is the item position on the latent scale or level of health assessed. Fit residuals are the difference between the observed and expected scores for the item, a non-significant (p > 0.05) chi-square indicating fit to the Rasch model Domain/item (Overall fit p value, person separation index)  gender [41]. Several of the fit residuals were outside of the ± 2.5 range but this was a large sample size which can make them unreliable [43]. The great majority of the correlations for the convergent validity of the PROMIS-29 were as hypothesized and met the criterion of 75% [28]. The remainder were all higher than expected. The EQ-5D is the most widely tested and applied generic PROM suitable for use in economic evaluation [20,44], and hence, comparisons by means of expected correlations with the PROMIS-29, increase our understanding of the latter in terms of its validity as a short-form generic health profile. Given their general focus, criteria for expected levels of correlation followed those used in a systematic review [38] and psychometric testing of generic PROMs [44]. The criteria, in terms of the range of correlations, are overlapping which takes consideration of different approaches to assessing health constructs and their operationalization, through items and scaling. For example, PROMIS-29 uses multiitem scales with several domain scores, whereas the EQ-5D uses single items that form an index based on preferences or values for health states obtained from the general population [20].
Domain scores that assess the same or very similar constructs had correlations exceeding the expected level of 0.6. The levels of correlation were highest for those assessing aspects of pain. The PROMIS-29 domain of pain interference assesses the effect of pain on daily activities, and arguably has the greatest overlap with the any of the EQ-5D dimensions. The EQ-5D assesses anxiety and depression through a single item, whereas PROMIS-29 has two separate domains which are highly correlated, but as this and other studies have found, are distinct [7,41]. Previous studies have also found acceptable levels of correlation between PROMIS-29 scores and those for other legacy instruments including the SF-36 [41,45]. The consistent association with the SCQ scores provides further empirical support for the convergent validity of the PROMIS-29 [41]. Furthermore, it supports its potential use as a measure of quality of care for people with multimorbidity and for the development of systems for identifying individuals at risk of deterioration [46,47].

Strengths and limitations
The study was comparable in scope and size to existing European studies that have assessed the measurement properties of the PROMIS-29 in the general population [7,41]. This secured more than an adequate sample size for the application of CFA and the Rasch partial credit model. The latter has been widely applied in the field of health measurement and while the graded response model has been more widely used for PROMIS measures [2], the Rasch partial credit model has had considerable application in Europe, including the PROMIS-29 [41]. It is encouraging that the PROMIS-29 domains demonstrate adequate fit to both models. Previous studies have included the SF-36, an establish generic health profile, for purposes of assessing the validity of the PROMIS-29 [41,45]. The current study included the EQ-5D, which is the most widely tested and used PROM suitable for use in economic evaluation [20,44]. In common with these studies, this was a crosssectional design, and hence, responsiveness to changes in health was not assessed. The survey was conducted three months before the COVID-19 pandemic in Norway and a one-year follow-up survey that included the PROMIS-29, was implemented to assess the impact of the pandemic on the health of the Norwegian general population. It is anticipated that PROMIS measures including the PROMIS-29, will have increasing use in Norway. The PROMIS-57 has evidence for measurement properties in a smaller Norwegian general population sample recruited through mainstream and social media [48] and is being used in a long-term follow-up of COVID-19 outpatients [49]. Several item banks and short forms have been translated for children with national applications including the Norwegian Pandemic Register [50] and Child Hip Register [51].
National data from Statistics Norway shows that the sample cannot be considered fully representative of the general population. It is uncertain whether a more representative sample would have influenced the findings of the psychometric analyses, but there was no evidence for DIF across age groups and education levels. The response rate of 26% would have increased had a reminder been used, but this would have proved costly with over 9,000 non-respondents.

Conclusions
In conclusion, the Norwegian-language PROMIS-29 has evidence for acceptable measurement properties including reliability and validity, in a large sample of the Norwegian general population. Subject to further testing including responsiveness to change, it may be suitable for applications where a short-form profile measure of health is required that offers more detailed information than the EQ-5D. However, this study only assessed a limited range of measurement properties in the general population. Further testing is recommended in patient populations along with an evaluation of responsiveness to changes in health.