Psychometrics of three Swedish physical pediatric item banks from the Patient-Reported Outcomes Measurement Information System (PROMIS)®: pain interference, fatigue, and physical activity

Background The Patient-Reported Outcomes Measurement Information System (PROMIS®) aims to provide self-reported item banks for several dimensions of physical, mental and social health. Here we investigate the psychometric properties of the Swedish pediatric versions of the Physical Health item banks for pain interference, fatigue and physical activity which can be used in school health care and other clinical pediatric settings. Physical health has been shown to be more important for teenagers’ well-being than ever because of the link to several somatic and mental conditions. The item banks are not yet available in Sweden. Methods 12- to 19-year-old participants (n = 681) were recruited in public school settings, and at a child- and psychiatric outpatient clinic. Three one-factor models using CFA were performed to evaluate scale dimensionality. We analyzed monotonicity and local independence. The items were calibrated by fitting the graded response model. Differential Item analyses (DIF) for age, gender and language were calculated. Results As part of the three one-factor models, we found support that each item bank measures a unidimensional construct. No monotonicity or local dependence were found. We found that 11 items had significant lack of fit in the item response theory (IRT) analyses. The result also showed DIF for age (seven items) and language (nine items). However, the differences on item fits and effect sizes of McFadden were negligible. After considering the analytic results, graphical illustration, item content and clinical relevance we decided to keep all items in the item banks. Conclusions We translated and validated the U.S. PROMIS item banks pain interference, fatigue and physical activity into Swedish by applying CFA, IRT and DIF analyses. The results suggest adequacy of the translations in terms of their psychometrics. The questionnaires can be used in school health and other pediatric care. Future studies can be to use Computerized Adaptive Testing (CAT), which provide fewer but reliable items to the test person compared to classical testing.

latent trait is being measured. Contrary to when classical methods are used, precision measurement may only require a few items to measure a construct because the calibration or weighting of the question is built into the results. In a computer adapted system (CAT) an answer to one question is used to identify the next question to be asked that will reduce the error rate of the predicted total score. By using CAT respondents do not need to report on the same items as each other in order to produce comparable scores. Different questions within the same item bank can be used to arrive at a total score for that domain. Thus IRT techniques minimize the number of items presented to each respondent and further prevent test-tiredness by the possibility of answering different questions at each test occasion.
This study is part of a Swedish PROMIS cooperative research group [31] aiming to translate and standardize PROMIS measures across global initiatives and settings. We work to create a shared unified terminology and metric to report common symptoms and functional life domains. PROMIS item banks offer great potential for improving Swedish and global assessment in clinical trials and evaluation of treatment and health care in clinical settings.
In this study, we validated the Swedish translations of three PROMIS Pediatric item banks. The PROMIS pediatric scale of pain interference has been used in studies among child and adolescent populations such as juvenile fibromyalgia and sickle cell disease [17][18][19][20]32] and shown good psychometric properties. The PROMIS pediatric Fatigue has previously been applied in several studies of child-and adolescent populations [20][21][22]. One article using IRT, Lai et al. [21], showed that the scale Fatigue demonstrated satisfactory psychometric properties after removing two items. The PROMIS pediatric Physical activity [23][24][25], has also previously shown to be a precise and valid measurement of children's lived experiences of physical activity [23].
The Swedish versions of the item banks need to be validated to ensure that quality and consistency are maintained from the PROMIS original English versions. The aim of this study was to validate three item banks in a Swedish population: The PROMIS pediatric item banks of Pain Interference v.2.0, Pediatric Fatigue v.2.0 and Pediatric Physical Activity v.1.0. These item banks were recently translated to Swedish [21].

Study setting
The study was conducted in the northern part of Sweden and was approved by the Regional Swedish Ethical Review Board in Umeå (number 2018/59-31). The authors have been working with PROMIS Health Procedure Adolescents (n = 681) were recruited between September 2018 and May 2019 from four community high schools (n = 638) and one child-and adolescent psychiatric (CAP) clinic (n = 43). To be eligible for the study, participants had to be fluent in spoken and written Swedish. Oral and written informed consent was gathered from participants and their parents (for children under 15 years).
All participants completed the survey on-line during approximately 30-45 min, and they received a gift card for their participation.

Participants
High-school students (n = 897) and CAP patients (n = 160) were asked to participate and 71% of the highschool students (n = 638) and 27% (n = 43) of the CAP clinic patients agreed to participate, which rendered a total sample of 419 girls and 262 boys between 12 and 19 years of age (M = 15.75, SD = 1.77). Most participants were of Swedish origin (91%). The socioeconomic status of the households was distributed as follows: 17% manual workers, 28% clerical or office workers, 32% higher civil servants, and executives, 7% self-employed of different kinds, 1% students, and 15% unknown. A subset of the adolescents (n = 238 girls and n = 110 boys, mean age 15.39, SD = 1.68) was invited for retesting approximately 3 weeks after the first assessment.

US sample for DIF analyses
For comparative analyses of language, a US sample [33] was used in the DIF analyses. From which only the variables that we analyzed in the present article was extracted. US data was only available for the pain and fatigue PROMIS item banks. The sample consisted of N = 356 adolescent (173 girls) between 12 and 17 years of age, (M = 14.70, SD = 1.72). All participants suffered from different medical conditions (19% cancer, 40% kidney problems, 15% rheumatic conditions, and 26% sickle cell anemia). The sample has been described in further detail elsewhere [33].

Translation and adaption of the item banks
Functional Assessment of Chronic Illness Therapy (FACIT) Multilingual Translation Methodology [34,35], with some modifications, was used for translation. Forward translation, reconciliation, expert reviews, backtranslation, cognitive debriefing, and pilot testing were performed. For more details, see Blomqvist et al. [29,31]. See Fig. 1, for an overview of the Swedish translation and adaption processes. The current translated item banks are found in the step "Reports of validation" in Fig. 1.

Self-report instruments PROMIS
Patient Reported Outcome Measurements Information System consists of item banks measuring generic health [12]. In the present study, the item banks for pain interference, fatigue, and physical activity were used.

PROMIS Pediatric Pain Interference v.2.0. [36]
The pain interference questionnaire measures the perceived extent to which pain has disrupted daily living over the last 7 days. It consists of 20 questions on a 5-point summated-rated scale ranging from 1 (never) to 5 (almost always).

PROMIS Pediatric Fatigue v.2.0. [12]
The fatigue questionnaire measures how tired the child has felt during the last 7 days. The 25 questions are rated on a 5-point scale ranging from 1 (never) to 5 (almost always).

Statistical and psychometric methods
The analyses were performed in IBM SPSS, Version 26.0 and in R [37]. Psychometric calculations followed the method described in Reeve et al., [38]. First, descriptive statistics was calculated. Thereafter, corrected item-total correlations (r it c ) was estimated. A correlation less than 0.3 indicates that the corresponding item does not correlate well with the overall scale and should be removed [39]. The reliability of the scales were calculated using Cronbach's α (good internal consistency is proposed to be between 0.70 and 0.90 [40]. Further IRT Test Information Function (TIF), Item Information Curves (IIC) and Standard Errors (SE) were calculated. TIF is inversely related to SE. A SE of 0.32 corresponds to a reliability of 0.90 according to the formula: r = 1-SE 2 , e.g. 1-0.3 2 = 1-0.09 = 0.91 [41], the smaller SE the better reliability.
We performed a test-retest analysis, with 3 weeks between the tests, and correlations were measured through intraclass correlation coefficients (ICCs), with a  two-way fixed effects model [42]. Values below 0.40 were considered poor, from 0.40 to 0.75 were fair to good, and values greater than 0.75 were excellent according to the criteria of Fleiss [43].

Unidimensionality
Before using IRT, we checked for unidimensionality (all items must load on a single factor) in the item banks with three single factor Comparative Factor Analyses (CFA) of the inter-item polychoric correlation matrices (as recommended by Reeve [38]. Due to the non-normal distribution found in the data and the use of ordinal data, we used the diagonally weighted least squares estimator with robust standard error [44] in the R package Lavaan for structural equation modeling version 0.6-3 [45]. Goodness of fit indices used in the study were Comparative Fit Index (CFI), Tucker Lewis Index (TLI), Root Means Square Error of Approximation (RMSEA) and Standardized Root Mean Residual (SRMR). We followed the recommendations form Hu and Bentler [46] and PROMIS analysis plan [38] for unidimensionality CFI > 0.95, TLI > 0.95, RMSEA < 0.06 and SRMR < 0.08.

Monotonicity and local independence
We assessed monotonicity and local independence using a non-parametric IRT model with Mokken scale analyses using R-package Mokken (version 3.0.3) [47]. Coefficients of homogeneity (H) were examined and monotonicity was indicated with item values at 0.3 or above and for total scale values at least 0.50 [48]. Local independence was checked by conditional association and reported with true/false values, if all values are true the items show local independence.

Graded response models
In addition, the items were fitted with the graded response model [49] with the R package ltm [50]. The discrimination (slope) and difficulty (thresholds) were calculated for each item. The four threshold parameters (beta coefficients for five alternative answers) were used to indicate the level of pain interference, fatigue, and physical activity at which a response in a particular category becomes likely. The goodness of fit of the IRT model (item-fit) was examined using S-χ 2 statistic for polytomous response data [51]. A non-significant value indicated adequate fit of the model to the data (p > 0.001 [52]).

Differential item function (DIF)
DIF for gender, age (median split), language (Swedish translated vs US original pediatric PROMIS item banks of pain and fatigue) [33], were calculated for each item on each scale using the IRT Likelihood Ratio DIF approach [53], using LR χ 2 item fit statistics, as implemented in the software R package mirt [54]. The Benjamini-Hochberg procedure [55] was used to control for multiplicity of comparisons in DIF (see Table 2). McFadden's R 2 was used to evaluate when DIF was detected (> 2%) [40].
McFadden's R 2 could be interpreted as < 0.035 = negligible DIF, 0.035-0.07 = moderate DIF, and > 0.07 = large DIF [56].The level of the effect size was evaluated tabular and graphically using methods outlined by Steinberg and Thissen [57] for items with significant DIF. We transformed the theta scores into T-scores as recommended by PROMIS using the formula ((θ*10) + 50. The average T-score of the study population is 50 (SD = 10).

Descriptive statistics and confirmatory factor analysis
The data showed good range and response distribution within the items. Descriptive statistics are shown in Table 1. Missing data analysis was performed and showed 0.3% missing data in all three item banks respectively. Missing data were replaced with imputed values using linear regression. Data was assumed to be missing at random.
Corrected item-total correlations (r it c ) were greater than 0.3 in the total sample (ranging from 0.52 to 0.85) and in the male and female subsamples (0.62 to 0.88 vs. 0.46 to 0.86, respectively). The corresponding items correlated well with the overall scales.
The internal consistency in terms of Cronbach alpha for the three item banks were very high: pain interference Test consistency over time was calculated using a subsample of n = 348 adolescents (55% of the original sample of N = 638 answered the questionnaire again 3 weeks later). The test-retest ICCs were 0.84 for the total score of the pain interference (95% CI 0.80, 0.87; F = 6.07; p ≤ 0.001), 0.89 for the fatigue (95% CI 0.86, 0.91; F = 9.04; p ≤ 0.001), and 0.86 for the physical activity item bank (95% CI 0.82, 0.88; F = 6.94; p ≤ 0.001). Based on the criteria of Fleiss [43], the ICCs were considered very good.  , and local independence was found among the items.

Graded response models
The item parameter estimates and the χ 2 mean square item fit statistics are shown in Table 2. In this table the items are sorted in order of decreasing discrimination (a), so the generally best indicators of pain interference, fatigue, and physical activity are near the top of the tables. The best and the worst discriminating items are shown in category characteristic curves, see Fig. 2. For the pain interference items, five of the items exhibited significant lack of fit as indicated by the SS χ 2 item fit (p < 0.001, χ 2 ranged from 503.88 to 754.07, df = 391) ( Table 2), after Benjamini-Hochberg correction for multiplicity. For the fatigue items, three of the items showed significant lack of fit (p < 0.05, χ 2 ranged from 887.04 to 1232.74, df = 636), and for physical activity items, three items showed significant lack of fit (p < 0.05, χ 2 ranged from 856.52 to 1007.04, df = 662).

Differential item function
DIF was used to detect whether gender, age-group and language biased an item. No DIF by gender was found in any of the subscales. For age groups (12-15 years and 16-19 years), there were, after Benjamin Hochberg correction, seven items with significant DIF. One of them had moderate DIF: "I have trouble starting things because I was too tired" (from fatigue item bank). For language (only measured for pain interference and fatigue) there were 9 items with significant DIF after Benjamin Hochberg correction. Most of them had negligible McFadden effect sizes, and only three of the items had moderate DIF ("Being tired kept me from having fun", "I had trouble starting things because I was too tired", and "I was too tired to go up and down a lot of stairs" [all three from fatigue item bank]). See Table 2 for the DIF results and the McFadden effect size.
For the items where DIF was found by age and language, we further investigated whether the results were due to the item's discrimination (slope) or difficulty (thresholds) by using a model where the equal slope   .00 .00 .00 It was hard for me to get out of bed in the morning because I was too tired  .00 .00 How many days did you run for 10 min or more? .00 .00 How many days were you physically active for 10 min or more? assumption was imposed and the difficulty was freely estimated for both of the two groups. There was no significant result for seven items of age, and four items of language. For five items in the item bank fatigue (marked as significant with a star in Table 2 for DIF of language), non-uniformity was found, meaning that the items had different slopes. After considering the analytic results, graphical illustration, item content and clinical relevance we decided to keep all items in the item pools.

Pain interference Fatigue P hysical activity
The best discriminating item: The T-score calculations were based on the full original English item bank (general and clinical population), obtained from www. asses sment center. net/ ac_ scori ngser vice. The mean T-scores of the study sample were as follows: Our T-scores can be provided on request.

Discussion
One major challenge prior to the use of IRT models is to resolve issues of dimensionality. For all three item banks pain interference, fatigue and physical activity, we found good values on the fit indices CFI, TLI and SRMR. However, for all three item banks, RMSEA values indicated a moderate fit, and for physical activity a relatively low fit (0.16). Values over 0.06 have been reported for many other PROMIS item banks e.g. [41,58]. Traditional goodness of fit indices has been criticized for not being suitable to establish unidimensionality of health item banks [59] and that RMSEA is sensitive to model complexity and skewed data distributions [59], the latter being the case in our distributions. SRMR has shown to generate more robust results through different populations and estimation methods [60].
Internal consistency or the scale reliability was high in all three item banks (Cronbach's α ranged from 0.93 to 0.97). The high value of Cronbach's α is probably partly due to the large number of items included in the scales (and some of the items were quite similar). However, when inspecting the TIF, IIC, and SE curves (IRT) this picture was confirmed but nuanced. At a total mean level, all item banks had satisfied reliability, while at an individual level, the items varied more in reliability. We conclude that the items with low reliability could be set aside in future studies.
Test-retest reliability of the scales and the ICC [43] showed excellent reliability over a period of three weeks (from 0.84 to 0.89 for all subscales). This can be interpreted as very good internal validity and ensures that the scales are both representative and stable over time.
Systematic measurement variability by groups can lead to a number of problems, including errors in hypothesis testing (e.g. it may be assumed that the test covers all genders, all ages or all cultures, but it does not), and misguided research [61]. Ensuring equivalent testing is thus important prior to making comparisons among individuals or groups [61]. We investigated DIF for gender, age-group and language in the three item pools. For all items, no DIF regarding gender was found (not in line with Lai et al. 2013 [21], which found three items due to gender-based DIF), and the subscales measured symptoms equally well for girls and boys. However, some items had DIF regarding age and language, although the effect sizes were mostly negligible (three were moderate for language) and we cannot draw any firm conclusions. DIF by age and language suggests that for these items, depending on age groups (12-15 years and 16-19 years) or language groups (Swedish sample of children speaking Swedish compared with a US sample speaking English), symptoms were not measured very well. For fatigue and age, this was in line with one previous study (Lai et al., 2013 [21], which found that 16 out of 25 fatigue items had DIF for age), while for the other two subscales (pain interference and physical activity) this was a new finding with regard to age. There can be several explanations for this, including that the concept of "fatigue" may not be the same across the age groups. Another potential item bias not measured (because our clinical sample was too small), was DIF regarding psychiatric and physical symptoms; our sample was more normative than the more clinical representation in the US sample.
When comparing the result with our previous review of the translated items (see [31]) we found similarity for only one of the items: "how many days did you run for 10 min or more?". It was problematic in the translation process because this item is an equivocal item without precise definition in the PROMIS definition list [31,62]. During cognitive interviews with Swedish children [31,63], some of them wondered if the item meant that they had done 10 min of continuous running or if the 10 min of running could be accumulated over a day. Even though we translated this item word by word, some children may therefore have interpreted the item differently. DIF by age for this item was not found in the original English version [23]. Several items contained the wording "how many days did you … for 10 min or more" and all of them were in the lower range of all psychometric measurement in our current study as well as in the study by Tucker et al., [23]. Measures of distance and time often need context and a qualitative description to be understandable [64].
A common strategy to deal with DIF items is to set items aside [21]. However, in brief questionnaires this strategy is not recommendable, because it might result in decreased reliability and validity. Apart from that, the shortened scale can lead to a modification of the construct it is intended to measure [65], and removing DIF items in well-established questionnaires decreases comparability between different research studies.
An interesting finding in this study was that the average T-sores of all three item banks was lower than the expected 50.0 (general and clinical US population). This may indicate that Swedish adolescents are, on average, less interfered by pain, less tired, and do less physical activity, compared to US adolescents. However, the samples differ, as our relatively healthy sample overall has less symptoms than the US sample. Further analyses are needed to explore possible alternative explanations.

Limitations and strengths
The present study had sufficient statistical power and all participants answered all questions, but some limitations should be noted. Participants were not geographically stratified and did not fully match the Swedish general pediatric population, for example, the unbalanced gender ratio limited generalizability. Instead, the participants came from four different schools along with a smaller sample from a child-and adolescent psychiatric clinic. When using IRT statistics, theoretically, a mixed sample is preferable because IRT offers the property of item invariance, in which item parameters are constant even if estimated in different samples [66]. However, our clinical sample was too small to test for DIF and future studies need to investigate if this is also true empirically. For the DIF of language, a sample more similar to ours would have been preferable, as the US sample contained a greater variety of medical diagnoses, which potentially biased the results.

Implications
The three PROMIS pediatric item banks were translated and adapted to Swedish to meet the need of short, effective and valid tests based on modern test theory such as IRT and DIF for the use in Swedish healthcare [4,31]. A major advantage in using IRT in health-related outcomes is that it enables adaptive testing, either by multiple short-forms or via computerized adaptive testing [67], which is less of a burden for the patients but not always available in research or clinical settings. Thus, shortforms can be valuable alternatives.

Conclusions
The PROMIS pediatric item banks of pain, physical activity, and fatigue showed sufficient psychometric properties in a Swedish population. Future studies can be to use Computerized Adaptive Testing (CAT), which provide fewer but reliable items to the test person compared to classical testing (e.g. [41]). This approach prevents test-tiredness. We hope that the item banks will be implemented both in Swedish school-based health care and in pediatric clinics.