Psychometric validation of a Saudi Arabian version of the sf-36v2 health survey and norm data for Saudi Arabia

Background Adaptation of a patient-reported outcomes survey into a new language requires careful translation procedures as well as qualitative and quantitative psychometric testing. This study aimed to evaluate the basic psychometric properties of the new Saudi Arabian SF-36v2 and establish norm data for Saudi Arabia. Methods Translation and adaptation of the SF-36v2 used standard methodology. Psychometric validation included two stages: 1) A qualitative study (n = 100) explored the components of health and health-related quality of life considered important in Saudi Arabia and evaluated the content validity of the SF-36v2 in Saudi Arabia, and 2) A quantitative study (n = 6166) evaluated the basic psychometric properties of the Saudi SF-36v2 and established norm data for Saudi Arabia. Comparison with US general population data (n = 4040) evaluated differential item function (DIF) and cross-national differences. Results The qualitative study supported the content validity of the Saudi SF-36v2. Cognitive debriefing identified only few and minor problems. Psychometric analyses supported item convergence within scales and differentiation across scales of the SF-36v2. Scale level exploratory factor analyses did not support the typical distinction between physical health and mental health components. Internal consistency reliability was satisfactory for all scales except the social function scale (alpha = 0.67). Cross-national DIF was identified for 9 items. In the Saudi general population, the average vitality score was lower for women (− 2.71 points) compared to men. For men, older age groups scored lower on the physical function scale (− 3.31) and the physical health component (− 3.06). For women, older age groups scored lower on the role physical (− 3.72), bodily pain (− 3.66), and vitality (− 2.32) scales as well as the physical health component (− 3.52). Compared to the 2009 United States general population, and after adjusting for age, gender, and differential item function, persons in Saudi Arabia had lower average scores for the physical function (− 3.10), role physical (− 4.75), social function (− 4.23), role emotional (− 5.67), and mental health (− 4.82) scales, as well as the mental health component (− 4.57). Conclusion This Saudi normative study of patient reported outcomes supported the validity and reliability of the new Saudi SF-36v2 and found cross-national differences with the USA.


Background
Patients' self-reports of health outcomes are important for measuring the impact of chronic disease, accounting for changes in health, measuring the effects of treatment, and predicting health resource utilization and thus medical expenditures. To date, most of the available patientreported outcome (PRO) measures are in English, and few have been translated into Arabic and adapted for use in Arab countries [1]. Because the perception of health-related outcomes may differ between populations and conditions, adaptation of a questionnaire into a new language and culture requires more than just a translation. Evaluation of content validity, construct validity, and reliability as well as establishing national normative data are important steps in the translation and cultural adaptation of a PRO measure [2][3][4][5][6][7][8][9]. Despite these challenges, the literature urges investigators not to "reinvent the wheel" by developing new or ad hoc measures, but rather cross-culturally adapt an existing health and health-related quality of life (HRQOL) measure. Crosscultural adaptation is believed to: 1) be more costeffective; 2) enable efficient utilization of the existing body of knowledge; 3) help standardize the concept internationally; and 4) offer the opportunity for international comparative studies. A disadvantage of culturespecific instruments is that their results are not generalizable or comparable because each has its conceptual definition and choice of indicators [6].
The SF-36 is one of the most widely used PRO instruments [10]. Its validity, reliability, and responsiveness have been documented in many groups varying by age, sex, socio-economic status, geographical region, and clinical conditions [3]. In the 1990s, researchers within the well-documented International Quality of Life Assessment (IQOLA) project pioneered the adaptation of the SF-36 for use internationally [11]. The methods used in the IQOLA project, still constitute the standard for translation and validation work today. The SF-36 has been translated into more than 150 languages and adapted to different cultures [10]. Responding to the difficulties in translating various items and response choices, the IQOLA project's investigators emphasized the importance of developing translations that are culturally appropriate to each country [2].
Published norms for the SF-36 exist in several developed countries [4,[11][12][13][14][15][16][17][18][19][20][21]. Norms permit evaluation of disease burden, i.e. the decrement in PRO scores relative to a general population comparison group with similar age and sex distribution [10]. Normative data can also help interpretation of treatment effects since no treatment effect can be expected to be larger than the disease burden. For Saudi Arabia, PRO population norms could help identify needs and subsequently guide health policies, legislation, and the development of strategic plans to allocate resources based on unmet needs. However, most previous work in Saudi Arabia has used the SF-36v1 or RAND-36 [22], rather than SF-36v2, and general population norms have been lacking.
Accordingly, this nationwide study aimed to explore the content validity of SF-36v2 in a Saudi Arabian context, test the validity and reliability of a new Saudi Arabic SF-36v2 translation, and collect Saudi normative SF-36v2 data. Since the SF-36v2 scoring is based on US general population norms, we also explored the difference between US and Saudi Arabian norms.

Methods
This project was performed in 2 stages, utilizing both qualitative and quantitative methods.
The qualitative study had two objectives: 1) To explore the concepts of health and HRQOL and evaluate the content validity of the SF-36v2 as an HRQOL instrument in a Saudi Arabic setting, and 2) To perform cognitive debriefing of the Saudi SF-36v2.
Semi-structured interviews were carried out on a convenience sample of 100 participants by trained interviewers aiming to explore which domains the participants consider important components of health and HRQOL, and to ascertain concordance with the WHO definition of health as "a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity" [23]. This definition forms the conceptual basis of the SF-36 and other commonly used HRQOL measures. Participants were asked introductory questions including: "What is the meaning of health?", "What do you think may affect a person's health?", and "What areas of life do you think are affected by health?" Participants were probed to elaborate on their answers until they indicated having no more ideas. Participants were then asked to evaluate the importance of domains commonly used in measuring HRQOL using a four-point scale ("very important", "quite important", "not quite important", and "not at all important"). Next, participants were asked to list additional domains that were not mentioned among the listed domains. It has been suggested that indicators could be added if rated important by at least 50% of subjects [24].
Subsequently, participants engaged in a cognitive debriefing of the Saudi SF-36v2 to evaluate whether the content of the translated version was easily understood and culturally relevant within a Saudi Arabian context. After completing each item, participants were asked questions about clarity, comprehensibility, relevance, and completion feasibility using a standardized response scale. The participants were probed to elaborate on their answer, using probes such as: "Interesting, can you elaborate on that?", "What do mean, can you explain further?", "How important is that to you?", "Do you want to add anything in this regard?" The quantitative study was based on a national general population survey involving Saudis aged 15 years or older. Saudi Arabia is divided into five regions (North, South, Central, East, and West); each region is divided into sub-regions and blocks. A probability proportional sampling method was used to randomly select subregions, blocks, and accordingly households. Households were chosen from each block and a roster of household members (based on age and sex) was collected by a surveyor visiting the household. An adult aged 15 years or older was randomly selected to be surveyed from each household. The surveyor handed out the SF-36v2 for self-administration. Individuals were excluded if they were unable to complete the questionnaire due to language problems, communication limitations or cognitive impairments. If the selected adult was not present, our surveyors made an appointment to return. The household was counted as nonresponsive after a total of three attempted unsuccessful visits.
The Saudi population is estimated to be approximately 12,167,245 people. The study aimed to obtain 6360 completed surveys. This sample was chosen to achieve sufficient representation of all strata of the Saudi population. Based on experience with surveys in Saudi Arabia, we assumed a non-response rate of up to 40% for a target of 10,600 contacts. Of the 10,592 approached Saudi adults, 6166 participated in the study with a response rate of 61%.
The Saudi Arabian data was compared to United States (US) general population data (n = 4040) obtained in 2009 [25]. This general population online survey of US citizens 18 years or older has been used to generate population norms for the USA (please see [25] for details).
Measures SF-36v2 was administered using a new Arabic version and scored according to standard recommendations of the SF-36v2 developers [25] into eight subscales: physical function (PF), role limitations due to physical health (RP), bodily pain (BP), general health perception (GH), vitality (VT), social function (SF), role limitations due to emotional problems (RE), and mental health (MH). For each subscale: 1) items were coded so that high score indicated good health, 2) two items were weighted [26], for all other items simple category weights (1, 2, 3 …) were used, 3) the mean score was taken across items and transformed linearly to a metric from 0 to 100, 4) for norm based scoring, the scale score was transformed linearly so that the US general population has a mean of 50 and an SD of 10. Also, two overall component scores, the physical component summary (PCS) and the mental component summary (MCS) were calculated based on scoring coefficients from a principal component analysis with orthogonal rotation [27].
The translation of the SF-36v2 used the principles of good practice from the International Society of Pharmacoeconomics and Outcomes Research Task Force for Translation and Cultural Adaptation [28]. This included: 1) Forward translation by two independent professional translators, 2) Back translation by a third independent translator, 3) Reconciliation by an expert panel, and 4) Independent assessment of translation quality. The translation was subsequently evaluated by cognitive debriefing and quantitative testing, as reported in this paper.
Morbidity questions included a list of 27 self-reported health conditions (hypertension, heart disease, diabetes, arthritis, depression, etc. ) and one open-ended response coded as "other".
Sociodemographic questions included age, gender, education level, marital status, occupation, and financial status (monthly household income). Additional questions concerned smoking habits and major life events during the previous year.

Statistical analysis
Distributions of basic demographic variables and chronic conditions were described through standard frequency tables. The analysis of scale structure relied on multitrait analyses [29], which have been used in multiple previous studies of the SF-36. These analyses test item convergence within scales and item differentiation across scales. Item convergence within scales (sometimes called convergent validity) was evaluated by analyzing the correlation of each item with the sum of all the other items in the scale (item-own-scale correlation has been corrected for overlap, also see [30]). A correlation of 0.40 or more for all items in a scale supports item convergence within scales. Item differentiation across scales (sometimes called discriminant validity) was evaluated for each item by comparing the item's correlation with its scale to its correlation with all other scales. Item differentiation across scales is supported if the item's correlation with its own scale is significantly larger than its correlation with any other scale. Furthermore, we analyzed the scale correlation matrix using exploratory factor analysis as in many previous studies of the SF-36 (e.g. [31,32]). Number of factors were evaluated by Eigen value analysis. Factors were extracted using the principal components method, followed by orthogonal rotation (Varimax). While most studies have used a two-factor solution of physical and mental health [31], analyses in some non-Western countries have suggested a threefactor structure of physical, mental, and social health [32]. For this reason, we evaluated both two-and three-factor solutions. As robustness analyses, we supplemented the Varimax rotation with an oblique rotation (Promax) and supplemented standard analyses of product-moment correlations with analyses of the polychoric correlation matrix.. Internal consistency reliability (coefficient alpha) was estimated for the subscales. Internal consistency reliability for the PCS and MCS was estimated using methods for weighted composites (see [33] page 37).
Differential item function (DIF) was evaluated for age, gender, and comparisons of Saudi Arabia and USA using logistic regression DIF tests [34]. Adopting a standard decision rule [35], evaluation of important DIF was based on statistical significance (p < 0.05 after Bonferroni adjustment) and magnitude in terms of increase in explained item variance (difference in pseudo R-squared [36] larger than 0.03). This criterion is slightly less conservative than a threshold of 0.035 advocated in the educational testing literature [37]. We used a standard purification strategy [34], where items with indications of DIF were excluded iteratively until a set of anchor items without DIF was identified. Then, the final DIF analyses were conducted for each item using the anchor items and the item in question. In cases of important cross-national DIF, we adjusted the cross-national comparisons using the generalized partial credit item response theory (IRT) model [38]. This model can adjust for uniform DIF (DIF with the same magnitude across score levels) by adjustment of the IRT thresholds parameters and adjust for non-uniform DIF (magnitude of DIF depends on score level) by adjustment of the IRT discrimination parameter. DIF adjustment was performed using a three step procedure: 1) We estimated item parameters for all SF-36v2 scales using the US data and the generalized partial credit IRT model [38], 2) For items with significant DIF, we re-estimated the item parameters in the Saudi data, fixing item parameters for the anchor (no-DIF) items, and 3) We performed IRTbased sum score cross-calibration to link the Saudi scale scores to the US metric [39]. After doing this for all subscales with DIF, we calculated adjusted PCS and MCS scores based on the adjusted subscales.
Comparisons between Saudi Arabia and the USA were carried out using a linear regression model, with and without controlling for differences in age and gender, and adjustment for DIF. The magnitude of differences was evaluated according to published guidelines for minimal important differences (MID) for the SF-36v2 [25]: PF: 3 points, RP: 3 points, BP: 3 points, GH: 2 points, VT: 2 points, SF: 3 points, RE: 4 points, MH: 3 points, PCS: 2 points, and MCS: 3 points. These MID values have been established using anchors such as noticeable increase in risk of mortality, job loss, or hospitalization [40].

Health and HRQOL concepts
The characteristics of the sample (N = 100) used in the qualitative study are presented in Table 1.
In the qualitative study, four concepts were endorsed by 50% or more of participants as components of health: physical functioning (70% of participants), normal psychological function and feelings (66%), healthy eating habits and enjoyment of food (61%), normal social functioning (50%); 38% of participants defined absence of disease or illness and 28% being full of energy (free from pain and fatigue) as components of health (Table 2).
When presented with a list of domains commonly including in the assessment of HRQOL, concepts related to all eight SF-36 domains were assessed as "quite" or "very" important for HRQOL (Range 95% -100%, Table 3). While the concept of being full of energy was considered as a component of health by only 28% of the sample, related concepts of "having a lot of energy" and "being free from pain" was considered a "very" or "quite" important components of HRQOL by 100% and 96% of participants, respectively. Participants also reported some additional domains that are not covered by the SF-36 as important: eating habits (72%), sleep (55%), travel (53%), and sexual function (56%) ( Table 3).
Item convergence within scales, item differentiation across scales Table 6 presents results of analyses of item convergence within scales and differentiation across scales in the Saudi sample. The numbers in bold show each item's correlation with the sum of all the other items in its own scale (item-own-scale correlations). All items satisfied the standard criterion of item convergence within scales (≥0.40). For all items except one, the item-ownscale correlation was higher than the correlation with any other scale, thus supporting item differentiation across scales. One item, SF01 ("During the past 4 weeks, to what extent has your physical health or emotional problems interfered with your normal social activities with family, friends, neighbors, or groups?"), showed a higher correlation with the pain scale than with the other item in its own scale.

Exploratory factor analysis and internal consistency reliability
While all scales were positively correlated, no correlations between scales were strong (above 0.70) supporting the notion that the eight scales measure distinct domains (data not shown). The highest scale correlation (0.67) was seen between RP and RE. In exploratory factor analysis, the first four Eigen values were: 4.07, 1.08, 0.78, 0.57, thus supporting a two-factor solution. In a two-factor model, the factor loadings did not concur with the hypothesized associations (Table 7). Rather, the PF, RP, and RE subscales loaded strongly on first factor (Physical and role function), while the BP, GH, VT, SF, and MH subscales loaded strongly on the second factor (Symptoms, health perception and social function). Analyses using oblique rotation and analyses of polychoric correlations provided similar results (data not shown). A three factor solution kept the first factor unchanged, but split the second factor into a factor on Symptoms and general health perception (BP, GH, and VT loaded strongly on this factor) and a factor on Social function and mental health (SF and MH loaded strongly on this factor, which also had a strong cross-loading from RE, data not shown).
Internal consistency reliability was above the traditional threshold of 0.70 for seven scales. The two-item SF scale had a reliability of 0.67. The internal consistency reliabilities were 0.91 for PCS and 0.90 for MCS.

Differential item function
We did not identify any DIF with regards to age and sex. Uniform and non-uniform cross-national DIF was identified for 4 and 5 items, respectively, based on explained item variance (Table 8). Due to the large sample size, all DIF results were highly significant. For all 9 items, the direction of DIF was clear and consistent over most or all of the score range. Six items (PF02, moderate  (Table 9). Slightly lower scores were also seen for the PF and MH scales, but these differences were below the suggested threshold for clinical significance. Adjusting for age and gender led to slightly larger differences for the scales reflecting physical health but had little impact on differences in scales reflecting mental health. Adjusting for DIF lowered the Saudi Arabia scores for PF, GH, MH, PCS and MCS, but provided higher scores for RE, thus slightly diminishing the difference between Saudi Arabia and the US on this scale. Both in Saudi Arabia and in the USA, separate analyses by gender and age group (Tables 10 and 11) showed lower physical health scores for older age groups. However, this trend was most pronounced in the USA, so the strongest cross-national differences were seen in the younger age groups. Saudi Arabian women, 60 years or older, reported significantly better physical function than American women in the same age group.
Comparisons according to gender in the Saudi Arabian sample showed that women scored lower on several scales: BP, GH, VT, SF, RE, and MH as well as on MCS (Tables 10 and 11). However, except for VT, the score   differences were below the thresholds for clinical significance.
Among men in Saudi Arabia (Table 10), the strongest score differences across age groups were seen for the PF scale and PCS. Among women (Table 11), lower scores in older age groups were seen for the RP, BP, and VT scales and PCS, whereas other scales remained fairly constant across age.

Discussion
This nationwide study generally supported the content validity, construct validity, and reliability of a new Saudi version of the SF-36v2. In the qualitative study, participants emphasized physical and psychological function as important components of healthalong with social function and healthy eating. Thus, similar to the World Health Organization (WHO) definition [23], health was Cognitive debriefing of the Saudi SF-36v2 indicated that respondents found the questionnaire easy to understand and answer. Each survey item was rated as relevant by more than 90% of participants, supporting the content validity of the survey.
The psychometric analyses supported the reliability and validity of the SF-36v2 in a Saudi general population. All items showed satisfactory convergence within scales. In all but one instance, items also showed satisfactory differentiation across scales. Such results are on par with results from the original validation of the SF-36 in the US [29]. Overall, these results support the hypothesized scale structure of the Saudi Arabic SF-36v2. However, exploratory factor analyses did not find a factor solution similar to typical results from Western countries [31]. Rather, the two-factor solution resembled results previously found in a Japanese sample [32] and to some extent in a Turkish urban population [21]. In contrast, factor analytic results from a study in Lebanon more closely resembled typical results from western countries [7]. The factor solution in our study seems particularly driven by the high correlation between the RP and RE scales, which suggest that the distinction between physical and psychological reasons for poor role performance does not apply to the Saudi data. The implications of these results for the validity of the PCS and MCS scores in Saudi Arabia needs to be explored in future studies.
Seven of the SF-36v2 scales had internal consistency reliability above 0.70, but the two-item SF scale had a reliability of only 0.67. However, this scale has also shown low reliability in some US studies, e.g., the first US general population study, where the SF scales showed a reliability of 0.63 [41]. Thus, the reliability results may be considered as adequate.
Within Saudi Arabia, we found no DIF for age and gender, but we found cross-national DIF for 9 items when comparing with US general population data. In a post-hoc cognitive debriefing study of these 9 items we were not able to identify problems in these 9 items that might explain the DIF (data not shown). A possible explanation of the DIF may be cultural or lifestyle differences between Saudi Arabia and the USA. For example, because of religious practices, persons in Saudi Arabia may do more bending and kneeling and thus find this activity easier than persons in the USA. The item on   bending and kneeling is still a valid indicator of physical function in each country, but the item is easier for persons in Saudi Arabia, thus influencing comparisons of Physical Function. If the interest of the researcher is to compare physical function in general (and not the specific activity of bending/kneeling) the comparison can be adjusted for the DIF. The impact of such adjustments can be evaluated on the overall level in Table 9 and for age and gender subgroups in Tables 10 and 11. The impact is actually rather small for the PF scale (0.40), but larger for the GH (1.87) and MH (2.38) scales. While these impacts are smaller than the MID for each scale, the largest impacts are larger than the impact of adjustment for demographic differences. Therefore, we recommend considering DIF when interpreting cross-national comparisons between Saudi Arabia and the USA. After adjustment for differences in age and gender, as well as DIF, analysis of Saudi general population norm data showed low scores for scales concerning physical function (PF difference = − 3.10), role and social function (RP difference = − 4.75, SF difference = − 4.23, and RE difference = − 5.67), mental health (MH difference = − 4.82) as well as for the mental component summary (MCS difference = − 4.57) compared to US general population norms (Table 9). In particular, scores on the RE scale were lower for women in Saudi Arabia compared to the USA, although some of this difference was explained by DIF. These differences are not likely to be caused by higher morbidity in Saudi Arabia since the self-reported prevalence of many chronic conditions was lower in Saudi Arabia than in the USA. The magnitude of the differences on these scales suggests differences in function that need to be explored. In particular, the lower scores in scales relating to mental health (SF, RE, MH, and MCS) does not concur with the low reports of clinical anxiety and depression (Table 5). A large study  to estimate the burden of mental disorders in the Eastern Mediterranean Region including Saudi Arabia, reported that the stigma attached to mental illness may cause underreporting or waiting for a long period of time before seeking healthcare [42]. Thus, it is possible that clinical anxiety and depression is underdiagnosed or under-reported for cultural reasons. Further, the low score on scales related to mental health may reflect subclinical, rather than clinical, mental health problems.
As in previous general population studies (e.g. [3,4,9,10]), women scored lower on all SF-36v2 scales, thus supporting known groups validity. However, the average differences were often smallonly the gender difference for the vitality scale exceeded the threshold for clinical significance.
Analyses by age group found lower scores in older age groups for SF-36v2 scales concerning physical health: PF, RP, BP, and PCS. These results are in line with results from many other studies [3,4,9,10], reflecting a decline in physical function with age and thus supporting known groups validity. As in previous studies, measures reflecting mental health were relatively constant across age groups. A study by Lorem et al. [43] found age by itself was protective of mental health symptoms when controlled for the mental health symptoms associated with physical illness.  Representation from most regions of Saudi Arabia was satisfactory, but few participants were recruited from the Northern region. We ascribe these difficulties in recruiting participants to lack of familiarity and lack of acceptance of surveys in some parts of the Saudi culture. The Northern region is the smallest (285,733 Saudi inhabitants in 2016) and least densely populated region in Saudi Arabia, with a population that is slightly younger (mean age 26.1 years against 27.4 years for all of Saudi Arabia) and with a slightly lower proportion of mean (50,3% against 50.9%). However, since these differences are very small, the low proportion of participants from the Northern region is unlikely to have a noticeable impact on the overall results.

Conclusion
This is the first large scale Saudi general population study of patient reported outcomes. We used a new translation of a well-known patient reported outcomes instrument, the SF-36v2. Concept elicitation, cognitive debriefing, and large-scale quantitative testing supported the validity and reliability of the Saudi SF-36v2, but an exploratory factor analysis did not support the typical distinction between a physical health and a mental health component. Also, we found cross-national DIF for 9 out of 35 tested items. After adjustment for DIF and demographic differences we found lower patient reported outcomes scores in Saudi Arabia for the PF, RP, SF, RE and MH scales as well as for the MCS. For the BP, GH and VT scales, as well as for PCS, score differences were smaller and did not exceed MID. Reasons for the differences in patient reported outcomes should be further explored and these general population differences should be taken into account when interpreting patient reported outcomes scores for patients in Saudi Arabia. Authors' contributions AA, QH contributed to the development of study design. AA, QH oversaw data collection. AA, QH, JB and MT made contributions to the data analysis and results interpretation. AA, QH wrote the first draft of the manuscript. All authors contributed and made revisions to the interpretation of results, first draft, and the manuscript. All authors have read and approved the final manuscript.

Funding
This study was funded by King Abdul Aziz City for Science and Technology (T-K-12-1041).

Availability of data and materials
The datasets/tables used and/or analyzed during the current study are available from the author on reasonable request.

Ethics approval and consent to participate
The study protocol was reviewed and approved by the institutional review board at King Fahd Medical City Riyadh, KSA and conducted in compliance