Skip to main content

Psychometric validation of a Saudi Arabian version of the sf-36v2 health survey and norm data for Saudi Arabia



Adaptation of a patient-reported outcomes survey into a new language requires careful translation procedures as well as qualitative and quantitative psychometric testing. This study aimed to evaluate the basic psychometric properties of the new Saudi Arabian SF-36v2 and establish norm data for Saudi Arabia.


Translation and adaptation of the SF-36v2 used standard methodology. Psychometric validation included two stages: 1) A qualitative study (n = 100) explored the components of health and health-related quality of life considered important in Saudi Arabia and evaluated the content validity of the SF-36v2 in Saudi Arabia, and 2) A quantitative study (n = 6166) evaluated the basic psychometric properties of the Saudi SF-36v2 and established norm data for Saudi Arabia. Comparison with US general population data (n = 4040) evaluated differential item function (DIF) and cross-national differences.


The qualitative study supported the content validity of the Saudi SF-36v2. Cognitive debriefing identified only few and minor problems. Psychometric analyses supported item convergence within scales and differentiation across scales of the SF-36v2. Scale level exploratory factor analyses did not support the typical distinction between physical health and mental health components. Internal consistency reliability was satisfactory for all scales except the social function scale (alpha = 0.67). Cross-national DIF was identified for 9 items. In the Saudi general population, the average vitality score was lower for women (− 2.71 points) compared to men. For men, older age groups scored lower on the physical function scale (− 3.31) and the physical health component (− 3.06). For women, older age groups scored lower on the role physical (− 3.72), bodily pain (− 3.66), and vitality (− 2.32) scales as well as the physical health component (− 3.52). Compared to the 2009 United States general population, and after adjusting for age, gender, and differential item function, persons in Saudi Arabia had lower average scores for the physical function (− 3.10), role physical (− 4.75), social function (− 4.23), role emotional (− 5.67), and mental health (− 4.82) scales, as well as the mental health component (− 4.57).


This Saudi normative study of patient reported outcomes supported the validity and reliability of the new Saudi SF-36v2 and found cross-national differences with the USA.


Patients’ self-reports of health outcomes are important for measuring the impact of chronic disease, accounting for changes in health, measuring the effects of treatment, and predicting health resource utilization and thus medical expenditures. To date, most of the available patient-reported outcome (PRO) measures are in English, and few have been translated into Arabic and adapted for use in Arab countries [1]. Because the perception of health-related outcomes may differ between populations and conditions, adaptation of a questionnaire into a new language and culture requires more than just a translation. Evaluation of content validity, construct validity, and reliability as well as establishing national normative data are important steps in the translation and cultural adaptation of a PRO measure [2,3,4,5,6,7,8,9]. Despite these challenges, the literature urges investigators not to “reinvent the wheel” by developing new or ad hoc measures, but rather cross-culturally adapt an existing health and health-related quality of life (HRQOL) measure. Cross-cultural adaptation is believed to: 1) be more cost-effective; 2) enable efficient utilization of the existing body of knowledge; 3) help standardize the concept internationally; and 4) offer the opportunity for international comparative studies. A disadvantage of culture-specific instruments is that their results are not generalizable or comparable because each has its conceptual definition and choice of indicators [6].

The SF-36 is one of the most widely used PRO instruments [10]. Its validity, reliability, and responsiveness have been documented in many groups varying by age, sex, socio-economic status, geographical region, and clinical conditions [3]. In the 1990s, researchers within the well-documented International Quality of Life Assessment (IQOLA) project pioneered the adaptation of the SF-36 for use internationally [11]. The methods used in the IQOLA project, still constitute the standard for translation and validation work today. The SF-36 has been translated into more than 150 languages and adapted to different cultures [10]. Responding to the difficulties in translating various items and response choices, the IQOLA project’s investigators emphasized the importance of developing translations that are culturally appropriate to each country [2].

Published norms for the SF-36 exist in several developed countries [4, 11,12,13,14,15,16,17,18,19,20,21]. Norms permit evaluation of disease burden, i.e. the decrement in PRO scores relative to a general population comparison group with similar age and sex distribution [10]. Normative data can also help interpretation of treatment effects since no treatment effect can be expected to be larger than the disease burden. For Saudi Arabia, PRO population norms could help identify needs and subsequently guide health policies, legislation, and the development of strategic plans to allocate resources based on unmet needs. However, most previous work in Saudi Arabia has used the SF-36v1 or RAND-36 [22], rather than SF-36v2, and general population norms have been lacking.

Accordingly, this nationwide study aimed to explore the content validity of SF-36v2 in a Saudi Arabian context, test the validity and reliability of a new Saudi Arabic SF-36v2 translation, and collect Saudi normative SF-36v2 data. Since the SF-36v2 scoring is based on US general population norms, we also explored the difference between US and Saudi Arabian norms.


This project was performed in 2 stages, utilizing both qualitative and quantitative methods.

The qualitative study had two objectives: 1) To explore the concepts of health and HRQOL and evaluate the content validity of the SF-36v2 as an HRQOL instrument in a Saudi Arabic setting, and 2) To perform cognitive debriefing of the Saudi SF-36v2.

Semi-structured interviews were carried out on a convenience sample of 100 participants by trained interviewers aiming to explore which domains the participants consider important components of health and HRQOL, and to ascertain concordance with the WHO definition of health as “a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity” [23]. This definition forms the conceptual basis of the SF-36 and other commonly used HRQOL measures. Participants were asked introductory questions including: “What is the meaning of health?”, “What do you think may affect a person’s health?”, and “What areas of life do you think are affected by health?” Participants were probed to elaborate on their answers until they indicated having no more ideas. Participants were then asked to evaluate the importance of domains commonly used in measuring HRQOL using a four-point scale (“very important”, “quite important”, “not quite important”, and “not at all important”). Next, participants were asked to list additional domains that were not mentioned among the listed domains. It has been suggested that indicators could be added if rated important by at least 50% of subjects [24].

Subsequently, participants engaged in a cognitive debriefing of the Saudi SF-36v2 to evaluate whether the content of the translated version was easily understood and culturally relevant within a Saudi Arabian context. After completing each item, participants were asked questions about clarity, comprehensibility, relevance, and completion feasibility using a standardized response scale. The participants were probed to elaborate on their answer, using probes such as: “Interesting, can you elaborate on that?”, “What do mean, can you explain further?”, “How important is that to you?”, “Do you want to add anything in this regard?”

The quantitative study was based on a national general population survey involving Saudis aged 15 years or older. Saudi Arabia is divided into five regions (North, South, Central, East, and West); each region is divided into sub-regions and blocks. A probability proportional sampling method was used to randomly select sub-regions, blocks, and accordingly households. Households were chosen from each block and a roster of household members (based on age and sex) was collected by a surveyor visiting the household. An adult aged 15 years or older was randomly selected to be surveyed from each household. The surveyor handed out the SF-36v2 for self-administration. Individuals were excluded if they were unable to complete the questionnaire due to language problems, communication limitations or cognitive impairments. If the selected adult was not present, our surveyors made an appointment to return. The household was counted as nonresponsive after a total of three attempted unsuccessful visits.

The Saudi population is estimated to be approximately 12,167,245 people. The study aimed to obtain 6360 completed surveys. This sample was chosen to achieve sufficient representation of all strata of the Saudi population. Based on experience with surveys in Saudi Arabia, we assumed a non-response rate of up to 40% for a target of 10,600 contacts. Of the 10,592 approached Saudi adults, 6166 participated in the study with a response rate of 61%.

The Saudi Arabian data was compared to United States (US) general population data (n = 4040) obtained in 2009 [25]. This general population online survey of US citizens 18 years or older has been used to generate population norms for the USA (please see [25] for details).


SF-36v2 was administered using a new Arabic version and scored according to standard recommendations of the SF-36v2 developers [25] into eight subscales: physical function (PF), role limitations due to physical health (RP), bodily pain (BP), general health perception (GH), vitality (VT), social function (SF), role limitations due to emotional problems (RE), and mental health (MH). For each subscale: 1) items were coded so that high score indicated good health, 2) two items were weighted [26], for all other items simple category weights (1, 2, 3 …) were used, 3) the mean score was taken across items and transformed linearly to a metric from 0 to 100, 4) for norm based scoring, the scale score was transformed linearly so that the US general population has a mean of 50 and an SD of 10. Also, two overall component scores, the physical component summary (PCS) and the mental component summary (MCS) were calculated based on scoring coefficients from a principal component analysis with orthogonal rotation [27].

The translation of the SF-36v2 used the principles of good practice from the International Society of Pharmacoeconomics and Outcomes Research Task Force for Translation and Cultural Adaptation [28]. This included: 1) Forward translation by two independent professional translators, 2) Back translation by a third independent translator, 3) Reconciliation by an expert panel, and 4) Independent assessment of translation quality. The translation was subsequently evaluated by cognitive debriefing and quantitative testing, as reported in this paper.

Morbidity questions included a list of 27 self-reported health conditions (hypertension, heart disease, diabetes, arthritis, depression, etc.) and one open-ended response coded as “other”.

Sociodemographic questions included age, gender, education level, marital status, occupation, and financial status (monthly household income). Additional questions concerned smoking habits and major life events during the previous year.

Statistical analysis

Distributions of basic demographic variables and chronic conditions were described through standard frequency tables. The analysis of scale structure relied on multi-trait analyses [29], which have been used in multiple previous studies of the SF-36. These analyses test item convergence within scales and item differentiation across scales. Item convergence within scales (sometimes called convergent validity) was evaluated by analyzing the correlation of each item with the sum of all the other items in the scale (item-own-scale correlation has been corrected for overlap, also see [30]). A correlation of 0.40 or more for all items in a scale supports item convergence within scales. Item differentiation across scales (sometimes called discriminant validity) was evaluated for each item by comparing the item’s correlation with its scale to its correlation with all other scales. Item differentiation across scales is supported if the item’s correlation with its own scale is significantly larger than its correlation with any other scale. Furthermore, we analyzed the scale correlation matrix using exploratory factor analysis as in many previous studies of the SF-36 (e.g. [31, 32]). Number of factors were evaluated by Eigen value analysis. Factors were extracted using the principal components method, followed by orthogonal rotation (Varimax). While most studies have used a two-factor solution of physical and mental health [31], analyses in some non-Western countries have suggested a three-factor structure of physical, mental, and social health [32]. For this reason, we evaluated both two- and three-factor solutions. As robustness analyses, we supplemented the Varimax rotation with an oblique rotation (Promax) and supplemented standard analyses of product-moment correlations with analyses of the polychoric correlation matrix.. Internal consistency reliability (coefficient alpha) was estimated for the subscales. Internal consistency reliability for the PCS and MCS was estimated using methods for weighted composites (see [33] page 37).

Differential item function (DIF) was evaluated for age, gender, and comparisons of Saudi Arabia and USA using logistic regression DIF tests [34]. Adopting a standard decision rule [35], evaluation of important DIF was based on statistical significance (p < 0.05 after Bonferroni adjustment) and magnitude in terms of increase in explained item variance (difference in pseudo R-squared [36] larger than 0.03). This criterion is slightly less conservative than a threshold of 0.035 advocated in the educational testing literature [37]. We used a standard purification strategy [34], where items with indications of DIF were excluded iteratively until a set of anchor items without DIF was identified. Then, the final DIF analyses were conducted for each item using the anchor items and the item in question. In cases of important cross-national DIF, we adjusted the cross-national comparisons using the generalized partial credit item response theory (IRT) model [38]. This model can adjust for uniform DIF (DIF with the same magnitude across score levels) by adjustment of the IRT thresholds parameters and adjust for non-uniform DIF (magnitude of DIF depends on score level) by adjustment of the IRT discrimination parameter. DIF adjustment was performed using a three step procedure: 1) We estimated item parameters for all SF-36v2 scales using the US data and the generalized partial credit IRT model [38], 2) For items with significant DIF, we re-estimated the item parameters in the Saudi data, fixing item parameters for the anchor (no-DIF) items, and 3) We performed IRT-based sum score cross-calibration to link the Saudi scale scores to the US metric [39]. After doing this for all subscales with DIF, we calculated adjusted PCS and MCS scores based on the adjusted subscales.

Comparisons between Saudi Arabia and the USA were carried out using a linear regression model, with and without controlling for differences in age and gender, and adjustment for DIF. The magnitude of differences was evaluated according to published guidelines for minimal important differences (MID) for the SF-36v2 [25]: PF: 3 points, RP: 3 points, BP: 3 points, GH: 2 points, VT: 2 points, SF: 3 points, RE: 4 points, MH: 3 points, PCS: 2 points, and MCS: 3 points. These MID values have been established using anchors such as noticeable increase in risk of mortality, job loss, or hospitalization [40].


Health and HRQOL concepts

The characteristics of the sample (N = 100) used in the qualitative study are presented in Table 1.

Table 1 Socio-demographic Characteristics of Participants in the Qualitative Study (n = 100)

In the qualitative study, four concepts were endorsed by 50% or more of participants as components of health: physical functioning (70% of participants), normal psychological function and feelings (66%), healthy eating habits and enjoyment of food (61%), normal social functioning (50%); 38% of participants defined absence of disease or illness and 28% being full of energy (free from pain and fatigue) as components of health (Table 2).

Table 2 Conceptual Domains Considered Components of Health

When presented with a list of domains commonly including in the assessment of HRQOL, concepts related to all eight SF-36 domains were assessed as “quite” or “very” important for HRQOL (Range 95% - 100%, Table 3). While the concept of being full of energy was considered as a component of health by only 28% of the sample, related concepts of “having a lot of energy” and “being free from pain” was considered a “very” or “quite” important components of HRQOL by 100% and 96% of participants, respectively. Participants also reported some additional domains that are not covered by the SF-36 as important: eating habits (72%), sleep (55%), travel (53%), and sexual function (56%) (Table 3).

Table 3 Importance Ratings of Health Related Quality of Life Domains

We identified four SF-36v2 items that two or three participants (out of 100) had problems understanding: HT (Health compared to 1 year ago), RP4 (Difficulty performing work due to physical health), VT4 (Feeling tired), and GH3 (Healthy as anybody). No other problems were identified for any other item at this stage.

Quantitative sample characteristics

Compared with the 2009 USA general population, the Saudi sample was younger (49.4% in the age range 15–29 years, with 47.3% in the age range 18–29 years vs. 22.0% in the USA), included a higher proportion of never married (43.8% vs. 26.9%), and a lower proportion of divorced/separated (3.1% vs. 15.6%). More people in the Saudi sample had received a college degree (43.2% vs. 35.0%) (Table 4). Differences in reporting of employment status precluded a detailed comparison, but 48.6% of the Saudi sample was working compared to 53.3% in the USA sample.

Table 4 Socio-demographic Characteristics of the Participants in the Quantiative Study

Data on self-reported health conditions (Table 5) also showed noticeable differences between Saudi Arabia and the USA. In Saudi Arabia, the most prevalent conditions were: trouble seeing (26.1%), back problems (18.0%), anemia (15.8%), and allergies (15.3%). Trouble seeing was much less frequently reported in the USA (12.1%), but several other conditions were considerably more prevalent in the USA: allergies (47.9%), hypertension (32%), arthritis (26.4%), anxiety (17.2%), and depression (14.1%).

Table 5 Self-reported Health Conditions among Respondents in Saudi Arabia and the USA

Item convergence within scales, item differentiation across scales

Table 6 presents results of analyses of item convergence within scales and differentiation across scales in the Saudi sample. The numbers in bold show each item’s correlation with the sum of all the other items in its own scale (item-own-scale correlations). All items satisfied the standard criterion of item convergence within scales (≥0.40). For all items except one, the item-own-scale correlation was higher than the correlation with any other scale, thus supporting item differentiation across scales. One item, SF01 (“During the past 4 weeks, to what extent has your physical health or emotional problems interfered with your normal social activities with family, friends, neighbors, or groups?”), showed a higher correlation with the pain scale than with the other item in its own scale.

Table 6 Item-scale Correlations for the SF-36v2 – Saudi Arabia

Exploratory factor analysis and internal consistency reliability

While all scales were positively correlated, no correlations between scales were strong (above 0.70) supporting the notion that the eight scales measure distinct domains (data not shown). The highest scale correlation (0.67) was seen between RP and RE. In exploratory factor analysis, the first four Eigen values were: 4.07, 1.08, 0.78, 0.57, thus supporting a two-factor solution. In a two-factor model, the factor loadings did not concur with the hypothesized associations (Table 7). Rather, the PF, RP, and RE subscales loaded strongly on first factor (Physical and role function), while the BP, GH, VT, SF, and MH subscales loaded strongly on the second factor (Symptoms, health perception and social function). Analyses using oblique rotation and analyses of polychoric correlations provided similar results (data not shown). A three factor solution kept the first factor unchanged, but split the second factor into a factor on Symptoms and general health perception (BP, GH, and VT loaded strongly on this factor) and a factor on Social function and mental health (SF and MH loaded strongly on this factor, which also had a strong cross-loading from RE, data not shown).

Table 7 Hypothesized associations and observed factor loadings for a two-factor solution

Internal consistency reliability was above the traditional threshold of 0.70 for seven scales. The two-item SF scale had a reliability of 0.67. The internal consistency reliabilities were 0.91 for PCS and 0.90 for MCS.

Differential item function

We did not identify any DIF with regards to age and sex. Uniform and non-uniform cross-national DIF was identified for 4 and 5 items, respectively, based on explained item variance (Table 8). Due to the large sample size, all DIF results were highly significant. For all 9 items, the direction of DIF was clear and consistent over most or all of the score range. Six items (PF02, moderate activities; PF06, bending/kneeling; GH01, health in general; GH05, health is excellent; MH03, calm and peaceful; MH05, happy) provided a more positive assessment of health in Saudi Arabia compared to the anchor items. Three items (PF10, bathing or dressing; GH02, sick easier; RE03, did work less carefully) provided a more negative assessment of heath in Saudi Arabia compared to the anchor items.

Table 8 Test of Differential Item Function (DIF) between Saudi and US SF-36v2 versions

Normative data for Saudi Arabia and comparisons with the USA

Compared to 2009 US general population norms, Saudi Arabia data showed lower scores for the RP, SF, and RE scales as well as for the MCS (Table 9). Slightly lower scores were also seen for the PF and MH scales, but these differences were below the suggested threshold for clinical significance. Adjusting for age and gender led to slightly larger differences for the scales reflecting physical health but had little impact on differences in scales reflecting mental health. Adjusting for DIF lowered the Saudi Arabia scores for PF, GH, MH, PCS and MCS, but provided higher scores for RE, thus slightly diminishing the difference between Saudi Arabia and the US on this scale.

Table 9 SF-36v2 Norm Tables for Saudi Arabia – Total Sample

Both in Saudi Arabia and in the USA, separate analyses by gender and age group (Tables 10 and 11) showed lower physical health scores for older age groups. However, this trend was most pronounced in the USA, so the strongest cross-national differences were seen in the younger age groups. Saudi Arabian women, 60 years or older, reported significantly better physical function than American women in the same age group.

Table 10 SF-36v2 Norm Tables for Saudi Arabia – Age Groups – Male
Table 11 SF-36v2 Norm Tables for Saudi Arabia – Age Groups – Female

Comparisons according to gender in the Saudi Arabian sample showed that women scored lower on several scales: BP, GH, VT, SF, RE, and MH as well as on MCS (Tables 10 and 11). However, except for VT, the score differences were below the thresholds for clinical significance.

Among men in Saudi Arabia (Table 10), the strongest score differences across age groups were seen for the PF scale and PCS. Among women (Table 11), lower scores in older age groups were seen for the RP, BP, and VT scales and PCS, whereas other scales remained fairly constant across age.


This nationwide study generally supported the content validity, construct validity, and reliability of a new Saudi version of the SF-36v2. In the qualitative study, participants emphasized physical and psychological function as important components of health – along with social function and healthy eating. Thus, similar to the World Health Organization (WHO) definition [23], health was seen as having both physical, psychological and social aspects. Most of the important HRQOL outcomes listed by participants overlap with domains covered by SF-36. However, some domains mentioned by smaller proportions of participants are not covered by the SF-36: religious habits, eating habits, travel, good sleep, and sexual function. Items covering these domains have been developed and will be reported in future papers.

Cognitive debriefing of the Saudi SF-36v2 indicated that respondents found the questionnaire easy to understand and answer. Each survey item was rated as relevant by more than 90% of participants, supporting the content validity of the survey.

The psychometric analyses supported the reliability and validity of the SF-36v2 in a Saudi general population. All items showed satisfactory convergence within scales. In all but one instance, items also showed satisfactory differentiation across scales. Such results are on par with results from the original validation of the SF-36 in the US [29]. Overall, these results support the hypothesized scale structure of the Saudi Arabic SF-36v2. However, exploratory factor analyses did not find a factor solution similar to typical results from Western countries [31]. Rather, the two-factor solution resembled results previously found in a Japanese sample [32] and to some extent in a Turkish urban population [21]. In contrast, factor analytic results from a study in Lebanon more closely resembled typical results from western countries [7]. The factor solution in our study seems particularly driven by the high correlation between the RP and RE scales, which suggest that the distinction between physical and psychological reasons for poor role performance does not apply to the Saudi data. The implications of these results for the validity of the PCS and MCS scores in Saudi Arabia needs to be explored in future studies.

Seven of the SF-36v2 scales had internal consistency reliability above 0.70, but the two-item SF scale had a reliability of only 0.67. However, this scale has also shown low reliability in some US studies, e.g., the first US general population study, where the SF scales showed a reliability of 0.63 [41]. Thus, the reliability results may be considered as adequate.

Within Saudi Arabia, we found no DIF for age and gender, but we found cross-national DIF for 9 items when comparing with US general population data. In a post-hoc cognitive debriefing study of these 9 items we were not able to identify problems in these 9 items that might explain the DIF (data not shown). A possible explanation of the DIF may be cultural or lifestyle differences between Saudi Arabia and the USA. For example, because of religious practices, persons in Saudi Arabia may do more bending and kneeling and thus find this activity easier than persons in the USA. The item on bending and kneeling is still a valid indicator of physical function in each country, but the item is easier for persons in Saudi Arabia, thus influencing comparisons of Physical Function. If the interest of the researcher is to compare physical function in general (and not the specific activity of bending/kneeling) the comparison can be adjusted for the DIF. The impact of such adjustments can be evaluated on the overall level in Table 9 and for age and gender subgroups in Tables 10 and 11. The impact is actually rather small for the PF scale (0.40), but larger for the GH (1.87) and MH (2.38) scales. While these impacts are smaller than the MID for each scale, the largest impacts are larger than the impact of adjustment for demographic differences. Therefore, we recommend considering DIF when interpreting cross-national comparisons between Saudi Arabia and the USA.

After adjustment for differences in age and gender, as well as DIF, analysis of Saudi general population norm data showed low scores for scales concerning physical function (PF difference = − 3.10), role and social function (RP difference = − 4.75, SF difference = − 4.23, and RE difference = − 5.67), mental health (MH difference = − 4.82) as well as for the mental component summary (MCS difference = − 4.57) compared to US general population norms (Table 9). In particular, scores on the RE scale were lower for women in Saudi Arabia compared to the USA, although some of this difference was explained by DIF. These differences are not likely to be caused by higher morbidity in Saudi Arabia since the self-reported prevalence of many chronic conditions was lower in Saudi Arabia than in the USA. The magnitude of the differences on these scales suggests differences in function that need to be explored. In particular, the lower scores in scales relating to mental health (SF, RE, MH, and MCS) does not concur with the low reports of clinical anxiety and depression (Table 5). A large study (1990–2013) to estimate the burden of mental disorders in the Eastern Mediterranean Region including Saudi Arabia, reported that the stigma attached to mental illness may cause underreporting or waiting for a long period of time before seeking healthcare [42]. Thus, it is possible that clinical anxiety and depression is under-diagnosed or under-reported for cultural reasons. Further, the low score on scales related to mental health may reflect subclinical, rather than clinical, mental health problems.

As in previous general population studies (e.g. [3, 4, 9, 10]), women scored lower on all SF-36v2 scales, thus supporting known groups validity. However, the average differences were often small – only the gender difference for the vitality scale exceeded the threshold for clinical significance.

Analyses by age group found lower scores in older age groups for SF-36v2 scales concerning physical health: PF, RP, BP, and PCS. These results are in line with results from many other studies [3, 4, 9, 10], reflecting a decline in physical function with age and thus supporting known groups validity. As in previous studies, measures reflecting mental health were relatively constant across age groups. A study by Lorem et al. [43] found age by itself was protective of mental health symptoms when controlled for the mental health symptoms associated with physical illness.

Representation from most regions of Saudi Arabia was satisfactory, but few participants were recruited from the Northern region. We ascribe these difficulties in recruiting participants to lack of familiarity and lack of acceptance of surveys in some parts of the Saudi culture. The Northern region is the smallest (285,733 Saudi inhabitants in 2016) and least densely populated region in Saudi Arabia, with a population that is slightly younger (mean age 26.1 years against 27.4 years for all of Saudi Arabia) and with a slightly lower proportion of mean (50,3% against 50.9%). However, since these differences are very small, the low proportion of participants from the Northern region is unlikely to have a noticeable impact on the overall results.


This is the first large scale Saudi general population study of patient reported outcomes. We used a new translation of a well-known patient reported outcomes instrument, the SF-36v2. Concept elicitation, cognitive debriefing, and large-scale quantitative testing supported the validity and reliability of the Saudi SF-36v2, but an exploratory factor analysis did not support the typical distinction between a physical health and a mental health component. Also, we found cross-national DIF for 9 out of 35 tested items. After adjustment for DIF and demographic differences we found lower patient reported outcomes scores in Saudi Arabia for the PF, RP, SF, RE and MH scales as well as for the MCS. For the BP, GH and VT scales, as well as for PCS, score differences were smaller and did not exceed MID. Reasons for the differences in patient reported outcomes should be further explored and these general population differences should be taken into account when interpreting patient reported outcomes scores for patients in Saudi Arabia.

Availability of data and materials

The datasets/tables used and/or analyzed during the current study are available from the author on reasonable request.



Health-related Quality of Life


International Quality of Life Assessment


Short Form Health Survey


Differential item function


Patient-reported outcome


Item response theory


Minimally important difference


World Health Organization


The Mental Component Summary


The Physical Component Summary


Mental Health


Role limitations due to Emotional Problems


Social Function




General Health Perception


Bodily Pain


Role Limitations due to Physical Health


Physical Function


United States


  1. Khader, S., Hourani, M. M., & Al-Akour, N. (2011). Normative data and psychometric properties of short form 36 health survey (SF-36, version 1.0) in the population of North Jordan. Eastern Mediterranean Health Journal, 17(5), 368–374..

  2. Wagner, A. K., Gandek, B., Aaronson, N. K., Acquadro, C., Alonso, J., Apolone, G., et al. (1998). Cross-cultural comparisons of the content of SF-36 translations across 10 countries: Results from the IQOLA project. Journal of Clinical Epidemiology, 51(11), 925–932.

    Article  CAS  PubMed  Google Scholar 

  3. Ware, J. E., Kosinski, M., & Gandel, B. (2000). SF–36® health survey: Manual and interpretation guide. Lincoln: QualityMetric.

    Google Scholar 

  4. Hopman, W. M., Towheed, T., Anastassiades, T., Tenenhouse, A., Poliquin, S., Berger, C., et al. (2000). Canadian normative data for the SF-36 health survey. Canadian Medical Association Journal, 163(3), 265–271.

    CAS  PubMed  Google Scholar 

  5. Wood-Dauphinee, S. (2000). The Canadian SF-36 health survey: Normative data add to its value. Canadian Medical Association Journal, 163(3), 283–284.

    CAS  PubMed  Google Scholar 

  6. Lam, C. L., Lauder, I. J., Lam, T. P., & Gandek, B. (2000). Validation and norming of the MOS 36-item short form health survey in Hong Kong Chinese adults. Health Services Research Committee Dissemination Report no 711026.

    Google Scholar 

  7. Sabbah, I., Drouby, N., Sabbah, S., Retel-Rude, N., & Mercier, M. (2003). Quality of life in rural and urban populations in Lebanon using SF-36 health survey. Health and Quality of Life Outcomes, 1(1), 1.

    Article  Google Scholar 

  8. Thumboo, J., Wu, Y., Tai, E.-S., Gandek, B., Lee, J., Ma, S., et al. (2013). Reliability and validity of the English (Singapore) and Chinese (Singapore) versions of the short-form 36 version 2 in a multi-ethnic urban Asian population in Singapore. Quality of Life Research, 22(9), 2501–2508.

    Article  PubMed  Google Scholar 

  9. Cruz, L. N., Fleck, M. P. A., Oliveira, M. R., Camey, S. A., Hoffmann, J. F., Bagattini, Â. M., et al. (2013). Health-related quality of life in Brazil: Normative data for the SF-36 in a general population sample in the south of the country. Ciência & Saúde Coletiva, 18, 1911–1921.

    Article  Google Scholar 

  10. Pappa, E., Kontodimopoulos, N., & Niakas, D. (2005). Validating and norming of the Greek SF-36 health survey. Quality of Life Research, 14(5), 1433–1438.

    Article  PubMed  Google Scholar 

  11. Bullinger, M., Alonso, J., Apolone, G., Leplège, A., Sullivan, M., Wood-Dauphinee, S., et al. (1998). Translating health status questionnaires and evaluating their quality: The IQOLA project approach. Journal of Clinical Epidemiology, 51(11), 913–923.

    Article  CAS  PubMed  Google Scholar 

  12. Blake, C., Codd, M. B., & O’Meara, Y. M. (2000). The short form 36 (SF-36) health survey: Normative data for the Irish population. Irish Journal of Medical Science, 169(3), 195.

    Article  CAS  PubMed  Google Scholar 

  13. Apolone, G., & Mosconi, P. (1998). The Italian SF-36 health survey: Translation, validation and norming. Journal of Clinical Epidemiology, 51(11), 1025–1036.

    Article  CAS  PubMed  Google Scholar 

  14. Aaronson, N. K., Muller, M., Cohen, P. D. A., Essink-Bot, M.-L., Fekkes, M., Sanderman, R., et al. (1998). Translation, validation, and norming of the Dutch language version of the SF-36 health survey in community and chronic disease populations. Journal of Clinical Epidemiology, 51(11), 1055–1068.

    Article  CAS  PubMed  Google Scholar 

  15. Scott, K. M., Tobias, M. I., Sarfati, D., & Haslett, S. J. (1999). SF-36 health survey reliability, validity and norms for new Zealand. Australian and New Zealand Journal of Public Health, 23(4), 401–406.

    Article  CAS  PubMed  Google Scholar 

  16. Loge, J. H., & Kaasa, S. (1998). Short form 36 (SF-36) health survey: Normative data from the general Norwegian population. Scandinavian Journal of Social Medicine, 26(4), 250–258.

    Article  CAS  PubMed  Google Scholar 

  17. Eng, B., Wee, H. L., Wu, Y., Tai, E.-S., & Gandek, B. (2014). Normative data for the Singapore English and Chinese SF-36 version 2 health survey. Annals of the Academy of Medicine, Singapore, 43, 15–23.

    Google Scholar 

  18. Jenkinson, C., Coulter, A., & Wright, L. (1993). Short form 36 (SF36) health survey questionnaire: Normative data for adults of working age. Bmj, 306(6890), 1437–1440.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Jenkinson, C., Stewart-Brown, S., Petersen, S., & Paice, C. (1999). Assessment of the SF-36 version 2 in the United Kingdom. Journal of Epidemiology & Community Health, 53(1), 46–50.

    Article  CAS  Google Scholar 

  20. Lyons, R. A., Fielder, H., & Littlepage, B. N. C. (1995). Measuring health status with the SF-36: The need for regional norms. Journal of Public Health, 17(1), 46–50.

    CAS  Google Scholar 

  21. Demiral, Y., Ergor, G., Unal, B., Semin, S., Akvardar, Y., Kıvırcık, B., et al. (2006). Normative data and discriminative properties of short form 36 (SF-36) in Turkish urban population. BMC Public Health, 6(1), 247.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Coons, S. J., Alabdulmohsin, S. A., Draugalis, J. R., & Hays, R. D. (1998). Reliability of an Arabic version of the RAND-36 health survey and its equivalence to the US-English version. Medical Care, 36(3), 428–432.

  23. World Health Organization (1948). Preamble to the Constitution of the World Health Organization as adopted by the International Health Conference, New York, 19–22 June, 1946; signed on 22 July 1946 by the representatives of 61 States (Official Records of the World Health Organization, no. 2, p. 100) and entered into force on 7 April 1948. .

  24. Guyatt, G. H., Feeny, D. H., & Patrick, D. L. (1993). Measuring health-related quality of life. Annals of Internal Medicine, 118(8), 622–629.

    Article  CAS  PubMed  Google Scholar 

  25. Maruish, M. E. (Ed.) (2011). User's manual for the SF-36v2 health survey, (3rd ed., ). Lincoln: QualityMetric Inc.

    Google Scholar 

  26. Ware Jr., J. E., Snow, K. K., Kosinski, M., & Gandek, B. (1993). SF-36 health survey. Manual and Interpretation Guide. Boston: The Health Institute, New England Medical Center.

    Google Scholar 

  27. Ware Jr., J. E., Kosinski, M., Bayliss, M. S., McHorney, C. A., Rogers, W. H., & Raczek, A. (1995). Comparison of methods for the scoring and statistical analysis of SF-36 health profile and summary measures: Summary of results from the medical outcomes study. Medical Care, 33(4 Suppl), AS264.

    PubMed  Google Scholar 

  28. Wild, D., Grove, A., Martin, M., Eremenco, S., McElroy, S., Verjee-Lorenz, A., et al. (2005). Principles of good practice for the translation and cultural adaptation process for patient-reported outcomes (PRO) measures: Report of the ISPOR task force for translation and cultural adaptation. Value in health : the journal of the International Society for Pharmacoeconomics and Outcomes Research.

  29. McHorney, C. A., Jr, W., John, E., Lu, J. R., & Sherbourne, C. D. (1994). The MOS 36-item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Medical Care, 32(1), 40–66.

  30. Howard, K. I., & Forehand, G. A. (1962). A method for correcting item-total correlations for the effect of relevant item inclusion. Educational and Psychological Measurement.

  31. Ware Jr., J. E., Kosinski, M., Gandek, B., Aaronson, N. K., Apolone, G., Bech, P., et al. (1998). The factor structure of the SF-36 health survey in 10 countries: Results from the IQOLA project. Journal of Clinical Epidemiology, 51(11), 1159–1165.

    Article  PubMed  Google Scholar 

  32. Suzukamo, Y., Fukuhara, S., Green, J., Kosinski, M., Gandek, B., & Ware, J. E. (2011). Validation testing of a three-component model of short Form-36 scores. Journal of Clinical Epidemiology.

  33. Ware Jr., J. E., Kosinski, M., & Keller, S. D. (1994). SF-36 physical and mental health summary scales - A user's manual. Boston: The Health Institute.

    Google Scholar 

  34. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa: Directorate of Human Resources Research and Evaluation, Department of National Defense.

    Google Scholar 

  35. Rose, M., Bjorner, J. B., Gandek, B., Bruce, B., Fries, J. F., & Ware Jr., J. E. (2014). The PROMIS physical function item bank was calibrated to a standardized metric and shown to improve measurement efficiency. Journal of Clinical Epidemiology.

  36. Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78, 691–692.

    Article  Google Scholar 

  37. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329–349.

    Article  Google Scholar 

  38. Muraki, E. (1997). A generalized partial credit model. In W. van der Linden, & R. Hambleton (Eds.), Handbook of modern item response theory, (pp. 153–164). Berlin: Springer.

    Chapter  Google Scholar 

  39. Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summed-score linking using item response theory: Application to depression measurement. Psychological Assessment, 12(3), 354–359.

    Article  CAS  PubMed  Google Scholar 

  40. Bjorner, J. B., Wallenstein, G. V., Martin, M. C., Lin, P., Blaisdell-Gross, B., Tak, P. C., et al. (2007). Interpreting score differences in the SF-36 vitality scale: Using clinical conditions and functional outcomes to define the minimally important difference. Current Medical Research and Opinion, 23(4), 731–739.

    Article  PubMed  Google Scholar 

  41. McHorney, C. A., Kosinski, M., & Ware Jr., J. E. (1994). Comparisons of the costs and quality of norms for the SF-36 health survey collected by mail versus telephone interview: results from a national survey. Medical Care, 32(6), 551–567.

  42. Charara, R., Forouzanfar, M., Naghavi, M., Moradi-Lakeh, M., Afshin, A., Vos, T., et al. (2017). The burden of mental disorders in the eastern Mediterranean region, 1990-2013. PLoS One, 12(1), e0169575.

  43. Lorem, G. F., Schirmer, H., Wang, C. E. A., & Emaus, N. (2017). Ageing and mental health: Changes in self-reported health due to physical illness and mental health status with consecutive cross-sectional analyses. BMJ Open, 7(1), e013629.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


We acknowledge Dr. Youssef Al Tannir’s help in editing the manuscript.


This study was funded by King Abdul Aziz City for Science and Technology (T-K-12-1041).

Author information

Authors and Affiliations



AA, QH contributed to the development of study design. AA, QH oversaw data collection. AA, QH, JB and MT made contributions to the data analysis and results interpretation. AA, QH wrote the first draft of the manuscript. All authors contributed and made revisions to the interpretation of results, first draft, and the manuscript. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Mohamad Al-Tannir.

Ethics declarations

Ethics approval and consent to participate

The study protocol was reviewed and approved by the institutional review board at King Fahd Medical City Riyadh, KSA and conducted in compliance with the Declaration of Helsinki. Informed consent was obtained from each individual participants included in the study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing of interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

AboAbat, A., Qannam, H., Bjorner, J.B. et al. Psychometric validation of a Saudi Arabian version of the sf-36v2 health survey and norm data for Saudi Arabia. J Patient Rep Outcomes 4, 67 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: