Validation of a health screening questionnaire for primary care using Rasch models
Journal of Patient-Reported Outcomes volume 3, Article number: 12 (2019)
Health inequality is on the rise due to various social and individual factors. While preventive health checks (PHC) aim to counteract health inequality, there is robust evidence against the use of PHC in general practice. It is unknown which factors can identify persons who will benefit from preventive interventions that are more beneficial than harmful. Hence, valid screening instruments are needed.
The aim of this study was to assess the psychometric properties of a screening questionnaire (SQ-33), which targets vulnerable persons in primary care practice who can benefit from preventive consultations. Survey data were acquired from 20 primary care clinical practices in the Northern Region of Jutland, Denmark. Respondents were 2056 persons between 20 and 44 years old who, for any reason, consulted their family doctor. The psychometric properties of the SQ-33 were assessed using Rasch item response modelling. Follow-up analysis was performed on a subsample of 364 persons one year subsequent to initial inclusion, in order to assess responsiveness and predictive validity using a general health anchor item.
Twenty-three of the SQ-33 items in four subscales fit a Graphical loglinear Rasch model (GLLRM) at baseline and follow-up, thus confirming the scaling properties. The modified 23-item version (HSQ-23) revealed superior responsiveness and predictive validity compared with the SQ-33.
The Health Screening Questionnaire (HSQ-23) was shown to possess adequate psychometric properties and responsiveness and can thus be used as an outcome measure in preventive intervention studies. Future study should address whether the HSQ-23 successfully identifies patients who will benefit from PHC consultations.
Preventive health checks (PHC) has been a controversial topic for at least a decade [1, 2]. There is presently substantial evidence against the use of PHC questionnaires used for screening in primary care medicine . Screening programs can be justifiably implemented so long as the instrument is capable of identifying persons who will benefit from some preventive intervention. However, benefits of screening must always outweigh harms, for example due to unnecessary interventions or overdiagnosis [3,4,5]. In Denmark, health inequality is on the rise , which can be attributed to various social and individual factors . Numerous screening strategies and selection criteria have been applied to identify persons at risk of developing life-threatening or functionally debilitating chronic diseases [7,8,9]. These strategies include stratifying the general population by age, gender, job-type, financial and sociodemographic factors, as well as specific diagnoses. Nevertheless, while screening instruments should identify persons at risk who can benefit from a preventative intervention, they must not increase the risk of harm by introducing unnecessary diagnoses and treatments . In addition, the instrument must possess acceptable diagnostic test accuracy and consist of meaningful indicators for the target population in order to enhance self-efficacy.
A previous paper describes the development and implementation of a PHC screening instrument for vulnerable persons, called the Screening Questionnaire (SQ-33) . The SQ-33 was developed to assess factors important to health, disease management, and child development using theories of salutogenesis, hierarchy of needs, and self-evaluated health . For a full description of the development of the original SQ-33 questionnaire, and more context, the reader is referred to Freund and Lous (2012) and Hansen et al. (2014) [11, 12].
The domains of the SQ-33 address aspects of Personal Resources (9 items), Lifestyle (8 items), Family Life (10 items), and Relationship with one’s Children) (6 items). The study revealed that a third of the screened population noted difficulties on at least seven of the 33 items , and a parallel study found that a number of SQ-33 items correlated positively with certain social and medical conditions and disease states . The results of a 1-year follow-up survey showed that participants randomly assigned to a package of two follow-up consultations with the GP had fewer social problems and an improved sense of psychological well-being compared with controls, as measured by the SF-12 Mental Health Component subscale [11, 12]. This indicates a beneficial impact on these variables. However, evidence of preventive effects on morbidity and mortality remains to be seen.
The SQ-33 screening instrument is a self-report questionnaire where the categorical responses to each item are numericized and summed to a composite score. Summation of raw item scores into a single index (i.e., a unidimensional scale)  assumes that each item describes a different aspect of the underlying latent trait [15,16,17]. The summated score is then used as a measure of the degree to which a person with limited resources is at risk of developing a disease, which in turn can affect a person’s health-related quality-of-life (HRQoL).
Item Response Theory (IRT) models are popular and robust statistical tools for validation of scales used to measure HRQoL. IRT models add interesting features to measurement provided adequate data-model fit . Item analyses using Rasch IRT explore in depth which items belong to a single dimension and how items included in each dimension are interrelated and ordered on a latent trait [19, 20]. Good scales exhibit adequate spread along the dimension of interest and are unaffected by subgroups in the population across sociodemographic factors like gender and age [16, 21]. Such person factor bias is known as differential item functioning (DIF) which can undermine the scale if not addressed [22,23,24,25]. Local response dependence (LD) is another source of bias that can result in lack of fit to a Rasch model . LD is seen when items are too highly correlated, as items should only be correlated through the latent variable that is being measured . Loglinear Rasch models permit some level of uniform DIF and LD and yet still yield robust scales [27, 28].
Reliability coefficients such as Cronbach’s alpha are often used to estimate measurement error and scale precision at the group level [29, 30]. However, using them to interpret scores at the level of the individual patient is problematic , as well as the use of alpha as an estimate of reliability based on a single survey administration . When the purpose of screening instruments is to identify individuals above or below some predetermined cut point, the use of standard error of measurement (SEM)  is crucial in order to assess a respondent’s location on a scale relative to that cut point [31, 34].
At present, rigorously validated self-report screening instruments that target vulnerable persons are not available. Notwithstanding, considerable political emphasis has been placed on developing methods to identify premorbid vulnerable persons in order to create interventions that can prevent health inequity and morbidity, and improve HRQoL [7, 9, 11]. Hence, there is a demand for validated self-report measures that consist of relevant indicators which can discriminate between vulnerable and non-vulnerable persons in primary care settings [35, 36].
The purpose of this study was to apply Rasch IRT models to assess the psychometric properties and criterion validity of the SQ-33 as applied to a sample of persons consulting primary care physicians in the Northern Jutland county of Denmark.
Material and methods
The data used for this analysis were derived from a previously published study of persons who completed the SQ-33 in 1998–99 in the North Jutland County of Denmark . The study included 2056 persons who paid a visit to their GP for any reason, where 1512 (73%) were women. The age within the sample ranged from 20 to 44 years. Of the sample, 495 persons who experienced seven or more problems on the SQ-33 survey were randomised to a 1-h preventive consultation with their GP (and a 20 min follow-up within three months), or to no preventive health consultation . Of the 495 eligible respondents, 364 persons (74%) completed a 1-year follow-up survey (180 persons from the intervention group and 184 from the control group) . Data from the baseline survey (n = 2056) were used for the analysis of the measurement properties of the SQ-33 in order to establish the scaling properties for the full range of subjects.
A graphical loglinear Rasch model (GLLRM) was fitted to the proposed subscales . The item screening procedure described by Kreiner and Christensen (2011)  was used to identify subscales with adequate fit. Overall model fit was evaluated using Andersen’s conditional likelihood ratio (CLR) test , which assesses measurement invariance across groups defined by total score, gender, and age. Individual item fit was assessed by comparing observed and expected item rest-score correlations [28, 38, 39] and by conditional versions of the infit and outfit item fit statistics [28, 38]. A thorough technical description of the criteria used to determine item fit is presented in Kreiner and Christensen (2011) . Measurement precision was evaluated using standard error of measurement (SEM) for the derived scales [33, 40]. The measurement models resulting from the above described procedures were tested in the 1-year follow-up data and the predictive validity of the original 33-item and the reduced versions were compared. Predictive validity evidence refers to the level of correlation between summated domain scores and the anchor item, which is assessed by studying associations between changes in the domain scores with the changes in an anchor item that evaluates general health. The association was calculated as partial Spearman rank correlations adjusting for baseline values of the anchor item. Change scores were calculated as standardized effect size (ES) (i.e., the difference between baseline and follow-up scores relative to the standard deviation of baseline scores) [41, 42].
In order to facilitate comparison of scores across domains, a linear transformation to a zero to 100 scale was used to report domain scores and standard error of measurement.
GLLRM was performed using the software program DIGRAM .
The GLLRM item screening procedure for all nine items (items 1 to 9) indicated massive evidence of positive LD, whereupon no model with satisfactory fit was identified. After deleting items 5, 8, and 9, a relatively parsimonious GLLRM (top left panel in Fig. 1) and adequate overall model fit was identified (Table 1).
Individual item fit statistics for the GLLRM analysis are shown in Table 2. Measurement error quantified as the standard error of measurement (SEM) is shown in Fig. 2. The figure illustrates how measurement error for an individual patient can approach 10 points on the zero to 100 scale and differs slightly across gender and age groups.
For the eight lifestyle items, the GLLRM item screening procedure could not identify a model with satisfactory fit for all items. After deletion of items 11, 13, and 15, a GLLRM was identified (Fig. 1, top right panel). Overall model fit was acceptable (Table 1). Individual item fit statistics are shown in the Table 2. The SEM for the GLLRM scale shows that measurement error is extremely large across all age and gender groups (Fig. 3).
The item screening did not identify a model with satisfactory fit for the original 10 item scale. After removal of items 18, 19, 20, and 21, a scale consisting of six items was identified (Fig. 1, bottom left panel). Overall model fit was acceptable (Table 1). Individual item fit statistics are shown in Table 2. The SEM for the GLLRM scale shows that measurement error is very large across all age and gender groups (Fig. 4).
Relationship with child/children
The GLLRM item screening procedure indicated that a scale for all six items had satisfactory fit (Table 1). The SEM is shown in Fig. 5 and demonstrates that measurement error is also very large across all age and gender groups.
GLLRMs for the four scales identified in the baseline data were confirmed in the follow-up data (results not shown). The standardized effect sizes for the original and for the revised scales are of the same magnitude for all four domains. However, the revised scales consistently show stronger associations with change in the anchor item (see Table 3). Table 3 shows the fit statistics.
The most significant finding of this study is that four subscales with twenty-three items satisfy the constraints of GLLRM and can thus be used to measure their underlying constructs in the targeted population. This is encouraging from a measurement perspective. The Resources scale suitably addresses general health, wellbeing, and reflecst a person’s ability to tackle psychological challenges in daily life, which makes sense in terms of content relevance and validity. GLLRM indicated a unidimensional scale, which compensates for some degree of DIF and LD. This implies a reliable, internally consistent, and construct valid measure of the latent variable of Resources for use in population-based studies. DIF by sex for item 3 indicates that men interpret the item differently than women (i.e., a different awareness of how to improve health), which of course is a source of confounding. The fact that men and women respond differentially may warrant DIF equating, as described by Brodersen et al. (2006) , in order to quantify the level of discrepancy between genders and compensate in futures studies across groups.
Figures 2, 3, and 4 demonstrate how measurement error distributed across the scale is problematic at the level of an individual respondent, as SEM varies substantially along the scale, reaching peak values in the midrange that are at least twice the magnitude of the low and high ends of the scale. While this is common , it has implications for the interpretation of individual scores, as confidence intervals expand with increasing SEM . Hence, depending on an individual’s sum score (i.e., the person’s location on the scale), the uncertainty around the score can differ substantially, which jeopardizes conclusions based on cut points.
Ten of the 33 items did not fit a Rasch model and were removed from the a priori proposed subscales. There can be different reasons for this misfit. For example, two items belonged to the domain of Lifestyle and addressed substance abuse (tobacco and addictive drugs). A potential reason could be that persons with substance abuse are reluctant to respond to questions addressing abuse , or the ability to discern the level and influence of the abuse might be distorted by that very abuse or denial . The theme of substance abuse is also captured by items 22 and 23 (alcohol and drug) in the domain of Family Life, so the content does not disappear from the instrument.
It must be noted that items that misfit should be removed from subscales, as they do not contribute to the scaling properties. However, items can always be retained as single items. Thus, information concerning for example a patient’s perceived need to use tobacco on a daily basis can be kept as a single item (and not hidden away inside a scale that possibly measures something else). It is thus the practitioner’s prerogative to use the single item for qualitative assessment.
Other reasons for misfit could stem from local dependence. For example, GLLRM revealed LD between item 1 (sense of general health) and item 2 (feel well enough to do what you like). This makes sense, as both items address general health. Item 3 (knowledge about health) and item 4 (feel appreciated by those you see every day) do not necessarily intuitively address the same topics. It may indicate that cohabitation with family and proximity to friends and colleagues can influence health literacy and self-efficacy.
Poor scaling properties can also be due to the phrasing of the question or the response options . For example, items 18–27 are dichotomous (yes/no). While 6 of these items in fact formed a Rasch scale (items 22–27), dichotomous items may lack nuances that an ordinal response structure capture. Respondents must have adequate response options in order to meaningfully address the item themes. Such qualitative issues can be tackled in face-to-face interviews with the target group in future explorative studies.
A weakness with this study is that follow-up data for just 495 out of the original 2056 persons was obtained (of which 364 participated in follow-up), in that personal identification numbers were registered only for patients with more than seven problems on the SQ-33 in the original collection of data. The rationale behind including persons with 7 or more problems on the SQ-33 for follow-up stems from an a priori assumption by Freund and Lous (2012) that these persons could be classified as ‘vulnerable’ [10, 12, 13]. Excluding the majority of subjects can introduce a bias if the scale is not psychometrically sound. However, the measurement properties of the reduced scales were tested and confirmed in the available follow-up data, and the predictive validity of the revised version was confirmed. Thus, we can conclude that HSQ-23 performs better as a psychometric instrument than the SQ-33 and is responsive to clinical change (as seen by the standardized effect sizes in Table 3).
Rasch IRT models were used to assess the psychometric properties of the four subscales of the SQ-33 screening questionnaire as applied to persons from the region of Northern Jutland in Denmark. A 23 item version was found to possess adequate psychometric properties and anchor based criterion validation showed responsiveness to clinical change. The revised instrument, the Health Screening Questionnaire 23 (HSQ-23), is appropriate for monitoring the constructs of Resources, Lifestyle, Family Life, and Relationship with children. These scales can be used for outcome assessment in studies of preventive interventions. Whether the scales possess predictive value for specific types of morbidity and mortality is a topic of future investigation.
Conditional likelihood ratio
Differential item functioning
Graphical loglinear Rasch model
Health Screening Questionnaire
Item response theory
Preventive health checks
Standard error of measurement
Gotzsche, P. C., Jorgensen, K. J., & Krogsboll, L. T. (2014). General health checks don't work. BMJ, 348, g3680.
Krogsboll, L. T., Jorgensen, K. J., & Gotzsche, P. C. (2013). General health checks in adults for reducing morbidity and mortality from disease. JAMA, 309, 2489–2490.
Brodersen, J., Jorgensen, K. J., & Gotzsche, P. C. (2010). The benefits and harms of screening for cancer with a focus on breast screening. Pol Arch Med Wewn, 120, 89–94.
Brodersen, J., Thorsen, H., McKenna, S., & Doward, L. (2005). Assessing psychosocial/quality of life outcomes in screening: How do we do it better? J Epidemiol Community Health, 59, 609.
McCaffery, K. J., Jansen, J., Scherer, L. D., Thornton, H., Hersch, J., Carter, S. M., et al. (2016). Walking the tightrope: Communicating overdiagnosis in modern healthcare. BMJ, 352, i348.
Sundhedsstyrelsen. [Inequality in health- reasons and efforts] Ulighed i Sundhed - årsager og indsatser. Copenhagen-DK Denmark: Sundhedsstyrelsen; 2011.
Bender, A. M., Jorgensen, T., Helbech, B., Linneberg, A., & Pisinger, C. (2014). Socioeconomic position and participation in baseline and follow-up visits: The Inter99 study. Eur J Prev Cardiol, 21, 899–905.
Engberg, M., Christensen, B., Karlsmose, B., Lous, J., & Lauritzen, T. (2002). General health screenings to improve cardiovascular risk profiles: A randomized controlled trial in general practice with 5-year follow-up. J Fam Pract, 51, 546–552.
Niederdeppe, J., Fiore, M. C., Baker, T. B., & Smith, S. S. (2008). Smoking-cessation media campaigns and their effectiveness among socioeconomically advantaged and disadvantaged populations. Am J Public Health, 98, 916–924.
Freund, K. S., & Lous, J. (2002). Potentially marginalized 20-44-years-olds in general practice. Who are they? The results of a questionnaire screening. Ugeskr Laeger, 164, 5367–5372.
Freund, K. S., & Lous, J. (2012). The effect of preventive consultations on young adults with psychosocial problems: A randomized trial. Health Educ Res, 27, 927–945.
Hansen, E., Fonager, K., Freund, K. S., & Lous, J. (2014). The impact of non-responders on health and lifestyle outcomes in an intervention study. BMC Res Notes, 7, 632.
Diderichsen, F., Andersen, I., & Manual, C. (2011). Ulighed i Sundhed - aarsager og indsatser. Koebenhavn, 2011.
Andrich D. Rasch models for measurement: Sage publications, Inc.; 1988.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. Statistical Theories of Mental Test Scores. Reading, Mass. Addison-Wesley.
Irtel, H., & Schmalhofer, F. (1981). Psychological diagnosis from ordinal scale levels: Measurement theory principles, model test and parameter estimation. Arch Psychol (Frankf), 134, 197–218.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish National Institute for Educational Research.
Christensen, K. B., & Kreiner, S. (2007). A Monte Carlo approach to unidimensionality testing in polytomous Rasch models. Appl Psych Meas, 31, 20–30.
Andrich, D. (2002). Understanding resistance to the data-model relationship in Rasch's paradigm: A reflection for the next generation. J Appl Meas, 3, 325–359.
Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Med Care, 42, I7–16.
Wolfe, F., & Kong, S. X. (1999). Rasch analysis of the Western Ontario MacMaster questionnaire (WOMAC) in 2205 patients with osteoarthritis, rheumatoid arthritis, and fibromyalgia. Ann Rheum Dis, 58, 563–568.
Brodersen, J., Meads, D. M., Kreiner, S., Thorsen, H., Doward, L., & McKenna, S. P. (2007). Methodological aspects of differential item functioning in the Rasch model. J Med Econ, 10, 309–324.
Comins, J. D., Krogsgaard, M. R., Kreiner, S., & Brodersen, J. (2013). Dimensionality of the knee numeric-entity evaluation score (KNEES-ACL): A condition-specific questionnaire. Scand J Med Sci Sports, 23, e302–e312.
Kreiner, S. (2007). Validity and objectivity: Reflections on the role and nature of Rasch models. Nordic Psychology, 59, 268–298.
Kreiner S. Rasch models: validity, sufficiency and – in principle – objectivity http://www.bmj.com/content/346/bmj.f232/rr/637148. BMJ rapid response. 2013.
Kreiner, S., & Christensen, K. B. (2004). Analysis of local dependence and multidimensionality in graphical Loglinier Rasch models. Communications in Statistics, 33, 1276.
Kelderman, H. (1984). Logliniear Rasch model tests. Psychometrika, 49, 223–245.
Kreiner, S., & Christensen, K. B. (2011). Item screening in graphical Loglinier Rasch models. Psychometrika, 76, 228–256.
Chronbach, L. J. S., & J, R. (2004). My current thoughts on coefficient alpha and successor procedures. Educ Psychol Meas, 64, 391–418.
Tavakol, M., & Dennick, R. (2011). Making sense of Cronbach's alpha. Int J Med Educ, 2, 53–55.
Harvill, L. M. (1991). Standard error of measurement. Instructional Topics in Educational Measurement, 10, 33–41.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach's alpha. Psychometrika, 74, 107–120.
Dierick, F., Aveniere, T., Cossement, M., Poilvache, P., Lobet, S., & Detrembleur, C. (2004). Outcome assessment in osteoarthritic patients undergoing total knee arthroplasty. Acta Orthop Belg, 70, 38–45.
Wilson, M. (2005). Constructing measures. London: Lawrence Erlbaum Associates.
Godlee, F. (2012). Outcomes that matter to patients. BMJ.
Streiner, D. L. (2008). G.R. N. Health measurement scales - a practical guide to their development and use. In Oxford: Oxford University press.
Andersen, E. B. (1973). Goodness of fit test for Rasch model. Psychometrika, 38, 123–140.
Kreiner, S. (2011). A note on item-rescore association in Rasch models. Appl Psychol Meas, 35, 557–561.
Kreiner, S., & Nielsen, T. (2007). Item analysis in Digram - notes on the use of DIGRAM for item analysis by graphical loglinear Rasch models. Department of Biostatistics - University of Copenhagen.
Kreiner, S., & Christensen, K. B. (2013). Person Parameter Estimation and Measurement in Rasch Models. Rasch Models in Health Hoboken, NJ (pp. 63–78). USA: John Wiley & Sons, Inc..
Beauchamp, M. K., Jette, A. M., Ward, R. E., Kurlinski, L. A., Kiely, D., Latham, N. K., et al. (2015). Predictive validity and responsiveness of patient-reported and performance-based measures of function in the Boston RISE study. J Gerontol A Biol Sci Med Sci, 70, 616–622.
Wiebe, S., Guyatt, G., Weaver, B., Matijevic, S., & Sidwell, C. (2003). Comparative responsiveness of generic and specific quality-of-life instruments. J Clin Epidemiol, 56, 52–60.
Walvoort, S. J., van der Heijden, P. T., Kessels, R. P., & Egger, J. I. (2016). Measuring illness insight in patients with alcohol-related cognitive dysfunction using the Q8 questionnaire: A validation study. Neuropsychiatr Dis Treat, 12, 1609–1615.
Dean, A. C., Kohno, M., Morales, A. M., Ghahremani, D. G., & London, E. D. (2015). Denial in methamphetamine users: Associations with cognition and functional connectivity in brain. Drug Alcohol Depend, 151, 84–91.
JDC and KSF received support from the Danish General Practice Research Foundation (PLU 541305 and PLU – FF-2-01-284 respectively).
Availability of data and materials
The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.
Ethics approval and consent to participate
The study complied fully with Danish data and research regulations and permissions. The Danish ethics committee waives the need for ethical approval in survey studies. Written informed consent from all respondents was obtained upon inclusion. The study was registered with ClinicalTrials.gov. Registration: NCT 01231256.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Comins, J.D., Freund, K.S., Christensen, K.B. et al. Validation of a health screening questionnaire for primary care using Rasch models. J Patient Rep Outcomes 3, 12 (2019). https://doi.org/10.1186/s41687-019-0104-7