The MBI has informed much of the current US health policy discourse surrounding the physician burnout crisis and continues to be the most widely used outcome assessment to monitor physician burnout prevalence at organizational and national levels [4,5,6, 8, 9, 23]. However, to our knowledge, no studies have used IRT to improve what is known about its psychometric properties in a national sample of physicians. In this study, we used IRT to better understand the meaning and precision of MBI subscale scores in US physicians. After calibrating each MBI subscale, we described the burnout symptom severity represented by each subscale item; created response profiles describing the probability that a US physician endorses each item at a frequency of once weekly or more across standardized, IRT-based subscale scores; and mapped IRT-based subscale scores to raw MBI subscale scores. As an example of their utility, we used the crosswalks and response profiles to interpret the meaning of mean scores and commonly used cut-points for defining dichotomous EE, DP, and PA outcomes. These crosswalks can also be used to compare groups’ (and for the EE subscale, individuals’) scores on each metric relative to the average level of each construct in a US physician reference population.
This analysis revealed several important findings regarding the burnout symptom burden experienced by the average US physician and represented by commonly used cut-points. The average US physician is likely to experience several EE symptoms once weekly or more, including feeling emotionally drained, used up, frustrated, and working too hard due to work; is unlikely to experience any symptoms of DP once weekly or more; and is likely to experience all indicators of PA once weekly or more. At respective EE, DP, and PA cut-points of 27, 10, and 33, a physician is likely to endorse the same EE symptoms that are experienced by a physician with a mean score and is unlikely to report feeling burned out from work once weekly or more; is unlikely to experience any DP symptoms once weekly or more (or even “a few times a month” or more); and is likely to experience most indicators of PA (including feeling accomplished) once weekly or more. If a physician’s endorsement of particular symptoms on each subscale is central to the definitions of dichotomous EE, DP, and PA outcomes, then our response profiles can be used to define the raw score cut-points at which physicians are likely report a particular EE, DP, and low PA burden. For example, if feeling “burned out from work”, feeling ≥ 1 symptom of DP, and not feeling professionally accomplished at least once weekly are central to the definitions of dichotomous EE, DP, and PA outcomes, respectively, then our findings suggest that raw score cut-points of ≥ 31, ≥ 14, and ≤ 29 should be used on respective EE, DP, and PA subscales. These cut-points correspond with the score at which a physician would have > 50% chance of endorsing feeling burned out and ≥ 1 symptom of DP and < 50% chance of endorsing feeling accomplished at work once weekly or more. These cut-points also correspond with EE, DP, and PA levels that are 0.27 SDs above, 0.78 SDs above, and − 1.22 SDs below the mean of US physicians, respectively. Importantly, using a definition of high scores on EE and/or DP subscales to define burnout, use of these content-referenced cut-points would lower the national prevalence of physician burnout from 54.4% to approximately 43.3% (2709/6474) in 2014 [4, 5].
Our analyses of the MBI’s precision bandwidths demonstrated that each subscale assesses the majority of physicians’ scores with ≥ 0.70 reliability. However, the EE and DP subscales lack adequate precision to assess the scores of physicians reporting the very highest EE and DP levels on each metric. Analysis of the PA scale also revealed that this scale is most precise at assessing below average levels of PA (arguably where the precision is most important given low PA is a symptom of burnout) and lacks precision at assessing above average levels of PA. Further, while researchers have stated that the MBI can be used for individual-level outcome measurement [2, 24], only the EE subscale showed adequate reliability for individual-level measurement. These findings highlight that each metric does not measure all physicians’ scores with equal precision—outside the score range possessing ≥ 0.70 and ≥ 0.90 reliability, these scales have inadequate precision to assess between-group and within-individual differences, respectively. Adding items to each subscale could improve their reliability.
Strengths and limitations
This is the first study to our knowledge to calibrate the MBI in a national sample of US physicians and create IRT-based response profiles mapped to raw scores. The strength of this study is that it allows investigators to classify physicians’ scores into discrete burnout outcome groups relative to 1) whether their score has met or exceeded a particular symptom burden represented by the items and 2) relative to the mean score of a US physician reference sample. This is particularly important in the absence of a gold-standard criterion for burnout. It is also important given the original cut-points for defining dichotomous outcomes on each subscale (examined herein) were selected by identifying the score corresponding with the third tercile in a large occupational sample [25]. As the scale developers and others have noted, a distributional approach such as this alone can result in somewhat arbitrary cut-points [24, 25]. The use of content-referenced score interpretations as a complement to the norm-referenced interpretations, as made possible through this study, addresses this shortcoming.
This study has several limitations. The burnout symptoms assessed by the MBI are continuous constructs, and it is important to treat scores as such where possible. Notwithstanding, its use in research to classify physicians into burned out versus non-burned out groups continues to influence healthcare policy and practice [6, 26]. Therefore, identifying the symptom burden associated with various cut-points has value. This study aims not to define new cut-points but instead to elucidate the meaning of the cut-points used to define physician burnout outcomes on MBI subscales, such that when reports state “X%” of physicians are “burned out” we have a better understanding (probabilistically) of what symptom burden level that means.
The selection of appropriate cut-points is a multi-attribute decision that depends critically on factors such as the intended purpose of assessment, the profile of burnout symptoms that are most probable at the cut-points, and consensus among investigators regarding what symptom burden matters for the purpose(s) of the assessment. This includes answering questions such as: which symptoms and symptom frequencies define burnout on each subscale; and what response probability criterion should be used to define whether a physician is likely or unlikely to report the burnout symptom? Our response profiles indicate the probability of item endorsements at a frequency of once weekly or more based on its prior use to define burnout in national studies [5, 15, 16], but it may be that a different symptom frequency is of interest. In this case, investigators can use the item parameter estimates (Table 2) to identify probable responses at different frequencies (see also Supplemental Appendix 4 for plotted cumulative probability curves describing the probability of a physician endorsing each subscale item at a frequency of a few times a month or more across IRT z-scores). Further, we use a response probability criterion of > 0.50 to define whether a physician is likely to endorse each item; however, it may be that a higher probability criterion (e.g., ≥ 0.67) is desired.
Definitions of what symptom burden matters should also consider relationship of a particular cut-point with external criterions. That is, what is the sensitivity and specificity of a particular cut-point with respect to important physician health and performance outcomes? To our knowledge, this has yet to be evaluated. Cut-points derived solely from content- and norm-referenced approaches may not be the cut-points at which sensitivity and specificity are maximized for a particular outcome. The optimal cut-point should be selected based on an evaluation of the costs and benefits of decisions resulting from its use to classify physicians into outcome groups (a property of context, not the subscales themselves) [27, 28]. For example, the costs and benefits of particular subscale cut-points for defining national physician burnout prevalence may differ substantially from those associated with identifying which physicians should receive an intervention. While cut-points may vary depending on context, there is a need for consistency in the cut-points used across studies when the purpose of assessment is estimating burnout prevalence [8]. Our findings can be used to inform consensus standards for defining outcome categories (e.g., burned out vs. not burned out; low, moderate, high symptoms) on each subscale for this purpose. However, this study does not address which subscales matter in the definition (e.g., EE and/or DP versus EE, DP, and PA, etc.) [29], which has also contributed to wide variation in prevalence estimates [8].
When using our crosswalk to interpret an individual’s/group’s score relative to its distance from the mean, it should be noted that comparisons will be relative to the mean EE, DP, and PA levels reported in this sample. While early and late responder analyses by Shanafelt et al. support the demographic representativeness of the sample [4], it is possible that the mean EE, DP, and PA levels in this calibration sample are not representative of those in the population. Findings from this study also cannot be assumed to generalize to other non-physician populations (e.g., nurses). That is, it cannot be assumed that the symptom burden represented by cut-points in this study have the same meaning in a non-physician sample without further research. Further research would be needed to place item responses from both groups onto the same metric and determine items function invariantly across physician and non-physician workers before raw scores can be assumed to represent the same symptom burden across groups.
It should be noted that the precision of each MBI subscale as implied by the crosswalks (Table 3) differs slightly from the precision of each metric reported by each TIF (Fig. 1) due to differences in estimating standard error (standard deviation of posterior distribution and square root of inverse Fisher expected information value, respectively). The use of each crosswalk requires complete responses on each MBI subscale. Finally, in the original study, item DP2 was slightly revised from the original MBI item (whereby “since I took this job” was removed from the original item: “I’ve become more callous toward people since I took this job”).