Describing the emotional exhaustion, depersonalization, and low personal accomplishment symptoms associated with Maslach Burnout Inventory subscale scores in US physicians: an item response theory analysis

Purpose Current US health policy discussions regarding physician burnout have largely been informed by studies employing the Maslach Burnout Inventory (MBI); yet, there is little in the literature focused on interpreting MBI scores. We described the burnout symptoms and precision associated with MBI scores in US physicians. Methods Using item response theory (IRT) analyses of secondary, cross-sectional survey data, we created response profiles describing the probability of burnout symptoms associated with US physicians’ MBI emotional exhaustion (EE), depersonalization (DP), and personal accomplishment (PA) subscale scores. Response profiles were mapped to raw subscale scores and used to predict symptom endorsements at mean scores and commonly used cut-points. Results The average US physician was likely to endorse feeling he/she is emotionally drained, used up, frustrated, and working too hard and all PA indicators once weekly or more but was unlikely to endorse feeling any DP symptoms once weekly or more. At the commonly used EE and DP cut-points of 27 and 10, respectively, a physician was unlikely to endorse feeling burned out or any DP symptoms once weekly or more. Each subscale assessed the majority of sample score ranges with ≥ 0.70 reliability. Conclusions We produced a crosswalk mapping raw MBI subscale scores to scaled scores and response profiles calibrated in a US physician sample. Our results can be used to better understand the meaning and precision of MBI scores in US physicians; compare individual/group MBI scores against a reference population of US physicians; and inform the selection of subscale cut-points for defining categorical physician burnout outcomes.


Introduction
Current US health policy discussions surrounding the physician burnout crisis have largely been informed by prevalence studies employing the Maslach Burnout Inventory-Human Services Survey for Medical Personnel (MBI) [1][2][3][4][5][6][7][8][9]. While the MBI is the most widely used physician burnout outcome assessment, a recent systematic review found a lack of consistency in cut-points used to define dichotomous burnout outcomes on each continuous MBI subscale [8], contributing to a marked heterogeneity in reported burnout prevalences across studies.
One contributor to the observed inconsistencies in defining dichotomous burnout outcomes on the MBI may be the lack of clarity regarding the meaning of subscale scores. Traditional measurement methods do not permit users to directly compare subscale scores with the content of items to interpret their meaning. The use of item response theory (IRT) measurement methods can facilitate an enhanced understanding of subscale scores over traditional methods [10,11]. Using IRT to estimate physicians' probability of endorsing MBI subscale items across different burnout symptom severity levels, scores can be interpreted based on how likely a physician is to endorse a particular item (e.g., "I feel burned out from my work") at a particular frequency (e.g., "once a week" or more) and relative to the mean score of the sample (i.e., content-referenced and norm-referenced scoring, respectively). IRT analyses are routinely used in health outcome measurement and are part of the NIH Patient Reported Outcome Measurement Information System (PROMIS) scientific standards for health outcome measurement development and validation [12]. However, no studies have applied IRT methods to evaluate the MBI in a national sample of US physicians.
In this study, we leveraged the content-referenced and norm-referenced score interpretation of IRT-calibrated (estimated) models to better understand the meaning of MBI subscale scores in a national US physician sample. Our primary aim was to create response profiles describing the probability of burnout symptoms across standardized MBI subscale scores in US physicians. We produced a crosswalk mapping raw (total) MBI subscale scores to scaled (IRT-based) scores and associated response profiles. As a secondary aim, we evaluated the precision bandwidth of each MBI subscale relative to where US physicians' scores are distributed on each metric.

Design and sample
This study used secondary survey data on the 22-item MBI from the 2014 wave of the anonymous [1,2], crosssectional study conducted by Shanafelt et al. (2015) to monitor the national prevalence of physician burnout [4]. Participants were sampled via email from the American Medical Association Physician Master File. Further sampling design details are published in Shanafelt et al. (2015) [4]. From this dataset, we excluded physicians who were not in practice in the US or retired at the time of the survey.

Measures
The MBI is a measure of job burnout defined by three subscales: emotional exhaustion (EE) (9 items), depersonalization (DP) (5 items), and professional accomplishment (PA) (8 items), each with 7-point Likerttype, frequency response scale (0 = never, 1 = a few times a year or less, 2 = once a month or less, 3 = a few times a month, 4 = once a week, 5 = a few times a week, 6 = every day) [1,2]. Scales are scored such that higher scores indicate more of each construct. Higher scores on the EE and DP subscales indicate a higher burnout symptom burden; lower scores on the PA subscale indicate a higher burnout symptom burden.

Statistical analyses
Our analytic approach was informed by the PROMIS scientific standards for instrument development and validation [12].

IRT model calibration
We calibrated IRT models for each MBI subscale using unidimensional, graded response models (GRM) [13]. For each MBI subscale item, the GRM predicted the cumulative probability of responding in a particular item response category or higher (e.g., "once a week" to "every day") as a function of physicians' underlying (latent) burnout symptom levels (i.e., an IRT score (θ)), item threshold parameters ( b x j ), and an item discrimination parameter (a j ). Item threshold parameters represent the IRT score at which a randomly selected physician among those with that score would have a cumulative probability of endorsing a particular response category or higher of 0.50. The mean of item threshold estimates from each calibrated IRT model describe the burnout symptom severity (item difficulty) represented by each item. IRT scores, item threshold parameters, and item symptom severity values are on z-score metric (0 = mean, SD = 1). Item discrimination parameters indicate the degree to which an item differentiates between physicians who have high versus low burnout symptom levels (with higher values yielding more scale precision). The GRM model assumes that physicians' item responses are a function of one primary, continuous underlying construct (unidimensionality); item responses are independent after controlling for the underlying burnout construct (local independence); and the probability of endorsing successively higher item response categories increases as physicians' underlying burnout symptom levels increase (monotonicity) [14]. Prior to calibrating each IRT model, we evaluated traditional item-and scale-level descriptive statistics; IRT model assumptions; and model-, item-, and person-level fit (Supplemental Appendix 1) [12].

Response profiles
To describe the severity of burnout symptoms associated with MBI subscale scores, we created response profiles from each calibrated IRT model that predict the cumulative probability that a randomly selected physician endorses each item (i.e., symptom) at a frequency of "once a week" or more (i.e., "a few times a week" or "every day") across IRT-based subscale scores. We selected a frequency of "once a week" or more for each response profile as this is commonly used as the frequency for defining burnout in national prevalence studies [5,15,16]. To enable instrument users to interpret individuals'/ groups' MBI subscale scores in relation to response profiles, we created a crosswalk mapping raw (total) subscale scores to IRT-based z-scores and associated standard errors (SEs) using expected a posteriori (EAP) sum scoring [17]. The crosswalks and associated response profiles allow instrument users to interpret individuals'/groups' scores relative to how likely a randomly selected physician among those with the particular score is likely to endorse each item at a frequency of "once a week" or more. We also present IRT-based t-scores (mean = 50, SD = 10) in each crosswalk. To illustrate how each response profile can be used, we interpreted the response profiles for z-scores at or nearest to the mean and at commonly used cut-points for defining dichotomous burnout outcomes on each subscale (≥ 27, ≥ 10, and ≤ 33 on the EE, DP, and PA subscales, respectively) [8]. In our interpretation of the burnout symptom severity associated with mean subscale scores and commonly used subscale cut-points, we defined an item as likely to be endorsed or not likely to be endorsed if it had a respective > 0.50 or < 0.50 cumulative probability of endorsement (response probability criterion) at a particular z-score.

Precision bandwidth
We used test information functions (TIFs) to evaluate each subscale's precision bandwidth by assessing whether each metric demonstrated adequate reliability for group-and individual-level measurement where sample scores (computed using EAP scoring) are distributed. A TIF describes the precision of a scale across z-scores and is inversely related to a scale's standard error (SE) [14]. Higher information equates to more reliability and lower SE associated with an individual's/group's subscale score. Adequate reliability for group-and individuallevel measurement was defined as 0.70 and 0.90, respectively [12].

Results
The overall sample included 6682 multi-specialty US physicians ( Table 1). The majority of the sample was male and a non-primary care physician.

IRT calibration
The final calibrated EE, DP, and PA IRT models ( Table 2) achieved adequate model-data fit and met all model assumptions (Supplemental Appendix 2) [22]. However, items DP4, PA2, and PA5 showed a lack of monotonicity across one or more adjacent response category pairs and items EE4 (working with people all day is a real strain) and EE8 (working with people directly puts too much stress on me) showed local dependence. While the former violation can be resolved by collapsing adjacent, non-monotonic item response categories [12], we chose to maintain the original scoring of the subscales and the ability to interpret published subscale scores relative to response profiles. Sensitivity analyses of DP and PA calibrations with and without collapsed item response categories showed minimal differences in item parameter estimates. The latter violation was remedied by summing the EE4 and EE8 items to form one scale (coded 0 to 12). The combined EE4EE8 item was used in the final calibrated EE IRT model in place of the individual items.

Item symptom severity
The least severe burnout symptoms ( Table 2) include: feeling used up (EE2), feeling emotionally hardened (DP3), and lacking feelings of exhilaration after working closely with patients (PA6). Whereas, the most severe burnout symptoms include: feeling that working with people is a real strain/too much stress (EE4EE8), not really caring what happens to some patients (DP4), and not easily understanding how patients feel (PA1).

Response profiles Emotional exhaustion subscale
A physician scoring approximately at the mean (raw score of 26) on the EE subscale is likely to endorse feeling emotionally drained from work (EE1), used up at the end of the workday (EE2), frustrated from his/her job (EE6), and that he/she is working too hard on his/her job (EE7) at a frequency of once weekly or more (Table 3; see Supplemental Appendix 3 for plotted cumulative probability curves and option response functions). A physician from this latent EE level would, however, be unlikely to report feeling: fatigued when getting up and having to face another day on the job (EE3), burned out from work (EE5), that working with people is stressful/ straining (EE4EE8), or at the end of his/her rope (EE9) once weekly or more. The commonly used raw score cut-point of 27 on the EE subscale corresponds to a zscore that is 0.07 SDs above the mean EE level of US physicians. At this score, a randomly selected physician would be likely to report feeling the same EE symptoms as a physician scoring at the mean. Endorsing feeling fatigued (EE3), burned out (EE5), at the end of your rope (EE9), and working with people is too stressful/straining (EE4EE8) once weekly or more is likely among physicians with z-scores > 0.20, > 0.27, > 1.00, > 1.57 SDs above the mean, respectively.

Depersonalization subscale
A physician scoring approximately at the mean (raw score of 7) on the DP subscale is unlikely to endorse feeling any depersonalization symptoms (DP1-DP5) once weekly or more. Physicians are also unlikely to endorse any depersonalization symptoms weekly or more at the commonly used raw score cut-point of 10, which represents a z-score that is 0.38 SDs above the mean DP level of US physicians. Endorsing feeling worried that work is hardening you emotionally (DP3), more callous toward people (DP2), patients blame you (DP5), that you treat patients as impersonal objects (DP1), and that you don't care what happens to some patients (DP4) once weekly or more is likely among physicians with z-scores > 0.78, > 0.92, > 1.07, > 1.64, and > 2.27 above the mean, respectively.

Personal accomplishment subscale
A physician scoring approximately at the mean (raw score of 42) on the PA subscale is likely to endorse all items (PA1-PA8) at a frequency of once weekly or more. The commonly used raw score cut-point of 33 represents a z-score that is 0.96 SDs below the mean PA level of US physicians. A physician with this score would be likely to endorse feeling he/she: can easily understand how patients feel (PA1); deals very effectively with patient problems (PA2); positively influences other people's lives through work (PA3); can easily create a relaxed atmosphere with patients (PA5); has accomplished many worthwhile things at work (PA7); deals with emotional work problems very calmly (PA8). A physician with this score would be unlikely, however, to endorse feeling very energetic (PA4) or exhilarated after working closely with patients (PA6) weekly or more, representing several burnout symptoms. Additional symptoms of low PA are likely among physicians with z-scores less than − 1.22 SDs below the mean. Figure 1 presents TIFs plotted against each subscale's sample score distribution. Of the score ranges in which US physicians are distributed, the EE, DP, and PA subscales have adequate reliability for group-level

Discussion
The MBI has informed much of the current US health policy discourse surrounding the physician burnout crisis and continues to be the most widely used outcome assessment to monitor physician burnout prevalence at organizational and national levels [4-6, 8, 9, 23]. However, to our knowledge, no studies have used IRT to    Higher scores on each scale indicate more of each construct; higher scores on the EE and DP scales indicate more burnout symptoms; lower scores on the PA scale indicate more burnout symptoms. " a j " parameter for the EE, DP, and PA IRT models = item discrimination parameter estimate, which indicates the degree to which an item discriminates between physicians with high versus low underlying EE, DP, or PA levels. Higher discrimination estimates indicate that the item is more discriminating compared to items with lower discrimination estimates. Item threshold estimates (b 1 j to b 6 j ) indicate the IRT score at which a randomly selected physician among those with that score would have a 50% chance of endorsing the particular response category or a higher response category. For items in each model: " b 1 j " = threshold parameter for endorsing "few times a year or less" or more; " b 2 j " = threshold parameter estimate for endorsing "once a month or less" or more; " b 3 j " = threshold parameter estimate for endorsing "a few times a month" or more; " b 4 j " = threshold parameter estimate for endorsing "once a week" or more; " b 5 j " = threshold parameter estimate for endorsing "a few times a week" or more; " b 6 j " = threshold parameter estimate for endorsing "every day". b Item symptom severity is the mean of item threshold parameter estimates (i.e., item difficulty). On the EE and DP subscales, items with lower item symptom severity values indicate that an item is easier to endorse and represents less severe burnout symptoms; higher item symptom severity values indicate that the item is harder to endorse and represents more severe burnout symptoms. On the PA subscale, items with lower symptom severity values indicate an item is harder to endorse and represents more severe burnout symptoms; items with higher symptom severity values indicate the item is easier to endorse and represents less severe burnout symptoms. c Item parameter estimates and associated SEs for the combined EE4EE8 item included in the EE IRT model are: a = 1.       The burnout symptom burden represented by subscale scores in this table can be interpreted based on the profile of likely item endorsements (i.e., a content-referenced score interpretation) as well as how far above or below the scores are relative to the mean score in a 2014 reference population of US physicians (i.e., a norm-referenced score interpretation). a A raw score on each subscale refers to the total (or sum) score on each subscale. Bolded rows correspond to IRT scores and response profiles that are at or closest to the mean for each subscale. b Items EE4 and EE8 were combined into one item (EE4EE8) to meet local independence assumptions. The probabilities shown for the combined EE4EE8 item represent the probability of a physician endorsing that working with people puts too much stress on him/her and/or is a real strain at a frequency of once a week or more at a particular score (i.e., a score of at least 8 on the combined EE4EE8 item). Higher scores on the EE subscale indicate more emotional exhaustion (and a higher burnout symptom burden). An EE item with > 0.50 probability of endorsement indicates a physician is likely to endorse feeling that particular EE symptom at a frequency of once a week or more at a particular score. c Higher scores on the DP subscale indicate more depersonalization (and a higher burnout symptom burden). A DP item with > 0.50 probability of endorsement indicates a physician is likely to endorse feeling that particular DP symptom at a frequency of once a week or more at a particular score. d Higher scores on the PA subscale indicate more personal accomplishment (and a lower symptom burden); whereas, lower scores on the PA subscale indicate a lower sense of personal accomplishment (and higher burnout symptom burden). A PA item with < 0.50 probability of endorsement indicates a physician is unlikely to endorse feeling that particular PA indicator at a frequency of once a week or more at a particular score improve what is known about its psychometric properties in a national sample of physicians. In this study, we used IRT to better understand the meaning and precision of MBI subscale scores in US physicians. After calibrating each MBI subscale, we described the burnout symptom severity represented by each subscale item; created response profiles describing the probability that a US physician endorses each item at a frequency of once weekly or more across standardized, IRT-based subscale scores; and mapped IRT-based subscale scores to raw MBI subscale scores. As an example of their utility, we used the crosswalks and response profiles to interpret the meaning of mean scores and commonly used cut-points for defining dichotomous EE, DP, and PA outcomes. These crosswalks can also be used to compare groups' (and for the EE subscale, individuals') scores on each metric relative to the average level of each construct in a US physician reference population. This analysis revealed several important findings regarding the burnout symptom burden experienced by the average US physician and represented by commonly used cut-points. The average US physician is likely to experience several EE symptoms once weekly or more, including feeling emotionally drained, used up, frustrated, and working too hard due to work; is unlikely to experience any symptoms of DP once weekly or more; and is likely to experience all indicators of PA once weekly or more. At respective EE, DP, and PA cut-points of 27, 10, and 33, a physician is likely to endorse the same EE symptoms that are experienced by a physician with a mean score and is unlikely to report feeling burned out from work once weekly or more; is unlikely to experience any DP symptoms once weekly or more (or even "a few times a month" or more); and is likely to experience most indicators of PA (including feeling accomplished) once weekly or more. If a physician's endorsement of particular symptoms on each subscale is central to the definitions of dichotomous EE, DP, and PA outcomes, then our response profiles can be used to define the raw score cut-points at which physicians are likely report a particular EE, DP, and low PA burden. For example, if feeling "burned out from work", feeling ≥ 1 symptom of DP, and not feeling professionally accomplished at least once weekly are central to the definitions of dichotomous EE, DP, and PA outcomes, respectively, then our findings suggest that raw score cut-points of ≥ 31, ≥ 14, and ≤ 29 should be used on respective EE, DP, and PA subscales. These cutpoints correspond with the score at which a physician would have > 50% chance of endorsing feeling burned out and ≥ 1 symptom of DP and < 50% chance of endorsing feeling accomplished at work once weekly or more. These cut-points also correspond with EE, DP, and PA levels that are 0.27 SDs above, 0.78 SDs above, and − 1.22 SDs below the mean of US physicians, respectively. Importantly, using a definition of high scores on EE and/or DP subscales to define burnout, use of these content-referenced cut-points would lower the national prevalence of physician burnout from 54.4% to approximately 43.3% (2709/6474) in 2014 [4,5].
Our analyses of the MBI's precision bandwidths demonstrated that each subscale assesses the majority of physicians' scores with ≥ 0.70 reliability. However, the EE and DP subscales lack adequate precision to assess the scores of physicians reporting the very highest EE and DP levels on each metric. Analysis of the PA scale also revealed that this scale is most precise at assessing below average levels of PA (arguably where the precision is most important given low PA is a symptom of burnout) and lacks precision at assessing above average levels of PA. Further, while researchers have stated that the MBI can be used for individual-level outcome measurement [2,24] reliability for individual-level measurement. These findings highlight that each metric does not measure all physicians' scores with equal precision-outside the score range possessing ≥ 0.70 and ≥ 0.90 reliability, these scales have inadequate precision to assess betweengroup and within-individual differences, respectively. Adding items to each subscale could improve their reliability.

Strengths and limitations
This is the first study to our knowledge to calibrate the MBI in a national sample of US physicians and create IRT-based response profiles mapped to raw scores. The strength of this study is that it allows investigators to classify physicians' scores into discrete burnout outcome groups relative to 1) whether their score has met or exceeded a particular symptom burden represented by the items and 2) relative to the mean score of a US physician reference sample. This is particularly important in the absence of a gold-standard criterion for burnout. It is also important given the original cut-points for defining dichotomous outcomes on each subscale (examined herein) were selected by identifying the score corresponding with the third tercile in a large occupational sample [25]. As the scale developers and others have noted, a distributional approach such as this alone can result in somewhat arbitrary cut-points [24,25]. The use of content-referenced score interpretations as a complement to the norm-referenced interpretations, as made possible through this study, addresses this shortcoming. This study has several limitations. The burnout symptoms assessed by the MBI are continuous constructs, and it is important to treat scores as such where possible. Notwithstanding, its use in research to classify physicians into burned out versus non-burned out groups continues to influence healthcare policy and practice [6,26]. Therefore, identifying the symptom burden associated with various cut-points has value. This study aims not to define new cut-points but instead to elucidate the meaning of the cut-points used to define physician burnout outcomes on MBI subscales, such that when reports state "X%" of physicians are "burned out" we have a better understanding (probabilistically) of what symptom burden level that means.
The selection of appropriate cut-points is a multiattribute decision that depends critically on factors such as the intended purpose of assessment, the profile of burnout symptoms that are most probable at the cut-points, and consensus among investigators regarding what symptom burden matters for the purpose(s) of the assessment. This includes answering questions such as: which symptoms and symptom frequencies define burnout on each subscale; and what response probability criterion should be used to define whether a physician is likely or unlikely to report the burnout symptom? Our response profiles indicate the probability of item endorsements at a frequency of once weekly or more based on its prior use to define burnout in national studies [5,15,16], but it may be that a different symptom frequency is of interest. In this case, investigators can use the item parameter estimates (Table 2) to identify probable responses at different frequencies (see also Supplemental Appendix 4 for plotted cumulative probability curves describing the probability of a physician endorsing each subscale item at a frequency of a few times a month or more across IRT z-scores). Further, we use a response probability criterion of > 0.50 to define whether a physician is likely to endorse each item; however, it may be that a higher probability criterion (e.g., ≥ 0.67) is desired.
Definitions of what symptom burden matters should also consider relationship of a particular cut-point with external criterions. That is, what is the sensitivity and specificity of a particular cut-point with respect to important physician health and performance outcomes? To our knowledge, this has yet to be evaluated. Cut-points derived solely from content-and norm-referenced approaches may not be the cut-points at which sensitivity and specificity are maximized for a particular outcome. The optimal cut-point should be selected based on an evaluation of the costs and benefits of decisions resulting from its use to classify physicians into outcome groups (a property of context, not the subscales themselves) [27,28]. For example, the costs and benefits of particular subscale cut-points for defining national physician burnout prevalence may differ substantially from those associated with identifying which physicians should receive an intervention. While cut-points may vary depending on context, there is a need for consistency in the cut-points used across studies when the purpose of assessment is estimating burnout prevalence [8]. Our findings can be used to inform consensus standards for defining outcome categories (e.g., burned out vs. not burned out; low, moderate, high symptoms) on each subscale for this purpose. However, this study does not address which subscales matter in the definition (e.g., EE and/or DP versus EE, DP, and PA, etc.) [29], which has also contributed to wide variation in prevalence estimates [8].
When using our crosswalk to interpret an individual's/ group's score relative to its distance from the mean, it should be noted that comparisons will be relative to the mean EE, DP, and PA levels reported in this sample. While early and late responder analyses by Shanafelt et al. support the demographic representativeness of the sample [4], it is possible that the mean EE, DP, and PA levels in this calibration sample are not representative of those in the population. Findings from this study also cannot be assumed to generalize to other non-physician populations (e.g., nurses). That is, it cannot be assumed that the symptom burden represented by cut-points in this study have the same meaning in a non-physician sample without further research. Further research would be needed to place item responses from both groups onto the same metric and determine items function invariantly across physician and non-physician workers before raw scores can be assumed to represent the same symptom burden across groups.
It should be noted that the precision of each MBI subscale as implied by the crosswalks (Table 3) differs slightly from the precision of each metric reported by each TIF (Fig. 1) due to differences in estimating standard error (standard deviation of posterior distribution and square root of inverse Fisher expected information value, respectively). The use of each crosswalk requires complete responses on each MBI subscale. Finally, in the original study, item DP2 was slightly revised from the original MBI item (whereby "since I took this job" was removed from the original item: "I've become more callous toward people since I took this job").

Conclusions
We produced a crosswalk mapping raw MBI subscale scores to IRT-based, standardized scores and response profiles calibrated in a US physician sample. Our results can be used in research and practice to better understand the meaning and precision of MBI scores in US physicians and compare individual/group MBI scores against a reference population of US physicians. Our response profiles underscore that the choice of cut-points for defining categorical MBI subscale outcomes matters. Different scores have different meanings with respect to the burnout symptom burden they represent, and prevalence estimates will be directly influenced by which cut-point is chosen. Our findings can be used better inform the selection of appropriate cut-points for defining categorical physician burnout outcomes on each MBI subscale.