Describing the emotional exhaustion, depersonalization, and low personal accomplishment symptoms associated with Maslach Burnout Inventory subscale scores in US physicians: an item response theory analysis

Brady, Keri J. S.; Ni, Pengsheng; Sheldrick, R. Christopher; Trockel, Mickey T.; Shanafelt, Tait D.; Rowe, Susannah G.; Schneider, Jeffrey I.; Kazis, Lewis E.

doi:10.1186/s41687-020-00204-x

Research
Open access
Published: 01 June 2020

Describing the emotional exhaustion, depersonalization, and low personal accomplishment symptoms associated with Maslach Burnout Inventory subscale scores in US physicians: an item response theory analysis

Keri J. S. Brady ORCID: orcid.org/0000-0001-6417-0840¹,
Pengsheng Ni^1,2,
R. Christopher Sheldrick¹,
Mickey T. Trockel^3,4,
Tait D. Shanafelt⁴,
Susannah G. Rowe^5,6,
Jeffrey I. Schneider^5,7 &
…
Lewis E. Kazis¹

Journal of Patient-Reported Outcomes volume 4, Article number: 42 (2020) Cite this article

37k Accesses
40 Citations
17 Altmetric
Metrics details

Abstract

Purpose

Current US health policy discussions regarding physician burnout have largely been informed by studies employing the Maslach Burnout Inventory (MBI); yet, there is little in the literature focused on interpreting MBI scores. We described the burnout symptoms and precision associated with MBI scores in US physicians.

Methods

Using item response theory (IRT) analyses of secondary, cross-sectional survey data, we created response profiles describing the probability of burnout symptoms associated with US physicians’ MBI emotional exhaustion (EE), depersonalization (DP), and personal accomplishment (PA) subscale scores. Response profiles were mapped to raw subscale scores and used to predict symptom endorsements at mean scores and commonly used cut-points.

Results

The average US physician was likely to endorse feeling he/she is emotionally drained, used up, frustrated, and working too hard and all PA indicators once weekly or more but was unlikely to endorse feeling any DP symptoms once weekly or more. At the commonly used EE and DP cut-points of 27 and 10, respectively, a physician was unlikely to endorse feeling burned out or any DP symptoms once weekly or more. Each subscale assessed the majority of sample score ranges with ≥ 0.70 reliability.

Conclusions

We produced a crosswalk mapping raw MBI subscale scores to scaled scores and response profiles calibrated in a US physician sample. Our results can be used to better understand the meaning and precision of MBI scores in US physicians; compare individual/group MBI scores against a reference population of US physicians; and inform the selection of subscale cut-points for defining categorical physician burnout outcomes.

Introduction

Current US health policy discussions surrounding the physician burnout crisis have largely been informed by prevalence studies employing the Maslach Burnout Inventory-Human Services Survey for Medical Personnel (MBI) [1,2,3,4,5,6,7,8,9]. While the MBI is the most widely used physician burnout outcome assessment, a recent systematic review found a lack of consistency in cut-points used to define dichotomous burnout outcomes on each continuous MBI subscale [8], contributing to a marked heterogeneity in reported burnout prevalences across studies.

One contributor to the observed inconsistencies in defining dichotomous burnout outcomes on the MBI may be the lack of clarity regarding the meaning of subscale scores. Traditional measurement methods do not permit users to directly compare subscale scores with the content of items to interpret their meaning. The use of item response theory (IRT) measurement methods can facilitate an enhanced understanding of subscale scores over traditional methods [10, 11]. Using IRT to estimate physicians’ probability of endorsing MBI subscale items across different burnout symptom severity levels, scores can be interpreted based on how likely a physician is to endorse a particular item (e.g., “I feel burned out from my work”) at a particular frequency (e.g., “once a week” or more) and relative to the mean score of the sample (i.e., content-referenced and norm-referenced scoring, respectively). IRT analyses are routinely used in health outcome measurement and are part of the NIH Patient Reported Outcome Measurement Information System (PROMIS) scientific standards for health outcome measurement development and validation [12]. However, no studies have applied IRT methods to evaluate the MBI in a national sample of US physicians.

In this study, we leveraged the content-referenced and norm-referenced score interpretation of IRT-calibrated (estimated) models to better understand the meaning of MBI subscale scores in a national US physician sample. Our primary aim was to create response profiles describing the probability of burnout symptoms across standardized MBI subscale scores in US physicians. We produced a crosswalk mapping raw (total) MBI subscale scores to scaled (IRT-based) scores and associated response profiles. As a secondary aim, we evaluated the precision bandwidth of each MBI subscale relative to where US physicians’ scores are distributed on each metric.

Methods

Design and sample

This study used secondary survey data on the 22-item MBI from the 2014 wave of the anonymous [1, 2], cross-sectional study conducted by Shanafelt et al. (2015) to monitor the national prevalence of physician burnout [4]. Participants were sampled via email from the American Medical Association Physician Master File. Further sampling design details are published in Shanafelt et al. (2015) [4]. From this dataset, we excluded physicians who were not in practice in the US or retired at the time of the survey.

Measures

The MBI is a measure of job burnout defined by three subscales: emotional exhaustion (EE) (9 items), depersonalization (DP) (5 items), and professional accomplishment (PA) (8 items), each with 7-point Likert-type, frequency response scale (0 = never, 1 = a few times a year or less, 2 = once a month or less, 3 = a few times a month, 4 = once a week, 5 = a few times a week, 6 = every day) [1, 2]. Scales are scored such that higher scores indicate more of each construct. Higher scores on the EE and DP subscales indicate a higher burnout symptom burden; lower scores on the PA subscale indicate a higher burnout symptom burden.

Statistical analyses

Our analytic approach was informed by the PROMIS scientific standards for instrument development and validation [12].

IRT model calibration

We calibrated IRT models for each MBI subscale using unidimensional, graded response models (GRM) [13]. For each MBI subscale item, the GRM predicted the cumulative probability of responding in a particular item response category or higher (e.g., “once a week” to “every day”) as a function of physicians’ underlying (latent) burnout symptom levels (i.e., an IRT score (θ)), item threshold parameters (\( {b}_{x_j} \)), and an item discrimination parameter (a_j). Item threshold parameters represent the IRT score at which a randomly selected physician among those with that score would have a cumulative probability of endorsing a particular response category or higher of 0.50. The mean of item threshold estimates from each calibrated IRT model describe the burnout symptom severity (item difficulty) represented by each item. IRT scores, item threshold parameters, and item symptom severity values are on z-score metric (0 = mean, SD = 1). Item discrimination parameters indicate the degree to which an item differentiates between physicians who have high versus low burnout symptom levels (with higher values yielding more scale precision). The GRM model assumes that physicians’ item responses are a function of one primary, continuous underlying construct (unidimensionality); item responses are independent after controlling for the underlying burnout construct (local independence); and the probability of endorsing successively higher item response categories increases as physicians’ underlying burnout symptom levels increase (monotonicity) [14]. Prior to calibrating each IRT model, we evaluated traditional item- and scale-level descriptive statistics; IRT model assumptions; and model-, item-, and person-level fit (Supplemental Appendix 1) [12].

Response profiles

To describe the severity of burnout symptoms associated with MBI subscale scores, we created response profiles from each calibrated IRT model that predict the cumulative probability that a randomly selected physician endorses each item (i.e., symptom) at a frequency of “once a week” or more (i.e., “a few times a week” or “every day”) across IRT-based subscale scores. We selected a frequency of “once a week” or more for each response profile as this is commonly used as the frequency for defining burnout in national prevalence studies [5, 15, 16]. To enable instrument users to interpret individuals’/groups’ MBI subscale scores in relation to response profiles, we created a crosswalk mapping raw (total) subscale scores to IRT-based z-scores and associated standard errors (SEs) using expected a posteriori (EAP) sum scoring [17]. The crosswalks and associated response profiles allow instrument users to interpret individuals’/groups’ scores relative to how likely a randomly selected physician among those with the particular score is likely to endorse each item at a frequency of “once a week” or more. We also present IRT-based t-scores (mean = 50, SD = 10) in each crosswalk. To illustrate how each response profile can be used, we interpreted the response profiles for z-scores at or nearest to the mean and at commonly used cut-points for defining dichotomous burnout outcomes on each subscale (≥ 27, ≥ 10, and ≤ 33 on the EE, DP, and PA subscales, respectively) [8]. In our interpretation of the burnout symptom severity associated with mean subscale scores and commonly used subscale cut-points, we defined an item as likely to be endorsed or not likely to be endorsed if it had a respective > 0.50 or < 0.50 cumulative probability of endorsement (response probability criterion) at a particular z-score.

Precision bandwidth

We used test information functions (TIFs) to evaluate each subscale’s precision bandwidth by assessing whether each metric demonstrated adequate reliability for group- and individual-level measurement where sample scores (computed using EAP scoring) are distributed. A TIF describes the precision of a scale across z-scores and is inversely related to a scale’s standard error (SE) [14]. Higher information equates to more reliability and lower SE associated with an individual’s/group’s subscale score. Adequate reliability for group- and individual-level measurement was defined as 0.70 and 0.90, respectively [12].

All statistical analyses were conducted in R (v3.5.1) using the psych (v1.8.12) [18, 19], lavaan (v0.6–3) [20], and mirt packages (v1.30.6) [21]. This study was approved by the Boston University Medical Campus Institutional Review Board (approval # H-37414).

Results

The overall sample included 6682 multi-specialty US physicians (Table 1). The majority of the sample was male and a non-primary care physician.

Table 1 MBI Overall Sample Characteristics (n = 6682)

Full size table

IRT calibration

The final calibrated EE, DP, and PA IRT models (Table 2) achieved adequate model-data fit and met all model assumptions (Supplemental Appendix 2) [22]. However, items DP4, PA2, and PA5 showed a lack of monotonicity across one or more adjacent response category pairs and items EE4 (working with people all day is a real strain) and EE8 (working with people directly puts too much stress on me) showed local dependence. While the former violation can be resolved by collapsing adjacent, non-monotonic item response categories [12], we chose to maintain the original scoring of the subscales and the ability to interpret published subscale scores relative to response profiles. Sensitivity analyses of DP and PA calibrations with and without collapsed item response categories showed minimal differences in item parameter estimates. The latter violation was remedied by summing the EE4 and EE8 items to form one scale (coded 0 to 12). The combined EE4EE8 item was used in the final calibrated EE IRT model in place of the individual items.

Table 2 Item Parameter Estimates and Standard Errors (SE) for Calibrated Emotional Exhaustion (EE), Depersonalization (DP), and Personal Accomplishment (PA) IRT Models ^a

Full size table

Item symptom severity

The least severe burnout symptoms (Table 2) include: feeling used up (EE2), feeling emotionally hardened (DP3), and lacking feelings of exhilaration after working closely with patients (PA6). Whereas, the most severe burnout symptoms include: feeling that working with people is a real strain/too much stress (EE4EE8), not really caring what happens to some patients (DP4), and not easily understanding how patients feel (PA1).

Response profiles

Emotional exhaustion subscale

A physician scoring approximately at the mean (raw score of 26) on the EE subscale is likely to endorse feeling emotionally drained from work (EE1), used up at the end of the workday (EE2), frustrated from his/her job (EE6), and that he/she is working too hard on his/her job (EE7) at a frequency of once weekly or more (Table 3; see Supplemental Appendix 3 for plotted cumulative probability curves and option response functions). A physician from this latent EE level would, however, be unlikely to report feeling: fatigued when getting up and having to face another day on the job (EE3), burned out from work (EE5), that working with people is stressful/straining (EE4EE8), or at the end of his/her rope (EE9) once weekly or more. The commonly used raw score cut-point of 27 on the EE subscale corresponds to a z-score that is 0.07 SDs above the mean EE level of US physicians. At this score, a randomly selected physician would be likely to report feeling the same EE symptoms as a physician scoring at the mean. Endorsing feeling fatigued (EE3), burned out (EE5), at the end of your rope (EE9), and working with people is too stressful/straining (EE4EE8) once weekly or more is likely among physicians with z-scores > 0.20, > 0.27, > 1.00, > 1.57 SDs above the mean, respectively.

Table 3 Crosswalks mapping raw (total) Emotional Exhaustion (EE), Depersonalization (DP), and Personal Accomplishment (PA) subscale scores to IRT-based scores and response profiles

Full size table

Depersonalization subscale

A physician scoring approximately at the mean (raw score of 7) on the DP subscale is unlikely to endorse feeling any depersonalization symptoms (DP1-DP5) once weekly or more. Physicians are also unlikely to endorse any depersonalization symptoms weekly or more at the commonly used raw score cut-point of 10, which represents a z-score that is 0.38 SDs above the mean DP level of US physicians. Endorsing feeling worried that work is hardening you emotionally (DP3), more callous toward people (DP2), patients blame you (DP5), that you treat patients as impersonal objects (DP1), and that you don’t care what happens to some patients (DP4) once weekly or more is likely among physicians with z-scores > 0.78, > 0.92, > 1.07, > 1.64, and > 2.27 above the mean, respectively.

Personal accomplishment subscale

A physician scoring approximately at the mean (raw score of 42) on the PA subscale is likely to endorse all items (PA1-PA8) at a frequency of once weekly or more. The commonly used raw score cut-point of 33 represents a z-score that is 0.96 SDs below the mean PA level of US physicians. A physician with this score would be likely to endorse feeling he/she: can easily understand how patients feel (PA1); deals very effectively with patient problems (PA2); positively influences other people’s lives through work (PA3); can easily create a relaxed atmosphere with patients (PA5); has accomplished many worthwhile things at work (PA7); deals with emotional work problems very calmly (PA8). A physician with this score would be unlikely, however, to endorse feeling very energetic (PA4) or exhilarated after working closely with patients (PA6) weekly or more, representing several burnout symptoms. Additional symptoms of low PA are likely among physicians with z-scores less than − 1.22 SDs below the mean.

Precision bandwidth

Figure 1 presents TIFs plotted against each subscale’s sample score distribution. Of the score ranges in which US physicians are distributed, the EE, DP, and PA subscales have adequate reliability for group-level measurement at respective z-scores of − 2.51 to 2.34, − 1.09 to 2.71, and − 3.51 to 0.97. Thus, 96.6%, 83.2%, 87.3% of the respective EE, DP, and PA sample score ranges can be assessed with ≥ 0.70 reliability. The EE and DP subscales do not possess adequate reliability to assess levels of EE and DP > 2.34 and > 2.71 SDs above the mean, respectively, at the highest ends of the EE and DP metrics where a physician is likely to report experiencing all EE and DP symptoms weekly or more. The DP and PA subscales do not possess adequate reliability to assess low DP levels less than 1.09 SDs below the mean and high PA levels > 0.97 SDs above the mean, corresponding to nearly no burnout symptoms. Reliability of the EE, DP, and PA scales peaked at 0.96, 0.89, and 0.89 between z-scores of − 1.19 to 0.59, 0.14 to 1.37, − 2.23 to − 1.60, respectively. Only the EE scale showed adequate reliability for individual-level measurement (from z-scores of − 2.18 to 1.76).

Discussion

The MBI has informed much of the current US health policy discourse surrounding the physician burnout crisis and continues to be the most widely used outcome assessment to monitor physician burnout prevalence at organizational and national levels [4,5,6, 8, 9, 23]. However, to our knowledge, no studies have used IRT to improve what is known about its psychometric properties in a national sample of physicians. In this study, we used IRT to better understand the meaning and precision of MBI subscale scores in US physicians. After calibrating each MBI subscale, we described the burnout symptom severity represented by each subscale item; created response profiles describing the probability that a US physician endorses each item at a frequency of once weekly or more across standardized, IRT-based subscale scores; and mapped IRT-based subscale scores to raw MBI subscale scores. As an example of their utility, we used the crosswalks and response profiles to interpret the meaning of mean scores and commonly used cut-points for defining dichotomous EE, DP, and PA outcomes. These crosswalks can also be used to compare groups’ (and for the EE subscale, individuals’) scores on each metric relative to the average level of each construct in a US physician reference population.

This analysis revealed several important findings regarding the burnout symptom burden experienced by the average US physician and represented by commonly used cut-points. The average US physician is likely to experience several EE symptoms once weekly or more, including feeling emotionally drained, used up, frustrated, and working too hard due to work; is unlikely to experience any symptoms of DP once weekly or more; and is likely to experience all indicators of PA once weekly or more. At respective EE, DP, and PA cut-points of 27, 10, and 33, a physician is likely to endorse the same EE symptoms that are experienced by a physician with a mean score and is unlikely to report feeling burned out from work once weekly or more; is unlikely to experience any DP symptoms once weekly or more (or even “a few times a month” or more); and is likely to experience most indicators of PA (including feeling accomplished) once weekly or more. If a physician’s endorsement of particular symptoms on each subscale is central to the definitions of dichotomous EE, DP, and PA outcomes, then our response profiles can be used to define the raw score cut-points at which physicians are likely report a particular EE, DP, and low PA burden. For example, if feeling “burned out from work”, feeling ≥ 1 symptom of DP, and not feeling professionally accomplished at least once weekly are central to the definitions of dichotomous EE, DP, and PA outcomes, respectively, then our findings suggest that raw score cut-points of ≥ 31, ≥ 14, and ≤ 29 should be used on respective EE, DP, and PA subscales. These cut-points correspond with the score at which a physician would have > 50% chance of endorsing feeling burned out and ≥ 1 symptom of DP and < 50% chance of endorsing feeling accomplished at work once weekly or more. These cut-points also correspond with EE, DP, and PA levels that are 0.27 SDs above, 0.78 SDs above, and − 1.22 SDs below the mean of US physicians, respectively. Importantly, using a definition of high scores on EE and/or DP subscales to define burnout, use of these content-referenced cut-points would lower the national prevalence of physician burnout from 54.4% to approximately 43.3% (2709/6474) in 2014 [4, 5].

Our analyses of the MBI’s precision bandwidths demonstrated that each subscale assesses the majority of physicians’ scores with ≥ 0.70 reliability. However, the EE and DP subscales lack adequate precision to assess the scores of physicians reporting the very highest EE and DP levels on each metric. Analysis of the PA scale also revealed that this scale is most precise at assessing below average levels of PA (arguably where the precision is most important given low PA is a symptom of burnout) and lacks precision at assessing above average levels of PA. Further, while researchers have stated that the MBI can be used for individual-level outcome measurement [2, 24], only the EE subscale showed adequate reliability for individual-level measurement. These findings highlight that each metric does not measure all physicians’ scores with equal precision—outside the score range possessing ≥ 0.70 and ≥ 0.90 reliability, these scales have inadequate precision to assess between-group and within-individual differences, respectively. Adding items to each subscale could improve their reliability.

Strengths and limitations

This is the first study to our knowledge to calibrate the MBI in a national sample of US physicians and create IRT-based response profiles mapped to raw scores. The strength of this study is that it allows investigators to classify physicians’ scores into discrete burnout outcome groups relative to 1) whether their score has met or exceeded a particular symptom burden represented by the items and 2) relative to the mean score of a US physician reference sample. This is particularly important in the absence of a gold-standard criterion for burnout. It is also important given the original cut-points for defining dichotomous outcomes on each subscale (examined herein) were selected by identifying the score corresponding with the third tercile in a large occupational sample [25]. As the scale developers and others have noted, a distributional approach such as this alone can result in somewhat arbitrary cut-points [24, 25]. The use of content-referenced score interpretations as a complement to the norm-referenced interpretations, as made possible through this study, addresses this shortcoming.

This study has several limitations. The burnout symptoms assessed by the MBI are continuous constructs, and it is important to treat scores as such where possible. Notwithstanding, its use in research to classify physicians into burned out versus non-burned out groups continues to influence healthcare policy and practice [6, 26]. Therefore, identifying the symptom burden associated with various cut-points has value. This study aims not to define new cut-points but instead to elucidate the meaning of the cut-points used to define physician burnout outcomes on MBI subscales, such that when reports state “X%” of physicians are “burned out” we have a better understanding (probabilistically) of what symptom burden level that means.

The selection of appropriate cut-points is a multi-attribute decision that depends critically on factors such as the intended purpose of assessment, the profile of burnout symptoms that are most probable at the cut-points, and consensus among investigators regarding what symptom burden matters for the purpose(s) of the assessment. This includes answering questions such as: which symptoms and symptom frequencies define burnout on each subscale; and what response probability criterion should be used to define whether a physician is likely or unlikely to report the burnout symptom? Our response profiles indicate the probability of item endorsements at a frequency of once weekly or more based on its prior use to define burnout in national studies [5, 15, 16], but it may be that a different symptom frequency is of interest. In this case, investigators can use the item parameter estimates (Table 2) to identify probable responses at different frequencies (see also Supplemental Appendix 4 for plotted cumulative probability curves describing the probability of a physician endorsing each subscale item at a frequency of a few times a month or more across IRT z-scores). Further, we use a response probability criterion of > 0.50 to define whether a physician is likely to endorse each item; however, it may be that a higher probability criterion (e.g., ≥ 0.67) is desired.

Definitions of what symptom burden matters should also consider relationship of a particular cut-point with external criterions. That is, what is the sensitivity and specificity of a particular cut-point with respect to important physician health and performance outcomes? To our knowledge, this has yet to be evaluated. Cut-points derived solely from content- and norm-referenced approaches may not be the cut-points at which sensitivity and specificity are maximized for a particular outcome. The optimal cut-point should be selected based on an evaluation of the costs and benefits of decisions resulting from its use to classify physicians into outcome groups (a property of context, not the subscales themselves) [27, 28]. For example, the costs and benefits of particular subscale cut-points for defining national physician burnout prevalence may differ substantially from those associated with identifying which physicians should receive an intervention. While cut-points may vary depending on context, there is a need for consistency in the cut-points used across studies when the purpose of assessment is estimating burnout prevalence [8]. Our findings can be used to inform consensus standards for defining outcome categories (e.g., burned out vs. not burned out; low, moderate, high symptoms) on each subscale for this purpose. However, this study does not address which subscales matter in the definition (e.g., EE and/or DP versus EE, DP, and PA, etc.) [29], which has also contributed to wide variation in prevalence estimates [8].

When using our crosswalk to interpret an individual’s/group’s score relative to its distance from the mean, it should be noted that comparisons will be relative to the mean EE, DP, and PA levels reported in this sample. While early and late responder analyses by Shanafelt et al. support the demographic representativeness of the sample [4], it is possible that the mean EE, DP, and PA levels in this calibration sample are not representative of those in the population. Findings from this study also cannot be assumed to generalize to other non-physician populations (e.g., nurses). That is, it cannot be assumed that the symptom burden represented by cut-points in this study have the same meaning in a non-physician sample without further research. Further research would be needed to place item responses from both groups onto the same metric and determine items function invariantly across physician and non-physician workers before raw scores can be assumed to represent the same symptom burden across groups.

It should be noted that the precision of each MBI subscale as implied by the crosswalks (Table 3) differs slightly from the precision of each metric reported by each TIF (Fig. 1) due to differences in estimating standard error (standard deviation of posterior distribution and square root of inverse Fisher expected information value, respectively). The use of each crosswalk requires complete responses on each MBI subscale. Finally, in the original study, item DP2 was slightly revised from the original MBI item (whereby “since I took this job” was removed from the original item: “I’ve become more callous toward people since I took this job”).

Conclusions

We produced a crosswalk mapping raw MBI subscale scores to IRT-based, standardized scores and response profiles calibrated in a US physician sample. Our results can be used in research and practice to better understand the meaning and precision of MBI scores in US physicians and compare individual/group MBI scores against a reference population of US physicians. Our response profiles underscore that the choice of cut-points for defining categorical MBI subscale outcomes matters. Different scores have different meanings with respect to the burnout symptom burden they represent, and prevalence estimates will be directly influenced by which cut-point is chosen. Our findings can be used better inform the selection of appropriate cut-points for defining categorical physician burnout outcomes on each MBI subscale.

Availability of data and materials

This study was a re-analysis of existing data obtained upon request from the author of the original study (https://doi.org/10.1016/j.mayocp.2015.08.023).

Abbreviations

DP:: Depersonalization
EAP:: Expected a posteriori
EE:: Emotional exhaustion
GRM:: Graded response model
IRT:: Item response theory
MBI:: Maslach Burnout Inventory-Human Services Survey for Medical Personnel
PA:: Personal accomplishment
PROMIS:: Patient-Reported Outcome Measurement Information System
SE:: Standard error
SD:: Standard deviation
TIF:: Test information function
US:: United States

References

Maslach, C., & Jackson, S. E. (1981). The measurement of experienced burnout. Journal of Occupational Behaviour, 2(2), 99–113.
Article Google Scholar
Maslach, C., Leiter, M. P., & Jackson, S. E. (2017). Maslach Burnout Inventory Manual (4th ed.). Menlo Park: Mind Garden, Inc.
Dzau, V. J., Kirch, D. G., & Nasca, T. J. (2018). To care is human — Collectively confronting the clinician-burnout crisis. The New England Journal of Medicine, 378(4), 312–314. https://doi.org/10.1056/NEJMp1715127.
Article PubMed Google Scholar
Shanafelt, T. D., Hasan, O., Dyrbye, L. N., Sinsky, C., Satele, D., Sloan, J., & West, C. P. (2015). Changes in burnout and satisfaction with work-life balance in physicians and the general US working population between 2011 and 2014. Mayo Clinic Proceedings, 90(12), 1600–1613. https://doi.org/10.1016/j.mayocp.2015.08.023.
Article PubMed Google Scholar
Shanafelt, T. D., West, C. P., Sinsky, C., Trockel, M., Tutty, M., Satele, D. V., Carlasare, L. E., & Dyrbye, L. N. (2019). Changes in burnout and satisfaction with work-life integration in physicians and the general US working population between 2011 and 2017. Mayo Clinic Proceedings, 94(9), 1681–1694.
Article Google Scholar
Jha, A. K., Ilif, A. R., & Chaoui, A. A. (2019). A crisis in health care: A call to action on physician burnout. http://www.massmed.org/News-and-Publications/MMS-News-Releases/Physician-Burnout-Report-2018/
Google Scholar
National Academies of Sciences, Engineering, and Medicine. (2018). Graduate medical education outcomes and metrics: Proceedings of a workshop. Washington, DC: The National Academies Press.
Rotenstein, L. S., Torre, M., Ramos, M. A., Rosales, R. C., Guille, C., Sen, S., & Mata, D. A. (2018). Prevalence of burnout among physicians: A systematic review. Journal of the American Medical Association, 320(11), 1131–1150. https://doi.org/10.1001/jama.2018.12777.
Article PubMed Google Scholar
Brady, K. J. S., Kazis, L. E., Sheldrick, R. C., Ni, P., & Trockel, M. T. (2019). Selecting physician well-being measures to assess health system performance and screen for distress: Conceptual and methodological considerations. Current Problems in Pediatric and Adolescent Health Care, 49(12), 100662. https://doi.org/10.1016/j.cppeds.2019.100662.
Article PubMed Google Scholar
Reise, S. P., & Haviland, M. G. (2005). Item response theory and the measurement of clinical change. Journal of Personality Assessment, 84(3), 228–238. https://doi.org/10.1207/s15327752jpa8403_02.
Article PubMed Google Scholar
Cook, K. F., Victorson, D. E., Cella, D., Schalet, B. D., & Miller, D. (2015). Creating meaningful cut-scores for Neuro-QOL measures of fatigue, physical functioning, and sleep disturbance using standard setting with patients and providers. Quality of Life Research, 24(3), 575–589. https://doi.org/10.1007/s11136-014-0790-9.
Article PubMed Google Scholar
HealthMeasures (2013). PROMIS instrument development and scientific standards version 2.0. http://www.healthmeasures.net/images/PROMIS/PROMISStandards_Vers2.0_Final.pdf
Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph supplement.
Book Google Scholar
De Ayala, R. J. (2013). The theory and practice of item response theory. New York: Guilford publications.
West, C. P., Dyrbye, L. N., Satele, D. V., Sloan, J. A., & Shanafelt, T. D. (2012). Concurrent validity of single-item measures of emotional exhaustion and depersonalization in burnout assessment. Journal of General Internal Medicine, 27(11), 1445–1452. https://doi.org/10.1007/s11606-012-2015-7.
Article PubMed PubMed Central Google Scholar
Shanafelt, T. D., Sinsky, C., Dyrbye, L. N., Trockel, M., & West, C. P. (2019). Burnout among physicians compared with individuals with a professional or doctoral degree in a field outside of medicine. Mayo Clinic Proceedings, 94(3), 549–551. https://doi.org/10.1016/j.mayocp.2018.11.035.
Article PubMed Google Scholar
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49.
Article Google Scholar
Revelle, W. (2018). psych: Procedures for personality and psychological research. https://CRAN.R-project.org/package=psych
Google Scholar
R Core Team. (2018). R: A language and environment for statistical computing. https://www.R-project.org/
Google Scholar
Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Article Google Scholar
Chalmers, P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06.
Article Google Scholar
Maydeu-Olivares, A. (2014). Evaluating the fit of IRT models. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling (pp. 129–145). New York: Routledge.
National Academy of Medicine (2018). Validated instruments to assess work-related dimensions of well-being. https://nam.edu/valid-reliable-survey-instruments-measure-burnout-well-work-related-dimensions/
Google Scholar
Schaufeli, W. B., Bakker, A. B., Hoogduin, K., Schaap, C., & Kladler, A. (2001). On the clinical validity of the Maslach burnout inventory and the burnout measure. Psychology & Health, 16(5), 565–582.
Article CAS Google Scholar
Mind Garden, Inc. (2018). The problem with cutoffs for the Maslach Burnout Inventory. https://www.mindgarden.com/documents/MBI-Cutoff-Caveat.pdf
Google Scholar
Health Resources & Services Administration. (2017). Advisory Committee on Training in Primary Care Medicine and Dentistry (ACTPCMD) meeting minutes: March 6–7, 2017. https://www.hrsa.gov/sites/default/files/hrsa/advisory-committees/primarycare-dentist/meetings/20170306-minutes.pdf
Google Scholar
Sheldrick, R. C., & Garfinkel, D. (2017). Is a positive developmental-behavioral screening score sufficient to justify referral? A review of evidence and theory. Academic Pediatrics, 17(5), 464–470.
Article Google Scholar
Sheldrick, R. C., Benneyan, J. C., Kiss, I. G., Briggs-Gowan, M. J., Copeland, W., & Carter, A. S. (2015). Thresholds and accuracy in screening tools for early detection of psychopathology. Journal of Child Psychology and Psychiatry, 56(9), 936–948.
Article Google Scholar
Eckleberry-Hunt, J., Kirkpatrick, H., & Barbera, T. (2017). The problems with burnout research. Academic Medicine. https://doi.org/10.1097/acm.0000000000001890.

Download references

Acknowledgements

Not applicable.

Funding

This work was funded by the Health Assessment Lab 2018–2019 Alvin R. Tarlov & John E. Ware Jr. Doctoral Dissertation Award in Patient Reported Outcomes. The Health Assessment Lab had no role in study design, analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Health Law, Policy & Management Department, Boston University School of Public Health, 715 Albany Street, Boston, MA, USA
Keri J. S. Brady, Pengsheng Ni, R. Christopher Sheldrick & Lewis E. Kazis
Biostatistics & Epidemiology Data Analytic Center, Boston University School of Public Health, 85 East Newton Street, Boston, MA, USA
Pengsheng Ni
Department of Psychiatry and Behavioral Sciences, Stanford University, 401 Quarry Road, Stanford, CA, USA
Mickey T. Trockel
Stanford Medicine WellMD Center, Stanford University, 300 Pasteur Drive, Suite H3215, Stanford, CA, USA
Mickey T. Trockel & Tait D. Shanafelt
Boston Medical Center, 1 Boston Medical Center Place, Boston, MA, USA
Susannah G. Rowe & Jeffrey I. Schneider
Department of Ophthalmology, Boston University School of Medicine, 85 East Concord Street, 8th Floor, Boston, MA, USA
Susannah G. Rowe
Department of Emergency Medicine, Boston University School of Medicine, 72 East Concord Street, Boston, MA, USA
Jeffrey I. Schneider

Authors

Keri J. S. Brady
View author publications
You can also search for this author in PubMed Google Scholar
Pengsheng Ni
View author publications
You can also search for this author in PubMed Google Scholar
R. Christopher Sheldrick
View author publications
You can also search for this author in PubMed Google Scholar
Mickey T. Trockel
View author publications
You can also search for this author in PubMed Google Scholar
Tait D. Shanafelt
View author publications
You can also search for this author in PubMed Google Scholar
Susannah G. Rowe
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey I. Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Lewis E. Kazis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

KJSB, PN, RCS, MTT and LEK conceptualized and designed the study. KJSB analyzed the data. KJSB, PN, RCS, MTT, LEK, TS, SGR, and JIS interpreted the results. All authors participated in the preparation of the manuscript, and the authors read and approved the final manuscript.

Authors’ information

KJSB is a PhD candidate in health services research in the Department of Health Law, Policy and Management at the Boston University School of Public Health. PN is a research associate professor in the Department of Health Law, Policy and Management at the Boston University School of Public Health. RCS is a research associate professor in the Department of Health Law, Policy and Management at the Boston University School of Public Health. MTT is a clinical associate professor in the Department of Psychiatry and Behavioral Sciences at Stanford Medicine and Director of Scholarship and Health Promotion at the Stanford Medicine WellMD Center. TS is a Jeanie & Stewart Ritchie Professor of Medicine, Chief Wellness Officer, and Associate Dean with Stanford Medicine and leads the Stanford Medicine WellMD Center. SGR is an assistant professor in the Department of Ophthalmology at Boston University School of Medicine and Associate Chief Medical Officer of Wellness and Professional Vitality at Boston Medical Center. JIS is an associate professor in the Department of Emergency Medicine and Assistant Dean for GME at the Boston University School of Medicine and the Designated Institutional Official for ACGME at Boston Medical Center. LEK is a professor of Health Law, Policy, and Management at the Boston University School of Public Health.

Corresponding author

Correspondence to Keri J. S. Brady.

Ethics declarations

Ethics approval and consent to participate

This research was approved as non-human subject research by the Boston University Medical Campus (BUMC) Institutional Review Board (H-37414).

Consent for publication

Not applicable.

Competing interests

Dr. Shanafelt is co-inventor of the Well-being Index instruments and the Participatory Management Leadership Index. Mayo Clinic holds the copyright for these instruments and has licensed them for use outside of Mayo Clinic. Dr. Shanafelt receives a portion of any royalties paid to Mayo Clinic. All other authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Brady, K.J.S., Ni, P., Sheldrick, R.C. et al. Describing the emotional exhaustion, depersonalization, and low personal accomplishment symptoms associated with Maslach Burnout Inventory subscale scores in US physicians: an item response theory analysis. J Patient Rep Outcomes 4, 42 (2020). https://doi.org/10.1186/s41687-020-00204-x

Download citation

Received: 23 September 2019
Accepted: 05 May 2020
Published: 01 June 2020
DOI: https://doi.org/10.1186/s41687-020-00204-x

Describing the emotional exhaustion, depersonalization, and low personal accomplishment symptoms associated with Maslach Burnout Inventory subscale scores in US physicians: an item response theory analysis

Abstract

Purpose

Methods

Results

Conclusions

Introduction

Methods

Design and sample

Measures

Statistical analyses

IRT model calibration

Response profiles

Precision bandwidth

Results

IRT calibration

Item symptom severity

Response profiles

Emotional exhaustion subscale

Depersonalization subscale

Personal accomplishment subscale

Precision bandwidth

Discussion

Strengths and limitations

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords