Skip to main content

Psychometric performance of the PROMIS® depression item bank: a comparison of the 28- and 51-item versions using Rasch measurement theory



The aim of this study is to illustrate an example application of Rach Measurement Theory (RMT) in the evaluation of patient-reported outcome (PRO) measures. RMT diagnostic methods were applied to evaluate the PROMIS® Depression items as part of a series of papers applying different psychometric paradigms in parallel to the same data.


RMT was used to examine scale-to-sample targeting, scale performance and sample measurement of two PROMIS depression item pools including respectively 28 and 51- items.


Sub-optimal but improved targeting was displayed in the 51-item pool which covered 27% of the range of depression measured in the sample compared to only 15% in the 28-item bank, further reducing the sample percentage with lower depression not covered by the scale (28% Vs 34%). Satisfactory scale performance was observed by the 28-item bank with marginal item misfit. However, deviations from the RMT criteria in the 51-itempool were observed including: 9 reversed thresholds; 12 misfitting items and 12 item-pairs displaying dependency. Overall reliability was good for sets of items (Person Separation Index = 0.93 and 0.95), but sub-optimal sample measurement (17% Vs 19% fit residuals outside of the recommended range).


The RMT approach in this exercise provided evidence that compared to the 28-item bank, the extended 51-item version of the PROMIS depression, improved sample-to-scale targeting. However, targeting in the lower end of the concept of interest remained sub-optimal and scale performance deteriorated. There may be a need to improve the conceptual breadth of the construct under investigation to ensure the inclusion of items that capture the full range of the concept of interest for this context of use.


The rising profile of including the patient perspective in clinical outcome assessment has consequently increased interest in patient reported outcome (PRO) instruments and techniques of evaluating their scientific rigor [1, 2]. Scores generated by PRO instruments are increasingly used as central outcome variables upon which important decisions are made related to patient care. Therefore, it is essential to assess whether they are fit for purpose as failure to do so could potentially lead to incorrect interpretations being drawn about patient care [1, 3]. A fundamental step in PRO evaluation is to examine whether an instrument comprehensively captures the concept of interest in the intended context of use [4]. Additionally, it is essential to assess whether summing individual items is “psychometrically sound” and whether—and to what extent—generated scores satisfy a priori reliability and validity criteria [5,6,7].

Different psychometric paradigms are available for developing and evaluating the scientific rigour of PRO instruments [1]. These include traditional psychometrics based on the theoretical Classical Test Theory (CTT) [8, 9] and more recently modern psychometric paradigms offering mathematically testable models: the Rasch Measurement Theory (RMT) [10, 11] and Item Response Theory (IRT) [12, 13]. CTT is grounded in Steven’s definition of measurement as “the assignment of numerals to objects or events according to some rule,” which proposes that in measurement a person’s observed score is the sum of their true and error estimates yet where the true score is theoretical [9]. Traditional psychometric analysis examines raw scores without weighing or standardization against theoretical measurement [8]. Modern psychometric paradigms on the other hand are grounded in Thurstone’s measurement criteria which include interval scaling and measurement invariance [14, 15] and offer mathematical testable logistic models against which measurement properties of rating scales can be examined [1].

This study constitutes part of a parallel exercise coordinated by the International Society for Quality of Life Research ISOQOL’s Psychometrics Special Interest Group (SIG) aiming to demonstrate the potential of different psychometric methodologies (CTT, IRT and RMT) for developing and evaluating PROs using the same exemplar instrument. The papers describing the three parallel studies were reviewed by members of the Psychometrics SIG and the ISOQOL board. The Patient-Reported Outcomes Measurement Information System (PROMIS) calibrated depression item bank [16, 17] was the chosen PRO example for this parallel exercise. The PROMIS depression item bank consists of a 28-item version calibrated from a larger pool of 56 well-performing items. PROMIS item calibration was completed by applying IRT models and CTT techniques to an initial 518-item bank collated from 78 depression scales [17]. Items for the depression bank were generated through an iterative process involving literature searches, conceptual framework development, expert review, focus groups and cognitive debriefing [17]. As the PROMIS depression authors suggested, validation of item banks should be an on-going process [17].

In this study we present an evaluation of the PROMIS depression item bank using RMT methods [3, 10]. Similar to the IRT approach used in the development and calibration of the PROMIS item banks [18] RMT offers a mathematical model that defines how a set of items perform to generate reliable and valid measurements [10]. The Rasch model postulates that the probability of response to an item is always a function of the difference between an item’s difficulty and a person’s ability [3, 10, 11]. The Rasch model defines how a set of items should perform to generate reliable and valid measurements, and RMT analysis examines the extent to which the observed scores data “fit” the scores expected by the Rasch model [3].

Both IRT and RMT are used to generate data focusing on item responses within scales. Despite the similarities of these two new modern psychometric approaches, they are characterized by a fundamental difference that is key in scale evaluation and modification [1]. IRT is a statistical modeling paradigm that aims to find the best measurement model to fit the observed data. RMT on the other hand is a diagnostic paradigm that aims to assess the extent to which a set of items both conform to the requirement of the Rasch model, based on subjects’ responses, and identify potential anomalies that do not fit the RMT expectations [1, 19]. In practice, anomalies within the RMT paradigm are resolved by revisiting the item content and appraising the qualitative and quantitative evidence to further understand the disparity between expected and observed scores of an evaluated scale [3]. Evaluating the PROMIS depression items using RMT methods can therefore provide an additional perspective to the on-going validation of the PROMIS depression item bank for application in research, clinical practice, and policy.



The analyses described here are secondary analysis of data initially collected in the PROMIS Wave 1 cohort [16, 17]. Recruitment took place at four PROMIS sites (University of Pittsburgh, University of North Carolina-Chapel Hill, Stanford University, and Duke University) and online via—a non-partisan online polling firm administering surveys for market research and political polling. As PROMIS item banks are targeted for use in clinical research (including clinical trials, observational research, and epidemiological studies), authors reported assembling a representative sample with a diverse severity of emotional distress.

PROMIS depression item Bank

The development of the PROMIS depression item bank has been described in detail elsewhere [17]. Items are rated on a 5-point frequency scale (never, rarely, sometimes, often, always) within a seven-day recall period with a model-based scoring system with higher values reflecting greater depression [17]. The 28-item bank comprised a single factor combining items from cognitive (n = 17), affective (n = 9), behavioral (n = 1) and suicidal (n = 1) conceptual domains. The current analyses use data from the extended 56-item PROMIS depression item bank. Five items were eliminated for intellectual property issues, leaving 51 items in the pool available for analysis. As an item bank relates to items that have been calibrated, the 51-item set will be referred to as an item pool.

Rasch measurement theory (RMT) analysis

The original simple logistic Rasch model [11] postulates that the probability of a positive response to a dichotomous (yes/no) item is a logistic function of the relative difference between the respondent (person) location and the item location on the measurement continuum. The Rasch model—which articulates how measurements of constructs can be derived from responses to items—postulates that the odds of a “yes” response correspond to the probability of a “yes” response divided by the probability of a “no” response. This approach outlines a natural logarithm, where the person and item locations are additive in log-odd units (logits), thus transforming scores into an interval scale. [3] The dichotomous model was subsequently expanded to polytomous data which conceptually and mathematically reflect an extension of dichotomous data [3]. Andrich [10] subsequently developed the rating scale Rasch model, the unrestricted version of which was used by this study [10].

RMT analysis uses the Rasch Model as the criterion against which scale performance is evaluated [1, 3, 10]. Effectively, RMT analysis examines the extent to which observed raw scores (responses to scale items) satisfy the a priori criteria and match the scores expected by the Rasch model. The evaluation of a rating scale using RMT analysis has been used commonly to develop and test measures in health, including psychiatry [20] using three broad aims: the evaluation of the scale-to-sample targeting, scale performance, and sample measurement. The tests and information collected to address these aims have been described in detail elsewhere [3] and in brief below.

Scale-to-sample targeting

Scale-to-sample targeting refers to the extent to which a scale’s items are able to measure the sample’s ability range (in this case depression). Targeting was evaluated through examination of the relative person and item distributions on the same continuum [21, 22] both graphically and numerically. RMT analysis orders items and responders in hierarchical order according to their relative difficulty (item locations) and relative ability (person location) on the same interval continuum of logits. The relative distributions inform the adequacy of the sample for evaluating the scale and the adequacy of the scale for measuring the sample, the better the ranges are matches for each other, the greater the potential for precise person measurement.

Scale performance

Five components of scale performance were examined [23].

Do the response categories work as intended?

Each PROMIS item is scored on a 5-point frequency response scale as follows: “1 = never”; “2 = rarely”; “3 = sometimes”; “4 = often”; “5 = always.” Respondents with higher levels of the construct (i.e., higher depression) are expected to endorse the higher response categories, while respondents with lower levels of the construct (i.e., lower depression) are expected to endorse the lower response categories. Thresholds represent the point between two response categories (i.e., the place where a person is equally likely to endorse either of the two response categories). Threshold ordering is expected to reflect the intended category ordering. Disordered thresholds signify response categories may not be working as intended. This may in turn affect scale score interpretations and scale validity, where higher scores may not necessarily reflect higher levels of the concept of interest [24].

Do the items map out a continuum of depression?

An optimal scale is expected to comprise of a continuum representing the construct under measurement (e.g., depression) with marked components articulated as items located at different points on the continuum [25,26,27]. The different locations are expected to cover the entire range of the construct range (i.e. scale items are intended to measure and equally representing all levels of within a construct) [23, 28]. Therefore, item location, spread, proximity to each other and precision are examined in RMT analysis, to determine the extent to which the set of items map out a continuum for measurement from less to more ability [21, 23].

Do the items define a single construct?

Within the RMT analysis, item responses are examined to assess the cohesiveness of the measurement continuum and therefore the legitimacy of the scale [23, 28]. Fit statistics identify the extent to which items work well together to define a single variable. Item fit statistics summarize the differences between observed scores and expected responses for individuals (i.e. the item-person interaction [fit residual]) and different ability groups (i.e. the item-construct interaction [chi square values]) [3]. Criteria suggest fit residuals should lie within the recommended range of − 2.5 and + 2.5, and Chi square values should be non-significant [3, 21]. Item characteristic curves (ICC) are graphical indicators of fit which are used to complement the interpretation of the fit residuals and chi square probabilities [23, 29].

Do responses to one item bias responses to other items?

The RMT expects item independency such that the response to one item should not influence or determine the response to another. High residual (i.e. observed - expected = residual) correlations highlight “local dependency” or item response bias which can artificially inflate reliability [21, 22]. Response bias was assessed in line with the r > 0.30 rule of thumb, which indicates a > 9% shared variance between a pair of items, suggesting local dependence [30].

Are the scale items stable between different groups?

The extent to which items are stable across different sample subgroups was assessed through differential item functioning (DIF) [3, 23, 31]. An item shows differential functioning if the expected response for two respondents who have the same level on the measured construct but from two different groups (age, sex) differs. DIF was investigated by examining the observed response differences between class intervals within groups. DIF was examined using analysis of variance (ANOVA), assessing item scores between the sample groups and across the different class-intervals where a significant p-value for differences between subgroups is taken to indicate DIF [3, 24]. The performance of the items was tested for stability across gender (male/female) six age groups (18–30, 31–40, 41–50, 51–60, 61–70 and 70+); race (white/non-white) and Hispanic ethnicity (Hispanic/non-Hispanic).

Sample measurement

Two components of person measurement were examined: the Person Separation Index (PSI), and the extent to which individuals’ responses were consistent with expectation. The PSI, which indicates a scale’s ability to detect differences in the levels of the construct within the sample, is a numerical indicator computed as the ratio of the variation of person estimates relative to the estimated error for each person [31]. PSI scores range between 0 and 1, where a 0 score indicated all error and a 1 score no error [3].

The extent to which individuals’ responses met the expectations of the Rasch model was examined with fit statistics. Person fit residuals summarize the difference between observed scores and expected responses for each person i.e., the person-construct interaction [3, 23]. Person fit residuals were examined with reference to the “rule of thumb,” expecting 99% of the sample to produce a fit residual between − 2.5 to 2.5. Fit residuals outside this range indicate potential problematic measurements for those persons [3, 23].

Analysis procedure

Data were analyzed in using RUMM2030 [11], one of several available software programs that provides graphical and statistical item analyses using Rasch unidimensional models for measurement. In line with the methods described above, the initial RMT analysis of the 28-item bank identified limited measurement and precision on the floor of the scale. We therefore hypothesized that adding new items could improve measurement performance by expanding the range of depression measured by the PROMIS items. For this, the extended item pool of 51 items* (i.e., original 28 + 23 items) was analyzed with RMT and performance compared to original 28-item bank. For brevity, results are presented in two stages: in stage 1, the extent to which the 51-items satisfy the RMT criteria, and in stage 2, the relative measurement performance of the 51-item and the 28-item pools. A summed total score was calculated for both the 28 and 51-item pools. Five items were eliminated from the 56-item bank due to potential intellectual property issues.



In the sample, described elsewhere [16, 17], 8% of participants were of a clinical population and 92% of a general population. Secondary analysis was performed on a subsample (n = 825) of patients ranging in age 18–88 years (mean = 50.91, SD = 18.92). Of these 51% were female, and 79% white, 10% Hispanic. The sample size reported between the two different stages of analyses (both on scale and on item level) varies slightly, as RMT analyses estimates are calculated on the basis of available data and excluding extreme scores located at the floor and ceiling.

Stage 1 RMT analysis: examination of the PROMIS 51-item pool measurement performance

Scale-to-sample targeting

Person locations ranged approximately between − 6 to 4 logits relative to the item threshold locations, ranging approximately between − 3 to 3 logits, and covering 60% of the depression measured in the sample (Table 1). Graphical review of targeting (Fig. 1a, b) indicated item bunching and sub-optimal measurement for the persons with lower depression scores. Sample measurements (n = 224, 28%) located below − 3.13 logit on the left of the continuum had no matching items.

Table 1 Targeting, reliability and sample measurement validity
Fig. 1

Scale-sample targeting: top histograms (pink bars): sample’s distribution of depression measurements (person locations) measured by the PROMIS 51 items in (a) and the PROMIS 28 in (b). Person measurements reflecting lower ability (lower depression) are represented by histogram bars on the left and person measurement reflecting higher ability (higher depression) by the histogram bars on the right. Lower histograms (blue bars): scale distribution of item threshold location estimates for the 51 items in (a) and the 28 items in (b). Histogram bars on the left represent the easiest items (requiring lower levels of depression to be positively endorsed) and histogram bars on the right represent the more difficult items (requiring higher levels of depression to be positively endorsed). Green curve: represent an inverse function of the standard error associated with measurement, it reflects scale’s precision and shows the location on the continuum the scale performs at its best, indicating that the location of item thresholds and persons is well matched

Scale performance

Do the response categories work as intended?

Nine items displayed disordered response thresholds (Table 2), with the “rarely” response category most consistently not working as intended.

Table 2 PROMIS 51 Thresholds ordering, item location estimates, item fit statistics and item dependency (items ordered by location)

Do the items map out a depression ability continuum?

Item locations for the PROMIS-51 continuum ranged between 1.58 to 1.18 logits (Table 1) and showed some overlap. For example, five items (items: 21, 30, 42, 48 and 50) were situated on within less 0.08 logits of each other and three items (items: 6, 9, and 41) within 0.06 logits of each other (Table 2, Figure 2).

Fig. 2

Plot of relative items location of the PROMIS 51 and 28 item pools. The blue dots represent the item locations for each item within the PROMIS 28 and PROMIS 51 (as anchored on the PROMIS-28 measurement continuum) item banks. The items located on the left of the continuum are the easiest (i.e. are relevant to persons with lower levels of depression); items towards the right are the most difficult (i.e. are relevant to person with higher levels of depression). The PROMIS 51 items cover parts of the continuum (<− 0.75) and around (− 0.50 to 0.00) where the PROMIS 28 item distribution displays item gaps, as well as smaller gaps of the 28-item pool along the continuum

Do the items define a single construct?

Twenty items displayed fit residual outside the recommended range, twelve of which also displayed significant chi square values (Table 2). Graphical review of the ICCs indicated overestimation of depression for items 17, 29 and 36 and underestimation of depression (i.e. observed scores were lower than expected at higher depression-ability locations and higher than expected at lower depression-ability locations) for items 11, 15, 24, 49 and 53. Exemplar ICCs are displayed in Fig. 3.

Fig. 3

Exemplar Item Characteristic Curves (ICC) for items displaying the worst Statistical Misfit. Item characteristic curves for the items displaying the worst statistical misfit. The y-axis represents the scores expected by the Rasch model and the x-axis the depression measurement continuum of the sample. The ICC plots the observed scores in each of the 10 class intervals of depression (the black dots) and the scores expected by the Rasch model at each level of the measurement continuum (the curve). (a) item 49 “I lost weight without trying” of the 51-item bank indicating evident underestimation of depression as observed scores at higher person location logits are located below the expected curve and scores at lower person location logits higher than the expected curve. (b) item 41 “I felt hopeless” of the 28-item bank indicating marginal overestimation of depression as observed scores at higher person location logits are located above the expected curve and observed scores

Do responses to one item bias responses to other items?

Twelve item pairs produced residual correlations above the recommended range, suggesting possible response bias between them. The highest residual correlation (r = 0.53) identified was between items 32 and 39 (Table 2).

Are the scale items stable between different groups?

No item displayed DIF by race or Hispanic ethnicity, whereas two items (items: 15 and 16) displayed DIF by gender, and ten items (items: 11, 12, 18, 20, 35, 37, 43, 47, 54 and 56) displayed DIF by age group (p < 0.01).

Sample measurement

The PROMIS-51 displayed high reliability and sample separation (PSI = 0.95) and suboptimal sample measurement as a total of 139 (17%) individuals in the sample fell outside the recommended range.

Stage 2 RMT analysis: comparative measurement performance of the 51 and 28-item pools

Scale-to-sample targeting

Even though the 51-item pool did not resolve targeting issues of the PROMIS 28-item pool, it did provide relative improvements to the measurement of people on the floor of the scale. Graphical review suggests the relative limitation of the 28-item pool (Fig. 1b) to measure depression for people with lower depression scores relative to the 51-item pool (Fig. 1a). In line with this, the sample mean is further skewed away from the item mean, floor effects rise to 7 from 3% in the 28-item bank (Table 1) as do the sample measurements outside the scale range which amount to 34% (n = 272) in the 28-item pool as compared to 28% (n = 224) 51-item pool. Figure 2 demonstrates how additional items improved to targeting as they expanded item coverage on the left of the continuum and filled in some of the gaps on the PROMIS 28-item pool measurement continuum. This improvement is further demonstrated numerically: the relative percentage coverage of the sample measurement by item location was 15%, in comparison to 27% in the 51-item pool (Table 1).

Scale performance

As compared to the 51, the 28-item pool satisfied more criteria of RMT in relation to scale performance.

Do the response categories work as intended?

Thresholds for only one item (compared to nine in the 51-item version) were not consistently ordered sequentially, suggesting the “rarely” response category was problematic for this item (Table 3).

Table 3 PROMIS 28 Thresholds ordering, item location estimates, item fit statistics and item dependency

Do the items map out a depression ability continuum?

The item location range of the 28-item pool—ranging from − 0.68 to 1.16 logits—was approximately 1 logit narrower than the 51-item pool ranging from − 1.58 to 1.18 logits (Table 1). Item locations consistently indicated some overlap: for example, five items (items: 14, 21, 22, 48 and 50) were situated within 0.08 logits and three items (items: 9, 19 and 41) within less than 0.05 logits (Table 3).

Do the items define a single construct?

Fewer items displayed misfit in the 28-item pool, as five items displayed fit residuals outside the recommended range, only one of which also displayed a significant chi square value (Table 3). The item characteristic curve for this item (Fig. 3b) indicated overestimation of depression: observed scores were higher than expected at higher depression-ability locations and lower than expected at lower depression-ability locations.

Do responses to one item bias responses to other items?

In comparison to the longer item pool, no item bias was identified in the 28-item pool as no items pairs produced residual correlations above 0.30 (Table 3).

Are the scale items stable between different groups?

Compared to the longer item pool, the 28-item displayed minimal DIF as no items displayed DIF by gender, race or Hispanic ethnicity and only one item (item 35) displayed DIF by age group (p < 0.01).

Sample measurement

Sample measurement was similar in the two PROMIS item-pools. The PSI of the 28-item pool was marginally lower (PSI = 0.93) than that of the 51-item and the validity of sample measurement marginally worse as the percentage of person fit residuals outside the recommended range rose to 19% (Table 1).


This study was part of a parallel exercise aiming to compare three different psychometric paradigms (CTT, IRT, RMT) and explore the psychometric properties of the PROMIS depression item bank [17] using RMT [3.10]. RMT analysis focuses on the examination scale-to-sample targeting; scale performance and sample measurement exploring the disparities between the observed scores and those expected by the RMT model. [3] Using this approach, improvements were observed with the 51-item version of the PROMIS depression inventory, compared to the original 28-item version. However, RMT also revealed shortcomings of the extended item set, specifically suboptimal sample-to-scale targeting.

Specifically, RMT evaluation of the PROMIS depression 28-item bank indicated overall adequate scale performance, but importantly revealed suboptimal targeting. More than a third of the sample was located at the lower end of the depression continuum where few items were identified to capture the individuals in this range. In other words, the 28-item bank is not well targeted for those with lower depression for whom interpretations using the PROMIS depression would be associated with limited precision (i.e., higher standard error). This was further supported by the high percentage of person fit residuals identified outside the recommended range, indicating problematic measurement for nearly 20% of the sample.

The extended 51-item PROMIS depression item pool was also analyzed to examine whether an additional 23 items improved the sample-to-scale targeting by extending the item continuum range. Although the 51-item pool showed improved targeting, problems were not resolved. In fact, the additional items impacted negatively on the measurement properties of the item bank. Specifically, the extended item pool appeared less statistically cohesive, suggesting the set of items were not measuring a single construct. As well, evidence suggested item dependency and problems with the ordering of item response thresholds.

Despite the short-comings revealed, the approach suggests possible ways to remedy the existing scale. Exploration of the construct at the lower end of the continuum and the identification of items to capture the concept of interest for the population under investigation may optimize the scale for the intended use. We hypothesize that a conceptually driven effort to fill in the item gaps will improve the content validity of both the PROMIS depression 28-item and 51-item pools [1, 6]. Review of the item development process indicates that initial pool of items was inductively categorized into conceptual domains (including 46 depression facets) before undergoing further standardization and calibration. [17] However, as Pilkonis et al. (2011) note, certain trade-offs were necessary to reach the assumption of unidimensionality and a single factor scale—a requirement of IRT models used to calibrate and evaluate these items. [17] Subsequent focus groups conducted with major depression patients [16] supported the multidimensional domains which further suggested additional conceptual categories of depression.

In addition to the issue of multiple conceptual domain, the diversity of items within the PROMIS depression item bank further fails to match the diversity of concepts within the conceptual framework [17]. As reported by Pilkonis et al. (2011), only the proportion of affective items was consistent, whereas the proportion of cognitive items was almost doubled and at the same time behavioral and somatic items were completely removed as they did not perform well within the single-factor unidimensional depression scale. Our findings suggest that conceptual limitations may have resulted in favor of statistical fit.

The recommended next step within the RMT paradigm involves evidence-based conceptual review of the item-bank’s content to sufficiently and comprehensively define the concept of interest under measurement for the patient population for whom it will be used. RMT analysis provides statistical evidence for the scale’s measurement properties; however, this does not guarantee a scale’s content validity, as statistical cohesiveness in measuring a construct does not specify what the construct is or how it is conceptualized by the population for whom it is intended to be used with [1, 4, 6, 32, 33]. Therefore, statistical tests of a PRO instrument’s validity can mislead if the intended construct is not targeted sufficiently. Therefore, a conceptually driven empirical assessment of content validity, driven the population for whom the PRO is to be administered, would improve the ability of PROMIS depression to quantify the construct under measurement in this study sample.

In addition, outside any psychometric paradigm, but in line with clinical outcome assessment development guidelines, our findings further indicate that PROMIS depression psychometric analysis could benefit by refining of the context of use [4]. RMT analysis showed that the range of PROMIS depression items matched the range of depression reported by persons at the higher end of the continuum well. These findings could therefore be interpreted as supportive of the use of the item bank in populations with higher levels of depression. Contrary to this finding, item calibration for the PROMIS depression item banks was completed on a sample the vast majority of which (92%) was recruited from general not clinical population [17]. PROMIS depression psychometric analysis would therefore benefit by empirical exploration of the concept of interest in a predefined and specific context of use [4].


In this study, RMT analysis supported the statistical scale performance of the PROMIS depression scales, but also identified targeting and sample measurement limitations. In practice, diagnosed anomalies may be resolved by attempting to explore the data and interpret the disparity between expected and observed scores of an evaluated scale further [3]. Within the RMT paradigm, rating scales and constructs are modified and, if necessary, more data are collected. In this respect, the application of RMT to the PROMIS depression item bank suggests that measurement could benefit by further examination of the construct under measurement and specifically the concept of interest and context of use.

Availability of data and materials

The datasets supporting the conclusions of this article are available from the PROMIS Health Organization,



Classical test theory


Item response theory


Patient-reported outcomes measurement information system


Rasch measurement theory


  1. 1.

    Cano, S., & Hobart, J. C. (2011). The problem with health measurement. Patient Preference and Adherence, 5, 279–290.

    PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Darzi, A. (2008). High quality care for all: NHS next stage review final report. London: Department of Health.

    Google Scholar 

  3. 3.

    Hobart, J., & Cano, S. (2009). Improving the evaluation of therapeutic interventions in multiples sclerosis: The role of new psychometric methods. Health Technology Assessment, 13(12), 1–214.

    Article  Google Scholar 

  4. 4.

    Food and Drug Administration. (2013). Qualification of clinical outcome assessments (COAs).

    Google Scholar 

  5. 5.

    Scientific Advisory Committee, Medical Outcomes Trust. (2002). Assessing health status and quality of life instruments: Attributes and review criteria. Quality of Life Research, 11, 197–209.

    Google Scholar 

  6. 6.

    US Department of Health and Human Services. Patient reported outcome measures: use in medical product development to support labelling claims 2009. Available from: Accessed October 10, 2018.

  7. 7.

    Mokkink, L., Terwee, C., Patrick, D., Alonso, J., Stratford, P., Knol, D., et al. (2010). The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: An international Delphi study. Quality of Life Research, 19, 539–549.

    PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Novick, M. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1–18.

    Article  Google Scholar 

  9. 9.

    Stevens, S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680.

    PubMed  Article  Google Scholar 

  10. 10.

    Andrich, D. (1988). Rasch models for measurement. Beverly Hills: Sage Publications.

    Google Scholar 

  11. 11.

    Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Education Research (Expanded edition (1980) with foreword and afterword by B.D. Wright, Chicago: The University of Chicago Press, Reprinted Chicago: MESA Press, 1993.

    Google Scholar 

  12. 12.

    Lord, F. (1952). A theory of test scores. Richmond: Psychometric Corporation.

    Google Scholar 

  13. 13.

    Lord FM, Novick MR (with contributions by Birnbaum A) (1968) Statistical theories of mental test scores. Addison-Wesley, Reading.

    Google Scholar 

  14. 14.

    Thurstone, L. (1925). A method for scaling psychological and educational tests. Journal of Education & Psychology, 16, 433–451.

    Article  Google Scholar 

  15. 15.

    Thurstone, L. (1929). Fechner’s law and the method of equal-appearing intervals. Journal of Experimental Psychology, 12, 214–224.

    Article  Google Scholar 

  16. 16.

    Cella, D., Riley, W., Stone, A., Rothrock, N., Reeve, B., Yount, S., et al. (2010). The patient-reported outcomes measurement information system (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005-2008. Journal of Clinical Epidemiology, 63(11), 1179–1194.

    PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Pilkonis, P. A., Choi, S. W., Reise, S. P., Stover, A. M., Riley, W. T., Cella, D., et al. (2011). Item banks for measuring emotional distress from the patient-reported outcomes measurement information system (PROMIS®): Depression, anxiety, and anger. Assessment, 18(3), 263–283.

    PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale: Lawrence Erlbaum Associates.

    Google Scholar 

  19. 19.

    Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1 Suppl), I7–I16.

    PubMed  Google Scholar 

  20. 20.

    Barbic, S. P., & Cano, S. J. (2016). The application of Rasch measurement theory to psychiatric clinical outcomes research. British Journal Psychology Bulletin, 40(5), 243–244.

    Google Scholar 

  21. 21.

    Wright, B. D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA.

    Google Scholar 

  22. 22.

    Hobart, J. C., Riazi, A., Thompson, A. J., Styles, I. M., Ingram, W., Vickery, P. J., et al. (2006). Getting the measure of spasticity in multiple sclerosis: The multiple Scleorsis spasticity scale (MSSS-88). Brain, 129(1), 224–234.

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Hobart, J., Cano, S., Posner, H., Selnes, O., Stern, Y., Thomas, R., & Zajicek, J. (2013). Putting the Alzheimer’s cognitive test to the test II: Rasch measurement theory. Alzheimer’s & Dementia, 9(1), S10–S20.

    Article  Google Scholar 

  24. 24.

    Andrich, D. (2011). Testlets and threshold disordering. Rasch Measurement Transaction, 251(1), 1318.

  25. 25.

    Andrich, D. (1985). An elaboration of Guttman scaling with Rasch models for measurement. In N. Tuma (Ed.), Social methodology. San Francisco: Jossey-Bass.

    Google Scholar 

  26. 26.

    Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude: A psychophysical method and some experiments with a scale for measuring attitude toward the church. Chicago: University of Chicago Press.

    Google Scholar 

  27. 27.

    Wilson, M. (2005). Construction measures: An item response modelling approach. New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  28. 28.

    Wright, B., & Stone, M. (1979). Best test design: Rasch measurement. Chicago: MESA College Press.

    Google Scholar 

  29. 29.

    Hagquist, C., Bruce, M., & Gustavsson, J. P. (2009). Using the Rasch model in nursing research: An introduction and illustrative example. International Journal of Nursing Studies, 46(3), 380–393.

    PubMed  Article  Google Scholar 

  30. 30.

    Andrich, D. (2004). Controversy and the Rasch mode: A characteristic of incompatible paradigms? Medical Care, 42(1), 1–17.

    Article  Google Scholar 

  31. 31.

    Pallant, J. F., & Tennant, A. (2007). An introduction to the Rasch measurement model: An example using the hospital anxiety and depression scale (HADS). British Journal Clinical Research, 46, 1–18.

    Google Scholar 

  32. 32.

    Andrich, D. (1982). An index of person separation in latent trait theory, the traditional KR20 index and the Guttman scale response pattern. Educational and Psychological Research, 9(1), 10.

    Google Scholar 

  33. 33.

    Hobart, J., Lamping, D., Fitzpatrick, R., Riazi, A., & Thompson, A. (2001). The multiple sclerosis impact scale (MSIS-29): A new patient-based outcome measure. Brain, 124(5), 962–973.

    CAS  PubMed  Article  Google Scholar 

Download references


The authors would like to thank the Patient-Reported Outcomes Measurement Information System (PROMIS®) group for making available the data on the PROMIS Depression Item Bank that was used for illustrative purposes in this paper. The authors received writing and editorial support under the guidance of the authors from Dr. Chad Green (Clinical Outcomes Solutions). We would also like to thank Amlan RayChaudhury, PhD for the editorial support in the submission process.


This paper was reviewed and endorsed by the International Society for Quality of Life Research (ISOQOL) Board of Directors as an ISOQOL publication and does not reflect an endorsement of the ISOQOL membership. All statements, findings and conclusions in this publication are solely those of the authors and do not necessarily represent the views of ISOQOL.



Author information




All authors contributed to the conception or design of the work, data analysis and interpretation, manuscript preparation, and critical revision. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Skye Pamela Barbic.

Ethics declarations

Ethics approval and consent to participate

Not applicable, use of publicly available data set.

Consent for publication

Not applicable.

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cleanthous, S., Barbic, S., Smith, S. et al. Psychometric performance of the PROMIS® depression item bank: a comparison of the 28- and 51-item versions using Rasch measurement theory. J Patient Rep Outcomes 3, 47 (2019).

Download citation


  • Depression
  • Psychometrics
  • Rasch measurement theory
  • Rasch model