Psychometric performance of the PROMIS® depression item bank: a comparison of the 28- and 51-item versions using Rasch measurement theory

Purpose The aim of this study is to illustrate an example application of Rach Measurement Theory (RMT) in the evaluation of patient-reported outcome (PRO) measures. RMT diagnostic methods were applied to evaluate the PROMIS® Depression items as part of a series of papers applying different psychometric paradigms in parallel to the same data. Methods RMT was used to examine scale-to-sample targeting, scale performance and sample measurement of two PROMIS depression item pools including respectively 28 and 51- items. Results Sub-optimal but improved targeting was displayed in the 51-item pool which covered 27% of the range of depression measured in the sample compared to only 15% in the 28-item bank, further reducing the sample percentage with lower depression not covered by the scale (28% Vs 34%). Satisfactory scale performance was observed by the 28-item bank with marginal item misfit. However, deviations from the RMT criteria in the 51-itempool were observed including: 9 reversed thresholds; 12 misfitting items and 12 item-pairs displaying dependency. Overall reliability was good for sets of items (Person Separation Index = 0.93 and 0.95), but sub-optimal sample measurement (17% Vs 19% fit residuals outside of the recommended range). Conclusions The RMT approach in this exercise provided evidence that compared to the 28-item bank, the extended 51-item version of the PROMIS depression, improved sample-to-scale targeting. However, targeting in the lower end of the concept of interest remained sub-optimal and scale performance deteriorated. There may be a need to improve the conceptual breadth of the construct under investigation to ensure the inclusion of items that capture the full range of the concept of interest for this context of use.


Background
The rising profile of including the patient perspective in clinical outcome assessment has consequently increased interest in patient reported outcome (PRO) instruments and techniques of evaluating their scientific rigor [1,2]. Scores generated by PRO instruments are increasingly used as central outcome variables upon which important decisions are made related to patient care. Therefore, it is essential to assess whether they are fit for purpose as failure to do so could potentially lead to incorrect interpretations being drawn about patient care [1,3]. A fundamental step in PRO evaluation is to examine whether an instrument comprehensively captures the concept of interest in the intended context of use [4]. Additionally, it is essential to assess whether summing individual items is "psychometrically sound" and whether-and to what extent-generated scores satisfy a priori reliability and validity criteria [5][6][7].
Different psychometric paradigms are available for developing and evaluating the scientific rigour of PRO instruments [1]. These include traditional psychometrics based on the theoretical Classical Test Theory (CTT) [8,9] and more recently modern psychometric paradigms offering mathematically testable models: the Rasch Measurement Theory (RMT) [10,11] and Item Response Theory (IRT) [12,13]. CTT is grounded in Steven's definition of measurement as "the assignment of numerals to objects or events according to some rule," which proposes that in measurement a person's observed score is the sum of their true and error estimates yet where the true score is theoretical [9]. Traditional psychometric analysis examines raw scores without weighing or standardization against theoretical measurement [8]. Modern psychometric paradigms on the other hand are grounded in Thurstone's measurement criteria which include interval scaling and measurement invariance [14,15] and offer mathematical testable logistic models against which measurement properties of rating scales can be examined [1].
This study constitutes part of a parallel exercise coordinated by the International Society for Quality of Life Research ISOQOL's Psychometrics Special Interest Group (SIG) aiming to demonstrate the potential of different psychometric methodologies (CTT, IRT and RMT) for developing and evaluating PROs using the same exemplar instrument. The papers describing the three parallel studies were reviewed by members of the Psychometrics SIG and the ISOQOL board. The Patient-Reported Outcomes Measurement Information System (PROMIS) calibrated depression item bank [16,17] was the chosen PRO example for this parallel exercise. The PROMIS depression item bank consists of a 28-item version calibrated from a larger pool of 56 well-performing items. PROMIS item calibration was completed by applying IRT models and CTT techniques to an initial 518-item bank collated from 78 depression scales [17]. Items for the depression bank were generated through an iterative process involving literature searches, conceptual framework development, expert review, focus groups and cognitive debriefing [17]. As the PROMIS depression authors suggested, validation of item banks should be an on-going process [17].
In this study we present an evaluation of the PROMIS depression item bank using RMT methods [3,10]. Similar to the IRT approach used in the development and calibration of the PROMIS item banks [18] RMT offers a mathematical model that defines how a set of items perform to generate reliable and valid measurements [10]. The Rasch model postulates that the probability of response to an item is always a function of the difference between an item's difficulty and a person's ability [3,10,11]. The Rasch model defines how a set of items should perform to generate reliable and valid measurements, and RMT analysis examines the extent to which the observed scores data "fit" the scores expected by the Rasch model [3].
Both IRT and RMT are used to generate data focusing on item responses within scales. Despite the similarities of these two new modern psychometric approaches, they are characterized by a fundamental difference that is key in scale evaluation and modification [1]. IRT is a statistical modeling paradigm that aims to find the best measurement model to fit the observed data. RMT on the other hand is a diagnostic paradigm that aims to assess the extent to which a set of items both conform to the requirement of the Rasch model, based on subjects' responses, and identify potential anomalies that do not fit the RMT expectations [1,19]. In practice, anomalies within the RMT paradigm are resolved by revisiting the item content and appraising the qualitative and quantitative evidence to further understand the disparity between expected and observed scores of an evaluated scale [3]. Evaluating the PROMIS depression items using RMT methods can therefore provide an additional perspective to the on-going validation of the PROMIS depression item bank for application in research, clinical practice, and policy.

Sample
The analyses described here are secondary analysis of data initially collected in the PROMIS Wave 1 cohort [16,17]. Recruitment took place at four PROMIS sites (University of Pittsburgh, University of North Carolina-Chapel Hill, Stanford University, and Duke University) and online via YouGovPolimetrix.com-a non-partisan online polling firm administering surveys for market research and political polling. As PROMIS item banks are targeted for use in clinical research (including clinical trials, observational research, and epidemiological studies), authors reported assembling a representative sample with a diverse severity of emotional distress.

PROMIS depression item Bank
The development of the PROMIS depression item bank has been described in detail elsewhere [17]. Items are rated on a 5-point frequency scale (never, rarely, sometimes, often, always) within a seven-day recall period with a model-based scoring system with higher values reflecting greater depression [17]. The 28-item bank comprised a single factor combining items from cognitive (n = 17), affective (n = 9), behavioral (n = 1) and suicidal (n = 1) conceptual domains. The current analyses use data from the extended 56-item PROMIS depression item bank. Five items were eliminated for intellectual property issues, leaving 51 items in the pool available for analysis. As an item bank relates to items that have been calibrated, the 51-item set will be referred to as an item pool.

Rasch measurement theory (RMT) analysis
The original simple logistic Rasch model [11] postulates that the probability of a positive response to a dichotomous (yes/no) item is a logistic function of the relative difference between the respondent (person) location and the item location on the measurement continuum. The Rasch model-which articulates how measurements of constructs can be derived from responses to items-postulates that the odds of a "yes" response correspond to the probability of a "yes" response divided by the probability of a "no" response. This approach outlines a natural logarithm, where the person and item locations are additive in log-odd units (logits), thus transforming scores into an interval scale. [3] The dichotomous model was subsequently expanded to polytomous data which conceptually and mathematically reflect an extension of dichotomous data [3]. Andrich [10] subsequently developed the rating scale Rasch model, the unrestricted version of which was used by this study [10].
RMT analysis uses the Rasch Model as the criterion against which scale performance is evaluated [1,3,10]. Effectively, RMT analysis examines the extent to which observed raw scores (responses to scale items) satisfy the a priori criteria and match the scores expected by the Rasch model. The evaluation of a rating scale using RMT analysis has been used commonly to develop and test measures in health, including psychiatry [20] using three broad aims: the evaluation of the scale-to-sample targeting, scale performance, and sample measurement.
The tests and information collected to address these aims have been described in detail elsewhere [3] and in brief below.

Scale-to-sample targeting
Scale-to-sample targeting refers to the extent to which a scale's items are able to measure the sample's ability range (in this case depression). Targeting was evaluated through examination of the relative person and item distributions on the same continuum [21,22] both graphically and numerically. RMT analysis orders items and responders in hierarchical order according to their relative difficulty (item locations) and relative ability (person location) on the same interval continuum of logits. The relative distributions inform the adequacy of the sample for evaluating the scale and the adequacy of the scale for measuring the sample, the better the ranges are matches for each other, the greater the potential for precise person measurement.

Scale performance
Five components of scale performance were examined [23].
Do the response categories work as intended?
Each PROMIS item is scored on a 5-point frequency response scale as follows: "1 = never"; "2 = rarely"; "3 = sometimes"; "4 = often"; "5 = always." Respondents with higher levels of the construct (i.e., higher depression) are expected to endorse the higher response categories, while respondents with lower levels of the construct (i.e., lower depression) are expected to endorse the lower response categories. Thresholds represent the point between two response categories (i.e., the place where a person is equally likely to endorse either of the two response categories). Threshold ordering is expected to reflect the intended category ordering. Disordered thresholds signify response categories may not be working as intended. This may in turn affect scale score interpretations and scale validity, where higher scores may not necessarily reflect higher levels of the concept of interest [24].

Do the items map out a continuum of depression?
An optimal scale is expected to comprise of a continuum representing the construct under measurement (e.g., depression) with marked components articulated as items located at different points on the continuum [25][26][27]. The different locations are expected to cover the entire range of the construct range (i.e. scale items are intended to measure and equally representing all levels of within a construct) [23,28]. Therefore, item location, spread, proximity to each other and precision are examined in RMT analysis, to determine the extent to which the set of items map out a continuum for measurement from less to more ability [21,23].

Do the items define a single construct?
Within the RMT analysis, item responses are examined to assess the cohesiveness of the measurement continuum and therefore the legitimacy of the scale [23,28]. Fit statistics identify the extent to which items work well together to define a single variable. Item fit statistics summarize the differences between observed scores and expected responses for individuals (i.e. the item-person interaction [fit residual]) and different ability groups (i.e. the item-construct interaction [chi square values]) [3]. Criteria suggest fit residuals should lie within the recommended range of − 2.5 and + 2.5, and Chi square values should be non-significant [3,21]. Item characteristic curves (ICC) are graphical indicators of fit which are used to complement the interpretation of the fit residuals and chi square probabilities [23,29].

Do responses to one item bias responses to other items?
The RMT expects item independency such that the response to one item should not influence or determine the response to another. High residual (i.e. observedexpected = residual) correlations highlight "local dependency" or item response bias which can artificially inflate reliability [21,22]. Response bias was assessed in line with the r > 0.30 rule of thumb, which indicates a > 9% shared variance between a pair of items, suggesting local dependence [30].
Are the scale items stable between different groups?
The extent to which items are stable across different sample subgroups was assessed through differential item functioning (DIF) [3,23,31]. An item shows differential functioning if the expected response for two respondents who have the same level on the measured construct but from two different groups (age, sex) differs. DIF was investigated by examining the observed response differences between class intervals within groups. DIF was examined using analysis of variance (ANOVA), assessing item scores between the sample groups and across the different class-intervals where a significant p-value for differences between subgroups is taken to indicate DIF [3,24]. The performance of the items was tested for stability across gender (male/female) six age groups (18-30, 31-40, 41-50, 51-60, 61-70 and 70+); race (white/nonwhite) and Hispanic ethnicity (Hispanic/non-Hispanic).

Sample measurement
Two components of person measurement were examined: the Person Separation Index (PSI), and the extent to which individuals' responses were consistent with expectation. The PSI, which indicates a scale's ability to detect differences in the levels of the construct within the sample, is a numerical indicator computed as the ratio of the variation of person estimates relative to the estimated error for each person [31]. PSI scores range between 0 and 1, where a 0 score indicated all error and a 1 score no error [3].
The extent to which individuals' responses met the expectations of the Rasch model was examined with fit statistics. Person fit residuals summarize the difference between observed scores and expected responses for each person i.e., the person-construct interaction [3,23]. Person fit residuals were examined with reference to the "rule of thumb," expecting 99% of the sample to produce a fit residual between − 2.5 to 2.5. Fit residuals outside this range indicate potential problematic measurements for those persons [3,23].

Analysis procedure
Data were analyzed in using RUMM2030 [11], one of several available software programs that provides graphical and statistical item analyses using Rasch unidimensional models for measurement. In line with the methods described above, the initial RMT analysis of the 28-item bank identified limited measurement and precision on the floor of the scale. We therefore hypothesized that adding new items could improve measurement performance by expanding the range of depression measured by the PRO-MIS items. For this, the extended item pool of 51 items* (i.e., original 28 + 23 items) was analyzed with RMT and performance compared to original 28-item bank. For brevity, results are presented in two stages: in stage 1, the extent to which the 51-items satisfy the RMT criteria, and in stage 2, the relative measurement performance of the 51-item and the 28-item pools. A summed total score was calculated for both the 28 and 51-item pools. Five items were eliminated from the 56-item bank due to potential intellectual property issues.

Sample
In the sample, described elsewhere [16,17], 8% of participants were of a clinical population and 92% of a general population. Secondary analysis was performed on a subsample (n = 825) of patients ranging in age 18-88 years (mean = 50.91, SD = 18.92). Of these 51% were female, and 79% white, 10% Hispanic. The sample size reported between the two different stages of analyses (both on scale and on item level) varies slightly, as RMT analyses estimates are calculated on the basis of available data and excluding extreme scores located at the floor and ceiling.
Stage 1 RMT analysis: examination of the PROMIS 51-item pool measurement performance Scale-to-sample targeting Person locations ranged approximately between − 6 to 4 logits relative to the item threshold locations, ranging approximately between − 3 to 3 logits, and covering 60% of the depression measured in the sample (Table 1). Graphical review of targeting (Fig. 1a, b) indicated item bunching and sub-optimal measurement for the persons with lower depression scores. Sample measurements (n = 224, 28%) located below − 3.13 logit on the left of the continuum had no matching items.

Scale performance
Do the response categories work as intended?
Nine items displayed disordered response thresholds (Table 2), with the "rarely" response category most consistently not working as intended.
Do the items map out a depression ability continuum?

Do the items define a single construct?
Twenty items displayed fit residual outside the recommended range, twelve of which also displayed significant chi square values (Table 2). Graphical review of the ICCs indicated overestimation of depression for items 17, 29 and 36 and underestimation of depression (i.e. observed scores were lower than expected at higher depressionability locations and higher than expected at lower depression-ability locations) for items 11, 15, 24, 49 and 53. Exemplar ICCs are displayed in Fig. 3.

Do responses to one item bias responses to other items?
Twelve item pairs produced residual correlations above the recommended range, suggesting possible response bias between them. The highest residual correlation (r = 0.53) identified was between items 32 and 39 (Table 2).
Are the scale items stable between different groups?

Sample measurement
The PROMIS-51 displayed high reliability and sample separation (PSI = 0.95) and suboptimal sample measurement as a total of 139 (17%) individuals in the sample fell outside the recommended range.
Stage 2 RMT analysis: comparative measurement performance of the 51 and 28-item pools Scale-to-sample targeting Even though the 51-item pool did not resolve targeting issues of the PROMIS 28-item pool, it did provide relative improvements to the measurement of people on the floor of the scale. Graphical review suggests the relative limitation of the 28-item pool (Fig. 1b) to measure depression for people with lower depression scores relative to the 51-item pool (Fig. 1a). In line with this, the sample mean is further skewed away from the item mean, floor effects rise to 7 from 3% in the 28-item bank ( Table 1) as do the sample measurements outside the scale range which amount to 34% (n = 272) in the 28item pool as compared to 28% (n = 224) 51-item pool. Figure 2 demonstrates how additional items improved to targeting as they expanded item coverage on the left of the continuum and filled in some of the gaps on the PROMIS 28-item pool measurement continuum. This improvement is further demonstrated numerically: the relative percentage coverage of the sample measurement by item location was 15%, in comparison to 27% in the 51-item pool ( Table 1).

Scale performance
As compared to the 51, the 28-item pool satisfied more criteria of RMT in relation to scale performance.

Do the response categories work as intended?
Thresholds for only one item (compared to nine in the 51item version) were not consistently ordered sequentially, suggesting the "rarely" response category was problematic for this item (Table 3).

Do the items map out a depression ability continuum?
The item location range of the 28-item pool-ranging from − 0.68 to 1.16 logits-was approximately 1 logit narrower than the 51-item pool ranging from − 1.58 to 1.18 logits (Table 1). Item locations consistently indicated some overlap: for example, five items (items: 14, 21, 22, 48 and 50) were situated within 0.08 logits and three items (items: 9, 19 and 41) within less than 0.05 logits (Table 3).

Do the items define a single construct?
Fewer items displayed misfit in the 28-item pool, as five items displayed fit residuals outside the recommended  range, only one of which also displayed a significant chi square value ( Table 3). The item characteristic curve for this item (Fig. 3b) indicated overestimation of depression: observed scores were higher than expected at higher depression-ability locations and lower than expected at lower depression-ability locations.

Do responses to one item bias responses to other items?
In comparison to the longer item pool, no item bias was identified in the 28-item pool as no items pairs produced residual correlations above 0.30 (Table 3).

Are the scale items stable between different groups?
Compared to the longer item pool, the 28-item displayed minimal DIF as no items displayed DIF by gender, race or Hispanic ethnicity and only one item (item 35) displayed DIF by age group (p < 0.01).

Sample measurement
Sample measurement was similar in the two PROMIS item-pools. The PSI of the 28-item pool was marginally lower (PSI = 0.93) than that of the 51-item and the validity of sample measurement marginally worse as the percentage  of person fit residuals outside the recommended range rose to 19% (Table 1).

Discussion
This study was part of a parallel exercise aiming to compare three different psychometric paradigms (CTT, IRT, RMT) and explore the psychometric properties of the PROMIS depression item bank [17] using RMT [3.10]. RMT analysis focuses on the examination scale-tosample targeting; scale performance and sample measurement exploring the disparities between the observed scores and those expected by the RMT model. [3] Using this approach, improvements were observed with the 51-item version of the PROMIS depression inventory, compared to the original 28-item version. However, RMT also revealed shortcomings of the extended item set, specifically suboptimal sample-to-scale targeting. Specifically, RMT evaluation of the PROMIS depression 28-item bank indicated overall adequate scale performance, but importantly revealed suboptimal targeting. More than a third of the sample was located at the lower end of the depression continuum where few items were identified to capture the individuals in this range. In other words, the 28-item bank is not well targeted for those with lower depression for whom interpretations using the PROMIS depression would be associated with limited precision (i.e., higher standard error). This was further supported by the high percentage of person fit residuals identified outside the recommended range, indicating problematic measurement for nearly 20% of the sample. The extended 51-item PROMIS depression item pool was also analyzed to examine whether an additional 23 items improved the sample-to-scale targeting by extending the item continuum range. Although the 51-item pool showed improved targeting, problems were not resolved. In fact, the additional items impacted negatively on the measurement properties of the item bank. Specifically, the extended item pool appeared less statistically cohesive, suggesting the set of items were not measuring a single construct. As well, evidence suggested item dependency and problems with the ordering of item response thresholds.
Despite the short-comings revealed, the approach suggests possible ways to remedy the existing scale. Exploration of the construct at the lower end of the continuum and the identification of items to capture the concept of interest for the population under investigation may optimize the scale for the intended use. We hypothesize that a conceptually driven effort to fill in the item gaps will improve the content validity of both the PROMIS depression 28-item and 51-item pools [1,6]. Review of the item development process indicates that initial pool of items was inductively categorized into conceptual domains (including 46 depression facets) before undergoing further standardization and calibration. [17] However, as Pilkonis et al. (2011) note, certain trade-offs were necessary to reach the assumption of unidimensionality and a single factor scale-a requirement of IRT models used to calibrate and evaluate these items. [17] Subsequent focus groups conducted with major depression patients [16]  In addition to the issue of multiple conceptual domain, the diversity of items within the PROMIS depression item bank further fails to match the diversity of concepts within the conceptual framework [17]. As reported by Pilkonis et al. (2011), only the proportion of affective items was consistent, whereas the proportion of cognitive items was almost doubled and at the same time behavioral and somatic items were completely removed as they did not perform well within the single-factor unidimensional depression scale. Our findings suggest that conceptual limitations may have resulted in favor of statistical fit.
The recommended next step within the RMT paradigm involves evidence-based conceptual review of the item-bank's content to sufficiently and comprehensively define the concept of interest under measurement for the patient population for whom it will be used. RMT analysis provides statistical evidence for the scale's measurement properties; however, this does not guarantee a scale's content validity, as statistical cohesiveness in measuring a construct does not specify what the construct is or how it is conceptualized by the population for whom it is intended to be used with [1,4,6,32,33]. Therefore, statistical tests of a PRO instrument's validity can mislead if the intended construct is not targeted sufficiently. Therefore, a conceptually driven empirical assessment of content validity, driven the population for whom the PRO is to be administered, would improve the ability of PROMIS depression to quantify the construct under measurement in this study sample.
In addition, outside any psychometric paradigm, but in line with clinical outcome assessment development guidelines, our findings further indicate that PROMIS depression psychometric analysis could benefit by refining of the context of use [4]. RMT analysis showed that the range of PROMIS depression items matched the range of depression reported by persons at the higher end of the continuum well. These findings could therefore be interpreted as supportive of the use of the item bank in populations with higher levels of depression. Contrary to this finding, item calibration for the PRO-MIS depression item banks was completed on a sample the vast majority of which (92%) was recruited from general not clinical population [17]. PROMIS depression psychometric analysis would therefore benefit by empirical exploration of the concept of interest in a predefined and specific context of use [4].

Conclusions
In this study, RMT analysis supported the statistical scale performance of the PROMIS depression scales, but also identified targeting and sample measurement limitations.
In practice, diagnosed anomalies may be resolved by attempting to explore the data and interpret the disparity between expected and observed scores of an evaluated scale further [3]. Within the RMT paradigm, rating scales and constructs are modified and, if necessary, more data are collected. In this respect, the application of RMT to the PROMIS depression item bank suggests that measurement could benefit by further examination of the construct under measurement and specifically the concept of interest and context of use.
Abbreviations CTT: Classical test theory; IRT: Item response theory; PROMIS: Patient-reported outcomes measurement information system; RMT: Rasch measurement theory