Skip to main content

Expected a posteriori scoring in PROMIS®

Abstract

Background

The Patient-Reported Outcome Measurement Information System® (PROMIS®) was developed to reliably measure health-related quality of life using the patient’s voice. To achieve these aims, PROMIS utilized Item Response Theory methods in its development, validation and implementation. PROMIS measures are typically scored using a specific method to calculate scores, called Expected A Posteriori estimation.

Body

Expected A Posteriori scoring methods are flexible, produce accurate scores and can be efficiently calculated by statistical software. This work seeks to make Expected A Posteriori scoring methods transparent and accessible to a larger audience through description, graphical demonstration and examples. Further applications and practical considerations of Expected A Posteriori scoring are presented and discussed. All materials used in this paper are made available through the R Markdown reproducibility framework and are intended to be reviewed and reused. Commented statistical code for the calculation of Expected A Posteriori scores is included.

Conclusion

This work seeks to provide the reader with a summary and visualization of the operation of Expected A Posteriori scoring, as implemented in PROMIS. As PROMIS is increasingly adopted and implemented, this work will provide a basis for making psychometric methods more accessible to the PROMIS user base.

Introduction

The Patient-Reported Outcome Measurement Information System® (PROMIS®) [1], is a disease-agnostic measurement system of health-related quality of life which utilizes Item Response Theory (IRT). PROMIS was originally created to leverage the benefits of IRT and Computer Adaptive Testing (CAT) to minimize patient response burden while maximizing measurement reliability. PROMIS measures have been shown to be reliable, valid and accurate in a variety of conditions and contexts [2,3,4,5,6,7]. Over the past fifteen years, there has been substantial development, adoption and implementation of PROMIS [8, 9]. Such efforts have leveraged IRT to increase the accessibility of and aid their interpretation, including T-score maps [10] and “linking” between non-PROMIS and PROMIS measures [11].

This paper aims to make PROMIS IRT scoring methods accessible to a broader audience of users who have a basic statistical background by supplementing foundational psychometric literature with non-technical descriptions and illustrative graphics. To the same end, this paper was created in the reproducibility framework of R Markdown [12]. An R Markdown document (.rmd) contains both commented statistical code and the explanatory text in this document. Both the text and statistical code for scoring is intended to be reviewed and implemented by the reader. Included in the appendices of this paper are a set of annotated statistical programming scripts for scoring PROMIS measures.

IRT foundations

The IRT methods employed in PROMIS and their foundations were developed 70–90 years ago [13,14,15,16,17] and have been used extensively in the educational field. Over the past two decades, researchers have also shown how IRT can be applied to patient-centered outcomes generally [18, 19] and documented how IRT has been applied in PROMIS specifically [8, 9]. This paper briefly reviews foundations of IRT in PROMIS and instead provides focused demonstration of PROMIS scoring methods.

Response option probabilities

Building block of IRT scoring

IRT ranks individuals and their responses to survey items across a latent trait, such as fatigue. Just as two different people might have different levels of fatigue, two different sets of responses to survey items relate to two different levels of fatigue. IRT allows us to infer where an individual most likely ranks on a latent trait continuum. The inference of where an individual ranks on a latent trait is made by transforming an individual’s response to survey items (e.g., I feel tiredNever, Sometimes, and Always) to a set of probabilities across all levels of the latent trait. Each probability in the set represents the likelihood that an individual and their selected response options has a particular level of latent trait. Expected A Posteriori scoring reduces these probability sets to a single point-estimate of the latent trait (i.e., a score) and provides an estimate of variability and reliability of the point estimate of the latent trait (i.e., standard deviation or standard error).

Two things are required to calculate these probabilities for a PROMIS measure: item calibration parameters, such as those shown in Table 1, and the two parameter logistic IRT model shown in Formula (1). The calibration parameters represent the relationship between a sample of individuals, their responses to a set of survey items and the latent trait. The formula allows a mathematical transformation of an individual’s response to an item to a set of probabilities across the spectrum of the latent trait.

Table 1 IRT Calibration Statistics for PROMIS Fatigue item FATEXP42: How much mental energy did you have on average?

In Formula (1) we can see the calibration parameters, annotated as “discrimination” and “threshold.” Each item has one discrimination calibration parameter and a number of threshold calibration parameters equal to the number of response options minus one. The subscript “i” in Formula (1) indicates that these parameters vary by item, and the subscript “k” indicates that there are multiple thresholds per item. An example is PROMIS Fatigue item FATEXP42 (In the past 7 days, how much mental energy did you have on average?) which has five response options (Not at all, A little bit, Somewhat, Quite a bit, and Very much). It follows that FATEXP42 has one discrimination calibration parameter (abbreviated “a”), and four threshold calibration parameters (abbreviated and numbered from “cb1” to “cb4”). The item calibrations parameters for FATEXP42 are provided here in Table 1 for reference.

The remaining undefined variable in Formula (1) is “theta,” which refers to the latent trait being measured (e.g., fatigue or physical functioning). Theta is actualized as a single number for an individual level of latent trait, ranging from negative infinity to infinity. Theta is constructed based on the population included in the calibration sample and is often scaled to have a mean center of 0 and a standard deviation of 1. For the PROMIS Profile measures (that include anxiety, depression, fatigue, pain, sleep disturbance, physical function, and satisfaction with participation in social roles), Cella and Liu [1, 20] provide a picture of the people representing the PROMIS calibrations and metric. An individual’s theta score represents their level of latent trait in the context of the sample that was used to generate the calibration parameters.

For the purposes of calculation, the range of theta is limited to ± 4, with a higher theta relating to more of what is being measured, e.g., higher PROMIS Fatigue theta values relate to more fatigue or higher PROMIS Physical Function theta values relate to better physical functioning.

$$Probability = \frac{1}{{1 + e^{{ - 1{*}discrimination_{i} \left( {theta - threshold_{ik} } \right)}} }}$$
(1)

Once we evaluate Formula (1) for all levels of theta (e.g., − 4 to 4) and for all item calibrations parameters provided in Table 1, we can create a set of probability curves that represent each item’s response options. Figure 1 shows an example of how the response options of FATEXP42 are ordered across level of theta (level of fatigue), with response option Not at all having higher probabilities at lower levels of theta (lower fatigue), and response option Very much having higher probabilities at higher levels of theta (higher fatigue).

Fig. 1
figure 1

Response option probabilities across theta for PROMIS Fatigue item FATEXP42

This paper demonstrates how PROMIS measures are scored using graphical representations of probability curves, such as those in Fig. 1. To aid interpretation, these probability curves are plotted with consistent formatting styles. All colors used in figures were selected from the colorBlindness package in R [21].

Figure 2 provides a more detailed example of how Formula (1) and FATEXP42’s item calibration parameters can be used to generate sets of probabilities and plot what are referred to as item characteristic curves. The black curves in the top graph of Fig. 2 are calculated with Formula (1) and the calibration parameters in Table 1 and are labeled as the “probability associated with a threshold parameter across theta” or Pcb1−cb4(Theta). These curves represent the probability that a respondent at a given level of theta would endorse any response option above one of the response options, e.g., Pcb2(Theta) represents the probability that an individual would endorse the third, fourth or fifth response option (Somewhat, Quite a bit, and Very much), but not the first or second response option (Not at all and A little bit). The threshold parameters (e.g., cb1 = − 1.26) represents value of theta where its corresponding threshold probability curve reaches 0.5, as represented by the intersection of the dotted black horizontal line and the vertical line segments underneath the threshold probability curve labels.

Fig. 2
figure 2

Calculation of response option probabilities across theta using the graded response model calibrations from PROMIS Fatigue item FATEXP42

The bottom plot of Fig. 2 presents the same item characteristic curves in Fig. 1, but with the response option probability curves labeled with their calculations. To isolate the probability associated with an individual response option, we calculate a set of probability differences between the probability curves of adjacent thresholds [e.g., Pcb2(Theta)—Pcb3(Theta)]. The last threshold probability curve, Pcb4(Theta), does not have an adjacent threshold probability because the item FATEXP42 does not have a response option greater than the fifth (Very Much). To calculate the probability associated with the fifth and highest response option, we subtract Pcb4(Theta) from 0. In other words, the probability associated with a respondent endorsing the fifth and highest response option is equal to the probability that a respondent will endorse any response option above the fourth, Pcb4(Theta), minus the impossibility (0 probability) that a participant will endorse a response option higher than fifth and highest. The first threshold probability curve, Pcb1(Theta), does not have another threshold probability curve below it. To calculate the probability associated with the lowest response option (Not at all) we subtract Pcb2(Theta) from 1. In other words, the probability of respondent endorsing the lowest response option is equal the certainty (1 probability) that a participant will endorse any response option minus the probability that a respondent will endorse a response option above the first and lowest, Pcb1(Theta).

The procedure of subtracting adjacent threshold probability curves to obtain probabilities curves of individual response options is reflected in the graded response model, Formula (2). To generate probabilities, we find the difference between two equations, one with threshold “k” and the other with threshold “k + 1”. The graded response model formula is the companion equation for interpreting PROMIS item calibration statistics and calculating probabilities. Although originally published by Samejima, the graded response model is explained in more accessible terms by Reeve, Chang, Fayers and Embretson [13, 19, 22, 23].

$$Probability = \frac{1}{{1 + e^{{ - 1{*}a_{i} \left( {theta - cb_{ik} } \right)}} }} - \frac{1}{{1 + e^{{ - 1{*}a_{i} \left( {theta - cb_{ik + 1} } \right)}} }}$$
(2)

Expected a posteriori scoring

How do we go from IRT probabilities to scores?

IRT provides probability-based modeling to evaluate item- and scale-level characteristics for scale development, but we can also use IRT to find an estimate of where an individual is on the theta spectrum. In other words, we can score individuals on the latent trait. PROMIS scores are reported on the “T-score” metric, which is a linear transformation of the standardized theta scores, as shown in Formula (3). This paper reports scores on either the standardized z-score metric (labeled "theta") or the T-score metric.

$$T - score = \left( {theta{*}10} \right) + 50$$
(3)

As a score calculation example, we will again use PROMIS Fatigue item FATEXP42 (In the past 7 days, how much mental energy did you have on average?). See item response option probability curve for the second response option (A little bit) in Fig. 3. A logical IRT score is the most probable level of theta, also known as the maximum likelihood of theta. Using this method, an individual that selected the second response option of FATEXP42 would be assigned a maximum likelihood score of − 0.2 theta or T-score of 48, as shown in Fig. 3.

Fig. 3
figure 3

A little bit response option probabilities across theta for PROMIS Fatigue item FATEXP42

This simple example has two problems, however. The first problem comes from a practical issue in measurement and the second stems from mathematical limitations. The practical measurement issue is that we are unable to differentiate individuals at the extreme ends of our measurement scale, which occurs when respondents select the absolute highest or lowest response option in an item (e.g., Never or Always). Using another fatigue item as an example, in FATEXP29 (In the past 7 days, how often did you feel totally drained?) the extreme response of Never is likely selected by people with very different experiences of fatigue: Never would be selected a by respondent with low-level fatigue (e.g., feels slightly, but not totally drained over the past week), Never would be selected by a respondent who didn’t experience fatigue (e.g., didn’t feel drained at all over the past week) and Never would be selected by a respondent who had an unusually high energy over the past week. While the extreme response option of Never is selected by all three respondents for this item, we can be more certain that respondents with even less fatigue (or more energy) are increasingly likely to pick the Never response option.

This is also true for the other extreme response option, Always. A response of Always is likely to be selected by a respondent who just had a totally draining week, by a respondent who had a totally draining month, or by a respondent who had a totally draining year. The inability of an item or scale to distinguish between extreme levels is a measurement property known as the “floor” and “ceiling” effect [24]. The response probability curves of the extreme responses options show these floor and ceiling effects (Fig. 4). Probabilities of these extreme responses categories are assumed to be monotonic, meaning they have a constantly increasing probability of being selected with increasingly extreme levels of theta, and there is no single point of maximum likelihood for us to use as a score.

Fig. 4
figure 4

Extreme response option probabilities across theta for PROMIS Fatigue item FATEXP42

The second problem related to mathematical limitations is the infinite range that we assume exists for the latent trait (theta). All response option probability curves are asymptotic, meaning the probability curves expand over an infinite range of theta and never reach probability values of either 0 or 1. It is mathematically complex and computationally costly to perform calculations in an infinite range.

To solve these two problems, we use an IRT scoring mechanism called “Expected A Posteriori” (EAP) scoring [15, 25]. This form of scoring works by imposing constraints on how we calculate probabilities. The first constraint comes from limiting the infinite theta space to a “quadrature,” which can be visualized in Fig. 5 as a set of evenly spaced points on a number line or x axis between two bounds. Boundaries of − 4 theta to 4 theta or (T-scores of 10–90), with 0.1 theta increments (1 T-score point) are used. Theta can be interpreted as standard deviations of the population, a range of − 4 to 4 theta encompasses 99.994% of people.

Fig. 5
figure 5

Theta quadrature as number line

The quadrature stops the constant growth of the extreme response option’s probability curve at its limits ( − 4 to 4), which means that a ‘maximum likelihood’ theta score for an extreme response option will be the same as the quadrature limit. Expanding or shrinking the limits of the quadrature (e.g., − 6 to 6 or − 2 to 2) will increase or decrease the scores of extreme response options. An individual who endorses an extreme response option would receive different theta scores only due to the choice of quadrature limits, not any real difference in the latent trait (e.g., fatigue).

EAP scoring uses a “prior” in the calculation of scores to address this problem. Generally, a prior is a bayesian concept that refers to our best guess of an individual’s theta score before they’ve selected a response option [24]. The EAP scoring prior used in PROMIS is a normal distribution which reflects the population mean (μ = 0) and standard deviation (σ = 1). It is a reasonable assumption that any individual is a member of the population.

After multiplication of the item characteristic curve by the normal prior probability curve, the extreme response probability curve is reshaped, repositioned and called the “posterior probability.” The new posterior probability curve is pulled back from the quadrature limit and is no longer monotonic: instead it looks like the normal curve of the prior. The amount of the lateral repositioning of the posterior (and movement of the maximum likelihood score) away from the quadrature limit is a function of the area under the curve of the original extreme response option probability and the area under the curve of the prior.

Figure 6 shows a graphical example of the new posterior curve. In Fig. 6, the dashed purple line represents the response probabilities from FATEXP42’s extreme response option (Not at all), the solid green line represents the prior probability curve, and the bold solid orange line represents the new posterior probability curve with a maximum probability of − 0.87. The posterior (bold solid orange) can be visualized as ‘splitting the difference’ between the probability curves of the extreme response option (dashed purple) and the prior (solid green).

Fig. 6
figure 6

Calculation of extreme response option probabilities across theta for PROMIS Fatigue item FATEXP42

The bottom half of Fig. 6 shows the calculation of the posterior probability curve using the theta quadrature. At each increment on the theta quadrature (− 4 to 4 by increments of 0.1), the response option probability is multiplied by the prior probability. For example, at a theta of − 1, the response probability of 0.407 is multiplied by a prior probability of 0.242, which equals a posterior probability of 0.099. The size of the posterior probabilities are shrunk due to the multiplication of decimals, but we are only concerned with the location of the maximum likelihood point estimate that we’ll use as an EAP score. Without the theta quadrature, integral calculus would be required to multiply the prior and the response option probability curves.

Figure 7 further demonstrates the method for calculating a single theta score from posterior probabilities across the theta quadrature. The quadrature again allows us to use simple multiplication in lieu of calculus, by multiplying posterior probabilities at each theta increment by their corresponding theta level to create a set of theta weighted posterior probabilities, e.g., theta of − 2 multiplied by a posterior probability of 0.04 equals a weighted probability of − 0.08. Dividing the sum of the weighted posterior probabilities ( − 1.82) by the sum of the posterior probabilities (2.08) gives us the final theta estimate ( − 0.87).

Fig. 7
figure 7

Calculation of theta for PROMIS Fatigue item FATEXP42, Not at all response option only

We originally introduced the prior into the scoring calculation in order to circumvent problems with extreme responses. However, in order to make sure that scores from all response options (extreme or not) are comparable, the prior is used in calculating all scores. This is also true for scores calculated from multiple items.

To calculate a single score from an individuals’ responses to multiple items, we combine the probability curves through multiplication. This operation is analogous to calculating the joint probability of two independent events, e.g., the probability of obtaining two heads from two coin flips is calculated as 0.5 × 0.5 = 0.25. A combined probability can then be multiplied by the prior to obtain a posterior probability.

In calculating a score from multiple items, we multiply all response probabilities together, and then multiply by the prior to generate a set of single set of posterior probabilities, as in Fig. 8 below. Figure 8 uses two response options probabilities from PROMIS Physical Function items PFA56 (Are you able to get in and out of a car?) and PFC46 (Are you able to transfer from a bed to a chair and back?). The calibration statistics for PFA56 and other PROMIS Physical Function items mentioned in this work can be found in the first Table of Rose et al. [26] without the “PF” item code prefix, e.g., “A56” is the same as “PFA56.”

Fig. 8
figure 8

Expected A Posteriori scoring with multiple PROMIS Physical Function items

The probabilities in the graph of Fig. 8 are scaled to make the posterior probability curve more visible. The dashed purple line represents the scaled response probabilities for extreme response option of PFC46, Unable to do and the dot-dashed brown line represents the scaled response probabilities of PFA56, With some difficulty. The solid green line represents the scaled prior probabilities and bold solid orange line represents the scaled probabilities of the posterior. The process for calculating a single theta score from multiple items is the same as in the single item example in Fig. 7.

Practical considerations of EAP scoring

There are three practical considerations of EAP scoring: one consideration related to the ordering of items, one related to score resolution, and another related the bias of prior.

Figure 8 shows that simple multiplication can be used to combine IRT response probabilities of multiple items. A property of multiplication is that any order or arrangement of multiplications has the same result (e.g., 1 × 2 × 3 = 3 × 2 × 1). Consequently, the order of items doesn’t matter in score calculation; item responses combined in any order will result in the same score.

The insensitivity to item order in IRT scoring also means that the resolution of scores increases exponentially with the number of items answered. One item with five response options has 5 possible IRT scores (51 = 5), two items have 25 possible IRT scores (52 = 25) and three items have 125 possible IRT scores (53 = 125). This is a large increase in score resolution over raw sum scoring methods, in which the same three items have only 13 possible sum scores, ranging from 3 to 15. Greater score resolution allows scores to be more sensitive to an individual’s responses to a set of items and is a component of score precision.

As shown in Fig. 6, multiplication by the prior biases an EAP score inward. However, since the prior is only multiplied once in calculation of the posterior, it’s influence on the final EAP score will shrink as more items added into the calculation.

For these reasons, this paper doesn’t recommend EAP scoring with fewer than 3 items. The shortest PROMIS Profile short form has 4 items and adult PROMIS CAT will administer 4 items as the standard minimum. There are few PROMIS short forms with less than 4 items, including 2 item Global Physical and Mental Health scales [27].

Raw sum score to IRT look-up table scoring

The previous sections demonstrate that Expected A Posteriori scoring is flexible and can be efficiently calculated by computers, but requires both statistical coding and calibration parameters to generate scores from item responses. For PROMIS users who do not have access to statistical code or calibration parameters, the HealthMeasures Scoring Service (https://www.assessmentcenter.net/ac_scoringservice) allows users upload their data to be scored with EAP scoring methods. An alternative to the HealthMeasures Scoring Service is a “look-up” table to convert a raw sum score to an EAP score. The scores in these look-up tables are calculated with EAP methods and represent the most probable theta level across all possible response pattern combinations for a single scale-level sum score [14, 28]. The maximum and minimum scale-level scores in the table relate to the floor and ceiling of the scale. Table 2 shows an example look-up table.

Table 2 IRT to Raw Sum Score Look-up Table

Figure 9 shows an example of how an EAP score for a raw sum score of 4 in Table 2 is calculated. In this example, three Physical Function items (PFA51, PFB25 and PFC46) make up a three-item scale. The minimum possible scale score on the three item scale is 3 (all three items have a raw score of 1) and maximum scale score of 15 (all three items have a raw score of 5), as shown in Table 2.

Fig. 9
figure 9

Calculation of most likely IRT score for raw sum score of 4

To calculate the EAP score for a scale-level raw sum score of 4, we first calculate the theta probabilities for each of the three possible combinations that sum to 4. Each response combination includes two 1’s and one 2, i.e., 1,1,2; 1,2,1; 2,1,1. Each of the response probability curves are shown in the top three plots of Fig. 9. The total probability of multiple independent events (or in this case, three independent response patterns which each have a sum-score of 4) can be found by summation, shown in bottom of Fig. 9. The center plot in Fig. 9 shows each scaled probability curves, including the three dotted, dashed and dot-dashed response pattern probability curves, and their sum multiplied by the prior. The result is a posterior probability curve (bold solid orange line) with a theta maximum likelihood of − 3.36 for all response combinations which sum to 4.

In order to differentiate between the two forms of scoring, one is referred to as “response pattern scoring” or “pattern response scoring” because it uses an individual’s pattern of responses and the other is referred to as “look-up table scoring.” Scores calculated for a look-up table are typically very highly correlated (e.g., > 0.9) with response pattern scoring. Figure 10 shows a plot of look-up and pattern response scoring methods for all response option combinations of the three physical function items used in Table 2 and Fig. 9. The two scoring methods have a pearson correlation coefficient of 0.96.

Fig. 10
figure 10

Comparison of look-up and pattern response Expected A Posteriori scoring methods

It is important to recognize that relative ease of use of look-up tables is balanced by a loss in resolution in comparison to pattern response scoring. Look-up scoring treats responses of equal raw score values (1, “Unable to do”) as equal, even if the responses relate to items of unequal difficulty (“Are you able to go for a walk of at least 15 min?” and “Are you able to run or jog for two miles (3 km)?”). This results in score differences or error between the pattern-response scoring and look-up table methods. The choice of implementing pattern response scoring or look-up scoring should reflect the context of measurement (e.g., regulatory decision making) and the corresponding level of precision needed. Pattern response scoring methods are more sensitive to an individual’s pattern of responses and are recommended whenever possible, and where appropriate, look-up table scoring is a good alternative.

Posterior standard deviation and standard error

Because of the inclusion of the prior in estimating the theta score, EAP scores don’t have a traditional standard error. Instead, we can calculate the standard deviation of the posterior distribution. The method for calculating the posterior standard deviation is the same for both pattern response and look-up table scoring methods. Formula (4) details the calculation of the posterior standard deviation.

There are parallels between the posterior standard deviation and the common standard deviation formula (5), notably, the size of the numerator of both formulas is driven by the sum of squared deviations from a single point, either the EAP score in Formula (4) or the mean in Formula (5) and both Formulas use a square root. They differ in that the squared deviation at each level of the theta quadrature is multiplied by the posterior probability before summation in Formula (4), and that the sum of the posterior distribution is the denominator in Formula (4) and the sample size is in the denominator of Formula (5).

$$Posterior\;\;SD = \sqrt {\frac{{\sum \left( {Posterior*\left( {Theta\;\;Quadrature - EAP\;\;Score } \right)}^{2} \right)}}{{\sum \left( {Posterior} \right)}}}$$
(4)
$$SD = \sqrt {\frac{{\sum \left( {X - \overline{X}} \right)^{2} }}{N}}$$
(5)

While the posterior standard deviation is not a standard error, it is related in a number of ways. The posterior standard deviation is a function of the shape of the posterior probability curve, which is informed by the consistency of response probabilities and the number of items scored.

Figure 11 shows an example of the relationship between the number of items scored (e.g., 3 or 6 items), consistency of item responses (e.g., raw scores of 3,3,3 or 1,3,5) and the resulting posterior standard deviation. The gray shaded area under the bold solid orange posterior probability curve in Fig. 11 indicates a bandwidth of one standard deviation from the EAP score. Generally, a smaller posterior standard deviation occurs with a larger number of items with consistent responses, which maps onto a smaller standard error. Conversely, a smaller number of inconsistent item responses leads to a larger posterior standard deviation and larger standard error. Bock draws a direct and “near identity” relationship between the posterior standard deviation and standard error as the number of items increases (p. 437) [15].

Fig. 11
figure 11

Posterior standard deviation, number of items and response consistency

Similar to how T-scores are a linear transformation of theta (Formula (3)), posterior standard deviations can be put on the T-score metric by multiplication by 10, e.g., a posterior standard deviation of 0.21 on the theta metric is a posterior standard deviation of 2.1 on the T-score metric.

Conclusion

Expected A Posteriori (EAP) scoring is a flexible and efficient scoring method that can be visualized and logically explained. Item response option probabilities distributed across a latent trait spectrum, theta, are the building blocks of EAP scoring and the maximum likelihood of these probabilities can provide a score estimate. An EAP score represents the level of latent trait experienced by the respondent compared to the level of latent trait present in the people who make up the calibration sample. Introduction of a theta quadrature and a Bayesian “prior” simplifies complex mathematical operations and alleviates measurement problems. For users who don’t have access to the statistical code and item calibration statistics, a scale-level raw sum score to EAP score look-up table can be calculated for custom short-forms or accessed on HealthMeasures website for existing short-forms. A posterior standard deviation can be calculated for all EAP scoring methods, which reflects the score standard error. A more complete understanding of the operation and options in PROMIS EAP scoring will help ground PROMIS IRT methods with existing users and will support the further adoption and implementation of PROMIS among researchers, clinicians, industry sponsors and regulators.

Availability of data and materials

All materials used in this manuscript are openly available and included in the parent R Markdown document, which will be submitted with the manuscript. Additionally, statistical code for performing Expected A Posteriori scoring is provided in the appendices of the manuscript. No data was used in the creation of this manuscript.

Abbreviations

CAT:

Computer adaptive test

EAP:

Expected A Posteriori

IRT:

Item response theory

PROMIS:

Patient reported outcomes measurement information system

SD:

Standard deviation

References

  1. Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S et al (2010) Initial adult health item banks and first wave testing of the patient-reported outcomes measurement information system (PROMIS™) network: 2005–2008. J Clin Epidemiol 63(11):1179–1194

    Article  Google Scholar 

  2. Schalet BD, Pilkonis PA, Yu L, Dodds N, Johnston KL, Yount S et al (2016) Clinical validity of PROMIS depression, anxiety, and anger across diverse clinical samples. J Clin Epidemiol 73:119–127

    Article  Google Scholar 

  3. Cook KF, Jensen SE, Schalet BD, Beaumont JL, Amtmann D, Czajkowski S et al (2016) PROMIS® measures of pain, fatigue, negative affect, physical function, and social function demonstrate clinical validity across a range of chronic conditions. J Clin Epidemiol 73:89–102

    Article  Google Scholar 

  4. Schalet BD, Hays RD, Jensen SE, Beaumont JL, Fries JF, Cella D (2016) Validity of PROMIS physical function measures in diverse clinical samples. J Clin Epidemiol 73:112–118

    Article  Google Scholar 

  5. Askew RL, Cook KF, Revicki DA, Cella D, Amtmann D (2016) Clinical validity of PROMIS® pain interference and pain behavior in diverse clinical populations. J Clin Epidemiol 73:103–111

    Article  Google Scholar 

  6. Cella D, Lai J-S, Jensen SE, Christodoulou C, Junghaenel DU, Reeve BB et al (2016) PROMIS fatigue item bank had clinical validity across diverse chronic conditions. J Clin Epidemiol 73:128–134

    Article  Google Scholar 

  7. Hahn EA, Beaumont JL, Pilkonis PA, Garcia SF, Magasi S, DeWalt DA et al (2016) The PROMIS satisfaction with social participation measures demonstrate responsiveness in diverse clinical populations. J Clin Epidemiol 73:135–141

    Article  Google Scholar 

  8. Reeve B, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA et al (2007) Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the patient-reported outcomes measurement information system (PROMIS). Med Care 45(5):S22-31

    Article  Google Scholar 

  9. Stover AM, McLeod LD, Langer MM, Chen W-H, Reeve BB (2019) State of the psychometric methods: patient-reported outcome measure development and refinement using item response theory. J Patient-Rep Outcomes 3(1):1–6

    Article  Google Scholar 

  10. Rothrock NE, Amtmann D, Cook KF (2020) Development and validation of an interpretive guide for PROMIS scores. J Patient-Rep Outcomes 4(1):16–26

    Article  Google Scholar 

  11. Choi S, Schalet B, Cook KF, Cella D (2014) Establishing a common metric for depressive symptoms: linking the BDI-II, CES-d, and PHQ-9 to PROMIS depression. Psychol Assess 26(2):513–527

    Article  Google Scholar 

  12. Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A et al (2021) Rmarkdown: dynamic documents for r [Internet]. Available from: https://github.com/rstudio/rmarkdown

  13. Samejima F (1969) Estimation of latent ability using a response pattern of graded scores. Psychometric Society, New York

    Book  Google Scholar 

  14. Lord FM, Wingersky MS (1984) Comparison of IRT true-score and equipercentile observed-score “equatings.” Appl Psychol Meas 8(4):453–461

    Article  Google Scholar 

  15. Bock RD, Mislevy RJ (1982) Adaptive EAP estimation of ability in a microcomputer environment. Appl Psychol Meas 6(4):431–444

    Article  Google Scholar 

  16. Thissen D (2016) Bad questions: an essay involving item response theory. J Educ Behav Stat 41(1):81–89. https://doi.org/10.3102/1076998615621300 (Internet)

    Article  Google Scholar 

  17. Symonds PM (1929) Choice of items for a test on the basis of difficulty. J Educ Psychol 20(7):481–493. https://doi.org/10.1037/h0075650 (Internet)

    Article  Google Scholar 

  18. Nguyen TH, Han H-R, Kim MT, Chan KS (2014) An introduction to item response theory for patient-reported outcome measurement. Pat Patient-Cent Outcomes Res 7(1):23–35

    Article  Google Scholar 

  19. Reeve B, Fayers P (2005) Applying item response theory modelling for evaluating questionnaire item and scale properties. In: Fayers PM, Hays RD (eds) Assessing quality of life in clinical trials: methods and practice, 2nd edn. Oxford University Press, Oxford, pp 55–73

    Google Scholar 

  20. Liu H (2010) Representativeness of the patient-reported outcomes measurement information system internet panel. J Clin Epidemiol 63(11):1169–1178. https://doi.org/10.1016/j.jclinepi.2009.11.021 (Internet)

    Article  PubMed  PubMed Central  Google Scholar 

  21. Ou J (2021) colorBlindness: safe color set for color blindness [Internet]. Available from: https://CRAN.R-project.org/package=colorBlindness

  22. Chang C-H, Reeve BB (2005) Item response theory and its applications to patient-reported outcomes measurement. Eval Health Prof 28(3):264–282

    Article  Google Scholar 

  23. Embretson SE, Reise SP (2000) Item response theory for psychologists. Lawrence Erlbaum Associates Publishers

  24. VandenBos GR (2007) APA dictionary of psychology, 1st edn. American Psychological Association, Washington

  25. Bock RD, Aitkin M (1981) Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika 46(4):443–459

    Article  Google Scholar 

  26. Rose M, Bjorner JB, Gandek B, Bruce B, Fries JF, Ware JE (2014) The PROMIS physical function item bank was calibrated to a standardized metric and shown to improve measurement efficiency. J Clin Epidemiol 67(5):516–526

    Article  Google Scholar 

  27. Hays R (2017) Two-item PROMIS® global physical and mental health scales. J Pat Rep Outcomes. https://doi.org/10.1186/s41687-017-0003-8

    Article  Google Scholar 

  28. Cai L (2015) Lord–wingersky algorithm version 2.0 for hierarchical item factor models with applications in test scoring, scale alignment, and model fit testing. Psychometrika 80(2):535–559

    Article  Google Scholar 

  29. Choi S (2010) ThetaSEeap.r. Version 1

  30. Choi S (2010) RSSS.r. Version 1

Download references

Acknowledgements

Feedback about the legibility and accessibility the manuscript were obtained from the PROMIS user community through posting initial drafts on a PsyArXiv preprint server. The author would like to thank Dr. Erin Anderson for her feedback and assistance in the preparation of the manuscript.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

The author designed, wrote, edited and finalized the manuscript. The author read and approved the final manuscript.

Corresponding author

Correspondence to Robert Chapman.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

ThetaSEeap.R

The “ThetaSEeap.R” script is an R script for calculating “pattern response” EAP scores, and was originally written by Choi [29].

figure a
figure b
figure c
figure d

RSSS.R

The “RSSS.R” script is an R script for calculating EAP to raw sum score “Look Up” tables, and was originally written by Choi [30].

figure e
figure f
figure g
figure h
figure i

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chapman, R. Expected a posteriori scoring in PROMIS®. J Patient Rep Outcomes 6, 59 (2022). https://doi.org/10.1186/s41687-022-00464-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s41687-022-00464-9