Psychometric validation of the 1-month recall Uterine Fibroid Symptom and Health-Related Quality of Life questionnaire (UFS-QOL)

Background To evaluate the psychometric characteristics of the 1-month recall Uterine Fibroid Symptom and Health-Related Quality of Life questionnaire (UFS-QOL), including the Revised Activities subscale. Methods VENUS I and II were phase III, randomized, double-blind, placebo-controlled trials of ulipristal acetate in women with uterine fibroids (UF) and abnormal uterine bleeding. Women completed the 1-month recall UFS-QOL at baseline and after 12 weeks’ treatment. Uterine bleeding was assessed via a daily diary (both studies); the Patient Global Impression of Improvement scale (PGI-I) was completed in VENUS II. Psychometric analyses examined internal consistency reliability and construct validity of the UFS-QOL; confirmatory factor analysis (CFA) compared model fit of the original and Revised Activities subscales. Analyses were conducted separately for VENUS I and II. Results One hundred and fifty-seven patients in VENUS I and 429 in VENUS II were included. Changes in mean Symptom Severity and health-related quality of life (HRQoL) scale scores indicated symptom burden reductions and HRQoL improvements. Cronbach’s alpha coefficients were high at baseline and after 12 weeks’ treatment (all ≥0.76, meeting the >0.70 threshold), demonstrating strong internal consistency reliability. Correlations between UFS-QOL scores and bleeding diary responses (range: −0.35 to −0.63), and UFS-QOL scores and PGI-I responses (range: −0.48 to −0.70), ranged from moderate to strong after 12 weeks’ treatment (all p < 0.0001). Patients with absence of bleeding or controlled bleeding after 12 weeks’ treatment scored significantly better (p < 0.001) on each UFS-QOL scale than patients not achieving those end points, supporting construct validity. CFA confirmed model fit for the Revised Activities subscale. Conclusions The 1-month recall UFS-QOL, including the Revised Activities subscale, is a valid, reliable measure to assess UF symptoms and their impact on HRQoL. Trial registration ClinicalTrials.gov, NCT02147197. Registered May 26, 2014; retrospectively registered. ClinicalTrials.gov, NCT02147158. Registered May 26, 2014; retrospectively registered. Electronic supplementary material The online version of this article (10.1186/s41687-019-0146-x) contains supplementary material, which is available to authorized users.


Background
Uterine fibroids (UF) are among the most common benign neoplasms of the female pelvic region. UF incidence increases as women approach menopause and has been reported to affect up to 80% of women by the age of 50 years [1]. Up to half of women with UF experience clinical symptoms [2], including abnormal uterine bleeding (AUB) and pain [3], which can cause significant emotional and psychological distress [4]. A national survey of women in the United States aged 29-59 years with self-reported symptomatic UF revealed that 31% of respondents reported symptoms interfering with physical activities "all/ most of the time", while 22% reported symptoms interfering with daily/social activities "all/most of the time." In addition, almost one-third of employed respondents reported missing work due to their symptoms [5].
The symptoms of UF and their negative impact on health-related quality of life (HRQoL) and activities of daily living are some of the reasons why women seek therapy for UF. Patient-reported outcome (PRO) measures are, therefore, appropriate tools to measure the impact and outcome of interventions [6]. The Uterine Fibroid Symptom and Health-Related Quality of Life questionnaire (UFS-QOL) is widely used to evaluate patient-reported UF symptoms and their impact on HRQoL, and is the only disease-specific instrument developed and validated in a population of women with UF [7]. It was developed based on qualitative input from patients with UF; the original validation demonstrated its ability to discriminate between women with and without UF and also between varying patient-reported disease severity [7]. Furthermore, the UFS-QOL has been shown to be highly responsive to change following treatment [6].
In the original version of the UFS-QOL, patients are instructed to consider their experiences with UF during the previous 3 months. The instrument has since been modified to incorporate a shorter 1-month recall period to minimize recall bias [8] and to provide a more precise assessment of treatment effect based on a monthly menstrual cycle. The availability of multiple recall versions of the UFS-QOL also provides utility for women who do not have a monthly cycle; the 3-month recall version may be more useful for women who have less frequent menstrual cycles. In addition, a Revised Activities subscale has been created to include the most relevant items pertaining to physical and social activities. This scale was developed based on recent qualitative focus groups in which participants were asked to indicate whether each item of the UFS-QOL was relevant to them. The two items that ranked lowest in terms of relevancy to patients on the Activities subscale were removed during data analysis to create the Revised Activities subscale (i.e. the items were not removed from the questionnaire itself). As a result of these changes to the UFS-QOL, further validation is warranted.
An a priori planned validation of the 1-month recall UFS-QOL, including the Revised Activities subscale, was carried out to evaluate the instrument's psychometric properties using data from two trials of ulipristal acetate (UPA), an investigational, orally administered selective progesterone receptor modulator that reversibly blocks progesterone receptors in its target tissues (endometrium, pituitary, and UF) [9,10]. UPA has been shown in studies to provide therapeutic effects in reducing AUB in women with UF [11][12][13][14][15], including two pivotal phase III trials conducted in the United States and Canada (VENUS I [UL1309; NCT02147197] and VENUS II [UL1208; NCT02147158]) [16,17], in study populations representative of women with UF in the US general population.

Study designs and patients
This analysis included data from VENUS I and VENUS II, two phase III, multicenter, randomized, double-blind, placebo-controlled trials to assess the safety and efficacy of UPA for the treatment of AUB associated with UF. Both studies included pre-menopausal women aged 18-50 years who had: ultrasound evidence of at least one discrete UF; a history of cyclic (≥22 and ≤ 35 days) AUB; and menstrual blood loss ≥80 mL. Key exclusion criteria were: a history of uterine surgery that would interfere with the study end points; known coagulation disorder; and a history of, or current, uterine, cervical, ovarian, or breast cancers [16,17]. VENUS I included 157 patients randomized to placebo, UPA 5 mg, or UPA 10 mg for 12 weeks of treatment, followed by a 12-week drug-free follow-up period. VENUS II included 432 patients randomized to placebo followed by UPA, UPA followed by placebo, or two courses of UPA. The two 12-week treatment courses were separated by a drug-free interval of two menses. The second treatment course in VENUS II was followed by a 12-week drug-free follow-up period. This report follows recommendations described in the CONSORT PRO Extension [18].

Questionnaires and assessments UFS-QOL
The 37-item, self-administered UFS-QOL measures Symptom Severity (eight items) and HRQoL (29 items) and has been previously validated [6,7,19,20]. The HRQoL Total scale consists of six subscales: Concern, Activities, Energy/ Mood, Control, Self-Consciousness, and Sexual Function. Response options for Symptom Severity scale items are scored from 1 ("Not at all") to 5 ("A very great deal"); response options for items in the HRQoL subscales range from 1 ("None of the time") to 5 ("All of the time"). The Symptom Severity scale, HRQoL subscales, and HRQoL Total scale scores are summed and transformed into a 0-100-point scale, with higher Symptom Severity scores indicating greater symptom severity and higher HRQoL scores indicating better HRQoL. The Symptom Severity scale is unidimensional and the HRQoL subscales can be treated as unidimensional scales of a multidimensional construct (HRQoL); the HRQoL Total scale is a sum of the HRQoL subscales.
Patients were instructed to consider their experiences with UF over a modified recall period of 1 month. The Revised Activities subscale, a shorter version of the original Activities subscale, was included in all validation analyses. The 1-month recall UFS-QOL was completed by patients at baseline (Visit 1; first on-treatment visit in both studies) and after 12 weeks of treatment (Visit 2; end of the 12-week treatment period in VENUS I and end of Treatment Course 1 in VENUS II), or on early withdrawal for patients who withdrew after Visit 1 and before Visit 2 (for both studies).

Patient Global Impression of Improvement scale
The Patient Global Impression of Improvement scale (PGI-I) is a self-administered measure used to rate patient-perceived response of a condition to therapy. The PGI-I, administered in VENUS II only, asked: "During treatment with study drug, how would you describe your menstrual/vaginal bleeding compared to before you started study drug?". Participants responded on a 7-point Likert scale with the following options: 1, "Very much better"; 2, "Much better"; 3, "A little better"; 4, "No change"; 5, "A little worse"; 6, "Much worse"; and 7, "Very much worse". The PGI-I was completed at Visit 2 or on early withdrawal.

Bleeding diary
Uterine bleeding was recorded in an electronic diary in both studies [16,17]. A patient's heaviest bleeding experienced over the preceding 24 h was captured by the following terms: "None", no bleeding and no spotting; "Spotting", evidence of minimal blood loss that does not require the use of sanitary protection (except for panty liners); "Bleeding", evidence of blood loss that requires the use of sanitary pads or tampons; "Heavy bleeding", more than normal bleeding relative to your experience, or the passage of clots. Absence of bleeding was defined as having no bleeding days during the last 35 consecutive days on treatment counting backward from the earlier of Day 84 or the last dose date in the treatment period (VENUS I) or in Treatment Course 1 (VENUS II). Controlled bleeding was defined as having 0 days of heavy bleeding and ≤8 days of bleeding within the analysis window (the last 56 days of treatment counting backward from the earlier of Day 84 or the last dose date in the treatment period [VENUS I] or in Treatment Course 1 [VENUS II]). No controlled bleeding was defined as having ≥1 day of heavy bleeding or ≥9 days of bleeding within the analysis window. The thresholds for bleeding, absence of bleeding, and controlled bleeding were identified a priori, based on the primary end points in the VENUS I and II clinical trials.

Statistical analyses
Observed UFS-QOL and PGI-I scores were used in all analyses. Missing bleeding diary data were imputed consistent with VENUS I and II protocols. Scoring of the questionnaires was performed according to the developers' guidelines. All statistical tests were two-sided and used a significance level of 0.05 unless otherwise noted. Baseline analyses were carried out on the intent-to-treat population using an observed cases approach, defined as patients who completed at least one item of the UFS-QOL at baseline. The per protocol population was defined as all randomized patients who completed the treatment period, in addition to completing at least one item of the UFS-QOL after 12 weeks of treatment (PRO approach). Descriptive analyses were performed on baseline patient demographic and clinical characteristics. Distributional characteristics of UFS-QOL scores were examined at baseline and after 12 weeks of treatment. Analyses were conducted separately for VENUS I and II in order to assess the reproducibility of the results; additionally, minor differences between the two trials were present, such as the PGI-I being administered only in VENUS II.
Psychometric analyses were conducted on the UFS-QOL, including the Revised Activities subscale, at baseline and after 12 weeks of treatment to examine internal consistency reliability and construct validity (convergent and known groups validity).
Internal consistency reliability explores associations between different items within a scale [21]. Cronbach's coefficient alpha was calculated for each UFS-QOL scale at baseline and after 12 weeks of treatment; a value of >0.70 was considered acceptable to demonstrate internal consistency [22]. Validity refers to the extent to which an instrument measures what it purports to measure [21]. Convergent validity is the extent to which scores from the instrument are related to scores from other related instruments or concepts [21]. Spearman's rank correlation coefficients were used to establish convergent validity between the UFS-QOL scales and bleeding diary assessments (number of bleeding days and heavy bleeding days) at baseline and after 12 weeks of treatment, and between the UFS-QOL scales and PGI-I after 12 weeks of treatment (VENUS II only).
Known groups validity is the extent to which scores from an instrument are distinguishable from groups that differ by a key indicator, often clinical in nature [20]. Known groups validity of the UFS-QOL was assessed by number of bleeding days (categorized by ≤5, >5 to 9, and >9 days), achievement of absence of bleeding, and achievement of controlled bleeding for both VENUS I and II. Known groups validity was also assessed based on PGI-I responses after 12 weeks of treatment in VENUS II. Specifically, two separate assessments were conducted based on patients' PGI-I data: 1) by individual PGI-I score; and 2) by the collapsed PGI-I response categories of "Improved" (responses of "Very much better", "Much better", and "A little better"), "No change", and "Worsened" (responses of "A little worse", "Much worse", and "Very much worse").
The unidimensional factor structures of the Activities and Revised Activities subscales were examined using confirmatory factor analysis (CFA) using Mplus. Model fit was assessed by examining three fit statistics: the Comparative Fit Index (CFI)the model was considered to have a good fit if the CFI was ≥0.90 [23]; Root Mean Square Error of Approximation (RMSEA)the goodness of fit of the model was considered acceptable for values <0.07 [24]; and Standardized Root Mean Square Residual (SRMR)the goodness of fit of the model was considered acceptable for values ≤0.08 [25].

Patients
All 157 randomized patients in VENUS I had baseline UFS-QOL data; 135 completed at least one item of the UFS-QOL after 12 weeks of treatment (or early termination) (Fig. 1). Of 432 randomized patients in VENUS II, 429 had baseline UFS-QOL data and 348 completed at least one item of the UFS-QOL after 12 weeks of treatment (or early termination) (Fig. 1). Mean (standard deviation [SD]) age was 41.1 (5.4) years and 41.0 (5.6) years in VENUS I and II, respectively. Most patients were black (68.8% in VENUS I and 66.9% in VENUS II) and mean (SD) body mass index was 31.7 (8.0) kg/m 2 and 32.2 (7.9) kg/m 2 in VENUS I and II, respectively (Table 1). There were no significant differences (p > 0.05) between women who completed 12 weeks of treatment and those who discontinued in terms of age, race, ethnicity, and body mass index.

UFS-QOL: scale analysis
Descriptive statistics for UFS-QOL scales are shown in Table 2. In both studies at baseline, the mean Symptom Severity scale score was relatively high (62.0 in VENUS I and 65.5 in VENUS II), decreasing to approximately half its baseline value after 12 weeks of treatment (30. Table S1).

Convergent validity
In both studies, correlations at baseline between UFS-QOL scales and the number of heavy bleeding days were significant (except for the Sexual Function subscale in VENUS I), but weak, as is to be expected for correlations between objective and subjective measures (r s = 0.18 [p < 0.05] for the Symptom Severity scale and ranging from r s = − 0.15 [p = not significant (NS)] to −0.25 [others p < 0.05] for the HRQoL subscales in VENUS I; r s = 0.24 [p < 0.001] for Symptom Severity and ranging from r s = −0.16 to −0.26 [p < 0.001] for the HRQoL subscales in VENUS II). In VENUS I, only Symptom Severity was significantly associated with the number of bleeding days; however, the association was weak (r s = 0.16; p < 0.05). In VENUS II, most scales were weakly, but significantly, correlated with the number of bleeding days; the Energy/Mood and Self-Consciousness subscales did not have significant correlations.
After 12 weeks of treatment in both studies, there were much stronger correlations (ranging from −0.35 to −0.63; all p < 0.0001) between all UFS-QOL scales and bleeding diary responses compared to baseline (Table 3). In addition, after 12 weeks of treatment in VENUS II, correlations with the PGI-I were moderate to strong [28], and were significant (all p < 0.0001): 0.69 for the Symptom Severity scale and ranging from −0.48 (Sexual Function) to −0.70 (Concern) for the HRQoL subscales (Table 3).

Known groups validity By bleeding diary responses
At baseline in VENUS I, patients with ≤5 days of bleeding scored significantly better (p < 0.05) on the Symptom Severity scale than those with >9 days of bleeding. However, pairwise comparisons showed no significant differences between groups on any of the HRQoL subscales, likely due to the small sample sizes in each group. In VENUS II at baseline, patients with ≤5 days of bleeding scored significantly better than those with >9 days of bleeding on most of the UFS-QOL scales (p < 0.05; except for the Energy/Mood, Self-Consciousness, and Sexual Function subscales, for which p = NS). Both studies showed similar trends after 12 weeks of treatment. There were significant differences on all scales between patients experiencing ≤5 days of bleeding versus those experiencing >5 to 9 days (all p < 0.01; both studies) and versus those experiencing >9 days (p < 0.05 for VENUS I, except the Self-Consciousness and Sexual Function subscales, for which p = NS; p < 0.001 for VENUS II). In VENUS II, there was also a significant difference between patients who experienced >5 to 9 days of bleeding versus >9 days of bleeding on the Revised Activities and Energy/Mood subscales (both p < 0.05) (data not shown).

By achievement of absence of bleeding and controlled bleeding
Patients who achieved absence of bleeding and controlled bleeding in both studies scored significantly better (p < 0.001) on each UFS-QOL scale than patients who did not achieve those outcomes (Fig. 2). For example, in VENUS II, the mean (SD) Revised Activities subscale score for women who achieved absence of bleeding compared to those who did not was 88.7 (22.7) versus 59.9 (33.5), respectively.  By PGI-I score In VENUS II after 12 weeks of treatment, pairwise comparisons demonstrated that patients who responded "Very much better" on the PGI-I scored significantly better on almost all subscales when compared to each of the other PGI-I response categories. There were no significant differences on any subscale scores between those who responded "No change", "A little worse", "Much worse", or "Very much worse" (data not shown; the latter three of these groups had very small sample sizes). When PGI-I responses were collapsed into the categories of "Improved", "No Change", and "Worsened", pairwise comparisons demonstrated that patients in the "Improved" group scored better on all UFS-QOL scales versus those in the "No Change" (all p < 0.001) and "Worsened" groups (all p < 0.05) ( Table 4). Likely due to the small sample size and greater variance in the "Worsened" category, a monotonic pattern in scores across categories was not observed for most subscales.   ; however, the RMSEA tends not to perform well in models with small degrees of freedom, as was the case here [29].

Discussion
This psychometric validation study demonstrated that the 1-month recall UFS-QOL, including the Revised Activities subscale, is a valid and reliable PRO measure for the assessment of UF symptoms and their impact on HRQoL. The appropriate recall period for a PRO measure depends on what the measure captures, its intended use, and the attributes of the disease or study [8]. With a longer recall period, there is a risk of introducing measurement error that may reduce the chances of detecting a treatment effect [8,30]. Given that the results for the 1-month recall version of the UFS-QOL reported here are strongly consistent with those reported in the initial 3-month recall UFS-QOL validation studies [6,7,19], comparison with studies using either version of the instrument is feasible.
In this analysis, the UFS-QOL was shown to detect differences between known outcomes or groups. When comparing UFS-QOL scale scores for patients who achieved absence of, or controlled, bleeding after 12 weeks of treatment versus those who did not, scores were significantly better (p < 0.001) for the groups who achieved those outcomes. Such discrimination of the UFS-QOL with AUB, one of the most common symptoms of UF [2], is important as it signifies that the UFS-QOL has the ability to differentiate by bleeding status.
The results of the current study corroborate the findings of an earlier validation study of the 4-week recall UFS-QOL using data from a phase IIa proof-of-concept study in 271 pre-menopausal women with heavy bleeding associated with UF [20]. Both studies provide support for the tool as a valid way to measure symptom severity and impact of UF on HRQoL. The current study is strong independently, in that it had a substantial sample size and minimal missing data. The strength of correlations between UFS-QOL scores and the number of bleeding days were weaker at baseline than after 12 weeks of treatment. These findings are comparable with those observed in the previous 4-week recall study, in which correlations between UFS-QOL scales and ratings on the Mansfield-Voda-Jorgensen Menstrual Bleeding Scale were low at baseline (<0.20), but increased after the 3-month treatment period (0.28 to 0.51; p < 0.0001) [20]. We would expect correlations between UFS-QOL scores and bleeding diary responses to be greater after 12 weeks of treatment compared to at baseline because the sample has changed with treatment. At the end of 12 weeks of treatment, 37.8% and 38.8% of patients were amenorrheic in VENUS I and II, respectively, with greatly improved UFS-QOL scores from baseline ( Table 2); such results are reflected in the correlations between UFS-QOL scales and bleeding diary responses. In contrast, at baseline, there was much greater variability in bleeding days and UFS-QOL responses, resulting in weaker correlations.
These analyses also showed that the Revised Activities subscale performed psychometrically as well as the original Activities subscale. The Revised Activities subscale was created by removing two items ranked lowest in terms of relevancy, based on qualitative focus groups. The modified subscale showed excellent internal consistency reliability and strong model fit based on CFA results, lending support to the use of this shortened scale in future studies. While a limitation of the current study is that it was not designed to include a direct comparison of the 1-month recall UFS-QOL to the 3-month recall version of the UFS-QOL, the 1-month recall UFS-QOL demonstrated similar psychometric properties as the 3-month recall version.

Conclusion
In conclusion, this study demonstrated that the 1-month recall UFS-QOL, including the Revised Activities subscale, is a valid and reliable PRO measure for the assessment of UF symptoms and their impact on HRQoL.

Additional file
Additional file 1: Table S1. Internal consistency reliability: Cronbach's coefficient alpha values for UFS-QOL scale scores in VENUS I and VENUS II at baseline (intent-to-treat population; observed cases approach) and after 12 weeks of treatment (per protocol population; patient-reported outcome approach) (PDF 250 kb)  Includes responses of "Very much better", "Much better", and "A little better" b Includes response of "No change" c Includes responses of "A little worse", "Much worse", and "Very much worse" d General linear modelpairwise comparisons between means were performed using Scheffe's test adjusting for multiple comparisons: *p < 0.05; **p < 0.01; ***p < 0.001; 1, "improved" versus "no change"; 2, "improved" versus "worsened". The comparison between "no change" versus "worsened" was not significant for each scale.