Validation and reliability of the Dutch version of the EORTC QLQ-NMIBC24 Questionnaire Module for patients with non-muscle-invasive bladder cancer

Background The European Organisation for Research and Treatment of Cancer (EORTC) quality of life questionnaire for non-muscle invasive bladder cancer (QLQ-NMIBC24) has been available and applied for some years now, but has yet to undergo a full comprehensive psychometric evaluation. The aim of this study was to investigate the psychometric properties of the Dutch version of the EORTC QLQ-NMIBC24 questionnaire in patients with low, intermediate and high risk NMIBC. Methods We included patients newly diagnosed with NMIBC participating in the multicenter, population-based prospective cohort studies UroLife or BlaZIB. Psychometric evaluation included examination of the structural validity, reliability (i.e. internal consistency and test–retest reliability), construct validity (i.e. divergent validity and known-groups validity), responsiveness and interpretability. Results A total of 1463 patients who completed the baseline questionnaire of UroLife (n = 541, response rate 50%) or BlaZIB (n = 922, response rate 58%) were included. The percentage of missing responses were low for all non-sex related scales (< 1%) and ranged between 6.9% to 50.0% for sex-related scales. More than 15% of the patients obtained the lowest possible scores on nearly each scale (floor effect). The structural validity was adequate; the confirmatory factor analysis showed satisfactory results and all items of multiple items scales had higher within- than between-scale correlations. Reliability of the questionnaire was adequate for most multiple item scales (Cronbach’s α ≥ 0.70 and intraclass correlation coefficient ≥ 0.70), with exception of the scales ‘malaise’ and ‘bloating and flatulence’. The questionnaire also showed good construct validity; it showed low correlations with the items of the EORTC core questionnaire and was able to measure differences between risk-based subgroups. The responsiveness of the questionnaire was good, but the interpretability, i.e. minimal important change, could not be determined. Conclusions This study shows that the measurement properties of the EORTC QLQL-NMIBC24 are good; it has a good structural validity, reliability (i.e. internal consistency and test–retest reliability), construct validity (i.e. divergent validity and known-group validity), and responsiveness. Interpretability could not be assessed. This questionnaire can be used to measure and monitor health-related quality of life of patients with NMIBC. Supplementary Information The online version contains supplementary material available at 10.1186/s41687-021-00372-4.


Background
The majority (75%) of new bladder cancer patients are diagnosed with non-muscle invasive bladder cancer (NMIBC) and undergo a transurethral resection (TURBT) [1]. Dependent on the tumour's risk profile, this is followed by a single chemotherapy instillation (low-risk tumours), adjuvant intravesical chemotherapy or Bacillus Calmette Guerin (BCG) for a maximum of 1 year (intermediate-risk tumours), or BCG maintenance for 1-3 years (high-risk tumours) [1]. As both the disease and its treatment can affect functional health and symptom experience, the European Organisation for Research and Treatment of Cancer (EORTC) developed a healthrelated quality of life (HRQoL) questionnaire specifically for patients with NMIBC, the EORTC Quality of Life Questionnaire (QLQ)-NMIBC24 [2,3]. This questionnaire module was designed to complement the EORTC core HRQoL questionnaire, the QLQ-C30. The QLQ-NMIBC24 is already partially validated (i.e. content validity) but still needs to undergo psychometric testing in a large international group of patients (phase III validated EORTC module) [4]. To date, three studies have investigated the psychometric properties of the QLQ-NMIBC24 questionnaire showing its psychometric robustness [2,5,6]. One study examined and revised the scale structure and evaluated the internal consistency, known group validity, and responsiveness of the questionnaire in a British patient population [2]. The other two studies evaluated the psychometric properties of the Danish and Korean translation of the questionnaire, respectively [5,6]. However, no full comprehensive evaluation of the psychometric properties of the QLQ-NMIBC24, which is required to judge the appropriateness of the measure, has been performed. Test-retest reliability, interpretability of change scores [7] and the performance of the NMIBC24 has not yet been evaluated in different risk groups of NMIBC patients or among Dutch-speaking patients.
Therefore, the aim of this study was to examine the structural validity, reliability (i.e. internal consistency and test-retest reliability), construct validity (i.e. divergent validity and known group validity), responsiveness and interpretability of the Dutch version of the QLQ-NMIBC24 [3] in patients with low, intermediate and high risk NMIBC.

Study design and participants
Dutch bladder cancer patients participating in the UroLife (Urothelial cell cancer: Lifestyle, prognosis and quality of Life) or BlaZIB ('BlaaskankerZorg In Beeld' , clinical trial number: NL8106) studies were included in the current analysis. Both studies are population-based, multicenter prospective cohort studies recruiting newly diagnosed bladder cancer patients based on notifications from the nationwide network and registry of histopathology and cytopathology in the Netherlands (PALGA) and successive registration in the Netherlands Cancer Registry (NCR). The main aim of the Urolife study is to evaluate the association between lifestyle habits and the risk of recurrence and progression and HRQoL of patients with NMIBC. BlaZIB aims to gain insight in bladder cancer care and to identify barriers and modulators for optimal care. More detailed information on these studies can be found elsewhere [8,9]. For this analysis, patients diagnosed with NMIBC (stage Ta, T1, Tis) between April 1, 2014 and March 18, 2016 were selected from the UroLife study, and patients diagnosed with high-risk NMIBC (stage T1 and Tis) between November 1, 2017 and July 7, 2019 were selected from the BlaZIB study. All patients were Dutch speaking, between 18 and 80 years old, and treated with a transurethral resection. This study was performed in line with the principles of the Declaration of Helsinki. The Committee for Human Research in the region Arnhem-Nijmegen provided ethical approval for the UroLife study (CMO 2013-494) and deemed the BlaZIB study exempt from ethical review under the Medical Research Involving Human Subjects Act (WMO). Both studies were approved by the ethical review board of the NCR. Written informed consent was obtained from all patients participating in UroLife or BlaZIB.

Data collection
Both studies collected self-reported questionnaire data online or on paper 6 weeks after diagnosis (T6wk). The online questionnaires were collected via the data collection tool of the Patient Reported Outcomes Following Initial treatment and Long term Evaluation of Survivorship (PROFILES) registry [10]. Baseline data (T6wk) and follow-up data collected at 3 months (T3mo) and 15 months (T15mo) after diagnosis in the UroLife study, and at 6 months (T6mo) and 12 months (T12mo) after diagnosis in the BlaZIB study were used for the current analysis. The measurement points of UroLife were based on the treatment regimen of patients diagnosed with NMIBC, i.e. shortly after histological confirmation of the tumour (T6wk), at time of cystoscopy to investigate whether the tumour was successfully removed (T3mo), and long-term follow-up (T15mo  [2,11]. Patients who underwent a cystectomy were not or no longer invited to participate in the UroLife study. In order to assess the test-retest reliability and standard error of measurement (SEM), patients who completed the BlaZIB T12mo questionnaire between March 1st 2019 and December 7th 2019 received an additional questionnaire 2 weeks after the T12mo questionnaire (T12mo + 2wk). In total, 134 patients diagnosed with NMIBC completed the T12mo + 2wk questionnaire (response rate 86.5%). This questionnaire included the QLQ-NMIBC24 and four additional questions to assess whether the symptoms -in terms of urinary, bowel, sexual and total function -had decreased, remained the same or increased compared to the T12mo questionnaire (three-point Likert scale, see Additional file 1: Appendix A). Patients whose symptoms remained the same were regarded as stable and included in the test-retest analysis. In order to assess the minimal important change (MIC), the follow-up questionnaires of BlaZIB (T6mo, T12mo) contained an anchor question to assess changes over time., i.e. 'Did your bladder cancer-specific complaints (urinary, bowel, sexual function and overall) change compared to your complaints at diagnosis?' . Patients were asked to score their change on a nine-point Likert scale ranging from 1 (worse than ever) to 9 (no complaints anymore) for urinary, bowel, sexual and total function, separately [12]. We clustered the answers into three categories: importantly deteriorated (1-3), not importantly changed (4-6) and importantly improved (7-9) [13].

HRQoL questionnaires
The EORTC QLQ-C30 is the core HRQoL questionnaire of the EORTC and measures the HRQoL of cancer patients. The questionnaire consists of 30 items organized into a global health status scale, five functioning scales (physical, role, cognitive, emotional, and social), three symptom scales (fatigue, pain, and nausea and vomiting) and six single items (dyspnoea, insomnia, loss of appetite, constipation, diarrhea, and financial impact) [11].
The QLQ-NMIBC24 is an EORTC module for patients diagnosed with NMIBC and should be administered in addition to the core questionnaire (EORTC-QLQ-C30) [4]. The module includes constructs specific to the tumour site and treatment of NMIBC. The QLQ-NMIBC consists of 24 items organized into six scales (urinary symptoms, malaise, future worries, bloating and flatulence, sexual functioning, and male sexual problems) and five single items (intravesical treatment issues, sexual intimacy, risk of contaminating partner, sexual enjoyment, female sexual problems) [3].
All items were scored on a four-point Likert scale ranging from 1 (not at all) to 4 (very much), with the exception of the global health status items, which employ a seven-point scale ranging from 1 (very poor) to 7 (excellent). Scores of items were summed and linearly transformed to 0-100 scales and missing data were imputed according to the EORTC guideline [14]. Higher scores on functioning scales and global health status represent better functioning, while higher scores on the symptom scales indicate more symptom burden. Higher scores on the scales and items of the QLQ-NMIBC24 should be interpreted as more symptom burden, with exception of the sexual function scale and sexual enjoyment where higher scores represent better functioning.

Statistical analysis
Floor and ceiling effects were examined for each scale at each assessment point. If more than 15% of the patients scored at the lowest or highest end of the scale, the scale was considered to have a floor or ceiling effect, respectively [15]. Multitrait scaling analysis and Confirmatory factor analysis (CFA) were performed to validate the constructs of the QLQ-NMIBC24. Convergent validity was defined as a correlation of 0.40 or greater between an item and its own scale. Discriminant validity was defined a as correlation of less than 0.40 between an item and any other scale [2,16]. Maximum Likelihood (ML) was used as estimator in the CFA and missing items were imputed using Full Info Max Likelihood (fiml). Model-data-fit of the CFA was assessed with model chisquare, the Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA) and Standardized Root Mean Square Residual (SRMR). Model chisquare > 0.05, CFI ≥ 0.95, RMSEA < 0.05 and SRMR < 0.05 indicate a good fit, and CFI > 0.90 and both RMSEA and SRMR > 0.05 but < 0.08 indicate an acceptable fit [15,17]. Internal consistency was assessed with Cronbach's α. A Cronbach's α of 0.70 or higher was considered adequate for group level comparisons. Test-retest reliability was assessed based on the questionnaires administered at T12mo and T12mo + 2wk using the intraclass correlation coefficient for absolute agreement (ICC; two-way mixed model, single measure) [18]. An ICC value of 0.70 or higher was considered acceptable.
Divergent validity of the QLQ-NMIBC24 was assessed by calculating the Spearman correlation coefficients between the scales of the QLQ-C30 and QLQ-NMIBC24 [19]. Based on previous studies, we expected in general low to moderate correlations (< 0.40) between the scales of both questionnaires. Previous studies have shown that malaise was moderately to strongly correlated (> 0.40) with all the scales of the QLQ-C30 [2,6,16]. The urinary symptom scale has previously also shown to be moderately (0.40-0.69) correlated with role function, cognitive function, social function, fatigue, nausea and vomiting, and pain [2,6]. At last, future worries was expected to be moderately correlated to the emotional function scale of the QLQ-C30 [2] and fatigue [6].
Known group validity was assessed by comparing patients with low, intermediate and high risk NMIBC using independent t-tests. Patients were divided into risk groups based on the European Association of Urology (EAU) guidelines [1] without taking into account the tumour size (not available) and the recurrent nature of the tumour (only primary tumours included). We hypothesized that patients with high risk NMIBC would have more urinary symptoms, malaise, future worries and intravesical treatment issues at T6wk than patients with low risk NMIBC.
Responsiveness to change was examined using all three questionnaires of the UroLife study (i.e. T6w, T3mo and T15mo) using paired t-tests. We hypothesized that differences on the scales of the NMIBC24 would only be small between T6wk and T3mo, but that symptoms and complaints decrease from T6wk to T15mo. Effect sizes (ESs) were calculated using Cohen's d statistic (mean difference divided by pooled standard deviation). These provide a distribution-based estimate of the magnitude of mean differences/changes, where an ES of 0.2 is considered small, 0.5 moderate, and 0.8 large [20]. MIC was assessed using the visual-anchor distribution method of De Vet et al. [13]. This method determines the smallest change in scores of the QLQ-NMIBC24 that are regarded as either improvement or deterioration by taking into account the variability and importance of the scores. To determine the importance of the scores, an external anchor is used. Correlations between the anchor-question and the scales of the QLQ-NMIBC24 were assessed to determine the adequacy of the anchor (r ≥ 0.40) (i.e. does the anchor question measures the same as the change scores?). Then, patients were subdivided into three groups (importantly deteriorated, not importantly changed and importantly improved) using the anchor question and for each group the distribution of the changes scores was plotted. The optimal receiver operating curve (ROC) was considered to be the MIC value.
The CFA was conducted with the software package R using the "lavaan" package [21]. ICCs were calculated in STATA version 16.0 (StataCorp LLC, College Station, Texas, USA) and SEMs were calculated in SAS (SAS Institute, Cary, North Carolina, USA). All other statistical analyses were executed using SPSS version 25 (IBM Corporation, Armonk, New York, USA). P values < 0.05 were considered statistically significant.

Patient characteristics and data quality
Fifty percent of the NMIBC patients invited for UroLife and 58% of the NMIBC patients invited for BlaZIB completed the baseline questionnaire, resulting in a total number of 1463 eligible patients for this study (Fig. 1). Figure 1 presents the number of completed questionnaires and the response rates at follow-up. The majority of the patients were male (81%) ( Table 1). Patients participating in UroLife were, on average, younger (66 vs. 72 years), more often female (21% vs. 17%), living together with a partner (85% vs. 77%) and employed (42% vs. 24%) than those participating in BlaZIB.
The percentage of missing responses was low (< 1%) for all non-sex related scales at all measurement moments of the UroLife study (Table 2). For the sex-related scales, the percentage of missing responses varied between 6.9 and 12.8%, with exception of female sexual problems (41.4-50.0% missing responses).
At T6wk, only four of the eleven scales had no floor effect (< 15%) ( Table 2). The highest floor effects were observed for malaise (87.1%) and intravesical treatment issues (74.5%). Over time, the percentage of patients with the lowest possible scores decreased for sexual functioning and male sexual problems but remained stable or increased for all other scales. At T15mo, floor effects were present for all scales. No ceiling effects were observed at any assessment point. The percentage of missing responses was low for all scales, except for female sexual problems (45.7%) ( Table 2). Table 3 shows the results of the multitrait scaling analysis. All items had a within-scale correlation of 0.40 or higher and a correlation of 0.40 or lower with other scales, indicating good convergent and discriminant validity, respectively. The model chi-square significance was < 0.0001, CFI was 0.93, RMSEA was 0.06 and SRMR was 0.04 after excluding female sexual function (question answered by N = 21), indicating an acceptable fit. Standardized factor loadings are presented in Table 4.

Construct validity
Correlations between the core questionnaire and the NMIBC-module were low (< 0.40) for nearly all scales (Table 6). Only between emotional function (QLQ-C30) and future worries (QLQ-NMIBC24) a moderate, inverse correlation was observed (− 0.57). This indicates that the content of the core questionnaire and the NMIBC-specific module do not overlap excessively.
Comparison of patients' scores at T6wk according to their NMIBC risk subgroups indicated that patients with high-risk NMIBC reported more urinary symptoms (ES = 0.41), future worries (ES = 0.51), problems with sexual intimacy (ES = 0.41) and risk of contaminating partner (ES = 0.72) than patients with low-risk NMIBC (Table 7).

Interpretability
Based on the correlations between the anchor question and the scales of the questionnaire (ranging between − 0.11 and 0.28), the anchor question was deemed inadequate and no MIC values were calculated.

Discussion
When evaluating the psychometric properties of the EORTC-QLQ-NMIBC24 in Dutch patients with NMIBC, we observed good structural validity, reliability (i.e. internal consistency and test-retest reliability), construct validity (i.e. divergent validity and known groups validity) and responsiveness. The number of missing items were low among patients for whom the items were applicable, with exception of female sexual problems. At all measurement points, multiple floor effects were observed and MIC values could not be determined.
Multitrait scaling analysis and Cronbach's α for internal consistency supported the scale structure of the QLQ-NMIBC24. Only the bloating and flatulence and malaise scales (at follow-up) did not reach the 0.70 cut off for group level use of the items in these scales. Other authors have reported similar results and also found that the bloating and flatulence [2] and especially the malaise scale [2,5,6] seems to yield unsatisfactory results. These results suggest that there is heterogeneity of the two items in the malaise scale, i.e. items cannot be grouped into one scale, and a revision of this scale may be needed.
The high number of scales with floor effects we observed at T6wk has also been found by other studies. Park et al. observed floor effects in nine scales and Mogensen et al. found floor effects in 20 out of the 24 items of the questionnaire [5,6]. Malaise, intravesical treatment issues and bloating and flatulence had the highest percentages of lowest possible scores in all studies (range 43.3% up to 90%) [2,5,6]. Floor effects at   baseline may impose a problem as further decreases in symptoms and function over time, as is the case in our study, cannot be measured. Reviewing the relevance of scales with high floor effects at baseline might improve the usefulness of this questionnaire. The low correlations between the scales of the QLQ-C30 and the QLQ-NMIBC24 questionnaires indicate good discriminant validity and added value of the QLQ-NMIBC24 to the core questionnaire. We could not confirm the moderate correlation (> 0.40) between malaise and urinary symptoms of the QLQ-NMIBC24 and the scales of the QLQ-C30 as observed in previous studies, but did confirm the moderate correlation between future worries and emotional function [2,6].
We found that the QLQ-NMIBC24 is able to discriminate between subgroups (i.e., NMIBC risk profile) and to measure change over time. However, the difference found for malaise by NMIBC risk profile was small and non-significant. Other studies have found significant differences in most scales according to physical function (> 90 vs < 90) [2] and Karnofsky performance status [6]. For gender comparisons, only differences in sexual function and sexual enjoyment were observed [2;6]. All studies were able to detect changes in score over time [2,6].
For the test-retest reliability and interpretability, we used an anchor-question and measurement scale to determine changes in patient's health over the course of time in line with previous recommendations [12]. However, it might not be suitable to assess changes in malaise (ICC of 0.07) as other, non-bladder cancer related, health issues may affect malaise as well. Furthermore, both malaise, as a consequence of treatment, and intravesical treatment issues are highly dependent on the timing of the questionnaire in the treatment process, which might also explain the rather low test-retest reliability for these scales. In addition, we were not able to calculate MIC values as the correlations between the scores on the anchor-question and change scores were too low. Other factors, such as the different modes of administration in the test-retest analysis (online vs. paper), may have also contributed to these findings. Future research will be necessary to determine MIC values for the NMIBC24. A limitation of this study is the response rate for both questionnaires, i.e. 50% and 58% for UroLife and BlaZIB, respectively. Although our response rates are as expected for Dutch patients with NMIBC, selective non-response may affect the generalizability of the scores to the entire Dutch patient population with NMIBC. We do, however, not expect selective non-response as participants of the UroLife study were comparable with respect to age, gender and tumour stage to the non-responders (data not shown). Furthermore, selective loss to followup might have affected some of our results concerning comparison of outcomes over time (i.e. responsiveness), although analyses based on patients who participated in all three UroLife questionnaires (T6wk, T6mo and T15mo) showed similar results (data not shown). At last, the sparse number of sexual active women in our study (n = 35 at T6wk; n = 0 test-retest analysis) limited the examination of the item female sexual problems; we omitted this item from the CFA and could not assess the ICC. Previous studies investigating the psychometric properties of the QLQ-NMIBC24 have also dealt with a low number of female sexual active participants [2,5,6]   and often lower than is considered adequate according to the COSMIN study design checklist for patient-reported outcome measures [7]. As a consequence, it is hard to draw solid conclusions about the single item female sexual problems.

Conclusions
The QLQ-NMIBC24 questionnaire has in general a good structural validity, reliability, construct validity and responsiveness. The reliability of the scales malaise and bloating and flatulence is, however, suboptimal. Relevant change scores for this questionnaire could not be defined  Table 8 Unadjusted mean scores (± SD) of EORTC QLQ-NMIBC24 subscales for 541 NMIBC patients participating in UroLife at three time points after diagnosis and mean differences at 3 months (T3mo) and 15 months (T15mo) compared to 6 weeks (T6wk) after diagnosis (responsiveness) a Effect size was calculated using Cohen's d statistic (mean difference divided by pooled standard deviation) and 0.2 is considered small, 0.5 moderate, and 0.