Skip to main content

Table 2 Overview of statistical analysis methods

From: Psychometric evaluation of the near activity visual questionnaire presbyopia (NAVQ-P) and additional patient-reported outcome items

Analysis

Description

Stage 1: Item properties

Quality of completion

• The quality of completion for the NAVQ-P, NVCI, and NVS was assessed at the item level in the randomized population (N = 235) at Baseline, Week 2, Month 1, Month 2, and Month 3. For the NAVQ-P, form level (whole PRO) missing data was also evaluated following finalisation of scoring.

Item response distributions and floor and ceiling effects

• Item response distributions for the NAVQ-P items, NVCI, NVS and NVCP at Baseline and Month 3 were examined to identify any skewed distributions or overly preferred response options for a given item.

Stage 2: Dimensionality and scoring

Inter-item correlations

• Inter-item correlations provided an initial exploration of dimensionality and were examined using polychoric correlation coefficients between each pair of items in the NAVQ-P in the cross-sectional analysis population at Month 2. This was done to ensure each item measured a distinct concept without any redundancy. Items that correlated highly with one another (> 0.90) or correlated < 0.40 were flagged for review.

Internal consistency reliability

• Internal consistency reliability, concerned with the homogeneity of items belonging to the same domain, were evaluated using Cronbach’s alpha coefficient (≥ 0.70 for good internal consistency) [30].

• The impact of item removal on internal consistency reliability was examined. Cronbach’s alpha was calculated with each item removed from their respective scores to assess the impact.

• Internal consistency was assessed at Month 2 in the cross-sectional analysis population for the NAVQ-P.

Confirmatory factor analysis (CFA)

• Confirmatory factor analysis (CFA) of the NAVQ-P was conducted using data from the cross-sectional analysis population at Month 2 to assess the dimensionality of the 15-item NAVQ-P to inform item deletion and overall scoring.

• Factor analytic models employed a weighted least square mean and variance adjusted (WLSMV) estimator, with theta parametrisation.

• Model fit indices were used to assess model fit (CFI = Comparative Fit Index; TLI = Tucker Lewis Index; RMSEA = Root Mean Square Error of Approximation and SRMR = Standardizes Root Mean Square Residual).

• Model fit indices were evaluated against the following desirable thresholds with the intended use to guide model fit assessment and not as strict cut-offs (CFI > 0.95, TLI > 0.95, RMSEA < 0.08 and SRMR < 0.05).

• Deciding between a weighted or unweighted summary score was informed through comparison of constrained (where factor loadings are constrained to be equal) vs. unconstrained (factor loadings are freely estimated) CFA models.

• In the case that factor loadings can be considered equal (i.e. model with constrained factor loadings will not fit significantly worse compared to model with freely estimated factor loadings) across items, an unweighted sum score was proposed [18].

• The NVCI, NVS, and NVCP were not included in these analyses as they are measuring distinct concepts that are not directly related to near vision functioning and were not expected to form part of the NAVQ-P score. Relationships of the single-item measures were instead assessed within the convergent validity analysis.

IRT analyses of NAVQ-P

• The NAVQ-P was assessed through item response theory (IRT) analyses to inform item properties, dimensionality, and scoring. The analysis was performed for the cross-sectional analysis population at Month 2 to assess whether the NAVQ-P was unidimensional.

• The Rating Scale Model (RSM) was applied, with the N/A response treated as missing for this analysis.

• Item characteristic curves were used to assess probability of responses and weak or overlapping item response categories.

• Person fit was evaluated through assessment of standardized fit residuals and number/proportion of participants with fit residuals outside of the range 0 ± 2.5 were summarized.

• Local dependency was assessed by Yens Q3 statistic with any residual correlation greater than the average residual correlation + 0.30 highlighting potential redundancy and interdependence [19, 20].

• Person separation reliability was assessed which is comparable to Cronbach’s alpha coefficient, values > 0.70 are deemed acceptable.

• Item fit was assessed by the infit mean square (MNSQ) and outfit MNSQ to highlight observed responses that deviate from the Rasch model expectation. Values between 0.5–1.5 indicate acceptable item fit and are productive for measurement.

• Item person maps were employed to flag overlapping items and any gaps in item location on the latent trait continuum.

Item reduction for the NAVQ-P

• Item reduction was considered for the NAVQ-P based on the analyses of item properties and dimensionality, but also considering previous qualitative findings and the clinical relevance and importance of the items.

• IRT and internal consistency analyses were repeated iteratively following the deletion of items until a final item set was decided upon.

Stage 3: Reliability and validity of scores

Reliability

Scale-level test-retest reliability

• The stability of scale-level scores between Months 2 and 3, and Week 2 and Month 1 was assessed in the primary and secondary test-retest analysis populations respectively using PGI-S and DCNVA-defined stable groups.

• Intraclass correlation coefficient (ICC) was calculated for continuous scores. The following cut-offs were employed to interpret ICC values: values < 0.40 were considered indicative of poor reliability, values between 0.40–0.75 indicated fair to good reliability, values > 0.75 indicated excellent reliability [31].

• The stability of NVCI and NVS scores was assessed by calculating weighted Kappa coefficients interpreted as follows: ≥0.75 excellent; 0.40- <0.75 as fair; <0.40 as poor [31].

Construct-related validity

Convergent validity

• Convergent validity was evaluated by calculating correlations of the DCNVA with the NAVQ-P, NVCI, and NVS using data collected in the cross-sectional analysis population at Month 2.

• Scores assessing similar or related concepts were expected to have strong correlations (r ≥ 0.5) thereby demonstrating convergent validity.

Known-groups analysis

• Construct validity was also assessed using the known-groups method, to evaluate differences in mean PRO scores between groups of participants who differ in severity as defined by PGI-S and DCNVA scores.

• Known-group comparisons were assessed using Month 2 data in the cross-sectional analysis population.

Ability to detect change over time

• Ability to detect change over time analyses focused on the evaluation of changes in PRO scores over time to demonstrate that observed improvements (or reductions) in those scores correspond to improvements (or worsening) in external criteria (anchors) also related to the construct.

• Ability to detect change was assessed using data from Baseline, Months 1, 2, and 3, with change from Baseline to Month 3 considered the primary analysis.

• The following pre-specified cut-offs were used to interpret the magnitude of each effect size (ES): small (ES = 0.20), moderate (ES = 0.50), and large (ES = 0.80) [32].

Stage 4: Interpretation of scores

Anchor-based methods

• Anchor-based methods were used to identify participants who experienced an important change in their condition, by exploring the association between changes on the NAVQ-P, NVCI, and NVS and the anchor measures (PGI-S, PGI-C, and DCNVA).

• All anchor-based analyses were performed in the interpretability analysis population by examining changes between Baseline and Months 1, 2, and 3, with the change to Month 3 considered as the primary analysis.

• A theoretical justification between the anchor and target instrument should exist and should be empirically demonstrated [14, 33, 34]. The suitability of proposed anchors was tested using a polyserial correlation coefficient or Spearman’s rank to establish the relationship between the change in the anchor and change in each PRO score between Baseline and Month 3. Anchors with correlations of < 0.30 were not taken forward for analysis.

• Each anchor deemed to have a sufficient relationship with the PRO scores was used to define groups of participants who experienced improvement, no change or worsening according to the interpretability analysis populations.

• The mean change in PRO score was calculated for participants classified as improved, stable, and worsened (meaningful within-group change). The meaningful between-group difference for each anchor was defined as the difference in mean change PRO score between the improved and stable groups.

• Receiver operating characteristics (ROC) curve analysis was used to find the change in PRO score that optimally discriminates between improved and stable groups defined by the anchors.

• Empirical Cumulative Distribution Functions (eCDFs) and Probability density functions (PDFs) were also plotted to aid comparison of different possible responder definitions on the PRO scores [13].

• Tables showing change from Baseline to Month 3 in NAVQ-P, NVCI, and NVS in terms of various percentiles, by baseline PGI-S, were also developed to explore any baseline dependency of meaningful change.

Distribution-based methods

• Distributional properties of the NAVQ-P, NVCI, and NVS scores were used to guide potential responder definitions estimated from anchor-based approaches, identifying the amount of change that exceeds measurement error [16, 35].

• These included 0.5 of the standard deviation (SD) at Baseline and the standard error of measurement (SEM).

Triangulation

• Triangulation was conducted by consolidating the different meaningful change estimates derived from anchor-based and distribution-based methods to support identification of an appropriate range of meaningful change values [24,25,26].

• Correlation-weighted average estimates of meaningful change from the anchor-based methods were also used to converge on a range of potential meaningful change estimates [27].