Does scrolling affect measurement equivalence of electronic patient-reported outcome measures (ePROM)? Results of a quantitative equivalence study

Background Scrolling is a perceived barrier in the use of bring your own device (BYOD) to capture electronic patient reported outcomes (ePROs). This study explored the impact of scrolling on the measurement equivalence of electronic patient-reported outcome measures (ePROMs) in the presence and absence of scrolling. Methods Adult participants with a chronic condition involving daily pain completed ePROMs on four devices with different scrolling properties: a large provisioned device not requiring scrolling; two provisioned devices requiring scrolling – one with a “smart-scrolling” feature that disabled the “next” button until all information was viewed, and a second without this feature; and BYOD with smart-scrolling. The ePROMs included were the SF-12, EQ-5D-5L, and three pain measures: a visual analogue scale, a numeric response scale and a Likert scale. Participants completed English or Spanish versions according to their first language. Associations between ePROM scores were assessed using intraclass correlation coefficients (ICCs), with lower bound of 95% confidence interval (CI) > 0.7 indicating comparability. Results One hundred fifteen English- or Spanish-speaking participants (21-75y) completed all four administrations. High associations between scrolling and non-scrolling were observed (ICCs: 0.71–0.96). The equivalence threshold was met for all but one SF-12 domain score (bodily pain; lower 95% CI: 0.65) and two EQ-5D-5L item scores (pain/discomfort, usual activities; lower 95% CI: 0.64/0.67). Age, language, and device size produced insignificant differences in scores. Conclusions The measurement properties of PROMs are preserved even in the presence of scrolling on a handheld device. Further studies that assess scrolling impact over long-term, repeated use are recommended.


Introduction
Patient-Reported Outcome (PRO) measures have been increasingly gaining momentum in clinical outcome research because of recent movement toward patientcenteredness in both clinical practice and research [1,2]. In the last two decades, the Food and Drug Administration (FDA) and European Medicines Agency (EMA) have progressively contributed to patient-focused drug development by requiring PRO endpoints in new drug applications [3] and including data from PROMs in drug labelling [4][5][6].
An increasing number of clinical research studies employ electronic formats to collect PRO measures (PROMs) in field-based and in-clinic settings [7]. This has been driven by the availability, low cost, and reliability of modern mobile devices such as smartphones and tablets, along with the requirement to improve the integrity and quality of data collected while limiting missing data entries and ensuring the timeliness of PROM completion [8]. Because most PROMs were originally developed and validated in pen-and-paper forms, migrating a PROM to an electronic format (ePROM) requires care to ensure the measurement properties of the original instrument are unaffected by the change in format [9].
Many clinical trials provide an electronic mobile device (provisioned device: PD) of a common make and model to all participants, to ensure that PROM presentation is identical for all participants. However, the drive to make clinical studies more patient-centric has led to increasing interest in collecting PROMs using the participants' own device (bring your own device: BYOD) with the aim to make PROM collection more convenient. Due to smartphone screen size, ePRO solution providers typically aim to present a single PROM question per screen and to ensure all content is displayed without the requirement to scroll [10,11]. When collecting PROMs using BYOD, the screen size and resolution of the participants' devices may vary, and this may introduce the requirement for the user to scroll the screen to reveal both the question and response options for some or all PROM items.
Previous studies have provided some evidence on the equivalence of the PROMs after migrating from paper to electronic formats [12,13]. However, past research examining the comprehension of information presented on computer monitors has reported mixed results when considering the impact of the requirement to scroll to retrieve information [14,15]. One concern for studies utilizing ePROM is that a user may not review the complete question and response options before giving an answer to a questionnaire item with the presence of scrolling, and this behaviour may adversely affect the PROM measurement properties. While the measurement equivalence of PROMs comparing BYOD to PD has been studied [16], the impact of scrolling features on the response pattern associated with PROM completion has not been addressed. In this study, we aimed to evaluate the measurement equivalence of ePROMs in the presence and absence of scrolling on a set of provisioned smartphone devices as well as BYOD smartphones.

Design
A Latin square crossover design enabling the randomization of four arms (sequences) and four periods (schedules) and balanced for first-order carryover was employed [17,18]. This design incorporates blocks of 4 sequences of 4 individual administrations, with sequences randomly allocated within each block. Each sequence contains a single instance of each administration in such a way that within each block the treatment periods contain the same number of each administration, and individual administrations are preceded by each other administration the same number of times (balanced first order carryover). This particular design reduces errors as a result of imbalance contribution of the interventions and requires a relatively small sample size to conduct the trial. On each period, one of the following formats was administered: 1) A provisioned device not requiring scrolling (Samsung Galaxy J7: screen size: 5.5-in., screen resolution: 720 × 1280 pixels); 2) a provisioned device requiring scrolling to reveal all item text and including a "smart-scrolling" feature that disabled the "next" navigation button until all information was viewed (Samsung Galaxy Core Prime: screen size: 4.5-in., screen resolution: 480 × 800 pixels); 3) a provisioned device requiring scrolling (Samsung Galaxy Core Prime: screen size: 4.5-in., screen resolution: 480 × 800 pixels) without the smart-scrolling feature (user can advance without scrolling to reveal all information); and 4) BYOD (Android or iOS) with smart-scrolling. We provided no instruction to the participants regarding the type of Android or iOS mobile device that they could bring for use in the BYOD administration period The format layout differences and smart-scrolling feature are illustrated in Fig. 1. A washout period of 1 hour was used between each ePROM administration schedule. This washout period included a distraction task comprising a Paced Visual Serial Addition Test (PVSAT), developed using Apple Research Kit by ICON Clinical Research (Dublin, Ireland) and CRF Bracket (Arlington, VA). This task comprised a working memory addition test with numbers repeated every 3 s for 60 repeats, and was deployed on an iPad Mini device.
Included in the study was a mix of US Englishspeaking and US Spanish-speaking participants, aged 18 years and older, with a self-reported chronic medical condition causing daily pain or discomfort. Participants completed a selected set of PROMs. Study procedures were conducted at ICON's office (Maryland, USA), with all participants being recruited from the US District of Columbia metropolitan area by Shugoll Research (Bethesda, USA) using their client database, referrals, and social media. All participants provided written informed consent. Salus Institutional Review Board (Austin, TX) provided ethical approval for the study. Participants were randomized to an administration schedule according to a pre-defined randomization list. Participants received training on use of the provisioned electronic smartphone devices from research staff to complete the PROMs.
The PROMs were delivered using the mProve Health ePRO platform (CRF Bracket, Arlington, VA). The ePRO platform was available in both US-English and US-Spanish versions, and participants were provided with the version corresponding with their primary language. The PROMs included the 12-Item Health Survey (SF-12) [19], EuroQol-5 Dimension-5 Level (EQ-5D-5L), EuroQol Visual Analog Scale (EQ-VAS) [20][21][22][23], and three items measuring pain over the past week: a visual analogue scale (VAS), an 11-point numeric rating scale (NRS), and a 7-point Likert scale (LIK). The electronic implementation of the SF-12 and EQ-5D instruments were approved by the license holders, and the VAS, NRS, and LIK for pain were implemented according to ePRO design best practices [24]. Information was collected from participants on their attitudes towards BYOD use, along with familiarity with smartphone devices, by administering an end-of-study questionnaire on paper. The ePRO platform was configured such that no item could be skipped. However, it was possible that the participant could withdraw from the study during schedule or after finishing a schedule. These participants were excluded to ensure a balanced crossover design. Hence, missing information was only possible at schedule level and not at item level. However, we only included the participants who completed all four schedules.
To calculate the required sample size, we assumed 80% power with a one-sided alpha significance of 0.05 and a true underlying Intraclass Correlation (ICC) of 0.85. We further assumed the difference we wished to equate at least a lower bound for ICC of 0.70 [7,9]. Subsequently, the required sample size per arm of the study was calculated to be 26 subjects. To compensate for losing five degrees of freedom as a result of extra variables in the model, we added 5 to the initial sample size (N = 31). The target recruitment sample size of 165 participants (assuming 25% dropout) was determined to provide 124 fully evaluable subjects with approximately 31 participants per sequence. We used the formula offered by Walter et al. to calculate the sample size [25]. No power analysis was performed for the logistic regression assumptions; however, we used a two-sided alpha at 0.05 as the significance level to interpret the results of the logistic regression analysis.

Statistical analysis
Analyses were conducted using SAS 9.4 (SAS Institute, Inc., NC, USA), Stata 15 (StataCorp LLC, College Station, TX), and SPSS 25 (IBM, Armonk, NY). Mixedeffects generalized linear models (ME-GLM) were employed to fit the data and test the association between the treatment variables (e.g. scrolling vs. non-scrolling) with each PRO score. A random intercept model with study participants treated as random effects was specified with all the covariates (schedules and sequence of administration) modelled as fixed effects. ICCs were calculated using the method specified by McGraw & Wong to derive ICCs with 95% confidence interval. ICC (A, K) for a two-way mixed effects model with absolute agreement among more than two experiments (here schedules) was applied [26] to the PROMs. Additionally, the ICCs were calculated by dividing the variance of the random intercept by the total variance of the ME-GLM model, which is the sum of variance for the random intercept and that of the error term. The 95% confidence interval was obtained using the "delta method" [27,28]. The more conservative method of estimating ICC (the one with a lower estimate) was eventually used as the primary method. Measurement equivalence was considered when a lower bound of the 95% Confidence Interval for the estimated ICC was at least 0.70 [7,9]. The results on post-estimation ICCs were compared between two software applications, SAS 9.4 and STATA 15 for consistency. Sensitivity analyses were conducted to examine differences between participants with any missing schedules and those who completed all four schedules. We fitted logistic regression models in which sex and age groups were set as the predictor variables and schedule completion status was set as the outcome variable. We also generated ICCs using all information (complete schedules and missing schedules) as well as only-complete schedules to evaluate the difference in the results given the input. Statistical significance was calculated for the twosided 0.05 level throughout.

Participants
Of the 151 eligible participants (42 US Spanish-speaking and 109 US English-speaking) initially recruited, 36 participants were excluded from the analysis for reasons described in Fig. 2. The final analyses included 115 participants (95 English-speaking and 20 Spanishspeaking), aged 21 to 75 years who completed all four schedules. Table 1 conveys detailed information on demographic features of the participants included in the final analyses. The most common self-reported cause of pain was arthritis (33.9%) followed by back pain (13%). Approximately 41% (N = 47) of the participants reported a heterogeneous array of reported morbidities, including diabetes. For all reported morbidities, only an indirect causal link between the reported morbidity and chronic pain was conceivable.
Familiarity with and attitudes toward BYOD Table 2 provides further details about BYOD familiarity and preference and attitudes toward BYOD devices. Out of 115 participants, 66 (57.4%) participants used an Apple device and 49 (42.6%) used an Android device as the BYOD device in this study. Seventy-eight participants (67.8%) carried large devices, arbitrarily defined as one with a diagonal size at least 140 mm (5.5 in). Only two of the 115 participants reported inability to download and run the study app on the BYOD device and required assistance from this ePROM study's research assistant to download the study app. Ninety-nine participants (86.1%) indicated that they were "definitely willing" to use a BYOD device for a clinical trial. Finally, 49 participants (57.4%) expressed that it was "essential/very important" that others could not see their data on their device. Table 3 presents the mean (SD) for each scale or item score under each of the four schedules and provides estimated ICC (95% CI). Comparing the scrolling and nonscrolling schedules, the equivalence threshold criterion (a minimum of 0.7 lower band of the 95% confidence interval for ICC) was met for all scale/item scores except for the bodily pain scale score from SF-12, and usual activity and pain/discomfort items of the EQ-5D-5L. Estimated ICCs for SF-12 ranged between 0.72 and 0.96, and that for EQ-5D-5L items and scores ranged between 0.71-0.90. For the three pain scales, the ICCs showed a range between 0.81 and 0.95. The lower bound for 95% CI for bodily pain from SF-12 was 0.65 and for usual activity and pain/discomfort items of the EQ. 5D-5L was 0.67 and 0.64 respectively. The same pattern of success in meeting the measurement equivalence criteria was preserved for the overall ICC (a model with no comparison), contrasting BYOD schedule with non-scrolling schedule, and smart scrolling schedule versus non-smart scrolling schedule. The equivalence threshold criterion was met for eleven of twelve SF-12 items across all the three comparisons and for the overall estimated ICCs (results are not shown). Table 4 provides detailed information on the estimated ICC (95% CI) for the models for the covariate impact. The reliability threshold meeting success pattern remained unchanged for the impact of three covariates: language (Spanish versus English), device size (large versus normal), and age (45-64 years versus 18-44 years and 65 + versus 18-44 years). For all three covariate effects, the estimated ICCs ranged from 0.71 to 0.96 across all the PROMs. For bodily pain of SF-12, and usual activity and pain/discomfort of the EQ-5D-5L the lower band of the 95% CI for the ICCs ranged between 0.62 and 0.71 across all the PROMs.

Sensitivity analysis
Cutting down the analytic sample from the full sample (N = 151) to the balanced sample (N = 115) trivially affected the ICCs and the confidence intervals. For instance, the overall ICCs for the SF-12-Physical Component Summary (PCS) score were estimated as 0.91(95% CI: 089-0.93) using the full sample and 0.92(95% CI: 0.89-0.94) after using the balanced sample. The equivalence analysis to obtain the ICCs with 95% CI was compared among SAS, STATA, and SPSS. SAS and STATA generated the exact results. However, by ignoring the covariate effect, SPSS consistently generated inflated ICCs. As an example, using SPSS the overall ICCs Mean differences of the scale/item scores across the four schedules using one-way analysis of variance and by including only the first administration (e.g., excluding administrations B, C, and D in ABCD sequence) were not statistically significant (one-sided P-value > 0.1 consistently).

Discussion
The ePRO design good practice guidelines, such as those reported by the Critical Path Institute's ePRO Consortium require the visibility of the full item stem text and its entire response options on the electric devices [24]. It follows that a principal concern in regard with migrating an existing pen and paper format PROM to an ePROM is that the participant may respond differently to items when the question and its response options are displayed fully compared to when items are partially displayed on a single screen. The difference in participants' response patterns could theoretically stem from their unawareness of all the response options if some appear off screen. In addition, participants could find it inconvenient to scroll and, therefore, pick an item in view so that they can move to the next question quickly. For that reason, we examined the hypothesis that scrolling can alter participants' response pattern. Had concern about using own device to download app to use in future 8 (7.0%)  Had Any concern about using own device to answer study questionnaire This study provided a strong indication that the presence of scrolling is unlikely to affect PROM measurement properties. More specifically, we demonstrated measurement equivalence of the SF-12, EQ-5D-5L, and three different pain scales using common response scale types in the presence and absence of scrolling on provisioned and BYOD smartphone devices. There was measurement equivalence when comparing BYOD smartphones with non-scrolling provisioned devices satisfied the measurement equivalence. Similarly, measurement equivalence was preserved in comparing smartscrolling with non-scrolling devices. Bodily pain scale score of the SF-12 and usual activity and pain/discomfort items of the EQ-5D-5L were the only scale/items which did not pass the measurement equivalence test. However, the lower band of the 95% confidence interval for the three pain scales exceeded the threshold of 0.7. Such inconsistencies may indicate discrepancies in itemlevel properties across different instruments that measure similar constructs. The impact of age, language, and smartphone size on the measurement equivalence was negligible and not statistically significant. The sensitivity analysis was done by preserving only the first administration, which converted the crossover design into a parallel design at the price of losing some power; however, the analysis of variance model showed no difference in mean scores of the PROMs across the four schedules. These sets of analyses supported the insignificant impact of the sequential testing on the measurement equivalence results. It is noteworthy to emphasize that the focus of the current study was not to test the psychometric properties of these instruments on electronic devices. This study is meant to evaluate whether the changes in the question-answer display format on smartphone screens may result in changing the subject responses. A number of approaches are currently offered by ePRO solution vendors to mitigate the need for scrolling. One is to detect device features (make, model, etc.) on app installation and block devices that do not meet minimum size/ specification criteria. Such an approach typically employs a look-up table of device specifications. While commercial databases exist, these have limitations, as it is hard to keep up to date with all makes and models (esp. Android) to enable this option for inclusion of all possible devices. A second method is to detect scrolling on a perpage basis and provide a scrolling indicator or disable navigation until scrolling has been accomplished (smartscrolling). Finally, one can ensure that the navigation buttons are always at the foot of the page so the need to scroll to advance is required to reveal the entire questionnaire item before it is possible to advance to the next question. We utilized the smart-scrolling approach in this investigation.

Study reimburses for data charges
In terms of design and analysis of the study, we employed a Latin square crossover design, which allowed the randomization of the four schedules and four different sequences. We followed previous research [7,9] to select the acceptable lower band 95% confidence interval limit (i.e., 0.70 to serve as the equivalence threshold). The fixed one-hour distraction task between each subsequent pair of ePROM administration was assumed to effectively mitigate the participant's recall of the response pattern from the previous administration to the next. By including two covariates, sequence and schedule, in the regression model for the equivalence estimation we tried to further mitigate the carryover effect.
The study comes with some limitations. While we were able to demonstrate measurement equivalence in the presence or absence of scrolling during repeated administration on a single day, we did not study the possible effects of scrolling during repeated use that is common with a typical clinical trial scenario. It would be valuable to study whether scrolling has a negative effect on completion compliance during longitudinal use, and whether response behaviour might be affected longitudinally if scrolling produces additional completion burden for the patient. Secondly, we only examined one method to mitigate scrolling, although it is likely that the other scrolling mitigation approaches would yield similar results. Finally, we had a small sample of participants who presented small BYOD smartphones and were not able to breakdown the sample for detailed analysis of the BYOD size effect. According to the latest data on smartphone sale by screen size, it is evident that small smartphones are still used by some people [29]. Hence, the results of this study on the impact of the BYOD size should be interpreted with caution.

Conclusions
This study, to our knowledge, is the first research that evaluates scrolling providing some positive signals to help mitigate concerns over use of a scrolling feature when it is necessary. While the need for scrolling is unlikely on larger devices and can be completely prevented when providing a provisioned smartphone to study participants, the need to scroll cannot be completely eliminated in a BYOD setting where a pre-defined criteria to exclude small BYOD devices is not set up. Based on the results of our study, we make the following recommendations relevant to ePRO design in the future: 1) continue to design ePROMs to avoid scrolling when using a provisioned device; 2) mitigate scrolling by using one of the approaches described (smart-scrolling, scrolling indicator/pop-up, or navigation buttons at the foot of the screen requiring scrolling to progress), 3) override certain user-adjusted screen display settings within the app display where possible; and 4) always provide partial provisioning as an option to allow for patients with unsuitable smartphones, which can be facilitated by defining a minimum specifications that can be easily identified by patient/site [9].