Structural validity of the Arabic version of the disabilities of the Arm, Shoulder and Hand (DASH) using Rasch measurement model

Background The disabilities of the arm, shoulder and hand (DASH) is a commonly used region-specific patient-reported outcome measure (PROM) that quantify upper extremity function (activity limitation) and symptoms. Current evidence suggests that measurement properties of the adapted versions of the DASH are not sufficiently examined. The Arabic DASH has evidence supporting its internal consistency, test–retest reliability, construct validity and responsiveness. On the other hand, the validity of the assumed unidimensionality of the Arabic DASH has not been examined previously. The aim of this study was to examine the structural validity of the Arabic DASH in patients with upper extremity musculoskeletal disorders using Rasch measurement model. Methods Patients with upper extremity musculoskeletal disorders were recruited and were asked to complete the Arabic DASH at their initial visit to physical therapy departments. The overall fit of the Arabic DASH to the requirement of the Rasch measurement model was examined using chi-square statistics for item-trait interaction, mean item and person fit residuals. The fit of individual items, thresholds ordering, local dependency, differential item functioning (DIF), and unidimensionality using the t-test approach were also examined. Results The Arabic DASH did not fit the Rasch measurement model initially (χ2 = 179.04, p < 0.001) with major breach of local item independence and a pattern of high residual correlations among the activity-related items and among the impairment-related items. Combining items into activity-limitation and impairment testlets accommodated the local dependency and led to satisfactory fit of the Arabic DASH to the requirement of the Rasch measurement model (χ2 = 3.99, p = 0.41). Conclusions Rasch measurement model supports the structural validity of the Arabic DASH as a unidimensional measure after the accommodation of local dependency.

inadequate testing of measurement properties [7]. The vast majority of the studies in this systematic review did not establish the structural validity of the adapted DASH versions [7]. The measurement properties of the Arabic version of the DASH has only been tested in one single study [8]. Alotaibi et al. completed the cross-cultural adaptation (forward translation then backward translation followed by expert committee review and pilot testing) of the scale into Arabic language then examined the internal consistency, test-retest reliability, measurement error, construct validity and responsiveness of the Arabic DASH [8]. The results of their study supported the measurement properties examined. In this study the authors used the common scoring method of the scale using one summary score for the whole scale which assume that all items are reflections of one underlying construct and that the scale is fairly unidimensional. The validity of the assumed unidimensionality of the Arabic DASH was not tested by the authors.
The consensus-based standards for the selection of health measurement instruments (COSMIN) defines structural validity as "the degree to which the scores of a health-related patient-reported outcomes instrument are an adequate reflection of the dimensionality of the construct to be measured" [9]. Establishing structural validity of multi-item PROM is imperative to ensure that the scale score (either total score or different subscales) actually reflects the constructs measured. The DASH contain items that capture upper extremity activity limitation and also items that capture psychosocial aspects and impairments in body function. These different domains covered by the DASH, although important and relevant to patients with upper extremity disorders [10][11][12], might threaten the unidimensionality of the measure and question the use of one total score as indicated previously using both classical and modern test theory methods [13][14][15][16][17][18][19]. Thus summarizing a scale with potential multiple constructs in one single score would be invalid and greatly limits the interpretation of the scale scores.
In addition to the unestablished dimensionality of the Arabic DASH, no prior studies examined the validity of the Arabic DASH 5 response categories, and whether the scale items function in a similar manner among subgroups of patients with different characteristics (measurement invariance). In contrast to classical test theory methods, Rasch measurement model provide unique opportunity to examine the Arabic DASH structural validity and to examine the validity of the rating scale and also to examine its measurement invariance across important clinical characteristics [20][21][22]. Therefore, the aim of this study was to examine the structural validity (internal construct validity) of the Arabic DASH in patients with upper extremity musculoskeletal disorders using Rasch measurement model.

Setting and participants
Participants in the current study were recruited from two outpatient physical therapy departments in Riyadh city using convenience sampling. Patients with upper extremity musculoskeletal disorders were recruited if they were 18 years of age or older. Potential participants were excluded if they were unable to read or understand Arabic language, or if they had any neurological disorder. Participants were also excluded if they had spinal surgery or disorder, cardiopulmonary disorder, or neurological disorder that were perceived by the patients as functionally limiting. Ethical approval for the study was obtained from the ethical committees at the participating institutes. Participants signed informed consent forms prior to participation. Two licensed physical therapists and one licensed occupational therapists with cumulative clinical experience of 20 years collected the data.

Procedure
Participants were asked to complete the Arabic DASH during their initial visit to the physical therapy department where they were seeking care for their upper extremity disorders. The Arabic DASH is 30-item region-specific PROM. Each item was scored on 1 (no functional limitation and no symptoms) to 5 (functional inability and extreme symptoms) scale [8]. Items response categories are "no difficulty", "mild difficulty", "moderate difficulty", "severe difficulty", "unable" for 21 items, and "none" "mild", "moderate", "severe", "extreme" for 5 items. Each of the remaining four items has different response categories. The typical 0-100 DASH scores can be obtained by subtracting 1 from the mean items score then multiplying by 25. Higher scores in the Arabic DASH indicates greater activity limitation and worse symptoms.

Rasch analysis
Partial credit model [23] was used to examine the satisfaction of the Arabic DASH to the requirement of the Rasch measurement model using the RUMM2030 software [24]. The overall fit of the Arabic DASH to the Rasch model was examined using chi-square statistics for itemtrait interaction [21,25]. A non-significant chi-square statistics suggest an overall fit of the measure to the Rasch model and a mean item and person fit residuals close to 0 with standard deviation close to 1 were used as indicators of good fit. Deviation of individual item and person from the Rasch measurement model was examined using standardized fit residuals and chi-square statistics. Standardized fit residuals above or below ± 2.5 and a significant chi-square statistics (after Bonferroni correction) were used to indicate deviation of individual item from the measurement model [21,25]. The item fit residuals are the sum (across persons) of standardized squared differences between the observed score and the expected score by the model transformed to approximate normal distribution (mean 0, standard deviation of 1) under the hypothesis of good fit to the measurement model [25]. Items were examined for any disordered thresholds by inspecting the category characteristic curves of each individual item [20,21,26,27]. Violation of the local item independence assumption was examined by testing the residual correlation between items [21,28,29]. Pairwise residual correlation 0.2 above the mean residual correlation was used to indicate violation of local item independence. The Arabic DASH items were also examined to see whether these items function in the same way in different patients' subgroups. Differential item functioning (DIF) were examined for sex (male vs. female), age (< 42 years vs. ≥ 42 years; split by median), surgical status (surgery vs. no surgery), and affected side (dominant vs. non-dominant) [21,27]. Two-way ANOVAs (class interval by subject characteristics) on Rasch residuals were used to detect items uniform and non-uniform DIF. The t-test approach was used to examine the unidimensionality of the Arabic DASH [21,30]. Principal component analysis on residuals was perform first then the pattern of loading on the first component was used to group items into 2 group (items with positive loadings and items with negative loadings). Each participant's upper extremity function was then estimated twice using the 2 groups of items then these estimates were compared using t-test. The Arabic DASH was considered unidimensional if the number of significant differences between the 2 estimates was not more than 5%. Person Separation Index was used to examine the internal consistency of the Arabic DASH.

Sample size estimation
The required sample size was determined based on the COSMIN guidelines [31]. A sample size of 100 participants was considered adequate by the COSMIN guideline for structural validity testing using Rasch analysis [31], thus 100 was used as the minimum required sample in the current study.

Results
One hundred and nine participants with upper extremity disorders were enrolled (Table 1) (Table 2). Fifty participants (45.9%) had no missing items on the Arabic DASH while 59 participants (54.1%) had 1 missing item (no response). The only item that was missed by the participants was item 21 (sexual activities) which is an optional item in the Arabic DASH and thus was removed before conducting the Rasch analysis.
Initially, Rasch analysis suggested that the Arabic DASH deviates from the Rasch measurement model as indicated by the significant chi-square statistics for item-trait interaction and the high standard deviations for item and person fit residuals (Table 3). The initial analysis identified 18 misfitting patients, 4 misfitting items (items 26 (tingling), 28 (stiffness), 29 (sleep), and 30 (confidence)), 12 items with disordered thresholds (items 1 (open a jar), 2 (write), 3 (turn key), 4 (prepare meal), 8 (garden/yard work), 11 (carry heavy object), 15 (put on sweater), 16 (use a knife), 18 (recreational: Force), 22 (Interference with social activities), 29 (sleep), and 30 (confidence)) and 44 item pairs with high residual correlation indicating local dependency. Four items exhibited uniform DIF by sex (items 1 (open a jar), 4 (prepare meal), 5 (open heavy door), 19 (recreational: Free arm)) while none of the items had uniform or non-uniform DIF by age, surgical status, or affected side. At this stage the t-test approach also did not support he unidimensionality of the Arabic DASH (Table 3). After removing the misfitting patients in the second run, the Arabic DASH still did not satisfy the requirement of the Rasch measurement model showing significant chi-square statistics for item-trait interaction and its unidimensionality was not supported (Table 3). At this stage, the scale had 3 misfitting items (items 26 (tingling), 29 (sleep), and 30 (confidence)), 11 items with disordered thresholds (Table 4) ( Fig. 1) and 37 item pairs with high residual correlation (Appendix 1). None of the items at the second run exhibited uniform or non-uniform DIF by age, sex, surgical status, or affected side.
In the third run of the analysis, items within the Arabic DASH were grouped into 2 testlets to accommodate for local dependency (Table 3). The activity limitation testlet (super item) included activity-related items (items 1 to 20) while the impairment testlet included mostly impairment-related items (items 22 to 30). This grouping of items was determined based on the pattern of residual correlations among items (Appendix 1). At this stage, the Arabic DASH satisfied the expectations of the Rasch measurement model as indicated by the non-significant chi-square statistics for item-trait interaction ( Table 3). The 2 super items fitted the Rasch model (activity limitation testlet; fit residual = − 1.40, χ 2 = 2.55, p = 0.280) (impairment testlet; fit residual = 1.37, χ 2 = 1.44, p = 0.486) (Fig. 2) and had no uniform or non-uniform DIF by age, sex, surgical status, or affected side (Fig. 3). The t-test approach indicated that only 3.3% of the participants showed significant difference between the two ability estimates (using the two testlets) supporting the unidimensionality of the scale (Table 3). Additionally, using the 2 testlets retained 97% of the common non-error variance further supporting the presence of one general factor. The Arabic DASH had good internal consistency with Person Separation Index of 0.93 which decreased from 0.96 after accounting for local dependency. Given the fit of the Arabic DASH to the Rasch measurement model after the creation of 2 testlets, Table 5 allows the transformation of the total raw ordinal-level scores (items scored from 0 to 4 with a total score from 0 to 116 for the 29 items) to an intervallevel scores (0-100) with 100 representing worst upper extremity function. The transformation to the intervallevel scores would allow the use of parametric statistics and would enhance the validity and interpretation of the scale change scores [32].

Discussion
The Arabic DASH was subjected to the requirements of the Rasch measurement model in the current study. Initially, the Arabic DASH did not satisfy the requirements of the Rasch measurement model with several areas of deviations including misfitting items, dysfunctional response categories, significant violation of local item independence, and lack of unidimensionality. After the accommodation of the local dependency by creating 2 super items reflecting upper extremity activity limitation and impairment, the Arabic DASH satisfied the requirement of the Rasch measurement model indicating that the scale if fairly unidimensional measure of upper extremity activity limitation and impairment.
Item 21 (sexual activities) was the only item that had missing responses (no response). In the Arabic DASH, item 21 (sexual activities) was the only item that was labeled as an optional item which might explain why only this item had missing responses. Given that item 21 cover sensitive issue (sexual activities), cultural factors might have also contributed to some missing responses to this item. The optional nature of the item and the high missing rate led to the removal of the item 21 (sexual activities) before conducting Rasch analysis in the current study.
After removing misfitting patients, Items 30 (confidence), 26 (tingling), and 29 (sleep) were misfitting based on their fit residuals and chi-square statistics. This indicate that the behavior of these items did not follow the expectation of the Rasch measurement model. The nonconformity of these items to the Rasch measurement model suggests that these items do not belong to the major underlying trait measured by the majority of the scale items, upper extremity activity limitation. The violation of local item independence observed in the current study might have driven these items to deviate from the measurement model [29,33]. Although these items were individually misfitting, the activity limitation and impairment testlets that were created to accommodate the local dependency showed good fit to the Rasch measurement model allowing the retention of these items. In patients with rheumatoid arthritis, Prodinger et al. reported good fit of the same 2 testlets (activity limitation and impairment testlets) used in our study providing support to the findings of the current study [34].  reported misfitting item across these studies followed by item 30 (confidence). Items 26 (tingling), and 30 (confidence) has also been reported to misfit the Rasch model in patients neurological disorders affecting upper extremity function [38,39]. Proper functioning of the scale response categories manifest as ordered coverage of the underlying continuum by the response categories where each response category becomes the most probable option in part of the continuum (Fig. 1). Under this ordered coverage, the response option "no difficulty" should be the most probable option for individuals with high level of upper extremity function followed by "mild difficulty", "moderate difficulty", "severe difficulty", then the response option "unable" with decreasing levels of upper extremity function. Improper functioning of the response categories and disordered thresholds were detected in 11 items in the Arabic DASH (Table 4). This indicates that response categories in these items were not used in the expected manner (either because wordings of response categories were not clear, or patients were not able to discriminate between them). Another potential reason for the disordered thresholds in the Arabic DASH is the significant violation of local item independence which could cause items' thresholds to be disordered [34]. All items with disordered thresholds in the Arabic DASH showed violation of local item independence as indicated by the high residual correlations. Prior studies that examined the internal structure of the DASH also pointed to disordered thresholds suggesting improper functioning of the scale response categories [15-18, 35, 37-39]. The number of items with disordered thresholds reported in the literature ranged from 2 items in patients with hand and elbow disorders [15,35] to 19 items in patients with various upper extremity disorders [18]. The Arabic DASH has major violation of local item independence indicating response dependency and multidimensionality [28,33]. The Arabic DASH suffered from local dependency between 37 item pairs. This indicates that these items have something in common other than the measured underlying trait that is upper extremity function violating the requirement of local item independence [20,21,27,29]. This violation caused the scale to deviate from the Rasch measurement model. The accommodation of local dependency within the scale by creating testlets (activity limitation and impairment testlets) led to satisfactory fit of the Arabic DASH to the Rasch measurement model supporting the validity of the scale. Examining the item pairs with residual correlations above the predetermined threshold indicates that most of these items inquire about similar functional activities or symptoms. Items 18 (recreational: force) and 19 (recreational: free arm) had the highest bivariate residual correlation after the removal of the underlying trait "upper extremity function". Both items inquire about the difficulty encountered in recreational activities while taking some force through the arm (item 18) and in free arm movement (item 19). The similar content of the 2 items and possible redundancy might explain the high residual correlation observed. Items with the second highest residual correlation were items 10 (carry shopping bag) and 11(carry heavy object). Both items are related to the same functional activity that is "carrying" and response to one item is likely to be linked or dependent on the response of the other item. For example if a patient was able to carry a heavy bag over 4.5 kg (item 11) then that patient would be able to carry a shopping bag (item 10). This problem of response dependency detected in the Arabic DASH is of concern given that it artificially inflates reliability and influence person estimates [28,29]. Similar to the residual correlation observed between activity-related items, impairment-related items also exhibited high residual correlation. This pattern of residual correlation might be an indicator of multidimensionality where these items represent an impairment-related dimension. Additionally, similar content might also explain some of the residual correlations among impairment-related items. Items 24 (pain), 25 (pain during activity), and 29 (sleep) for example inquire about pain severity and pain-related sleep difficulty. This enquiry about the same impairment might constitute the shared concept that lead to the high residual correlation even after the removal of the major underlying factor. Consistent with our findings, DASH has been reported to violate the requirement of local item independence. A pattern of high residual correlation similar to ours where activity-related item group together while impairment-related items group together has been reported in patients with various upper extremity musculoskeletal disorders [15,16,18,34,36,37]. Similar to the approach used in the current study, Prodinger et al. reported the use of two testlets (activity limitation and impairment testlets) to accommodate the issue of local dependency within the scale and this method yielded satisfactory fit to the Rasch model in line with the findings of the current study [34]. Table 5 Arabic DASH total raw ordinal-level score (0-116) to interval level score (0-100) with 100 representing worst upper extremity function For the raw ordinal-level scores, items scored from 0 to 4 with a total raw score ranging from 0 to 116 for the 29 items

Raw score
Interval-level score Rasch measurement model is a unidimensional model where the probability of being able to perform an activity is only governed by a single factor that is the person ability (level of upper extremity function possessed by the patient). We believe that the major breach of local item independence was the main reason explaining why the Arabic DASH did not satisfy the requirement of unidimensionality initially. After the accommodation of the local dependency by the creation of activity limitation and impairment testlets, The Arabic DASH satisfied the requirement of unidimensionality as suggested by the principal component analysis of residuals followed by the t-test [29,33]. Although items were grouped into two groups similar to having two subscales, the majority of the common non-error variance was retained by the two testlets suggesting that the scale has one general factor [34]. These results support the unidimensionality of the Arabic DASH and supports the validity of providing one single summary score for the Arabic DASH. Similar to the findings of the current study, Prodinger et al. reported that the DASH was sufficiently unidimensional in patients with rheumatoid arthritis after addressing the issue of local dependency between items using testlets [34]. The accommodation of local dependency through the use of testlets also led to unidimensionality of the Finnish DASH in patients with hand and wrist disorders [15]. Lack of unidimensionality of the DASH was reported in the literature in patients with hand disorders [16], Dupuyteren's contracture [17], various upper extremity musculoskeletal disorders [18,19], shoulder disorders [36] and also in patient with stroke [40]. Number of these studies did not examine for violation local item independence [17,19,40]. On the other hand, the studies that examined for this violation used high threshold for detecting local dependency thus underestimated the degree of dependency and also did not examine the effect of the accommodation of dependency using testlets on scale unidimensionality [16,18,36].
The Arabic DASH items seems to have no uniform or non-uniform DIF by age, sex, surgical status, or affected side. This suggests that the scale items behave in a similar manner regardless of the patient characteristics and that items were not biased to any of the levels of these characteristics (for example bias toward males versus females). The lack of uniform and non-uniform DIF was also observed in the current study at the level of testlets suggesting that the testlets also are invariant to patients' characteristics. These measurement invariance results of the Arabic DASH at the item level and at the testlet level should be interpreted with caution giving the limited number of participants in each subgroup [31] and a follow-up study might be need to confirm our findings. To the best of our knowledge, the current study is the first study which suggested that the DASH is invariant to the treatment received (surgically or non-surgical) and whether the affected side was the dominant or the nondominant side. Similar to the findings of the current study, the activity limitation and impairment testlets were reported to have no DIF by age in patients with rheumatoid arthritis and the reported DIF by sex in the testlets was considered trivial and required no modifications [34]. DASH individual items has also been reported previously to have no DIF by sex [36,37] and age [16]. On the contrary, number of studies reported DIF by sex [15-17, 35, 41] and age [15,36,37,41] in the DASH items but the results of these studies were inconsistent regarding the number of items, and the specific items exhibiting DIF.
This study represents the first attempt to examine the structural validity (internal construct validity) of the Arabic DASH. Rasch measurement model pointed to areas of dysfunction in the behavior of the Arabic DASH items mainly local dependency between items that could not have been determined using classical test theory methods. The accommodation of the local dependency between items improved the internal structure of the Arabic DASH resulting in an interval-level unidimensional measure of upper extremity function. The results of this study would help in guiding future modifications of the scale aiming to improve its validity. Although the sample size used in the current is adequate for conducting Rasch analysis [31], it is at the lower end of what is considered adequate sample size thus further testing of the measure might be needed using larger number of participants to confirm the findings of the current study. Additionally, when the whole sample was split into subsamples for DIF analysis (e.g. male and female), the number of participants within each group was below the recommended number for examining scale invariance [31]. Thus a caution should be practiced when interpreting the findings of the DIF analysis reported in the current study. The majority of participants in the current study have shoulder and arm disorders then wrist and hand disorders with few participants who had elbow and forearm disorders, thus the results of the current should be interpreted with caution especially for patients with elbow and forearm disorders.

Conclusions
Rasch measurement model supports the validity of the Arabic DASH as a unidimensional measure of upper extremity activity limitation and impairment after accommodating for local dependency between items. The total Arabic DASH score can be used in clinical practice and for research purposes to reflect the level of upper extremity activity limitation and impairment in patients with upper extremity musculoskeletal disorders.

Appendix 1
See Table 6.  Table 6 Items residual correlations after removing misfitting patients (run number 2) and before grouping items into two testlets