Our companion paper describes fully the methods used for this web-based study . We briefly describe the same for the sake of completeness of the present work .
Sample and procedure
This study recruited DMD siblings via their parents who participated in an earlier study of DMD caregivers (see  for details). Comparison-group participants were recruited via IPSOS-Insight and were selected to accurately represent the United States in distributions of age, race/ethnicity, gender, and region. Eligible participants were children aged 8 and older to adults (aged 18 and over) and able to complete an online questionnaire. The protocol was reviewed and approved by the New England Independent Review Board (NEIRB #20,201,623), and all participants provided informed assent (if younger than 18) and/or consent (for parents of minor children or adult siblings) before beginning the survey.
Accommodating Age and Disability The study design tailored the measures collected by age and/or participant preference. DMD siblings were offered the option to choose a simpler form of the survey if they felt they had trouble reading or concentrating. This “Alternate” survey was the same as the Child survey and asked fewer questions. Additional file 1: Table 1 shows the questions administered by survey type.
Aspirations were measured using qualitative (open-ended) and quantitative (closed-ended) questions to triangulate on the concept of aspirations. The open-ended questions included: (1) Three Wishes , in which participants were asked, “If you could make three wishes, any three wishes in the whole world, what would they be?”; (2) Goals: “What are the main things you want to accomplish?”; (3) QOL Definition: “In a sentence, what does the phrase ‘Quality of Life’ mean to you at this time?” The latter two are part of the QOL Appraisal Profilev1 .
In addition to the open-ended questions, 29 closed-ended goal-delineation items from the QOL Appraisal Profilev2  (QOLAPv2) were included. These rating-scale items queried a broad range of life domains (e.g., living situation, work/school, social relationships, health-related, spiritual, etc.), asking the respondent how much each goal statement was like them (1 = “not at all like me” through 5 = “very much like me”). Participants were given the option of not responding (Not applicable/Decline/I don’t know), which was coded as missing (− 99). [The interested reader can contact the corresponding author for the QOLAPv2 Goal-Delineation items.]
Participants of all ages answered the Three Wishes open-ended question; DMD sibling participants aged 18 and older who did not opt for the Alternate survey answered the open-ended goals and QOL-definition questions and the closed-ended goal-delineation items.
Demographic Characteristics asked of all participants included year of birth, gender, whether received help with survey, height, weight, race, ethnicity, with whom the person lived, and whether anyone in the household was or had been infected with the novel coronavirus-2019. Adult participants were additionally asked about education level, marital status, difficulty paying bills, employment status, number of hours worked per week, occupational complexity, and hours missed from work from the Work Productivity & Activity Impairment . Referring caregiver was tracked via the web recruitment link.
Coding open-ended data
The open-ended data, assembled into a data set that included responses from patients, siblings, and the comparison group, were coded into themes by six trained raters (EB, RBB, AD, JBL, EK, MCF) according to a standardized protocol and comprehensive codebook derived from an extensive sorting procedure . Themes were coded as “1’’ or ‘‘0’’ depending on whether they were reflected in the individual’s written text. A set of 40 themes was used for both the wishes and goals prompts and 17 for the QOL definition prompt. For each prompt, a theme of ‘‘No Direct Answer’’ was used if the respondent did not provide an answer or answered a different question than the one asked. For example, in response to the question ‘‘What are the main things you want to accomplish?’’ exemplary No-Direct-Answer responses were ‘‘many things” and ‘I've asked myself that question since I was a kid, and even now I have no idea’’.
Each text entry could be coded for as many themes as were identified, among the set for the corresponding prompt. Thus, one entry could elicit one theme or more than one depending on how the individual worded it. For example, as a goal one individual wrote, ‘‘My bills paid, my family healthy and happy, and family go to church.’ It was coded as reflecting family welfare, financial concerns, health issues, mental health/mood state, and religious/spiritual concerns. In contrast, another individual’s goal was ‘‘Move to a different state,” which was coded only with the single theme of living situation. Thus, we are assuming that the relevant factor here is the coded themes, not the individual wishes, goals, or QOL definitions themselves.
Training took place in two multi-hour sessions to understand the protocol and to utilize fully the codebook where themes were described fully and exemplified. Raters coded an initial set of ten participants’ data (from all three prompts), followed by a discussion of difference decisions across raters. They then coded the next ten participants’ data (again all prompts), and comparison and discussion now revealed almost no differences across raters. Raters then coded data from 40 more responses (all prompts), from which inter-rater reliability was computed in two ways on the 240 test responses (6 raters * 40 participant entries).
Two methods were used to assess aspects of inter-rater reliability. The first, Fleiss’s kappa  computed based on the entire data set, assessed degree of agreement over and above what would be expected by chance. This variant on the more familiar Cohen’s kappa  is used in cases of more than two raters. While there are no generally accepted rules of thumb for a desirable level of either form of kappa, some healthcare researchers have proposed values from 0.41 to 0.60 as “moderate,” 0.61–0.80 as “good,” and 0.81–1.00 as “very good.”[27, 28]
The second method assessed what proportion of the variance could be explained by the Rater effect. A low number is preferable as it reflects that the scores relate to the individual’s data being coded rather than reflecting a response style of the rater. This method used logistic regression to assess level of agreement among raters, with each of 240 “0” or “1” values regressed on the Rater variable, with its six categories (i.e., six raters). High inter-rater reliability (IRR) for any given theme would be indicated by a nonsignificant Rater effect, and one that explained a low fraction of the variance in ratings (as estimated by a pseudo-R-squared in the low single digits).
Comparing length, number of themes, and inter-method associations
Analysis of Variance (ANOVA) models were used to compare length of open-ended response and number of themes (dependent variables) by role (sibling vs. comparison; independent variable). Longer open-ended responses would generally reflect more complex or comprehensive answers.
Demographic differences between DMD siblings and comparison participants were controlled in eventual multivariate models using propensity scores . The goal of the propensity-score modeling was to create a score for covariate adjustment across all age groups, thereby allowing us to compare aspirations across the age span. This was the central contribution of the present work. We thus used the following pragmatic approach for dealing with the fact that some covariates were simply not asked and thus not available (see Additional file 1: Table 1). Accordingly, our propensity-score model adjusted for those covariates that differed between sibling and comparison groups in bivariate analyses described below. Separately for adults and for teens/children, a logistic regression model was computed predicting the dependent variable role (DMD sibling or comparison) from applicable covariates. For adults who completed the adult survey, the covariates included the following: ethnicity, White race, Black race, region, marital status, difficulty paying bills, whether currently working, education, whether received help completing survey, and whether someone in household had contracted COVID-19. For children, teens, and those who completed the Alternate survey, the covariates included the following: ethnicity, White race, Black race, region, whether received help completing survey, and whether someone in household had contracted COVID-19. For a small proportion of participants (6%), propensity scores were based on the mean propensity score among the individual’s age group).
Differences in age distributions by role
The child, teen, and adult data sets revealed age differences between patients and comparisons and between siblings and comparisons: there were differences in mean age, the frequency of certain age ranges, and the shape of the age distributions. We decided that, in addition to adjusting for age, in our models we would apply weights so as to simulate more comparable age distributions. While the weighting might not completely eliminate the age differences, it was aimed at making those distributions comparable enough to render the planned analyses tractable .
Analyzing the aspirations data
Descriptive statistics summarized either the proportion of each group coded as reflecting a given theme, for the open-ended data, or the central tendency, for the closed-ended goal-delineation items. Effect size was summarized by phi for comparison of proportions or Cohen’s d for comparison of means, the open- and closed-ended data, respectively. Multivariate models were then computed to hone the contrasts, comparing siblings vs. comparison group. In models, weighted to nearly equalize age, the two groups were contrasted in terms of role, age, and the role-by-age interaction, after adjusting for their propensity scores. Logistic regression was used to analyze the individual themes for the coded open-ended data, while Analysis of Covariance (ANCOVA) was used to analyze the closed-ended goal-delineation data.
For logistic regressions, 15 of the 95 theme variables showed no variation and thus were excluded from analyses. Further, some themes analyzed showed complete or quasi-complete separation in logistic regression, and for these we reported only descriptive results.
Interpretation of main effects in the context of interactions
The abovementioned multivariate models aimed to investigate how siblings compared to comparison participants in their aspirations and at different ages, after adjusting for demographic variables that might have confounded the raw descriptive comparisons. Interpreting main effects can be challenging when the model contains interaction terms, because the latter are collinear with the former. To address this challenge, plots of substantial interaction effects were used to facilitate their interpretation. In order to display an interaction effect (role*age), we created scatterplots that graphed predicted values from the entire model (Y-axis) against age (X axis), with separate lines for each Role group. Any theme variable with a group mean that was < 0.01, or with a |Beta| or |Estimated Beta| out of the usual range (> 1.3), was excluded from interaction graphs.
Interpretation in the context of many contrasts
The present study involves a large number of statistical contrasts, primarily because it is investigating research questions that have not been addressed to date and which involve translating nuanced qualitative data into quantitative metrics. Beyond demographic comparisons, where we rely on p-values to identify group differences, we focused our interpretation on effect sizes (ES). Cohen’s criteria were used to facilitate interpretation of differences for medium and large effect sizes, respectively: in proportions (Phi of 0.3 and 0.5); in mean differences (Cohen’s d of 0.5 and 0.8); in model explained variance (R2 of 0.06 and 0.14); and in model parameter estimates (standardized coefficients or β of 0.3 and 0.5). While we report ES regardless of magnitude, we interpret only medium or large ES because these are generally considered clinically important . Tables are conditionally formatted using data bars in unadjusted comparisons, and using different colors and saturation levels in adjusted comparisons, to highlight effects’ direction and magnitude.
IBM SPSS version 27  and the R software  were used for all analyses.