Development and psychometric validation of the Nausea/Vomiting Symptom Assessment patient-reported outcome (PRO) instrument for adults with secondary hyperparathyroidism

Background We developed the Nausea/Vomiting Symptom Assessment (NVSA©) patient-reported outcome (PRO) instrument to capture patients’ experience with nausea and vomiting while on calcimimetic therapy to treat secondary hyperparathyroidism (SHPT) related to end-stage kidney disease. This report summarizes the content validity and psychometric validation of the NVSA©. Methods The two NVSA© items were drafted by two health outcomes researchers, one medical development lead, and one regulatory lead: it yields three scores: the number of days of vomiting or nausea per week, the number of vomiting episodes per week, and the mean severity of nausea. An eight-week prospective observational study was conducted at ten dialysis centers in the U.S. with 91 subjects. Criterion measures included in the study were the Functional Living Index-Emesis, Kidney Disease Quality of Life Instrument, EQ-5D-5 L, Static Patient Global Assessment, and Patient Global Rating of Change. Analyses included assessment of score distributions, convergent and known-groups validity, test-retest reliability, ability to detect change, and thresholds for meaningful change. Results Qualitative interviews verified that the NVSA© captures relevant aspects of nausea and vomiting. Patients understood the NVSA© instructions, items, and response scales. Correlations between the NVSA© and related and unrelated measures indicated strong convergent and discriminant validity, respectively. Mean differences between externally-defined vomiting/nausea groups supported known-groups validity. The scores were stable in subjects who reported no change on the Patient Global Rating of Change indicating sufficient test-retest reliability. The no-change group had mean differences and effect sizes close to zero; mean differences were mostly positive for a worsening group and mostly negative for the improvement group with predominantly medium or large effect sizes. Preliminary thresholds for meaningful worsening were 0.90 days for number of days of vomiting or nausea per week, 1.20 for number of episodes of vomiting per week, and 0.40 for mean severity of nausea. Conclusions The NVSA© instrument demonstrated content validity, convergent and known-groups validity, test-retest reliability, and the ability to detect change. Preliminary thresholds for minimally important change should be further refined with additional interventional research. The NVSA© may be used to support study endpoints in clinical trials comparing the nausea/vomiting profile of novel SHPT therapies.


Background
Chronic kidney disease (CKD) is a progressive loss in renal function. Secondary hyperparathyroidism (SHPT) is a frequently-encountered problem in CKD with resulting impact on the morbidity and survival of hemodialysis patients [1][2][3][4][5][6][7]. Several therapeutic strategies have been used to manage SHPT including the oral calcimimetic Sensipar® (cinacalcet) [8]. The most frequently-reported adverse reactions associated with cinacalcet have been nausea and vomiting [8][9][10]. These side effects may adversely affect medication adherence, therapeutic efficacy, and patients' clinical and quality-of-life outcomes. A new generation of calcimimetic is being developed as an intravenous formulation. This innovation may improve patients' calcimimetic nausea/vomiting experience; however, an instrument tailored to daily nausea and vomiting assessment would be needed to quantify the impact.
We developed the Nausea/Vomiting Symptom Assessment (NVSA © ) patient-reported outcome (PRO) instrument to capture the number of days of nausea or vomiting per week, number of episodes of vomiting per week, and mean severity of nausea among CKD patients on maintenance hemodialysis with SHPT. The US Food and Drug Administration (FDA) PRO Guidance document [11] states that PRO measures should be conceptually valid as they relate to the medical intervention and disease state being studied, meet a threshold of psychometric soundness, and be relevant to patients. Similarly, the Committee for Medicinal Products for Human Use reflection paper [12] states "the use of specific [health-related quality of life] HRQoL domains as study endpoints pre-supposes that the HRQoL instrument was adequately developed and fully validated prior to measuring the subset of domains chosen." The NVSA © was developed because existing measures of nausea/vomiting do not harmonize the concepts of interest (nausea/vomiting) with the context of use (comparative studies of new therapies for CKD patients with SHPT) [13]. Some existing measures are dictotomous in format or have long recall periods [14], thus potentially yielding imprecise information on change in the frequency and severity of nausea/vomiting due to therapy. Still others assess the impact of nausea/vomiting symptoms on HRQoL [15,16] instead of nausea/vomiting themselves. Thus, the NVSA © was developed to provide a direct, focused, daily measure of nausea and vomiting in comparative studies of SHPT therapies.
PRO instruments designed to assess treatment-related adverse reactions come with unique challenges in their development and validation. For example, the sample of patients used in psychometric testing must have the underlying target illness but must also be using a treatment that results in the adverse reaction. Because not all patients using the target treatment will experience adverse reactions, it can be difficult to ensure adequate variability-both between and within patients-in a study sample. Further, for PRO instruments that assess treatment-related adverse reactions, assessment of clinically-meaningful change should be focused on onset or worsening of symptoms over time. In other words, patients in a comparative trial that includes treatmentrelated adverse reactions as a study endpoint would likely start with no or minimal symptoms-some patients would then develop symptoms or experience worsening in treatment-related symptoms during the study. Rather than identifying the definition of a meaningful response to treatment, there must be an a priori identification of a meaningful worsening in the severity of reactions. We addressed these unique aspects of treatment-related PRO development herein by describing the development and psychometric evaluation of the NVSA © .

Study design
Overview of NVSA © Development and Validation.
As the FDA PRO Guidance document [11] notes, a "fundamental consideration in the review of a PRO instrument is the adequacy of the item generation process to support the final conceptual framework for the instrument." However, as the guidance also outlines, "in some cases, the question of what to measure may be obvious given the condition being treated." A review of information from clinical studies [8][9][10][17][18][19] and expert clinical guidelines [20] supported nausea and vomiting as important adverse reactions associated with the oral calcimimetic cinacalcet. Given the evidence identifying calcimimetic treatment-emergent nausea and vomiting as a recognized limitation of cinacalcet [8][9][10][17][18][19][20], the two NVSA © items focused on these two important adverse reactions. The items were developed based on a review of data from cinacalcet trials, discussions with clinical SHPT experts, a review of PRO measures that assess nausea/vomiting, and initial hypotheses regarding the endpoints that would be most appropriate for use in future efficacy trials of novel (i.e., non-cinacalcet) trials for SHPT. The disease model for the NVSA © , which guided PRO development, is presented in Fig. 1.
Because measuring vomiting and nausea was obvious for the purpose of PRO development, the initial items and response categories were drafted by a crossfunctional expert team that included two health outcomes/PRO researchers, one medical development lead, and one safety/regulatory lead. Based on the above premises, particularly the FDA PRO Guidance [11], traditional hypothesis-generating conceptelicitation interviews were not executed but were exchanged instead with qualitative cognitive-debriefing interviews to document that patients could understand the NVSA © instructions, items, and response scales and could easily comprehend and meaningfully respond to the NVSA © items. Confirmatory, qualitative concept-elicitation interviews were conducted to provide final, supportive evidence of the importance of the concepts of nausea and vomiting for patients who experienced these adverse reactions associated with oral calcimimetic cinacalcet. The psychometric findings, using data from an observational study dedicated to assessment of the NVSA © , were designed to document the measurement properties of the NVSA © as a PRO that can be used to measure nausea/vomiting in clinical trials involving the specific population of interest-CKD patients on maintenance hemodialysis with SHPT.

Phase I: Cognitive-debriefing interviews
Phase I utilized individual, face-to-face interviews to assess patient understanding and ease of selfadministration of the draft NVSA © items on an electronic diary (eDiary). Recruitment targeted 12 adult CKD patients on maintenance hemodialysis with SHPT. A quota was implemented to have 50% (n = 6) of the 12 subjects be currently treated with cinacalcet at the time of screening. Two clinical sites enrolled patients (Houston, TX and Washington, DC). Clinicians were oriented to the study procedures, protocol, and requirements over the telephone and were asked to screen and recruit patients and schedule them for an interview if they qualified. Potential patients were telephoned and screened for eligibility against study criteria. Those eligible were asked to attend an enrollment and interview visit at the site to review and sign the Informed Consent Form, complete the Demographic Form, and to participate in an interview. Interviewers conducted in-person, one-on-one interviews using a semi-structured cognitive-interview guide.

Phase III: Psychometric validation
This was an eight-week, prospective, observational study. The eight-week time period was selected based on the timing of incident nausea/vomiting in clinical trials [8,18,19] as well as anecdotal information from the real-world experiences of physicians prescribing cinacalcet. To capture the full range of experiences of nausea/vomiting, three cohorts of subjects were recruited: (1) those who started cinacalcet in the month before enrollment ('Initiating Cinacalcet' cohort); (2) those who had SHPT but had no history of previous treatment with cinacalcet ('Cinacalcet Naïve' cohort); and (3) those who were current users of cinacalcet and initiated cinacalcet greater than 1 month prior to enrollment ('Variable Time on Cinacalcet' cohort). Subjects were recruited into each cohort in a 1:1:2 ratio to enroll a sample with adequate variability in nausea/ vomiting. Subjects were enrolled if they were at least 18 years of age and had maintenance hemodialysis three times per week for at least 3 months prior to enrollment. Subjects must have had: (a) SHPT defined as two consecutive laboratory serum parathyroid hormone values > 400 pg/mL without cinacalcet; or (b) had initiated cinacalcet in the last month; or (c) been currently using cinacalcet for greater than 1 month at the time of enrollment. Subjects were excluded if they were not able to give written informed consent and/or comply with the study procedures, if they had any other illness (with the exception of diabetes) that causes nausea/ vomiting and that might confound subjects' responses to interview questions, if they were receiving chronic antiemetic therapy, if they were pregnant or lactating, or if they were currently participating in another investigational device or drug study.
A total of 91 subjects were recruited across ten dialysis centers in the US. IRB submission and approval from Quorum Review IRB for the study was secured by DaVita Clinical Research who maintained appropriate documentation and conducted management of the study in accordance with the ethical principles consistent with Good Clinical Practice and applicable regulatory requirements.

Phase I: Cognitive-debriefing interviews
Cognitive interviewing is a method of testing a PRO instrument to identify: comprehensibility of the language, format, and instructions; appropriateness of response options; and overall feasibility and burden [21]. A cognitive-interview guide facilitated the cognitive assessment (patient comprehension) of the draft NVSA © items. The cognitive-interview guide used a process requiring patients to "think aloud" where patients were asked to respond to a specific set of questions regarding the instructions, questions, and response options. The "think aloud" technique allowed patients to verbalize the thought process involved in providing a response to the questions in the NVSA. © During the interview, patients self-administered the NVSA © and answered questions about their comprehension of the items. Two primary components were tested: (1) the intent of the question (what the respondent believes the question is asking), and (2) the meaning of terms (what specific words and phrases in the question mean to the respondent). The information gathered during the individual interviews was used to inform revisions to the content of the eDiary and to create patient training for the administration of the NVSA. © The primary focus of the cognitive interview was on the items themselves rather than on administration procedures. Other questions in the cognitive interviews focused on the response option and instructions and further examined specific terminology to ensure patients understood the NVSA © items. At the close of the interview, patients completed an "Ease of Use Questionnaire" to report their experience regarding the usability of the eDiary.

Phase II: Confirmatory concept-elicitation interviews
The confirmatory concept-elicitation interviewers used a semi-structured interview guide to explore patient perceptions about their experience with nausea/ vomiting associated with treatment for SHPT. The semi-structured qualitative interview guide included open-ended questions and a day-reconstruction exercise to invite spontaneous responses from subjects regarding treatment-related nausea/vomiting. Each interview focused first on issues more proximal to treatment-related nausea/vomiting (symptoms) and then moved to more distal issues (impacts). All interviews were conducted using a process that began with open-ended questions before moving to followup probe questions for areas not initially mentioned by subjects. A series of follow-up probing questions were also included to explore sign, symptom, and impact language and areas not mentioned spontaneously by subjects. Worksheets were used to obtain numeric ratings of characteristics of the elicited concepts that subjects experienced.

Phase III: Psychometric validation
The NVSA © is a two-item PRO instrument designed to assess nausea/vomiting on a daily basis. An eDiary was used to administer the NVSA © each evening. The first item asks about severity of nausea in the past 24 h using a 0-10 numeric rating scale (no nausea to nausea as severe as you can imagine). The second item asks about the number of times the subject vomited in the past 24 h using a "spinner control" to select a number between zero and 99.
The FLIE-Emesis is an 18-item PRO instrument that measures the impact of nausea and vomiting on physical activities, social and emotional function, and ability to enjoy meals over the past 5 days [22]. The FLIE contains nine items for nausea and nine for vomiting. The total score ranges from 18 to 126 with higher scores indicating a less negative impact on patients' daily life due to nausea and vomiting. Two subscales can be derived: FLIE vomiting and FLIE nausea (both have a score range of 9 to 63).
The KDQOL-36™ is a 36-item PRO instrument specific to CKD with a four-week recall [23]. For the psychometric evaluation, three KDQOL-36™ single-item measures were used: bothersomeness of nausea, faintness, and shortness of breath. Each item yields scores ranging from 1 (not at all bothered) to 5 (extremely bothered).
The EQ-5D-5 L is a generic measure of health status that provides descriptive profiles and a single-index value. [24] The recall period is "today" (at the time of completion). The five EQ-5D-5 L descriptive profiles and its visual analogue scale (VAS) measuring health today were used to assess the discriminant validity of the NVSA © .
sPGA (Static Patient Global Assessment) -Nausea and Vomiting is a two-item measure in which subjects indicate an overall assessment of current nausea and vomiting. Response choices include no symptoms, very mild, mild, moderate, severe, and very severe symptoms. The nausea and vomiting items were scored separately.
PGRC (Patient Global Rating of Change) -Nausea and Vomiting is a two-item measure in which subjects indicate worsening or improvement in nausea and vomiting since the onset of the study. Response choices include very much improved, much improved, minimally improved, no change, minimally worse, much worse, and very much worse. The nausea and vomiting items were scored separately.
Patients were trained to use the eDiary upon enrollment. The NVSA © was completed on the eDiary at home each evening during the study. Subjects completed the FLIE, KDQOL-36,™ EQ-5D-5 L, sPGA, and PGRC on eDiaries before dialysis during the enrollment visit. Follow-up FLIE and PGRC and sPGA measures were completed weekly thereafter, also before dialysis at one of the subject's three times per week dialysis clinic visits, for the duration of the study. Follow-up KDQOL-36™ and EQ-5D-5 L measures were completed at the week 4 and week 8 dialysis clinic visit.

Cognitive-debriefing interviews
The cognitive interviews were audio-recorded. After each wave of cognitive interviews, the interviewer wrote a description of problematic cognitive issues for each item if any were apparent. The description of potential issues, along with recommendations for retaining, modifying, or eliminating an item, were discussed with the development team and, following agreement, changes were made to the draft items for the second wave of cognitive interviews. After the second wave of interviews, the results were again reviewed by the development team and recommendations for the items were documented. The transcripts were reviewed in order to extract quotations from the patients' responses to each question asked in the cognitiveinterview guide. Quotations were summarized to show patient interpretation and understanding of each item and its response option and to identify any difficulties that arose with the proper understanding of the item content.

Confirmatory concept-elicitation interviews
The confirmatory concept-elicitation interviews were audio recorded. The digital audio files of subject interviews were transcribed by a professional transcription company. Transcripts were coded using ATLAS.ti software to organize coded concepts by similarity of content. The primary goal of transcript coding was to organize and catalog subject descriptions of their treatment-related nausea/vomiting symptoms. The coding framework provided an organization for grouping like-concept codes. As the coding process continued, the coding framework was expanded by adding concepts that subjects identified in the interview process. Saturation of concepts was assessed across four grouping of the ten transcripts: (1) first to third interview; (2) fourth to sixth interview; (3) seventh to eighth interview; and (4) nine to tenth interview.

Psychometric validation
Standard psychometric techniques, consistent with those recommended by the FDA [11], were used to assess the measurement properties of the NVSA © .

Handling of data
Mean severity of nausea was calculated by averaging all available daily severity ratings. No imputation was performed for missing responses. For psychometric evaluation, daily severity data were averaged into a weekly (i.e., seven day) score.
The number of episodes of vomiting per week was estimated as the total number of episodes in a given week. Consistent with other diary-based PRO research [25][26][27][28][29][30][31], if a subject was missing valid responses on three or fewer days in a week, the available valid responses in the week were used to estimate the total number of episodes of vomiting equivalent to a seven-day week. If the subject had missing responses for four or more days for a given week, data from that week were considered missing [32,33].
Number of days of vomiting or nausea per week was derived from the first and second item. If a subject provided a valid response to the severity of nausea question that was > 0 but did not provide a valid response to the vomiting-episode question, this day was counted as a day of nausea/vomiting. If a subject provided a valid response to the vomiting-episode question that was > 0 but did not provide a valid response to the nausea-severity question, this day was counted as a day of nausea/vomiting. If the answer to the number of vomiting episodes was 99 (the maximum allowable value on the eDiary spinner), then the data were considered missing.
Descriptive statistics are presented for each of the 8 weeks of data collection. Tests of validity (convergent, discriminant, and known groups) involve week 1 data. Test-retest reliability used week 1 and week 2 data. Sensitivity to change and anchor-based assessment of meaningful change involved weekly data. All analyses were conducted using SAS Version 9.2.

Item-level score distributions
Item-level distributional statistics were tabulated including mean, median, standard deviation (SD), percentage of the sample at the floor (score of 0-lowest possible score-or the least possible nausea/vomiting) and ceiling (highest possible score or the greatestt possible nausea/vomiting), and item-level missing data. Floor and ceiling effects were considered to be present if they exceeded 15% [34].

Convergent validity
The NVSA © scores were hypothesized to correlate significantly and moderately with criterion measures assessing nausea/vomiting (convergent validity with the FLIE total, nausea, and vomiting scores and KDQOL-36™ nausea bothersomeness) but have low correlations with criterion measures assessing unrelated symptoms (discriminant validity with the KDQOL-36™ shortness of breath and faintness bothersomeness and EQ-5D-5 L). Convergent and discriminant validity were assessed with Spearman correlations due to the non-normal nature of the NVSA © distributions. A strong correlation was defined as a Spearman correlation ≥ 0.50, moderate correlation as ≥ 0.30 but < 0.50, and weak correlation as < 0.30 [35].

Known-groups validity
Known-groups validity compared NVSA © scores across groups that differed in their nausea/vomiting as defined by criteria external to the NVSA © . Subjects reporting 'no symptoms' on the sPGA were compared to subjects who reported any level of symptoms. Subjects reporting being 'not at all bothered' on the KDQOL-36™ nausea item were compared to subjects who reported any level of nausea bothersomeness. Subjects who had a FLIE total score equal to or higher than 108 (indicating no or minimal nausea/vomiting) were compared to subjects with scores lower than 108 (indicating presence of nausea/ vomiting) [36,37]. The Wilcoxon two-sample test was used to assess known-groups validity.

Test-retest reliability
A stable subgroup was defined as those who experienced no change in nausea/vomiting as measured by the PGRC. The intraclass correlation coefficient (ICC) was derived from a repeated measures, two-way, random effects model. The ICC was calculated from week 1 to week 2. ICC's of 0.70 or higher are considered acceptable [38,39].

Ability to detect change
Ability to detect change (or responsiveness) is an aspect of longitudinal construct validity [40,41]. Ability to detect change can be assessed by evaluating the relationship between changes in the PRO of interest to changes in external criteria (anchors) whether they be clinical or other patient-reported criteria [42]. Once responsiveness has been documented, range of values for thresholds for meaningful change can be estimated. The weekly sPGA was used as the external anchor in tests of ability to detect change. Sample size for the any improvement and any worsening groups had to be at least n = 6 for the analysis to be executed. sPGA nausea was the criterion for number of days of vomiting or nausea and mean severity of nausea. sPGA vomiting was the criterion for number of episodes of vomiting. Per well-established recommendations, the weekly NVSA© change scores were correlated (Spearman correlations) with their weekly anchor sPGA change score to assess the strength of the relationship between the anchor and the PRO instrument. It is recommended that anchor-PRO instrument correlations be 0.30-0.37 [42,43]. Within-patient change scores (e.g., week 2-week 1, week 3-week 2) were generated and were summarized with within-patient effect sizes [44]. The within-patient effect size was defined as the within-patient change score (post-pre score) divided by pre (group) standard deviation. Because of the small sample size and the sparse distribution of the PGRC for worsening, the PGRC could not be used as an external anchor.

Distribution-based and anchor-based assessments of minimal meaningful change
Distribution-based methods were 0.50 SD and one standard error of measurement (SEM) [42,45]. The mean and median of the eight weekly 0.50 SDs generated the 0.50 SD criterion. The SEM was calculated from SD * √1-reliability. The SEM was calculated using the test-retest ICC and the mean SD across the 8 weeks.
The sPGA was used to define minimally improved, no change, and minimally worse nausea/vomiting subgroups for anchor-based estimates. Minimal improvement was defined as a within-patient change score between contiguous sPGAs of − 1.0, minimal worsening was a within-patient change score between contiguous sPGAs of 1.0, and no change was a within-patient change score of zero. Because the sPGA was completed weekly, weekly within-patient change scores (e.g., week 2-week 1, week 3-week 2) were estimated for the minimal-improvement and minimal-worsening groups.
Because of the small sample size and the sparse distributions of the PGRC for worsening, the PGRC could not be used as an anchor.

Cognitive-debriefing interviews
A total of 12 interviews (six at each site) were scheduled. However, one subject did not show up for the interview, and no other eligible patients were available at that time. Therefore, 11 patients (six in wave 1 interviews [Washington, D.C.] and five in wave 2 interviews [Houston, TX]) were included in the data analysis across two sites. The mean age of the interview patients was 54 years, and the sample was 64% percent male. Just over one half (55%) of the patients were Black or African American. Just over one third (36%) of patients had completed less than a high-school education.
The cognitive interviews revealed that two patients mixed the meaning of nausea and vomiting. For clarity and consistency across patients' understanding of the meaning of these terms, it was decided to add patient training on the definitions of nausea and vomiting using examples of idiomatic phrases provided by patients. The second wave of interviews showed that the training solved this issue. For the nausea-severity item, four patients preferred an alternate wording for the anchor of "10" ("worst possible nausea" vs. "nausea as severe as you can imagine"). However, with exception of a single patient, patients understood both wordings of the scale, thought they were identical, and could answer the question identically with either response scale. It was decided to retain the original wording ("nausea as severe as you can imagine") to align with FDA [46] and IMMPACT [47] recommendations on assessment of pain intensity wherein the worst anchor is defined by "pain as bad as you can imagine" as well as single-item ratings scales used in other PRO instruments that define the extremity of the symptom of interest with "as bad as you can imagine." [27,[48][49][50][51][52][53][54][55] Cognitive interviews in other therapeutic areas also confirmed that patients can understand the response scale of "as bad as you can imagine." [55] Four patients did not answer items thinking of the specified 24-h recall period. This did not appear to result from wording of the items but from lack of experience with recall period. It was decided to add this point to patient training using a visual aid to clearly explain the 24h recall period. The second wave of interviews showed that the training solved this issue. Three patients had difficulty using the spinner widget to enter their numerical answer. It was decided that this would be an additional feature for which patients received training. Simple training was not enough to solve this issue for two patients in Wave 2. It was decided to enhance this training with exercises on the use of the spinner. Nearly three quarters (73%) of patients described the eDiary as "very easy" or "quite easy" to use. Two patients (18.2%) reported having difficulty using the eDiary, but 100% of patients described the device as "acceptable to use."

Confirmatory concept-elicitation interviews
A total of ten concept-elicitation interviews were completed from six sites from DaVita dialysis centers across the US. The concept-elicitation interviews were completed by four female and six male patients who had experienced nausea/vomiting associated with treatment for SHPT.
Nearly one half (46%) of the coded concepts were identified in the first group (first to third interview), additional 43% were identified in group 2 (fourth to sixth interview), and 7% were identified in group group 3 (seventh to eighth interview), respectively. Only one new concept was identified in the fourth and final group (ninth to tenth interview): a single patient reported cutting down on smoking cigarettes because smoking made the nausea/ vomiting worse. Given that this impact is relevant only to the population of patients who smoke tobacco-and its causal link to nausea/vomiting among CKD patients on maintenance hemodialysis with SHPT treated with oral calcimimetics is questionable-it is reasonable to dismiss this particular concept and assume that the ten transcripts demonstrated saturation of concept.
Patients most commonly reported that they started to experience nausea/vomiting relatively soon after starting treatment. In most cases, the symptoms began within 1 or 2 weeks of starting treatment. If they experienced gastrointestinal issues prior to the treatment, it typically became worse after starting treatment. Patients typically described having "nausea" or feeling "nauseous" or "nauseated." Other terms used to describe the experience were having a "queasy" stomach that "feels bad," "upset," or "sick." When asked how they would describe mild nausea, patients described an episode that was not bothersome or an episode of nausea without the vomiting. Nausea was considered severe if it impacted patients' daily life. Some reported that severe nausea rendered them unable to walk or "do anything." Participants were asked to rate the severity of their nausea when the symptom was at its worst (using a 0-10 numerical rating scale with zero indicating "No nausea," and 10 indicating "Nausea as severe as you can imagine"). On average, the patients rated their nausea at 6.9 out of 10 in severity (SD of 2.8) with a median of 8.0.
Patients distinguished between nausea and vomiting and experienced them as separate, but related, symptoms. While nausea was typically described as a sickly feeling, subjects described "throwing up" or "vomiting" as when this feeling was followed by ejecting the contents of their stomach into their mouth. Episodes of nausea were often, but not always, followed by vomiting. However, subjects frequently described experiencing nausea without vomiting.
While some patients experienced nausea on a daily or near-daily basis, others experienced it only once or twice a month or intermittently. Patients experienced variation in the amount of time they felt nauseous. For some, episodes of nausea were relatively brief lasting 10 to 15 min or less. Others described feeling nauseous for an entire day. When patients were asked to explain how often they vomit, most reported the number of vomiting episodes they experienced per day as opposed to per week or per month. Frequency of vomiting ranged from once or twice a month or "rarely" to up to eight or nine times in 1 day. Patients were also asked to estimate the number of times they vomit in a 24-h period when they experience vomiting at its worst. Seven out of ten patients responded to the question with a mean frequency of 2.4 times per day and answers ranged from once per day to eight times per day. Vomiting episodes were typically counted as the number of times a patient's body ejected the contents from their stomach into their mouth. Each episode of vomiting was counted individually even if multiple episodes happened within a short timeframe.

Psychometric validation Cohort description
As shown in Table 1, subject age ranged from 25 to 83 with a mean of 53.5 and a median of 54. Males constituted 59% of the sample. The racial composition of the sample included 36% white, 59% of black or African American, 2% Asian, and 2% American Indian or Alaskan native. One fifth (19%) of the sample was of Hispanic or Latino heritage. One quarter (25%) of the sample was never married, 45% married (or living as married), 14% divorced, and 8% each separated or widowed. Mean years of education was 13.1 with 9% reporting less than a high-school education, 42% a high- school education, 35% some college, and 14% a college education. In terms of employment, 38% reported being disabled/unable to work, 23% were retired, 21% were unemployed, 16% were employed either part-time or full-time, and 2% were homemakers. Almost two thirds of the sample (64%) reported an annual income of less than $15,000; 13% reported an income between $15,000-$25,000, 6% between $25,000-35,000, 10% between $35,000-$50,000, and 7% greater than $50,000. Just over one half of the sample (53%) were 'Variable Time on Cinacalcet, ' 33% were 'Cinacalcet Naïve, ' and 14% were 'Initiating Cinacalcet.'

Descriptive statistics
For each of the NVSA © scores, floor effects (absence of nausea and vomiting) were large (according to well-accepted criteria [34]) at baseline (> 47% across the three NVSA © scores) and increased over the course of the study indicating less nausea/vomiting over time (Table 2). Ceiling effects (greatest severity of nausea and vomiting) were under minimal (according to well-accepted criteria: [34] 10.5% for number of days of vomiting or nausea, under 2.0% for number of episodes of vomiting, and under 3.5% for mean severity of nausea. Missing data rates ranged from a low of 2.5% (week 1) to to a high of 9.9% (week 2). There were no discernible patterns to missing-data rates across the eight study weeks.

Test-retest reliability
The ICCs for week 1 to week 2 were 0.93 for number of days of vomiting or nausea, 0.61 for number of episodes N sample size, SD standard deviation Floor = percentage of the sample obtaining the lowest possible score (score of 0) (absence of nausea and vomiting) Ceiling = percentage of the sample obtaining the highest possible score (greatest nausea and vomiting) a Test-retest reliability was intraclass correlations between week 1 to week 2 of vomiting, and 0.94 for mean severity of nausea (Table 2).

Convergent and Discriminant validity
A total of nine convergent correlations with the FLIE were tested, and 100% were statistically significant (Table 3). Spearman correlations of the NVSA © scores with the FLIE ranged from − 0.38 to − 0.60 (mean and median of − 0.56 and − 0.51, respectively); only two of the nine convergent correlations were less than 0.40. Six of the nine correlations (67%) of the NVSA © with the FLIE met standard criteria for a strong correlation while the remaining three correlations met standard criteria for a moderate correlation. Spearman correlations of the NVSA © scores with the KDQOL-36™ ranged from 0.38 to 0.67 (mean and median of 0.57 and 0.65, respectively), and two of the three correlations (67%) met standard criteria for a strong correlation (Table 4). Hypotheses about discriminant validity were confirmed. Spearman correlations between NVSA © scores and KDQOL-36™ shortness of breath and faintness bothersomeness ranged from 0.06 to 0.31 (mean and median of 0.19 and 0.20, respectively); five of the six discriminant correlations met criteria for a small correlation ( Table 4). Six of the 15 (40%) of the Spearman discriminant correlations between NVSA© scores and EQ-5D-5 L dimensions were statistically significant at conventional probability levels (p ≤ 0.05). These discriminant correlations ranged from − 0.06 to 0.42 (mean and median of 0.19 and 0.20, respectively); 11 of the 15 discriminant correlations (73%) met criteria for a small correlation (Table 5). Spearman correlations of the NVSA © scores with the EQ-5D-5 L VAS were all statistically significant and ranged from − 0.24 to − 0.33 (mean and median of − 0.28); two of the three discriminant correlations met criteria for a small correlation (Table 5).

Known-groups validity
Each of the nine tests of known-groups validity were statistically significant: there was a statistically higher number of days of vomiting or nausea per week, number of episodes of vomiting per week, and mean severity of nausea for subjects with vomiting/nausea as defined by the external criteria ( Table 6). Number of days of vomiting or nausea per week and mean severity of nausea had similar discriminating ability to one another for three externally-defined known-groups while number of episodes of vomiting per week bore a somewhat weaker relationship to the known groups.

Ability to detect change
The Spearman correlation between the eight weekly sPGA change scores and NVSA © change scores ranged from 0.14 to 0.43 for number of days of vomiting or nausea per week (mean and median of 0.28 and 0.29, respectively), 0.09 to 0.58 for number of episodes of vomiting per week (mean and median of 0.28 and 0.29, respectively), and 0.14 to 0.49 for mean severity of nausea (mean and median of 0.34 and 0.38, respectively). Even though a small handful of the 21 calculated correlations were small-and perhaps meaningless (i.e., the single observed correlation of 0.09)-a broader gestalt view of the 21 correlations suggested that, by standard criteria [42,43], the sPGA represents a marginally-acceptable criteria for assessing ability to detect change in the NVSA © scores. Because of the small sample size and the sparse distributions of the sPGA's, some statistical comparisons were not effected (blank cells in Appendix). Table 7 presents a summary of results for ability to detect change (detailed results by week are presented in Appendix). Consistent with expectations, mean differences for the no-change group for all three NVSA © scores were small (mean and median, respectively: 0.02 and − 0.00 for number of days of vomiting or nausea; 0.06 and 0.07 for number of vomiting episodes, and 0.00 and 0.04 for mean severity of nausea). Effect sizes for the no-change group for all three NVSA © scores were small (all less than 0.13).   Across the three NVSA © scores, 75% of the mean differences for the worsening group were positive in direction as expected. For number of days of vomiting or nausea, the mean and median mean difference was 0.65 and 0.55, respectively, and mean and median effect sizes were 0.28 and 0.23, respectively. The mean and median effect size for number of vomiting episodes was 0.88 and 0.60, respectively, and the same for mean severity of nausea was 0.36 and 0.38, respectively.
Across the three NVSA © scores, 94% of the mean differences for the improvement group were negative as expected. For number of days of vomiting or nausea, the mean and median mean difference was − 0.88 and − 1.05, respectively, and mean and median effect sizes were 0.38 and 0.42, respectively. The mean and median effect sizes for number of vomiting episodes was 0.83 and 0.88, respectively, and the same for mean severity of nausea was 0.49 and 0.29, respectively.

Minimal meaningful change thresholds
Distribution-based results The mean and median onehalf SD for number of days of vomiting or nausea was 1.19 and 1.20, respectively ( Table 8). The mean and median one-half SD for number of episodes of vomiting was 1.08 and 0.99, respectively, and the same for mean severity of nausea was 0.55 and 0.54, respectively. The one SEM for number of days of vomiting and nausea was 0.63, 1.35 for number of episodes of vomiting, and 0.27 for mean severity of nausea.
Anchor-based results using the sPGA Because of the small sample size and the sparse distributions of the sPGA change scores, 52% of the analyses could not be executed. Because a small sample of subjects contributed data to these analyses and because two observed results were anomalous (positive change for minimally improve   for number of days of vomiting and nausea and negative change for minimally worsen for mean severity of nausea), a single, preliminary threshold for minimalmeaningful change is proposed rather than a range of thresholds. Thresholds for both minimal worsening and improvement are presented for comprehensiveness sake (Table 9). Triangulation between distribution-and anchorbased results According to the FDA [11], "distributionbased methods for determining clinical significance of particular score changes should be considered as supportive and are not appropriate as the sole basis for determining a responder definition." Accordingly, triangulation is used to examine multiple values from different approaches to converge on a small range of values [42,56,57]. Thresholds for meaningful change should exceed the bounds of measurement error, i.e., distribution-based approaches should define the lower limits of meaningful-change thresholds [38,58]. In the case of the NVSA, © however, the anchor-based estimates were smaller than the distribution-based estimates for number of days with nausea and vomiting and mean severity of nausea. Therefore, the following preliminary thresholds for meaningful worsening are suggested based upon interpolation between the 0.50 SD and one SEM: 0.90 for number of days of nausea and vomiting, 1.20 for number of vomiting episodes; and 0.40 for mean severity of nausea. These estimates should be considered strictly provisional until additional interventional, longitudinal research has been completed.

Discussion
We described herein the development and psychometric validation of the NVSA © -a new PRO instrument that The within-patient effect size was defined as the within-patient change score (post-pre score) divided by pre (group) standard deviation question of what to measure may be obvious given the condition being treated," and this was squarely the case with CKD patients on maintenance hemodialysis with SHPT treated with oral calcimimetic cinacalcet. Traditional cognitive-debriefing interviews revealed that patients understood the NVSA © instructions, items, and response scales (although additional training was recommended and implemented). A full 73% of patients described the eDiary as "very easy" or "quite easy" to use, and 100% of patients described the eDiary as "acceptable to use." Confirmatory concept-elicitation interviews cross-validated the data from clinical studies and expert clinical guidelines; these interviews confirmed that the NVSA © captures relevant aspects of nausea (severity) and vomiting (frequency) from the patient point of view. The psychometric analyses support most measurement characteristics of the NVSA © . The NVSA © scores correlated significantly with the FLIE scales and KDQOL-36™ nausea item and exhibited markedly smaller correlations with KDQOL-36™ shortness of breath and faintness and the EQ-5D-5 L dimensions and VAS. Tests of known-groups validity were highly supportive of the NVSA © with the group defined by the presence of nausea/vomiting having statistically higher mean NVSA © scores than the absence group. Empty rows = sample size < n = 6 for minimally improve or minimally worsen and analyses not conducted n = sample size; Δ = change score; SE = standard error Test-retest reliability was excellent for number of days of nausea and vomiting and mean severity of nausea. Test-reteat reliability was somewhat smaller for number of vomiting episodes likely because few subjects reported vomiting episodes (lack of between-subjects variability). The pattern of results for ability to detect change was as expected. The no-change group showed little change and had effect sizes close to zero. Mean differences for the worsening group were mostly positive as expected and the same for the improvement group were mostly negative as expected. With a few exceptions, the mean and median effect sizes for the worsening and improvement groups met standard criteria for medium or large effect sizes. That the worsening and improvement groups exhibited larger mean difference and effect sizes than the no-change group provides evidence of face validity of the ability-to-detectchange analyses. These analyses were limited by the small sample size overall and the fact that a large proportion of subjects in the non-interventional, observational study changed little over time. Future research is necessary to replicate these analysis in larger samples undergoing treatment.
Preliminary minimal meaningful change thresholds were an increase of 0.90 days with vomiting or nausea per week, 1.20 vomiting episodes per week, and 0.40 for mean severity of nausea. These estimates were based on group-level data. Given that vomiting and nausea are burdensome experiences, small changes in NVSA © scores could be considered clinically meaningful for CKD patients with SHPT. Definitions of minimal meaningful change are not a static or fixed attribute of a PRO instrument and evolve as more evidence is gathered through empirical studies. [56] Additional interventional research is necessary to recommend more robust thresholds for meaningful change.
Our study has several limitations. The initial items and response categories were drafted by a crossfunctional expert team that included two health outcomes/PRO researchers, one medical development lead, and one safety/regulatory lead. External clinical experts were not involved in the initial PRO development. The ten patients recruited and successfully interviewed for the confirmatory concept-elicitation interviews had completed the eight-week data collection period for the non-interventional, observational study; as such they may have been sensitized to topics covered in the noninterventional, observational study data-collection protocol. However, the intent of the concept-elicitation interviews was to be confirmatory in nature-to identify any potential outcomes aside from nausea and vomiting-not identified by the review of information from clinical studies [8][9][10][17][18][19] and expert clinical guidelines [20]. Although the most frequently-reported adverse reactions for cinacalcet are nausea and vomiting, less than 15% of the sample were subjects who would have been expected to experience new episodes of nausea/vomiting due to initiation of cinacalcet. This reduced the variability in the NVSA © scores: almost half of the sample reported no nausea/vomiting during the first study week (floor effects) thus preventing analysis of improvement in scores over time. It is, however, ideal for capturing worsening. Due to the noninterventional nature of the study, neither substantial improvement nor worsening was expected to occur. The small sample limited tests of ability to detect any improvement or worsening (since most of the sample reported no change on the sPGA). With eight sPGA's across 8 weeks, there was an endless number of possible combinations and permutations to assess ability to detect change and meaningful change-none of which have any innate conceptual, methodological, and psychometric superiority over one another. Because we were contending with small sample sizes, we utilized an approach to garner "strength across numbers." By utilizing all 8 weeks, we derived multiple estimates that could be summarized by measures of central tendency instead of a more limited approach of arbitrarily selecting a handful of weeks for analysis. By utilizing all 8 weeks, it was also possible to visually assess the consistency of within-patient change across the 8 weeks and across definitions of worsening, no change, and improvement which added confidence to the interpretation of the data despite the small sample size. Derivation of the anchor-based thresholds of meaningful change might have been more conclusive if sample size was larger for the minimally-improved and minimally-worsened groups. Additional assessment of the performace of the NVSA © , particularly thresholds for meaningful change, in the context of any clinical trial that it may be used in are warrented.

Conclusions
Tolerability issues can impact adherence with medications and, ultimately, treatment efficacy. Nausea and vomiting are the most commonly-reported adverse reactions experienced by patients taking the oral calcimimetic cinacalcet for SHPT associated with CKD requiring dialysis. Developing a reliable and valid measure of nausea/vomiting in this population will facilitate the measurement of these reactions in clinical trials. This study supported many of the psychometric characteristics of a short daily diary designed to measure nausea and vomiting. The NVSA © may be used to support study endpoints in clinical trials comparing the nausea and vomiting profile of novel SHPT therapies. Empty rows = sample size < n = 6 and analyses not conducted. The within-patient effect size was defined as the within-patient change score (post-pre score) divided by pre (group) standard deviation