Interrater Reliability in Evaluating Neurogenic Bladders on Videourodynamics
John Weaver, MD1, Jason Van Batavia, MD, MSTR1, Dana Weiss, MD1, Christopher Long, MD1, Ariana Smith, MD2, Stephen Zderic, MD1, Madalyne Martin-olenski, BA1, Antoine Selman Fermin, MD1, Gregory Tasian, MD, MSCE1.
1Children's Hospital of Philadelphia, Philadelphia, PA, USA, 2University of Pennsylvania, Philadelphia, PA, USA.

Introduction and objective: Videourodynamics (VUDS) is the gold standard for the evaluation of the lower urinary tract in patients with spina bifida (SB) to ascertain whether a bladder has safe storage characteristics that will be protective of the upper urinary tract. However, there is high variation in the interpretation of VUDS studies, even within a single institution. We hypothesized that after achieving consensus of VUDS characteristics of the degree of bladder dysfunction a group of expert reviewers would show substantial agreement in classifying VUDS studies with respect to risk of future upper tract injury.Methods: We performed a pilot study that included 10 VUDS studies that were performed on children at our institution. All reviewers (4 fellowship trained pediatric urologists who regularly care for SB patients (Reviewers 1-4) and 1 adult urologist with expertise in urodynamics (Reviewer 5)) rated each study on whether the study placed the patient at high, moderate, or low risk of upper tract injury. All reviewers met biweekly to review the 10 studies and come to a consensus on risk strata characteristics for the high, moderate and low risk categories.Subsequently, a a REDCap survey containing the VUDS tracings, real-time flowsheet data from the study filled out by a pediatric urologist in attendance during the study, and fluoroscopic images from 60 studies was distributed to all reviewers For each study, reviewers were asked one question: What is the risk category (low, moderate, high) for upper tract deterioration for this VUDS? Cohen kappa and Fleiss kappa scores were calculated using R. Results: All reviewers reviewed the same 60 VUDS studies. The Fleiss kappa score comparing all reviewers was 0.534, indicating moderate agreement. Reviewer 1 and 2 showed the most agreement with a Cohen kappa score of 0.661 (substantial agreement) while reviewer 3 and 4 had the least agreement with a Cohen kappa score of 0.382 (fair agreement). Table 1 reports Cohen kappa scores between individual raters. Perfect agreement among the reviewers occurred for 27 studies (4 high risk, 12 moderate risk, 11 low risk). For 1 of the studies all the risk categories were chosen by at least 1 reviewer. Conclusion: Even with significant measures to ensure consistency of VUDS studies and to achieve consensus of risk strata, expert reviewers only demonstrated moderate agreement when evaluating studies for the critical clinical question of future risk of renal injury. Advancements are needed in the way pediatric urologists analyze VUDS. Machine learning algorithms may provide an opportunity to identify characteristics that predict upper tract deterioration and chronic kidney disease progression.

Reviewer CombinationCohen kappa score (p-value)
Reviewer 1 and 20.661(<.001)
Reviewer 1 and 30.5(<.001)
Reviewer 1 and 40.537(<.001)
Reviewer 1 and 50.587(<.001)
Reviewer 2 and 30.525 (<.001)
Reviewer 2 and 40.437(<.001)
Reviewer 2 and 50.61(<.001)
Reviewer 3 and 40.382 (<.001)
Reviewer 3 and 50.548(<.001)
Reviewer 4 and 50.597 (<.001)

Table 1. Cohen kappa scores between individual raters with associated p-values.

