Moving towards quantitative vesicoureteral reflux grading using machine learning
Adree Khondker, BHSc1, Jethro CC. Kwong, MD2, Priyank Yadav, MCh1, Justin YH. Chan, MD2, Anuradha Singh, MD1, Marta Skreta, BSc2, Lauren Erdman, MSc2, Daniel T. Keefe, MD1, Mandy Rickard, NP1, Armando Lorenzo, MD1.
1The Hospital for Sick Children, Toronto, ON, Canada, 2University of Toronto, Toronto, ON, Canada.
BACKGROUND: The subjective nature of grading vesicoureteral reflux (VUR) between 1-5 from voiding cystourethrograms (VCUGs) results in low agreement between clinicians. This raises a need for more objective means of grading VUR that can standardize and improve the current VUR grading system. Specific features of the ureter, such as tortuosity or dilatation, have been reported to correlate with VUR grade. The objective of this study was to use these features, coupled with machine learning (ML), to quantitatively determine individual VUR grades with high accuracy and reliability.
METHODS: The database was generated from the imaging repository of VCUGs between January 2013 to December 2019 at our institution. Each VCUG was split into left and right renal units, respectively containing the whole ureter and kidney, and then assessed for reflux. Renal units were then abstracted based on the inclusion/exclusion criteria (Figure 1). Each renal unit was then annotated to generate features for supervised ML. The four features abstracted include: ureter tortuosity, UPJ/proximal ureter width, UVJ/distal ureter width, and maximum ureter width (Figure 2). Due to the highly variable grading of VUR, each included renal unit was graded by at least 5 raters to determine a consensus VUR grade. Inter-rater reliability was determined to assess validity of grading. Multi-class classification was trained with a support vector machine model to distinguish individual VUR grades. The model was also applied to clinical cases of posterior urethral valves (PUV) and pyeloplasty to assess feasibility in practice.
RESULTS: A total of 6288 renal units (from 3144 VCUGs) were identified in the study period and screened for VUR. Of these, 1935 renal units had documented VUR with 1248 being included into the ML model. A total of 7986 independent VUR grades from 5+ raters for each renal unit were collected. The included cohort consisted of a 50/50 male/female split with a median age at imaging was 0.74 years (IQR 0.25, 3.12). The overall Fleissí kappa for interrater agreement was 0.443. The model performed with 66% accuracy (AUC = 0.82) on 80/20 holdout validation. Among 51 patients with PUV, the 13 who required renal replacement therapy had ureters with greater median tortuosity than patients not requiring renal replacement therapy (2.75 vs. 2.11, p = 0.04). There was a moderate correlation between tortuosity and anteroposterior diameter post-pyeloplasty in 16 patients (r = -0.54, p = 0.03).
CONCLUSIONS: VUR grading by quantitative metrics is feasible in large datasets and can be supported by ML-based methods. VUR features may be correlated with clinical outcomes but further validation and prospective study is warranted.
Back to 2021 Abstracts