BACKGROUND: Machine learning models are being increasingly utilized in clinical management. Using a multicenter cohort of children with vesicoureteral reflux (VUR), our objective was to develop a risk prediction model for urinary tract infection (UTI). We aimed to develop a clinically useful model and identify potentially novel UTI risk factors. METHODS: Patients with primary VUR diagnosed by voiding cystourethrogram from 2010-2021 were collected from four academic centers. Children with secondary VUR, other urologic anomalies, less than 9 months of follow-up, or incomplete data were excluded. Primary outcome was urinary tract infection, confirmed by proper collection method, pyuria, and single-organism culture, within one year of follow-up following VUR diagnosis. Cox regression and Random Forest models were used for analysis. Statistical and machine learning models were run in parallel including similar patient sets and identical follow-up periods. In Random Forest models, tenfold cross validation was utilized for training and testing the random forest classifier. The training/test fold that produced the highest area under the curve (AUC) was utilized to determine variables of importance.
RESULTS: Among 848 patients with primary VUR, 631 were included in one-year risk models (620 patients in Random Forest models). Median age at diagnosis was six months (IQR 3.0-21), and 33% of patients had VUR grades IV-V. Males comprised 40% of patients (258/631) and 63% (163/258) were uncircumcised. A total of 88 patients (14%) developed a UTI during one year of follow-up. Most patients (70%) were on continuous antibiotic prophylaxis (CAP) at the initial visit. In multivariable Cox regression models, 1-year predictors of UTI risk were: female sex, presence of foreskin, high-grade VUR, and Hispanic ethnicity (Table 1). After exploring machine learning methods, the Random Forest model was identified as the best performing model in terms of AUC. Random Forest models identified CAP, VUR grade, Hispanic ethnicity, sex and circumcision status, prior UTI history, among others, as influential for the AUC. Pruning 22 variables to 12 variables (Figure 1) resulted in the same AUC as before pruning (0.82) for that test fold. Using the pruned set of variables to run tenfold cross validation produced an increase in AUC, from 0.67 before pruning, to 0.68, after pruning. Analyses of two-year UTI risk agreed with these findings, albeit with a smaller sample size. CONCLUSIONS:
Incorporating Random Forest models, we developed a UTI risk prediction model for patients with primary VUR. This model agreed with findings from traditional statistics and clinical knowledge. Female sex, circumcision status, severity of VUR and age were significant predictors of UTI risk at one year. Hispanic ethnicity was a novel risk factor for UTI identified in both statistical and machine learning models. Next steps are to develop a risk prediction application and further refine the model.