Societies for Pediatric Urology

SPU Home SPU Home Past & Future Meetings Past & Future Meetings

Back to 2024 Posters


A Moratorium On Manual Chart Reviews - Llama Ai Model
David A. Ostrowski, MD1, Joseph R. Logan, MS2, Dennis Head, BS3, Austin Thompson, BS4, Mandy Rickard, NP5, Jessica H. Hannick, MD, MSc6, Lynn L. Woo, MD6, Armando J. Lorenzo, MD, MSc5, Gregory E. Tasian, MD, MSc, MSCE7, John K. Weaver, MD, MSTR6.
1University of Pennsylvania Health System, Philadelphia, PA, USA, 2The Children's Hospital of Philadelphia, Philadelphia, PA, USA, 3Northeast Ohio Medical University, Rootstown, OH, USA, 4Case Western University, Cleveland, OH, USA, 5The Hospital for Sick Children, Toronto, ON, Canada, 6Cleveland Clinic Children's Hospital, Cleveland, OH, USA, 7The Children's Hospital of Pennsylvania, Philadelphia, PA, USA.

Background: Our objective was to test the “out-of-the-box” capability of the LLAMA-2 large language model (LLM), an LLM developed by a Stanford-Meta collaboration and trained on 13 billion parameters, to accurately extract text-based data and identify the presence of hydronephrosis in renal ultrasound reports. We hypothesized that LLAMA-2 would be able to accurately classify ultrasound reports as describing the presence versus absence of hydronephrosis.
Methods: Two independent test sets of renal ultrasounds at two separate pediatric hospitals (Children’s Hospital of Philadelphia (CHOP) and the Hospital for Sick Children (Sickkids) were identified. Each test set consisted of radiologist impressions from renal ultrasounds performed for pediatric patients with spina bifida that were manually reviewed and labelled for the finding of any degree of hydronephrosis in either kidney. The CHOP set consisted of 2,392 renal ultrasounds, and the Sickkids set consisted of 110 renal ultrasounds. LLAMA-2 was installed on a secure server and underwent no additional training or modification. LLAMA-2 was asked to evaluate each ultrasound report impression for the presence of any degree of hydronephrosis in either kidney and provide a binary classification (“hydronephrosis present” or “no hydronephrosis present”). Between each query, the search history of LLAMA-2 was cleared so that prior inputs did not bias/inform future model outputs. For each test set, LLAMA-2 classifications were compared to manually-derived labels using a confusion matrix (a 2x2 table comparing model classification versus manually-derived labels). Accuracy, sensitivity, specificity, positive predicted value (PPV), negative predictive value (NPV), and F1 score (the harmonic mean of PPV and sensitivity between 0 and 1 where 1 indicates perfect PPV and sensitivity) for the finding of hydronephrosis were calculated. Since LLAMA-2 provided binary outputs without classification probabilities or modifiable decision thresholds, receiver operating characteristic (ROC) curves were unable to be calculated.
Results: Among the CHOP set, there were 517 studies reporting hydronephrosis and 1,875 studies reporting no hydronephrosis. LLAMA-2 achieved 90.6% accuracy with sensitivity 66.0%, specificity 97.3%, PPV 87.2%, NPV 91.2%, and F1 0.75. Among the Sickkids set, there were 81 studies reporting hydronephrosis and 29 studies reporting no hydronephrosis. LLAMA-2 achieved 85.5% accuracy with sensitivity 86.4%, specificity 82.8%, PPV 93.3%, NPV 68.6%, and F1 0.90.
Conclusions: LLAMA-2 demonstrated good performance in evaluating renal ultrasound report impressions from pediatric patients with spina bifida for the presence of hydronephrosis in two independent test sets. Our findings suggest a future clinical application for LLAMA-2 and other LLMs in automatically, rapidly, and accurately extracting information from medical text records, making tedious manual chart review obsolete and enhancing the efficiency of research as well as clinical care.



Back to 2024 Posters