Background: Our objective was to test the “out-of-the-box” capability of the LLAMA-2 large language model (LLM), an LLM developed by a Stanford-Meta collaboration and trained on 13 billion parameters, to accurately extract text-based data and identify the presence of hydronephrosis in renal ultrasound reports. We hypothesized that LLAMA-2 would be able to accurately classify ultrasound reports as describing the presence versus absence of hydronephrosis.
Methods: Two independent test sets of renal ultrasounds at two separate pediatric hospitals (Children’s Hospital of Philadelphia (CHOP) and the Hospital for Sick Children (Sickkids) were identified. Each test set consisted of radiologist impressions from renal ultrasounds performed for pediatric patients with spina bifida that were manually reviewed and labelled for the finding of any degree of hydronephrosis in either kidney. The CHOP set consisted of 2,392 renal ultrasounds, and the Sickkids set consisted of 110 renal ultrasounds. LLAMA-2 was installed on a secure server and underwent no additional training or modification. LLAMA-2 was asked to evaluate each ultrasound report impression for the presence of any degree of hydronephrosis in either kidney and provide a binary classification (“hydronephrosis present” or “no hydronephrosis present”). Between each query, the search history of LLAMA-2 was cleared so that prior inputs did not bias/inform future model outputs. For each test set, LLAMA-2 classifications were compared to manually-derived labels using a confusion matrix (a 2x2 table comparing model classification versus manually-derived labels). Accuracy, sensitivity, specificity, positive predicted value (PPV), negative predictive value (NPV), and F1 score (the harmonic mean of PPV and sensitivity between 0 and 1 where 1 indicates perfect PPV and sensitivity) for the finding of hydronephrosis were calculated. Since LLAMA-2 provided binary outputs without classification probabilities or modifiable decision thresholds, receiver operating characteristic (ROC) curves were unable to be calculated.
Results: Among the CHOP set, there were 517 studies reporting hydronephrosis and 1,875 studies reporting no hydronephrosis. LLAMA-2 achieved 90.6% accuracy with sensitivity 66.0%, specificity 97.3%, PPV 87.2%, NPV 91.2%, and F1 0.75. Among the Sickkids set, there were 81 studies reporting hydronephrosis and 29 studies reporting no hydronephrosis. LLAMA-2 achieved 85.5% accuracy with sensitivity 86.4%, specificity 82.8%, PPV 93.3%, NPV 68.6%, and F1 0.90.
Conclusions: LLAMA-2 demonstrated good performance in evaluating renal ultrasound report impressions from pediatric patients with spina bifida for the presence of hydronephrosis in two independent test sets. Our findings suggest a future clinical application for LLAMA-2 and other LLMs in automatically, rapidly, and accurately extracting information from medical text records, making tedious manual chart review obsolete and enhancing the efficiency of research as well as clinical care.