BACKGROUND- Hydronephrosis (HN) is one of the most frequently observed abnormalities in prenatal ultrasound (U/S), affecting around 1% of pregnancies and persisting after birth in about 50% of cases. U/S is the predominant modality for both diagnosing and monitoring this condition. However, despite the utility of this modality, U/S reports can present challenges due to the presence of imprecise language and ambiguous terminology concerning the presence and severity of HN. This linguistic ambiguity can complicate the interpretation of the report's conclusion, necessitating a more nuanced approach to assessment and diagnosis.Consequently, this study aims to evaluate the effectiveness of natural language processing (NLP) utilizing a finely-tuned transformer model in accurately grading HN based on unstructured U/S reports. Transformer models, particularly in the realm of NLP, have shown exceptional performance in various language understanding tasks. In this study, we seek to leverage transformer models specifically tailored for medical text analysis. By fine-tuning these models on clinical U/S reports, we aim to enhance their ability to accurately grade the severity of HN. We hypothesize that a tuned transformer model will demonstrate comparable or superior performance to expert urologist interpretations of U/S reports.
METHODS- 213,243 reports from kidney and abdominal U/S examinations conducted on pediatric patients were obtained from a 207-facility healthcare system. From this dataset, a subset of n=1,412 reports underwent annotation to ascertain the presence and severity of HN, categorized as none (n=391), mild (n=379), moderate (n=326), or severe (n=316) in accordance with established guidelines. Dual annotation was conducted by two pediatric urologists for all n=1,412 U/S reports. Discrepancies were resolved by a third pediatric urologist. We utilized a tuned Bidirectional Encoder Representations from Transformers (BERT) model, pre-trained on an additional dataset comprising n=211,831 U/S reports. This fine-tuning process aimed to refine
the model's ability to accurately predict the grade of HN from kidney U/S reports. Severity classification was performed, with a specific focus on the interpretation sections of the U/S reports, employing an 80/20 training/test split.
RESULTS- The performance of the tuned BERT transformer model in classifying HN grades, ranging from none to severe, is summarized in Table 1. The model exhibited at least 78% sensitivity and at least 92% specificity across all grades of HN, indicating its ability to correctly identify both positive and negative cases. Additionally, the positive predictive value, negative predictive value, F1 score, and accuracy were consistent across different grades, demonstrating robust performance in classifying HN severity.
CONCLUSION- Natural language processing (NLP) utilizing finely-tuned transformer models are effective tools for extracting pertinent clinical information from unstructured U/S reports, particularly in the context of identifying HN grades. Our findings demonstrate that these models perform comparably to expert urologists in deciphering HN grades from U/S reports, showcasing their potential as valuable aids in clinical decision-making.
Table 1: Weighted BERT model performance characteristics by hydronephrosis grades | ||||
Performance | Hydronephrosis Grade | |||
None | Mild | Moderate | Severe | |
Sensitivity | 0.92 | 0.92 | 0.78 | 0.79 |
Specificity | 0.99 | 0.93 | 0.94 | 0.93 |
Positive Predictive Value | 0.96 | 0.81 | 0.78 | 0.77 |
Negative Predictive Value | 0.97 | 0.97 | 0.94 | 0.94 |
F1 Score | 0.94 | 0.86 | 0.78 | 0.78 |
Accuracy | 0.97 | 0.92 | 0.90 | 0.90 |