Sept. 10, 2025 — According to ARRS’ American Journal of Roentgenology (AJR), general-purpose large language models (LLMs) such as GPT-4 can detect and classify critical findings in radiology reports with high precision and recall when guided by carefully designed prompt strategies, highlighting their potential to support timely communication in clinical workflows.
“In our evaluation of more than 400 radiology reports, GPT-4 achieved precision of 90% and recall of 87% for true critical findings using a few-shot static prompting approach,” said first author Ish A. Talati, MD, from the department of radiology at Stanford University. “These results suggest that out-of-the-box LLMs may adapt to specialized radiology tasks with minimal data annotation, although further refinement is needed before clinical implementation.”
Talati et al.’s AJR manuscript included 252 radiology reports from the MIMIC-III database and an external test set of 180 chest radiograph reports from CheXpert Plus. Reports were manually reviewed to identify critical findings and categorized as true, known/expected, or equivocal. Various prompting strategies—including zero-shot, few-shot static, and few-shot dynamic—were tested with GPT-4 and Mistral-7B to optimize detection performance.
For true critical findings in the holdout test set, GPT-4 achieved 90.1% precision and 86.9% recall, compared to 75.6% and 77.4% for Mistral-7B. On the external test set, GPT-4 reached 82.6% precision and 98.3% recall, while Mistral-7B achieved 75.0% and 93.1%, respectively. Static few-shot prompting with five examples emerged as the most effective approach for optimizing performance.
“Effective identification of critical findings is essential for patient safety,” Talati and colleagues concluded. “While further technical development is required, these findings underscore the promise of LLMs in improving radiology workflows by augmenting communication of urgent findings.”
A supplement to this AJR accepted manuscript is available here.
December 01, 2025 