GPT-4, Google Gemini Fall Short in Breast Imaging Classification

Use of publicly available large language models (LLMs) resulted in changes in breast imaging reports classification that could have a negative effect on patient management

Getty Images

April 30, 2024 — Use of publicly available large language models (LLMs) resulted in changes in breast imaging reports classification that could have a negative effect on patient management, according to a new international study published today in the journal Radiology, a journal of the Radiological Society of North America (RSNA). The study findings underscore the need to regulate these LLMs in scenarios that require high-level medical reasoning, researchers said.

LLMs are a type of artificial intelligence (AI) widely used today for a variety of purposes. In radiology, LLMs have already been tested in a wide variety of clinical tasks, from processing radiology request forms to providing imaging recommendations and diagnosis support.

Publicly available generic LLMs like ChatGPT (GPT 3.5 and GPT-4) and Google Gemini (formerly Bard) have shown promising results in some tasks. Importantly, however, they are less successful at more complex tasks requiring a higher level of reasoning and deeper clinical knowledge, such as providing imaging recommendations. Users seeking medical advice may not always understand the limitations of these untrained programs.

“Evaluating the abilities of generic LLMs remains important as these tools are the most readily available and may unjustifiably be used by both patients and non-radiologist physicians seeking a second opinion,” said study co-lead author Andrea Cozzi, M.D., Ph.D., radiology resident and post-doctoral research fellow at the Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale, in Lugano, Switzerland.

Dr. Cozzi and colleagues set out to test the generic LLMs on a task that pertains to daily clinical routine but where the depth of medical reasoning is high and where the use of languages other than English would further stress LLMs capabilities. They focused on the agreement between human readers and LLMs for the assignment of Breast Imaging Reporting and Data System (BI-RADS) categories, a widely used system to describe and classify breast lesions.

The Swiss researchers partnered with an American team from Memorial Sloan Kettering Cancer Center in New York City and a Dutch team at the Netherlands Cancer Institute in Amsterdam.

The study included BI-RADS classifications of 2,400 breast imaging reports written in English, Italian and Dutch. Three LLMs—GPT-3.5, GPT-4 and Google Bard (now renamed Google Gemini)—assigned BI-RADS categories using only the findings described by the original radiologists. The researchers then compared the performance of the LLMs with that of board-certified breast radiologists.

The agreement for BI-RADS category assignments between human readers was almost perfect. However, the agreement between humans and the LLMs was only moderate. Most importantly, the researchers also observed a high percentage of discordant category assignments that would result in negative changes in patient management. This raises several concerns about the potential consequences of placing too much reliance on these widely available LLMs.

According to Dr. Cozzi, the results highlight the need for regulation of LLMs when there is a highly likely possibility that users may ask them health-care-related questions of varying depth and complexity.

“The results of this study add to the growing body of evidence that reminds us of the need to carefully understand and highlight the pros and cons of LLM use in health care,” he said. “These programs can be a wonderful tool for many tasks but should be used wisely. Patients need to be aware of the intrinsic shortcomings of these tools, and that they may receive incomplete or even utterly wrong replies to complex questions.”

The Swiss researchers were supervised by the co-senior author Simone Schiaffino, M.D. The American team was led by the co-first author Katja Pinker, M.D., Ph.D., and the Dutch team was led by the co-senior author Ritse M. Mann, M.D., Ph.D.

For more information: www.rsna.org

Related Content

Philips Gets FDA 510(k) Clearance for SmartSpeed Precise Dual AI Software

News | Magnetic Resonance Imaging (MRI)

Philips Gets FDA Clearance for SmartSpeed Precise Dual AI Software

July 2, 2025 — Philips has received FDA 510(k) clearance for SmartSpeed Precise[1] MR’s latest deep learning ...

July 03, 2025

News | Ultrasound Imaging

Study Examines Use of ChatGPT-4 for Ultrasound-based Liver Diagnostics

July 1, 2025 — UPDATE: The final paper is now available at: JMIR AI - ChatGPT-4–Driven Liver Ultrasound Radiomics ...

July 01, 2025

Siemens Healthineers' Tesla Magnetic Resonance Scanner Receives FDA Clearance

News | Magnetic Resonance Imaging (MRI)

Siemens Healthineers' 1.5 Tesla Magnetic Resonance Scanner Receives FDA Clearance

June 26, 2025 — Siemens Healthineers has received Food and Drug Administration clearance for the Magnetom Flow.Ace, its ...

June 26, 2025

New AI-Powered Prostate MRI Imaging Quality Assessment Available from Quibim

News | Prostate Cancer

New AI-Powered Prostate MRI Imaging Quality Assessment Now Available

June 26, 2025 – Quibim, a global provider of quantitative medical imaging solutions, has launched AI-QUAL, a new feature ...

June 26, 2025

News | Women's Health

Susan G. Komen to Award Nearly $11 Million in Breast Cancer Research Grants

June 23, 2025 — Susan G. Komen, the world’s leading breast cancer organization, recently announced it is awarding $10.8 ...

June 23, 2025

News | Bone Densitometry Systems

Naitive Publishes Study Demonstrating Diagnostic Accuracy of AI Tool

June 19, 2025 — Naitive Technologies has published results demonstrating the diagnostic performance of its AI-powered ...

June 18, 2025

News | Lung Imaging

Exo Launches FDA-cleared Ultrasound AI for Detecting Pleural Effusion, Consolidation/Atelectasis

June 18, 2025 — Exo recently announced that now included on its Exo Iris is the first ever FDA 510(k) cleared AI for ...

June 18, 2025

Feature | Women's Health | Christine Murray

How Contrast-Enhanced Mammography is Shaping the Future of Women’s Health

In breast cancer detection, speed and accuracy are more than clinical goals – they can significantly increase chances ...

June 17, 2025

AI, Digital Pathology Key to Advancing Precision Medicine: Diagnostic Lab Leaders

News | Digital Pathology

AI, Digital Pathology Key to Advancing Precision Medicine Say Diagnostic Lab Leaders

June 11, 2025 — Diagnostic laboratory leaders view digital pathology and artificial intelligence (AI) as pivotal to ...

June 12, 2025

News | Lung Imaging

Qure.ai Introduces Lung Cancer AI Training

June 11, 2025 — To prepare healthcare workforces and providers for an AI-driven future, Qure.ai has expanded its Global ...

June 11, 2025

If you enjoy this content, please share it with a colleague

GPT-4, Google Gemini Fall Short in Breast Imaging Classification

If you enjoy this content, please share it with a colleague

Related Content