News | Artificial Intelligence | November 07, 2018

Artificial Intelligence May Fall Short Analyzing Data Across Multiple Health Systems

Study shows deep learning models must be carefully tested across multiple environments before being put into clinical practice

November 7, 2018 — Artificial intelligence (AI) tools trained to detect pneumonia on chest X-rays suffered significant decreases in performance when tested on data from outside health systems, according to a new study. The study, conducted at the Icahn School of Medicine at Mount Sinai, was published in a special issue of PLOS Medicine on machine learning and healthcare.1 These findings suggest that artificial intelligence in the medical space must be carefully tested for performance across a wide range of populations; otherwise, the deep learning models may not perform as accurately as expected.  

As interest in the use of computer system frameworks called convolutional neural networks (CNN) to analyze medical imaging and provide a computer-aided diagnosis grows, recent studies have suggested that AI image classification may not generalize to new data as well as commonly portrayed.

Researchers at the Icahn School of Medicine at Mount Sinai assessed how AI models identified pneumonia in 158,000 chest X-rays across three medical institutions: the National Institutes of Health; The Mount Sinai Hospital; and Indiana University Hospital. Researchers chose to study the diagnosis of pneumonia on chest X-rays for its common occurrence, clinical significance and prevalence in the research community.

In three out of five comparisons, CNNs’ performance in diagnosing diseases on X-rays from hospitals outside of its own network was significantly lower than on X-rays from the original health system. However, CNNs were able to detect the hospital system where an X-ray was acquired with a high-degree of accuracy, and cheated at their predictive task based on the prevalence of pneumonia at the training institution. Researchers found that the difficulty of using deep learning models in medicine is that they use a massive number of parameters, making it challenging to identify specific variables driving predictions, such as the types of computed tomography (CT) scanners used at a hospital and the resolution quality of imaging.

“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” said senior author Eric Oermann, M.D., instructor in neurosurgery at the Icahn School of Medicine at Mount Sinai. “Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”

“If CNN systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios and carefully assessed to determine how they impact accurate diagnosis,” said first author John Zech, a medical student at the Icahn School of Medicine at Mount Sinai.

This research builds on papers published earlier this year in the journals Radiology and Nature Medicine, which laid the framework for applying computer vision and deep learning techniques, including natural language processing algorithms, for identifying clinical concepts in radiology reports for CT scans.

Listen to the PODCAST: Radiologists Must Understand AI To Know If It Is Wrong

For more information: www.journals.plos.org/plosmedicine

Reference

1. Zech J.R., Badgeley M.A., Liu M., et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine, Nov. 6, 2018. https://doi.org/10.1371/journal.pmed.1002683

Related Content

R2* maps of healthy control participants and participants with Alzheimer disease. R2* maps are windowed between 10 and 50 sec21. Differences in iron concentration in basal ganglia are too small to allow visual separation between patients with Alzheimer disease and control participants, and iron levels strongly depend on anatomic structure and subject age. Image courtesy of Radiological Society of North America

R2* maps of healthy control participants and participants with Alzheimer disease. R2* maps are windowed between 10 and 50 sec21. Differences in iron concentration in basal ganglia are too small to allow visual separation between patients with Alzheimer disease and control participants, and iron levels strongly depend on anatomic structure and subject age. Image courtesy of Radiological Society of North America

News | Magnetic Resonance Imaging (MRI) | July 01, 2020
July 1, 2020 — Researchers using magnetic...
Imaging Artificial Intelligence (AI) provider Qure.ai announced its first US FDA 510(k) clearance for its head CT scan product qER. The US Food and Drug Administration's decision covers four critical abnormalities identified by Qure.ai's emergency room product.
News | Artificial Intelligence | June 30, 2020
June 30, 2020 — Imaging Artificial Intelligence (AI) provider Qure.ai announced its first US FDA 510(k) clearance for
Sponsored Content | Videos | PACS | June 29, 2020
Kevin Borden, Vice President of Product, Healthcare IT for Konica Minolta, talks about Improving Access and Aiding Wo
Thoracic findings in a 15-year-old girl with Multisystem Inflammatory Syndrome in Children (MIS-C). (a) Chest radiograph on admission shows mild perihilar bronchial wall cuffing. (b) Chest radiograph on the third day of admission demonstrates extensive airspace opacification with a mid and lower zone predominance. (c, d) Contrast-enhanced axial CT chest of the thorax at day 3 shows areas of ground-glass opacification (GGO) and dense airspace consolidation with air bronchograms. (c) This conformed to a mosai

Thoracic findings in a 15-year-old girl with Multisystem Inflammatory Syndrome in Children (MIS-C). (a) Chest radiograph on admission shows mild perihilar bronchial wall cuffing. (b) Chest radiograph on the third day of admission demonstrates extensive airspace opacification with a mid and lower zone predominance. (c, d) Contrast-enhanced axial CT chest of the thorax at day 3 shows areas of ground-glass opacification (GGO) and dense airspace consolidation with air bronchograms. (c) This conformed to a mosaic pattern with a bronchocentric distribution to the GGO (white arrow, d) involving both central and peripheral lung parenchyma with pleural effusions (black small arrow, d). image courtesy of Radiological Society of North America

News | Coronavirus (COVID-19) | June 26, 2020
June 26, 2020 — In recent weeks, a multisystem hyperinflammatory condition has emerged in children in association wit
Universal digital operating system for surgery enables health tech companies and start-ups to accelerate, scale and grow

Stefan Vilsmeier, President and CEO of Brainlab Photo courtesy of Brainlab

News | Artificial Intelligence | June 26, 2020
June 26, 2020 — ...
n support of Mayo Clinic’s digital health and practice transformation initiatives, the Mayo Clinic Department of Laboratory Medicine and Pathology has initiated an enterprise-wide digital pathology implementation of the Sectra digital slide review and image storage and management system to enable digital pathology. 
News | Enterprise Imaging | June 26, 2020
June 26, 2020 —  In support of Mayo Clinic’s digital health
Visage announces cloud implementation, Visage 7 Workflow and semantic annotations
News | Enterprise Imaging | June 26, 2020
June 26, 2020 — Visage Imaging, Inc.