Artificial intelligence matches subjective severity assessment of pneumonia for prediction of patient outcome and need for mechanical ventilation – a cohort study

Abstract

To compare the performance of artificial intelligence (AI) and Radiographic Assessment of Lung Edema (RALE) scores from frontal chest radiographs (CXRs) for predicting patient outcomes and the need for mechanical ventilation in COVID-19 pneumonia. Our IRB-approved study included 1367 serial CXRs from 405 adult patients (mean age 65 ± 16 years) from two sites in the US (Site A) and South Korea (Site B). We recorded information pertaining to patient demographics (age, gender), smoking history, comorbid conditions (such as cancer, cardiovascular and other diseases), vital signs (temperature, oxygen saturation), and available laboratory data (such as WBC count and CRP). Two thoracic radiologists performed the qualitative assessment of all CXRs based on the RALE score for assessing the severity of lung involvement. All CXRs were processed with a commercial AI algorithm to obtain the percentage of the lung affected with findings related to COVID-19 (AI score). Independent t- and chi-square tests were used in addition to multiple logistic regression with Area Under the Curve (AUC) as output for predicting disease outcome and the need for mechanical ventilation. The RALE and AI scores had a strong positive correlation in CXRs from each site (r2 = 0.79–0.86; p < 0.0001). Patients who died or received mechanical ventilation had significantly higher RALE and AI scores than those with recovery or without the need for mechanical ventilation (p < 0.001). Patients with a more substantial difference in baseline and maximum RALE scores and AI scores had a higher prevalence of death and mechanical ventilation (p < 0.001). The addition of patients’ age, gender, WBC count, and peripheral oxygen saturation increased the outcome prediction from 0.87 to 0.94 (95% CI 0.90–0.97) for RALE scores and from 0.82 to 0.91 (95% CI 0.87–0.95) for the AI scores. AI algorithm is as robust a predictor of adverse patient outcome (death or need for mechanical ventilation) as subjective RALE scores in patients with COVID-19 pneumonia.

Artificial intelligence matches subjective severity assessment of pneumonia for prediction of patient outcome and need for mechanical ventilation – a cohort study

Abstract

To compare the performance of artificial intelligence (AI) and Radiographic Assessment of Lung Edema (RALE) scores from frontal chest radiographs (CXRs) for predicting patient outcomes and the need for mechanical ventilation in COVID-19 pneumonia. Our IRB-approved study included 1367 serial CXRs from 405 adult patients (mean age 65 ± 16 years) from two sites in the US (Site A) and South Korea (Site B). We recorded information pertaining to patient demographics (age, gender), smoking history, comorbid conditions (such as cancer, cardiovascular and other diseases), vital signs (temperature, oxygen saturation), and available laboratory data (such as WBC count and CRP). Two thoracic radiologists performed the qualitative assessment of all CXRs based on the RALE score for assessing the severity of lung involvement. All CXRs were processed with a commercial AI algorithm to obtain the percentage of the lung affected with findings related to COVID-19 (AI score). Independent t- and chi-square tests were used in addition to multiple logistic regression with Area Under the Curve (AUC) as output for predicting disease outcome and the need for mechanical ventilation. The RALE and AI scores had a strong positive correlation in CXRs from each site (r2 = 0.79–0.86; p < 0.0001). Patients who died or received mechanical ventilation had significantly higher RALE and AI scores than those with recovery or without the need for mechanical ventilation (p < 0.001). Patients with a more substantial difference in baseline and maximum RALE scores and AI scores had a higher prevalence of death and mechanical ventilation (p < 0.001). The addition of patients’ age, gender, WBC count, and peripheral oxygen saturation increased the outcome prediction from 0.87 to 0.94 (95% CI 0.90–0.97) for RALE scores and from 0.82 to 0.91 (95% CI 0.87–0.95) for the AI scores. AI algorithm is as robust a predictor of adverse patient outcome (death or need for mechanical ventilation) as subjective RALE scores in patients with COVID-19 pneumonia.

Automated Lateral Ventricular and Cranial Vault Volume Measurements in 13,851 Subjects Utilizing Deep Learning Algorithms

Background

Currently, no large dataset-derived standard has been established for normal or pathologic human cerebral ventricular and cranial vault volumes. Automated volumetric measurements could be used to assist in diagnosis and follow-up of hydrocephalus or craniofacial syndromes. In this work we use deep learning algorithms to measure ventricular and cranial vault volumes in a large dataset of head computed tomography (CT) scans.

Methods

A cross-sectional dataset comprising 13,851 CT scans was utilized to deploy U-net deep learning networks to segment and quantify lateral cerebral ventricular and cranial vault volumes in relation to age and sex. The models were validated against manual segmentations. Corresponding radiological reports were annotated using a rule-based natural language processing (NLP) framework to identify normal scans, cerebral atrophy, or hydrocephalus.

Results

U-net models had high fidelity to manual segmentations for lateral ventricular and cranial vault volume measurements (DICE 0.878 and 0.983, respectively). The NLP identified 6,239 (44.7%) normal radiological reports, 1,827 (13.1%) with cerebral atrophy and 1,185 (8.5%) with hydrocephalus. Age- and sex-based reference tables with medians, 25th and 75th percentiles for scans classified as normal, atrophy and hydrocephalus were constructed. The median lateral ventricular volume in normal scans was significantly smaller compared to hydrocephalus (15.7mL vs 82.0mL, P<0.001).

Conclusion

This is the first study to measure lateral ventricular and cranial vault volumes in a large dataset, made possible with artificial intelligence. We provide a robust method to establish normal values for these volumes and a tool to report these on CT scans when evaluating for hydrocephalus.

Chest x-ray analysis with deep learning-based software as a triage test for pulmonary tuberculosis – a prospective study of diagnostic accuracy for culture-confirmed disease

Background

Deep learning-based radiological image analysis could facilitate use of chest x-rays as triage tests for pulmonary tuberculosis in resource-limited settings. We sought to determine whether commercially available chest x-ray analysis software meet WHO recommendations for minimal sensitivity and specificity as pulmonary tuberculosis triage tests.

Methods

We recruited symptomatic adults at the Indus Hospital, Karachi, Pakistan. We compared two software, qXR version 2.0 (qXRv2) and CAD4TB version 6.0 (CAD4TBv6), with a reference of mycobacterial culture of two sputa. We assessed qXRv2 using its manufacturer prespecified threshold score for chest x-ray classification as tuberculosis present versus not present. For CAD4TBv6, we used a data-derived threshold, because it does not have a prespecified one. We tested for non-inferiority to preset WHO recommendations (0·90 for sensitivity, 0·70 for specificity) using a non-inferiority limit of 0·05. We identified factors associated with accuracy by stratification and logistic regression.

Findings

We included 2198 (92·7%) of 2370 enrolled participants. 2187 (99·5%) of 2198 were HIV-negative, and 272 (12·4%) had culture-confirmed pulmonary tuberculosis. For both software, accuracy was non-inferior to WHO-recommended minimum values (qXRv2 sensitivity 0·93 [95% CI 0·89–0·95], non-inferiority p=0·0002; CAD4TBv6 sensitivity 0·93 [95% CI 0·90–0·96], p<0·0001; qXRv2 specificity 0·75 [95% CI 0·73–0·77], p<0·0001; CAD4TBv6 specificity 0·69 [95% CI 0·67–0·71], p=0·0003). Sensitivity was lower in smear-negative pulmonary tuberculosis for both software, and in women for CAD4TBv6. Specificity was lower in men and in those with previous tuberculosis, and reduced with increasing age and decreasing body mass index. Smoking and diabetes did not affect accuracy.

Interpretation

In an HIV-negative population, these software met WHO-recommended minimal accuracy for pulmonary tuberculosis triage tests. Sensitivity will be lower when smear-negative pulmonary tuberculosis is more prevalent.

Initial chest radiographs and artificial intelligence (AI) predict clinical outcomes in COVID-19 patients – analysis of 697 Italian patients

Objective

To evaluate whether the initial chest X-ray (CXR) severity assessed by an AI system may have prognostic utility in patients with COVID-19.

Methods

This retrospective single-center study included adult patients presenting to the emergency department (ED) between February 25 and April 9, 2020, with SARS-CoV-2 infection confirmed on real-time reverse transcriptase polymerase chain reaction (RT-PCR). Initial CXRs obtained on ED presentation were evaluated by a deep learning artificial intelligence (AI) system and compared with the Radiographic Assessment of Lung Edema (RALE) score, calculated by two experienced radiologists. Death and critical COVID-19 (admission to intensive care unit (ICU) or deaths occurring before ICU admission) were identified as clinical outcomes. Independent predictors of adverse outcomes were evaluated by multivariate analyses.

Results

Six hundred ninety-seven 697 patients were included in the study: 465 males (66.7%), median age of 62 years (IQR 52–75). Multivariate analyses adjusting for demographics and comorbidities showed that an AI system-based score ≥ 30 on the initial CXR was an independent predictor both for mortality (HR 2.60 (95% CI 1.69 − 3.99; p < 0.001)) and critical COVID-19 (HR 3.40 (95% CI 2.35–4.94; p < 0.001)). Other independent predictors were RALE score, older age, male sex, coronary artery disease, COPD, and neurodegenerative disease.

Conclusion

AI- and radiologist-assessed disease severity scores on CXRs obtained on ED presentation were independent and comparable predictors of adverse outcomes in patients with COVID-19.

Can artificial intelligence (AI) be used to accurately detect tuberculosis (TB) from chest x-ray? A multiplatform evaluation of five AI products used for TB screening in a high TB-burden setting.

Powered by artificial intelligence (AI), particularly deep neural networks, computer aided detection (CAD) tools can be trained to recognize TB-related abnormalities on chest radiographs, thereby screening large numbers of people and reducing the pressure on healthcare professionals. Addressing the lack of studies comparing the performance of different products, we evaluated five AI software platforms specific to TB: CAD4TB (v6), InferReadDR (v2), Lunit INSIGHT for Chest Radiography (v4.9.0) , JF CXR-1 (v2) by and qXR (v3) by on an unseen dataset of chest X-rays collected in three TB screening center in Dhaka, Bangladesh. The 23,566 individuals included in the study all received a CXR read by a group of three Bangladeshi board-certified radiologists. A sample of CXRs were re-read by US board-certified radiologists. Xpert was used as the reference standard. All five AI platforms significantly outperformed the human readers. The areas under the receiver operating characteristic curves are qXR: 0.91 (95% CI:0.90-0.91), Lunit INSIGHT CXR: 0.89 (95% CI:0.88-0.89), InferReadDR: 0.85 (95% CI:0.84-0.86), JF CXR-1: 0.85 (95% CI:0.84-0.85), CAD4TB: 0.82 (95% CI:0.81-0.83). We also proposed a new analytical framework that evaluates a screening and triage test and informs threshold selection through tradeoff between cost efficiency and ability to triage. Further, we assessed the performance of the five AI algorithms across the subgroups of age, use cases, and prior TB history, and found that the threshold scores performed differently across different subgroups. The positive results of our evaluation indicate that these AI products can be useful screening and triage tools for active case finding in high TB-burden regions.

Performance of Qure.ai automatic classifiers against a large annotated database of patients with diverse forms of tuberculosis

Availability of trained radiologists for fast processing of CXRs in regions burdened with tuberculosis always has been a challenge, affecting both timely diagnosis and patient monitoring. The paucity of annotated images of lungs of TB patients hampers attempts to apply data-oriented algorithms for research and clinical practices. The TB Portals Program database (TBPP, https://TBPortals.niaid.nih.gov) is a global collaboration curating a large collection of the most dangerous, hard-to-cure drug-resistant tuberculosis (DR-TB) patient cases. TBPP, with 1,179 (83%) DR-TB patient cases, is a unique collection that is well positioned as a testing ground for deep learning classifiers. As of January 2019, the TBPP database contains 1,538 CXRs, of which 346 (22.5%) are annotated by a radiologist and 104 (6.7%) by a pulmonologist–leaving 1,088 (70.7%) CXRs without annotations. The Qure.ai qXR artificial intelligence automated CXR interpretation tool, was blind-tested on the 346 radiologist-annotated CXRs from the TBPP database. Qure.ai qXR CXR predictions for cavity, nodule, pleural effusion, hilar lymphadenopathy was successfully matching human expert annotations. In addition, we tested the 12 Qure.ai classifiers to find whether they correlate with treatment success (information provided by treating physicians). Ten descriptors were found as significant: abnormal CXR (p = 0.0005), pleural effusion (p = 0.048), nodule (p = 0.0004), hilar lymphadenopathy (p = 0.0038), cavity (p = 0.0002), opacity (p = 0.0006), atelectasis (p = 0.0074), consolidation (p = 0.0004), indicator of TB disease (p = < .0001), and fibrosis (p = < .0001). We conclude that applying fully automated Qure.ai CXR analysis tool is useful for fast, accurate, uniform, large-scale CXR annotation assistance, as it performed well even for DR-TB cases that were not used for initial training. Testing artificial intelligence algorithms (encapsulating both machine learning and deep learning classifiers) on diverse data collections, such as TBPP, is critically important toward progressing to clinically adopted automatic assistants for medical data analysis.

Deep learning, computer-aided radiography reading for tuberculosis – A diagnostic accuracy study from a tertiary hospital in India

In general, chest radiographs (CXR) have high sensitivity and moderate specificity for active pulmonary tuberculosis (PTB) screening when interpreted by human readers. However, they are challenging to scale due to hardware costs and the dearth of professionals available to interpret CXR in low-resource, high PTB burden settings. Recently, several computer-aided detection (CAD) programs have been developed to facilitate automated CXR interpretation. We conducted a retrospective case-control study to assess the diagnostic accuracy of a CAD software (qXR, Qure.ai, Mumbai, India) using microbiologically-confirmed PTB as the reference standard. To assess overall accuracy of qXR, receiver operating characteristic (ROC) analysis was used to determine the area under the curve (AUC), along with 95% confidence intervals (CI). Kappa coefficients, and associated 95% CI, were used to investigate inter-rater reliability of the radiologists for detection of specific chest abnormalities. In total, 317 cases and 612 controls were included in the analysis. The AUC for qXR for the detection of microbiologically-confirmed PTB was 0.81 (95% CI: 0.78, 0.84). Using the threshold that maximized sensitivity and specificity of qXR simultaneously, the software achieved a sensitivity and specificity of 71% (95% CI: 66%, 76%) and 80% (95% CI: 77%, 83%), respectively. The sensitivity and specificity of radiologists for the detection of microbiologically-confirmed PTB was 56% (95% CI: 50%, 62%) and 80% (95% CI: 77%, 83%), respectively. For detection of key PTB-related abnormalities ‘pleural effusion’ and ‘cavity’, qXR achieved an AUC of 0.94 (95% CI: 0.92, 0.96) and 0.84 (95% CI: 0.82, 0.87), respectively. For the other abnormalities, the AUC ranged from 0.75 (95% CI: 0.70, 0.80) to 0.94 (95% CI: 0.91, 0.96). The controls had a high prevalence of other lung diseases which can cause radiological manifestations similar to PTB (e.g., 26% had pneumonia, 15% had lung malignancy, etc.). In a tertiary hospital in India, qXR demonstrated moderate sensitivity and specificity for the detection of PTB. There is likely a larger role for CAD software as a triage test for PTB at the primary care level in settings where access to radiologists in limited. Larger prospective studies that can better assess heterogeneity in important subgroups are needed.

Using artificial intelligence to read chest radiographs for tuberculosis detection – A multi-site evaluation of the diagnostic accuracy of three deep learning systems

Deep learning (DL) neural networks have only recently been employed to interpret chest radiography (CXR) to screen and triage people for pulmonary tuberculosis (TB). No published studies have compared multiple DL systems and populations. We conducted a retrospective evaluation of three DL systems (CAD4TB, Lunit INSIGHT, and qXR) for detecting TB-associated abnormalities in chest radiographs from outpatients in Nepal and Cameroon. All 1196 individuals received a Xpert MTB/RIF assay and a CXR read by two groups of radiologists and the DL systems. Xpert was used as the reference standard. The area under the curve of the three systems was similar: Lunit (0.94, 95% CI: 0.93–0.96), qXR (0.94, 95% CI: 0.92–0.97) and CAD4TB (0.92, 95% CI: 0.90–0.95). When matching the sensitivity of the radiologists, the specificities of the DL systems were significantly higher except for one. Using DL systems to read CXRs could reduce the number of Xpert MTB/RIF tests needed by 66% while maintaining sensitivity at 95% or better. Using a universal cutoff score resulted different performance in each site, highlighting the need to select scores based on the population screened. These DL systems should be considered by TB programs where human resources are constrained, and automated technology is available.

Deep learning algorithms for detection of critical findings in head CT scans – A retrospective study

Background

Non-contrast head CT scan is the current standard for initial imaging of patients with head trauma or stroke symptoms. We aimed to develop and validate a set of deep learning algorithms for automated detection of the following key findings from these scans: intracranial haemorrhage and its types (ie, intraparenchymal, intraventricular, subdural, extradural, and subarachnoid); calvarial fractures; midline shift; and mass effect.

Methods

We retrospectively collected a dataset containing 313 318 head CT scans together with their clinical reports from around 20 centres in India between Jan 1, 2011, and June 1, 2017. A randomly selected part of this dataset (Qure25k dataset) was used for validation and the rest was used to develop algorithms. An additional validation dataset (CQ500 dataset) was collected in two batches from centres that were di erent from those used for the development and Qure25k datasets. We excluded postoperative scans and scans of patients younger than 7 years. The original clinical radiology report and consensus of three independent radiologists were considered as gold standard for the Qure25k and CQ500 datasets, respectively. Areas under the receiver operating characteristic curves (AUCs) were primarily used to assess the algorithms.

Findings

The Qure25k dataset contained 21 095 scans (mean age 43 years; 9030 [43%] female patients), and the CQ500 dataset consisted of 214 scans in the rst batch (mean age 43 years; 94 [44%] female patients) and 277 scans in the second batch (mean age 52 years; 84 [30%] female patients). On the Qure25k dataset, the algorithms achieved an AUC of 0·92 (95% CI 0·91–0·93) for detecting intracranial haemorrhage (0·90 [0·89–0·91] for intraparenchymal, 0·96 [0·94–0·97] for intraventricular, 0·92 [0·90–0·93] for subdural, 0·93 [0·91–0·95] for extradural, and 0·90 [0·89–0·92] for subarachnoid). On the CQ500 dataset, AUC was 0·94 (0·92–0·97) for intracranial haemorrhage (0·95 [0·93–0·98], 0·93 [0·87–1·00], 0·95 [0·91–0·99], 0·97 [0·91–1·00], and 0·96 [0·92–0·99], respectively). AUCs on the Qure25k dataset were 0·92 (0·91–0·94) for calvarial fractures, 0·93 (0·91–0·94) for midline shift, and 0·86 (0·85–0·87) for mass e ect, while AUCs on the CQ500 dataset were 0·96 (0·92–1·00), 0·97 (0·94–1·00), and 0·92 (0·89–0·95), respectively.

Interpretation

Our results show that deep learning algorithms can accurately identify head CT scan abnormalities requiring urgent attention, opening up the possibility to use these algorithms to automate the triage process.