Artificial intelligence (AI) algorithms can be trained to recognise tuberculosis-related abnormalities on chest radiographs. Various AI algorithms are available commercially, yet there is little impartial evidence on how their performance compares with each other and with radiologists. We aimed to evaluate five commercial AI algorithms for triaging tuberculosis using a large dataset that had not previously been used to train any AI algorithms.
Individuals aged 15 years or older presenting or referred to three tuberculosis screening centres in Dhaka, Bangladesh, between May 15, 2014, and Oct 4, 2016, were recruited consecutively. Every participant was verbally screened for symptoms and received a digital posterior-anterior chest x-ray and an Xpert MTB/RIF (Xpert) test. All chest x-rays were read independently by a group of three registered radiologists and five commercial AI algorithms: CAD4TB (version 7), InferRead DR (version 2), Lunit INSIGHT CXR (version 4.9.0), JF CXR-1 (version 2), and qXR (version 3). We compared the performance of the AI algorithms with each other, with the radiologists, and with the WHO's Target Product Profile (TPP) of triage tests (≥90% sensitivity and ≥70% specificity). We used a new evaluation framework that simultaneously evaluates sensitivity, proportion of Xpert tests avoided, and number needed to test to inform implementers’ choice of software and selection of threshold abnormality scores.
Chest x-rays from 23 954 individuals were included in the analysis. All five AI algorithms significantly outperformed the radiologists. The areas under the receiver operating characteristic curve were 90·81% (95% CI 90·33–91·29) for qXR, 90·34% (89·81–90·87) for CAD4TB, 88·61% (88·03–89·20) for Lunit INSIGHT CXR, 84·90% (84·27–85·54) for InferRead DR, and 84·89% (84·26–85·53) for JF CXR-1. Only qXR (74·3% specificity [95% CI 73·3–74·9]) and CAD4TB (72·9% specificity [72·3–73·5]) met the TPP at 90% sensitivity. All five AI algorithms reduced the number of Xpert tests required by 50% while maintaining a sensitivity above 90%. All AI algorithms performed worse among older age groups (>60 years) and people with a history of tuberculosis.
AI algorithms can be highly accurate and useful triage tools for tuberculosis detection in high-burden regions, and outperform human readers.