Categories
Uncategorized

Time is Brain: AI helps cut down stroke diagnosis time in the Himalayan foothills

Stroke is a leading cause of death. Stroke care is limited by the availability of specialized medical professionals. In this post, we describe a physician-led stroke unit model established at Baptist Christian Hospital (BCH) in Assam, India. Enabled with qER, Qure’s AI driven automated CT Brain interpretation tool, BCH can quickly and easily determine next steps in terms of treatment and examine the implications for clinical outcomes.

qER at a Stroke unit

Across the world, Stroke is a leading cause of death, second only to ischemic heart disease. According to the the World Stroke Organization (WSO), 13.7 million new strokes occur each year and there are about 80 million stroke survivors globally. In India as per the Health of the Nation’s State Report we see an incidence rate of 119 to 152/100000, and has a case fatality rate of 19 to 42% across the country.

Catering to tea plantation workers in and around the town of Tezpur, the Baptist Christian Hospital, Tezpur (BCH) is a 130-bed secondary care hospital in the North eastern state of Assam in India. This hospital is a unit of the Emmanuel Hospital Association, New Delhi. From humble beginnings, offering basic dispensary services, the hospital grew to become one of the best healthcare providers in Assam, being heavily involved in academic and research work at both national and international levels.

Nestled below the Himalayas, interspersed with large tea plantations, Assamese indigenous population and tea garden workers showcase a prevalence of hypertension, the largest single risk factor of stroke, reportedly between 33% to 60.8%. Anecdotal reports and hospital-based studies indicate a huge burden of stroke in Assam – a significant portion of which is addressed by Baptist Hospital. Recent study showed that hemorrhagic strokes account for close to 50% of the cases here, compared to only about 20% of the strokes in the rest of India.

Baptist Christian Hospital

Baptist Christian Hospital, Tezpur. Source

Challenges in Stroke Care

One of the biggest obstacles in Stroke Care is the lack of awareness of stroke symptoms and the late arrival of the patient, often at smaller peripheral hospitals, which are not equipped with the necessary scanning facilities and the specialists, leading to a delay in effective treatment.

The doctors and nurses of the Stroke Unit at BCH, Tezpur were trained online by specialist neurologists, who in turn trained the rest of the team on a protocol that included Stroke Clinical Assessment, monitoring of risk factors and vital parameters, and other supportive measures like management of Swallow assessment in addition to starting the rehabilitation process and advising on long term care at home. A study done at Tezpur indicated that post establishment of Stroke Unit, there was significant improvement in the quality of life along with reduction in deaths compared to the pre-Stroke Unit phase.

This is a crucial development in Stroke care especially in the low and middle income countries(LMIC) like India, to strengthen the peripheral smaller hospitals which lack specialists and are almost always the first stop for patients in emergencies like Stroke.

Stroke pathway barriers

This representative image details the acute stroke care pathway. Source

The guidelines for management of acute ischemic stroke involves capturing a non-contrast CT (NCCT) study of the brain along with CT or MRI angiography and perfusion and thrombolysis-administration of rTPA (Tissue Plasminogen Activator) within 4.5 hours of symptom onset. Equipped with a CT machine and teleradiology reporting, the physicians at BCH provide primary intervention for these stroke cases after a basic NCCT and may refer them to a tertiary facility, as applicable. They follow a Telestroke model-in cases where thrombolysis is required, the ER doctors consult with neurologists at a more specialized center and the decision making is done upon sharing these NCCT images via phone-based mediums like WhatsApp while severe cases of head trauma are referred for further management to far away tertiary facilities. There have been studies done on a Physician based Stroke Unit model in Tezpur, that has shown an improvement in treatment outcomes.

How is Qure.ai helping BCH with stroke management?

BCH and Qure have worked closely since the onset of the COVID-19 pandemic, especially at a time when confirmatory RT-PCR kits were limiting. qXR, Qure’s AI aided chest X-ray solution had proved to be a beneficial addition for identification of especially asymptomatic COVID-19 suspects and their treatment and management, beyond its role in comprehensive chest screening.

qER messages

AI at BCH

In efforts to improve the workflow of stroke management and care at the Baptist hospital, qER, FDA approved and CE certified software which can detect 12 abnormalities was deployed. The abnormalities including five types of Intracranial Hemorrhages, Cranial Fractures, Mass effect, midline Shift, Infarcts, Hydrocephalus, Atrophy etc in less than 1-2 minutes of the CT being taken. qER has been trained on CT scans from more than 22 different CT machine models, thus making it hardware agnostic. In addition to offering a pre-populated radiology report, the HIPAA compliant qER solution is also able to label and annotate the abnormalities in the key slices.

Since qER integrates seamlessly with the existing technical framework of the site, the deployment of the software was completed in less an hour along with setting up a messaging group for the site. Soon after, within minutes of taking the Head CT, qER analyses were available in the PACS worklist along with messaging alerts for the physicians’ and medical team’s review on their mobile phones.

The aim of this pilot project was to evaluate how qER could add value to a secondary care center where the responsibility for determination of medical intervention falls on the physicians based on teleradiology report available to them in a span of 15-60 minutes. As is established with stroke care, every minute saved is precious.

Baptist Christian Hospital

Physician using qER

At the outset, there were apprehensions amongst the medical team about the performance of the software and its efficacy in improving the workflow, however, this is what they have to say about qER after 2 months of operation:

“qER is good as it alerts the physicians in a busy casualty room even without having to open the workstation. We know if there are any critical issues with the patient” – Dr. Jemin Webster, a physician at Tezpur.

He goes on to explain how qER helps grab the attention of the emergency room doctors and nurses to critical cases that need intervention, or in some instances, referral. It helps in boosting the confidence of the treating doctors in making the right judgement in the clinical decision-making process. It also helps in seeking the teleradiology support’s attention into the notified critical scans, as well as the scans of the stroke cases that are in the window period for thrombolysis. Dr. Jemin also sees the potential of qER in the workflow of high volume, multi-specialty referral centers, where coordination between multiple departments are required.

The Way Ahead

A technology solution like qER can reduce the time to diagnosis in case of emergencies like Stroke or trauma and boosts the confidence of Stroke Unit, even in the absence of specialists. The qER platform can help Stroke neurologists in the Telestroke settings access great quality scans even on their smartphones and guide the treating doctors for thrombolysis and further management. Scaling up this technology to Stroke units and MSUs can empower peripheral hospitals to manage acute Stroke especially in LMICs.

We intend to conduct an observational time-motion study to analyze the Door-to- Needle time with qER intervention via instant reports and phone alerts as we work through the required approvals. Also in the pipeline is performance comparison of qER reporting against the Radiologist report as ground truth along with comparison of clinical outcomes and these parameters before and after introduction of qER into the workflow. We also plan to extend the pilot project to Padhar Mission Hospital, MP and the Shanthibhavan Medical Center, Simdega, Jharkhand.

Qure team is also working on creating a comprehensive stroke platform which is aimed at improving stroke workflows in LMICs and low-resource settings.

Categories
Uncategorized

Smarter City: How AI is enabling Mumbai battle COVID-19

When the COVID-19 pandemic hit Mumbai, one of the most densely populated cities in the world, the Municipal Corporation of Greater Mumbai (MCGM) promptly embraced newer technologies, while creatively utilising available resources. Here is a deeper dive into how the versatility of chest x-rays and Artificial Intelligence helped the financial capital of India in efforts to containing this pandemic.

The COVID-19 pandemic is one of the most demanding adversities that the present generation has had to witness and endure. The highly virulent novel Coronavirus has posed a challenge like no other to the most sophisticated healthcare systems world over. Given the brisk transmission, it was only a matter of time that the virus spread to Mumbai, the busiest city of India, with a population more than 1.5 times that of New York.

The resilient Municipal Corporation of Greater Mumbai (MCGM), swiftly sprang into action, devising multiple strategies to test, isolate, and treat in an attempt to contain the pandemic and avoid significant damage. Given the availability and effectiveness of chest x-rays, they were identified to be an excellent tool to rule-in cases that needed further testing to ensure that no suspected case was missed out. Though Mumbai saw a steep rise in cases more than any other city in India, MCGM’s efforts across various touchpoints in the city were augmented using Qure’s AI-based X-ray interpretation tool – qXR – and the extension of its capabilities and benefits.

In the latter half of June, MCGM launched the MISSION ZERO initiative, a public-private partnership supported by the Bill & Melinda Gates Foundation, Bharatiya Jain Sanghatana (BJS) and Desh Apnayen and CREDAI-MCHI. Mobile vans with qXR installed digital X-ray systems were stationed outside various quarantine centers in the city. Individuals identified to be at high-risk of COVID-19 infection by on-site physicians from various camps were directed to these vans for further examination. Based on the clinical and radiological indications of the individuals thus screened, they were requested to proceed for isolation, RT-PCR testing, or continue isolation in the quarantine facility. Our objective was to reduce the load on the centers by continuously monitoring patients and discharging those who had recovered, making room for new patients to be admitted, and ensuring optimal utilization of resources.

A patient being screen in a BJS van equipped with qXR

The approach adopted by MCGM was multi-pronged to ascertain that no step of the pandemic management process was overlooked:

  • Triaging of high-risk and vulnerable and increase in case-detection in a mass screening setting to contain community transmission (11.4% individuals screened)
  • Patient management in critical care units to manage mortality rates
  • Support the existing healthcare framework by launching MISSION ZERO initiative and using chest X-ray based screening for optimum utilization of beds at quarantine centers

Learn more about Qure.ai qXR COVID in our detailed blog here

Triaging and Improvement in Case Finding

Kasturba Hospital and HBT Trauma Center were among the first few COVID-19 testing centers in Mumbai. However, due to the overwhelming caseload, it was essential that they triage individuals flowing into fever clinics for optimal utilization of testing kits.  The two centers used conventional analog film-based X-ray machines, one for standard OPD setting and another portable system for COVID isolation wards

From early March, both these hospitals adopted

  1. qXR software – our AI-powered chest X-ray interpretation tool provided the COVID-19 risk score based on the condition of the patient’s lungs
  2. qTrack – our newly launched disease management platform

The qTrack mobile app is a simple, easy to use tool that interfaces qXR results with the user. The qTrack app digitizes film-based X-rays and provides real-time interpretation using deep learning models. The x-ray technician simply clicks a picture of the x-ray against a view box via the app to receive the AI reading corresponding to the x-ray uploaded. The app is a complete workflow management tool, with the provision to register patients and capture all relevant information along with the x-ray. The attending physicians and the hospital Deans were provided separate access to the Qure portal so that they could instantly access AI analyses of the x-rays from their respective sites, from the convenience of their desktops/mobile phones.

qXR app in action at Kasturba Hospital

qXR app in action at Kasturba Hospital

Triaging in Hotspots and Containment Zones

When the city went into lockdown along with the rest of the world as a measure to contain the spread of infection, social distancing guidelines were imposed across the globe. However, this is not a luxury that the second-most densely populated city in the world could always afford. It is not uncommon to have several families living in close quarters within various communities, easily making them high-risk areas and soon, containment zones. With more than 50% of the COVID-19 positive cases being asymptomatic cases, it was imperative to test aggressively. Especially in the densely populated areas to identify individuals who are at high-risk of infection so that they could be institutionally quarantined in order to prevent and contain community transmission.

Workflow for COVID-19 management in containment zones using qXR

Workflow for COVID-19 management in containment zones using qXR

The BMC van involved in mass screenings and qXR in action in the van

The BMC van involved in mass screenings and qXR in action in the van

Patient Management in Critical Care Units

As the global situation worsened, the commercial capital of the country saw a steady rise in the number of positive cases. MCGM, very creatively and promptly, revived the previously closed down hospitals and converted large open grounds in the city into dedicated COVID-19 centers in record time with their own critical patient units. The BKC MMRDA grounds, NESCO grounds, NSCI (National Sports Council of India) Dome, and SevenHills Hospital are a few such centers.

NESCO COVID Center

The COVID-19 center at NESCO is a 3000-bed facility with 100+ ICU beds, catering primarily to patients from Mumbai’s slums. With several critical patients admitted here, it was important for Dr. Neelam Andrade, the facility head, and her team to monitor patients closely, keep a check on their disease progression and ensure that they acted quickly.  qXR helped Dr. Andrade’s team by providing instant automated reporting of the chest X-rays. It also captured all clinical information, enabling the center to make their process completely paperless.

The patient summary screen of qXR web portal

The patient summary screen of qXR web portal

“Since the patients admitted here are confirmed cases, we take frequent X-rays to monitor their condition. qXR gives instant results and this has been very helpful for us to make decisions quickly for the patient on their treatment and management.”

– Dr Neelam Andrade, Dean, NESCO COVID centre

SevenHills Hospital, Andheri

Located in the heart of the city’s suburbs, SevenHills Hospital was one of the first hospitals that were revived by MCGM as a part of COVID-19 response measures.

The center played a critical role on two accounts:

  1. Because patients were referred to the hospital for RT-PCR testing from door-to-door screening by MCGM. If found positive, they were admitted at the center itself for quarantine and treatment.
  2. With close to 1000 beds dedicated to COVID-19 patients alone, the doctors needed assistance for easy management of critical patients and to monitor their cases closely.

As with all COVID-19 cases, chest x-rays were taken of the admitted patients periodically to ascertain their lung condition and monitor the progress of the disease. All x-rays were then read by the head radiologist, Dr. Bhujang Pai, the next day, and released to the patient only post his review and approval. This meant that on most mornings, Dr. Pai was tasked with reading and reporting 200-250 x-rays, if not more. This is where qXR simplified his work.

Initially, we deployed the software on one of the two chest X-ray systems. However, after stellar feedback from Dr. Pai, our technology was installed in both the machines. In this manner AI, pre-read was available for all chest X-rays of COVID-19 patients from the center.

Where qXR adds most value:

  • several crucial indications are reported up by qXR
  • percentage lung affected helps to quantify improvement/deterioration in the patient lung and provide an objective assessment of the patient’s condition
  • pre-filled PDF report downloadable from the Qure portal makes it easier to finalize the radiology report prior to releasing to the patient, especially in a high-volume setting

Dr. Pai reviews and finalizes the qXR report prior to signing it off

Dr. Pai reviews and finalizes the qXR report prior to signing it off

“At SevenHills hospital, we have a daily load of ~220 Chest X-rays from the admitted COVID-19 cases, sometimes going up to 300 films per day. Having qXR has helped me immensely in reading them in a much shorter amount of time and helps me utilise my time more efficiently. The findings from the software are useful to quickly pickup the indications and we have been able to work with the team, and make suitable modifications in the reporting pattern, to make the findings more accurate. qXR pre-fills the report which I review and edit, and this facilitates releasing the patient reports in a much faster and efficient manner. This obviously translates into better patient care and treatment outcomes. The percentage of lung involvement which qXR analyses enhances the Radiologist’s report and is an excellent tool in reporting Chest radiographs of patients diagnosed with COVID infection.”

– Dr Bhujang Pai, Radiology Head, SevenHills Hospital

Challenges and learnings

During the course of the pandemic, Qure has assisted MCGM with providing AI analyses for thousands of chest x-rays of COVID-19 suspects and patients. This has been possible with continued collaboration with key stakeholders within MCGM who have been happy to assist in the process and provide necessary approvals and documentation to initiate work. However, different challenges were posed by the sites owing to their varied nature and the limitations that came with them.

We had to navigate through various technical challenges like interrupted network connections and lack of an IT team, especially at the makeshift COVID centers. We crossed these hurdles repeatedly to ensure that the x-rays from these centers were processed seamlessly within the stipulated timeframe, and the x-ray systems being used were serviced and functioning uninterrupted. Close coordination with the on-ground team and cooperation from their end was crucial to keep the engagement smooth.

This pandemic has been a revelation in many ways. In addition to reiterating that a virus sees no class or creed, it also forced us to move beyond our comfort zones and take our blinders off. Owing to limitations posed by the pandemic and subsequent movement restrictions, every single deployment of qXR by Qure was done entirely remotely. This included end-to-end activities like coordination with the key stakeholders, planning and execution of the deployment of the software, training of on-ground staff, and physicians using the portal/mobile app in addition to continuous operations support.

Robust and smart technology truly made it possible to implement what we had conceived and hoped for. Proving yet again that if we are to move ahead, it has to be a healthy partnership between technology and humanity.

Qure is supported by ACT Grants and India Health Fund for joining MCGM’s efforts for the pandemic response using qXR for COVID-19 management.

Categories
Uncategorized

An AI Upgrade during COVID-19: Stories from the most resilient healthcare systems in Rural India

When the pandemic hit the world without discretion, it caused health systems to crumble across the world. While a large focus was on strengthening them in the urban cities, the rural areas were struggling to cope up. In this blog, we highlight our experience working with some of the best healthcare centers in rural India that are delivering healthcare to the last mile. We describe how they embraced AI technology during this pandemic, and how it made a difference in their workflow and patient outcomes.

2020 will be remembered as the year of the COVID-19 pandemic. Affecting every corner of the world without discretion, it has caused unprecedented chaos and put healthcare systems under enormous stress. The majority of COVID-19 transmissions take place due to asymptomatic or mildly symptomatic cases. While global public health programs have steadily created evolving strategies for integrative technologies for improved case detection, there is a critical need for consistent and rigorous testing. It is at this juncture that the impact of Qure’s AI-powered chest X-ray screening tool, qXR, was felt across large testing sites such as hospital networks and government-led initiatives.

In India, Qure joined forces with the Indian Government to combat COVID-19 and qXR found its value towards diagnostic aid and critical care management. With the assistance of investor groups like ACT Grants and India Health Fund, we extended support to a number of sites, strengthening the urban systems fighting the virus in hotspots and containment zones.
Unfortunately, by this time, the virus had already moved to the rural areas, crumbling the primary healthcare systems that were already overburdened and resource-constrained.

Discovering the undiscovered healthcare providers

Technologies are meant to improve the quality of human lives, and access to quality healthcare is one of the most basic necessities. To further our work with hospitals and testing centers across the world, we took upon ourselves if more hospitals could benefit from the software in optimising testing capability. Through our physicians, we reached out to healthcare provider networks and social impact organisations that could potentially use the software for triaging and optimisation. During this process, we discovered an entirely new segment, very different from the well equipped urban hospitals we have been operating so far, and interacted with few Physicians dedicated to delivering quality and affordable healthcare through these hospitals.

Working closely with the community public health systems, these secondary care hospitals act as a vital referral link for tertiary hospitals. Some of these are located in isolated tribal areas and address the needs of large catchment populations, hosting close to 100,000 OPD visits annually. They already faced the significant burden of TB and now had to cope with the COVID-19 crisis. With testing facilities often located far away, the diagnosis time increases by days, which is unfortunate because chest X-rays are crucial for primary investigation prior to confirmatory tests, mainly due to the limitations in a testing capacity. No, sufficient testing kits have not reached many parts of rural India as yet!

“I have just finished referring a 25-year-old who came in respiratory distress, flagged positive on X-ray with positive rapid antigen test to Silchar Medical College and Hospital (SMCH), which is 162kms away from here. The number of cases here in Assam is increasing”

Dr. Roshine Koshy, Makunda Christian Leprosy and General Hospital in Assam.

BSTI algorithm

On the left: Chinchpada mission hospital, Maharashtra; Right: Shanti Bhavan Medical Center, Jharkhand.

When we first reached out to these hospitals, we were struck by the heroic vigour with which they were already handling the COVID-19 crisis despite their limited resources. We spoke to the doctors, care-givers and IT experts across all of these hospitals and they had the utmost clarity from the very beginning on how the technology could help them.

Why do they need innovations?

Patients regularly present with no symptoms or atypical ones and conceal their travel history due to the associated stigma of COVID-19. Owing to the ambiguous nature of the COVID-19 presentation, there is a possibility of missing subtle findings. This means that, apart from direct contact with the patient, it puts the healthcare team, their families, and other vulnerable patients at risk.

qXR bridges underlying gaps in these remote, isolated and resource-constrained regions around the world. Perhaps the most revolutionary, life-saving aspect is the fact that, in less than 1 minute, qXR generates the AI analysis of whether the X-ray is normal or abnormal, along with a list of 27+ abnormalities including COVID-19 and TB. With qXR’s assistance, the X-rays that are suggestive of a high risk of COVID-19 are flagged, enabling quick triaging and isolation of these suspects till negative RT PCR confirmatory results are received. As the prognosis changes with co-morbidities, alerting the referring Physician via phone of life-threatening findings like Pneumothorax is an added advantage.

Overview of results generated by qXR

Overview of results generated by qXR

Due to the lack of radiologists and other specialists in their own or neighbouring cities, Clinicians often play multiple roles – Physician, Obstetrician, Surgeon, Intensivist, Anaesthesist – and is normal in these hospitals that investigate, treat and perform surgeries for those in need. Detecting any case at risk prior to their surgical procedures are important for necessitating RT PCR confirmation and further action.

Enabling the solution and the impact

These hospitals have been in the service of the local communities with a mix of healthcare and community outreach services for decades now. Heavily dependent on funding, these setups have to often navigate severe financial crises in their mission to continue catering to people at the bottom of the pyramid. Amidst the tribal belt in Jharkhand, Dr. George Mathew (former Principal, CMC, Vellore) and Medical Director of Shantibhavan Medical Center in Simdega, had to face the herculean task of providing food and accommodation for all his healthcare and non-healthcare staff as they were ostracised by their families owing to the stigma attached to COVID-19 care. Lack of availability of  PPE kits and other protective gear, also pushed these sites to innovate and produce them inhouse.

Staff protecting themselves and patients

Left: the staff of Shanti Bhavan medical center making the essentials for protecting themselves in-house; Right: staff protecting themselves and a patient.

qXR was introduced to these passionate professionals and other staff were sensitized on the technology. Post their buy-in of the solution, we on-boarded 11 of these hospitals, working closely with their IT teams for secure protocols, deployment and training of the staff in a span of 2 weeks. A glimpse of the hospitals as below:

Location Hospital Name Setting
Betul District, rural Madhya Pradesh Padhar Hospital This is a 200 bedded multi-speciality charitable hospital engages in a host of community outreach activities in nearby villages involving education, nutrition, maternal and child health programs, mental health and cancer screening
Nandurbar, Maharashtra Chinchpada Mission Hospital This secondary care hospital serves the Bhil tribal community. Patients travel upto 200kms from the interiors of Maharashtra to avail affordable, high quality care.
Tezpur, Assam The Baptist Christian Hospital This is a 200- bedded secondary care hospital in the North eastern state of Assam
Bazaricherra, Assam Makunda Christian Leprosy & General Hospital They cater to the tribal regions. Situated in a district with a Maternal Mortality Rate (MMR) as high as 284 per 100,000 live births and Infant Mortality Rate (IMR) of 69 per 1000 live births. They conduct 6,000 deliveries, and perform 3,000 surgeries annually.
Simdega, Jharkhand Shanti Bhavan Medical Center This secondary hospital caters to remote tribal district. It is managed entirely by 3-4 doctors that actively multitask to ensure highest quality care for their patients. The nearest tertiary care hospital is approximately 100 km away. Currently, they are a COVID-19 designated center and they actively see many TB cases as well.

Others include hospitals in Khariar, Odisha; Dimapur, Nagaland; Raxaul, Bihar and so on.

Initially, qXR was used to process X-rays of cases with COVID-19 like symptoms, with results interpreted and updated in a minute. Soon the doctors found it to be useful in OPD as well and the solution’s capability was extended to all patients who visited with various ailments that required chest X-ray diagnosis. Alerts on every suspect are provided immediately, based on the likelihood of disease predicted by qXR, along with information on other suggestive findings. The reports are compiled and integrated on our patient workflow management solution, qTrack. Due to resource constraints for viewing X-ray in dedicated workstations, the results are also made available real-time using the qTrack mobile application.

qTrack app and web

Left: qTrack app used by the Physicians to view results in real time during while they are attending patients and performing routine work; Right: qTrack web used by Physicians and technicians to view instantaneously for reporting.

“It is a handy tool for our junior medical officers in the emergency department, as it helps in quick clinical decision making. The uniqueness of the system being speed, accuracy, and the details of the report. We get the report moment the x rays are uploaded on the server. The dashboard is very friendly to use. It is a perfect tool for screening asymptomatic patients for RT PCR testing, as it calculates the COVID-19 risk score. This also helps us to isolate suspected patients early and thereby helping in infection control. In this pandemic, this AI system would be a valuable tool in the battleground”

Dr Jemin Webster, Tezpur Baptist Hospital

Once the preliminary chest X-ray screening is done, the hospitals equipped with COVID-19 rapid tests get them done right away, while the others send samples to the closest testing facility which may be more than 30 miles away, with results made available in 4-5 days or more. But, none of these hospitals have the RT-PCR testing facility, yet!

qXR Protocol

In Makunda Hospital, Assam, qXR is used as an additional input in the diagnosis methodologies to manage the patient as a COVID-19 patient. They have currently streamlined their workflow to include the X-ray technicians taking digital X-rays and uploading the images on qXR, to  follow up and alert the doctors. Meanwhile, physicians can also access reports, review images  and make clinical corroboration anywhere they are through qTrack and manage patients without any undue delay.

Dr. Roshine Koshy using qXR

Dr. Roshine Koshy using qXR system during her OPD to review and take next course of action

“One of our objectives as a clinical team has been to ensure that care for non-COVID-19 patients is not affected as much as possible as there are no other healthcare facilities providing similar care. We are seeing atypical presentations of the illness, patients without fever, with vague complaints. We had one patient admitted in the main hospital who was flagged positive on the qXR system and subsequently tested positive and referred to a higher center. All the symptomatic patients who tested positive on the rapid antigen test have been flagged positive by qXR and some of them were alerted because of the qXR input. Being a high volume center and the main service provider in the district, using Qure.ai as a triaging tool will have enormous benefits in rural areas especially where there are no well-trained doctors”

– Dr. Roshine Koshy, Makunda Christian Leprosy and General Hospital in Assam.

There are a number of changes our users experienced in this short span of introduction of qXR in their existing workflow including:

  • Empowering the front-line healthcare physicians and care-givers in quick decisions
  • Enabling diagnosis for patients by triaging them for Rapid Antigen or RT-PCR tests immediately
  • Identifying asymptomatic cases which would have been missed otherwise
  • Ensuring safety of the health workers and other staff
  • Reducing risk of disease transmission

In Padhar Hospital, Madhya Pradesh, in addition to triaging suspected COVID cases, qXR assists doctors in managing pre-operative patients, where their medicine department takes care of pre-anaesthesia checkups as well. qXR helps them in identifying and flagging off suspected cases who are planned for procedures.  They are deferred till diagnosis or handled with appropriate additional safety measures in case of an emergency.

“We are finding it quite useful since we get a variety of patients, both outpatients and inpatients. And anyone who has a short history of illness and has history suggestive of SARI, we quickly do the chest X-ray and if the Qure app shows a high COVID-19 score, we immediately refer the patient to the nearby district hospital for RT-PCR for further management. Through the app we are also able to pick up asymptomatic suspects who hides their travel history or positive cases who have come for second opinion, to confirm and/or guide them to the proper place for further testing and isolation”

– Dr Mahima Sonwani, Padhar Hospital, Betul, Madhya Pradesh

Dr. Roshine Koshy using qXR

Left: technician capturing X-ray in Shanti Bhavan medical center; Right: Dr. Jemine Webster using qXR solution in Baptist hospital, Tezpur

In some of the high TB burden settings like Simdega in Jharkhand, qXR is used as a surveillance tool for screening and triaging Tuberculosis cases in addition to COVID-19 and other lung ailments.

“We are dependent on chest X-rays to make the preliminary diagnosis in both these conditions before we perform any confirmatory test. There are no trained radiologists available in our district or our neighbouring district and struggle frequently to make accurate diagnosis without help of a trained radiologist. The AI solution provided by Qure, is a perfect answer for our problem in this remote and isolated region. I strongly feel that the adoption of AI for Chest X-ray and other radiological investigation is the ideal solution for isolated and human resource deprived regions of the world”

– Dr.George Mathew, Medical Director, Shanti Bhavan Medical Centre

Currently, qXR processes close to 150 chest X-rays a day from these hospitals, enabling quick diagnostic decisions for lung diseases.

Challenges: Several hospitals had very basic technological infrastructure systems with poor internet connectivity and limitations in IT systems for using all supporting softwares. They were anxious about potential viruses / crashing the computer where our software was installed. Most of these teams had limited understanding of exposure to working with such softwares as well. However, they were extremely keen to learn, adapt and even provide solutions to overcome these infrastructural limitations. The engineers of the customer success team at Qure, deployed the software gateways carefully, ensuring no interruption in their existing functioning.

Conclusion

At Qure, we have worked closely with public health stakeholders in recent years. It is rewarding to hear the experiences and stories of impact from these physicians. To strengthen their armor in the fight against the pandemic even in such resource-limited settings, we will continue to expand our software solutions. Without limitation, qXR will be available across primary, secondary, and tertiary hospitals. The meetings, deployments, and training will be done remotely, providing a seamless experience. It is reassuring to hear these words:

“Qure’s solution is particularly attractive because it is cutting edge technology that directly impacts care for those sections of our society who are deprived of many advances in science and technology simply because they never reach them! We hope that this and many more such innovative initiatives would be encouraged so that we can include the forgotten masses of our dear people in rural India in the progress enjoyed by those in the cities, where most of the health infrastructure and manpower is concentrated”

Dr. Ashita Waghmare, Chinchpada hospital

Democratizing healthcare through innovations! We will be publishing a detailed study soon.

Categories
Uncategorized

Improving performance of AI models in presence of artifacts

Our deep learning models have become really good at recognizing hemorrhages from Head CT scans. Real-world performance is sometimes hampered by several external factors both hardware-related and human-related. In this blog post, we analyze how acquisition artifacts are responsible for performance degradation and introduce two methods that we tried, to solve this problem.

Medical Imaging is often accompanied by acquisition artifacts which can be subject related or hardware related. These artifacts make confident diagnostic evaluation difficult in two ways:

  • by making abnormalities less obvious visually by overlaying on them.
  • by mimicking an abnormality.

Some common examples of artifacts are

  • Clothing artifact- due to clothing on the patient at acquisition time See fig 1 below. Here a button on the patient’s clothing looks like a coin lesion on a Chest X Ray. Marked by red arrow.

clothing artifact

Fig 1. A button mimicking coin lesion in Chest X Ray. Marked by red arrow.Source.

  • Motion artifact- due to voluntary or involuntary subject motion during acquisition. Severe motion artifacts due to voluntary motion would usually call for a rescan. Involuntary motion like respiration or cardiac motion, or minimal subject movement could result in artifacts that go undetected and mimic a pathology. See fig 2. Here subject movement has resulted in motion artifacts that mimic subdural hemorrhage(SDH).

motion artifact

Fig 2. Artifact due to subject motion, mimicking a subdural hemorrhage in a Head CT.Source

  • Hardware artifact- See fig 3. This artifact is caused due to air bubbles in the cooling system. There are subtle irregular dark bands in scan, that can be misidentifed as cerebral edema.

hardware artifact edema

Fig 3. A hardware related artifact, mimicking cerebral edema marked by yellow arrows.Source

Here we are investigating motion artifacts that look like SDH, in Head CT scans. These artifacts result in increase in false positive (FPs) predictions of subdural hemorrhage models. We confirmed this by quantitatively analyzing the FPs of our AI model deployed at an urban outpatient center. The FP rates were higher for this data when compared to our internal test dataset.
The reason for these false positive predictions is due to the lack of variety of artifact-ridden data in the training set used. Its practically difficult to acquire and include scans containing all varieties of artifacts in the training set.

artifact mistaken for sdh

Fig 4. Model identifies an artifact slice as SDH because of similarity in shape and location. Both are hyperdense areas close to the cranial bones

We tried to solve this problem in the following two ways.

  • Making the models invariant to artifacts, by explicitly including artifact images into our training dataset.
  • Discounting slices with artifact when calculating the probability of bleed in a scan.

Method 1. Artifact as an augmentation using Cycle GANs

We reasoned that the artifacts were misclassified as bleeds because the model has not seen enough artifact scans while training.
The number of images containing artifacts is relatively small in our annotated training dataset. But we have access to several unannotated scans containing artifacts acquired from various centers with older CT scanners.(Motion artifacts are more prevalent when using older CT scanners with poor in plane temporal resolution). If we could generate artifact ridden versions of all the annotated images in our training dataset, we would be able to effectively augment our training dataset and make the model invariant to artifacts.
We decided to use a Cycle GAN to generate new training data containing artifacts.

Cycle GAN[1] is a generative adversarial network that is used for unpaired image to image translation. It serves our purpose because we have an unpaired image translation problem where X domain has our training set CT images with no artifact and Y domain has artifact-ridden CT images.

cycle gan illustration

Fig 5. Cycle GAN was used to convert a short clip of horse into that of a zebra.Source

We curated a A dataset of 5000 images without artifact and B dataset of 4000 images with artifacts and used this to train the Cycle GAN.

Unfortunately, the quality of generated images was not very good. See fig 6.
The generator was unable to capture all the variety in CT dataset, meanwhile introducing artifacts of its own, thus rendering it useless for augmenting the dataset. Cycle GAN authors state that the performance of the generator when the transformation involves geometric changes for ex. dog to cat, apples to oranges etc. is worse when compared to transformation involving color or style changes. Inclusion of artifacts is a bit more complex than color or style changes because it has to introduce distortions to existing geometry. This could be one of the reasons why the generated images have extra artifacts.

cycle gan images

Fig 6. Sampling of generated images using Cycle GAN. real_A are input images and fake_B are the artifact_images generated by Cycle GAN.

Method 2. Discounting artifact slices

In this method, we trained a model to identify slices with artifacts and show that discounting these slices made the AI model identifying subdural hemorrhage (SDH) robust to artifacts.
A manually annotated dataset was used to train a convolutional neural network (CNN) model to detect if a CT slice had artifacts or not. The original SDH model was also a CNN which predicted if a slice contained SDH. The probabilities from artifact model were used to discount the slices containing artifact and artifact-free slices of a scan were used in computation of score for presence of bleed.
See fig 7.

Method 2 illustration

Fig 7. Method 2 Using a trained artifacts model to discount artifact slices while calculating SDH probability.

Results

Our validation dataset contained 712 head CT scans, of which 42 contained SDH. Original SDH model predicted 35 false positives and no false negatives. Quantitative analysis of FPs confirmed that 17 (48%) of them were due to CT artifacts. Our trained artifact model had slice-wise AUC of 96%. Proposed modification to the SDH model had reduced the FPs to 18 (decrease of 48%) without introducing any false negatives. Thus using method 2, all scanwise FP’s due to artifacts were corrected.

In summary, using method 2, we improved the precision of SDH detection from 54.5% to 70% while maintaining a sensitivity of 100 percent.

confusion matrics

Fig 8. Confusion Matrix before and after using artifact model for SDH prediction

See fig 9. for model predictions on a representative scan.

artifact discount explanaation

Fig 9. Model predictions for few representative slices in a scan falsely predicted as positive by original SDH model

A drawback of Method 2 is that if SDH and artifact are present in the same slice, its probable that the SDH could be missed.

Conclusion

Using a cycle GAN to augment the dataset with artifact ridden scans would solve the problem by enriching the dataset with both SDH positive and SDH negative scans with artifacts over top of it. But the current experiments do not give realistic looking image synthesis results. The alternative we used, meanwhile reduces the problem of high false positives due to artifacts while maintaining the same sensitivity.

References

  1. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks by Jun-Yan Zhu et al.

Categories
Uncategorized

Morphology of the Brain: Changes in Ventricular and Cranial Vault Volumes in 15000 subjects with Aging and Hydrocephalus

This post is Part 1 of a series that uses large datasets (15,000+) coupled with deep learning segmentation methods to review and maybe re-establish what we know about normal brain anatomy and pathology. Subsequent posts will tackle intra-cranial bleeds, their typical volumes and locations across similarly sized datasets.

Brain ventricular volume has been quantified by post-mortem studies [1] and pneumoencephalography. When CT and subsequently MRI became available, facilitating non-invasive observation of the ventricular system larger datasets could be used to study these volumes. Typical subject numbers in recent studies have ranged from 50 – 150 [26].

Now that deep learning segmentation methods have enabled automated precise measurements of ventricular volume, we can re-establish these reference ranges using datasets that are 2 orders of magnitude larger. This is likely to be especially useful for age group extremes – in children, where very limited reference data exist and the elderly, where the effects of age-related atrophy may co-exist with pathologic neurodegenerative processes.

To date, no standard has been established regarding the normal ventricular volume of the human brain. The Evans index and the bicaudate index are linear measurements currently being used as surrogates to provide some indication that there is abnormal ventricular enlargement [1]. True volumetric measures are preferable to these indices for a number of reasons [7, 8] but have not been adopted so far, largely because of the time required for manual segmentation of images. Now that automated precise quantification is feasible with deep learning, it is possible to upgrade to a more precise volumetric measure.

Such quantitative measures will be useful in the monitoring of patients with hydrocephalus, and as an aid to diagnosing normal pressure hydrocephalus. In the future, automated measurements of ventricular, brain and cranial volumes could be displayed alongside established age- and gender-adjusted normal ranges as a standard part of radiology head CT and MRI reports.

Methods and Results

To train our deep learning model, lateral ventricles were manually annotated in 103 scans. We split these scans randomly with a ratio of 4:1 for training and validation respectively. We trained a U-Net to segment lateral ventricles in each slice. Another U-Net model was trained to segment cranial vault using a similar process. Models were validated using DICE score metric versus the annotations.

AnatomyDICE Score

Lateral Ventricles 0.909
Cranial Vault 0.983

Validation set of about 20 scans might not have represented all the anatomical/pathological variations in the population. Therefore, we visually verified that the resulting models worked despite pathologies like hemorrhage/infarcts or surgical implants such as shunts. We show some representative scans and model outputs below.

Focal ventricle dilatation

30 year old male reported with 'focal dilatation of left lateral ventricle.'

Mild Hydrcephalus

7 year old female child reported with 'mild obstructive hydrocephalus'

Mild Hydrcephalus

28 year old male reported with fracture and hemorrhages

Shunt

36 year old male reported with an intraventricular mass and with a VP shunt

To study lateral ventricular and cranial vault volume variation across the population, we randomly selected 14,153 scans from our database. This selection contained only 208 scans with hydrocephalus reported by the radiologist. Since we wanted to study ventricle volume variation in patients with hydrocephalus, we added 1314 additional scans reported with ‘hydrocephalus’. We excluded those scans for which age/gender metadata were not available.
In total, our analysis dataset contained 15223 scans whose demographic characteristics are shown in the table below.

CharacteristicValue

Number of scans 15223
Females 6317 (41.5%)
Age: median (interquartile range) 40 (24 – 56) years
Scans reported with cerebral atrophy 1999 (13.1%)
Scans reported with hydrocephalus 1404 (9.2%)

Dataset demographics and prevalances.

Histogram of age distribution is shown below. It can be observed that there are reasonable numbers of subjects (>200) for all age and sex groups. This ensures that our analysis is generalizable.

age histogram

We ran the trained deep learning models and measured lateral ventricular and cranial vault volumes for each of the 15223 scans in our database. Below is the scatter plot of all the analyzed scans.

Scatter plot

In this scatter plot, x-axis is the lateral ventricular volume while y-axis is cranial vault volume. Patients with atrophy were circled with marked orange and while scans with hydrocephalus were marked with green. Patients with atrophy were on the right to the majority of the individuals, indicating larger ventricles in these subjects. Patients with hydrocephalus move to the extreme right with ventricular volumes even higher than those with atrophy.

To make this relationship clearer, we have plotted distribution of ventricular volume for patients without hydrocephalus or atrophy and patients with one of these.

ventricular volume distribution

Interestingly, hydrocephalus distribution has a very long tail while distribution of patients with neither hydrocephalus nor atrophy has a narrower peak.

Next, let us observe cranial vault volume variation with age and sex. Bands around solid lines indicate interquartile range of cranial vault volume of the particular group.

cranial vault volume variation

An obvious feature of this plot is that the cranial vault increases in size until age of 10-20 after which it plateaus. The cranial vault of males is approximately 13% larger than that of females. Another interesting point is that the cranial vault in males will grow until the age group of 15-20 while in the female group it stabilizes at ages of 10-15.

Now, let’s plot variation of lateral ventricles with age and sex. As before, bands indicate interquartile range for a particular age group.

lateral ventricular volume variation

This plot shows that ventricles grow in size as one ages. This may be explained by the fact that brain naturally atrophies with age, leading to relative enlargement of the ventricles. This information can be used as normal range of ventricle volume for a particular age in a defined gender. Ventricle volume outside this normal range can be indicative of hydrocephalus or a neurodegenerative disease.

While the above plot showed variation of lateral ventricle volumes across age and sex, it might be easier to visualize relative proportion of lateral ventricles compared to cranial vault volume. This also has a normalizing effect across sexes; difference in ventricular volumes between sexes might be due to difference in cranial vault sizes.

relative lateral ventricular volume variation

This plot looks similar to the plot before, with the ratio of the ventricular volume to the cranial vault increasing with age. Until the age of 30-35, males and females have relatively similar ventricular volumes. After that age, however, males tend to larger relative ventricular size compared to females. This is in line with prior research which found that males are more susceptible to atrophy than females[10].

We can incorporate all this analysis into our automated report. For example, following is the CT scan of an 75 year old patient and our automated report.

 

CT scan of a 75 Y/M patient.
Use scroll bar on the right to scroll through slices.

qER Analysis Report
===================

Patient ID: KSA18458
Patient Age: 75Y
Patient Sex: M

Preliminary Findings by Automated Analysis:

- Infarct of 0.86 ml in left occipital region.
- Dilated lateral ventricles.
  This might indicate neurodegenerative disease/hydrocephalus.
  Lateral ventricular volume = 88 ml.
  Interquartile range for male >=75Y patients is 28 - 54 ml.

This is a report of preliminary findings by automated analysis.
Other significant abnormalities may be present.
Please refer to final report.

Our auto generated report. Added text is indicated in bold.

Discussion

The question of how to establish the ground truth for these measurements still remains to be answered. For this study, we use DICE scores versus manually outlined ventricles as an indicator of segmentation accuracy. Ventricle volumes annotated slice-wise by experts are an insufficient gold-standard not only because of scale, but also because of the lack of precision. The most likely places where these algorithms are likely to fail (and therefore need more testing) are anatomical variants and pathology that might alter the structure of the ventricles. We have tested some common co-occurring pathologies (hemorrhage), but it would be interesting to see how well the method performs on scans with congenital anomalies and other conditions such as subarachnoid cysts (which caused an earlier machine-learning-based algorithm to fail [9]).

  • Recording ventricular volume on reports is a good idea for future reference and monitor ventricular size in individuals with varying pathologies such as traumatic brain injury and colloid cysts of the third ventricle.
  • It provides an objective measure to follow ventricular volumes in patients who have had shunts and can help in identifying shunt failure.
  • Establishing the accuracy of these automated segmentation methods algorithms also paves the way for more nuanced neuroradiology research on a scale that was not previously possible.
  • One can use the data in relation to the cerebral volume and age to define hydrocephalus, atrophy and normal pressure hydrocephalus.

References

  1. EVANS, WILLIAM A. “An encephalographic ratio for estimating ventricular enlargement and cerebral atrophy.” Archives of Neurology & Psychiatry 47.6 (1942): 931-937.
  2. Matsumae, Mitsunori, et al. “Age-related changes in intracranial compartment volumes in normal adults assessed by magnetic resonance imaging.” Journal of neurosurgery 84.6 (1996): 982-991.
  3. Scahill, Rachael I., et al. “A longitudinal study of brain volume changes in normal aging using serial registered magnetic resonance imaging.” Archives of neurology 60.7 (2003): 989-994.
  4. Hanson, J., B. Levander, and B. Liliequist. “Size of the intracerebral ventricles as measured with computer tomography, encephalography and echoventriculography.” Acta Radiologica. Diagnosis 16.346_suppl (1975): 98-106.
  5. Gyldensted, C. “Measurements of the normal ventricular system and hemispheric sulci of 100 adults with computed tomography.” Neuroradiology 14.4 (1977): 183-192.
  6. Haug, G. “Age and sex dependence of the size of normal ventricles on computed tomography.” Neuroradiology 14.4 (1977): 201-204.
  7. Toma, Ahmed K., et al. “Evans’ index revisited: the need for an alternative in normal pressure hydrocephalus.” Neurosurgery 68.4 (2011): 939-944.
  8. Ambarki, Khalid, et al. “Brain ventricular size in healthy elderly: comparison between Evans index and volume measurement.” Neurosurgery 67.1 (2010): 94-99.
  9. Yepes-Calderon, Fernando, Marvin D. Nelson, and J. Gordon McComb. “Automatically measuring brain ventricular volume within PACS using artificial intelligence.” PloS one 13.3 (2018): e0193152.
  10. Gur, Ruben C., et al. “Gender differences in age effect on brain atrophy measured by magnetic resonance imaging.” Proceedings of the National Academy of Sciences 88.7 (1991): 2845-2849.

Categories
Uncategorized

Challenges of Development & Validation of Deep Learning for Radiology

We have recently published an article on our deep learning algorithms for Head CT in The Lancet. This article is the first ever AI in medical imaging paper to be published in this journal.
We described development and validation of these algorithms in the article.
In this blog, I explain some of the challenges we faced in this process and how we solved them. The challenges I describe are fairly general and should be applicable to any research involving AI and radiology images.

Development

3D Images

First challenge we faced in the development process is that CT scans are three dimensional (3D). There is plethora of research for two dimensional (2D) images, but far less for 3D images. You might ask, why not simply use 3D convolutional neural networks (CNNs) in place of 2D CNNs? Notwithstanding the computational and memory requirements of 3D CNNs, they have been shown to be inferior to 2D CNN based approaches on a similar problem (action recognition).

So how do we solve it? We need not invent the wheel from scratch when there is a lot of literature on a similar problem, action recognition. Action recognition is classification of action that is present in a given video.
Why is action recognition similar to 3D volume classification? Well, temporal dimension in videos is analogous to the Z dimension in the CT.

Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

We have taken a foundational work from action recognition literature and modified it to our purposes. Our modification was that we have incorporated slice (or frame in videos) level labels in to the network. This is because action recognition literature had a comfort of using pretrained 2D CNNs which we do not share.

High Resolution

Second challenge was that CT is of high resolution both spatially and in bit depth. We just downsample the CT to a standard pixel spacing. How about bit depth? Deep learning doesn’t work great with the data which is not normalized to [-1, 1] or [0, 1]. We solved this with what a radiologist would use – windowing. Windowing is restriction of dynamic range to a certain interval (eg. [0, 80]) and then normalizing it. We applied three windows and passed them as channels to the CNNs.

Windows: brain, blood/subdural and bone

Windows: brain, blood/subdural and bone

This approach allows for multi-class effects to be accounted by the model. For example, a large scalp hemotoma visible in brain window might indicate a fracture underneath it. Conversely, a fracture visible in the bone window is usually correlated with an extra-axial bleed.

Other Challenges

There are few other challenges that deserve mention as well:

  1. Class Imbalance: We solved the class imbalance issue by weighted sampling and loss weighting.
  2. Lack of pretraining: There’s no pretrained model like imagenet available for medical images. We found that using imagenet weights actually hurts the performance.

Validation

Once the algorithms were developed, validation was not without its challenges as well.
Here are the key questions we started with: does our algorithms generalize well to CT scans not in the development dataset?
Does the algorithm also generalize to CT scans from a different source altogether? How does it compare to radiologists without access to clinical history?

Low prevalences and statistical confidence

The validation looks simple enough: just acquire scans (from a different source), get it read by radiologists and compare their reads with the algorithms’.
But statistical design is a challenge! This is because prevalence of abnormalities tend to be low; it can be as low as 1% for some abnormalities. Our key metrics for evaluating the algorithms are sensitivity & specificity and AUC depending on the both. Sensitivity is the trouble maker: we have to ensure there are enough positives in the dataset to ensure narrow enough 95% confidence intervals (CI). Required number of positive scans turns out to be ~80 for a CI of +/- 10% at an expected sensitivity of 0.7.

If we were to chose a randomly sampled dataset, number of scans to be read is ~ 80/prevalence rate = 8000. Suppose there are three readers per scan, number of total reads are 8k * 3 = 24k. So, this is a prohibitively large dataset to get read by radiologists. We cannot therefore have a randomly sampled dataset; we have to somehow enrich the number of positives in the dataset.

Enrichment

To enrich a dataset with positives, we have to find the positives from all the scans available. It’s like searching for a needle in a haystack. Fortunately, all the scans usually have a clinical report associated with them. So we just have to read the reports and choose the positive reports. Even better, have an NLP algorithm parse the reports and randomly sample the required number of positives. We chose this path.

We collected the dataset in two batches, B1 & B2. B1 was all the head CT scans acquired in a month and B2 was the algorithmically selected dataset. So, B1 mostly contained negatives while B2 contained lot of positives. This approach removed any selection bias that might have been present if the scans were manually picked. For example, if positive scans were to be picked by manual & cursory glances at the scans themselves, subtle positive findings would have been missing from the dataset.

Prevalences of the findings in batches B1 and B2. Observe the low prevalences of findings in uniformly sampled batch B1.

Reading

We called this enriched dataset, CQ500 dataset (C for CARING and Q for Qure.ai). The dataset contained 491 scans after the exclusions. Three radiologists independently read the scans in the dataset and the majority vote is considered the gold standard. We randomized the order of the reads to minimize the recall of follow up scans and to blind the readers to the batches of the dataset.

We make this dataset and the radiologists’ reads public under CC-BY-NC-SA license. Other researchers can use this dataset to benchmark their algorithms. I think it can also be used for some clinical research like measuring concordance of radiologists on various tasks etc.

In addition to the CQ500 dataset, we validated the algorithms on a much larger randomly sampled dataset, Qure25k dataset. Number of scans in this dataset was 21095. Ground truths were clinical radiology reports. We used the NLP algorithm to get structured data from the reports. This dataset satisfies the statistical requirements, but each scan is read only by a single radiologist who had access to clinical history.

Results

Finding CQ500 (95% CI) Qure25k (95% CI)
Intracranial hemorrhage 0.9419 (0.9187-0.9651) 0.9194 (0.9119-0.9269)
Intraparenchymal 0.9544 (0.9293-0.9795) 0.8977 (0.8884-0.9069)
Intraventricular 0.9310 (0.8654-0.9965) 0.9559 (0.9424-0.9694)
Subdural 0.9521 (0.9117-0.9925) 0.9161 (0.9001-0.9321)
Extradural 0.9731 (0.9113-1.0000) 0.9288 (0.9083-0.9494)
Subarachnoid 0.9574 (0.9214-0.9934) 0.9044 (0.8882-0.9205)
Calvarial fracture 0.9624 (0.9204-1.0000) 0.9244 (0.9130-0.9359)
Midline Shift 0.9697 (0.9403-0.9991) 0.9276 (0.9139-0.9413)
Mass Effect 0.9216 (0.8883-0.9548) 0.8583 (0.8462-0.8703)

AUCs of the algorithms on the both datasets.

Above table shows AUCs of the algorithms on the two datasets. Note that the AUCs are directly comparable. This is because AUC is prevalence independent. AUCs on CQ500 dataset are generally better than that on the Qure25k dataset. This might be because:

  1. Ground truths in the Qure25k dataset incorporated clinical information not available to the algorithms and therefore the algorithms did not perform well.
  2. Majority vote of three reads is a better ground truth than that of a single read.

ROC curves

ROC curves for the algorithms on the Qure25k (blue) and CQ500 (red) datasets. TPR and FPR of radiologists are also plotted.

Shown above is ROC curves on both the datasets. Readers’ TPR and FPR are also plotted. We observe that radiologists are either highly sensitive or specific to a particular finding. The algorithms are still yet to beat radiologists, on this task at least! But these should nonetheless be useful to triage or notify physicians.

Categories
Uncategorized

Deep Learning for Videos: A 2018 Guide to Action Recognition

Medical images like MRIs, CTs (3D images) are very similar to videos – both of them encode 2D spatial information over a 3rd dimension. Much like diagnosing abnormalities from 3D images, action recognition from videos would require capturing context from entire video rather than just capturing information from each frame.

Fig 1: Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

In this post, I summarize the literature on action recognition from videos. The post is organized into three sections –

  1. What is action recognition and why is it tough
  2. Overview of approaches
  3. Summary of papers

Action recognition and why is it tough?

Action recognition task involves the identification of different actions from video clips (a sequence of 2D frames) where the action may or may not be performed throughout the entire duration of the video. This seems like a natural extension of image classification tasks to multiple frames and then aggregating the predictions from each frame. Despite the stratospheric success of deep learning architectures in image classification (ImageNet), progress in architectures for video classification and representation learning has been slower.

What made this task tough?

  1. Huge Computational Cost
    A simple convolution 2D net for classifying 101 classes has just ~5M parameters whereas the same architecture when inflated to a 3D structure results in ~33M parameters. It takes 3 to 4 days to train a 3DConvNet on UCF101 and about two months on Sports-1M, which makes extensive architecture search difficult and overfitting likely[1].
  2. Capturing long context
    Action recognition involves capturing spatiotemporal context across frames. Additionally, the spatial information captured has to be compensated for camera movement. Even having strong spatial object detection doesn’t suffice as the motion information also carries finer details. There’s a local as well as global context w.r.t. motion information which needs to be captured for robust predictions. For example, consider the video representations shown in Figure 2. A strong image classifier can identify human, water body in both the videos but the nature of temporal periodic action differentiates front crawl from breast stroke.

    Fig 2: Left: Front crawl. Right: Breast stroke. Capturing temporal motion is critical to differentiate these two seemingly similar cases. Also notice, how camera angle suddenly changes in the middle of front crawl video.

  3. Designing classification architectures
    Designing architectures that can capture spatiotemporal information involve multiple options which are non-trivial and expensive to evaluate. For example, some possible strategies could be

    • One network for capturing spatiotemporal information vs. two separate ones for each spatial and temporal
    • Fusing predictions across multiple clips
    • End-to-end training vs. feature extraction and classifying separately
  4. No standard benchmark
    The most popular and benchmark datasets have been UCF101 and Sports1M for a long time. Searching for reasonable architecture on Sports1M can be extremely expensive. For UCF101, although the number of frames is comparable to ImageNet, the high spatial correlation among the videos makes the actual diversity in the training much lesser. Also, given the similar theme (sports) across both the datasets, generalization of benchmarked architectures to other tasks remained a problem. This has been solved lately with the introduction of Kinetics dataset[2].

    Karpathy_fusion
    Sample illustration of UCF-101. Source.

It must be noted here that abnormality detection from 3D medical images doesn’t involve all the challenges mentioned here. The major differences between action recognition from medical images are mentioned as below

  1. In case of medical imaging, the temporal context may not be as important as action recognition. For example, detecting hemorrhage in a head CT scan could involve much less temporal context across slices. Intracranial hemorrhage can be detected from a single slice only. As opposed to that, detecting lung nodule from chest CT scans would involve capturing temporal context as the nodule as well as bronchi and vessels all look like circular objects in 2D scans. It’s only when 3D context is captured, that nodules can be seen as spherical objects as opposed to cylindrical objects like vessels
  2. In case of action recognition, most of the research ideas resort to using pre-trained 2D CNNs as a starting point for drastically better convergence. In case of medical images, such pre-trained networks would be unavailable.

Overview of approaches

Before deep learning came along, most of the traditional CV algorithm variants for action recognition can be broken down into the following 3 broad steps:


  1. Local high-dimensional visual features that describe a region of the video are extracted either densely [3] or at a sparse set of interest points[4 , 5].
  2. The extracted features get combined into a fixed-sized video level description. One popular variant to the step is to bag of visual words (derived using hierarchical or k-means clustering) for encoding features at video-level.
  3. A classifier, like SVM or RF, is trained on bag of visual words for final prediction

Of these algorithms that use shallow hand-crafted features in Step 1, improved Dense Trajectories [6] (iDT) which uses densely sampled trajectory features was the state-of-the-art. Simultaneously, 3D convolutions were used as is for action recognition without much help in 2013[7]. Soon after this in 2014, two breakthrough research papers were released which form the backbone for all the papers we are going to discuss in this post. The major differences between them was the design choice around combining spatiotemporal information.

Approach 1: Single Stream Network

In this work [June 2014], the authors – Karpathy et al. – explore multiple ways to fuse temporal information from consecutive frames using 2D pre-trained convolutions.

Karpathy_fusion

Fig 3: Fusion Ideas Source.

As can be seen in Fig 3, the consecutive frames of the video are presented as input in all setups. Single frame uses single architecture that fuses information from all frames at the last stage. Late fusion uses two nets with shared params, spaced 15 frames apart, and also combines predictions at the end. Early fusion combines in the first layer by convolving over 10 frames. Slow fusion involves fusing at multiple stages, a balance between early and late fusion. For final predictions, multiple clips were sampled from entire video and prediction scores from them were averaged for final prediction.

Despite extensive experimentations the authors found that the results were significantly worse as compared to state-of-the-art hand-crafted feature based algorithms. There were multiple reasons attributed for this failure:

  1. The learnt spatiotemporal features didn’t capture motion features
  2. The dataset being less diverse, learning such detailed features was tough

Approach 2: Two Stream Networks

In this pioneering work [June 2014] by Simmoyan and Zisserman, the authors build on the failures of the previous work by Karpathy et al. Given the toughness of deep architectures to learn motion features, authors explicitly modeled motion features in the form of stacked optical flow vectors. So instead of single network for spatial context, this architecture has two separate networks – one for spatial context (pre-trained), one for motion context. The input to the spatial net is a single frame of the video. Authors experimented with the input to the temporal net and found bi-directional optical flow stacked across for 10 successive frames was performing best. The two streams were trained separately and combined using SVM. Final prediction was same as previous paper, i.e. averaging across sampled frames.

2 stream architecture

Fig 4: Two stream architecture Source.

Though this method improved the performance of single stream method by explicitly capturing local temporal movement, there were still a few drawbacks:

  1. Because the video level predictions were obtained from averaging predictions over sampled clips, the long range temporal information was still missing in learnt features.
  2. Since training clips are sampled uniformly from videos, they suffer from a problem of false label assignemnt. The ground truth of each of these clips are assumed same as ground truth of the video which may not be the case if the action just happens for a small duration within the entire video.
  3. The method involved pre-computing optical flow vectors and storing them separately. Also, the training for both the streams was separate implying end-to-end training on-the-go is still a long road.

Summaries

Following papers which are, in a way, evolutions from the two papers (single stream and two stream) which are summarized as below:

  1. LRCN
  2. C3D
  3. Conv3D & Attention
  4. TwoStreamFusion
  5. TSN
  6. ActionVlad
  7. HiddenTwoStream
  8. I3D
  9. T3D

The recurrent theme around these papers can be summarized as follows. All of the papers are improvisations on top of these basic ideas.

SegNet Architecture

Recurrent theme across papers. Source.

For each of these papers, I list down their key contributions and explain them.
I also show their benchmark scores on UCF101-split1.

LRCN

  • Long-term Recurrent Convolutional Networks for Visual Recognition and Description
  • Donahue et al.
  • Submitted on 17 November 2014
  • Arxiv Link

Key Contributions:

  • Building on previous work by using RNN as opposed to stream based designs
  • Extension of encoder-decoder architecture for video representations
  • End-to-end trainable architecture proposed for action recognition

Explanation:

In a previous work by Ng et al[9]. authors had explored the idea of using LSTMs on separately trained feature maps to see if it can capture temporal information from clips. Sadly, they conclude that temporal pooling of convoluted features proved more effective than LSTM stacked after trained feature maps. In the current paper, authors build on the same idea of using LSTM blocks (decoder) after convolution blocks(encoder) but using end-to-end training of entire architecture. They also compared RGB and optical flow as input choice and found that a weighted scoring of predictions based on both inputs was the best.

2 stream architecture2 stream architecture

Fig 5: Left: LRCN for action recognition. Right: Generic LRCN architecture for all tasks Source.

Algorithm:

During training, 16 frame clips are sampled from video. The architecture is trained end-to-end with input as RGB or optical flow of 16 frame clips. Final prediction for each clip is the average of predictions across each time step. The final prediction at video level is average of predictions from each clip.

Benchmarks (UCF101-split1):

Score Comment
82.92 Weighted score of flow and RGB inputs
71.1 Score with just RGB

My comments:

Even though the authors suggested end-to-end training frameworks, there were still a few drawbacks

  • False label assignment as video was broken to clips
  • Inability to capture long range temporal information
  • Using optical flow meant pre-computing flow features separately

Varol et al. in their work[10] tried to compensate for the stunted temporal range problem by using lower spatial resolution of video and longer clips (60 frames) which led to significantly better performance.

C3D

  • Learning Spatiotemporal Features with 3D Convolutional Networks
  • Du Tran et al.
  • Submitted on 02 December 2014
  • Arxiv Link

Key Contributions:

  • Repurposing 3D convolutional networks as feature extractors
  • Extensive search for best 3D convolutional kernel and architecture
  • Using deconvolutional layers to interpret model decision

Explanation:

In this work authors built upon work by Karpathy et al. However, instead of using 2D convolutions across frames, they used 3D convolutions on video volume. The idea was to train these vast networks on Sports1M and then use them (or an ensemble of nets with different temporal depths) as feature extractors for other datasets. Their finding was a simple linear classifier like SVM on top of ensemble of extracted features worked better than she ttate-of-the-art algorithms. The model performed even better if hand crafted features like iDT were used additionally.

SegNet Architecture

Differences in C3D paper and single stream paper Source.

The other interesting part of the work was using deconvolutional layers (explained here) to interpret the decisions. Their finding was that the net focussed on spatial appearance in first few frames and tracked the motion in the subsequent frames.

Algorithm:

During training, five random 2-second clips are extracted for each video with ground truth as action reported in the entire video. In test time, 10 clips are randomly sampled and predictions across them are averaged for final prediction.

SegNet Architecture

3D convolution where convolution is applied on a spatiotemporal cube.

Benchmarks (UCF101-split1):

Score Comment
82.3 C3D (1 net) + linear SVM
85.2 C3D (3 nets) + linear SVM
90.4 C3D (3 nets) + iDT + linear SVM

My comments:

The long range temporal modeling was still a problem. Moreover, training such huge networks is computationally a problem – especially for medical imaging where pre-training from natural images doesn’t help a lot.

Note: Around the same time Sun et al.[11] introduced the concept of factorized 3D conv networks (FSTCN), where the authors explored the idea of breaking 3D convolutions into spatial 2D convolutions followed by temporal 1D convolutions. The 1D convolution, placed after 2D conv layer, was implemented as 2D convolution over temporal and channel dimension. The factorized 3D convolutions (FSTCN) had comparable results on UCF101 split.

SegNet Architecture

FSTCN paper and the factorization of 3D convolution Source.

Conv3D & Attention

  • Describing Videos by Exploiting Temporal Structure
  • Yao et al.
  • Submitted on 25 April 2015
  • Arxiv Link

Key Contributions:

  • Novel 3D CNN-RNN encoder-decoder architecture which captures local spatiotemporal information
  • Use of an attention mechanism within a CNN-RNN encoder-decoder framework to capture global context

Explanation:

Although this work is not directly related to action recognition, but it was a landmark work in terms of video representations. In this paper the authors use a 3D CNN + LSTM as base architecture for video description task. On top of the base, authors use a pre-trained 3D CNN for improved results.

Algorithm:

The set up is almost same as encoder-decoder architecture described in LRCN with two differences

  1. Instead of passing features from 3D CNN as is to LSTM, 3D CNN feature maps for the clip are concatenated with stacked 2D feature maps for the same set of frames to enrich representation {v1, v2, …, vn} for each frame i. Note: The 2D & 3D CNN used is a pre-trained one and not trained end-to-end like LRCN
  2. Instead of averaging temporal vectors across all frames, a weighted average is used to combine the temporal features. The attention weights are decided based on LSTM output at every time step.

Attention Mechanism

Attention mechanism for action recognition. Source.

Benchmarks:

Score Comment
Network used for video description prediction

My comments:

This was one of the landmark work in 2015 introducing attention mechanism for the first time for video representations.

TwoStreamFusion

  • Convolutional Two-Stream Network Fusion for Video Action Recognition
  • Feichtenhofer et al.
  • Submitted on 22 April 2016
  • Arxiv Link

Key Contributions:

  • Long range temporal modeling through better long range losses
  • Novel multi-level fused architecture

Explanation:

In this work, authors use the base two stream architecture with two novel approaches and demonstrate performance increment without any significant increase in size of parameters. The authors explore the efficacy of two major ideas.

  1. Fusion of spatial and temporal streams (how and when) – For a task discriminating between brushing hair and brushing teeth – spatial net can capture the spatial dependency in a video (if it’s hair or teeth) while temporal net can capture presence of periodic motion for each spatial location in video. Hence it’s important to map spatial feature maps pertaining to say a particular facial region to temporal feature map for the corresponding region. To achieve the same, the nets need to be fused at an early level such that responses at the same pixel position are put in correspondence rather than fusing at end (like in base two stream architecture).
  2. Combining temporal net output across time frames so that long term dependency is also modeled.

Algorithm:

Everything from two stream architecture remains almost similar except

  1. As described in the figure below, outputs of conv_5 layer from both streams are fused by conv+pooling. There is yet another fusion at the end layer. The final fused output was used for spatiotemporal loss evaluation.

    SegNet Architecture

    Possible strategies for fusing spatial and temporal streams. The one on right performed better. Source.

  2. For temporal fusion, output from temporal net, stacked across time, fused by conv+pooling was used for temporal loss

SegNet Architecture

Two stream fusion architecture. There are two paths one for step 1 and other for step 2 Source.

Benchmarks (UCF101-split1):

Score Comment
92.5 TwoStreamfusion
94.2 TwoStreamfusion + iDT

My comments:
The authors established the supremacy of the TwoStreamFusion method as it improved the performance over C3D without the extra parameters used in C3D.

TSN

  • Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
  • Wang et al.
  • Submitted on 02 August 2016
  • Arxiv Link

Key Contributions:

  • Effective solution aimed at long range temporal modeling
  • Establishing the usage of batch normalization, dropout and pre-training as good practices

Explanation:

In this work authors improved on two streams architecture to produce state-of-the-art results. There were two major differences from the original paper

  1. They suggest sampling clips sparsely across the video to better model long range temporal signal instead of the random sampling across entire video.
  2. For final prediction at video-level authors explored multiple strategies. The best strategy was
    1. Combining scores of temporal and spatial streams (and other streams if other input modalities are involved) separately by averaging across snippets
    2. Fusing score of final spatial and temporal scores using weighted average and applying softmax over all classes.

The other important part of the work was establishing the problem of overfitting (due to small dataset sizes) and demonstrating usage of now-prevalent techniques like batch normalization, dropout and pre-trainign to counter the same. The authors also evaluated two new input modalities as alternate to optical flow – namely warped optical flow and RGB difference.

Algorithm:

During training and prediction a video is divided into K segments
of equal durations. Thereafter, snippets are sampled randomly from each of the K segments. Rest of the steps remained similar to two stream architecture with changes as mentioned above.

SegNet Architecture

Temporal Segment Network architecture. Source.

Benchmarks (UCF101-split1):

Score Comment
94.0 TSN (input RGB + Flow )
94.2 TSN (input RGB + Flow + Warped flow)

My comments:

The work attempted to tackle two big challenges in action recognition – overfitting due to small sizes and long range modeling and the results were really strong. However,the problem of pre-computing optical flow and related input modalities was still a problem at large.

ActionVLAD

  • ActionVLAD: Learning spatio-temporal aggregation for action classification
  • Girdhar et al.
  • Submitted on 10 April 2017
  • Arxiv Link

Key Contributions:

  • Learnable video-level aggregation of features
  • End-to-end trainable model with video-level aggregated features to capture long term dependency

Explanation:

In this work, the most notable contribution by the authors is the usage of learnable feature aggregation (VLAD) as compared to normal aggregation using maxpool or avgpool. The aggregation technique is akin to bag of visual words. There are multiple learned anchor-point (say c1, …ck) based vocabulary representing k typical action (or sub-action) related spatiotemporal features. The output from each stream in two stream architecture is encoded in terms of k-space “action words” features – each feature being difference of the output from the corresponding anchor-point for any given spatial or temporal location.

SegNet Architecture

ActionVLAD – Bag of action based visual “words". Source.

Average or max-pooling represent the entire distribution of points as only a single descriptor which can be sub-optimal for representing an entire video composed of multiple sub-actions. In contrast, the proposed video aggregation represents an entire distribution of descriptors with multiple sub-actions by splitting the descriptor space into k cells and pooling inside each of the cells.

SegNet Architecture

While max or average pooling are good for similar features, they do not not adequately capture the complete distribution of features. ActionVlAD clusters the appearance and motion features and aggregates their residuals from nearest cluster centers. Source.

Algorithm:

Everything from two stream architecture remains almost similar except the usage of ActionVLAD layer. The authors experiment multiple layers to place ActionVLAD layer with the late fusion after conv layers working out as the best strategy.

Benchmarks (UCF101-split1):

Score Comment
92.7 ActionVLAD
93.6 ActionVLAD + iDT

My comments:
The use of VLAD as an effective way of pooling was already proved long back. The extension of the same in an end-to-end trainable framework made this technique extremely robust and state-of-the-art for most action recognition tasks in early 2017.

HiddenTwoStream

  • Hidden Two-Stream Convolutional Networks for Action Recognition
  • Zhu et al.
  • Submitted on 2 April 2017
  • Arxiv Link

Key Contributions:

  • Novel architecture for generating optical flow input on-the-fly using a separate network

Explanation:

The usage of optical flow in the two stream architecture made it mandatory to pre-compute optical flow for each sampled frame before hand thereby affecting storage and speed adversely. This paper advocates the usage of an unsupervised architecture to generate optical flow for a stack of frames.

Optical flow can be regarded as an image reconstruction problem. Given a pair of adjacent frames I1 and I2 as input, our CNN generates a flow field V. Then using the predicted flow field V and I2, I1 can be reconstructed as I1 using inverse warping such that difference between I1 and it’s reconstruction is minimized.

Algorithm:

The authors explored multiple strategies and architectures to generate optical flow with largest fps and least parameters without hurting accuracy much. The final architecture was same as two stream architecture with changes as mentioned:

  1. The temporal stream now had the optical flow generation net (MotionNet) stacked on the top of the general temporal stream architectures. The input to the temporal stream was now consequent frames instead of preprocessed optical flow.
  2. There’s an additional multi-level loss for the unsupervised training of MotionNet

The authors also demonstrate improvement in performance using TSN based fusion instead of conventional architecture for two stream approach.

SegNet Architecture

HiddenTwoStream – MotionNet generates optical flow on-the-fly. Source.

Benchmarks (UCF101-split1):

Score Comment
89.8 Hidden Two Stream
92.5 Hidden Two Stream + TSN

My comments:
The major contribution of the paper was to improve speed and associated cost of prediction. With automated generation of flow, the authors relieved the dependency on slower traditional methods to generate optical flow.

I3D

  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • Carreira et al.
  • Submitted on 22 May 2017
  • Arxiv Link

Key Contributions:

  • Combining 3D based models into two stream architecture leveraging pre-training
  • Kinetics dataset for future benchmarking and improved diversity of action datasets

Explanation:

This paper takes off from where C3D left. Instead of a single 3D network, authors use two different 3D networks for both the streams in the two stream architecture. Also, to take advantage of pre-trained 2D models the authors repeat the 2D pre-trained weights in the 3rd dimension. The spatial stream input now consists of frames stacked in time dimension instead of single frames as in basic two stream architectures.

Algorithm:

Same as basic two stream architecture but with 3D nets for each stream

Benchmarks (UCF101-split1):

Score Comment
93.4 Two Stream I3D
98.0 Imagenet + Kinetics pre-training

My comments:

The major contribution of the paper was the demonstration of evidence towards benefit of using pre-trained 2D conv nets. The Kinetics dataset, that was open-sourced along the paper, was the other crucial contribution from this paper.

T3D

  • Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
  • Diba et al.
  • Submitted on 22 Nov 2017
  • Arxiv Link

Key Contributions:

  • Architecture to combine temporal information across variable depth
  • Novel training architecture & technique to supervise transfer learning between 2D pre-trained net to 3D net

Explanation:

The authors extend the work done on I3D but suggest using a single stream 3D DenseNet based architecture with multi-depth temporal pooling layer (Temporal Transition Layer) stacked after dense blocks to capture different temporal depths The multi depth pooling is achieved by pooling with kernels of varying temporal sizes.

SegNet Architecture

TTL Layer along with rest of DenseNet architecture. Source.

Apart from the above, the authors also devise a new technique of supervising transfer learning betwenn pre-trained 2D conv nets and T3D. The 2D pre-trianed net and T3D are both presented frames and clips from videos where the clips and videos could be from same video or not. The architecture is trianed to predict 0/1 based on the same and the error from the prediction is back-propagated through the T3D net so as to effectively transfer knowledge.

SegNet Architecture

Transfer learning supervision. Source.

Algorithm:

The architecture is basically 3D modification to DenseNet [12] with added variable temporal pooling.

Benchmarks (UCF101-split1):

Score Comment
90.3 T3D
91.7 T3D + Transfer
93.2 T3D + TSN

My comments:

Although the results don’t improve on I3D results but that can mostly attributed to much lower model footprint as compared to I3D. The most novel contribution of the paper was the supervised transfer learning technique.

References

  1. ConvNet Architecture Search for Spatiotemporal Feature Learning by Du Tran et al.
  2. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  3. Action recognition by dense trajectories by Wang et. al.
  4. On space-time interest points by Laptev
  5. Behavior recognition via sparse spatio-temporal features by Dollar et al
  6. Action Recognition with Improved Trajectories by Wang et al.
  7. 3D Convolutional Neural Networks for Human Action Recognition by Ji et al.
  8. Large-scale Video Classification with Convolutional Neural Networks by Karpathy et al.
  9. Beyond Short Snippets: Deep Networks for Video Classification by Ng et al.
  10. Long-term Temporal Convolutions for Action Recognition by Varol et al.
  11. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks by Sun et al.
  12. Densely Connected Convolutional Networks by Huang et al.

Categories
Uncategorized

Teaching Machines to Read Radiology Reports

At Qure, we build deep learning models to detect abnormalities from radiological images. These models require huge amount of labeled data to learn to diagnose abnormalities from the scans. So, we collected a large dataset from several centers, which included both in-hospital and outpatient radiology centers. These datasets contain scans and the associated clinical radiology reports.

For now, we use radiologist reports as the gold standard as we train deep learning algorithms to recognize abnormalities on radiology images. While this is not ideal for many reasons (see this), it is currently the most scalable way to supply classification algorithms with the millions of images that they need in order to achieve high accuracy.

These reports are usually written in free form text rather than in a structured format. So, we have designed a rule based Natural Language Processing (NLP) system to extract findings automatically from these unstructured reports.

CT SCAN BRAIN - PLAIN STUDY
Axial ct sections of the brain were performed from the level of base of skull. 5mm sections were done for the posterior fossa and 5 mm sections for the supra sellar region without contrast.

OBSERVATIONS: 
- Area of intracerebral haemorrhage measuring 16x15mm seen in left gangliocapsular region and left corona radiate.
- Minimal squashing of left lateral ventricle noted without any appreciable midline shift
- Lacunar infarcts seen in both gangliocapsular regions
- Cerebellar parenchyma is normal.
- Fourth ventricle is normal in position and caliber. 
- The cerebellopontine cisterns, basal cisterns and sylvian cisterns appear normal.
- Midbrain and pontine structures are normal.
- Sella and para sellar regions appear normal.
- The grey-white matter attenuation pattern is normal.
- Calvarium appears normal
- Ethmoid and right maxillary sinusitis noted

IMPRESSION:
- INTRACEREBRAL HAEMORRHAGE IN LEFT GANGLIOCAPSULAR REGION AND LEFT CORONA RADIATA 
- LACUNAR INFARCTS IN BOTH GANGLIOCAPSULAR REGIONS 

{
	"intracerebral hemorrhage": true,
	"lacunar infarct": true,
	"mass effect": true,
	"midline shift": false,
	"maxillary sinusitis": true
}

An example clinical radiology report and the automatically extracted findings

Why Rule based NLP ?

Rule based NLP systems use a list of manually created rules to parse the unorganized content and structure it. Machine Learning (ML) based NLP systems, on the other hand, automatically generate the rules when trained on a large annotated dataset.

Rule based approaches have multiple advantages when compared to ML based ones:

  1. Clinical knowledge can be manually incorporated into a rule based system. Whereas, to capture this knowledge in a ML based system, a huge amount of annotation is required.
  2. Auto-generated rules of ML systems are difficult to interpret compared to the manually curated rules.
  3. Rules can be readily added or modified to accommodate a new set of target findings in a rule based system.
  4. Previous works on clinical report parsing[1, 2] show that the results of machine learning based NLP systems are inferior to that of rule based ones.

Development of Rule based NLP

As reports were collected from multiple centers, there were multiple reporting standards. Therefore, we constructed a set of rules to capture these variations after manually reading a large number of reports. Of these, I illustrate two common types of rules below.

Findings Detection

In reports, the same finding can be noted in several different formats. These include the definition of the finding itself or its synonyms. For example, finding blunted CP angle could be reported in either of the following ways:

  • CP angle is obliterated
  • Hazy costophrenic angles
  • Obscured CP angle
  • Effusion/thickening

We collected all the wordings that can be used to report findings and created a rule for each finding. As an illustration, following is the rule for blunted CP angle.

((angle & (blunt | obscur | oblitera | haz | opaci)) | (effusio & thicken))

Blunted CP

Visualization of blunted CP angle rule

This rule will be positive if there are words angle and blunted or its synonyms in a sentence. Alternatively, it will also be positive if there are words effusion and thickening in a given sentence.

In addition, there can be a hierarchical structure in findings. For example, opacity is considered positive if any of the edema, groundglass, consolidation etc are positive.
We therefore created a ontology of findings and rules to deal with this hierarchy.

[opacity]
rule = ((opacit & !(/ & collapse)) | infiltrate | hyperdensit)
hierarchy = (edema | groundglass | consolidation | ... )

Rule and hierarchy for opacity

Negation Detection

The above mentioned rules are used to detect a finding in a report. But these are not sufficient to understand the reports. For example, consider the following sentences.

1. Intracerebral hemorrhage is absent.
2. Contusions are ruled out.
3. No evidence of intracranial hemorrhages in the brain.

Although the findings intracerebral hemorrhage, contusion and intracranial hemorrhage are mentioned in the above sentences, their absence is noted in these sentences rather than their presence. Therefore, we need to detect negations in a sentence in addition to findings.

We manually read several sentences that indicate negation of findings and grouped these sentences according to their structures. Rules to detect negation were created based on these groups.
One of these is illustrated below:

() & ( is | are | was | were ) & (absent | ruled out | unlikely | negative)

Negation

Negation detection structure

We can see that first and second sentences of above example matches this rule and therefore we can infer that the finding is negative.

  1. Intracerebral hemorrhage is absentintracerebral hemorrhage negative.
  2. Contusions are ruled outcontusion negative.

Results:

We have tested our algorithm on a dataset containing 1878 clinical radiology reports of Head CT scans. We manually read all the reports to create gold standards. We used sensitivity and specificity as evaluation metrics. The results obtained are given below in a table.

Findings #Positives Sensitivity
(95% CI)
Specificity
(95% CI)
Intracranial Hemorrhage 207 0.9807
(0.9513-0.9947)
0.9873
(0.9804-0.9922)
Intraparenchymal Hemorrhage 157 0.9809
(0.9452-0.9960)
0.9883
(0.9818-0.9929)
Intraventricular Hemorrhage 44 1.0000
(0.9196-1.0000)
1.0000
(0.9979-1.0000)
Subdural Hemorrhage 44 0.9318
(0.8134-0.9857)
0.9965
(0.9925-0.9987)
Extradural Hemorrhage 27 1.0000
(0.8723-1.0000)
0.9983
(0.9950-0.9996)
Subarachnoid Hemorrhage 51 1.0000
(0.9302-1.0000)
0.9971
(0.9933-0.9991)
Fracture 143 1.0000
(0.9745-1.0000)
1.0000
(0.9977-1.0000)
Calvarial Fracture 89 0.9888
(0.9390-0.9997)
0.9947
(0.9899-0.9976)
Midline Shift 54 0.9815
(0.9011-0.9995)
1.0000
(0.9979-1.0000)
Mass Effect 132 0.9773
(0.9350-0.9953)
0.9933
(0.9881-0.9967)

In this paper[1], authors used ML based NLP model (Bag Of Words with unigrams, bigrams, and trigrams plus average word embeddings vector) to extract findings from head CT clinical radiology reports. They reported average sensitivity and average specificity of 0.9025 and 0.9172 across findings. The same across target findings on our evaluation turns out to be 0.9841 and 0.9956 respectively. So, we can conclude rule based NLP algorithms perform better than ML based NLP algorithms on clinical reports.

References

  1. John Zech, Margaret Pain, Joseph Titano, Marcus Badgeley, Javin Schefflein, Andres Su, Anthony Costa, Joshua Bederson, Joseph Lehar & Eric Karl Oermann (2018). Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports. Radiology.
  2. Bethany Percha, Houssam Nassif, Jafi Lipson, Elizabeth Burnside & Daniel Rubin (2012). Automatic classification of mammography reports by BI-RADS breast tissue composition class.

Categories
Uncategorized

What We Learned Deploying Deep Learning at Scale for Radiology Images

Qure.ai is deploying deep learning for radiology across the globe. This blog is the first in the series where we will talk about our learnings from deploying deep learning solutions at radiology centers. We will cover the technical aspects of the challenges and solutions in here. The operational hurdles will be covered in the next part of this series.

The dawn of an AI revolution is upon us. Deep learning or deep neural networks have crawled into our daily lives transforming how we type, write emails, search for photos etc. It is revolutionizing major fields like healthcare, banking, driving etc. At Qure.ai, we have been working for the past couple of years on our mission of making healthcare more affordable and accessible through the power of deep learning.

Since our journey began more than two years ago, we have seen excellent progress in development and visualization of deep learning models. With Nvidia leading the advancements in GPUs and the release of Pytorch, Tensorflow, MXNet etc leading the war on deep learning frameworks, training deep learning models has become faster and easier than ever.

However, deploying these deep learning models at scale has become a different beast altogether. Let’s discuss some of the major problems that Qure.ai has tackled/is tackling in deploying deep learning for hospitals and radiologists across the globe.

Where does the challenge lie?

Let us start with understanding how the challenges in deploying deep learning models are different from training them. During training, the focus is mainly on the accuracy of predictions, while deployment focuses on speed and reliability of predictions. Models can be trained on local servers, but in deployment, they need to be capable of scaling up or down depending upon the volume of API requests. Companies like Algorithmia and EnvoyAI are trying to solve this problem by providing a layer over AI to serve the end users. We are already working with EnvoyAI to explore this route of deploying deep learning.

Selecting the right deep learning framework

Caffe was the first framework built to focus on production. Initially, our research team was using both Torch (flexible, imperative) as well as Lasagne/Keras (python!) for training. The release of Pytorch in late 2016 settled the debate on frameworks within our team.

Deep learning frameworks (source)

Thankfully, this happened before we started looking into deployment. Once we finalized Pytorch for training and tweaking our models, we started looking into best practices for deploying the same. Meanwhile, Facebook released Caffe2 for easier deployment, especially into mobile devices.

The AI community including Facebook, Microsoft and Amazon came together to release Open Neural Network Exchange (ONNX) making it easier to switch between tools as per need. For example, it enables you to train your model in Pytorch and then export it into Caffe2/ MXNet/ CNTK (Cognitive Toolkit) for deployment. This approach is worth looking into when the load on our servers increase. But for our present needs, deploying models in Pytorch has sufficed.

Selecting the right stack

We use following components to build our Linux servers keeping our pythonic deep learning framework in mind.


  • Docker: For operating system level virtualization
  • Anaconda: For creating python3 virtual environments and supervising package installations
  • Django: For building and serving RESTful APIs
  • Pytorch: As deep learning framework
  • Nginx: As webserver and load balancer
  • uWSGI: For serving multiple requests at a time
  • Celery: As distributed task queue

Most of these tools can be replaced as per requirements. The following diagram represents our present stack.

Server architecture

Choosing the cloud GPU server

We use Amazon EC2 P2 instances as our cloud GPU servers primarily due to our team’s familiarity with AWS. Although, Microsoft’s Azure and Google Cloud can also be excellent options.

Automating scaling and load balancing

Our servers are built using small components performing specific services and it was important to have them on the same host for easy configuration. Moreover, we handle large dicom images (each having a size between 10 and 50 Mb) and they get transferred between the components. It made sense to have all the components on the same host or else, the network bandwidth might get choked due to these transfers. The following diagram illustrates various software components comprising a typical qure deployment.

Software Components

We started with launching qXR (Chest X-ray product) on a P2 instance but as the load on our servers rose, managing GPU memory became an overhead. We were also planning to launch qER (HeadCT product) which had even higher GPU memory requirements.

Initially, we started with buying new P2 instances. Optimizing their usage and making sure that few instances are not bogged down by the incoming load while other instances remain comparatively free became a challenge. It became clear that we needed auto-scaling for our containers.

Load balancing improves the distribution of workloads across instances (source)

That was when we started looking into solutions for managing our containerized applications. We decided to go ahead with Kubernetes (Amazon ECS is also an excellent alternative) majorly because it runs independently of specific provider (ECS has to be deployed on Amazon cloud). Since many hospitals and radiology centers prefer on-premise deployment, Kubernetes is clearly more suited for such needs. It makes life easier by automatic bin-packing of containers based on resource requirements, simpler horizontal scaling, and load balancing.

GPU memory management

Initially, when qXR was deployed, it dealt with fewer abnormalities. So for an incoming request, loading models into memory, processing images through it and then releasing the memory worked fine. But as the number of abnormalities (thereby models) increased, loading all models sequentially for each upcoming request became an overhead.

We thought of accumulating incoming requests and processing images in batches on a periodic basis. This could have been a decent solution except that time was critical when dealing with medical images, more so in emergency situations. It was especially critical for qER where in cases of strokes, one has less than an hour to make a diagnostic decision. This ruled out the batch processing approach.

Beware of GPUs !! (warning at Qure's Mumbai office)

Moreover, our models for qER were even larger and required approximately 10x GPU memory of what qXR models required. Another thought was to keep the models loaded in memory and process images through them as the requests arrive. This is a good solution where you need to run your models every second or even millisecond (think of AI models running on millions of images being uploaded to Facebook or Google Photos). However, this is not a typical scenario within the medical domain. Radiology centers do not encounter patients at that scale. Even if the servers send back the results within a couple of minutes, that’s like a 30x improvement in the amount of time that a radiologist would take to report the scan. And that’s when you assume that a radiologist is immediately available. Otherwise, an average turnaround period of a chest x-ray scan varies from 1 to 2 days (700-1400x of what we take currently).

As of now, auto-scaling with Kubernetes solves our problems but we would definitely look into it in future. The solution lies somewhere between the two approaches (think of a caching mechanism for deep learning models).

Conclusion

Training deep learning models, especially in healthcare, is only one part of building a successful AI product. Bringing it to healthcare practitioners is a formidable and interesting challenge in itself. There are other operational hurdles like convincing doctors to embrace AI, offline working style at some hospitals (using radiographic films), lack of modern infrastructure at radiology centers (operating systems, bandwidth, RAM, disk space, GPU), varying procedures for scan acquisition etc. We will talk about them in detail in the next part of this series.

Note

For a free trial of qXR and qER, please visit us at scan.qure.ai

Categories
Uncategorized

Visualizing Deep Learning Networks – Part II

In the previous post we looked at methods to visualize and interpret the decisions made by deep learning models using perturbation based techniques.
To summarize the previous post, perturbation based methods do a good job of explaining decisions but they suffer from
expensive computations and instability to surprise artifacts. In this post, we’ll give a brief overview and drawbacks of the various gradient-based algorithms for deep learning based classification models.

We would be discussing the following types of algorithms in this post:

  1. Gradient-based algorithms
  2. Relevance score based algorithms

In gradient-based algorithms, the gradient of the output with respect to the input is used for constructing the saliency maps. The algorithms in this class differ in the way the gradients are modified during backpropagation. Relevance score based algorithms try to attribute the relevance of each input pixel by backpropagating the probability score instead of the gradient. However, all of these methods involve a single forward and backward pass through the net to generate heatmaps as opposed to multiple forward passes for the perturbation based methods. Evidently, all of these methods are computationally cheaper as well as free of artifacts originating from perturbation techniques.

To illustrate each algorithm, we would be considering a Chest X-Ray (image below) of a patient diagnosed with pulmonary consolidation. Pulmonary consolidation is simply a “solidification” of the lung tissue due to the accumulation of solid and liquid material in the air spaces that would have normally been filled by gas [1]. The dense material deposition in the airways could have been affected by infection or pneumonia (deposition of pus) or lung cancer (deposition of malignant cells) or pulmonary hemorrhage (airways filled with blood) etc. An easy way to diagnose consolidation is to look out for dense abnormal regions with ill-defined borders in the X-ray image.

Annotated_x

Chest X-ray with consolidation.

We would be considering this X-ray and one of our models trained for detecting consolidation for demonstration purposes. For this patient, our consolidation model predicts a possible consolidation with 98.2% confidence.

Gradient Based

Gradient Input

  • Deep inside convolutional networks: Visualising image classification models and saliency maps
  • Submitted on 20 Dec 2013
  • Arxiv Link

Explanation:
Measure the relative importance of input features by calculating the gradient of the output decision with respect to those input features.

There were 2 very similar papers that pioneered the idea in 2013. In these papers — Saliency features [2] by Simonyan et al. and DeconvNet [3] by Zeiler et al. — authors used directly the gradient of the majority class prediction with respect to input to observe saliency features. The main difference between the above papers was how the authors handle the backpropagation of gradients through non-linear layers like ReLU. In Saliency features paper, the gradients of neurons with negative input were suppressed while propagating through ReLU layers. In the DeconvNet paper, the gradients of neurons with incoming negative gradients were suppressed.

Algorithm:
Given an image I0, a class c, and a classification ConvNet with the class score function Sc(I). The heatmap is calculated as absolute of the gradient of Sc with respect to I at I0
[frac{partial S_c}{partial I} |_{I_0} ]

It is to be noted here, that DeepLIFT paper (which we’ll discuss later) explores the idea of gradient * input also as an alternate indicator as it leverages the strength and signal of input
[frac{partial S_c}{partial I} |_{I_0} * I_0 ]

Annotated_x

Heatmap by GradInput against original annotation.

Shortcomings:
The problem with such a simple algorithm arises from non-linear activation functions like ReLU, ELU etc. Such non-linear functions being inherently non-differentiable at certain locations have discontinuous gradients. Now as the methods measured partial derivatives with respect to each pixel, the gradient heatmap is inherently discontinuous over the entire image and produces artifacts if viewed as it is. Some of it can be overcome by convolving with a Gaussian kernel. Also, the gradient flow suffers in case of renormalization layers like BatchNorm or max pooling.

Guided Backpropagation

  • Striving for simplicity: The all convolutional net
  • Submitted on 21 Dec 2014
  • Arxiv Link

Explanation:
The next paper [4], by Springenberg et. al, released in 2014 introduces GuidedBackprop, suppressed the flow of gradients through neurons wherein either of input or incoming gradients were negative. Springenberg et al. showed the difference amongst their methods through a beautiful illustration given below. As we discussed, this paper combined the gradient handling of both the Simonyan et al. and Zeiler et al.

GuidedBackprop

Schematic of visualizing the activations of high layer neurons. a) Given an input image, we perform the forward pass to the layer we are interested in, then set to zero all activations except one and propagate back to the image to get a reconstruction. b) Different methods of propagating back through a ReLU nonlinearity. c) Formal definition of different methods for propagating a output activation out back through a ReLU unit in layer l; note that the ’deconvnet’ approach and guided backpropagation do not compute a true gradient but rather an imputed version. Source.

Annotated_x

Heatmap by GuidedBackprop against original annotation.

Shortcomings:
The problem of gradient flow through ReLU layers still remained a problem at large. Tackling renormalization layers were still an unresolved problem as most of the above papers (including this paper) proposed mostly fully convolutional architectures (without max pool layers) and batch normalization was yet to ‘alchemised’ in 2014. Another such fully-convolutional architecture paper was CAM [6].

Grad CAM

  • Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • Submitted on 07 Oct 2016
  • Arxiv Link

Explanation:
An effective way to circumnavigate the backpropagation problems were explored in the GradCAM [5] by Selvaraju et al. This paper was a generalization of CAM [6] algorithm given by Zhou et al., that tried to describe attribution scores using fully connected layers. The idea is, instead of trying to propagate back the gradients, can the activation maps of the final convolutional layer be directly used to infer downsampled relevance map of the input pixels. The downsampled heatmap is upsampled to obtain a coarse relevance heatmap.

Algorithm:


Let the feature maps in the final convolutional layers be F1, F2 … ,Fn. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Weights (w1, w2 ,…, wn) for each pixel in the F1, F2 … , Fn is calculated based on the gradients class c w.r.t. each feature map such as
    (w_i = frac{partial S_c}{partial F} |_{F_i} forall i=1 dots n )
  2. The weights and the corresponding activations of the feature maps are multiplied to compute the weighted activations (A1,A2, … , An) of each pixel in the feature maps.
    (A_i = w_i * F_i forall i = 1 dots n )
  3. The weighted activations across feature maps are added pixel-wise to indicate importance of each pixel in the downsampled feature-importance map ( H_{i,j} ) as
    ( H_{i,j} = sum_{k=1}^{n}A_k(i,j) forall i = 1 dots n)
  4. The downsampled heatmap ( H_{i,j} ) is upsampled to original image dimensions to produce the coarse-grained relevant heatmap
  5. [Optional] The authors suggest multiplying the final coarse heatmap with the heatmap obtained from GuidedBackprop to obtain a finer heatmap.

Steps 1-4 makes up the GradCAM method. Including step 5 constitutes the Guided Grad CAM method. Here’s how a heat map generated from Grad CAM method looks like. The best contribution from the paper was the generalization of the CAM paper in the presence of fully-connected layers.

Annotated_x

Heatmap by GradCAM against original annotation.

Shortcomings:
The algorithm managed to steer clear of backpropagating the gradients all the way up to inputs – it only propagates the gradients only till the final convolutional layer. The major problem with GradCAM was its limitation to specific architectures which use the AveragePooling layer to connect convolutional layers to fully connected layers. The other major drawback of GradCAM was the upsampling to coarse heatmap results in artifacts and loss in signal.

Relevance score based

There are a couple of major problems with the gradient-based methods which can be summarised as follows:

  1. Discontinuous gradients for some non-linear activations : As explained in the figure below (taken from DeepLIFT paper) the discontinuities in gradients cause undesirable artifacts. Also, the attribution doesn’t propagate back smoothly due to such non-linearities resulting in distortion of attribution scores.

    Discontinuous gradients

    Saturation problems of gradient based methods Source.

  2. Saturation of gradients: As explained through this simplistic network, the gradients when either of i1 or i2 is greater than 1 the gradient of the output w.r.t either of them won’t change as long as i1 + i2 > 1.

Gradient saturation

Saturation problems of gradient based methods Source.

Layerwise Relevance Propagation

  • On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation
  • Published on July 10, 2015
  • Journal Link

Explanation:
To counter these issues, relevance score based attribution technique was discussed for the first time by Bach et al. in 2015 in this [7] paper. The authors suggested a simple yet strong technique of propagating relevance scores and redistributing as per the proportion of the activation of previous layers. The redistribution based on activation scores means we steer clear of the difficulties that arise with non-linear activation layers.

Algorithm:


This implementation is according to epsilon-LRP[8] where small epsilon is added in denominator to propagate relevance with numerical stability. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Relevance score (Rf) for the final layer is Sc
  2. While input layer is not reached
    • Redistribute the relevance score in the current layer (Rl) in the previous layer (Rl+1) in proportion of activations.
      Say zij is the activation of the jth neuron in layer l+1 with input from ith neuron in layer l where zj is
      (z_j = sum_{i}^{}z_{ij})

Relevance propagation


Annotated_x

Heatmap by Epsilon LRP against original annotation.

DeepLIFT

  • Learning Important Features Through Propagating Activation Differences
  • Submitted on 10 Apr 2017
  • Journal Link

Explanation:
The last paper[9] we cover in this series, is based on layer-wise relevance. However, herein instead of directly explaining the output prediction in previous models, the authors explain the difference in the output prediction and prediction on a baseline reference image.The concept is similar to Integrated Gradients which we discussed in the previous post. The authors bring out a valid concern with the gradient-based methods described above – gradients don’t use a reference which limits the inference. This is because gradient-based methods only describe the local behavior of the output at the specific input value, without considering how the output behaves over a range of inputs.

Algorithm:
The reference image (IR) is chosen as the neutral image, suitable for the problem at hand. For class c, and a classification ConvNet with the class score function Sc(I), SRc be the probability for image IR. The relevance score to be propagated is not Sc but Sc – SRc.

Discussions

We have so far understood both perturbation based algorithms as well as gradient-based methods. Computationally and practically, perturbation based methods are not much of a win although their performance is relatively uniform and consistent with an underlying concept of interpretability. The gradient-based methods are computationally cheaper and measure the contribution of the pixels in the neighborhood of the original image. But these papers are plagued by the difficulties in propagating gradients back through non-linear and renormalization layers. The layer relevance techniques go a step ahead and directly redistribute relevance in the proportion of activations, thereby steering clear of the problems in propagating through non-linear layers. In order to understand the relative importance of pixels, not only in the local neighborhood of pixel intensities, DeepLIFT redistributes difference of activation of an image and a baseline image.

We’ll be following up with a final post on the performance of all the methods discussed in the current and previous post and detailed analysis of their performance.

References

  1. Consolidation of Lung – Signs, Symptoms and Causes
  2. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
  3. Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
  4. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
  5. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2016). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.
  6. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2921-2929).
  7. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7), e0130140.
  8. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K. R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems.
  9. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685.