In Focus Uncategorized

Burning Issue: Why Opportunistic Screening for Lung Cancer is the need of the hour

'Cancer Cures Smoking'

Did the above line make you look twice and think thrice? Years ago, the Cancer Patients Aid Association published this thought-provoking message, a genuinely fresh view on the relationship between tobacco and cancer. And why not?

Extensive research from across the world indicates that cigarette smoking can explain almost 90% of lung cancer risk in men and 70 to 80% in women. The WHO lists tobacco use as the first risk factor for cancer. The World Cancer Research Fund International goes a step further and plainly calls out smoking. With lung cancer racking up 2.21 million cases in 2021 and 1.8 million deaths, one can understand why healthcare stakeholders want to focus efforts on targeting common causes and reducing incidents of the disease.

Yet, a recent study indicates troubling trends.

Medanta Hospital is one of India’s leading medical facilities. Their research on lung cancer prevalence, conducted over a decade between 2012 – 2022 amongst 304 patients threw up a startling statistic – 50% of their lung cancer patient cohort were non-smokers. According to the doctors who conducted the research, Dr Arvind Kumar, Dr. Belal Bin Asaf and Dr. Harsh Puri, this was a sharp rise from earlier figures for non-smoking lung cancer patients (10-20%). But, there’s more.

The study indicates that, be it smokers or non-smokers, the risk group for lung cancer has expanded to a relatively more youthful population.

The WHO previously flagged a key factor for the rising trend in young, non-smokers being at risk for lung diseases – air pollution. Dr. Tedros Adhanom Ghebreyesus called air pollution a ‘silent public health emergency’ and ‘the new tobacco’. It presents clinicians working to treat and prevent lung cancer with a new conundrum – evaluating risk factors for the disease.

Simply put, how does one tackle the risk of lung cancer in a 25-year-old, non-smoking individual living a reasonably healthy lifestyle when a risk factor could be the simple act of breathing?

According to Dr. Matthew Lungren, the answer could be Opportunistic Screening – which he calls, “… the BEST use case for AI in radiology” concurs. qXR, our artificial intelligence (AI) solution for chest X-rays, has been tried, tested and trusted to assist in identifying and reporting missing nodules, which highlights the importance of opportunistic screening for identifying potential lung cancers early.

All our recent studies, including the one with Massachusetts General Hospital (MGH) in a retrospective multi-center study, investigated and concluded that Qure’s CE approved qXR could identify critical findings on Chest X-Rays, including malignant nodules.  This spurs the possibility that opportunistic screening for indicators of lung cancer and other pulmonary diseases should become the norm.’s solutions, can truly make the difference, augmenting the efforts of clinicians and radiologists any and every time a Chest X-ray or Chest CT is conducted.

November is Lung Cancer Awareness Month. What better moment than the last day of the month to urge everyone to think outside the box when it comes to demographics, risk factors, screening, and the role of AI in healthcare.


Taking No Chances: Opportunistic Screening’s Role in Early Lung Cancer Detection

Key Highlights

  • Over 20M Chest CTs are performed every year in the USA alone  
  • Every chest CT scan is a potential lung cancer screening opportunity 
  • Chest CT scanning increased significantly during the pandemic 
  • conducted a deep-learning study to use CT scans for COVID to screen for actionable nodules


Jackson Brown, Jr. once said that nothing is more expensive than a missed opportunity. Lung cancer is perhaps the ideal example of this because incidental/early detection via opportunistic screening can play a significant role in helping to successfully combat the malady. 

Lung cancer accounts for 1 in 5 cancer deaths yearly; the leading cause of cancer-related deaths worldwide. It accounts for the greatest economic and public health burden of all cancers annually; approximately $180 billion. This is also because the prognosis for lung cancer is poor compared to other cancers, largely due to a high proportion of cases being detected at an advanced stage  where treatment options are limited, and the 5-year survival rate is only 5-15%.The global pandemic strained healthcare systems worldwide also leading to significant increase in the chest CT volumes.  

“Earlier we would conduct approximately 300 chest CT scans per month. During the pandemic, this number rose to 7000 per month. It put a severe strain on doctors who must review every scan. Qure’s AI solution, qCT, made a significant difference to us by flagging missed actionable nodules on chest CT scans for further follow-ups & investigations.”
– Arpit Kothari, CEO,

The large volume of scans during the pandemic allowed to conduct a study using a deep-learning approach towards opportunistic screening for actionable lung nodules.


The study uses’s deep-learning approach to identify lung nodules on CT scans from patients who were scanned for COVID-19 from 5 radiology centers across different cities in India.  

The scans were sourced from, a leading radiology service provider in Central India and Aarthi Scans & Labs, yet another major diagnostic provider with 40 full-fledged diagnostic centers across India.

2502 scans were randomly selected from Chest CTs performed at 5 sites in two specialist radiology chains, Aarthi Scans and bodyScans during India’s 2nd and 3rd wave of Covid. They were processed by qCT, Qure’s AI capable of detecting and characterizing lung nodules. The radiologist report of the cases flagged by qCT were investigated for findings suggestive of cancer. Flagged cases for which the nodule was not reported were re-read by an independent radiologist with AI assistance on a web portal. They were asked to either confirm or reject the flag, rate the nodule for malignancy potential if confirmed or provide alternate finding if rejected (See Figure). 


  • 2502 CT scans were processed in total.  
  • Of these, 23.7% were flagged by qCT and re-read by an independent thoracic radiologist.  
  • In 19.4% of these flagged cases, the radiologist agreed that there were unreported actionable nodules.  
  • There were 19 cases where radiologists did not rule out the risk of malignancy and 2 out of these were rated as probably malignant.  


In the study,’s AI tool has assisted in reporting missed nodules which highlights the importance of opportunistic screening for identifying potential lung cancers early.  The need to improve efficiency and speed of clinical care continues to drive multiple innovations into practice, including AI. With the increasing demand for superior health care services and the large volumes of data generated daily from parallel streams, streamlining of clinical workflows has become a pressing issue. In our study, by using AI as a safety net, we found 21 chest CTs that should have warranted follow-up management for the patients. 

“Early detection plays a critical role in successfully treating Lung Cancer. Yet, there are several factors which contribute to the significant risk of these nodules getting missed in chest CT scans. Qure’s AI solution, qCT is immensely useful because it acts as a safety net; another pair of eyes to ensure that we clinicians can identify those patients who need immediate help. Eventually, AI can augment our efforts to defeat the disease.”
– Dr. Arunkumar Govindarajan, Director, Aarthi Scans & Labs


AI-Based Gaze Deviation Detection to Aid LVO Diagnosis in NCCT


Strokes occur when blood supply to the brain is interrupted or reduced, depriving brain tissue of oxygen and nutrients. It is estimated that a patient can lose 1.9 million neurons each minute when a stroke is untreated. So, the treatment of stroke is a medical emergency that requires early intervention to minimize brain damage and complications. Furthermore, a stroke caused by emergent large vessel occlusion (LVO) requires a much more prompt identification to improve clinical outcomes.

Neuro interventionalists need to activate their operating rooms to prepare candidates identified for endovascular therapy (EVT) as soon as possible. As a result, identifying imaging findings on non-contrast computed tomography (NCCT) that are predictive of LVO would aid in identifying potential EVT candidates. We present and validate gaze deviation as an indicator to detect LVO using NCCT. In addition, we offer an Artificial Intelligence (AI) algorithm to detect this indicator.

What is LVO?

Large vessel occlusion (LVO) stroke is caused by a blockage in one of the following brain vessels:

  1. Internal Carotid Artery (ICA) 
  2. ICA terminus (T-lesion; T occlusion) 
  3. Middle Cerebral Artery (MCA) 
  4. M1 MCA 
  5. Vertebral Artery 
  6. Basilar Artery

Image source: Science direct

LVO strokes are considered one of the more severe kinds of strokes, accounting for approximately 24% to 46% of acute ischemic strokes. For this reason, acute LVO stroke patients often need to be treated at comprehensive centers that are equipped to handle LVOs. 

Endovascular Treatment (EVT)

EVT is a treatment given to patients with acute ischemic stroke. Using this treatment, clots in large vessels are removed, helping deliver better outcomes. EVT evaluation needs to be done at the earliest for the patients that meet the criteria and are eligible. Early access to EVT increases better outcomes for patients.  The timeframe to perform is usually between 16 – 24 hours in most acute ischemic cases.

Image Source: PennMedicine

Goal for EVT

Since it is important to perform this procedure as early as possible, how do we get there?

LVO detection on NCCT

There is a 3 point step to consider for this:

  1. Absence of blood
  2. Hyperdense vessel sign or dot sign
  3. Gaze deviation (often overlooked on NCCT) 

Gaze deviation and its relationship with acute stroke

Several studies suggest that gaze deviation is largely associated with the presence of LVO [1,2,3].

Stroke patients with eye deviation on admission CT have higher rates of disability/death and hemorrhagic transformation. Consistent assessment and documentation of radiological eye deviation on acute stroke CT scan may help with prognostication [4].

AI algorithm to identify gaze deviation

We developed an AI algorithm that reports the presence of gaze deviation given an NCCT scan. Such AI algorithms have tremendous potential to aid in this triage process. The AI algorithm was trained using a set of scans to identify gaze direction and midline of the brain. The gaze deviation is calculated by measuring the angle between the gaze direction and the midline of the brain. We used this AI algorithm to identify clinical symptoms of ipsiversive gaze deviation in stroke patients with LVO treated with EVT. The AI algorithm has a sensitivity and specificity of 80.8% and 80.1% to detect LVO using gaze deviation as the sole indicator. The test set had 150 scans with LVO-positive cases where thrombectomy was performed.


Ipsiversive Gaze deviation on NCCT is a good predictor of LVO due to proximal vessel occlusions in ICA terminus and M1 occlusions. However, it is a poor predictor of LVO due to M2 occlusion. We report an AI algorithm that can identify this clinical sign on NCCT. These findings can aid in the triage of LVO patients and expedite the identification of EVT candidates. 

We are presenting this AI method at SNIS 2022, Toronto. Please attend our oral presentation on 28th July 2022 at 12:15 PM (Toronto time).


Upadhyay, Ujjwal & Golla, Satish & Kumar, Shubham & Szweda, Kamila & Shahripour, Reza & Tarpley, Jason. (2022). Society of NeuroInterventional Surgery SNIS

In Focus Uncategorized

Need for Speed: AI, AstraZeneca, and early lung cancer diagnosis

The AstraZeneca-Qure partnership

A thousand miles begins with a single step. In 2020, and AstraZeneca took the first step together to integrate advanced artificial intelligence (AI) solutions to identify lung diseases early in patients across AstraZeneca’s Emerging Markets region – Latin America, Asia, Africa, and the Middle East. In the past 2 years, the partnership has made significant progress, incorporating the use of AI technology with chest X-rays for multi-disease screening, including tuberculosis and heart failure along with lung cancer.

Lung Cancer: The need for early detection

In more than 40% of people suffering from Lung Cancer, it is detected at Stage 4, when their likelihood of surviving 5 years is under 10%. Only 20% are diagnosed at Stage 1 when the survival rate is between 68-92%. That’s why Lung Cancer is responsible for every 1 in 5 cancer deaths worldwide.

Though early detection facilitates early diagnosis and better patient outcomes, the disease’s silent progress to advanced stages makes it a challenge like none other. Low Dose CT (LDCT) remains the most effective means of screening for Lung Cancer. However, in LMICs, CTs can be prohibitively expensive, priced between USD 500 – 700, limiting their access. However, there is some hope.

Chest X-rays are one the most routinely performed exams in the world, representing 40% of the approximately 3.6 billion imaging tests that are performed annually. As a non-invasive diagnostic test with easy access and low costs, the chest X-ray is a valuable first line test to screen for radiological indications of issues in the lungs, heart, ribs, and more. Acquiring chest X-ray scans only takes minutes; but it warrants expert radiologists to read and analyze them.

Augmenting X-rays with the power AI

qXR,'s AI-powered chest X-ray interpretation tool, can automatically detect and localize up to 30 abnormalities, including indicators of Lung Cancer, TB, and COVID-19. This is particularly impactful when millions of scans are examined using qXR to report any abnormalities that could otherwise be missed due to:

  • Lack of experienced personnel
  • Increased workloads, limiting access and time for detailed reads of abnormal scans
  • Incidental nodules indicative for Lung Cancer being missed because physicians are only looking at the results for which the X-rays were ordered and not incidental findings.

How Qure is making a difference

1. Working with grassroot level healthcare professionals

A. Leveraging Primary Care GP clinics in Malaysia

Qualitas Medical Group (QMG) is a chain of integrated general practice (GP) clinics, dental clinics, medical imaging centers, and ambulatory care management centers that play an integral role in Malaysia’s health system. Along with Lung Cancer Network Malaysia, QMG uses qXR to triage all chest X-rays taken of local workers, identifying incidental lung nodules that maybe indicative of lung cancer for further testing. qXR has also helped GPs to reduce their dependency on radiologists for second reads and reduced reporting turnaround time for chest X-rays from 2 days to the same day.

“’s state of the art deep learning technology is a potential game changer that will enhance and expedite diagnosis with rapid referral to relevant specialty “, said Dr Anand Sachithanandan, President, Lung Cancer Network Malaysia

B. Empowering Primary Care Physicians in Latin America

Primary care centers are the first medical care touchpoint and are crucial stakeholders for early diagnosis in disease care pathways. In collaboration with Lung Ambition Alliance, Latin America, Qure is empowering primary care physicians in 12 different countries with AI-enabled smart phone-based chest X-ray analysis and lung nodule screening.

In the absence of digital X-rays, physicians only need to click a picture of the X-ray film against a lightbox and upload it on the app to receive instant qXR analysis. Based on the results, they can guide the patient to the next appropriate steps.

2. Collaborating with Cancer Care Foundations

Assam is called India’s Cancer Capital as the state’s average cancer incident rate is double the national average. The high cancer burden, low public awareness, and a lack of specialised health-care infrastructure led the Govt. of Assam to partner with Tata Trusts and build the Assam Cancer Care Foundation (ACCF).

Potential lung cancer suspects are identified via door-to-door screenings as well as via a screening kiosk set up at the Fakhruddin Ali Ahmed Medical College and Hospital, Barpeta where ACCF have built a specialised cancer care unit. Chest X-rays of these individuals will be screened for detection of suspicious lung nodule(s) using qXR. Based on the result, they will either be called back for an LDCT/Biopsy or an oncology consultation.

3. Surveillance of all chest X-rays taken in a tertiary care hospital

The VPS Lakeshore, Kerala is a tertiary care hospital and a centre of excellence in Oncology and other specialities. It is well equipped to take up largescale screening programs and facilitate the required care continuum for high risk, suspected and confirmed disease cases. The hospital has a program in place where a tool surveys all chest X-rays taken to facilitate early detection of Lung Cancer.

Through our partnership with AstraZeneca, we have deployed qXR to scan all chest X-rays performed at the hospital to pick up possibly early cases of Lung Cancer. If any abnormal/nodule indicative cases are picked up by the software, it is instantly flagged to the radiologist/referring physician so that they can guide the patient along the next steps in the patient care pathway.

4. Public screening road shows

The Ministry Of Public Health, Thailand along with the AstraZeneca team initiated the “Don’t Wait. Get Checked” Lung Cancer Campaign in April ’22 at Central World Mall, in partnership with Banphaeo General Hospital, Digital Economy Promotion Agency (DEPA) and the Central Group. On the occasion of World No Tobacco Day,’s qXR was used to screen close to 200 people. The objective of this program was to directly impact Thailand's public health policies revolving around lung cancer.

Way Forward

“Building health systems that are resilient and sustainable will require finding new ways to prevent disease, diagnose patients earlier, and treat them more effectively. The benefits of the technology that offers align well with our corporate values, ultimately supporting our strategic objective to reshape healthcare delivery, close the cancer care gap and better chronic disease management, especially in low-to-middle income countries. We believe that innovative technology has the potential to transform patients’ outcomes, enabling more people to access care in timely, reliable and affordable ways, regardless of where they live”, said Pei-Chieh Fong, Medical VP, AstraZeneca International.

At the Davos World Economic Forum 2022, AstraZeneca pledged to join the WEF EDISON Alliance and committed to screening 5 million patients for lung cancer by 2025 in partnership with

With the support of AstraZeneca Turkey, collaborated with Mersin University Hospital on a landmark study for the use of AI in Heart Failure detection, using our qXR suite. This study is an important indicator for the future of AI in healthcare and the use of technology to augment the efforts of physicians in the early detection of other diseases.


Improving performance of AI models in presence of artifacts

Our deep learning models have become really good at recognizing hemorrhages from Head CT scans. Real-world performance is sometimes hampered by several external factors both hardware-related and human-related. In this blog post, we analyze how acquisition artifacts are responsible for performance degradation and introduce two methods that we tried, to solve this problem.

Medical Imaging is often accompanied by acquisition artifacts which can be subject related or hardware related. These artifacts make confident diagnostic evaluation difficult in two ways:

  • by making abnormalities less obvious visually by overlaying on them.
  • by mimicking an abnormality.

Some common examples of artifacts are

  • Clothing artifact- due to clothing on the patient at acquisition time See fig 1 below. Here a button on the patient’s clothing looks like a coin lesion on a Chest X Ray. Marked by red arrow.

clothing artifact

Fig 1. A button mimicking coin lesion in Chest X Ray. Marked by red arrow.Source.

  • Motion artifact- due to voluntary or involuntary subject motion during acquisition. Severe motion artifacts due to voluntary motion would usually call for a rescan. Involuntary motion like respiration or cardiac motion, or minimal subject movement could result in artifacts that go undetected and mimic a pathology. See fig 2. Here subject movement has resulted in motion artifacts that mimic subdural hemorrhage(SDH).

motion artifact

Fig 2. Artifact due to subject motion, mimicking a subdural hemorrhage in a Head CT.Source

  • Hardware artifact- See fig 3. This artifact is caused due to air bubbles in the cooling system. There are subtle irregular dark bands in scan, that can be misidentifed as cerebral edema.

hardware artifact edema

Fig 3. A hardware related artifact, mimicking cerebral edema marked by yellow arrows.Source

Here we are investigating motion artifacts that look like SDH, in Head CT scans. These artifacts result in increase in false positive (FPs) predictions of subdural hemorrhage models. We confirmed this by quantitatively analyzing the FPs of our AI model deployed at an urban outpatient center. The FP rates were higher for this data when compared to our internal test dataset.
The reason for these false positive predictions is due to the lack of variety of artifact-ridden data in the training set used. Its practically difficult to acquire and include scans containing all varieties of artifacts in the training set.

artifact mistaken for sdh

Fig 4. Model identifies an artifact slice as SDH because of similarity in shape and location. Both are hyperdense areas close to the cranial bones

We tried to solve this problem in the following two ways.

  • Making the models invariant to artifacts, by explicitly including artifact images into our training dataset.
  • Discounting slices with artifact when calculating the probability of bleed in a scan.

Method 1. Artifact as an augmentation using Cycle GANs

We reasoned that the artifacts were misclassified as bleeds because the model has not seen enough artifact scans while training.
The number of images containing artifacts is relatively small in our annotated training dataset. But we have access to several unannotated scans containing artifacts acquired from various centers with older CT scanners.(Motion artifacts are more prevalent when using older CT scanners with poor in plane temporal resolution). If we could generate artifact ridden versions of all the annotated images in our training dataset, we would be able to effectively augment our training dataset and make the model invariant to artifacts.
We decided to use a Cycle GAN to generate new training data containing artifacts.

Cycle GAN[1] is a generative adversarial network that is used for unpaired image to image translation. It serves our purpose because we have an unpaired image translation problem where X domain has our training set CT images with no artifact and Y domain has artifact-ridden CT images.

cycle gan illustration

Fig 5. Cycle GAN was used to convert a short clip of horse into that of a zebra.Source

We curated a A dataset of 5000 images without artifact and B dataset of 4000 images with artifacts and used this to train the Cycle GAN.

Unfortunately, the quality of generated images was not very good. See fig 6.
The generator was unable to capture all the variety in CT dataset, meanwhile introducing artifacts of its own, thus rendering it useless for augmenting the dataset. Cycle GAN authors state that the performance of the generator when the transformation involves geometric changes for ex. dog to cat, apples to oranges etc. is worse when compared to transformation involving color or style changes. Inclusion of artifacts is a bit more complex than color or style changes because it has to introduce distortions to existing geometry. This could be one of the reasons why the generated images have extra artifacts.

cycle gan images

Fig 6. Sampling of generated images using Cycle GAN. real_A are input images and fake_B are the artifact_images generated by Cycle GAN.

Method 2. Discounting artifact slices

In this method, we trained a model to identify slices with artifacts and show that discounting these slices made the AI model identifying subdural hemorrhage (SDH) robust to artifacts.
A manually annotated dataset was used to train a convolutional neural network (CNN) model to detect if a CT slice had artifacts or not. The original SDH model was also a CNN which predicted if a slice contained SDH. The probabilities from artifact model were used to discount the slices containing artifact and artifact-free slices of a scan were used in computation of score for presence of bleed.
See fig 7.

Method 2 illustration

Fig 7. Method 2 Using a trained artifacts model to discount artifact slices while calculating SDH probability.


Our validation dataset contained 712 head CT scans, of which 42 contained SDH. Original SDH model predicted 35 false positives and no false negatives. Quantitative analysis of FPs confirmed that 17 (48%) of them were due to CT artifacts. Our trained artifact model had slice-wise AUC of 96%. Proposed modification to the SDH model had reduced the FPs to 18 (decrease of 48%) without introducing any false negatives. Thus using method 2, all scanwise FP’s due to artifacts were corrected.

In summary, using method 2, we improved the precision of SDH detection from 54.5% to 70% while maintaining a sensitivity of 100 percent.

confusion matrics

Fig 8. Confusion Matrix before and after using artifact model for SDH prediction

See fig 9. for model predictions on a representative scan.

artifact discount explanaation

Fig 9. Model predictions for few representative slices in a scan falsely predicted as positive by original SDH model

A drawback of Method 2 is that if SDH and artifact are present in the same slice, its probable that the SDH could be missed.


Using a cycle GAN to augment the dataset with artifact ridden scans would solve the problem by enriching the dataset with both SDH positive and SDH negative scans with artifacts over top of it. But the current experiments do not give realistic looking image synthesis results. The alternative we used, meanwhile reduces the problem of high false positives due to artifacts while maintaining the same sensitivity.


  1. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks by Jun-Yan Zhu et al.


Challenges of Development & Validation of Deep Learning for Radiology

We have recently published an article on our deep learning algorithms for Head CT in The Lancet. This article is the first ever AI in medical imaging paper to be published in this journal.
We described development and validation of these algorithms in the article.
In this blog, I explain some of the challenges we faced in this process and how we solved them. The challenges I describe are fairly general and should be applicable to any research involving AI and radiology images.


3D Images

First challenge we faced in the development process is that CT scans are three dimensional (3D). There is plethora of research for two dimensional (2D) images, but far less for 3D images. You might ask, why not simply use 3D convolutional neural networks (CNNs) in place of 2D CNNs? Notwithstanding the computational and memory requirements of 3D CNNs, they have been shown to be inferior to 2D CNN based approaches on a similar problem (action recognition).

So how do we solve it? We need not invent the wheel from scratch when there is a lot of literature on a similar problem, action recognition. Action recognition is classification of action that is present in a given video.
Why is action recognition similar to 3D volume classification? Well, temporal dimension in videos is analogous to the Z dimension in the CT.

Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

We have taken a foundational work from action recognition literature and modified it to our purposes. Our modification was that we have incorporated slice (or frame in videos) level labels in to the network. This is because action recognition literature had a comfort of using pretrained 2D CNNs which we do not share.

High Resolution

Second challenge was that CT is of high resolution both spatially and in bit depth. We just downsample the CT to a standard pixel spacing. How about bit depth? Deep learning doesn’t work great with the data which is not normalized to [-1, 1] or [0, 1]. We solved this with what a radiologist would use – windowing. Windowing is restriction of dynamic range to a certain interval (eg. [0, 80]) and then normalizing it. We applied three windows and passed them as channels to the CNNs.

Windows: brain, blood/subdural and bone

Windows: brain, blood/subdural and bone

This approach allows for multi-class effects to be accounted by the model. For example, a large scalp hemotoma visible in brain window might indicate a fracture underneath it. Conversely, a fracture visible in the bone window is usually correlated with an extra-axial bleed.

Other Challenges

There are few other challenges that deserve mention as well:

  1. Class Imbalance: We solved the class imbalance issue by weighted sampling and loss weighting.
  2. Lack of pretraining: There’s no pretrained model like imagenet available for medical images. We found that using imagenet weights actually hurts the performance.


Once the algorithms were developed, validation was not without its challenges as well.
Here are the key questions we started with: does our algorithms generalize well to CT scans not in the development dataset?
Does the algorithm also generalize to CT scans from a different source altogether? How does it compare to radiologists without access to clinical history?

Low prevalences and statistical confidence

The validation looks simple enough: just acquire scans (from a different source), get it read by radiologists and compare their reads with the algorithms’.
But statistical design is a challenge! This is because prevalence of abnormalities tend to be low; it can be as low as 1% for some abnormalities. Our key metrics for evaluating the algorithms are sensitivity & specificity and AUC depending on the both. Sensitivity is the trouble maker: we have to ensure there are enough positives in the dataset to ensure narrow enough 95% confidence intervals (CI). Required number of positive scans turns out to be ~80 for a CI of +/- 10% at an expected sensitivity of 0.7.

If we were to chose a randomly sampled dataset, number of scans to be read is ~ 80/prevalence rate = 8000. Suppose there are three readers per scan, number of total reads are 8k * 3 = 24k. So, this is a prohibitively large dataset to get read by radiologists. We cannot therefore have a randomly sampled dataset; we have to somehow enrich the number of positives in the dataset.


To enrich a dataset with positives, we have to find the positives from all the scans available. It’s like searching for a needle in a haystack. Fortunately, all the scans usually have a clinical report associated with them. So we just have to read the reports and choose the positive reports. Even better, have an NLP algorithm parse the reports and randomly sample the required number of positives. We chose this path.

We collected the dataset in two batches, B1 & B2. B1 was all the head CT scans acquired in a month and B2 was the algorithmically selected dataset. So, B1 mostly contained negatives while B2 contained lot of positives. This approach removed any selection bias that might have been present if the scans were manually picked. For example, if positive scans were to be picked by manual & cursory glances at the scans themselves, subtle positive findings would have been missing from the dataset.

Prevalences of the findings in batches B1 and B2. Observe the low prevalences of findings in uniformly sampled batch B1.


We called this enriched dataset, CQ500 dataset (C for CARING and Q for The dataset contained 491 scans after the exclusions. Three radiologists independently read the scans in the dataset and the majority vote is considered the gold standard. We randomized the order of the reads to minimize the recall of follow up scans and to blind the readers to the batches of the dataset.

We make this dataset and the radiologists’ reads public under CC-BY-NC-SA license. Other researchers can use this dataset to benchmark their algorithms. I think it can also be used for some clinical research like measuring concordance of radiologists on various tasks etc.

In addition to the CQ500 dataset, we validated the algorithms on a much larger randomly sampled dataset, Qure25k dataset. Number of scans in this dataset was 21095. Ground truths were clinical radiology reports. We used the NLP algorithm to get structured data from the reports. This dataset satisfies the statistical requirements, but each scan is read only by a single radiologist who had access to clinical history.


FindingCQ500 (95% CI)Qure25k (95% CI)
Intracranial hemorrhage0.9419 (0.9187-0.9651)0.9194 (0.9119-0.9269)
Intraparenchymal0.9544 (0.9293-0.9795)0.8977 (0.8884-0.9069)
Intraventricular0.9310 (0.8654-0.9965)0.9559 (0.9424-0.9694)
Subdural0.9521 (0.9117-0.9925)0.9161 (0.9001-0.9321)
Extradural0.9731 (0.9113-1.0000)0.9288 (0.9083-0.9494)
Subarachnoid0.9574 (0.9214-0.9934)0.9044 (0.8882-0.9205)
Calvarial fracture0.9624 (0.9204-1.0000)0.9244 (0.9130-0.9359)
Midline Shift0.9697 (0.9403-0.9991)0.9276 (0.9139-0.9413)
Mass Effect0.9216 (0.8883-0.9548)0.8583 (0.8462-0.8703)

AUCs of the algorithms on the both datasets.

Above table shows AUCs of the algorithms on the two datasets. Note that the AUCs are directly comparable. This is because AUC is prevalence independent. AUCs on CQ500 dataset are generally better than that on the Qure25k dataset. This might be because:

  1. Ground truths in the Qure25k dataset incorporated clinical information not available to the algorithms and therefore the algorithms did not perform well.
  2. Majority vote of three reads is a better ground truth than that of a single read.

ROC curves

ROC curves for the algorithms on the Qure25k (blue) and CQ500 (red) datasets. TPR and FPR of radiologists are also plotted.

Shown above is ROC curves on both the datasets. Readers’ TPR and FPR are also plotted. We observe that radiologists are either highly sensitive or specific to a particular finding. The algorithms are still yet to beat radiologists, on this task at least! But these should nonetheless be useful to triage or notify physicians.


Deep Learning for Videos: A 2018 Guide to Action Recognition

Medical images like MRIs, CTs (3D images) are very similar to videos – both of them encode 2D spatial information over a 3rd dimension. Much like diagnosing abnormalities from 3D images, action recognition from videos would require capturing context from entire video rather than just capturing information from each frame.

Fig 1: Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

In this post, I summarize the literature on action recognition from videos. The post is organized into three sections –

  1. What is action recognition and why is it tough
  2. Overview of approaches
  3. Summary of papers

Action recognition and why is it tough?

Action recognition task involves the identification of different actions from video clips (a sequence of 2D frames) where the action may or may not be performed throughout the entire duration of the video. This seems like a natural extension of image classification tasks to multiple frames and then aggregating the predictions from each frame. Despite the stratospheric success of deep learning architectures in image classification (ImageNet), progress in architectures for video classification and representation learning has been slower.

What made this task tough?

  1. Huge Computational Cost
    A simple convolution 2D net for classifying 101 classes has just ~5M parameters whereas the same architecture when inflated to a 3D structure results in ~33M parameters. It takes 3 to 4 days to train a 3DConvNet on UCF101 and about two months on Sports-1M, which makes extensive architecture search difficult and overfitting likely[1].
  2. Capturing long context
    Action recognition involves capturing spatiotemporal context across frames. Additionally, the spatial information captured has to be compensated for camera movement. Even having strong spatial object detection doesn’t suffice as the motion information also carries finer details. There’s a local as well as global context w.r.t. motion information which needs to be captured for robust predictions. For example, consider the video representations shown in Figure 2. A strong image classifier can identify human, water body in both the videos but the nature of temporal periodic action differentiates front crawl from breast stroke.

    Fig 2: Left: Front crawl. Right: Breast stroke. Capturing temporal motion is critical to differentiate these two seemingly similar cases. Also notice, how camera angle suddenly changes in the middle of front crawl video.

  3. Designing classification architectures
    Designing architectures that can capture spatiotemporal information involve multiple options which are non-trivial and expensive to evaluate. For example, some possible strategies could be

    • One network for capturing spatiotemporal information vs. two separate ones for each spatial and temporal
    • Fusing predictions across multiple clips
    • End-to-end training vs. feature extraction and classifying separately
  4. No standard benchmark
    The most popular and benchmark datasets have been UCF101 and Sports1M for a long time. Searching for reasonable architecture on Sports1M can be extremely expensive. For UCF101, although the number of frames is comparable to ImageNet, the high spatial correlation among the videos makes the actual diversity in the training much lesser. Also, given the similar theme (sports) across both the datasets, generalization of benchmarked architectures to other tasks remained a problem. This has been solved lately with the introduction of Kinetics dataset[2].

    Sample illustration of UCF-101. Source.

It must be noted here that abnormality detection from 3D medical images doesn’t involve all the challenges mentioned here. The major differences between action recognition from medical images are mentioned as below

  1. In case of medical imaging, the temporal context may not be as important as action recognition. For example, detecting hemorrhage in a head CT scan could involve much less temporal context across slices. Intracranial hemorrhage can be detected from a single slice only. As opposed to that, detecting lung nodule from chest CT scans would involve capturing temporal context as the nodule as well as bronchi and vessels all look like circular objects in 2D scans. It’s only when 3D context is captured, that nodules can be seen as spherical objects as opposed to cylindrical objects like vessels
  2. In case of action recognition, most of the research ideas resort to using pre-trained 2D CNNs as a starting point for drastically better convergence. In case of medical images, such pre-trained networks would be unavailable.

Overview of approaches

Before deep learning came along, most of the traditional CV algorithm variants for action recognition can be broken down into the following 3 broad steps:

  1. Local high-dimensional visual features that describe a region of the video are extracted either densely [3] or at a sparse set of interest points[4 , 5].
  2. The extracted features get combined into a fixed-sized video level description. One popular variant to the step is to bag of visual words (derived using hierarchical or k-means clustering) for encoding features at video-level.
  3. A classifier, like SVM or RF, is trained on bag of visual words for final prediction

Of these algorithms that use shallow hand-crafted features in Step 1, improved Dense Trajectories [6] (iDT) which uses densely sampled trajectory features was the state-of-the-art. Simultaneously, 3D convolutions were used as is for action recognition without much help in 2013[7]. Soon after this in 2014, two breakthrough research papers were released which form the backbone for all the papers we are going to discuss in this post. The major differences between them was the design choice around combining spatiotemporal information.

Approach 1: Single Stream Network

In this work [June 2014], the authors – Karpathy et al. – explore multiple ways to fuse temporal information from consecutive frames using 2D pre-trained convolutions.


Fig 3: Fusion Ideas Source.

As can be seen in Fig 3, the consecutive frames of the video are presented as input in all setups. Single frame uses single architecture that fuses information from all frames at the last stage. Late fusion uses two nets with shared params, spaced 15 frames apart, and also combines predictions at the end. Early fusion combines in the first layer by convolving over 10 frames. Slow fusion involves fusing at multiple stages, a balance between early and late fusion. For final predictions, multiple clips were sampled from entire video and prediction scores from them were averaged for final prediction.

Despite extensive experimentations the authors found that the results were significantly worse as compared to state-of-the-art hand-crafted feature based algorithms. There were multiple reasons attributed for this failure:

  1. The learnt spatiotemporal features didn’t capture motion features
  2. The dataset being less diverse, learning such detailed features was tough

Approach 2: Two Stream Networks

In this pioneering work [June 2014] by Simmoyan and Zisserman, the authors build on the failures of the previous work by Karpathy et al. Given the toughness of deep architectures to learn motion features, authors explicitly modeled motion features in the form of stacked optical flow vectors. So instead of single network for spatial context, this architecture has two separate networks – one for spatial context (pre-trained), one for motion context. The input to the spatial net is a single frame of the video. Authors experimented with the input to the temporal net and found bi-directional optical flow stacked across for 10 successive frames was performing best. The two streams were trained separately and combined using SVM. Final prediction was same as previous paper, i.e. averaging across sampled frames.

2 stream architecture

Fig 4: Two stream architecture Source.

Though this method improved the performance of single stream method by explicitly capturing local temporal movement, there were still a few drawbacks:

  1. Because the video level predictions were obtained from averaging predictions over sampled clips, the long range temporal information was still missing in learnt features.
  2. Since training clips are sampled uniformly from videos, they suffer from a problem of false label assignemnt. The ground truth of each of these clips are assumed same as ground truth of the video which may not be the case if the action just happens for a small duration within the entire video.
  3. The method involved pre-computing optical flow vectors and storing them separately. Also, the training for both the streams was separate implying end-to-end training on-the-go is still a long road.


Following papers which are, in a way, evolutions from the two papers (single stream and two stream) which are summarized as below:

  1. LRCN
  2. C3D
  3. Conv3D & Attention
  4. TwoStreamFusion
  5. TSN
  6. ActionVlad
  7. HiddenTwoStream
  8. I3D
  9. T3D

The recurrent theme around these papers can be summarized as follows. All of the papers are improvisations on top of these basic ideas.

SegNet Architecture

Recurrent theme across papers. Source.

For each of these papers, I list down their key contributions and explain them.
I also show their benchmark scores on UCF101-split1.


  • Long-term Recurrent Convolutional Networks for Visual Recognition and Description
  • Donahue et al.
  • Submitted on 17 November 2014
  • Arxiv Link

Key Contributions:

  • Building on previous work by using RNN as opposed to stream based designs
  • Extension of encoder-decoder architecture for video representations
  • End-to-end trainable architecture proposed for action recognition


In a previous work by Ng et al[9]. authors had explored the idea of using LSTMs on separately trained feature maps to see if it can capture temporal information from clips. Sadly, they conclude that temporal pooling of convoluted features proved more effective than LSTM stacked after trained feature maps. In the current paper, authors build on the same idea of using LSTM blocks (decoder) after convolution blocks(encoder) but using end-to-end training of entire architecture. They also compared RGB and optical flow as input choice and found that a weighted scoring of predictions based on both inputs was the best.

2 stream architecture2 stream architecture

Fig 5: Left: LRCN for action recognition. Right: Generic LRCN architecture for all tasks Source.


During training, 16 frame clips are sampled from video. The architecture is trained end-to-end with input as RGB or optical flow of 16 frame clips. Final prediction for each clip is the average of predictions across each time step. The final prediction at video level is average of predictions from each clip.

Benchmarks (UCF101-split1):

82.92Weighted score of flow and RGB inputs
71.1Score with just RGB

My comments:

Even though the authors suggested end-to-end training frameworks, there were still a few drawbacks

  • False label assignment as video was broken to clips
  • Inability to capture long range temporal information
  • Using optical flow meant pre-computing flow features separately

Varol et al. in their work[10] tried to compensate for the stunted temporal range problem by using lower spatial resolution of video and longer clips (60 frames) which led to significantly better performance.


  • Learning Spatiotemporal Features with 3D Convolutional Networks
  • Du Tran et al.
  • Submitted on 02 December 2014
  • Arxiv Link

Key Contributions:

  • Repurposing 3D convolutional networks as feature extractors
  • Extensive search for best 3D convolutional kernel and architecture
  • Using deconvolutional layers to interpret model decision


In this work authors built upon work by Karpathy et al. However, instead of using 2D convolutions across frames, they used 3D convolutions on video volume. The idea was to train these vast networks on Sports1M and then use them (or an ensemble of nets with different temporal depths) as feature extractors for other datasets. Their finding was a simple linear classifier like SVM on top of ensemble of extracted features worked better than she ttate-of-the-art algorithms. The model performed even better if hand crafted features like iDT were used additionally.

SegNet Architecture

Differences in C3D paper and single stream paper Source.

The other interesting part of the work was using deconvolutional layers (explained here) to interpret the decisions. Their finding was that the net focussed on spatial appearance in first few frames and tracked the motion in the subsequent frames.


During training, five random 2-second clips are extracted for each video with ground truth as action reported in the entire video. In test time, 10 clips are randomly sampled and predictions across them are averaged for final prediction.

SegNet Architecture

3D convolution where convolution is applied on a spatiotemporal cube.

Benchmarks (UCF101-split1):

82.3C3D (1 net) + linear SVM
85.2C3D (3 nets) + linear SVM
90.4C3D (3 nets) + iDT + linear SVM

My comments:

The long range temporal modeling was still a problem. Moreover, training such huge networks is computationally a problem – especially for medical imaging where pre-training from natural images doesn’t help a lot.

Note: Around the same time Sun et al.[11] introduced the concept of factorized 3D conv networks (FSTCN), where the authors explored the idea of breaking 3D convolutions into spatial 2D convolutions followed by temporal 1D convolutions. The 1D convolution, placed after 2D conv layer, was implemented as 2D convolution over temporal and channel dimension. The factorized 3D convolutions (FSTCN) had comparable results on UCF101 split.

SegNet Architecture

FSTCN paper and the factorization of 3D convolution Source.

Conv3D & Attention

  • Describing Videos by Exploiting Temporal Structure
  • Yao et al.
  • Submitted on 25 April 2015
  • Arxiv Link

Key Contributions:

  • Novel 3D CNN-RNN encoder-decoder architecture which captures local spatiotemporal information
  • Use of an attention mechanism within a CNN-RNN encoder-decoder framework to capture global context


Although this work is not directly related to action recognition, but it was a landmark work in terms of video representations. In this paper the authors use a 3D CNN + LSTM as base architecture for video description task. On top of the base, authors use a pre-trained 3D CNN for improved results.


The set up is almost same as encoder-decoder architecture described in LRCN with two differences

  1. Instead of passing features from 3D CNN as is to LSTM, 3D CNN feature maps for the clip are concatenated with stacked 2D feature maps for the same set of frames to enrich representation {v1, v2, …, vn} for each frame i. Note: The 2D & 3D CNN used is a pre-trained one and not trained end-to-end like LRCN
  2. Instead of averaging temporal vectors across all frames, a weighted average is used to combine the temporal features. The attention weights are decided based on LSTM output at every time step.

Attention Mechanism

Attention mechanism for action recognition. Source.


Network used for video description prediction

My comments:

This was one of the landmark work in 2015 introducing attention mechanism for the first time for video representations.


  • Convolutional Two-Stream Network Fusion for Video Action Recognition
  • Feichtenhofer et al.
  • Submitted on 22 April 2016
  • Arxiv Link

Key Contributions:

  • Long range temporal modeling through better long range losses
  • Novel multi-level fused architecture


In this work, authors use the base two stream architecture with two novel approaches and demonstrate performance increment without any significant increase in size of parameters. The authors explore the efficacy of two major ideas.

  1. Fusion of spatial and temporal streams (how and when) – For a task discriminating between brushing hair and brushing teeth – spatial net can capture the spatial dependency in a video (if it’s hair or teeth) while temporal net can capture presence of periodic motion for each spatial location in video. Hence it’s important to map spatial feature maps pertaining to say a particular facial region to temporal feature map for the corresponding region. To achieve the same, the nets need to be fused at an early level such that responses at the same pixel position are put in correspondence rather than fusing at end (like in base two stream architecture).
  2. Combining temporal net output across time frames so that long term dependency is also modeled.


Everything from two stream architecture remains almost similar except

  1. As described in the figure below, outputs of conv_5 layer from both streams are fused by conv+pooling. There is yet another fusion at the end layer. The final fused output was used for spatiotemporal loss evaluation.

    SegNet Architecture

    Possible strategies for fusing spatial and temporal streams. The one on right performed better. Source.

  2. For temporal fusion, output from temporal net, stacked across time, fused by conv+pooling was used for temporal loss

SegNet Architecture

Two stream fusion architecture. There are two paths one for step 1 and other for step 2 Source.

Benchmarks (UCF101-split1):

94.2TwoStreamfusion + iDT

My comments:
The authors established the supremacy of the TwoStreamFusion method as it improved the performance over C3D without the extra parameters used in C3D.


  • Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
  • Wang et al.
  • Submitted on 02 August 2016
  • Arxiv Link

Key Contributions:

  • Effective solution aimed at long range temporal modeling
  • Establishing the usage of batch normalization, dropout and pre-training as good practices


In this work authors improved on two streams architecture to produce state-of-the-art results. There were two major differences from the original paper

  1. They suggest sampling clips sparsely across the video to better model long range temporal signal instead of the random sampling across entire video.
  2. For final prediction at video-level authors explored multiple strategies. The best strategy was
    1. Combining scores of temporal and spatial streams (and other streams if other input modalities are involved) separately by averaging across snippets
    2. Fusing score of final spatial and temporal scores using weighted average and applying softmax over all classes.

The other important part of the work was establishing the problem of overfitting (due to small dataset sizes) and demonstrating usage of now-prevalent techniques like batch normalization, dropout and pre-trainign to counter the same. The authors also evaluated two new input modalities as alternate to optical flow – namely warped optical flow and RGB difference.


During training and prediction a video is divided into K segments
of equal durations. Thereafter, snippets are sampled randomly from each of the K segments. Rest of the steps remained similar to two stream architecture with changes as mentioned above.

SegNet Architecture

Temporal Segment Network architecture. Source.

Benchmarks (UCF101-split1):

94.0TSN (input RGB + Flow )
94.2TSN (input RGB + Flow + Warped flow)

My comments:

The work attempted to tackle two big challenges in action recognition – overfitting due to small sizes and long range modeling and the results were really strong. However,the problem of pre-computing optical flow and related input modalities was still a problem at large.


  • ActionVLAD: Learning spatio-temporal aggregation for action classification
  • Girdhar et al.
  • Submitted on 10 April 2017
  • Arxiv Link

Key Contributions:

  • Learnable video-level aggregation of features
  • End-to-end trainable model with video-level aggregated features to capture long term dependency


In this work, the most notable contribution by the authors is the usage of learnable feature aggregation (VLAD) as compared to normal aggregation using maxpool or avgpool. The aggregation technique is akin to bag of visual words. There are multiple learned anchor-point (say c1, …ck) based vocabulary representing k typical action (or sub-action) related spatiotemporal features. The output from each stream in two stream architecture is encoded in terms of k-space “action words” features – each feature being difference of the output from the corresponding anchor-point for any given spatial or temporal location.

SegNet Architecture

ActionVLAD – Bag of action based visual “words". Source.

Average or max-pooling represent the entire distribution of points as only a single descriptor which can be sub-optimal for representing an entire video composed of multiple sub-actions. In contrast, the proposed video aggregation represents an entire distribution of descriptors with multiple sub-actions by splitting the descriptor space into k cells and pooling inside each of the cells.

SegNet Architecture

While max or average pooling are good for similar features, they do not not adequately capture the complete distribution of features. ActionVlAD clusters the appearance and motion features and aggregates their residuals from nearest cluster centers. Source.


Everything from two stream architecture remains almost similar except the usage of ActionVLAD layer. The authors experiment multiple layers to place ActionVLAD layer with the late fusion after conv layers working out as the best strategy.

Benchmarks (UCF101-split1):

93.6ActionVLAD + iDT

My comments:
The use of VLAD as an effective way of pooling was already proved long back. The extension of the same in an end-to-end trainable framework made this technique extremely robust and state-of-the-art for most action recognition tasks in early 2017.


  • Hidden Two-Stream Convolutional Networks for Action Recognition
  • Zhu et al.
  • Submitted on 2 April 2017
  • Arxiv Link

Key Contributions:

  • Novel architecture for generating optical flow input on-the-fly using a separate network


The usage of optical flow in the two stream architecture made it mandatory to pre-compute optical flow for each sampled frame before hand thereby affecting storage and speed adversely. This paper advocates the usage of an unsupervised architecture to generate optical flow for a stack of frames.

Optical flow can be regarded as an image reconstruction problem. Given a pair of adjacent frames I1 and I2 as input, our CNN generates a flow field V. Then using the predicted flow field V and I2, I1 can be reconstructed as I1 using inverse warping such that difference between I1 and it’s reconstruction is minimized.


The authors explored multiple strategies and architectures to generate optical flow with largest fps and least parameters without hurting accuracy much. The final architecture was same as two stream architecture with changes as mentioned:

  1. The temporal stream now had the optical flow generation net (MotionNet) stacked on the top of the general temporal stream architectures. The input to the temporal stream was now consequent frames instead of preprocessed optical flow.
  2. There’s an additional multi-level loss for the unsupervised training of MotionNet

The authors also demonstrate improvement in performance using TSN based fusion instead of conventional architecture for two stream approach.

SegNet Architecture

HiddenTwoStream – MotionNet generates optical flow on-the-fly. Source.

Benchmarks (UCF101-split1):

89.8Hidden Two Stream
92.5Hidden Two Stream + TSN

My comments:
The major contribution of the paper was to improve speed and associated cost of prediction. With automated generation of flow, the authors relieved the dependency on slower traditional methods to generate optical flow.


  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • Carreira et al.
  • Submitted on 22 May 2017
  • Arxiv Link

Key Contributions:

  • Combining 3D based models into two stream architecture leveraging pre-training
  • Kinetics dataset for future benchmarking and improved diversity of action datasets


This paper takes off from where C3D left. Instead of a single 3D network, authors use two different 3D networks for both the streams in the two stream architecture. Also, to take advantage of pre-trained 2D models the authors repeat the 2D pre-trained weights in the 3rd dimension. The spatial stream input now consists of frames stacked in time dimension instead of single frames as in basic two stream architectures.


Same as basic two stream architecture but with 3D nets for each stream

Benchmarks (UCF101-split1):

93.4Two Stream I3D
98.0Imagenet + Kinetics pre-training

My comments:

The major contribution of the paper was the demonstration of evidence towards benefit of using pre-trained 2D conv nets. The Kinetics dataset, that was open-sourced along the paper, was the other crucial contribution from this paper.


  • Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
  • Diba et al.
  • Submitted on 22 Nov 2017
  • Arxiv Link

Key Contributions:

  • Architecture to combine temporal information across variable depth
  • Novel training architecture & technique to supervise transfer learning between 2D pre-trained net to 3D net


The authors extend the work done on I3D but suggest using a single stream 3D DenseNet based architecture with multi-depth temporal pooling layer (Temporal Transition Layer) stacked after dense blocks to capture different temporal depths The multi depth pooling is achieved by pooling with kernels of varying temporal sizes.

SegNet Architecture

TTL Layer along with rest of DenseNet architecture. Source.

Apart from the above, the authors also devise a new technique of supervising transfer learning betwenn pre-trained 2D conv nets and T3D. The 2D pre-trianed net and T3D are both presented frames and clips from videos where the clips and videos could be from same video or not. The architecture is trianed to predict 0/1 based on the same and the error from the prediction is back-propagated through the T3D net so as to effectively transfer knowledge.

SegNet Architecture

Transfer learning supervision. Source.


The architecture is basically 3D modification to DenseNet [12] with added variable temporal pooling.

Benchmarks (UCF101-split1):

91.7T3D + Transfer
93.2T3D + TSN

My comments:

Although the results don’t improve on I3D results but that can mostly attributed to much lower model footprint as compared to I3D. The most novel contribution of the paper was the supervised transfer learning technique.


  1. ConvNet Architecture Search for Spatiotemporal Feature Learning by Du Tran et al.
  2. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  3. Action recognition by dense trajectories by Wang et. al.
  4. On space-time interest points by Laptev
  5. Behavior recognition via sparse spatio-temporal features by Dollar et al
  6. Action Recognition with Improved Trajectories by Wang et al.
  7. 3D Convolutional Neural Networks for Human Action Recognition by Ji et al.
  8. Large-scale Video Classification with Convolutional Neural Networks by Karpathy et al.
  9. Beyond Short Snippets: Deep Networks for Video Classification by Ng et al.
  10. Long-term Temporal Convolutions for Action Recognition by Varol et al.
  11. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks by Sun et al.
  12. Densely Connected Convolutional Networks by Huang et al.


Teaching Machines to Read Radiology Reports

At Qure, we build deep learning models to detect abnormalities from radiological images. These models require huge amount of labeled data to learn to diagnose abnormalities from the scans. So, we collected a large dataset from several centers, which included both in-hospital and outpatient radiology centers. These datasets contain scans and the associated clinical radiology reports.

For now, we use radiologist reports as the gold standard as we train deep learning algorithms to recognize abnormalities on radiology images. While this is not ideal for many reasons (see this), it is currently the most scalable way to supply classification algorithms with the millions of images that they need in order to achieve high accuracy.

These reports are usually written in free form text rather than in a structured format. So, we have designed a rule based Natural Language Processing (NLP) system to extract findings automatically from these unstructured reports.

Axial ct sections of the brain were performed from the level of base of skull. 5mm sections were done for the posterior fossa and 5 mm sections for the supra sellar region without contrast.

- Area of intracerebral haemorrhage measuring 16x15mm seen in left gangliocapsular region and left corona radiate.
- Minimal squashing of left lateral ventricle noted without any appreciable midline shift
- Lacunar infarcts seen in both gangliocapsular regions
- Cerebellar parenchyma is normal.
- Fourth ventricle is normal in position and caliber. 
- The cerebellopontine cisterns, basal cisterns and sylvian cisterns appear normal.
- Midbrain and pontine structures are normal.
- Sella and para sellar regions appear normal.
- The grey-white matter attenuation pattern is normal.
- Calvarium appears normal
- Ethmoid and right maxillary sinusitis noted


	"intracerebral hemorrhage": true,
	"lacunar infarct": true,
	"mass effect": true,
	"midline shift": false,
	"maxillary sinusitis": true

An example clinical radiology report and the automatically extracted findings

Why Rule based NLP ?

Rule based NLP systems use a list of manually created rules to parse the unorganized content and structure it. Machine Learning (ML) based NLP systems, on the other hand, automatically generate the rules when trained on a large annotated dataset.

Rule based approaches have multiple advantages when compared to ML based ones:

  1. Clinical knowledge can be manually incorporated into a rule based system. Whereas, to capture this knowledge in a ML based system, a huge amount of annotation is required.
  2. Auto-generated rules of ML systems are difficult to interpret compared to the manually curated rules.
  3. Rules can be readily added or modified to accommodate a new set of target findings in a rule based system.
  4. Previous works on clinical report parsing[1, 2] show that the results of machine learning based NLP systems are inferior to that of rule based ones.

Development of Rule based NLP

As reports were collected from multiple centers, there were multiple reporting standards. Therefore, we constructed a set of rules to capture these variations after manually reading a large number of reports. Of these, I illustrate two common types of rules below.

Findings Detection

In reports, the same finding can be noted in several different formats. These include the definition of the finding itself or its synonyms. For example, finding blunted CP angle could be reported in either of the following ways:

  • CP angle is obliterated
  • Hazy costophrenic angles
  • Obscured CP angle
  • Effusion/thickening

We collected all the wordings that can be used to report findings and created a rule for each finding. As an illustration, following is the rule for blunted CP angle.

((angle & (blunt | obscur | oblitera | haz | opaci)) | (effusio & thicken))

Blunted CP

Visualization of blunted CP angle rule

This rule will be positive if there are words angle and blunted or its synonyms in a sentence. Alternatively, it will also be positive if there are words effusion and thickening in a given sentence.

In addition, there can be a hierarchical structure in findings. For example, opacity is considered positive if any of the edema, groundglass, consolidation etc are positive.
We therefore created a ontology of findings and rules to deal with this hierarchy.

rule = ((opacit & !(/ & collapse)) | infiltrate | hyperdensit)
hierarchy = (edema | groundglass | consolidation | ... )

Rule and hierarchy for opacity

Negation Detection

The above mentioned rules are used to detect a finding in a report. But these are not sufficient to understand the reports. For example, consider the following sentences.

1. Intracerebral hemorrhage is absent.
2. Contusions are ruled out.
3. No evidence of intracranial hemorrhages in the brain.

Although the findings intracerebral hemorrhage, contusion and intracranial hemorrhage are mentioned in the above sentences, their absence is noted in these sentences rather than their presence. Therefore, we need to detect negations in a sentence in addition to findings.

We manually read several sentences that indicate negation of findings and grouped these sentences according to their structures. Rules to detect negation were created based on these groups.
One of these is illustrated below:

() & ( is | are | was | were ) & (absent | ruled out | unlikely | negative)


Negation detection structure

We can see that first and second sentences of above example matches this rule and therefore we can infer that the finding is negative.

  1. Intracerebral hemorrhage is absentintracerebral hemorrhage negative.
  2. Contusions are ruled outcontusion negative.


We have tested our algorithm on a dataset containing 1878 clinical radiology reports of Head CT scans. We manually read all the reports to create gold standards. We used sensitivity and specificity as evaluation metrics. The results obtained are given below in a table.

(95% CI)
(95% CI)
Intracranial Hemorrhage2070.9807
Intraparenchymal Hemorrhage1570.9809
Intraventricular Hemorrhage441.0000
Subdural Hemorrhage440.9318
Extradural Hemorrhage271.0000
Subarachnoid Hemorrhage511.0000
Calvarial Fracture890.9888
Midline Shift540.9815
Mass Effect1320.9773

In this paper[1], authors used ML based NLP model (Bag Of Words with unigrams, bigrams, and trigrams plus average word embeddings vector) to extract findings from head CT clinical radiology reports. They reported average sensitivity and average specificity of 0.9025 and 0.9172 across findings. The same across target findings on our evaluation turns out to be 0.9841 and 0.9956 respectively. So, we can conclude rule based NLP algorithms perform better than ML based NLP algorithms on clinical reports.


  1. John Zech, Margaret Pain, Joseph Titano, Marcus Badgeley, Javin Schefflein, Andres Su, Anthony Costa, Joshua Bederson, Joseph Lehar & Eric Karl Oermann (2018). Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports. Radiology.
  2. Bethany Percha, Houssam Nassif, Jafi Lipson, Elizabeth Burnside & Daniel Rubin (2012). Automatic classification of mammography reports by BI-RADS breast tissue composition class.


What We Learned Deploying Deep Learning at Scale for Radiology Images is deploying deep learning for radiology across the globe. This blog is the first in the series where we will talk about our learnings from deploying deep learning solutions at radiology centers. We will cover the technical aspects of the challenges and solutions in here. The operational hurdles will be covered in the next part of this series.

The dawn of an AI revolution is upon us. Deep learning or deep neural networks have crawled into our daily lives transforming how we type, write emails, search for photos etc. It is revolutionizing major fields like healthcare, banking, driving etc. At, we have been working for the past couple of years on our mission of making healthcare more affordable and accessible through the power of deep learning.

Since our journey began more than two years ago, we have seen excellent progress in development and visualization of deep learning models. With Nvidia leading the advancements in GPUs and the release of Pytorch, Tensorflow, MXNet etc leading the war on deep learning frameworks, training deep learning models has become faster and easier than ever.

However, deploying these deep learning models at scale has become a different beast altogether. Let’s discuss some of the major problems that has tackled/is tackling in deploying deep learning for hospitals and radiologists across the globe.

Where does the challenge lie?

Let us start with understanding how the challenges in deploying deep learning models are different from training them. During training, the focus is mainly on the accuracy of predictions, while deployment focuses on speed and reliability of predictions. Models can be trained on local servers, but in deployment, they need to be capable of scaling up or down depending upon the volume of API requests. Companies like Algorithmia and EnvoyAI are trying to solve this problem by providing a layer over AI to serve the end users. We are already working with EnvoyAI to explore this route of deploying deep learning.

Selecting the right deep learning framework

Caffe was the first framework built to focus on production. Initially, our research team was using both Torch (flexible, imperative) as well as Lasagne/Keras (python!) for training. The release of Pytorch in late 2016 settled the debate on frameworks within our team.

Deep learning frameworks (source)

Thankfully, this happened before we started looking into deployment. Once we finalized Pytorch for training and tweaking our models, we started looking into best practices for deploying the same. Meanwhile, Facebook released Caffe2 for easier deployment, especially into mobile devices.

The AI community including Facebook, Microsoft and Amazon came together to release Open Neural Network Exchange (ONNX) making it easier to switch between tools as per need. For example, it enables you to train your model in Pytorch and then export it into Caffe2/ MXNet/ CNTK (Cognitive Toolkit) for deployment. This approach is worth looking into when the load on our servers increase. But for our present needs, deploying models in Pytorch has sufficed.

Selecting the right stack

We use following components to build our Linux servers keeping our pythonic deep learning framework in mind.

  • Docker: For operating system level virtualization
  • Anaconda: For creating python3 virtual environments and supervising package installations
  • Django: For building and serving RESTful APIs
  • Pytorch: As deep learning framework
  • Nginx: As webserver and load balancer
  • uWSGI: For serving multiple requests at a time
  • Celery: As distributed task queue

Most of these tools can be replaced as per requirements. The following diagram represents our present stack.

Server architecture

Choosing the cloud GPU server

We use Amazon EC2 P2 instances as our cloud GPU servers primarily due to our team’s familiarity with AWS. Although, Microsoft’s Azure and Google Cloud can also be excellent options.

Automating scaling and load balancing

Our servers are built using small components performing specific services and it was important to have them on the same host for easy configuration. Moreover, we handle large dicom images (each having a size between 10 and 50 Mb) and they get transferred between the components. It made sense to have all the components on the same host or else, the network bandwidth might get choked due to these transfers. The following diagram illustrates various software components comprising a typical qure deployment.

Software Components

We started with launching qXR (Chest X-ray product) on a P2 instance but as the load on our servers rose, managing GPU memory became an overhead. We were also planning to launch qER (HeadCT product) which had even higher GPU memory requirements.

Initially, we started with buying new P2 instances. Optimizing their usage and making sure that few instances are not bogged down by the incoming load while other instances remain comparatively free became a challenge. It became clear that we needed auto-scaling for our containers.

Load balancing improves the distribution of workloads across instances (source)

That was when we started looking into solutions for managing our containerized applications. We decided to go ahead with Kubernetes (Amazon ECS is also an excellent alternative) majorly because it runs independently of specific provider (ECS has to be deployed on Amazon cloud). Since many hospitals and radiology centers prefer on-premise deployment, Kubernetes is clearly more suited for such needs. It makes life easier by automatic bin-packing of containers based on resource requirements, simpler horizontal scaling, and load balancing.

GPU memory management

Initially, when qXR was deployed, it dealt with fewer abnormalities. So for an incoming request, loading models into memory, processing images through it and then releasing the memory worked fine. But as the number of abnormalities (thereby models) increased, loading all models sequentially for each upcoming request became an overhead.

We thought of accumulating incoming requests and processing images in batches on a periodic basis. This could have been a decent solution except that time was critical when dealing with medical images, more so in emergency situations. It was especially critical for qER where in cases of strokes, one has less than an hour to make a diagnostic decision. This ruled out the batch processing approach.

Beware of GPUs !! (warning at Qure's Mumbai office)

Moreover, our models for qER were even larger and required approximately 10x GPU memory of what qXR models required. Another thought was to keep the models loaded in memory and process images through them as the requests arrive. This is a good solution where you need to run your models every second or even millisecond (think of AI models running on millions of images being uploaded to Facebook or Google Photos). However, this is not a typical scenario within the medical domain. Radiology centers do not encounter patients at that scale. Even if the servers send back the results within a couple of minutes, that’s like a 30x improvement in the amount of time that a radiologist would take to report the scan. And that’s when you assume that a radiologist is immediately available. Otherwise, an average turnaround period of a chest x-ray scan varies from 1 to 2 days (700-1400x of what we take currently).

As of now, auto-scaling with Kubernetes solves our problems but we would definitely look into it in future. The solution lies somewhere between the two approaches (think of a caching mechanism for deep learning models).


Training deep learning models, especially in healthcare, is only one part of building a successful AI product. Bringing it to healthcare practitioners is a formidable and interesting challenge in itself. There are other operational hurdles like convincing doctors to embrace AI, offline working style at some hospitals (using radiographic films), lack of modern infrastructure at radiology centers (operating systems, bandwidth, RAM, disk space, GPU), varying procedures for scan acquisition etc. We will talk about them in detail in the next part of this series.


For a free trial of qXR and qER, please visit us at


Visualizing Deep Learning Networks – Part II

In the previous post we looked at methods to visualize and interpret the decisions made by deep learning models using perturbation based techniques.
To summarize the previous post, perturbation based methods do a good job of explaining decisions but they suffer from
expensive computations and instability to surprise artifacts. In this post, we’ll give a brief overview and drawbacks of the various gradient-based algorithms for deep learning based classification models.

We would be discussing the following types of algorithms in this post:

  1. Gradient-based algorithms
  2. Relevance score based algorithms

In gradient-based algorithms, the gradient of the output with respect to the input is used for constructing the saliency maps. The algorithms in this class differ in the way the gradients are modified during backpropagation. Relevance score based algorithms try to attribute the relevance of each input pixel by backpropagating the probability score instead of the gradient. However, all of these methods involve a single forward and backward pass through the net to generate heatmaps as opposed to multiple forward passes for the perturbation based methods. Evidently, all of these methods are computationally cheaper as well as free of artifacts originating from perturbation techniques.

To illustrate each algorithm, we would be considering a Chest X-Ray (image below) of a patient diagnosed with pulmonary consolidation. Pulmonary consolidation is simply a “solidification” of the lung tissue due to the accumulation of solid and liquid material in the air spaces that would have normally been filled by gas [1]. The dense material deposition in the airways could have been affected by infection or pneumonia (deposition of pus) or lung cancer (deposition of malignant cells) or pulmonary hemorrhage (airways filled with blood) etc. An easy way to diagnose consolidation is to look out for dense abnormal regions with ill-defined borders in the X-ray image.


Chest X-ray with consolidation.

We would be considering this X-ray and one of our models trained for detecting consolidation for demonstration purposes. For this patient, our consolidation model predicts a possible consolidation with 98.2% confidence.

Gradient Based

Gradient Input

  • Deep inside convolutional networks: Visualising image classification models and saliency maps
  • Submitted on 20 Dec 2013
  • Arxiv Link

Measure the relative importance of input features by calculating the gradient of the output decision with respect to those input features.

There were 2 very similar papers that pioneered the idea in 2013. In these papers — Saliency features [2] by Simonyan et al. and DeconvNet [3] by Zeiler et al. — authors used directly the gradient of the majority class prediction with respect to input to observe saliency features. The main difference between the above papers was how the authors handle the backpropagation of gradients through non-linear layers like ReLU. In Saliency features paper, the gradients of neurons with negative input were suppressed while propagating through ReLU layers. In the DeconvNet paper, the gradients of neurons with incoming negative gradients were suppressed.

Given an image I0, a class c, and a classification ConvNet with the class score function Sc(I). The heatmap is calculated as absolute of the gradient of Sc with respect to I at I0
[frac{partial S_c}{partial I} |_{I_0} ]

It is to be noted here, that DeepLIFT paper (which we’ll discuss later) explores the idea of gradient * input also as an alternate indicator as it leverages the strength and signal of input
[frac{partial S_c}{partial I} |_{I_0} * I_0 ]


Heatmap by GradInput against original annotation.

The problem with such a simple algorithm arises from non-linear activation functions like ReLU, ELU etc. Such non-linear functions being inherently non-differentiable at certain locations have discontinuous gradients. Now as the methods measured partial derivatives with respect to each pixel, the gradient heatmap is inherently discontinuous over the entire image and produces artifacts if viewed as it is. Some of it can be overcome by convolving with a Gaussian kernel. Also, the gradient flow suffers in case of renormalization layers like BatchNorm or max pooling.

Guided Backpropagation

  • Striving for simplicity: The all convolutional net
  • Submitted on 21 Dec 2014
  • Arxiv Link

The next paper [4], by Springenberg et. al, released in 2014 introduces GuidedBackprop, suppressed the flow of gradients through neurons wherein either of input or incoming gradients were negative. Springenberg et al. showed the difference amongst their methods through a beautiful illustration given below. As we discussed, this paper combined the gradient handling of both the Simonyan et al. and Zeiler et al.


Schematic of visualizing the activations of high layer neurons. a) Given an input image, we perform the forward pass to the layer we are interested in, then set to zero all activations except one and propagate back to the image to get a reconstruction. b) Different methods of propagating back through a ReLU nonlinearity. c) Formal definition of different methods for propagating a output activation out back through a ReLU unit in layer l; note that the ’deconvnet’ approach and guided backpropagation do not compute a true gradient but rather an imputed version. Source.


Heatmap by GuidedBackprop against original annotation.

The problem of gradient flow through ReLU layers still remained a problem at large. Tackling renormalization layers were still an unresolved problem as most of the above papers (including this paper) proposed mostly fully convolutional architectures (without max pool layers) and batch normalization was yet to ‘alchemised’ in 2014. Another such fully-convolutional architecture paper was CAM [6].

Grad CAM

  • Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • Submitted on 07 Oct 2016
  • Arxiv Link

An effective way to circumnavigate the backpropagation problems were explored in the GradCAM [5] by Selvaraju et al. This paper was a generalization of CAM [6] algorithm given by Zhou et al., that tried to describe attribution scores using fully connected layers. The idea is, instead of trying to propagate back the gradients, can the activation maps of the final convolutional layer be directly used to infer downsampled relevance map of the input pixels. The downsampled heatmap is upsampled to obtain a coarse relevance heatmap.


Let the feature maps in the final convolutional layers be F1, F2 … ,Fn. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Weights (w1, w2 ,…, wn) for each pixel in the F1, F2 … , Fn is calculated based on the gradients class c w.r.t. each feature map such as
    (w_i = frac{partial S_c}{partial F} |_{F_i} forall i=1 dots n )
  2. The weights and the corresponding activations of the feature maps are multiplied to compute the weighted activations (A1,A2, … , An) of each pixel in the feature maps.
    (A_i = w_i * F_i forall i = 1 dots n )
  3. The weighted activations across feature maps are added pixel-wise to indicate importance of each pixel in the downsampled feature-importance map ( H_{i,j} ) as
    ( H_{i,j} = sum_{k=1}^{n}A_k(i,j) forall i = 1 dots n)
  4. The downsampled heatmap ( H_{i,j} ) is upsampled to original image dimensions to produce the coarse-grained relevant heatmap
  5. [Optional] The authors suggest multiplying the final coarse heatmap with the heatmap obtained from GuidedBackprop to obtain a finer heatmap.

Steps 1-4 makes up the GradCAM method. Including step 5 constitutes the Guided Grad CAM method. Here’s how a heat map generated from Grad CAM method looks like. The best contribution from the paper was the generalization of the CAM paper in the presence of fully-connected layers.


Heatmap by GradCAM against original annotation.

The algorithm managed to steer clear of backpropagating the gradients all the way up to inputs – it only propagates the gradients only till the final convolutional layer. The major problem with GradCAM was its limitation to specific architectures which use the AveragePooling layer to connect convolutional layers to fully connected layers. The other major drawback of GradCAM was the upsampling to coarse heatmap results in artifacts and loss in signal.

Relevance score based

There are a couple of major problems with the gradient-based methods which can be summarised as follows:

  1. Discontinuous gradients for some non-linear activations : As explained in the figure below (taken from DeepLIFT paper) the discontinuities in gradients cause undesirable artifacts. Also, the attribution doesn’t propagate back smoothly due to such non-linearities resulting in distortion of attribution scores.

    Discontinuous gradients

    Saturation problems of gradient based methods Source.

  2. Saturation of gradients: As explained through this simplistic network, the gradients when either of i1 or i2 is greater than 1 the gradient of the output w.r.t either of them won’t change as long as i1 + i2 > 1.

Gradient saturation

Saturation problems of gradient based methods Source.

Layerwise Relevance Propagation

  • On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation
  • Published on July 10, 2015
  • Journal Link

To counter these issues, relevance score based attribution technique was discussed for the first time by Bach et al. in 2015 in this [7] paper. The authors suggested a simple yet strong technique of propagating relevance scores and redistributing as per the proportion of the activation of previous layers. The redistribution based on activation scores means we steer clear of the difficulties that arise with non-linear activation layers.


This implementation is according to epsilon-LRP[8] where small epsilon is added in denominator to propagate relevance with numerical stability. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Relevance score (Rf) for the final layer is Sc
  2. While input layer is not reached
    • Redistribute the relevance score in the current layer (Rl) in the previous layer (Rl+1) in proportion of activations.
      Say zij is the activation of the jth neuron in layer l+1 with input from ith neuron in layer l where zj is
      (z_j = sum_{i}^{}z_{ij})

Relevance propagation


Heatmap by Epsilon LRP against original annotation.


  • Learning Important Features Through Propagating Activation Differences
  • Submitted on 10 Apr 2017
  • Journal Link

The last paper[9] we cover in this series, is based on layer-wise relevance. However, herein instead of directly explaining the output prediction in previous models, the authors explain the difference in the output prediction and prediction on a baseline reference image.The concept is similar to Integrated Gradients which we discussed in the previous post. The authors bring out a valid concern with the gradient-based methods described above – gradients don’t use a reference which limits the inference. This is because gradient-based methods only describe the local behavior of the output at the specific input value, without considering how the output behaves over a range of inputs.

The reference image (IR) is chosen as the neutral image, suitable for the problem at hand. For class c, and a classification ConvNet with the class score function Sc(I), SRc be the probability for image IR. The relevance score to be propagated is not Sc but Sc – SRc.


We have so far understood both perturbation based algorithms as well as gradient-based methods. Computationally and practically, perturbation based methods are not much of a win although their performance is relatively uniform and consistent with an underlying concept of interpretability. The gradient-based methods are computationally cheaper and measure the contribution of the pixels in the neighborhood of the original image. But these papers are plagued by the difficulties in propagating gradients back through non-linear and renormalization layers. The layer relevance techniques go a step ahead and directly redistribute relevance in the proportion of activations, thereby steering clear of the problems in propagating through non-linear layers. In order to understand the relative importance of pixels, not only in the local neighborhood of pixel intensities, DeepLIFT redistributes difference of activation of an image and a baseline image.

We’ll be following up with a final post on the performance of all the methods discussed in the current and previous post and detailed analysis of their performance.


  1. Consolidation of Lung – Signs, Symptoms and Causes
  2. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
  3. Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
  4. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
  5. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2016). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.
  6. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2921-2929).
  7. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7), e0130140.
  8. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K. R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems.
  9. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685.