Categories
Uncategorized

Improving performance of AI models in presence of artifacts

Our deep learning models have become really good at recognizing hemorrhages from Head CT scans. Real-world performance is sometimes hampered by several external factors both hardware-related and human-related. In this blog post, we analyze how acquisition artifacts are responsible for performance degradation and introduce two methods that we tried, to solve this problem.

Medical Imaging is often accompanied by acquisition artifacts which can be subject related or hardware related. These artifacts make confident diagnostic evaluation difficult in two ways:

  • by making abnormalities less obvious visually by overlaying on them.
  • by mimicking an abnormality.

Some common examples of artifacts are

  • Clothing artifact- due to clothing on the patient at acquisition time See fig 1 below. Here a button on the patient’s clothing looks like a coin lesion on a Chest X Ray. Marked by red arrow.

clothing artifact

Fig 1. A button mimicking coin lesion in Chest X Ray. Marked by red arrow.Source.

  • Motion artifact- due to voluntary or involuntary subject motion during acquisition. Severe motion artifacts due to voluntary motion would usually call for a rescan. Involuntary motion like respiration or cardiac motion, or minimal subject movement could result in artifacts that go undetected and mimic a pathology. See fig 2. Here subject movement has resulted in motion artifacts that mimic subdural hemorrhage(SDH).

motion artifact

Fig 2. Artifact due to subject motion, mimicking a subdural hemorrhage in a Head CT.Source

  • Hardware artifact- See fig 3. This artifact is caused due to air bubbles in the cooling system. There are subtle irregular dark bands in scan, that can be misidentifed as cerebral edema.

hardware artifact edema

Fig 3. A hardware related artifact, mimicking cerebral edema marked by yellow arrows.Source

Here we are investigating motion artifacts that look like SDH, in Head CT scans. These artifacts result in increase in false positive (FPs) predictions of subdural hemorrhage models. We confirmed this by quantitatively analyzing the FPs of our AI model deployed at an urban outpatient center. The FP rates were higher for this data when compared to our internal test dataset.
The reason for these false positive predictions is due to the lack of variety of artifact-ridden data in the training set used. Its practically difficult to acquire and include scans containing all varieties of artifacts in the training set.

artifact mistaken for sdh

Fig 4. Model identifies an artifact slice as SDH because of similarity in shape and location. Both are hyperdense areas close to the cranial bones

We tried to solve this problem in the following two ways.

  • Making the models invariant to artifacts, by explicitly including artifact images into our training dataset.
  • Discounting slices with artifact when calculating the probability of bleed in a scan.

Method 1. Artifact as an augmentation using Cycle GANs

We reasoned that the artifacts were misclassified as bleeds because the model has not seen enough artifact scans while training.
The number of images containing artifacts is relatively small in our annotated training dataset. But we have access to several unannotated scans containing artifacts acquired from various centers with older CT scanners.(Motion artifacts are more prevalent when using older CT scanners with poor in plane temporal resolution). If we could generate artifact ridden versions of all the annotated images in our training dataset, we would be able to effectively augment our training dataset and make the model invariant to artifacts.
We decided to use a Cycle GAN to generate new training data containing artifacts.

Cycle GAN[1] is a generative adversarial network that is used for unpaired image to image translation. It serves our purpose because we have an unpaired image translation problem where X domain has our training set CT images with no artifact and Y domain has artifact-ridden CT images.

cycle gan illustration

Fig 5. Cycle GAN was used to convert a short clip of horse into that of a zebra.Source

We curated a A dataset of 5000 images without artifact and B dataset of 4000 images with artifacts and used this to train the Cycle GAN.

Unfortunately, the quality of generated images was not very good. See fig 6.
The generator was unable to capture all the variety in CT dataset, meanwhile introducing artifacts of its own, thus rendering it useless for augmenting the dataset. Cycle GAN authors state that the performance of the generator when the transformation involves geometric changes for ex. dog to cat, apples to oranges etc. is worse when compared to transformation involving color or style changes. Inclusion of artifacts is a bit more complex than color or style changes because it has to introduce distortions to existing geometry. This could be one of the reasons why the generated images have extra artifacts.

cycle gan images

Fig 6. Sampling of generated images using Cycle GAN. real_A are input images and fake_B are the artifact_images generated by Cycle GAN.

Method 2. Discounting artifact slices

In this method, we trained a model to identify slices with artifacts and show that discounting these slices made the AI model identifying subdural hemorrhage (SDH) robust to artifacts.
A manually annotated dataset was used to train a convolutional neural network (CNN) model to detect if a CT slice had artifacts or not. The original SDH model was also a CNN which predicted if a slice contained SDH. The probabilities from artifact model were used to discount the slices containing artifact and artifact-free slices of a scan were used in computation of score for presence of bleed.
See fig 7.

Method 2 illustration

Fig 7. Method 2 Using a trained artifacts model to discount artifact slices while calculating SDH probability.

Results

Our validation dataset contained 712 head CT scans, of which 42 contained SDH. Original SDH model predicted 35 false positives and no false negatives. Quantitative analysis of FPs confirmed that 17 (48%) of them were due to CT artifacts. Our trained artifact model had slice-wise AUC of 96%. Proposed modification to the SDH model had reduced the FPs to 18 (decrease of 48%) without introducing any false negatives. Thus using method 2, all scanwise FP’s due to artifacts were corrected.

In summary, using method 2, we improved the precision of SDH detection from 54.5% to 70% while maintaining a sensitivity of 100 percent.

confusion matrics

Fig 8. Confusion Matrix before and after using artifact model for SDH prediction

See fig 9. for model predictions on a representative scan.

artifact discount explanaation

Fig 9. Model predictions for few representative slices in a scan falsely predicted as positive by original SDH model

A drawback of Method 2 is that if SDH and artifact are present in the same slice, its probable that the SDH could be missed.

Conclusion

Using a cycle GAN to augment the dataset with artifact ridden scans would solve the problem by enriching the dataset with both SDH positive and SDH negative scans with artifacts over top of it. But the current experiments do not give realistic looking image synthesis results. The alternative we used, meanwhile reduces the problem of high false positives due to artifacts while maintaining the same sensitivity.

References

  1. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks by Jun-Yan Zhu et al.

Categories
Uncategorized

Challenges of Development & Validation of Deep Learning for Radiology

We have recently published an article on our deep learning algorithms for Head CT in The Lancet. This article is the first ever AI in medical imaging paper to be published in this journal.
We described development and validation of these algorithms in the article.
In this blog, I explain some of the challenges we faced in this process and how we solved them. The challenges I describe are fairly general and should be applicable to any research involving AI and radiology images.

Development

3D Images

First challenge we faced in the development process is that CT scans are three dimensional (3D). There is plethora of research for two dimensional (2D) images, but far less for 3D images. You might ask, why not simply use 3D convolutional neural networks (CNNs) in place of 2D CNNs? Notwithstanding the computational and memory requirements of 3D CNNs, they have been shown to be inferior to 2D CNN based approaches on a similar problem (action recognition).

So how do we solve it? We need not invent the wheel from scratch when there is a lot of literature on a similar problem, action recognition. Action recognition is classification of action that is present in a given video.
Why is action recognition similar to 3D volume classification? Well, temporal dimension in videos is analogous to the Z dimension in the CT.

Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

We have taken a foundational work from action recognition literature and modified it to our purposes. Our modification was that we have incorporated slice (or frame in videos) level labels in to the network. This is because action recognition literature had a comfort of using pretrained 2D CNNs which we do not share.

High Resolution

Second challenge was that CT is of high resolution both spatially and in bit depth. We just downsample the CT to a standard pixel spacing. How about bit depth? Deep learning doesn’t work great with the data which is not normalized to [-1, 1] or [0, 1]. We solved this with what a radiologist would use – windowing. Windowing is restriction of dynamic range to a certain interval (eg. [0, 80]) and then normalizing it. We applied three windows and passed them as channels to the CNNs.

Windows: brain, blood/subdural and bone

Windows: brain, blood/subdural and bone

This approach allows for multi-class effects to be accounted by the model. For example, a large scalp hemotoma visible in brain window might indicate a fracture underneath it. Conversely, a fracture visible in the bone window is usually correlated with an extra-axial bleed.

Other Challenges

There are few other challenges that deserve mention as well:

  1. Class Imbalance: We solved the class imbalance issue by weighted sampling and loss weighting.
  2. Lack of pretraining: There’s no pretrained model like imagenet available for medical images. We found that using imagenet weights actually hurts the performance.

Validation

Once the algorithms were developed, validation was not without its challenges as well.
Here are the key questions we started with: does our algorithms generalize well to CT scans not in the development dataset?
Does the algorithm also generalize to CT scans from a different source altogether? How does it compare to radiologists without access to clinical history?

Low prevalences and statistical confidence

The validation looks simple enough: just acquire scans (from a different source), get it read by radiologists and compare their reads with the algorithms’.
But statistical design is a challenge! This is because prevalence of abnormalities tend to be low; it can be as low as 1% for some abnormalities. Our key metrics for evaluating the algorithms are sensitivity & specificity and AUC depending on the both. Sensitivity is the trouble maker: we have to ensure there are enough positives in the dataset to ensure narrow enough 95% confidence intervals (CI). Required number of positive scans turns out to be ~80 for a CI of +/- 10% at an expected sensitivity of 0.7.

If we were to chose a randomly sampled dataset, number of scans to be read is ~ 80/prevalence rate = 8000. Suppose there are three readers per scan, number of total reads are 8k * 3 = 24k. So, this is a prohibitively large dataset to get read by radiologists. We cannot therefore have a randomly sampled dataset; we have to somehow enrich the number of positives in the dataset.

Enrichment

To enrich a dataset with positives, we have to find the positives from all the scans available. It’s like searching for a needle in a haystack. Fortunately, all the scans usually have a clinical report associated with them. So we just have to read the reports and choose the positive reports. Even better, have an NLP algorithm parse the reports and randomly sample the required number of positives. We chose this path.

We collected the dataset in two batches, B1 & B2. B1 was all the head CT scans acquired in a month and B2 was the algorithmically selected dataset. So, B1 mostly contained negatives while B2 contained lot of positives. This approach removed any selection bias that might have been present if the scans were manually picked. For example, if positive scans were to be picked by manual & cursory glances at the scans themselves, subtle positive findings would have been missing from the dataset.

Prevalences of the findings in batches B1 and B2. Observe the low prevalences of findings in uniformly sampled batch B1.

Reading

We called this enriched dataset, CQ500 dataset (C for CARING and Q for Qure.ai). The dataset contained 491 scans after the exclusions. Three radiologists independently read the scans in the dataset and the majority vote is considered the gold standard. We randomized the order of the reads to minimize the recall of follow up scans and to blind the readers to the batches of the dataset.

We make this dataset and the radiologists’ reads public under CC-BY-NC-SA license. Other researchers can use this dataset to benchmark their algorithms. I think it can also be used for some clinical research like measuring concordance of radiologists on various tasks etc.

In addition to the CQ500 dataset, we validated the algorithms on a much larger randomly sampled dataset, Qure25k dataset. Number of scans in this dataset was 21095. Ground truths were clinical radiology reports. We used the NLP algorithm to get structured data from the reports. This dataset satisfies the statistical requirements, but each scan is read only by a single radiologist who had access to clinical history.

Results

Finding CQ500 (95% CI) Qure25k (95% CI)
Intracranial hemorrhage 0.9419 (0.9187-0.9651) 0.9194 (0.9119-0.9269)
Intraparenchymal 0.9544 (0.9293-0.9795) 0.8977 (0.8884-0.9069)
Intraventricular 0.9310 (0.8654-0.9965) 0.9559 (0.9424-0.9694)
Subdural 0.9521 (0.9117-0.9925) 0.9161 (0.9001-0.9321)
Extradural 0.9731 (0.9113-1.0000) 0.9288 (0.9083-0.9494)
Subarachnoid 0.9574 (0.9214-0.9934) 0.9044 (0.8882-0.9205)
Calvarial fracture 0.9624 (0.9204-1.0000) 0.9244 (0.9130-0.9359)
Midline Shift 0.9697 (0.9403-0.9991) 0.9276 (0.9139-0.9413)
Mass Effect 0.9216 (0.8883-0.9548) 0.8583 (0.8462-0.8703)

AUCs of the algorithms on the both datasets.

Above table shows AUCs of the algorithms on the two datasets. Note that the AUCs are directly comparable. This is because AUC is prevalence independent. AUCs on CQ500 dataset are generally better than that on the Qure25k dataset. This might be because:

  1. Ground truths in the Qure25k dataset incorporated clinical information not available to the algorithms and therefore the algorithms did not perform well.
  2. Majority vote of three reads is a better ground truth than that of a single read.

ROC curves

ROC curves for the algorithms on the Qure25k (blue) and CQ500 (red) datasets. TPR and FPR of radiologists are also plotted.

Shown above is ROC curves on both the datasets. Readers’ TPR and FPR are also plotted. We observe that radiologists are either highly sensitive or specific to a particular finding. The algorithms are still yet to beat radiologists, on this task at least! But these should nonetheless be useful to triage or notify physicians.

Categories
Uncategorized

Deep Learning for Videos: A 2018 Guide to Action Recognition

Medical images like MRIs, CTs (3D images) are very similar to videos – both of them encode 2D spatial information over a 3rd dimension. Much like diagnosing abnormalities from 3D images, action recognition from videos would require capturing context from entire video rather than just capturing information from each frame.

Fig 1: Left: Example Head CT scan. Right: Example video from a action recognition dataset. Z dimension in the CT volume is analogous to time dimension in the video.

In this post, I summarize the literature on action recognition from videos. The post is organized into three sections –

  1. What is action recognition and why is it tough
  2. Overview of approaches
  3. Summary of papers

Action recognition and why is it tough?

Action recognition task involves the identification of different actions from video clips (a sequence of 2D frames) where the action may or may not be performed throughout the entire duration of the video. This seems like a natural extension of image classification tasks to multiple frames and then aggregating the predictions from each frame. Despite the stratospheric success of deep learning architectures in image classification (ImageNet), progress in architectures for video classification and representation learning has been slower.

What made this task tough?

  1. Huge Computational Cost
    A simple convolution 2D net for classifying 101 classes has just ~5M parameters whereas the same architecture when inflated to a 3D structure results in ~33M parameters. It takes 3 to 4 days to train a 3DConvNet on UCF101 and about two months on Sports-1M, which makes extensive architecture search difficult and overfitting likely[1].
  2. Capturing long context
    Action recognition involves capturing spatiotemporal context across frames. Additionally, the spatial information captured has to be compensated for camera movement. Even having strong spatial object detection doesn’t suffice as the motion information also carries finer details. There’s a local as well as global context w.r.t. motion information which needs to be captured for robust predictions. For example, consider the video representations shown in Figure 2. A strong image classifier can identify human, water body in both the videos but the nature of temporal periodic action differentiates front crawl from breast stroke.

    Fig 2: Left: Front crawl. Right: Breast stroke. Capturing temporal motion is critical to differentiate these two seemingly similar cases. Also notice, how camera angle suddenly changes in the middle of front crawl video.

  3. Designing classification architectures
    Designing architectures that can capture spatiotemporal information involve multiple options which are non-trivial and expensive to evaluate. For example, some possible strategies could be

    • One network for capturing spatiotemporal information vs. two separate ones for each spatial and temporal
    • Fusing predictions across multiple clips
    • End-to-end training vs. feature extraction and classifying separately
  4. No standard benchmark
    The most popular and benchmark datasets have been UCF101 and Sports1M for a long time. Searching for reasonable architecture on Sports1M can be extremely expensive. For UCF101, although the number of frames is comparable to ImageNet, the high spatial correlation among the videos makes the actual diversity in the training much lesser. Also, given the similar theme (sports) across both the datasets, generalization of benchmarked architectures to other tasks remained a problem. This has been solved lately with the introduction of Kinetics dataset[2].

    Karpathy_fusion
    Sample illustration of UCF-101. Source.

It must be noted here that abnormality detection from 3D medical images doesn’t involve all the challenges mentioned here. The major differences between action recognition from medical images are mentioned as below

  1. In case of medical imaging, the temporal context may not be as important as action recognition. For example, detecting hemorrhage in a head CT scan could involve much less temporal context across slices. Intracranial hemorrhage can be detected from a single slice only. As opposed to that, detecting lung nodule from chest CT scans would involve capturing temporal context as the nodule as well as bronchi and vessels all look like circular objects in 2D scans. It’s only when 3D context is captured, that nodules can be seen as spherical objects as opposed to cylindrical objects like vessels
  2. In case of action recognition, most of the research ideas resort to using pre-trained 2D CNNs as a starting point for drastically better convergence. In case of medical images, such pre-trained networks would be unavailable.

Overview of approaches

Before deep learning came along, most of the traditional CV algorithm variants for action recognition can be broken down into the following 3 broad steps:


  1. Local high-dimensional visual features that describe a region of the video are extracted either densely [3] or at a sparse set of interest points[4 , 5].
  2. The extracted features get combined into a fixed-sized video level description. One popular variant to the step is to bag of visual words (derived using hierarchical or k-means clustering) for encoding features at video-level.
  3. A classifier, like SVM or RF, is trained on bag of visual words for final prediction

Of these algorithms that use shallow hand-crafted features in Step 1, improved Dense Trajectories [6] (iDT) which uses densely sampled trajectory features was the state-of-the-art. Simultaneously, 3D convolutions were used as is for action recognition without much help in 2013[7]. Soon after this in 2014, two breakthrough research papers were released which form the backbone for all the papers we are going to discuss in this post. The major differences between them was the design choice around combining spatiotemporal information.

Approach 1: Single Stream Network

In this work [June 2014], the authors – Karpathy et al. – explore multiple ways to fuse temporal information from consecutive frames using 2D pre-trained convolutions.

Karpathy_fusion

Fig 3: Fusion Ideas Source.

As can be seen in Fig 3, the consecutive frames of the video are presented as input in all setups. Single frame uses single architecture that fuses information from all frames at the last stage. Late fusion uses two nets with shared params, spaced 15 frames apart, and also combines predictions at the end. Early fusion combines in the first layer by convolving over 10 frames. Slow fusion involves fusing at multiple stages, a balance between early and late fusion. For final predictions, multiple clips were sampled from entire video and prediction scores from them were averaged for final prediction.

Despite extensive experimentations the authors found that the results were significantly worse as compared to state-of-the-art hand-crafted feature based algorithms. There were multiple reasons attributed for this failure:

  1. The learnt spatiotemporal features didn’t capture motion features
  2. The dataset being less diverse, learning such detailed features was tough

Approach 2: Two Stream Networks

In this pioneering work [June 2014] by Simmoyan and Zisserman, the authors build on the failures of the previous work by Karpathy et al. Given the toughness of deep architectures to learn motion features, authors explicitly modeled motion features in the form of stacked optical flow vectors. So instead of single network for spatial context, this architecture has two separate networks – one for spatial context (pre-trained), one for motion context. The input to the spatial net is a single frame of the video. Authors experimented with the input to the temporal net and found bi-directional optical flow stacked across for 10 successive frames was performing best. The two streams were trained separately and combined using SVM. Final prediction was same as previous paper, i.e. averaging across sampled frames.

2 stream architecture

Fig 4: Two stream architecture Source.

Though this method improved the performance of single stream method by explicitly capturing local temporal movement, there were still a few drawbacks:

  1. Because the video level predictions were obtained from averaging predictions over sampled clips, the long range temporal information was still missing in learnt features.
  2. Since training clips are sampled uniformly from videos, they suffer from a problem of false label assignemnt. The ground truth of each of these clips are assumed same as ground truth of the video which may not be the case if the action just happens for a small duration within the entire video.
  3. The method involved pre-computing optical flow vectors and storing them separately. Also, the training for both the streams was separate implying end-to-end training on-the-go is still a long road.

Summaries

Following papers which are, in a way, evolutions from the two papers (single stream and two stream) which are summarized as below:

  1. LRCN
  2. C3D
  3. Conv3D & Attention
  4. TwoStreamFusion
  5. TSN
  6. ActionVlad
  7. HiddenTwoStream
  8. I3D
  9. T3D

The recurrent theme around these papers can be summarized as follows. All of the papers are improvisations on top of these basic ideas.

SegNet Architecture

Recurrent theme across papers. Source.

For each of these papers, I list down their key contributions and explain them.
I also show their benchmark scores on UCF101-split1.

LRCN

  • Long-term Recurrent Convolutional Networks for Visual Recognition and Description
  • Donahue et al.
  • Submitted on 17 November 2014
  • Arxiv Link

Key Contributions:

  • Building on previous work by using RNN as opposed to stream based designs
  • Extension of encoder-decoder architecture for video representations
  • End-to-end trainable architecture proposed for action recognition

Explanation:

In a previous work by Ng et al[9]. authors had explored the idea of using LSTMs on separately trained feature maps to see if it can capture temporal information from clips. Sadly, they conclude that temporal pooling of convoluted features proved more effective than LSTM stacked after trained feature maps. In the current paper, authors build on the same idea of using LSTM blocks (decoder) after convolution blocks(encoder) but using end-to-end training of entire architecture. They also compared RGB and optical flow as input choice and found that a weighted scoring of predictions based on both inputs was the best.

2 stream architecture2 stream architecture

Fig 5: Left: LRCN for action recognition. Right: Generic LRCN architecture for all tasks Source.

Algorithm:

During training, 16 frame clips are sampled from video. The architecture is trained end-to-end with input as RGB or optical flow of 16 frame clips. Final prediction for each clip is the average of predictions across each time step. The final prediction at video level is average of predictions from each clip.

Benchmarks (UCF101-split1):

Score Comment
82.92 Weighted score of flow and RGB inputs
71.1 Score with just RGB

My comments:

Even though the authors suggested end-to-end training frameworks, there were still a few drawbacks

  • False label assignment as video was broken to clips
  • Inability to capture long range temporal information
  • Using optical flow meant pre-computing flow features separately

Varol et al. in their work[10] tried to compensate for the stunted temporal range problem by using lower spatial resolution of video and longer clips (60 frames) which led to significantly better performance.

C3D

  • Learning Spatiotemporal Features with 3D Convolutional Networks
  • Du Tran et al.
  • Submitted on 02 December 2014
  • Arxiv Link

Key Contributions:

  • Repurposing 3D convolutional networks as feature extractors
  • Extensive search for best 3D convolutional kernel and architecture
  • Using deconvolutional layers to interpret model decision

Explanation:

In this work authors built upon work by Karpathy et al. However, instead of using 2D convolutions across frames, they used 3D convolutions on video volume. The idea was to train these vast networks on Sports1M and then use them (or an ensemble of nets with different temporal depths) as feature extractors for other datasets. Their finding was a simple linear classifier like SVM on top of ensemble of extracted features worked better than she ttate-of-the-art algorithms. The model performed even better if hand crafted features like iDT were used additionally.

SegNet Architecture

Differences in C3D paper and single stream paper Source.

The other interesting part of the work was using deconvolutional layers (explained here) to interpret the decisions. Their finding was that the net focussed on spatial appearance in first few frames and tracked the motion in the subsequent frames.

Algorithm:

During training, five random 2-second clips are extracted for each video with ground truth as action reported in the entire video. In test time, 10 clips are randomly sampled and predictions across them are averaged for final prediction.

SegNet Architecture

3D convolution where convolution is applied on a spatiotemporal cube.

Benchmarks (UCF101-split1):

Score Comment
82.3 C3D (1 net) + linear SVM
85.2 C3D (3 nets) + linear SVM
90.4 C3D (3 nets) + iDT + linear SVM

My comments:

The long range temporal modeling was still a problem. Moreover, training such huge networks is computationally a problem – especially for medical imaging where pre-training from natural images doesn’t help a lot.

Note: Around the same time Sun et al.[11] introduced the concept of factorized 3D conv networks (FSTCN), where the authors explored the idea of breaking 3D convolutions into spatial 2D convolutions followed by temporal 1D convolutions. The 1D convolution, placed after 2D conv layer, was implemented as 2D convolution over temporal and channel dimension. The factorized 3D convolutions (FSTCN) had comparable results on UCF101 split.

SegNet Architecture

FSTCN paper and the factorization of 3D convolution Source.

Conv3D & Attention

  • Describing Videos by Exploiting Temporal Structure
  • Yao et al.
  • Submitted on 25 April 2015
  • Arxiv Link

Key Contributions:

  • Novel 3D CNN-RNN encoder-decoder architecture which captures local spatiotemporal information
  • Use of an attention mechanism within a CNN-RNN encoder-decoder framework to capture global context

Explanation:

Although this work is not directly related to action recognition, but it was a landmark work in terms of video representations. In this paper the authors use a 3D CNN + LSTM as base architecture for video description task. On top of the base, authors use a pre-trained 3D CNN for improved results.

Algorithm:

The set up is almost same as encoder-decoder architecture described in LRCN with two differences

  1. Instead of passing features from 3D CNN as is to LSTM, 3D CNN feature maps for the clip are concatenated with stacked 2D feature maps for the same set of frames to enrich representation {v1, v2, …, vn} for each frame i. Note: The 2D & 3D CNN used is a pre-trained one and not trained end-to-end like LRCN
  2. Instead of averaging temporal vectors across all frames, a weighted average is used to combine the temporal features. The attention weights are decided based on LSTM output at every time step.

Attention Mechanism

Attention mechanism for action recognition. Source.

Benchmarks:

Score Comment
Network used for video description prediction

My comments:

This was one of the landmark work in 2015 introducing attention mechanism for the first time for video representations.

TwoStreamFusion

  • Convolutional Two-Stream Network Fusion for Video Action Recognition
  • Feichtenhofer et al.
  • Submitted on 22 April 2016
  • Arxiv Link

Key Contributions:

  • Long range temporal modeling through better long range losses
  • Novel multi-level fused architecture

Explanation:

In this work, authors use the base two stream architecture with two novel approaches and demonstrate performance increment without any significant increase in size of parameters. The authors explore the efficacy of two major ideas.

  1. Fusion of spatial and temporal streams (how and when) – For a task discriminating between brushing hair and brushing teeth – spatial net can capture the spatial dependency in a video (if it’s hair or teeth) while temporal net can capture presence of periodic motion for each spatial location in video. Hence it’s important to map spatial feature maps pertaining to say a particular facial region to temporal feature map for the corresponding region. To achieve the same, the nets need to be fused at an early level such that responses at the same pixel position are put in correspondence rather than fusing at end (like in base two stream architecture).
  2. Combining temporal net output across time frames so that long term dependency is also modeled.

Algorithm:

Everything from two stream architecture remains almost similar except

  1. As described in the figure below, outputs of conv_5 layer from both streams are fused by conv+pooling. There is yet another fusion at the end layer. The final fused output was used for spatiotemporal loss evaluation.

    SegNet Architecture

    Possible strategies for fusing spatial and temporal streams. The one on right performed better. Source.

  2. For temporal fusion, output from temporal net, stacked across time, fused by conv+pooling was used for temporal loss

SegNet Architecture

Two stream fusion architecture. There are two paths one for step 1 and other for step 2 Source.

Benchmarks (UCF101-split1):

Score Comment
92.5 TwoStreamfusion
94.2 TwoStreamfusion + iDT

My comments:
The authors established the supremacy of the TwoStreamFusion method as it improved the performance over C3D without the extra parameters used in C3D.

TSN

  • Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
  • Wang et al.
  • Submitted on 02 August 2016
  • Arxiv Link

Key Contributions:

  • Effective solution aimed at long range temporal modeling
  • Establishing the usage of batch normalization, dropout and pre-training as good practices

Explanation:

In this work authors improved on two streams architecture to produce state-of-the-art results. There were two major differences from the original paper

  1. They suggest sampling clips sparsely across the video to better model long range temporal signal instead of the random sampling across entire video.
  2. For final prediction at video-level authors explored multiple strategies. The best strategy was
    1. Combining scores of temporal and spatial streams (and other streams if other input modalities are involved) separately by averaging across snippets
    2. Fusing score of final spatial and temporal scores using weighted average and applying softmax over all classes.

The other important part of the work was establishing the problem of overfitting (due to small dataset sizes) and demonstrating usage of now-prevalent techniques like batch normalization, dropout and pre-trainign to counter the same. The authors also evaluated two new input modalities as alternate to optical flow – namely warped optical flow and RGB difference.

Algorithm:

During training and prediction a video is divided into K segments
of equal durations. Thereafter, snippets are sampled randomly from each of the K segments. Rest of the steps remained similar to two stream architecture with changes as mentioned above.

SegNet Architecture

Temporal Segment Network architecture. Source.

Benchmarks (UCF101-split1):

Score Comment
94.0 TSN (input RGB + Flow )
94.2 TSN (input RGB + Flow + Warped flow)

My comments:

The work attempted to tackle two big challenges in action recognition – overfitting due to small sizes and long range modeling and the results were really strong. However,the problem of pre-computing optical flow and related input modalities was still a problem at large.

ActionVLAD

  • ActionVLAD: Learning spatio-temporal aggregation for action classification
  • Girdhar et al.
  • Submitted on 10 April 2017
  • Arxiv Link

Key Contributions:

  • Learnable video-level aggregation of features
  • End-to-end trainable model with video-level aggregated features to capture long term dependency

Explanation:

In this work, the most notable contribution by the authors is the usage of learnable feature aggregation (VLAD) as compared to normal aggregation using maxpool or avgpool. The aggregation technique is akin to bag of visual words. There are multiple learned anchor-point (say c1, …ck) based vocabulary representing k typical action (or sub-action) related spatiotemporal features. The output from each stream in two stream architecture is encoded in terms of k-space “action words” features – each feature being difference of the output from the corresponding anchor-point for any given spatial or temporal location.

SegNet Architecture

ActionVLAD – Bag of action based visual “words". Source.

Average or max-pooling represent the entire distribution of points as only a single descriptor which can be sub-optimal for representing an entire video composed of multiple sub-actions. In contrast, the proposed video aggregation represents an entire distribution of descriptors with multiple sub-actions by splitting the descriptor space into k cells and pooling inside each of the cells.

SegNet Architecture

While max or average pooling are good for similar features, they do not not adequately capture the complete distribution of features. ActionVlAD clusters the appearance and motion features and aggregates their residuals from nearest cluster centers. Source.

Algorithm:

Everything from two stream architecture remains almost similar except the usage of ActionVLAD layer. The authors experiment multiple layers to place ActionVLAD layer with the late fusion after conv layers working out as the best strategy.

Benchmarks (UCF101-split1):

Score Comment
92.7 ActionVLAD
93.6 ActionVLAD + iDT

My comments:
The use of VLAD as an effective way of pooling was already proved long back. The extension of the same in an end-to-end trainable framework made this technique extremely robust and state-of-the-art for most action recognition tasks in early 2017.

HiddenTwoStream

  • Hidden Two-Stream Convolutional Networks for Action Recognition
  • Zhu et al.
  • Submitted on 2 April 2017
  • Arxiv Link

Key Contributions:

  • Novel architecture for generating optical flow input on-the-fly using a separate network

Explanation:

The usage of optical flow in the two stream architecture made it mandatory to pre-compute optical flow for each sampled frame before hand thereby affecting storage and speed adversely. This paper advocates the usage of an unsupervised architecture to generate optical flow for a stack of frames.

Optical flow can be regarded as an image reconstruction problem. Given a pair of adjacent frames I1 and I2 as input, our CNN generates a flow field V. Then using the predicted flow field V and I2, I1 can be reconstructed as I1 using inverse warping such that difference between I1 and it’s reconstruction is minimized.

Algorithm:

The authors explored multiple strategies and architectures to generate optical flow with largest fps and least parameters without hurting accuracy much. The final architecture was same as two stream architecture with changes as mentioned:

  1. The temporal stream now had the optical flow generation net (MotionNet) stacked on the top of the general temporal stream architectures. The input to the temporal stream was now consequent frames instead of preprocessed optical flow.
  2. There’s an additional multi-level loss for the unsupervised training of MotionNet

The authors also demonstrate improvement in performance using TSN based fusion instead of conventional architecture for two stream approach.

SegNet Architecture

HiddenTwoStream – MotionNet generates optical flow on-the-fly. Source.

Benchmarks (UCF101-split1):

Score Comment
89.8 Hidden Two Stream
92.5 Hidden Two Stream + TSN

My comments:
The major contribution of the paper was to improve speed and associated cost of prediction. With automated generation of flow, the authors relieved the dependency on slower traditional methods to generate optical flow.

I3D

  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  • Carreira et al.
  • Submitted on 22 May 2017
  • Arxiv Link

Key Contributions:

  • Combining 3D based models into two stream architecture leveraging pre-training
  • Kinetics dataset for future benchmarking and improved diversity of action datasets

Explanation:

This paper takes off from where C3D left. Instead of a single 3D network, authors use two different 3D networks for both the streams in the two stream architecture. Also, to take advantage of pre-trained 2D models the authors repeat the 2D pre-trained weights in the 3rd dimension. The spatial stream input now consists of frames stacked in time dimension instead of single frames as in basic two stream architectures.

Algorithm:

Same as basic two stream architecture but with 3D nets for each stream

Benchmarks (UCF101-split1):

Score Comment
93.4 Two Stream I3D
98.0 Imagenet + Kinetics pre-training

My comments:

The major contribution of the paper was the demonstration of evidence towards benefit of using pre-trained 2D conv nets. The Kinetics dataset, that was open-sourced along the paper, was the other crucial contribution from this paper.

T3D

  • Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification
  • Diba et al.
  • Submitted on 22 Nov 2017
  • Arxiv Link

Key Contributions:

  • Architecture to combine temporal information across variable depth
  • Novel training architecture & technique to supervise transfer learning between 2D pre-trained net to 3D net

Explanation:

The authors extend the work done on I3D but suggest using a single stream 3D DenseNet based architecture with multi-depth temporal pooling layer (Temporal Transition Layer) stacked after dense blocks to capture different temporal depths The multi depth pooling is achieved by pooling with kernels of varying temporal sizes.

SegNet Architecture

TTL Layer along with rest of DenseNet architecture. Source.

Apart from the above, the authors also devise a new technique of supervising transfer learning betwenn pre-trained 2D conv nets and T3D. The 2D pre-trianed net and T3D are both presented frames and clips from videos where the clips and videos could be from same video or not. The architecture is trianed to predict 0/1 based on the same and the error from the prediction is back-propagated through the T3D net so as to effectively transfer knowledge.

SegNet Architecture

Transfer learning supervision. Source.

Algorithm:

The architecture is basically 3D modification to DenseNet [12] with added variable temporal pooling.

Benchmarks (UCF101-split1):

Score Comment
90.3 T3D
91.7 T3D + Transfer
93.2 T3D + TSN

My comments:

Although the results don’t improve on I3D results but that can mostly attributed to much lower model footprint as compared to I3D. The most novel contribution of the paper was the supervised transfer learning technique.

References

  1. ConvNet Architecture Search for Spatiotemporal Feature Learning by Du Tran et al.
  2. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
  3. Action recognition by dense trajectories by Wang et. al.
  4. On space-time interest points by Laptev
  5. Behavior recognition via sparse spatio-temporal features by Dollar et al
  6. Action Recognition with Improved Trajectories by Wang et al.
  7. 3D Convolutional Neural Networks for Human Action Recognition by Ji et al.
  8. Large-scale Video Classification with Convolutional Neural Networks by Karpathy et al.
  9. Beyond Short Snippets: Deep Networks for Video Classification by Ng et al.
  10. Long-term Temporal Convolutions for Action Recognition by Varol et al.
  11. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks by Sun et al.
  12. Densely Connected Convolutional Networks by Huang et al.

Categories
Uncategorized

Teaching Machines to Read Radiology Reports

At Qure, we build deep learning models to detect abnormalities from radiological images. These models require huge amount of labeled data to learn to diagnose abnormalities from the scans. So, we collected a large dataset from several centers, which included both in-hospital and outpatient radiology centers. These datasets contain scans and the associated clinical radiology reports.

For now, we use radiologist reports as the gold standard as we train deep learning algorithms to recognize abnormalities on radiology images. While this is not ideal for many reasons (see this), it is currently the most scalable way to supply classification algorithms with the millions of images that they need in order to achieve high accuracy.

These reports are usually written in free form text rather than in a structured format. So, we have designed a rule based Natural Language Processing (NLP) system to extract findings automatically from these unstructured reports.

CT SCAN BRAIN - PLAIN STUDY
Axial ct sections of the brain were performed from the level of base of skull. 5mm sections were done for the posterior fossa and 5 mm sections for the supra sellar region without contrast.

OBSERVATIONS: 
- Area of intracerebral haemorrhage measuring 16x15mm seen in left gangliocapsular region and left corona radiate.
- Minimal squashing of left lateral ventricle noted without any appreciable midline shift
- Lacunar infarcts seen in both gangliocapsular regions
- Cerebellar parenchyma is normal.
- Fourth ventricle is normal in position and caliber. 
- The cerebellopontine cisterns, basal cisterns and sylvian cisterns appear normal.
- Midbrain and pontine structures are normal.
- Sella and para sellar regions appear normal.
- The grey-white matter attenuation pattern is normal.
- Calvarium appears normal
- Ethmoid and right maxillary sinusitis noted

IMPRESSION:
- INTRACEREBRAL HAEMORRHAGE IN LEFT GANGLIOCAPSULAR REGION AND LEFT CORONA RADIATA 
- LACUNAR INFARCTS IN BOTH GANGLIOCAPSULAR REGIONS 

{
	"intracerebral hemorrhage": true,
	"lacunar infarct": true,
	"mass effect": true,
	"midline shift": false,
	"maxillary sinusitis": true
}

An example clinical radiology report and the automatically extracted findings

Why Rule based NLP ?

Rule based NLP systems use a list of manually created rules to parse the unorganized content and structure it. Machine Learning (ML) based NLP systems, on the other hand, automatically generate the rules when trained on a large annotated dataset.

Rule based approaches have multiple advantages when compared to ML based ones:

  1. Clinical knowledge can be manually incorporated into a rule based system. Whereas, to capture this knowledge in a ML based system, a huge amount of annotation is required.
  2. Auto-generated rules of ML systems are difficult to interpret compared to the manually curated rules.
  3. Rules can be readily added or modified to accommodate a new set of target findings in a rule based system.
  4. Previous works on clinical report parsing[1, 2] show that the results of machine learning based NLP systems are inferior to that of rule based ones.

Development of Rule based NLP

As reports were collected from multiple centers, there were multiple reporting standards. Therefore, we constructed a set of rules to capture these variations after manually reading a large number of reports. Of these, I illustrate two common types of rules below.

Findings Detection

In reports, the same finding can be noted in several different formats. These include the definition of the finding itself or its synonyms. For example, finding blunted CP angle could be reported in either of the following ways:

  • CP angle is obliterated
  • Hazy costophrenic angles
  • Obscured CP angle
  • Effusion/thickening

We collected all the wordings that can be used to report findings and created a rule for each finding. As an illustration, following is the rule for blunted CP angle.

((angle & (blunt | obscur | oblitera | haz | opaci)) | (effusio & thicken))

Blunted CP

Visualization of blunted CP angle rule

This rule will be positive if there are words angle and blunted or its synonyms in a sentence. Alternatively, it will also be positive if there are words effusion and thickening in a given sentence.

In addition, there can be a hierarchical structure in findings. For example, opacity is considered positive if any of the edema, groundglass, consolidation etc are positive.
We therefore created a ontology of findings and rules to deal with this hierarchy.

[opacity]
rule = ((opacit & !(/ & collapse)) | infiltrate | hyperdensit)
hierarchy = (edema | groundglass | consolidation | ... )

Rule and hierarchy for opacity

Negation Detection

The above mentioned rules are used to detect a finding in a report. But these are not sufficient to understand the reports. For example, consider the following sentences.

1. Intracerebral hemorrhage is absent.
2. Contusions are ruled out.
3. No evidence of intracranial hemorrhages in the brain.

Although the findings intracerebral hemorrhage, contusion and intracranial hemorrhage are mentioned in the above sentences, their absence is noted in these sentences rather than their presence. Therefore, we need to detect negations in a sentence in addition to findings.

We manually read several sentences that indicate negation of findings and grouped these sentences according to their structures. Rules to detect negation were created based on these groups.
One of these is illustrated below:

() & ( is | are | was | were ) & (absent | ruled out | unlikely | negative)

Negation

Negation detection structure

We can see that first and second sentences of above example matches this rule and therefore we can infer that the finding is negative.

  1. Intracerebral hemorrhage is absentintracerebral hemorrhage negative.
  2. Contusions are ruled outcontusion negative.

Results:

We have tested our algorithm on a dataset containing 1878 clinical radiology reports of Head CT scans. We manually read all the reports to create gold standards. We used sensitivity and specificity as evaluation metrics. The results obtained are given below in a table.

Findings #Positives Sensitivity
(95% CI)
Specificity
(95% CI)
Intracranial Hemorrhage 207 0.9807
(0.9513-0.9947)
0.9873
(0.9804-0.9922)
Intraparenchymal Hemorrhage 157 0.9809
(0.9452-0.9960)
0.9883
(0.9818-0.9929)
Intraventricular Hemorrhage 44 1.0000
(0.9196-1.0000)
1.0000
(0.9979-1.0000)
Subdural Hemorrhage 44 0.9318
(0.8134-0.9857)
0.9965
(0.9925-0.9987)
Extradural Hemorrhage 27 1.0000
(0.8723-1.0000)
0.9983
(0.9950-0.9996)
Subarachnoid Hemorrhage 51 1.0000
(0.9302-1.0000)
0.9971
(0.9933-0.9991)
Fracture 143 1.0000
(0.9745-1.0000)
1.0000
(0.9977-1.0000)
Calvarial Fracture 89 0.9888
(0.9390-0.9997)
0.9947
(0.9899-0.9976)
Midline Shift 54 0.9815
(0.9011-0.9995)
1.0000
(0.9979-1.0000)
Mass Effect 132 0.9773
(0.9350-0.9953)
0.9933
(0.9881-0.9967)

In this paper[1], authors used ML based NLP model (Bag Of Words with unigrams, bigrams, and trigrams plus average word embeddings vector) to extract findings from head CT clinical radiology reports. They reported average sensitivity and average specificity of 0.9025 and 0.9172 across findings. The same across target findings on our evaluation turns out to be 0.9841 and 0.9956 respectively. So, we can conclude rule based NLP algorithms perform better than ML based NLP algorithms on clinical reports.

References

  1. John Zech, Margaret Pain, Joseph Titano, Marcus Badgeley, Javin Schefflein, Andres Su, Anthony Costa, Joshua Bederson, Joseph Lehar & Eric Karl Oermann (2018). Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports. Radiology.
  2. Bethany Percha, Houssam Nassif, Jafi Lipson, Elizabeth Burnside & Daniel Rubin (2012). Automatic classification of mammography reports by BI-RADS breast tissue composition class.

Categories
Uncategorized

What We Learned Deploying Deep Learning at Scale for Radiology Images

Qure.ai is deploying deep learning for radiology across the globe. This blog is the first in the series where we will talk about our learnings from deploying deep learning solutions at radiology centers. We will cover the technical aspects of the challenges and solutions in here. The operational hurdles will be covered in the next part of this series.

The dawn of an AI revolution is upon us. Deep learning or deep neural networks have crawled into our daily lives transforming how we type, write emails, search for photos etc. It is revolutionizing major fields like healthcare, banking, driving etc. At Qure.ai, we have been working for the past couple of years on our mission of making healthcare more affordable and accessible through the power of deep learning.

Since our journey began more than two years ago, we have seen excellent progress in development and visualization of deep learning models. With Nvidia leading the advancements in GPUs and the release of Pytorch, Tensorflow, MXNet etc leading the war on deep learning frameworks, training deep learning models has become faster and easier than ever.

However, deploying these deep learning models at scale has become a different beast altogether. Let’s discuss some of the major problems that Qure.ai has tackled/is tackling in deploying deep learning for hospitals and radiologists across the globe.

Where does the challenge lie?

Let us start with understanding how the challenges in deploying deep learning models are different from training them. During training, the focus is mainly on the accuracy of predictions, while deployment focuses on speed and reliability of predictions. Models can be trained on local servers, but in deployment, they need to be capable of scaling up or down depending upon the volume of API requests. Companies like Algorithmia and EnvoyAI are trying to solve this problem by providing a layer over AI to serve the end users. We are already working with EnvoyAI to explore this route of deploying deep learning.

Selecting the right deep learning framework

Caffe was the first framework built to focus on production. Initially, our research team was using both Torch (flexible, imperative) as well as Lasagne/Keras (python!) for training. The release of Pytorch in late 2016 settled the debate on frameworks within our team.

Deep learning frameworks (source)

Thankfully, this happened before we started looking into deployment. Once we finalized Pytorch for training and tweaking our models, we started looking into best practices for deploying the same. Meanwhile, Facebook released Caffe2 for easier deployment, especially into mobile devices.

The AI community including Facebook, Microsoft and Amazon came together to release Open Neural Network Exchange (ONNX) making it easier to switch between tools as per need. For example, it enables you to train your model in Pytorch and then export it into Caffe2/ MXNet/ CNTK (Cognitive Toolkit) for deployment. This approach is worth looking into when the load on our servers increase. But for our present needs, deploying models in Pytorch has sufficed.

Selecting the right stack

We use following components to build our Linux servers keeping our pythonic deep learning framework in mind.


  • Docker: For operating system level virtualization
  • Anaconda: For creating python3 virtual environments and supervising package installations
  • Django: For building and serving RESTful APIs
  • Pytorch: As deep learning framework
  • Nginx: As webserver and load balancer
  • uWSGI: For serving multiple requests at a time
  • Celery: As distributed task queue

Most of these tools can be replaced as per requirements. The following diagram represents our present stack.

Server architecture

Choosing the cloud GPU server

We use Amazon EC2 P2 instances as our cloud GPU servers primarily due to our team’s familiarity with AWS. Although, Microsoft’s Azure and Google Cloud can also be excellent options.

Automating scaling and load balancing

Our servers are built using small components performing specific services and it was important to have them on the same host for easy configuration. Moreover, we handle large dicom images (each having a size between 10 and 50 Mb) and they get transferred between the components. It made sense to have all the components on the same host or else, the network bandwidth might get choked due to these transfers. The following diagram illustrates various software components comprising a typical qure deployment.

Software Components

We started with launching qXR (Chest X-ray product) on a P2 instance but as the load on our servers rose, managing GPU memory became an overhead. We were also planning to launch qER (HeadCT product) which had even higher GPU memory requirements.

Initially, we started with buying new P2 instances. Optimizing their usage and making sure that few instances are not bogged down by the incoming load while other instances remain comparatively free became a challenge. It became clear that we needed auto-scaling for our containers.

Load balancing improves the distribution of workloads across instances (source)

That was when we started looking into solutions for managing our containerized applications. We decided to go ahead with Kubernetes (Amazon ECS is also an excellent alternative) majorly because it runs independently of specific provider (ECS has to be deployed on Amazon cloud). Since many hospitals and radiology centers prefer on-premise deployment, Kubernetes is clearly more suited for such needs. It makes life easier by automatic bin-packing of containers based on resource requirements, simpler horizontal scaling, and load balancing.

GPU memory management

Initially, when qXR was deployed, it dealt with fewer abnormalities. So for an incoming request, loading models into memory, processing images through it and then releasing the memory worked fine. But as the number of abnormalities (thereby models) increased, loading all models sequentially for each upcoming request became an overhead.

We thought of accumulating incoming requests and processing images in batches on a periodic basis. This could have been a decent solution except that time was critical when dealing with medical images, more so in emergency situations. It was especially critical for qER where in cases of strokes, one has less than an hour to make a diagnostic decision. This ruled out the batch processing approach.

Beware of GPUs !! (warning at Qure's Mumbai office)

Moreover, our models for qER were even larger and required approximately 10x GPU memory of what qXR models required. Another thought was to keep the models loaded in memory and process images through them as the requests arrive. This is a good solution where you need to run your models every second or even millisecond (think of AI models running on millions of images being uploaded to Facebook or Google Photos). However, this is not a typical scenario within the medical domain. Radiology centers do not encounter patients at that scale. Even if the servers send back the results within a couple of minutes, that’s like a 30x improvement in the amount of time that a radiologist would take to report the scan. And that’s when you assume that a radiologist is immediately available. Otherwise, an average turnaround period of a chest x-ray scan varies from 1 to 2 days (700-1400x of what we take currently).

As of now, auto-scaling with Kubernetes solves our problems but we would definitely look into it in future. The solution lies somewhere between the two approaches (think of a caching mechanism for deep learning models).

Conclusion

Training deep learning models, especially in healthcare, is only one part of building a successful AI product. Bringing it to healthcare practitioners is a formidable and interesting challenge in itself. There are other operational hurdles like convincing doctors to embrace AI, offline working style at some hospitals (using radiographic films), lack of modern infrastructure at radiology centers (operating systems, bandwidth, RAM, disk space, GPU), varying procedures for scan acquisition etc. We will talk about them in detail in the next part of this series.

Note

For a free trial of qXR and qER, please visit us at scan.qure.ai

Categories
Uncategorized

Visualizing Deep Learning Networks – Part II

In the previous post we looked at methods to visualize and interpret the decisions made by deep learning models using perturbation based techniques.
To summarize the previous post, perturbation based methods do a good job of explaining decisions but they suffer from
expensive computations and instability to surprise artifacts. In this post, we’ll give a brief overview and drawbacks of the various gradient-based algorithms for deep learning based classification models.

We would be discussing the following types of algorithms in this post:

  1. Gradient-based algorithms
  2. Relevance score based algorithms

In gradient-based algorithms, the gradient of the output with respect to the input is used for constructing the saliency maps. The algorithms in this class differ in the way the gradients are modified during backpropagation. Relevance score based algorithms try to attribute the relevance of each input pixel by backpropagating the probability score instead of the gradient. However, all of these methods involve a single forward and backward pass through the net to generate heatmaps as opposed to multiple forward passes for the perturbation based methods. Evidently, all of these methods are computationally cheaper as well as free of artifacts originating from perturbation techniques.

To illustrate each algorithm, we would be considering a Chest X-Ray (image below) of a patient diagnosed with pulmonary consolidation. Pulmonary consolidation is simply a “solidification” of the lung tissue due to the accumulation of solid and liquid material in the air spaces that would have normally been filled by gas [1]. The dense material deposition in the airways could have been affected by infection or pneumonia (deposition of pus) or lung cancer (deposition of malignant cells) or pulmonary hemorrhage (airways filled with blood) etc. An easy way to diagnose consolidation is to look out for dense abnormal regions with ill-defined borders in the X-ray image.

Annotated_x

Chest X-ray with consolidation.

We would be considering this X-ray and one of our models trained for detecting consolidation for demonstration purposes. For this patient, our consolidation model predicts a possible consolidation with 98.2% confidence.

Gradient Based

Gradient Input

  • Deep inside convolutional networks: Visualising image classification models and saliency maps
  • Submitted on 20 Dec 2013
  • Arxiv Link

Explanation:
Measure the relative importance of input features by calculating the gradient of the output decision with respect to those input features.

There were 2 very similar papers that pioneered the idea in 2013. In these papers — Saliency features [2] by Simonyan et al. and DeconvNet [3] by Zeiler et al. — authors used directly the gradient of the majority class prediction with respect to input to observe saliency features. The main difference between the above papers was how the authors handle the backpropagation of gradients through non-linear layers like ReLU. In Saliency features paper, the gradients of neurons with negative input were suppressed while propagating through ReLU layers. In the DeconvNet paper, the gradients of neurons with incoming negative gradients were suppressed.

Algorithm:
Given an image I0, a class c, and a classification ConvNet with the class score function Sc(I). The heatmap is calculated as absolute of the gradient of Sc with respect to I at I0
[frac{partial S_c}{partial I} |_{I_0} ]

It is to be noted here, that DeepLIFT paper (which we’ll discuss later) explores the idea of gradient * input also as an alternate indicator as it leverages the strength and signal of input
[frac{partial S_c}{partial I} |_{I_0} * I_0 ]

Annotated_x

Heatmap by GradInput against original annotation.

Shortcomings:
The problem with such a simple algorithm arises from non-linear activation functions like ReLU, ELU etc. Such non-linear functions being inherently non-differentiable at certain locations have discontinuous gradients. Now as the methods measured partial derivatives with respect to each pixel, the gradient heatmap is inherently discontinuous over the entire image and produces artifacts if viewed as it is. Some of it can be overcome by convolving with a Gaussian kernel. Also, the gradient flow suffers in case of renormalization layers like BatchNorm or max pooling.

Guided Backpropagation

  • Striving for simplicity: The all convolutional net
  • Submitted on 21 Dec 2014
  • Arxiv Link

Explanation:
The next paper [4], by Springenberg et. al, released in 2014 introduces GuidedBackprop, suppressed the flow of gradients through neurons wherein either of input or incoming gradients were negative. Springenberg et al. showed the difference amongst their methods through a beautiful illustration given below. As we discussed, this paper combined the gradient handling of both the Simonyan et al. and Zeiler et al.

GuidedBackprop

Schematic of visualizing the activations of high layer neurons. a) Given an input image, we perform the forward pass to the layer we are interested in, then set to zero all activations except one and propagate back to the image to get a reconstruction. b) Different methods of propagating back through a ReLU nonlinearity. c) Formal definition of different methods for propagating a output activation out back through a ReLU unit in layer l; note that the ’deconvnet’ approach and guided backpropagation do not compute a true gradient but rather an imputed version. Source.

Annotated_x

Heatmap by GuidedBackprop against original annotation.

Shortcomings:
The problem of gradient flow through ReLU layers still remained a problem at large. Tackling renormalization layers were still an unresolved problem as most of the above papers (including this paper) proposed mostly fully convolutional architectures (without max pool layers) and batch normalization was yet to ‘alchemised’ in 2014. Another such fully-convolutional architecture paper was CAM [6].

Grad CAM

  • Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
  • Submitted on 07 Oct 2016
  • Arxiv Link

Explanation:
An effective way to circumnavigate the backpropagation problems were explored in the GradCAM [5] by Selvaraju et al. This paper was a generalization of CAM [6] algorithm given by Zhou et al., that tried to describe attribution scores using fully connected layers. The idea is, instead of trying to propagate back the gradients, can the activation maps of the final convolutional layer be directly used to infer downsampled relevance map of the input pixels. The downsampled heatmap is upsampled to obtain a coarse relevance heatmap.

Algorithm:


Let the feature maps in the final convolutional layers be F1, F2 … ,Fn. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Weights (w1, w2 ,…, wn) for each pixel in the F1, F2 … , Fn is calculated based on the gradients class c w.r.t. each feature map such as
    (w_i = frac{partial S_c}{partial F} |_{F_i} forall i=1 dots n )
  2. The weights and the corresponding activations of the feature maps are multiplied to compute the weighted activations (A1,A2, … , An) of each pixel in the feature maps.
    (A_i = w_i * F_i forall i = 1 dots n )
  3. The weighted activations across feature maps are added pixel-wise to indicate importance of each pixel in the downsampled feature-importance map ( H_{i,j} ) as
    ( H_{i,j} = sum_{k=1}^{n}A_k(i,j) forall i = 1 dots n)
  4. The downsampled heatmap ( H_{i,j} ) is upsampled to original image dimensions to produce the coarse-grained relevant heatmap
  5. [Optional] The authors suggest multiplying the final coarse heatmap with the heatmap obtained from GuidedBackprop to obtain a finer heatmap.

Steps 1-4 makes up the GradCAM method. Including step 5 constitutes the Guided Grad CAM method. Here’s how a heat map generated from Grad CAM method looks like. The best contribution from the paper was the generalization of the CAM paper in the presence of fully-connected layers.

Annotated_x

Heatmap by GradCAM against original annotation.

Shortcomings:
The algorithm managed to steer clear of backpropagating the gradients all the way up to inputs – it only propagates the gradients only till the final convolutional layer. The major problem with GradCAM was its limitation to specific architectures which use the AveragePooling layer to connect convolutional layers to fully connected layers. The other major drawback of GradCAM was the upsampling to coarse heatmap results in artifacts and loss in signal.

Relevance score based

There are a couple of major problems with the gradient-based methods which can be summarised as follows:

  1. Discontinuous gradients for some non-linear activations : As explained in the figure below (taken from DeepLIFT paper) the discontinuities in gradients cause undesirable artifacts. Also, the attribution doesn’t propagate back smoothly due to such non-linearities resulting in distortion of attribution scores.

    Discontinuous gradients

    Saturation problems of gradient based methods Source.

  2. Saturation of gradients: As explained through this simplistic network, the gradients when either of i1 or i2 is greater than 1 the gradient of the output w.r.t either of them won’t change as long as i1 + i2 > 1.

Gradient saturation

Saturation problems of gradient based methods Source.

Layerwise Relevance Propagation

  • On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation
  • Published on July 10, 2015
  • Journal Link

Explanation:
To counter these issues, relevance score based attribution technique was discussed for the first time by Bach et al. in 2015 in this [7] paper. The authors suggested a simple yet strong technique of propagating relevance scores and redistributing as per the proportion of the activation of previous layers. The redistribution based on activation scores means we steer clear of the difficulties that arise with non-linear activation layers.

Algorithm:


This implementation is according to epsilon-LRP[8] where small epsilon is added in denominator to propagate relevance with numerical stability. Like before assume image I0, a class c, and a classification ConvNet with the class score function Sc(I).

  1. Relevance score (Rf) for the final layer is Sc
  2. While input layer is not reached
    • Redistribute the relevance score in the current layer (Rl) in the previous layer (Rl+1) in proportion of activations.
      Say zij is the activation of the jth neuron in layer l+1 with input from ith neuron in layer l where zj is
      (z_j = sum_{i}^{}z_{ij})

Relevance propagation


Annotated_x

Heatmap by Epsilon LRP against original annotation.

DeepLIFT

  • Learning Important Features Through Propagating Activation Differences
  • Submitted on 10 Apr 2017
  • Journal Link

Explanation:
The last paper[9] we cover in this series, is based on layer-wise relevance. However, herein instead of directly explaining the output prediction in previous models, the authors explain the difference in the output prediction and prediction on a baseline reference image.The concept is similar to Integrated Gradients which we discussed in the previous post. The authors bring out a valid concern with the gradient-based methods described above – gradients don’t use a reference which limits the inference. This is because gradient-based methods only describe the local behavior of the output at the specific input value, without considering how the output behaves over a range of inputs.

Algorithm:
The reference image (IR) is chosen as the neutral image, suitable for the problem at hand. For class c, and a classification ConvNet with the class score function Sc(I), SRc be the probability for image IR. The relevance score to be propagated is not Sc but Sc – SRc.

Discussions

We have so far understood both perturbation based algorithms as well as gradient-based methods. Computationally and practically, perturbation based methods are not much of a win although their performance is relatively uniform and consistent with an underlying concept of interpretability. The gradient-based methods are computationally cheaper and measure the contribution of the pixels in the neighborhood of the original image. But these papers are plagued by the difficulties in propagating gradients back through non-linear and renormalization layers. The layer relevance techniques go a step ahead and directly redistribute relevance in the proportion of activations, thereby steering clear of the problems in propagating through non-linear layers. In order to understand the relative importance of pixels, not only in the local neighborhood of pixel intensities, DeepLIFT redistributes difference of activation of an image and a baseline image.

We’ll be following up with a final post on the performance of all the methods discussed in the current and previous post and detailed analysis of their performance.

References

  1. Consolidation of Lung – Signs, Symptoms and Causes
  2. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.
  3. Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham.
  4. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.
  5. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2016). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization.
  6. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2921-2929).
  7. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K. R., & Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7), e0130140.
  8. Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K. R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems.
  9. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685.

Categories
Uncategorized

Interview with Dr Mustafa Biviji – Artificial Intelligence and the Future of Radiology

With close to 30 years of radiology experience, Dr Biviji is an eminent radiologist based in Nagpur. He is an authority on developing deep learning solutions to radiology problems and works closely with early-stage healthcare technology innovators.

Q&A with Dr Mustafa Biviji on artificial intelligence in radiology.

How do you see Artificial Intelligence in radiology evolving in the future?

In the future, radiologists and radiographers could be replaced by intelligent machines. CT and MRI machines of the future would be embedded with AI programs capable of modifying scanning protocols on the fly, depending on the disease process initially identified. Highly accurate automated reports would be produced almost instantly. Machines would prognosticate, identify as yet unknown imaging patterns associated with diseases and may also uncover new diseases.

There will be objectivity to the radiology reports with personal bias of the radiologist no longer a factor. Remote and isolated areas of the world will have an equal access to the best diagnostic information. Coupled with this would be better machine navigation during surgeries or probably even complete robotic surgery based on the imaging patterns identified with AI. Through it all, I believe that radiologists will continue to reinvent themselves.

Photo of Dr Mustafa Biviji with quote

How far away is the industry from realizing these goals, and how does Qure compare to similar solutions that you may have seen/ implemented?

These are initial days and the role of AI in Radiology is currently restricted to assistance. While most solutions talk about simplifying workflows, Qure to the best of my knowledge is the only one talking about automated reports with a remarkable degree of accuracy, thereby opening up exciting new prospects for the future. While the perfect radiology AI may be far in the future, at least a promising beginning has been made.

How does Qure.ai help in your radiology practice?

Qure.ai solutions in radiology now include automated head CT reports particularly for trauma and strokes. Reporting for these conditions would earlier have either necessitated a sleepless night or a delay in reporting. Automated reports can now be used to assist residents and help can be sought in case of a doubt or discrepancy. Delayed radiology reports will soon be a thing of the past.

How do you think the Qure’s Chest X-ray solution can help or is helping radiologists in their practice?

Qure’s chest X-ray solution presently is best targeted to a general practitioner in a remote or rural location interpreting his own chest radiographs. Qure CXR could help provide radiologist-level accuracy, previously only available at the larger centers in the bigger cities. Better radiology would lead to better treatment outcomes and obviate the need for patients to travel long distances to seek a diagnosis.

How do you think young radiologists should prepare for AI?

AI in the future will radically modify the role of a radiologist. I predict a significant blurring of the roles of a diagnostic radiologist, surgeon or a physician. The radiologist of the future will have to stop behaving like an unseen backroom doctor and reinvent to participate actively in patient management. Image assisted robotic surgeries and integrated patient care are not too far off in the future.

Categories
Uncategorized

Interview with Dr Bharat Aggarwal – Artificial Intelligence and the Radiology Workflow

Dr. Bharat Aggarwal is the Director of Radiology Services at Max Healthcare. A distinguished third generation Radiologist, he was previously the promoter and lead Radiologist at Diwan Chand Aggarwal Imaging Research Centre, New Delhi. Dr. Aggarwal is an alumnus of Tata Memorial Hospital, Mumbai, and UCMS, Delhi.

Q&A with Dr. Bharat Aggarwal on artificial intelligence in radiology.

How do you see Artificial Intelligence in radiology evolving in the future?

There is going to be a significant role of AI in the field of imaging, and it will form a critical part of service delivery. There are many gaps in the existing model of service offerings. Some examples where AI will be commonly used include triaging and highlighting critical cases (reporting is done sequentially and a diagnosis requiring urgent intervention could be “at the bottom of the pile”); early diagnosis (pixel resolution of AI vs the human eye); pre-reading to take care of resource crunch, automation in comparisons, objectivization of disease & response to treatment; quality assurance etc.

Photo of Dr Bharat Aggarwal with quote

How far away is the industry from realizing these goals, and how does Qure compare to similar solutions that you may have seen/ implemented?

10-15 years.

How do you think the Qure.ai Chest X-ray solution can help radiologists in their practice?

Triaging normal from abnormal; building efficiency; quality assurance.

What is your advice to young radiologists who are just getting started on their career? How should they think about adopting AI in their practice and should they be doing anything differently to succeed as a radiologist 10-20 years from now?

Yes, adopting AI is a must. Radiologists will not be irrelevant in the world of machines. The role of the radiologists will be to direct research towards clinical gaps, validate AI diagnosis and focus on new problems that will emerge in the AI world. They need to treat AI with healthy competitiveness and build their careers with AI on their team. The opposition is the disease. The goal is health for all.

Categories
Uncategorized

Interview with Dr Shalini Govil – Training Artificial Intelligence to Read Chest X-Rays

Dr Shalini Govil is the Lead Abdominal Radiologist, Senior Advisor and Quality Controller at the Columbia Asia Radiology Group. Through her years as Associate Professor at CMC Vellore, Dr Shalini Govil has taught and mentored countless radiologists and medical students and continues to do so. Nowadays, she is busy training a new student – an Artificial Intelligence system that is learning to read Chest X-rays. Dr Govil is an accomplished researcher having published 30+ papers, and has won numerous awards for her contributions to Radiology.

Q&A with Dr Shalini Govil on artificial intelligence in radiology.

How do you see Artificial Intelligence in radiology evolving in the future?

Given the accuracy levels being reported across the world for deep learning algorithm diagnosis on imaging, I am sure AI has the potential to emerge as a strong diagnostic tool in the clinical armamentarium.

The only factor that could stand in the way of this progress is the very human fear of being “replaced”, “overtaken” or “made redundant”.

I feel that any crossroad like this in the practice of Medicine is best approached from the point of view of the patient and not from the viewpoint of commerce or market forces.
Medicine is not a “job”…Medicine is “healing”…
Medicine…is a patient trusting you at a vulnerable moment in his/her life.

From that standpoint, it is very simple – if AI is as accurate as a Senior Radiology Resident or even more accurate, let the patient have the benefit of a timely and accurate DRAFT report that can be validated by a physician or radiologist. This would certainly be better than the current practice in many parts of the world where the x-ray is not formally reported by a trained Radiologist or even a Trainee Radiologist.

Photo of Dr Shalini Govil with quote

How far away is the industry from realizing these goals, and how does Qure compare to similar solutions that you may have seen/ implemented?

Even as researchers are racing to study AI performance in increasingly complex pathology, widespread and parallel clinical testing is the need of the hour, to build confidence in Radiology AI and to obtain to a critical mass that will allow the threshold of human fear to be crossed.

Qure.ai has come up with a way to “see through the computer’s eyes”. I think this will be a game changer on the road to building confidence in AI. Whenever I have discussed the work I am doing on the use of AI in chest x-ray diagnosis with doctors, they tend to get a glassy look that says, “This is impractical…it’s never going to come into clinical use…”

But the minute they see a chest x-ray with the Qure.ai heatmap shading the abnormality that the AI actually “picked up”…the glassy look turns into one of wonder…because it is exactly what the doctor sees himself! I find this happens with lay people as well, even high school kids!

How do you think the Qure CXR solution can help or is helping radiologists in their practice?

Once the algorithm has been trained on a large number of chest x-rays and robust clinical testing has demonstrated a low false negative rate, I think the best use of the Qure.ai CXR solution would be to run all chest x-rays in our practice through the algorithm and obtain a DRAFT report to ease validation by a Radiologist.

What is your advice to young radiologists who are just getting started on their career? How should they think about adopting AI in their practice and what should they learn to succeed as a radiologist 10-20 years from now?

I would tell young Radiologists that help is on the way…that the days of struggling without a mentor when viewing a difficult case are over…that very soon, an “App” will help them derive a keyword tag to the image that has confounded them and that this keyword will then enable them to research and read and provide an articulate and lucid differential diagnosis.

What should they learn?
They should learn Radiology of course…as in-depth and in-breadth as has ever been done…and they possibly can….
But they should also learn the basics of neural networks, deep learning algorithms and keep abreast of evolving AI.
Oh! and another thing – it might be a good idea to brush up on their 12th grade calculus!

Categories
Recommended

Interview with Dr Bhavin Jankharia – Radiologist Perspective on AI

Dr Bhavin Jankharia is one of India’s leading radiologists, former president of the IRIA as well as a renowned educator, speaker and writer. Dr Jankharia’s radiology and imaging practice, “Picture This by Jankharia”, is known for being an early adopter of innovation and for pioneering new technologies in radiology.

Q&A with Dr Jankharia on artificial intelligence in radiology.

How do you see Artificial Intelligence in radiology evolving in the future?

AI is here to stay and will be a major factor to shape radiology over the next 10 years. It will be incorporated in some form or the other in protocols and workflows across the spectrum of radiology work and across the globe.

Photo of Dr Bhavin Jankharia with quote

You have been an early adopter of AI in your practice. What would your advice be to other institutions globally who are considering incorporating AI into their workflow?

It is about questions that need to be answered. At present, AI is good at solving specific questions or given numerical data from CT scans of the abdomen and pelvis with respect to bone density and aortic size, etc. Wherever there is a need for such issues to be addresses, AI should be incorporated into those specific workflows. We still haven’t gotten to the stage where AI can report every detail in every scan and that may actually never happen.

It may never happen that AI can do what a radiologist does, but looking at the near team (say next 3-5 years), what do you think AI can achieve? (For example, what tasks can it automate? Can it improve reporting accuracy? ) Where will the biggest value addition be?

Its basic value addition will be to take away drudge work. Automated measurements, automated checking of densities, enhancement patterns, perhaps even automated follow-ups for measurements of abnormal areas already marked out on the first scans and the like.

Now that you have experienced AI in practice, how would you differentiate this technology from traditional CAD solutions that have been around for a while?

AI learns much faster and the basic approach is different. To the end user though, it matters not, does it, how we get the answer we want…

You have seen several AI companies in Radiology. What should they be doing differently to reach this goal?

At present, all of AI is problem-solving based. And since each company deals with different problems based on the doctors they work with, this approach is fine. The company that figures out a way to handle a non-problem based approach to basic interpretation of scans, the way radiologists do, will have a head-start.

How do you think the Qure.ai solutions can help or are helping radiologists in their practice?

They are slowly saving time and helping radiologists work smarter and better.

What is your advice to young radiologists who are just getting started on their career? How should they think about adopting AI in their practice and should they be doing anything differently to succeed as a radiologist 10-20 years from now?

I don’t think radiologists per se have to do anything about AI, unless they want to change track and work in the field of AI from a technical perspective. AI incorporation into workflow will happen anyway and like all changes to radiology workflow over the decades, it will become routine and a way of life. They don’t really need to do anything different, except be willing to accept change.