Classification of xray images using vision transformers

Classification of X-Ray Images Using
Vision Transformers (ViT)
Guided by : Mr. JACOB THOMAS
Assistant Professor, Department of AD
Seminar Presentation
Presented By :
Mr. JAYASANKAR SHYAM , SJC20AD040
Nov 14, 2023
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
1

2
Outline
● Introduction
○ Vision Transformer(ViT)
○ Self-Attention
Mechanism
○ Transformers
● History of ViT
● ViT Architecture
● Working of ViT
● Applications of ViT
● Advantages of ViT
● Disadvantages of ViT
● Research Article Discussion
○ Explaining the Dataset
○ CapsNet
○ VGG16
○ Transformers
○ Implementation of ViT
○ Implementation of
VDSNet
○ Results and comparison
● Conclusion
● References

Introduction
● Vision Transformer (ViT) have recently emerged as a
competitive alternative to Convolutional Neural Networks
(CNNs) that are currently state-of-the-art in different image
recognition computer vision tasks.
● ViT models outperform the current state-of-the-art (CNN) by
almost x4 in terms of computational efficiency and accuracy.
● Vision Transformers (ViT) have recently achieved highly
competitive performance in benchmarks for several computer
vision applications, such as image classification, object
detection, and semantic image segmentation.
3

Vision Transformer (ViT)
● A Vision Transformer, often abbreviated as ViT, is a deep
learning model architecture designed for computer vision
tasks, such as image classification and object detection.
● It represents a departure from traditional Convolutional
Neural Networks (CNNs) by relying on self-attention
mechanisms inspired by the Transformer architecture
● It employs a Transformer-like architecture over patches of the
image.
4

Self-Attention Mechanism
5
● The self-attention mechanism is a key component of the
transformer architecture, which is used to capture long-range
dependencies and contextual information in the input data.
● The self-attention mechanism allows a ViT model to attend to
different regions of the input data, based on their relevance to
the task at hand.
● Attention has proven to be a key element for vision networks
to achieve higher robustness.

6

Transformers
● Transformers are a type of deep learning model used for
natural language processing (NLP) and computer vision (CV)
tasks.
● Transformers can process the entire input data at once,
capturing context and relevance.
● They can handle longer sequences efficiently and overcome
the vanishing gradients problem faced by recurrent neural
networks (RNNs).
● Transformers were introduced in 2017 through the paper
"Attention is All You Need" by Google Brain.
7

History of ViT
8

Architecture of ViT
9

ViT Architecture
10
● Vision Transformers (ViT) is an architecture that uses
self-attention mechanisms to process images.
● The Vision Transformer Architecture consists of a series
of transformer blocks.
● Each transformer block consists of two sub-layers: a
multi-head self-attention layer and a feed-forward layer.
● The self-attention layer calculates attention weights for
each pixel in the image based on its relationship with all
other pixels, while the feed-forward layer applies a non-
linear transformation to the output of the self-attention
layer.

ViT Architecture
11
● ViT also includes an additional patch embedding layer,
which divides the image into fixed-size patches and
maps each patch to a high-dimensional vector
representation.
● The final output of the ViT architecture is a class
prediction, obtained by passing the output of the last
transformer block through a classification head, which
typically consists of a single fully connected layer.

12
Working of ViT

Working of ViT
13
● Split an image into patches
● Flatten the patches
● Produce lower-dimensional linear embeddings from the
flattened patches
● Add positional embeddings
● Feed the sequence as an input to a standard
transformer encoder
● Pretrain the model with image labels (fully supervised
on a huge dataset)
● Finetune on the downstream dataset for image
classification

Applications of ViT
14
● Image Classification
● Image Segmentation
● Image Captioning
● Anomaly detection
● Autonomous Driving
● Image enhancement
● Video deepfake detection etc.

Advantages of ViT over CNN
15
● Comparing ViT and CNN in terms of computational
efficiency and accuracy, ViT can be chosen if the time
for model training is limited.
● The self-attention mechanism can bring more
awareness to the developed model. Since it is so hard
to understand the weaknesses of the model developed
by CNNs, attention maps can be visualized, and they
can help developers to guide how to improve the model.
● Ability to process images at different scales and
resolutions

Disadvantages of ViT
16
● It require a large amount of data for high accuracy, the
data collection process can extend project time
● They can be computationally expensive to train and
evaluate, which can make them less practical for some
applications
● Vision transformers are still a relatively new technology,
and there is still much research to be done to fully
understand their capabilities and limitations

Research Article Discussion
17
Vision Transformer Outperforms Deep Convolutional Neural Network-based Model in
Classifying X-ray Images
● For the last ten years, Convolutional Neural Networks
(CNNs) have been the go-to method for automated clinical
image diagnosis.
● This paper suggests a ViT-based approach for detecting lung
diseases, positioning ViTs as a compelling alternative to
CNNs in this domain.
● This approach has been compared with a CNN-based hybrid
deep learning approach called Visual Geometric Group Data
Spatial Transformer with CNN (VDSNet) that outperforms
existing different deep learning techniques.

18
● For the last ten years, Convolutional Neural Networks
(CNNs) have been the go-to method for automated clinical
image diagnosis.
● This paper suggests a ViT-based approach for detecting lung
diseases, positioning ViTs as a compelling alternative to
CNNs in this domain.
● This approach has been compared with a CNN-based hybrid
deep learning approach called Visual Geometric Group Data
Spatial Transformer with CNN (VDSNet) that outperforms
existing different deep learning techniques.

Explaining the Dataset
19
● The data for this work came from the National Institutes
of Health Clinical Center, which is freely available on
Kaggle and is one of the most extensive collections of
chest X-ray images accessible to the scholarly
community
● There are 112,120 frontal-view X-ray pictures in the
collection, representing 30,805 different patients.
● The chest images have a resolution of 1024 × 1024 in
this data

20
● The collection has 15 classes and photos can be
classified into one or more illness categories as No
findings, Atelectasis,Pneumonia, Hernia, Edema,
Emphysema, Cardiomegaly, Pneumothorax,
Consolidation, Infiltration, Fibrosis,Effusion, Pleural
thickening, Nodule, Mass.
Age distribution of the full dataset by gender.

21
Number of each disease by patient gender.

CapsNet
22
● CapsNet makes use of a group of neurons whose span
of the activity vector indicates the likelihood of an
entity's existence.
● Equivariance, which maintains the geographical
connection of entities in a picture withholding their
orientation or size, is one of the network's major
qualities.
● It's also used in a variety of medical image classification
applications, such as detecting brain cancers from MRI
pictures and produces a dependable accuracy and
reduced feature map with a few tweaks to the
parameters.

23
● 256 filters, 2 strides, 9 kernel size,'relu' activation, and 'same'
padding in a convolution layer. They had collected fewer
features with strides = 2 than with strides = 1, but they had
enhanced the strings as a result, and so considered the output
of lung pictures to be focused.
Architecture of CapsNet as implemented by the authors of VDSNet.

24
● The primary capsule has 8 as the dimension of a
capsule, 2 as stride, 9 as kernel size, 32 number of
channels, and ‘same’ as padding. The only variation
here as compared to Hinton’s structure was that
padding ‘valid’ was swapped with ‘same’.
● Prognosis with n_class = num_capsule and 16 as the
capsule's dimension.

VGG16
25
● VGG16 takes input as images of fixed size, in this case,
224 × 224 in RGB thus making three channels. The
model processes these inputs and outputs a single
dimensional 1000-valued vector.
● Utilizing the Softmax algorithm, we ensure that their
probabilities sum up to 1. The Softmax algorithm σ
consists of

26
Architectural design of VGG16 model

Transformers
27
● It is made up of an encoder-decoder pair.
● In a transformer, before transferring the input language
sequence to the next encoder blocks, the current
encoder turns it into an embedding.
● To yield the following phrase present in the output
hierarchy, the decoder uses the earlier produced results
in the target language and the intermediate branch's
enciphered input pattern.
● By moving the final result one position to the right and
pushing the beginning-of-sentence token at the starting
point, the series of earlier results is acquired.

28
● The model learning is prevented from simply copying
the decoder input to the output by this change.
● The blocks containing multi-head attention and
feedforward layers are replicated N fold in the encoder-
decoder pair.
● There are two major concepts that have influenced the
creation of transformer algorithms.
○ Foremost is self-attention, which allows the capture
of 'long-term' information and relationships between
sequence pieces.
○ The second fundamental concept is that of
(un)supervised preliminary training on a huge
(un)labeled aggregation and then fine-tuning
towards the desired job with a tiny, labeled sample.

29
Transformer model architecture with Encoder shown in middle row, Decoder shown in last row
and Multi-Head Attention shown in the first row.

Implementation of ViT
30
● A Vision Transformer only has the encoder section of a
normal Transformer.
● Below figure illustrates the full architecture of a vision
transformer model
The ViT architecture as implemented in this work

31
● The general estimated trend in ViTs is that as the patch
size decreases and the number of layers increases, the
time required for train and fine-tune phases also
increases because of the increase in the amount of
parameters similarly the accuracy also increases.
● An image is split into square tiles called patches. These
patches are then flattened into a single-dimensional
vector after which positional embeddings are added to
them so that the position of each patch can be
identified. These are called 1-dimensional sequences of
token embeddings.
● These go as input to the encoder which is itself made
up of several layers.

Implementation of VDSNet
32
The VDSNet architecture as implemented in this work.

33
● Three layers are present in this section.
○ The first layer converts the default routing to [-0.5:
0.5] and is called the lambda λ layer.
○ Batch normalization is the second layer.
○ A spatial transformer is the third layer which
removes maximum significant features to classify X-
ray images.
● This section contains a pre-trained VGG16 model.It has
a total of 21 layers (out of which 16 are weight layers)
as follows:
○ 13 convolutional
○ 5 max pooling.
○ 3 fully connected (dense).

Results and Discussions
34
● After many trials, the best possible accuracy with the
developed models was noted.
● The models were trained for 10 and 20 epochs on
fractional and full datasets respectively.
● The performance of chosen ViT networks as well as the
self-implemented VDSNet on lung disease detection
has been recorded
● The training and validation losses for ViTs that produce
comparable or better results on 25%, 50%, 75% and
100% dataset fractions has been plotted.

35
Loss curves for 5% fraction of the dataset.

36

37

38
Evaluation Metrics of Implemented Models.

Conclusion
● The ViT model after appending the additional parameters
appears to perform better with 70.24% accuracy in
comparison with CNN based model VDSNet with 69.86%
accuracy for detecting lung diseases in chest X-rays.
● We observe that ViTs with the help of transfer learning can
reach similar levels of performance or slightly higher than
CNN-based models in smaller medical image datasets.
3

40
References
● [1]O. Uparkar, J. Bharti, R. K. Pateriya, R. K. Gupta, and A.
Sharma, “Vision Transformer Outperforms Deep
Convolutional Neural Network-based Model in Classifying X-
ray Images,” Procedia Computer Science, vol. 218, pp. 2338–
2349, Jan. 2023, doi:
https://doi.org/10.1016/j.procs.2023.01.209
● [2] K. Han et al., “A Survey on Vision Transformer,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,
pp. 1–1, 2022, doi:
https://doi.org/10.1109/tpami.2022.3152247.
● [3] Gaudenz boesch, “Vision Transformers (ViT) in Image
Recognition - 2022 Guide,” viso.ai, Sep. 06, 2021.
https://viso.ai/deep-learning/vision-transformer-vit/

41
References
● [4] “Vision Transformer: What It Is & How It Works
[2023 Guide],” www.v7labs.com.
https://www.v7labs.com/blog/vision-transformer-guide
● [5] R. K. Gupta, N. Kunhare, N. Pathik, and B. Pathik,
“An AI-enabled pre-trained model-based Covid detection
model using chest X-ray images,” Multimedia Tools and
Applications, Jul. 2022, doi:
https://doi.org/10.1007/s11042-021-11580-x.
‌

42
Questions ?

43
Thank You

Classification of xray images using vision transformers

More Related Content

Similar to Classification of xray images using vision transformers (20)

Recently uploaded (20)

Classification of xray images using vision transformers