SlideShare a Scribd company logo
Classification of X-Ray Images Using
Vision Transformers (ViT)
Guided by : Mr. JACOB THOMAS
Assistant Professor, Department of AD
Seminar Presentation
Presented By :
Mr. JAYASANKAR SHYAM , SJC20AD040
Nov 14, 2023
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
1
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
2
Outline
● Introduction
○ Vision Transformer(ViT)
○ Self-Attention
Mechanism
○ Transformers
● History of ViT
● ViT Architecture
● Working of ViT
● Applications of ViT
● Advantages of ViT
● Disadvantages of ViT
● Research Article Discussion
○ Explaining the Dataset
○ CapsNet
○ VGG16
○ Transformers
○ Implementation of ViT
○ Implementation of
VDSNet
○ Results and comparison
● Conclusion
● References
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Introduction
● Vision Transformer (ViT) have recently emerged as a
competitive alternative to Convolutional Neural Networks
(CNNs) that are currently state-of-the-art in different image
recognition computer vision tasks.
● ViT models outperform the current state-of-the-art (CNN) by
almost x4 in terms of computational efficiency and accuracy.
● Vision Transformers (ViT) have recently achieved highly
competitive performance in benchmarks for several computer
vision applications, such as image classification, object
detection, and semantic image segmentation.
3
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Vision Transformer (ViT)
● A Vision Transformer, often abbreviated as ViT, is a deep
learning model architecture designed for computer vision
tasks, such as image classification and object detection.
● It represents a departure from traditional Convolutional
Neural Networks (CNNs) by relying on self-attention
mechanisms inspired by the Transformer architecture
● It employs a Transformer-like architecture over patches of the
image.
4
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Self-Attention Mechanism
5
● The self-attention mechanism is a key component of the
transformer architecture, which is used to capture long-range
dependencies and contextual information in the input data.
● The self-attention mechanism allows a ViT model to attend to
different regions of the input data, based on their relevance to
the task at hand.
● Attention has proven to be a key element for vision networks
to achieve higher robustness.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
6
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Transformers
● Transformers are a type of deep learning model used for
natural language processing (NLP) and computer vision (CV)
tasks.
● Transformers can process the entire input data at once,
capturing context and relevance.
● They can handle longer sequences efficiently and overcome
the vanishing gradients problem faced by recurrent neural
networks (RNNs).
● Transformers were introduced in 2017 through the paper
"Attention is All You Need" by Google Brain.
7
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
History of ViT
8
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Architecture of ViT
9
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
ViT Architecture
10
● Vision Transformers (ViT) is an architecture that uses
self-attention mechanisms to process images.
● The Vision Transformer Architecture consists of a series
of transformer blocks.
● Each transformer block consists of two sub-layers: a
multi-head self-attention layer and a feed-forward layer.
● The self-attention layer calculates attention weights for
each pixel in the image based on its relationship with all
other pixels, while the feed-forward layer applies a non-
linear transformation to the output of the self-attention
layer.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
ViT Architecture
11
● ViT also includes an additional patch embedding layer,
which divides the image into fixed-size patches and
maps each patch to a high-dimensional vector
representation.
● The final output of the ViT architecture is a class
prediction, obtained by passing the output of the last
transformer block through a classification head, which
typically consists of a single fully connected layer.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
12
Working of ViT
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Working of ViT
13
● Split an image into patches
● Flatten the patches
● Produce lower-dimensional linear embeddings from the
flattened patches
● Add positional embeddings
● Feed the sequence as an input to a standard
transformer encoder
● Pretrain the model with image labels (fully supervised
on a huge dataset)
● Finetune on the downstream dataset for image
classification
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Applications of ViT
14
● Image Classification
● Image Segmentation
● Image Captioning
● Anomaly detection
● Autonomous Driving
● Image enhancement
● Video deepfake detection etc.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Advantages of ViT over CNN
15
● Comparing ViT and CNN in terms of computational
efficiency and accuracy, ViT can be chosen if the time
for model training is limited.
● The self-attention mechanism can bring more
awareness to the developed model. Since it is so hard
to understand the weaknesses of the model developed
by CNNs, attention maps can be visualized, and they
can help developers to guide how to improve the model.
● Ability to process images at different scales and
resolutions
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Disadvantages of ViT
16
● It require a large amount of data for high accuracy, the
data collection process can extend project time
● They can be computationally expensive to train and
evaluate, which can make them less practical for some
applications
● Vision transformers are still a relatively new technology,
and there is still much research to be done to fully
understand their capabilities and limitations
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Research Article Discussion
17
Vision Transformer Outperforms Deep Convolutional Neural Network-based Model in
Classifying X-ray Images
● For the last ten years, Convolutional Neural Networks
(CNNs) have been the go-to method for automated clinical
image diagnosis.
● This paper suggests a ViT-based approach for detecting lung
diseases, positioning ViTs as a compelling alternative to
CNNs in this domain.
● This approach has been compared with a CNN-based hybrid
deep learning approach called Visual Geometric Group Data
Spatial Transformer with CNN (VDSNet) that outperforms
existing different deep learning techniques.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
18
● For the last ten years, Convolutional Neural Networks
(CNNs) have been the go-to method for automated clinical
image diagnosis.
● This paper suggests a ViT-based approach for detecting lung
diseases, positioning ViTs as a compelling alternative to
CNNs in this domain.
● This approach has been compared with a CNN-based hybrid
deep learning approach called Visual Geometric Group Data
Spatial Transformer with CNN (VDSNet) that outperforms
existing different deep learning techniques.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Explaining the Dataset
19
● The data for this work came from the National Institutes
of Health Clinical Center, which is freely available on
Kaggle and is one of the most extensive collections of
chest X-ray images accessible to the scholarly
community
● There are 112,120 frontal-view X-ray pictures in the
collection, representing 30,805 different patients.
● The chest images have a resolution of 1024 × 1024 in
this data
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
20
● The collection has 15 classes and photos can be
classified into one or more illness categories as No
findings, Atelectasis,Pneumonia, Hernia, Edema,
Emphysema, Cardiomegaly, Pneumothorax,
Consolidation, Infiltration, Fibrosis,Effusion, Pleural
thickening, Nodule, Mass.
Age distribution of the full dataset by gender.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
21
Number of each disease by patient gender.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
CapsNet
22
● CapsNet makes use of a group of neurons whose span
of the activity vector indicates the likelihood of an
entity's existence.
● Equivariance, which maintains the geographical
connection of entities in a picture withholding their
orientation or size, is one of the network's major
qualities.
● It's also used in a variety of medical image classification
applications, such as detecting brain cancers from MRI
pictures and produces a dependable accuracy and
reduced feature map with a few tweaks to the
parameters.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
23
● 256 filters, 2 strides, 9 kernel size,'relu' activation, and 'same'
padding in a convolution layer. They had collected fewer
features with strides = 2 than with strides = 1, but they had
enhanced the strings as a result, and so considered the output
of lung pictures to be focused.
Architecture of CapsNet as implemented by the authors of VDSNet.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
24
● The primary capsule has 8 as the dimension of a
capsule, 2 as stride, 9 as kernel size, 32 number of
channels, and ‘same’ as padding. The only variation
here as compared to Hinton’s structure was that
padding ‘valid’ was swapped with ‘same’.
● Prognosis with n_class = num_capsule and 16 as the
capsule's dimension.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
VGG16
25
● VGG16 takes input as images of fixed size, in this case,
224 × 224 in RGB thus making three channels. The
model processes these inputs and outputs a single
dimensional 1000-valued vector.
● Utilizing the Softmax algorithm, we ensure that their
probabilities sum up to 1. The Softmax algorithm σ
consists of
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
26
Architectural design of VGG16 model
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Transformers
27
● It is made up of an encoder-decoder pair.
● In a transformer, before transferring the input language
sequence to the next encoder blocks, the current
encoder turns it into an embedding.
● To yield the following phrase present in the output
hierarchy, the decoder uses the earlier produced results
in the target language and the intermediate branch's
enciphered input pattern.
● By moving the final result one position to the right and
pushing the beginning-of-sentence token at the starting
point, the series of earlier results is acquired.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
28
● The model learning is prevented from simply copying
the decoder input to the output by this change.
● The blocks containing multi-head attention and
feedforward layers are replicated N fold in the encoder-
decoder pair.
● There are two major concepts that have influenced the
creation of transformer algorithms.
○ Foremost is self-attention, which allows the capture
of 'long-term' information and relationships between
sequence pieces.
○ The second fundamental concept is that of
(un)supervised preliminary training on a huge
(un)labeled aggregation and then fine-tuning
towards the desired job with a tiny, labeled sample.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
29
Transformer model architecture with Encoder shown in middle row, Decoder shown in last row
and Multi-Head Attention shown in the first row.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Implementation of ViT
30
● A Vision Transformer only has the encoder section of a
normal Transformer.
● Below figure illustrates the full architecture of a vision
transformer model
The ViT architecture as implemented in this work
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
31
● The general estimated trend in ViTs is that as the patch
size decreases and the number of layers increases, the
time required for train and fine-tune phases also
increases because of the increase in the amount of
parameters similarly the accuracy also increases.
● An image is split into square tiles called patches. These
patches are then flattened into a single-dimensional
vector after which positional embeddings are added to
them so that the position of each patch can be
identified. These are called 1-dimensional sequences of
token embeddings.
● These go as input to the encoder which is itself made
up of several layers.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Implementation of VDSNet
32
The VDSNet architecture as implemented in this work.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
33
● Three layers are present in this section.
○ The first layer converts the default routing to [-0.5:
0.5] and is called the lambda λ layer.
○ Batch normalization is the second layer.
○ A spatial transformer is the third layer which
removes maximum significant features to classify X-
ray images.
● This section contains a pre-trained VGG16 model.It has
a total of 21 layers (out of which 16 are weight layers)
as follows:
○ 13 convolutional
○ 5 max pooling.
○ 3 fully connected (dense).
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Results and Discussions
34
● After many trials, the best possible accuracy with the
developed models was noted.
● The models were trained for 10 and 20 epochs on
fractional and full datasets respectively.
● The performance of chosen ViT networks as well as the
self-implemented VDSNet on lung disease detection
has been recorded
● The training and validation losses for ViTs that produce
comparable or better results on 25%, 50%, 75% and
100% dataset fractions has been plotted.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
35
Loss curves for 5% fraction of the dataset.
Loss curves for 25% fraction of the dataset.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
36
Loss curves for 50% fraction of the dataset.
Loss curves for 75% fraction of the dataset.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
37
Loss curves for 100% fraction of the dataset.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
38
Evaluation Metrics of Implemented Models.
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
Conclusion
● The ViT model after appending the additional parameters
appears to perform better with 70.24% accuracy in
comparison with CNN based model VDSNet with 69.86%
accuracy for detecting lung diseases in chest X-rays.
● We observe that ViTs with the help of transfer learning can
reach similar levels of performance or slightly higher than
CNN-based models in smaller medical image datasets.
3
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
40
References
● [1]O. Uparkar, J. Bharti, R. K. Pateriya, R. K. Gupta, and A.
Sharma, “Vision Transformer Outperforms Deep
Convolutional Neural Network-based Model in Classifying X-
ray Images,” Procedia Computer Science, vol. 218, pp. 2338–
2349, Jan. 2023, doi:
https://doi.org/10.1016/j.procs.2023.01.209
● [2] K. Han et al., “A Survey on Vision Transformer,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,
pp. 1–1, 2022, doi:
https://doi.org/10.1109/tpami.2022.3152247.
● [3] Gaudenz boesch, “Vision Transformers (ViT) in Image
Recognition - 2022 Guide,” viso.ai, Sep. 06, 2021.
https://viso.ai/deep-learning/vision-transformer-vit/
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
41
References
● [4] “Vision Transformer: What It Is & How It Works
[2023 Guide],” www.v7labs.com.
https://www.v7labs.com/blog/vision-transformer-guide
● [5] R. K. Gupta, N. Kunhare, N. Pathik, and B. Pathik,
“An AI-enabled pre-trained model-based Covid detection
model using chest X-ray images,” Multimedia Tools and
Applications, Jul. 2022, doi:
https://doi.org/10.1007/s11042-021-11580-x.
‌
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
42
Questions ?
Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43
43
Thank You

More Related Content

PDF
Master's Thesis - Data Science - Presentation
PDF
How is a Vision Transformer (ViT) model built and implemented?
PDF
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
PDF
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...
Master's Thesis - Data Science - Presentation
How is a Vision Transformer (ViT) model built and implemented?
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
unlocking-the-future-an-introduction-to-vision-transformers-202410100758143pD...
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
[DSC Europe 23] Ivan Biliskov - Seeing Through the Lens of Transformers: A Ne...

Similar to Classification of xray images using vision transformers (20)

PDF
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
PDF
BriefHistoryTransformerstransformers.pdf
PDF
CV18_Vision Transformers.pdf
PPTX
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
[212]big models without big data using domain specific deep networks in data-...
PDF
[Paper] Multiscale Vision Transformers(MVit)
PPTX
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
PDF
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
PDF
20141003.journal club
PDF
Brief History of Visual Representation Learning
PPTX
Miccai2018 paperlist
PDF
“Applying the Right Deep Learning Model with the Right Data for Your Applicat...
PPTX
Transformer in Vision
PDF
Learning where to look: focus and attention in deep vision
PDF
Image Classification with Deep Learning.pdf
PPTX
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
PDF
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
PDF
Object-Region Video Transformers
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
BriefHistoryTransformerstransformers.pdf
CV18_Vision Transformers.pdf
Deep Learning for Image Processing on 16 June 2025 MITS.pptx
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
[212]big models without big data using domain specific deep networks in data-...
[Paper] Multiscale Vision Transformers(MVit)
SLIDES OF LECTURE ABOUT TRANSFORMERS FOR VISION TASKS
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
20141003.journal club
Brief History of Visual Representation Learning
Miccai2018 paperlist
“Applying the Right Deep Learning Model with the Right Data for Your Applicat...
Transformer in Vision
Learning where to look: focus and attention in deep vision
Image Classification with Deep Learning.pdf
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
Object-Region Video Transformers
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Ad

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
introduction to datamining and warehousing
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
additive manufacturing of ss316l using mig welding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Digital Logic Computer Design lecture notes
PDF
Well-logging-methods_new................
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
introduction to datamining and warehousing
OOP with Java - Java Introduction (Basics)
additive manufacturing of ss316l using mig welding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Safety Seminar civil to be ensured for safe working.
Digital Logic Computer Design lecture notes
Well-logging-methods_new................
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CH1 Production IntroductoryConcepts.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lecture Notes Electrical Wiring System Components
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Ad

Classification of xray images using vision transformers

  • 1. Classification of X-Ray Images Using Vision Transformers (ViT) Guided by : Mr. JACOB THOMAS Assistant Professor, Department of AD Seminar Presentation Presented By : Mr. JAYASANKAR SHYAM , SJC20AD040 Nov 14, 2023 Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 1
  • 2. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 2 Outline ● Introduction ○ Vision Transformer(ViT) ○ Self-Attention Mechanism ○ Transformers ● History of ViT ● ViT Architecture ● Working of ViT ● Applications of ViT ● Advantages of ViT ● Disadvantages of ViT ● Research Article Discussion ○ Explaining the Dataset ○ CapsNet ○ VGG16 ○ Transformers ○ Implementation of ViT ○ Implementation of VDSNet ○ Results and comparison ● Conclusion ● References
  • 3. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Introduction ● Vision Transformer (ViT) have recently emerged as a competitive alternative to Convolutional Neural Networks (CNNs) that are currently state-of-the-art in different image recognition computer vision tasks. ● ViT models outperform the current state-of-the-art (CNN) by almost x4 in terms of computational efficiency and accuracy. ● Vision Transformers (ViT) have recently achieved highly competitive performance in benchmarks for several computer vision applications, such as image classification, object detection, and semantic image segmentation. 3
  • 4. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Vision Transformer (ViT) ● A Vision Transformer, often abbreviated as ViT, is a deep learning model architecture designed for computer vision tasks, such as image classification and object detection. ● It represents a departure from traditional Convolutional Neural Networks (CNNs) by relying on self-attention mechanisms inspired by the Transformer architecture ● It employs a Transformer-like architecture over patches of the image. 4
  • 5. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Self-Attention Mechanism 5 ● The self-attention mechanism is a key component of the transformer architecture, which is used to capture long-range dependencies and contextual information in the input data. ● The self-attention mechanism allows a ViT model to attend to different regions of the input data, based on their relevance to the task at hand. ● Attention has proven to be a key element for vision networks to achieve higher robustness.
  • 6. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 6
  • 7. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Transformers ● Transformers are a type of deep learning model used for natural language processing (NLP) and computer vision (CV) tasks. ● Transformers can process the entire input data at once, capturing context and relevance. ● They can handle longer sequences efficiently and overcome the vanishing gradients problem faced by recurrent neural networks (RNNs). ● Transformers were introduced in 2017 through the paper "Attention is All You Need" by Google Brain. 7
  • 8. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 History of ViT 8
  • 9. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Architecture of ViT 9
  • 10. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 ViT Architecture 10 ● Vision Transformers (ViT) is an architecture that uses self-attention mechanisms to process images. ● The Vision Transformer Architecture consists of a series of transformer blocks. ● Each transformer block consists of two sub-layers: a multi-head self-attention layer and a feed-forward layer. ● The self-attention layer calculates attention weights for each pixel in the image based on its relationship with all other pixels, while the feed-forward layer applies a non- linear transformation to the output of the self-attention layer.
  • 11. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 ViT Architecture 11 ● ViT also includes an additional patch embedding layer, which divides the image into fixed-size patches and maps each patch to a high-dimensional vector representation. ● The final output of the ViT architecture is a class prediction, obtained by passing the output of the last transformer block through a classification head, which typically consists of a single fully connected layer.
  • 12. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 12 Working of ViT
  • 13. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Working of ViT 13 ● Split an image into patches ● Flatten the patches ● Produce lower-dimensional linear embeddings from the flattened patches ● Add positional embeddings ● Feed the sequence as an input to a standard transformer encoder ● Pretrain the model with image labels (fully supervised on a huge dataset) ● Finetune on the downstream dataset for image classification
  • 14. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Applications of ViT 14 ● Image Classification ● Image Segmentation ● Image Captioning ● Anomaly detection ● Autonomous Driving ● Image enhancement ● Video deepfake detection etc.
  • 15. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Advantages of ViT over CNN 15 ● Comparing ViT and CNN in terms of computational efficiency and accuracy, ViT can be chosen if the time for model training is limited. ● The self-attention mechanism can bring more awareness to the developed model. Since it is so hard to understand the weaknesses of the model developed by CNNs, attention maps can be visualized, and they can help developers to guide how to improve the model. ● Ability to process images at different scales and resolutions
  • 16. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Disadvantages of ViT 16 ● It require a large amount of data for high accuracy, the data collection process can extend project time ● They can be computationally expensive to train and evaluate, which can make them less practical for some applications ● Vision transformers are still a relatively new technology, and there is still much research to be done to fully understand their capabilities and limitations
  • 17. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Research Article Discussion 17 Vision Transformer Outperforms Deep Convolutional Neural Network-based Model in Classifying X-ray Images ● For the last ten years, Convolutional Neural Networks (CNNs) have been the go-to method for automated clinical image diagnosis. ● This paper suggests a ViT-based approach for detecting lung diseases, positioning ViTs as a compelling alternative to CNNs in this domain. ● This approach has been compared with a CNN-based hybrid deep learning approach called Visual Geometric Group Data Spatial Transformer with CNN (VDSNet) that outperforms existing different deep learning techniques.
  • 18. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 18 ● For the last ten years, Convolutional Neural Networks (CNNs) have been the go-to method for automated clinical image diagnosis. ● This paper suggests a ViT-based approach for detecting lung diseases, positioning ViTs as a compelling alternative to CNNs in this domain. ● This approach has been compared with a CNN-based hybrid deep learning approach called Visual Geometric Group Data Spatial Transformer with CNN (VDSNet) that outperforms existing different deep learning techniques.
  • 19. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Explaining the Dataset 19 ● The data for this work came from the National Institutes of Health Clinical Center, which is freely available on Kaggle and is one of the most extensive collections of chest X-ray images accessible to the scholarly community ● There are 112,120 frontal-view X-ray pictures in the collection, representing 30,805 different patients. ● The chest images have a resolution of 1024 × 1024 in this data
  • 20. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 20 ● The collection has 15 classes and photos can be classified into one or more illness categories as No findings, Atelectasis,Pneumonia, Hernia, Edema, Emphysema, Cardiomegaly, Pneumothorax, Consolidation, Infiltration, Fibrosis,Effusion, Pleural thickening, Nodule, Mass. Age distribution of the full dataset by gender.
  • 21. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 21 Number of each disease by patient gender.
  • 22. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 CapsNet 22 ● CapsNet makes use of a group of neurons whose span of the activity vector indicates the likelihood of an entity's existence. ● Equivariance, which maintains the geographical connection of entities in a picture withholding their orientation or size, is one of the network's major qualities. ● It's also used in a variety of medical image classification applications, such as detecting brain cancers from MRI pictures and produces a dependable accuracy and reduced feature map with a few tweaks to the parameters.
  • 23. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 23 ● 256 filters, 2 strides, 9 kernel size,'relu' activation, and 'same' padding in a convolution layer. They had collected fewer features with strides = 2 than with strides = 1, but they had enhanced the strings as a result, and so considered the output of lung pictures to be focused. Architecture of CapsNet as implemented by the authors of VDSNet.
  • 24. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 24 ● The primary capsule has 8 as the dimension of a capsule, 2 as stride, 9 as kernel size, 32 number of channels, and ‘same’ as padding. The only variation here as compared to Hinton’s structure was that padding ‘valid’ was swapped with ‘same’. ● Prognosis with n_class = num_capsule and 16 as the capsule's dimension.
  • 25. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 VGG16 25 ● VGG16 takes input as images of fixed size, in this case, 224 × 224 in RGB thus making three channels. The model processes these inputs and outputs a single dimensional 1000-valued vector. ● Utilizing the Softmax algorithm, we ensure that their probabilities sum up to 1. The Softmax algorithm σ consists of
  • 26. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 26 Architectural design of VGG16 model
  • 27. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Transformers 27 ● It is made up of an encoder-decoder pair. ● In a transformer, before transferring the input language sequence to the next encoder blocks, the current encoder turns it into an embedding. ● To yield the following phrase present in the output hierarchy, the decoder uses the earlier produced results in the target language and the intermediate branch's enciphered input pattern. ● By moving the final result one position to the right and pushing the beginning-of-sentence token at the starting point, the series of earlier results is acquired.
  • 28. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 28 ● The model learning is prevented from simply copying the decoder input to the output by this change. ● The blocks containing multi-head attention and feedforward layers are replicated N fold in the encoder- decoder pair. ● There are two major concepts that have influenced the creation of transformer algorithms. ○ Foremost is self-attention, which allows the capture of 'long-term' information and relationships between sequence pieces. ○ The second fundamental concept is that of (un)supervised preliminary training on a huge (un)labeled aggregation and then fine-tuning towards the desired job with a tiny, labeled sample.
  • 29. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 29 Transformer model architecture with Encoder shown in middle row, Decoder shown in last row and Multi-Head Attention shown in the first row.
  • 30. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Implementation of ViT 30 ● A Vision Transformer only has the encoder section of a normal Transformer. ● Below figure illustrates the full architecture of a vision transformer model The ViT architecture as implemented in this work
  • 31. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 31 ● The general estimated trend in ViTs is that as the patch size decreases and the number of layers increases, the time required for train and fine-tune phases also increases because of the increase in the amount of parameters similarly the accuracy also increases. ● An image is split into square tiles called patches. These patches are then flattened into a single-dimensional vector after which positional embeddings are added to them so that the position of each patch can be identified. These are called 1-dimensional sequences of token embeddings. ● These go as input to the encoder which is itself made up of several layers.
  • 32. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Implementation of VDSNet 32 The VDSNet architecture as implemented in this work.
  • 33. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 33 ● Three layers are present in this section. ○ The first layer converts the default routing to [-0.5: 0.5] and is called the lambda λ layer. ○ Batch normalization is the second layer. ○ A spatial transformer is the third layer which removes maximum significant features to classify X- ray images. ● This section contains a pre-trained VGG16 model.It has a total of 21 layers (out of which 16 are weight layers) as follows: ○ 13 convolutional ○ 5 max pooling. ○ 3 fully connected (dense).
  • 34. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Results and Discussions 34 ● After many trials, the best possible accuracy with the developed models was noted. ● The models were trained for 10 and 20 epochs on fractional and full datasets respectively. ● The performance of chosen ViT networks as well as the self-implemented VDSNet on lung disease detection has been recorded ● The training and validation losses for ViTs that produce comparable or better results on 25%, 50%, 75% and 100% dataset fractions has been plotted.
  • 35. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 35 Loss curves for 5% fraction of the dataset. Loss curves for 25% fraction of the dataset.
  • 36. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 36 Loss curves for 50% fraction of the dataset. Loss curves for 75% fraction of the dataset.
  • 37. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 37 Loss curves for 100% fraction of the dataset.
  • 38. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 38 Evaluation Metrics of Implemented Models.
  • 39. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 Conclusion ● The ViT model after appending the additional parameters appears to perform better with 70.24% accuracy in comparison with CNN based model VDSNet with 69.86% accuracy for detecting lung diseases in chest X-rays. ● We observe that ViTs with the help of transfer learning can reach similar levels of performance or slightly higher than CNN-based models in smaller medical image datasets. 3
  • 40. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 40 References ● [1]O. Uparkar, J. Bharti, R. K. Pateriya, R. K. Gupta, and A. Sharma, “Vision Transformer Outperforms Deep Convolutional Neural Network-based Model in Classifying X- ray Images,” Procedia Computer Science, vol. 218, pp. 2338– 2349, Jan. 2023, doi: https://doi.org/10.1016/j.procs.2023.01.209 ● [2] K. Han et al., “A Survey on Vision Transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2022, doi: https://doi.org/10.1109/tpami.2022.3152247. ● [3] Gaudenz boesch, “Vision Transformers (ViT) in Image Recognition - 2022 Guide,” viso.ai, Sep. 06, 2021. https://viso.ai/deep-learning/vision-transformer-vit/
  • 41. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 41 References ● [4] “Vision Transformer: What It Is & How It Works [2023 Guide],” www.v7labs.com. https://www.v7labs.com/blog/vision-transformer-guide ● [5] R. K. Gupta, N. Kunhare, N. Pathik, and B. Pathik, “An AI-enabled pre-trained model-based Covid detection model using chest X-ray images,” Multimedia Tools and Applications, Jul. 2022, doi: https://doi.org/10.1007/s11042-021-11580-x. ‌
  • 42. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 42 Questions ?
  • 43. Department of AD (Seminar : ADQ 413) Nov 14, 2023 /43 43 Thank You