SlideShare a Scribd company logo
How Transformers are
Changing the Direction of
Deep Learning
Architectures
Tom Michiels
System Architect
Synopsys
• The Surprising Rise of Transformers in Vision
• The Structure of Attention and Transformer
• Transformers applied to Vision and Other Application Domains
• Why Transformers are Here to Stay for Vision
Outline
© 2022 Synopsys 2
CNNs Have Dominated Many Vision Tasks Since 2012
2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
Image Classification
© 2022 Synopsys 3
2012 ... 2014
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
Object Detection
© 2022 Synopsys 4
2015 2016 2017 2018 2019 2020 2021 2022
CNNs Have Dominated Many Vision Tasks Since 2012
2012 ... 2014
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
FCN
DeepLab
SegNet DSNet
Semantic Segmentation
© 2022 Synopsys 5
2015 2016 2017 2018 2019 2020 2021 2022
CNNs Have Dominated Many Vision Tasks Since 2012
2012 ... 2014
AlexNet VGG ResNet MobileNet V1 MobileNet V2
EfficientNet
MobileNet v3
RCNN
FRCNN
SSD
YOLOV2 Mask RCNN YoloV3
YoloV4/V5
EfficientDet
FCN
DeepLab
SegNet DSNet
DeeperLab
PFPN
EfficientPS
YoloP
Panoptic Vision
© 2022 Synopsys 6
2015 2016 2017 2018 2019 2020 2021 2022
CNNs Have Dominated Many Vision Tasks Since 2012
A Decade of CNN Development…
Residual Connection
© 2022 Synopsys 7
Inverted Residual Blocks
Beaten in Accuracy by Transformers
© 2022 Synopsys 8
Is this the real life ? Is this just fantasy ?
Transformer, a model designed for natural
language processing
… without any modifications applied to image patches,
beats the highly specialized CNNs in accuracy
The Structure of Attention and Transformer
• Attention is all you need!(*)
• Bidirectional Encoder Representations from
Transformers
• A Transformer is a deep learning model that uses
attention mechanism
• Transformers were primarily used for Natural
Language Processing
• Translation
• Question Answering
• Conversational AI
• Successful training of huge transformers
• MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch
• Transformers are successfully applied in other
application domains with promising results for
embedded use
Bert and Transformers
(*) https://arxiv.org/abs/1706.03762
© 2022 Synopsys 10
Convolutions, Feed Forward, and Multi-Head Attention
11
© 2022 Synopsys
1x1 Convolution
Add
3x3 Convolution
Add
CNN
Transformer
• The Feed Forward layer of the Transformer
is identical to a 1x1 Convolution
• In this part of the model, no information is
flowing between tokens/pixels
• Multi-Head Attention and 3x3 Convolution
layers are the layers responsible for mixing
information between tokens/pixels
Convolutions as Hard-Coded Attention
Both Convolution and Attention Networks mix in features of other tokens/pixels
Convolution Attention
Convolutions mix in features from tokens based on fixed spatial location
Attention mix in features from tokens based on learned attention
© 2022 Synopsys 12
The Structure of a Transformer: Attention
Multi-Head Attention
Attention: Mix in Features of Other Tokens
Is this the real life ?
Is
this
the
real
life
?
© 2022 Synopsys 13
The Structure of a Transformer: Attention
Multi-Head Attention
Attention: Mix in Features of Other Tokens
© 2022 Synopsys 14
The Structure of a Transformer: Attention
NxN
matrix
N: number of tokens
Multi-Head Attention
© 2022 Synopsys 15
+
The Structure of a Transformer: Embedding
Embedding of input tokens and the positional encoding
Is this the real life ? Is this just fantasy ?
embedding vectors
position encoding
elementwise addition
© 2022 Synopsys 17
Other Application Domains:
Vision, Action Recognition, Speech Recognition
Vision Transformers (ViT/L16 or ViT-G/14)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(*)
Image is split into tiles
(*) https://arxiv.org/abs/2010.11929
Linear Projection
Multi-Head Attention
Feed Forward
Add & Norm
Add & Norm
+
Pixels in a tile are
flattened into tokens (vectors) that feed in
the transformer
N x
Vision Transformers are at the time of
publication best-known method for image
classification
They are beating convolutional neural
networks in accuracy and training time, but
not in inference time.
Position encoding
© 2022 Synopsys 20
Vision Transformer  Increasing Resolution
Linear Projection
Multi-Head Attention
Feed Forward
Add & Norm
Add & Norm
+
N x
Position encoding
Attention matrix scales quadratically with the number
of patches
N x N matrix
Where N = the number of
tokens/patches
© 2022 Synopsys 21
Swin Transformers
Hierarchical Vision Transformer using Shifted Windows (*)
Adaptation makes Transformers scale for
larger images:
1. Shifted Window Attention
2. Patch-Merging
State of the Art for
• Object Detection (COCO)
• Semantic Segmentation (ADE20K)
(*) https://arxiv.org/abs/2103.14030
Shifted Window Attention Patch-Merging
22
© 2022 Synopsys
Action Classification with Transformers
Video Swin Transformer
https://arxiv.org/abs/2106.13230
Video Swin Transformers extend the (shifted) window to three
dimensions (2D spatial + time)
Today’s state of the art on Kinetics-400 and Kinetics-600
© 2022 Synopsys 23
© 2022 Synopsys
Action Classification with Transformers
Is Space-Time Attention All You Need for Video Understanding?
https://arxiv.org/abs/2102.05095
• Transformers can directly be applied to video
• Like for ViT, the video frames are split-up in tiles that feed directly in the Transformer
• Applying attention separately on time and on space “Divided Attention” gives (at time of publication) state of the art
results on Kinetics-400 and Kinetics-600 benchmarks
24
• Conformers are Transformer with and additional Convolution Module
• The convolution module contains a pointwise and a depthwise (1D, size=31) convolution:
• Compared to RNN, LSTM, DW-Conv and Transformers, Conformers give excellent accuracy / size ratio:
• Best known methods for speech recognition (LibriSpeech)
are based on Conformers
Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition (*)
WER
(%)
5.0
1.0
2.0
3.0
4.0
6.0
7.0
8.0
9.0
10.0 Not Released
Hybrid Model
Transformer
# of parameters (M)
25 275 300
50 75 100 125 150 175 200 225 250 325
https://arxiv.org/abs/2005.08100
© 2022 Synopsys 25
Why Attention and Transformers
are Here to Stay for Vision
Visual Perception beyond Segmentation & Object
Detection
Future applications like security cameras, personal assistants, storage retrieval,…. require a deeper
understanding of the world
 Merging NLP and Vision using the same knowledge representation backend
Today
Panoptic Segmentation
2022-…
What is happening in this scene?
© 2022 Synopsys 27
Tesla AI Day: Using Transformers Make Predictions in
Vector Space
• Convolutional neural network extract features
for every camera
• A transformer is used to:
• Fuse multiple cameras
• Make predictions directly in bird-eye-
view vector space
© 2022 Synopsys 28
• Attention based networks outperform CNN-only networks on accuracy
• Highest accuracy required for high-end applications
• Models that combine Vision Transformers with Convolutions are more efficient at
inference
• Examples: MobileViT(*), CoAtNet(**)
• Full visual perception requires knowledge that may not easily be acquired by
vision only
• Multi-modal learning required for a deeper understanding of visual information
• Application integrating multiple sensors benefit from attention-based networks
Why Transformers are Here to Stay in Vision
(*) https://arxiv.org/abs/2110.02178
(**) https://arxiv.org/abs/2106.04803v2
30
© 2022 Synopsys
• Transformers are deep learning models primarily used in the field of NLP
• Transformers lead to state-of-the-art results in other application domains of deep
learning like vision and speech
• They can be applied to other domains with surprisingly little modifications
• Models that combine attention and convolutions outperform convolutional
neural networks on vision tasks, even for small models
• Transformers and attention for vision applications are here to stay
• Real world applications require knowledge that is not easily captured with
convolutions
Summary
© 2022 Synopsys 32
Resources
Join the Synopsys Deep Dive
Optimize AI Performance & Power for Tomorrow’s Neural
Network Applications (Thursday, 12-3 PM)
Synopsys Demos in Booth 719
• Executing Transformer Neural Networks in ARC NPX6 NPU
IP
• Driver Management System on ARC EV Processor IP with
Visidon
• Neural Network-Enhanced Radar Processing on ARC VPX5
DSP with SensorCortek
Resources
ARVIX.org
https://arxiv.org/abs/1706.03762
ARC NPX6 NPU IP
www.synopsys.com/arc
© 2022 Synopsys 33

More Related Content

PDF
Sequence Modelling with Deep Learning
PDF
(Paper Seminar detailed version) BART: Denoising Sequence-to-Sequence Pre-tra...
PPTX
Block chain technology and its applications
PPTX
Image denoising
PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
PDF
Convolutional neural network
PDF
Introduction to Recurrent Neural Network
Sequence Modelling with Deep Learning
(Paper Seminar detailed version) BART: Denoising Sequence-to-Sequence Pre-tra...
Block chain technology and its applications
Image denoising
Emerging Properties in Self-Supervised Vision Transformers
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Convolutional neural network
Introduction to Recurrent Neural Network

What's hot (20)

PDF
Transformer in Computer Vision
PDF
Transformer Introduction (Seminar Material)
PDF
Deep learning - A Visual Introduction
PPTX
Diffusion models beat gans on image synthesis
PPTX
Natural language processing and transformer models
PPTX
Convolutional Neural Network and Its Applications
PDF
Intro to LLMs
PPT
Deep learning ppt
PDF
Deep Learning - Convolutional Neural Networks
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PPTX
Transformer in Vision
PPTX
Introduction to CNN
PDF
Transformers, LLMs, and the Possibility of AGI
PDF
Transfer Learning -- The Next Frontier for Machine Learning
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PPTX
Introduction to Deep learning
PDF
Mobilenetv1 v2 slide
PDF
Convolutional Neural Network Models - Deep Learning
PDF
Introduction to Generative Adversarial Networks (GANs)
PDF
Introduction to TensorFlow 2.0
Transformer in Computer Vision
Transformer Introduction (Seminar Material)
Deep learning - A Visual Introduction
Diffusion models beat gans on image synthesis
Natural language processing and transformer models
Convolutional Neural Network and Its Applications
Intro to LLMs
Deep learning ppt
Deep Learning - Convolutional Neural Networks
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Transformer in Vision
Introduction to CNN
Transformers, LLMs, and the Possibility of AGI
Transfer Learning -- The Next Frontier for Machine Learning
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Introduction to Deep learning
Mobilenetv1 v2 slide
Convolutional Neural Network Models - Deep Learning
Introduction to Generative Adversarial Networks (GANs)
Introduction to TensorFlow 2.0
Ad

Similar to “How Transformers are Changing the Direction of Deep Learning Architectures,” a Presentation from Synopsys (20)

PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
Transformer models for FER
PDF
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
PDF
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
PDF
“Challenges and Solutions of Moving Vision LLMs to the Edge,” a Presentation ...
PDF
Efficient video perception through AI
PDF
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
PDF
BriefHistoryTransformerstransformers.pdf
PDF
The evolution of data center network fabrics
PDF
Making Cloud Native CI_CD Services.pdf
PDF
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
PDF
CICD Pipelines for Microservices Best Practices
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
PDF
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
PDF
Multicloud as the Next Generation of Cloud Infrastructure
PDF
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
PPTX
Whats new in eCognition 8.8
PDF
Виктор Ерухимов Open VX mixar moscow sept'15
PDF
DSD-INT 2024 MeshKernel and Grid Editor - New mesh generation tools - Carniato
PDF
“Challenges in Architecting Vision Inference Systems for Transformer Models,”...
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
Transformer models for FER
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Efficient Video Perception Through AI,” a Presentation from Qualcomm
“Challenges and Solutions of Moving Vision LLMs to the Edge,” a Presentation ...
Efficient video perception through AI
MPEG-Immersive 3DoF+: 360 Video Streaming for Virtual Reality
BriefHistoryTransformerstransformers.pdf
The evolution of data center network fabrics
Making Cloud Native CI_CD Services.pdf
“Deploying Large Models on the Edge: Success Stories and Challenges,” a Prese...
CICD Pipelines for Microservices Best Practices
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder E...
AE-ViT: Token Enhancement for Vision Transformers via CNN-Based Autoencoder E...
Multicloud as the Next Generation of Cloud Infrastructure
“Transformer Networks: How They Work and Why They Matter,” a Presentation fro...
Whats new in eCognition 8.8
Виктор Ерухимов Open VX mixar moscow sept'15
DSD-INT 2024 MeshKernel and Grid Editor - New mesh generation tools - Carniato
“Challenges in Architecting Vision Inference Systems for Transformer Models,”...
Ad

More from Edge AI and Vision Alliance (20)

PDF
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
PDF
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
PDF
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
PDF
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
PDF
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
PDF
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
PDF
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
PDF
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
PDF
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
PDF
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
PDF
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
PDF
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
PDF
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
PDF
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
PDF
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips
“An Introduction to the MIPI CSI-2 Image Sensor Standard and Its Latest Advan...
“Visual Search: Fine-grained Recognition with Embedding Models for the Edge,”...
“Optimizing Real-time SLAM Performance for Autonomous Robots with GPU Acceler...
“LLMs and VLMs for Regulatory Compliance, Quality Control and Safety Applicat...
“Simplifying Portable Computer Vision with OpenVX 2.0,” a Presentation from AMD
“Quantization Techniques for Efficient Deployment of Large Language Models: A...
“Introduction to Data Types for AI: Trade-offs and Trends,” a Presentation fr...
“Introduction to Radar and Its Use for Machine Perception,” a Presentation fr...
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
“ONNX and Python to C++: State-of-the-art Graph Compilation,” a Presentation ...
“Beyond the Demo: Turning Computer Vision Prototypes into Scalable, Cost-effe...
“Running Accelerated CNNs on Low-power Microcontrollers Using Arm Ethos-U55, ...
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
“A Re-imagination of Embedded Vision System Design,” a Presentation from Imag...
“MPU+: A Transformative Solution for Next-Gen AI at the Edge,” a Presentation...
“Evolving Inference Processor Software Stacks to Support LLMs,” a Presentatio...
“Efficiently Registering Depth and RGB Images,” a Presentation from eInfochips

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectroscopy.pptx food analysis technology
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
A Presentation on Artificial Intelligence
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation theory and applications.pdf
Teaching material agriculture food technology

“How Transformers are Changing the Direction of Deep Learning Architectures,” a Presentation from Synopsys

  • 1. How Transformers are Changing the Direction of Deep Learning Architectures Tom Michiels System Architect Synopsys
  • 2. • The Surprising Rise of Transformers in Vision • The Structure of Attention and Transformer • Transformers applied to Vision and Other Application Domains • Why Transformers are Here to Stay for Vision Outline © 2022 Synopsys 2
  • 3. CNNs Have Dominated Many Vision Tasks Since 2012 2012 ... 2014 2015 2016 2017 2018 2019 2020 2021 2022 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 Image Classification © 2022 Synopsys 3
  • 4. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet Object Detection © 2022 Synopsys 4 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  • 5. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet Semantic Segmentation © 2022 Synopsys 5 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  • 6. 2012 ... 2014 AlexNet VGG ResNet MobileNet V1 MobileNet V2 EfficientNet MobileNet v3 RCNN FRCNN SSD YOLOV2 Mask RCNN YoloV3 YoloV4/V5 EfficientDet FCN DeepLab SegNet DSNet DeeperLab PFPN EfficientPS YoloP Panoptic Vision © 2022 Synopsys 6 2015 2016 2017 2018 2019 2020 2021 2022 CNNs Have Dominated Many Vision Tasks Since 2012
  • 7. A Decade of CNN Development… Residual Connection © 2022 Synopsys 7 Inverted Residual Blocks
  • 8. Beaten in Accuracy by Transformers © 2022 Synopsys 8 Is this the real life ? Is this just fantasy ? Transformer, a model designed for natural language processing … without any modifications applied to image patches, beats the highly specialized CNNs in accuracy
  • 9. The Structure of Attention and Transformer
  • 10. • Attention is all you need!(*) • Bidirectional Encoder Representations from Transformers • A Transformer is a deep learning model that uses attention mechanism • Transformers were primarily used for Natural Language Processing • Translation • Question Answering • Conversational AI • Successful training of huge transformers • MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch • Transformers are successfully applied in other application domains with promising results for embedded use Bert and Transformers (*) https://arxiv.org/abs/1706.03762 © 2022 Synopsys 10
  • 11. Convolutions, Feed Forward, and Multi-Head Attention 11 © 2022 Synopsys 1x1 Convolution Add 3x3 Convolution Add CNN Transformer • The Feed Forward layer of the Transformer is identical to a 1x1 Convolution • In this part of the model, no information is flowing between tokens/pixels • Multi-Head Attention and 3x3 Convolution layers are the layers responsible for mixing information between tokens/pixels
  • 12. Convolutions as Hard-Coded Attention Both Convolution and Attention Networks mix in features of other tokens/pixels Convolution Attention Convolutions mix in features from tokens based on fixed spatial location Attention mix in features from tokens based on learned attention © 2022 Synopsys 12
  • 13. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in Features of Other Tokens Is this the real life ? Is this the real life ? © 2022 Synopsys 13
  • 14. The Structure of a Transformer: Attention Multi-Head Attention Attention: Mix in Features of Other Tokens © 2022 Synopsys 14
  • 15. The Structure of a Transformer: Attention NxN matrix N: number of tokens Multi-Head Attention © 2022 Synopsys 15
  • 16. + The Structure of a Transformer: Embedding Embedding of input tokens and the positional encoding Is this the real life ? Is this just fantasy ? embedding vectors position encoding elementwise addition © 2022 Synopsys 17
  • 17. Other Application Domains: Vision, Action Recognition, Speech Recognition
  • 18. Vision Transformers (ViT/L16 or ViT-G/14) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale(*) Image is split into tiles (*) https://arxiv.org/abs/2010.11929 Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + Pixels in a tile are flattened into tokens (vectors) that feed in the transformer N x Vision Transformers are at the time of publication best-known method for image classification They are beating convolutional neural networks in accuracy and training time, but not in inference time. Position encoding © 2022 Synopsys 20
  • 19. Vision Transformer  Increasing Resolution Linear Projection Multi-Head Attention Feed Forward Add & Norm Add & Norm + N x Position encoding Attention matrix scales quadratically with the number of patches N x N matrix Where N = the number of tokens/patches © 2022 Synopsys 21
  • 20. Swin Transformers Hierarchical Vision Transformer using Shifted Windows (*) Adaptation makes Transformers scale for larger images: 1. Shifted Window Attention 2. Patch-Merging State of the Art for • Object Detection (COCO) • Semantic Segmentation (ADE20K) (*) https://arxiv.org/abs/2103.14030 Shifted Window Attention Patch-Merging 22 © 2022 Synopsys
  • 21. Action Classification with Transformers Video Swin Transformer https://arxiv.org/abs/2106.13230 Video Swin Transformers extend the (shifted) window to three dimensions (2D spatial + time) Today’s state of the art on Kinetics-400 and Kinetics-600 © 2022 Synopsys 23
  • 22. © 2022 Synopsys Action Classification with Transformers Is Space-Time Attention All You Need for Video Understanding? https://arxiv.org/abs/2102.05095 • Transformers can directly be applied to video • Like for ViT, the video frames are split-up in tiles that feed directly in the Transformer • Applying attention separately on time and on space “Divided Attention” gives (at time of publication) state of the art results on Kinetics-400 and Kinetics-600 benchmarks 24
  • 23. • Conformers are Transformer with and additional Convolution Module • The convolution module contains a pointwise and a depthwise (1D, size=31) convolution: • Compared to RNN, LSTM, DW-Conv and Transformers, Conformers give excellent accuracy / size ratio: • Best known methods for speech recognition (LibriSpeech) are based on Conformers Speech Recognition Conformer: Convolution-augmented Transformer for Speech Recognition (*) WER (%) 5.0 1.0 2.0 3.0 4.0 6.0 7.0 8.0 9.0 10.0 Not Released Hybrid Model Transformer # of parameters (M) 25 275 300 50 75 100 125 150 175 200 225 250 325 https://arxiv.org/abs/2005.08100 © 2022 Synopsys 25
  • 24. Why Attention and Transformers are Here to Stay for Vision
  • 25. Visual Perception beyond Segmentation & Object Detection Future applications like security cameras, personal assistants, storage retrieval,…. require a deeper understanding of the world  Merging NLP and Vision using the same knowledge representation backend Today Panoptic Segmentation 2022-… What is happening in this scene? © 2022 Synopsys 27
  • 26. Tesla AI Day: Using Transformers Make Predictions in Vector Space • Convolutional neural network extract features for every camera • A transformer is used to: • Fuse multiple cameras • Make predictions directly in bird-eye- view vector space © 2022 Synopsys 28
  • 27. • Attention based networks outperform CNN-only networks on accuracy • Highest accuracy required for high-end applications • Models that combine Vision Transformers with Convolutions are more efficient at inference • Examples: MobileViT(*), CoAtNet(**) • Full visual perception requires knowledge that may not easily be acquired by vision only • Multi-modal learning required for a deeper understanding of visual information • Application integrating multiple sensors benefit from attention-based networks Why Transformers are Here to Stay in Vision (*) https://arxiv.org/abs/2110.02178 (**) https://arxiv.org/abs/2106.04803v2 30 © 2022 Synopsys
  • 28. • Transformers are deep learning models primarily used in the field of NLP • Transformers lead to state-of-the-art results in other application domains of deep learning like vision and speech • They can be applied to other domains with surprisingly little modifications • Models that combine attention and convolutions outperform convolutional neural networks on vision tasks, even for small models • Transformers and attention for vision applications are here to stay • Real world applications require knowledge that is not easily captured with convolutions Summary © 2022 Synopsys 32
  • 29. Resources Join the Synopsys Deep Dive Optimize AI Performance & Power for Tomorrow’s Neural Network Applications (Thursday, 12-3 PM) Synopsys Demos in Booth 719 • Executing Transformer Neural Networks in ARC NPX6 NPU IP • Driver Management System on ARC EV Processor IP with Visidon • Neural Network-Enhanced Radar Processing on ARC VPX5 DSP with SensorCortek Resources ARVIX.org https://arxiv.org/abs/1706.03762 ARC NPX6 NPU IP www.synopsys.com/arc © 2022 Synopsys 33