SlideShare a Scribd company logo
Transformer & Bert
Models for long sequence
How to model long sequence (LSTM)
From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
How to model long sequence (CNN)
From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
How to model long sequence (CNN)
Convolutional Sequence to Sequence Learning
Neural Machine Translation of Rare Words with Subword Units
Google's Neural Machine Translation System
Seq2seq
From: https://github.com/farizrahman4u/seq2seq
Attention Mechanism
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
Transformer (Q, K, V)
From: http://jalammar.github.io/illustrated-transformer/
From: http://jalammar.github.io/illustrated-transformer/
Why divided sqrt(d_k) ?
What about order ?
From: http://jalammar.github.io/illustrated-transformer/
From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Transformer (parameters)
 Multi-Head-Attention: (512 * 64 * 3 * 8) + (8 * 64 * 512)
 Feed-Forward: (512*2048) + 2048 + (2048 * 512) + 512
 Last-Linear-Layer: (512 * 370000)
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 63 * 1e6
((512*64*3*8)+(8*64*512)) * 18 + ((512*2048)+(2048*512)+2048+512) * 12 + 512 * 37000
Transformer (FLOPS per token)
 Multi-Head-Attention: ((512+511)*64)*3*8+((512+511)*512)
 Feed-Forward: ((512+511)*2048)+2048+((2048+2047)*512)+512
 Last-Linear-Layer: ((512+511)*370000)+370000
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 467MFLOPS
(((512+511)*64)*3*8+((512+511)*512))*18+(((512+511)*2048)+2048+((2048+2047)*512)+512)*12+((512+511)*370000)+370000
Picture from: https://www.alamy.com/stock-photo-cookie-monster-ernie-elmo-bert-grover-sesame-street-1969-30921023.html
ELMO
BERT
ERNIE
From: https://arxiv.org/pdf/1810.04805.pdf
BERT (Origin)
BERT (embedding)
From: https://arxiv.org/pdf/1810.04805.pdf
BERT (training tasks)
 Masked Language Model: masked word with the [MASK] token
 Next Sentence Prediction
BERT
 BERT-base: L=12, H=768, A=12, Total Parameters: 110M
 Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M
steps. 128000 * 467M FLOPS = 60 TFLOPS
 Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete
 Conclusion
 Space: 440MB + 393MB = 833MB
 Speed: 173 TFLOPS per second
From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction
Some thoughts
 All matrix add/multiple operations (a slight bit of sin/cos/exp)
 More hardware-friendly Model
 Big Op (automatically)
 Transformer + NTM

More Related Content

PDF
III EEE-CS2363-Computer-Networks-model-question-paper-set-2-for-may-june-2014
PDF
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [June /...
PDF
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [April ...
PDF
III EEE-CS2363-Computer-Networks-important-questions-for-unit-3-for-may-june-...
PDF
III EEE-CS2363-Computer-Networks-important-questions-for-unit-3-unit-4-for-ma...
DOC
Report-Implementation of Quantum Gates using Verilog
PDF
Deep Neural Machine Translation with Linear Associative Unit
PDF
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [Septem...
III EEE-CS2363-Computer-Networks-model-question-paper-set-2-for-may-june-2014
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [June /...
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [April ...
III EEE-CS2363-Computer-Networks-important-questions-for-unit-3-for-may-june-...
III EEE-CS2363-Computer-Networks-important-questions-for-unit-3-unit-4-for-ma...
Report-Implementation of Quantum Gates using Verilog
Deep Neural Machine Translation with Linear Associative Unit
[Question Paper] Microprocessor and Microcontrollers (Revised Course) [Septem...

What's hot (13)

PDF
4 c# programming constructs
PPT
Implementation of quantum gates using verilog
DOC
Avlsi qp
PPT
Losurdo Tum Seminar 18 04 08
PDF
Hidden Truths in Dead Software Paths
DOCX
Programmable logic array
DOCX
Advance compositing and animation
PDF
9 d55201 testing & testability
PPTX
Python decision making_loops part7
PPTX
Programmable Logic Array
PPT
PAL And PLA ROM
PDF
[Question Paper] Embedded System (Revised Course) [April / 2015]
PDF
B.Sc.IT: Semester - VI (October - 2013) [IDOL - Revised Course | Question Paper]
4 c# programming constructs
Implementation of quantum gates using verilog
Avlsi qp
Losurdo Tum Seminar 18 04 08
Hidden Truths in Dead Software Paths
Programmable logic array
Advance compositing and animation
9 d55201 testing & testability
Python decision making_loops part7
Programmable Logic Array
PAL And PLA ROM
[Question Paper] Embedded System (Revised Course) [April / 2015]
B.Sc.IT: Semester - VI (October - 2013) [IDOL - Revised Course | Question Paper]
Ad

Similar to Transformer and BERT (20)

PDF
Intel Nervana Graph とは?
PPTX
LLaMA_Final The Meta LLM Presentation.pptx
PPTX
Transformer Zoo
PDF
Attention Is All You Need
PDF
05 backpropagation automatic_differentiation
PPTX
Learn about Tensorflow for Deep Learning now! Part 1
PDF
Non-equilibrium molecular dynamics with LAMMPS
PDF
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
PDF
Ai meetup Neural machine translation updated
PPTX
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
PDF
Digital Filters Fausto Pedro Garca Mrquez
PPTX
Workshop NGS data analysis - 2
PDF
Embedded Logic Flip-Flops: A Conceptual Review
PDF
Digital Signal Processinf (DSP) Course Outline
PDF
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
PPSX
What's new in c# 5.0 net ponto
PDF
Set Up & Operate Tungsten Replicator
PDF
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
PDF
FORECASTING MUSIC GENRE (RNN - LSTM)
PPTX
BERT QnA System for Airplane Flight Manual
Intel Nervana Graph とは?
LLaMA_Final The Meta LLM Presentation.pptx
Transformer Zoo
Attention Is All You Need
05 backpropagation automatic_differentiation
Learn about Tensorflow for Deep Learning now! Part 1
Non-equilibrium molecular dynamics with LAMMPS
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Ai meetup Neural machine translation updated
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Digital Filters Fausto Pedro Garca Mrquez
Workshop NGS data analysis - 2
Embedded Logic Flip-Flops: A Conceptual Review
Digital Signal Processinf (DSP) Course Outline
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
What's new in c# 5.0 net ponto
Set Up & Operate Tungsten Replicator
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
FORECASTING MUSIC GENRE (RNN - LSTM)
BERT QnA System for Airplane Flight Manual
Ad

More from Hao(Robin) Dong (9)

PPTX
Google TPU
PDF
flashcache原理及改造
ODP
ext2-110628041727-phpapp02
PDF
Ext4 Bigalloc report public
PPT
Overlayfs and VFS
ODP
Ext4 new feature - bigalloc
ODP
why we need ext4
PPTX
Kernel在多核机器上的负载均衡机制
PPT
Linux下Poll和Epoll内核源码剖析
Google TPU
flashcache原理及改造
ext2-110628041727-phpapp02
Ext4 Bigalloc report public
Overlayfs and VFS
Ext4 new feature - bigalloc
why we need ext4
Kernel在多核机器上的负载均衡机制
Linux下Poll和Epoll内核源码剖析

Recently uploaded (20)

PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Sustainable Sites - Green Building Construction
DOCX
573137875-Attendance-Management-System-original
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
bas. eng. economics group 4 presentation 1.pptx
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Current and future trends in Computer Vision.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Sustainable Sites - Green Building Construction
573137875-Attendance-Management-System-original
CH1 Production IntroductoryConcepts.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Embodied AI: Ushering in the Next Era of Intelligent Systems
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
III.4.1.2_The_Space_Environment.p pdffdf
Safety Seminar civil to be ensured for safe working.
OOP with Java - Java Introduction (Basics)
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Transformer and BERT