SlideShare a Scribd company logo
Effective Machine Learning with TPU
ATHUL K S
ROLL NO. 16
S7 B.Tech CSE
Guided by : Ms. Bintu Balachandran (Assistant Professor, CSE Dept)
Government Engineering College, Wayanad
November 18, 2019
Overview
1 INTRODUCTION
2 LITERATURE SURVEY
3 GOALS AND OBJECTIVES
4 EVOLUTION
5 ARCHITECTURE
6 COMPARISON
7 PERFORMANCE ANALYSIS
8 CONLUSION
9 REFERENCES
INTRODUCTION
Introduction
Major ML progress over past several year.
There’s been a tremendous amount of machine learning progress over
the past several years. The number of research publications on arvix is
growing faster than Moore’s law. Like 50 new machine learning papers
every day and tremendous increase in accuracy.
ATHUL K S (GECW) Effective Machine Learning with TPU 3 / 31
INTRODUCTION
ML Arvix papers per year
Figure: 1. Papers published
ATHUL K S (GECW) Effective Machine Learning with TPU 4 / 31
INTRODUCTION
Rapid Accuracy Improvement
Figure: 2.Accuracy in Machine Learning
ATHUL K S (GECW) Effective Machine Learning with TPU 5 / 31
INTRODUCTION
Challenge : Increasing Complexity and Computational cost
As these models get
more accurate they tend
to be larger trained on
larger data sets and that
means that you need
more computation both
to train the model on the
data set and then even-
tually to run it.
Figure: 3. computational cost Vs accuracy
ATHUL K S (GECW) Effective Machine Learning with TPU 6 / 31
LITERATURE SURVEY
Literature Survey
Custom ASIC were deployed in SPERT-II worstation to accelerate
both NN training and Inference for speech recognition. The eight-lane
vector unit could produce up to sixteen 32-bit arithmetic results per
clock cycle based on 8-bit and 16-bit inputs, making it 25 times faster
at inference and 20 times faster at training than a SPARC-20
workstation. They found that 16 bits were insufficient for training, so
they used two 16-bit words instead, which doubled training time.
The Synapse-1 system was based upon a custom systolic
multiply-accumulate chip called the MA-16, which performed sixteen
16-bit multiplies at a time . The system concatenated several MA-16
chips together and had custom hardware to do activation functions
ATHUL K S (GECW) Effective Machine Learning with TPU 7 / 31
GOALS AND OBJECTIVES
Goals and Objectives
To improve accuracy and speed in Machine Learning domain.Improve
cost-performance by 10X over GPU.
Develop a domain-specific hardware for Machine Learning.
Run whole inference models on a coprocessor to reduce interactions
with the host CPU and to be flexible enough to match the NN needs.
ATHUL K S (GECW) Effective Machine Learning with TPU 8 / 31
EVOLUTION
Evolution
TPUv1 : Googles first Tensor Processing Unit (TPU)
Figure: 4. TPUv1
ATHUL K S (GECW) Effective Machine Learning with TPU 9 / 31
EVOLUTION
Evolution
TPUv2 : Cloud TPU
Figure: 5. TPUv2
ATHUL K S (GECW) Effective Machine Learning with TPU 10 / 31
EVOLUTION
Evolution
Cloud TPUv2 pod
Figure: 6. Cloud TPUv2 pod
ATHUL K S (GECW) Effective Machine Learning with TPU 11 / 31
EVOLUTION
Evolution
The TPUv3 chip runs so hot that for first time Google has introduced liquid cooling
in its data centers.
TPUv3 pod is eight times more powerful than a TPUv2 pod.
(a) TPUv3 pod (b) TPUv3 Chip
Figure: 7. TPUv3
ATHUL K S (GECW) Effective Machine Learning with TPU 12 / 31
EVOLUTION
Relentless Effort
TPUv1
(2015)
Cloud TPU
(2017)
TPU pod
(2017)
92 TeraFlops
Inference only
180 TeraFlops
64 GB HBM
Training and inference
11.5 PetaFlops
4TB HBM
2D toroidal mesh network
Training and inference
ATHUL K S (GECW) Effective Machine Learning with TPU 13 / 31
EVOLUTION
Relentless Effort
TPUv3
(2018)
TPUv3 pod
(2018)
420 terraflops
128 GB HBM
New chip architecture
>100 petaflops
More than 8x the perfor-
mance than a TPUv2 Pod
ATHUL K S (GECW) Effective Machine Learning with TPU 14 / 31
ARCHITECTURE
Architechture: TPUv1
The matrix unit : 65,536 (256x
256)
700MHz clock rate
Peak 92 operations per scond
4MiB of on-chip Accumulator
memory
24 MiB of on chip Unified
Buffer (activation memory).
3.5 X as much on-chip mem-
ory Vs GPU
Two 2133MHz DDR3 channels
8GiB of off-chip weight DRAM
memory
CISC
Figure: 8. schematic diagram of TPUv1
ATHUL K S (GECW) Effective Machine Learning with TPU 15 / 31
ARCHITECTURE
Architecture
TPU is designed to be a coprocessor on the PCIe I/O bus, allowing it
to plug into existing servers just as a GPU does.
The host server sends TPU instructions for it to execute rather than
the TPU fetching them itself. Hence, the TPU is closer in spirit to an
FPU (floating-point unit) coprocessor than it is to a GPU.
The TPU instructions are sent from the host over the PCIe Gen3 X
16 bus into an instruction buffer. The internal blocks are typically
connected together by 256-byte-wide paths.
ATHUL K S (GECW) Effective Machine Learning with TPU 16 / 31
ARCHITECTURE
Instruction Set
Mathematical Operations required for Neural Network Inference
Instructions for Neural Network Inference
TPU Instructions Functions
Read Host Memory Read data from memory
Read Weights Read weight from memory
MartrixMultiply/Convolve Multiply or convolve with the data
and weights,accumulate the result
Activate Applyactivation function
Write Host Memory Write result to memory
Table: 1. Instructions and functions
ATHUL K S (GECW) Effective Machine Learning with TPU 17 / 31
ARCHITECTURE
Architecture
It uses a 4-stage pipeline for these CISC instructions, where each
instruction executes in a separate stage.
Reading a large SRAM uses much more power than arithmetic, the
matrix unit uses systolic execution to save energy by reducing reads
and writes of the Unified Buffer
ATHUL K S (GECW) Effective Machine Learning with TPU 18 / 31
ARCHITECTURE
Architechture : Cloud TPU
Figure: 6. Cloud TPU
180 TeraFlops of computation
64 GB of HBM memory
2400 GB/s mem
Designed to be connected together into larger configuration
CISC
ATHUL K S (GECW) Effective Machine Learning with TPU 19 / 31
ARCHITECTURE
Architechture : Cloud TPU
16 GB of HBM
600 GB/s mem
vector units: 32b float
MXU: 32b float accu-
mulation but reduced
precision for multipliers
45TFLOPS
Figure: 10. Cloud TPU chip layout
ATHUL K S (GECW) Effective Machine Learning with TPU 20 / 31
ARCHITECTURE
Architechture : Cloud TPU
22.5 TFLOPS per core
2 core per chip
4 chips per 150 TFLOP
cloud TPU
Scalar unit
Vector unit
Matrix unit
Mostly float 32
Figure: 11. Cloud TPU chip layout core
ATHUL K S (GECW) Effective Machine Learning with TPU 21 / 31
ARCHITECTURE
Architechture : Cloud TPU
Figure: 12. Cloud TPU chip layout
Matrix Unit (MXU) 128 x
128 systolic array
bfloat16 multiplies
Float32 accumulate
Figure: 13. bfloat16 - Brain Floating Point Format
ATHUL K S (GECW) Effective Machine Learning with TPU 22 / 31
COMPARISON
CPU Vs GPU Vs TPU
CPU
(Central processing unit)
Low Latency
Low Throughput
Scalar Operations
GPU
(Graphical processing unit)
high latency
high throughput
[matrix
multiplication]
Vector Operations
TPU
(TensorFlow processing unit)
Tpu use systolic
matrix multiplication
very high throughput
bfloat16
TPU delivers 1530X
more throughput
than contemporary
CPUs and GPUs,
ATHUL K S (GECW) Effective Machine Learning with TPU 23 / 31
COMPARISON
Comparing CPU, GPU and TPU
Figure: 14. Performance / watt, relative to contemporary CPUs and GPUs (Incremental,
weighted mean) (in log scale)
ATHUL K S (GECW) Effective Machine Learning with TPU 24 / 31
COMPARISON
Comparing CPU, GPU and TPU
Figure: 15.Throughput under 7 ms latency limit (in log scale)
ATHUL K S (GECW) Effective Machine Learning with TPU 25 / 31
PERFORMANCE ANALYSIS
Performance Analysis
(a) Cloud TPU (b) Cloud TPU pod
Figure: 16. Performance Analysis
ATHUL K S (GECW) Effective Machine Learning with TPU 26 / 31
PERFORMANCE ANALYSIS
Scaling
(a) Performance (b) Accuracy
Figure: 17. Scaling
ATHUL K S (GECW) Effective Machine Learning with TPU 27 / 31
CONLUSION
Conclusion
A user can mix and match the different components, use custom
Virtual Machines of any shapes. It can be scaled up and down
quickly, there is no need to order hardware or build a machine.
Making TPU widely available in the cloud is giving users a dial that
can turn their prototyping on an inexpensive single device and then
increase the batch size without doing any other code changes and
suddenly training time is going down from hours to minutes.
ATHUL K S (GECW) Effective Machine Learning with TPU 28 / 31
Bibliography
References I
[1] Young Cliff, Patterson David, and Sato Kaz.
An in-depth look at googles first tensor processing unit (tpu).
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-
first-tensor-processing-unit-tpu,
2017.
[2] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav
Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,
Al Borchers, et al.
In-datacenter performance analysis of a tensor processing unit.
In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International
Symposium on, pages 1–12. IEEE, 2017.
ATHUL K S (GECW) Effective Machine Learning with TPU 29 / 31
Bibliography
References II
[3] Zato Kaz.
What makes tpus fine-tuned for deep learning?
https://cloud.google.com/blog/products/ai-machine-learning/what-makes-
tpus-fine-tuned-for-deep-learning,
2018.
[4] Sachin Kelkar, Chetanya Rastogi, Sparsh Gupta, and GN Pillai.
Squeezegan: Image to image translation with minimum parameters.
In 2018 International Joint Conference on Neural Networks (IJCNN), pages
1–6. IEEE, 2018.
[5] Tsung-Yi Lin, Piotr Doll´ar, Ross B Girshick, Kaiming He, Bharath Hariharan,
and Serge J Belongie.
Feature pyramid networks for object detection.
In CVPR, volume 1, page 4, 2017.
ATHUL K S (GECW) Effective Machine Learning with TPU 30 / 31
Bibliography
References III
[6] Mark Nutter, Catherine H Crawford, and Jorge Ortiz.
Design of novel deep learning models for real-time human activity recognition
with mobile phones.
In 2018 International Joint Conference on Neural Networks (IJCNN), pages
1–8. IEEE, 2018.
[7] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on
learning.
In AAAI, volume 4, page 12, 2017.
[8] Tamura Yoshi, Jackson Andrew, and Stone Zak.
Cloud tpus in kubernetes engine powering minigo are now available in beta.
https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpus-in-
kubernetes-engine-powering-minigo-are-now-available-in-beta,
2018.
ATHUL K S (GECW) Effective Machine Learning with TPU 31 / 31

More Related Content

PDF
Deep learning seminar report
PDF
Q-learning and Deep Q Network (Reinforcement Learning)
PPTX
Emerging Technologies - Neuromorphic Engineering / Computing
PPT
Deep learning ppt
PPTX
Physics-Informed Machine Learning
PPTX
Recurrent neural network
PDF
Internship Report on IOT & Robotics
PPTX
Show and tell: A Neural Image caption generator
Deep learning seminar report
Q-learning and Deep Q Network (Reinforcement Learning)
Emerging Technologies - Neuromorphic Engineering / Computing
Deep learning ppt
Physics-Informed Machine Learning
Recurrent neural network
Internship Report on IOT & Robotics
Show and tell: A Neural Image caption generator

Similar to Effective machine learning_with_tpu (20)

PPTX
TPU paper slide
PDF
Cache Optimization Techniques for General Purpose Graphic Processing Units
PDF
In datacenter performance analysis of a tensor processing unit
PDF
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
PDF
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
PPTX
Google warehouse scale computer
PPT
f32-book-parallel-pres-pt1jjjjjjjooo.ppt
PPTX
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
PPTX
autoTVM
PDF
How to use Apache TVM to optimize your ML models
PPTX
Introduction to heterogeneous_computing_for_hpc
PDF
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
PDF
Large-Scale Optimization Strategies for Typical HPC Workloads
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PPTX
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
PDF
Tokyo Webmining Talk1
PDF
TensorFlow for HPC?
PDF
Open power ddl and lms
TPU paper slide
Cache Optimization Techniques for General Purpose Graphic Processing Units
In datacenter performance analysis of a tensor processing unit
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
Google warehouse scale computer
f32-book-parallel-pres-pt1jjjjjjjooo.ppt
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
autoTVM
How to use Apache TVM to optimize your ML models
Introduction to heterogeneous_computing_for_hpc
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Large-Scale Optimization Strategies for Typical HPC Workloads
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Tokyo Webmining Talk1
TensorFlow for HPC?
Open power ddl and lms
Ad

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
DOCX
573137875-Attendance-Management-System-original
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Welding lecture in detail for understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
composite construction of structures.pdf
Mechanical Engineering MATERIALS Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
573137875-Attendance-Management-System-original
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Sustainable Sites - Green Building Construction
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Internet of Things (IOT) - A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
Lecture Notes Electrical Wiring System Components
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Operating System & Kernel Study Guide-1 - converted.pdf
OOP with Java - Java Introduction (Basics)
Welding lecture in detail for understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
composite construction of structures.pdf
Ad

Effective machine learning_with_tpu

  • 1. Effective Machine Learning with TPU ATHUL K S ROLL NO. 16 S7 B.Tech CSE Guided by : Ms. Bintu Balachandran (Assistant Professor, CSE Dept) Government Engineering College, Wayanad November 18, 2019
  • 2. Overview 1 INTRODUCTION 2 LITERATURE SURVEY 3 GOALS AND OBJECTIVES 4 EVOLUTION 5 ARCHITECTURE 6 COMPARISON 7 PERFORMANCE ANALYSIS 8 CONLUSION 9 REFERENCES
  • 3. INTRODUCTION Introduction Major ML progress over past several year. There’s been a tremendous amount of machine learning progress over the past several years. The number of research publications on arvix is growing faster than Moore’s law. Like 50 new machine learning papers every day and tremendous increase in accuracy. ATHUL K S (GECW) Effective Machine Learning with TPU 3 / 31
  • 4. INTRODUCTION ML Arvix papers per year Figure: 1. Papers published ATHUL K S (GECW) Effective Machine Learning with TPU 4 / 31
  • 5. INTRODUCTION Rapid Accuracy Improvement Figure: 2.Accuracy in Machine Learning ATHUL K S (GECW) Effective Machine Learning with TPU 5 / 31
  • 6. INTRODUCTION Challenge : Increasing Complexity and Computational cost As these models get more accurate they tend to be larger trained on larger data sets and that means that you need more computation both to train the model on the data set and then even- tually to run it. Figure: 3. computational cost Vs accuracy ATHUL K S (GECW) Effective Machine Learning with TPU 6 / 31
  • 7. LITERATURE SURVEY Literature Survey Custom ASIC were deployed in SPERT-II worstation to accelerate both NN training and Inference for speech recognition. The eight-lane vector unit could produce up to sixteen 32-bit arithmetic results per clock cycle based on 8-bit and 16-bit inputs, making it 25 times faster at inference and 20 times faster at training than a SPARC-20 workstation. They found that 16 bits were insufficient for training, so they used two 16-bit words instead, which doubled training time. The Synapse-1 system was based upon a custom systolic multiply-accumulate chip called the MA-16, which performed sixteen 16-bit multiplies at a time . The system concatenated several MA-16 chips together and had custom hardware to do activation functions ATHUL K S (GECW) Effective Machine Learning with TPU 7 / 31
  • 8. GOALS AND OBJECTIVES Goals and Objectives To improve accuracy and speed in Machine Learning domain.Improve cost-performance by 10X over GPU. Develop a domain-specific hardware for Machine Learning. Run whole inference models on a coprocessor to reduce interactions with the host CPU and to be flexible enough to match the NN needs. ATHUL K S (GECW) Effective Machine Learning with TPU 8 / 31
  • 9. EVOLUTION Evolution TPUv1 : Googles first Tensor Processing Unit (TPU) Figure: 4. TPUv1 ATHUL K S (GECW) Effective Machine Learning with TPU 9 / 31
  • 10. EVOLUTION Evolution TPUv2 : Cloud TPU Figure: 5. TPUv2 ATHUL K S (GECW) Effective Machine Learning with TPU 10 / 31
  • 11. EVOLUTION Evolution Cloud TPUv2 pod Figure: 6. Cloud TPUv2 pod ATHUL K S (GECW) Effective Machine Learning with TPU 11 / 31
  • 12. EVOLUTION Evolution The TPUv3 chip runs so hot that for first time Google has introduced liquid cooling in its data centers. TPUv3 pod is eight times more powerful than a TPUv2 pod. (a) TPUv3 pod (b) TPUv3 Chip Figure: 7. TPUv3 ATHUL K S (GECW) Effective Machine Learning with TPU 12 / 31
  • 13. EVOLUTION Relentless Effort TPUv1 (2015) Cloud TPU (2017) TPU pod (2017) 92 TeraFlops Inference only 180 TeraFlops 64 GB HBM Training and inference 11.5 PetaFlops 4TB HBM 2D toroidal mesh network Training and inference ATHUL K S (GECW) Effective Machine Learning with TPU 13 / 31
  • 14. EVOLUTION Relentless Effort TPUv3 (2018) TPUv3 pod (2018) 420 terraflops 128 GB HBM New chip architecture >100 petaflops More than 8x the perfor- mance than a TPUv2 Pod ATHUL K S (GECW) Effective Machine Learning with TPU 14 / 31
  • 15. ARCHITECTURE Architechture: TPUv1 The matrix unit : 65,536 (256x 256) 700MHz clock rate Peak 92 operations per scond 4MiB of on-chip Accumulator memory 24 MiB of on chip Unified Buffer (activation memory). 3.5 X as much on-chip mem- ory Vs GPU Two 2133MHz DDR3 channels 8GiB of off-chip weight DRAM memory CISC Figure: 8. schematic diagram of TPUv1 ATHUL K S (GECW) Effective Machine Learning with TPU 15 / 31
  • 16. ARCHITECTURE Architecture TPU is designed to be a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers just as a GPU does. The host server sends TPU instructions for it to execute rather than the TPU fetching them itself. Hence, the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU. The TPU instructions are sent from the host over the PCIe Gen3 X 16 bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide paths. ATHUL K S (GECW) Effective Machine Learning with TPU 16 / 31
  • 17. ARCHITECTURE Instruction Set Mathematical Operations required for Neural Network Inference Instructions for Neural Network Inference TPU Instructions Functions Read Host Memory Read data from memory Read Weights Read weight from memory MartrixMultiply/Convolve Multiply or convolve with the data and weights,accumulate the result Activate Applyactivation function Write Host Memory Write result to memory Table: 1. Instructions and functions ATHUL K S (GECW) Effective Machine Learning with TPU 17 / 31
  • 18. ARCHITECTURE Architecture It uses a 4-stage pipeline for these CISC instructions, where each instruction executes in a separate stage. Reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer ATHUL K S (GECW) Effective Machine Learning with TPU 18 / 31
  • 19. ARCHITECTURE Architechture : Cloud TPU Figure: 6. Cloud TPU 180 TeraFlops of computation 64 GB of HBM memory 2400 GB/s mem Designed to be connected together into larger configuration CISC ATHUL K S (GECW) Effective Machine Learning with TPU 19 / 31
  • 20. ARCHITECTURE Architechture : Cloud TPU 16 GB of HBM 600 GB/s mem vector units: 32b float MXU: 32b float accu- mulation but reduced precision for multipliers 45TFLOPS Figure: 10. Cloud TPU chip layout ATHUL K S (GECW) Effective Machine Learning with TPU 20 / 31
  • 21. ARCHITECTURE Architechture : Cloud TPU 22.5 TFLOPS per core 2 core per chip 4 chips per 150 TFLOP cloud TPU Scalar unit Vector unit Matrix unit Mostly float 32 Figure: 11. Cloud TPU chip layout core ATHUL K S (GECW) Effective Machine Learning with TPU 21 / 31
  • 22. ARCHITECTURE Architechture : Cloud TPU Figure: 12. Cloud TPU chip layout Matrix Unit (MXU) 128 x 128 systolic array bfloat16 multiplies Float32 accumulate Figure: 13. bfloat16 - Brain Floating Point Format ATHUL K S (GECW) Effective Machine Learning with TPU 22 / 31
  • 23. COMPARISON CPU Vs GPU Vs TPU CPU (Central processing unit) Low Latency Low Throughput Scalar Operations GPU (Graphical processing unit) high latency high throughput [matrix multiplication] Vector Operations TPU (TensorFlow processing unit) Tpu use systolic matrix multiplication very high throughput bfloat16 TPU delivers 1530X more throughput than contemporary CPUs and GPUs, ATHUL K S (GECW) Effective Machine Learning with TPU 23 / 31
  • 24. COMPARISON Comparing CPU, GPU and TPU Figure: 14. Performance / watt, relative to contemporary CPUs and GPUs (Incremental, weighted mean) (in log scale) ATHUL K S (GECW) Effective Machine Learning with TPU 24 / 31
  • 25. COMPARISON Comparing CPU, GPU and TPU Figure: 15.Throughput under 7 ms latency limit (in log scale) ATHUL K S (GECW) Effective Machine Learning with TPU 25 / 31
  • 26. PERFORMANCE ANALYSIS Performance Analysis (a) Cloud TPU (b) Cloud TPU pod Figure: 16. Performance Analysis ATHUL K S (GECW) Effective Machine Learning with TPU 26 / 31
  • 27. PERFORMANCE ANALYSIS Scaling (a) Performance (b) Accuracy Figure: 17. Scaling ATHUL K S (GECW) Effective Machine Learning with TPU 27 / 31
  • 28. CONLUSION Conclusion A user can mix and match the different components, use custom Virtual Machines of any shapes. It can be scaled up and down quickly, there is no need to order hardware or build a machine. Making TPU widely available in the cloud is giving users a dial that can turn their prototyping on an inexpensive single device and then increase the batch size without doing any other code changes and suddenly training time is going down from hours to minutes. ATHUL K S (GECW) Effective Machine Learning with TPU 28 / 31
  • 29. Bibliography References I [1] Young Cliff, Patterson David, and Sato Kaz. An in-depth look at googles first tensor processing unit (tpu). https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles- first-tensor-processing-unit-tpu, 2017. [2] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pages 1–12. IEEE, 2017. ATHUL K S (GECW) Effective Machine Learning with TPU 29 / 31
  • 30. Bibliography References II [3] Zato Kaz. What makes tpus fine-tuned for deep learning? https://cloud.google.com/blog/products/ai-machine-learning/what-makes- tpus-fine-tuned-for-deep-learning, 2018. [4] Sachin Kelkar, Chetanya Rastogi, Sparsh Gupta, and GN Pillai. Squeezegan: Image to image translation with minimum parameters. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–6. IEEE, 2018. [5] Tsung-Yi Lin, Piotr Doll´ar, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017. ATHUL K S (GECW) Effective Machine Learning with TPU 30 / 31
  • 31. Bibliography References III [6] Mark Nutter, Catherine H Crawford, and Jorge Ortiz. Design of novel deep learning models for real-time human activity recognition with mobile phones. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2018. [7] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017. [8] Tamura Yoshi, Jackson Andrew, and Stone Zak. Cloud tpus in kubernetes engine powering minigo are now available in beta. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpus-in- kubernetes-engine-powering-minigo-are-now-available-in-beta, 2018. ATHUL K S (GECW) Effective Machine Learning with TPU 31 / 31