Effective machine learning_with_tpu

Eﬀective Machine Learning with TPU
ATHUL K S
ROLL NO. 16
S7 B.Tech CSE
Guided by : Ms. Bintu Balachandran (Assistant Professor, CSE Dept)
Government Engineering College, Wayanad
November 18, 2019

Overview
1 INTRODUCTION
2 LITERATURE SURVEY
3 GOALS AND OBJECTIVES
4 EVOLUTION
5 ARCHITECTURE
6 COMPARISON
7 PERFORMANCE ANALYSIS
8 CONLUSION
9 REFERENCES

INTRODUCTION
Introduction
Major ML progress over past several year.
There’s been a tremendous amount of machine learning progress over
the past several years. The number of research publications on arvix is
growing faster than Moore’s law. Like 50 new machine learning papers
every day and tremendous increase in accuracy.
ATHUL K S (GECW) Eﬀective Machine Learning with TPU 3 / 31

INTRODUCTION
ML Arvix papers per year
Figure: 1. Papers published

INTRODUCTION
Rapid Accuracy Improvement
Figure: 2.Accuracy in Machine Learning

INTRODUCTION
Challenge : Increasing Complexity and Computational cost
As these models get
more accurate they tend
to be larger trained on
larger data sets and that
means that you need
more computation both
to train the model on the
data set and then even-
tually to run it.
Figure: 3. computational cost Vs accuracy

LITERATURE SURVEY
Literature Survey
Custom ASIC were deployed in SPERT-II worstation to accelerate
both NN training and Inference for speech recognition. The eight-lane
vector unit could produce up to sixteen 32-bit arithmetic results per
clock cycle based on 8-bit and 16-bit inputs, making it 25 times faster
at inference and 20 times faster at training than a SPARC-20
workstation. They found that 16 bits were insuﬃcient for training, so
they used two 16-bit words instead, which doubled training time.
The Synapse-1 system was based upon a custom systolic
multiply-accumulate chip called the MA-16, which performed sixteen
16-bit multiplies at a time . The system concatenated several MA-16
chips together and had custom hardware to do activation functions

GOALS AND OBJECTIVES
Goals and Objectives
To improve accuracy and speed in Machine Learning domain.Improve
cost-performance by 10X over GPU.
Develop a domain-speciﬁc hardware for Machine Learning.
Run whole inference models on a coprocessor to reduce interactions
with the host CPU and to be ﬂexible enough to match the NN needs.

EVOLUTION
Evolution
TPUv1 : Googles ﬁrst Tensor Processing Unit (TPU)
Figure: 4. TPUv1

EVOLUTION
Evolution
TPUv2 : Cloud TPU
Figure: 5. TPUv2

EVOLUTION
Evolution
Cloud TPUv2 pod
Figure: 6. Cloud TPUv2 pod

EVOLUTION
Evolution
The TPUv3 chip runs so hot that for ﬁrst time Google has introduced liquid cooling
in its data centers.
TPUv3 pod is eight times more powerful than a TPUv2 pod.
(a) TPUv3 pod (b) TPUv3 Chip
Figure: 7. TPUv3

EVOLUTION
Relentless Eﬀort
TPUv1
(2015)
Cloud TPU
(2017)
TPU pod
(2017)
92 TeraFlops
Inference only
180 TeraFlops
64 GB HBM
Training and inference
11.5 PetaFlops
4TB HBM
2D toroidal mesh network
Training and inference

EVOLUTION
Relentless Effort
TPUv3
(2018)
TPUv3 pod
(2018)
420 terraflops
128 GB HBM
New chip architecture
>100 petaflops
More than 8x the perfor-
mance than a TPUv2 Pod

ARCHITECTURE
Architechture: TPUv1
The matrix unit : 65,536 (256x
256)
700MHz clock rate
Peak 92 operations per scond
4MiB of on-chip Accumulator
memory
24 MiB of on chip Unified
Buffer (activation memory).
3.5 X as much on-chip mem-
ory Vs GPU
Two 2133MHz DDR3 channels
8GiB of off-chip weight DRAM
memory
CISC
Figure: 8. schematic diagram of TPUv1

ARCHITECTURE
Architecture
TPU is designed to be a coprocessor on the PCIe I/O bus, allowing it
to plug into existing servers just as a GPU does.
The host server sends TPU instructions for it to execute rather than
the TPU fetching them itself. Hence, the TPU is closer in spirit to an
FPU (ﬂoating-point unit) coprocessor than it is to a GPU.
The TPU instructions are sent from the host over the PCIe Gen3 X
16 bus into an instruction buﬀer. The internal blocks are typically
connected together by 256-byte-wide paths.

ARCHITECTURE
Instruction Set
Mathematical Operations required for Neural Network Inference
Instructions for Neural Network Inference
TPU Instructions Functions
Read Host Memory Read data from memory
Read Weights Read weight from memory
MartrixMultiply/Convolve Multiply or convolve with the data
and weights,accumulate the result
Activate Applyactivation function
Write Host Memory Write result to memory
Table: 1. Instructions and functions

ARCHITECTURE
Architecture
It uses a 4-stage pipeline for these CISC instructions, where each
instruction executes in a separate stage.
Reading a large SRAM uses much more power than arithmetic, the
matrix unit uses systolic execution to save energy by reducing reads
and writes of the Uniﬁed Buﬀer

ARCHITECTURE
Architechture : Cloud TPU
Figure: 6. Cloud TPU
180 TeraFlops of computation
64 GB of HBM memory
2400 GB/s mem
Designed to be connected together into larger conﬁguration
CISC

ARCHITECTURE
16 GB of HBM
600 GB/s mem
vector units: 32b ﬂoat
MXU: 32b ﬂoat accu-
mulation but reduced
precision for multipliers
45TFLOPS
Figure: 10. Cloud TPU chip layout

ARCHITECTURE
22.5 TFLOPS per core
2 core per chip
4 chips per 150 TFLOP
cloud TPU
Scalar unit
Vector unit
Matrix unit
Mostly ﬂoat 32
Figure: 11. Cloud TPU chip layout core

ARCHITECTURE
Figure: 12. Cloud TPU chip layout
Matrix Unit (MXU) 128 x
128 systolic array
bﬂoat16 multiplies
Float32 accumulate
Figure: 13. bﬂoat16 - Brain Floating Point Format

COMPARISON
CPU Vs GPU Vs TPU
CPU
(Central processing unit)
Low Latency
Low Throughput
Scalar Operations
GPU
(Graphical processing unit)
high latency
high throughput
[matrix
multiplication]
Vector Operations
TPU
(TensorFlow processing unit)
Tpu use systolic
matrix multiplication
very high throughput
bﬂoat16
TPU delivers 1530X
more throughput
than contemporary
CPUs and GPUs,

COMPARISON
Comparing CPU, GPU and TPU
Figure: 14. Performance / watt, relative to contemporary CPUs and GPUs (Incremental,
weighted mean) (in log scale)

COMPARISON
Comparing CPU, GPU and TPU
Figure: 15.Throughput under 7 ms latency limit (in log scale)

PERFORMANCE ANALYSIS
Performance Analysis
(a) Cloud TPU (b) Cloud TPU pod
Figure: 16. Performance Analysis

PERFORMANCE ANALYSIS
Scaling
(a) Performance (b) Accuracy
Figure: 17. Scaling

CONLUSION
Conclusion
A user can mix and match the diﬀerent components, use custom
Virtual Machines of any shapes. It can be scaled up and down
quickly, there is no need to order hardware or build a machine.
Making TPU widely available in the cloud is giving users a dial that
can turn their prototyping on an inexpensive single device and then
increase the batch size without doing any other code changes and
suddenly training time is going down from hours to minutes.

Bibliography
References I
[1] Young Cliff, Patterson David, and Sato Kaz.
An in-depth look at googles first tensor processing unit (tpu).
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-
first-tensor-processing-unit-tpu,
2017.
[2] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav
Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden,
Al Borchers, et al.
In-datacenter performance analysis of a tensor processing unit.
In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International
Symposium on, pages 1–12. IEEE, 2017.

Bibliography
References II
[3] Zato Kaz.
What makes tpus fine-tuned for deep learning?
https://cloud.google.com/blog/products/ai-machine-learning/what-makes-
tpus-fine-tuned-for-deep-learning,
2018.
[4] Sachin Kelkar, Chetanya Rastogi, Sparsh Gupta, and GN Pillai.
Squeezegan: Image to image translation with minimum parameters.
In 2018 International Joint Conference on Neural Networks (IJCNN), pages
1–6. IEEE, 2018.
[5] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan,
and Serge J Belongie.
Feature pyramid networks for object detection.
In CVPR, volume 1, page 4, 2017.

Bibliography
References III
[6] Mark Nutter, Catherine H Crawford, and Jorge Ortiz.
Design of novel deep learning models for real-time human activity recognition
with mobile phones.
In 2018 International Joint Conference on Neural Networks (IJCNN), pages
1–8. IEEE, 2018.
[7] Christian Szegedy, Sergey Ioﬀe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on
learning.
In AAAI, volume 4, page 12, 2017.
[8] Tamura Yoshi, Jackson Andrew, and Stone Zak.
Cloud tpus in kubernetes engine powering minigo are now available in beta.
https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpus-in-
kubernetes-engine-powering-minigo-are-now-available-in-beta,
2018.

Effective machine learning_with_tpu

More Related Content

Similar to Effective machine learning_with_tpu (20)

Recently uploaded (20)

Effective machine learning_with_tpu