TensorFlow for HPC?

TensorFlow for HPC?
Peter Braam
peter@braam.io

me
1980 1993 20021997 2013
pure math & th. physics @oxford
cs @cmu
@5 startups & @3 big acquirers - Lustre
SKA @cambridge
work with 100’s of largest compute centers and
virtually all major system & CPU/GPU vendors
Math / ML /
Astrophysics
@flatiron Institue
2018

Origin of this talk
I worked extensively on HPC infrastructure for the SKA telescope.
Through coincidence I was offered a generous visit of CERN and asked to
explain some of my thoughts to the HEP ML community.
I decided to offer the HEP ML community a “systems perspective” of
TensorFlow, and I came away highly impressed about this platform’s history
and promise..

Why talk about TF?
Very widely used, gained much ground on other packages
Has unmatched flexibility for deployment
Achieves very high performance
Systems Engineering Masterpiece
Best of breed specialists involved from multiple domains
Door Opener for new xPU design
Domain specific computation infrastructure template
4

Why did Google do this?
Google’s AI could mean doubling their data centres (modest use)
100’s of projects will pursue ML: development productivity is central
Google released TensorFlow in 2015 (a 2nd design following DistBelief).
TensorFlow’s scope is profound: language, compiler, chips, tools, devops
One of the most impressive software - systems - hardware project I’ve seen

Character of mega projects ... (108
-1010
$$)
TensorFlow
Google realized they would massively develop
ML driven applications, modest use would
require twofold expansion of data centers
Challenge:
● high productivity software development
● portable deployment from phones to
massive clusters
● lowest cost performance ratio
SKA Telescope
SKA is deploying a massive new radio
telescope. It needs to provide usable science
data product (i.e. images) for astrophysicists
using algorithms that might need adaptation.
Challenge:
● Understand required compute systems
● Flexible development and runtime
environment
● Meet energy and financial budgets
6

Co-design
experts
architecture
suppliersIT
users
executives
7

TensorFlow
components
High level API’s like Keras allow rapid prototyping of ML models (this
is largely maths, not programming). Automatic differentiation.
Debugging and profiling tools exist, such as tensorboard and a data
flow debugger.
One code base can be used for development, training, evaluation,
inference and snapshotting, and runs on mobile devices through
specialized large clusters.
ML focus
this is frequently discused
Language, execution
platform and optimization
Compiler and Chips (TPU)
Devops
The architecture reflects a
strong separation of
concerns

Tensorflow Core
v (M vect)A (NxM mat)
y=Ax
y (N vect) output or “fetch”
input or “feed”
TF operation
Data Flow model with extremely rich features.
Expressions in programming languages define
data flow graphs from call graphs and arguments
TF treats graphs declaratively, i.e. they are
defined but not executed at the same time.
TF Graphs can be automatically split for
distributed execution on multiple devices.
9
TF Operations Reflect Domain Specific
Aspects found throughout TensorFlow

TF execution
Matmul
w
b
x
Add
User
Fetches
Output
TensorFlow
Performs
Operations
User
Feeds
Inputs
TensorFlow
Trains
Variables
TensorFlow
data flow
Tensors
Define graph
Start 1 or more sessions
The session executes the graph:
- recursively check what graph nodes the
fetch (output) depends on
- lots of optimizations
- execute dependencies
- in parallel
- this is called lazy evaluation
- enables parallelism
10

Execution Framework Challenges
➔ run on distributed systems and on many architectures (including the TPU)
◆ split graphs, feed data to remote architectures, control architecture
◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions
➔ create code for different architectures, optimized for scalable clusters and
for handheld devices. Compiler offers
◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors
◆ AOT: ahead of time to create a standalone binary
➔ optimizations:
◆ tiling sizes, threading, data alignment, perform padding, minimize communications,
adapt queue lengths
The TF framework itself but particularly the XLA compiler make this
transparent

Graph modifications for distributed execution
node 0
b c
w
y
xa
xPU1
xPU0
b
xPU0
c
w
y
a x
send send
recvrecv
xPU1
node 0
node 1
message
queues
12

Execution Platforms & Tensor Processing Units (TPU)
➔ A docker machine with Python can run Tensorflow and its debugging tools
◆ can even invoke a GPU
➔ Training and inference may require performance and scale
◆ Support for GPU, FPGA acceleration - through XLA
◆ Custom TPU processors

TPU’s are accelerators on PCI bus in servers
14

TPU pods (clusters) - TPU chips
TPU chips have systolic MXU (matrix multiply
unit) reducing memory accesses by ~100x:
Pass data between ~100K ALU’s. Small processing units,
using a global clock, no registers.
Only for TF ops. ~100T Ops/cycle (limited precision)

System Organization
Send TF graph as a whole to a TF node
Send individual XLA generated operations
with their data to the TPU accelerator.
This includes instructions and data. The
TPU does not fetch instructions like a CPU
17
grpc over PCI
grpc over TCP/IP
storage

TPU v3.0 specs (conservative guesses based on v2)
18
TPU 3.0 TPU 3.0 / node TPU / pod
#TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes
mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec
flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec
Operations per clock cycle
CPU 10’s (cores)
CPU vectorized 1000 (core x vector length)
GPU 10K ‘s
TPU 128K (TPU v1)
* flops are of various precisions
instructions: - model is RPC to chip
Read_Host_Memory
Write_Host_Memory
Read_Weights
MatrixMultiply/Convolve
Activate (ReLU, Sigmoid, Maxpool, LRN, …)

This should raise eyebrows ...
256 nodes for 5PB/sec of BW and 100PF ??? -
pretty much a top 5 machine in top500
It would work very well for moderate
granularity computations, like SKA (and AI for
which it was made). Wouldn’t help with AMR
likely
Like GPU’s in 2003, this is worth tinkering and
playing with to see its HPC potential.
And yes, ultimately, it may require a new chip.
Some candidate enablers:
1. does HPC use need more operations
than the TensorFlow operations?
2. does a more general systolic network
interconnect offer more opportunities?
3. can mixed precision arithmetic be
introduced? (posithub.org)
This would come at a cost of adapting the
chip? For a project like SKA it could be a
major breakthrough.

Lessons Learned
Significant cost benefits make software and
custom HW projects viable solutions
Replicating an effort of this stature is
extremely difficult.
Domain specific solutions hold a lot of
promise.
Acknowledgement: I’ve used some images
from TensorFlow documentation
(https://tensorflow.org), and Google’s blog
(https://cloud.google.com/blog/products/gcp)
Things to remember:
TensorFlow is a complete systems project:
language, compiler, hardware, devops, tools
Compiler enables advanced use models from
one code base: mobile, cloud, distributed,
GPU, TPU
TPU design has extremely high memory
bandwidth and ops/sec

TensorFlow for HPC?

More Related Content

What's hot (20)

Similar to TensorFlow for HPC? (20)

More from inside-BigData.com (20)

Recently uploaded (20)

TensorFlow for HPC?