SlideShare a Scribd company logo
TensorFlow for HPC?
Peter Braam
peter@braam.io
me
1980 1993 20021997 2013
pure math & th. physics @oxford
cs @cmu
@5 startups & @3 big acquirers - Lustre
SKA @cambridge
work with 100’s of largest compute centers and
virtually all major system & CPU/GPU vendors
Math / ML /
Astrophysics
@flatiron Institue
2018
Origin of this talk
I worked extensively on HPC infrastructure for the SKA telescope.
Through coincidence I was offered a generous visit of CERN and asked to
explain some of my thoughts to the HEP ML community.
I decided to offer the HEP ML community a “systems perspective” of
TensorFlow, and I came away highly impressed about this platform’s history
and promise..
Why talk about TF?
Very widely used, gained much ground on other packages
Has unmatched flexibility for deployment
Achieves very high performance
Systems Engineering Masterpiece
Best of breed specialists involved from multiple domains
Door Opener for new xPU design
Domain specific computation infrastructure template
4
Why did Google do this?
Google’s AI could mean doubling their data centres (modest use)
100’s of projects will pursue ML: development productivity is central
Google released TensorFlow in 2015 (a 2nd design following DistBelief).
TensorFlow’s scope is profound: language, compiler, chips, tools, devops
One of the most impressive software - systems - hardware project I’ve seen
Character of mega projects ... (108
-1010
$$)
TensorFlow
Google realized they would massively develop
ML driven applications, modest use would
require twofold expansion of data centers
Challenge:
● high productivity software development
● portable deployment from phones to
massive clusters
● lowest cost performance ratio
SKA Telescope
SKA is deploying a massive new radio
telescope. It needs to provide usable science
data product (i.e. images) for astrophysicists
using algorithms that might need adaptation.
Challenge:
● Understand required compute systems
● Flexible development and runtime
environment
● Meet energy and financial budgets
6
Co-design
experts
architecture
suppliersIT
users
executives
7
TensorFlow
components
High level API’s like Keras allow rapid prototyping of ML models (this
is largely maths, not programming). Automatic differentiation.
Debugging and profiling tools exist, such as tensorboard and a data
flow debugger.
One code base can be used for development, training, evaluation,
inference and snapshotting, and runs on mobile devices through
specialized large clusters.
ML focus
this is frequently discused
Language, execution
platform and optimization
Compiler and Chips (TPU)
Devops
The architecture reflects a
strong separation of
concerns
Tensorflow Core
v (M vect)A (NxM mat)
y=Ax
y (N vect) output or “fetch”
input or “feed”
TF operation
Data Flow model with extremely rich features.
Expressions in programming languages define
data flow graphs from call graphs and arguments
TF treats graphs declaratively, i.e. they are
defined but not executed at the same time.
TF Graphs can be automatically split for
distributed execution on multiple devices.
9
TF Operations Reflect Domain Specific
Aspects found throughout TensorFlow
TF execution
Matmul
w
b
x
Add
User
Fetches
Output
TensorFlow
Performs
Operations
User
Feeds
Inputs
TensorFlow
Trains
Variables
TensorFlow
data flow
Tensors
Define graph
Start 1 or more sessions
The session executes the graph:
- recursively check what graph nodes the
fetch (output) depends on
- lots of optimizations
- execute dependencies
- in parallel
- this is called lazy evaluation
- enables parallelism
10
Execution Framework Challenges
➔ run on distributed systems and on many architectures (including the TPU)
◆ split graphs, feed data to remote architectures, control architecture
◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions
➔ create code for different architectures, optimized for scalable clusters and
for handheld devices. Compiler offers
◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors
◆ AOT: ahead of time to create a standalone binary
➔ optimizations:
◆ tiling sizes, threading, data alignment, perform padding, minimize communications,
adapt queue lengths
The TF framework itself but particularly the XLA compiler make this
transparent
Graph modifications for distributed execution
node 0
b c
w
y
xa
xPU1
xPU0
b
xPU0
c
w
y
a x
send send
recvrecv
xPU1
node 0
node 1
message
queues
12
Execution Platforms & Tensor Processing Units (TPU)
➔ A docker machine with Python can run Tensorflow and its debugging tools
◆ can even invoke a GPU
➔ Training and inference may require performance and scale
◆ Support for GPU, FPGA acceleration - through XLA
◆ Custom TPU processors
TPU’s are accelerators on PCI bus in servers
14
TPU pods (clusters)
TPU pods (clusters) - TPU chips
TPU chips have systolic MXU (matrix multiply
unit) reducing memory accesses by ~100x:
Pass data between ~100K ALU’s. Small processing units,
using a global clock, no registers.
Only for TF ops. ~100T Ops/cycle (limited precision)
System Organization
Send TF graph as a whole to a TF node
Send individual XLA generated operations
with their data to the TPU accelerator.
This includes instructions and data. The
TPU does not fetch instructions like a CPU
17
grpc over PCI
grpc over TCP/IP
storage
TPU v3.0 specs (conservative guesses based on v2)
18
TPU 3.0 TPU 3.0 / node TPU / pod
#TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes
mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec
flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec
Operations per clock cycle
CPU 10’s (cores)
CPU vectorized 1000 (core x vector length)
GPU 10K ‘s
TPU 128K (TPU v1)
* flops are of various precisions
instructions: - model is RPC to chip
Read_Host_Memory
Write_Host_Memory
Read_Weights
MatrixMultiply/Convolve
Activate (ReLU, Sigmoid, Maxpool, LRN, …)
This should raise eyebrows ...
256 nodes for 5PB/sec of BW and 100PF ??? -
pretty much a top 5 machine in top500
It would work very well for moderate
granularity computations, like SKA (and AI for
which it was made). Wouldn’t help with AMR
likely
Like GPU’s in 2003, this is worth tinkering and
playing with to see its HPC potential.
And yes, ultimately, it may require a new chip.
Some candidate enablers:
1. does HPC use need more operations
than the TensorFlow operations?
2. does a more general systolic network
interconnect offer more opportunities?
3. can mixed precision arithmetic be
introduced? (posithub.org)
This would come at a cost of adapting the
chip? For a project like SKA it could be a
major breakthrough.
Lessons Learned
Significant cost benefits make software and
custom HW projects viable solutions
Replicating an effort of this stature is
extremely difficult.
Domain specific solutions hold a lot of
promise.
Acknowledgement: I’ve used some images
from TensorFlow documentation
(https://tensorflow.org), and Google’s blog
(https://cloud.google.com/blog/products/gcp)
Things to remember:
TensorFlow is a complete systems project:
language, compiler, hardware, devops, tools
Compiler enables advanced use models from
one code base: mobile, cloud, distributed,
GPU, TPU
TPU design has extremely high memory
bandwidth and ops/sec
Thank You
peter@braam.io
TensorFlow for HPC?

More Related Content

PPTX
190111 tf2 preview_jwkang_pub
PDF
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
PDF
Profiling PyTorch for Efficiency & Sustainability
PDF
Performance Evaluation using TAU Performance System and E4S
PDF
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
PDF
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
PPTX
PDF
Arm tools and roadmap for SVE compiler support
190111 tf2 preview_jwkang_pub
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Profiling PyTorch for Efficiency & Sustainability
Performance Evaluation using TAU Performance System and E4S
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Comparison of deep learning frameworks from a viewpoint of double backpropaga...
Arm tools and roadmap for SVE compiler support

What's hot (20)

PPTX
An Introduction to TensorFlow architecture
PDF
An evaluation of LLVM compiler for SVE with fairly complicated loops
PDF
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
PPTX
Google TPU
PPTX
TPU paper slide
PDF
High performance computing - building blocks, production & perspective
PDF
TensorFlow example for AI Ukraine2016
PDF
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
PDF
LeFlowを調べてみました
PDF
Xian He Sun Data-Centric Into
PDF
Introduction to TensorFlow
PDF
Get Your Hands Dirty with Intel® Distribution for Python*
PDF
Buzzwords Numba Presentation
PDF
Distributed Multi-device Execution of TensorFlow – an Outlook
PDF
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PPTX
IBM AI at Scale
PDF
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
PDF
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
PDF
PyData NYC whatsnew NumPy-SciPy 2019
An Introduction to TensorFlow architecture
An evaluation of LLVM compiler for SVE with fairly complicated loops
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Google TPU
TPU paper slide
High performance computing - building blocks, production & perspective
TensorFlow example for AI Ukraine2016
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
LeFlowを調べてみました
Xian He Sun Data-Centric Into
Introduction to TensorFlow
Get Your Hands Dirty with Intel® Distribution for Python*
Buzzwords Numba Presentation
Distributed Multi-device Execution of TensorFlow – an Outlook
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
IBM AI at Scale
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
PyData NYC whatsnew NumPy-SciPy 2019
Ad

Similar to TensorFlow for HPC? (20)

PPTX
Exascale Capabl
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PDF
running Tensorflow in Production
PPTX
Deep Learning with Spark and GPUs
PDF
In datacenter performance analysis of a tensor processing unit
PDF
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
PDF
Barcelona Supercomputing Center, Generador de Riqueza
PPTX
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
PPTX
Programmable Exascale Supercomputer
PDF
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
PDF
Hardware & Software Platforms for HPC, AI and ML
PPT
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
PDF
NNSA Explorations: ARM for Supercomputing
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
Infrastructure and Tooling - Full Stack Deep Learning
PDF
Heroku @ Toyota Motor Europe: Platform As A Factory As A Service
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
PPTX
multithread in multiprocessor architecture
Exascale Capabl
Assisting User’s Transition to Titan’s Accelerated Architecture
running Tensorflow in Production
Deep Learning with Spark and GPUs
In datacenter performance analysis of a tensor processing unit
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Barcelona Supercomputing Center, Generador de Riqueza
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Programmable Exascale Supercomputer
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
Hardware & Software Platforms for HPC, AI and ML
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
NNSA Explorations: ARM for Supercomputing
Preparing to program Aurora at Exascale - Early experiences and future direct...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Infrastructure and Tooling - Full Stack Deep Learning
Heroku @ Toyota Motor Europe: Platform As A Factory As A Service
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
multithread in multiprocessor architecture
Ad

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
PDF
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Major Market Shifts in IT
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
Assigned Numbers - 2025 - Bluetooth® Document
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

TensorFlow for HPC?

  • 1. TensorFlow for HPC? Peter Braam peter@braam.io
  • 2. me 1980 1993 20021997 2013 pure math & th. physics @oxford cs @cmu @5 startups & @3 big acquirers - Lustre SKA @cambridge work with 100’s of largest compute centers and virtually all major system & CPU/GPU vendors Math / ML / Astrophysics @flatiron Institue 2018
  • 3. Origin of this talk I worked extensively on HPC infrastructure for the SKA telescope. Through coincidence I was offered a generous visit of CERN and asked to explain some of my thoughts to the HEP ML community. I decided to offer the HEP ML community a “systems perspective” of TensorFlow, and I came away highly impressed about this platform’s history and promise..
  • 4. Why talk about TF? Very widely used, gained much ground on other packages Has unmatched flexibility for deployment Achieves very high performance Systems Engineering Masterpiece Best of breed specialists involved from multiple domains Door Opener for new xPU design Domain specific computation infrastructure template 4
  • 5. Why did Google do this? Google’s AI could mean doubling their data centres (modest use) 100’s of projects will pursue ML: development productivity is central Google released TensorFlow in 2015 (a 2nd design following DistBelief). TensorFlow’s scope is profound: language, compiler, chips, tools, devops One of the most impressive software - systems - hardware project I’ve seen
  • 6. Character of mega projects ... (108 -1010 $$) TensorFlow Google realized they would massively develop ML driven applications, modest use would require twofold expansion of data centers Challenge: ● high productivity software development ● portable deployment from phones to massive clusters ● lowest cost performance ratio SKA Telescope SKA is deploying a massive new radio telescope. It needs to provide usable science data product (i.e. images) for astrophysicists using algorithms that might need adaptation. Challenge: ● Understand required compute systems ● Flexible development and runtime environment ● Meet energy and financial budgets 6
  • 8. TensorFlow components High level API’s like Keras allow rapid prototyping of ML models (this is largely maths, not programming). Automatic differentiation. Debugging and profiling tools exist, such as tensorboard and a data flow debugger. One code base can be used for development, training, evaluation, inference and snapshotting, and runs on mobile devices through specialized large clusters. ML focus this is frequently discused Language, execution platform and optimization Compiler and Chips (TPU) Devops The architecture reflects a strong separation of concerns
  • 9. Tensorflow Core v (M vect)A (NxM mat) y=Ax y (N vect) output or “fetch” input or “feed” TF operation Data Flow model with extremely rich features. Expressions in programming languages define data flow graphs from call graphs and arguments TF treats graphs declaratively, i.e. they are defined but not executed at the same time. TF Graphs can be automatically split for distributed execution on multiple devices. 9 TF Operations Reflect Domain Specific Aspects found throughout TensorFlow
  • 10. TF execution Matmul w b x Add User Fetches Output TensorFlow Performs Operations User Feeds Inputs TensorFlow Trains Variables TensorFlow data flow Tensors Define graph Start 1 or more sessions The session executes the graph: - recursively check what graph nodes the fetch (output) depends on - lots of optimizations - execute dependencies - in parallel - this is called lazy evaluation - enables parallelism 10
  • 11. Execution Framework Challenges ➔ run on distributed systems and on many architectures (including the TPU) ◆ split graphs, feed data to remote architectures, control architecture ◆ understand the data: inactive and identity nodes, mapping constants, tensor dimensions ➔ create code for different architectures, optimized for scalable clusters and for handheld devices. Compiler offers ◆ JIT: just in time (during execution) to take full advantage of the sizes of the tensors ◆ AOT: ahead of time to create a standalone binary ➔ optimizations: ◆ tiling sizes, threading, data alignment, perform padding, minimize communications, adapt queue lengths The TF framework itself but particularly the XLA compiler make this transparent
  • 12. Graph modifications for distributed execution node 0 b c w y xa xPU1 xPU0 b xPU0 c w y a x send send recvrecv xPU1 node 0 node 1 message queues 12
  • 13. Execution Platforms & Tensor Processing Units (TPU) ➔ A docker machine with Python can run Tensorflow and its debugging tools ◆ can even invoke a GPU ➔ Training and inference may require performance and scale ◆ Support for GPU, FPGA acceleration - through XLA ◆ Custom TPU processors
  • 14. TPU’s are accelerators on PCI bus in servers 14
  • 16. TPU pods (clusters) - TPU chips TPU chips have systolic MXU (matrix multiply unit) reducing memory accesses by ~100x: Pass data between ~100K ALU’s. Small processing units, using a global clock, no registers. Only for TF ops. ~100T Ops/cycle (limited precision)
  • 17. System Organization Send TF graph as a whole to a TF node Send individual XLA generated operations with their data to the TPU accelerator. This includes instructions and data. The TPU does not fetch instructions like a CPU 17 grpc over PCI grpc over TCP/IP storage
  • 18. TPU v3.0 specs (conservative guesses based on v2) 18 TPU 3.0 TPU 3.0 / node TPU / pod #TPU’s 1 card, 4 chips, 16 MXU 4 cards 1024 cards, 256 nodes mem BW 5 TB/sec (?) 20 TB/sec 5 PB/sec flops / sec (*) 100 TF/sec 400 TF/sec 100 PF/sec Operations per clock cycle CPU 10’s (cores) CPU vectorized 1000 (core x vector length) GPU 10K ‘s TPU 128K (TPU v1) * flops are of various precisions instructions: - model is RPC to chip Read_Host_Memory Write_Host_Memory Read_Weights MatrixMultiply/Convolve Activate (ReLU, Sigmoid, Maxpool, LRN, …)
  • 19. This should raise eyebrows ... 256 nodes for 5PB/sec of BW and 100PF ??? - pretty much a top 5 machine in top500 It would work very well for moderate granularity computations, like SKA (and AI for which it was made). Wouldn’t help with AMR likely Like GPU’s in 2003, this is worth tinkering and playing with to see its HPC potential. And yes, ultimately, it may require a new chip. Some candidate enablers: 1. does HPC use need more operations than the TensorFlow operations? 2. does a more general systolic network interconnect offer more opportunities? 3. can mixed precision arithmetic be introduced? (posithub.org) This would come at a cost of adapting the chip? For a project like SKA it could be a major breakthrough.
  • 20. Lessons Learned Significant cost benefits make software and custom HW projects viable solutions Replicating an effort of this stature is extremely difficult. Domain specific solutions hold a lot of promise. Acknowledgement: I’ve used some images from TensorFlow documentation (https://tensorflow.org), and Google’s blog (https://cloud.google.com/blog/products/gcp) Things to remember: TensorFlow is a complete systems project: language, compiler, hardware, devops, tools Compiler enables advanced use models from one code base: mobile, cloud, distributed, GPU, TPU TPU design has extremely high memory bandwidth and ops/sec