SlideShare a Scribd company logo
8
Most read
10
Most read
11
Most read
Deep Dive: Compiling Deep Learning Models
Companion video: https://youtu.be/Oo07fFb-aH0
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Once upon a time in TensorFlow 😤
• Define a neural network as a graph, where input tensors flow (get it?) through compute operations
• This is known as graph mode, aka "define then run"
# First layer: 128 neurons
w1 = tf.Variable(tf.random_normal([128, x_dim]]), name='w1')
b1 = tf.Variable(tf.constant(0.1, shape=(128, 1)), name='b1')
y1 = tf.nn.relu(tf.add(tf.matmul(w1, x), b1)))
# Second layer : 256 neurons
w2 = tf.Variable(tf.random_normal([256, 128]), name='w2')
b2 = tf.Variable(tf.constant(0.1, shape=(256, 1)), name='b2')
y2 = tf.nn.relu(tf.add(tf.matmul(w2, y1), b2)))
• Tensor shapes and execution flow are fully defined in advance
• There are many opportunities to optimize graph execution
• Can the optimization process be automated?
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
TensorFlow XLA
https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html (03/2017)
• Accelerated Linear Algebra (XLA) appeared in TensorFlow 1.0.0-rc0 (01/2017)
• XLA: a compiler that analyzes and optimizes TensorFlow graphs automatically
• Specialize the graph for the actual tensor dimensions and data types
• Eliminate redundancy and fuse operators when possible
• Generate device-optimized native machine code for CPUs, GPUs and TPUs
• "Just in time" (JIT) compilation at runtime, or "Ahead of time" (AOT) compilation pre-deployment
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Meanwhile, in PyTorch land...
https://pytorch.org/docs/stable/jit.html
• TorchScript is a statically-typed subset of Python, available since PyTorch 1.0 (05/2018)
• torch.jit API: trace(),script(),save(),load()
• It lets you export PyTorch code into an intermediate representation (IR) using only low-level PyTorch primitives
• The IR can be converted to other languages (C++), or compiled for hardware accelerators
• Long story short: TorchScript has limitations and is now in maintenance mode
def foo(len):
rv = torch.zeros(3, 4)
for i in range(len):
if i < 10:
rv = rv - 1.0
else:
rv = rv + 1.0
return rv
print(foo.code)
Python
def foo(len: int) -> Tensor:
rv = torch.zeros([3, 4])
rv0 = rv
for i in range(len):
if torch.lt(i, 10):
rv1 = torch.sub(rv0, 1., 1)
else:
rv1 = torch.add(rv0, 1., 1)
rv0 = rv1
return rv0
TorchScript
graph(%len.1 : int):
%24 : int = prim::Constant[value=1]()
%17 : bool = prim::Constant[value=1]() # test.py:10:5
%12 : bool? = prim::Constant()
%10 : Device? = prim::Constant()
%6 : int? = prim::Constant()
%1 : int = prim::Constant[value=3]() # test.py:9:22
%2 : int = prim::Constant[value=4]() # test.py:9:25
%20 : int = prim::Constant[value=10]() # test.py:11:16
%23 : float = prim::Constant[value=1]() # test.py:12:23
%4 : int[] = prim::ListConstruct(%1, %2)
%rv.1 : Tensor = aten::zeros(%4, %6, %6, %10, %12) # test.py:9:10
%rv : Tensor = prim::Loop(%len.1, %17, %rv.1) # test.py:10:5
block0(%i.1 : int, %rv.14 : Tensor):
%21 : bool = aten::lt(%i.1, %20) # test.py:11:12
%rv.13 : Tensor = prim::If(%21) # test.py:11:9
block0():
%rv.3 : Tensor = aten::sub(%rv.14, %23, %24) # test.py:12:18
-> (%rv.3)
block1():
%rv.6 : Tensor = aten::add(%rv.14, %23, %24) # test.py:14:18
-> (%rv.6)
-> (%17, %rv.13)
return (%rv)
TorchScript
IR
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
PyTorch/XLA
https://github.com/torch/xla
• In 2018, Google and Meta started collaborating on bringing PyTorch to TPUs
• Vanilla PyTorch runs in eager mode (aka "define-by-run"): operations are run immediately on the
underlying hardware, so we can't build a graph beforehand
• PyTorch/XLA was launched in late 2019 at the PyTorch Developer Conference
https://www.youtube.com/watch?v=zXAzkqFXclM
• XLA introduces lazy tensors that allow a graph to be recorded, compiled and run on an accelerator
https://arxiv.org/pdf/2102.13267.pdf
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
"Hello world" with PyTorch/XLA
https://pytorch.org/xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model = MNIST().train().to(device)
loss_fn = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=lr)
for data, target in train_loader:
optimizer.zero_grad()
data = data.to(device)
target = target.to(device)
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
xm.mark_step()
• XLA makes it easy to reuse existing PyTorch code
on custom AI hardware
• The vanilla model is loaded on the XLA device
• The training loop runs lazily on the host,
automatically building an internal representation,
aka tracing
• On the host, at parameter optimization time:
• The IR is translated to High-Level Opcodes
(HLO), aka lowering
• XLA compiles the HLO code to machine-
dependent code
• This compiled code is loaded on the XLA device
and executed
Build the IR
on the host
Compile and run the IR
on the XLA device
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
HLO example 😱
func.func @main(
%image: tensor<28x28xf32>,
%weights: tensor<784x10xf32>,
%bias: tensor<1x10xf32>
) -> tensor<1x10xf32> {
%0 = "stablehlo.reshape"(%image) : (tensor<28x28xf32>) -> tensor<1x784xf32>
%1 = "stablehlo.dot"(%0, %weights) : (tensor<1x784xf32>, tensor<784x10xf32>) -> tensor<1x10xf32>
%2 = "stablehlo.add"(%1, %bias) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
%3 = "stablehlo.constant"() { value = dense<0.0> : tensor<1x10xf32> } : () -> tensor<1x10xf32>
%4 = "stablehlo.maximum"(%2, %3) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
"func.return"(%4): (tensor<1x10xf32>) -> ()
}
Can you guess what this does?
Pretty horrible, but it's meant for compilers, not for humans
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
https://cloud.google.com/blog/products/ai-machine-learning/googles-open-source-momentum-openxla-new-partnerships (10/2022)
https://opensource.googleblog.com/2023/03/openxla-is-ready-to-accelerate-and-simplify-ml-development.html (03/2023)
• New frameworks, new AI accelerators
• XLA is the de facto toolkit to compile and optimize
models across hardware platforms
• The XLA compiler and HLO become standalone
projects, outside of TensorFlow
https://github.com/openxla/xla
https://github.com/openxla/stablehlo
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
"Hello world" with PyTorch/XLA on AWS Inferentia 2
https://awsdocs-neuron.readthedocs-hosted.com
import torch
import torch_neuronx
import torch_xla.core.xla_model as xm
# Create XLA device
device = xm.xla_device()
# Load example model and inputs to Neuron device
model = torch.nn.Sequential(
torch.nn.Linear(784, 120),
torch.nn.ReLU(),
torch.nn.Linear(120, 10),
torch.nn.Softmax(dim=-1),
)
model.eval()
model.to(device)
example = torch.rand((1, 784), device=device)
# Inference
with torch.no_grad():
result = model(example)
xm.mark_step() # Compilation occurs here
print(result.cpu())
• XLA makes it easy to run PyTorch code on
custom hardware accelerators
• Hardware and SDK details are abstracted by
extending the torch_xla API
• Tracing, lowering and JIT compilation happen
under the hood with the AWS Neuron SDK
• AOT compilation is also possible with a
TorchScript-like API
• Intel Habana Gaudi 2 works the same
https://docs.habana.ai/en/latest/PyTorch/
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
PyTorch 2: the light at the end of the tunnel ?
https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
• torch.compile(): a new stack for model compilation
• TorchDynamo + AOT Autograd: trace forward and
backward passes using only low-level primitives, and
save in torch.fx format
• Data-dependent control flow, dynamic shapes, non-PyTorch code ✅
• TorchInductor: frontend compiler
• OpenMP/C++ backend for CPU code
• OpenAI Triton backend for GPU code
• AoT export added to PyTorch 2.2
• PyTorch 2 will embrace OpenXLA as a backend
https://pytorch.org/blog/pytorch-2.0-xla-path-forward/ (04/2023)
• torch.compile() for all AI devices?
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Accelerating Hugging Face models with PyTorch 2
https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/ (12/2022)
import torch
from transformers import BertTokenizer, BertModel
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device=device)
model = torch.compile(model, backend="inductor")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device=device)
output = model(**encoded_input)
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
pipe.unet = torch.compile(pipe.unet)
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0]
One-line optimization for CPU and GPU 🎉 🎉 🎉
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Accelerating BERT on CPU with PyTorch 2
import torch
from transformers import BertTokenizer, BertModel
import intel_extension_for_pytorch as ipex
import time
device = torch.device("cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
orig_model = BertModel.from_pretrained("bert-base-uncased").to(device=device)
orig_model.eval()
text = "Replace me by any text you'd like. " * 12 # Seq length 420
print(f"Sequence length: {len(text)}")
encoded_input = tokenizer(text, return_tensors='pt').to(device=device)
def bench(model, input, n=1000):
with torch.no_grad():
# Warmup
model(**encoded_input)
start = time.time()
for _ in range(n):
model(**input)
end = time.time()
return (end - start) * 1000 / n
print(f"Average time: {bench(orig_model, encoded_input):.2f} ms")
print(torch._dynamo.list_backends())
model = torch.compile(orig_model, backend="inductor")
print(f"Average time inductor: {bench(model, encoded_input):.2f} ms")
torch._dynamo.reset()
model = ipex.optimize(orig_model) # frontend optim
model = torch.compile(model, backend="ipex") # backend optim
print(f"Average time ipex: {bench(model, encoded_input):.2f} ms")
Baseline: 34.13 ms
Inductor: 31.83 ms
IPEX: 30.86 ms
Amazon EC2 c6i.4xlarge
AWS Deep Learning AMI
PyTorch 2.2.0 + IPEX 2.2.0
https://gitlab.com/juliensimon/huggingface-demos/-/blob/main/pt2/bench_bert.py

More Related Content

PDF
Pytorch for tf_developers
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
PDF
Simple, fast, and scalable torch7 tutorial
PDF
PyTorch Introduction
PDF
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
PPTX
Deep Learning in Your Browser
PPTX
TensorFlow in Your Browser
Pytorch for tf_developers
Soumith Chintala - Increasing the Impact of AI Through Better Software
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Simple, fast, and scalable torch7 tutorial
PyTorch Introduction
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Deep Learning in Your Browser
TensorFlow in Your Browser

Similar to Julien Simon - Deep Dive: Compiling Deep Learning Models (20)

PPTX
Tensorflow in practice by Engineer - donghwi cha
PDF
Introduction to TensorFlow 2.0
PPTX
Deep Learning and TensorFlow
PDF
TensorFlow example for AI Ukraine2016
PPTX
Introduction to Deep Learning and TensorFlow
PPTX
Intro to Deep Learning, TensorFlow, and tensorflow.js
PPTX
2Wisjshsbebe pehele isienew Dorene isksnwnw
PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
PDF
Tensor flow intro and summit info feb 2017
PDF
Dive Into PyTorch
PDF
Overview of TensorFlow For Natural Language Processing
PDF
TensorFlow and Keras: An Overview
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
PDF
1-pytorch-CNN-RNN.pdf
PDF
pytdddddddddddddddddddddddddddddddddorch.pdf
PDF
pytorch-cheatsheet.pdf for ML study with pythroch
PPTX
190111 tf2 preview_jwkang_pub
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
PPTX
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
PDF
OpenPOWER Workshop in Silicon Valley
Tensorflow in practice by Engineer - donghwi cha
Introduction to TensorFlow 2.0
Deep Learning and TensorFlow
TensorFlow example for AI Ukraine2016
Introduction to Deep Learning and TensorFlow
Intro to Deep Learning, TensorFlow, and tensorflow.js
2Wisjshsbebe pehele isienew Dorene isksnwnw
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Tensor flow intro and summit info feb 2017
Dive Into PyTorch
Overview of TensorFlow For Natural Language Processing
TensorFlow and Keras: An Overview
Advanced Spark and TensorFlow Meetup May 26, 2016
1-pytorch-CNN-RNN.pdf
pytdddddddddddddddddddddddddddddddddorch.pdf
pytorch-cheatsheet.pdf for ML study with pythroch
190111 tf2 preview_jwkang_pub
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
24-TensorFlow-Clipper.pptxnjjjjnjjjjjjmm
OpenPOWER Workshop in Silicon Valley
Ad

More from Julien SIMON (20)

PDF
Implementing high-quality and cost-effiient AI applications with small langua...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
PDF
Arcee AI - building and working with small language models (06/25)
PDF
deep_dive_multihead_latent_attention.pdf
PDF
Deep Dive: Model Distillation with DistillKit
PDF
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
PDF
Building High-Quality Domain-Specific Models with Mergekit
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Tailoring Small Language Models for Enterprise Use Cases
PDF
Julien Simon - Deep Dive - Optimizing LLM Inference
PDF
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
PDF
Julien Simon - Deep Dive - Quantizing LLMs
PDF
Julien Simon - Deep Dive - Model Merging
PDF
An introduction to computer vision with Hugging Face
PDF
Reinventing Deep Learning
 with Hugging Face Transformers
PDF
Building NLP applications with Transformers
PPTX
Building Machine Learning Models Automatically (June 2020)
PDF
Starting your AI/ML project right (May 2020)
Implementing high-quality and cost-effiient AI applications with small langua...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Trying to figure out MCP by actually building an app from scratch with open s...
Arcee AI - building and working with small language models (06/25)
deep_dive_multihead_latent_attention.pdf
Deep Dive: Model Distillation with DistillKit
Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum
Building High-Quality Domain-Specific Models with Mergekit
Tailoring Small Language Models for Enterprise Use Cases
Tailoring Small Language Models for Enterprise Use Cases
Tailoring Small Language Models for Enterprise Use Cases
Julien Simon - Deep Dive - Optimizing LLM Inference
Julien Simon - Deep Dive - Accelerating Models with Better Attention Layers
Julien Simon - Deep Dive - Quantizing LLMs
Julien Simon - Deep Dive - Model Merging
An introduction to computer vision with Hugging Face
Reinventing Deep Learning
 with Hugging Face Transformers
Building NLP applications with Transformers
Building Machine Learning Models Automatically (June 2020)
Starting your AI/ML project right (May 2020)
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Julien Simon - Deep Dive: Compiling Deep Learning Models

  • 1. Deep Dive: Compiling Deep Learning Models Companion video: https://youtu.be/Oo07fFb-aH0 Julien Simon https://www.linkedin.com/in/juliensimon https://www.youtube.com/juliensimonfr The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
  • 2. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Once upon a time in TensorFlow 😤 • Define a neural network as a graph, where input tensors flow (get it?) through compute operations • This is known as graph mode, aka "define then run" # First layer: 128 neurons w1 = tf.Variable(tf.random_normal([128, x_dim]]), name='w1') b1 = tf.Variable(tf.constant(0.1, shape=(128, 1)), name='b1') y1 = tf.nn.relu(tf.add(tf.matmul(w1, x), b1))) # Second layer : 256 neurons w2 = tf.Variable(tf.random_normal([256, 128]), name='w2') b2 = tf.Variable(tf.constant(0.1, shape=(256, 1)), name='b2') y2 = tf.nn.relu(tf.add(tf.matmul(w2, y1), b2))) • Tensor shapes and execution flow are fully defined in advance • There are many opportunities to optimize graph execution • Can the optimization process be automated?
  • 3. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. TensorFlow XLA https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html (03/2017) • Accelerated Linear Algebra (XLA) appeared in TensorFlow 1.0.0-rc0 (01/2017) • XLA: a compiler that analyzes and optimizes TensorFlow graphs automatically • Specialize the graph for the actual tensor dimensions and data types • Eliminate redundancy and fuse operators when possible • Generate device-optimized native machine code for CPUs, GPUs and TPUs • "Just in time" (JIT) compilation at runtime, or "Ahead of time" (AOT) compilation pre-deployment
  • 4. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Meanwhile, in PyTorch land... https://pytorch.org/docs/stable/jit.html • TorchScript is a statically-typed subset of Python, available since PyTorch 1.0 (05/2018) • torch.jit API: trace(),script(),save(),load() • It lets you export PyTorch code into an intermediate representation (IR) using only low-level PyTorch primitives • The IR can be converted to other languages (C++), or compiled for hardware accelerators • Long story short: TorchScript has limitations and is now in maintenance mode def foo(len): rv = torch.zeros(3, 4) for i in range(len): if i < 10: rv = rv - 1.0 else: rv = rv + 1.0 return rv print(foo.code) Python def foo(len: int) -> Tensor: rv = torch.zeros([3, 4]) rv0 = rv for i in range(len): if torch.lt(i, 10): rv1 = torch.sub(rv0, 1., 1) else: rv1 = torch.add(rv0, 1., 1) rv0 = rv1 return rv0 TorchScript graph(%len.1 : int): %24 : int = prim::Constant[value=1]() %17 : bool = prim::Constant[value=1]() # test.py:10:5 %12 : bool? = prim::Constant() %10 : Device? = prim::Constant() %6 : int? = prim::Constant() %1 : int = prim::Constant[value=3]() # test.py:9:22 %2 : int = prim::Constant[value=4]() # test.py:9:25 %20 : int = prim::Constant[value=10]() # test.py:11:16 %23 : float = prim::Constant[value=1]() # test.py:12:23 %4 : int[] = prim::ListConstruct(%1, %2) %rv.1 : Tensor = aten::zeros(%4, %6, %6, %10, %12) # test.py:9:10 %rv : Tensor = prim::Loop(%len.1, %17, %rv.1) # test.py:10:5 block0(%i.1 : int, %rv.14 : Tensor): %21 : bool = aten::lt(%i.1, %20) # test.py:11:12 %rv.13 : Tensor = prim::If(%21) # test.py:11:9 block0(): %rv.3 : Tensor = aten::sub(%rv.14, %23, %24) # test.py:12:18 -> (%rv.3) block1(): %rv.6 : Tensor = aten::add(%rv.14, %23, %24) # test.py:14:18 -> (%rv.6) -> (%17, %rv.13) return (%rv) TorchScript IR
  • 5. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. PyTorch/XLA https://github.com/torch/xla • In 2018, Google and Meta started collaborating on bringing PyTorch to TPUs • Vanilla PyTorch runs in eager mode (aka "define-by-run"): operations are run immediately on the underlying hardware, so we can't build a graph beforehand • PyTorch/XLA was launched in late 2019 at the PyTorch Developer Conference https://www.youtube.com/watch?v=zXAzkqFXclM • XLA introduces lazy tensors that allow a graph to be recorded, compiled and run on an accelerator https://arxiv.org/pdf/2102.13267.pdf
  • 6. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. "Hello world" with PyTorch/XLA https://pytorch.org/xla import torch_xla.core.xla_model as xm device = xm.xla_device() model = MNIST().train().to(device) loss_fn = nn.NLLLoss() optimizer = optim.SGD(model.parameters(), lr=lr) for data, target in train_loader: optimizer.zero_grad() data = data.to(device) target = target.to(device) output = model(data) loss = loss_fn(output, target) loss.backward() optimizer.step() xm.mark_step() • XLA makes it easy to reuse existing PyTorch code on custom AI hardware • The vanilla model is loaded on the XLA device • The training loop runs lazily on the host, automatically building an internal representation, aka tracing • On the host, at parameter optimization time: • The IR is translated to High-Level Opcodes (HLO), aka lowering • XLA compiles the HLO code to machine- dependent code • This compiled code is loaded on the XLA device and executed Build the IR on the host Compile and run the IR on the XLA device
  • 7. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. HLO example 😱 func.func @main( %image: tensor<28x28xf32>, %weights: tensor<784x10xf32>, %bias: tensor<1x10xf32> ) -> tensor<1x10xf32> { %0 = "stablehlo.reshape"(%image) : (tensor<28x28xf32>) -> tensor<1x784xf32> %1 = "stablehlo.dot"(%0, %weights) : (tensor<1x784xf32>, tensor<784x10xf32>) -> tensor<1x10xf32> %2 = "stablehlo.add"(%1, %bias) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32> %3 = "stablehlo.constant"() { value = dense<0.0> : tensor<1x10xf32> } : () -> tensor<1x10xf32> %4 = "stablehlo.maximum"(%2, %3) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32> "func.return"(%4): (tensor<1x10xf32>) -> () } Can you guess what this does? Pretty horrible, but it's meant for compilers, not for humans
  • 8. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. https://cloud.google.com/blog/products/ai-machine-learning/googles-open-source-momentum-openxla-new-partnerships (10/2022) https://opensource.googleblog.com/2023/03/openxla-is-ready-to-accelerate-and-simplify-ml-development.html (03/2023) • New frameworks, new AI accelerators • XLA is the de facto toolkit to compile and optimize models across hardware platforms • The XLA compiler and HLO become standalone projects, outside of TensorFlow https://github.com/openxla/xla https://github.com/openxla/stablehlo
  • 9. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. "Hello world" with PyTorch/XLA on AWS Inferentia 2 https://awsdocs-neuron.readthedocs-hosted.com import torch import torch_neuronx import torch_xla.core.xla_model as xm # Create XLA device device = xm.xla_device() # Load example model and inputs to Neuron device model = torch.nn.Sequential( torch.nn.Linear(784, 120), torch.nn.ReLU(), torch.nn.Linear(120, 10), torch.nn.Softmax(dim=-1), ) model.eval() model.to(device) example = torch.rand((1, 784), device=device) # Inference with torch.no_grad(): result = model(example) xm.mark_step() # Compilation occurs here print(result.cpu()) • XLA makes it easy to run PyTorch code on custom hardware accelerators • Hardware and SDK details are abstracted by extending the torch_xla API • Tracing, lowering and JIT compilation happen under the hood with the AWS Neuron SDK • AOT compilation is also possible with a TorchScript-like API • Intel Habana Gaudi 2 works the same https://docs.habana.ai/en/latest/PyTorch/
  • 10. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. PyTorch 2: the light at the end of the tunnel ? https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html • torch.compile(): a new stack for model compilation • TorchDynamo + AOT Autograd: trace forward and backward passes using only low-level primitives, and save in torch.fx format • Data-dependent control flow, dynamic shapes, non-PyTorch code ✅ • TorchInductor: frontend compiler • OpenMP/C++ backend for CPU code • OpenAI Triton backend for GPU code • AoT export added to PyTorch 2.2 • PyTorch 2 will embrace OpenXLA as a backend https://pytorch.org/blog/pytorch-2.0-xla-path-forward/ (04/2023) • torch.compile() for all AI devices?
  • 11. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Accelerating Hugging Face models with PyTorch 2 https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/ (12/2022) import torch from transformers import BertTokenizer, BertModel device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased").to(device=device) model = torch.compile(model, backend="inductor") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt').to(device=device) output = model(**encoded_input) from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda") pipe.unet = torch.compile(pipe.unet) images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] One-line optimization for CPU and GPU 🎉 🎉 🎉
  • 12. The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned. This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/ You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made. You may not use the material for commercial purposes. You may not apply any restriction on what the license permits. Accelerating BERT on CPU with PyTorch 2 import torch from transformers import BertTokenizer, BertModel import intel_extension_for_pytorch as ipex import time device = torch.device("cpu") tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') orig_model = BertModel.from_pretrained("bert-base-uncased").to(device=device) orig_model.eval() text = "Replace me by any text you'd like. " * 12 # Seq length 420 print(f"Sequence length: {len(text)}") encoded_input = tokenizer(text, return_tensors='pt').to(device=device) def bench(model, input, n=1000): with torch.no_grad(): # Warmup model(**encoded_input) start = time.time() for _ in range(n): model(**input) end = time.time() return (end - start) * 1000 / n print(f"Average time: {bench(orig_model, encoded_input):.2f} ms") print(torch._dynamo.list_backends()) model = torch.compile(orig_model, backend="inductor") print(f"Average time inductor: {bench(model, encoded_input):.2f} ms") torch._dynamo.reset() model = ipex.optimize(orig_model) # frontend optim model = torch.compile(model, backend="ipex") # backend optim print(f"Average time ipex: {bench(model, encoded_input):.2f} ms") Baseline: 34.13 ms Inductor: 31.83 ms IPEX: 30.86 ms Amazon EC2 c6i.4xlarge AWS Deep Learning AMI PyTorch 2.2.0 + IPEX 2.2.0 https://gitlab.com/juliensimon/huggingface-demos/-/blob/main/pt2/bench_bert.py