Julien Simon - Deep Dive: Compiling Deep Learning Models

Deep Dive: Compiling Deep Learning Models
Companion video: https://youtu.be/Oo07fFb-aH0
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

Once upon a time in TensorFlow 😤
• Define a neural network as a graph, where input tensors flow (get it?) through compute operations
• This is known as graph mode, aka "define then run"
# First layer: 128 neurons
w1 = tf.Variable(tf.random_normal([128, x_dim]]), name='w1')
b1 = tf.Variable(tf.constant(0.1, shape=(128, 1)), name='b1')
y1 = tf.nn.relu(tf.add(tf.matmul(w1, x), b1)))
# Second layer : 256 neurons
w2 = tf.Variable(tf.random_normal([256, 128]), name='w2')
b2 = tf.Variable(tf.constant(0.1, shape=(256, 1)), name='b2')
y2 = tf.nn.relu(tf.add(tf.matmul(w2, y1), b2)))
• Tensor shapes and execution flow are fully defined in advance
• There are many opportunities to optimize graph execution
• Can the optimization process be automated?

TensorFlow XLA
https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html (03/2017)
• Accelerated Linear Algebra (XLA) appeared in TensorFlow 1.0.0-rc0 (01/2017)
• XLA: a compiler that analyzes and optimizes TensorFlow graphs automatically
• Specialize the graph for the actual tensor dimensions and data types
• Eliminate redundancy and fuse operators when possible
• Generate device-optimized native machine code for CPUs, GPUs and TPUs
• "Just in time" (JIT) compilation at runtime, or "Ahead of time" (AOT) compilation pre-deployment

Meanwhile, in PyTorch land...
https://pytorch.org/docs/stable/jit.html
• TorchScript is a statically-typed subset of Python, available since PyTorch 1.0 (05/2018)
• torch.jit API: trace(),script(),save(),load()
• It lets you export PyTorch code into an intermediate representation (IR) using only low-level PyTorch primitives
• The IR can be converted to other languages (C++), or compiled for hardware accelerators
• Long story short: TorchScript has limitations and is now in maintenance mode
def foo(len):
rv = torch.zeros(3, 4)
for i in range(len):
if i < 10:
rv = rv - 1.0
else:
rv = rv + 1.0
return rv
print(foo.code)
Python
def foo(len: int) -> Tensor:
rv = torch.zeros([3, 4])
rv0 = rv
for i in range(len):
if torch.lt(i, 10):
rv1 = torch.sub(rv0, 1., 1)
else:
rv1 = torch.add(rv0, 1., 1)
rv0 = rv1
return rv0
TorchScript
graph(%len.1 : int):
%24 : int = prim::Constant[value=1]()
%17 : bool = prim::Constant[value=1]() # test.py:10:5
%12 : bool? = prim::Constant()
%10 : Device? = prim::Constant()
%6 : int? = prim::Constant()
%1 : int = prim::Constant[value=3]() # test.py:9:22
%23 : float = prim::Constant[value=1]() # test.py:12:23
%4 : int[] = prim::ListConstruct(%1, %2)
%rv.1 : Tensor = aten::zeros(%4, %6, %6, %10, %12) # test.py:9:10
%rv : Tensor = prim::Loop(%len.1, %17, %rv.1) # test.py:10:5
block0(%i.1 : int, %rv.14 : Tensor):
%21 : bool = aten::lt(%i.1, %20) # test.py:11:12
%rv.13 : Tensor = prim::If(%21) # test.py:11:9
block0():
%rv.3 : Tensor = aten::sub(%rv.14, %23, %24) # test.py:12:18
-> (%rv.3)
block1():
%rv.6 : Tensor = aten::add(%rv.14, %23, %24) # test.py:14:18
-> (%rv.6)
-> (%17, %rv.13)
return (%rv)
TorchScript
IR

PyTorch/XLA
https://github.com/torch/xla
• In 2018, Google and Meta started collaborating on bringing PyTorch to TPUs
• Vanilla PyTorch runs in eager mode (aka "define-by-run"): operations are run immediately on the
underlying hardware, so we can't build a graph beforehand
• PyTorch/XLA was launched in late 2019 at the PyTorch Developer Conference
https://www.youtube.com/watch?v=zXAzkqFXclM
• XLA introduces lazy tensors that allow a graph to be recorded, compiled and run on an accelerator
https://arxiv.org/pdf/2102.13267.pdf

"Hello world" with PyTorch/XLA
https://pytorch.org/xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
model = MNIST().train().to(device)
loss_fn = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=lr)
for data, target in train_loader:
optimizer.zero_grad()
data = data.to(device)
target = target.to(device)
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
xm.mark_step()
• XLA makes it easy to reuse existing PyTorch code
on custom AI hardware
• The vanilla model is loaded on the XLA device
• The training loop runs lazily on the host,
automatically building an internal representation,
aka tracing
• On the host, at parameter optimization time:
• The IR is translated to High-Level Opcodes
(HLO), aka lowering
• XLA compiles the HLO code to machine-
dependent code
• This compiled code is loaded on the XLA device
and executed
Build the IR
on the host
Compile and run the IR
on the XLA device

HLO example 😱
func.func @main(
%image: tensor<28x28xf32>,
%weights: tensor<784x10xf32>,
%bias: tensor<1x10xf32>
) -> tensor<1x10xf32> {
%0 = "stablehlo.reshape"(%image) : (tensor<28x28xf32>) -> tensor<1x784xf32>
%1 = "stablehlo.dot"(%0, %weights) : (tensor<1x784xf32>, tensor<784x10xf32>) -> tensor<1x10xf32>
%2 = "stablehlo.add"(%1, %bias) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
%3 = "stablehlo.constant"() { value = dense<0.0> : tensor<1x10xf32> } : () -> tensor<1x10xf32>
%4 = "stablehlo.maximum"(%2, %3) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
"func.return"(%4): (tensor<1x10xf32>) -> ()
}
Can you guess what this does?
Pretty horrible, but it's meant for compilers, not for humans

https://cloud.google.com/blog/products/ai-machine-learning/googles-open-source-momentum-openxla-new-partnerships (10/2022)
https://opensource.googleblog.com/2023/03/openxla-is-ready-to-accelerate-and-simplify-ml-development.html (03/2023)
• New frameworks, new AI accelerators
• XLA is the de facto toolkit to compile and optimize
models across hardware platforms
• The XLA compiler and HLO become standalone
projects, outside of TensorFlow
https://github.com/openxla/xla
https://github.com/openxla/stablehlo

"Hello world" with PyTorch/XLA on AWS Inferentia 2
https://awsdocs-neuron.readthedocs-hosted.com
import torch
import torch_neuronx
import torch_xla.core.xla_model as xm
# Create XLA device
device = xm.xla_device()
# Load example model and inputs to Neuron device
model = torch.nn.Sequential(
torch.nn.Linear(784, 120),
torch.nn.ReLU(),
torch.nn.Linear(120, 10),
torch.nn.Softmax(dim=-1),
)
model.eval()
model.to(device)
example = torch.rand((1, 784), device=device)
# Inference
with torch.no_grad():
result = model(example)
xm.mark_step() # Compilation occurs here
print(result.cpu())
• XLA makes it easy to run PyTorch code on
custom hardware accelerators
• Hardware and SDK details are abstracted by
extending the torch_xla API
• Tracing, lowering and JIT compilation happen
under the hood with the AWS Neuron SDK
• AOT compilation is also possible with a
TorchScript-like API
• Intel Habana Gaudi 2 works the same
https://docs.habana.ai/en/latest/PyTorch/

PyTorch 2: the light at the end of the tunnel ?
https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
• torch.compile(): a new stack for model compilation
• TorchDynamo + AOT Autograd: trace forward and
backward passes using only low-level primitives, and
save in torch.fx format
• Data-dependent control flow, dynamic shapes, non-PyTorch code ✅
• TorchInductor: frontend compiler
• OpenMP/C++ backend for CPU code
• OpenAI Triton backend for GPU code
• AoT export added to PyTorch 2.2
• PyTorch 2 will embrace OpenXLA as a backend
https://pytorch.org/blog/pytorch-2.0-xla-path-forward/ (04/2023)
• torch.compile() for all AI devices?

Accelerating Hugging Face models with PyTorch 2
https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/ (12/2022)
import torch
from transformers import BertTokenizer, BertModel
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device=device)
model = torch.compile(model, backend="inductor")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device=device)
output = model(**encoded_input)
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda")
pipe.unet = torch.compile(pipe.unet)
images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0]
One-line optimization for CPU and GPU 🎉 🎉 🎉

Accelerating BERT on CPU with PyTorch 2
import torch
from transformers import BertTokenizer, BertModel
import intel_extension_for_pytorch as ipex
import time
device = torch.device("cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
orig_model = BertModel.from_pretrained("bert-base-uncased").to(device=device)
orig_model.eval()
text = "Replace me by any text you'd like. " * 12 # Seq length 420
print(f"Sequence length: {len(text)}")
encoded_input = tokenizer(text, return_tensors='pt').to(device=device)
def bench(model, input, n=1000):
with torch.no_grad():
# Warmup
model(**encoded_input)
start = time.time()
for _ in range(n):
model(**input)
end = time.time()
return (end - start) * 1000 / n
print(f"Average time: {bench(orig_model, encoded_input):.2f} ms")
print(torch._dynamo.list_backends())
model = torch.compile(orig_model, backend="inductor")
print(f"Average time inductor: {bench(model, encoded_input):.2f} ms")
torch._dynamo.reset()
model = ipex.optimize(orig_model) # frontend optim
model = torch.compile(model, backend="ipex") # backend optim
print(f"Average time ipex: {bench(model, encoded_input):.2f} ms")
Baseline: 34.13 ms
Inductor: 31.83 ms
IPEX: 30.86 ms
Amazon EC2 c6i.4xlarge
AWS Deep Learning AMI
PyTorch 2.2.0 + IPEX 2.2.0
https://gitlab.com/juliensimon/huggingface-demos/-/blob/main/pt2/bench_bert.py

Julien Simon - Deep Dive: Compiling Deep Learning Models

More Related Content

Similar to Julien Simon - Deep Dive: Compiling Deep Learning Models (20)

More from Julien SIMON (20)

Recently uploaded (20)

Julien Simon - Deep Dive: Compiling Deep Learning Models