SlideShare a Scribd company logo
Glow Compiler
2018
issue.hsu@gmail.com
Outline
• Brief introduction to Glow
• Glow IR
• Glow Quantization
• Glow CPU Backend
2
Brief introduction to
Glow
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
3
A collaborative effort
• Over the past seven years, FB has learned a great deal about how to best
collaborate with the hardware community
• Our work to help found and drive the Open Compute Project has been instrumental in
allowing us to build highly scalable, efficient networking and storage technologies for our
data centers
• We’ve applied this thinking to how we work with telecom operators and the connectivity
ecosystem overall with the Telecom Infra Project, as we work to get more people around the
world better connected to the internet
• As we look ahead, we now want to take these learnings and apply them to how we work with
our silicon partners on AI and ML
• We created Glow, an open source framework, to be community driven. This approach allows
partners to more rapidly design and optimize new silicon products for AI and ML by
leveraging community-driven compiler software
• Cadence, Esperanto, Intel, Marvell, and Qualcomm Technologies Inc, a subsidiary
of Qualcomm Incorporated, have committed to supporting Glow in future silicon
products
4
How Glow works
• Glow is designed to target a wide range of hardware accelerators
• The hardware-independent parts of the compiler focus on math-related
optimizations that are not tied to a specific hardware model
• It also contains a number of utilities and building blocks that can be
configured to support multiple hardware targets, including
• a powerful linear algebra optimizer
• an extensive test suite
• a CPU-based reference implementation for testing the accuracy of hardware
accelerators
• the memory allocator
• an instruction scheduler
• etc…
5
How Glow works
6
Glow Intermediate
Representation
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
7
High-Level IR
• The high-level IR is a dataflow node-based graph representation
• similar to a graph that you may find inside Caffe or in ONNX format
• When we load a neural network model from some file we construct
this graph with a direct translation of one operator to one or more
nodes
• The graph is strongly typed, which means that inputs and output
have a known tensor type
• Consisting of the tensor's shape and element type, and that the types of
nodes are verified by the compiler
8
High-Level IR
• The Glow graph is structured as a module that contains multiple functions that
contain multiple nodes
• Nodes inside functions are able to reference Placeholders and Constants which
are owned by the module
• Placeholders and Constants, which are similar to global variables in C programs, are nodes
that are shared between the functions
• A module may have multiple functions
• For example, one module could contain both an inference function and the gradient of that
inference function
• The gradient function could perform training of the placeholder weights, and the
inference function could read from those same weights
9
High-Level IR
• Variable Visibility
• Glow variables are similar to PyTorch and TensorFlow variables
• They are persistent tensors that live across different executions of the neural network
• Variables are annotated with Public or Private labels. These labels specify whether
the node is visible outside of the graph
• If the node is public, then it means that C++ code from outside the graph may access the
variable directly and change its content before or after the execution of the program
• This means that the optimizer is not allowed to delete unused public variables or change their
dimensions
• In the case of private variables, the optimizer is allowed to delete unused variables, transpose,
perform constant propagation, etc.
10
High-Level IR
• Constants
• special nodes that represent tensors that
are a part of the graph
• These nodes can be used to represent
things like the weights of neural
networks
• Constants are immutable during the
execution of the program, but graph
optimizations can access the constants
and modify them
• This feature is useful for transformations
that prepare the weights by transposing
them or quantizing them before the
execution of the program
• Placeholders
• symbolic nodes that are not backed by a
concrete tensor during the compilation of
the program
• Inputs and outputs of Glow programs
should be modeled using Placeholder
nodes
• Concrete tensors are attached to
placeholder nodes during the execution
of the program
• Unlike constants, the optimizer can't
inspect or mutate the content of
Placeholder nodes
• The same program could be compiled
using different bound tensors without
changing the semantics of the program
11
High-Level IR
• Glow functions contain nodes that represent the
different operations of a neural network
• The function owns the nodes and has access to the
placeholders and constants in the module
• The image in the right-hand side depicts the compute
graph that represents the expression “saveD = A / B”
• Glow lowers the nodes that compute the gradient of
the expression and the stochastic gradient descent
(SGD) node into a sequence of low-level operators (Div,
Mul, Add and Save)
• The different compiler backends do not need to implement
support for the DivGrad, ReLUGrad or SGD nodes
12
Node Lowering
• Instead of compiling high-level operators directly, Glow performs
“node lowering”
• In this phase, the compiler breaks the high-level operator nodes into
low-level linear algebra operator nodes
• For example, the FullyConnected layer is represented as a matrix
multiplication followed by broadcasted add
• Different compiler backends do not have to implement the FullyConnected
layer and a dozen other high-level opcodes, just the low-level matrix
multiplication
13
Node Lowering
• In Glow, lowering is performed as part of the high-level graph as
described above, prior to moving to low-level IR
• This is due to a number of reasons
• First, the new lowered graph may allow for additional graph-level
optimizations
• Second, the new graph structure may affect the decisions of the instruction
scheduler
• And third, after lowering we allow the backends to perform additional target-
specific optimizations on the lowered graph
14
Low-Level IR
• After optimizing the graph with target-independent optimizations,
and lowering from high-level operator nodes to linear algebra
operator nodes, the code is further lowered into the low-level IR in a
phase that is called "IRGen" (which stands for IR generation)
• This is a one-to-many translation where each high-level node is translated
into one or more instructions
• During IRGen, constants and placeholders are converted into
WeightVars
• These WeightVars are annotated with Mutable or Constant labels, depending
on the source and whether the weights are modified during the execution of
the program
15
Low-Level IR
• The low-level IR enables a different kind of target independent
optimizations that are not possible with the high-level graph format
• This is an instruction-based representation that operates on tensors that are
referenced by address
• This gives the compiler the ability to perform low-level memory
optimizations that are not possible at the high-level, because memory is not
represented directly
• Hiding the latency of memory operations is important for utilizing the
execution units of the hardware effectively, and the instruction-based
representation allows the compiler to create a schedule that hides the
latency of the memory operations
16
Low-Level IR
• The IR is not Static Single Assignment (SSA) based representation,
because the IR does not support control flow
• The IR is strongly typed and each instruction operand kind has known
parameter types
• It is designed to be used as an in-memory form, though can be
dumped to human readable assembly-like format
17
Low-Level IR
• A function in IR form contains two sections:
'declare' and 'program’
• In the first section of the IR we declare a number
of memory regions that live throughout the
lifetime of the program
• This is similar to global variables in C
• The second part of the IR is a list of instructions
• There are two kinds of memory regions which
correspond to these two sections:
• global memory regions (found in 'declare’)
• and locally allocated regions (found in 'program’)
• The locally allocated memory regions are similar to
'alloca' in LLVM IR
• Memory regions are strongly typed, which
means that the kind of type of tensor that the
region represents is known
18
• Note that the 'alloc' instruction does not
allocate memory; it just marks the lifetime
of the activation
Low-Level IR
• Instructions operate on either global
variables or locally allocated buffers
• Each operand is annotated with one of
the qualifiers '@in'/'@out'/'@inout’
• '@in' means that the buffer is read from
• '@out' means that the buffer is written
into
• And '@inout' means that the instruction
may read and write into the buffer
• These operand qualifiers help the
optimizer decide when it is legal to
perform certain optimizations, such as
copy elimination or buffer sharing
19
How Glow works
20
The lowering phase is designed
to reduce the input space and
allow new hardware backends
to focus on a small number of
linear algebra primitives
The high-level IR allows
the optimizer to perform
domain-specific
optimizations
The lower-level instruction-based
address-only IR allows the compiler to
perform memory-related optimizations,
such as instruction scheduling, static
memory allocation and copy elimination
At the lowest level, the optimizer
performs machine-specific code
generation to take advantage of
specialized hardware features
Glow lowers a traditional neural network dataflow graph into a
two-phase strongly-typed intermediate representation (IR).
The graph is either
loaded via the graph
loader (from ONNX or
Caffe2 format), or
constructed via the
C++ interface
Additional rounds of optimizations
occur, both target independent and
target specific
IRGen
1
2
3
4
5
6 7
Glow Quantization
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
21
Glow Quantization
• Glow is able to convert floating-point-based networks into signed 8-
bit integer networks
• The canonical quantization representation is using signed integers, though it
is possible to support other quantization formats
• Arithmetic using small integers is more efficient than the computation of full-
width floating-point numbers, and additionally decreases memory usage
• Glow uses profile-guided quantization, observing execution during
inference to estimate the possible numeric range for each stage of the
neural network
• Training-based quantization is considered future work
22
Tensor Representation
• In Glow, tensors are typed and can represent floats, quantized non-
floating-point values such as currently supported Int8 (8-bit signed
integers), and index types
• To convert from the 8-bit integer range of [-128..127] to the floating-point
number that they represent, Glow uses the following conversion formula:
• Float value = (Int8 input - offset) * scale
• Activations, weights, and variables all use the same type-system and
represent information in a uniform way
23
Network Conversion
• Glow’s quantization conversion works
using a two-phase process
• First, we statically instrument the
network with special profiling nodes
that record the ranges of activations
that flow in the network, optimize the
network including these profiling nodes,
and then run inference
• Then, we recompile the network using
this profile information to convert the
network into a quantized form,
allowing for static optimization of the
quantized graph
• We convert portions of the network
into islands of integer computation
and aim to generate outputs in the
range that the original floating-point
network produces
24A quantized subgraph from Resnet50
Scale = 0.0364
Offset = -66
Max = 7.031
Min = -2.259
7.031 = (input –(-66)) * 0.0364
input = 127.159
input = 127 (int8)
-2.259 = (input –(-66)) * 0.0364
input = -128.060
input = -128 (int8)
Float value = (Int8 input - offset) * scale
Compiler Optimizations
• There are a few classes of optimizations and parameters to optimize
• First, we attempt to minimize the number of conversions between floating-point tensors and
integer tensors, in both directions
• Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can
minimize conversions
• Second, the neural network contains 'rescale' nodes that change the range of the integers
• These nodes are required to convert between numeric ranges that mimic the original floating-point network
• However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and
eliminate them
• Third, it's possible to rescale the values in the network in order to allow fast hardware
implementations of the quantized operations
• By normalizing both sides of the 'max' operation to the same scale will allow hardware to perform a simple
comparison with efficient
• For more specific graph optimizations check here
25
Glow CPU Backend
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
26
Introduction
• The CPU Backend is a JIT ("Just In Time") compiler that generates
code in memory on demand for the host CPU
• The host cpu can be X86, ARM or anything that LLVM can target
• The Glow interpreter goes over the low-level IR one instruction at a
time and executes a switch statement that dispatches a C++
implementation for each instruction. This is suboptimal
• First, after each low-level instruction is executed, by calling a function call, we
return to the dispatch switch-loop
• Second, the C++ implementation of the low-level instruction had no
knowledge of the specific situation in which the instruction is being executed
27
Introduction
• The JIT, on the other hand, generates a single stream of highly
optimized instructions that don't go back to the interpreter
• Each instruction is optimized based on specific information on the context in
which the instruction is executed
• When a matrix multiplication is compiled the JIT knows exactly the dimensions of the
matrices that are being executed and where the tensors are placed in memory
• The JIT knows that the buffers do or do-not alias, and exactly the number of iterations
for the loop
• The knowledge enables much better code generation and vectorization
• The JIT is also able to eliminate all calls to 'malloc', because the memory is
statically allocated
• The whole network is allocated by a single malloc call
28
How the JIT Works
• The JIT accepts the low-level IR, and allocates concrete memory addresses for the
AllocActivation instructions in the module
• After this process the allocator knows the maximum number of bytes that the network
consumes
• The allocator assigns offsets for each alloc activation within the buffer
• Then, the JIT performs a single call to 'malloc' to allocates the heap
• At this point each activation and each weight has a concrete address on the heap
• Next, the JIT opens new LLVM functions and prepares for code generation
• The compiler goes over each low-level instruction and generates a sequence of LLVM-IR
• After the LLVM module is generated, the compiler calls the LLVM optimizer to
optimize the generated module and the code generator to generate efficient
machine code
• At this point the compilation phase is complete, and the network is ready for execution
29
Usage of the Standard Library
• During the compilation process, each Glow low-level instruction is
converted into a sequence of LLVM-IR instructions
• One way to implement this lowering is to use the IRBuilder to generate low-level
programs
• This is insane. Implementing and maintaining the low-level implementations of so many
operations using the LLVM-IR is not scalable
• Instead, the CPU backend compiles a small standard library into LLVM bitcode that it
ships with the compiler
• During the compilation process, Glow loads the bitcode from disk and specializes the operator
implementations for the specific context
• Glow replaces function arguments that represent the dimensions of some tensor or buffer
addresses with constants that LLVM can optimize to generate efficient code
• Most operators are very simple and the LLVM vectorizer is able to generate very efficient code
• The convolution and matrix multiplication operations are hand-optimized in C++ using the
clang extended OpenCL vector syntax, and LLVM does a good job allocating registers and
encoding the instructions, removing the need to use inline assembly
30
Operator Stacking
• One important optimization that the CPU backend implements is stacking of data-parallel
operators
• Consider a sequence of operators that operate one element at a time, for example a
ReLU, Add, Sub
• Iterating over a large buffer multiple times is inefficient because it requires the CPU to load the
memory multiple times, each time invalidating the whole cache
• Instead, Glow stacks operators and performs a few data-parallel operators one after the other on
the same memory location
• Operator stacking is similar to operator fusion
• However, when fusing multiple operators (e.g. Conv and ReLU fused together), all backends that
want to support this fused operator must implement a specific kernel for each permutation of
operators
• In contrast, Glow’s stacking automatically creates such kernels; all of the possible permutations of
data-parallel nodes are automatically fused into a fast kernel
31
End
Thanks!
Brief introduction to
Glow
Glow IR
Glow Quantization
Glow CPU Backend
32
Reference
• Glow: A community-driven approach to AI infrastructure
• Glow: Graph Lowering Compiler Techniques for Neural Networks
• https://github.com/pytorch/glow/
• https://github.com/pytorch/glow/issues/1575
33

More Related Content

PPT
Real time-embedded-system-lec-02
PDF
Admission control
PPT
Selective repeat protocol
PPT
Coding style for good synthesis
PDF
14 static timing_analysis_5_clock_domain_crossing
PPT
Intel 64bit Architecture
PPTX
Chapter 03 arithmetic for computers
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Real time-embedded-system-lec-02
Admission control
Selective repeat protocol
Coding style for good synthesis
14 static timing_analysis_5_clock_domain_crossing
Intel 64bit Architecture
Chapter 03 arithmetic for computers
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...

What's hot (20)

PDF
Functional verification techniques EW16 session
PDF
3 jump, loop and call instructions
PPTX
Serial communication in LPC2148
PDF
Data types in verilog
PPTX
Xilinx 4000 series
PPTX
dual-port RAM (DPRAM)
PPTX
Sequential circuits
PDF
Xilinx lca and altera flex
PDF
Chapter 5 counter
PDF
Java threads
PPT
Code Optimization
PPTX
Peephole optimization techniques
PPT
EULER AND FERMAT THEOREM
PPT
Assembly language programming(unit 4)
PDF
Basic blocks and flow graph in Compiler Construction
PPT
Type Checking(Compiler Design) #ShareThisIfYouLike
PPTX
Principal source of optimization in compiler design
PDF
Verilog Tasks & Functions
DOCX
Flag register 8086 assignment
PDF
12 static timing_analysis_3_clocked_design
Functional verification techniques EW16 session
3 jump, loop and call instructions
Serial communication in LPC2148
Data types in verilog
Xilinx 4000 series
dual-port RAM (DPRAM)
Sequential circuits
Xilinx lca and altera flex
Chapter 5 counter
Java threads
Code Optimization
Peephole optimization techniques
EULER AND FERMAT THEOREM
Assembly language programming(unit 4)
Basic blocks and flow graph in Compiler Construction
Type Checking(Compiler Design) #ShareThisIfYouLike
Principal source of optimization in compiler design
Verilog Tasks & Functions
Flag register 8086 assignment
12 static timing_analysis_3_clocked_design
Ad

Similar to Glow introduction (20)

PPTX
closd computing 4th updated MODULE-4.pptx
PPT
Embedded _c_
PPTX
Embedded and Real Time Systems Unit II.pptx
PPTX
module nenddhd dhdbdh dehrbdbddnd d 1.pptx
PDF
C- language Lecture 4
PPTX
Control Structures and Functions..........
PPTX
Control Structures and Functions........
PPTX
Control Structures and Functions.........
PPTX
Control Structures and Functions.........
PDF
SPCC_Sem6_Chapter 6_Code Optimization part
PPTX
PPT
PPTX
Apache Spark
PPT
Basics of micro controllers for biginners
PPTX
What is Serverless Computing?
PPTX
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
PPTX
Instructions, Instruction set and its types
PPTX
Ch-4 Middleware Architectures.pptx
PDF
loaders and linkers
PPTX
C program execution and algorithm
closd computing 4th updated MODULE-4.pptx
Embedded _c_
Embedded and Real Time Systems Unit II.pptx
module nenddhd dhdbdh dehrbdbddnd d 1.pptx
C- language Lecture 4
Control Structures and Functions..........
Control Structures and Functions........
Control Structures and Functions.........
Control Structures and Functions.........
SPCC_Sem6_Chapter 6_Code Optimization part
Apache Spark
Basics of micro controllers for biginners
What is Serverless Computing?
UNIT -5 EMBEDDED DRIVERS AND APPLICATION PORTING.pptx
Instructions, Instruction set and its types
Ch-4 Middleware Architectures.pptx
loaders and linkers
C program execution and algorithm
Ad

More from Yi-Hsiu Hsu (8)

PPTX
TensorRT survey
PPTX
Yocto Project introduction
PPTX
Understand more about C
PPTX
Introduction to memory order consume
PPTX
RISC-V Introduction
PPTX
Memory model
PPTX
GCC for ARMv8 Aarch64
PPTX
Introduction to armv8 aarch64
TensorRT survey
Yocto Project introduction
Understand more about C
Introduction to memory order consume
RISC-V Introduction
Memory model
GCC for ARMv8 Aarch64
Introduction to armv8 aarch64

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Glow introduction

  • 2. Outline • Brief introduction to Glow • Glow IR • Glow Quantization • Glow CPU Backend 2
  • 3. Brief introduction to Glow Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 3
  • 4. A collaborative effort • Over the past seven years, FB has learned a great deal about how to best collaborate with the hardware community • Our work to help found and drive the Open Compute Project has been instrumental in allowing us to build highly scalable, efficient networking and storage technologies for our data centers • We’ve applied this thinking to how we work with telecom operators and the connectivity ecosystem overall with the Telecom Infra Project, as we work to get more people around the world better connected to the internet • As we look ahead, we now want to take these learnings and apply them to how we work with our silicon partners on AI and ML • We created Glow, an open source framework, to be community driven. This approach allows partners to more rapidly design and optimize new silicon products for AI and ML by leveraging community-driven compiler software • Cadence, Esperanto, Intel, Marvell, and Qualcomm Technologies Inc, a subsidiary of Qualcomm Incorporated, have committed to supporting Glow in future silicon products 4
  • 5. How Glow works • Glow is designed to target a wide range of hardware accelerators • The hardware-independent parts of the compiler focus on math-related optimizations that are not tied to a specific hardware model • It also contains a number of utilities and building blocks that can be configured to support multiple hardware targets, including • a powerful linear algebra optimizer • an extensive test suite • a CPU-based reference implementation for testing the accuracy of hardware accelerators • the memory allocator • an instruction scheduler • etc… 5
  • 7. Glow Intermediate Representation Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 7
  • 8. High-Level IR • The high-level IR is a dataflow node-based graph representation • similar to a graph that you may find inside Caffe or in ONNX format • When we load a neural network model from some file we construct this graph with a direct translation of one operator to one or more nodes • The graph is strongly typed, which means that inputs and output have a known tensor type • Consisting of the tensor's shape and element type, and that the types of nodes are verified by the compiler 8
  • 9. High-Level IR • The Glow graph is structured as a module that contains multiple functions that contain multiple nodes • Nodes inside functions are able to reference Placeholders and Constants which are owned by the module • Placeholders and Constants, which are similar to global variables in C programs, are nodes that are shared between the functions • A module may have multiple functions • For example, one module could contain both an inference function and the gradient of that inference function • The gradient function could perform training of the placeholder weights, and the inference function could read from those same weights 9
  • 10. High-Level IR • Variable Visibility • Glow variables are similar to PyTorch and TensorFlow variables • They are persistent tensors that live across different executions of the neural network • Variables are annotated with Public or Private labels. These labels specify whether the node is visible outside of the graph • If the node is public, then it means that C++ code from outside the graph may access the variable directly and change its content before or after the execution of the program • This means that the optimizer is not allowed to delete unused public variables or change their dimensions • In the case of private variables, the optimizer is allowed to delete unused variables, transpose, perform constant propagation, etc. 10
  • 11. High-Level IR • Constants • special nodes that represent tensors that are a part of the graph • These nodes can be used to represent things like the weights of neural networks • Constants are immutable during the execution of the program, but graph optimizations can access the constants and modify them • This feature is useful for transformations that prepare the weights by transposing them or quantizing them before the execution of the program • Placeholders • symbolic nodes that are not backed by a concrete tensor during the compilation of the program • Inputs and outputs of Glow programs should be modeled using Placeholder nodes • Concrete tensors are attached to placeholder nodes during the execution of the program • Unlike constants, the optimizer can't inspect or mutate the content of Placeholder nodes • The same program could be compiled using different bound tensors without changing the semantics of the program 11
  • 12. High-Level IR • Glow functions contain nodes that represent the different operations of a neural network • The function owns the nodes and has access to the placeholders and constants in the module • The image in the right-hand side depicts the compute graph that represents the expression “saveD = A / B” • Glow lowers the nodes that compute the gradient of the expression and the stochastic gradient descent (SGD) node into a sequence of low-level operators (Div, Mul, Add and Save) • The different compiler backends do not need to implement support for the DivGrad, ReLUGrad or SGD nodes 12
  • 13. Node Lowering • Instead of compiling high-level operators directly, Glow performs “node lowering” • In this phase, the compiler breaks the high-level operator nodes into low-level linear algebra operator nodes • For example, the FullyConnected layer is represented as a matrix multiplication followed by broadcasted add • Different compiler backends do not have to implement the FullyConnected layer and a dozen other high-level opcodes, just the low-level matrix multiplication 13
  • 14. Node Lowering • In Glow, lowering is performed as part of the high-level graph as described above, prior to moving to low-level IR • This is due to a number of reasons • First, the new lowered graph may allow for additional graph-level optimizations • Second, the new graph structure may affect the decisions of the instruction scheduler • And third, after lowering we allow the backends to perform additional target- specific optimizations on the lowered graph 14
  • 15. Low-Level IR • After optimizing the graph with target-independent optimizations, and lowering from high-level operator nodes to linear algebra operator nodes, the code is further lowered into the low-level IR in a phase that is called "IRGen" (which stands for IR generation) • This is a one-to-many translation where each high-level node is translated into one or more instructions • During IRGen, constants and placeholders are converted into WeightVars • These WeightVars are annotated with Mutable or Constant labels, depending on the source and whether the weights are modified during the execution of the program 15
  • 16. Low-Level IR • The low-level IR enables a different kind of target independent optimizations that are not possible with the high-level graph format • This is an instruction-based representation that operates on tensors that are referenced by address • This gives the compiler the ability to perform low-level memory optimizations that are not possible at the high-level, because memory is not represented directly • Hiding the latency of memory operations is important for utilizing the execution units of the hardware effectively, and the instruction-based representation allows the compiler to create a schedule that hides the latency of the memory operations 16
  • 17. Low-Level IR • The IR is not Static Single Assignment (SSA) based representation, because the IR does not support control flow • The IR is strongly typed and each instruction operand kind has known parameter types • It is designed to be used as an in-memory form, though can be dumped to human readable assembly-like format 17
  • 18. Low-Level IR • A function in IR form contains two sections: 'declare' and 'program’ • In the first section of the IR we declare a number of memory regions that live throughout the lifetime of the program • This is similar to global variables in C • The second part of the IR is a list of instructions • There are two kinds of memory regions which correspond to these two sections: • global memory regions (found in 'declare’) • and locally allocated regions (found in 'program’) • The locally allocated memory regions are similar to 'alloca' in LLVM IR • Memory regions are strongly typed, which means that the kind of type of tensor that the region represents is known 18 • Note that the 'alloc' instruction does not allocate memory; it just marks the lifetime of the activation
  • 19. Low-Level IR • Instructions operate on either global variables or locally allocated buffers • Each operand is annotated with one of the qualifiers '@in'/'@out'/'@inout’ • '@in' means that the buffer is read from • '@out' means that the buffer is written into • And '@inout' means that the instruction may read and write into the buffer • These operand qualifiers help the optimizer decide when it is legal to perform certain optimizations, such as copy elimination or buffer sharing 19
  • 20. How Glow works 20 The lowering phase is designed to reduce the input space and allow new hardware backends to focus on a small number of linear algebra primitives The high-level IR allows the optimizer to perform domain-specific optimizations The lower-level instruction-based address-only IR allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation and copy elimination At the lowest level, the optimizer performs machine-specific code generation to take advantage of specialized hardware features Glow lowers a traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation (IR). The graph is either loaded via the graph loader (from ONNX or Caffe2 format), or constructed via the C++ interface Additional rounds of optimizations occur, both target independent and target specific IRGen 1 2 3 4 5 6 7
  • 21. Glow Quantization Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 21
  • 22. Glow Quantization • Glow is able to convert floating-point-based networks into signed 8- bit integer networks • The canonical quantization representation is using signed integers, though it is possible to support other quantization formats • Arithmetic using small integers is more efficient than the computation of full- width floating-point numbers, and additionally decreases memory usage • Glow uses profile-guided quantization, observing execution during inference to estimate the possible numeric range for each stage of the neural network • Training-based quantization is considered future work 22
  • 23. Tensor Representation • In Glow, tensors are typed and can represent floats, quantized non- floating-point values such as currently supported Int8 (8-bit signed integers), and index types • To convert from the 8-bit integer range of [-128..127] to the floating-point number that they represent, Glow uses the following conversion formula: • Float value = (Int8 input - offset) * scale • Activations, weights, and variables all use the same type-system and represent information in a uniform way 23
  • 24. Network Conversion • Glow’s quantization conversion works using a two-phase process • First, we statically instrument the network with special profiling nodes that record the ranges of activations that flow in the network, optimize the network including these profiling nodes, and then run inference • Then, we recompile the network using this profile information to convert the network into a quantized form, allowing for static optimization of the quantized graph • We convert portions of the network into islands of integer computation and aim to generate outputs in the range that the original floating-point network produces 24A quantized subgraph from Resnet50 Scale = 0.0364 Offset = -66 Max = 7.031 Min = -2.259 7.031 = (input –(-66)) * 0.0364 input = 127.159 input = 127 (int8) -2.259 = (input –(-66)) * 0.0364 input = -128.060 input = -128 (int8) Float value = (Int8 input - offset) * scale
  • 25. Compiler Optimizations • There are a few classes of optimizations and parameters to optimize • First, we attempt to minimize the number of conversions between floating-point tensors and integer tensors, in both directions • Some operations, such as 'transpose' and 'concat' operate on both types, and changing the representation can minimize conversions • Second, the neural network contains 'rescale' nodes that change the range of the integers • These nodes are required to convert between numeric ranges that mimic the original floating-point network • However, in many cases, it is possible to fold the rescale operations into numeric-producing operations, and eliminate them • Third, it's possible to rescale the values in the network in order to allow fast hardware implementations of the quantized operations • By normalizing both sides of the 'max' operation to the same scale will allow hardware to perform a simple comparison with efficient • For more specific graph optimizations check here 25
  • 26. Glow CPU Backend Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 26
  • 27. Introduction • The CPU Backend is a JIT ("Just In Time") compiler that generates code in memory on demand for the host CPU • The host cpu can be X86, ARM or anything that LLVM can target • The Glow interpreter goes over the low-level IR one instruction at a time and executes a switch statement that dispatches a C++ implementation for each instruction. This is suboptimal • First, after each low-level instruction is executed, by calling a function call, we return to the dispatch switch-loop • Second, the C++ implementation of the low-level instruction had no knowledge of the specific situation in which the instruction is being executed 27
  • 28. Introduction • The JIT, on the other hand, generates a single stream of highly optimized instructions that don't go back to the interpreter • Each instruction is optimized based on specific information on the context in which the instruction is executed • When a matrix multiplication is compiled the JIT knows exactly the dimensions of the matrices that are being executed and where the tensors are placed in memory • The JIT knows that the buffers do or do-not alias, and exactly the number of iterations for the loop • The knowledge enables much better code generation and vectorization • The JIT is also able to eliminate all calls to 'malloc', because the memory is statically allocated • The whole network is allocated by a single malloc call 28
  • 29. How the JIT Works • The JIT accepts the low-level IR, and allocates concrete memory addresses for the AllocActivation instructions in the module • After this process the allocator knows the maximum number of bytes that the network consumes • The allocator assigns offsets for each alloc activation within the buffer • Then, the JIT performs a single call to 'malloc' to allocates the heap • At this point each activation and each weight has a concrete address on the heap • Next, the JIT opens new LLVM functions and prepares for code generation • The compiler goes over each low-level instruction and generates a sequence of LLVM-IR • After the LLVM module is generated, the compiler calls the LLVM optimizer to optimize the generated module and the code generator to generate efficient machine code • At this point the compilation phase is complete, and the network is ready for execution 29
  • 30. Usage of the Standard Library • During the compilation process, each Glow low-level instruction is converted into a sequence of LLVM-IR instructions • One way to implement this lowering is to use the IRBuilder to generate low-level programs • This is insane. Implementing and maintaining the low-level implementations of so many operations using the LLVM-IR is not scalable • Instead, the CPU backend compiles a small standard library into LLVM bitcode that it ships with the compiler • During the compilation process, Glow loads the bitcode from disk and specializes the operator implementations for the specific context • Glow replaces function arguments that represent the dimensions of some tensor or buffer addresses with constants that LLVM can optimize to generate efficient code • Most operators are very simple and the LLVM vectorizer is able to generate very efficient code • The convolution and matrix multiplication operations are hand-optimized in C++ using the clang extended OpenCL vector syntax, and LLVM does a good job allocating registers and encoding the instructions, removing the need to use inline assembly 30
  • 31. Operator Stacking • One important optimization that the CPU backend implements is stacking of data-parallel operators • Consider a sequence of operators that operate one element at a time, for example a ReLU, Add, Sub • Iterating over a large buffer multiple times is inefficient because it requires the CPU to load the memory multiple times, each time invalidating the whole cache • Instead, Glow stacks operators and performs a few data-parallel operators one after the other on the same memory location • Operator stacking is similar to operator fusion • However, when fusing multiple operators (e.g. Conv and ReLU fused together), all backends that want to support this fused operator must implement a specific kernel for each permutation of operators • In contrast, Glow’s stacking automatically creates such kernels; all of the possible permutations of data-parallel nodes are automatically fused into a fast kernel 31
  • 32. End Thanks! Brief introduction to Glow Glow IR Glow Quantization Glow CPU Backend 32
  • 33. Reference • Glow: A community-driven approach to AI infrastructure • Glow: Graph Lowering Compiler Techniques for Neural Networks • https://github.com/pytorch/glow/ • https://github.com/pytorch/glow/issues/1575 33