SlideShare a Scribd company logo
Facebook Glow Compiler のソースコー
ドをグダグダ語る会
@DeNA
作成:2018/08/26, 9/16,9/22,10/28
Slideshareにて公開 :2018/11/29
@Vengineer
ブログ (2007年~) : Vengineerの戯言
 http://blogs.yahoo.co.jp/verification_engineer
SlideShare :
 https://www.slideshare.net/ssuser479fa3
Twitter (2009年~) :
@Vengineer
ソースコード解析職人
宣伝です
宣伝です
お髭の人に
いじられるために
はるばるやってきました
しかしながら、
この企画を提案したのは、
あたしです!
お髭の人には、
気を付けろ
もうひとつ
宣伝です
PyTorch から XLA に変
換し、Cloud TPU にて、
Resnet-50を動かしたとい
うコードなのかな?
2018年12月1日(土)
さて、本題
Glowとは?
第1フェーズ
 ディープラーニング・フレームワーク
  ・Keras + TensorFlow  ダントツ
  ・PyTorch
  ・Chainer 日本では?
第2フェーズ
 グラフ・コンパイラ
  ・TensorFlow XLA
  ・NNVM (Relay) / TVM
  ・Glow
Glow: Graph Lowering Compiler
Techniques for Neural Networks
May 2, 2018
https://arxiv.org/abs/1805.00907
Facebook
Glow: A community-driven approach to
AI infrastructure
Sep 13, 2018
https://code.fb.com/ml-applications/glow-a-community-driven-approach-to-ai
-infrastructure/
Facebook
@Scale
2018 Keynote: Glow: A community-driven
approach to AI
SEPTEMBER 19, 2018
https://atscaleconference.com/videos/scale-2018-keynote-glow-a-community-driven
-approach-to-ai/
Facebook
さあ、
ソースコードを見
てみよう
$ sudo apt-get install graphviz cmake wget libpng-dev 
ninja-build clang llvm-5.0 
libprotobuf-dev protobuf-compiler
  cmake は、3.7.1 以上が必要
別途、ソースコードから3.12.1 をインストールしました
 llvmは、6.0 でも、7.0 でもいいみたいです。
準備
$ git clone https://github.com/pytorch/glow.git
$ git submodule update --init --recursive
$ cd glow
$ mkdir build_Debug
$ cd build_Debug
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug ..
$ ninja all
$ ninja test
 
ビルド
CMakeLists.txt の
option(GLOW_WITH_OPENCL "Build the OpenCL backend" ON)
を
option(GLOW_WITH_OPENCL "Build the OpenCL backend" OFF)
に変更にするか、コマンドラインにて、以下のようなパラメータを指定する
-DGLOW_WITH_OPENCL=OFF
 
OpenCL がデフォルトで ON
https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code
 
TensorFlow XLA : JITコンパイラ (r1.5~)
XLAグラフに変換
最適化、その1
ターゲットハードウェアの実行オブジェクト
ターゲットハードウェアに依存しない最適化
HLO (High Level Optimizer)
XLAグラフ
最適化、その2
コード生成
ターゲットハードウェアに依存する最適化
LLO (Low Level Optimizer)
TensorFow Graph
実行オブジェクト
XLAグラフ
High-Level IR
・ドメインスペシフィックな最適化
Low-Level IR
・メモリ関連の最適化
命令のスケジューリング
静的なメモリ割り当て
メモリコピーの削除
・マシン依存コード生成
Glowは、どんなことをやっている?
ExecutionEngine EE(executionBackend);
TrainingConfig TC;
TC.learningRate = 0.001;
TC.momentum = 0.9;
TC.L2Decay = 0.001;
TC.batchSize = minibatchSize;
Function *T = glow::differentiate(F, TC); # <= 学習はこれが必要
EE.compile(CompilationMode::Train, T); # <= CompilationMode::Train
例題:mnist を見てみよう ( 学習 だってできる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
Tensor imageInputs;
Tensor labelInputs;
Variable *A =
mod.createVariable(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input",
VisibilityKind::Public, false);
Variable *selected =
mod.createVariable(ElemKind::Int64ITy, {minibatchSize, 1}, "selected",
VisibilityKind::Public, false);
unsigned numImages = loadMNIST(imageInputs, labelInputs);
EE.runBatch(numIterations, {A, selected}, {&imageInputs, &labelInputs});
例題:mnist を見てみよう ( 学習 だってできる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
auto *result = F->createSave("return", SM);
EE.compile(CompilationMode::Infer, F); #<= CompilationMode::Infer
Tensor sample(ElemKind::FloatTy, {minibatchSize, 28, 28, 1});
for (int iter = numIterations; iter < numIterations + 10; iter++) {
sample.copyConsecutiveSlices(&imageInputs, minibatchSize * iter);
EE.run({A}, {&sample});
Tensor &res = result->getVariable()->getPayload();
例題:mnist を見てみよう ( 推論 も当然できる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
llvm::cl::opt<BackendKind> executionBackend(
llvm::cl::desc("Backend to use:"),
llvm::cl::values(clEnumValN(BackendKind::Interpreter, "interpreter",
"Use interpreter (default option)"),
clEnumValN(BackendKind::CPU, "cpu", "Use CPU"),
clEnumValN(BackendKind::OpenCL, "opencl", "Use OpenCL")
),
llvm::cl::init(BackendKind::Interpreter),
llvm::cl::cat(mnistCat)
);
バックエンドは、「Interpreter(デフォルト)」「CPU」「OpenCL」
バックエンドは?
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
auto *CV0 = F->create Conv("conv", A, 16, 5, 1, 2, 1);
auto *RL0 = F->create RELU("relu", CV0);
auto *MP0 = F->create MaxPool("pool", RL0, 3, 3, 0);
auto *CV1 = F->create Conv("conv", MP0, 16, 5, 1, 2, 1);
auto *RL1 = F->create RELU("relu", CV1);
auto *MP1 = F->create MaxPool("pool", RL1, 3, 3, 0);
auto *FCL1 = F->create FullyConnected("fc", MP1, 10);
auto *SM = F->create SoftMax("sm", FCL1, selected);
auto *result = F->createSave("return", SM);
mnist のモデル構築
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
The Lifetime of
a Glow Instruction
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
モデルをロード
ONNX
Caffe2
PyTorch 1.0
PyTorch + Caffe2 + Glow
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetFilename("tests/models/onnxModels/simpleConv.onnxtxt");
Variable *graphOutputVar;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
ONNXModelLoader onnxLD(NetFilename, {"data"}, {&data}, *F);
graphOutputVar = onnxLD.getSingleOutput();
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
ONNXモデル をロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/onnxImporterTest.cpp#L28
ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetDescFilename("tests/models/caffe2Models/predict_net.pbtxt");
std::string NetWeightFilename("tests/models/caffe2Models/init_net.pbtxt");
Variable *output;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
caffe2ModelLoader caffe2LD(NetDescFilename, NetWeightFilename,
{"data"}, {&data}, *F);
output = caffe2LD.getSingleOutput();
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
Caffe2モデル をロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/caffe2ImporterTest.cpp
ExecuteEngine
モデルのコンパイル
モデルの実行
モデルの保存
ExecuteEngine
compile バックエンドのgenerateIR : IRの生成
run
CompiledFunction の生成 (各バックエンド毎)
CompiledFunction の 実行 (execute)実行
コンパイル
save バックエンドのsave : IRの保存保存
void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化 後で
function_ = backend_-> compile(F, ctx); // コンパイル 後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void glow::runBatch(ExecutionEngine &EE, size_t iterations,
size_t &sampleCounter, llvm::ArrayRef<Variable *> vars,
llvm::ArrayRef<Tensor *> input ) {
size_t batchSize = vars[0]->getType()->dims()[0];
for (size_t i = 0; i < iterations; i++) {
for (int i = 0, e = ph.size(); i < e; i++) {
auto *backingTensor = ctx.get(ph[i]);
auto dim = inputs[i]->dims();
size_t slc = sampleCounter % dim[0];
backingTensor->copyConsecutiveSlices(inputs[i], slc);
}
glow::updateVariablesFromBatch(vars, inputs, sampleCounter);
EE.run();
sampleCounter += batchSize;
}
}
glow::runBatch
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: run() {
assert(function_ && "No function has been compiled");
// Make sure that the context has backing tensors for all placeholders.
ctx.allocate(M_.getPlaceholders());
function_->setupRuns();
function_->beforeRun(ctx);
function_->execute();
function_->afterRun(ctx);
function_->tearDownRuns();
}
ExecutionEngine::run
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: save(CompilationMode mode, Function *F,
llvm::StringRef outputDir) {
llvm::StringRef networkName) {
optimizeFunction(mode, F); // 最適化 後で
backend_->save(F, outputDir, networkName);
}
ExecutionEngine::save
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
最適化
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化
function_ = backend_-> compile(F, ctx); // コンパイル 後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void ExecutionEngine:: optimizeFunction(CompilationMode mode,
Function *F) {
// Verify the function pre-optimization/lowering.
F->verify();
// Optimize the graph.
::glow::optimize(F, mode);
// Allow the backend to transform the graph prior to lowering.
if (backend_->transformPreLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
::glow::optimize(F, mode);
}
ExecutionEngine::optimizeFunction
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
// Lower the graph into a sequence of low-level linear algebra operations.
::glow::lower(F, *backend_);
// Optimize the graph again.
::glow::optimize(F, mode);
// Allow the backend to transform the graph after lowering.
if (backend_->transformPostLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
::glow::optimize(F, mode);
}
}
ExecutionEngine::optimizeFunction
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
1)、::glow::optimize(F, mode);
2)、if (backend_->transformPreLowering(F, mode))
::glow::optimize(F, mode);
3)、::glow::lower(F, *backend_);
4)、::glow::optimize(F, mode);
5)、if (backend_->transformPostLowering(F, mode))
::glow::optimize(F, mode);
generateIR の最適化部分のみ、抜き出すと
1)、::glow::optimize(F, mode);
2)、if (backend_->transformPreLowering(F, mode))
::glow::optimize(F, mode);
3)、::glow::lower(F, *backend_);
4)、::glow::optimize(F, mode);
5)、if (backend_->transformPostLowering(F, mode))
::glow::optimize(F, mode);
transformPreLowering / transformPostLowering
現時点の実装( Interpreter, CPU, OpenCL ) では、
transformPostLowering の実装は、CPU と OpenCL ではあるが、
transformPreLowering の実装はありません。
CPUBackendでは、
1)、convolution を CPU最適版 に置換
2)、MaxPooling と Splat を マージして、CPUMaxSplat に置換
OpenCLBackend では、
1)、Convolution を、OpenCL最適化版 に置換
2)、MaxPooling を、OpenCL最適化版 に置換
3)、AvgPooling を、OpenCL最適化版 に置換
transformPreLowering / PostLowering の実装
バックエンド
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
# ExecutionEngineは、インスタンス生成時に、バックエンドの種類を指定する
ExecutionEngine EE(executionBackend);
ExecutionEngine.hpp
  # デフォルトは、Interpreter
ExecutionEngine(BackendKind backendKind = BackendKind::Interpreter);
ExecutionEngine.cpp
# 指定した種類のバックエンドを生成する
ExecutionEngine::ExecutionEngine(BackendKind backendKind)
: backend_( createBackend(backendKind)) {}
ExecutionEngine::ExecutionEngine
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
Backend *glow::createBackend(BackendKind backendKind) {
switch (backendKind) {
case BackendKind::Interpreter: # Interpreter (Naiveな実装)
return createInterpreter();
case BackendKind::OpenCL: # OpenCL (Hostコード & OpenCLカーネル)
return createOCLBackend();
case BackendKind::CPU: # CPU (LLVM)
return createCPUBackend();
}
llvm_unreachable("unreachable");
}
glow::createBackend
https://github.com/pytorch/glow/blob/master/lib/Backends/Backends.cpp
Backend *createInterpreter() { return new Interpreter(); }
Backend *createCPUBackend() { return new CPUBackend(); }
Backend *createOCLBackend() { return new OCLBackend(); }
バックエンドの生成
https://github.com/pytorch/glow/blob/master/lib/Backends/
コンパイル:compile
バックエンドの compile
generateAndOptimizeIR
IR生成
&
IR最適化
compileIR
IRから
コード生成
virtual std::unique_ptr<CompiledFunction>
compile(std::unique_ptr<IRFunction> IR) const = 0;
InterpreterBackend
llvm::make_unique<InterpreterFunction>(std::move(IR))
CPUBackEnd
llvm::make_unique<CPUFunction>(std::move(JIT), heap)
OpenCBackend
llvm::make_unique<OpenCLFunction>(std::move(IR))
compile
https://github.com/pytorch/glow/blob/master/include/glow/Backends/Backend.h#L43
std::unique_ptr<CompiledFunction>
Interpreter::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
Interpreter::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
std::unique_ptr<CompiledFunction>
Interpreter::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator constantWeightsAllocator("ConstantWeights", 0);
MemoryAllocator placeholderWeightsAllocator("PlaceholderWeights", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, constantWeightsAllocator,
placeholderWeightsAllocator, activationsAllocator);
return llvm::make_unique< InterpreterFunction>(
std::move(IR), bundle) ;
}
Interpreter::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
std::unique_ptr<CompiledFunction>
CPUBackend::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
CPUBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp#L146
std::unique_ptr<CompiledFunction>
CPUBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
AllocationsInfo allocationsInfo;
std::unique_ptr<LLVMIRGen> irgen = createIRGen(IR.get(), allocationsInfo);
irgen->initTargetMachine(target.empty() ? "" : target.getValue(),
llvm::CodeModel::Model::Large);
irgen->initCodeGen();
allocateJITMemory(IR.get(), irgen->getAllocationsInfo());
emitJitMain(*irgen);
irgen->performCodeGen();
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
auto JIT = llvm::make_unique<llvm::orc::GlowJIT>(irgen->getTargetMachine());
JIT->addModule(irgen->borrowModule());
MemoryAllocator constantAllocator("ConstantWeights", 0);
MemoryAllocator placeholderAllocator("Placeholders", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle runtimeInfo = generateRuntimeBundle(
*IR, constantAllocator, placeholderAllocator, activationsAllocator);
return llvm::make_unique<CPUFunction>(std::move(JIT), runtimeInfo);
}
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
std::unique_ptr<CompiledFunction>
OCLBackend::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
std::unique_ptr<CompiledFunction>
OCLBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator allocator("GPU", 0xFFFFFFFF);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, allocator, allocator, allocator);
return llvm::make_unique<OpenCLFunction>(std::move(IR), bundle) ;
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
IR生成
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F,
bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
void IRFunction::generateIR() {
assert(G_->verify() && "Invalid function");
// Schedule the nodes.
NodesPtrList ScheduledNodes;
scheduleGraph(ScheduledNodes);
IRGenVisitor irgen(this);
for (auto &N : ScheduledNodes) {
N->visit(nullptr, &irgen);
}
}
IR生成
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
void IRFunction::scheduleGraph(NodesPtrList &Schedule) {
Schedule.clear();
for (auto &N : G_->getParent()->getVars()) {
Schedule.push_back(N);
}
for (auto &N : G_->getParent()->getPlaceholders()) {
Schedule.push_back(N);
}
グラフのスケジュール:前半
https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp
auto numVars = G_->getParent()->getConstants().size();
auto numPlaceholders = G_->getParent()->getPlaceholders().size();
(void)numVars;
(void)numPlaceholders;
std::unique_ptr<Scheduler> scheduler{
createScheduler(graphScheduler, *G_, Schedule)};
scheduler->schedule();
assert(scheduler->getSchedule().size() ==
G_->getNodes().size() + numPlaceholders + numVars &&
"All graph nodes have to be scheduled");
}
グラフのスケジュール:後半
https://github.com/pytorch/https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp#L172/blob/master/lib/IR/Graph
Scheduler.cpp
IR最適化
  1)、The graph is either loaded via the graph loader
  (from ONNX or Caffe2 format),
    or constructed via the C++ interface.
  2)、The graph is differentiated if needed.
  3)、The graph is optimized.
  4)、Linear algebra node lowering takes place.
  5)、Additional rounds of optimizations occur,
    both target independent and target specific.
  6)、The graph is scheduled into a linear sequence of nodes
    that minimizes memory usage.
  7)、IRGen converts the low-level graph into instructions.
  8)、Low-level IR optimizations are performed.
  9)、Backend-specific optimizations
    and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md
std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
void glow::optimize(IRFunction &M, CompilationMode mode, const Backend &B) {
M.verify();
if (!optimizeIR) return;
performPeepholeOptimizations(M);
eliminateDeadStores(M);
// Replace applicable InsertTensors and ExtractTensors with TensorViews.
optimizeInserts(M);
optimizeExtracts(M);
if (B.shouldShareBuffers ()) // Reuse buffers from previous operations.
shareBuffers(M);;
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1602
performPeepholeOptimizations(M);
hoistDealloc(M); // Shorten the lifetime of buffers.
sinkAllocas(M);
eliminateDeadStores(M); // Perform Dead Store Elimination.
deleteDeadAllocs(M);
makeWeightsConst(M); // Turn read-only weights into constant weights.
performDebugInstrumentation(M);
if (dumpOptMod) // Print the module to stdout if requested.
M.dump();
M.verify();
}
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1596
実行:execute
class CompiledFunction {
public:
virtual ~CompiledFunction() = default;
virtual void execute() = 0;
virtual void setupRuns() = 0;
virtual void beforeRun(const Context &ctx) = 0;
virtual void afterRun(const Context &ctx) = 0;
virtual void tearDownRuns() = 0;
};
CompiledFunction
https://github.com/pytorch/glow/blob/master/include/glow/Backends/CompiledFunction.h
class InterpreterFunction final : public CompiledFunction {
/// The IR to be executed.
std::unique_ptr<IRFunction> F_;
/// Maps values to Tensors, that are owned by this class.
std::unordered_map<const Value *, Tensor *> tensors_;
/// Maps values to Tensors, that are *not* owned by this class.
std::unordered_map<const Value *, Tensor *> externalTensors_;
public:
InterpreterFunction(std::unique_ptr<IRFunction> F, const Context &ctx);
~InterpreterFunction() override;
void execute() override;
InterpreterFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.h#L43
void InterpreterFunction::execute() {
#define DEF_VALUE(CLASS, NAME)
#define DEF_INSTR(CLASS, NAME) 
case Kinded::Kind::CLASS##Kind: { 
fwd##CLASS(llvm::cast<CLASS>(&I)); 
break; 
}
#define DEF_BACKEND_SPECIFIC_INSTR(CLASS, NAME)
for (const auto &I : F_->getInstrs()) {
switch (I.getKind()) { # <= 各オペレータの分岐!
#include "glow/AutoGenInstr.def"
default:
llvm_unreachable("Invalid instruction.");
}
}
}
InterpreterFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.cpp
class CPUFunction final : public CompiledFunction {
std::unique_ptr<llvm::orc::GlowJIT> JIT_;
void *heap_;
public:
CPUFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, void *heap);
~CPUFunction() override;
void execute() override;
};
CPUFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.h
void CPUFunction::execute() {
auto sym = JIT_->findSymbol( "jitmain");
using JitFuncType =
void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars,
uint8_t * activations);
auto address = sym.getAddress();
if (address) {
JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get());
funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress_,
baseActivationsAddress_);
} else {
GLOW_ASSERT(false && "Error getting address.");
}
}
CPUFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.cpp#L29
class OpenCLFunction final : public CompiledFunction {
cl_device_id deviceId_;
cl_context context_;
cl_command_queue commands_;
cl_mem deviceBuffer_{0};
std::vector<KernelLaunch> kernelLaunches_;
public:
explicit OpenCLFunction(std::unique_ptr<IRFunction> F);
~OpenCLFunction() override;
void execute() override;
OpenCLFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
void OpenCLFunction::execute() {
# めっちゃ長い
#
# 基本的には、
#
# 各レイヤーのループ
#
# 1). 各レイヤに対応したHost側のコード to OpenCLカーネルを生成
# 2). OpenCLカーネルのコンパイル
# 3). OpenCLの作法に則ったコードを実行 (enqueueKernel)
#
# clFinish(commands_); にて、すべてのOpenCLカーネルが終了するまで待つ
#
}
OpenCLFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
量子化 (FP32 => INT8)
https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  ・FP32 => INT8
・プロファイルに基づく量子化
推論中の実行を観察して、
ニューラルネットワークの各ステージの可能な数値範囲を推定
・学習ベースの量子化は、将来サポート検討中
Glow での 量子化
https://github.com/pytorch/glow/blob/master/docs/Quantization.md
std::vector<NodeQuantizationInfo> QI{
{NodeQuantizationInfo::generateNodeOutputName(input->getName()),
{0.2f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(W->getName()), {0.3f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(B->getName()), {0.4f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(FC->getName()), {0.6f, 0}},
};
F = quantization::quantizeFunction(EE, QI, F);
// Make sure that graph can be compiled and run.
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
quantization::quantizeFunction の例
https://github.com/pytorch/glow/blob/master/tests/unittests/quantizationTest.cpp
Function *
quantizeFunction(const ExecutionEngine &EE,
llvm::ArrayRef<NodeQuantizationInfo> quantizationInfos,
Function *F, llvm::StringRef newFuncName = "");
quantization::quantizeFunction
https://github.com/pytorch/glow/blob/master/include/glow/Quantization/Quantization.h
https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code
 
バックエンド
  Interpreter
CPU
OpenCL
 
あたしは、
ディープラーニング職人 ではありません
コンピュータエンジニア です
ありがとうございました
@Vengineer
ソースコード解析職人

More Related Content

PDF
TensorFlow XLA RPC
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
PDF
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
PDF
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
PDF
TensorFlow local Python XLA client
PDF
TVM VTA (TSIM)
PDF
Tensor comprehensions
PDF
Tiramisu概要
TensorFlow XLA RPC
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
TensorFlow local Python XLA client
TVM VTA (TSIM)
Tensor comprehensions
Tiramisu概要

What's hot (20)

PDF
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
PDF
Tiramisu をちょっと、味見してみました。
PDF
Global Interpreter Lock: Episode I - Break the Seal
PDF
How to make a large C++-code base manageable
PDF
History & Practices for UniRx(EN)
PDF
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
PDF
深入淺出C語言
PDF
Антон Наумович, Система автоматической крэш-аналитики своими средствами
PPTX
C++17 now
PDF
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
PDF
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
PDF
Java Performance: Speedup your application with hardware counters
PDF
clWrap: Nonsense free control of your GPU
PDF
Skiron - Experiments in CPU Design in D
PDF
Антон Бикинеев, Writing good std::future&lt; C++ >
PDF
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
PDF
Metaprogramming and Reflection in Common Lisp
PDF
閒聊Python應用在game server的開發
PDF
Memory Management of C# with Unity Native Collections
PDF
C++ How I learned to stop worrying and love metaprogramming
RISC-V : Berkeley Boot Loader & Proxy Kernelのソースコード解析
Tiramisu をちょっと、味見してみました。
Global Interpreter Lock: Episode I - Break the Seal
How to make a large C++-code base manageable
History & Practices for UniRx(EN)
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
深入淺出C語言
Антон Наумович, Система автоматической крэш-аналитики своими средствами
C++17 now
[ZigBee 嵌入式系統] ZigBee 應用實作 - 使用 TI Z-Stack Firmware
Photon Server Deep Dive - View from Implmentation of PhotonWire, Multiplayer ...
Java Performance: Speedup your application with hardware counters
clWrap: Nonsense free control of your GPU
Skiron - Experiments in CPU Design in D
Антон Бикинеев, Writing good std::future&lt; C++ >
PyCon TW 2017 - PyPy's approach to construct domain-specific language runtime...
Metaprogramming and Reflection in Common Lisp
閒聊Python應用在game server的開發
Memory Management of C# with Unity Native Collections
C++ How I learned to stop worrying and love metaprogramming
Ad

Similar to Facebook Glow Compiler のソースコードをグダグダ語る会 (20)

PDF
The Ring programming language version 1.8 book - Part 95 of 202
PDF
Ekon 25 Python4Delphi_MX475
PPTX
TestUpload
PDF
EKON 25 Python4Delphi_mX4
PPT
TopicMapReduceComet log analysis by using splunk
PDF
How to reverse engineer Android applications
PDF
How to reverse engineer Android applications—using a popular word game as an ...
ODP
The why and how of moving to PHP 5.5/5.6
PDF
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
PDF
Library Operating System for Linux #netdev01
PDF
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
PDF
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
PDF
Presto anatomy
PPT
Groovy Introduction - JAX Germany - 2008
PDF
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
PDF
Building Network Functions with eBPF & BCC
PDF
The true story_of_hello_world
PDF
maxbox starter72 multilanguage coding
PPT
Euro python2011 High Performance Python
PPTX
Python at Facebook
The Ring programming language version 1.8 book - Part 95 of 202
Ekon 25 Python4Delphi_MX475
TestUpload
EKON 25 Python4Delphi_mX4
TopicMapReduceComet log analysis by using splunk
How to reverse engineer Android applications
How to reverse engineer Android applications—using a popular word game as an ...
The why and how of moving to PHP 5.5/5.6
Flink Forward San Francisco 2019: Deploying ONNX models on Flink - Isaac Mcki...
Library Operating System for Linux #netdev01
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
The Hitchhiker's Guide to Faster Builds. Viktor Kirilov. CoreHard Spring 2019
Presto anatomy
Groovy Introduction - JAX Germany - 2008
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
Building Network Functions with eBPF & BCC
The true story_of_hello_world
maxbox starter72 multilanguage coding
Euro python2011 High Performance Python
Python at Facebook
Ad

More from Mr. Vengineer (20)

PDF
XilinxのxsimでSoftware Driven Verification.pdf
PDF
VerilatorとSystemCでSoftware Driven Verification
PDF
VerilatorとSystemC
PDF
Cloud TPU Driver API ソースコード解析
PDF
Cloud Deep Learning Chips Training & Inference
PDF
TensorFlow Lite Delegateとは?
PDF
Pixel Visual Core device driver source code analysis
PDF
TensorFlow XLA 「XLAとは、から、最近の利用事例について」
PDF
Ultra96(UltraZed)実践勉強会
PDF
LeFlowを調べてみました
PDF
Tensorflow dynamically loadable XLA plugin ソースコード解析
PDF
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
PDF
「ディープラーニングでは、エコシステムが大切よ!」
PDF
TensorFlow XLA とハードウェア
PDF
2017年のFPGA Community活動について
PDF
Zynq VIPを利用したテストベンチ
PDF
TensorFlow XLAの可能性
PDF
AWS EC2 F1とXilinx SDAccel
PDF
Intel Nervana Graph とは?
PDF
DSPでディープラーニング
XilinxのxsimでSoftware Driven Verification.pdf
VerilatorとSystemCでSoftware Driven Verification
VerilatorとSystemC
Cloud TPU Driver API ソースコード解析
Cloud Deep Learning Chips Training & Inference
TensorFlow Lite Delegateとは?
Pixel Visual Core device driver source code analysis
TensorFlow XLA 「XLAとは、から、最近の利用事例について」
Ultra96(UltraZed)実践勉強会
LeFlowを調べてみました
Tensorflow dynamically loadable XLA plugin ソースコード解析
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
「ディープラーニングでは、エコシステムが大切よ!」
TensorFlow XLA とハードウェア
2017年のFPGA Community活動について
Zynq VIPを利用したテストベンチ
TensorFlow XLAの可能性
AWS EC2 F1とXilinx SDAccel
Intel Nervana Graph とは?
DSPでディープラーニング

Recently uploaded (20)

PPTX
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
DOCX
fsdffdghjjgfxfdghjvhjvgfdfcbchghgghgcbjghf
PPTX
ERP good ERP good ERP good ERP good good ERP good ERP good
PPT
Lines and angles cbse class 9 math chemistry
PPTX
Fundamentals of Computer.pptx Computer BSC
PPTX
INFERTILITY (FEMALE FACTORS).pptxgvcghhfcg
PPTX
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
PDF
How NGOs Save Costs with Affordable IT Rentals
DOCX
A PROPOSAL ON IoT climate sensor 2.docx
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
PPTX
Computers and mobile device: Evaluating options for home and work
PPTX
Embedded for Artificial Intelligence 1.pptx
PPTX
Embeded System for Artificial intelligence 2.pptx
PPTX
sdn_based_controller_for_mobile_network_traffic_management1.pptx
PPT
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
PPTX
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
PDF
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
PPTX
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的
Presentacion compuuuuuuuuuuuuuuuuuuuuuuu
fsdffdghjjgfxfdghjvhjvgfdfcbchghgghgcbjghf
ERP good ERP good ERP good ERP good good ERP good ERP good
Lines and angles cbse class 9 math chemistry
Fundamentals of Computer.pptx Computer BSC
INFERTILITY (FEMALE FACTORS).pptxgvcghhfcg
Entre CHtzyshshshshshshshzhhzzhhz 4MSt.pptx
How NGOs Save Costs with Affordable IT Rentals
A PROPOSAL ON IoT climate sensor 2.docx
Smarter Security: How Door Access Control Works with Alarms & CCTV
PLC ANALOGUE DONE BY KISMEC KULIM TD 5 .0
Computers and mobile device: Evaluating options for home and work
Embedded for Artificial Intelligence 1.pptx
Embeded System for Artificial intelligence 2.pptx
sdn_based_controller_for_mobile_network_traffic_management1.pptx
chapter_1_a.ppthduushshwhwbshshshsbbsbsbsbsh
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
02fdgfhfhfhghghhhhhhhhhhhhhhhhhhhhh.pptx
Layer23-Switch.com The Cisco Catalyst 9300 Series is Cisco’s flagship stackab...
了解新西兰毕业证(Wintec毕业证书)怀卡托理工学院毕业证存档可查的

Facebook Glow Compiler のソースコードをグダグダ語る会

  • 1. Facebook Glow Compiler のソースコー ドをグダグダ語る会 @DeNA 作成:2018/08/26, 9/16,9/22,10/28 Slideshareにて公開 :2018/11/29 @Vengineer
  • 2. ブログ (2007年~) : Vengineerの戯言  http://blogs.yahoo.co.jp/verification_engineer SlideShare :  https://www.slideshare.net/ssuser479fa3 Twitter (2009年~) : @Vengineer ソースコード解析職人
  • 7. 宣伝です PyTorch から XLA に変 換し、Cloud TPU にて、 Resnet-50を動かしたとい うコードなのかな? 2018年12月1日(土)
  • 11. Glow: Graph Lowering Compiler Techniques for Neural Networks May 2, 2018 https://arxiv.org/abs/1805.00907 Facebook
  • 12. Glow: A community-driven approach to AI infrastructure Sep 13, 2018 https://code.fb.com/ml-applications/glow-a-community-driven-approach-to-ai -infrastructure/ Facebook
  • 13. @Scale 2018 Keynote: Glow: A community-driven approach to AI SEPTEMBER 19, 2018 https://atscaleconference.com/videos/scale-2018-keynote-glow-a-community-driven -approach-to-ai/ Facebook
  • 15. $ sudo apt-get install graphviz cmake wget libpng-dev ninja-build clang llvm-5.0 libprotobuf-dev protobuf-compiler   cmake は、3.7.1 以上が必要 別途、ソースコードから3.12.1 をインストールしました  llvmは、6.0 でも、7.0 でもいいみたいです。 準備
  • 16. $ git clone https://github.com/pytorch/glow.git $ git submodule update --init --recursive $ cd glow $ mkdir build_Debug $ cd build_Debug $ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug .. $ ninja all $ ninja test   ビルド
  • 17. CMakeLists.txt の option(GLOW_WITH_OPENCL "Build the OpenCL backend" ON) を option(GLOW_WITH_OPENCL "Build the OpenCL backend" OFF) に変更にするか、コマンドラインにて、以下のようなパラメータを指定する -DGLOW_WITH_OPENCL=OFF   OpenCL がデフォルトで ON
  • 18. https://github.com/pytorch/glow Glow : Graph Compiler & Execution Engine High-Level Graph => Low-Level IR => Machine Code  
  • 19. TensorFlow XLA : JITコンパイラ (r1.5~) XLAグラフに変換 最適化、その1 ターゲットハードウェアの実行オブジェクト ターゲットハードウェアに依存しない最適化 HLO (High Level Optimizer) XLAグラフ 最適化、その2 コード生成 ターゲットハードウェアに依存する最適化 LLO (Low Level Optimizer) TensorFow Graph 実行オブジェクト XLAグラフ
  • 21. ExecutionEngine EE(executionBackend); TrainingConfig TC; TC.learningRate = 0.001; TC.momentum = 0.9; TC.L2Decay = 0.001; TC.batchSize = minibatchSize; Function *T = glow::differentiate(F, TC); # <= 学習はこれが必要 EE.compile(CompilationMode::Train, T); # <= CompilationMode::Train 例題:mnist を見てみよう ( 学習 だってできる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 22. Tensor imageInputs; Tensor labelInputs; Variable *A = mod.createVariable(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input", VisibilityKind::Public, false); Variable *selected = mod.createVariable(ElemKind::Int64ITy, {minibatchSize, 1}, "selected", VisibilityKind::Public, false); unsigned numImages = loadMNIST(imageInputs, labelInputs); EE.runBatch(numIterations, {A, selected}, {&imageInputs, &labelInputs}); 例題:mnist を見てみよう ( 学習 だってできる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 23. auto *result = F->createSave("return", SM); EE.compile(CompilationMode::Infer, F); #<= CompilationMode::Infer Tensor sample(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}); for (int iter = numIterations; iter < numIterations + 10; iter++) { sample.copyConsecutiveSlices(&imageInputs, minibatchSize * iter); EE.run({A}, {&sample}); Tensor &res = result->getVariable()->getPayload(); 例題:mnist を見てみよう ( 推論 も当然できる ) https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 24. llvm::cl::opt<BackendKind> executionBackend( llvm::cl::desc("Backend to use:"), llvm::cl::values(clEnumValN(BackendKind::Interpreter, "interpreter", "Use interpreter (default option)"), clEnumValN(BackendKind::CPU, "cpu", "Use CPU"), clEnumValN(BackendKind::OpenCL, "opencl", "Use OpenCL") ), llvm::cl::init(BackendKind::Interpreter), llvm::cl::cat(mnistCat) ); バックエンドは、「Interpreter(デフォルト)」「CPU」「OpenCL」 バックエンドは? https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 25. auto *CV0 = F->create Conv("conv", A, 16, 5, 1, 2, 1); auto *RL0 = F->create RELU("relu", CV0); auto *MP0 = F->create MaxPool("pool", RL0, 3, 3, 0); auto *CV1 = F->create Conv("conv", MP0, 16, 5, 1, 2, 1); auto *RL1 = F->create RELU("relu", CV1); auto *MP1 = F->create MaxPool("pool", RL1, 3, 3, 0); auto *FCL1 = F->create FullyConnected("fc", MP1, 10); auto *SM = F->create SoftMax("sm", FCL1, selected); auto *result = F->createSave("return", SM); mnist のモデル構築 https://github.com/pytorch/glow/blob/master/examples/mnist.cpp
  • 26. The Lifetime of a Glow Instruction
  • 27.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 29. PyTorch 1.0 PyTorch + Caffe2 + Glow
  • 30.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 31. ExecutionEngine EE{BackendKind::Interpreter}; auto &mod = EE.getModule(); Function *F = mod.createFunction("main"); std::string NetFilename("tests/models/onnxModels/simpleConv.onnxtxt"); Variable *graphOutputVar; Tensor data; getNCHWData(&data, 1, 1, 3, 3); ONNXModelLoader onnxLD(NetFilename, {"data"}, {&data}, *F); graphOutputVar = onnxLD.getSingleOutput(); EE.compile(CompilationMode::Infer, F); EE.run({}, {}); ONNXモデル をロード、コンパイル、推論 https://github.com/pytorch/glow/blob/master/tests/unittests/onnxImporterTest.cpp#L28
  • 32. ExecutionEngine EE{BackendKind::Interpreter}; auto &mod = EE.getModule(); Function *F = mod.createFunction("main"); std::string NetDescFilename("tests/models/caffe2Models/predict_net.pbtxt"); std::string NetWeightFilename("tests/models/caffe2Models/init_net.pbtxt"); Variable *output; Tensor data; getNCHWData(&data, 1, 1, 3, 3); caffe2ModelLoader caffe2LD(NetDescFilename, NetWeightFilename, {"data"}, {&data}, *F); output = caffe2LD.getSingleOutput(); EE.compile(CompilationMode::Infer, F); EE.run({}, {}); Caffe2モデル をロード、コンパイル、推論 https://github.com/pytorch/glow/blob/master/tests/unittests/caffe2ImporterTest.cpp
  • 34. ExecuteEngine compile バックエンドのgenerateIR : IRの生成 run CompiledFunction の生成 (各バックエンド毎) CompiledFunction の 実行 (execute)実行 コンパイル save バックエンドのsave : IRの保存保存
  • 35. void ExecutionEngine:: compile(CompilationMode mode, Function *F, const Context &ctx) { optimizeFunction(mode, F); // 最適化 後で function_ = backend_-> compile(F, ctx); // コンパイル 後で } 引数の mode は、最適化で使用する ExecutionEngine::compile https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 36. void glow::runBatch(ExecutionEngine &EE, size_t iterations, size_t &sampleCounter, llvm::ArrayRef<Variable *> vars, llvm::ArrayRef<Tensor *> input ) { size_t batchSize = vars[0]->getType()->dims()[0]; for (size_t i = 0; i < iterations; i++) { for (int i = 0, e = ph.size(); i < e; i++) { auto *backingTensor = ctx.get(ph[i]); auto dim = inputs[i]->dims(); size_t slc = sampleCounter % dim[0]; backingTensor->copyConsecutiveSlices(inputs[i], slc); } glow::updateVariablesFromBatch(vars, inputs, sampleCounter); EE.run(); sampleCounter += batchSize; } } glow::runBatch https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 37. void ExecutionEngine:: run() { assert(function_ && "No function has been compiled"); // Make sure that the context has backing tensors for all placeholders. ctx.allocate(M_.getPlaceholders()); function_->setupRuns(); function_->beforeRun(ctx); function_->execute(); function_->afterRun(ctx); function_->tearDownRuns(); } ExecutionEngine::run https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 38. void ExecutionEngine:: save(CompilationMode mode, Function *F, llvm::StringRef outputDir) { llvm::StringRef networkName) { optimizeFunction(mode, F); // 最適化 後で backend_->save(F, outputDir, networkName); } ExecutionEngine::save https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 40.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 41. void ExecutionEngine:: compile(CompilationMode mode, Function *F, const Context &ctx) { optimizeFunction(mode, F); // 最適化 function_ = backend_-> compile(F, ctx); // コンパイル 後で } 引数の mode は、最適化で使用する ExecutionEngine::compile https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 42. void ExecutionEngine:: optimizeFunction(CompilationMode mode, Function *F) { // Verify the function pre-optimization/lowering. F->verify(); // Optimize the graph. ::glow::optimize(F, mode); // Allow the backend to transform the graph prior to lowering. if (backend_->transformPreLowering(F, mode)) { // Optimize the graph again after the backend transformation. // In particular, DCE is very likely to be useful. ::glow::optimize(F, mode); } ExecutionEngine::optimizeFunction https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 43. // Lower the graph into a sequence of low-level linear algebra operations. ::glow::lower(F, *backend_); // Optimize the graph again. ::glow::optimize(F, mode); // Allow the backend to transform the graph after lowering. if (backend_->transformPostLowering(F, mode)) { // Optimize the graph again after the backend transformation. // In particular, DCE is very likely to be useful. ::glow::optimize(F, mode); } } ExecutionEngine::optimizeFunction https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 44. 1)、::glow::optimize(F, mode); 2)、if (backend_->transformPreLowering(F, mode)) ::glow::optimize(F, mode); 3)、::glow::lower(F, *backend_); 4)、::glow::optimize(F, mode); 5)、if (backend_->transformPostLowering(F, mode)) ::glow::optimize(F, mode); generateIR の最適化部分のみ、抜き出すと
  • 45. 1)、::glow::optimize(F, mode); 2)、if (backend_->transformPreLowering(F, mode)) ::glow::optimize(F, mode); 3)、::glow::lower(F, *backend_); 4)、::glow::optimize(F, mode); 5)、if (backend_->transformPostLowering(F, mode)) ::glow::optimize(F, mode); transformPreLowering / transformPostLowering
  • 46. 現時点の実装( Interpreter, CPU, OpenCL ) では、 transformPostLowering の実装は、CPU と OpenCL ではあるが、 transformPreLowering の実装はありません。 CPUBackendでは、 1)、convolution を CPU最適版 に置換 2)、MaxPooling と Splat を マージして、CPUMaxSplat に置換 OpenCLBackend では、 1)、Convolution を、OpenCL最適化版 に置換 2)、MaxPooling を、OpenCL最適化版 に置換 3)、AvgPooling を、OpenCL最適化版 に置換 transformPreLowering / PostLowering の実装
  • 48.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 49. # ExecutionEngineは、インスタンス生成時に、バックエンドの種類を指定する ExecutionEngine EE(executionBackend); ExecutionEngine.hpp   # デフォルトは、Interpreter ExecutionEngine(BackendKind backendKind = BackendKind::Interpreter); ExecutionEngine.cpp # 指定した種類のバックエンドを生成する ExecutionEngine::ExecutionEngine(BackendKind backendKind) : backend_( createBackend(backendKind)) {} ExecutionEngine::ExecutionEngine https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 50. Backend *glow::createBackend(BackendKind backendKind) { switch (backendKind) { case BackendKind::Interpreter: # Interpreter (Naiveな実装) return createInterpreter(); case BackendKind::OpenCL: # OpenCL (Hostコード & OpenCLカーネル) return createOCLBackend(); case BackendKind::CPU: # CPU (LLVM) return createCPUBackend(); } llvm_unreachable("unreachable"); } glow::createBackend https://github.com/pytorch/glow/blob/master/lib/Backends/Backends.cpp
  • 51. Backend *createInterpreter() { return new Interpreter(); } Backend *createCPUBackend() { return new CPUBackend(); } Backend *createOCLBackend() { return new OCLBackend(); } バックエンドの生成 https://github.com/pytorch/glow/blob/master/lib/Backends/
  • 54. virtual std::unique_ptr<CompiledFunction> compile(std::unique_ptr<IRFunction> IR) const = 0; InterpreterBackend llvm::make_unique<InterpreterFunction>(std::move(IR)) CPUBackEnd llvm::make_unique<CPUFunction>(std::move(JIT), heap) OpenCBackend llvm::make_unique<OpenCLFunction>(std::move(IR)) compile https://github.com/pytorch/glow/blob/master/include/glow/Backends/Backend.h#L43
  • 55. std::unique_ptr<CompiledFunction> Interpreter::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } Interpreter::compile https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
  • 56. std::unique_ptr<CompiledFunction> Interpreter::compileIR(std::unique_ptr<IRFunction> IR) const { MemoryAllocator constantWeightsAllocator("ConstantWeights", 0); MemoryAllocator placeholderWeightsAllocator("PlaceholderWeights", 0); MemoryAllocator activationsAllocator("Activations", 0); runtime::RuntimeBundle bundle = generateRuntimeBundle(*IR, constantWeightsAllocator, placeholderWeightsAllocator, activationsAllocator); return llvm::make_unique< InterpreterFunction>( std::move(IR), bundle) ; } Interpreter::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27
  • 57. std::unique_ptr<CompiledFunction> CPUBackend::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } CPUBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp#L146
  • 58. std::unique_ptr<CompiledFunction> CPUBackend::compileIR(std::unique_ptr<IRFunction> IR) const { AllocationsInfo allocationsInfo; std::unique_ptr<LLVMIRGen> irgen = createIRGen(IR.get(), allocationsInfo); irgen->initTargetMachine(target.empty() ? "" : target.getValue(), llvm::CodeModel::Model::Large); irgen->initCodeGen(); allocateJITMemory(IR.get(), irgen->getAllocationsInfo()); emitJitMain(*irgen); irgen->performCodeGen(); CPUBackend::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
  • 59. auto JIT = llvm::make_unique<llvm::orc::GlowJIT>(irgen->getTargetMachine()); JIT->addModule(irgen->borrowModule()); MemoryAllocator constantAllocator("ConstantWeights", 0); MemoryAllocator placeholderAllocator("Placeholders", 0); MemoryAllocator activationsAllocator("Activations", 0); runtime::RuntimeBundle runtimeInfo = generateRuntimeBundle( *IR, constantAllocator, placeholderAllocator, activationsAllocator); return llvm::make_unique<CPUFunction>(std::move(JIT), runtimeInfo); } CPUBackend::compileIR https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp
  • 60. std::unique_ptr<CompiledFunction> OCLBackend::compile(Function *F) const { auto IR = generateAndOptimizeIR(F, shouldShareBuffers()); return compileIR(std::move(IR)); } OpenCLBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
  • 61. std::unique_ptr<CompiledFunction> OCLBackend::compileIR(std::unique_ptr<IRFunction> IR) const { MemoryAllocator allocator("GPU", 0xFFFFFFFF); runtime::RuntimeBundle bundle = generateRuntimeBundle(*IR, allocator, allocator, allocator); return llvm::make_unique<OpenCLFunction>(std::move(IR), bundle) ; } OpenCLBackend::compile https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp
  • 63.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 64. std::unique_ptr<IRFunction> glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) { auto IR = llvm::make_unique<IRFunction>(F); # IR の生成 IR->generateIR(); # バックエンドを使って、最適化 ::glow::optimize(*IR, shouldShareBuffers); return IR; } IR生成とバックエンドを使ってIR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
  • 65. void IRFunction::generateIR() { assert(G_->verify() && "Invalid function"); // Schedule the nodes. NodesPtrList ScheduledNodes; scheduleGraph(ScheduledNodes); IRGenVisitor irgen(this); for (auto &N : ScheduledNodes) { N->visit(nullptr, &irgen); } } IR生成 https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp
  • 66. void IRFunction::scheduleGraph(NodesPtrList &Schedule) { Schedule.clear(); for (auto &N : G_->getParent()->getVars()) { Schedule.push_back(N); } for (auto &N : G_->getParent()->getPlaceholders()) { Schedule.push_back(N); } グラフのスケジュール:前半 https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp
  • 67. auto numVars = G_->getParent()->getConstants().size(); auto numPlaceholders = G_->getParent()->getPlaceholders().size(); (void)numVars; (void)numPlaceholders; std::unique_ptr<Scheduler> scheduler{ createScheduler(graphScheduler, *G_, Schedule)}; scheduler->schedule(); assert(scheduler->getSchedule().size() == G_->getNodes().size() + numPlaceholders + numVars && "All graph nodes have to be scheduled"); } グラフのスケジュール:後半 https://github.com/pytorch/https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp#L172/blob/master/lib/IR/Graph Scheduler.cpp
  • 69.   1)、The graph is either loaded via the graph loader   (from ONNX or Caffe2 format),     or constructed via the C++ interface.   2)、The graph is differentiated if needed.   3)、The graph is optimized.   4)、Linear algebra node lowering takes place.   5)、Additional rounds of optimizations occur,     both target independent and target specific.   6)、The graph is scheduled into a linear sequence of nodes     that minimizes memory usage.   7)、IRGen converts the low-level graph into instructions.   8)、Low-level IR optimizations are performed.   9)、Backend-specific optimizations     and code generation are performed. https://github.com/pytorch/glow/blob/master/docs/IR.md
  • 70. std::unique_ptr<IRFunction> glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) { auto IR = llvm::make_unique<IRFunction>(F); # IR の生成 IR->generateIR(); # バックエンドを使って、最適化 ::glow::optimize(*IR, shouldShareBuffers); return IR; } IR生成とバックエンドを使ってIR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp
  • 71. void glow::optimize(IRFunction &M, CompilationMode mode, const Backend &B) { M.verify(); if (!optimizeIR) return; performPeepholeOptimizations(M); eliminateDeadStores(M); // Replace applicable InsertTensors and ExtractTensors with TensorViews. optimizeInserts(M); optimizeExtracts(M); if (B.shouldShareBuffers ()) // Reuse buffers from previous operations. shareBuffers(M);; IR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1602
  • 72. performPeepholeOptimizations(M); hoistDealloc(M); // Shorten the lifetime of buffers. sinkAllocas(M); eliminateDeadStores(M); // Perform Dead Store Elimination. deleteDeadAllocs(M); makeWeightsConst(M); // Turn read-only weights into constant weights. performDebugInstrumentation(M); if (dumpOptMod) // Print the module to stdout if requested. M.dump(); M.verify(); } IR最適化 https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1596
  • 74. class CompiledFunction { public: virtual ~CompiledFunction() = default; virtual void execute() = 0; virtual void setupRuns() = 0; virtual void beforeRun(const Context &ctx) = 0; virtual void afterRun(const Context &ctx) = 0; virtual void tearDownRuns() = 0; }; CompiledFunction https://github.com/pytorch/glow/blob/master/include/glow/Backends/CompiledFunction.h
  • 75. class InterpreterFunction final : public CompiledFunction { /// The IR to be executed. std::unique_ptr<IRFunction> F_; /// Maps values to Tensors, that are owned by this class. std::unordered_map<const Value *, Tensor *> tensors_; /// Maps values to Tensors, that are *not* owned by this class. std::unordered_map<const Value *, Tensor *> externalTensors_; public: InterpreterFunction(std::unique_ptr<IRFunction> F, const Context &ctx); ~InterpreterFunction() override; void execute() override; InterpreterFunction https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.h#L43
  • 76. void InterpreterFunction::execute() { #define DEF_VALUE(CLASS, NAME) #define DEF_INSTR(CLASS, NAME) case Kinded::Kind::CLASS##Kind: { fwd##CLASS(llvm::cast<CLASS>(&I)); break; } #define DEF_BACKEND_SPECIFIC_INSTR(CLASS, NAME) for (const auto &I : F_->getInstrs()) { switch (I.getKind()) { # <= 各オペレータの分岐! #include "glow/AutoGenInstr.def" default: llvm_unreachable("Invalid instruction."); } } } InterpreterFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.cpp
  • 77. class CPUFunction final : public CompiledFunction { std::unique_ptr<llvm::orc::GlowJIT> JIT_; void *heap_; public: CPUFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, void *heap); ~CPUFunction() override; void execute() override; }; CPUFunction https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.h
  • 78. void CPUFunction::execute() { auto sym = JIT_->findSymbol( "jitmain"); using JitFuncType = void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars, uint8_t * activations); auto address = sym.getAddress(); if (address) { JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get()); funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress_, baseActivationsAddress_); } else { GLOW_ASSERT(false && "Error getting address."); } } CPUFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.cpp#L29
  • 79. class OpenCLFunction final : public CompiledFunction { cl_device_id deviceId_; cl_context context_; cl_command_queue commands_; cl_mem deviceBuffer_{0}; std::vector<KernelLaunch> kernelLaunches_; public: explicit OpenCLFunction(std::unique_ptr<IRFunction> F); ~OpenCLFunction() override; void execute() override; OpenCLFunction https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
  • 80. void OpenCLFunction::execute() { # めっちゃ長い # # 基本的には、 # # 各レイヤーのループ # # 1). 各レイヤに対応したHost側のコード to OpenCLカーネルを生成 # 2). OpenCLカーネルのコンパイル # 3). OpenCLの作法に則ったコードを実行 (enqueueKernel) # # clFinish(commands_); にて、すべてのOpenCLカーネルが終了するまで待つ # } OpenCLFunction::execute https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h
  • 81. 量子化 (FP32 => INT8) https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  • 82.   ・FP32 => INT8 ・プロファイルに基づく量子化 推論中の実行を観察して、 ニューラルネットワークの各ステージの可能な数値範囲を推定 ・学習ベースの量子化は、将来サポート検討中 Glow での 量子化 https://github.com/pytorch/glow/blob/master/docs/Quantization.md
  • 83. std::vector<NodeQuantizationInfo> QI{ {NodeQuantizationInfo::generateNodeOutputName(input->getName()), {0.2f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(W->getName()), {0.3f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(B->getName()), {0.4f, 0}}, {NodeQuantizationInfo::generateNodeOutputName(FC->getName()), {0.6f, 0}}, }; F = quantization::quantizeFunction(EE, QI, F); // Make sure that graph can be compiled and run. EE.compile(CompilationMode::Infer, F); EE.run({}, {}); quantization::quantizeFunction の例 https://github.com/pytorch/glow/blob/master/tests/unittests/quantizationTest.cpp
  • 84. Function * quantizeFunction(const ExecutionEngine &EE, llvm::ArrayRef<NodeQuantizationInfo> quantizationInfos, Function *F, llvm::StringRef newFuncName = ""); quantization::quantizeFunction https://github.com/pytorch/glow/blob/master/include/glow/Quantization/Quantization.h
  • 85. https://github.com/pytorch/glow Glow : Graph Compiler & Execution Engine High-Level Graph => Low-Level IR => Machine Code   バックエンド   Interpreter CPU OpenCL