Facebook Glow Compiler のソースコードをグダグダ語る会

Facebook Glow Compiler のソースコー
ドをグダグダ語る会
@DeNA
作成：2018/08/26, 9/16，9/22，10/28
Slideshareにて公開：2018/11/29
@Vengineer

ブログ (2007年～) : Vengineerの戯言
　http://blogs.yahoo.co.jp/verification_engineer
SlideShare :
　https://www.slideshare.net/ssuser479fa3
Twitter (2009年～) :
＠Vengineer
ソースコード解析職人

お髭の人に
いじられるために
はるばるやってきました
しかしながら、
この企画を提案したのは、
あたしです！
お髭の人には、
気を付けろ

宣伝です
PyTorch から XLA に変
換し、Cloud TPU にて、
Resnet-50を動かしたとい
うコードなのかな？
2018年12月1日(土)

第1フェーズ
　ディープラーニング・フレームワーク
　　・Keras + TensorFlow　ダントツ
　　・PyTorch
　　・Chainer 日本では？

第2フェーズ
　グラフ・コンパイラ
　　・TensorFlow XLA
　　・NNVM (Relay) / TVM
　　・Glow

Glow: Graph Lowering Compiler
Techniques for Neural Networks
May 2, 2018
https://arxiv.org/abs/1805.00907
Facebook

Glow: A community-driven approach to
AI infrastructure
Sep 13, 2018
https://code.fb.com/ml-applications/glow-a-community-driven-approach-to-ai
-infrastructure/
Facebook

@Scale
2018 Keynote: Glow: A community-driven
approach to AI
SEPTEMBER 19, 2018
https://atscaleconference.com/videos/scale-2018-keynote-glow-a-community-driven
-approach-to-ai/
Facebook

さあ、
ソースコードを見
てみよう

$ sudo apt-get install graphviz cmake wget libpng-dev
ninja-build clang llvm-5.0
libprotobuf-dev protobuf-compiler
　 cmake は、3.7.1 以上が必要
別途、ソースコードから3.12.1 をインストールしました
　llvmは、6.0 でも、7.0 でもいいみたいです。
準備

$ git clone https://github.com/pytorch/glow.git
$ git submodule update --init --recursive
$ cd glow
$ mkdir build_Debug
$ cd build_Debug
$ cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug ..
$ ninja all
$ ninja test
　
ビルド

CMakeLists.txt の
option(GLOW_WITH_OPENCL "Build the OpenCL backend" ON)
を
option(GLOW_WITH_OPENCL "Build the OpenCL backend" OFF)
に変更にするか、コマンドラインにて、以下のようなパラメータを指定する
-DGLOW_WITH_OPENCL=OFF
　
OpenCL がデフォルトで ON

https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code

TensorFlow XLA : JITコンパイラ (r1.5～)
XLAグラフに変換
最適化、その1
ターゲットハードウェアの実行オブジェクト
ターゲットハードウェアに依存しない最適化
HLO (High Level Optimizer)
XLAグラフ
最適化、その2
コード生成
ターゲットハードウェアに依存する最適化
LLO (Low Level Optimizer)
TensorFow Graph
実行オブジェクト
XLAグラフ

High-Level IR
・ドメインスペシフィックな最適化
Low-Level IR
・メモリ関連の最適化
命令のスケジューリング
静的なメモリ割り当て
メモリコピーの削除
・マシン依存コード生成
Glowは、どんなことをやっている？

ExecutionEngine EE(executionBackend);
TrainingConfig TC;
TC.learningRate = 0.001;
TC.momentum = 0.9;
TC.L2Decay = 0.001;
TC.batchSize = minibatchSize;
Function *T = glow::differentiate(F, TC); # <= 学習はこれが必要
EE.compile(CompilationMode::Train, T); # <= CompilationMode::Train
例題：mnist を見てみよう ( 学習だってできる )
https://github.com/pytorch/glow/blob/master/examples/mnist.cpp

Tensor imageInputs;
Tensor labelInputs;
Variable *A =
mod.createVariable(ElemKind::FloatTy, {minibatchSize, 28, 28, 1}, "input",
VisibilityKind::Public, false);
Variable *selected =
mod.createVariable(ElemKind::Int64ITy, {minibatchSize, 1}, "selected",
VisibilityKind::Public, false);
unsigned numImages = loadMNIST(imageInputs, labelInputs);
EE.runBatch(numIterations, {A, selected}, {&imageInputs, &labelInputs});
例題：mnist を見てみよう ( 学習だってできる )

auto *result = F->createSave("return", SM);
EE.compile(CompilationMode::Infer, F); #<= CompilationMode::Infer
Tensor sample(ElemKind::FloatTy, {minibatchSize, 28, 28, 1});
for (int iter = numIterations; iter < numIterations + 10; iter++) {
sample.copyConsecutiveSlices(&imageInputs, minibatchSize * iter);
EE.run({A}, {&sample});
Tensor &res = result->getVariable()->getPayload();
例題：mnist を見てみよう ( 推論も当然できる )

llvm::cl::opt<BackendKind> executionBackend(
llvm::cl::desc("Backend to use:"),
llvm::cl::values(clEnumValN(BackendKind::Interpreter, "interpreter",
"Use interpreter (default option)"),
clEnumValN(BackendKind::CPU, "cpu", "Use CPU"),
clEnumValN(BackendKind::OpenCL, "opencl", "Use OpenCL")
),
llvm::cl::init(BackendKind::Interpreter),
llvm::cl::cat(mnistCat)
);
バックエンドは、「Interpreter(デフォルト)」「CPU」「OpenCL」
バックエンドは？

auto *CV0 = F->create Conv("conv", A, 16, 5, 1, 2, 1);
auto *RL0 = F->create RELU("relu", CV0);
auto *MP0 = F->create MaxPool("pool", RL0, 3, 3, 0);
auto *CV1 = F->create Conv("conv", MP0, 16, 5, 1, 2, 1);
auto *RL1 = F->create RELU("relu", CV1);
auto *MP1 = F->create MaxPool("pool", RL1, 3, 3, 0);
auto *FCL1 = F->create FullyConnected("fc", MP1, 10);
auto *SM = F->create SoftMax("sm", FCL1, selected);
auto *result = F->createSave("return", SM);
mnist のモデル構築

The Lifetime of
a Glow Instruction

1)、The graph is either loaded via the graph loader
　 (from ONNX or Caffe2 format),
　　　　or constructed via the C++ interface.
　　2)、The graph is differentiated if needed.
　　3)、The graph is optimized.
　　4)、Linear algebra node lowering takes place.
　　5)、Additional rounds of optimizations occur,
　　　　both target independent and target specific.
　　6)、The graph is scheduled into a linear sequence of nodes
　　　　that minimizes memory usage.
　　7)、IRGen converts the low-level graph into instructions.
　　8)、Low-level IR optimizations are performed.
　　9)、Backend-specific optimizations
　　　　and code generation are performed.
https://github.com/pytorch/glow/blob/master/docs/IR.md

モデルをロード
ONNX
Caffe2

PyTorch 1.0
PyTorch + Caffe2 + Glow

ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetFilename("tests/models/onnxModels/simpleConv.onnxtxt");
Variable *graphOutputVar;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
ONNXModelLoader onnxLD(NetFilename, {"data"}, {&data}, *F);
graphOutputVar = onnxLD.getSingleOutput();
EE.compile(CompilationMode::Infer, F);
EE.run({}, {});
ONNXモデルをロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/onnxImporterTest.cpp#L28

ExecutionEngine EE{BackendKind::Interpreter};
auto &mod = EE.getModule();
Function *F = mod.createFunction("main");
std::string NetDescFilename("tests/models/caffe2Models/predict_net.pbtxt");
std::string NetWeightFilename("tests/models/caffe2Models/init_net.pbtxt");
Variable *output;
Tensor data;
getNCHWData(&data, 1, 1, 3, 3);
caffe2ModelLoader caffe2LD(NetDescFilename, NetWeightFilename,
{"data"}, {&data}, *F);
output = caffe2LD.getSingleOutput();
EE.run({}, {});
Caffe2モデルをロード、コンパイル、推論
https://github.com/pytorch/glow/blob/master/tests/unittests/caffe2ImporterTest.cpp

ExecuteEngine
モデルのコンパイル
モデルの実行
モデルの保存

ExecuteEngine
compile バックエンドのgenerateIR : IRの生成
run
CompiledFunction の生成 (各バックエンド毎)
CompiledFunction の実行 (execute)実行
コンパイル
save バックエンドのsave : IRの保存保存

void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化後で
function_ = backend_-> compile(F, ctx);　// コンパイル後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile
https://github.com/pytorch/glow/blob/master/lib/ExecutionEngine/ExecutionEngine.cpp

void glow::runBatch(ExecutionEngine &EE, size_t iterations,
size_t &sampleCounter, llvm::ArrayRef<Variable *> vars,
llvm::ArrayRef<Tensor *> input ) {
size_t batchSize = vars[0]->getType()->dims()[0];
for (size_t i = 0; i < iterations; i++) {
for (int i = 0, e = ph.size(); i < e; i++) {
auto *backingTensor = ctx.get(ph[i]);
auto dim = inputs[i]->dims();
size_t slc = sampleCounter % dim[0];
backingTensor->copyConsecutiveSlices(inputs[i], slc);
}
glow::updateVariablesFromBatch(vars, inputs, sampleCounter);
EE.run();
sampleCounter += batchSize;
}
}
glow::runBatch

void ExecutionEngine:: run() {
assert(function_ && "No function has been compiled");
// Make sure that the context has backing tensors for all placeholders.
ctx.allocate(M_.getPlaceholders());
function_->setupRuns();
function_->beforeRun(ctx);
function_->execute();
function_->afterRun(ctx);
function_->tearDownRuns();
}
ExecutionEngine::run

void ExecutionEngine:: save(CompilationMode mode, Function *F,
llvm::StringRef outputDir) {
llvm::StringRef networkName) {
optimizeFunction(mode, F); // 最適化後で
backend_->save(F, outputDir, networkName);
}
ExecutionEngine::save

void ExecutionEngine:: compile(CompilationMode mode, Function *F,
const Context &ctx) {
optimizeFunction(mode, F); // 最適化
function_ = backend_-> compile(F, ctx);　// コンパイル後で
}
引数の mode は、最適化で使用する
ExecutionEngine::compile

void ExecutionEngine:: optimizeFunction(CompilationMode mode,
Function *F) {
// Verify the function pre-optimization/lowering.
F->verify();
// Optimize the graph.
::glow::optimize(F, mode);
// Allow the backend to transform the graph prior to lowering.
if (backend_->transformPreLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
}
ExecutionEngine::optimizeFunction

// Lower the graph into a sequence of low-level linear algebra operations.
::glow::lower(F, *backend_);
// Optimize the graph again.
// Allow the backend to transform the graph after lowering.
if (backend_->transformPostLowering(F, mode)) {
// Optimize the graph again after the backend transformation.
// In particular, DCE is very likely to be useful.
}
}
ExecutionEngine::optimizeFunction

1)、::glow::optimize(F, mode);
2)、if (backend_->transformPreLowering(F, mode))
3)、::glow::lower(F, *backend_);
5)、if (backend_->transformPostLowering(F, mode))
generateIR の最適化部分のみ、抜き出すと

2)、if (backend_->transformPreLowering(F, mode))
3)、::glow::lower(F, *backend_);
5)、if (backend_->transformPostLowering(F, mode))
transformPreLowering / transformPostLowering

現時点の実装( Interpreter, CPU, OpenCL ) では、
transformPostLowering の実装は、CPU と OpenCL ではあるが、
transformPreLowering の実装はありません。
CPUBackendでは、
1)、convolution を CPU最適版に置換
2)、MaxPooling と Splat をマージして、CPUMaxSplat に置換
OpenCLBackend では、
1)、Convolution を、OpenCL最適化版に置換
2)、MaxPooling を、OpenCL最適化版に置換
3)、AvgPooling を、OpenCL最適化版に置換
transformPreLowering / PostLowering の実装

# ExecutionEngineは、インスタンス生成時に、バックエンドの種類を指定する
ExecutionEngine EE(executionBackend);
ExecutionEngine.hpp
　 # デフォルトは、Interpreter
ExecutionEngine(BackendKind backendKind = BackendKind::Interpreter);
ExecutionEngine.cpp
# 指定した種類のバックエンドを生成する
ExecutionEngine::ExecutionEngine(BackendKind backendKind)
: backend_( createBackend(backendKind)) {}
ExecutionEngine::ExecutionEngine

Backend *glow::createBackend(BackendKind backendKind) {
switch (backendKind) {
case BackendKind::Interpreter: # Interpreter (Naiveな実装)
return createInterpreter();
case BackendKind::OpenCL: # OpenCL (Hostコード & OpenCLカーネル)
return createOCLBackend();
case BackendKind::CPU: # CPU (LLVM)
return createCPUBackend();
}
llvm_unreachable("unreachable");
}
glow::createBackend
https://github.com/pytorch/glow/blob/master/lib/Backends/Backends.cpp

Backend *createInterpreter() { return new Interpreter(); }
Backend *createCPUBackend() { return new CPUBackend(); }
Backend *createOCLBackend() { return new OCLBackend(); }
バックエンドの生成
https://github.com/pytorch/glow/blob/master/lib/Backends/

バックエンドの compile
generateAndOptimizeIR
IR生成
＆
IR最適化
compileIR
IRから
コード生成

virtual std::unique_ptr<CompiledFunction>
compile(std::unique_ptr<IRFunction> IR) const = 0;
InterpreterBackend
llvm::make_unique<InterpreterFunction>(std::move(IR))
CPUBackEnd
llvm::make_unique<CPUFunction>(std::move(JIT), heap)
OpenCBackend
llvm::make_unique<OpenCLFunction>(std::move(IR))
compile
https://github.com/pytorch/glow/blob/master/include/glow/Backends/Backend.h#L43

std::unique_ptr<CompiledFunction>
Interpreter::compile(Function *F) const {
auto IR = generateAndOptimizeIR(F, shouldShareBuffers());
return compileIR(std::move(IR));
}
Interpreter::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27

Interpreter::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator constantWeightsAllocator("ConstantWeights", 0);
MemoryAllocator placeholderWeightsAllocator("PlaceholderWeights", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, constantWeightsAllocator,
placeholderWeightsAllocator, activationsAllocator);
return llvm::make_unique< InterpreterFunction>(
std::move(IR), bundle) ;
}
Interpreter::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/Interpreter.cpp#L27

CPUBackend::compile(Function *F) const {
}
CPUBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp#L146

CPUBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
AllocationsInfo allocationsInfo;
std::unique_ptr<LLVMIRGen> irgen = createIRGen(IR.get(), allocationsInfo);
irgen->initTargetMachine(target.empty() ? "" : target.getValue(),
llvm::CodeModel::Model::Large);
irgen->initCodeGen();
allocateJITMemory(IR.get(), irgen->getAllocationsInfo());
emitJitMain(*irgen);
irgen->performCodeGen();
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp

auto JIT = llvm::make_unique<llvm::orc::GlowJIT>(irgen->getTargetMachine());
JIT->addModule(irgen->borrowModule());
MemoryAllocator constantAllocator("ConstantWeights", 0);
MemoryAllocator placeholderAllocator("Placeholders", 0);
MemoryAllocator activationsAllocator("Activations", 0);
runtime::RuntimeBundle runtimeInfo = generateRuntimeBundle(
*IR, constantAllocator, placeholderAllocator, activationsAllocator);
return llvm::make_unique<CPUFunction>(std::move(JIT), runtimeInfo);
}
CPUBackend::compileIR
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUBackend.cpp

OCLBackend::compile(Function *F) const {
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp

OCLBackend::compileIR(std::unique_ptr<IRFunction> IR) const {
MemoryAllocator allocator("GPU", 0xFFFFFFFF);
runtime::RuntimeBundle bundle =
generateRuntimeBundle(*IR, allocator, allocator, allocator);
return llvm::make_unique<OpenCLFunction>(std::move(IR), bundle) ;
}
OpenCLBackend::compile
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.cpp

std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F,
bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp

void IRFunction::generateIR() {
assert(G_->verify() && "Invalid function");
// Schedule the nodes.
NodesPtrList ScheduledNodes;
scheduleGraph(ScheduledNodes);
IRGenVisitor irgen(this);
for (auto &N : ScheduledNodes) {
N->visit(nullptr, &irgen);
}
}
IR生成

void IRFunction::scheduleGraph(NodesPtrList &Schedule) {
Schedule.clear();
for (auto &N : G_->getParent()->getVars()) {
Schedule.push_back(N);
}
for (auto &N : G_->getParent()->getPlaceholders()) {
Schedule.push_back(N);
}
グラフのスケジュール：前半
https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp

auto numVars = G_->getParent()->getConstants().size();
auto numPlaceholders = G_->getParent()->getPlaceholders().size();
(void)numVars;
(void)numPlaceholders;
std::unique_ptr<Scheduler> scheduler{
createScheduler(graphScheduler, *G_, Schedule)};
scheduler->schedule();
assert(scheduler->getSchedule().size() ==
G_->getNodes().size() + numPlaceholders + numVars &&
"All graph nodes have to be scheduled");
}
グラフのスケジュール：後半
https://github.com/pytorch/https://github.com/pytorch/glow/blob/master/lib/IR/GraphScheduler.cpp#L172/blob/master/lib/IR/Graph
Scheduler.cpp

std::unique_ptr<IRFunction>
glow::generateAndOptimizeIR(Function *F, bool shouldShareBuffers) {
auto IR = llvm::make_unique<IRFunction>(F);
# IR の生成
IR->generateIR();
# バックエンドを使って、最適化
::glow::optimize(*IR, shouldShareBuffers);
return IR;
}
IR生成とバックエンドを使ってIR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp

void glow::optimize(IRFunction &M, CompilationMode mode, const Backend &B) {
M.verify();
if (!optimizeIR) return;
performPeepholeOptimizations(M);
eliminateDeadStores(M);
// Replace applicable InsertTensors and ExtractTensors with TensorViews.
optimizeInserts(M);
optimizeExtracts(M);
if (B.shouldShareBuffers ()) // Reuse buffers from previous operations.
shareBuffers(M);;
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1602

performPeepholeOptimizations(M);
hoistDealloc(M); // Shorten the lifetime of buffers.
sinkAllocas(M);
eliminateDeadStores(M); // Perform Dead Store Elimination.
deleteDeadAllocs(M);
makeWeightsConst(M); // Turn read-only weights into constant weights.
performDebugInstrumentation(M);
if (dumpOptMod) // Print the module to stdout if requested.
M.dump();
M.verify();
}
IR最適化
https://github.com/pytorch/glow/blob/master/lib/Optimizer/IROptimizer.cpp#L1596

class CompiledFunction {
public:
virtual ~CompiledFunction() = default;
virtual void execute() = 0;
virtual void setupRuns() = 0;
virtual void beforeRun(const Context &ctx) = 0;
virtual void afterRun(const Context &ctx) = 0;
virtual void tearDownRuns() = 0;
};
CompiledFunction
https://github.com/pytorch/glow/blob/master/include/glow/Backends/CompiledFunction.h

class InterpreterFunction final : public CompiledFunction {
/// The IR to be executed.
std::unique_ptr<IRFunction> F_;
/// Maps values to Tensors, that are owned by this class.
std::unordered_map<const Value *, Tensor *> tensors_;
/// Maps values to Tensors, that are *not* owned by this class.
std::unordered_map<const Value *, Tensor *> externalTensors_;
public:
InterpreterFunction(std::unique_ptr<IRFunction> F, const Context &ctx);
~InterpreterFunction() override;
void execute() override;
InterpreterFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.h#L43

void InterpreterFunction::execute() {
#define DEF_VALUE(CLASS, NAME)
#define DEF_INSTR(CLASS, NAME)
case Kinded::Kind::CLASS##Kind: {
fwd##CLASS(llvm::cast<CLASS>(&I));
break;
}
#define DEF_BACKEND_SPECIFIC_INSTR(CLASS, NAME)
for (const auto &I : F_->getInstrs()) {
switch (I.getKind()) { # <= 各オペレータの分岐！
#include "glow/AutoGenInstr.def"
default:
llvm_unreachable("Invalid instruction.");
}
}
}
InterpreterFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/Interpreter/InterpreterFunction.cpp

class CPUFunction final : public CompiledFunction {
std::unique_ptr<llvm::orc::GlowJIT> JIT_;
void *heap_;
public:
CPUFunction(std::unique_ptr<llvm::orc::GlowJIT> JIT, void *heap);
~CPUFunction() override;
};
CPUFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.h

void CPUFunction::execute() {
auto sym = JIT_->findSymbol( "jitmain");
using JitFuncType =
void (*)(uint8_t * constantWeightVars, uint8_t * mutableWeightVars,
uint8_t * activations);
auto address = sym.getAddress();
if (address) {
JitFuncType funcPtr = reinterpret_cast<JitFuncType>(address.get());
funcPtr(runtimeBundle_.getConstants(), baseMutableWeightVarsAddress_,
baseActivationsAddress_);
} else {
GLOW_ASSERT(false && "Error getting address.");
}
}
CPUFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/CPUFunction.cpp#L29

class OpenCLFunction final : public CompiledFunction {
cl_device_id deviceId_;
cl_context context_;
cl_command_queue commands_;
cl_mem deviceBuffer_{0};
std::vector<KernelLaunch> kernelLaunches_;
public:
explicit OpenCLFunction(std::unique_ptr<IRFunction> F);
~OpenCLFunction() override;
OpenCLFunction
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h

void OpenCLFunction::execute() {
# めっちゃ長い
#
# 基本的には、
#
# 各レイヤーのループ
#
# 1). 各レイヤに対応したHost側のコード to OpenCLカーネルを生成
# 2). OpenCLカーネルのコンパイル
# 3). OpenCLの作法に則ったコードを実行　(enqueueKernel)
#
# clFinish(commands_); にて、すべてのOpenCLカーネルが終了するまで待つ
#
}
OpenCLFunction::execute
https://github.com/pytorch/glow/blob/master/lib/Backends/OpenCL/OpenCL.h

量子化 (FP32 => INT8)
https://github.com/pytorch/glow/blob/master/docs/Quantization.md

・FP32 => INT8
・プロファイルに基づく量子化
推論中の実行を観察して、
ニューラルネットワークの各ステージの可能な数値範囲を推定
・学習ベースの量子化は、将来サポート検討中
Glow での量子化
https://github.com/pytorch/glow/blob/master/docs/Quantization.md

std::vector<NodeQuantizationInfo> QI{
{NodeQuantizationInfo::generateNodeOutputName(input->getName()),
{0.2f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(W->getName()), {0.3f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(B->getName()), {0.4f, 0}},
{NodeQuantizationInfo::generateNodeOutputName(FC->getName()), {0.6f, 0}},
};
F = quantization::quantizeFunction(EE, QI, F);
// Make sure that graph can be compiled and run.
EE.run({}, {});
quantization::quantizeFunction の例
https://github.com/pytorch/glow/blob/master/tests/unittests/quantizationTest.cpp

Function *
quantizeFunction(const ExecutionEngine &EE,
llvm::ArrayRef<NodeQuantizationInfo> quantizationInfos,
Function *F, llvm::StringRef newFuncName = "");
quantization::quantizeFunction
https://github.com/pytorch/glow/blob/master/include/glow/Quantization/Quantization.h

https://github.com/pytorch/glow
Glow : Graph Compiler & Execution Engine
High-Level Graph => Low-Level IR => Machine Code
　
バックエンド
　 Interpreter
CPU
OpenCL

あたしは、
ディープラーニング職人ではありません
コンピュータエンジニアです
ありがとうございました
＠Vengineer
ソースコード解析職人

Facebook Glow Compiler のソースコードをグダグダ語る会

More Related Content

What's hot (20)

Similar to Facebook Glow Compiler のソースコードをグダグダ語る会 (20)

More from Mr. Vengineer (20)

Recently uploaded (20)

Facebook Glow Compiler のソースコードをグダグダ語る会