Versatile tensor accelerator (vta) introduction and usage

Versatile Tensor Accelerator (VTA)
소개 및 활용법
2019.7.16
이제민

목
차
NPU 시작의 배경1
VTA 구조 및 성능2
코드 분석 및 튜토리얼 소개3
| 2 |
향후 계획4

Idea: tailor your chip architecture to the characteristics of a stable
workload
Hardware Specialization
NPU 시작의 배경
| 3 |

Evolution of Deep Learning
| 4 |

Tape-out costs for ASICs is exorbitant 10x cost gap between 16nm
and 65nm
Flexibility vsRisky bet to design hardware accelerators for ever-
changing applications
. Efficiency Tradeoffs
• Does deep learning constitute a stable workload to justify ASIC-based
hardware accelerator?
Specialization challenge
| 5 |

Highlights:
• Custom ASIC deployed in datacenters since 2015
• 65k 8-bit matrix multiply that offers peak throughput of 92 TOPS
• Targets mainstream NN applications (MLPs, CNNs, and LSTMs)
• Shows 30-80x improved TOPS/Watt over K80
What make TPUs Efficient?
• Integer inference (saves 6-30x energy over 16bit FP)
• Large amount of MACs (25x over K80)
• Large amount of on-chip memory (3.5x over K80)
TPU: Google’s Entry in the Deep Learning Acceleration Race
| 6 |
[1] Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA 2017

Implementing a Convolutional Layer with matrix Multiplication
| 7 |

Problem: Reading a large SRAM uses much more power than
arithmetic
Solution: Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
A systolic array is two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
Systolic Execution in Matrix Array
| 8 |
[1] Why systolic architectures?, IEEE computer ,1982.

Systolic Execution in one dimension
| 9 |

Systolic Execution in two dimension
| 10 |

In the TPU, the systolic array is rotated
• Weights are loaded from the top and the input data flows into the array in from the left
• Weights are preloaded and take effect with the advancing wave alongside the first data of a
new block
Pros & Cons
• Principled: Efficiently makes use of limited memory bandwidth, balances
computation to bandwidth availability.
• Specialized (computation needs to fit PE organization/functions)
- Improved efficiency, simple design, high concurrency/performance
- Good to do more with less memory bandwidth requirement
• Specialized: Not generally applicable because computation needs to fit the PE
functions/organization
Systolic array architecture
| 11 |

TPU 성능 (Roofline)
| 12 |

목
차
| 13 |
향후 계획4

전체 구조
VTA 구조 및 성능
| 14 |

4 modules: fetch, load, compute, store
3 stage architecture: load, compute, store
Two-Level ISA: provide the right tradeoff between expressiveness
and code compactness
• LOAD, GEMM, ALU, STORE instructions (CISC-like instructions)
- Multi-cycle compute and memory operations
• Use RISC micro-ops to perform single-cycle tensor operations
Parameterizablility
Exposing Task-level pipeline Parallelism: TLPP is based on the
paradigm of access-execute decoupling [1].
VTA Hardware Architecture 특징
| 15 |
[1] Decoupled access/execute computer architectures, ISCA’82

VTA Modules
| 16 |

Fetch module
| 17 |

Load Module
| 18 |

Compute Module
| 19 |

Store Module
| 20 |

3 stage pipeline (1)
| 21 |

3 stage pipeline (2)
| 22 |

How the GEMM core performs computation over data stored in
the input, weight, and accumulator memories.
No control flow: need to be unrolled
Two types of compute micro-ops: ALU and GEMM operations.
VTA GEMM core
| 23 |

Vta/hardware/Xilinx/sources
• vta.cc: VTA module의 정의와 모델 동작들을 정의한다.
• vta.h: ap_int의 타입과 함수들의 프로토타입의 정의
• `vta/include/vta/hw_spec.h`
• 이 파라메터를 가지고 `vta/config/vta_config.json`이 생성된다. 그리고 이건
`vta/config/vta_config.py`에 의해서 생성된다.
Vivado High-level synthesis, Xilinx
| 24 |

The shape of the tensor intrinsic
Clock frequency
Pipelining
Data type width
On-chip buffer sizes
LOG_INP_WIDTH,
LOG_WGT_WIDTH,
LOG_OUT_WIDTH는 같도록 설계
H/W parameters
| 25 |

Pipelining Tasks to Hide Memory Latency
| 26 |
[1] Decoupled access/execute computer architectures, ISCA’82

Addressing the programmability challenge
| 27 |

NNVM
• graph-level IR 이걸로 합치면서 효율성을 높인다.
VTA Runtime
• JIT compilation of VTA binaries (instruction streams and micro-kernel
code)
• manages shared memory
• performs synchronization to hand off execution to VTA
VTAs two-level ISAs
• high-level CISC ISA
- latency operation들을 정의함
- DMA loads, deep learning operators
• low-level and fixed latency RISC ISA
- low-level matrix-matrix operations.
VTA micro-architecture
• 딥러닝 하드웨어의 상세 디자인을 유연하게 하기 위함.
TVM stack에서의 각각의 컴포넌트
| 28 |

VTA’s JIT runtime enables cooperative execution of deep learning
workloads between a CPU host and the accelerator.
• 1) enable heterogeneous execution: one challenge present in fixed
function accelerators is model evolution, because most of these
accelerators are built for fixed models. Heterogeneous execution
overcomes this limitation by properly scheduling operators into
targets(e.g., CPUs or VTs), depending on their affinity for different
types of operators.
- Ex: it is well known that the first convolutional layer in most CNNs
contains operators with low arithmetic intensity that perform well on
CPUs.
- Another motivation behind heterogeneous execution is providing a
fallback mechanism for supporting emerging operators that are not yet
supported by VTA.
• 2) lower compiler design complexity
• 3) overcome physical limitations
• 4) reduce binary bloat
• 5) future proofing: Advances in system show trends towards
heterogeneous multi-accelerator system and scale-out acceleration.
JIT Runtime System
| 29 |

Full evaluation on PYNQ FPGA board (Z1)
Full Stack Evaluation (TVM)
| 30 |
TVM can offload most convolution operations to
the FPGA (40x speedup on off-loadable layers)

For comparable systems, VTA provides a significant performance
edge over conventional CPU and GPU-based inference
Evaluation over multiple CPU, GPU, and FPGA-quipped edge systems
| 31 |

목
차
| 32 |
향후 계획4

실행 환경
코드 분석 및 튜토리얼 소개
| 33 |

FPGA Programming
| 34 |

연산정의: A[1024] + B[1024] = C[1024]
Computation declaration
| 35 |

연산 정의
| 36 |
Batch, block_out이 기본 1,16 이다.
64 * 16 = 1024 연산
아래는 정의만 할 뿐 연산 되지는 않는다.

Computation C를 획득하는 방법은 다양함
가장 기본적인 방법으로 Schedule을 해서 VTA hardware primitives를
생성하면 아래와 같음
Scheduling the Computation
| 37 |

Default Schedule (tvm.lowering, pass 12개)
| 38 |
버퍼 정의
버퍼 lowering
z
버퍼 lowering
결과 버퍼에 저장
최종 결과 저장

Default Schedule (vta.lowering, pass 9개)
| 39 |
z VTA buffe로 접근하는
CPU Access 표시
z
z

VTA hardware intrinsic로 변환 되기 위해서는 아래의 조건을 포함
해야함.
• DMA copy operations: global scope을 local scope으로 copy하는
operation을 의미함
• Vector ALU operation들은 vector add를 실행해서 수행 해야 한다.
VTA는 아래 세 가지의 On-Chip SRAMs를 보유하고 있음
• env.inp_scope (read-only)
- 입력 행렬을 저장함
- 모양은 env.BATCH, env.BLOCK_IN (env.inp_dtype)
• env.wgt_scope (read-only)
- Weight matric을 저장
- 모양은 env.BLOCK_OUT, env.BLOCK_IN (type = env.wgt_dtype)
• env.acc_scope (read/write SRAM buffer): general purpose register file
- Accumulator 행렬
- 모양은 env.BATCH, env.BLOCK_OUT (type = env.acc_dtype)
Default Schedule에서 부족한 점
| 40 |

On-chip scoping 적용 후의 LowredFunc
| 41 |

Hardware accelerator에서 통상적으로 사용하는 방식
• DRAM에서 VTA on-chip buffer로 데이터 이동
- Pragmas 함수가 compiler에 DMA를 이용해서 copy operation을 bulk로
실행하라는 의미를 전달
ALU Operations
• VTA는 Accumulator buffer를 이용해서 tensor들을 연산하는 ALU가 내장
되어 있다.
• Vector addition loop를 VTA의 ALU를 이용하라고 명시적으로 지칭
해주어야한다.
DMA Transfers and ALU Operations
| 42 |

VTA에 맞게 변경된 Lowered TVM Schedule 코드
| 43 |
z
z
z ALU를 이용해서 연산 수행
VTA buffe로 접근하는
CPU Access 표시
Z
연산 처리 결과 내용 저장

컴파일, 실행, 검증
| 44 |
컴파일
실행
검증

TVM 함수로 컴파일 함
Tvm.build를 이용해서 function을 생성
• Schedule, desired signature of the function (inputs and outputs),
target language
모듈로 저장
• 모듈을 파일로 저장
• 추후에 로드함
• Ahead of time compilation의 기능
• Cross-compile the executable을 다른 환경으로 전달 할 수 있음 (RPC를
이용한 방법)
로드 하기
TVM Compilation
| 45 |

C api로 작성된 compiled tvm은 결국 어떤 언어를 이용해서도 invoke할
수 있음
DLPack에 기반한 array 접근 API를 제공함 (quick testing과 prototping)
• Remote context 생성 (pynq)
• Tvm.nd.array
• F()가 actual computation을 실행
• Asnumpy() 해석 가능 할 수 있게 결과를 복사해오고 포멧팅
Running the function
| 46 |

Assert를 이용해서 마지막으로 검증
Verifying Correctness
| 47 |

Pynq-z1 board: RPC server session
Print Instructions on PYNQ-Z1 (debug option)
| 48 |

Schedule to Machine code
튜토리얼
| 49 |
TVM Schedule
Lowering
LowerdFunc
runtime
Machine
Code
IR_PASS (11, 9)
CPU, VTA 코드
Ir.pass.py::Irb.emit(“VTAStoreBuff2D”)
Jit Compilation (Runtime.cc): 각각의 call들을 machine code로
변환
VTABufferAlloc
VTABufferFree
VTABufferCopy
A[1024] + B[1024] = C[1024]
VTATLSCommandHandle
VTARuntimeShutdown
VTASetDebugMode
VTABufferCPUPtr
VTAWriteBarrier
VTAReadBarrier
VTALoadBuffer2D
VTAStoreBUffer2D
VTAUopPush
VTAUopLoopBegin
VTAUopLoopEnd
VTAPushGEMMOp
VTAPushALUOp
VTADepPush
VTADepPop
VTASynchronize

목
차
| 50 |
향후 계획4

TVM, VTA 분석
• 각 IR PASS들이 어떻게 코드 변환을 수행하는 분석 중 (디버깅 환경에서)
- TVM IR PASS
- VTA IR PASS
• JIT compilation and runtime 분석
딥러닝 컴파일러 스택 개발
• Lowering 부분을 개발하여 통합 진행 (with 유미선 책임, 김영주 박사)
• A[1024] + B[1024] = C[1024] #vector addition
ResNet-18 network의 operation 분석
• VTA에서 실행 가능하도록 변환 되는 부분들 분석
안건
향후 계획
| 51 |

Versatile tensor accelerator (vta) introduction and usage

More Related Content

What's hot (20)

Similar to Versatile tensor accelerator (vta) introduction and usage (20)

More from jemin lee (6)

Versatile tensor accelerator (vta) introduction and usage