SlideShare a Scribd company logo
1-bit Semantic Segmentation
김정훈
1-bit Semantic Segmentation
Side Project 개발 경험 공유!
//Neural Network Quantization & Inference Acceleration
//Project. Drop the bit
김정훈, AI Robotics KR
https://www.facebook.com/jeounghoon.kim.5
KakaoTalk: @SamuelKJH
발표자 김정훈
.고려대학교 전기전자공학 석사
.제어 로봇 시스템 연구원
.딥러닝 엔지니어 전문연구요원
.관심사: State Estimation & Neural Network
2019 활동 사항:
. Google Developer Group Gwangju DevFest 2019 발표
. 삼성전자 State Estimation with Probabilistic Data Association & Multi-Object Tracking 세미나 (2019.11.26 예정)
. 한국 기술 교육 대학교 온라인 평생 교육원 자문위원
. 삼성 오픈 소스 컨퍼런스 (SOSCON) 2019 심사위원
. Mathworks Advisory Board 2019
. Neural Network Quantization & Compact Network Design 스터디 리더 (구독과.. 좋아요…)
. AI Robotics KR 운영자
함께 연구하고, 이야기 할 사람들을 찾기 위해 커뮤니티 활동을 시작했습니다!
오늘 발표 내용: Neural Network Quantization & Inference Acceleration
3
Project. Drop the bit
Weight & Bias
FP32 à 1 Bit
Small Computing Device!
On-device AI!Neural Network: Heavy & Power Hungry
4
Project. Drop the bit
• 내용 : 인공신경망 모델 경량화 기법과 하드웨어 가속 기법을 이용한 On-Device AI 실현
• 참여자: 김정훈(SW, 고려대학교), 김현우(HW, 한양대학교)
• 진행 방식: SW & HW Collaboration
• 결과물 : On-device AI 구현물,
Stage1: Light Weight End to End Semantic Segmentation
5
Many thanks to HW Kim..
Processing Environments
Jetson Nano
Raspberry Pi
PC/CPU/GPU
ASIC
FPGA
Computing Environments:
CPU, GPU,
Arm Processors,
Neural Processing Unit(NPU),
Dedicated Hardware,
…
6
Neural Network Quantization
FP32 à Lower Bit
Weight & Bias
Lower Volume, Computation Power, Memory Access & Usage,…
Model Compression!
&& Acceleration
(여러 많은 부수적 도움이 있다면)
7
Neural Network Quantization
Wu, Shuang, et al. "Training and inference with integers in deep neural networks." ICLR2018, arXiv:1802.04680 (2018).
혹시 8-bit Quantization에 대한 개념적 설명이 필요하신 분은…
https://kr.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html?s_v1=29204&elqem=2890150_EM_KR_19-11_NEWSLETTER_CG-
DIGEST&elqTrackId=a602b18751024fa5a27b1359e4a129a4&elq=f393de06767f4f3ba8bde323d1cf7176&elqaid=29204&elqat=1&elqCampaignId=10302&fbclid=IwAR3MQomYaE8RmG5CICO6ZJ5rP_NXyFfjr18grV82jOA5CLvjXRcbGSC_igE
철저하게 네트워크의 Weight & Bias && Activation에 관한 이야기
8
Neural Network Quantization
Architecture Design
Data Analysis &
Pre-processing
Model
Training
Network Analysis
t-SNE Spaces
Weight Histogram
Post-Processing
Workflow
9
Neural Network Quantization
사실 정말 소중한 이 기능들… 별 생각 없이 썼던 이 기능들…
10
Binarized Neural Networks
11
Network Parameter Comparison – DeepLabV3+
Architecture Weight Count
Maximum Activation (Initial Input Size: 360 x 480 x 3)
bit byte
FP32 659111936 82388992
1bit 20597248 2574656
bit byte
FP32 105,062,400 13132800
1bit 3,283,200 410400
FP32 vs 1 Bit
Total Weight Count
DeepLabV3+ (Baseline: ResNet18)
Width Height
Input
Channel
Output
Channel
Total
7 7 3 64 9408
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
3 3 64 64 36864
1 1 64 48 3072
1 1 64 128 8192
3 3 64 128 73728
3 3 128 128 147456
3 3 128 128 147456
3 3 128 128 147456
3 3 128 256 294912
1 1 128 256 32768
3 3 256 256 589824
3 3 256 256 589824
3 3 256 256 589824
1 1 256 512 131072
3 3 256 512 1179648
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 512 2359296
3 3 512 256 1179648
3 3 512 256 1179648
3 3 512 256 1179648
1 1 512 256 131072
1 1 1024 256 262144
8 8 256 256 4194304
3 3 304 256 700416
3 3 256 256 589824
1 1 256 11 2816
8 8 11 11 7744
2.574656 MB
82.388992MB
13.1328 MB
0.4104 MB
당연한 얘기지만, 32-bit와 1-bit는 Volume에서 32배 차이
만약 DeepLabV3+ 를 1-bit Quantization한다면?
12
Network Parameter Comparison – XnorNet
Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision.
Springer, Cham, 2016.
13
Building Binarized Neural Network Trainer!
•Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep
neural networks with weights and activations constrained to+ 1 or-1.”
arXiv preprint arXiv:1602.02830 (2016).
•Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using
binary convolutional neural networks." European Conference on
Computer Vision. Springer, Cham, 2016.
•Darabi, Sajad, et al. “BNN+: Improved binary network training.” arXiv
preprint arXiv:1812.11800 (2018).
•Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with
improved representational capability and advanced training
algorithm." Proceedings of the European Conference on Computer Vision
(ECCV). 2018.
•Zhou, Shuchang, et al. “Dorefa-net: Training low bitwidth convolutional
neural networks with low bitwidth gradients.” arXiv preprint
arXiv:1606.06160 (2016).
•Jung, Sangil, et al. "Learning to quantize deep networks by optimizing
quantization intervals with task loss." Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 2019.
14
class SignumActivation(torch.autograd.Function):
def forward(self, input):
self.save_for_backward(input)
size = input.size()
mean = torch.mean(input.abs(), 1, keepdim=True)
output = input.sign().add(0.01).sign()
return output, mean
def backward(self, grad_output, grad_output_mean): #STE Part
input, = self.saved_tensors
grad_input = grad_output.clone()
grad_input=(2/torch.cosh(input))*(2/torch.cosh(input))*(grad_input)
#grad_input[input.ge(1)] = 0
#great or equal #grad_input[input.le(-1)] = 0 #less or equal
return grad_input
Code Example (PyTorch)
Straight Through Estimator for Gradient Propagation
15
class BinConv2d(nn.Conv2d):
def __init__(self, *kargs, **kwargs):
super(BinConv2d, self).__init__(*kargs, **kwargs)
def forward(self, input):
if not hasattr(self.weight,'fp'):
self.weight.fp=self.weight.data.clone()
self.weight.data=self.weight.fp.sign().add(0.01).sign()
out = nn.functional.conv2d(input, self.weight, None, self.stride, self.padding, self.dilation, self.groups)
if not self.bias is None:
self.bias.fp=self.bias.data.clone()
out += self.bias.view(1, -1, 1, 1).expand_as(out)
return out
Code Example (PyTorch)
FP32 Weight && Qauntized Weight. à Optimization for Minimum Quantization Error
à 결국 Quantization은 Gradient와 Weight를 어떻게 다뤄줄지에 대한 얘기
16
BNN Performance Comparison
Nurvitadhi, Eriko, et al. "Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC." 2016
International Conference on Field-Programmable Technology (FPT). IEEE, 2016.
17
Architecture Study – 1. SegNet
SegNet Architecture (Baseline: VGG16)
Convolution
Batch
Normalization
Activation
Unit
Pool Convolution
Batch
Normalization
Unit
PoolActivation
Signum Activation 뒤의 Pool은 정보 손실이 크기 때문에 Pool 뒤에 Activation을 둔다
18
Architecture Study – 1. SegNet
Global Accuracy Mean Accuracy Mean IoU Weighted IoU Mean BF Score
FP32 0.936334255 0.804027548 0.71979825 0.885348723 0.788193434
1-bit 0.924336215 0.770357729 0.678117835 0.865555414 0.723602791
19
Architecture Study – 2. DeepLabV3+
어려움 & 문제점
1. Residual이 많음 à Hardware에서 Memory 활용 안 좋음
2. Down Sampling이 많음 à BNN 취약
3. Feature의 Precision이 중요한 Network 구조
4. Dilation처럼 Detail이 중요한 Feature에서 Binary Feature는 충분하지 않은 듯 함
5. 나름 열심히 변형해서 학습을 해 보았으나 Accuracy가 나오지 않음…ㅠㅠ
Architecture 조언을 해주실 분을 구합니다….
20
Segmentation Network Architecture Setting!
Network Architecture를 위한
Hardware?
Hardware를 위한
Network Architecture?
Trade-off!!
21
Segmentation Network Architecture Setting!
Convolution
or
Transposed
Convolution
Batch
Normalization
Signum
Activation
Unit
처음 시도하는 Hardware니까, Hardware를 위해 Architecture를 희생! à 처음엔 간단한 구조로 시작해보자!
Architecture 조언을 해주실 분을 구합니다2….
Unit1
360x480x64360x480x3
Encoding Decoding
Unit2 Unit3
180x240x128
Unit4 Unit5 Unit6 Unit7 Unit8 Unit9 Unit10 Unit11
90x120x256 180x240x128 360x480x64 360x480x11
22
Segmentation Result Comparison – SegNet
Iteration Sky Building Pole Road Pavement Tree
Sign
Symbol
Fence Car Pedestrian Bicyclist mIoU BF
FP32
SegNet
80k> 0.896 0.834 0.961 0.877 0.527 0.964 0.622 0.5345 0.321 0.933 0.365 0.6010 0.4684
Ours 60k 0.940 0.649 0.782 0.904 0.891 0.823 0.804 0.750 0.826 0.780 0.723 0.5474 0.5478
GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore
0.82310 0.80665 0.54736 0.74198 0.54778
Accuracy IoU MeanBFScore
Sky 0.94006 0.89583 0.88903
Building 0.64936 0.62527 0.48987
Pole 0.78200 0.17563 0.45098
Road 0.90385 0.88358 0.66327
Pavement 0.89108 0.642277 0.59924
Tree 0.82300 0.721303 0.62532
Sign Symbol 0.80385 0.22311 0.31997
Fence 0.75018 0.43108 0.43870
Car 0.82649 0.646385 0.54331
Pedestrian 0.77986 0.274434 0.44095
Bicyclist 0.72339 0.502063 0.47152
23
Segmentation Result Comparison – SegNet
Ours
SegNet
Ground Truth
Image
24
FINN* based BNN Segmentation HW
• Heterogeneous streaming architecture
• Scalable architecture – configurable SIMD/PE
• Developed using High Level Synthesis (HLS)
* Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017.
25
Current HW specification
• Target FPGA board: Xilinx ZCU104 (Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC with 504K logic cells) à 갖고있는게 이것 밖에 없어요..ㅠㅠ
• Resource: FF 145479, LUT 321172, BRAM_18K 324
• Performance: 360p (360x480x3) 30 FPS @ 200 MHz
• The longest latency of pipeline stage(conv layer) is 6220805 cycles → 6220805/200000000=0.031 sec
• Performance and resources are scalable (by adjusting SIMD/PE)
Convolution/Transposed convolution logic
Resource and performance analysis of synthesized HW by using Xilinx HLS
Resource utilization of synthesized HW
26
Comparison with ESPNet*
* Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
27
Comparison with ESPNet
ESPNet Ours
Platform NVIDIA Jetson TX2 Xilinx ZCU104
Dataset Cityscape Camvid
Inference speed 6~9 FPS 30 FPS
Operating frequency 828~1300 MHz 200 MHz
DRAM usage 3.52 GB Only for storing input
images. Weight / bias
/ activation are stored
on-chip memory.
3.52 GB
28
On-device “Light Weight Semantic Segmentation”!!
Small & Low Power
Processor
영상은 사실 엄청난 Cherry Picking !!!
360p 화질의 영상을 30fps로 실시간 처리하는 Neural Network Acceleration!
29
어려웠던 점 그리고 앞으로 해야 할 일
• Hardware를 고려하여 Network Architecture 를 정해야 해서... Trade-off를 정하기 힘듦
• BNN은 1-bit Feature를 이용하여 Segmentation을 해야 하며, 이러한 경우 높은 정확도를 갖는 Network를 얻기 힘듦
• 1-bit Drop the bit: GPU 연산 커널 && Arm 환경 가속기로 확장 시키기
• Low-bit Quantized Networks는 기존 Network와 조금 다른 Architecture에 관한 연구가 조금 필요할 듯 함
à BNN Architecture Golden Rule, Hardware-Aware, …
• Hardware-Aware와 관련하여 최근 SqueezeNext* && MobileNetV3(MobileNetEdgeTPU)** 가 그러한 모습을 보여줌
• 다음 프로젝트는 대세를 따라, 그리고 높은 정확도의 네트워크를 위해 Multi-bit Quantization을 고려?
• 1-bit GAN
**https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html?fbclid=IwAR28AznWOPf-NUj_S1P5ZUWwTTlrtTk56HpZA7XpnSjWfICGZ1mBzfspFqU
*https://arxiv.org/abs/1803.10615
30
Collaboration!
대학원 때 하지 못했던, 협업.
다른 분야의 연구원들과 함께 연구하는 경험이 중요함
꼭 연구는 혼자 하는 것이 아님을 깨달았다! 왜 진작 이렇게 못했을까?
앞으로 더 열심히 함께 공부해야겠다!!
31
Thanks again to HW Kim

More Related Content

PDF
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
PDF
Towards Reliable AI-Powered Vision for Autonomous Systems
 
PDF
IRJET- Prediction of Anomalous Activities in a Video
PPTX
Cvpr 2018 papers review (efficient computing)
PDF
Recent advances of AI for medical imaging : Engineering perspectives
PPTX
Convolutional neural networks 이론과 응용
PPTX
Caffe framework tutorial2
PDF
Koss 1605 machine_learning_mariocho_t10
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
Towards Reliable AI-Powered Vision for Autonomous Systems
 
IRJET- Prediction of Anomalous Activities in a Video
Cvpr 2018 papers review (efficient computing)
Recent advances of AI for medical imaging : Engineering perspectives
Convolutional neural networks 이론과 응용
Caffe framework tutorial2
Koss 1605 machine_learning_mariocho_t10

Similar to 1-bit semantic segmentation (20)

PDF
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
PDF
DLD meetup 2017, Efficient Deep Learning
PPTX
Pytorch kr devcon
PPTX
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
PDF
Computer vision for transportation
PDF
ModuLab DLC-Medical3
PDF
PyTorch 튜토리얼 (Touch to PyTorch)
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PPTX
Review-image-segmentation-by-deep-learning
PDF
Pr045 deep lab_semantic_segmentation
PPTX
Computer Vision for Beginners
PPTX
Image Segmentation Using Deep Learning : A survey
PPTX
Large scale gpu cluster for ai
PDF
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
PDF
IRJET- Semantic Segmentation using Deep Learning
PDF
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PPTX
Scene Representation Networks(NIPS 2019)_OJung
PDF
Torch intro
PDF
Tutorial-on-DNN-07-Co-design-Precision.pdf
Deep Learning을 위한 AWS 기반 인공 지능(AI) 서비스 (윤석찬)
DLD meetup 2017, Efficient Deep Learning
Pytorch kr devcon
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
Computer vision for transportation
ModuLab DLC-Medical3
PyTorch 튜토리얼 (Touch to PyTorch)
Semantic segmentation with Convolutional Neural Network Approaches
Review-image-segmentation-by-deep-learning
Pr045 deep lab_semantic_segmentation
Computer Vision for Beginners
Image Segmentation Using Deep Learning : A survey
Large scale gpu cluster for ai
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
IRJET- Semantic Segmentation using Deep Learning
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Scene Representation Networks(NIPS 2019)_OJung
Torch intro
Tutorial-on-DNN-07-Co-design-Precision.pdf
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Understanding_Digital_Forensics_Presentation.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Ad

1-bit semantic segmentation

  • 2. 1-bit Semantic Segmentation Side Project 개발 경험 공유! //Neural Network Quantization & Inference Acceleration //Project. Drop the bit 김정훈, AI Robotics KR https://www.facebook.com/jeounghoon.kim.5 KakaoTalk: @SamuelKJH
  • 3. 발표자 김정훈 .고려대학교 전기전자공학 석사 .제어 로봇 시스템 연구원 .딥러닝 엔지니어 전문연구요원 .관심사: State Estimation & Neural Network 2019 활동 사항: . Google Developer Group Gwangju DevFest 2019 발표 . 삼성전자 State Estimation with Probabilistic Data Association & Multi-Object Tracking 세미나 (2019.11.26 예정) . 한국 기술 교육 대학교 온라인 평생 교육원 자문위원 . 삼성 오픈 소스 컨퍼런스 (SOSCON) 2019 심사위원 . Mathworks Advisory Board 2019 . Neural Network Quantization & Compact Network Design 스터디 리더 (구독과.. 좋아요…) . AI Robotics KR 운영자 함께 연구하고, 이야기 할 사람들을 찾기 위해 커뮤니티 활동을 시작했습니다! 오늘 발표 내용: Neural Network Quantization & Inference Acceleration 3
  • 4. Project. Drop the bit Weight & Bias FP32 à 1 Bit Small Computing Device! On-device AI!Neural Network: Heavy & Power Hungry 4
  • 5. Project. Drop the bit • 내용 : 인공신경망 모델 경량화 기법과 하드웨어 가속 기법을 이용한 On-Device AI 실현 • 참여자: 김정훈(SW, 고려대학교), 김현우(HW, 한양대학교) • 진행 방식: SW & HW Collaboration • 결과물 : On-device AI 구현물, Stage1: Light Weight End to End Semantic Segmentation 5 Many thanks to HW Kim..
  • 6. Processing Environments Jetson Nano Raspberry Pi PC/CPU/GPU ASIC FPGA Computing Environments: CPU, GPU, Arm Processors, Neural Processing Unit(NPU), Dedicated Hardware, … 6
  • 7. Neural Network Quantization FP32 à Lower Bit Weight & Bias Lower Volume, Computation Power, Memory Access & Usage,… Model Compression! && Acceleration (여러 많은 부수적 도움이 있다면) 7
  • 8. Neural Network Quantization Wu, Shuang, et al. "Training and inference with integers in deep neural networks." ICLR2018, arXiv:1802.04680 (2018). 혹시 8-bit Quantization에 대한 개념적 설명이 필요하신 분은… https://kr.mathworks.com/company/newsletters/articles/what-is-int8-quantization-and-why-is-it-popular-for-deep-neural-networks.html?s_v1=29204&elqem=2890150_EM_KR_19-11_NEWSLETTER_CG- DIGEST&elqTrackId=a602b18751024fa5a27b1359e4a129a4&elq=f393de06767f4f3ba8bde323d1cf7176&elqaid=29204&elqat=1&elqCampaignId=10302&fbclid=IwAR3MQomYaE8RmG5CICO6ZJ5rP_NXyFfjr18grV82jOA5CLvjXRcbGSC_igE 철저하게 네트워크의 Weight & Bias && Activation에 관한 이야기 8
  • 9. Neural Network Quantization Architecture Design Data Analysis & Pre-processing Model Training Network Analysis t-SNE Spaces Weight Histogram Post-Processing Workflow 9
  • 10. Neural Network Quantization 사실 정말 소중한 이 기능들… 별 생각 없이 썼던 이 기능들… 10
  • 12. Network Parameter Comparison – DeepLabV3+ Architecture Weight Count Maximum Activation (Initial Input Size: 360 x 480 x 3) bit byte FP32 659111936 82388992 1bit 20597248 2574656 bit byte FP32 105,062,400 13132800 1bit 3,283,200 410400 FP32 vs 1 Bit Total Weight Count DeepLabV3+ (Baseline: ResNet18) Width Height Input Channel Output Channel Total 7 7 3 64 9408 3 3 64 64 36864 3 3 64 64 36864 3 3 64 64 36864 3 3 64 64 36864 1 1 64 48 3072 1 1 64 128 8192 3 3 64 128 73728 3 3 128 128 147456 3 3 128 128 147456 3 3 128 128 147456 3 3 128 256 294912 1 1 128 256 32768 3 3 256 256 589824 3 3 256 256 589824 3 3 256 256 589824 1 1 256 512 131072 3 3 256 512 1179648 3 3 512 512 2359296 3 3 512 512 2359296 3 3 512 512 2359296 3 3 512 256 1179648 3 3 512 256 1179648 3 3 512 256 1179648 1 1 512 256 131072 1 1 1024 256 262144 8 8 256 256 4194304 3 3 304 256 700416 3 3 256 256 589824 1 1 256 11 2816 8 8 11 11 7744 2.574656 MB 82.388992MB 13.1328 MB 0.4104 MB 당연한 얘기지만, 32-bit와 1-bit는 Volume에서 32배 차이 만약 DeepLabV3+ 를 1-bit Quantization한다면? 12
  • 13. Network Parameter Comparison – XnorNet Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016. 13
  • 14. Building Binarized Neural Network Trainer! •Courbariaux, Matthieu, et al. “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1.” arXiv preprint arXiv:1602.02830 (2016). •Rastegari, Mohammad, et al. "Xnor-net: Imagenet classification using binary convolutional neural networks." European Conference on Computer Vision. Springer, Cham, 2016. •Darabi, Sajad, et al. “BNN+: Improved binary network training.” arXiv preprint arXiv:1812.11800 (2018). •Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm." Proceedings of the European Conference on Computer Vision (ECCV). 2018. •Zhou, Shuchang, et al. “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients.” arXiv preprint arXiv:1606.06160 (2016). •Jung, Sangil, et al. "Learning to quantize deep networks by optimizing quantization intervals with task loss." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. 14
  • 15. class SignumActivation(torch.autograd.Function): def forward(self, input): self.save_for_backward(input) size = input.size() mean = torch.mean(input.abs(), 1, keepdim=True) output = input.sign().add(0.01).sign() return output, mean def backward(self, grad_output, grad_output_mean): #STE Part input, = self.saved_tensors grad_input = grad_output.clone() grad_input=(2/torch.cosh(input))*(2/torch.cosh(input))*(grad_input) #grad_input[input.ge(1)] = 0 #great or equal #grad_input[input.le(-1)] = 0 #less or equal return grad_input Code Example (PyTorch) Straight Through Estimator for Gradient Propagation 15
  • 16. class BinConv2d(nn.Conv2d): def __init__(self, *kargs, **kwargs): super(BinConv2d, self).__init__(*kargs, **kwargs) def forward(self, input): if not hasattr(self.weight,'fp'): self.weight.fp=self.weight.data.clone() self.weight.data=self.weight.fp.sign().add(0.01).sign() out = nn.functional.conv2d(input, self.weight, None, self.stride, self.padding, self.dilation, self.groups) if not self.bias is None: self.bias.fp=self.bias.data.clone() out += self.bias.view(1, -1, 1, 1).expand_as(out) return out Code Example (PyTorch) FP32 Weight && Qauntized Weight. à Optimization for Minimum Quantization Error à 결국 Quantization은 Gradient와 Weight를 어떻게 다뤄줄지에 대한 얘기 16
  • 17. BNN Performance Comparison Nurvitadhi, Eriko, et al. "Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC." 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 2016. 17
  • 18. Architecture Study – 1. SegNet SegNet Architecture (Baseline: VGG16) Convolution Batch Normalization Activation Unit Pool Convolution Batch Normalization Unit PoolActivation Signum Activation 뒤의 Pool은 정보 손실이 크기 때문에 Pool 뒤에 Activation을 둔다 18
  • 19. Architecture Study – 1. SegNet Global Accuracy Mean Accuracy Mean IoU Weighted IoU Mean BF Score FP32 0.936334255 0.804027548 0.71979825 0.885348723 0.788193434 1-bit 0.924336215 0.770357729 0.678117835 0.865555414 0.723602791 19
  • 20. Architecture Study – 2. DeepLabV3+ 어려움 & 문제점 1. Residual이 많음 à Hardware에서 Memory 활용 안 좋음 2. Down Sampling이 많음 à BNN 취약 3. Feature의 Precision이 중요한 Network 구조 4. Dilation처럼 Detail이 중요한 Feature에서 Binary Feature는 충분하지 않은 듯 함 5. 나름 열심히 변형해서 학습을 해 보았으나 Accuracy가 나오지 않음…ㅠㅠ Architecture 조언을 해주실 분을 구합니다…. 20
  • 21. Segmentation Network Architecture Setting! Network Architecture를 위한 Hardware? Hardware를 위한 Network Architecture? Trade-off!! 21
  • 22. Segmentation Network Architecture Setting! Convolution or Transposed Convolution Batch Normalization Signum Activation Unit 처음 시도하는 Hardware니까, Hardware를 위해 Architecture를 희생! à 처음엔 간단한 구조로 시작해보자! Architecture 조언을 해주실 분을 구합니다2…. Unit1 360x480x64360x480x3 Encoding Decoding Unit2 Unit3 180x240x128 Unit4 Unit5 Unit6 Unit7 Unit8 Unit9 Unit10 Unit11 90x120x256 180x240x128 360x480x64 360x480x11 22
  • 23. Segmentation Result Comparison – SegNet Iteration Sky Building Pole Road Pavement Tree Sign Symbol Fence Car Pedestrian Bicyclist mIoU BF FP32 SegNet 80k> 0.896 0.834 0.961 0.877 0.527 0.964 0.622 0.5345 0.321 0.933 0.365 0.6010 0.4684 Ours 60k 0.940 0.649 0.782 0.904 0.891 0.823 0.804 0.750 0.826 0.780 0.723 0.5474 0.5478 GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore 0.82310 0.80665 0.54736 0.74198 0.54778 Accuracy IoU MeanBFScore Sky 0.94006 0.89583 0.88903 Building 0.64936 0.62527 0.48987 Pole 0.78200 0.17563 0.45098 Road 0.90385 0.88358 0.66327 Pavement 0.89108 0.642277 0.59924 Tree 0.82300 0.721303 0.62532 Sign Symbol 0.80385 0.22311 0.31997 Fence 0.75018 0.43108 0.43870 Car 0.82649 0.646385 0.54331 Pedestrian 0.77986 0.274434 0.44095 Bicyclist 0.72339 0.502063 0.47152 23
  • 24. Segmentation Result Comparison – SegNet Ours SegNet Ground Truth Image 24
  • 25. FINN* based BNN Segmentation HW • Heterogeneous streaming architecture • Scalable architecture – configurable SIMD/PE • Developed using High Level Synthesis (HLS) * Umuroglu, Yaman, et al. "Finn: A framework for fast, scalable binarized neural network inference." Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017. 25
  • 26. Current HW specification • Target FPGA board: Xilinx ZCU104 (Zynq UltraScale+ XCZU7EV-2FFVC1156 MPSoC with 504K logic cells) à 갖고있는게 이것 밖에 없어요..ㅠㅠ • Resource: FF 145479, LUT 321172, BRAM_18K 324 • Performance: 360p (360x480x3) 30 FPS @ 200 MHz • The longest latency of pipeline stage(conv layer) is 6220805 cycles → 6220805/200000000=0.031 sec • Performance and resources are scalable (by adjusting SIMD/PE) Convolution/Transposed convolution logic Resource and performance analysis of synthesized HW by using Xilinx HLS Resource utilization of synthesized HW 26
  • 27. Comparison with ESPNet* * Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings of the European Conference on Computer Vision (ECCV). 2018. 27
  • 28. Comparison with ESPNet ESPNet Ours Platform NVIDIA Jetson TX2 Xilinx ZCU104 Dataset Cityscape Camvid Inference speed 6~9 FPS 30 FPS Operating frequency 828~1300 MHz 200 MHz DRAM usage 3.52 GB Only for storing input images. Weight / bias / activation are stored on-chip memory. 3.52 GB 28
  • 29. On-device “Light Weight Semantic Segmentation”!! Small & Low Power Processor 영상은 사실 엄청난 Cherry Picking !!! 360p 화질의 영상을 30fps로 실시간 처리하는 Neural Network Acceleration! 29
  • 30. 어려웠던 점 그리고 앞으로 해야 할 일 • Hardware를 고려하여 Network Architecture 를 정해야 해서... Trade-off를 정하기 힘듦 • BNN은 1-bit Feature를 이용하여 Segmentation을 해야 하며, 이러한 경우 높은 정확도를 갖는 Network를 얻기 힘듦 • 1-bit Drop the bit: GPU 연산 커널 && Arm 환경 가속기로 확장 시키기 • Low-bit Quantized Networks는 기존 Network와 조금 다른 Architecture에 관한 연구가 조금 필요할 듯 함 à BNN Architecture Golden Rule, Hardware-Aware, … • Hardware-Aware와 관련하여 최근 SqueezeNext* && MobileNetV3(MobileNetEdgeTPU)** 가 그러한 모습을 보여줌 • 다음 프로젝트는 대세를 따라, 그리고 높은 정확도의 네트워크를 위해 Multi-bit Quantization을 고려? • 1-bit GAN **https://ai.googleblog.com/2019/11/introducing-next-generation-on-device.html?fbclid=IwAR28AznWOPf-NUj_S1P5ZUWwTTlrtTk56HpZA7XpnSjWfICGZ1mBzfspFqU *https://arxiv.org/abs/1803.10615 30
  • 31. Collaboration! 대학원 때 하지 못했던, 협업. 다른 분야의 연구원들과 함께 연구하는 경험이 중요함 꼭 연구는 혼자 하는 것이 아님을 깨달았다! 왜 진작 이렇게 못했을까? 앞으로 더 열심히 함께 공부해야겠다!! 31
  • 32. Thanks again to HW Kim