SlideShare a Scribd company logo
Analysis and
Improvement of
IOTA PoW
Implementation
chenwei (魏禛)
<zhenwei.tw@gmail.com>
AndyYang (楊子賢)
<kukry5566@gmail.com>
March 10, 2018 / SITCON2018 1
chenwei (魏禛)
● From Tainan, Taiwan
● Study Master degree at National Taiwan University
● Recent work
○ Learning how to implement a interpreter
○ Learning Golang
○ Optimize Neural Network on multiple GPUs
● GitHub <https://github.com/chenwei-tw>
2
AndyYang (楊子賢)
● 來自台北
● 目前就讀台大資工所一年級
● 研究領域 :
○ 機器學習
○ 計算機結構
● Recent Work :
○ ReRam Based Accelerator for Convolutional Neural
Network
3
Brief Introduction to IOTA
from: “Iota Tangle Visualization” <https://simulation1.tangle.works/>
4
Brief Introduction to IOTA
● IRI (IOTA Reference Implementation)
○ Provides RESTful API to participate in Tangle
○ Exchange transactions with other nodes
○ Maintain Database for storing transactions
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
Referenced: “IOTA API Reference”
<https://iota.readme.io/v1.2.0/reference>
5
Brief Introduction to IOTA
● (Light) Wallet
○ 查詢餘額、收款、轉帳
○ 因為沒有運行完整的 Node,所以 Wallet 的資訊都必
須透過前述的 RESTful API 與一個 full node 做溝通
○ Before doing any operation with your wallet,
check host connected is available
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
6
Brief Introduction to IOTA
● 如何發起一筆交易 ?
○ Node 選擇兩個交易 (transaction) 做驗證
○ 檢查該兩筆交易是否有衝突 (conflict)
(e.g. 帳戶餘額為負)
○ 解出一道加密問題 (PoW),耗費計算力
Referenced: “Tangle 白皮書” <https://hackmd.io/s/ryriSgvAW>
Further Reading: “深入理解 IOTA 交易方式”
<https://blog.louie.lu/2018/01/10/in-depth-explain-iota-transaction/>
7
How I get involved in
● <attachToTangle> in IRI
Referenced: “iotaledger/iri” <https://github.com/iotaledger/iri>
8
How I get involved in
● There are too many IOTA PoW Implementation hided
in these libraries
○ curl.lib.js
<https://github.com/iotaledger/curl.lib.js>
○ gIOTA <https://github.com/iotaledger/gIOTA>
○ ccurl <https://github.com/iotaledger/ccurl>
○ iota-pearldiver
<https://github.com/mlouielu/iota-pearldiver>
9
● gIOTA 蒐集了多種的PoW實作(C, SSE, AVX, OpenCL)
○ 而這些實作多以 C code 的形式內嵌在 Golang 裡
Why choose gIOTA?
● 故我們可以藉由 C 打造 IOTA 底層的
trinary structure 後,便可快速將實作移轉過去
10
● Alternative to Binary, Trinary is a base-3 numeral
system
● Trits: Analogous to bits, a ternary digit is a trit .The
digits may have the values 1, 0, or -1
● Trytes: A tryte consists of 3 trits, which can
represent 27 values.
○ in IOTA, trytes are represented as characters
'9,A-Z'.
Referenced: “IOTA Glossary” <https://iota.readme.io/docs/glossary>
Trinary Structure
11
Source Code: “chenwei-tw/dcurl” <https://github.com/chenwei-
tw/dcurl/blob/dev/src/trinary/trinary.h>
Our Trinary Structure
12
● 9 in tryte = {0,0,0} in trits
What is PoW (Proof Of Work)?
Referenced: “The Anatomy of a Transaction”
<https://domschiener.gitbooks.io/iota-
guide/content/chapter1/transactions-and-bundles.html>
...0guatda.com/cmx.p000...0
MWM
Hash
13
● giota 所蒐集的實作使用的多執行緒寫
法,並不是真的把計算函數分工,而是
同時執行多個一樣的函數看誰比較快算
出來的暴力解法
● 不同執行緒的起始 seed 不一樣
如何找出Nonce?
14
● C, GO, SSE 的實作沒有
問題
Referenced: “用 C 開發 IOTA PoW 的各種實作" <https://hackmd.io/s/HyNw4VM-z>
實測 giota 正確性
15
● AVX, OpenCL 卻沒通過
pow_avx_test.go:47: pow is illegal
J9QTUNNMONCMIR9JBNMRC9SC9QTBRKBUVCBYBUITBHEICYVQ9HXEXSPWPU9KACTSDRSQBDOJPOOEAFVMP
pow_cl_test.go:46: pow is illegal
IIHYVX9VHSMQWSNDJYWZOJBCBTPVQBLVBF9UYIYSTEKJVEFVY9JPJJMRLFWOJFKNWKAANSZKLXDBWMALI
● 後來發現 iotaledger/ccurl, 和 gIOTA 的 OpenCL Kernel
Function 是一樣的, 但是 ccurl 的結果是對的, 我們推測可
能是 gIOTA 在 launch kernel 的時候發生問題
● 於是後來的 GPU 效能評估與後續的設計都是基於
iotaledger/ccurl 版本做修改
實測 giota 正確性
16
● 以一個 tryte 量測三種 PoW 實作的效能
● 但是後來發現不同的 tryte 找到的 Nonce 時間不一樣
量測各種 PoW 實作效能
17
● 以大量的 trytes 來量測並繪製分布圖, 觀察各實作的效能
● 30 trytes 200 samples 的結果
量測各種 PoW 實作效能
47組 samples 執行時間約 10 秒
重複初始化 OpenCL context
的下場
Source Code: “chenwei-tw/iota-pow-in-c”
<https://github.com/chenwei-tw/iota-pow-in-c>
18
● 疑問: 為何使用 GPU 的 OpenCL 效能特別差 ?
● 可能的問題點:
○ 尋找 Nonce 的 kernel function 要計算很久?
○ Device 與 Host 之間的 Communication overhead
過大 ?
○ 還是 OpenCL 哪一個的 API 出了問題 ?
● 另外一個問題:
○ 由於實驗環境的 GPU 為 Nvidia,且 Nvidia 沒有提供
其 OpenCL 的 Profiling Tool
OpenCL 效能差的原因?
19
● 最直覺的想法便是重新把 OpenCL 實作改寫為 CUDA 後
再用 toolkit 的其中一項工具 nvprof 進行觀察
● 從下圖的結果,無法直接觀察到變慢的原因
自幹一發 CUDA !
Further Reading: “Profiler :: CUDA Toolkit Documentation”
<http://docs.nvidia.com/cuda/profiler-users-guide/index.html>
20
● 後來在 github 找到另一個 Profiling Tool - uftrace, 這個
工具可以提供如:
○ Duration
○ TID
○ Times of Function Call
○ Total time
● 雖然 uftrace 無法分析有關 GPU 的 Profiling
Information , 但是它提供的資訊仍可以讓我們了解效能
是卡在哪裡
Referenced: “namhyung/uftrace” <https://github.com/namhyung/uftrace>
嘗試另一個 Profiling Tool
21
● record : runs a program and saves the trace data
● graph : shows function call graph in the trace data
uftrace 的量測結果
$ uftrace record pow_cl
$ uftrace graph main
22
● GPU初始化階段占了近70%的比重
total time
init_clcon
text
init_cl_ke
rnel
write_cl_b
uffer
clEnqueueW
riteBuffer
clWaitForE
vents
clEnqueueR
eadBuffer
Hash
1.938 1.354 s 14.362 us 1.541 ms 1.538 ms 569.901 ms 84.981 us 5.502 ms
OpenCL context Initialization OpenCL searching nonce
uftrace 的量測結果
23
● 想辦法避免 OpenCL context 重複初始化的問題
○ 而 ccurl 的解決辦法是,一次只做一個 PoW Task,並
重複利用同一個 context
● 閱讀完 ccurl 的程式碼後,我們認為 ccurl 的資料結構設
計也有試圖想實現 multi-thread Pow Task,但是我們嘗
試在同一個 address space 同時 launch 多個
<ccurl_pow> ,算出來的 hash 卻是錯的
如何改善 OpenCL 版本的問題
24
New IOTA PoW Library - dcurl
● Goal
○ 在給定的硬體環境裡,想辦法讓 PoW 跑越快越好
○ 整合至 IRI,並檢驗效能是否有提升
● Our ideas
○ PoW tasks can be multi-threaded executed
○ Integrate powerful IOTA PoW implementation
25
New IOTA PoW Library - dcurl
● Hardware Environment
○ Ubuntu 16.04
○ Intel(R) Xeon(R) CPU E5-2650 v4 @2.2GHz 48 cores
○ Nvidia Titan Xp
○ 94.2 GB RAM
26
New IOTA PoW Library - dcurl
27
New IOTA PoW Library - dcurl It’s important to find
respective lock
28
Does multi-thread really bring speedup?
Frequency
Time (s)
29
Does multi-thread really bring speedup?
Frequency
Time (s)
30
Compare dcurl with other PoW Libraries
Frequency
Time (s)
31
Integrate dcurl into IRI
32
Integrate dcurl into IRI
● Use javah to produce header file for c program
$ javah com.iota.iri.hash.PearlDiver
33
Integrate dcurl into IRI
● <jni.h> provides many functions to convert
java objects to C objects, such as ...
○ GetIntArrayElements() gets java int array
and return c int array
○ SetIntArrayRegion() copys c int array to
java int array
Further Reading: “JNI Functions”
<https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html>
Further Reading: “Java Programming Tutorial Java Natve Interface (JNI)”
<https://www.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html>
34
Integrate dcurl into IRI
● Reminder
○ Provide include path to OpenJDK for compiler
○ Set java library path before launch your jvm
● Lets compile it !
○ We can get a shared library for jvm to load
○ Done!
Source code: “chenwei-tw/iri” <https://github.com/chenwei-
tw/iri/tree/task/integrate_dcurl>
35
Performance between IRI and dcurl
Frequency
Time (s)
Different Hardware Platform
● Intel(R) Core(™) i7-8700K
Processor
● Nvidia GeForce GTX 1080 Ti
● 32 GB Memory
<attachToTangle> Performance Comparison
36
Something in progress ...
● Fix AVX implementation
● Let dcurl can configure environment and
support multiple GPUs
● dcurl would be crashed if GPU memory is not enough
● dcurl would decide suitable parameter set
automatically
37
Future Work
● Add a new interface for PearlDiver in IRI,
so everyone can load suitable PoW implementation
for their hardware environment
● Search for other bottlenecks of IRI and try to improve
38

More Related Content

PDF
from Binary to Binary: How Qemu Works
PDF
Specializing the Data Path - Hooking into the Linux Network Stack
PDF
Qemu JIT Code Generator and System Emulation
PPTX
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
PDF
The Simple Scheduler in Embedded System @ OSDC.TW 2014
PDF
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
PDF
Machine Trace Metrics
PDF
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
from Binary to Binary: How Qemu Works
Specializing the Data Path - Hooking into the Linux Network Stack
Qemu JIT Code Generator and System Emulation
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
The Simple Scheduler in Embedded System @ OSDC.TW 2014
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
Machine Trace Metrics
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」

What's hot (20)

PDF
Instruction Combine in LLVM
PPTX
Online test program generator for RISC-V processors
PDF
Zn task - defcon russia 20
PDF
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
PPTX
Onnc intro
PDF
PDF
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
PDF
Devirtualizing FinSpy
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PPTX
QEMU - Binary Translation
PDF
Q4.11: NEON Intrinsics
PDF
Advanced cfg bypass on adobe flash player 18 defcon russia 23
PDF
Q4.11: Using GCC Auto-Vectorizer
PDF
Implementing Lightweight Networking
PDF
LLVM Register Allocation
PDF
Implementing STM in Java
PDF
Making OpenBSD Useful on the Octeon Network Gear by Paul Irofti
PPT
Virtual platform
PDF
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
PDF
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Instruction Combine in LLVM
Online test program generator for RISC-V processors
Zn task - defcon russia 20
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
Onnc intro
Runtime Code Generation and Data Management for Heterogeneous Computing in Java
Devirtualizing FinSpy
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
QEMU - Binary Translation
Q4.11: NEON Intrinsics
Advanced cfg bypass on adobe flash player 18 defcon russia 23
Q4.11: Using GCC Auto-Vectorizer
Implementing Lightweight Networking
LLVM Register Allocation
Implementing STM in Java
Making OpenBSD Useful on the Octeon Network Gear by Paul Irofti
Virtual platform
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Ad

Similar to [Sitcon2018] Analysis and Improvement of IOTA PoW Implementation (20)

PDF
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
PDF
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
PDF
OpenStack Neutron Tutorial
PDF
Montreal OpenStack Q3-2017 MeetUp
PDF
An introduction to node3
PDF
BUD17-405: Building a reference IoT product with Zephyr
PDF
PyTorch crash course
PDF
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
PDF
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
PDF
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
PDF
Tapjoy OpenStack Summit Paris Breakout Session
PDF
Webinar: Code Faster on Kubernetes
PDF
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
PDF
Kubernetes Cloud Native Indonesia Meetup - June 2024
PDF
Dev.bg DevOps March 2024 Monitoring & Logging
PDF
PDF
Intro to Kubernetes & GitOps Workshop
PDF
Netty training
PDF
Are app servers still fascinating
PDF
Free GitOps Workshop + Intro to Kubernetes & GitOps
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
OpenStack Neutron Tutorial
Montreal OpenStack Q3-2017 MeetUp
An introduction to node3
BUD17-405: Building a reference IoT product with Zephyr
PyTorch crash course
[KubeCon NA 2018] Telepresence Deep Dive Session - Rafael Schloming & Luke Sh...
PuppetConf 2016: Why Network Automation Matters, and What You Can Do About It...
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
Tapjoy OpenStack Summit Paris Breakout Session
Webinar: Code Faster on Kubernetes
Montreal Kubernetes Meetup: Developer-first workflows (for microservices) on ...
Kubernetes Cloud Native Indonesia Meetup - June 2024
Dev.bg DevOps March 2024 Monitoring & Logging
Intro to Kubernetes & GitOps Workshop
Netty training
Are app servers still fascinating
Free GitOps Workshop + Intro to Kubernetes & GitOps
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Sustainable Sites - Green Building Construction
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Welding lecture in detail for understanding
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Construction Project Organization Group 2.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
PPT on Performance Review to get promotions
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Sustainable Sites - Green Building Construction
Automation-in-Manufacturing-Chapter-Introduction.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
bas. eng. economics group 4 presentation 1.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Lecture Notes Electrical Wiring System Components
Welding lecture in detail for understanding
OOP with Java - Java Introduction (Basics)
Mechanical Engineering MATERIALS Selection
Construction Project Organization Group 2.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT on Performance Review to get promotions
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation

  • 1. Analysis and Improvement of IOTA PoW Implementation chenwei (魏禛) <zhenwei.tw@gmail.com> AndyYang (楊子賢) <kukry5566@gmail.com> March 10, 2018 / SITCON2018 1
  • 2. chenwei (魏禛) ● From Tainan, Taiwan ● Study Master degree at National Taiwan University ● Recent work ○ Learning how to implement a interpreter ○ Learning Golang ○ Optimize Neural Network on multiple GPUs ● GitHub <https://github.com/chenwei-tw> 2
  • 3. AndyYang (楊子賢) ● 來自台北 ● 目前就讀台大資工所一年級 ● 研究領域 : ○ 機器學習 ○ 計算機結構 ● Recent Work : ○ ReRam Based Accelerator for Convolutional Neural Network 3
  • 4. Brief Introduction to IOTA from: “Iota Tangle Visualization” <https://simulation1.tangle.works/> 4
  • 5. Brief Introduction to IOTA ● IRI (IOTA Reference Implementation) ○ Provides RESTful API to participate in Tangle ○ Exchange transactions with other nodes ○ Maintain Database for storing transactions Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係” <https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet- full-wallet-and-full-node/> Referenced: “IOTA API Reference” <https://iota.readme.io/v1.2.0/reference> 5
  • 6. Brief Introduction to IOTA ● (Light) Wallet ○ 查詢餘額、收款、轉帳 ○ 因為沒有運行完整的 Node,所以 Wallet 的資訊都必 須透過前述的 RESTful API 與一個 full node 做溝通 ○ Before doing any operation with your wallet, check host connected is available Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係” <https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet- full-wallet-and-full-node/> 6
  • 7. Brief Introduction to IOTA ● 如何發起一筆交易 ? ○ Node 選擇兩個交易 (transaction) 做驗證 ○ 檢查該兩筆交易是否有衝突 (conflict) (e.g. 帳戶餘額為負) ○ 解出一道加密問題 (PoW),耗費計算力 Referenced: “Tangle 白皮書” <https://hackmd.io/s/ryriSgvAW> Further Reading: “深入理解 IOTA 交易方式” <https://blog.louie.lu/2018/01/10/in-depth-explain-iota-transaction/> 7
  • 8. How I get involved in ● <attachToTangle> in IRI Referenced: “iotaledger/iri” <https://github.com/iotaledger/iri> 8
  • 9. How I get involved in ● There are too many IOTA PoW Implementation hided in these libraries ○ curl.lib.js <https://github.com/iotaledger/curl.lib.js> ○ gIOTA <https://github.com/iotaledger/gIOTA> ○ ccurl <https://github.com/iotaledger/ccurl> ○ iota-pearldiver <https://github.com/mlouielu/iota-pearldiver> 9
  • 10. ● gIOTA 蒐集了多種的PoW實作(C, SSE, AVX, OpenCL) ○ 而這些實作多以 C code 的形式內嵌在 Golang 裡 Why choose gIOTA? ● 故我們可以藉由 C 打造 IOTA 底層的 trinary structure 後,便可快速將實作移轉過去 10
  • 11. ● Alternative to Binary, Trinary is a base-3 numeral system ● Trits: Analogous to bits, a ternary digit is a trit .The digits may have the values 1, 0, or -1 ● Trytes: A tryte consists of 3 trits, which can represent 27 values. ○ in IOTA, trytes are represented as characters '9,A-Z'. Referenced: “IOTA Glossary” <https://iota.readme.io/docs/glossary> Trinary Structure 11
  • 12. Source Code: “chenwei-tw/dcurl” <https://github.com/chenwei- tw/dcurl/blob/dev/src/trinary/trinary.h> Our Trinary Structure 12
  • 13. ● 9 in tryte = {0,0,0} in trits What is PoW (Proof Of Work)? Referenced: “The Anatomy of a Transaction” <https://domschiener.gitbooks.io/iota- guide/content/chapter1/transactions-and-bundles.html> ...0guatda.com/cmx.p000...0 MWM Hash 13
  • 15. ● C, GO, SSE 的實作沒有 問題 Referenced: “用 C 開發 IOTA PoW 的各種實作" <https://hackmd.io/s/HyNw4VM-z> 實測 giota 正確性 15
  • 16. ● AVX, OpenCL 卻沒通過 pow_avx_test.go:47: pow is illegal J9QTUNNMONCMIR9JBNMRC9SC9QTBRKBUVCBYBUITBHEICYVQ9HXEXSPWPU9KACTSDRSQBDOJPOOEAFVMP pow_cl_test.go:46: pow is illegal IIHYVX9VHSMQWSNDJYWZOJBCBTPVQBLVBF9UYIYSTEKJVEFVY9JPJJMRLFWOJFKNWKAANSZKLXDBWMALI ● 後來發現 iotaledger/ccurl, 和 gIOTA 的 OpenCL Kernel Function 是一樣的, 但是 ccurl 的結果是對的, 我們推測可 能是 gIOTA 在 launch kernel 的時候發生問題 ● 於是後來的 GPU 效能評估與後續的設計都是基於 iotaledger/ccurl 版本做修改 實測 giota 正確性 16
  • 17. ● 以一個 tryte 量測三種 PoW 實作的效能 ● 但是後來發現不同的 tryte 找到的 Nonce 時間不一樣 量測各種 PoW 實作效能 17
  • 18. ● 以大量的 trytes 來量測並繪製分布圖, 觀察各實作的效能 ● 30 trytes 200 samples 的結果 量測各種 PoW 實作效能 47組 samples 執行時間約 10 秒 重複初始化 OpenCL context 的下場 Source Code: “chenwei-tw/iota-pow-in-c” <https://github.com/chenwei-tw/iota-pow-in-c> 18
  • 19. ● 疑問: 為何使用 GPU 的 OpenCL 效能特別差 ? ● 可能的問題點: ○ 尋找 Nonce 的 kernel function 要計算很久? ○ Device 與 Host 之間的 Communication overhead 過大 ? ○ 還是 OpenCL 哪一個的 API 出了問題 ? ● 另外一個問題: ○ 由於實驗環境的 GPU 為 Nvidia,且 Nvidia 沒有提供 其 OpenCL 的 Profiling Tool OpenCL 效能差的原因? 19
  • 20. ● 最直覺的想法便是重新把 OpenCL 實作改寫為 CUDA 後 再用 toolkit 的其中一項工具 nvprof 進行觀察 ● 從下圖的結果,無法直接觀察到變慢的原因 自幹一發 CUDA ! Further Reading: “Profiler :: CUDA Toolkit Documentation” <http://docs.nvidia.com/cuda/profiler-users-guide/index.html> 20
  • 21. ● 後來在 github 找到另一個 Profiling Tool - uftrace, 這個 工具可以提供如: ○ Duration ○ TID ○ Times of Function Call ○ Total time ● 雖然 uftrace 無法分析有關 GPU 的 Profiling Information , 但是它提供的資訊仍可以讓我們了解效能 是卡在哪裡 Referenced: “namhyung/uftrace” <https://github.com/namhyung/uftrace> 嘗試另一個 Profiling Tool 21
  • 22. ● record : runs a program and saves the trace data ● graph : shows function call graph in the trace data uftrace 的量測結果 $ uftrace record pow_cl $ uftrace graph main 22
  • 23. ● GPU初始化階段占了近70%的比重 total time init_clcon text init_cl_ke rnel write_cl_b uffer clEnqueueW riteBuffer clWaitForE vents clEnqueueR eadBuffer Hash 1.938 1.354 s 14.362 us 1.541 ms 1.538 ms 569.901 ms 84.981 us 5.502 ms OpenCL context Initialization OpenCL searching nonce uftrace 的量測結果 23
  • 24. ● 想辦法避免 OpenCL context 重複初始化的問題 ○ 而 ccurl 的解決辦法是,一次只做一個 PoW Task,並 重複利用同一個 context ● 閱讀完 ccurl 的程式碼後,我們認為 ccurl 的資料結構設 計也有試圖想實現 multi-thread Pow Task,但是我們嘗 試在同一個 address space 同時 launch 多個 <ccurl_pow> ,算出來的 hash 卻是錯的 如何改善 OpenCL 版本的問題 24
  • 25. New IOTA PoW Library - dcurl ● Goal ○ 在給定的硬體環境裡,想辦法讓 PoW 跑越快越好 ○ 整合至 IRI,並檢驗效能是否有提升 ● Our ideas ○ PoW tasks can be multi-threaded executed ○ Integrate powerful IOTA PoW implementation 25
  • 26. New IOTA PoW Library - dcurl ● Hardware Environment ○ Ubuntu 16.04 ○ Intel(R) Xeon(R) CPU E5-2650 v4 @2.2GHz 48 cores ○ Nvidia Titan Xp ○ 94.2 GB RAM 26
  • 27. New IOTA PoW Library - dcurl 27
  • 28. New IOTA PoW Library - dcurl It’s important to find respective lock 28
  • 29. Does multi-thread really bring speedup? Frequency Time (s) 29
  • 30. Does multi-thread really bring speedup? Frequency Time (s) 30
  • 31. Compare dcurl with other PoW Libraries Frequency Time (s) 31
  • 33. Integrate dcurl into IRI ● Use javah to produce header file for c program $ javah com.iota.iri.hash.PearlDiver 33
  • 34. Integrate dcurl into IRI ● <jni.h> provides many functions to convert java objects to C objects, such as ... ○ GetIntArrayElements() gets java int array and return c int array ○ SetIntArrayRegion() copys c int array to java int array Further Reading: “JNI Functions” <https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html> Further Reading: “Java Programming Tutorial Java Natve Interface (JNI)” <https://www.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html> 34
  • 35. Integrate dcurl into IRI ● Reminder ○ Provide include path to OpenJDK for compiler ○ Set java library path before launch your jvm ● Lets compile it ! ○ We can get a shared library for jvm to load ○ Done! Source code: “chenwei-tw/iri” <https://github.com/chenwei- tw/iri/tree/task/integrate_dcurl> 35
  • 36. Performance between IRI and dcurl Frequency Time (s) Different Hardware Platform ● Intel(R) Core(™) i7-8700K Processor ● Nvidia GeForce GTX 1080 Ti ● 32 GB Memory <attachToTangle> Performance Comparison 36
  • 37. Something in progress ... ● Fix AVX implementation ● Let dcurl can configure environment and support multiple GPUs ● dcurl would be crashed if GPU memory is not enough ● dcurl would decide suitable parameter set automatically 37
  • 38. Future Work ● Add a new interface for PearlDiver in IRI, so everyone can load suitable PoW implementation for their hardware environment ● Search for other bottlenecks of IRI and try to improve 38

Editor's Notes

  • #6: 能夠完成這些行為的都能夠稱做 “full node”
  • #25: cue: