[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation

Analysis and
Improvement of
IOTA PoW
Implementation
chenwei (魏禛)
<zhenwei.tw@gmail.com>
AndyYang (楊子賢)
<kukry5566@gmail.com>
March 10, 2018 / SITCON2018 1

chenwei (魏禛)
● From Tainan, Taiwan
● Study Master degree at National Taiwan University
● Recent work
○ Learning how to implement a interpreter
○ Learning Golang
○ Optimize Neural Network on multiple GPUs
● GitHub <https://github.com/chenwei-tw>
2

AndyYang (楊子賢)
● 來自台北
● 目前就讀台大資工所一年級
● 研究領域 :
○ 機器學習
○ 計算機結構
● Recent Work :
○ ReRam Based Accelerator for Convolutional Neural
Network
3

Brief Introduction to IOTA
from: “Iota Tangle Visualization” <https://simulation1.tangle.works/>
4

● IRI (IOTA Reference Implementation)
○ Provides RESTful API to participate in Tangle
○ Exchange transactions with other nodes
○ Maintain Database for storing transactions
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
Referenced: “IOTA API Reference”
<https://iota.readme.io/v1.2.0/reference>
5

● (Light) Wallet
○ 查詢餘額、收款、轉帳
○ 因為沒有運行完整的 Node，所以 Wallet 的資訊都必
須透過前述的 RESTful API 與一個 full node 做溝通
○ Before doing any operation with your wallet,
check host connected is available
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
6

● 如何發起一筆交易 ?
○ Node 選擇兩個交易 (transaction) 做驗證
○ 檢查該兩筆交易是否有衝突 (conflict)
(e.g. 帳戶餘額為負)
○ 解出一道加密問題 (PoW)，耗費計算力
Referenced: “Tangle 白皮書” <https://hackmd.io/s/ryriSgvAW>
Further Reading: “深入理解 IOTA 交易方式”
<https://blog.louie.lu/2018/01/10/in-depth-explain-iota-transaction/>
7

How I get involved in
● <attachToTangle> in IRI
Referenced: “iotaledger/iri” <https://github.com/iotaledger/iri>
8

How I get involved in
● There are too many IOTA PoW Implementation hided
in these libraries
○ curl.lib.js
<https://github.com/iotaledger/curl.lib.js>
○ gIOTA <https://github.com/iotaledger/gIOTA>
○ ccurl <https://github.com/iotaledger/ccurl>
○ iota-pearldiver
<https://github.com/mlouielu/iota-pearldiver>
9

● gIOTA 蒐集了多種的PoW實作(C, SSE, AVX, OpenCL)
○ 而這些實作多以 C code 的形式內嵌在 Golang 裡
Why choose gIOTA?
● 故我們可以藉由 C 打造 IOTA 底層的
trinary structure 後，便可快速將實作移轉過去
10

● Alternative to Binary, Trinary is a base-3 numeral
system
● Trits: Analogous to bits, a ternary digit is a trit .The
digits may have the values 1, 0, or -1
● Trytes: A tryte consists of 3 trits, which can
represent 27 values.
○ in IOTA, trytes are represented as characters
'9,A-Z'.
Referenced: “IOTA Glossary” <https://iota.readme.io/docs/glossary>
Trinary Structure
11

Source Code: “chenwei-tw/dcurl” <https://github.com/chenwei-
tw/dcurl/blob/dev/src/trinary/trinary.h>
Our Trinary Structure
12

● 9 in tryte = {0,0,0} in trits
What is PoW (Proof Of Work)?
Referenced: “The Anatomy of a Transaction”
<https://domschiener.gitbooks.io/iota-
guide/content/chapter1/transactions-and-bundles.html>
...0guatda.com/cmx.p000...0
MWM
Hash
13

● giota 所蒐集的實作使用的多執行緒寫
法，並不是真的把計算函數分工，而是
同時執行多個一樣的函數看誰比較快算
出來的暴力解法
● 不同執行緒的起始 seed 不一樣
如何找出Nonce?
14

● C, GO, SSE 的實作沒有
問題
Referenced: “用 C 開發 IOTA PoW 的各種實作" <https://hackmd.io/s/HyNw4VM-z>
實測 giota 正確性
15

● AVX, OpenCL 卻沒通過
pow_avx_test.go:47: pow is illegal
J9QTUNNMONCMIR9JBNMRC9SC9QTBRKBUVCBYBUITBHEICYVQ9HXEXSPWPU9KACTSDRSQBDOJPOOEAFVMP
pow_cl_test.go:46: pow is illegal
IIHYVX9VHSMQWSNDJYWZOJBCBTPVQBLVBF9UYIYSTEKJVEFVY9JPJJMRLFWOJFKNWKAANSZKLXDBWMALI
● 後來發現 iotaledger/ccurl, 和 gIOTA 的 OpenCL Kernel
Function 是一樣的, 但是 ccurl 的結果是對的, 我們推測可
能是 gIOTA 在 launch kernel 的時候發生問題
● 於是後來的 GPU 效能評估與後續的設計都是基於
iotaledger/ccurl 版本做修改
實測 giota 正確性
16

● 以一個 tryte 量測三種 PoW 實作的效能
● 但是後來發現不同的 tryte 找到的 Nonce 時間不一樣
量測各種 PoW 實作效能
17

● 以大量的 trytes 來量測並繪製分布圖, 觀察各實作的效能
● 30 trytes 200 samples 的結果
量測各種 PoW 實作效能
47組 samples 執行時間約 10 秒
重複初始化 OpenCL context
的下場
Source Code: “chenwei-tw/iota-pow-in-c”
<https://github.com/chenwei-tw/iota-pow-in-c>
18

● 疑問: 為何使用 GPU 的 OpenCL 效能特別差 ?
● 可能的問題點:
○ 尋找 Nonce 的 kernel function 要計算很久?
○ Device 與 Host 之間的 Communication overhead
過大 ?
○ 還是 OpenCL 哪一個的 API 出了問題 ?
● 另外一個問題:
○ 由於實驗環境的 GPU 為 Nvidia，且 Nvidia 沒有提供
其 OpenCL 的 Profiling Tool
OpenCL 效能差的原因?
19

● 最直覺的想法便是重新把 OpenCL 實作改寫為 CUDA 後
再用 toolkit 的其中一項工具 nvprof 進行觀察
● 從下圖的結果，無法直接觀察到變慢的原因
自幹一發 CUDA !
Further Reading: “Profiler :: CUDA Toolkit Documentation”
<http://docs.nvidia.com/cuda/profiler-users-guide/index.html>
20

● 後來在 github 找到另一個 Profiling Tool - uftrace, 這個
工具可以提供如:
○ Duration
○ TID
○ Times of Function Call
○ Total time
● 雖然 uftrace 無法分析有關 GPU 的 Profiling
Information , 但是它提供的資訊仍可以讓我們了解效能
是卡在哪裡
Referenced: “namhyung/uftrace” <https://github.com/namhyung/uftrace>
嘗試另一個 Profiling Tool
21

● record : runs a program and saves the trace data
● graph : shows function call graph in the trace data
uftrace 的量測結果
$ uftrace record pow_cl
$ uftrace graph main
22

● GPU初始化階段占了近70%的比重
total time
init_clcon
text
init_cl_ke
rnel
write_cl_b
uffer
clEnqueueW
riteBuffer
clWaitForE
vents
clEnqueueR
eadBuffer
Hash
1.938 1.354 s 14.362 us 1.541 ms 1.538 ms 569.901 ms 84.981 us 5.502 ms
OpenCL context Initialization OpenCL searching nonce
uftrace 的量測結果
23

● 想辦法避免 OpenCL context 重複初始化的問題
○ 而 ccurl 的解決辦法是，一次只做一個 PoW Task，並
重複利用同一個 context
● 閱讀完 ccurl 的程式碼後，我們認為 ccurl 的資料結構設
計也有試圖想實現 multi-thread Pow Task，但是我們嘗
試在同一個 address space 同時 launch 多個
<ccurl_pow> ，算出來的 hash 卻是錯的
如何改善 OpenCL 版本的問題
24

New IOTA PoW Library - dcurl
● Goal
○ 在給定的硬體環境裡，想辦法讓 PoW 跑越快越好
○ 整合至 IRI，並檢驗效能是否有提升
● Our ideas
○ PoW tasks can be multi-threaded executed
○ Integrate powerful IOTA PoW implementation
25

● Hardware Environment
○ Ubuntu 16.04
○ Intel(R) Xeon(R) CPU E5-2650 v4 @2.2GHz 48 cores
○ Nvidia Titan Xp
○ 94.2 GB RAM
26

27

New IOTA PoW Library - dcurl It’s important to find
respective lock
28

Does multi-thread really bring speedup?
Frequency
Time (s)
29

Does multi-thread really bring speedup?
Frequency
Time (s)
30

Compare dcurl with other PoW Libraries
Frequency
Time (s)
31

Integrate dcurl into IRI
● Use javah to produce header file for c program
$ javah com.iota.iri.hash.PearlDiver
33

● <jni.h> provides many functions to convert
java objects to C objects, such as ...
○ GetIntArrayElements() gets java int array
and return c int array
○ SetIntArrayRegion() copys c int array to
java int array
Further Reading: “JNI Functions”
<https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html>
Further Reading: “Java Programming Tutorial Java Natve Interface (JNI)”
<https://www.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html>
34

● Reminder
○ Provide include path to OpenJDK for compiler
○ Set java library path before launch your jvm
● Lets compile it !
○ We can get a shared library for jvm to load
○ Done!
Source code: “chenwei-tw/iri” <https://github.com/chenwei-
tw/iri/tree/task/integrate_dcurl>
35

Performance between IRI and dcurl
Frequency
Time (s)
Different Hardware Platform
● Intel(R) Core(™) i7-8700K
Processor
● Nvidia GeForce GTX 1080 Ti
● 32 GB Memory
<attachToTangle> Performance Comparison
36

Something in progress ...
● Fix AVX implementation
● Let dcurl can configure environment and
support multiple GPUs
● dcurl would be crashed if GPU memory is not enough
● dcurl would decide suitable parameter set
automatically
37

Future Work
● Add a new interface for PearlDiver in IRI,
so everyone can load suitable PoW implementation
for their hardware environment
● Search for other bottlenecks of IRI and try to improve
38

[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation

More Related Content

What's hot (20)

Similar to [Sitcon2018] Analysis and Improvement of IOTA PoW Implementation (20)

Recently uploaded (20)

[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation

Editor's Notes