Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

1. Yukio Okuda (freelance) okf.yukio@gmail.com PyConJP2018/9 Y. Okuda

2. Me= A Programmer over 40 Years= Joyful 2 :FBS $16$MPDL()[ 3FBM$PSFT 0OF $PSF .VMUJ $PSF )ZQFS5ISFBE GIL@Python One Core N-Threads ≤ 1-Thread CPU-Bound ➡Special Vector Processor Multi Core N-Threads@M-Cores = min(N,M) CPU-Bound ➡General Low-price GPGPU IBM/ SRB SUN/ process DECα/ VMS-thread POSIX/ pthread Linux/ pthread ✈Wiki-Xeon PyConJP2018/9 Y. Okuda

3. Story(1/2) 3 How to accelerate CPU bound codes in Python Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development ● Compatibility ● Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda

4. Story(2/2) 4 ■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down F

10. 4IPUT 5JNFTFD 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPOYU /VN1Z5G!(QV 1ZUIPO/VNCB

11. 5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda

12. H/W S/W 5 env-all Tf-cpu,gpu Python 3.6 3.5 Conda 5.1.0 VirtualEnv Mint Linux(Ubuntu 16.04) CPU + GPU Batch python on shell SSH, NFS CPU: i7-2630QM stepping=5 (Sandy Bridge’12 mobile) Turbo=Off, EIST=Off SSE SSE2 SSE4.1 SSE4.2 AVX 2.0 GHz 4Core 8HT L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s,swap off QM77, NF9G(Jetway Co.) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280コア L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-9 CC-6.1 Test bench PyConJP2018/9 Y. Okuda

13. Background ➊ Python Thread ➋ GIL ➌ CPython-Ext ➍ NumPy PyConJP2018/9 Y. Okuda

14. Speeds of Process and Thread (1/2) 7 def add(n): a = 0 for in range(n): a += 1 for n in [ .. ]: ts = time.monotonic() for in range(1000): f(n) te = time.monotonic() def series(n): add(n) add(n) def process(n): p1 = Process (target= add, args=(n,)) p1.start() p2 = Process (target= add, ... p1.join(); p2.join() def thread(n): t1 = Thread (target= add, args=(n,)) t1.start() t2 = Thread (target= add, ... t1.join(); t2.join() Background PyConJP2018/9 Y. Okuda

15. Speeds of Process and Thread (2/2) 8 ■ Speed ●Thread 1× Series ▼ (25%Down@TruboOn) ●Process 1.8× Series F

18. 4IPUT 5JNFTFD 5ISFBE 4FSJFT 1SPDFTT EE4FS1SP5IS) ■ Launch time ●Thread Zero ●Process 6 msec /each F

21. 4IPUT 5JNFTFD 5ISFBE 4FSJFT 1SPDFTT EE4FS1SP5IS-PX Background PyConJP2018/9 Y. Okuda

22. Is Thread Safe? (1/2) 9 def add(n): global g for in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda

23. Is Thread Safe? (2/2) 10 ■ T ime ≥ 8 Not Thread-Safe global and local F

29. 4IPUT 5JNFTFD /PU;FSPJO(4 5JNF TFD /PU;FSPJO(4 Background PyConJP2018/9 Y. Okuda

30. Why not Safe? GIL 11 ■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria ● GIL: Global Interpreter Lock ■ Threads chopped intslice, and lose codes ✈A. Jesse ● tslice = 5 msec ● Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe operations Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda

31. How to Avoid GIL 12 ■ CPython-Extension: 1. Copy Python objects to C-Data 2. Apply “ Py BEGIN ALLOW THREADS” C-Macro 3. Execute C-Native codes or OTF codes 4. Apply “ Py END ALLOW THREADS” C-Macro 5. Copy C-Data to Python objects Thread1 Thread2 CPython C-Native GIL AvoidGIL Copy in Py BEGIN ALLOW THREADS Py END ALLOW THREADS Copy out C-Native Codes Background PyConJP2018/9 Y. Okuda

32. Monte Carlo ΠCalculation 13 H hits in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 = 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda

33. Multi-Threaded ΠCalculation 14 ■ Original: pin(n) Get h hits in n shots ➡ 4 ·h/n ■ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda

34. Π@CPython 15 ■ Easy Operation (good tools and ofﬁcial documents) ■ Require to run setup at each release of cpython import cif pi = cif.pin(n) dist/cif- • • • egg python setup.py • • • (Compile, Link, Deploy, Test) cifmodule.c #include Python.h static PyObject * pin( • • • ){ Py BEGIN ALLOW THREADS for (int i = 0; i n; ++i){ • • • Py END ALLOW THREADS setup.py import setuptools setup( • • • ) cif test.py import unittest import cif Background PyConJP2018/9 Y. Okuda

35. Effects of Threads and Cores 16 ☞ min(N, M)× – Overhead N: # of Threads, M: # of Real Cores ; 5ISFBET 3FMBUJWF4QFFE ! ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Background PyConJP2018/9 Y. Okuda

36. Hard to Develop Threading 17 ■ Design issues : out of scope of this talk ■ A issue in this trial: rand r, random r ● rand r: Low randomness, ideal speed up ➡Selected ● random r : Good randomness, speed down at threading ● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std ● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48 ● “more standardization―for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 ΠError 0.0001 -0.005 Rand r Random r F

39. 4IPUT 5JNFTFD 5XP 5ISFBE /P 5ISFBE Background PyConJP2018/9 Y. Okuda

40. NumPy Speedup 18 ■ Converting to NumPy 7.7✕ ● Vectorize: Move “for loops” into functions ● Numpy Vector/Matrix functions are compiled C-codes ■ Not only numeric calculation ● count nonzero ● less equal, less, .. ● sort, lexsort, .. ● where, searchsorted ● I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda

41. Summary 19 ➊Avoid GIL to speed up ➋Apply GIL-Safe operations for Thread-Safe ➌min(N, M) acceleration F

47. 4IPUT 5JNFTFD 1ZUIPO /VN1Z $1ZUIPOYU $1ZUIPOYU!5ISFBE Background PyConJP2018/9 Y. Okuda

48. Numba (V0.38.0) ✈Ofﬁcial ●Background ●Accelerate on: CPU, M-Core, CUDA (SSE, AVX, AVX2, AVX-512) ●@numba.jit Just in Time Compile ■ Few user’s guides ✈Conda2018Slide ■ An excellent review ✈Matthew Rocklin ■ Supported by Conda, Inc ■ The Gordon and Betty Moore Foundation ■ GPU version free from end of 2017 ■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ■ CUDA 2.0 or above PyConJP2018/9 Y. Okuda

49. Outline 21 ■Two different systems CPU CUDA @numba.jit(• •) def pin(n) • • • • • • return out pi = pin(100) @numba.cuda.jit(• •) def pin(n, out) • • • • • • (no return) pin[25, 40](100, pi) ●Many Python codes: ✈Official ✈NumPy ▼ Language: All except class, try, except, with, yield ▼ Function call: inner,closure, recursive ▼ Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round() sorted() type() zip() ▼ NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std() take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view() ▼ Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi ● CUDA Kernel codes ● NumPy: Not Supported Numba PyConJP2018/9 Y. Okuda

50. On CPU 22 ■ @numba.jit() Compile/Execute compatible Python codes to LLVM ●Apply Python π✍ ➡ 21✕ Cf. Manual convert to CPython ✍ ➡ 23✕ ☞Comparable speed to manually converted C ●Apply NumPy π✍ ➡ 1✕ ☞ NumPy functions are not accelerated Cf. Python to NumPy ➡7.7✕ ☞ Jit 21/7.7= 3✕ of NumPy functions ▼ Numba: Python ➡LLVM ➡Python ▼ NumPy: (Python ➡C ➡Python)✕Repeat Numba PyConJP2018/9 Y. Okuda

51. Accelerate NumPy Indexing 23 ■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo ● “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F

57. 4IPUT 5JNFTFD 'PS/ESSBZ 'PS-JTU /QEE-JTU +JU'PS-JTU /QEE/ESSBZ +JU'PS/ESSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverﬂow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda

58. On M-Core 24 ■ All Core working ➊ set @jit(parallel=True) ➋ change “range” to “numba.prange” ● Apply Python π➡ 89✕ ➡ 4.4✕ of @jit() ●No way to control # of cores ▼ Multi-User/Process needs core assignment ■ @jit(nogil=True) + ThreadPoolExecutor controls ✍ 5ISFBET 3FMBUJWF4QFFE ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Numba PyConJP2018/9 Y. Okuda

59. On CUDA 25 ■ Non-compatible python codes, (details are out of scope) ● CUDA kernel codes in deﬁnitions ▼ Python like, not C in PyCuda ● insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) ● Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F

65. 4IPUT 5JNFTFD $6%TU $6%OE Overhead ➡ Numba PyConJP2018/9 Y. Okuda

66. Summary 26 ➊Convert to Nogil functions ➋Accelerate “for/while” loops ➌Improve NumPy indexing F

72. 4IPUT 5JNFTFD 1ZUIPO $16 /PHJM!5ISFBE 1BSBMMFM $6% Numba PyConJP2018/9 Y. Okuda

73. Machine Learning Packages: ●NumPy accelerators ●Kernel-less CUDA access ●Tensor objects ●Poor Documents➡My thought ? ➊ TensorFlow (V1.9) ✈Official ✈奥田 ●CPU, CUDA, (TPU, ROCm, Phi) Own-SIMD +(SSE, AVX, AVX2, AVX-512) ➋ PyTorch (V0.4.11) ✈Official ●CUDA ➌ CuPy (V4.1.0) –Chainer– ✈Official ●CUDA PyConJP2018/9 Y. Okuda

74. Exec Modes 28 ■ TensorFlow-tf: (CPU, CUDA)✕(Eager, Graph)= 4 ● Eager: Python is a direct executor for ordinary actions ● Graph: Python is a macro generator for computing graphs ● Eager if 1st-code is tf.enable eager execution() else Graph ●Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ■ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) ● torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) ● Implicit: auto-decide from operands ➡ Fast ● Explicit-2: torch.func(..).cuda() ➡ Slow ■ CuPy-cp: CUDA = 1 (NN-Graph) ●Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda

75. CUDA 29 ■ TensorFlow Eager✍ ➊ np. ➡tf. ➋ Change some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ■ PyTorch✍/CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ■ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ■ TensorFlow CPU ● Execute the same codes on env. of CPU F

78. 4IPUT 5JNFTFD 5GBHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda

79. CPU 30 ■ TensorFlow ✍ 2.4, 3.8✕ 8 cores run SIMD ? ■ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F

82. 4IPUT 5JNFTFD 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ■ In progress of Eager , More functional and faster ? F

85. 4IPUT 5JNFTFD $POEBOW 7JSUVBMOW $POEB.LM ● V1.5@Jan./2018: Contribution version ✈奥田 ● V1.7: Moving out of contribution ● V1.8: SSE, AVX link ● V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel ● V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda

86. TensorFlow Graph 31 ■Advanced computing graph ●While, Branch, Parallel, Reduce, Scatter, etc in CUDA ● Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph ● Slower than PyToch in the π calculation ●1000 While@CUDA✍ ●10 Parallel@CUDA✍ F

92. 4IPUT 5JNFTFD 5G8IJMF $V1Z 1Z5PSDI F

98. 4IPUT 5JNFTFD 5G1BSB $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda

99. Overhead (OH) 32 ■ Negligible OHs for heavy functions as fft, cv, solvers, etc ● TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) ● CuPy: 1. linalg 2. math 3. fft ■ Prediction of getting array OHs at ordinary cases ●NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F

100. F

101. F

102. F

103. F

104. F

105. 4IPUT 5JNFTFD /VN1Z SSBZ 4DBMBS F

106. F

107. F

108. F

109. F

110. F

111. 4IPUT 5JNFTFD SSBZ 4DBMBS    { ▼ Transfer time from CUDA to CPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda

112. All Overheads 33 ■ Accelerate function “r = f(a1, a2)” ●NumPy ● Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F

113. F

114. F

115. F

116. F

117. F

118. 4IPUT 5JNFTFD 5G!$QV 5G!(QV 1Z5PSDI $V1Z ▼ copy out F

119. F

120. F

121. F

122. F

123. F

124. 4IPUT 5JNFTFD 5G!(QV $V1Z 1Z5PSDI /VN1Z 5G!$QV − ▼ NumPy-copy F

125. F

126. F

127. F

128. F

129. F

130. 4IPUT 5JNFTFD /VN1Z = ▼ copy F

131. F

132. F

133. F

134. F

135. F

136. 4IPUT — 5JNFTFD 5G!(QV $V1Z 1Z5PSDI 5G!$QV Modify copy ML Packages PyConJP2018/9 Y. Okuda

137. Tensor 34 ■ Bridge between CPU and Accelerator ? CPU Acc. a1 p1 a2 p2 • • • r return rf Tensor copy in copy out others Buffer/Cache ●copy in (Create Tensor Object) TensorFlow convert to tensor(nd) PyTorch tensor(nd) Cupy array(nd) nd: NdArray ●copy out (Convert to NdArray) TensorFlow t obj.numpy() PyTorch t obj.cpu().numpy() Cupy asnumpy(t obj) t obj: Tensor Object ●Others ▼ Neural Network functions ▼ MM-Direct: Scatter Read/Write ▼ “if”, “while” ●Buffer/Cache ✈PyTorch ✈DlPack ▼ Not store in CPU-Mem. Cf. NumPy functions ▼ • • • ML Packages PyConJP2018/9 Y. Okuda

138. Summary 35 ➊ CuPy: NumPy compatible CUDA ☞ TensorFlow: CPU-SIMD/CUDA/.., Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F

139. F

140. F

141. F

142. F

143. F

144. 4IPUT 5JNFTFD /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda

145. Dask (V0.18.0) ✈Official ●Background ●“Delayed” simple graph for threading ■ Answer of PyData to Col. W/O-MM-Limit: Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney , HANA(SAP), RevolutionR(MS)✈Official ■ Conda + DARPA, NSF, Gordon Moore Found., HHMI ■ Expand NumPy, Pandas, Scikit-Learn ■ Parallel computing: ● Process: Futures ●Thread: Delayed PyConJP2018/9 Y. Okuda

146. Graph for Multi-Threading 37 ■“delayed” deﬁnes nodes of parallel computing✍ # Thread ●mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ■Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda

147. The Results at 3 Threads 38 ■NumPy shows little improvement /VN1Z %BTL ● Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ■No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL ●ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ■The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda

148. Delayed vs ThreadPoolExecutor 39 ■ NogilPy ThreadPool shows lower launch, higher speed ● Delayed ●ThreadPool F

149. F

150. F

151. F

152. F

153. F

154. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F

155. F

156. F

157. F

158. F

159. F

160. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda

161. Summary 40 ➊ No guide about GIL-Safe ●Only inhibit “+=, –=” without reasoning ➋ Large Overheads for the πcalculation ■ A tool for Dask components ? ■ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions ● PyTorch Freeze ● TensorFlow@CPU segmentation fault F

162. F

163. F

164. F

165. F

166. F

167. 4IPUT 5JNFTFD /VN1Z /VN1Z!5ISFBE /PHJM1Z!5ISFBE Dask PyConJP2018/9 Y. Okuda

168. Threading and Nogil ● ThreadPoolExecutor ➊ Conﬁrm Nogil-ness of CuPy ➋ GIL-Safe prediction ➌ Nogil forced NumPy PyConJP2018/9 Y. Okuda

169. NumPy vs CuPy 42 ■ NumPy Partial-Nogil, CuPy Full-Nogil ? ● NumPy ●CuPy F

170. F

171. F

172. F

173. F

174. F

175. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F

176. F

177. F

178. F

179. F

180. F

181. 4IPUT 5JNFTFD 5 5 5 5 5 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda

182. Conﬁrm CuPy 43 ■ Error/π = a · (N)b ✈WikiPi-1 / 4IPUT — — — — — CTPMVUF3FMBUJWFSSPS %BUB $V1Z!5 /VN1ZSSPS 1 Loop ●CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA ●NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda

183. GIL-Safe Prediction 44 ■ Almost impossible to predict GIL-Safe Local functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c ● Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ ● Rng Count No error ☞Apply Forced Nogil functions F

184. F

185. F

186. F

187. F

188. F

189. / 5JNFTFD SSPST 5 5 5 5 5 5 Count F

190. F

191. F

192. F

193. F

194. F

195. / 5JNFTFD 3OH@$PVOU $PVOU 1 Loop Threading and Nogil PyConJP2018/9 Y. Okuda

196. Numba JIT Options 45 ■ Set nopython=True for nogil guarantee ?●Local objects are stored in a heap storage of which accesses should be mutexes. ●The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL EntryObject Manager Obj-1 Python Heap Storage Obj-n ●All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda

197. Nogil NumPy by Namba 46 ■ Some NumPy functions require rewriting ● Guaranteed Nogil F

198. F

199. F

200. F

201. F

202. F

203. 4IPUT 5JNFTFD 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM ● Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 ● Numba speeds up 1.6X ● 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda

204. Summary 47 ➊Apply Nogil functions for Thread-Safe ■ Set nopython=True with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F

205. F

206. F

207. F

208. F

209. F

210. 4IPUT 5JNFTFD /VN1Z /PHJM/VN1Z Threading and Nogil PyConJP2018/9 Y. Okuda

211. Conclusion 48 Execution Time Conﬁrmation (ETC) on run time signatures showed: ➊ Ideal threading acceleration = min(N, M) ➋ A comparison of On-The-Fly packages: ● Numba ● TensorFlow ● PyTorch ● CuPy ● Dask ➌ Basic issues and Solutions: ● GIL ● Nogil ● GIL-Safe ● Threading ● Graph ● NumPy Indexing ● Copy Overhead Enjoy On-The-Fly Own Ways ✍ PyConJP2018/9 Y. Okuda

212. Questions or Comments PyConJP2018/9 Y. Okuda

213. Appendix MIT License Copyright ( c ) 2018 Yukio Okuda Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda

214. Numba 51 Nogil import numba import random from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r = 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

215. ML:TfEager,PyTorch,Cupy 52 TensorFlow-CPU/CUDA-Eager import t e n s o r f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

216. ML:TfGraph 53 TensorFlow-Simple Graph import t e n s o r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

217. ML:TfGraph Dask 54 TensorFlow-Parallel Graph import t e n s o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda

218. Miscellaneous 55 GIL-Safe import numpy as np from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

More Related Content

What's hot (20)

Similar to Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc (20)

Recently uploaded (20)

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc