SlideShare a Scribd company logo
Yukio Okuda
(freelance)
okf.yukio@gmail.com
PyConJP2018/9 Y. Okuda
Me= A Programmer over 40 Years= Joyful
2
   
:FBS





$16$MPDL()[






3FBM$PSFT
0OF $PSF .VMUJ $PSF
)ZQFS5ISFBE
GIL@Python
One Core
N-Threads ≤ 1-Thread
CPU-Bound ➡Special
Vector Processor
Multi Core
N-Threads@M-Cores = min(N,M)
CPU-Bound ➡General
Low-price GPGPU
IBM/
SRB
SUN/
process
DECα/
VMS-thread
POSIX/
pthread
Linux/
pthread
✈Wiki-Xeon
PyConJP2018/9 Y. Okuda
Story(1/2)
3
How to accelerate CPU bound codes in Python
Fast Execution
Compile Languages:
CPython-extension
No GIL:
Cython, PyPy, Jython, IronPython, ..
Device depend:
OpenMP, OpenACC, PyCuda
Fast Development
● Compatibility
● Portability
On-The-Fly (OTF)
Packages
PDF in clouds Codes in Appendix: ✍ Links: ✈GIL
Introduction
PyConJP2018/9 Y. Okuda
Story(2/2)
4
■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
/VN1Z5G 
1ZUIPO/VNCB 
$1ZUIPOYU 
/VN1Z5G!(QV 
1ZUIPO/VNCB
5ISFBE 
5G(SBQI8IJMF!(QV 
1Z5PSDI$V1Z!(QV 
Note: Very few data transfer, No tune up at packages
1000 Loops
Monte Carlo ΠCalculation
PyConJP2018/9 Y. Okuda
H/W S/W
5
env-all Tf-cpu,gpu
Python 3.6 3.5
Conda 5.1.0 VirtualEnv
Mint Linux(Ubuntu 16.04)
CPU + GPU
Batch python on shell
SSH, NFS
CPU: i7-2630QM stepping=5
(Sandy Bridge’12 mobile)
Turbo=Off, EIST=Off
SSE SSE2 SSE4.1 SSE4.2 AVX
2.0 GHz 4Core 8HT
L1=256K, L2=1M, L3=6M
PCIe II 5GT/s
DDR3 16G 21.3G/s,swap off
QM77, NF9G(Jetway Co.)
GPU: GTX-1060
(Pascal GP-106)
1.5 GHz 1280コ ア
L2=1.5M(192bI/F)
PCIe II 5GT/s
DDR5 6G 8G/s
CUDA-9 CC-6.1
Test bench
PyConJP2018/9 Y. Okuda
Background
➊ Python Thread
➋ GIL
➌ CPython-Ext
➍ NumPy
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (1/2)
7
def add(n):
a = 0
for in range(n):
a += 1
for n in [ .. ]:
ts = time.monotonic()
for in range(1000):
f(n)
te = time.monotonic()
def series(n):
add(n)
add(n)
def process(n):
p1 = Process (target= add,
args=(n,))
p1.start()
p2 = Process (target= add,
...
p1.join(); p2.join()
def thread(n):
t1 = Thread (target= add,
args=(n,))
t1.start()
t2 = Thread (target= add,
...
t1.join(); t2.join()
Background
PyConJP2018/9 Y. Okuda
Speeds of Process and Thread (2/2)
8
■ Speed
●Thread 1× Series
▼
(25%Down@TruboOn)
●Process 1.8× Series
F
F
F
4IPUT





5JNFTFD
5ISFBE 
4FSJFT 
1SPDFTT 
EE4FS1SP5IS)
■ Launch time
●Thread Zero
●Process 6 msec /each
F
F
F
4IPUT




5JNFTFD


5ISFBE
4FSJFT
1SPDFTT
EE4FS1SP5IS-PX
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (1/2)
9
def add(n):
global g
for in range(n):
g += 1
def sub(n):
global g
for in range(n):
g –= 1
g = None
def a s(n):
global g
g = 0
t1 = Thread( ..
add,.. n)
t2 = Thread( ..
sub,.. n)
.. .. ..
return g
for n in [ .. ]:
gs = []
for in range(1000):
gs.append(a s(n))
n0 = not zero count (gs)
Background
PyConJP2018/9 Y. Okuda
Is Thread Safe? (2/2)
10
■ T ime ≥ 8 Not Thread-Safe global and local
F
F
F
F
F
F
4IPUT







5JNFTFD






/PU;FSPJO(4
   
5JNF TFD



/PU;FSPJO(4

Background
PyConJP2018/9 Y. Okuda
Why not Safe? GIL
11
■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria
● GIL: Global Interpreter Lock
■ Threads chopped intslice, and lose codes ✈A. Jesse
● tslice = 5 msec ● Errors from 8 msec
☞ For acceleration, avoid GIL and Python object access
☞ For no error, Finish in tslice or apply GIL-Safe opera-
tions
Thread1
Thread2
GIL
tslice
tslice
tslice
tslice
tslice GIL
Background
PyConJP2018/9 Y. Okuda
How to Avoid GIL
12
■ CPython-Extension:
1. Copy Python objects to C-Data
2. Apply “ Py BEGIN ALLOW THREADS” C-Macro
3. Execute C-Native codes or OTF codes
4. Apply “ Py END ALLOW THREADS” C-Macro
5. Copy C-Data to Python objects
Thread1
Thread2
CPython
C-Native
GIL
AvoidGIL
Copy in
Py BEGIN ALLOW THREADS
Py END ALLOW THREADS
Copy out
C-Native Codes
Background
PyConJP2018/9 Y. Okuda
Monte Carlo ΠCalculation
13
 


H hits in the circle targeting
N random shots at a square
π = 4 ·H/N ✈WikiPi-2 ✈LLNL
Error/π = a · Nb ✈WikiPi-1
Python C
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
double pin ( n ) {
unsigned i n t s = time (NULL) ;
i n t h = 0;
for ( i n t i = 0; i  n ; ++ i ) {
double x = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double y = ( double ) ( ( double ) r a n d r (s )
/ ( double )RAND MAX) ;
double r2 = x∗x + y∗y ;
i f ( r2 = 1 . )
h += 1;
return 4 . ∗ ( double ) h / ( double ) n ;
}}
Background
PyConJP2018/9 Y. Okuda
Multi-Threaded ΠCalculation
14
■ Original: pin(n)
Get h hits in n shots ➡ 4 ·h/n
■ m Threading: pinm(n, m)
Launch
h1 in
n/m
h2 in
n/m
...
...
hm in
n/m Map
h = sum(h1, h2, .., hm) Reduce
4 ·h/n
Background
PyConJP2018/9 Y. Okuda
Π@CPython
15
■ Easy Operation (good tools and official documents)
■ Require to run setup at each release of cpython
import cif
pi = cif.pin(n)
dist/cif- • • • egg
python setup.py • • •
(Compile, Link, Deploy, Test)
cifmodule.c
#include Python.h
static PyObject *
pin( • • • ){
Py BEGIN ALLOW THREADS
for (int i = 0; i  n; ++i){ • • •
Py END ALLOW THREADS
setup.py
import setuptools
setup( • • • )
cif test.py
import unittest
import cif
Background
PyConJP2018/9 Y. Okuda
Effects of Threads and Cores
16
☞ min(N, M)× – Overhead
N: # of Threads, M: # of Real Cores ;
     
5ISFBET






3FMBUJWF4QFFE
! !
3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT
4QFFE
*EFBM
Background
PyConJP2018/9 Y. Okuda
Hard to Develop Threading
17
■ Design issues : out of scope of this talk
■ A issue in this trial: rand r, random r
● rand r: Low randomness, ideal speed up ➡Selected
● random r : Good randomness, speed down at threading
● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std
● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48
● “more standardization―for compilers, users, and libraries
..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press
☞ Check speeds of Official thread-safe functions
0e+00 5e+04 1e+05
# Shots
0.00
0.02
0.04
ΠError
0.0001 -0.005
Rand r
Random r
F
F
F
4IPUT




5JNFTFD
5XP 5ISFBE
/P 5ISFBE
Background
PyConJP2018/9 Y. Okuda
NumPy Speedup
18
■ Converting to NumPy 7.7✕
● Vectorize: Move “for loops” into functions
● Numpy Vector/Matrix functions are compiled C-codes
■ Not only numeric calculation
● count nonzero ● less equal, less, .. ● sort, lexsort, ..
● where, searchsorted ● I/O
Python NumPy
import random
def pin ( n ) :
h = 0
for in range ( n ) :
x = random . random ( )
y = random . random ( )
r2 = x∗x + y∗y
i f r2 = 1 . :
h += 1
return 4 . ∗ h / n
import numpy as np
def np pi ( n ) :
x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 )
r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) ,
np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) ,
dtype=np . f l o a t 6 4 )
ones = np . ones ( n , dtype=np . f l o a t 6 4 )
l s s = np . l e s s e q u a l ( rs , ones )
h i t = np . count nonzero ( l s s )
pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / 
np . f l o a t 6 4 ( n )
return pi
Background
PyConJP2018/9 Y. Okuda
Summary
19
➊Avoid GIL to speed up
➋Apply GIL-Safe operations
for Thread-Safe
➌min(N, M) acceleration
F
F
F
F
F
F
4IPUT








5JNFTFD
1ZUIPO 
/VN1Z 
$1ZUIPOYU 
$1ZUIPOYU!5ISFBE 
Background
PyConJP2018/9 Y. Okuda
Numba (V0.38.0) ✈Official
●Background
●Accelerate on:
CPU, M-Core, CUDA
(SSE, AVX, AVX2, AVX-512)
●@numba.jit
Just in Time Compile
■ Few user’s guides ✈Conda2018Slide
■ An excellent review ✈Matthew Rocklin
■ Supported by Conda, Inc
■ The Gordon and Betty Moore Foundation
■ GPU version free from end of 2017
■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite
■ CUDA 2.0 or above
PyConJP2018/9 Y. Okuda
Outline
21
■Two different systems
CPU CUDA
@numba.jit(• •)
def pin(n)
• • •
• • •
return out
pi = pin(100)
@numba.cuda.jit(• •)
def pin(n, out)
• • •
• • •
(no return)
pin[25, 40](100, pi)
●Many Python codes: ✈Official ✈NumPy
▼
Language: All except class, try, except, with, yield
▼
Function call: inner,closure, recursive
▼
Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round()
sorted() type() zip()
▼
NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std()
take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view()
▼
Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi
● CUDA Kernel codes ● NumPy: Not Supported
Numba
PyConJP2018/9 Y. Okuda
On CPU
22
■ @numba.jit() Compile/Execute compatible Python codes to LLVM
●Apply Python π✍ ➡ 21✕
Cf. Manual convert to CPython ✍ ➡ 23✕
☞Comparable speed to manually converted C
●Apply NumPy π✍ ➡ 1✕
☞ NumPy functions are not accelerated
Cf. Python to NumPy ➡7.7✕
☞ Jit 21/7.7= 3✕ of NumPy functions
▼
Numba: Python ➡LLVM ➡Python
▼
NumPy: (Python ➡C ➡Python)✕Repeat
Numba
PyConJP2018/9 Y. Okuda
Accelerate NumPy Indexing
23
■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo
● “for loop” and a function vector operations
on List and NdArray by native and Jit
def for add(n, vs):
for i in range(n):
vs[i] += 1
def np add(n, vs):
a = np.add (vs, 1)
F
F
F
F
F

More Related Content

PDF
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
PDF
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
PDF
ゆるふわコンピュータ (IPSJ-ONE2017)
PDF
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
PDF
Vc4c development of opencl compiler for videocore4
PPTX
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
PDF
Design and Implementation of GCC Register Allocation
PPTX
Gpu workshop cluster universe: scripting cuda
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
ゆるふわコンピュータ (IPSJ-ONE2017)
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
Vc4c development of opencl compiler for videocore4
[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation
Design and Implementation of GCC Register Allocation
Gpu workshop cluster universe: scripting cuda

What's hot (20)

PDF
Про асинхронность / Максим Щепелин / Web Developer Wargaming
PDF
Ctf hello,world!
PDF
HHVM on AArch64 - BUD17-400K1
PDF
Tiramisu をちょっと、味見してみました。
PDF
Beyond tf idf why, what & how
PDF
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
PDF
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
PDF
深入淺出C語言
PDF
C++ amp on linux
PDF
Code GPU with CUDA - Identifying performance limiters
PPTX
python gil
PPTX
Class 18: Measuring Cost
PDF
GPU Programming on CPU - Using C++AMP
PDF
Programming at Compile Time
PDF
C++ How I learned to stop worrying and love metaprogramming
PPTX
MSc Thesis Defense Presentation
PDF
Q4.11: NEON Intrinsics
PDF
Dafunctor
ODP
Sysprog 12
ODP
Sysprog 12
Про асинхронность / Максим Щепелин / Web Developer Wargaming
Ctf hello,world!
HHVM on AArch64 - BUD17-400K1
Tiramisu をちょっと、味見してみました。
Beyond tf idf why, what & how
Veriloggen.Stream: データフローからハードウェアを作る(2018年3月3日 高位合成友の会 第5回 @東京工業大学)
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014
深入淺出C語言
C++ amp on linux
Code GPU with CUDA - Identifying performance limiters
python gil
Class 18: Measuring Cost
GPU Programming on CPU - Using C++AMP
Programming at Compile Time
C++ How I learned to stop worrying and love metaprogramming
MSc Thesis Defense Presentation
Q4.11: NEON Intrinsics
Dafunctor
Sysprog 12
Sysprog 12
Ad

Similar to Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc (20)

PDF
PyCon2022 - Building Python Extensions
PDF
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
PDF
Mpi in-python
PDF
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
PDF
Numba: Array-oriented Python Compiler for NumPy
PDF
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
PDF
Shared Memory Parallelism with Python by Dr.-Ing Mike Muller
PDF
Parallel Processing with IPython
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PPTX
NYAI - Scaling Machine Learning Applications by Braxton McKee
PDF
On the necessity and inapplicability of python
PDF
On the Necessity and Inapplicability of Python
PDF
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PDF
PyPy's approach to construct domain-specific language runtime
PDF
Scale up and Scale Out Anaconda and PyData
PDF
Travis Oliphant "Python for Speed, Scale, and Science"
PDF
Notes about moving from python to c++ py contw 2020
PPT
Euro python2011 High Performance Python
PPTX
Scaling Python to CPUs and GPUs
PPTX
Role of python in hpc
PyCon2022 - Building Python Extensions
SFSCON23 - Emily Bourne Yaman Güçlü - Pyccel write Python code, get Fortran ...
Mpi in-python
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
Numba: Array-oriented Python Compiler for NumPy
MLOps Case Studies: Building fast, scalable, and high-accuracy ML systems at ...
Shared Memory Parallelism with Python by Dr.-Ing Mike Muller
Parallel Processing with IPython
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
NYAI - Scaling Machine Learning Applications by Braxton McKee
On the necessity and inapplicability of python
On the Necessity and Inapplicability of Python
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PyPy's approach to construct domain-specific language runtime
Scale up and Scale Out Anaconda and PyData
Travis Oliphant "Python for Speed, Scale, and Science"
Notes about moving from python to c++ py contw 2020
Euro python2011 High Performance Python
Scaling Python to CPUs and GPUs
Role of python in hpc
Ad

Recently uploaded (20)

PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
System and Network Administraation Chapter 3
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
medical staffing services at VALiNTRY
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Odoo Companies in India – Driving Business Transformation.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How to Choose the Right IT Partner for Your Business in Malaysia
How to Migrate SBCGlobal Email to Yahoo Easily
ManageIQ - Sprint 268 Review - Slide Deck
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
System and Network Administraation Chapter 3
ISO 45001 Occupational Health and Safety Management System
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms II-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
ai tools demonstartion for schools and inter college
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
medical staffing services at VALiNTRY
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Operating system designcfffgfgggggggvggggggggg
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Wondershare Filmora 15 Crack With Activation Key [2025
Odoo Companies in India – Driving Business Transformation.pdf

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

  • 2. Me= A Programmer over 40 Years= Joyful 2 :FBS $16$MPDL()[ 3FBM$PSFT 0OF $PSF .VMUJ $PSF )ZQFS5ISFBE GIL@Python One Core N-Threads ≤ 1-Thread CPU-Bound ➡Special Vector Processor Multi Core N-Threads@M-Cores = min(N,M) CPU-Bound ➡General Low-price GPGPU IBM/ SRB SUN/ process DECα/ VMS-thread POSIX/ pthread Linux/ pthread ✈Wiki-Xeon PyConJP2018/9 Y. Okuda
  • 3. Story(1/2) 3 How to accelerate CPU bound codes in Python Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development ● Compatibility ● Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda
  • 4. Story(2/2) 4 ■Showing speed, but 10=90% 20=95% 50=98% 100=99% of time down F
  • 5. F
  • 6. F
  • 7. F
  • 8. F
  • 9. F
  • 10. 4IPUT 5JNFTFD 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPOYU /VN1Z5G!(QV 1ZUIPO/VNCB
  • 11. 5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda
  • 12. H/W S/W 5 env-all Tf-cpu,gpu Python 3.6 3.5 Conda 5.1.0 VirtualEnv Mint Linux(Ubuntu 16.04) CPU + GPU Batch python on shell SSH, NFS CPU: i7-2630QM stepping=5 (Sandy Bridge’12 mobile) Turbo=Off, EIST=Off SSE SSE2 SSE4.1 SSE4.2 AVX 2.0 GHz 4Core 8HT L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s,swap off QM77, NF9G(Jetway Co.) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280コ ア L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-9 CC-6.1 Test bench PyConJP2018/9 Y. Okuda
  • 13. Background ➊ Python Thread ➋ GIL ➌ CPython-Ext ➍ NumPy PyConJP2018/9 Y. Okuda
  • 14. Speeds of Process and Thread (1/2) 7 def add(n): a = 0 for in range(n): a += 1 for n in [ .. ]: ts = time.monotonic() for in range(1000): f(n) te = time.monotonic() def series(n): add(n) add(n) def process(n): p1 = Process (target= add, args=(n,)) p1.start() p2 = Process (target= add, ... p1.join(); p2.join() def thread(n): t1 = Thread (target= add, args=(n,)) t1.start() t2 = Thread (target= add, ... t1.join(); t2.join() Background PyConJP2018/9 Y. Okuda
  • 15. Speeds of Process and Thread (2/2) 8 ■ Speed ●Thread 1× Series ▼ (25%Down@TruboOn) ●Process 1.8× Series F
  • 16. F
  • 17. F
  • 18. 4IPUT 5JNFTFD 5ISFBE 4FSJFT 1SPDFTT EE4FS1SP5IS) ■ Launch time ●Thread Zero ●Process 6 msec /each F
  • 19. F
  • 20. F
  • 22. Is Thread Safe? (1/2) 9 def add(n): global g for in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda
  • 23. Is Thread Safe? (2/2) 10 ■ T ime ≥ 8 Not Thread-Safe global and local F
  • 24. F
  • 25. F
  • 26. F
  • 27. F
  • 28. F
  • 29. 4IPUT 5JNFTFD /PU;FSPJO(4 5JNF TFD /PU;FSPJO(4 Background PyConJP2018/9 Y. Okuda
  • 30. Why not Safe? GIL 11 ■ GIL activates one thread to avoid object corruption✈Dabeaz ✈Abhinav Ajitsaria ● GIL: Global Interpreter Lock ■ Threads chopped intslice, and lose codes ✈A. Jesse ● tslice = 5 msec ● Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe opera- tions Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda
  • 31. How to Avoid GIL 12 ■ CPython-Extension: 1. Copy Python objects to C-Data 2. Apply “ Py BEGIN ALLOW THREADS” C-Macro 3. Execute C-Native codes or OTF codes 4. Apply “ Py END ALLOW THREADS” C-Macro 5. Copy C-Data to Python objects Thread1 Thread2 CPython C-Native GIL AvoidGIL Copy in Py BEGIN ALLOW THREADS Py END ALLOW THREADS Copy out C-Native Codes Background PyConJP2018/9 Y. Okuda
  • 32. Monte Carlo ΠCalculation 13 H hits in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 = 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda
  • 33. Multi-Threaded ΠCalculation 14 ■ Original: pin(n) Get h hits in n shots ➡ 4 ·h/n ■ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda
  • 34. Π@CPython 15 ■ Easy Operation (good tools and official documents) ■ Require to run setup at each release of cpython import cif pi = cif.pin(n) dist/cif- • • • egg python setup.py • • • (Compile, Link, Deploy, Test) cifmodule.c #include Python.h static PyObject * pin( • • • ){ Py BEGIN ALLOW THREADS for (int i = 0; i n; ++i){ • • • Py END ALLOW THREADS setup.py import setuptools setup( • • • ) cif test.py import unittest import cif Background PyConJP2018/9 Y. Okuda
  • 35. Effects of Threads and Cores 16 ☞ min(N, M)× – Overhead N: # of Threads, M: # of Real Cores ; 5ISFBET 3FMBUJWF4QFFE ! ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Background PyConJP2018/9 Y. Okuda
  • 36. Hard to Develop Threading 17 ■ Design issues : out of scope of this talk ■ A issue in this trial: rand r, random r ● rand r: Low randomness, ideal speed up ➡Selected ● random r : Good randomness, speed down at threading ● random r is slower at threading ✈stackoverflow ● Standard shows no clear speed specification at multi-thread ✈open-std ● 80 stdlib functions are not thread-safe✈opengroup ● Not thread-safe: rand, random, drand48, lrand48, mrand48 ● “more standardization―for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 ΠError 0.0001 -0.005 Rand r Random r F
  • 37. F
  • 38. F
  • 40. NumPy Speedup 18 ■ Converting to NumPy 7.7✕ ● Vectorize: Move “for loops” into functions ● Numpy Vector/Matrix functions are compiled C-codes ■ Not only numeric calculation ● count nonzero ● less equal, less, .. ● sort, lexsort, .. ● where, searchsorted ● I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 = 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda
  • 41. Summary 19 ➊Avoid GIL to speed up ➋Apply GIL-Safe operations for Thread-Safe ➌min(N, M) acceleration F
  • 42. F
  • 43. F
  • 44. F
  • 45. F
  • 46. F
  • 47. 4IPUT 5JNFTFD 1ZUIPO /VN1Z $1ZUIPOYU $1ZUIPOYU!5ISFBE Background PyConJP2018/9 Y. Okuda
  • 48. Numba (V0.38.0) ✈Official ●Background ●Accelerate on: CPU, M-Core, CUDA (SSE, AVX, AVX2, AVX-512) ●@numba.jit Just in Time Compile ■ Few user’s guides ✈Conda2018Slide ■ An excellent review ✈Matthew Rocklin ■ Supported by Conda, Inc ■ The Gordon and Betty Moore Foundation ■ GPU version free from end of 2017 ■ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ■ CUDA 2.0 or above PyConJP2018/9 Y. Okuda
  • 49. Outline 21 ■Two different systems CPU CUDA @numba.jit(• •) def pin(n) • • • • • • return out pi = pin(100) @numba.cuda.jit(• •) def pin(n, out) • • • • • • (no return) pin[25, 40](100, pi) ●Many Python codes: ✈Official ✈NumPy ▼ Language: All except class, try, except, with, yield ▼ Function call: inner,closure, recursive ▼ Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round() sorted() type() zip() ▼ NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std() take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view() ▼ Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi ● CUDA Kernel codes ● NumPy: Not Supported Numba PyConJP2018/9 Y. Okuda
  • 50. On CPU 22 ■ @numba.jit() Compile/Execute compatible Python codes to LLVM ●Apply Python π✍ ➡ 21✕ Cf. Manual convert to CPython ✍ ➡ 23✕ ☞Comparable speed to manually converted C ●Apply NumPy π✍ ➡ 1✕ ☞ NumPy functions are not accelerated Cf. Python to NumPy ➡7.7✕ ☞ Jit 21/7.7= 3✕ of NumPy functions ▼ Numba: Python ➡LLVM ➡Python ▼ NumPy: (Python ➡C ➡Python)✕Repeat Numba PyConJP2018/9 Y. Okuda
  • 51. Accelerate NumPy Indexing 23 ■ Jit NumPy indexing ➡ 817✕ , actual 100✕ ✈Murillo ● “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F
  • 52. F
  • 53. F
  • 54. F
  • 55. F
  • 56. F
  • 57. 4IPUT 5JNFTFD 'PS/ESSBZ 'PS-JTU /QEE-JTU +JU'PS-JTU /QEE/ESSBZ +JU'PS/ESSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverflow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda
  • 58. On M-Core 24 ■ All Core working ➊ set @jit(parallel=True) ➋ change “range” to “numba.prange” ● Apply Python π➡ 89✕ ➡ 4.4✕ of @jit() ●No way to control # of cores ▼ Multi-User/Process needs core assignment ■ @jit(nogil=True) + ThreadPoolExecutor controls ✍ 5ISFBET 3FMBUJWF4QFFE ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Numba PyConJP2018/9 Y. Okuda
  • 59. On CUDA 25 ■ Non-compatible python codes, (details are out of scope) ● CUDA kernel codes in definitions ▼ Python like, not C in PyCuda ● insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) ● Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F
  • 60. F
  • 61. F
  • 62. F
  • 63. F
  • 64. F
  • 65. 4IPUT 5JNFTFD $6%TU $6%OE Overhead ➡ Numba PyConJP2018/9 Y. Okuda
  • 66. Summary 26 ➊Convert to Nogil functions ➋Accelerate “for/while” loops ➌Improve NumPy indexing F
  • 67. F
  • 68. F
  • 69. F
  • 70. F
  • 71. F
  • 72. 4IPUT 5JNFTFD 1ZUIPO $16 /PHJM!5ISFBE 1BSBMMFM $6% Numba PyConJP2018/9 Y. Okuda
  • 73. Machine Learning Packages: ●NumPy accelerators ●Kernel-less CUDA access ●Tensor objects ●Poor Documents➡My thought ? ➊ TensorFlow (V1.9) ✈Official ✈奥田 ●CPU, CUDA, (TPU, ROCm, Phi) Own-SIMD +(SSE, AVX, AVX2, AVX-512) ➋ PyTorch (V0.4.11) ✈Official ●CUDA ➌ CuPy (V4.1.0) –Chainer– ✈Official ●CUDA PyConJP2018/9 Y. Okuda
  • 74. Exec Modes 28 ■ TensorFlow-tf: (CPU, CUDA)✕(Eager, Graph)= 4 ● Eager: Python is a direct executor for ordinary actions ● Graph: Python is a macro generator for computing graphs ● Eager if 1st-code is tf.enable eager execution() else Graph ●Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ■ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) ● torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) ● Implicit: auto-decide from operands ➡ Fast ● Explicit-2: torch.func(..).cuda() ➡ Slow ■ CuPy-cp: CUDA = 1 (NN-Graph) ●Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda
  • 75. CUDA 29 ■ TensorFlow Eager✍ ➊ np. ➡tf. ➋ Change some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ■ PyTorch✍/CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ■ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ■ TensorFlow CPU ● Execute the same codes on env. of CPU F
  • 76. F
  • 77. F
  • 78. 4IPUT 5JNFTFD 5GBHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda
  • 79. CPU 30 ■ TensorFlow ✍ 2.4, 3.8✕ 8 cores run SIMD ? ■ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F
  • 80. F
  • 81. F
  • 82. 4IPUT 5JNFTFD 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ■ In progress of Eager , More functional and faster ? F
  • 83. F
  • 84. F
  • 85. 4IPUT 5JNFTFD $POEBOW 7JSUVBMOW $POEB.LM ● V1.5@Jan./2018: Contribution version ✈奥田 ● V1.7: Moving out of contribution ● V1.8: SSE, AVX link ● V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel ● V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda
  • 86. TensorFlow Graph 31 ■Advanced computing graph ●While, Branch, Parallel, Reduce, Scatter, etc in CUDA ● Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph ● Slower than PyToch in the π calculation ●1000 While@CUDA✍ ●10 Parallel@CUDA✍ F
  • 87. F
  • 88. F
  • 89. F
  • 90. F
  • 91. F
  • 93. F
  • 94. F
  • 95. F
  • 96. F
  • 97. F
  • 99. Overhead (OH) 32 ■ Negligible OHs for heavy functions as fft, cv, solvers, etc ● TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) ● CuPy: 1. linalg 2. math 3. fft ■ Prediction of getting array OHs at ordinary cases ●NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F
  • 100. F
  • 101. F
  • 102. F
  • 103. F
  • 104. F
  • 106. F
  • 107. F
  • 108. F
  • 109. F
  • 110. F
  • 111. 4IPUT 5JNFTFD SSBZ 4DBMBS    { ▼ Transfer time from CUDA to CPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda
  • 112. All Overheads 33 ■ Accelerate function “r = f(a1, a2)” ●NumPy ● Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F
  • 113. F
  • 114. F
  • 115. F
  • 116. F
  • 117. F
  • 119. F
  • 120. F
  • 121. F
  • 122. F
  • 123. F
  • 125. F
  • 126. F
  • 127. F
  • 128. F
  • 129. F
  • 131. F
  • 132. F
  • 133. F
  • 134. F
  • 135. F
  • 137. Tensor 34 ■ Bridge between CPU and Accelerator ? CPU Acc. a1 p1 a2 p2 • • • r return rf Tensor copy in copy out others Buffer/Cache ●copy in (Create Tensor Object) TensorFlow convert to tensor(nd) PyTorch tensor(nd) Cupy array(nd) nd: NdArray ●copy out (Convert to NdArray) TensorFlow t obj.numpy() PyTorch t obj.cpu().numpy() Cupy asnumpy(t obj) t obj: Tensor Object ●Others ▼ Neural Network functions ▼ MM-Direct: Scatter Read/Write ▼ “if”, “while” ●Buffer/Cache ✈PyTorch ✈DlPack ▼ Not store in CPU-Mem. Cf. NumPy functions ▼ • • • ML Packages PyConJP2018/9 Y. Okuda
  • 138. Summary 35 ➊ CuPy: NumPy compatible CUDA ☞ TensorFlow: CPU-SIMD/CUDA/.., Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F
  • 139. F
  • 140. F
  • 141. F
  • 142. F
  • 143. F
  • 144. 4IPUT 5JNFTFD /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda
  • 145. Dask (V0.18.0) ✈Official ●Background ●“Delayed” simple graph for threading ■ Answer of PyData to Col. W/O-MM-Limit: Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney , HANA(SAP), RevolutionR(MS)✈Official ■ Conda + DARPA, NSF, Gordon Moore Found., HHMI ■ Expand NumPy, Pandas, Scikit-Learn ■ Parallel computing: ● Process: Futures ●Thread: Delayed PyConJP2018/9 Y. Okuda
  • 146. Graph for Multi-Threading 37 ■“delayed” defines nodes of parallel computing✍ # Thread ●mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ■Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda
  • 147. The Results at 3 Threads 38 ■NumPy shows little improvement /VN1Z %BTL ● Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ■No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL ●ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ■The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda
  • 148. Delayed vs ThreadPoolExecutor 39 ■ NogilPy ThreadPool shows lower launch, higher speed ● Delayed ●ThreadPool F
  • 149. F
  • 150. F
  • 151. F
  • 152. F
  • 153. F
  • 154. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F
  • 155. F
  • 156. F
  • 157. F
  • 158. F
  • 159. F
  • 160. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda
  • 161. Summary 40 ➊ No guide about GIL-Safe ●Only inhibit “+=, –=” without reasoning ➋ Large Overheads for the πcalculation ■ A tool for Dask components ? ■ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions ● PyTorch Freeze ● TensorFlow@CPU segmentation fault F
  • 162. F
  • 163. F
  • 164. F
  • 165. F
  • 166. F
  • 167. 4IPUT 5JNFTFD /VN1Z /VN1Z!5ISFBE /PHJM1Z!5ISFBE Dask PyConJP2018/9 Y. Okuda
  • 168. Threading and Nogil ● ThreadPoolExecutor ➊ Confirm Nogil-ness of CuPy ➋ GIL-Safe prediction ➌ Nogil forced NumPy PyConJP2018/9 Y. Okuda
  • 169. NumPy vs CuPy 42 ■ NumPy Partial-Nogil, CuPy Full-Nogil ? ● NumPy ●CuPy F
  • 170. F
  • 171. F
  • 172. F
  • 173. F
  • 174. F
  • 175. 4IPUT 5JNFTFD 5 5 5 5 5 5 5 F
  • 176. F
  • 177. F
  • 178. F
  • 179. F
  • 180. F
  • 181. 4IPUT 5JNFTFD 5 5 5 5 5 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda
  • 182. Confirm CuPy 43 ■ Error/π = a · (N)b ✈WikiPi-1 / 4IPUT — — — — — CTPMVUF3FMBUJWFSSPS %BUB $V1Z!5 /VN1ZSSPS 1 Loop ●CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA ●NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda
  • 183. GIL-Safe Prediction 44 ■ Almost impossible to predict GIL-Safe Local functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c ● Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ ● Rng Count No error ☞Apply Forced Nogil functions F
  • 184. F
  • 185. F
  • 186. F
  • 187. F
  • 188. F
  • 190. F
  • 191. F
  • 192. F
  • 193. F
  • 194. F
  • 195. / 5JNFTFD 3OH@$PVOU $PVOU 1 Loop Threading and Nogil PyConJP2018/9 Y. Okuda
  • 196. Numba JIT Options 45 ■ Set nopython=True for nogil guarantee ?●Local objects are stored in a heap storage of which accesses should be mutexes. ●The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL EntryObject Manager Obj-1 Python Heap Storage Obj-n ●All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda
  • 197. Nogil NumPy by Namba 46 ■ Some NumPy functions require rewriting ● Guaranteed Nogil F
  • 198. F
  • 199. F
  • 200. F
  • 201. F
  • 202. F
  • 203. 4IPUT 5JNFTFD 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF4QFFE ! 4QFFE *EFBM ● Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 ● Numba speeds up 1.6X ● 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda
  • 204. Summary 47 ➊Apply Nogil functions for Thread-Safe ■ Set nopython=True with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F
  • 205. F
  • 206. F
  • 207. F
  • 208. F
  • 209. F
  • 210. 4IPUT 5JNFTFD /VN1Z /PHJM/VN1Z Threading and Nogil PyConJP2018/9 Y. Okuda
  • 211. Conclusion 48 Execution Time Confirmation (ETC) on run time signatures showed: ➊ Ideal threading acceleration = min(N, M) ➋ A comparison of On-The-Fly packages: ● Numba ● TensorFlow ● PyTorch ● CuPy ● Dask ➌ Basic issues and Solutions: ● GIL ● Nogil ● GIL-Safe ● Threading ● Graph ● NumPy Indexing ● Copy Overhead Enjoy On-The-Fly Own Ways ✍ PyConJP2018/9 Y. Okuda
  • 213. Appendix MIT License Copyright ( c ) 2018 Yukio Okuda Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda
  • 214. Numba 51 Nogil import numba import random from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r = 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 215. ML:TfEager,PyTorch,Cupy 52 TensorFlow-CPU/CUDA-Eager import t e n s o r f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 216. ML:TfGraph 53 TensorFlow-Simple Graph import t e n s o r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
  • 217. ML:TfGraph Dask 54 TensorFlow-Parallel Graph import t e n s o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r = 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda
  • 218. Miscellaneous 55 GIL-Safe import numpy as np from c o n c u r r e n t . f u t u r e s import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda