Introduction to NumPy for Machine Learning Programmers

Introduction to NumPy
for Machine Learning Programmers
PyData Tokyo Meetup
April 3, 2015 @ Denso IT Laboratory
Kimikazu Kato
Silver Egg Techonogy
1 / 26

Target Audience
People who want to implement machine learning algorithms in Python.
2 / 26

Outline
Preliminaries
Basic usage of NumPy
Indexing, Broadcasting
Case study
Reading source code of scikit-learn
Conclusion
3 / 26

Who am I?
Kimikazu Kato
Chief Scientist at Silver Egg Technology
Algorithm designer for a recommendation system
Ph.D in computer science (Master's degree in math)
4 / 26

Python is Very Slow!
Code in C
#include<stdio.h>
intmain(){
inti;doubles=0;
for(i=1;i<=100000000;i++)s+=i;
printf("%.0fn",s);
}
Code in Python
s=0.
foriinxrange(1,100000001):
s+=i
prints
Both of the codes compute the sum of integers from 1 to 100,000,000.
Result of benchmark in a certain environment:
Above: 0.109 sec (compiled with -O3 option)
Below: 8.657 sec
(80+ times slower!!)
5 / 26

Better code
importnumpyasnp
a=np.arange(1,100000001)
printa.sum()
Now it takes 0.188 sec. (Measured by "time" command in Linux, loading time
included)
Still slower than C, but sufficiently fast as a script language.
6 / 26

Lessons
Python is very slow when written badly
Translate C (or Java, C# etc.) code into Python is often a bad idea.
Python-friendly rewriting sometimes result in drastic performance
improvement
7 / 26

Basic rules for better performance
Avoid for-sentence as far as possible
Utilize libraries' capabilities instead
Forget about the cost of copying memory
Typical C programmer might care about it, but ...
8 / 26

Basic techniques for NumPy
Broadcasting
Indexing
9 / 26

Broadcasting
>>>importnumpyasnp
>>>a=np.array([0,1,2,3])
>>>a*3
array([0,3,6,9])
>>>np.exp(a)
array([ 1. , 2.71828183, 7.3890561, 20.08553692])
expis called a universal function.
10 / 26

Broadcasting (2D)
>>>importnumpyasnp
>>>a=np.arange(9).reshape((3,3))
>>>b=np.array([1,2,3])
>>>a
array([[0,1,2],
[3,4,5],
[6,7,8]])
>>>b
array([1,2,3])
>>>a*b
array([[0, 2, 6],
[3, 8,15],
[6,14,24]])
11 / 26

Indexing
>>>importnumpyasnp
>>>a=np.arange(10)
>>>a
array([0,1,2,3,4,5,6,7,8,9])
>>>indices=np.arange(0,10,2)
>>>indices
array([0,2,4,6,8])
>>>a[indices]=0
>>>a
array([0,1,0,3,0,5,0,7,0,9])
>>>b=np.arange(100,600,100)
>>>b
array([100,200,300,400,500])
>>>a[indices]=b
>>>a
array([100, 1,200, 3,300, 5,400, 7,500, 9])
12 / 26

Boolean Indexing
>>>a=np.array([1,2,3])
>>>b=np.array([False,True,True])
>>>a[b]
array([2,3])
>>>c=np.arange(-3,4)
>>>c
array([-3,-2,-1, 0, 1, 2, 3])
>>>d=c>0
>>>d
array([False,False,False,False, True, True, True],dtype=bool)
>>>c[d]
array([1,2,3])
>>>c[c>0]
array([1,2,3])
>>>c[np.logical_and(c>=0,c%2==0)]
array([0,2])
>>>c[np.logical_or(c>=0,c%2==0)]
array([-2, 0, 1, 2, 3])
13 / 26

Cf. In Pandas
>>>importpandasaspd
>>>importnumpyasnp
>>>df=pd.DataFrame(np.random.randn(5,3),columns=["A","B","C"])
>>>df
A B C
0 1.084117-0.626930-1.818375
1 1.717066 2.554761-0.560069
2-1.355434-0.464632 0.322603
3 0.013824 0.298082-1.405409
4 0.743068 0.292042-1.002901
[5rowsx3columns]
>>>df[df.A>0.5]
A B C
0 1.084117-0.626930-1.818375
1 1.717066 2.554761-0.560069
4 0.743068 0.292042-1.002901
[3rowsx3columns]
>>>df[(df.A>0.5)&(df.B>0)]
A B C
1 1.717066 2.554761-0.560069
4 0.743068 0.292042-1.002901
[2rowsx3columns]
14 / 26

Case Study 1: Ridge Regression
(sklearn.linear_model.Ridge)
, : input, output of training data
: hyper parameter
The optimum is given as:
The corresponding part of the code:
K=safe_sparse_dot(X,X.T,dense_output=True)
try:
dual_coef=_solve_cholesky_kernel(K,y,alpha)
coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T
exceptlinalg.LinAlgError:
(sklearn.h/linear_model/ridge.py L338-343)
∥y − Xw + α∥wmin
w
∥
2
2
∥
2
2
X y
α
w = ( X + αI yX
T
)
−1
X
T
15 / 26

K=safe_sparse_dot(X,X.T,dense_output=True)
try:
dual_coef=_solve_cholesky_kernel(K,y,alpha)
coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T
exceptlinalg.LinAlgError:
(sklearn.h/linear_model/ridge.py L338-343)
safe_sparse_dotis a wrapper function of dotwhich can be applied to
sparse and dense matrices.
_solve_cholesky_kernelcomputes
w = ( X + α yX
T
)
−1
X
T
(K + αI y)
−1
16 / 26

Inside _solve_cholesky_kernel
K.flat[::n_samples+1]+=alpha[0]
try:
dual_coef=linalg.solve(K,y,sym_pos=True,
overwrite_a=False)
exceptnp.linalg.LinAlgError:
(sklearn.h/linear_model/ridge.py L138-146, comments omitted)
invshould not be used; solveis faster (general knowledge in numerical
computation)
flat???
(K + αI y)
−1
17 / 26

flat
classflatiter(builtins.object)
| Flatiteratorobjecttoiterateoverarrays.
|
| Aflatiteriteratorisreturnedby``x.flat``foranyarrayx.
| Itallowsiteratingoverthearrayasifitwerea1-Darray,
| eitherinafor-looporbycallingitsnextmethod.
|
| IterationisdoneinC-contiguousstyle,withthelastindexvaryingthe
| fastest.Theiteratorcanalsobeindexedusingbasicslicingor
| advancedindexing.
|
| SeeAlso
| --------
| ndarray.flat:Returnaflatiteratoroveranarray.
| ndarray.flatten:Returnsaflattenedcopyofanarray.
|
| Notes
| -----
| AflatiteriteratorcannotbeconstructeddirectlyfromPythoncode
| bycallingtheflatiterconstructor.
In short, x.flatis a reference to the elements of the array x, and can be used
like a one dimensional array.
18 / 26

try:
dual_coef=linalg.solve(K,y,sym_pos=True,
overwrite_a=False)
exceptnp.linalg.LinAlgError:
(sklearn.h/linear_model/ridge.py L138-146, comments omitted)
is equivalent to
K+=alpha[0]*np.eyes(n_samples)
(The size of is n_samples n_samples)
The upper is an inplace version.
K ×
19 / 26

Case Study 2: NMF
(sklearn.decomposition.nmf)
NMF = Non-negative Matrix Factorization
Successful in face part detection
20 / 26

Idea of NMF
Approximate the input matrix as a product of two smaller non-negative
matrix:
Notation
Parameter set:
Error function:
X ≈ HW
T
≥ 0, ≥ 0Wij Hij
Θ = (W , H), : i-th element of Θθi
f(Θ) = ∥X − HW
T
∥
2
F
21 / 26

Algorithm of NMF
Projected gradient descent (Lin 2007):
where
Convergence condition:
where
(Note: )
= P [ − α∇f( )]Θ
(k+1)
Θ
(k)
Θ
(k)
P [x = max(0, )]i xi
f( ) ≤ ϵ f( )∥
∥∇
P
Θ
(k)
∥
∥
∥
∥∇
P
Θ
(1)
∥
∥
f(Θ) = {∇
P
∇f(Θ)i
min (0, ∇f(Θ ))i
if > 0θi
if = 0θi
≥ 0θi
22 / 26

Computation of where
Code:
proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)],
gradH[np.logical_or(gradH<0,H>0)]])
(sklearn/decomposition/nmf.py L500-501)
norm: utility function of scikit-learn which computes L2-norm
np.r_: concatenation of arrays
f(Θ)∥∥∇
P
∣∣
f(Θ) = {∇
P
∇f(Θ)i
min (0, ∇f(Θ ))i
if > 0θi
if = 0θi
23 / 26

means
Code:
proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)],
gradH[np.logical_or(gradH<0,H>0)]])
(sklearn/decomposition/nmf.py L500-501)
gradW[np.logical_or(gradW<0,W>0)],
means that an element is employed if or , and discarded
otherwise.
Only non-zero elements remains after indexing.
f(Θ) = {∇
P
∇f(Θ)i
min (0, ∇f(Θ ))i
if > 0θi
if = 0θi
f(Θ) =∇
P
⎧
⎩
⎨
⎪
⎪
∇f(Θ)i
∇f(Θ)i
0
if > 0θi
if = 0 and ∇f(Θ < 0θi )i
otherwise
∇f(Θ < 0)i > 0θi
24 / 26

Conclusion
Avoid for-sentence; use NumPy/SciPy's capabilities
Mathematical derivation is important
You can learn a lot from the source code of scikit-learn
25 / 26

References
Official
scikit-learn
For beginners of NumPy/SciPy
Gabriele Lanaro, "Python High Performance Programming," Packt
Publishing, 2013.
Stéfan van der Walt, Numpy Medkit
Python Scientific Lecture Notes
Algorithm of NMF
C.-J. Lin. Projected gradient methods for non-negative matrix
factorization. Neural Computation 19, 2007.
26 / 26

Introduction to NumPy for Machine Learning Programmers

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Introduction to NumPy for Machine Learning Programmers (20)

More from Kimikazu Kato (18)

Recently uploaded (20)

Introduction to NumPy for Machine Learning Programmers