Machine Learning Crash Course by Sebastian Raschka

Sebastian Raschka, Ph.D.
MSU Data Science workshop
East Lansing, Michigan State University • Feb 21, 2018
Machine Learning with Python

Today’s focus:
And if we have time, a quick
overview ...
2

Contact:
o E-mail: mail@sebastianraschka.com
o Website: http://sebastianraschka.com
o Twitter: @rasbt
o GitHub: rasbt
Tutorial Material on GitHub:
https://github.com/rasbt/msu-datascience-ml-tutorial-2018
3

Machine learning is used & useful (almost) anywhere
4

3 Types of Learning
Reinforcement
Supervised Unsupervised
6

Working with Labeled Data
Supervised
Learning
?
x (“input”)
y
(“output”)
x1 (“input”)
x
2
(“input”)
?
Regression
Classification
7

Working with Unlabeled Data
Unsupervised
Learning
Clustering
Compression
8

Topics
1. Introduction to Machine Learning
2. Linear Regression
3. Introduction to Classification
4. Feature Preprocessing & scikit-learn Pipelines
5. Dimensionality Reduction: Feature Selection & Extraction
6. Model Evaluation & Hyperparameter Tuning
9

y
(response
variable)
x (explanatory variable)
(xi, yi)
ŷ = w0 + w1x
w0 (intercept)
w1 (slope)
= Δy / Δx
Δx
Δy
vertical offset
|ŷ − y|
Simple Linear Regression
10

11
Columns: features (explanatory variables, independent variables, covariates,
predictors, variables, inputs, attributes)
x0 x1 … xm
x0,0 x0,1
x1,0 x1,1
x2,0 x2,1
x3,0 x3,1
.
.
.
xn,0 xn,1 … xn,m
X=
y0
y1
y2
y3
.
.
.
yn
y=
Data Representation
Rows:
training
examples
(observations,
records,
instances,
samples)
Targets (target
variable,response variable,
dependent variable, labels,
ground truth)

Learning
Algorithm
Hyperparameter
Values
Model
Prediction
Test Labels
Performance
Model
Learning
Algorithm
Hyperparameter
Values Final
Model
2
3
4
1
Test Labels
Test Data
Training Data
Training Labels
Data
Labels
Data
Labels
Training Data
Training Labels
Test Data
“Basic” Supervised Learning Workflow
12

Topics
14

Scikit-learn API
class SupervisedEstimator(...):
def __init__(self, hyperparam, ...):
...
def fit(self, X, y):
...
return self
def predict(self, X):
...
return y_pred
def score(self, X, y):
...
return score
... 15

Iris Dataset
Iris-Virginica
Iris-Versicolor
Iris-Setosa
16

features (columns)
sepal
length
[cm]
sepal
width
[cm]
petal
lengt
h
[cm]
petal
width
[cm]
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
50 6.4 3.5 4.5 1.2
.
.
.
150 5.9 3.0 5.0 1.8
X=
setosa
setosa
versicolor
.
.
.
virginica
y=
samples
(rows)
sepal
petal
Iris Dataset
17

Note about Non-Stratified Splits
§ training set → 38 x Setosa, 28 x Versicolor, 34 x Virginica
§ test set → 12 x Setosa, 22 x Versicolor, 16 x Virginica
18

Linear Regression Recap
Σ
.
.
.
w1
wm
w2
w0
x1
1
x2
xm
y
Activation
function
Net input
function
a
z Predicted
output
Weight
coefficients
Input
values
Bias
unit
19

Linear Regression Recap
Σ
.
.
.
w1
wm
w2
w0
x1
1
x2
xm
y
Activation
function
Net input
function
a
z Predicted
output
Weight
coefficients
Input
values
Bias
unit
Here: Identity
function
20

Logistic Regression, a Generalized Linear Model
(a Classifier)
Σ
.
.
.
w1
wm
w2
w0
x1
1
x2
xm
y
Activation
function
Net input
function
a
z
Unit step
function
Predicted
class label
Weight
coefficients
Input
values
Bias
unit
Predicted
probability
21

A “Lazy Learner:” K-Nearest Neighbors Classifier
x1
?
3 ×
1 ×
1 ×
Predict
? =
x2
22

http://scikit-learn.org/stable/supervised_learning.html
There are many, many more classification
and regression algorithms ...
24

Topics
25

Categorical Variables
color size price
class
label
red M $10.49 0
blue XL $15.00 1
green L $12.99 1
26

Encoding Categorical Variables (Ordinal vs Nominal)
color size price class label
red M $10.49 0
blue XL $15.00 1
green L $12.99 1
size
0
2
1
red blue green
1 0 0
0 1 0
0 0 1
27

Feature Normalization
feature minmax z-score
1.0 0.0 -1.46385
2.0 0.2 -0.87831
3.0 0.4 -0.29277
4.0 0.6 0.29277
5.0 0.8 0.87831
6.0 1.0 1.46385
Min-max scaling Z-score standardization
28

Scikit-learn API
class UnsupervisedEstimator(...):
def __init__(self, ...):
...
def fit(self, X):
...
return self
def transform(self, X):
...
return X_transf
def predict(self, X):
...
return pred
29

Scikit-learn Pipelines
Class labels
Training data
Test data
Learning
Algorithm
Dimensionality
Reduction
Scaling
Model
Pipeline
fit
fit & transform
fit & transform
fit
transform
transform
Class labels
predict
predict
30

Topics
32

Dimensionality Reduction – why?
[cm] [cm] [cm] [cm]
[cm]
[cm]
[cm]
[cm]
33

Dimensionality Reduction – why?
predictive performance
predictive performance
storage & speed
visualization &
interpretability
34

Recursive Feature Elimination
available features:
[ w1 w2 w3 w4 ]
[ w1 w2 w4 ]
[ w1 w4 ]
[ w4 ]
[ f1 f2 f3 f4 ]
fit model, remove lowest weight, repeat
35

Sequential Feature Selection
[ f1 f2 f3 f4 ]
[ f1 ] [ f2 ] [ f3 ] [ f4 ]
[ f1 f3 ] [ f1 f2 ] [ f1 f4 ]
[ f1 f3 f4 ] [ f1 f3 f2 ]
available features:
fit model, pick best, repeat
fit model, pick best, repeat
36

Principal Component Analysis
x1
x2
PC1
PC2
37

Topics
39

Learning
Algorithm
Hyperparameter
Values
Model
Prediction
Test Labels
Performance
Model
Learning
Algorithm
Hyperparameter
Values Final
Model
2
3
4
1
Test Labels
Test Data
Training Data
Training Labels
Data
Labels
Data
Labels
Training Data
Training Labels
Test Data
“Basic” Supervised Learning Workflow
40

Holdout Method and Hyperparameter Tuning 1-3
2
1
Data
Labels
Training Data
Validation
Data
Validation
Labels
Test
Data
Test
Labels
Training Labels
Performance
Model
Validation
Data
Validation
Labels
Prediction
Performance
Model
Validation
Data
Validation
Labels
Prediction
Performance
Model
Validation
Data
Validation
Labels
Prediction
Best
Model
Learning
Algorithm
Hyperparameter
values
Model
Hyperparameter
values
Hyperparameter
values
Model
Model
Training Data
Training Labels
3
Best
Hyperparameter
values
41

Learning
Algorithm
Best
Hyperparameter
Values Final
Model
6
Data
Labels
Prediction
Test Labels
Performance
Model
4
Test Data
Learning
Algorithm
Best
Hyperparameter
Values
Model
Training Data
Training Labels
5
Validation
Data
Validation
Labels
Holdout Method and Hyperparameter Tuning 4-6
42

1st
2nd
3rd
4th
5th
K
Iterations
(K-Folds)
Validation
Fold
Training
Fold
Learning
Algorithm
Hyperparameter
Values
Model
Training Fold Data
Training Fold Labels
Prediction
Performance
Model
Validation
Fold Data
Validation
Fold Labels
Performance
Performance
Performance
Performance
Performance
1
2
3
4
5
Performance
1
10 ∑
10
i=1
Performancei
=
This work by Sebastian Raschka is licensed under a
K-fold Cross-Validation
43

K-fold Cross-Validation Workflow 1-3
Test Labels
Test Data
Training Data
Training Labels
Data
Labels
Model
Model
Model
Learning
Algorithm
Hyperparameter
values
Hyperparameter
values
Hyperparameter
values
Training Data
Training Labels
Learning
Algorithm
Best
Hyperparameter
Values
Model
Training Data
Training Labels
2
1
3
44

K-fold Cross-Validation Workflow 4-5
Prediction
Test Labels
Performance
Model
Test Data
Learning
Algorithm
Best
Hyperparameter
Values Final
Model
Data
Labels
4
5
45

More info about model evaluation (one of the most
important topics in ML):
https://sebastianraschka.com/blog/index.html
• Model evaluation, model selection, and algorithm selection in machine learning Part I - The basics
• Model evaluation, model selection, and algorithm selection in machine learning Part II -
Bootstrapping and uncertainties
• Model evaluation, model selection, and algorithm selection in machine learning Part III - Cross-
validation and hyperparameter tuning
46

https://www.tensorflow.org
49

TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
(Preliminary White Paper, November 9, 2015)
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Research⇤
Abstract
TensorFlow [1] is an interface for expressing machine learn-
ing algorithms, and an implementation for executing such al-
gorithms. A computation expressed using TensorFlow can be
executed with little or no change on a wide variety of hetero-
geneous systems, ranging from mobile devices such as phones
and tablets up to large-scale distributed systems of hundreds
of machines and thousands of computational devices such as
GPU cards. The system is flexible and can be used to express
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learn-
ing systems into production across more than a dozen areas of
sequence prediction [47], move selection for Go [34],
pedestrian detection [2], reinforcement learning [38],
and other areas [17, 5]. In addition, often in close collab-
oration with the Google Brain team, more than 50 teams
at Google and other Alphabet companies have deployed
deep neural networks using DistBelief in a wide variety
of products, including Google Search [11], our advertis-
ing products, our speech recognition systems [50, 6, 46],
Google Photos [43], Google Maps and StreetView [19],
Google Translate [18], YouTube, and many others.
Based on our experience with DistBelief and a more
complete understanding of the desirable system proper-
ties and requirements for training and using neural net-
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf
Figure 1: Example TensorFlow code fragm
W
b
x
MatMul
Add
ReLU
...
C
50

https://sebastianraschka.com/pdf/books/dlb/appendix_g_tensorflow.pdf
at performing highly parallelized numerical computations. In addition, TensorFlow also
supports distributed systems as well as mobile computing platforms, including Android and
Apple’s iOS.
But what is a tensor? In simplifying terms, we can think of tensors as multidimensional
arrays of numbers, as a generalization of scalars, vectors, and matrices.
1. Scalar: R
2. Vector: Rn
3. Matrix: Rn × Rm
4. 3-Tensor: Rn × Rm × Rp
5. …
When we describe tensors, we refer to its “dimensions” as the rank (or order) of a tensor,
which is not to be confused with the dimensions of a matrix. For instance, an m × n matrix,
where m is the number of rows and n is the number of columns, would be a special case of
a rank-2 tensor. A visual explanation of tensors and their ranks is given is the figure below.
Index [2]
Index [0,0]
Index [0,2,1]
rank 0 tensor
dimensions [ ]
scalar
rank 2 tensor
dimensions [5, 3]
matrix
rank 1 tensor
dimensions [5]
vector
rank 3 tensor
dimensions [4, 4, 2]
Tensors?
51

x =
X = np.random.random((num_train_examples, num_features))
W = np.random.random((num_features, num_hidden))
Vectorization
53

Computation Graphs
a(x, w, b) = relu(w*x + b)
u
v
u = wx
x
w
b
+
*
v = u+b a = relu(v)
55

Computation Graphs
Tensor("x:0", dtype=float32) <tf.Variable 'w:0' shape=() dtype=float32_ref> <tf.Variable
'b:0' shape=() dtype=float32_ref> Tensor("mul:0", dtype=float32) Tensor("add:0",
dtype=float32) Tensor("Relu:0", dtype=float32)
import tensorflow as tf
g = tf.Graph()
with g.as_default() as g:
x = tf.placeholder(dtype=tf.float32, shape=None, name='x')
w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')
u = x * w
v = u + b
a = tf.nn.relu(v)
print(x, w, b, u, v, a)
56

Computation Graphs
u = wx
b=1
+
*
v = u+b a = relu(v)
with tf.Session(graph=g) as sess:
sess.run(init_op)
b_res = sess.run(’b:0')
print(b_res)
1.0
x
w=2
57

u = wx
x=3
w=2
b=1
+
*
v = u+b a = relu(v)
6
7 7
!"
!#
$#
$%
$#
$&
$&
$'
()
(*
=
(+
(*
()
(+
()
(,
=
(-
(,
()
(-
=
(-
(,
(+
(-
()
(+
= 1
= 1
= 1
= 3
= 1
= 3*1*1 = 3
https://github.com/rasbt/pydata-annarbor2017-dl-tutorial 58

g = tf.Graph()
with g.as_default() as g:
x = tf.placeholder(dtype=tf.float32, shape=None, name='x')
w = tf.Variable(initial_value=2, dtype=tf.float32, name='w')
b = tf.Variable(initial_value=1, dtype=tf.float32, name='b')
u = x * w
v = u + b
a = tf.nn.relu(v)
d_a_w = tf.gradients(a, w)
d_b_w = tf.gradients(a, b)
sess.run(tf.global_variables_initializer())
res = sess.run([d_a_w, d_b_w], feed_dict={'x:0': 3})
[3.0] [1.0] 59

http://pytorch.org
60

d_a_w: Variable containing:
3
[torch.FloatTensor of size 1]
d_a_b: Variable containing:
1
[torch.FloatTensor of size 1]
import torch
import torch.nn.functional as F
from torch.autograd import Variable
from torch.autograd import grad
x = Variable(torch.Tensor([3]))
w = Variable(torch.Tensor([2]), requires_grad=True)
b = Variable(torch.Tensor([1]), requires_grad=True)
u = x * w
v = u + b
a = F.relu(v)
partial_derivatives = grad(a, (w, b))
for name, grad in zip("wb", (partial_derivatives)):
print('d_a_%s:' % name, grad)
61

https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/code/ch12/images/12_02.png
Multilayer Perceptron
62

g = tf.Graph()
with g.as_default():
# Input data
tf_x = tf.placeholder(tf.float32, [None, n_input], name='features')
tf_y = tf.placeholder(tf.float32, [None, n_classes], name='targets')
# Model parameters
weights = {
'h1': tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1)),
'out': tf.Variable(tf.truncated_normal([n_hidden_2, n_classes], stddev=0.1))
}
biases = {
'b1': tf.Variable(tf.zeros([n_hidden_1])),
'out': tf.Variable(tf.zeros([n_classes]))
}
# Multilayer perceptron
layer_1 = tf.add(tf.matmul(tf_x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
out_layer = tf.matmul(layer_1, weights['out']) + biases['out']
# Loss and optimizer
loss = tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=tf_y)
cost = tf.reduce_mean(loss, name='cost')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train = optimizer.minimize(cost, name='train')
# Prediction
correct_prediction = tf.equal(tf.argmax(tf_y, 1), tf.argmax(out_layer, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name='accuracy')
sess.run(tf.global_variables_initializer())
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = mnist.train.num_examples // batch_size
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
_, c = sess.run(['train', 'cost:0'], feed_dict={'features:0': batch_x,
'targets:0': batch_y})
class MultilayerPerceptron(torch.nn.Module):
def __init__(self, num_features, num_classes):
super(MultilayerPerceptron, self).__init__()
### 1st hidden layer
self.linear_1 = torch.nn.Linear(num_features, num_hidden_1)
### Output layer
self.linear_out = torch.nn.Linear(num_hidden_2, num_classes)
def forward(self, x):
out = self.linear_1(x)
out = F.relu(out)
logits = self.linear_out(out)
probas = F.softmax(logits, dim=1)
return logits, probas
model = MultilayerPerceptron(num_features=num_features,
num_classes=num_classes)
if torch.cuda.is_available():
model.cuda()
for epoch in range(num_epochs):
for batch_idx, (features, targets) in enumerate(train_loader):
features = Variable(features.view(-1, 28*28))
targets = Variable(targets)
if torch.cuda.is_available():
features, targets = features.cuda(), targets.cuda()
### FORWARD AND BACK PROP
logits, probas = model(features)
cost = cost_fn(logits, targets)
optimizer.zero_grad()
cost.backward()
### UPDATE MODEL PARAMETERS
optimizer.step()
63

Further Resources
Math-heavy Math-free scikit-learn intro Mix of code & math
(~60% scikit-learn)
64

Contact:
o E-mail: mail@sebastianraschka.com
o Website: http://sebastianraschka.com
o Twitter: @rasbt
o GitHub: rasbt
Tutorial Material on GitHub:
https://github.com/rasbt/msu-datascience-ml-tutorial-2018
Thanks for attending!
65

Machine Learning Crash Course by Sebastian Raschka

More Related Content

Similar to Machine Learning Crash Course by Sebastian Raschka (20)

Recently uploaded (20)

Machine Learning Crash Course by Sebastian Raschka