SlideShare a Scribd company logo
Sep 25th, 2001
Copyright © 2001, 2003, Andrew W. Moore
Regression and
Classification with
Neural Networks
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of
these slides. Andrew would be
delighted if you found this source
material useful in giving your own
lectures. Feel free to use these slides
verbatim, or to modify them to fit your
own needs. PowerPoint originals are
available. If you make use of a
significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials
. Comments and corrections gratefully
received.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2
Linear Regression
Linear regression assumes that the expected
value of the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
x3 = 2 y3 = 2
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
DATASET
 1 

w

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
P(y|w,x) has a normal distribution with
• mean wx
• variance σ2
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4
Bayesian Linear Regression
P(y|w,x) = Normal (mean wx, var σ2
)
We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.
We want to infer w from the data.
P(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5
Maximum likelihood estimation of
w
Asks the question:
“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is
maximized?
)
,
(
1
i
n
i
i x
w
y
P


Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6
For what w is
For what w is
For what w is
For what w is
maximized?
)
,
(
1
i
n
i
i x
w
y
P


maximized?
)
)
(
2
1
exp( 2
1 
i
i wx
y
n
i


 
maximized?
2
1 2
1





 


 
i
i
n
i
wx
y
  minimized?
2
1



n
i
i
i wx
y
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7
Linear Regression
The maximum
likelihood w is
the one that
minimizes sum-
of-squares of
residuals
We want to minimize a quadratic function of w.
 
    2
2
2
2
2 w
x
w
y
x
y
wx
y
i
i
i
i
i
i
i
i

 







E(w)
w
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8
Linear Regression
Easy to show the sum
of squares is
minimized when
2



i
i
i
x
y
x
w
The maximum likelihood
model is
We can use it for
prediction
  wx
x 
Out
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9
Linear Regression
Easy to show the sum
of squares is
minimized when
2



i
i
i
x
y
x
w
The maximum likelihood
model is
We can use it for
prediction
Note: In Bayesian stats you’d
have ended up with a prob dist of w
And predictions would have given a
prob dist of expected output
Often useful to know your confidence.
Max likelihood can give some kinds
of confidence too.
p(w)
w
  wx
x 
Out
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10
Multivariate Regression
What if the inputs are vectors?
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
3 .
. 4 6 .
.
5
. 8
. 10
2-d input
example
x1
x2
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11
Multivariate Regression
Write matrix X and Y thus:







































R
Rm
R
R
m
m
R y
y
y
x
x
x
x
x
x
x
x
x



2
1
2
1
2
22
21
1
12
11
2
...
...
...
.....
.....
.....
.....
.....
.....
y
x
x
x
x
1
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such
that
Out(x) = wT
x = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XT
X) -1
(XT
Y)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12
Multivariate Regression
Write matrix X and Y thus:







































R
Rm
R
R
m
m
R y
y
y
x
x
x
x
x
x
x
x
x



2
1
2
1
2
22
21
1
12
11
2
...
...
...
.....
.....
.....
.....
.....
.....
y
x
x
x
x
1
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such
that
Out(x) = wT
x = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XT
X) -1
(XT
Y)
IMPORTANT EXERCISE:
PROVE IT !!!!!
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 13
Multivariate Regression (con’t)
The max. likelihood w is w = (XT
X)-1
(XT
Y)
XT
X is an m x m matrix: i,j’th elt is
XT
Y is an m-element vector: i’th
elt


R
k
kj
ki x
x
1


R
k
k
ki y
x
1
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14
What about a constant term?
We may expect
linear data that
does not go
through the origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y
2 4 16
3 4 17
5 5 20
X0 X1 X2 Y
1 2 4 16
1 3 4 17
1 5 5 20
Before:
Y=w1X1+ w2X2
…has to be a poor
model
After:
Y= w0X0+w1X1+ w2X2
= w0+w1X1+ w2X2
…has a fine constant
term
In this example,
You should be
able to see the
MLE w0 , w1 and w2
by inspection
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16
Regression with varying noise
• Suppose you know the variance of the noise
that was added to each datapoint.
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2
xi yi i
2
½ ½ 4
1 1 1
2 1 1/4
2 3 4
3 2 1/4
)
,
(
~ 2
i
i
i wx
N
y 
Assume What’s the MLE
estimate of w?
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17
MLE estimation with varying noise

)
,
,...,
,
,
,...,
,
|
,...,
,
(
log 2
2
2
2
1
2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R
R 






R
i i
i
i wx
y
w
1
2
2
)
(
argmin 













0
)
(
such that
1
2
R
i i
i
i
i wx
y
x
w





















R
i i
i
R
i i
i
i
x
y
x
1
2
2
1
2


Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting
dLL/dw equal
to zero
Trivial algebra
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18
This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2



R
i i
i
i wx
y
w
1
2
2
)
(
argmin 
2
1
i

where weight for i’th datapoint is
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19
Regression
The max. likelihood w is w = (WXT
WX)-1
(WXT
WY)
(WXT
WX) is an m x m matrix: i,j’th elt is
(WXT
WY) is an m-element vector: i’th
elt


R
k i
kj
ki x
x
1
2



R
k i
k
ki y
x
1
2

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
xi yi
½ ½
1 2.5
2 3
3 2
3 3
)
,
(
~ 2

i
i x
w
N
y 
Assume What’s the MLE
estimate of w?
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 21
Non-linear MLE estimation

)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting
dLL/dw equal
to zero
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22
Non-linear MLE estimation

)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting
dLL/dw equal
to zero
We’re down the
algebraic toilet
So guess
what we
do?
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23
Non-linear MLE estimation

)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting
dLL/dw equal
to zero
We’re down the
algebraic toilet
So guess
what we
do?
Common (but not only) approach:
Numerical Solutions:
• Line Search
• Simulated Annealing
• Gradient Descent
• Conjugate Gradient
• Levenberg Marquart
• Newton’s Method
Also, special purpose statistical-
optimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24
GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05



:
f(w)
 
w
w
w
w f



 
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25
GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05



:
f(w)
 
w
w
w
w f



 
QUESTION: Justify the Gradient Descent
Rule
Recall Andrew’s favorite
default value for anything
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26
Gradient Descent in “m”
Dimensions


m
:
)
f(w
 
w
f
-
w
w 
 
Given
points in direction of steepest
ascent.
GRADIENT DESCENT RULE:
Equivalently
 
w
f
-
j
j
j
w
η
w
w


 ….where wj is the jth weight
“just like a linear feedback
system”
 
 
 



















w
f
w
f
w
f
1
m
w
w

 
w
f
 is the gradient in that direction
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27
What’s all this got to do with
Neural Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize
sum-of-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights,
we use GRADIENT DESCENT to SEARCH for them.
“Why?” you ask, a querulous expression in your eyes.
“Aha!!” I reply: “We’ll see later.”
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28
Linear Perceptrons
They are multivariate linear models:
Out(x) = wT
x
And “training” consists of minimizing sum-of-squared
residuals by gradient descent.
QUESTION: Derive the perceptron training rule.
 
 
 2
2








k
k
y
y
k
k
k
k
x
x
Out
w
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29
Linear Perceptron Training Rule




R
k
k
T
k
y
E
1
2
)
( x
w
Gradient descent tells
us we should update w
thusly if we wish to
minimize E:
j
j
j
w
E
η
w
w


 -
So what’s ?
j
w
E


Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30
Linear Perceptron Training Rule




R
k
k
T
k
y
E
1
2
)
( x
w
Gradient descent tells
us we should update w
thusly if we wish to
minimize E:
j
j
j
w
E
η
w
w


 -
So what’s ?
j
w
E









 R
k
k
T
k
j
j
y
w
w
E
1
2
)
( x
w







R
k
k
T
k
j
k
T
k y
w
y
1
)
(
)
(
2 x
w
x
w

 



R
k
k
T
j
k
w
δ
1
2 x
w
k
T
k
k y
δ x
w


…where…
 
 




R
k
m
i
ki
i
j
k x
w
w
δ
1 1
2




R
k
kj
k x
δ
1
2
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31
Linear Perceptron Training Rule




R
k
k
T
k
y
E
1
2
)
( x
w
Gradient descent tells
us we should update w
thusly if we wish to
minimize E:
j
j
j
w
E
η
w
w


 -
…where…





 R
k
kj
k
j
x
δ
w
E
1
2




R
k
kj
k
j
j x
δ
η
w
w
1
2
We frequently neglect the 2 (meaning
we halve the learning rate)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32
algorithm
1) Randomly initialize weights w1 w2 … wm
2) Get your dataset (append 1’s to the inputs if
you don’t want to go through the origin).
3) for i = 1 to R
4) for j = 1 to m
5) if stops improving then stop. Else loop
back to 3.
i
i
i y x
w


:





R
i
ij
i
j
j x
w
w
1



2
i

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33
ij
i
j
j
i
i
i
x
w
w
y





 
x
w A RULE KNOWN BY
MANY NAMES
The LMS Rule
The delta rule
The Widrow Hoff rule
Classical
conditioning
The adaline rule
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34
If data is voluminous and arrives
fast
Input-output pairs (x,y) come streaming in very
quickly. THEN
Don’t bother remembering old ones.
Just keep using new ones.
observe (x,y)
j
j
j x
δ
η
w
w
j
y x
w




 

Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix
Inversion for Linear Perceptrons
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix
Inversion for Linear Perceptrons
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix
Inversion for Linear Perceptrons
But we’ll
soon see that
GD
has an important extra
trick up its sleeve
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤1/2
1 if out(x)>1/2
WHAT’S THE BIG PROBLEM WITH THIS???
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???
Blue = Out(x)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤½
1 if out(x)>½
Blue = Out(x)
Green =
Classification
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41
I
 .
x
w
2
 
 i
i
y
Don’t minimize
Minimize number of misclassifications instead. [Assume outputs are
+1 & -1, not +1 & 0]
where Round(x) = -1 if x<0
1 if x≥0
The gradient descent rule can be changed to:
if (xi,yi) correctly classed, don’t change
if wrongly predicted as 1 w  w - xi
if wrongly predicted as -1 w  w + xi
 
 
 
 i
i
y x
w
Round
NOTE: CUTE &
NON OBVIOUS WHY
THIS WORKS!!
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.
SOLUTION:
Instead of Out(x) = wT
x
We’ll use Out(x) = g(wT
x)
where is a
squashing function
   
1
,
0
: 

x
g
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44
The Sigmoid
)
exp(
1
1
)
(
h
h
g



Note that if you rotate
this curve through 180o
centered on (0,1/2) you
get the same curve.
i.e. g(h)=1-g(-h)
Can you prove this?
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45
The Sigmoid
Now we choose w to minimize
   

 





R
i
i
i
R
i
i
i g
y
y
1
2
1
2
)
x
w
(
)
x
(
Out
)
exp(
1
1
)
(
h
h
g



Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46
Linear Perceptron Classification
Regions
0 0
0
1
1
1
X2
X1
We’ll use the model Out(x) = g(wT
(x,1))
= g(w1x1 + w2x2 + w0)
Which region of above diagram classified with +1, and
which with 0 ??
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47
perceptron
     
 
   
   
 
   
 




 

 
 






















































































































 







 










 









k
k
k
i
i
i
i
ij
i
i
i
i
k
ik
k
j
k
ik
k
i k
ik
k
i
k
ik
k
j
i k
ik
k
i
j
i k
ik
k
i
k
k
k
x
w
y
x
g
g
x
w
w
x
w
g
x
w
g
y
x
w
g
w
x
w
g
y
w
x
w
g
y
x
w
g
x
g
x
g
x
e
x
e
x
e
x
e
x
e
x
e
x
e
x
e
x
g
x
e
x
g
x
g
x
g
x
g
net
)
Out(x
where
net
1
net
2
'
2
2
Out(x)
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
2
1
'
so
1
1
:
Because
1
'
notice
First,
2


 





R
i
ij
i
i
i
j
j x
g
g
w
w
1
1










 

m
j
ij
j
i x
w
g
g
1
i
i
i g
y 


The sigmoid perceptron
update rule:
where
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48
Perceptrons
• Invented and popularized by Rosenblatt (1962)
• Even with sigmoid nonlinearity, correct
convergence is guaranteed
• Stable behavior for overconstrained and
underconstrained problems
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49
Perceptrons and Boolean
Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…
• Can learn the function x1  x2
• Can learn the function x1  x2 .
• Can learn any conjunction of literals, e.g.
x1  ~x2  ~x3  x4  x5
QUESTION: WHY?
X1
X2
X1
X2
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50
Perceptrons and Boolean
Functions
• Can learn any disjunction of literals
e.g. x1  ~x2  ~x3  x4  x5
• Can learn majority function
f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1
0 if less than n/2 xi’s are = 1
• What about the exclusive or function?
f(x1,x2) = x1  x2 =
(x1  ~x2)  (~ x1  x2)
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51
Multilayer Networks
The class of functions representable by perceptrons
is limited
  








 

j
j
j x
w
g
g
Out(x) x
w
Use a wider
representation !














 
 k
jk
jk
j
j x
w
g
W
g
Out(x) This is a nonlinear function
Of a linear combination
Of non linear functions
Of linear combinations of inputs
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52
A 1-HIDDEN LAYER NET
NINPUTS = 2 NHIDDEN = 3








 

HID
N
k
k
kv
W
g
1
Out

































INS
INS
INS
N
k
k
k
N
k
k
k
N
k
k
k
x
w
g
v
x
w
g
v
x
w
g
v
1
3
3
1
2
2
1
1
1
x1
x2
w11
w21
w31
w1
w2
w3
w32
w22
w12
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53
OTHER NEURAL NETS
2-Hidden layers + Constant Term
1
x1
x2
x3
x2
x1
“JUMP” CONNECTIONS









 
 

HID
INS N
k
k
k
N
k
k
k v
W
x
w
g
1
1
0
Out
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54
Backpropagation
 
 
descent.
gradient
by
x
Out
minimize
to
}
{
,
}
{
weights
of
set
a
Find
Out(x)
2

 
















i
i
i
jk
j
j k
k
jk
j
y
w
W
x
w
g
W
g
That’s it!
That’s the backpropagation
algorithm.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.
IMPLEMENTING BACKPROP:  Differentiate Monster sum-square residual 
Write down the Gradient Descent Rule  It turns out to be easier &
computationally efficient to use lots of local variables with names like hj ok vj
neti etc…
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56
Choosing the learning rate
• This is a subtle art.
• Too small: can take days instead of
minutes to converge
• Too large: diverges (MSE gets larger and
larger while the weights increase and
usually oscillate)
• Sometimes the “just right” value is hard to
find.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57
Learning-rate problems
From J. Hertz, A. Krogh, and R.
G. Palmer. Introduction to the
Theory of Neural Computation.
Addison-Wesley, 1994.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58
Improving Simple Gradient
Descent
Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let be the change we would make with
regular gradient descent.
Instead we use
Momentum damps oscillations.
A hack? Well, maybe.
w



 
   
t
t Δw
w
Δw 
 





1
momentum
parameter
     
t
t
t Δw
w
w 

1
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59
Momentum illustration
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60
Improving Simple Gradient
Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









If we neglect the O(h3
) terms, this is a quadratic form
Quadratic form fun facts:
If y = c + bT
x - 1/2 xT
A x
And if A is SPD
Then
xopt
= A-1
b is the value of x that maximizes y
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61
Improving Simple Gradient
Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









If we neglect the O(h3
) terms, this is a quadratic form
w
w
w
w













E
E
1
2
2
This should send us directly to the global minimum if
the function is truly quadratic.
And it might get us close if it’s locally quadraticish
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62
Improving Simple Gradient
Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









If we neglect the O(h3
) terms, this is a quadratic form
w
w
w
w













E
E
1
2
2
This should send us directly to the global minimum if
the function is truly quadratic.
And it might get us close if it’s locally quadraticish
BUT (and it’s a big but)…
That second derivative matrix can be
expensive and fiddly to compute.
If we’re not already in the quadratic bowl,
we’ll go nuts.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63
Improving Simple Gradient
Descent
Conjugate Gradient
Another method which attempts to exploit the “local
quadratic bowl” assumption
But does so while only needing to use
and not
2
2
w

 E
It is also more stable than Newton’s method if the local
quadratic bowl assumption is violated.
It’s complicated, outside our scope, but it often works
well. More details in Numerical Recipes in C.
w

E
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64
BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.
HOW TO FORMALIZE THIS INTUITION?
1. Don’t. Just use intuition
2. Bayesian Methods Get it Right
3. Statistical Analysis explains what’s going on
4. Cross-validation
Discussed in the next
lecture
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65
What You Should Know
• How to implement multivariate Least-
squares linear regression.
• Derivation of least squares as max.
likelihood estimator of linear coefficients
• The general gradient descent rule
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66
What You Should Know
• Perceptrons
 Linear output, least squares
 Sigmoid output, least squares
• Multilayer nets
 The idea behind back prop
 Awareness of better minimization methods
• Generalization. What it means.
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67
APPLICATIONS
To Discuss:
• What can non-linear regression be useful for?
• What can neural nets (used as non-linear
regressors) be useful for?
• What are the advantages of N. Nets for
nonlinear regression?
• What are the disadvantages?
Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68
Other Uses of Neural Nets…
• Time series with recurrent nets
• Unsupervised learning (clustering
principal components and non-linear
versions thereof)
• Combinatorial optimization with Hopfield
nets, Boltzmann Machines
• Evaluation function learning (in
reinforcement learning)

More Related Content

PPT
Classification and regression power point
PPT
gmatrix distro_gmatrix distro_gmatrix distro
PPT
svm-jain.ppt
PPT
PDF
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
PPT
Unit 4 SVM and AVR.ppt
PDF
Sparse autoencoder
PDF
Neural Networks
Classification and regression power point
gmatrix distro_gmatrix distro_gmatrix distro
svm-jain.ppt
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
Unit 4 SVM and AVR.ppt
Sparse autoencoder
Neural Networks

Similar to neural network artificial for process analysis (20)

PDF
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
PDF
Lecture 2
PDF
Two algorithms to accelerate training of back-propagation neural networks
PPT
Probability density in data mining and coveriance
PPT
For beginner who wants to know k means lecture 1.ppt
PDF
Predicting Real-valued Outputs: An introduction to regression
PPT
PPT
Lecture2---Feed-Forward Neural Networks.ppt
PDF
Dm part03 neural-networks-handout
PDF
An efficient approach to wavelet image Denoising
PDF
Information Gain
PDF
Cs229 notes-deep learning
PDF
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
PPT
Artificial Neural Networks
PDF
Cross-Validation
PDF
11.optimal nonlocal means algorithm for denoising ultrasound image
PDF
Optimal nonlocal means algorithm for denoising ultrasound image
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PDF
project final
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
Lecture 2
Two algorithms to accelerate training of back-propagation neural networks
Probability density in data mining and coveriance
For beginner who wants to know k means lecture 1.ppt
Predicting Real-valued Outputs: An introduction to regression
Lecture2---Feed-Forward Neural Networks.ppt
Dm part03 neural-networks-handout
An efficient approach to wavelet image Denoising
Information Gain
Cs229 notes-deep learning
The Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intelligence)
Artificial Neural Networks
Cross-Validation
11.optimal nonlocal means algorithm for denoising ultrasound image
Optimal nonlocal means algorithm for denoising ultrasound image
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
project final
Ad

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
Database Infoormation System (DBIS).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Acumen Training GuidePresentation.pptx
.pdf is not working space design for the following data for the following dat...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Acumen Training GuidePresentation.pptx
Ad

neural network artificial for process analysis

  • 1. Sep 25th, 2001 Copyright © 2001, 2003, Andrew W. Moore Regression and Classification with Neural Networks Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  • 2. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2 Linear Regression Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 x3 = 2 y3 = 2 x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 DATASET  1   w 
  • 3. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 P(y|w,x) has a normal distribution with • mean wx • variance σ2
  • 4. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4 Bayesian Linear Regression P(y|w,x) = Normal (mean wx, var σ2 ) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. P(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation
  • 5. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is P(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is maximized? ) , ( 1 i n i i x w y P  
  • 6. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 6 For what w is For what w is For what w is For what w is maximized? ) , ( 1 i n i i x w y P   maximized? ) ) ( 2 1 exp( 2 1  i i wx y n i     maximized? 2 1 2 1            i i n i wx y   minimized? 2 1    n i i i wx y
  • 7. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7 Linear Regression The maximum likelihood w is the one that minimizes sum- of-squares of residuals We want to minimize a quadratic function of w.       2 2 2 2 2 w x w y x y wx y i i i i i i i i           E(w) w
  • 8. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8 Linear Regression Easy to show the sum of squares is minimized when 2    i i i x y x w The maximum likelihood model is We can use it for prediction   wx x  Out
  • 9. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9 Linear Regression Easy to show the sum of squares is minimized when 2    i i i x y x w The maximum likelihood model is We can use it for prediction Note: In Bayesian stats you’d have ended up with a prob dist of w And predictions would have given a prob dist of expected output Often useful to know your confidence. Max likelihood can give some kinds of confidence too. p(w) w   wx x  Out
  • 10. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 10 Multivariate Regression What if the inputs are vectors? Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR 3 . . 4 6 . . 5 . 8 . 10 2-d input example x1 x2
  • 11. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11 Multivariate Regression Write matrix X and Y thus:                                        R Rm R R m m R y y y x x x x x x x x x    2 1 2 1 2 22 21 1 12 11 2 ... ... ... ..... ..... ..... ..... ..... ..... y x x x x 1 (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wT x = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XT X) -1 (XT Y)
  • 12. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12 Multivariate Regression Write matrix X and Y thus:                                        R Rm R R m m R y y y x x x x x x x x x    2 1 2 1 2 22 21 1 12 11 2 ... ... ... ..... ..... ..... ..... ..... ..... y x x x x 1 (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wT x = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XT X) -1 (XT Y) IMPORTANT EXERCISE: PROVE IT !!!!!
  • 13. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 13 Multivariate Regression (con’t) The max. likelihood w is w = (XT X)-1 (XT Y) XT X is an m x m matrix: i,j’th elt is XT Y is an m-element vector: i’th elt   R k kj ki x x 1   R k k ki y x 1
  • 14. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14 What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess??
  • 15. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y 2 4 16 3 4 17 5 5 20 X0 X1 X2 Y 1 2 4 16 1 3 4 17 1 5 5 20 Before: Y=w1X1+ w2X2 …has to be a poor model After: Y= w0X0+w1X1+ w2X2 = w0+w1X1+ w2X2 …has a fine constant term In this example, You should be able to see the MLE w0 , w1 and w2 by inspection
  • 16. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 16 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1 =1/2 =2 =1 =1/2 =2 xi yi i 2 ½ ½ 4 1 1 1 2 1 1/4 2 3 4 3 2 1/4 ) , ( ~ 2 i i i wx N y  Assume What’s the MLE estimate of w?
  • 17. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17 MLE estimation with varying noise  ) , ,..., , , ,..., , | ,..., , ( log 2 2 2 2 1 2 1 2 1 argmax w x x x y y y p w R R R        R i i i i wx y w 1 2 2 ) ( argmin               0 ) ( such that 1 2 R i i i i i wx y x w                      R i i i R i i i i x y x 1 2 2 1 2   Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero Trivial algebra
  • 18. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 18 This is Weighted Regression • We are asking to minimize the weighted sum of squares x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1 =1/2 =2 =1 =1/2 =2    R i i i i wx y w 1 2 2 ) ( argmin  2 1 i  where weight for i’th datapoint is
  • 19. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 19 Regression The max. likelihood w is w = (WXT WX)-1 (WXT WY) (WXT WX) is an m x m matrix: i,j’th elt is (WXT WY) is an m-element vector: i’th elt   R k i kj ki x x 1 2    R k i k ki y x 1 2 
  • 20. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 20 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: x=0 x=3 x=2 x=1 y=0 y=3 y=2 y=1 xi yi ½ ½ 1 2.5 2 3 3 2 3 3 ) , ( ~ 2  i i x w N y  Assume What’s the MLE estimate of w?
  • 21. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 21 Non-linear MLE estimation  ) , , ,..., , | ,..., , ( log 2 1 2 1 argmax w x x x y y y p w R R         R i i i x w y w 1 2 argmin                0 such that 1 R i i i i x w x w y w Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero
  • 22. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22 Non-linear MLE estimation  ) , , ,..., , | ,..., , ( log 2 1 2 1 argmax w x x x y y y p w R R         R i i i x w y w 1 2 argmin                0 such that 1 R i i i i x w x w y w Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero We’re down the algebraic toilet So guess what we do?
  • 23. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23 Non-linear MLE estimation  ) , , ,..., , | ,..., , ( log 2 1 2 1 argmax w x x x y y y p w R R         R i i i x w y w 1 2 argmin                0 such that 1 R i i i i x w x w y w Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero We’re down the algebraic toilet So guess what we do? Common (but not only) approach: Numerical Solutions: • Line Search • Simulated Annealing • Gradient Descent • Conjugate Gradient • Levenberg Marquart • Newton’s Method Also, special purpose statistical- optimization-specific tricks such as E.M. (See Gaussian Mixtures lecture for introduction)
  • 24. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24 GRADIENT DESCENT Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w GRADIENT DESCENT RULE: η is called the LEARNING RATE. A small positive number, e.g. η = 0.05    : f(w)   w w w w f     
  • 25. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25 GRADIENT DESCENT Suppose we have a scalar function We want to find a local minimum. Assume our current weight is w GRADIENT DESCENT RULE: η is called the LEARNING RATE. A small positive number, e.g. η = 0.05    : f(w)   w w w w f      QUESTION: Justify the Gradient Descent Rule Recall Andrew’s favorite default value for anything
  • 26. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26 Gradient Descent in “m” Dimensions   m : ) f(w   w f - w w    Given points in direction of steepest ascent. GRADIENT DESCENT RULE: Equivalently   w f - j j j w η w w    ….where wj is the jth weight “just like a linear feedback system”                          w f w f w f 1 m w w    w f  is the gradient in that direction
  • 27. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27 What’s all this got to do with Neural Nets, then, eh?? For supervised learning, neural nets are also models with vectors of w parameters in them. They are now called weights. As before, we want to compute the weights to minimize sum-of-squared residuals. Which turns out, under “Gaussian i.i.d noise” assumption to be max. likelihood. Instead of explicitly solving for max. likelihood weights, we use GRADIENT DESCENT to SEARCH for them. “Why?” you ask, a querulous expression in your eyes. “Aha!!” I reply: “We’ll see later.”
  • 28. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28 Linear Perceptrons They are multivariate linear models: Out(x) = wT x And “training” consists of minimizing sum-of-squared residuals by gradient descent. QUESTION: Derive the perceptron training rule.      2 2         k k y y k k k k x x Out w
  • 29. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29 Linear Perceptron Training Rule     R k k T k y E 1 2 ) ( x w Gradient descent tells us we should update w thusly if we wish to minimize E: j j j w E η w w    - So what’s ? j w E  
  • 30. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30 Linear Perceptron Training Rule     R k k T k y E 1 2 ) ( x w Gradient descent tells us we should update w thusly if we wish to minimize E: j j j w E η w w    - So what’s ? j w E           R k k T k j j y w w E 1 2 ) ( x w        R k k T k j k T k y w y 1 ) ( ) ( 2 x w x w       R k k T j k w δ 1 2 x w k T k k y δ x w   …where…         R k m i ki i j k x w w δ 1 1 2     R k kj k x δ 1 2
  • 31. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31 Linear Perceptron Training Rule     R k k T k y E 1 2 ) ( x w Gradient descent tells us we should update w thusly if we wish to minimize E: j j j w E η w w    - …where…       R k kj k j x δ w E 1 2     R k kj k j j x δ η w w 1 2 We frequently neglect the 2 (meaning we halve the learning rate)
  • 32. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32 algorithm 1) Randomly initialize weights w1 w2 … wm 2) Get your dataset (append 1’s to the inputs if you don’t want to go through the origin). 3) for i = 1 to R 4) for j = 1 to m 5) if stops improving then stop. Else loop back to 3. i i i y x w   :      R i ij i j j x w w 1    2 i 
  • 33. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33 ij i j j i i i x w w y        x w A RULE KNOWN BY MANY NAMES The LMS Rule The delta rule The Widrow Hoff rule Classical conditioning The adaline rule
  • 34. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34 If data is voluminous and arrives fast Input-output pairs (x,y) come streaming in very quickly. THEN Don’t bother remembering old ones. Just keep using new ones. observe (x,y) j j j x δ η w w j y x w       
  • 35. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35 GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Gradient Descent vs Matrix Inversion for Linear Perceptrons
  • 36. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36 GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Gradient Descent vs Matrix Inversion for Linear Perceptrons
  • 37. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37 GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Gradient Descent vs Matrix Inversion for Linear Perceptrons But we’ll soon see that GD has an important extra trick up its sleeve
  • 38. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤1/2 1 if out(x)>1/2 WHAT’S THE BIG PROBLEM WITH THIS???
  • 39. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤½ 1 if out(x)>½ WHAT’S THE BIG PROBLEM WITH THIS??? Blue = Out(x)
  • 40. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤½ 1 if out(x)>½ Blue = Out(x) Green = Classification
  • 41. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41 I  . x w 2    i i y Don’t minimize Minimize number of misclassifications instead. [Assume outputs are +1 & -1, not +1 & 0] where Round(x) = -1 if x<0 1 if x≥0 The gradient descent rule can be changed to: if (xi,yi) correctly classed, don’t change if wrongly predicted as 1 w  w - xi if wrongly predicted as -1 w  w + xi        i i y x w Round NOTE: CUTE & NON OBVIOUS WHY THIS WORKS!!
  • 42. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 42 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much better. But not a least squares fit.
  • 43. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much better. But not a least squares fit. SOLUTION: Instead of Out(x) = wT x We’ll use Out(x) = g(wT x) where is a squashing function     1 , 0 :   x g
  • 44. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44 The Sigmoid ) exp( 1 1 ) ( h h g    Note that if you rotate this curve through 180o centered on (0,1/2) you get the same curve. i.e. g(h)=1-g(-h) Can you prove this?
  • 45. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45 The Sigmoid Now we choose w to minimize             R i i i R i i i g y y 1 2 1 2 ) x w ( ) x ( Out ) exp( 1 1 ) ( h h g   
  • 46. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46 Linear Perceptron Classification Regions 0 0 0 1 1 1 X2 X1 We’ll use the model Out(x) = g(wT (x,1)) = g(w1x1 + w2x2 + w0) Which region of above diagram classified with +1, and which with 0 ??
  • 47. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47 perceptron                                                                                                                                                                                          k k k i i i i ij i i i i k ik k j k ik k i k ik k i k ik k j i k ik k i j i k ik k i k k k x w y x g g x w w x w g x w g y x w g w x w g y w x w g y x w g x g x g x e x e x e x e x e x e x e x e x g x e x g x g x g x g net ) Out(x where net 1 net 2 ' 2 2 Out(x) 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 ' so 1 1 : Because 1 ' notice First, 2          R i ij i i i j j x g g w w 1 1              m j ij j i x w g g 1 i i i g y    The sigmoid perceptron update rule: where
  • 48. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48 Perceptrons • Invented and popularized by Rosenblatt (1962) • Even with sigmoid nonlinearity, correct convergence is guaranteed • Stable behavior for overconstrained and underconstrained problems
  • 49. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49 Perceptrons and Boolean Functions If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s… • Can learn the function x1  x2 • Can learn the function x1  x2 . • Can learn any conjunction of literals, e.g. x1  ~x2  ~x3  x4  x5 QUESTION: WHY? X1 X2 X1 X2
  • 50. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50 Perceptrons and Boolean Functions • Can learn any disjunction of literals e.g. x1  ~x2  ~x3  x4  x5 • Can learn majority function f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1 0 if less than n/2 xi’s are = 1 • What about the exclusive or function? f(x1,x2) = x1  x2 = (x1  ~x2)  (~ x1  x2)
  • 51. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51 Multilayer Networks The class of functions representable by perceptrons is limited               j j j x w g g Out(x) x w Use a wider representation !                  k jk jk j j x w g W g Out(x) This is a nonlinear function Of a linear combination Of non linear functions Of linear combinations of inputs
  • 52. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52 A 1-HIDDEN LAYER NET NINPUTS = 2 NHIDDEN = 3            HID N k k kv W g 1 Out                                  INS INS INS N k k k N k k k N k k k x w g v x w g v x w g v 1 3 3 1 2 2 1 1 1 x1 x2 w11 w21 w31 w1 w2 w3 w32 w22 w12
  • 53. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53 OTHER NEURAL NETS 2-Hidden layers + Constant Term 1 x1 x2 x3 x2 x1 “JUMP” CONNECTIONS               HID INS N k k k N k k k v W x w g 1 1 0 Out
  • 54. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54 Backpropagation     descent. gradient by x Out minimize to } { , } { weights of set a Find Out(x) 2                    i i i jk j j k k jk j y w W x w g W g That’s it! That’s the backpropagation algorithm.
  • 55. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55 Backpropagation Convergence Convergence to a global minimum is not guaranteed. •In practice, this is not a problem, apparently. Tweaking to find the right number of hidden units, or a useful learning rate η, is more hassle, apparently. IMPLEMENTING BACKPROP:  Differentiate Monster sum-square residual  Write down the Gradient Descent Rule  It turns out to be easier & computationally efficient to use lots of local variables with names like hj ok vj neti etc…
  • 56. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56 Choosing the learning rate • This is a subtle art. • Too small: can take days instead of minutes to converge • Too large: diverges (MSE gets larger and larger while the weights increase and usually oscillate) • Sometimes the “just right” value is hard to find.
  • 57. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57 Learning-rate problems From J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1994.
  • 58. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 58 Improving Simple Gradient Descent Momentum Don’t just change weights according to the current datapoint. Re-use changes from earlier iterations. Let ∆w(t) = weight changes at time t. Let be the change we would make with regular gradient descent. Instead we use Momentum damps oscillations. A hack? Well, maybe. w          t t Δw w Δw         1 momentum parameter       t t t Δw w w   1
  • 59. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59 Momentum illustration
  • 60. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60 Improving Simple Gradient Descent Newton’s method ) | (| 2 1 ) ( ) ( 3 2 2 h h w h w h w h w O E E E E T T          If we neglect the O(h3 ) terms, this is a quadratic form Quadratic form fun facts: If y = c + bT x - 1/2 xT A x And if A is SPD Then xopt = A-1 b is the value of x that maximizes y
  • 61. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61 Improving Simple Gradient Descent Newton’s method ) | (| 2 1 ) ( ) ( 3 2 2 h h w h w h w h w O E E E E T T          If we neglect the O(h3 ) terms, this is a quadratic form w w w w              E E 1 2 2 This should send us directly to the global minimum if the function is truly quadratic. And it might get us close if it’s locally quadraticish
  • 62. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62 Improving Simple Gradient Descent Newton’s method ) | (| 2 1 ) ( ) ( 3 2 2 h h w h w h w h w O E E E E T T          If we neglect the O(h3 ) terms, this is a quadratic form w w w w              E E 1 2 2 This should send us directly to the global minimum if the function is truly quadratic. And it might get us close if it’s locally quadraticish BUT (and it’s a big but)… That second derivative matrix can be expensive and fiddly to compute. If we’re not already in the quadratic bowl, we’ll go nuts.
  • 63. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63 Improving Simple Gradient Descent Conjugate Gradient Another method which attempts to exploit the “local quadratic bowl” assumption But does so while only needing to use and not 2 2 w   E It is also more stable than Newton’s method if the local quadratic bowl assumption is violated. It’s complicated, outside our scope, but it often works well. More details in Numerical Recipes in C. w  E
  • 64. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 64 BEST GENERALIZATION Intuitively, you want to use the smallest, simplest net that seems to fit the data. HOW TO FORMALIZE THIS INTUITION? 1. Don’t. Just use intuition 2. Bayesian Methods Get it Right 3. Statistical Analysis explains what’s going on 4. Cross-validation Discussed in the next lecture
  • 65. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65 What You Should Know • How to implement multivariate Least- squares linear regression. • Derivation of least squares as max. likelihood estimator of linear coefficients • The general gradient descent rule
  • 66. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66 What You Should Know • Perceptrons  Linear output, least squares  Sigmoid output, least squares • Multilayer nets  The idea behind back prop  Awareness of better minimization methods • Generalization. What it means.
  • 67. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67 APPLICATIONS To Discuss: • What can non-linear regression be useful for? • What can neural nets (used as non-linear regressors) be useful for? • What are the advantages of N. Nets for nonlinear regression? • What are the disadvantages?
  • 68. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68 Other Uses of Neural Nets… • Time series with recurrent nets • Unsupervised learning (clustering principal components and non-linear versions thereof) • Combinatorial optimization with Hopfield nets, Boltzmann Machines • Evaluation function learning (in reinforcement learning)