Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2
Linear Regression
Linear regression assumes that the expected
value of the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
x3 = 2 y3 = 2
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1
DATASET
 1 

w


1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
P(y|w,x) has a normal distribution with
• mean wx
• variance σ2

Bayesian Linear Regression
P(y|w,x) = Normal (mean wx, var σ2
)
We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.
We want to infer w from the data.
P(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation

Maximum likelihood estimation of
w
Asks the question:
“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is
maximized?
)
,
(
1
i
n
i
i x
w
y
P



For what w is
For what w is
For what w is
For what w is
maximized?
)
,
(
1
i
n
i
i x
w
y
P


maximized?
)
)
(
2
1
exp( 2
1 
i
i wx
y
n
i


 
maximized?
2
1 2
1





 


 
i
i
n
i
wx
y
  minimized?
2
1



n
i
i
i wx
y

Linear Regression
The maximum
likelihood w is
the one that
minimizes sum-
of-squares of
residuals
We want to minimize a quadratic function of w.
 
    2
2
2
2
2 w
x
w
y
x
y
wx
y
i
i
i
i
i
i
i
i

 







E(w)
w

Linear Regression
Easy to show the sum
of squares is
minimized when
2



i
i
i
x
y
x
w
The maximum likelihood
model is
We can use it for
prediction
  wx
x 
Out

Linear Regression
Easy to show the sum
of squares is
minimized when
2



i
i
i
x
y
x
w
The maximum likelihood
model is
We can use it for
prediction
Note: In Bayesian stats you’d
have ended up with a prob dist of w
And predictions would have given a
prob dist of expected output
Often useful to know your confidence.
Max likelihood can give some kinds
of confidence too.
p(w)
w
  wx
x 
Out

Multivariate Regression
What if the inputs are vectors?
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR
3 .
. 4 6 .
.
5
. 8
. 10
2-d input
example
x1
x2

Write matrix X and Y thus:







































R
Rm
R
R
m
m
R y
y
y
x
x
x
x
x
x
x
x
x



2
1
2
1
2
22
21
1
12
11
2
...
...
...
.....
.....
.....
.....
.....
.....
y
x
x
x
x
1
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such
that
Out(x) = wT
x = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XT
X) -1
(XT
Y)

Write matrix X and Y thus:







































R
Rm
R
R
m
m
R y
y
y
x
x
x
x
x
x
x
x
x



2
1
2
1
2
22
21
1
12
11
2
...
...
...
.....
.....
.....
.....
.....
.....
y
x
x
x
x
1
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such
that
Out(x) = wT
x = w1x[1] + w2x[2] + ….wmx[D]
X) -1
(XT
Y)
IMPORTANT EXERCISE:
PROVE IT !!!!!

Multivariate Regression (con’t)
X)-1
(XT
Y)
XT
X is an m x m matrix: i,j’th elt is
XT
Y is an m-element vector: i’th
elt


R
k
kj
ki x
x
1


R
k
k
ki y
x
1

What about a constant term?
We may expect
linear data that
does not go
through the origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??

The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1 X2 Y
2 4 16
3 4 17
5 5 20
X0 X1 X2 Y
1 2 4 16
1 3 4 17
1 5 5 20
Before:
Y=w1X1+ w2X2
…has to be a poor
model
After:
Y= w0X0+w1X1+ w2X2
= w0+w1X1+ w2X2
…has a fine constant
term
In this example,
You should be
able to see the
MLE w0 , w1 and w2
by inspection

Regression with varying noise
• Suppose you know the variance of the noise
that was added to each datapoint.
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2
xi yi i
2
½ ½ 4
1 1 1
2 1 1/4
2 3 4
3 2 1/4
)
,
(
~ 2
i
i
i wx
N
y 
Assume What’s the MLE
estimate of w?

MLE estimation with varying noise

)
,
,...,
,
,
,...,
,
|
,...,
,
(
log 2
2
2
2
1
2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R
R 






R
i i
i
i wx
y
w
1
2
2
)
(
argmin 













0
)
(
such that
1
2
R
i i
i
i
i wx
y
x
w





















R
i i
i
R
i i
i
i
x
y
x
1
2
2
1
2


Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting
dLL/dw equal
to zero
Trivial algebra

This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
=1/2
=2
=1
=1/2
=2



R
i i
i
i wx
y
w
1
2
2
)
(
argmin 
2
1
i

where weight for i’th datapoint is

Regression
The max. likelihood w is w = (WXT
WX)-1
(WXT
WY)
(WXT
WX) is an m x m matrix: i,j’th elt is
(WXT
WY) is an m-element vector: i’th
elt


R
k i
kj
ki x
x
1
2



R
k i
k
ki y
x
1
2


Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
x=0 x=3
x=2
x=1
y=0
y=3
y=2
y=1
xi yi
½ ½
1 2.5
2 3
3 2
3 3
)
,
(
~ 2

i
i x
w
N
y 
Assume What’s the MLE
estimate of w?

Non-linear MLE estimation

)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
and simplifying.
Setting
dLL/dw equal
to zero


)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
and simplifying.
Setting
dLL/dw equal
to zero
We’re down the
algebraic toilet
So guess
what we
do?


)
,
,
,...,
,
|
,...,
,
(
log 2
1
2
1
argmax w
x
x
x
y
y
y
p
w
R
R 
  




R
i
i
i x
w
y
w
1
2
argmin















0
such that
1
R
i i
i
i
x
w
x
w
y
w
Assuming i.i.d. and
then plugging in
and simplifying.
Setting
dLL/dw equal
to zero
We’re down the
algebraic toilet
So guess
what we
do?
Common (but not only) approach:
Numerical Solutions:
• Line Search
• Simulated Annealing
• Gradient Descent
• Conjugate Gradient
• Levenberg Marquart
• Newton’s Method
Also, special purpose statistical-
optimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)

GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05



:
f(w)
 
w
w
w
w f



 

GRADIENT DESCENT
Suppose we have a scalar function
We want to find a local minimum.
Assume our current weight is w
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05



:
f(w)
 
w
w
w
w f



 
QUESTION: Justify the Gradient Descent
Rule
Recall Andrew’s favorite
default value for anything

Gradient Descent in “m”
Dimensions


m
:
)
f(w
 
w
f
-
w
w 
 
Given
points in direction of steepest
ascent.
Equivalently
 
w
f
-
j
j
j
w
η
w
w


 ….where wj is the jth weight
“just like a linear feedback
system”
 
 
 



















w
f
w
f
w
f
1
m
w
w

 
w
f
 is the gradient in that direction

What’s all this got to do with
Neural Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize
sum-of-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights,
we use GRADIENT DESCENT to SEARCH for them.
“Why?” you ask, a querulous expression in your eyes.
“Aha!!” I reply: “We’ll see later.”

Linear Perceptrons
They are multivariate linear models:
Out(x) = wT
x
And “training” consists of minimizing sum-of-squared
residuals by gradient descent.
QUESTION: Derive the perceptron training rule.
 
 
 2
2








k
k
y
y
k
k
k
k
x
x
Out
w

Linear Perceptron Training Rule




R
k
k
T
k
y
E
1
2
)
( x
w
Gradient descent tells
us we should update w
thusly if we wish to
minimize E:
j
j
j
w
E
η
w
w


 -
So what’s ?
j
w
E







R
k
k
T
k
y
E
1
2
)
( x
w
minimize E:
j
j
j
w
E
η
w
w


 -
So what’s ?
j
w
E









 R
k
k
T
k
j
j
y
w
w
E
1
2
)
( x
w







R
k
k
T
k
j
k
T
k y
w
y
1
)
(
)
(
2 x
w
x
w

 



R
k
k
T
j
k
w
δ
1
2 x
w
k
T
k
k y
δ x
w


…where…
 
 




R
k
m
i
ki
i
j
k x
w
w
δ
1 1
2




R
k
kj
k x
δ
1
2





R
k
k
T
k
y
E
1
2
)
( x
w
minimize E:
j
j
j
w
E
η
w
w


 -
…where…





 R
k
kj
k
j
x
δ
w
E
1
2




R
k
kj
k
j
j x
δ
η
w
w
1
2
We frequently neglect the 2 (meaning
we halve the learning rate)

algorithm
1) Randomly initialize weights w1 w2 … wm
2) Get your dataset (append 1’s to the inputs if
you don’t want to go through the origin).
3) for i = 1 to R
4) for j = 1 to m
5) if stops improving then stop. Else loop
back to 3.
i
i
i y x
w


:





R
i
ij
i
j
j x
w
w
1



2
i


ij
i
j
j
i
i
i
x
w
w
y





 
x
w A RULE KNOWN BY
MANY NAMES
The LMS Rule
The delta rule
The Widrow Hoff rule
Classical
conditioning
The adaline rule

If data is voluminous and arrives
fast
Input-output pairs (x,y) come streaming in very
quickly. THEN
Don’t bother remembering old ones.
Just keep using new ones.
observe (x,y)
j
j
j x
δ
η
w
w
j
y x
w




 


GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Gradient Descent vs Matrix
Inversion for Linear Perceptrons

• It’s moronic
be efficient.

• It’s moronic
be efficient.
But we’ll
soon see that
GD
has an important extra
trick up its sleeve

Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤1/2
1 if out(x)>1/2
WHAT’S THE BIG PROBLEM WITH THIS???

or
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???
Blue = Out(x)

or
Our prediction is 0 if out(x)≤½
1 if out(x)>½
Blue = Out(x)
Green =
Classification

I
 .
x
w
2
 
 i
i
y
Don’t minimize
Minimize number of misclassifications instead. [Assume outputs are
+1 & -1, not +1 & 0]
where Round(x) = -1 if x<0
1 if x≥0
The gradient descent rule can be changed to:
if (xi,yi) correctly classed, don’t change
if wrongly predicted as 1 w  w - xi
if wrongly predicted as -1 w  w + xi
 
 
 
 i
i
y x
w
Round
NOTE: CUTE &
NON OBVIOUS WHY
THIS WORKS!!

Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.

Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.
SOLUTION:
Instead of Out(x) = wT
x
We’ll use Out(x) = g(wT
x)
where is a
squashing function
   
1
,
0
: 

x
g

The Sigmoid
)
exp(
1
1
)
(
h
h
g



Note that if you rotate
this curve through 180o
centered on (0,1/2) you
get the same curve.
i.e. g(h)=1-g(-h)
Can you prove this?

The Sigmoid
Now we choose w to minimize
   

 





R
i
i
i
R
i
i
i g
y
y
1
2
1
2
)
x
w
(
)
x
(
Out
)
exp(
1
1
)
(
h
h
g




Linear Perceptron Classification
Regions
0 0
0
1
1
1
X2
X1
We’ll use the model Out(x) = g(wT
(x,1))
= g(w1x1 + w2x2 + w0)
Which region of above diagram classified with +1, and
which with 0 ??

perceptron
     
 
   
   
 
   
 




 

 
 






















































































































 







 










 









k
k
k
i
i
i
i
ij
i
i
i
i
k
ik
k
j
k
ik
k
i k
ik
k
i
k
ik
k
j
i k
ik
k
i
j
i k
ik
k
i
k
k
k
x
w
y
x
g
g
x
w
w
x
w
g
x
w
g
y
x
w
g
w
x
w
g
y
w
x
w
g
y
x
w
g
x
g
x
g
x
e
x
e
x
e
x
e
x
e
x
e
x
e
x
e
x
g
x
e
x
g
x
g
x
g
x
g
net
)
Out(x
where
net
1
net
2
'
2
2
Out(x)
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
2
1
'
so
1
1
:
Because
1
'
notice
First,
2


 





R
i
ij
i
i
i
j
j x
g
g
w
w
1
1










 

m
j
ij
j
i x
w
g
g
1
i
i
i g
y 


The sigmoid perceptron
update rule:
where

Perceptrons
• Invented and popularized by Rosenblatt (1962)
• Even with sigmoid nonlinearity, correct
convergence is guaranteed
• Stable behavior for overconstrained and
underconstrained problems

Perceptrons and Boolean
Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…
• Can learn the function x1  x2
• Can learn the function x1  x2 .
• Can learn any conjunction of literals, e.g.
x1  ~x2  ~x3  x4  x5
QUESTION: WHY?
X1
X2
X1
X2

Perceptrons and Boolean
Functions
• Can learn any disjunction of literals
e.g. x1  ~x2  ~x3  x4  x5
• Can learn majority function
f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1
0 if less than n/2 xi’s are = 1
• What about the exclusive or function?
f(x1,x2) = x1  x2 =
(x1  ~x2)  (~ x1  x2)

Multilayer Networks
The class of functions representable by perceptrons
is limited
  








 

j
j
j x
w
g
g
Out(x) x
w
Use a wider
representation !














 
 k
jk
jk
j
j x
w
g
W
g
Out(x) This is a nonlinear function
Of a linear combination
Of non linear functions
Of linear combinations of inputs

A 1-HIDDEN LAYER NET
NINPUTS = 2 NHIDDEN = 3








 

HID
N
k
k
kv
W
g
1
Out

































INS
INS
INS
N
k
k
k
N
k
k
k
N
k
k
k
x
w
g
v
x
w
g
v
x
w
g
v
1
3
3
1
2
2
1
1
1
x1
x2
w11
w21
w31
w1
w2
w3
w32
w22
w12

OTHER NEURAL NETS
2-Hidden layers + Constant Term
1
x1
x2
x3
x2
x1
“JUMP” CONNECTIONS









 
 

HID
INS N
k
k
k
N
k
k
k v
W
x
w
g
1
1
0
Out

Backpropagation
 
 
descent.
gradient
by
x
Out
minimize
to
}
{
,
}
{
weights
of
set
a
Find
Out(x)
2

 
















i
i
i
jk
j
j k
k
jk
j
y
w
W
x
w
g
W
g
That’s it!
That’s the backpropagation
algorithm.

Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.
IMPLEMENTING BACKPROP:  Differentiate Monster sum-square residual 
Write down the Gradient Descent Rule  It turns out to be easier &
computationally efficient to use lots of local variables with names like hj ok vj
neti etc…

Choosing the learning rate
• This is a subtle art.
• Too small: can take days instead of
minutes to converge
• Too large: diverges (MSE gets larger and
larger while the weights increase and
usually oscillate)
• Sometimes the “just right” value is hard to
find.

Learning-rate problems
From J. Hertz, A. Krogh, and R.
G. Palmer. Introduction to the
Theory of Neural Computation.
Addison-Wesley, 1994.

Improving Simple Gradient
Descent
Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let be the change we would make with
regular gradient descent.
Instead we use
Momentum damps oscillations.
A hack? Well, maybe.
w



 
   
t
t Δw
w
Δw 
 





1
momentum
parameter
     
t
t
t Δw
w
w 

1

Momentum illustration

Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









If we neglect the O(h3
) terms, this is a quadratic form
Quadratic form fun facts:
If y = c + bT
x - 1/2 xT
A x
And if A is SPD
Then
xopt
= A-1
b is the value of x that maximizes y

Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









w
w
w
w













E
E
1
2
2
This should send us directly to the global minimum if
the function is truly quadratic.
And it might get us close if it’s locally quadraticish

Descent
Newton’s method
)
|
(|
2
1
)
(
)
( 3
2
2
h
h
w
h
w
h
w
h
w O
E
E
E
E T
T









w
w
w
w













E
E
1
2
2
This should send us directly to the global minimum if
the function is truly quadratic.
And it might get us close if it’s locally quadraticish
BUT (and it’s a big but)…
That second derivative matrix can be
expensive and fiddly to compute.
If we’re not already in the quadratic bowl,
we’ll go nuts.

Descent
Conjugate Gradient
Another method which attempts to exploit the “local
quadratic bowl” assumption
But does so while only needing to use
and not
2
2
w

 E
It is also more stable than Newton’s method if the local
quadratic bowl assumption is violated.
It’s complicated, outside our scope, but it often works
well. More details in Numerical Recipes in C.
w

E

BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.
HOW TO FORMALIZE THIS INTUITION?
1. Don’t. Just use intuition
2. Bayesian Methods Get it Right
3. Statistical Analysis explains what’s going on
4. Cross-validation
Discussed in the next
lecture

What You Should Know
• How to implement multivariate Least-
squares linear regression.
• Derivation of least squares as max.
likelihood estimator of linear coefficients
• The general gradient descent rule

What You Should Know
• Perceptrons
 Linear output, least squares
 Sigmoid output, least squares
• Multilayer nets
 The idea behind back prop
 Awareness of better minimization methods
• Generalization. What it means.

APPLICATIONS
To Discuss:
• What can non-linear regression be useful for?
• What can neural nets (used as non-linear
regressors) be useful for?
• What are the advantages of N. Nets for
nonlinear regression?
• What are the disadvantages?

Other Uses of Neural Nets…
• Time series with recurrent nets
• Unsupervised learning (clustering
principal components and non-linear
versions thereof)
• Combinatorial optimization with Hopfield
nets, Boltzmann Machines
• Evaluation function learning (in
reinforcement learning)

neural network artificial for process analysis

More Related Content

Similar to neural network artificial for process analysis (20)

Recently uploaded (20)

neural network artificial for process analysis