4. y
σ
w1 w2 w3
x1 x2 x3
Artificial Neuron
2
• The most fundamental unit of a deep
neural network is called an artificial
neuron
5. σ
y
w1 w2 w3
x1 x2
x3
2
Artificial Neuron
• The most fundamental unit of a deep
neural network is called an artificial
neuron
• Why is it called a neuron ? Where does
the inspiration come from ?
• The inspiration comes from biology
(more specifically, from the brain)
• biological neurons = neural cells =
neural processing units
• W
e will first see what a biological
neuron looks like ...
6. Biological Neurons∗
• dendrite: receives signals from other
neurons
• synapse: point of connection to other
neurons
• soma: processes the information
• axon: transmits the output of
this neuron
∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg
3
7. • Of course, in reality, it is not just a single neuron
which does all this
5
8. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
5
9. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
• The sense organs relay information to the lowest
layer of neurons
5
10. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
• The sense organs relay information to the lowest
layer of neurons
• Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
inform- ation to other neurons they are
connected to
5
11. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
• The sense organs relay information to the lowest
layer of neurons
• Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
inform- ation to other neurons they are
connected to
• These neurons may also fire (again, in red) and
the process continues
5
12. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
• The sense organs relay information to the lowest
layer of neurons
• Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
inform- ation to other neurons they are
connected to
• These neurons may also fire (again, in red) and
the process continues eventually resulting in a
re- sponse (laughter in this case)
5
13. • Of course, in reality, it is not just a single neuron
which does all this
• There is a massively parallel interconnected net-
work of neurons
• The sense organs relay information to the lowest
layer of neurons
• Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
inform- ation to other neurons they are
connected to
• These neurons may also fire (again, in red) and
the process continues eventually resulting in a
re- sponse (laughter in this case)
• An average human brain has around 1011 (100
bil- lion) neurons!
5
14. • This massively parallel network also ensures that
there is division of work
6
15. • This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
6
16. • This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
6
A simplified illustration
17. • This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
6
A simplified illustration
18. • This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
6
A simplified illustration
19. • This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
6
A simplified illustration
20. A simplified illustration
6
• This massively parallel network also ensures that
there is division of work
• Each neuron may perform a certain role or
respond to a certain stimulus
21. • The neurons in the brain are arranged
in a hierarchy
7
22. Sample illustration of hierarchical
processing∗
∗
Idea borrowed from Hugo Larochelle’s
lecture slides
8
23. Disclaimer
9
• I understand very little about how the brain works!
• What you saw so far is an overly simplified explanation of how the brain
works!
• But this explanation suffices for the purpose of this course!
25. x1 x2 ..
..
xn
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
11
26. x1 x2 ..
..
xn ∈ {0, 1}
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
11
27. x1 x2 ..
..
xn ∈ {0, 1}
g
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs
11
28. x1 x2 ..
..
xn ∈ {0, 1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
11
29. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
11
30. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
11
31. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
11
32. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
11
1 2 n
n
L
i=1
g(x , x , ..., x ) = g(x) = xi
33. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
11
1 2 n
n
L
g(x , x , ..., x ) = g(x) = xi
i=1
y = f (g(x)) = 1 if
g(x) ≥ θ
34. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
11
1 2 n
n
∑
g(x , x , ..., x ) = g(x) = xi
y = f (g(x)) = 1 if
= 0
if
i=1
g(x) ≥
θ
g(x) <
θ
35. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
11
1 2 n
n
L
g(x , x , ..., x ) = g(x) = xi
i=1
g(x) ≥
θ
y = f (g(x)) = 1 if
= 0
if
g(x) < θ
• θ is called the thresholding
parameter
36. x1 x2 ..
..
xn ∈ {0, 1}
y ∈ {0,
1}
g
f
• McCulloch (neuroscientist) and Pitts (logician)
proposed a highly simplified computational
model of the neuron (1943)
• g aggregates the inputs and the function f
takes a decision based on this aggregation
• The inputs can be excitatory or inhibitory
• y = 0 if any xi is inhibitory,else
1 2 n
n
L
g(x , x , ..., x ) = g(x) = xi
i=1
g(x) ≥
θ
y = f (g(x)) = 1 if
= 0
if
g(x) < θ
• θ is called the thresholding
parameter 11
37. Let us implement some boolean functions using this McCulloch Pitts (MP)
neuron ...
12
38. y ∈ {0, 1}
θ
x1 x2 x3
A McCulloch Pitts unit
13
39. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
13
x1 x2 x3
AND function
40. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
3
x1 x2 x3
AND function
13
41. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
13
x1 x2 x3
OR function
42. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
1
x1 x2 x3
OR function
13
43. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
1
x1 x2 x3
OR function
x1 x2
x1 AND !x2
∗
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
44. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
1
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
1
x1 x2 x3
OR function
x1 x2
x1 AND !x2
∗
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
45. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
y ∈ {0,
1}
1
x1 x2 x3
OR function
1
x1 x2 x1 x2
x1 AND !x2
∗
NOR function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
46. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
y ∈ {0,
1}
1
x1 x2 x3
OR function
1 0
x1 x2 x1 x2
x1 AND !x2
∗
NOR function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
47. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
y ∈ {0,
1}
1
x1 x2 x3
OR function
y ∈ {0,
1}
1 0
x1 x2 x1 x2 x1
NO
T function
x1 AND !x2
∗
NOR function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
48. y ∈ {0,
1}
θ
x1 x2 x3
A McCulloch Pitts unit
y ∈ {0,
1}
y ∈ {0,
1}
3
x1 x2 x3
AND function
y ∈ {0,
1}
y ∈ {0,
1}
1
x1 x2 x3
OR function
y ∈ {0,
1}
1 0 0
x1 x2 x1 x2 x1
NO
T function
x1 AND !x2
∗
NOR function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be
0
13
49. • Can any boolean function be represented using a McCulloch Pitts unit ?
14
50. • Can any boolean function be represented using a McCulloch Pitts unit ?
• Before answering this question let us first see the geometric interpretation of a MP
unit
...
14
51. y ∈ {0, 1}
1
x1 x2
OR function
15
1
2
x + x =
L 2
i=1 xi ≥ 1
52. y ∈ {0, 1}
1
x1 x2
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1
(0, 0) (1, 0) 15
53. y ∈ {0, 1}
1
x1 x2
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
x1
(0, 0) (1, 0) 15
54. y ∈ {0,
1}
1
x1 x2
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
• A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
x1
(0, 0) (1, 0) 15
55. x1 x2
y ∈ {0,
1}
1
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
• A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
x1
(0, 0) (1, 0) 15
L n
i=1 xi − θ = 0
• Points lying on or above the
line and points lying below this
line
56. x1 x2
y ∈ {0,
1}
1
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
• A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
x1
(0, 0) (1, 0) 15
L n
i=1 xi − θ = 0
• Points lying on or above the
line and points lying below this
line
• In other words, all inputs which produce an
output
0 will be on one side
(
L n
i=1 xi < θ) of the line
and
all inputs which produce an output 1 will lie on
the
other side
(
L n
i=1 xi ≥ θ) of this line
57. x1 x2
y ∈ {0,
1}
1
OR function
1
2
x + x =
L 2
i=1 xi ≥ 1
x2
(0, 1) (1, 1)
x1 + x2 = θ = 1
• A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
x1
(0, 0) (1, 0) 15
L n
i=1 xi − θ = 0
• Points lying on or above the
line and points lying below this
line
• In other words, all inputs which produce an
output
0 will be on one side
(
L n
i=1 xi < θ) of the line
and
all inputs which produce an output 1 will lie on
the
other side
(
L n
i=1 xi ≥ θ) of this line
• Let us convince ourselves about this with a few
more examples (if it is not already clear from the
math)
58. y ∈ {0, 1}
2
x1 x2
AND function
16
1
2
x + x =
L 2
i=1 xi ≥ 2
59. y ∈ {0, 1}
2
x1 x2
AND function
1
2
x + x =
L 2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1
(0, 0) (1, 0)
16
60. y ∈ {0, 1}
2
x1 x2
AND function
1
2
x + x = L
2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
x1
(0, 0) (1, 0)
16
61. x1 x2
y ∈ {0, 1}
2
AND function
1
2
x + x =
L 2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
y ∈ {0,
1}
x1
(0, 0) (1, 0)
16
x1 x2
Tautology (always ON)
62. x1 x2
y ∈ {0, 1}
2
AND function
1
2
x + x =
L 2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
y ∈ {0,
1}
0
x1 x2
Tautology (always ON)
x1
(0, 0) (1, 0)
16
63. x1 x2
y ∈ {0, 1}
2
AND function
1
2
x + x =
L 2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
y ∈ {0,
1}
0
x1 x2
Tautology (always ON)
x2
(0, 1) (1, 1)
x1 + x2 = θ = 2
x1
(0, 0) (1, 0)
x1
(0, 0) (1, 0)
16
64. x1 x2
y ∈ {0, 1}
2
AND function
1
2
x + x =
L 2
i=1 xi ≥ 2
x2
(0, 1) (1, 1)
y ∈ {0,
1}
0
x1 x2
Tautology (always ON)
x2
(0, 1) (1, 1)
x1 + x2 = θ = 0
x1 + x2 = θ = 2
x1
(0, 0) (1, 0)
x1
(0, 0) (1, 0)
16
65. x1 x2 x3
y ∈ {0,
1}
O
R
1
• What if we have more than 2 inputs?
17
66. y ∈ {0, 1}
O
R
1
x1
x2
x3 x2
(0, 0,
0)
(0, 1,
0)
(1, 0, 0) x1
(1, 1,
0)
(0, 1,
1)
(1, 1,
1)
• What if we have more than 2 inputs?
(0, 0,
1)
(1, 0,
1)
x3
17
67. y ∈ {0, 1}
O
R
1
x1
x2
x3 x2
(0, 0,
0)
(0, 1,
0)
(1, 0, 0) x1
(1, 1,
0)
(0, 1,
1)
(1, 1,
1)
• What if we have more than 2 inputs?
• Well, instead of a line we will have a
plane
(0, 0,
1)
(1, 0,
1)
x3
17
68. x1 x2 x3
y ∈ {0, 1}
O
R
1
x2
(0, 0,
0)
(0, 1,
0)
(1, 0, 0) x1
(1, 1,
0)
(0, 1,
1)
(1, 1,
1)
• What if we have more than 2 inputs?
• Well, instead of a line we will have a
plane
• For the OR function, we want a
plane such that the point (0,0,0) lies
on one side and the remaining 7 points
lie on the other side of the plane
(0, 0,
1)
(1, 0,
1)
x3
17
69. y ∈ {0, 1}
O
R
1
(0, 0,
0)
x1
x2
x3 x2
(0, 1, 0)
(1, 0, 0) x1
(1, 1,
0)
(0, 1,
1)
(1, 1, 1)x1 + x2 + x3 = θ = 1
• What if we have more than 2 inputs?
• Well, instead of a line we will have a
plane
• For the OR function, we want a
plane such that the point (0,0,0) lies
on one side and the remaining 7 points
lie on the other side of the plane
(0, 0,
1)
(1, 0,
1)
x3
17
70. The story so far ...
18
• A single McCulloch Pitts Neuron can be used to represent boolean functions which
are linearly separable
71. The story so far ...
18
• A single McCulloch Pitts Neuron can be used to represent boolean functions which
are linearly separable
• Linear separability (for boolean functions) : There exists a line (plane) such that all
in- puts which produce a 1 lie on one side of the line (plane) and all inputs which
produce a 0 lie on other side of the line (plane)
73. The story ahead ...
20
• What about non-boolean (say, real) inputs ?
74. The story ahead ...
20
• What about non-boolean (say, real) inputs ?
• Do we always need to hand code the
threshold ?
75. The story ahead ...
20
• What about non-boolean (say, real) inputs ?
• Do we always need to hand code the threshold ?
• Are all inputs equal ? What if we want to assign more weight (importance) to some
inputs ?
76. The story ahead ...
20
• What about non-boolean (say, real) inputs ?
• Do we always need to hand code the threshold ?
• Are all inputs equal ? What if we want to assign more weight (importance) to some
inputs ?
• What about functions which are not linearly separable ?
77. • Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
21
78. x1 x2 .. .. xn
y
w1
21
w2 .. .. wn
• Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
79. x1 x2 .. .. xn
y
w1
21
w2 .. .. wn
• Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
• A more general computational model than
McCul- loch–Pitts neurons
80. x1 x2 .. .. xn
y
w1
21
w2 .. .. wn
• Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
• A more general computational model than
McCul- loch–Pitts neurons
• Main differences: Introduction of numerical
weights for inputs and a mechanism for learning
these weights
81. x1 x2 .. .. xn
y
w1
21
w2 .. .. wn
• Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
• A more general computational model than
McCul- loch–Pitts neurons
• Main differences: Introduction of numerical
weights for inputs and a mechanism for learning
these weights
• Inputs are no longer limited to boolean values
82. x1 x2 .. .. xn
y
w1
21
w2 .. .. wn
• Frank Rosenblatt, an American psychologist, pro-
posed the classical perceptron model (1958)
• A more general computational model than
McCul- loch–Pitts neurons
• Main differences: Introduction of numerical
weights for inputs and a mechanism for learning
these weights
• Inputs are no longer limited to boolean values
• Refined and carefully analyzed by Minsky and Pa-
pert (1969) - their model is referred to as the
per- ceptron model here
84. x1 x2 .. .. xn
y
w1
w2
22
.. .. wn
n
L
i=1
i i
y = 1 if w ∗ x ≥
θ
85. x1 x2 .. .. xn
y
w1
w2
22
.. .. wn
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
86. x1 x2 .. .. xn
y
w1
w2
22
.. .. wn
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
87. x1 x2 .. .. xn
y
w1
w2
22
.. .. wn
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i=1
i i
y = 1 if w ∗ x − θ ≥
0
88. x1 x2 .. .. xn
y
w1
w2
22
.. .. wn
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i i
y = 1 if w ∗ x − θ ≥
0 i=1
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
89. .. xn
y
w1
w2
22
.. .. wn
x1 x2
..
A more accepted convention,
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
y = 1 if wi ∗ xi − θ ≥
0
i=1
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
90. .. xn
y
w1
w2
22
.. .. wn
x1 x2
..
A more accepted convention,
n
L
i=0
y = 1 if wi ∗ xi ≥
0
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i=1
y = 1 if wi ∗ xi − θ ≥
0
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
91. .. xn
y
w1
w2
where, x0 = 1 and w0 =
−θ 22
.. .. wn
x1 x2
..
A more accepted convention,
n
L
i=0
y = 1 if wi ∗ xi ≥
0
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i=1
y = 1 if wi ∗ xi − θ ≥
0
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
92. x1 x2
..
.. xn
y
w1
w2
where, x0 = 1 and w0 =
−θ 22
.. .. wn
w0 = −θ
x0 = 1
A more accepted
convention,
n
L
i=0
y = 1 if wi ∗ xi ≥
0
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i=1
y = 1 if wi ∗ xi − θ ≥
0
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
93. x1 x2
..
.. xn
y
w1
w2
where, x0 = 1 and w0 =
−θ 22
.. .. wn
w0 = −θ
x0 = 1
A more accepted
convention,
n
L
i=0
y = 1 if wi ∗ xi ≥
0
n
L
i=0
i i
= 0 if w ∗ x <
0
n
L
i i
y = 1 if w ∗ x ≥
θ i=1
n
L
i=1
i i
= 0 if w ∗ x <
θ
Rewriting the above,
n
L
i=1
y = 1 if wi ∗ xi − θ ≥
0
n
L
i=1
i i
= 0 if w ∗ x − θ <
0
94. We will now try to answer the following questions:
• Why are we trying to implement boolean functions?
• Why do we need weights ?
• Why is w0 = −θ called the bias ?
23
95. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
• Consider the task of predicting whether we would like a
movie or not
96. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
• Consider the task of predicting whether we would like a
movie or not
• Suppose, we base our decision on 3 inputs (binary, for
sim- plicity)
97. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• Consider the task of predicting whether we would like a
movie or not
• Suppose, we base our decision on 3 inputs (binary, for
sim- plicity)
• Based on our past viewing experience (data), we may give
a high weight to isDirectorNolan as compared to the
other inputs
98. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• Consider the task of predicting whether we would like a
movie or not
• Suppose, we base our decision on 3 inputs (binary, for
sim- plicity)
• Based on our past viewing experience (data), we may give
a high weight to isDirectorNolan as compared to the
other inputs
• Specifically, even if the actor is not Matt Damon and the
genre is not thriller we would still want to cross the
threshold θ by assigning a high weight to isDirectorNolan
99. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• w0 is called the bias as it represents the prior
(prejudice)
100. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• w0 is called the bias as it represents the prior (prejudice)
• A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, director
[θ = 0]
101. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• w0 is called the bias as it represents the prior (prejudice)
• A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, director
[θ = 0]
• On the other hand, a selective viewer may only watch
thrillers starring Matt Damon and directed by Nolan [θ =
3]
102. x1 x2 x3
y
w0 = −θ
x0 = 1
24
w1 w2
w3
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
• w0 is called the bias as it represents the prior (prejudice)
• A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, director
[θ = 0]
• On the other hand, a selective viewer may only watch
thrillers starring Matt Damon and directed by Nolan [θ =
3]
• The weights (w1, w2, ..., wn) and the bias (w0) will
depend on the data (viewer history in this case)
103. What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?
25
104. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
26
n
L
i
y = 1 if x ≥ 0
i=0
n
L
i=0
i
= 0 if x < 0
Perceptron
n
L
i i
y = 1 if w ∗ x ≥
0 i=0
n
L
i=0
i i
= 0 if w ∗ x <
0
105. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
L
i
y = 1 if x ≥ 0
i=0
n
L
i=0
i
= 0 if x < 0
Perceptron
n
L
i i
y = 1 if w ∗ x ≥
0 i=0
n
L
i=0
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
26
106. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
L
i=0
i
y = 1 if x ≥ 0
n
L
= 0 if xi < 0
i=0
Perceptron
n
L
i i
∗ x ≥
0
y = 1 if
w
i=0
n
L
i=0
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
• All inputs which produce a 1 lie on one side and
all inputs which produce a 0 lie on the other side
26
107. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
L
i=0
i
y = 1 if x ≥ 0
n
L
= 0 if xi < 0
i=0
Perceptron
n
L
i i
∗ x ≥
0
y = 1 if
w
i=0
n
L
i=0
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
• All inputs which produce a 1 lie on one side and
all inputs which produce a 0 lie on the other side
• In other words, a single perceptron can only be
used to implement linearly separable functions
26
108. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
n
L
i=0
i
y = 1 if x ≥ 0
n
L
= 0 if xi < 0
i=0
Perceptron
n
L
i i
∗ x ≥
0
y = 1 if
w
i=0
n
L
i=0
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
• All inputs which produce a 1 lie on one side and
all inputs which produce a 0 lie on the other side
• In other words, a single perceptron can only be
used to implement linearly separable functions
• Then what is the difference?
26
109. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
i=0
26
n
L
i=0
i
y = 1 if x ≥ 0
n
L
= 0 if xi < 0
i=0
Perceptron
n
L
i i
∗ x ≥
0
y = 1 if
w
i=0
n
L
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
• All inputs which produce a 1 lie on one side and
all inputs which produce a 0 lie on the other side
• In other words, a single perceptron can only be
used to implement linearly separable functions
• Then what is the difference? The weights
(includ- ing threshold) can be learned and the
inputs can be real valued
110. McCulloch Pitts Neuron
(assuming no inhibitory inputs)
i=0
26
n
L
i=0
i
y = 1 if x ≥ 0
n
L
= 0 if xi < 0
i=0
Perceptron
n
L
i i
∗ x ≥
0
y = 1 if
w
i=0
n
L
i i
= 0 if w ∗ x <
0
• From the equations it should be clear that even
a perceptron separates the input space into two
halves
• All inputs which produce a 1 lie on one side and
all inputs which produce a 0 lie on the other side
• In other words, a single perceptron can only be
used to implement linearly separable functions
• Then what is the difference? The weights
(includ- ing threshold) can be learned and the
inputs can be real valued
• We will first revisit some boolean functions and
then see the perceptron learning algorithm (for
learning weights)
124. x1 x2
OR
0 0 0 w0 +
L 2
wixi < 0
i=1
1 0 1 w0 +
L 2
wixi ≥ 0
i=1
0 1 1 i=1
27
w0 +
L 2
wixi ≥ 0
1 1 1 w +
L 2
i=1
0 i i
w x ≥
0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
• One possible solution to this set of inequalities
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and
various other solutions are possible)
125. x1 x2
OR
0 0 0 w0 +
L 2
wixi < 0
i=1
1 0 1 w0 +
L 2
wixi ≥ 0
i=1
0 1 1 i=1
w0 +
L 2
wixi ≥ 0
1 1 1 w +
L 2
i=1
0 i i
w x ≥
0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
• One possible solution to this set of inequalities is
w0 = −1, w1 = 1.1, , w2 = 1.1 (and
various other solutions are possible)
x1
x2
(0, 0)
(0, 1)
(1, 0)
(1, 1)
27
126. x1 x2
OR
0 0 0 w0 +
L 2
wixi < 0
i=1
1 0 1 w0 +
L 2
wixi ≥ 0
i=1
0 1 1 i=1
w0 +
L 2
wixi ≥ 0
1 1 1 w +
L 2
i=1
0 i i
w x ≥
0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
• One possible solution to this set of inequalities is
w0 = −1, w1 = 1.1, , w2 = 1.1 (and
various
x1
x2
(0, 0)
(0, 1)
(1, 0)
(1, 1)
−1 + 1.1x1 + 1.1x2 = 0
other solutions are possible)
27
127. x1 x2
OR
0 0 0 w0 +
L 2
wixi < 0
i=1
1 0 1 w0 +
L 2
wixi ≥ 0
i=1
0 1 1 i=1
w0 +
L 2
wixi ≥ 0
1 1 1 w +
L 2
i=1
0 i i
w x ≥
0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
• One possible solution to this set of inequalities is
w0 = −1, w1 = 1.1, , w2 = 1.1 (and
various
other solutions are possible)
x1
x2
(0, 0)
(0, 1)
(1, 0)
(1, 1)
−1 + 1.1x1 + 1.1x2 = 0
• Note that we can come up
with a similar set of
inequalities and find the value
of θ for a McCul- loch Pitts
neuron also (Try it!) 27
128. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
x1
(0, 0) (1, 0)
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
29
129. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
x1
x2
(0, 0) (1, 0)
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
−1 + (−1)x1 + (−1)x2 = 0
29
130. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line?
x1
x2
(0, 0) (1, 0)
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
−1 + (−1)x1 + (−1)x2 = 0
29
131. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
x1
x2
(0, 0) (1, 0)
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
−1 + (−1)x1 + (−1)x2 = 0
29
132. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
x1
x2
(0, 0) (1, 0)
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
−1 + (−1)x1 + (−1)x2 = 0
29
133. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1 3
x1
x2
(0, 0) (1, 0)
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
−1 + (−1)x1 + (−1)x2 = 0
29
134. • Let us fix the threshold (−w0 = 1) and try
differ- ent values of w1, w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -
1
1.5 0
3
1
x1
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
(0, 0) (1, 0)
−1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
29
135. • Let us fix the threshold (−w0 = 1) and try
differ-
ent values of w1,
w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1 3
1.5 0 1
0.45 0.45 3
x1
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
(0, 0) (1, 0)
−1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
−1 + (0.45)x1 + (0.45)x2 = 0
29
136. • Let us fix the threshold (−w0 = 1) and try
differ-
ent values of w1,
w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1
3
1.5 0
1
0.45 0.45 3
• We are interested in those values of w0, w1,
w2
which result in 0 error
x1
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
(0, 0) (1, 0)
−1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
−1 + (0.45)x1 + (0.45)x2 = 0
29
137. • Let us fix the threshold (−w0 = 1) and try
differ-
ent values of w1,
w2
• Say, w1 = −1, w2 = −1
• What is wrong with this line? We make an error
on 1 out of the 4 inputs
• Lets try some more values of w1, w2 and note
how many errors we make
w1 w2 errors
-1 -1
3
1.5 0
1
0.45 0.45 3
• We are interested in those values of w0, w1,
w2
which result in 0 error
• Let us plot the error surface corresponding to
x1
x2
−1 + 1.1x1 + 1.1x2 = 0
(0, 1)
(1, 1)
(0, 0) (1, 0)
−1 + (1.5)x1 + (0)x2 = 0
−1 + (−1)x1 + (−1)x2 = 0
−1 + (0.45)x1 + (0.45)x2 = 0
29
138. • For ease of analysis, we will keep w0
fixed (-1) and plot the error for different
values of w1, w2
30
139. • For ease of analysis, we will keep w0
fixed (-1) and plot the error for different
values of w1, w2
• For a given w0, w1, w2 we will
compute
−w0 + w1 ∗ x1 + w2 ∗ x2 for all
com- binations of (x1, x2 ) and note
down how many errors we make
30
140. • For ease of analysis, we will keep w0
fixed (-1) and plot the error for different
values of w1, w2
• For a given w0, w1, w2 we will
compute
−w0 + w1 ∗ x1 + w2 ∗ x2 for all
com- binations of (x1, x2 ) and note
down how many errors we make
• For the OR function, an error occurs
if (x1, x2 ) = (0, 0) but −w0 + w1 ∗
x1 + w2 ∗ x2 ≥ 0 or if (x1, x2 ) >=
(0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
30
141. • For ease of analysis, we will keep w0
fixed (-1) and plot the error for different
values of w1, w2
• For a given w0, w1, w2 we will
compute
−w0 + w1 ∗ x1 + w2 ∗ x2 for all
com- binations of (x1, x2 ) and note
down how many errors we make
• For the OR function, an error occurs
if (x1, x2 ) = (0, 0) but −w0 + w1 ∗
x1 + w2 ∗ x2 ≥ 0
• or if (x1, x2) /= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
30
142. • For ease of analysis, we will keep w0
fixed (-1) and plot the error for different
values of w1, w2
• For a given w0, w1, w2 we will
compute
−w0 + w1 ∗ x1 + w2 ∗ x2 for all
com- binations of (x1, x2 ) and note
down how many errors we make
• For the OR function, an error occurs
if (x1, x2 ) = (0, 0) but −w0 + w1 ∗
x1 + w2 ∗ x2 ≥ 0 or if (x1, x2 ) /=
(0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
• W
e are interested in finding an
algorithm which finds the values of w ,
30
143. • Let us reconsider our problem of deciding
whether to watch a movie or not
• Suppose we are given a list of m movies and a
la- bel (class) associated with each movie
indicating whether the user liked this movie or
not : binary decision
• Further, suppose we represent each movie with
n
features (some boolean, some real valued)
33
144. x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
x4 =
imdbRating(scaled to 0
to 1)
... ... 33
• Let us reconsider our problem of deciding
whether to watch a movie or not
• Suppose we are given a list of m movies and a
la- bel (class) associated with each movie
indicating whether the user liked this movie or
not : binary decision
• Further, suppose we represent each movie with
n
features (some boolean, some real valued)
145. x0 = 1 x1 x2 .. ..
xn
y
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
x4 =
imdbRating(scaled to 0
to 1)
... ... 33
• Let us reconsider our problem of deciding
whether to watch a movie or not
• Suppose we are given a list of m movies and a
la- bel (class) associated with each movie
indicating whether the user liked this movie or
not : binary decision
• Further, suppose we represent each movie with n
features (some boolean, some real valued)
• We will assume that the data is linearly separable
and we want a perceptron to learn how to make
this decision
146. x1 x2 .. ..
xn
y
w0 = −θ
x0 = 1
33
w1
w2
.. .. wn
x1 = isActorDamon
x2 = isGenreThriller
x3 = isDirectorNolan
x4 =
imdbRating(scaled to 0
to 1)
... ...
• Let us reconsider our problem of deciding
whether to watch a movie or not
• Suppose we are given a list of m movies and a
la- bel (class) associated with each movie
indicating whether the user liked this movie or
not : binary decision
• Further, suppose we represent each movie with n
features (some boolean, some real valued)
• We will assume that the data is linearly separable
and we want a perceptron to learn how to make
this decision
• In other words, we want the perceptron to find
the equation of this separating plane (or find the
val- ues of w0, w1, w2, .., wm)