1. Machine Learning:
Machine Learning:
k-Nearest Neighbor and
k-Nearest Neighbor and
Support Vector Machines
Support Vector Machines
skim 20.4, 20.6-20.7
skim 20.4, 20.6-20.7
CMSC 471
CMSC 471
2. Revised End-of-Semester Schedule
Revised End-of-Semester Schedule
Wed 11/21
Wed 11/21 Machine Learning IV
Machine Learning IV
Mon 11/26
Mon 11/26 Philosophy of AI
Philosophy of AI (You must read the three articles!)
(You must read the three articles!)
Wed 11/28
Wed 11/28 Special Topics
Special Topics
Mon 12/3
Mon 12/3 Special Topics
Special Topics
Wed 12/5
Wed 12/5 Review / Tournament dry run #2
Review / Tournament dry run #2 (HW6 due)
(HW6 due)
Mon 12/10
Mon 12/10 Tournament
Tournament
Wed 12/19
Wed 12/19 FINAL EXAM
FINAL EXAM (1:00pm - 3:00pm) (Project and final report due)
(1:00pm - 3:00pm) (Project and final report due)
NO LATE SUBMISSIONS ALLOWED!
NO LATE SUBMISSIONS ALLOWED!
Special Topics
Special Topics
Robotics
Robotics
AI in Games
AI in Games
Natural language processing
Natural language processing
Multi-agent systems
Multi-agent systems
3. k
k-Nearest Neighbor
-Nearest Neighbor
Instance-Based Learning
Instance-Based Learning
Some material adapted from slides by Andrew Moore, CMU.
Some material adapted from slides by Andrew Moore, CMU.
Visit
Visit http://www.autonlab.org/tutorials/
http://www.autonlab.org/tutorials/ for
for
Andrew’s repository of Data Mining tutorials.
Andrew’s repository of Data Mining tutorials.
4. 1-Nearest Neighbor
1-Nearest Neighbor
One of the simplest of all machine
One of the simplest of all machine
learning classifiers
learning classifiers
Simple idea: label a new point the same as
Simple idea: label a new point the same as
the closest known point
the closest known point
Label it red.
5. 1-Nearest Neighbor
1-Nearest Neighbor
A type of instance-based learning
A type of instance-based learning
Also known as “memory-based” learning
Also known as “memory-based” learning
Forms a Voronoi tessellation of the
Forms a Voronoi tessellation of the
instance space
instance space
6. Distance Metrics
Distance Metrics
Different metrics can change the decision surface
Different metrics can change the decision surface
Standard Euclidean distance metric:
Standard Euclidean distance metric:
Two-dimensional: Dist(a,b) = sqrt((a
Two-dimensional: Dist(a,b) = sqrt((a1
1 – b
– b1
1)
)2
2
+ (a
+ (a2
2 – b
– b2
2)
)2
2
)
)
Multivariate: Dist(a,b) = sqrt(∑ (a
Multivariate: Dist(a,b) = sqrt(∑ (ai
i – b
– bi
i)
)2
2
)
)
Dist(a,b) =(a1 – b1)2
+ (a2 – b2)2
Dist(a,b) =(a1 – b1)2
+ (3a2 – 3b2)2
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.
7. Four Aspects of an
Four Aspects of an
Instance-Based Learner:
Instance-Based Learner:
1. A distance metric
2. How many nearby neighbors to look at?
3. A weighting function (optional)
4. How to fit with the local points?
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.
8. 1-NN’s Four Aspects as an
1-NN’s Four Aspects as an
Instance-Based Learner:
Instance-Based Learner:
1. A distance metric
Euclidian
2. How many nearby neighbors to look at?
One
3. A weighting function (optional)
Unused
4. How to fit with the local points?
Just predict the same output as the nearest
neighbor.
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.
9. Zen Gardens
Zen Gardens
Mystery of renowned zen garden revealed [CNN Article]
Thursday, September 26, 2002 Posted: 10:11 AM EDT (1411 GMT)
LONDON (Reuters) -- For centuries visitors to the renowned Ryoanji Temple garden in
Kyoto, Japan have been entranced and mystified by the simple arrangement of rocks.
The five sparse clusters on a rectangle of raked gravel are said to be pleasing to the eyes
of the hundreds of thousands of tourists who visit the garden each year.
Scientists in Japan said on Wednesday they now believe they have discovered its
mysterious appeal.
"We have uncovered the implicit structure of the Ryoanji garden's visual ground and
have shown that it includes an abstract, minimalist depiction of natural scenery," said
Gert Van Tonder of Kyoto University.
The researchers discovered that the empty space of the garden evokes a hidden image of a
branching tree that is sensed by the unconscious mind.
"We believe that the unconscious perception of this pattern contributes to the enigmatic
appeal of the garden," Van Tonder added.
He and his colleagues believe that whoever created the garden during the Muromachi era
between 1333-1573 knew exactly what they were doing and placed the rocks around the
tree image.
By using a concept called medial-axis transformation, the scientists showed that the
hidden branched tree converges on the main area from which the garden is viewed.
The trunk leads to the prime viewing site in the ancient temple that once overlooked the
garden. It is thought that abstract art may have a similar impact.
"There is a growing realisation that scientific analysis can reveal unexpected structural
features hidden in controversial abstract paintings," Van Tonder said
Adapted from “Instance-Based Learning” lecture slides by Andrew Moore, CMU.
10. k – Nearest Neighbor
k – Nearest Neighbor
Generalizes 1-NN to smooth away noise
Generalizes 1-NN to smooth away noise
in the labels
in the labels
A new point is now assigned the most
A new point is now assigned the most
frequent label of its
frequent label of its k
k nearest neighbors
nearest neighbors
Label it red, when k = 3
Label it blue, when k = 7
11. k-Nearest Neighbor (k = 9)
k-Nearest Neighbor (k = 9)
A magnificent job of
noise smoothing.
Three cheers for 9-
nearest-neighbor.
But the lack of
gradients and the
jerkiness isn’t good.
Appalling behavior!
Loses all the detail
that 1-nearest
neighbor would give.
The tails are horrible!
Fits much less of the
noise, captures trends.
But still, frankly,
pathetic compared
with linear regression.
Adapted from “Instance-Based Learning”
lecture slides by Andrew Moore, CMU.
12. Support Vector
Support Vector
Machines and Kernels
Machines and Kernels
Adapted from slides by Tim Oates
Adapted from slides by Tim Oates
Cognition, Robotics, and Learning (CORAL) Lab
Cognition, Robotics, and Learning (CORAL) Lab
University of Maryland Baltimore County
University of Maryland Baltimore County
Doing
Doing Really
Really Well with
Well with
Linear Decision Surfaces
Linear Decision Surfaces
13. Outline
Outline
Prediction
Prediction
Why might predictions be wrong?
Why might predictions be wrong?
Support vector machines
Support vector machines
Doing really well with linear models
Doing really well with linear models
Kernels
Kernels
Making the non-linear linear
Making the non-linear linear
14. Supervised ML = Prediction
Supervised ML = Prediction
Given training instances (x,y)
Given training instances (x,y)
Learn a model f
Learn a model f
Such that f(x) = y
Such that f(x) = y
Use f to predict y for new x
Use f to predict y for new x
Many variations on this basic theme
Many variations on this basic theme
15. Why might predictions be wrong?
Why might predictions be wrong?
True Non-Determinism
True Non-Determinism
Flip a biased coin
Flip a biased coin
p(
p(heads
heads) =
) =
Estimate
Estimate
If
If
> 0.5 predict
> 0.5 predict heads
heads, else
, else tails
tails
Lots of ML research on problems like this
Lots of ML research on problems like this
Learn a model
Learn a model
Do the best you can in expectation
Do the best you can in expectation
16. Why might predictions be wrong?
Why might predictions be wrong?
Partial Observability
Partial Observability
Something needed to predict y is missing
Something needed to predict y is missing
from observation x
from observation x
N-bit parity problem
N-bit parity problem
x contains N-1 bits (hard PO)
x contains N-1 bits (hard PO)
x contains N bits but learner ignores some of them
x contains N bits but learner ignores some of them
(soft PO)
(soft PO)
20. Strengths of SVMs
Strengths of SVMs
Good generalization in theory
Good generalization in theory
Good generalization in practice
Good generalization in practice
Work well with few training instances
Work well with few training instances
Find globally best model
Find globally best model
Efficient algorithms
Efficient algorithms
Amenable to the kernel trick
Amenable to the kernel trick
21. Linear Separators
Linear Separators
Training instances
Training instances
x
x
n
n
y
y
{-1, 1}
{-1, 1}
w
w
n
n
b
b
Hyperplane
Hyperplane
<w, x> + b = 0
<w, x> + b = 0
w
w1
1x
x1
1 + w
+ w2
2x
x2
2 … + w
… + wn
nx
xn
n + b = 0
+ b = 0
Decision function
Decision function
f(x) = sign(<w, x> + b)
f(x) = sign(<w, x> + b)
Math Review
Math Review
Inner (dot) product:
Inner (dot) product:
<a, b> = a · b = ∑ a
<a, b> = a · b = ∑ ai
i*b
*bi
i
= a
= a1
1b
b1
1 + a
+ a2
2b
b2
2 + …+a
+ …+an
nb
bn
n
33. The Math
The Math
Training instances
Training instances
x
x
n
n
y
y
{-1, 1}
{-1, 1}
Decision function
Decision function
f(x) = sign(<w,x> + b)
f(x) = sign(<w,x> + b)
w
w
n
n
b
b
Find w and b that
Find w and b that
Perfectly classify training instances
Perfectly classify training instances
Assuming linear separability
Assuming linear separability
Maximize margin
Maximize margin
34. The Math
The Math
For perfect classification, we want
For perfect classification, we want
y
yi
i (<w,x
(<w,xi
i> + b) ≥ 0 for all i
> + b) ≥ 0 for all i
Why?
Why?
To maximize the margin, we want
To maximize the margin, we want
w that minimizes |w|
w that minimizes |w|2
2
35. Dual Optimization Problem
Dual Optimization Problem
Maximize over
Maximize over
W(
W(
) =
) =
i
i
i
i - 1/2
- 1/2
i,j
i,j
i
i
j
j y
yi
i y
yj
j <x
<xi
i, x
, xj
j>
>
Subject to
Subject to
i
i
0
0
i
i
i
i y
yi
i = 0
= 0
Decision function
Decision function
f(x) = sign(
f(x) = sign(
i
i
i
i y
yi
i <x, x
<x, xi
i> + b)
> + b)
36. Strengths of SVMs
Strengths of SVMs
Good generalization in theory
Good generalization in theory
Good generalization in practice
Good generalization in practice
Work well with few training instances
Work well with few training instances
Find globally best model
Find globally best model
Efficient algorithms
Efficient algorithms
Amenable to the kernel trick …
Amenable to the kernel trick …
37. What if Surface is Non-Linear?
What if Surface is Non-Linear?
X
X
X
X
X
X
O O
O
O
O
O O
O
O
O
O
O
O
O O
O
O
O O
O
Image from http://www.atrandomresearch.com/iclass/
39. When Linear Separators Fail
When Linear Separators Fail
X
O
O O O X
X
X x1
x2
X
O
O O O
X
X
X
x1
x1
2
40. Mapping into a New Feature Space
Mapping into a New Feature Space
Rather than run SVM on x
Rather than run SVM on xi
i, run it on
, run it on
(x
(xi
i)
)
Find non-linear separator in input space
Find non-linear separator in input space
What if
What if
(x
(xi
i) is really big?
) is really big?
Use kernels to compute it implicitly!
Use kernels to compute it implicitly!
: x
: x
X =
X =
(x)
(x)
(x
(x1
1,x
,x2
2) = (x
) = (x1
1,x
,x2
2,x
,x1
1
2
2
,x
,x2
2
2
2
,x
,x1
1x
x2
2)
)
Image from http://web.engr.oregonstate.edu/
~afern/classes/cs534/
41. Kernels
Kernels
Find kernel K such that
Find kernel K such that
K(x
K(x1
1,x
,x2
2) = <
) = <
(x
(x1
1),
),
(x
(x2
2)>
)>
Computing K(x
Computing K(x1
1,x
,x2
2) should be efficient, much
) should be efficient, much
more so than computing
more so than computing
(x
(x1
1) and
) and
(x
(x2
2)
)
Use K(x
Use K(x1
1,x
,x2
2) in SVM algorithm rather than
) in SVM algorithm rather than
<x
<x1
1,x
,x2
2>
>
Remarkably, this is possible
Remarkably, this is possible
42. The Polynomial Kernel
The Polynomial Kernel
K(x
K(x1
1,x
,x2
2) = < x
) = < x1
1, x
, x2
2 >
> 2
2
x
x1
1 = (x
= (x11
11, x
, x12
12)
)
x
x2
2 = (x
= (x21
21, x
, x22
22)
)
< x
< x1
1, x
, x2
2 > = (x
> = (x11
11x
x21
21 + x
+ x12
12x
x22
22)
)
< x
< x1
1, x
, x2
2 >
> 2
2
= (x
= (x11
11
2
2
x
x21
21
2
2
+ x
+ x12
12
2
2
x
x22
22
2
2
+ 2x
+ 2x11
11 x
x12
12 x
x21
21 x
x22
22)
)
(x
(x1
1) = (x
) = (x11
11
2
2
, x
, x12
12
2
2
, √2x
, √2x11
11 x
x12
12)
)
(x
(x2
2) = (x
) = (x21
21
2
2
, x
, x22
22
2
2
, √2x
, √2x21
21 x
x22
22)
)
K(x
K(x1
1,x
,x2
2) = <
) = <
(x
(x1
1),
),
(x
(x2
2)
) >
>
43. The Polynomial Kernel
The Polynomial Kernel
(x) contains all monomials of degree d
(x) contains all monomials of degree d
Useful in visual pattern recognition
Useful in visual pattern recognition
Number of monomials
Number of monomials
16x16 pixel image
16x16 pixel image
10
1010
10
monomials of degree 5
monomials of degree 5
Never explicitly compute
Never explicitly compute
(x)!
(x)!
Variation - K(x
Variation - K(x1
1,x
,x2
2) = (< x
) = (< x1
1, x
, x2
2 > + 1)
> + 1) 2
2
44. A Few Good Kernels
A Few Good Kernels
Dot product kernel
Dot product kernel
K(x
K(x1
1,x
,x2
2) = < x
) = < x1
1,x
,x2
2 >
>
Polynomial kernel
Polynomial kernel
K(x
K(x1
1,x
,x2
2) = < x
) = < x1
1,x
,x2
2 >
>d
d
(Monomials of degree d)
(Monomials of degree d)
K(x
K(x1
1,x
,x2
2) = (< x
) = (< x1
1,x
,x2
2 > + 1)
> + 1)d
d
(All monomials of degree 1,2,…,d)
(All monomials of degree 1,2,…,d)
Gaussian kernel
Gaussian kernel
K(x
K(x1
1,x
,x2
2) = exp(-| x
) = exp(-| x1
1-x
-x2
2 |
|2
2
/2
/2
2
2
)
)
Radial basis functions
Radial basis functions
Sigmoid kernel
Sigmoid kernel
K(x
K(x1
1,x
,x2
2) = tanh(< x
) = tanh(< x1
1,x
,x2
2 > +
> +
)
)
Neural networks
Neural networks
Establishing “kernel-hood” from first principles is non-
Establishing “kernel-hood” from first principles is non-
trivial
trivial
45. The Kernel Trick
The Kernel Trick
“Given an algorithm which is
formulated in terms of a positive
definite kernel K1, one can construct
an alternative algorithm by replacing
K1 with another positive definite
kernel K2”
SVMs can use the kernel trick
46. Using a Different Kernel in the
Using a Different Kernel in the
Dual Optimization Problem
Dual Optimization Problem
For example, using the polynomial kernel
For example, using the polynomial kernel
with d = 4 (including lower-order terms).
with d = 4 (including lower-order terms).
Maximize over
Maximize over
W(
W(
) =
) =
i
i
i
i - 1/2
- 1/2
i,j
i,j
i
i
j
j y
yi
i y
yj
j <x
<xi
i, x
, xj
j>
>
Subject to
Subject to
i
i
0
0
i
i
i
i y
yi
i = 0
= 0
Decision function
Decision function
f(x) = sign(
f(x) = sign(
i
i
i
i y
yi
i <x, x
<x, xi
i> + b)
> + b)
(<x
(<xi
i, x
, xj
j> + 1)
> + 1)4
4
X
(<x
(<xi
i, x
, xj
j> + 1)
> + 1)4
4
X
These are kernels!
So by the kernel trick,
we just replace them!
47. Conclusion
Conclusion
SVMs find optimal linear separator
SVMs find optimal linear separator
The kernel trick makes SVMs non-linear
The kernel trick makes SVMs non-linear
learning algorithms
learning algorithms