1_MachineLearning.ppt

Learning
• Learning is essential for unknown
environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than trying
to write it down
• Learning modifies the agent's decision
mechanisms to improve performance

Learning Agent
1. Performance Element: Collection of knowledge
and procedures to decide on the next action.
2. Learning Element: takes in feedback from the
critic and modifies the performance element
accordingly.
3. Critic: provides the learning element with
information on how well the agent is doing
based on a fixed performance standard.
E.g. the audience
4. Problem Generator: provides the performance
element with suggestions on new actions to
take.

Machine Learning Systems
• Rote learning : Process of memorization. Involves
one to one mapping from inputs to stored
representations. Example is caching where large
pieces of data are stored and recalled when
required by a computation

Machine Learning Systems
• learning by being told (advice-taking)
• learning from examples (induction)
• learning by analogy

Two types of learning in AI
Deductive: Deduce rules/facts from already known
rules/facts. (We have already dealt with this)
Inductive: Learn new rules/facts from a data set D.
   
C
A
C
B
A 



   
C
A
n
y
n N
n 

  ...
1
)
(
),
(
x
D
We will be dealing with the latter, inductive learning, now

Inductive Learning
• Key idea:
– To use specific examples to reach general
conclusions
• Given a set of examples, the system tries
to approximate the evaluation function.
• Also called Pure Inductive Inference

Inductive learning - example
• f(x) is the target function
• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x)  f(x) given a
training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
  


1
)
(
,
0
0
1
0
1
0
1
1
1





































 x
x f
 
 

1
)
(
,
0
0
1
1
1
0
0
1
1





































 x
x f
  


0
)
(
,
0
1
0
1
1
0
0
1
1




































 x
x f
Etc...
Inspired by a slide from V. Pavlovic

Inductive learning – example B
• Construct h so that it agrees with f.
• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.
• How to achieve good generalization?
Consistent linear fit Consistent 7th order
polynomial fit
Inconsistent linear fit.
Consistent 6th order
polynomial fit.
Consistent sinusoidal
fit

Types of Learning
– Supervised learning: correct answer for each
example. Answer can be a numeric variable,
categorical variable etc.
– The machine has access to a teacher who corrects it.
– Unsupervised learning: correct answers not given –
just examples (e.g. – the same figures as above ,
without the labels).
– No access to teacher. Instead, the machine must
search for “order” and “structure” in the
environment.
– Reinforcement learning: occasional rewards
M M M
F F F

Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x)
= the predicted (estimated) output value
for the input x.
• Discrete valued function ⇒ classification
• Continuous valued function ⇒ regression

Example: Robot color vision
Classify the Lego pieces into red, blue, and yellow.
Classify white balls, black sideboard, and green carpet.
Input = pixel in image, output = category

Example: Predict price for cotton futures
Input: Past history
of closing prices,
and trading volume
Output: Predicted
closing price

Method: Decision trees
• “Divide and conquer”:
Split data into smaller and
smaller subsets.
• Splits usually on a single
variable
x1 > a ?
yes no
x2 > b ? x2 > g ?
yes yes
no no

The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?

Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root
of (sub)tree

Choosing an attribute
• Idea: a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative"
• Patrons? is a better choice

Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive
examples and n negative examples:
n
p
n
n
p
n
n
p
p
n
p
p
n
p
n
n
p
p
I









2
2 log
log
)
,
(

The entropy is maximal when
all possibilities are equally
likely.
The goal of the decision tree
is to decrease the entropy in
each node.
Entropy is zero in a pure ”yes”
node (or pure ”no” node).
The second law of thermodynamics:
Elements in a closed system tend
to seek their most probable distribution;
in a closed system entropy always increases
Entropy is a measure of ”order” in a
system.

Information gain
• A chosen attribute A divides the training set E into
subsets E1, … , Ev according to their values for A,
where A has v distinct values.
• Information Gain (IG) or reduction in entropy from
the attribute test:
• Choose the attribute with the largest IG

 




v
i i
i
i
i
i
i
i
i
n
p
n
n
p
p
I
n
p
n
p
A
remainder
1
)
,
(
)
(
)
(
)
,
(
)
( A
remainder
n
p
n
n
p
p
I
A
IG 




Decision tree learning example
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}

T = True, F = False
6 True,
6 False
        30
.
0
12
6
ln
12
6
12
6
ln
12
6
Entropy 




       
         
  30
.
0
6
3
ln
6
3
6
3
ln
6
3
12
6
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy 






Alternate?
3 T, 3 F 3 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.30 = 0

       
         
  30
.
0
6
3
ln
6
3
6
3
ln
6
3
12
6
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy 






Bar?
3 T, 3 F 3 T, 3 F
Yes No

       
         
  29
.
0
7
3
ln
7
3
7
4
ln
7
4
12
7
5
3
ln
5
3
5
2
ln
5
2
12
5
Entropy 






Sat/Fri?
2 T, 3 F 4 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.29 = 0.01

       
         
  24
.
0
5
4
ln
5
4
5
1
ln
5
1
12
5
7
2
ln
7
2
7
5
ln
7
5
12
7
Entropy 






Hungry?
5 T, 2 F 1 T, 4 F
Yes No

       
         
  30
.
0
8
4
ln
8
4
8
4
ln
8
4
12
8
4
2
ln
4
2
4
2
ln
4
2
12
4
Entropy 






Raining?
2 T, 2 F 4 T, 4 F
Yes No

       
         
  29
.
0
7
4
ln
7
4
7
3
ln
7
3
12
7
5
2
ln
5
2
5
3
ln
5
3
12
5
Entropy 






Reservation?
3 T, 2 F 3 T, 4 F
Yes No

       
         
 
       
  14
.
0
6
4
ln
6
4
6
2
ln
6
2
12
6
4
0
ln
4
0
4
4
ln
4
4
12
4
2
2
ln
2
2
2
0
ln
2
0
12
2
Entropy










Patrons?
2 F
4 T
None Full
2 T, 4 F
Some

       
         
 
       
  23
.
0
4
3
ln
4
3
4
1
ln
4
1
12
4
2
0
ln
2
0
2
2
ln
2
2
12
2
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy










Price
3 T, 3 F
2 T
$ $$$
1 T, 3 F
$$

       
         
 
       
         
  30
.
0
4
2
ln
4
2
4
2
ln
4
2
12
4
4
2
ln
4
2
4
2
ln
4
2
12
4
2
1
ln
2
1
2
1
ln
2
1
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
Entropy













Type
1 T, 1 F
1 T, 1 F
French Burger
2 T, 2 F
Italian
2 T, 2 F
Thai

       
         
 
       
         
  24
.
0
2
2
ln
2
2
2
0
ln
2
0
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
6
2
ln
6
2
6
4
ln
6
4
12
6
Entropy













Est. waiting
time
4 T, 2 F
1 T, 1 F
0-10 > 60
2 F
10-30
1 T, 1 F
30-60

Patrons?
2 F
4 T
None Full
Largest entropy decrease (0.16)
achieved by splitting on Patrons.
2 T, 4 F
Some
X? Continue like this, making new
splits, always purifying nodes.

Induced tree (from examples)

True tree

Induced tree (from examples)
Cannot make it more complex
than what the data supports.

Support Vector Machine -
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
w x + b<0
w x + b>0

Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?

f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?
Linear Classifiers

f
x
a
yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
Linear Classifiers

f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?
Misclassified
to +1 class
Linear Classifiers

Classifier Margin
f
x
a
yest
denotes +1
denotes -1
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

Maximum Margin
f
x
a
yest
denotes +1
denotes -1
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Maximizing the margin is good
according to intuition and PAC theory
2. Implies that only support vectors are
important; other training examples
are ignorable.
3. Empirically it works very very well.

Non-linear SVMs
 Datasets that are linearly separable with some noise
work out great:
 But what are we going to do if the dataset is just too hard?
 How about… mapping data to a higher-dimensional
space:
0 x
0 x
0 x
x2

Non-linear SVMs: Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)

The “Kernel Trick”
 The linear classifier relies on dot product between vectors K(xi,xj)=xi
Txj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product in
some expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]

Examples of Kernel Functions
 Linear: K(xi,xj)= xi
Txj
 Polynomial of power p: K(xi,xj)= (1+ xi
Txj)p
 Gaussian (radial-basis function network):
 Sigmoid: K(xi,xj)= tanh(β0xi
Txj + β1)
)
2
exp(
)
,
( 2
2

j
i
j
i
x
x
x
x



K

Introduction
• The goal of clustering is to
– group data points that are close (or similar) to
each other
– identify such groupings (or clusters) in an
unsupervised manner
• Unsupervised: no information is provided to the
algorithm on which data points belong to which
clusters
• Example
x
x
x
x
x
x
x
x
x
What should the
clusters be for
these data
points?

Clustering Algorithms
• Exclusive clustering : Objects are grouped
in exclusive way, if it belongs to one
cluster cannot belong to any other cluster.
– Ex : k means
• Overlapping Clustering: Each point may
belong to two or more clusters with
different degrees of membership.
– Ex: Fuzzy c means
• Hierarchical clustering: Based on union
between two nearest clusters.

Hierarchical clustering
• Given the input set S, the goal is to produce a
hierarchy (dendrogram) in which nodes
represent subsets of S.
• Features of the tree obtained:
– The root is the whole input set S.
– The leaves are the individual elements of S.
– The internal nodes are defined as the union of their
children.
• Each level of the tree represents a partition of
the input data into several (nested) clusters
or groups.

• There are two styles of hierarchical
clustering algorithms to build a tree from
the input set S:
– Agglomerative (bottom-up):
• Beginning with singletons (sets with 1 element)
• Merging them until S is achieved as the root.
• It is the most common approach.
– Divisive (top-down):
• Recursively partitioning S until singleton sets are
reached.

• Input: a pairwise matrix involved all instances in
S
• Algorithm
1. Place each instance of S in its own cluster (singleton),
creating the list of clusters L (initially, the leaves of T):
L= S1, S2, S3, ..., Sn-1, Sn.
2. Compute a merging cost function between every pair
of elements in L to find the two closest clusters {Si, Sj}
which will be the cheapest couple to merge.
3. Remove Si and Sj from L.
4. Merge Si and Sj to create a new internal node Sij in T
which will be the parent of Si and Sj in the resulting
tree.
5. Go to Step 2 until there is only one set remaining.

• Step 2 can be done in different ways, which is what
distinguishes single-linkage from complete-linkage and
average-linkage clustering.
– In single-linkage clustering (also called the
connectedness or minimum method): we consider the
distance between one cluster and another cluster to be
equal to the shortest distance from any member of one
cluster to any member of the other cluster.
– In complete-linkage clustering (also called the diameter
or maximum method), we consider the distance
between one cluster and another cluster to be equal to
the greatest distance from any member of one cluster
to any member of the other cluster.
– In average-linkage clustering, we consider the distance
between one cluster and another cluster to be equal to
the average distance from any member of one cluster
to any member of the other cluster.

Hierarchical clustering:
example

example using single linkage

forming clusters
• Forming clusters from dendograms

• Advantages
– Dendograms are great for visualization
– Provides hierarchical relations between clusters
– Shown to be able to capture concentric
clusters
• Disadvantages
– Not easy to define levels for clusters
– Experiments showed that other clustering
techniques outperform hierarchical clustering

K-means
• Input: n objects (or points) and a number k
• Algorithm
1. Randomly place K points into the space
represented by the objects that are being
clustered. These points represent initial group
centroids.
2. Assign each object to the group that has the
closest centroid.
3. When all objects have been assigned, recalculate
the positions of the K centroids.
4. Repeat Steps 2 and 3 until the stopping criteria is
met.

K-means
• Stopping criteria:
– No change in the members of all clusters
– when the squared error is less than some small
threshold value a
• Squared error se
– where mi is the mean of all instances in cluster ci
• se(j) < a
• Properties of k-means
– Guaranteed to converge
– Guaranteed to achieve local optimal, not
necessarily global optimal.
• Example:
http://www.kdnuggets.com/dmcourse/data_mining_
course/mod-13-clustering.ppt.
 
 


k
i c
p
i
i
m
p
se
1
2

K-means
• Pros:
– Low complexity
• complexity is O(nkt), where t = #iterations
• Cons:
– Necessity of specifying k
– Sensitive to noise and outlier data points
• Outliers: a small number of such data can
substantially influence the mean value)
– Clusters are sensitive to initial assignment of
centroids
• K-means is not a deterministic algorithm
• Clusters can be inconsistent from one run to another

Fuzzy c-means
• An extension of k-means
• Hierarchical, k-means generates partitions
– each data point can only be assigned in one
cluster
• Fuzzy c-means allows data points to be
assigned into more than one cluster
– each data point has a degree of membership
(or probability) of belonging to each cluster

Fuzzy c-means algorithm
• Let xi be a vector of values for data point gi.
1.Initialize membership U(0) = [ uij ] for data
point gi of cluster clj by random
2.At the k-th step, compute the fuzzy centroid
C(k) = [ cj ] for j = 1, .., nc, where nc is the
number of clusters, using
where m is the fuzzy parameter and n is the number
of data points.




 n
i
m
ij
n
i
i
m
ij
j
u
x
u
c
1
1
)
(
)
(

Fuzzy c-means algorithm
3. Update the fuzzy membership U(k) = [ uij ], using
4. If ||U(k) – U(k-1)|| < , then STOP, else return to step
2.
5. Determine membership cutoff
– For each data point gi, assign gi to cluster clj if uij of
U(k) > a
 
 























c
n
j
m
j
i
m
j
i
ij
c
x
c
x
u
1
1
1
1
1
1
1

Fuzzy c-means
• Pros:
– Allows a data point to be in multiple clusters
– A more natural representation of the behavior
of genes
• genes usually are involved in multiple functions
• Cons:
– Need to define c, the number of clusters
– Need to determine membership cutoff value
– Clusters are sensitive to initial assignment of
centroids
• Fuzzy c-means is not a deterministic algorithm

1_MachineLearning.ppt

More Related Content

Similar to 1_MachineLearning.ppt (20)

Recently uploaded (20)

1_MachineLearning.ppt