SlideShare a Scribd company logo
Machine Learning
Learning
• Learning is essential for unknown
environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than trying
to write it down
• Learning modifies the agent's decision
mechanisms to improve performance
Learning agents
Actuators
Learning Agent
1. Performance Element: Collection of knowledge
and procedures to decide on the next action.
2. Learning Element: takes in feedback from the
critic and modifies the performance element
accordingly.
3. Critic: provides the learning element with
information on how well the agent is doing
based on a fixed performance standard.
E.g. the audience
4. Problem Generator: provides the performance
element with suggestions on new actions to
take.
Machine Learning Systems
• Rote learning : Process of memorization. Involves
one to one mapping from inputs to stored
representations. Example is caching where large
pieces of data are stored and recalled when
required by a computation
Machine Learning Systems
• learning by being told (advice-taking)
• learning from examples (induction)
• learning by analogy
Two types of learning in AI
Deductive: Deduce rules/facts from already known
rules/facts. (We have already dealt with this)
Inductive: Learn new rules/facts from a data set D.
   
C
A
C
B
A 



   
C
A
n
y
n N
n 

  ...
1
)
(
),
(
x
D
We will be dealing with the latter, inductive learning, now
Inductive Learning
• Key idea:
– To use specific examples to reach general
conclusions
• Given a set of examples, the system tries
to approximate the evaluation function.
• Also called Pure Inductive Inference
Inductive learning - example
• f(x) is the target function
• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x)  f(x) given a
training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
  


1
)
(
,
0
0
1
0
1
0
1
1
1





































 x
x f
 
 

1
)
(
,
0
0
1
1
1
0
0
1
1





































 x
x f
  


0
)
(
,
0
1
0
1
1
0
0
1
1




































 x
x f
Etc...
Inspired by a slide from V. Pavlovic
Inductive learning – example B
• Construct h so that it agrees with f.
• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.
• How to achieve good generalization?
Consistent linear fit Consistent 7th order
polynomial fit
Inconsistent linear fit.
Consistent 6th order
polynomial fit.
Consistent sinusoidal
fit
Types of Learning
– Supervised learning: correct answer for each
example. Answer can be a numeric variable,
categorical variable etc.
– The machine has access to a teacher who corrects it.
– Unsupervised learning: correct answers not given –
just examples (e.g. – the same figures as above ,
without the labels).
– No access to teacher. Instead, the machine must
search for “order” and “structure” in the
environment.
– Reinforcement learning: occasional rewards
M M M
F F F
Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x)
= the predicted (estimated) output value
for the input x.
• Discrete valued function ⇒ classification
• Continuous valued function ⇒ regression
Example: Robot color vision
Classify the Lego pieces into red, blue, and yellow.
Classify white balls, black sideboard, and green carpet.
Input = pixel in image, output = category
Example: Predict price for cotton futures
Input: Past history
of closing prices,
and trading volume
Output: Predicted
closing price
Method: Decision trees
• “Divide and conquer”:
Split data into smaller and
smaller subsets.
• Splits usually on a single
variable
x1 > a ?
yes no
x2 > b ? x2 > g ?
yes yes
no no
The wait@restaurant decision tree
This is our true function.
Can we learn this tree from examples?
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root
of (sub)tree
Choosing an attribute
• Idea: a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative"
• Patrons? is a better choice
Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive
examples and n negative examples:
n
p
n
n
p
n
n
p
p
n
p
p
n
p
n
n
p
p
I









2
2 log
log
)
,
(
The entropy is maximal when
all possibilities are equally
likely.
The goal of the decision tree
is to decrease the entropy in
each node.
Entropy is zero in a pure ”yes”
node (or pure ”no” node).
The second law of thermodynamics:
Elements in a closed system tend
to seek their most probable distribution;
in a closed system entropy always increases
Entropy is a measure of ”order” in a
system.
Information gain
• A chosen attribute A divides the training set E into
subsets E1, … , Ev according to their values for A,
where A has v distinct values.
• Information Gain (IG) or reduction in entropy from
the attribute test:
• Choose the attribute with the largest IG

 




v
i i
i
i
i
i
i
i
i
n
p
n
n
p
p
I
n
p
n
p
A
remainder
1
)
,
(
)
(
)
(
)
,
(
)
( A
remainder
n
p
n
n
p
p
I
A
IG 



Example
Decision tree learning example
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}
Decision tree learning example
T = True, F = False
6 True,
6 False
        30
.
0
12
6
ln
12
6
12
6
ln
12
6
Entropy 



Decision tree learning example
       
         
  30
.
0
6
3
ln
6
3
6
3
ln
6
3
12
6
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy 






Alternate?
3 T, 3 F 3 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
       
         
  30
.
0
6
3
ln
6
3
6
3
ln
6
3
12
6
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy 






Bar?
3 T, 3 F 3 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
       
         
  29
.
0
7
3
ln
7
3
7
4
ln
7
4
12
7
5
3
ln
5
3
5
2
ln
5
2
12
5
Entropy 






Sat/Fri?
2 T, 3 F 4 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.29 = 0.01
Decision tree learning example
       
         
  24
.
0
5
4
ln
5
4
5
1
ln
5
1
12
5
7
2
ln
7
2
7
5
ln
7
5
12
7
Entropy 






Hungry?
5 T, 2 F 1 T, 4 F
Yes No
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
       
         
  30
.
0
8
4
ln
8
4
8
4
ln
8
4
12
8
4
2
ln
4
2
4
2
ln
4
2
12
4
Entropy 






Raining?
2 T, 2 F 4 T, 4 F
Yes No
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
       
         
  29
.
0
7
4
ln
7
4
7
3
ln
7
3
12
7
5
2
ln
5
2
5
3
ln
5
3
12
5
Entropy 






Reservation?
3 T, 2 F 3 T, 4 F
Yes No
Entropy decrease = 0.30 – 0.29 = 0.01
Decision tree learning example
       
         
 
       
  14
.
0
6
4
ln
6
4
6
2
ln
6
2
12
6
4
0
ln
4
0
4
4
ln
4
4
12
4
2
2
ln
2
2
2
0
ln
2
0
12
2
Entropy










Patrons?
2 F
4 T
None Full
Entropy decrease = 0.30 – 0.14 = 0.16
2 T, 4 F
Some
Decision tree learning example
       
         
 
       
  23
.
0
4
3
ln
4
3
4
1
ln
4
1
12
4
2
0
ln
2
0
2
2
ln
2
2
12
2
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy










Price
3 T, 3 F
2 T
$ $$$
Entropy decrease = 0.30 – 0.23 = 0.07
1 T, 3 F
$$
Decision tree learning example
       
         
 
       
         
  30
.
0
4
2
ln
4
2
4
2
ln
4
2
12
4
4
2
ln
4
2
4
2
ln
4
2
12
4
2
1
ln
2
1
2
1
ln
2
1
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
Entropy













Type
1 T, 1 F
1 T, 1 F
French Burger
Entropy decrease = 0.30 – 0.30 = 0
2 T, 2 F
Italian
2 T, 2 F
Thai
Decision tree learning example
       
         
 
       
         
  24
.
0
2
2
ln
2
2
2
0
ln
2
0
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
2
1
ln
2
1
2
1
ln
2
1
12
2
6
2
ln
6
2
6
4
ln
6
4
12
6
Entropy













Est. waiting
time
4 T, 2 F
1 T, 1 F
0-10 > 60
Entropy decrease = 0.30 – 0.24 = 0.06
2 F
10-30
1 T, 1 F
30-60
Decision tree learning example
Patrons?
2 F
4 T
None Full
Largest entropy decrease (0.16)
achieved by splitting on Patrons.
2 T, 4 F
Some
X? Continue like this, making new
splits, always purifying nodes.
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)
Cannot make it more complex
than what the data supports.
Support Vector Machine -
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
w x + b<0
w x + b>0
Support Vector Machine -
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
Support Vector Machine -
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Any of these
would be fine..
..but which is
best?
Support Vector Machine -
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you
classify this data?
Misclassified
to +1 class
Support Vector Machine -
Linear Classifiers
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Maximizing the margin is good
according to intuition and PAC theory
2. Implies that only support vectors are
important; other training examples
are ignorable.
3. Empirically it works very very well.
Non-linear SVMs
 Datasets that are linearly separable with some noise
work out great:
 But what are we going to do if the dataset is just too hard?
 How about… mapping data to a higher-dimensional
space:
0 x
0 x
0 x
x2
Non-linear SVMs: Feature spaces
 General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”
 The linear classifier relies on dot product between vectors K(xi,xj)=xi
Txj
 If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product in
some expanded feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi
Txj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xi
Txj)2
,
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]
Examples of Kernel Functions
 Linear: K(xi,xj)= xi
Txj
 Polynomial of power p: K(xi,xj)= (1+ xi
Txj)p
 Gaussian (radial-basis function network):
 Sigmoid: K(xi,xj)= tanh(β0xi
Txj + β1)
)
2
exp(
)
,
( 2
2

j
i
j
i
x
x
x
x



K
CLUSTERING
Introduction
• The goal of clustering is to
– group data points that are close (or similar) to
each other
– identify such groupings (or clusters) in an
unsupervised manner
• Unsupervised: no information is provided to the
algorithm on which data points belong to which
clusters
• Example
x
x
x
x
x
x
x
x
x
What should the
clusters be for
these data
points?
Clustering Algorithms
• Exclusive clustering : Objects are grouped
in exclusive way, if it belongs to one
cluster cannot belong to any other cluster.
– Ex : k means
• Overlapping Clustering: Each point may
belong to two or more clusters with
different degrees of membership.
– Ex: Fuzzy c means
• Hierarchical clustering: Based on union
between two nearest clusters.
Hierarchical clustering
• Given the input set S, the goal is to produce a
hierarchy (dendrogram) in which nodes
represent subsets of S.
• Features of the tree obtained:
– The root is the whole input set S.
– The leaves are the individual elements of S.
– The internal nodes are defined as the union of their
children.
• Each level of the tree represents a partition of
the input data into several (nested) clusters
or groups.
Hierarchical clustering
Hierarchical clustering
• There are two styles of hierarchical
clustering algorithms to build a tree from
the input set S:
– Agglomerative (bottom-up):
• Beginning with singletons (sets with 1 element)
• Merging them until S is achieved as the root.
• It is the most common approach.
– Divisive (top-down):
• Recursively partitioning S until singleton sets are
reached.
Hierarchical clustering
• Input: a pairwise matrix involved all instances in
S
• Algorithm
1. Place each instance of S in its own cluster (singleton),
creating the list of clusters L (initially, the leaves of T):
L= S1, S2, S3, ..., Sn-1, Sn.
2. Compute a merging cost function between every pair
of elements in L to find the two closest clusters {Si, Sj}
which will be the cheapest couple to merge.
3. Remove Si and Sj from L.
4. Merge Si and Sj to create a new internal node Sij in T
which will be the parent of Si and Sj in the resulting
tree.
5. Go to Step 2 until there is only one set remaining.
Hierarchical clustering
• Step 2 can be done in different ways, which is what
distinguishes single-linkage from complete-linkage and
average-linkage clustering.
– In single-linkage clustering (also called the
connectedness or minimum method): we consider the
distance between one cluster and another cluster to be
equal to the shortest distance from any member of one
cluster to any member of the other cluster.
– In complete-linkage clustering (also called the diameter
or maximum method), we consider the distance
between one cluster and another cluster to be equal to
the greatest distance from any member of one cluster
to any member of the other cluster.
– In average-linkage clustering, we consider the distance
between one cluster and another cluster to be equal to
the average distance from any member of one cluster
to any member of the other cluster.
Hierarchical clustering:
example
Hierarchical clustering:
example using single linkage
Hierarchical clustering:
forming clusters
• Forming clusters from dendograms
Hierarchical clustering
• Advantages
– Dendograms are great for visualization
– Provides hierarchical relations between clusters
– Shown to be able to capture concentric
clusters
• Disadvantages
– Not easy to define levels for clusters
– Experiments showed that other clustering
techniques outperform hierarchical clustering
K-means
• Input: n objects (or points) and a number k
• Algorithm
1. Randomly place K points into the space
represented by the objects that are being
clustered. These points represent initial group
centroids.
2. Assign each object to the group that has the
closest centroid.
3. When all objects have been assigned, recalculate
the positions of the K centroids.
4. Repeat Steps 2 and 3 until the stopping criteria is
met.
K-means
• Stopping criteria:
– No change in the members of all clusters
– when the squared error is less than some small
threshold value a
• Squared error se
– where mi is the mean of all instances in cluster ci
• se(j) < a
• Properties of k-means
– Guaranteed to converge
– Guaranteed to achieve local optimal, not
necessarily global optimal.
• Example:
http://www.kdnuggets.com/dmcourse/data_mining_
course/mod-13-clustering.ppt.
 
 


k
i c
p
i
i
m
p
se
1
2
K-means
• Pros:
– Low complexity
• complexity is O(nkt), where t = #iterations
• Cons:
– Necessity of specifying k
– Sensitive to noise and outlier data points
• Outliers: a small number of such data can
substantially influence the mean value)
– Clusters are sensitive to initial assignment of
centroids
• K-means is not a deterministic algorithm
• Clusters can be inconsistent from one run to another
Fuzzy c-means
• An extension of k-means
• Hierarchical, k-means generates partitions
– each data point can only be assigned in one
cluster
• Fuzzy c-means allows data points to be
assigned into more than one cluster
– each data point has a degree of membership
(or probability) of belonging to each cluster
Fuzzy c-means algorithm
• Let xi be a vector of values for data point gi.
1.Initialize membership U(0) = [ uij ] for data
point gi of cluster clj by random
2.At the k-th step, compute the fuzzy centroid
C(k) = [ cj ] for j = 1, .., nc, where nc is the
number of clusters, using
where m is the fuzzy parameter and n is the number
of data points.




 n
i
m
ij
n
i
i
m
ij
j
u
x
u
c
1
1
)
(
)
(
Fuzzy c-means algorithm
3. Update the fuzzy membership U(k) = [ uij ], using
4. If ||U(k) – U(k-1)|| < , then STOP, else return to step
2.
5. Determine membership cutoff
– For each data point gi, assign gi to cluster clj if uij of
U(k) > a
 
 























c
n
j
m
j
i
m
j
i
ij
c
x
c
x
u
1
1
1
1
1
1
1
Fuzzy c-means
• Pros:
– Allows a data point to be in multiple clusters
– A more natural representation of the behavior
of genes
• genes usually are involved in multiple functions
• Cons:
– Need to define c, the number of clusters
– Need to determine membership cutoff value
– Clusters are sensitive to initial assignment of
centroids
• Fuzzy c-means is not a deterministic algorithm

More Related Content

PPT
belajar untuk pandai lagui lah Your score increases as yo .ppt
PPT
m18-learning Learning from Observations
PPT
M18 learning
PPT
Download presentation source
PPTX
Data Science-entropy machine learning.pptx
PPT
Lecture 7
PPT
Lecture 7
PPTX
machine leraning : main principles and techniques
belajar untuk pandai lagui lah Your score increases as yo .ppt
m18-learning Learning from Observations
M18 learning
Download presentation source
Data Science-entropy machine learning.pptx
Lecture 7
Lecture 7
machine leraning : main principles and techniques

Similar to 1_MachineLearning.ppt (20)

PDF
Decision Tree-ID3,C4.5,CART,Regression Tree
PPTX
Learning
PPT
ML-DecisionTrees.ppt
PDF
CSA 3702 machine learning module 2
PPT
PPTX
AI -learning and machine learning.pptx
PPTX
Machine Learning
PPT
ai4.ppt
PDF
Decision tree learning
PPT
ai4.ppt
PPT
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
PPT
fovkfgfdfsssssffffffffffssssccocmall.ppt
PDF
Machine Learning course Lecture number 5, InfoGain.pdf
PPTX
Lecture4.pptx
PDF
lec02-DecisionTreed. Checking primality of an integer n .pdf
PPTX
Decision Tree Learning
PPT
week9_Machine_Learning.ppt
PDF
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
PPT
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
PPTX
Decision Trees
Decision Tree-ID3,C4.5,CART,Regression Tree
Learning
ML-DecisionTrees.ppt
CSA 3702 machine learning module 2
AI -learning and machine learning.pptx
Machine Learning
ai4.ppt
Decision tree learning
ai4.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
fovkfgfdfsssssffffffffffssssccocmall.ppt
Machine Learning course Lecture number 5, InfoGain.pdf
Lecture4.pptx
lec02-DecisionTreed. Checking primality of an integer n .pdf
Decision Tree Learning
week9_Machine_Learning.ppt
Aiml ajsjdjcjcjcjfjfjModule4_Pashrt1-1.pdf
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
Decision Trees
Ad

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Welding lecture in detail for understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
composite construction of structures.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
CH1 Production IntroductoryConcepts.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Foundation to blockchain - A guide to Blockchain Tech
Welding lecture in detail for understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
bas. eng. economics group 4 presentation 1.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CYBER-CRIMES AND SECURITY A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
composite construction of structures.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Ad

1_MachineLearning.ppt

  • 2. Learning • Learning is essential for unknown environments, – i.e., when designer lacks omniscience • Learning is useful as a system construction method, – i.e., expose the agent to reality rather than trying to write it down • Learning modifies the agent's decision mechanisms to improve performance
  • 4. Learning Agent 1. Performance Element: Collection of knowledge and procedures to decide on the next action. 2. Learning Element: takes in feedback from the critic and modifies the performance element accordingly. 3. Critic: provides the learning element with information on how well the agent is doing based on a fixed performance standard. E.g. the audience 4. Problem Generator: provides the performance element with suggestions on new actions to take.
  • 5. Machine Learning Systems • Rote learning : Process of memorization. Involves one to one mapping from inputs to stored representations. Example is caching where large pieces of data are stored and recalled when required by a computation
  • 6. Machine Learning Systems • learning by being told (advice-taking) • learning from examples (induction) • learning by analogy
  • 7. Two types of learning in AI Deductive: Deduce rules/facts from already known rules/facts. (We have already dealt with this) Inductive: Learn new rules/facts from a data set D.     C A C B A         C A n y n N n     ... 1 ) ( ), ( x D We will be dealing with the latter, inductive learning, now
  • 8. Inductive Learning • Key idea: – To use specific examples to reach general conclusions • Given a set of examples, the system tries to approximate the evaluation function. • Also called Pure Inductive Inference
  • 9. Inductive learning - example • f(x) is the target function • An example is a pair [x, f(x)] • Learning task: find a hypothesis h such that h(x)  f(x) given a training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N      1 ) ( , 0 0 1 0 1 0 1 1 1                                       x x f      1 ) ( , 0 0 1 1 1 0 0 1 1                                       x x f      0 ) ( , 0 1 0 1 1 0 0 1 1                                      x x f Etc... Inspired by a slide from V. Pavlovic
  • 10. Inductive learning – example B • Construct h so that it agrees with f. • The hypothesis h is consistent if it agrees with f on all observations. • Ockham’s razor: Select the simplest consistent hypothesis. • How to achieve good generalization? Consistent linear fit Consistent 7th order polynomial fit Inconsistent linear fit. Consistent 6th order polynomial fit. Consistent sinusoidal fit
  • 11. Types of Learning – Supervised learning: correct answer for each example. Answer can be a numeric variable, categorical variable etc. – The machine has access to a teacher who corrects it. – Unsupervised learning: correct answers not given – just examples (e.g. – the same figures as above , without the labels). – No access to teacher. Instead, the machine must search for “order” and “structure” in the environment. – Reinforcement learning: occasional rewards M M M F F F
  • 12. Learning problems • The hypothesis takes as input a set of attributes x and returns a ”decision” h(x) = the predicted (estimated) output value for the input x. • Discrete valued function ⇒ classification • Continuous valued function ⇒ regression
  • 13. Example: Robot color vision Classify the Lego pieces into red, blue, and yellow. Classify white balls, black sideboard, and green carpet. Input = pixel in image, output = category
  • 14. Example: Predict price for cotton futures Input: Past history of closing prices, and trading volume Output: Predicted closing price
  • 15. Method: Decision trees • “Divide and conquer”: Split data into smaller and smaller subsets. • Splits usually on a single variable x1 > a ? yes no x2 > b ? x2 > g ? yes yes no no
  • 16. The wait@restaurant decision tree This is our true function. Can we learn this tree from examples?
  • 17. Decision tree learning • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree
  • 18. Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice
  • 19. Using information theory • To implement Choose-Attribute in the DTL algorithm • Information Content (Entropy): I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi) • For a training set containing p positive examples and n negative examples: n p n n p n n p p n p p n p n n p p I          2 2 log log ) , (
  • 20. The entropy is maximal when all possibilities are equally likely. The goal of the decision tree is to decrease the entropy in each node. Entropy is zero in a pure ”yes” node (or pure ”no” node). The second law of thermodynamics: Elements in a closed system tend to seek their most probable distribution; in a closed system entropy always increases Entropy is a measure of ”order” in a system.
  • 21. Information gain • A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values. • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IG        v i i i i i i i i i n p n n p p I n p n p A remainder 1 ) , ( ) ( ) ( ) , ( ) ( A remainder n p n n p p I A IG    
  • 23. Decision tree learning example 10 attributes: 1. Alternate: Is there a suitable alternative restaurant nearby? {yes,no} 2. Bar: Is there a bar to wait in? {yes,no} 3. Fri/Sat: Is it Friday or Saturday? {yes,no} 4. Hungry: Are you hungry? {yes,no} 5. Patrons: How many are seated in the restaurant? {none, some, full} 6. Price: Price level {$,$$,$$$} 7. Raining: Is it raining? {yes,no} 8. Reservation: Did you make a reservation? {yes,no} 9. Type: Type of food {French,Italian,Thai,Burger} 10. Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}
  • 24. Decision tree learning example T = True, F = False 6 True, 6 False         30 . 0 12 6 ln 12 6 12 6 ln 12 6 Entropy    
  • 25. Decision tree learning example                     30 . 0 6 3 ln 6 3 6 3 ln 6 3 12 6 6 3 ln 6 3 6 3 ln 6 3 12 6 Entropy        Alternate? 3 T, 3 F 3 T, 3 F Yes No Entropy decrease = 0.30 – 0.30 = 0
  • 26. Decision tree learning example                     30 . 0 6 3 ln 6 3 6 3 ln 6 3 12 6 6 3 ln 6 3 6 3 ln 6 3 12 6 Entropy        Bar? 3 T, 3 F 3 T, 3 F Yes No Entropy decrease = 0.30 – 0.30 = 0
  • 27. Decision tree learning example                     29 . 0 7 3 ln 7 3 7 4 ln 7 4 12 7 5 3 ln 5 3 5 2 ln 5 2 12 5 Entropy        Sat/Fri? 2 T, 3 F 4 T, 3 F Yes No Entropy decrease = 0.30 – 0.29 = 0.01
  • 28. Decision tree learning example                     24 . 0 5 4 ln 5 4 5 1 ln 5 1 12 5 7 2 ln 7 2 7 5 ln 7 5 12 7 Entropy        Hungry? 5 T, 2 F 1 T, 4 F Yes No Entropy decrease = 0.30 – 0.24 = 0.06
  • 29. Decision tree learning example                     30 . 0 8 4 ln 8 4 8 4 ln 8 4 12 8 4 2 ln 4 2 4 2 ln 4 2 12 4 Entropy        Raining? 2 T, 2 F 4 T, 4 F Yes No Entropy decrease = 0.30 – 0.30 = 0
  • 30. Decision tree learning example                     29 . 0 7 4 ln 7 4 7 3 ln 7 3 12 7 5 2 ln 5 2 5 3 ln 5 3 12 5 Entropy        Reservation? 3 T, 2 F 3 T, 4 F Yes No Entropy decrease = 0.30 – 0.29 = 0.01
  • 31. Decision tree learning example                               14 . 0 6 4 ln 6 4 6 2 ln 6 2 12 6 4 0 ln 4 0 4 4 ln 4 4 12 4 2 2 ln 2 2 2 0 ln 2 0 12 2 Entropy           Patrons? 2 F 4 T None Full Entropy decrease = 0.30 – 0.14 = 0.16 2 T, 4 F Some
  • 32. Decision tree learning example                               23 . 0 4 3 ln 4 3 4 1 ln 4 1 12 4 2 0 ln 2 0 2 2 ln 2 2 12 2 6 3 ln 6 3 6 3 ln 6 3 12 6 Entropy           Price 3 T, 3 F 2 T $ $$$ Entropy decrease = 0.30 – 0.23 = 0.07 1 T, 3 F $$
  • 33. Decision tree learning example                                         30 . 0 4 2 ln 4 2 4 2 ln 4 2 12 4 4 2 ln 4 2 4 2 ln 4 2 12 4 2 1 ln 2 1 2 1 ln 2 1 12 2 2 1 ln 2 1 2 1 ln 2 1 12 2 Entropy              Type 1 T, 1 F 1 T, 1 F French Burger Entropy decrease = 0.30 – 0.30 = 0 2 T, 2 F Italian 2 T, 2 F Thai
  • 34. Decision tree learning example                                         24 . 0 2 2 ln 2 2 2 0 ln 2 0 12 2 2 1 ln 2 1 2 1 ln 2 1 12 2 2 1 ln 2 1 2 1 ln 2 1 12 2 6 2 ln 6 2 6 4 ln 6 4 12 6 Entropy              Est. waiting time 4 T, 2 F 1 T, 1 F 0-10 > 60 Entropy decrease = 0.30 – 0.24 = 0.06 2 F 10-30 1 T, 1 F 30-60
  • 35. Decision tree learning example Patrons? 2 F 4 T None Full Largest entropy decrease (0.16) achieved by splitting on Patrons. 2 T, 4 F Some X? Continue like this, making new splits, always purifying nodes.
  • 36. Decision tree learning example Induced tree (from examples)
  • 37. Decision tree learning example True tree
  • 38. Decision tree learning example Induced tree (from examples) Cannot make it more complex than what the data supports.
  • 39. Support Vector Machine - Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b<0 w x + b>0
  • 40. Support Vector Machine - Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data?
  • 41. f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Support Vector Machine - Linear Classifiers
  • 42. f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine.. ..but which is best? Support Vector Machine - Linear Classifiers
  • 43. f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class Support Vector Machine - Linear Classifiers
  • 44. Classifier Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Classifier Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 45. Maximum Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1. Maximizing the margin is good according to intuition and PAC theory 2. Implies that only support vectors are important; other training examples are ignorable. 3. Empirically it works very very well.
  • 46. Non-linear SVMs  Datasets that are linearly separable with some noise work out great:  But what are we going to do if the dataset is just too hard?  How about… mapping data to a higher-dimensional space: 0 x 0 x 0 x x2
  • 47. Non-linear SVMs: Feature spaces  General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
  • 48. The “Kernel Trick”  The linear classifier relies on dot product between vectors K(xi,xj)=xi Txj  If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(xi,xj)= φ(xi) Tφ(xj)  A kernel function is some function that corresponds to an inner product in some expanded feature space.  Example: 2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi Txj)2 , Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj)=(1 + xi Txj)2 , = 1+ xi1 2xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2xj2 2 + 2xi1xj1 + 2xi2xj2 = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2]T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2]
  • 49. Examples of Kernel Functions  Linear: K(xi,xj)= xi Txj  Polynomial of power p: K(xi,xj)= (1+ xi Txj)p  Gaussian (radial-basis function network):  Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1) ) 2 exp( ) , ( 2 2  j i j i x x x x    K
  • 51. Introduction • The goal of clustering is to – group data points that are close (or similar) to each other – identify such groupings (or clusters) in an unsupervised manner • Unsupervised: no information is provided to the algorithm on which data points belong to which clusters • Example x x x x x x x x x What should the clusters be for these data points?
  • 52. Clustering Algorithms • Exclusive clustering : Objects are grouped in exclusive way, if it belongs to one cluster cannot belong to any other cluster. – Ex : k means • Overlapping Clustering: Each point may belong to two or more clusters with different degrees of membership. – Ex: Fuzzy c means • Hierarchical clustering: Based on union between two nearest clusters.
  • 53. Hierarchical clustering • Given the input set S, the goal is to produce a hierarchy (dendrogram) in which nodes represent subsets of S. • Features of the tree obtained: – The root is the whole input set S. – The leaves are the individual elements of S. – The internal nodes are defined as the union of their children. • Each level of the tree represents a partition of the input data into several (nested) clusters or groups.
  • 55. Hierarchical clustering • There are two styles of hierarchical clustering algorithms to build a tree from the input set S: – Agglomerative (bottom-up): • Beginning with singletons (sets with 1 element) • Merging them until S is achieved as the root. • It is the most common approach. – Divisive (top-down): • Recursively partitioning S until singleton sets are reached.
  • 56. Hierarchical clustering • Input: a pairwise matrix involved all instances in S • Algorithm 1. Place each instance of S in its own cluster (singleton), creating the list of clusters L (initially, the leaves of T): L= S1, S2, S3, ..., Sn-1, Sn. 2. Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to merge. 3. Remove Si and Sj from L. 4. Merge Si and Sj to create a new internal node Sij in T which will be the parent of Si and Sj in the resulting tree. 5. Go to Step 2 until there is only one set remaining.
  • 57. Hierarchical clustering • Step 2 can be done in different ways, which is what distinguishes single-linkage from complete-linkage and average-linkage clustering. – In single-linkage clustering (also called the connectedness or minimum method): we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. – In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. – In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
  • 60. Hierarchical clustering: forming clusters • Forming clusters from dendograms
  • 61. Hierarchical clustering • Advantages – Dendograms are great for visualization – Provides hierarchical relations between clusters – Shown to be able to capture concentric clusters • Disadvantages – Not easy to define levels for clusters – Experiments showed that other clustering techniques outperform hierarchical clustering
  • 62. K-means • Input: n objects (or points) and a number k • Algorithm 1. Randomly place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the stopping criteria is met.
  • 63. K-means • Stopping criteria: – No change in the members of all clusters – when the squared error is less than some small threshold value a • Squared error se – where mi is the mean of all instances in cluster ci • se(j) < a • Properties of k-means – Guaranteed to converge – Guaranteed to achieve local optimal, not necessarily global optimal. • Example: http://www.kdnuggets.com/dmcourse/data_mining_ course/mod-13-clustering.ppt.       k i c p i i m p se 1 2
  • 64. K-means • Pros: – Low complexity • complexity is O(nkt), where t = #iterations • Cons: – Necessity of specifying k – Sensitive to noise and outlier data points • Outliers: a small number of such data can substantially influence the mean value) – Clusters are sensitive to initial assignment of centroids • K-means is not a deterministic algorithm • Clusters can be inconsistent from one run to another
  • 65. Fuzzy c-means • An extension of k-means • Hierarchical, k-means generates partitions – each data point can only be assigned in one cluster • Fuzzy c-means allows data points to be assigned into more than one cluster – each data point has a degree of membership (or probability) of belonging to each cluster
  • 66. Fuzzy c-means algorithm • Let xi be a vector of values for data point gi. 1.Initialize membership U(0) = [ uij ] for data point gi of cluster clj by random 2.At the k-th step, compute the fuzzy centroid C(k) = [ cj ] for j = 1, .., nc, where nc is the number of clusters, using where m is the fuzzy parameter and n is the number of data points.      n i m ij n i i m ij j u x u c 1 1 ) ( ) (
  • 67. Fuzzy c-means algorithm 3. Update the fuzzy membership U(k) = [ uij ], using 4. If ||U(k) – U(k-1)|| < , then STOP, else return to step 2. 5. Determine membership cutoff – For each data point gi, assign gi to cluster clj if uij of U(k) > a                            c n j m j i m j i ij c x c x u 1 1 1 1 1 1 1
  • 68. Fuzzy c-means • Pros: – Allows a data point to be in multiple clusters – A more natural representation of the behavior of genes • genes usually are involved in multiple functions • Cons: – Need to define c, the number of clusters – Need to determine membership cutoff value – Clusters are sensitive to initial assignment of centroids • Fuzzy c-means is not a deterministic algorithm