Instance based learning

1.Introduction
 Instance-based learning methods such as nearest
neighbour and locally weighted regression are conceptually
straightforward approaches to approximating real-valued
or discrete-valued target functions.
 It can also called Lazy Learning.
 Learning in these algorithms consists of simply storing the
presented training data.
 When a new query instance is encountered, a set similar
related instances is retrieved from memory and used to
classify the new query instance.
Swapna.C

 It mainly uses memorizing technique to learn.
 Instance Based Learning can be achieved by
3 approaches.
 1. Lazy learning (Nearest Neihbourhood), KNN
 2. Radial Basis Function( based on weighted methods)
 3. Case-Based Reasoning
 That instance-based approaches can construct a different
approximation to the target function for each distinct
query instance that must be classified.
 For large instances it assigns locally target for each new
instance.
Swapna.C

 Advantages:
 Instance-based methods can also use more complex,
symbolic representations for instances.
 In case-based learning instances are represented in this
fashion and the process for identifying “neighbouring”
instances is elaborated accordingly.
 Case-based reasoning has been applied to tasks such as
storing and reusing past experience at a help desk,
reasoning about legal cases by referring to previous cases,
and solving complex scheduling problems by reusing
relevant portions of previously solved problems .
Swapna.C

 Disadvantages:
 Cost of classifying new instances can be high.
 Techniques for efficiently indexing training examples are a
significant practical issue in reducing the computation
required at query time.
 Especially nearest neighbour approaches, is that the
typically consider all attributes of the instances when
attempting to retrieve similar training examples from
memory.
 If the target concept depends on only a few of the many
available attributes, then the instances that are truly most
“similar” may well be a large distance apart.
Swapna.C

2.K-NEAREST NEIGHBOR LEARNING:
 This algorithm assumes all instances correspond to points in the
n-dimensional space. The nearest neighbours of an instance are
defined in terms of the standard Euclidean distance.
 Let an arbitrary instance x be described by the feature vector
 Where ar(x) denotes the value of the rth attribute of instance x.
Then the distance between 2 instances xi and xj is defined to be
d(xi,xj), where ar(x) =

Swapna.C

 The operation of the k-NEAREST NEIGHBOR algorithm for the
case where the instances are points in a 2- dimensional space and
where the target function is boolean valued.
 The positive and negative training examples are shown by “+”
and “-” res.1-NEAREST NEIGHBOR Algorithm classifies xq as a
positive ex.
 Whereas the 5-NEAREST NEIGHBOR algorithm classifies it as a
negative example.
 What the implicit general function is, or what classifications
would be assigned if we were to hold the training examples
constant and query the algorithm with every possible instance in
x.
Swapna.C

 The shape of this decision surface induced by 1-NEAREST
NEIGHBOR over the entire instance space.
 The decision surface is a combination of convex polyhedral
surrounding each of the training ex.
for every training ex.
 The polyhedron indicates the set of query points whose
classification will be completely determined by that
training ex. Query points outside the polyhedron are closer
to some other training ex. This kind of diagram is often
called the Voronoi diagram for the set of training ex.
Swapna.C

 The k-NEAREST NEIGHBOR algorithm is easily
adapted to approximating continuous-valued target
functions.
Swapna.C

2.1.Distance –Weighted NEAREST NEIGHBOR
Algorithm
 K-NEAREST NEIGHBOR Algorithm is to weight the
contribution of each of the k neighbours according to
their distance to the query point xq, giving greater
weight to closer neighbours.
EX. Which approximates discrete-valued target
functions, we might weight the vote of each neighbour
according to the inverse square of its distance from xq.
Swapna.C

2.2 Remarks on K-NEAREST NEIGHBOUR
Algorithm
 The distance-weighted k-NEAREST NEIGHBOR Algorithm
is a highly effective inductive inference method for many
practical problems.
 It is robust to noisy training data and quite effective when it
is provided a sufficiently large set of training data.
 Note that by taking the weighted average of the k
neighbours nearest to the query point, it can smooth out
the impact of isolated noisy training examples.
Swapna.C

 The inductive bias corresponds to an assumption that the
classification of an stance xq will be most similar to the
classification of other instances that are near by in Euclidean
distance.
 In k-NEAREST NEIGHBOR , The distance between neighbours
will be dominated by the large number of irrelevant attributes re
present, is sometimes referred to as the curse of dimensionality.
 Overcome this problem is to weight each attribute differently
when calculating the distance between 2 instances. This
corresponds to stretching the axes in the Euclidean space,
shortening the axes that correspond to less relevant attributes,
and lengthening the axes that correspond to more relevant
attributes.
Swapna.C

 The process of stretching the axes in order to optimize the
performance of k-NEAREST NEIGHBOR algorithms provides a
mechanism for suppressing the impact of irrelevant attributes.
 Alternative is to completely eliminate the least relevant
attributes from the instance space. This is equivalent to setting
some of the scaling factors to zero.
 Moore and Lee discuss efficient cross-validation methods for
selecting relevant subsets of the attributes for k-NEAREST
NEIGHBOR algorithms.
 They explore methods based on leave-one-out cross validation,
in which the set of m training instances is repeatedly divided
into a training set of size m-1 and test set of size 1.
Swapna.C

 There is risk of over fitting. The approach of locally
stretching the axes is much less common.
 Practical issue in applying k-NEAREST NEIGHBOR is
efficient memory indexing.
 This algorithm delays all processing until a new query is
received, significant computation can be required to
process each new query.
 One indexing method is the kd-tree, in which instances
are stored at the leaves of a tree, with nearby instances
stored at the dame or nearby nodes.
 The internal nodes of the tree sort the new query xq to the
relevant leaf by testing selected attributes of xq.
Swapna.C

3.LOCALLY WEIGHTED REGRESSION
 This is a generalization approach of KNN.
 The phrase "locally weighted regression" is called local
because the function is approximated based a only on data
near the query point, weighted because the contribution of
each training example is weighted by its distance from the
query point, and regression because this is the term used
widely in the statistical learning community for the
problem of approximating real-valued functions.
Swapna.C

 It constructs on explicit approximation to f over the local
region corresponds to query point.
3.1 Locally Weighted Linear Regression
 Let us consider the case of locally weighted regression in
which the target function f is approximated near x, using a
linear function of the form
 We choose weights that minimize the squared error
summed over the set D of training examples
Swapna.C

 Gradient descent training rule
 Error criterion E
Swapna.C

 Criterion two is perhaps the most esthetical pleasing
because it allows every training example to have an
impact on the classification of xq.
 However, this approach requires computation that
grows linearly with the number of training examples.
 Criterion three is a good approximation to criterion
two and has the advantage that computational cost is
independent of the total number of training examples;
its cost depends only on the number k of neighbours
considered.
Swapna.C

 Choose criterion 3 and re derive
Swapna.C

3.2 Remarks on Locally Weighted Regression
 The literature on locally weighted regression contains a broad
range of alternative methods for distance weighting the training
examples, and a range of methods for locally approximating the
target function.
 In most cases, the target function is approximated by a constant,
linear, or quadratic function.
 More complex functional forms are not often found because
 (1) the cost of fitting more complex functions for each query
instance is prohibitively high, and
 (2) these simple approximations model the target function quite
well over a sufficiently small sub region of the instance space.
Swapna.C

4.RADIAL BASIS FUCTIONS
 Function approximation that is closely related to
distance-weighted regression and also to artificial
neural networks is learning with radial basis functions.
Swapna.C

 Gaussian kernel function
Swapna.C

 Given a set of training examples of the target function,
RBF networks are typically trained in a two-stage
process.
 First, the number k of hidden units is determined
and each hidden unit u is defined by choosing the
values of xu and that define its kernel
function Ku(d(xu, x)).
 Second, the weights wu, are trained to maximize the
fit of the network to the training data, using the global
error criterion.
Swapna.C

 Several alternative methods have been proposed for
choosing an appropriate number of hidden units or
equivalently, kernel functions.
 One approach is to allocate a Gaussian kernel function for
each training example (xi, f(xi)), centering this Gaussian at
the point xi. Each of these kernels may be assigned the
same width.
 One advantage of this choice of kernel functions is that it
allows the RBF network to fit the training data exactly. That
is, for any set of m training examples the weights wo . . . w,
for combining the m Gaussian kernel functions can be
set so that f(xi) = f (xi) for each training example<xi,f(xi)>
Swapna.C

 A second approach is to choose a set of kernel
functions that is smaller than the number of training
examples. It especially when the number of training
examples is large.
 Alternatively, we may wish to distribute the centers
non uniformly, especially if the instances themselves
are found to be distributed non uniformly over X.
 Radial basis function networks provide a global
approximation to the target function, represented by a
linear combination of many local kernel functions.
Swapna.C

 The value for any given kernel function is non-negligible
only when the input x falls into the region defined by its
particular center and width.
 Thus, the network can be viewed as a smooth linear
combination of many local approximations to the target
function.
 One key advantage to RBF networks is that they can be
trained much more efficiently than feed forward networks
trained with BACKPROPAGATTION.
 This follows from the fact that the input layer and the
output layer of an RBF are trained separately.
Swapna.C

5.CASE-BASED REASONING
 Instance-based methods such as k - NEAREST NEIGHBOR
and locally weighted regression share three key
properties.
 First, they are lazy learning methods in that they defer the
decision of how to generalize beyond the training data
until a new query instance is observed.
 Second, they classify new query instances by analyzing
similar instances while ignoring instances that are very
different from the query.
 Third, they represent instances as real-valued points in an
n-dimensional Euclidean space.
Swapna.C

 In CBR, instances are typically represented using more rich
symbolic descriptions, and the methods used to retrieve
similar instances are correspondingly more elaborate.
 CBR has been applied to problems such as conceptual
design of mechanical devices based on a stored library of
previous designs (Sycara et al. 1992), reasoning about new
legal cases based on previous rulings (Ashley 1990), and
solving planning and scheduling problems by reusing and
combining portions of previous solutions to similar
problems.
Swapna.C

 Conceptual design of mechanical devices based on
previous stored devices.
Swapna.C

 Several generic properties of case-based reasoning systems that
distinguish them from approaches such as
k-NEAREST NEIGHBOR:
 Instances or cases may be represented by rich symbolic descriptions, such as
the function graphs used in CADET. This may require a similarity metric
different from Euclidean distance, such as the size of the largest shared sub
graph between two function graphs.
 Multiple retrieved cases may be combined to form the solution to the new
problem. This is similar to the k-NEAREST NEIGHBOUR approach, in that
multiple similar cases are used to construct a response for the new query.
However, the process for combining these multiple retrieved cases can be very
different, relying on knowledge-based reasoning rather than statistical
methods.
Swapna.C

 There may be a tight coupling between case retrieval,
knowledge-based reasoning, and problem solving.
 One simple example of this is found in CADET, which
uses generic knowledge about influences to rewrite
function graphs during its attempt to find matching
cases.
 Other systems have been developed that more fully
integrate case-based reasoning into general search
based problem-solving systems. Two examples are
ANAPRON and PRODIGY?ANALOGY.
Swapna.C

6.REMARKS ON LAZY AND EAGER LEARNING
 we considered three lazy learning methods:
The k-NEAREST NEIGHBOR algorithm, locally
weighted regression, and case-based reasoning.
 These methods lazy because they defer the decision of
how to generalize beyond the training data until each
new query instance is encountered.
 One Eager learning method: the method for
learning radial basis function networks.
Swapna.C

 Differences in computation time and differences in
the classifications produced for new queries.
 Lazy methods will generally require less computation
during training, but more computation when they must
predict the target value for a new query.
 The key difference between lazy and eager methods in
this regard is
 Lazy methods may consider the query instance x, when
deciding how to generalize beyond the training data D.
 Eager methods cannot. By the time they observe the query
instance x, they have already chosen their (global)
approximation to the target function.
Swapna.C

 For each new query xq it generalizes from the training data by
choosing a new hypothesis based on the training examples near
xq.
 In contrast, an eager learner that uses the same hypothesis space
of linear functions must choose its approximation before the
queries are observed.
 The eager learner must therefore commit to a single linear
function hypothesis that covers the entire instance space and all
future queries.
 The lazy method effectively uses a richer hypothesis space
because it uses many different local linear functions to form its
implicit global approximation to the target function.
Swapna.C

 Lazy learner has the option of (implicitly) representing the
target function by a combination of many local
approximations , whereas an eager learner must commit at
training time to a single global approximation.
 The distinction between eager and lazy learning is thus
related to the distinction between global and local
approximations to the target function.
 The RBF learning methods we discussed are eager methods
that commit to a global approximation to the target
function at training time.
Swapna.C

 RBF networks are built eagerly from local approximations
centred around the training examples, or around clusters
of training examples, but not around the unknown future
query points.
 lazy methods have the option of selecting a different
hypothesis or local approximation to the target function for
each query instance.
 Eager methods using the same hypothesis space are more
restricted because they must commit to a single hypothesis
that covers the entire instance space.
 Eager methods can, of course, employ hypothesis spaces
that combine multiple local approximations, as in RBF
networks.
Swapna.C

Instance based learning

More Related Content

What's hot (20)

Similar to Instance based learning (20)

More from swapnac12 (16)

Recently uploaded (20)

Instance based learning