Introduction to Classification

Introduction to
Classiﬁcation
October 9, 2019
Amit Praseed Classiﬁcation October 9, 2019 1 / 30

Introduction
The question of whether machines can learn from experience and em-
ulate humans has always intrigued researchers.
The Turing Test (1950) : “test of a machine’s ability to exhibit in-
telligent behaviour equivalent to, or indistinguishable from, that of a
human”
Led to the development of several mechanisms to block automated ac-
cesses, such as CAPTCHAs.
Google (2014) demonstrated that their algorithms could defeat CAPTCHAs
with 99.8% accuracy !!!

Types of Learning
Supervised Learning
Labelled data (data+class la-
bels) is provided as input to
the system.
When a new unlabelled ex-
ample is provided to the sys-
tem, it maps it to a class
based on the examples it has
encountered.
Eg: Classiﬁcation

Types of Learning
Unsupervised Learning
Unlabelled data is provided
as input to the system.
The system identiﬁes pat-
terns in the data and creates
internal groupings.
Eg: Clustering

Basic Idea of Classiﬁcation
Input : Data set X = {x1, x2, ...xn} and associated Label set Y =
{y1, y2, ...yn}
Learning: Identify a function / procedure based on X and Y.
f (x) = y
Testing: Given a new input x , predict the class label
f (x ) = y

Features and Feature Vectors
Each data item used in classiﬁcation is represented by using its fea-
tures.
The array representing all of a data item’s features and the correspond-
ing values is called a feature vector.
Eg: One of the popular open-source data sets Iris contains information
about 50 samples of ﬂowers belonging to three classes. For each of the
samples, four features are measured - the length and width of sepals
and petals. A particular sample may look like [5.1, 3.5, 1.4, 0.2], so the
sample is said to have 4 dimensions.
This means that every data item can be represented as a point
in an n-dimensional data space.

Geometric View of Classiﬁcation
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Figure

The Nearest Neighbour Approach - Example 1
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Figure

1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Figure

k Nearest Neighbours (kNN) Approach
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Figure: k=3

k Nearest Neighbours Approach
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
x
y
Figure

Advantages and Disadvantages of kNN
Advantages
Simple and easy to implement
Lazy Learner - No training
phase required
Only two parameters - k and the
distance measure
Disadvantages
High complexity for large
datasets and data of higher
dimensions
Doesn’t work well with categor-
ical data

Scalability of kNN
kNN classiﬁer has a complexity of O(nd + nk), where n is the number
of data points and d is the number of attributes or dimensions.
Diﬀerent mechanisms can be used to reduce the complexity of kNN
especially for large datasets.
Parallelization
Exact Space Partitioning
KD Trees
Ball Trees
Cover Trees
Approximate Neighbour Search
Space Partitioning Trees
Nearest Neighbour Graphs
Locality Sensitive Hashing
Dimensionality Reduction
Feature Extraction
Feature Selection

Space Partitioning using kd-Trees
Similar to binary search trees.
Each element in a kd Tree is a multidimensional vector in itself.
Each level is aligned along one particular dimension and splits the search
space.
Each node is conﬁned to a particular bounding box in the search space.
Approximate neighbour search restricts itself to a single bounding box,
thus reducing complexity.
Exact neighbour search can get a bit more computationally expensive.

kd-Tree Construction
51,75

(51,75)
(25,40)

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)

NN-Search for Query (1,5)
(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)
Current NN distance = 5 (to (1,10))

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)
Current NN distance =
√
13 =3.6055
(to (10,30))

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)

(51,75)
(25,40)
(10,30)
(1,10)
(50,50)
(70,70)
(55,1) (60,80)
Initial NN distance =
√
2384 =48.83 (to (10,30))
Final NN distance =
√
26 =5.1 (to (55,1))

Introduction to Classification

More Related Content

Similar to Introduction to Classification (20)

More from amitpraseed (7)

Recently uploaded (20)

Introduction to Classification